2023-12-08 00:24:27

by Qais Yousef

[permalink] [raw]
Subject: [PATCH v2 0/8] sched: cpufreq: Remove magic hardcoded numbers from margins

And replace it with more dynamic logic based on hardware/software limitations.

Margins referred to are the ones in:

* 80% in fits_capacity()
* 125% in map_util_perf()

The current values seem to have worked well in the past but on modern hardware
they pose the following probelms:

* fits_capacity() is not big enough for little cores
* fits_capacity() is too big for mid cores
* Leaves questions whether the big core can trigger overutilized
prematurely on powerful systems where the top 20% is still a lot of
perf headroom to be consumed

The 1st causes tasks to get stuck for too long underperforming on little cores.
The 2nd prevents from better utilizing the mid core when the workload at
a steady state above the 80%. Ideally we need to spread to mids and bigs in
this case but we end up pushing to bigs only. The 3rd I didn't get a chance to
quantify its effect in practice yet. But I do think our current definition of
overutilized being tied to misfit has gotten stale too and needs rethinking.

* 125% map_util_perf() ends up causing power/thermal issue on powerful
systems by forcing unnecessary high headroom of idle time.

We could run slower with no perf impact on many cases.

To address these issues we define the limitation of each as follows:

* fits_capacity() should return true as long as the util will not rise
above the capacity of the CPU before the next load balance.

Load balance is the point of correction for misplacing a task. And we can
afford to keep the task running on a CPU when

task->util_avg < capacity_of(cpu)

as long as the task->util_avg won't become higher shortly after leaving the
task stuck underperforming until a load balancing event comes to save the day.

* map_util_perf() should provide extra headroom to cater for the fact
we have to wait for N us before requesting another update.

So we need to give cpu->util_avg enough headroom to grow comfortably given the
fact there will be N us delay before we can update DVFS again due to hardware
limitation. So for faster DVFS h/w, we need small headroom knowing that we can
request another update shortly after if we don't go back to idle.

To cater for the need to be able to tweak these values, we introduce a new
schedutil knob, response_time_ms, to allow userspace to control how fast they
want DVFS to ramp-up for a given policy. It can also be slowed down if
power/thermal are larger concern.

I opted to remove the 1.25 default speed up as it somehow constituted a policy
that is good for some, but not necessary best for all systems out there. With
the availability of the knob it is better for userspace to learn to tune for
best perf/power/thermal trade-off.

With uclamp being available, this system tuning should not be necessary if
userspace is smart and conveys task perf requirement directly.

At the end I opted to keep the patch to control PELT HALFLIFE at boot time.
I know this wasn't popular before and I don't want to conjure the wrath of the
titans; but speaking with Vincent about per-task util_est_faster he seemed to
think that a boot time control might be better. So the matter seemed debatable
still. Generaly like above, with a smarter userspace that uses uclamp, this
won't be necessary. But the need I see for it is that we have a constant model
for all systems in the scheduler, and this gives an easy way to help under
powered ones to be more reactive. I am happy to drop this and explore other
alternatives (whatever they might be), but felt it is necessary for a complete
story on how to allow tweaking ability to migrate faster on underpowered HMP
systems. Remember, today's high end are tomorrow low end :). For SMP, the
dvfs_response_time is equivalent to a great extent and I don't see this knob
should offer any additional benefit for them. So maybe make it conditional on
HMP.

Testing on Pixel 6 running mainlin(ish) kernel, I see the following in
schedutil as default response times

# grep . /sys/devices/system/cpu/cpufreq/policy*/schedutil/*
/sys/devices/system/cpu/cpufreq/policy0/schedutil/rate_limit_us:2000
/sys/devices/system/cpu/cpufreq/policy0/schedutil/response_time_ms:8
/sys/devices/system/cpu/cpufreq/policy4/schedutil/rate_limit_us:2000
/sys/devices/system/cpu/cpufreq/policy4/schedutil/response_time_ms:29
/sys/devices/system/cpu/cpufreq/policy6/schedutil/rate_limit_us:2000
/sys/devices/system/cpu/cpufreq/policy6/schedutil/response_time_ms:155

Changing response time to replicate the 1.25 dvfs headroom (multiply by 0.8)

# grep . /sys/devices/system/cpu/cpufreq/policy*/schedutil/*
/sys/devices/system/cpu/cpufreq/policy0/schedutil/rate_limit_us:2000
/sys/devices/system/cpu/cpufreq/policy0/schedutil/response_time_ms:6
/sys/devices/system/cpu/cpufreq/policy4/schedutil/rate_limit_us:2000
/sys/devices/system/cpu/cpufreq/policy4/schedutil/response_time_ms:23
/sys/devices/system/cpu/cpufreq/policy6/schedutil/rate_limit_us:2000
/sys/devices/system/cpu/cpufreq/policy6/schedutil/response_time_ms:124

When I set PELT HALFLIFE to 16ms I get:

# grep . /sys/devices/system/cpu/cpufreq/policy*/schedutil/*
/sys/devices/system/cpu/cpufreq/policy0/schedutil/rate_limit_us:2000
/sys/devices/system/cpu/cpufreq/policy0/schedutil/response_time_ms:4
/sys/devices/system/cpu/cpufreq/policy4/schedutil/rate_limit_us:2000
/sys/devices/system/cpu/cpufreq/policy4/schedutil/response_time_ms:15
/sys/devices/system/cpu/cpufreq/policy6/schedutil/rate_limit_us:2000
/sys/devices/system/cpu/cpufreq/policy6/schedutil/response_time_ms:78

Changing response time to replicate the 1.25 dvfs headroom

# grep . /sys/devices/system/cpu/cpufreq/policy*/schedutil/*
/sys/devices/system/cpu/cpufreq/policy0/schedutil/rate_limit_us:2000
/sys/devices/system/cpu/cpufreq/policy0/schedutil/response_time_ms:3
/sys/devices/system/cpu/cpufreq/policy4/schedutil/rate_limit_us:2000
/sys/devices/system/cpu/cpufreq/policy4/schedutil/response_time_ms:12
/sys/devices/system/cpu/cpufreq/policy6/schedutil/rate_limit_us:2000
/sys/devices/system/cpu/cpufreq/policy6/schedutil/response_time_ms:62

I didn't try 8ms. As you can see from the values for the little, 16ms PELT
HALFLIFE is not suitable with TICK being 4ms. With the new definition of
fits_capacity(), we'd skip the littles and only use them in overutilized state.

Note that I changed the way I calculate the response time. 1024 takes 324
(IIRC). But the new calculations takes into account that max_freq will be
reached as soon as util crosses the util for cap@2nd_highest_freq. Which is
around ~988 (IIRC) for this system. Hence the 155ms at default for biggest
cluster.

===

Running speedometer browser benchmark I get for an average of 10 runs and power
numbers are not super accurate here due to some limitation in test setup

| baseline | patch | 1.25 headroom | 16ms PELT | 16ms + 1.25 headroom
-------+----------+-----------+---------------+-----------+---------------------
score | 135.14 | 108.03 | 135.72 | 137.48 | 143.96
-------+----------+-----------+---------------+-----------+---------------------
power | 1455.49 | 1204.75 | 1451.79 | 1690.38 | 1644.69

Removing the hardcoded values from margins loses a lot of perf with large
power save. Re-applying the 1.25 headroom policy regains same perf and power.

Increasing pelt has a high power cost on this system. With 1.25 DVFS headroom
there's a decent boost in perf.

===

For UiBench average of 3 runs (each iteration will already repeat subtests
several times) and power numbers are more accurate though the benchmark does
have sometimes higher than desired variance from run to run

| baseline | patch | 1.25 headroom | 16ms PELT | 16ms + 1.25 headroom
-------+----------+-----------+---------------+-----------+---------------------
jank | 110.3 | 68.0 | 56.0 | 58.6 | 50.3
-------+----------+-----------+---------------+-----------+---------------------
power | 147.76 | 146.54 | 164.49 | 174.97 | 209.92

Removing the hardcoded values from the margins produces a great improvement in
score without any power loss. I haven't done *any* detailed analysis, but
I think it's due to fits_capacity() will return false early on littles and
we're less likely to end up with UI tasks getting stuck there waiting for load
balance to save the day and migrate the misfit task.

Re-applying the 1.25 DVFS headroom policy gains more perf at a power cost that
is justifiable compared to 'patch'. It's a massive win against baseline which
has the 1.25 headroom.

16ms PELT HALFLIFE numbers look better compared to patch. But at higher power
cost.

---

Take away: There are perf and power trade-offs associated with these margins
that are hard to abstract completely. Give the power to sysadmins to find their
sweet spot meaningfully and make scheduler behavior constrained by actual h/w
and software limitations.

---

Changes since v1:

* Rebase on top of tip/sched/core with Vincent rework patch for schedutil
* Fix bugs in approximate_util_avg()/approximate_runtime() (Dietmar)
* Remove usage of aligned per cpu variable (Peter)
* Calculate the response_time_mult once (Peter)
* Tweaked response_time_ms calculation to cater for max freq will be
reached well before we hit 1024.

v1 discussion link: https://lore.kernel.org/lkml/[email protected]/

Patch 1 changes the default rate_limit_us when cpufreq_transition is not
provided. The default 10ms is too high for modern hardware. the patch can be
picked up readily.

Patch 2 renames map_util_perf() into apply_dvfs_headroom() which I believe
better reflect the actual functionality it performs. And it gives it a doc and
moves back to sched.h. Patch can be picked up readily too.

Patches 3 and 4 add helper functions to help convert between util_avg and time.

Patches 5 and 6 uses these new function to implement the logic to make
fits_capacity() and apply_dvfs_headroom() better approximate the limitation of
the system based on TICK and rate_limit_us as explained earlier.

Patch 7 adds the new response_time_ms to schedutil. The slow down functionality
has a limitation that is documented.

Patch 8 adds ability to modify PELT HALFLIFE via boot time parameter. Happy to
drop this one if it is still hated and the need to cater for low end under
powered systems didn't make sense.

Thanks!

--
Qais Yousef

Qais Yousef (7):
cpufreq: Change default transition delay to 2ms
sched: cpufreq: Rename map_util_perf to apply_dvfs_headroom
sched/pelt: Add a new function to approximate the future util_avg
value
sched/pelt: Add a new function to approximate runtime to reach given
util
sched/fair: Remove magic hardcoded margin in fits_capacity()
sched: cpufreq: Remove magic 1.25 headroom from apply_dvfs_headroom()
sched/schedutil: Add a new tunable to dictate response time

Vincent Donnefort (1):
sched/pelt: Introduce PELT multiplier

Documentation/admin-guide/pm/cpufreq.rst | 17 ++-
drivers/cpufreq/cpufreq.c | 8 +-
include/linux/cpufreq.h | 3 +
include/linux/sched/cpufreq.h | 5 -
kernel/sched/core.c | 3 +-
kernel/sched/cpufreq_schedutil.c | 128 +++++++++++++++++++++--
kernel/sched/fair.c | 21 +++-
kernel/sched/pelt.c | 89 ++++++++++++++++
kernel/sched/pelt.h | 42 +++++++-
kernel/sched/sched.h | 31 ++++++
10 files changed, 323 insertions(+), 24 deletions(-)

--
2.34.1


2023-12-08 00:24:41

by Qais Yousef

[permalink] [raw]
Subject: [PATCH v2 2/8] sched: cpufreq: Rename map_util_perf to apply_dvfs_headroom

We are providing headroom for the utilization to grow until the next
decision point to pick the next frequency. Give the function a better
name and give it some documentation. It is not really mapping anything.

Also move it to sched.h. This function relies on updating util signal
appropriately to give a headroom to grow. This is more of a scheduler
functionality than cpufreq. Move it to sched.h where all the other util
handling code belongs.

Signed-off-by: Qais Yousef (Google) <[email protected]>
---
include/linux/sched/cpufreq.h | 5 -----
kernel/sched/cpufreq_schedutil.c | 2 +-
kernel/sched/sched.h | 17 +++++++++++++++++
3 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index bdd31ab93bc5..d01755d3142f 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -28,11 +28,6 @@ static inline unsigned long map_util_freq(unsigned long util,
{
return freq * util / cap;
}
-
-static inline unsigned long map_util_perf(unsigned long util)
-{
- return util + (util >> 2);
-}
#endif /* CONFIG_CPU_FREQ */

#endif /* _LINUX_SCHED_CPUFREQ_H */
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 4ee8ad70be99..79c3b96dc02c 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -157,7 +157,7 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
unsigned long max)
{
/* Add dvfs headroom to actual utilization */
- actual = map_util_perf(actual);
+ actual = apply_dvfs_headroom(actual);
/* Actually we don't need to target the max performance */
if (actual < max)
max = actual;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e58a54bda77d..0da3425200b1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3002,6 +3002,23 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
unsigned long min,
unsigned long max);

+/*
+ * DVFS decision are made at discrete points. If CPU stays busy, the util will
+ * continue to grow, which means it could need to run at a higher frequency
+ * before the next decision point was reached. IOW, we can't follow the util as
+ * it grows immediately, but there's a delay before we issue a request to go to
+ * higher frequency. The headroom caters for this delay so the system continues
+ * to run at adequate performance point.
+ *
+ * This function provides enough headroom to provide adequate performance
+ * assuming the CPU continues to be busy.
+ *
+ * At the moment it is a constant multiplication with 1.25.
+ */
+static inline unsigned long apply_dvfs_headroom(unsigned long util)
+{
+ return util + (util >> 2);
+}

/*
* Verify the fitness of task @p to run on @cpu taking into account the
--
2.34.1

2023-12-08 00:24:42

by Qais Yousef

[permalink] [raw]
Subject: [PATCH v2 3/8] sched/pelt: Add a new function to approximate the future util_avg value

Given a util_avg value, the new function will return the future one
given a runtime delta.

This will be useful in later patches to help replace some magic margins
with more deterministic behavior.

Signed-off-by: Qais Yousef (Google) <[email protected]>
---
kernel/sched/pelt.c | 22 +++++++++++++++++++++-
kernel/sched/sched.h | 2 ++
2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 63b6cf898220..81555a8288be 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -466,4 +466,24 @@ int update_irq_load_avg(struct rq *rq, u64 running)

return ret;
}
-#endif
+#endif /* CONFIG_HAVE_SCHED_AVG_IRQ */
+
+/*
+ * Approximate the new util_avg value assuming an entity has continued to run
+ * for @delta us.
+ */
+unsigned long approximate_util_avg(unsigned long util, u64 delta)
+{
+ struct sched_avg sa = {
+ .util_sum = util * PELT_MIN_DIVIDER,
+ .util_avg = util,
+ };
+
+ if (unlikely(!delta))
+ return util;
+
+ accumulate_sum(delta, &sa, 1, 0, 1);
+ ___update_load_avg(&sa, 0);
+
+ return sa.util_avg;
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0da3425200b1..7e5a86a376f8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3002,6 +3002,8 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
unsigned long min,
unsigned long max);

+unsigned long approximate_util_avg(unsigned long util, u64 delta);
+
/*
* DVFS decision are made at discrete points. If CPU stays busy, the util will
* continue to grow, which means it could need to run at a higher frequency
--
2.34.1

2023-12-08 00:24:47

by Qais Yousef

[permalink] [raw]
Subject: [PATCH v2 4/8] sched/pelt: Add a new function to approximate runtime to reach given util

It is basically the ramp-up time from 0 to a given value. Will be used
later to implement new tunable to control response time for schedutil.

Signed-off-by: Qais Yousef (Google) <[email protected]>
---
kernel/sched/pelt.c | 21 +++++++++++++++++++++
kernel/sched/sched.h | 1 +
2 files changed, 22 insertions(+)

diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 81555a8288be..00a1b9c1bf16 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -487,3 +487,24 @@ unsigned long approximate_util_avg(unsigned long util, u64 delta)

return sa.util_avg;
}
+
+/*
+ * Approximate the required amount of runtime in ms required to reach @util.
+ */
+u64 approximate_runtime(unsigned long util)
+{
+ struct sched_avg sa = {};
+ u64 delta = 1024; // period = 1024 = ~1ms
+ u64 runtime = 0;
+
+ if (unlikely(!util))
+ return runtime;
+
+ while (sa.util_avg < util) {
+ accumulate_sum(delta, &sa, 1, 0, 1);
+ ___update_load_avg(&sa, 0);
+ runtime++;
+ }
+
+ return runtime;
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7e5a86a376f8..2de64f59853c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3003,6 +3003,7 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
unsigned long max);

unsigned long approximate_util_avg(unsigned long util, u64 delta);
+u64 approximate_runtime(unsigned long util);

/*
* DVFS decision are made at discrete points. If CPU stays busy, the util will
--
2.34.1

2023-12-08 00:24:47

by Qais Yousef

[permalink] [raw]
Subject: [PATCH v2 7/8] sched/schedutil: Add a new tunable to dictate response time

The new tunable, response_time_ms, allow us to speed up or slow down
the response time of the policy to meet the perf, power and thermal
characteristic desired by the user/sysadmin. There's no single universal
trade-off that we can apply for all systems even if they use the same
SoC. The form factor of the system, the dominant use case, and in case
of battery powered systems, the size of the battery and presence or
absence of active cooling can play a big role on what would be best to
use.

The new tunable provides sensible defaults, but yet gives the power to
control the response time to the user/sysadmin, if they wish to.

This tunable is applied before we apply the DVFS headroom.

The default behavior of applying 1.25 headroom can be re-instated easily
now. But we continue to keep the min required headroom to overcome
hardware limitation in its speed to change DVFS. And any additional
headroom to speed things up must be applied by userspace to match their
expectation for best perf/watt as it dictates a type of policy that will
be better for some systems, but worse for others.

There's a whitespace clean up included in sugov_start().

Signed-off-by: Qais Yousef (Google) <[email protected]>
---
Documentation/admin-guide/pm/cpufreq.rst | 17 +++-
drivers/cpufreq/cpufreq.c | 4 +-
include/linux/cpufreq.h | 3 +
kernel/sched/cpufreq_schedutil.c | 115 ++++++++++++++++++++++-
4 files changed, 132 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst
index 6adb7988e0eb..fa0d602a920e 100644
--- a/Documentation/admin-guide/pm/cpufreq.rst
+++ b/Documentation/admin-guide/pm/cpufreq.rst
@@ -417,7 +417,7 @@ is passed by the scheduler to the governor callback which causes the frequency
to go up to the allowed maximum immediately and then draw back to the value
returned by the above formula over time.

-This governor exposes only one tunable:
+This governor exposes two tunables:

``rate_limit_us``
Minimum time (in microseconds) that has to pass between two consecutive
@@ -427,6 +427,21 @@ This governor exposes only one tunable:
The purpose of this tunable is to reduce the scheduler context overhead
of the governor which might be excessive without it.

+``respone_time_ms``
+ Amount of time (in milliseconds) required to ramp the policy from
+ lowest to highest frequency. Can be decreased to speed up the
+ responsiveness of the system, or increased to slow the system down in
+ hope to save power. The best perf/watt will depend on the system
+ characteristics and the dominant workload you expect to run. For
+ userspace that has smart context on the type of workload running (like
+ in Android), one can tune this to suite the demand of that workload.
+
+ Note that when slowing the response down, you can end up effectively
+ chopping off the top frequencies for that policy as the util is capped
+ to 1024. On HMP systems this chopping effect will only occur on the
+ biggest core whose capacity is 1024. Don't rely on this behavior as
+ this is a limitation that can hopefully be improved in the future.
+
This governor generally is regarded as a replacement for the older `ondemand`_
and `conservative`_ governors (described below), as it is simpler and more
tightly integrated with the CPU scheduler, its overhead in terms of CPU context
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 9875284ca6e4..15c397ce3252 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -533,8 +533,8 @@ void cpufreq_disable_fast_switch(struct cpufreq_policy *policy)
}
EXPORT_SYMBOL_GPL(cpufreq_disable_fast_switch);

-static unsigned int __resolve_freq(struct cpufreq_policy *policy,
- unsigned int target_freq, unsigned int relation)
+unsigned int __resolve_freq(struct cpufreq_policy *policy,
+ unsigned int target_freq, unsigned int relation)
{
unsigned int idx;

diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 1c5ca92a0555..29c3723653a3 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -613,6 +613,9 @@ int cpufreq_driver_target(struct cpufreq_policy *policy,
int __cpufreq_driver_target(struct cpufreq_policy *policy,
unsigned int target_freq,
unsigned int relation);
+unsigned int __resolve_freq(struct cpufreq_policy *policy,
+ unsigned int target_freq,
+ unsigned int relation);
unsigned int cpufreq_driver_resolve_freq(struct cpufreq_policy *policy,
unsigned int target_freq);
unsigned int cpufreq_policy_transition_delay_us(struct cpufreq_policy *policy);
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 1d4d6025c15f..788208becc13 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -8,9 +8,12 @@

#define IOWAIT_BOOST_MIN (SCHED_CAPACITY_SCALE / 8)

+DEFINE_PER_CPU_READ_MOSTLY(unsigned long, response_time_mult);
+
struct sugov_tunables {
struct gov_attr_set attr_set;
unsigned int rate_limit_us;
+ unsigned int response_time_ms;
};

struct sugov_policy {
@@ -22,6 +25,7 @@ struct sugov_policy {
raw_spinlock_t update_lock;
u64 last_freq_update_time;
s64 freq_update_delay_ns;
+ unsigned int freq_response_time_ms;
unsigned int next_freq;
unsigned int cached_raw_freq;

@@ -59,6 +63,70 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);

/************************ Governor internals ***********************/

+static inline u64 sugov_calc_freq_response_ms(struct sugov_policy *sg_policy)
+{
+ int cpu = cpumask_first(sg_policy->policy->cpus);
+ unsigned long cap = arch_scale_cpu_capacity(cpu);
+ unsigned int max_freq, sec_max_freq;
+
+ max_freq = sg_policy->policy->cpuinfo.max_freq;
+ sec_max_freq = __resolve_freq(sg_policy->policy,
+ max_freq - 1,
+ CPUFREQ_RELATION_H);
+
+ /*
+ * We will request max_freq as soon as util crosses the capacity at
+ * second highest frequency. So effectively our response time is the
+ * util at which we cross the cap@2nd_highest_freq.
+ */
+ cap = sec_max_freq * cap / max_freq;
+
+ return approximate_runtime(cap + 1);
+}
+
+static inline void sugov_update_response_time_mult(struct sugov_policy *sg_policy)
+{
+ unsigned long mult;
+ int cpu;
+
+ if (unlikely(!sg_policy->freq_response_time_ms))
+ sg_policy->freq_response_time_ms = sugov_calc_freq_response_ms(sg_policy);
+
+ mult = sg_policy->freq_response_time_ms * SCHED_CAPACITY_SCALE;
+ mult /= sg_policy->tunables->response_time_ms;
+
+ if (SCHED_WARN_ON(!mult))
+ mult = SCHED_CAPACITY_SCALE;
+
+ for_each_cpu(cpu, sg_policy->policy->cpus)
+ per_cpu(response_time_mult, cpu) = mult;
+}
+
+/*
+ * Shrink or expand how long it takes to reach the maximum performance of the
+ * policy.
+ *
+ * sg_policy->freq_response_time_ms is a constant value defined by PELT
+ * HALFLIFE and the capacity of the policy (assuming HMP systems).
+ *
+ * sg_policy->tunables->response_time_ms is a user defined response time. By
+ * setting it lower than sg_policy->freq_response_time_ms, the system will
+ * respond faster to changes in util, which will result in reaching maximum
+ * performance point quicker. By setting it higher, it'll slow down the amount
+ * of time required to reach the maximum OPP.
+ *
+ * This should be applied when selecting the frequency.
+ */
+static inline unsigned long
+sugov_apply_response_time(unsigned long util, int cpu)
+{
+ unsigned long mult;
+
+ mult = per_cpu(response_time_mult, cpu) * util;
+
+ return mult >> SCHED_CAPACITY_SHIFT;
+}
+
static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
{
s64 delta_ns;
@@ -156,7 +224,10 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
unsigned long min,
unsigned long max)
{
- /* Add dvfs headroom to actual utilization */
+ /*
+ * Speed up/slow down response timee first then apply DVFS headroom.
+ */
+ actual = sugov_apply_response_time(actual, cpu);
actual = apply_dvfs_headroom(actual, cpu);
/* Actually we don't need to target the max performance */
if (actual < max)
@@ -555,8 +626,42 @@ rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, size_t count

static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);

+static ssize_t response_time_ms_show(struct gov_attr_set *attr_set, char *buf)
+{
+ struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+
+ return sprintf(buf, "%u\n", tunables->response_time_ms);
+}
+
+static ssize_t
+response_time_ms_store(struct gov_attr_set *attr_set, const char *buf, size_t count)
+{
+ struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+ struct sugov_policy *sg_policy;
+ unsigned int response_time_ms;
+
+ if (kstrtouint(buf, 10, &response_time_ms))
+ return -EINVAL;
+
+ /* XXX need special handling for high values? */
+
+ tunables->response_time_ms = response_time_ms;
+
+ list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) {
+ if (sg_policy->tunables == tunables) {
+ sugov_update_response_time_mult(sg_policy);
+ break;
+ }
+ }
+
+ return count;
+}
+
+static struct governor_attr response_time_ms = __ATTR_RW(response_time_ms);
+
static struct attribute *sugov_attrs[] = {
&rate_limit_us.attr,
+ &response_time_ms.attr,
NULL
};
ATTRIBUTE_GROUPS(sugov);
@@ -744,11 +849,13 @@ static int sugov_init(struct cpufreq_policy *policy)
goto stop_kthread;
}

- tunables->rate_limit_us = cpufreq_policy_transition_delay_us(policy);
-
policy->governor_data = sg_policy;
sg_policy->tunables = tunables;

+ tunables->rate_limit_us = cpufreq_policy_transition_delay_us(policy);
+ tunables->response_time_ms = sugov_calc_freq_response_ms(sg_policy);
+ sugov_update_response_time_mult(sg_policy);
+
ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype,
get_governor_parent_kobj(policy), "%s",
schedutil_gov.name);
@@ -808,7 +915,7 @@ static int sugov_start(struct cpufreq_policy *policy)
void (*uu)(struct update_util_data *data, u64 time, unsigned int flags);
unsigned int cpu;

- sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
+ sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
sg_policy->last_freq_update_time = 0;
sg_policy->next_freq = 0;
sg_policy->work_in_progress = false;
--
2.34.1

2023-12-08 00:24:57

by Qais Yousef

[permalink] [raw]
Subject: [PATCH v2 6/8] sched: cpufreq: Remove magic 1.25 headroom from apply_dvfs_headroom()

Replace 1.25 headroom in apply_dvfs_headroom() with better dynamic
logic.

Instead of the magical 1.25 headroom, use the new approximate_util_avg()
to provide headroom based on the dvfs_update_delay; which is the period
at which the cpufreq governor will send DVFS updates to the hardware.

Add a new percpu dvfs_update_delay that can be cheaply accessed whenever
apply_dvfs_headroom() is called. We expect cpufreq governors that rely
on util to drive its DVFS logic/algorithm to populate these percpu
variables. schedutil is the only such governor at the moment.

The behavior of schedutil will change as the headroom will be less than
1.25 for most systems as the rate_limit_us is usually short.

Signed-off-by: Qais Yousef (Google) <[email protected]>
---
kernel/sched/core.c | 1 +
kernel/sched/cpufreq_schedutil.c | 13 +++++++++++--
kernel/sched/sched.h | 18 ++++++++++++++----
3 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index db4be4921e7f..b4a1c8ea9e12 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -116,6 +116,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);

DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DEFINE_PER_CPU_READ_MOSTLY(u64, dvfs_update_delay);

#ifdef CONFIG_SCHED_DEBUG
/*
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 79c3b96dc02c..1d4d6025c15f 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -157,7 +157,7 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
unsigned long max)
{
/* Add dvfs headroom to actual utilization */
- actual = apply_dvfs_headroom(actual);
+ actual = apply_dvfs_headroom(actual, cpu);
/* Actually we don't need to target the max performance */
if (actual < max)
max = actual;
@@ -535,15 +535,21 @@ rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, size_t count
struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
struct sugov_policy *sg_policy;
unsigned int rate_limit_us;
+ int cpu;

if (kstrtouint(buf, 10, &rate_limit_us))
return -EINVAL;

tunables->rate_limit_us = rate_limit_us;

- list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
+ list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) {
+
sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC;

+ for_each_cpu(cpu, sg_policy->policy->cpus)
+ per_cpu(dvfs_update_delay, cpu) = rate_limit_us;
+ }
+
return count;
}

@@ -824,6 +830,9 @@ static int sugov_start(struct cpufreq_policy *policy)
memset(sg_cpu, 0, sizeof(*sg_cpu));
sg_cpu->cpu = cpu;
sg_cpu->sg_policy = sg_policy;
+
+ per_cpu(dvfs_update_delay, cpu) = sg_policy->tunables->rate_limit_us;
+
cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util, uu);
}
return 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2de64f59853c..bbece0eb053a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3005,6 +3005,15 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
unsigned long approximate_util_avg(unsigned long util, u64 delta);
u64 approximate_runtime(unsigned long util);

+/*
+ * Any governor that relies on util signal to drive DVFS, must populate these
+ * percpu dvfs_update_delay variables.
+ *
+ * It should describe the rate/delay at which the governor sends DVFS freq
+ * update to the hardware in us.
+ */
+DECLARE_PER_CPU_READ_MOSTLY(u64, dvfs_update_delay);
+
/*
* DVFS decision are made at discrete points. If CPU stays busy, the util will
* continue to grow, which means it could need to run at a higher frequency
@@ -3014,13 +3023,14 @@ u64 approximate_runtime(unsigned long util);
* to run at adequate performance point.
*
* This function provides enough headroom to provide adequate performance
- * assuming the CPU continues to be busy.
+ * assuming the CPU continues to be busy. This headroom is based on the
+ * dvfs_update_delay of the cpufreq governor.
*
- * At the moment it is a constant multiplication with 1.25.
+ * XXX: Should we provide headroom when the util is decaying?
*/
-static inline unsigned long apply_dvfs_headroom(unsigned long util)
+static inline unsigned long apply_dvfs_headroom(unsigned long util, int cpu)
{
- return util + (util >> 2);
+ return approximate_util_avg(util, per_cpu(dvfs_update_delay, cpu));
}

/*
--
2.34.1

2023-12-08 00:24:57

by Qais Yousef

[permalink] [raw]
Subject: [PATCH v2 8/8] sched/pelt: Introduce PELT multiplier

From: Vincent Donnefort <[email protected]>

The new sched_pelt_multiplier boot param allows a user to set a clock
multiplier to x2 or x4 (x1 being the default). This clock multiplier
artificially speeds up PELT ramp up/down similarly to use a faster
half-life than the default 32ms.

- x1: 32ms half-life
- x2: 16ms half-life
- x4: 8ms half-life

Internally, a new clock is created: rq->clock_task_mult. It sits in the
clock hierarchy between rq->clock_task and rq->clock_pelt.

The param is set as read only and can only be changed at boot time via

kernel.sched_pelt_multiplier=[1, 2, 4]

PELT has a big impact on the overall system response and reactiveness to
change. Smaller PELT HF means it'll require less time to reach the
maximum performance point of the system when the system become fully
busy; and equally shorter time to go back to lowest performance point
when the system goes back to idle.

This faster reaction impacts both dvfs response and migration time
between clusters in HMP system.

Smaller PELT values are expected to give better performance at the cost
of more power. Under powered systems can particularly benefit from
smaller values. Powerful systems can still benefit from smaller values
if they want to be tuned towards perf more and power is not the major
concern for them.

This combined with respone_time_ms from schedutil should give the user
and sysadmin a deterministic way to control the triangular power, perf
and thermals for their system. The default response_time_ms will half
as PELT HF halves.

Update approximate_{util_avg, runtime}() to take into account the PELT
HALFLIFE multiplier.

Signed-off-by: Vincent Donnefort <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
[Converted from sysctl to boot param and updated commit message]
Signed-off-by: Qais Yousef (Google) <[email protected]>
---
kernel/sched/core.c | 2 +-
kernel/sched/pelt.c | 52 ++++++++++++++++++++++++++++++++++++++++++--
kernel/sched/pelt.h | 42 +++++++++++++++++++++++++++++++----
kernel/sched/sched.h | 1 +
4 files changed, 90 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b4a1c8ea9e12..9c8626b4ddff 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -745,7 +745,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
update_irq_load_avg(rq, irq_delta + steal);
#endif
- update_rq_clock_pelt(rq, delta);
+ update_rq_clock_task_mult(rq, delta);
}

void update_rq_clock(struct rq *rq)
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 00a1b9c1bf16..0a10e56f76c7 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -468,6 +468,54 @@ int update_irq_load_avg(struct rq *rq, u64 running)
}
#endif /* CONFIG_HAVE_SCHED_AVG_IRQ */

+__read_mostly unsigned int sched_pelt_lshift;
+static unsigned int sched_pelt_multiplier = 1;
+
+static int set_sched_pelt_multiplier(const char *val, const struct kernel_param *kp)
+{
+ int ret;
+
+ ret = param_set_int(val, kp);
+ if (ret)
+ goto error;
+
+ switch (sched_pelt_multiplier) {
+ case 1:
+ fallthrough;
+ case 2:
+ fallthrough;
+ case 4:
+ WRITE_ONCE(sched_pelt_lshift,
+ sched_pelt_multiplier >> 1);
+ break;
+ default:
+ ret = -EINVAL;
+ goto error;
+ }
+
+ return 0;
+
+error:
+ sched_pelt_multiplier = 1;
+ return ret;
+}
+
+static const struct kernel_param_ops sched_pelt_multiplier_ops = {
+ .set = set_sched_pelt_multiplier,
+ .get = param_get_int,
+};
+
+#ifdef MODULE_PARAM_PREFIX
+#undef MODULE_PARAM_PREFIX
+#endif
+/* XXX: should we use sched as prefix? */
+#define MODULE_PARAM_PREFIX "kernel."
+module_param_cb(sched_pelt_multiplier, &sched_pelt_multiplier_ops, &sched_pelt_multiplier, 0444);
+MODULE_PARM_DESC(sched_pelt_multiplier, "PELT HALFLIFE helps control the responsiveness of the system.");
+MODULE_PARM_DESC(sched_pelt_multiplier, "Accepted value: 1 32ms PELT HALIFE - roughly 200ms to go from 0 to max performance point (default).");
+MODULE_PARM_DESC(sched_pelt_multiplier, " 2 16ms PELT HALIFE - roughly 100ms to go from 0 to max performance point.");
+MODULE_PARM_DESC(sched_pelt_multiplier, " 4 8ms PELT HALIFE - roughly 50ms to go from 0 to max performance point.");
+
/*
* Approximate the new util_avg value assuming an entity has continued to run
* for @delta us.
@@ -482,7 +530,7 @@ unsigned long approximate_util_avg(unsigned long util, u64 delta)
if (unlikely(!delta))
return util;

- accumulate_sum(delta, &sa, 1, 0, 1);
+ accumulate_sum(delta << sched_pelt_lshift, &sa, 1, 0, 1);
___update_load_avg(&sa, 0);

return sa.util_avg;
@@ -494,7 +542,7 @@ unsigned long approximate_util_avg(unsigned long util, u64 delta)
u64 approximate_runtime(unsigned long util)
{
struct sched_avg sa = {};
- u64 delta = 1024; // period = 1024 = ~1ms
+ u64 delta = 1024 << sched_pelt_lshift; // period = 1024 = ~1ms
u64 runtime = 0;

if (unlikely(!util))
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 3a0e0dc28721..9b35b5072bae 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -61,6 +61,14 @@ static inline void cfs_se_util_change(struct sched_avg *avg)
WRITE_ONCE(avg->util_est.enqueued, enqueued);
}

+static inline u64 rq_clock_task_mult(struct rq *rq)
+{
+ lockdep_assert_rq_held(rq);
+ assert_clock_updated(rq);
+
+ return rq->clock_task_mult;
+}
+
static inline u64 rq_clock_pelt(struct rq *rq)
{
lockdep_assert_rq_held(rq);
@@ -72,7 +80,7 @@ static inline u64 rq_clock_pelt(struct rq *rq)
/* The rq is idle, we can sync to clock_task */
static inline void _update_idle_rq_clock_pelt(struct rq *rq)
{
- rq->clock_pelt = rq_clock_task(rq);
+ rq->clock_pelt = rq_clock_task_mult(rq);

u64_u32_store(rq->clock_idle, rq_clock(rq));
/* Paired with smp_rmb in migrate_se_pelt_lag() */
@@ -121,6 +129,27 @@ static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
rq->clock_pelt += delta;
}

+extern unsigned int sched_pelt_lshift;
+
+/*
+ * absolute time |1 |2 |3 |4 |5 |6 |
+ * @ mult = 1 --------****************--------****************-
+ * @ mult = 2 --------********----------------********---------
+ * @ mult = 4 --------****--------------------****-------------
+ * clock task mult
+ * @ mult = 2 | | |2 |3 | | | | |5 |6 | | |
+ * @ mult = 4 | | | | |2|3| | | | | | | | | | |5|6| | | | | | |
+ *
+ */
+static inline void update_rq_clock_task_mult(struct rq *rq, s64 delta)
+{
+ delta <<= READ_ONCE(sched_pelt_lshift);
+
+ rq->clock_task_mult += delta;
+
+ update_rq_clock_pelt(rq, delta);
+}
+
/*
* When rq becomes idle, we have to check if it has lost idle time
* because it was fully busy. A rq is fully used when the /Sum util_sum
@@ -147,7 +176,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
* rq's clock_task.
*/
if (util_sum >= divider)
- rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
+ rq->lost_idle_time += rq_clock_task_mult(rq) - rq->clock_pelt;

_update_idle_rq_clock_pelt(rq);
}
@@ -218,13 +247,18 @@ update_irq_load_avg(struct rq *rq, u64 running)
return 0;
}

-static inline u64 rq_clock_pelt(struct rq *rq)
+static inline u64 rq_clock_task_mult(struct rq *rq)
{
return rq_clock_task(rq);
}

+static inline u64 rq_clock_pelt(struct rq *rq)
+{
+ return rq_clock_task_mult(rq);
+}
+
static inline void
-update_rq_clock_pelt(struct rq *rq, s64 delta) { }
+update_rq_clock_task_mult(struct rq *rq, s64 delta) { }

static inline void
update_idle_rq_clock_pelt(struct rq *rq) { }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bbece0eb053a..a7c89c623250 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1029,6 +1029,7 @@ struct rq {
u64 clock;
/* Ensure that all clocks are in the same cache line */
u64 clock_task ____cacheline_aligned;
+ u64 clock_task_mult;
u64 clock_pelt;
unsigned long lost_idle_time;
u64 clock_pelt_idle;
--
2.34.1

2023-12-08 00:24:57

by Qais Yousef

[permalink] [raw]
Subject: [PATCH v2 5/8] sched/fair: Remove magic hardcoded margin in fits_capacity()

Replace hardcoded margin value in fits_capacity() with better dynamic
logic.

80% margin is a magic value that has served its purpose for now, but it
no longer fits the variety of systems exist today. If a system is over
powered specifically, this 80% will mean we leave a lot of capacity
unused before we decide to upmigrate on HMP system.

On some systems the little core are under powered and ability to migrate
faster away from them is desired.

The upmigration behavior should rely on the fact that a bad decision
made will need load balance to kick in to perform misfit migration. And
I think this is an adequate definition for what to consider as enough
headroom to consider whether a util fits capacity or not.

Use the new approximate_util_avg() function to predict the util if the
task continues to run for TICK_US. If the value is not strictly less
than the capacity, then it must not be placed there, ie considered
misfit.

Signed-off-by: Qais Yousef (Google) <[email protected]>
---
kernel/sched/fair.c | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bcea3d55d95d..b83448be3f79 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -101,16 +101,31 @@ int __weak arch_asym_cpu_priority(int cpu)
}

/*
- * The margin used when comparing utilization with CPU capacity.
+ * The util will fit the capacity if it has enough headroom to grow within the
+ * next tick - which is when any load balancing activity happens to do the
+ * correction.
*
- * (default: ~20%)
+ * If util stays within the capacity before tick has elapsed, then it should be
+ * fine. If not, then a correction action must happen shortly after it starts
+ * running, hence we treat it as !fit.
+ *
+ * TODO: TICK is not actually accurate enough. balance_interval is the correct
+ * one to use as the next load balance doesn't not happen religiously at tick.
+ * Accessing balance_interval might be tricky and will require some refactoring
+ * first.
*/
-#define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)
+static inline bool fits_capacity(unsigned long util, unsigned long capacity)
+{
+ return approximate_util_avg(util, TICK_USEC) < capacity;
+}

/*
* The margin used when comparing CPU capacities.
* is 'cap1' noticeably greater than 'cap2'
*
+ * TODO: use approximate_util_avg() to give something more quantifiable based
+ * on time? Like 1ms?
+ *
* (default: ~5%)
*/
#define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
--
2.34.1

2023-12-08 18:07:08

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v2 7/8] sched/schedutil: Add a new tunable to dictate response time

On Fri, Dec 8, 2023 at 1:24 AM Qais Yousef <[email protected]> wrote:
>
> The new tunable, response_time_ms, allow us to speed up or slow down
> the response time of the policy to meet the perf, power and thermal
> characteristic desired by the user/sysadmin. There's no single universal
> trade-off that we can apply for all systems even if they use the same
> SoC. The form factor of the system, the dominant use case, and in case
> of battery powered systems, the size of the battery and presence or
> absence of active cooling can play a big role on what would be best to
> use.
>
> The new tunable provides sensible defaults, but yet gives the power to
> control the response time to the user/sysadmin, if they wish to.
>
> This tunable is applied before we apply the DVFS headroom.
>
> The default behavior of applying 1.25 headroom can be re-instated easily
> now. But we continue to keep the min required headroom to overcome
> hardware limitation in its speed to change DVFS. And any additional
> headroom to speed things up must be applied by userspace to match their
> expectation for best perf/watt as it dictates a type of policy that will
> be better for some systems, but worse for others.
>
> There's a whitespace clean up included in sugov_start().
>
> Signed-off-by: Qais Yousef (Google) <[email protected]>

I thought that there was an agreement to avoid adding any new tunables
to schedutil.

> ---
> Documentation/admin-guide/pm/cpufreq.rst | 17 +++-
> drivers/cpufreq/cpufreq.c | 4 +-
> include/linux/cpufreq.h | 3 +
> kernel/sched/cpufreq_schedutil.c | 115 ++++++++++++++++++++++-
> 4 files changed, 132 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst
> index 6adb7988e0eb..fa0d602a920e 100644
> --- a/Documentation/admin-guide/pm/cpufreq.rst
> +++ b/Documentation/admin-guide/pm/cpufreq.rst
> @@ -417,7 +417,7 @@ is passed by the scheduler to the governor callback which causes the frequency
> to go up to the allowed maximum immediately and then draw back to the value
> returned by the above formula over time.
>
> -This governor exposes only one tunable:
> +This governor exposes two tunables:
>
> ``rate_limit_us``
> Minimum time (in microseconds) that has to pass between two consecutive
> @@ -427,6 +427,21 @@ This governor exposes only one tunable:
> The purpose of this tunable is to reduce the scheduler context overhead
> of the governor which might be excessive without it.
>
> +``respone_time_ms``
> + Amount of time (in milliseconds) required to ramp the policy from
> + lowest to highest frequency. Can be decreased to speed up the
> + responsiveness of the system, or increased to slow the system down in
> + hope to save power. The best perf/watt will depend on the system
> + characteristics and the dominant workload you expect to run. For
> + userspace that has smart context on the type of workload running (like
> + in Android), one can tune this to suite the demand of that workload.
> +
> + Note that when slowing the response down, you can end up effectively
> + chopping off the top frequencies for that policy as the util is capped
> + to 1024. On HMP systems this chopping effect will only occur on the
> + biggest core whose capacity is 1024. Don't rely on this behavior as
> + this is a limitation that can hopefully be improved in the future.
> +
> This governor generally is regarded as a replacement for the older `ondemand`_
> and `conservative`_ governors (described below), as it is simpler and more
> tightly integrated with the CPU scheduler, its overhead in terms of CPU context
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 9875284ca6e4..15c397ce3252 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -533,8 +533,8 @@ void cpufreq_disable_fast_switch(struct cpufreq_policy *policy)
> }
> EXPORT_SYMBOL_GPL(cpufreq_disable_fast_switch);
>
> -static unsigned int __resolve_freq(struct cpufreq_policy *policy,
> - unsigned int target_freq, unsigned int relation)
> +unsigned int __resolve_freq(struct cpufreq_policy *policy,
> + unsigned int target_freq, unsigned int relation)
> {
> unsigned int idx;
>
> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index 1c5ca92a0555..29c3723653a3 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -613,6 +613,9 @@ int cpufreq_driver_target(struct cpufreq_policy *policy,
> int __cpufreq_driver_target(struct cpufreq_policy *policy,
> unsigned int target_freq,
> unsigned int relation);
> +unsigned int __resolve_freq(struct cpufreq_policy *policy,
> + unsigned int target_freq,
> + unsigned int relation);
> unsigned int cpufreq_driver_resolve_freq(struct cpufreq_policy *policy,
> unsigned int target_freq);
> unsigned int cpufreq_policy_transition_delay_us(struct cpufreq_policy *policy);
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 1d4d6025c15f..788208becc13 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -8,9 +8,12 @@
>
> #define IOWAIT_BOOST_MIN (SCHED_CAPACITY_SCALE / 8)
>
> +DEFINE_PER_CPU_READ_MOSTLY(unsigned long, response_time_mult);
> +
> struct sugov_tunables {
> struct gov_attr_set attr_set;
> unsigned int rate_limit_us;
> + unsigned int response_time_ms;
> };
>
> struct sugov_policy {
> @@ -22,6 +25,7 @@ struct sugov_policy {
> raw_spinlock_t update_lock;
> u64 last_freq_update_time;
> s64 freq_update_delay_ns;
> + unsigned int freq_response_time_ms;
> unsigned int next_freq;
> unsigned int cached_raw_freq;
>
> @@ -59,6 +63,70 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
>
> /************************ Governor internals ***********************/
>
> +static inline u64 sugov_calc_freq_response_ms(struct sugov_policy *sg_policy)
> +{
> + int cpu = cpumask_first(sg_policy->policy->cpus);
> + unsigned long cap = arch_scale_cpu_capacity(cpu);
> + unsigned int max_freq, sec_max_freq;
> +
> + max_freq = sg_policy->policy->cpuinfo.max_freq;
> + sec_max_freq = __resolve_freq(sg_policy->policy,
> + max_freq - 1,
> + CPUFREQ_RELATION_H);
> +
> + /*
> + * We will request max_freq as soon as util crosses the capacity at
> + * second highest frequency. So effectively our response time is the
> + * util at which we cross the cap@2nd_highest_freq.
> + */
> + cap = sec_max_freq * cap / max_freq;
> +
> + return approximate_runtime(cap + 1);
> +}
> +
> +static inline void sugov_update_response_time_mult(struct sugov_policy *sg_policy)
> +{
> + unsigned long mult;
> + int cpu;
> +
> + if (unlikely(!sg_policy->freq_response_time_ms))
> + sg_policy->freq_response_time_ms = sugov_calc_freq_response_ms(sg_policy);
> +
> + mult = sg_policy->freq_response_time_ms * SCHED_CAPACITY_SCALE;
> + mult /= sg_policy->tunables->response_time_ms;
> +
> + if (SCHED_WARN_ON(!mult))
> + mult = SCHED_CAPACITY_SCALE;
> +
> + for_each_cpu(cpu, sg_policy->policy->cpus)
> + per_cpu(response_time_mult, cpu) = mult;
> +}
> +
> +/*
> + * Shrink or expand how long it takes to reach the maximum performance of the
> + * policy.
> + *
> + * sg_policy->freq_response_time_ms is a constant value defined by PELT
> + * HALFLIFE and the capacity of the policy (assuming HMP systems).
> + *
> + * sg_policy->tunables->response_time_ms is a user defined response time. By
> + * setting it lower than sg_policy->freq_response_time_ms, the system will
> + * respond faster to changes in util, which will result in reaching maximum
> + * performance point quicker. By setting it higher, it'll slow down the amount
> + * of time required to reach the maximum OPP.
> + *
> + * This should be applied when selecting the frequency.
> + */
> +static inline unsigned long
> +sugov_apply_response_time(unsigned long util, int cpu)
> +{
> + unsigned long mult;
> +
> + mult = per_cpu(response_time_mult, cpu) * util;
> +
> + return mult >> SCHED_CAPACITY_SHIFT;
> +}
> +
> static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> {
> s64 delta_ns;
> @@ -156,7 +224,10 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
> unsigned long min,
> unsigned long max)
> {
> - /* Add dvfs headroom to actual utilization */
> + /*
> + * Speed up/slow down response timee first then apply DVFS headroom.
> + */
> + actual = sugov_apply_response_time(actual, cpu);
> actual = apply_dvfs_headroom(actual, cpu);
> /* Actually we don't need to target the max performance */
> if (actual < max)
> @@ -555,8 +626,42 @@ rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, size_t count
>
> static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
>
> +static ssize_t response_time_ms_show(struct gov_attr_set *attr_set, char *buf)
> +{
> + struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
> +
> + return sprintf(buf, "%u\n", tunables->response_time_ms);
> +}
> +
> +static ssize_t
> +response_time_ms_store(struct gov_attr_set *attr_set, const char *buf, size_t count)
> +{
> + struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
> + struct sugov_policy *sg_policy;
> + unsigned int response_time_ms;
> +
> + if (kstrtouint(buf, 10, &response_time_ms))
> + return -EINVAL;
> +
> + /* XXX need special handling for high values? */
> +
> + tunables->response_time_ms = response_time_ms;
> +
> + list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) {
> + if (sg_policy->tunables == tunables) {
> + sugov_update_response_time_mult(sg_policy);
> + break;
> + }
> + }
> +
> + return count;
> +}
> +
> +static struct governor_attr response_time_ms = __ATTR_RW(response_time_ms);
> +
> static struct attribute *sugov_attrs[] = {
> &rate_limit_us.attr,
> + &response_time_ms.attr,
> NULL
> };
> ATTRIBUTE_GROUPS(sugov);
> @@ -744,11 +849,13 @@ static int sugov_init(struct cpufreq_policy *policy)
> goto stop_kthread;
> }
>
> - tunables->rate_limit_us = cpufreq_policy_transition_delay_us(policy);
> -
> policy->governor_data = sg_policy;
> sg_policy->tunables = tunables;
>
> + tunables->rate_limit_us = cpufreq_policy_transition_delay_us(policy);
> + tunables->response_time_ms = sugov_calc_freq_response_ms(sg_policy);
> + sugov_update_response_time_mult(sg_policy);
> +
> ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype,
> get_governor_parent_kobj(policy), "%s",
> schedutil_gov.name);
> @@ -808,7 +915,7 @@ static int sugov_start(struct cpufreq_policy *policy)
> void (*uu)(struct update_util_data *data, u64 time, unsigned int flags);
> unsigned int cpu;
>
> - sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
> + sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
> sg_policy->last_freq_update_time = 0;
> sg_policy->next_freq = 0;
> sg_policy->work_in_progress = false;
> --
> 2.34.1
>

2023-12-10 20:41:06

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v2 7/8] sched/schedutil: Add a new tunable to dictate response time

On 12/08/23 19:06, Rafael J. Wysocki wrote:
> On Fri, Dec 8, 2023 at 1:24 AM Qais Yousef <[email protected]> wrote:
> >
> > The new tunable, response_time_ms, allow us to speed up or slow down
> > the response time of the policy to meet the perf, power and thermal
> > characteristic desired by the user/sysadmin. There's no single universal
> > trade-off that we can apply for all systems even if they use the same
> > SoC. The form factor of the system, the dominant use case, and in case
> > of battery powered systems, the size of the battery and presence or
> > absence of active cooling can play a big role on what would be best to
> > use.
> >
> > The new tunable provides sensible defaults, but yet gives the power to
> > control the response time to the user/sysadmin, if they wish to.
> >
> > This tunable is applied before we apply the DVFS headroom.
> >
> > The default behavior of applying 1.25 headroom can be re-instated easily
> > now. But we continue to keep the min required headroom to overcome
> > hardware limitation in its speed to change DVFS. And any additional
> > headroom to speed things up must be applied by userspace to match their
> > expectation for best perf/watt as it dictates a type of policy that will
> > be better for some systems, but worse for others.
> >
> > There's a whitespace clean up included in sugov_start().
> >
> > Signed-off-by: Qais Yousef (Google) <[email protected]>
>
> I thought that there was an agreement to avoid adding any new tunables
> to schedutil.

Oh. I didn't know that.

What alternatives do we have? I couldn't see how can we universally make the
response work for every possible system (not just SoC, but different platforms
with same SoC even) and workloads. We see big power saving with no or little
perf impact on many workloads when not applying the current 125%. Others want
to push it faster under gaming scenarios etc to get more stable FPS.

Hopefully uclamp will make the need for this tuning obsolete over time. But
until userspace gains critical mass; I can't see how we can know best
trade-offs for all myriads of use cases/systems.

Some are happy to gain more perf and lose power. Others prefer to save power
over perf. DVFS response time plays a critical role in this trade-off and I'm
not sure how we can crystal ball it without delegating.


Thanks!

--
Qais Yousef

2023-12-11 20:21:58

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v2 7/8] sched/schedutil: Add a new tunable to dictate response time

On Sun, Dec 10, 2023 at 9:40 PM Qais Yousef <[email protected]> wrote:
>
> On 12/08/23 19:06, Rafael J. Wysocki wrote:
> > On Fri, Dec 8, 2023 at 1:24 AM Qais Yousef <[email protected]> wrote:
> > >
> > > The new tunable, response_time_ms, allow us to speed up or slow down
> > > the response time of the policy to meet the perf, power and thermal
> > > characteristic desired by the user/sysadmin. There's no single universal
> > > trade-off that we can apply for all systems even if they use the same
> > > SoC. The form factor of the system, the dominant use case, and in case
> > > of battery powered systems, the size of the battery and presence or
> > > absence of active cooling can play a big role on what would be best to
> > > use.
> > >
> > > The new tunable provides sensible defaults, but yet gives the power to
> > > control the response time to the user/sysadmin, if they wish to.
> > >
> > > This tunable is applied before we apply the DVFS headroom.
> > >
> > > The default behavior of applying 1.25 headroom can be re-instated easily
> > > now. But we continue to keep the min required headroom to overcome
> > > hardware limitation in its speed to change DVFS. And any additional
> > > headroom to speed things up must be applied by userspace to match their
> > > expectation for best perf/watt as it dictates a type of policy that will
> > > be better for some systems, but worse for others.
> > >
> > > There's a whitespace clean up included in sugov_start().
> > >
> > > Signed-off-by: Qais Yousef (Google) <[email protected]>
> >
> > I thought that there was an agreement to avoid adding any new tunables
> > to schedutil.
>
> Oh. I didn't know that.
>
> What alternatives do we have? I couldn't see how can we universally make the
> response work for every possible system (not just SoC, but different platforms
> with same SoC even) and workloads. We see big power saving with no or little
> perf impact on many workloads when not applying the current 125%. Others want
> to push it faster under gaming scenarios etc to get more stable FPS.
>
> Hopefully uclamp will make the need for this tuning obsolete over time. But
> until userspace gains critical mass; I can't see how we can know best
> trade-offs for all myriads of use cases/systems.
>
> Some are happy to gain more perf and lose power. Others prefer to save power
> over perf. DVFS response time plays a critical role in this trade-off and I'm
> not sure how we can crystal ball it without delegating.

I understand the motivation, but counter-arguments are based on the
experience with the cpufreq governors predating schedutil, especially
ondemand. Namely, at one point people focused on adjusting all of the
governor tunables to their needs without contributing any code or even
insights back, so when schedutil was introduced, a decision was made
to reduce the tunability to a minimum (preferably no tunables at all,
but it turned out to be hard to avoid the one tunable existing today).
Peter was involved in those discussions and I think that the point
made then is still valid.

The headroom formula was based on the observation that it would be a
good idea to have some headroom in the majority of cases and on the
balance between the simplicity of computation and general suitability.

Of course, it is hard to devise a single value that will work for
everyone, but tunables complicate things from the maintenance
perspective. For example, the more tunables there are, the harder it
is to make changes without altering the behavior in ways that will
break someone's setup.

2023-12-12 13:16:46

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v2 7/8] sched/schedutil: Add a new tunable to dictate response time

On 12/11/23 21:20, Rafael J. Wysocki wrote:

> I understand the motivation, but counter-arguments are based on the
> experience with the cpufreq governors predating schedutil, especially
> ondemand. Namely, at one point people focused on adjusting all of the
> governor tunables to their needs without contributing any code or even
> insights back, so when schedutil was introduced, a decision was made
> to reduce the tunability to a minimum (preferably no tunables at all,
> but it turned out to be hard to avoid the one tunable existing today).
> Peter was involved in those discussions and I think that the point
> made then is still valid.
>
> The headroom formula was based on the observation that it would be a
> good idea to have some headroom in the majority of cases and on the
> balance between the simplicity of computation and general suitability.
>
> Of course, it is hard to devise a single value that will work for
> everyone, but tunables complicate things from the maintenance
> perspective. For example, the more tunables there are, the harder it
> is to make changes without altering the behavior in ways that will
> break someone's setup.

Okay thanks for the insights Rafael! I hope the matter is open for debate at
least. I do agree and share the sentiment and if there's another way to avoid
the tunable I'm all for going to try it out. I just personally failed to see
how can we do this without delegating. And the current choice of 25% headroom
is too aggressive for modern hardware IMHO. I'm not sure we can pick a value
that will truly work for most use cases. In mobile world, it is really hard to
cover all use cases. Different OEMs tend to focus on different use cases and
design their systems to optimally work for those. And those use cases don't
necessarily hit the same bottlenecks on different systems.

If we consider all the possible systems that Linux gets incorporated in, it is
even harder to tell what's a sensible default.

And generally if there's indeed a default that works for most users, what
should we do if we fall into the minority where this default is not suitable
for us? I think we need to handle this still. So we need a way somehow even if
this proposal doesn't hit the mark. Although again, I hope the matter is open
for debate.

The only ultimate solution I see is userspace becoming fully uclamp aware and
tell us their perf requirements. Then this value will be NOP as we have direct
info from the use cases to help us give them the performance they need when
they need it. And if their usage ends up with bad perf or power, we can at
least shift the blame for their bad usage :-) /me runs

But this is years from hitting the critical mass. We need to get to a point
where we can enable uclamp config by default as not all distros enable it
still.

Anyway, looking forward to learning more on how we can do better.


Thanks!

--
Qais Yousef

2024-01-20 07:53:47

by Ashay Jaiswal

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] sched/pelt: Introduce PELT multiplier

Hello Qais Yousef,

We ran few benchmarks with PELT multiplier patch on a Snapdragon 8Gen2
based internal Android device and we are observing significant
improvements with PELT8 configuration compared to PELT32.

Following are some of the benchmark results with PELT32 and PELT8
configuration:

+-----------------+---------------+----------------+----------------+
| Test case | PELT32 | PELT8 |
+-----------------+---------------+----------------+----------------+
| | Overall | 711543 | 971275 |
| +---------------+----------------+----------------+
| | CPU | 193704 | 224378 |
| +---------------+----------------+----------------+
|ANTUTU V9.3.9 | GPU | 284650 | 424774 |
| +---------------+----------------+----------------+
| | MEM | 125207 | 160548 |
| +---------------+----------------+----------------+
| | UX | 107982 | 161575 |
+-----------------+---------------+----------------+----------------+
| | Single core | 1170 | 1268 |
|GeekBench V5.4.4 +---------------+----------------+----------------+
| | Multi core | 2530 | 3797 |
+-----------------+---------------+----------------+----------------+
| | Twitter | >50 Janks | 0 |
| SCROLL +---------------+----------------+----------------+
| | Contacts | >30 Janks | 0 |
+-----------------+---------------+----------------+----------------+

Please let us know if you need any support with running any further
workloads for PELT32/PELT8 experiments, we can help with running the
experiments.

Thank you,
Ashay Jaiswal

On 12/8/2023 5:53 AM, Qais Yousef wrote:
> From: Vincent Donnefort <[email protected]>
>
> The new sched_pelt_multiplier boot param allows a user to set a clock
> multiplier to x2 or x4 (x1 being the default). This clock multiplier
> artificially speeds up PELT ramp up/down similarly to use a faster
> half-life than the default 32ms.
>
> - x1: 32ms half-life
> - x2: 16ms half-life
> - x4: 8ms half-life
>
> Internally, a new clock is created: rq->clock_task_mult. It sits in the
> clock hierarchy between rq->clock_task and rq->clock_pelt.
>
> The param is set as read only and can only be changed at boot time via
>
> kernel.sched_pelt_multiplier=[1, 2, 4]
>
> PELT has a big impact on the overall system response and reactiveness to
> change. Smaller PELT HF means it'll require less time to reach the
> maximum performance point of the system when the system become fully
> busy; and equally shorter time to go back to lowest performance point
> when the system goes back to idle.
>
> This faster reaction impacts both dvfs response and migration time
> between clusters in HMP system.
>
> Smaller PELT values are expected to give better performance at the cost
> of more power. Under powered systems can particularly benefit from
> smaller values. Powerful systems can still benefit from smaller values
> if they want to be tuned towards perf more and power is not the major
> concern for them.
>
> This combined with respone_time_ms from schedutil should give the user
> and sysadmin a deterministic way to control the triangular power, perf
> and thermals for their system. The default response_time_ms will half
> as PELT HF halves.
>
> Update approximate_{util_avg, runtime}() to take into account the PELT
> HALFLIFE multiplier.
>
> Signed-off-by: Vincent Donnefort <[email protected]>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> [Converted from sysctl to boot param and updated commit message]
> Signed-off-by: Qais Yousef (Google) <[email protected]>
> ---
> kernel/sched/core.c | 2 +-
> kernel/sched/pelt.c | 52 ++++++++++++++++++++++++++++++++++++++++++--
> kernel/sched/pelt.h | 42 +++++++++++++++++++++++++++++++----
> kernel/sched/sched.h | 1 +
> 4 files changed, 90 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b4a1c8ea9e12..9c8626b4ddff 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -745,7 +745,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
> if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
> update_irq_load_avg(rq, irq_delta + steal);
> #endif
> - update_rq_clock_pelt(rq, delta);
> + update_rq_clock_task_mult(rq, delta);
> }
>
> void update_rq_clock(struct rq *rq)
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index 00a1b9c1bf16..0a10e56f76c7 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -468,6 +468,54 @@ int update_irq_load_avg(struct rq *rq, u64 running)
> }
> #endif /* CONFIG_HAVE_SCHED_AVG_IRQ */
>
> +__read_mostly unsigned int sched_pelt_lshift;
> +static unsigned int sched_pelt_multiplier = 1;
> +
> +static int set_sched_pelt_multiplier(const char *val, const struct kernel_param *kp)
> +{
> + int ret;
> +
> + ret = param_set_int(val, kp);
> + if (ret)
> + goto error;
> +
> + switch (sched_pelt_multiplier) {
> + case 1:
> + fallthrough;
> + case 2:
> + fallthrough;
> + case 4:
> + WRITE_ONCE(sched_pelt_lshift,
> + sched_pelt_multiplier >> 1);
> + break;
> + default:
> + ret = -EINVAL;
> + goto error;
> + }
> +
> + return 0;
> +
> +error:
> + sched_pelt_multiplier = 1;
> + return ret;
> +}
> +
> +static const struct kernel_param_ops sched_pelt_multiplier_ops = {
> + .set = set_sched_pelt_multiplier,
> + .get = param_get_int,
> +};
> +
> +#ifdef MODULE_PARAM_PREFIX
> +#undef MODULE_PARAM_PREFIX
> +#endif
> +/* XXX: should we use sched as prefix? */
> +#define MODULE_PARAM_PREFIX "kernel."
> +module_param_cb(sched_pelt_multiplier, &sched_pelt_multiplier_ops, &sched_pelt_multiplier, 0444);
> +MODULE_PARM_DESC(sched_pelt_multiplier, "PELT HALFLIFE helps control the responsiveness of the system.");
> +MODULE_PARM_DESC(sched_pelt_multiplier, "Accepted value: 1 32ms PELT HALIFE - roughly 200ms to go from 0 to max performance point (default).");
> +MODULE_PARM_DESC(sched_pelt_multiplier, " 2 16ms PELT HALIFE - roughly 100ms to go from 0 to max performance point.");
> +MODULE_PARM_DESC(sched_pelt_multiplier, " 4 8ms PELT HALIFE - roughly 50ms to go from 0 to max performance point.");
> +
> /*
> * Approximate the new util_avg value assuming an entity has continued to run
> * for @delta us.
> @@ -482,7 +530,7 @@ unsigned long approximate_util_avg(unsigned long util, u64 delta)
> if (unlikely(!delta))
> return util;
>
> - accumulate_sum(delta, &sa, 1, 0, 1);
> + accumulate_sum(delta << sched_pelt_lshift, &sa, 1, 0, 1);
> ___update_load_avg(&sa, 0);
>
> return sa.util_avg;
> @@ -494,7 +542,7 @@ unsigned long approximate_util_avg(unsigned long util, u64 delta)
> u64 approximate_runtime(unsigned long util)
> {
> struct sched_avg sa = {};
> - u64 delta = 1024; // period = 1024 = ~1ms
> + u64 delta = 1024 << sched_pelt_lshift; // period = 1024 = ~1ms
> u64 runtime = 0;
>
> if (unlikely(!util))
> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> index 3a0e0dc28721..9b35b5072bae 100644
> --- a/kernel/sched/pelt.h
> +++ b/kernel/sched/pelt.h
> @@ -61,6 +61,14 @@ static inline void cfs_se_util_change(struct sched_avg *avg)
> WRITE_ONCE(avg->util_est.enqueued, enqueued);
> }
>
> +static inline u64 rq_clock_task_mult(struct rq *rq)
> +{
> + lockdep_assert_rq_held(rq);
> + assert_clock_updated(rq);
> +
> + return rq->clock_task_mult;
> +}
> +
> static inline u64 rq_clock_pelt(struct rq *rq)
> {
> lockdep_assert_rq_held(rq);
> @@ -72,7 +80,7 @@ static inline u64 rq_clock_pelt(struct rq *rq)
> /* The rq is idle, we can sync to clock_task */
> static inline void _update_idle_rq_clock_pelt(struct rq *rq)
> {
> - rq->clock_pelt = rq_clock_task(rq);
> + rq->clock_pelt = rq_clock_task_mult(rq);
>
> u64_u32_store(rq->clock_idle, rq_clock(rq));
> /* Paired with smp_rmb in migrate_se_pelt_lag() */
> @@ -121,6 +129,27 @@ static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
> rq->clock_pelt += delta;
> }
>
> +extern unsigned int sched_pelt_lshift;
> +
> +/*
> + * absolute time |1 |2 |3 |4 |5 |6 |
> + * @ mult = 1 --------****************--------****************-
> + * @ mult = 2 --------********----------------********---------
> + * @ mult = 4 --------****--------------------****-------------
> + * clock task mult
> + * @ mult = 2 | | |2 |3 | | | | |5 |6 | | |
> + * @ mult = 4 | | | | |2|3| | | | | | | | | | |5|6| | | | | | |
> + *
> + */
> +static inline void update_rq_clock_task_mult(struct rq *rq, s64 delta)
> +{
> + delta <<= READ_ONCE(sched_pelt_lshift);
> +
> + rq->clock_task_mult += delta;
> +
> + update_rq_clock_pelt(rq, delta);
> +}
> +
> /*
> * When rq becomes idle, we have to check if it has lost idle time
> * because it was fully busy. A rq is fully used when the /Sum util_sum
> @@ -147,7 +176,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
> * rq's clock_task.
> */
> if (util_sum >= divider)
> - rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
> + rq->lost_idle_time += rq_clock_task_mult(rq) - rq->clock_pelt;
>
> _update_idle_rq_clock_pelt(rq);
> }
> @@ -218,13 +247,18 @@ update_irq_load_avg(struct rq *rq, u64 running)
> return 0;
> }
>
> -static inline u64 rq_clock_pelt(struct rq *rq)
> +static inline u64 rq_clock_task_mult(struct rq *rq)
> {
> return rq_clock_task(rq);
> }
>
> +static inline u64 rq_clock_pelt(struct rq *rq)
> +{
> + return rq_clock_task_mult(rq);
> +}
> +
> static inline void
> -update_rq_clock_pelt(struct rq *rq, s64 delta) { }
> +update_rq_clock_task_mult(struct rq *rq, s64 delta) { }
>
> static inline void
> update_idle_rq_clock_pelt(struct rq *rq) { }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index bbece0eb053a..a7c89c623250 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1029,6 +1029,7 @@ struct rq {
> u64 clock;
> /* Ensure that all clocks are in the same cache line */
> u64 clock_task ____cacheline_aligned;
> + u64 clock_task_mult;
> u64 clock_pelt;
> unsigned long lost_idle_time;
> u64 clock_pelt_idle;

2024-01-21 00:05:00

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] sched/pelt: Introduce PELT multiplier

Hi Ashay

On 01/20/24 13:22, Ashay Jaiswal wrote:
> Hello Qais Yousef,
>
> We ran few benchmarks with PELT multiplier patch on a Snapdragon 8Gen2
> based internal Android device and we are observing significant
> improvements with PELT8 configuration compared to PELT32.
>
> Following are some of the benchmark results with PELT32 and PELT8
> configuration:
>
> +-----------------+---------------+----------------+----------------+
> | Test case | PELT32 | PELT8 |
> +-----------------+---------------+----------------+----------------+
> | | Overall | 711543 | 971275 |
> | +---------------+----------------+----------------+
> | | CPU | 193704 | 224378 |
> | +---------------+----------------+----------------+
> |ANTUTU V9.3.9 | GPU | 284650 | 424774 |
> | +---------------+----------------+----------------+
> | | MEM | 125207 | 160548 |
> | +---------------+----------------+----------------+
> | | UX | 107982 | 161575 |
> +-----------------+---------------+----------------+----------------+
> | | Single core | 1170 | 1268 |
> |GeekBench V5.4.4 +---------------+----------------+----------------+
> | | Multi core | 2530 | 3797 |
> +-----------------+---------------+----------------+----------------+
> | | Twitter | >50 Janks | 0 |
> | SCROLL +---------------+----------------+----------------+
> | | Contacts | >30 Janks | 0 |
> +-----------------+---------------+----------------+----------------+
>
> Please let us know if you need any support with running any further
> workloads for PELT32/PELT8 experiments, we can help with running the
> experiments.

Thanks a lot for the test results. Was this tried with this patch alone or
the whole series applied?

Have you tried to tweak each policy response_time_ms introduced in patch
7 instead? With the series applied, boot with PELT8, record the response time
values for each policy, then boot back again to PELT32 and use those values.
Does this produce similar results?

You didn't share power numbers which I assume the perf gains are more important
than the power cost for you.


Thanks!

--
Qais Yousef

2024-01-28 16:51:45

by Ashay Jaiswal

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] sched/pelt: Introduce PELT multiplier

Hello Qais Yousef,

Thank you for your response.

On 1/21/2024 5:34 AM, Qais Yousef wrote:
> Hi Ashay
>
> On 01/20/24 13:22, Ashay Jaiswal wrote:
>> Hello Qais Yousef,
>>
>> We ran few benchmarks with PELT multiplier patch on a Snapdragon 8Gen2
>> based internal Android device and we are observing significant
>> improvements with PELT8 configuration compared to PELT32.
>>
>> Following are some of the benchmark results with PELT32 and PELT8
>> configuration:
>>
>> +-----------------+---------------+----------------+----------------+
>> | Test case | PELT32 | PELT8 |
>> +-----------------+---------------+----------------+----------------+
>> | | Overall | 711543 | 971275 |
>> | +---------------+----------------+----------------+
>> | | CPU | 193704 | 224378 |
>> | +---------------+----------------+----------------+
>> |ANTUTU V9.3.9 | GPU | 284650 | 424774 |
>> | +---------------+----------------+----------------+
>> | | MEM | 125207 | 160548 |
>> | +---------------+----------------+----------------+
>> | | UX | 107982 | 161575 |
>> +-----------------+---------------+----------------+----------------+
>> | | Single core | 1170 | 1268 |
>> |GeekBench V5.4.4 +---------------+----------------+----------------+
>> | | Multi core | 2530 | 3797 |
>> +-----------------+---------------+----------------+----------------+
>> | | Twitter | >50 Janks | 0 |
>> | SCROLL +---------------+----------------+----------------+
>> | | Contacts | >30 Janks | 0 |
>> +-----------------+---------------+----------------+----------------+
>>
>> Please let us know if you need any support with running any further
>> workloads for PELT32/PELT8 experiments, we can help with running the
>> experiments.
>
> Thanks a lot for the test results. Was this tried with this patch alone or
> the whole series applied?
>
I have only applied patch8(sched/pelt: Introduce PELT multiplier) for the tests.

> Have you tried to tweak each policy response_time_ms introduced in patch
> 7 instead? With the series applied, boot with PELT8, record the response time
> values for each policy, then boot back again to PELT32 and use those values.
> Does this produce similar results?
>
As the device is based on 5.15 kernel, I will try to pull all the 8 patches
along with the dependency patches on 5.15 and try out the experiments as
suggested.

> You didn't share power numbers which I assume the perf gains are more important
> than the power cost for you.
>
If possible I will try to collect the power number for future test and share the
details.

>
> Thanks!
>
> --
> Qais Yousef

2024-01-30 17:29:04

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] sched/pelt: Introduce PELT multiplier

On Sun, 28 Jan 2024 at 17:22, Ashay Jaiswal <[email protected]> wrote:
>
> Hello Qais Yousef,
>
> Thank you for your response.
>
> On 1/21/2024 5:34 AM, Qais Yousef wrote:
> > Hi Ashay
> >
> > On 01/20/24 13:22, Ashay Jaiswal wrote:
> >> Hello Qais Yousef,
> >>
> >> We ran few benchmarks with PELT multiplier patch on a Snapdragon 8Gen2
> >> based internal Android device and we are observing significant
> >> improvements with PELT8 configuration compared to PELT32.
> >>
> >> Following are some of the benchmark results with PELT32 and PELT8
> >> configuration:
> >>
> >> +-----------------+---------------+----------------+----------------+
> >> | Test case | PELT32 | PELT8 |
> >> +-----------------+---------------+----------------+----------------+
> >> | | Overall | 711543 | 971275 |
> >> | +---------------+----------------+----------------+
> >> | | CPU | 193704 | 224378 |
> >> | +---------------+----------------+----------------+
> >> |ANTUTU V9.3.9 | GPU | 284650 | 424774 |
> >> | +---------------+----------------+----------------+
> >> | | MEM | 125207 | 160548 |
> >> | +---------------+----------------+----------------+
> >> | | UX | 107982 | 161575 |
> >> +-----------------+---------------+----------------+----------------+
> >> | | Single core | 1170 | 1268 |
> >> |GeekBench V5.4.4 +---------------+----------------+----------------+
> >> | | Multi core | 2530 | 3797 |
> >> +-----------------+---------------+----------------+----------------+
> >> | | Twitter | >50 Janks | 0 |
> >> | SCROLL +---------------+----------------+----------------+
> >> | | Contacts | >30 Janks | 0 |
> >> +-----------------+---------------+----------------+----------------+
> >>
> >> Please let us know if you need any support with running any further
> >> workloads for PELT32/PELT8 experiments, we can help with running the
> >> experiments.
> >
> > Thanks a lot for the test results. Was this tried with this patch alone or
> > the whole series applied?
> >
> I have only applied patch8(sched/pelt: Introduce PELT multiplier) for the tests.
>
> > Have you tried to tweak each policy response_time_ms introduced in patch
> > 7 instead? With the series applied, boot with PELT8, record the response time
> > values for each policy, then boot back again to PELT32 and use those values.
> > Does this produce similar results?
> >
> As the device is based on 5.15 kernel, I will try to pull all the 8 patches
> along with the dependency patches on 5.15 and try out the experiments as
> suggested.

Generally speaking, it would be better to compare with the latest
kernel or at least close and which includes new features added since
v5.15 (which is more than 2 years old now). I understand that this is
not always easy or doable but you could be surprised by the benefit of
some features like [0] merged since v5.15

[0] https://lore.kernel.org/lkml/[email protected]/

>
> > You didn't share power numbers which I assume the perf gains are more important
> > than the power cost for you.
> >
> If possible I will try to collect the power number for future test and share the
> details.
>
> >
> > Thanks!
> >
> > --
> > Qais Yousef

2024-01-30 17:38:38

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] sched/pelt: Introduce PELT multiplier

Le vendredi 08 d?c. 2023 ? 00:23:42 (+0000), Qais Yousef a ?crit :
> From: Vincent Donnefort <[email protected]>
>
> The new sched_pelt_multiplier boot param allows a user to set a clock
> multiplier to x2 or x4 (x1 being the default). This clock multiplier
> artificially speeds up PELT ramp up/down similarly to use a faster
> half-life than the default 32ms.
>
> - x1: 32ms half-life
> - x2: 16ms half-life
> - x4: 8ms half-life
>
> Internally, a new clock is created: rq->clock_task_mult. It sits in the
> clock hierarchy between rq->clock_task and rq->clock_pelt.
>
> The param is set as read only and can only be changed at boot time via
>
> kernel.sched_pelt_multiplier=[1, 2, 4]
>
> PELT has a big impact on the overall system response and reactiveness to
> change. Smaller PELT HF means it'll require less time to reach the
> maximum performance point of the system when the system become fully
> busy; and equally shorter time to go back to lowest performance point
> when the system goes back to idle.
>
> This faster reaction impacts both dvfs response and migration time
> between clusters in HMP system.
>
> Smaller PELT values are expected to give better performance at the cost
> of more power. Under powered systems can particularly benefit from
> smaller values. Powerful systems can still benefit from smaller values
> if they want to be tuned towards perf more and power is not the major
> concern for them.
>
> This combined with respone_time_ms from schedutil should give the user
> and sysadmin a deterministic way to control the triangular power, perf
> and thermals for their system. The default response_time_ms will half
> as PELT HF halves.
>
> Update approximate_{util_avg, runtime}() to take into account the PELT
> HALFLIFE multiplier.
>
> Signed-off-by: Vincent Donnefort <[email protected]>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> [Converted from sysctl to boot param and updated commit message]
> Signed-off-by: Qais Yousef (Google) <[email protected]>
> ---
> kernel/sched/core.c | 2 +-
> kernel/sched/pelt.c | 52 ++++++++++++++++++++++++++++++++++++++++++--
> kernel/sched/pelt.h | 42 +++++++++++++++++++++++++++++++----
> kernel/sched/sched.h | 1 +
> 4 files changed, 90 insertions(+), 7 deletions(-)
>

..

> +__read_mostly unsigned int sched_pelt_lshift;
> +static unsigned int sched_pelt_multiplier = 1;
> +
> +static int set_sched_pelt_multiplier(const char *val, const struct kernel_param *kp)
> +{
> + int ret;
> +
> + ret = param_set_int(val, kp);
> + if (ret)
> + goto error;
> +
> + switch (sched_pelt_multiplier) {
> + case 1:
> + fallthrough;
> + case 2:
> + fallthrough;
> + case 4:
> + WRITE_ONCE(sched_pelt_lshift,
> + sched_pelt_multiplier >> 1);
> + break;
> + default:
> + ret = -EINVAL;
> + goto error;
> + }
> +
> + return 0;
> +
> +error:
> + sched_pelt_multiplier = 1;
> + return ret;
> +}
> +
> +static const struct kernel_param_ops sched_pelt_multiplier_ops = {
> + .set = set_sched_pelt_multiplier,
> + .get = param_get_int,
> +};
> +
> +#ifdef MODULE_PARAM_PREFIX
> +#undef MODULE_PARAM_PREFIX
> +#endif
> +/* XXX: should we use sched as prefix? */
> +#define MODULE_PARAM_PREFIX "kernel."
> +module_param_cb(sched_pelt_multiplier, &sched_pelt_multiplier_ops, &sched_pelt_multiplier, 0444);
> +MODULE_PARM_DESC(sched_pelt_multiplier, "PELT HALFLIFE helps control the responsiveness of the system.");
> +MODULE_PARM_DESC(sched_pelt_multiplier, "Accepted value: 1 32ms PELT HALIFE - roughly 200ms to go from 0 to max performance point (default).");
> +MODULE_PARM_DESC(sched_pelt_multiplier, " 2 16ms PELT HALIFE - roughly 100ms to go from 0 to max performance point.");
> +MODULE_PARM_DESC(sched_pelt_multiplier, " 4 8ms PELT HALIFE - roughly 50ms to go from 0 to max performance point.");
> +
> /*
> * Approximate the new util_avg value assuming an entity has continued to run
> * for @delta us.

..

> +
> static inline void
> -update_rq_clock_pelt(struct rq *rq, s64 delta) { }
> +update_rq_clock_task_mult(struct rq *rq, s64 delta) { }
>
> static inline void
> update_idle_rq_clock_pelt(struct rq *rq) { }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index bbece0eb053a..a7c89c623250 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1029,6 +1029,7 @@ struct rq {
> u64 clock;
> /* Ensure that all clocks are in the same cache line */
> u64 clock_task ____cacheline_aligned;
> + u64 clock_task_mult;

I'm not sure that we want yet another clock and this doesn't apply for irq_avg.

What about the below is simpler and I think cover all cases ?

diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index f951c44f1d52..5cdd147b7abe 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -180,6 +180,7 @@ static __always_inline int
___update_load_sum(u64 now, struct sched_avg *sa,
unsigned long load, unsigned long runnable, int running)
{
+ int time_shift;
u64 delta;

delta = now - sa->last_update_time;
@@ -195,12 +196,17 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
/*
* Use 1024ns as the unit of measurement since it's a reasonable
* approximation of 1us and fast to compute.
+ * On top of this, we can change the half-time period from the default
+ * 32ms to a shorter value. This is equivalent to left shifting the
+ * time.
+ * Merge both right and left shifts in one single right shift
*/
- delta >>= 10;
+ time_shift = 10 - sched_pelt_lshift;
+ delta >>= time_shift;
if (!delta)
return 0;

- sa->last_update_time += delta << 10;
+ sa->last_update_time += delta << time_shift;

/*
* running is a subset of runnable (weight) so running can't be set if



> u64 clock_pelt;
> unsigned long lost_idle_time;
> u64 clock_pelt_idle;
> --
> 2.34.1
>

2024-02-01 22:24:40

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] sched/pelt: Introduce PELT multiplier

On 01/30/24 18:38, Vincent Guittot wrote:
> Le vendredi 08 déc. 2023 à 00:23:42 (+0000), Qais Yousef a écrit :
> > From: Vincent Donnefort <[email protected]>
> >
> > The new sched_pelt_multiplier boot param allows a user to set a clock
> > multiplier to x2 or x4 (x1 being the default). This clock multiplier
> > artificially speeds up PELT ramp up/down similarly to use a faster
> > half-life than the default 32ms.
> >
> > - x1: 32ms half-life
> > - x2: 16ms half-life
> > - x4: 8ms half-life
> >
> > Internally, a new clock is created: rq->clock_task_mult. It sits in the
> > clock hierarchy between rq->clock_task and rq->clock_pelt.
> >
> > The param is set as read only and can only be changed at boot time via
> >
> > kernel.sched_pelt_multiplier=[1, 2, 4]
> >
> > PELT has a big impact on the overall system response and reactiveness to
> > change. Smaller PELT HF means it'll require less time to reach the
> > maximum performance point of the system when the system become fully
> > busy; and equally shorter time to go back to lowest performance point
> > when the system goes back to idle.
> >
> > This faster reaction impacts both dvfs response and migration time
> > between clusters in HMP system.
> >
> > Smaller PELT values are expected to give better performance at the cost
> > of more power. Under powered systems can particularly benefit from
> > smaller values. Powerful systems can still benefit from smaller values
> > if they want to be tuned towards perf more and power is not the major
> > concern for them.
> >
> > This combined with respone_time_ms from schedutil should give the user
> > and sysadmin a deterministic way to control the triangular power, perf
> > and thermals for their system. The default response_time_ms will half
> > as PELT HF halves.
> >
> > Update approximate_{util_avg, runtime}() to take into account the PELT
> > HALFLIFE multiplier.
> >
> > Signed-off-by: Vincent Donnefort <[email protected]>
> > Signed-off-by: Dietmar Eggemann <[email protected]>
> > [Converted from sysctl to boot param and updated commit message]
> > Signed-off-by: Qais Yousef (Google) <[email protected]>
> > ---
> > kernel/sched/core.c | 2 +-
> > kernel/sched/pelt.c | 52 ++++++++++++++++++++++++++++++++++++++++++--
> > kernel/sched/pelt.h | 42 +++++++++++++++++++++++++++++++----
> > kernel/sched/sched.h | 1 +
> > 4 files changed, 90 insertions(+), 7 deletions(-)
> >
>
> ...
>
> > +__read_mostly unsigned int sched_pelt_lshift;
> > +static unsigned int sched_pelt_multiplier = 1;
> > +
> > +static int set_sched_pelt_multiplier(const char *val, const struct kernel_param *kp)
> > +{
> > + int ret;
> > +
> > + ret = param_set_int(val, kp);
> > + if (ret)
> > + goto error;
> > +
> > + switch (sched_pelt_multiplier) {
> > + case 1:
> > + fallthrough;
> > + case 2:
> > + fallthrough;
> > + case 4:
> > + WRITE_ONCE(sched_pelt_lshift,
> > + sched_pelt_multiplier >> 1);
> > + break;
> > + default:
> > + ret = -EINVAL;
> > + goto error;
> > + }
> > +
> > + return 0;
> > +
> > +error:
> > + sched_pelt_multiplier = 1;
> > + return ret;
> > +}
> > +
> > +static const struct kernel_param_ops sched_pelt_multiplier_ops = {
> > + .set = set_sched_pelt_multiplier,
> > + .get = param_get_int,
> > +};
> > +
> > +#ifdef MODULE_PARAM_PREFIX
> > +#undef MODULE_PARAM_PREFIX
> > +#endif
> > +/* XXX: should we use sched as prefix? */
> > +#define MODULE_PARAM_PREFIX "kernel."
> > +module_param_cb(sched_pelt_multiplier, &sched_pelt_multiplier_ops, &sched_pelt_multiplier, 0444);
> > +MODULE_PARM_DESC(sched_pelt_multiplier, "PELT HALFLIFE helps control the responsiveness of the system.");
> > +MODULE_PARM_DESC(sched_pelt_multiplier, "Accepted value: 1 32ms PELT HALIFE - roughly 200ms to go from 0 to max performance point (default).");
> > +MODULE_PARM_DESC(sched_pelt_multiplier, " 2 16ms PELT HALIFE - roughly 100ms to go from 0 to max performance point.");
> > +MODULE_PARM_DESC(sched_pelt_multiplier, " 4 8ms PELT HALIFE - roughly 50ms to go from 0 to max performance point.");
> > +
> > /*
> > * Approximate the new util_avg value assuming an entity has continued to run
> > * for @delta us.
>
> ...
>
> > +
> > static inline void
> > -update_rq_clock_pelt(struct rq *rq, s64 delta) { }
> > +update_rq_clock_task_mult(struct rq *rq, s64 delta) { }
> >
> > static inline void
> > update_idle_rq_clock_pelt(struct rq *rq) { }
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index bbece0eb053a..a7c89c623250 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1029,6 +1029,7 @@ struct rq {
> > u64 clock;
> > /* Ensure that all clocks are in the same cache line */
> > u64 clock_task ____cacheline_aligned;
> > + u64 clock_task_mult;
>
> I'm not sure that we want yet another clock and this doesn't apply for irq_avg.
>
> What about the below is simpler and I think cover all cases ?

Looks better, yes. I'll change to this and if no issues come up I'll add your
signed-off-by if that's okay with you for the next version.


Thanks!

--
Qais Yousef

>
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index f951c44f1d52..5cdd147b7abe 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -180,6 +180,7 @@ static __always_inline int
> ___update_load_sum(u64 now, struct sched_avg *sa,
> unsigned long load, unsigned long runnable, int running)
> {
> + int time_shift;
> u64 delta;
>
> delta = now - sa->last_update_time;
> @@ -195,12 +196,17 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
> /*
> * Use 1024ns as the unit of measurement since it's a reasonable
> * approximation of 1us and fast to compute.
> + * On top of this, we can change the half-time period from the default
> + * 32ms to a shorter value. This is equivalent to left shifting the
> + * time.
> + * Merge both right and left shifts in one single right shift
> */
> - delta >>= 10;
> + time_shift = 10 - sched_pelt_lshift;
> + delta >>= time_shift;
> if (!delta)
> return 0;
>
> - sa->last_update_time += delta << 10;
> + sa->last_update_time += delta << time_shift;
>
> /*
> * running is a subset of runnable (weight) so running can't be set if
>
>
>
> > u64 clock_pelt;
> > unsigned long lost_idle_time;
> > u64 clock_pelt_idle;
> > --
> > 2.34.1
> >

2024-02-01 22:44:13

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v2 7/8] sched/schedutil: Add a new tunable to dictate response time

On 12/08/23 00:23, Qais Yousef wrote:

> +static inline u64 sugov_calc_freq_response_ms(struct sugov_policy *sg_policy)
> +{
> + int cpu = cpumask_first(sg_policy->policy->cpus);
> + unsigned long cap = arch_scale_cpu_capacity(cpu);
> + unsigned int max_freq, sec_max_freq;
> +
> + max_freq = sg_policy->policy->cpuinfo.max_freq;
> + sec_max_freq = __resolve_freq(sg_policy->policy,
> + max_freq - 1,
> + CPUFREQ_RELATION_H);
> +
> + /*
> + * We will request max_freq as soon as util crosses the capacity at
> + * second highest frequency. So effectively our response time is the
> + * util at which we cross the cap@2nd_highest_freq.
> + */
> + cap = sec_max_freq * cap / max_freq;
> +
> + return approximate_runtime(cap + 1);

After Linus problem (and more testing) I realize this is not correct. This
value is correct for the biggest core only, for smaller cores I must stretch
time with capacity. I have similar invariance issues that I need to address in
this series and uclamp max aggregation series.

2024-02-04 11:32:59

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] sched/pelt: Introduce PELT multiplier

On Thu, 1 Feb 2024 at 23:24, Qais Yousef <[email protected]> wrote:
>
> On 01/30/24 18:38, Vincent Guittot wrote:
> > Le vendredi 08 déc. 2023 à 00:23:42 (+0000), Qais Yousef a écrit :
> > > From: Vincent Donnefort <[email protected]>
> > >
> > > The new sched_pelt_multiplier boot param allows a user to set a clock
> > > multiplier to x2 or x4 (x1 being the default). This clock multiplier
> > > artificially speeds up PELT ramp up/down similarly to use a faster
> > > half-life than the default 32ms.
> > >
> > > - x1: 32ms half-life
> > > - x2: 16ms half-life
> > > - x4: 8ms half-life
> > >
> > > Internally, a new clock is created: rq->clock_task_mult. It sits in the
> > > clock hierarchy between rq->clock_task and rq->clock_pelt.
> > >
> > > The param is set as read only and can only be changed at boot time via
> > >
> > > kernel.sched_pelt_multiplier=[1, 2, 4]
> > >
> > > PELT has a big impact on the overall system response and reactiveness to
> > > change. Smaller PELT HF means it'll require less time to reach the
> > > maximum performance point of the system when the system become fully
> > > busy; and equally shorter time to go back to lowest performance point
> > > when the system goes back to idle.
> > >
> > > This faster reaction impacts both dvfs response and migration time
> > > between clusters in HMP system.
> > >
> > > Smaller PELT values are expected to give better performance at the cost
> > > of more power. Under powered systems can particularly benefit from
> > > smaller values. Powerful systems can still benefit from smaller values
> > > if they want to be tuned towards perf more and power is not the major
> > > concern for them.
> > >
> > > This combined with respone_time_ms from schedutil should give the user
> > > and sysadmin a deterministic way to control the triangular power, perf
> > > and thermals for their system. The default response_time_ms will half
> > > as PELT HF halves.
> > >
> > > Update approximate_{util_avg, runtime}() to take into account the PELT
> > > HALFLIFE multiplier.
> > >
> > > Signed-off-by: Vincent Donnefort <[email protected]>
> > > Signed-off-by: Dietmar Eggemann <[email protected]>
> > > [Converted from sysctl to boot param and updated commit message]
> > > Signed-off-by: Qais Yousef (Google) <[email protected]>
> > > ---
> > > kernel/sched/core.c | 2 +-
> > > kernel/sched/pelt.c | 52 ++++++++++++++++++++++++++++++++++++++++++--
> > > kernel/sched/pelt.h | 42 +++++++++++++++++++++++++++++++----
> > > kernel/sched/sched.h | 1 +
> > > 4 files changed, 90 insertions(+), 7 deletions(-)
> > >
> >
> > ...
> >
> > > +__read_mostly unsigned int sched_pelt_lshift;
> > > +static unsigned int sched_pelt_multiplier = 1;
> > > +
> > > +static int set_sched_pelt_multiplier(const char *val, const struct kernel_param *kp)
> > > +{
> > > + int ret;
> > > +
> > > + ret = param_set_int(val, kp);
> > > + if (ret)
> > > + goto error;
> > > +
> > > + switch (sched_pelt_multiplier) {
> > > + case 1:
> > > + fallthrough;
> > > + case 2:
> > > + fallthrough;
> > > + case 4:
> > > + WRITE_ONCE(sched_pelt_lshift,
> > > + sched_pelt_multiplier >> 1);
> > > + break;
> > > + default:
> > > + ret = -EINVAL;
> > > + goto error;
> > > + }
> > > +
> > > + return 0;
> > > +
> > > +error:
> > > + sched_pelt_multiplier = 1;
> > > + return ret;
> > > +}
> > > +
> > > +static const struct kernel_param_ops sched_pelt_multiplier_ops = {
> > > + .set = set_sched_pelt_multiplier,
> > > + .get = param_get_int,
> > > +};
> > > +
> > > +#ifdef MODULE_PARAM_PREFIX
> > > +#undef MODULE_PARAM_PREFIX
> > > +#endif
> > > +/* XXX: should we use sched as prefix? */
> > > +#define MODULE_PARAM_PREFIX "kernel."
> > > +module_param_cb(sched_pelt_multiplier, &sched_pelt_multiplier_ops, &sched_pelt_multiplier, 0444);
> > > +MODULE_PARM_DESC(sched_pelt_multiplier, "PELT HALFLIFE helps control the responsiveness of the system.");
> > > +MODULE_PARM_DESC(sched_pelt_multiplier, "Accepted value: 1 32ms PELT HALIFE - roughly 200ms to go from 0 to max performance point (default).");
> > > +MODULE_PARM_DESC(sched_pelt_multiplier, " 2 16ms PELT HALIFE - roughly 100ms to go from 0 to max performance point.");
> > > +MODULE_PARM_DESC(sched_pelt_multiplier, " 4 8ms PELT HALIFE - roughly 50ms to go from 0 to max performance point.");
> > > +
> > > /*
> > > * Approximate the new util_avg value assuming an entity has continued to run
> > > * for @delta us.
> >
> > ...
> >
> > > +
> > > static inline void
> > > -update_rq_clock_pelt(struct rq *rq, s64 delta) { }
> > > +update_rq_clock_task_mult(struct rq *rq, s64 delta) { }
> > >
> > > static inline void
> > > update_idle_rq_clock_pelt(struct rq *rq) { }
> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > index bbece0eb053a..a7c89c623250 100644
> > > --- a/kernel/sched/sched.h
> > > +++ b/kernel/sched/sched.h
> > > @@ -1029,6 +1029,7 @@ struct rq {
> > > u64 clock;
> > > /* Ensure that all clocks are in the same cache line */
> > > u64 clock_task ____cacheline_aligned;
> > > + u64 clock_task_mult;
> >
> > I'm not sure that we want yet another clock and this doesn't apply for irq_avg.
> >
> > What about the below is simpler and I think cover all cases ?
>
> Looks better, yes. I'll change to this and if no issues come up I'll add your
> signed-off-by if that's okay with you for the next version.

Yes, that's okay

>
>
> Thanks!
>
> --
> Qais Yousef
>
> >
> > diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> > index f951c44f1d52..5cdd147b7abe 100644
> > --- a/kernel/sched/pelt.c
> > +++ b/kernel/sched/pelt.c
> > @@ -180,6 +180,7 @@ static __always_inline int
> > ___update_load_sum(u64 now, struct sched_avg *sa,
> > unsigned long load, unsigned long runnable, int running)
> > {
> > + int time_shift;
> > u64 delta;
> >
> > delta = now - sa->last_update_time;
> > @@ -195,12 +196,17 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
> > /*
> > * Use 1024ns as the unit of measurement since it's a reasonable
> > * approximation of 1us and fast to compute.
> > + * On top of this, we can change the half-time period from the default
> > + * 32ms to a shorter value. This is equivalent to left shifting the
> > + * time.
> > + * Merge both right and left shifts in one single right shift
> > */
> > - delta >>= 10;
> > + time_shift = 10 - sched_pelt_lshift;
> > + delta >>= time_shift;
> > if (!delta)
> > return 0;
> >
> > - sa->last_update_time += delta << 10;
> > + sa->last_update_time += delta << time_shift;
> >
> > /*
> > * running is a subset of runnable (weight) so running can't be set if
> >
> >
> >
> > > u64 clock_pelt;
> > > unsigned long lost_idle_time;
> > > u64 clock_pelt_idle;
> > > --
> > > 2.34.1
> > >

2024-02-06 17:08:36

by Ashay Jaiswal

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] sched/pelt: Introduce PELT multiplier



On 1/30/2024 10:58 PM, Vincent Guittot wrote:
> On Sun, 28 Jan 2024 at 17:22, Ashay Jaiswal <[email protected]> wrote:
>>
>> Hello Qais Yousef,
>>
>> Thank you for your response.
>>
>> On 1/21/2024 5:34 AM, Qais Yousef wrote:
>>> Hi Ashay
>>>
>>> On 01/20/24 13:22, Ashay Jaiswal wrote:
>>>> Hello Qais Yousef,
>>>>
>>>> We ran few benchmarks with PELT multiplier patch on a Snapdragon 8Gen2
>>>> based internal Android device and we are observing significant
>>>> improvements with PELT8 configuration compared to PELT32.
>>>>
>>>> Following are some of the benchmark results with PELT32 and PELT8
>>>> configuration:
>>>>
>>>> +-----------------+---------------+----------------+----------------+
>>>> | Test case | PELT32 | PELT8 |
>>>> +-----------------+---------------+----------------+----------------+
>>>> | | Overall | 711543 | 971275 |
>>>> | +---------------+----------------+----------------+
>>>> | | CPU | 193704 | 224378 |
>>>> | +---------------+----------------+----------------+
>>>> |ANTUTU V9.3.9 | GPU | 284650 | 424774 |
>>>> | +---------------+----------------+----------------+
>>>> | | MEM | 125207 | 160548 |
>>>> | +---------------+----------------+----------------+
>>>> | | UX | 107982 | 161575 |
>>>> +-----------------+---------------+----------------+----------------+
>>>> | | Single core | 1170 | 1268 |
>>>> |GeekBench V5.4.4 +---------------+----------------+----------------+
>>>> | | Multi core | 2530 | 3797 |
>>>> +-----------------+---------------+----------------+----------------+
>>>> | | Twitter | >50 Janks | 0 |
>>>> | SCROLL +---------------+----------------+----------------+
>>>> | | Contacts | >30 Janks | 0 |
>>>> +-----------------+---------------+----------------+----------------+
>>>>
>>>> Please let us know if you need any support with running any further
>>>> workloads for PELT32/PELT8 experiments, we can help with running the
>>>> experiments.
>>>
>>> Thanks a lot for the test results. Was this tried with this patch alone or
>>> the whole series applied?
>>>
>> I have only applied patch8(sched/pelt: Introduce PELT multiplier) for the tests.
>>
>>> Have you tried to tweak each policy response_time_ms introduced in patch
>>> 7 instead? With the series applied, boot with PELT8, record the response time
>>> values for each policy, then boot back again to PELT32 and use those values.
>>> Does this produce similar results?
>>>
>> As the device is based on 5.15 kernel, I will try to pull all the 8 patches
>> along with the dependency patches on 5.15 and try out the experiments as
>> suggested.
>
> Generally speaking, it would be better to compare with the latest
> kernel or at least close and which includes new features added since
> v5.15 (which is more than 2 years old now). I understand that this is
> not always easy or doable but you could be surprised by the benefit of
> some features like [0] merged since v5.15
>
> [0] https://lore.kernel.org/lkml/[email protected]/
>
Thank you Vincent for the suggestion, I will try to get the results on device running
with most recent kernel and update.

Thanks,
Ashay Jaiswal
>>
>>> You didn't share power numbers which I assume the perf gains are more important
>>> than the power cost for you.
>>>
>> If possible I will try to collect the power number for future test and share the
>> details.
>>
>>>
>>> Thanks!
>>>
>>> --
>>> Qais Yousef

2024-04-12 10:08:03

by Ashay Jaiswal

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] sched/pelt: Introduce PELT multiplier

On 2/6/2024 10:37 PM, Ashay Jaiswal wrote:
>
>
> On 1/30/2024 10:58 PM, Vincent Guittot wrote:
>> On Sun, 28 Jan 2024 at 17:22, Ashay Jaiswal <[email protected]> wrote:
>>>
>>> Hello Qais Yousef,
>>>
>>> Thank you for your response.
>>>
>>> On 1/21/2024 5:34 AM, Qais Yousef wrote:
>>>> Hi Ashay
>>>>
>>>> On 01/20/24 13:22, Ashay Jaiswal wrote:
>>>>> Hello Qais Yousef,
>>>>>
>>>>> We ran few benchmarks with PELT multiplier patch on a Snapdragon 8Gen2
>>>>> based internal Android device and we are observing significant
>>>>> improvements with PELT8 configuration compared to PELT32.
>>>>>
>>>>> Following are some of the benchmark results with PELT32 and PELT8
>>>>> configuration:
>>>>>
>>>>> +-----------------+---------------+----------------+----------------+
>>>>> | Test case | PELT32 | PELT8 |
>>>>> +-----------------+---------------+----------------+----------------+
>>>>> | | Overall | 711543 | 971275 |
>>>>> | +---------------+----------------+----------------+
>>>>> | | CPU | 193704 | 224378 |
>>>>> | +---------------+----------------+----------------+
>>>>> |ANTUTU V9.3.9 | GPU | 284650 | 424774 |
>>>>> | +---------------+----------------+----------------+
>>>>> | | MEM | 125207 | 160548 |
>>>>> | +---------------+----------------+----------------+
>>>>> | | UX | 107982 | 161575 |
>>>>> +-----------------+---------------+----------------+----------------+
>>>>> | | Single core | 1170 | 1268 |
>>>>> |GeekBench V5.4.4 +---------------+----------------+----------------+
>>>>> | | Multi core | 2530 | 3797 |
>>>>> +-----------------+---------------+----------------+----------------+
>>>>> | | Twitter | >50 Janks | 0 |
>>>>> | SCROLL +---------------+----------------+----------------+
>>>>> | | Contacts | >30 Janks | 0 |
>>>>> +-----------------+---------------+----------------+----------------+
>>>>>
>>>>> Please let us know if you need any support with running any further
>>>>> workloads for PELT32/PELT8 experiments, we can help with running the
>>>>> experiments.
>>>>
>>>> Thanks a lot for the test results. Was this tried with this patch alone or
>>>> the whole series applied?
>>>>
>>> I have only applied patch8(sched/pelt: Introduce PELT multiplier) for the tests.
>>>
>>>> Have you tried to tweak each policy response_time_ms introduced in patch
>>>> 7 instead? With the series applied, boot with PELT8, record the response time
>>>> values for each policy, then boot back again to PELT32 and use those values.
>>>> Does this produce similar results?
>>>>
>>> As the device is based on 5.15 kernel, I will try to pull all the 8 patches
>>> along with the dependency patches on 5.15 and try out the experiments as
>>> suggested.
>>
>> Generally speaking, it would be better to compare with the latest
>> kernel or at least close and which includes new features added since
>> v5.15 (which is more than 2 years old now). I understand that this is
>> not always easy or doable but you could be surprised by the benefit of
>> some features like [0] merged since v5.15
>>
>> [0] https://lore.kernel.org/lkml/[email protected]/
>>
> Thank you Vincent for the suggestion, I will try to get the results on device running
> with most recent kernel and update.
>
> Thanks,
> Ashay Jaiswal

Hello Qais Yousef and Vincent,

Sorry for the delay, setting up internal device on latest kernel is taking more time than anticipated.
We are trying to bring-up latest kernel on the device and will complete the testing with the latest
cpufreq patches as you suggested.

Regarding PELT multiplier patch [1], are we planning to merge it separately or will it be merged
altogether with the cpufreq patches?

[1]: https://lore.kernel.org/all/[email protected]/

Thanks and Regards,
Ashay Jaiswal

>>>
>>>> You didn't share power numbers which I assume the perf gains are more important
>>>> than the power cost for you.
>>>>
>>> If possible I will try to collect the power number for future test and share the
>>> details.
>>>
>>>>
>>>> Thanks!
>>>>
>>>> --
>>>> Qais Yousef

2024-04-19 13:20:56

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] sched/pelt: Introduce PELT multiplier

Hi Ashay

On 04/12/24 15:36, Ashay Jaiswal wrote:
> On 2/6/2024 10:37 PM, Ashay Jaiswal wrote:
> >
> >
> > On 1/30/2024 10:58 PM, Vincent Guittot wrote:
> >> On Sun, 28 Jan 2024 at 17:22, Ashay Jaiswal <[email protected]> wrote:
> >>>
> >>> Hello Qais Yousef,
> >>>
> >>> Thank you for your response.
> >>>
> >>> On 1/21/2024 5:34 AM, Qais Yousef wrote:
> >>>> Hi Ashay
> >>>>
> >>>> On 01/20/24 13:22, Ashay Jaiswal wrote:
> >>>>> Hello Qais Yousef,
> >>>>>
> >>>>> We ran few benchmarks with PELT multiplier patch on a Snapdragon 8Gen2
> >>>>> based internal Android device and we are observing significant
> >>>>> improvements with PELT8 configuration compared to PELT32.
> >>>>>
> >>>>> Following are some of the benchmark results with PELT32 and PELT8
> >>>>> configuration:
> >>>>>
> >>>>> +-----------------+---------------+----------------+----------------+
> >>>>> | Test case | PELT32 | PELT8 |
> >>>>> +-----------------+---------------+----------------+----------------+
> >>>>> | | Overall | 711543 | 971275 |
> >>>>> | +---------------+----------------+----------------+
> >>>>> | | CPU | 193704 | 224378 |
> >>>>> | +---------------+----------------+----------------+
> >>>>> |ANTUTU V9.3.9 | GPU | 284650 | 424774 |
> >>>>> | +---------------+----------------+----------------+
> >>>>> | | MEM | 125207 | 160548 |
> >>>>> | +---------------+----------------+----------------+
> >>>>> | | UX | 107982 | 161575 |
> >>>>> +-----------------+---------------+----------------+----------------+
> >>>>> | | Single core | 1170 | 1268 |
> >>>>> |GeekBench V5.4.4 +---------------+----------------+----------------+
> >>>>> | | Multi core | 2530 | 3797 |
> >>>>> +-----------------+---------------+----------------+----------------+
> >>>>> | | Twitter | >50 Janks | 0 |
> >>>>> | SCROLL +---------------+----------------+----------------+
> >>>>> | | Contacts | >30 Janks | 0 |
> >>>>> +-----------------+---------------+----------------+----------------+
> >>>>>
> >>>>> Please let us know if you need any support with running any further
> >>>>> workloads for PELT32/PELT8 experiments, we can help with running the
> >>>>> experiments.
> >>>>
> >>>> Thanks a lot for the test results. Was this tried with this patch alone or
> >>>> the whole series applied?
> >>>>
> >>> I have only applied patch8(sched/pelt: Introduce PELT multiplier) for the tests.
> >>>
> >>>> Have you tried to tweak each policy response_time_ms introduced in patch
> >>>> 7 instead? With the series applied, boot with PELT8, record the response time
> >>>> values for each policy, then boot back again to PELT32 and use those values.
> >>>> Does this produce similar results?
> >>>>
> >>> As the device is based on 5.15 kernel, I will try to pull all the 8 patches
> >>> along with the dependency patches on 5.15 and try out the experiments as
> >>> suggested.
> >>
> >> Generally speaking, it would be better to compare with the latest
> >> kernel or at least close and which includes new features added since
> >> v5.15 (which is more than 2 years old now). I understand that this is
> >> not always easy or doable but you could be surprised by the benefit of
> >> some features like [0] merged since v5.15
> >>
> >> [0] https://lore.kernel.org/lkml/[email protected]/
> >>
> > Thank you Vincent for the suggestion, I will try to get the results on device running
> > with most recent kernel and update.
> >
> > Thanks,
> > Ashay Jaiswal
>
> Hello Qais Yousef and Vincent,
>
> Sorry for the delay, setting up internal device on latest kernel is taking more time than anticipated.
> We are trying to bring-up latest kernel on the device and will complete the testing with the latest
> cpufreq patches as you suggested.
>
> Regarding PELT multiplier patch [1], are we planning to merge it separately or will it be merged
> altogether with the cpufreq patches?
>
> [1]: https://lore.kernel.org/all/[email protected]/

I am working on updated version. I've been analysing the problem more since the
last posting and found more issues to fix and enhance the response time of the
system in terms of migration and DVFS.

For the PELT multiplier, I have an updated patch now based on Vincent
suggestion of a different implementation without anew clodk. I will include
that, but haven't been testing this part so far.

I hope to send the new version soon. Will CC you so you can try and see if
these improvements help. In my view the boot time PELT multiplier is only
necessary to help low-end type of system where the max performance is
relatively low, but the scheduler has a constant model (and response time)
which means they need to a different default behavior so workloads reach this
max performance point faster.

I can potentially see a powerful system needing that, but IMHO the trade-off
with power will be very costly. If everything goes to plan, I hope we can
introduce a per-task util_est_faster like Peter suggested in earlier discussion
which should help workload that needs best ST perf to reach that faster without
changing the system default behavior.

The biggest challenge is handling those bursty tasks, and I hope the proposal
I am working on will put us on the right direction in terms of better default
behavior for those tasks.

If you have any analysis on why you think faster PELT helps, that'd be great to
share.


Cheers

--
Qais Yousef