2023-10-04 09:06:27

by Hongyan Xia

[permalink] [raw]
Subject: [RFC PATCH 0/6] sched: uclamp sum aggregation

Current implementation of uclamp leaves many things to be desired.
There are several problems:

1. Max aggregation is fragile. A single task with a high UCLAMP_MAX (or
with the default value, which is 1024) can ruin the previous
settings the moment it is enqueued, as shown in the uclamp frequency
spike problem in Section 5.2 of
Documentation/scheduler/sched-util-clamp.rst. Constantly running at
1024 utilization implies that the CPU is driven at its max capacity.
However, with UCLAMP_MAX, this assumption is no longer true. To
mitigate this, one idea is to filter out uclamp values for
short-running tasks. However, the effectiveness of this idea remains
to be seen.

2. No way to differentiate between UCLAMP_MAX throttled CPU or a CPU
running at its peak, as both show utilization of 1024. However, in
certain cases it's important to tell the difference, as we still want
to take the opportunity to enqueue tasks in the former case.

It's also debatable whether uclamp should be a frequency hint. An
example is a system with a mid CPU of capacity 500, and a little CPU
with capacity 100. If we have 10 tasks with UCLAMP_MIN of 101, then
util_fits_cpu() will schedule them on the mid CPU because feec()
thinks they don't fit on the little, even though we should use the
little core at some point instead of making the mid even more crowded.
Of course, this in reality doesn't happen because of other mechanisms
like load balancing, but it's also not good when other mechanisms can
cancel the effect of uclamp in feec(). A CPU running at capacity 500 for
1ms or for 1000ms gives completely different performance levels, so
trying to fit only the frequency does not give us any performance
guarantees. If we then talk about not only running at some frequency but
also for some amount of time, then what we really mean is a capacity
hint, not a frequency hint.

It's even worse when the tasks scheduled on the mid CPU also have
UCLAMP_MAX values. In the previous example, instead of running at 500, a
single UCLAMP_MAX, assuming it's 110, can make the mid CPU run at a much
lower frequency than 500, so it is then a really bad decision to honor
the UCLAMP_MIN frequency hint and place it on the mid CPU, instead of
running it on the little CPU which is less crowded, and has pretty much
the same capacity (100 vs 110) anyway.

This series attempts to solve these problems by tracking a
util_avg_uclamp signal in tasks and cfs_rq. At task level,
p->se.avg.util_avg_uclamp is basically tracking the normal util_avg, but
clamped within its uclamp min and max. At cfs_rq level, util_avg_uclamp
must always be the sum of all util_avg_uclamp of all the entities on
this cfs_rq. As a result, cfs_rq->avg.util_avg_uclamp is the sum
aggregation of all the clamped values, which indicates the frequency
this rq should run at and what the utilization is.

This idea solves Problem 1 by capping the utilization of an
always-running task throttled by UCLAMP_MAX. Although the task (denoted
by Task A) has no idle time, the util_avg_uclamp signal gives its
UCLAMP_MAX value instead of 1024, so even if another task (Task B) with
a UCLAMP_MAX value at 1024 joins the rq, the util_avg_uclamp is A's
UCLAMP_MAX plus B's utilization, instead of 1024 plus B's utilization,
which means we no longer have the frequency spike problem caused by B.
This should mean that we might completely avoid the need for uclamp
filtering.

It also solves Problem 2 by tracking the capped utilization of a rq.
Using util_avg_uclamp, we are able to differentiate between a CPU
throttled by UCLAMP_MAX and one that is actually running at its peak
capacity. An always-running rq with a task having UCLAMP_MAX at 300 will
report a cfs_rq util_avg_uclamp at 300, not 1024, which means task
placement sees spare capacity on this CPU and will allocate some tasks
to it, instead of treating it the same way as a CPU actually running at
the peak. This brings us closer to the benefits of Android's sum
aggregation (see code related to CONFIG_USE_GROUP_THROTTLE at
https://android.googlesource.com/kernel/gs/+/refs/heads/android-gs-raviole-5.10-android12-d1/drivers/soc/google/vh/kernel/sched/fair.c#510),
which shows energy benefits because we are able to schedule tasks on
smaller cores which are UCLAMP_MAX capped, instead of finding a fresh
big CPU. However, a major difference is that this series has an
amortized cost of O(1), instead of O(n) in cpu_util_next() in Android
kernels.

It avoids the shortcomings from uclamp-being-a-frequency-hint. Under sum
aggregation, the scheduler sees the sum aggregated util_avg_uclamp and
will avoid the problem of enqueueing too many tasks just to fit the
frequency, and will place tasks on little CPUs in the previous example.

Patch 2/6 tries to simulate PELT decay in the new util_avg_uclamp
signal, as this gives us gradual decay of utilization which avoids
premature CPU frequency drops. This is a major caveat of this series. We
should explore if there's a better way to integrate uclamp directly into
the util_avg signal, instead of introducing a new util_avg_uclamp and
then simulate PELT on it.

This new design significantly simplifies uclamp logic.
Patch 4/6 removes the tri-state return value of util_fits_cpu().
Patch 5/6 then completely removes uclamp buckets. Because the
util_avg_uclamp is already a clamped value propagated from bottom to
top, there's no need to clamp anything at the top level and we can just
use this value for frequency selection and spare capacity search.
Basically, the idea is that all uclamp logic happens inside
util_avg_uclamp, and other code using this new signal can just use it
like util_avg, without even knowing that uclamp exists. This drastically
reduces the complexity of uclamp and makes the code considering all the
uclamp corner cases unnecessary. At the end of the series, we remove 749
lines of code while adding a bit more than 300 (although once we update
Documentation.rst, it will be a bit more than that).

Note that this series is still considered RFC status. TODO items are:

1. Implement sum aggregation for RT tasks.
2. Improve handling of cpu_util(boost).

TESTING:

Initial test and benchmark results suggest that sum aggregation not only
is significantly simpler, but generally performs much better in
achieving what uclamp is supposed to do. Two notebooks are shared at

https://nbviewer.org/github/honxia02/sum_aggre_notebooks/blob/main/upstream.ipynb
https://nbviewer.org/github/honxia02/sum_aggre_notebooks/blob/main/sum.ipynb

The experiments done in notebooks are on Arm Juno r2 board. CPU0-3 are
little cores with capacity of 383. CPU4-5 are big cores. The rt-app
profiles used for these experiments are included in the notebooks.

Scenario 1: Scheduling 4 always-running tasks with UCLAMP_MAX at 200.

The scheduling decisions are plotted in Out[11] and Out[14]
respectively. Max aggregation fails to recognize the opportunity to run
all of them on the efficient little Power Domain (PD), whereas sum
aggregation runs all 4 on the little PD, leaving the big PD completely
powered off, satisfying uclamp requests while saving power.

Scenario 2: Scheduling 4 tasks with UCLAMP_MIN and UCLAMP_MAX at a value
slightly above the capacity of the little CPU.

Results are in Out[17] and Out[82]. The purpose is to use UCLAMP_MIN to
place tasks on the big core. Max aggregation is pretty much in an
overutilized state the entire time. Sum aggregation sees that the big
CPU can hold two such tasks (satisfying both UCLAMP_MIN and UCLAMP_MAX
requests for all 4 tasks) on each big CPU and quickly settles down, and
is still under EAS without overutilization.

Scenario 3: Task A is a task with a small utilization pinned to CPU4.
Task B is an always-running task pinned to CPU5, but UCLAMP_MAX capped
at 200. After a while, task A is then pinned to CPU5, joining B.

Results are in Out[23] and Out[95]. The blue util_avg signal is the
original root CFS util_avg. The yellow line is the root CFS utilization
after considering uclamp. Max aggregation sees a frequency
spike at 579.1s. When zoomed in, one can see square-wave-like
utilization values because of A periodically going to sleep. When A
wakes up, its default UCLAMP_MAX of 1024 will uncap B and reach the
highest CPU frequency. When A sleeps, B's UCLAMP_MAX will be in effect
and will reduce rq utilization to 200. This happens repeatedly, hence
the square wave. In contrast, sum aggregation sees a normal increase in
utilization when A joins B, at around 2507s, without any square-wave
behavior.

Scenario 4: 4 always-running tasks with UCLAMP_MAX of 120 pinned to the
little PD (CPU0-3). 4 same tasks pinned to the big PD (CPU4-5).
After a while, remove the CPU pinning of the 4 tasks on the big PD.

Results are in Out[29] and Out[101]. Max aggregation fails to identify
that all 8 tasks can be packed on the little PD, whereas sum
aggregation immediately moves the 4 tasks pinned to big PD to the
little PD the moment pinning is removed. Sum aggregation understands
that even when the rq seems to have utilization of 1024, this is because
of UCLAMP_MAX and there's still opportunity to pack 2 such tasks on each
little CPU, leaving the big PD untouched, saving power.

BENCHMARKS:

One TODO item is to handle cpu_util(boost) better. The current handling
of boost is far from ideal and is a known caveat. All below benchmarks
numbers are done without any boosting in cpu_util(), in both max and sum
aggregation.

Geekbench 6 (on Rock-5B board)
+-----+-------------+------------+
| | Single-core | Multi-core |
+-----+-------------+------------+
| Max | 800.6 | 2977.0 |
| Sum | 801.2 | 2981.4 |
+-----+-------------+------------+

No regression is seen after switching to sum aggregation.

Jankbench (backporting sum aggregation to Android 5.18 mainline kernel)

+------+-----------------+-------+-----------+
| | variable | value | perc_diff |
+------+-----------------+-------+-----------+
| main | jank_percentage | 1.1 | 0.00% |
| sum | jank_percentage | 0.5 | -53.92% |
+------+-----------------+-------+-----------+

+------------+--------+------+-------+-----------+
| | metric | tag | value | perc_diff |
+------------+--------+------+-------+-----------+
| CPU | gmean | main | 166.1 | 0.00% |
| CPU-Big | gmean | main | 55.1 | 0.00% |
| CPU-Little | gmean | main | 91.7 | 0.00% |
| CPU-Mid | gmean | main | 19.2 | 0.00% |
| CPU | gmean | sum | 161.3 | -2.85% |
| CPU-Big | gmean | sum | 52.9 | -3.97% |
| CPU-Little | gmean | sum | 86.7 | -5.39% |
| CPU-Mid | gmean | sum | 21.6 | 12.63% |
+------------+--------+------+-------+-----------+

The TL;DR of Jankbench is that sum aggregation reduces jank by 54% while
with 2.85% less CPU power. Note that this benchmark on Pixel 6 by
default has some sort of uclamp feedback applied to it. When jank is
detected, certain display and benchmark threads are applied with a
UCLAMP_MIN value of 512 (any help in identifying where these UCLAMP_MIN
values come from in the mainline Android kernel is appreciated). In
contrast, we constantly apply a UCLAMP_MIN value of 60 to these threads
without any feedback loop. If a similar feedback loop is involved to
only apply UCLAMP_MIN when needed, power consumption can be expected to
be even lower.

Hongyan Xia (6):
sched/uclamp: Track uclamped util_avg in sched_avg
sched/uclamp: Simulate PELT decay in util_avg_uclamp
sched/fair: Use CFS util_avg_uclamp for utilization and frequency
sched/fair: Rewrite util_fits_cpu()
sched/uclamp: Remove all uclamp bucket logic
sched/uclamp: Simplify uclamp_eff_value()

include/linux/sched.h | 13 +-
init/Kconfig | 32 ---
kernel/sched/core.c | 312 ++-------------------------
kernel/sched/cpufreq_schedutil.c | 19 +-
kernel/sched/fair.c | 354 +++++++++----------------------
kernel/sched/pelt.c | 146 ++++++++++++-
kernel/sched/rt.c | 4 -
kernel/sched/sched.h | 208 ++++++------------
8 files changed, 339 insertions(+), 749 deletions(-)

--
2.34.1


2023-10-04 09:06:39

by Hongyan Xia

[permalink] [raw]
Subject: [RFC PATCH 3/6] sched/fair: Use CFS util_avg_uclamp for utilization and frequency

From: Hongyan Xia <[email protected]>

Switch to the new util_avg_uclamp for task and runqueue utilization.
Since util_est() calls task_util(), this means util_est is now also a
clamped value.

Now that we have the sum aggregated CFS util value, we do not need to
consult uclamp buckets to know how the frequency should be clamped. We
simply look at the aggregated top level root_cfs_util_uclamp to know
what frequency to choose. Because we simulate PELT decay in
root_cfs_util_uclamp anyway, there's no need in cpufreq_schedutil.c to
avoid premature frequency drops.

Consequently, there is no need for uclamp_rq_util_with(). This function
takes the un-clamped util value and sends it through various clamping
filters to get the final value. However, util_avg_uclamp is propagated
with clamping in mind already, so it does not need to be clamped again.

TODO: There are two major caveats in this patch.
1. At the moment sum aggregation does not consider RT tasks. The avg_rt
signal considers all RT tasks on this rq as a single entity, which
means the utilization of individual RT tasks is not tracked
separately. If we want to use sum aggregation, we might have to track
utilization of RT tasks individually.
2. Busy time accounting in compute_energy() now takes the uclamp'ed
value. Ideally, it should reflect reality and use the un-clamp'ed
values. However, that would require maintaining both the normal and
uclamp'ed values for util_est. This needs to be revisited if it
causes real problems in practice.

Signed-off-by: Hongyan Xia <[email protected]>
---
kernel/sched/core.c | 10 +--
kernel/sched/cpufreq_schedutil.c | 19 +++---
kernel/sched/fair.c | 38 +++++------
kernel/sched/sched.h | 106 +++++++++----------------------
4 files changed, 59 insertions(+), 114 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index efe3848978a0..32511ee63f01 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7402,10 +7402,12 @@ int sched_core_idle_cpu(int cpu)
* The DL bandwidth number otoh is not a measured metric but a value computed
* based on the task model parameters and gives the minimal utilization
* required to meet deadlines.
+ *
+ * The util_cfs parameter has already taken uclamp into account (unless uclamp
+ * support is not compiled in).
*/
unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
- enum cpu_util_type type,
- struct task_struct *p)
+ enum cpu_util_type type)
{
unsigned long dl_util, util, irq, max;
struct rq *rq = cpu_rq(cpu);
@@ -7439,8 +7441,6 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
* frequency will be gracefully reduced with the utilization decay.
*/
util = util_cfs + cpu_util_rt(rq);
- if (type == FREQUENCY_UTIL)
- util = uclamp_rq_util_with(rq, util, p);

dl_util = cpu_util_dl(rq);

@@ -7493,7 +7493,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,

unsigned long sched_cpu_util(int cpu)
{
- return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL);
+ return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL);
}
#endif /* CONFIG_SMP */

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 4492608b7d7f..6e63952b8063 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -159,8 +159,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
struct rq *rq = cpu_rq(sg_cpu->cpu);

sg_cpu->bw_dl = cpu_bw_dl(rq);
- sg_cpu->util = effective_cpu_util(sg_cpu->cpu, util,
- FREQUENCY_UTIL, NULL);
+ sg_cpu->util = effective_cpu_util(sg_cpu->cpu, util, FREQUENCY_UTIL);
}

/**
@@ -282,7 +281,11 @@ static void sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
* into the same scale so we can compare.
*/
boost = (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
- boost = uclamp_rq_util_with(cpu_rq(sg_cpu->cpu), boost, NULL);
+ /*
+ * TODO: Investigate what should be done here. In sum aggregation there
+ * is no such thing as uclamp_max on a rq, so how do we cap the boost
+ * value, or do we want to cap the boost frequency here at all?
+ */
if (sg_cpu->util < boost)
sg_cpu->util = boost;
}
@@ -346,11 +349,8 @@ static void sugov_update_single_freq(struct update_util_data *hook, u64 time,
/*
* Do not reduce the frequency if the CPU has not been idle
* recently, as the reduction is likely to be premature then.
- *
- * Except when the rq is capped by uclamp_max.
*/
- if (!uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)) &&
- sugov_cpu_is_busy(sg_cpu) && next_f < sg_policy->next_freq) {
+ if (sugov_cpu_is_busy(sg_cpu) && next_f < sg_policy->next_freq) {
next_f = sg_policy->next_freq;

/* Restore cached freq as next_freq has changed */
@@ -399,11 +399,8 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
/*
* Do not reduce the target performance level if the CPU has not been
* idle recently, as the reduction is likely to be premature then.
- *
- * Except when the rq is capped by uclamp_max.
*/
- if (!uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)) &&
- sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
+ if (sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
sg_cpu->util = prev_util;

cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl),
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 420af57d01ee..31004aae5f09 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4572,10 +4572,17 @@ static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)

static int newidle_balance(struct rq *this_rq, struct rq_flags *rf);

+#ifdef CONFIG_UCLAMP_TASK
+static inline unsigned long task_util(struct task_struct *p)
+{
+ return READ_ONCE(p->se.avg.util_avg_uclamp);
+}
+#else
static inline unsigned long task_util(struct task_struct *p)
{
return READ_ONCE(p->se.avg.util_avg);
}
+#endif

static inline unsigned long _task_util_est(struct task_struct *p)
{
@@ -4589,22 +4596,6 @@ static inline unsigned long task_util_est(struct task_struct *p)
return max(task_util(p), _task_util_est(p));
}

-#ifdef CONFIG_UCLAMP_TASK
-static inline unsigned long uclamp_task_util(struct task_struct *p,
- unsigned long uclamp_min,
- unsigned long uclamp_max)
-{
- return clamp(task_util_est(p), uclamp_min, uclamp_max);
-}
-#else
-static inline unsigned long uclamp_task_util(struct task_struct *p,
- unsigned long uclamp_min,
- unsigned long uclamp_max)
-{
- return task_util_est(p);
-}
-#endif
-
static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
struct task_struct *p)
{
@@ -7468,11 +7459,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
static unsigned long
cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
{
- struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
- unsigned long util = READ_ONCE(cfs_rq->avg.util_avg);
+ struct rq *rq = cpu_rq(cpu);
+ struct cfs_rq *cfs_rq = &rq->cfs;
+ unsigned long util = root_cfs_util(rq);
+ bool capped = uclamp_rq_is_capped(rq);
unsigned long runnable;

- if (boost) {
+ if (boost && !capped) {
runnable = READ_ONCE(cfs_rq->avg.runnable_avg);
util = max(util, runnable);
}
@@ -7629,7 +7622,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
for_each_cpu(cpu, pd_cpus) {
unsigned long util = cpu_util(cpu, p, -1, 0);

- busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL, NULL);
+ busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL);
}

eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
@@ -7650,7 +7643,6 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
int cpu;

for_each_cpu(cpu, pd_cpus) {
- struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
unsigned long eff_util;

@@ -7661,7 +7653,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
* NOTE: in case RT tasks are running, by default the
* FREQUENCY_UTIL's utilization can be max OPP.
*/
- eff_util = effective_cpu_util(cpu, util, FREQUENCY_UTIL, tsk);
+ eff_util = effective_cpu_util(cpu, util, FREQUENCY_UTIL);
max_util = max(max_util, eff_util);
}

@@ -7758,7 +7750,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
target = prev_cpu;

sync_entity_load_avg(&p->se);
- if (!uclamp_task_util(p, p_util_min, p_util_max))
+ if (!task_util_est(p))
goto unlock;

eenv_task_busy_time(&eenv, p, prev_cpu);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 98fa5e79f4e9..e73aedd9a76b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2997,8 +2997,7 @@ enum cpu_util_type {
};

unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
- enum cpu_util_type type,
- struct task_struct *p);
+ enum cpu_util_type type);

/*
* Verify the fitness of task @p to run on @cpu taking into account the
@@ -3055,85 +3054,44 @@ static inline bool uclamp_rq_is_idle(struct rq *rq)
return rq->uclamp_flags & UCLAMP_FLAG_IDLE;
}

-/**
- * uclamp_rq_util_with - clamp @util with @rq and @p effective uclamp values.
- * @rq: The rq to clamp against. Must not be NULL.
- * @util: The util value to clamp.
- * @p: The task to clamp against. Can be NULL if you want to clamp
- * against @rq only.
- *
- * Clamps the passed @util to the max(@rq, @p) effective uclamp values.
- *
- * If sched_uclamp_used static key is disabled, then just return the util
- * without any clamping since uclamp aggregation at the rq level in the fast
- * path is disabled, rendering this operation a NOP.
+/*
+ * When uclamp is compiled in, the aggregation at rq level is 'turned off'
+ * by default in the fast path and only gets turned on once userspace performs
+ * an operation that requires it.
*
- * Use uclamp_eff_value() if you don't care about uclamp values at rq level. It
- * will return the correct effective uclamp value of the task even if the
- * static key is disabled.
+ * Returns true if userspace opted-in to use uclamp and aggregation at rq level
+ * hence is active.
*/
-static __always_inline
-unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
- struct task_struct *p)
+static inline bool uclamp_is_used(void)
{
- unsigned long min_util = 0;
- unsigned long max_util = 0;
-
- if (!static_branch_likely(&sched_uclamp_used))
- return util;
-
- if (p) {
- min_util = uclamp_eff_value(p, UCLAMP_MIN);
- max_util = uclamp_eff_value(p, UCLAMP_MAX);
-
- /*
- * Ignore last runnable task's max clamp, as this task will
- * reset it. Similarly, no need to read the rq's min clamp.
- */
- if (uclamp_rq_is_idle(rq))
- goto out;
- }
-
- min_util = max_t(unsigned long, min_util, uclamp_rq_get(rq, UCLAMP_MIN));
- max_util = max_t(unsigned long, max_util, uclamp_rq_get(rq, UCLAMP_MAX));
-out:
- /*
- * Since CPU's {min,max}_util clamps are MAX aggregated considering
- * RUNNABLE tasks with _different_ clamps, we can end up with an
- * inversion. Fix it now when the clamps are applied.
- */
- if (unlikely(min_util >= max_util))
- return min_util;
+ return static_branch_likely(&sched_uclamp_used);
+}

- return clamp(util, min_util, max_util);
+static inline unsigned long root_cfs_util(struct rq *rq)
+{
+ return READ_ONCE(rq->root_cfs_util_uclamp);
}

/* Is the rq being capped/throttled by uclamp_max? */
static inline bool uclamp_rq_is_capped(struct rq *rq)
{
- unsigned long rq_util;
- unsigned long max_util;
+ unsigned long uclamp_util, real_util;

- if (!static_branch_likely(&sched_uclamp_used))
+ if (!uclamp_is_used())
return false;

- rq_util = cpu_util_cfs(cpu_of(rq)) + cpu_util_rt(rq);
- max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
-
- return max_util != SCHED_CAPACITY_SCALE && rq_util >= max_util;
-}
+ /*
+ * At the moment there's no such thing as uclamp_max for RT tasks, so
+ * we only see if CFS is capped.
+ *
+ * TODO: Implement uclamp sum aggregation for RT.
+ */
+ uclamp_util = root_cfs_util(rq);
+ real_util = READ_ONCE(rq->cfs.avg.util_avg);

-/*
- * When uclamp is compiled in, the aggregation at rq level is 'turned off'
- * by default in the fast path and only gets turned on once userspace performs
- * an operation that requires it.
- *
- * Returns true if userspace opted-in to use uclamp and aggregation at rq level
- * hence is active.
- */
-static inline bool uclamp_is_used(void)
-{
- return static_branch_likely(&sched_uclamp_used);
+ /* XXX: The 80 margin here isn't backed by science. */
+ return uclamp_util < SCHED_CAPACITY_SCALE &&
+ real_util > uclamp_util + 80;
}

static inline void enqueue_util_avg_uclamp(struct cfs_rq *cfs_rq,
@@ -3172,13 +3130,6 @@ static inline unsigned long uclamp_eff_value(struct task_struct *p,
return SCHED_CAPACITY_SCALE;
}

-static inline
-unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
- struct task_struct *p)
-{
- return util;
-}
-
static inline bool uclamp_rq_is_capped(struct rq *rq) { return false; }

static inline bool uclamp_is_used(void)
@@ -3205,6 +3156,11 @@ static inline bool uclamp_rq_is_idle(struct rq *rq)
return false;
}

+static inline unsigned long root_cfs_util(struct rq *rq)
+{
+ return READ_ONCE(rq->cfs.avg.util_avg);
+}
+
static inline void enqueue_util_avg_uclamp(struct cfs_rq *cfs_rq,
struct sched_entity *se)
{
--
2.34.1

2023-10-04 09:07:15

by Hongyan Xia

[permalink] [raw]
Subject: [RFC PATCH 5/6] sched/uclamp: Remove all uclamp bucket logic

From: Hongyan Xia <[email protected]>

Also rewrite uclamp_update_active() so that the effective uclamp values
are updated every time we change task group properties, change system
defaults or a request is issued from userspace.

TODO: Rewrite documentation to match the new logic.

Signed-off-by: Hongyan Xia <[email protected]>
---
include/linux/sched.h | 4 -
init/Kconfig | 32 -----
kernel/sched/core.c | 295 +++---------------------------------------
kernel/sched/fair.c | 4 -
kernel/sched/rt.c | 4 -
kernel/sched/sched.h | 85 ------------
6 files changed, 16 insertions(+), 408 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 825d7b86b006..5b8d5abb2bba 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -685,9 +685,6 @@ struct sched_dl_entity {
};

#ifdef CONFIG_UCLAMP_TASK
-/* Number of utilization clamp buckets (shorter alias) */
-#define UCLAMP_BUCKETS CONFIG_UCLAMP_BUCKETS_COUNT
-
/*
* Utilization clamp for a scheduling entity
* @value: clamp value "assigned" to a se
@@ -713,7 +710,6 @@ struct sched_dl_entity {
*/
struct uclamp_se {
unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
- unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
unsigned int active : 1;
unsigned int user_defined : 1;
};
diff --git a/init/Kconfig b/init/Kconfig
index 5e7d4885d1bf..4ec0023d2149 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -808,38 +808,6 @@ config UCLAMP_TASK
enforce or grant any specific bandwidth for tasks.

If in doubt, say N.
-
-config UCLAMP_BUCKETS_COUNT
- int "Number of supported utilization clamp buckets"
- range 5 20
- default 5
- depends on UCLAMP_TASK
- help
- Defines the number of clamp buckets to use. The range of each bucket
- will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the
- number of clamp buckets the finer their granularity and the higher
- the precision of clamping aggregation and tracking at run-time.
-
- For example, with the minimum configuration value we will have 5
- clamp buckets tracking 20% utilization each. A 25% boosted tasks will
- be refcounted in the [20..39]% bucket and will set the bucket clamp
- effective value to 25%.
- If a second 30% boosted task should be co-scheduled on the same CPU,
- that task will be refcounted in the same bucket of the first task and
- it will boost the bucket clamp effective value to 30%.
- The clamp effective value of a bucket is reset to its nominal value
- (20% in the example above) when there are no more tasks refcounted in
- that bucket.
-
- An additional boost/capping margin can be added to some tasks. In the
- example above the 25% task will be boosted to 30% until it exits the
- CPU. If that should be considered not acceptable on certain systems,
- it's always possible to reduce the margin by increasing the number of
- clamp buckets to trade off used memory for run-time tracking
- precision.
-
- If in doubt, use the default value.
-
endmenu

#
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 32511ee63f01..c5bf01e7df28 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1387,17 +1387,9 @@ static struct uclamp_se uclamp_default[UCLAMP_CNT];
*/
DEFINE_STATIC_KEY_FALSE(sched_uclamp_used);

-/* Integer rounded range for each bucket */
-#define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS)
-
#define for_each_clamp_id(clamp_id) \
for ((clamp_id) = 0; (clamp_id) < UCLAMP_CNT; (clamp_id)++)

-static inline unsigned int uclamp_bucket_id(unsigned int clamp_value)
-{
- return min_t(unsigned int, clamp_value / UCLAMP_BUCKET_DELTA, UCLAMP_BUCKETS - 1);
-}
-
static inline unsigned int uclamp_none(enum uclamp_id clamp_id)
{
if (clamp_id == UCLAMP_MIN)
@@ -1409,58 +1401,9 @@ static inline void uclamp_se_set(struct uclamp_se *uc_se,
unsigned int value, bool user_defined)
{
uc_se->value = value;
- uc_se->bucket_id = uclamp_bucket_id(value);
uc_se->user_defined = user_defined;
}

-static inline unsigned int
-uclamp_idle_value(struct rq *rq, enum uclamp_id clamp_id,
- unsigned int clamp_value)
-{
- /*
- * Avoid blocked utilization pushing up the frequency when we go
- * idle (which drops the max-clamp) by retaining the last known
- * max-clamp.
- */
- if (clamp_id == UCLAMP_MAX) {
- rq->uclamp_flags |= UCLAMP_FLAG_IDLE;
- return clamp_value;
- }
-
- return uclamp_none(UCLAMP_MIN);
-}
-
-static inline void uclamp_idle_reset(struct rq *rq, enum uclamp_id clamp_id,
- unsigned int clamp_value)
-{
- /* Reset max-clamp retention only on idle exit */
- if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))
- return;
-
- uclamp_rq_set(rq, clamp_id, clamp_value);
-}
-
-static inline
-unsigned int uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
- unsigned int clamp_value)
-{
- struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
- int bucket_id = UCLAMP_BUCKETS - 1;
-
- /*
- * Since both min and max clamps are max aggregated, find the
- * top most bucket with tasks in.
- */
- for ( ; bucket_id >= 0; bucket_id--) {
- if (!bucket[bucket_id].tasks)
- continue;
- return bucket[bucket_id].value;
- }
-
- /* No tasks -- default clamp values */
- return uclamp_idle_value(rq, clamp_id, clamp_value);
-}
-
static void __uclamp_update_util_min_rt_default(struct task_struct *p)
{
unsigned int default_util_min;
@@ -1542,196 +1485,24 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)

unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
{
- struct uclamp_se uc_eff;
-
- /* Task currently refcounted: use back-annotated (effective) value */
- if (p->uclamp[clamp_id].active)
- return (unsigned long)p->uclamp[clamp_id].value;
-
- uc_eff = uclamp_eff_get(p, clamp_id);
-
- return (unsigned long)uc_eff.value;
-}
-
-/*
- * When a task is enqueued on a rq, the clamp bucket currently defined by the
- * task's uclamp::bucket_id is refcounted on that rq. This also immediately
- * updates the rq's clamp value if required.
- *
- * Tasks can have a task-specific value requested from user-space, track
- * within each bucket the maximum value for tasks refcounted in it.
- * This "local max aggregation" allows to track the exact "requested" value
- * for each bucket when all its RUNNABLE tasks require the same clamp.
- */
-static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
- enum uclamp_id clamp_id)
-{
- struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
- struct uclamp_se *uc_se = &p->uclamp[clamp_id];
- struct uclamp_bucket *bucket;
-
- lockdep_assert_rq_held(rq);
+ if (!uclamp_is_used() || !p->uclamp[clamp_id].active)
+ return uclamp_none(clamp_id);

- /* Update task effective clamp */
- p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
-
- bucket = &uc_rq->bucket[uc_se->bucket_id];
- bucket->tasks++;
- uc_se->active = true;
-
- uclamp_idle_reset(rq, clamp_id, uc_se->value);
-
- /*
- * Local max aggregation: rq buckets always track the max
- * "requested" clamp value of its RUNNABLE tasks.
- */
- if (bucket->tasks == 1 || uc_se->value > bucket->value)
- bucket->value = uc_se->value;
-
- if (uc_se->value > uclamp_rq_get(rq, clamp_id))
- uclamp_rq_set(rq, clamp_id, uc_se->value);
+ return p->uclamp[clamp_id].value;
}

-/*
- * When a task is dequeued from a rq, the clamp bucket refcounted by the task
- * is released. If this is the last task reference counting the rq's max
- * active clamp value, then the rq's clamp value is updated.
- *
- * Both refcounted tasks and rq's cached clamp values are expected to be
- * always valid. If it's detected they are not, as defensive programming,
- * enforce the expected state and warn.
- */
-static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
- enum uclamp_id clamp_id)
-{
- struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
- struct uclamp_se *uc_se = &p->uclamp[clamp_id];
- struct uclamp_bucket *bucket;
- unsigned int bkt_clamp;
- unsigned int rq_clamp;
-
- lockdep_assert_rq_held(rq);
-
- /*
- * If sched_uclamp_used was enabled after task @p was enqueued,
- * we could end up with unbalanced call to uclamp_rq_dec_id().
- *
- * In this case the uc_se->active flag should be false since no uclamp
- * accounting was performed at enqueue time and we can just return
- * here.
- *
- * Need to be careful of the following enqueue/dequeue ordering
- * problem too
- *
- * enqueue(taskA)
- * // sched_uclamp_used gets enabled
- * enqueue(taskB)
- * dequeue(taskA)
- * // Must not decrement bucket->tasks here
- * dequeue(taskB)
- *
- * where we could end up with stale data in uc_se and
- * bucket[uc_se->bucket_id].
- *
- * The following check here eliminates the possibility of such race.
- */
- if (unlikely(!uc_se->active))
- return;
-
- bucket = &uc_rq->bucket[uc_se->bucket_id];
-
- SCHED_WARN_ON(!bucket->tasks);
- if (likely(bucket->tasks))
- bucket->tasks--;
-
- uc_se->active = false;
-
- /*
- * Keep "local max aggregation" simple and accept to (possibly)
- * overboost some RUNNABLE tasks in the same bucket.
- * The rq clamp bucket value is reset to its base value whenever
- * there are no more RUNNABLE tasks refcounting it.
- */
- if (likely(bucket->tasks))
- return;
-
- rq_clamp = uclamp_rq_get(rq, clamp_id);
- /*
- * Defensive programming: this should never happen. If it happens,
- * e.g. due to future modification, warn and fixup the expected value.
- */
- SCHED_WARN_ON(bucket->value > rq_clamp);
- if (bucket->value >= rq_clamp) {
- bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
- uclamp_rq_set(rq, clamp_id, bkt_clamp);
- }
-}
-
-static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
-{
- enum uclamp_id clamp_id;
-
- /*
- * Avoid any overhead until uclamp is actually used by the userspace.
- *
- * The condition is constructed such that a NOP is generated when
- * sched_uclamp_used is disabled.
- */
- if (!static_branch_unlikely(&sched_uclamp_used))
- return;
-
- if (unlikely(!p->sched_class->uclamp_enabled))
- return;
-
- for_each_clamp_id(clamp_id)
- uclamp_rq_inc_id(rq, p, clamp_id);
-
- /* Reset clamp idle holding when there is one RUNNABLE task */
- if (rq->uclamp_flags & UCLAMP_FLAG_IDLE)
- rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
-}
-
-static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
+static inline void
+uclamp_update_active_nolock(struct task_struct *p)
{
enum uclamp_id clamp_id;

- /*
- * Avoid any overhead until uclamp is actually used by the userspace.
- *
- * The condition is constructed such that a NOP is generated when
- * sched_uclamp_used is disabled.
- */
- if (!static_branch_unlikely(&sched_uclamp_used))
- return;
-
- if (unlikely(!p->sched_class->uclamp_enabled))
- return;
-
for_each_clamp_id(clamp_id)
- uclamp_rq_dec_id(rq, p, clamp_id);
-}
-
-static inline void uclamp_rq_reinc_id(struct rq *rq, struct task_struct *p,
- enum uclamp_id clamp_id)
-{
- if (!p->uclamp[clamp_id].active)
- return;
-
- uclamp_rq_dec_id(rq, p, clamp_id);
- uclamp_rq_inc_id(rq, p, clamp_id);
-
- /*
- * Make sure to clear the idle flag if we've transiently reached 0
- * active tasks on rq.
- */
- if (clamp_id == UCLAMP_MAX && (rq->uclamp_flags & UCLAMP_FLAG_IDLE))
- rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
+ p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
}

static inline void
uclamp_update_active(struct task_struct *p)
{
- enum uclamp_id clamp_id;
struct rq_flags rf;
struct rq *rq;

@@ -1745,14 +1516,7 @@ uclamp_update_active(struct task_struct *p)
*/
rq = task_rq_lock(p, &rf);

- /*
- * Setting the clamp bucket is serialized by task_rq_lock().
- * If the task is not yet RUNNABLE and its task_struct is not
- * affecting a valid clamp bucket, the next time it's enqueued,
- * it will already see the updated clamp bucket value.
- */
- for_each_clamp_id(clamp_id)
- uclamp_rq_reinc_id(rq, p, clamp_id);
+ uclamp_update_active_nolock(p);

task_rq_unlock(rq, p, &rf);
}
@@ -1983,26 +1747,22 @@ static void __setscheduler_uclamp(struct task_struct *p,
uclamp_se_set(&p->uclamp_req[UCLAMP_MAX],
attr->sched_util_max, true);
}
+
+ uclamp_update_active_nolock(p);
}

static void uclamp_fork(struct task_struct *p)
{
enum uclamp_id clamp_id;

- /*
- * We don't need to hold task_rq_lock() when updating p->uclamp_* here
- * as the task is still at its early fork stages.
- */
- for_each_clamp_id(clamp_id)
- p->uclamp[clamp_id].active = false;
-
- if (likely(!p->sched_reset_on_fork))
- return;
-
- for_each_clamp_id(clamp_id) {
- uclamp_se_set(&p->uclamp_req[clamp_id],
- uclamp_none(clamp_id), false);
+ if (unlikely(p->sched_reset_on_fork)) {
+ for_each_clamp_id(clamp_id) {
+ uclamp_se_set(&p->uclamp_req[clamp_id],
+ uclamp_none(clamp_id), false);
+ }
}
+
+ uclamp_update_active(p);
}

static void uclamp_post_fork(struct task_struct *p)
@@ -2010,28 +1770,10 @@ static void uclamp_post_fork(struct task_struct *p)
uclamp_update_util_min_rt_default(p);
}

-static void __init init_uclamp_rq(struct rq *rq)
-{
- enum uclamp_id clamp_id;
- struct uclamp_rq *uc_rq = rq->uclamp;
-
- for_each_clamp_id(clamp_id) {
- uc_rq[clamp_id] = (struct uclamp_rq) {
- .value = uclamp_none(clamp_id)
- };
- }
-
- rq->uclamp_flags = UCLAMP_FLAG_IDLE;
-}
-
static void __init init_uclamp(void)
{
struct uclamp_se uc_max = {};
enum uclamp_id clamp_id;
- int cpu;
-
- for_each_possible_cpu(cpu)
- init_uclamp_rq(cpu_rq(cpu));

for_each_clamp_id(clamp_id) {
uclamp_se_set(&init_task.uclamp_req[clamp_id],
@@ -2050,8 +1792,6 @@ static void __init init_uclamp(void)
}

#else /* CONFIG_UCLAMP_TASK */
-static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
-static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
static inline int uclamp_validate(struct task_struct *p,
const struct sched_attr *attr)
{
@@ -2098,7 +1838,6 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
}

- uclamp_rq_inc(rq, p);
p->sched_class->enqueue_task(rq, p, flags);

if (sched_core_enabled(rq))
@@ -2118,7 +1857,6 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
psi_dequeue(p, flags & DEQUEUE_SLEEP);
}

- uclamp_rq_dec(rq, p);
p->sched_class->dequeue_task(rq, p, flags);
}

@@ -10659,7 +10397,6 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css)
if (eff[clamp_id] == uc_se[clamp_id].value)
continue;
uc_se[clamp_id].value = eff[clamp_id];
- uc_se[clamp_id].bucket_id = uclamp_bucket_id(eff[clamp_id]);
clamps |= (0x1 << clamp_id);
}
if (!clamps) {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 75a8f7d50e9c..bfe01f534a21 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12708,10 +12708,6 @@ DEFINE_SCHED_CLASS(fair) = {
#ifdef CONFIG_SCHED_CORE
.task_is_throttled = task_is_throttled_fair,
#endif
-
-#ifdef CONFIG_UCLAMP_TASK
- .uclamp_enabled = 1,
-#endif
};

#ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 0597ba0f85ff..68f257150c16 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2732,10 +2732,6 @@ DEFINE_SCHED_CLASS(rt) = {
#ifdef CONFIG_SCHED_CORE
.task_is_throttled = task_is_throttled_rt,
#endif
-
-#ifdef CONFIG_UCLAMP_TASK
- .uclamp_enabled = 1,
-#endif
};

#ifdef CONFIG_RT_GROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e73aedd9a76b..30dee8eb2ed9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -903,46 +903,6 @@ extern void rto_push_irq_work_func(struct irq_work *work);
#endif /* CONFIG_SMP */

#ifdef CONFIG_UCLAMP_TASK
-/*
- * struct uclamp_bucket - Utilization clamp bucket
- * @value: utilization clamp value for tasks on this clamp bucket
- * @tasks: number of RUNNABLE tasks on this clamp bucket
- *
- * Keep track of how many tasks are RUNNABLE for a given utilization
- * clamp value.
- */
-struct uclamp_bucket {
- unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
- unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
-};
-
-/*
- * struct uclamp_rq - rq's utilization clamp
- * @value: currently active clamp values for a rq
- * @bucket: utilization clamp buckets affecting a rq
- *
- * Keep track of RUNNABLE tasks on a rq to aggregate their clamp values.
- * A clamp value is affecting a rq when there is at least one task RUNNABLE
- * (or actually running) with that value.
- *
- * There are up to UCLAMP_CNT possible different clamp values, currently there
- * are only two: minimum utilization and maximum utilization.
- *
- * All utilization clamping values are MAX aggregated, since:
- * - for util_min: we want to run the CPU at least at the max of the minimum
- * utilization required by its currently RUNNABLE tasks.
- * - for util_max: we want to allow the CPU to run up to the max of the
- * maximum utilization allowed by its currently RUNNABLE tasks.
- *
- * Since on each system we expect only a limited number of different
- * utilization clamp values (UCLAMP_BUCKETS), use a simple array to track
- * the metrics required to compute all the per-rq utilization clamp values.
- */
-struct uclamp_rq {
- unsigned int value;
- struct uclamp_bucket bucket[UCLAMP_BUCKETS];
-};
-
DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
#endif /* CONFIG_UCLAMP_TASK */

@@ -989,12 +949,8 @@ struct rq {
u64 nr_switches;

#ifdef CONFIG_UCLAMP_TASK
- /* Utilization clamp values based on CPU's RUNNABLE tasks */
- struct uclamp_rq uclamp[UCLAMP_CNT] ____cacheline_aligned;
- unsigned int uclamp_flags;
unsigned int root_cfs_util_uclamp;
unsigned int root_cfs_util_uclamp_removed;
-#define UCLAMP_FLAG_IDLE 0x01
#endif

struct cfs_rq cfs;
@@ -2229,11 +2185,6 @@ struct affinity_context {
};

struct sched_class {
-
-#ifdef CONFIG_UCLAMP_TASK
- int uclamp_enabled;
-#endif
-
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*yield_task) (struct rq *rq);
@@ -3037,23 +2988,6 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
#ifdef CONFIG_UCLAMP_TASK
unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id);

-static inline unsigned long uclamp_rq_get(struct rq *rq,
- enum uclamp_id clamp_id)
-{
- return READ_ONCE(rq->uclamp[clamp_id].value);
-}
-
-static inline void uclamp_rq_set(struct rq *rq, enum uclamp_id clamp_id,
- unsigned int value)
-{
- WRITE_ONCE(rq->uclamp[clamp_id].value, value);
-}
-
-static inline bool uclamp_rq_is_idle(struct rq *rq)
-{
- return rq->uclamp_flags & UCLAMP_FLAG_IDLE;
-}
-
/*
* When uclamp is compiled in, the aggregation at rq level is 'turned off'
* by default in the fast path and only gets turned on once userspace performs
@@ -3137,25 +3071,6 @@ static inline bool uclamp_is_used(void)
return false;
}

-static inline unsigned long uclamp_rq_get(struct rq *rq,
- enum uclamp_id clamp_id)
-{
- if (clamp_id == UCLAMP_MIN)
- return 0;
-
- return SCHED_CAPACITY_SCALE;
-}
-
-static inline void uclamp_rq_set(struct rq *rq, enum uclamp_id clamp_id,
- unsigned int value)
-{
-}
-
-static inline bool uclamp_rq_is_idle(struct rq *rq)
-{
- return false;
-}
-
static inline unsigned long root_cfs_util(struct rq *rq)
{
return READ_ONCE(rq->cfs.avg.util_avg);
--
2.34.1

2023-10-04 09:07:17

by Hongyan Xia

[permalink] [raw]
Subject: [RFC PATCH 4/6] sched/fair: Rewrite util_fits_cpu()

From: Hongyan Xia <[email protected]>

Currently, there's no way to distinguish the difference between 1) a CPU
that is actually maxed out at its highest frequency, or 2) one that is
throttled because of UCLAMP_MAX, since both present util_avg values of
1024. This is problematic because when we try to pick a CPU for a task
to run, we would like to give 2) a chance, or at least prefer 2) to 1).

Current upstream gives neither a chance because the spare capacity is 0
for either case. There are patches to fix this problem by considering 0
capacities [1], but this might still be inefficient because this ends
up treating 1) and 2) equally, and will always pick the same one because
we don't change how we iterate through all CPUs. If we end up putting
many tasks on 1), then this creates a seriously unbalanced load for the
two CPUs.

Fix by using util_avg_uclamp for util_fits_cpu(). This way, case 1) will
still keep its utilization at 1024 whereas 2) shows spare capacities if
the sum of util_avg_uclamp values is still under the CPU capacity.
Note that this is roughly what the sum aggregation does in the Android
kernel [2] (although we clamp UCLAMP_MIN as well in this patch, which
may need some discussions), which shows superior energy savings because
there's more chance that a task can get scheduled on 2) instead of
finding a big CPU to run on.

Under sum aggregation, checking whether a task fits a CPU becomes much
simpler. We simply do fits_capacity() and there does not need to be code
checking all corner cases for uclamp. This means util_fits_cpu() returns
to true and false instead of tri-state, simplifying a significant amount
of code.

[1]: https://lore.kernel.org/all/[email protected]/
[2]: https://android.googlesource.com/kernel/gs/+/refs/heads/android-gs-raviole-5.10-android12-d1/drivers/soc/google/vh/kernel/sched/fair.c#510

Signed-off-by: Hongyan Xia <[email protected]>
---
kernel/sched/fair.c | 253 ++++----------------------------------------
1 file changed, 23 insertions(+), 230 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 31004aae5f09..75a8f7d50e9c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4729,135 +4729,19 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
trace_sched_util_est_se_tp(&p->se);
}

-static inline int util_fits_cpu(unsigned long util,
- unsigned long uclamp_min,
- unsigned long uclamp_max,
- int cpu)
+/* util must be the uclamp'ed value (i.e. from util_avg_uclamp). */
+static inline int util_fits_cpu(unsigned long util, int cpu)
{
- unsigned long capacity_orig, capacity_orig_thermal;
unsigned long capacity = capacity_of(cpu);
- bool fits, uclamp_max_fits;

- /*
- * Check if the real util fits without any uclamp boost/cap applied.
- */
- fits = fits_capacity(util, capacity);
-
- if (!uclamp_is_used())
- return fits;
-
- /*
- * We must use capacity_orig_of() for comparing against uclamp_min and
- * uclamp_max. We only care about capacity pressure (by using
- * capacity_of()) for comparing against the real util.
- *
- * If a task is boosted to 1024 for example, we don't want a tiny
- * pressure to skew the check whether it fits a CPU or not.
- *
- * Similarly if a task is capped to capacity_orig_of(little_cpu), it
- * should fit a little cpu even if there's some pressure.
- *
- * Only exception is for thermal pressure since it has a direct impact
- * on available OPP of the system.
- *
- * We honour it for uclamp_min only as a drop in performance level
- * could result in not getting the requested minimum performance level.
- *
- * For uclamp_max, we can tolerate a drop in performance level as the
- * goal is to cap the task. So it's okay if it's getting less.
- */
- capacity_orig = capacity_orig_of(cpu);
- capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
-
- /*
- * We want to force a task to fit a cpu as implied by uclamp_max.
- * But we do have some corner cases to cater for..
- *
- *
- * C=z
- * | ___
- * | C=y | |
- * |_ _ _ _ _ _ _ _ _ ___ _ _ _ | _ | _ _ _ _ _ uclamp_max
- * | C=x | | | |
- * | ___ | | | |
- * | | | | | | | (util somewhere in this region)
- * | | | | | | |
- * | | | | | | |
- * +----------------------------------------
- * cpu0 cpu1 cpu2
- *
- * In the above example if a task is capped to a specific performance
- * point, y, then when:
- *
- * * util = 80% of x then it does not fit on cpu0 and should migrate
- * to cpu1
- * * util = 80% of y then it is forced to fit on cpu1 to honour
- * uclamp_max request.
- *
- * which is what we're enforcing here. A task always fits if
- * uclamp_max <= capacity_orig. But when uclamp_max > capacity_orig,
- * the normal upmigration rules should withhold still.
- *
- * Only exception is when we are on max capacity, then we need to be
- * careful not to block overutilized state. This is so because:
- *
- * 1. There's no concept of capping at max_capacity! We can't go
- * beyond this performance level anyway.
- * 2. The system is being saturated when we're operating near
- * max capacity, it doesn't make sense to block overutilized.
- */
- uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
- uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig);
- fits = fits || uclamp_max_fits;
-
- /*
- *
- * C=z
- * | ___ (region a, capped, util >= uclamp_max)
- * | C=y | |
- * |_ _ _ _ _ _ _ _ _ ___ _ _ _ | _ | _ _ _ _ _ uclamp_max
- * | C=x | | | |
- * | ___ | | | | (region b, uclamp_min <= util <= uclamp_max)
- * |_ _ _|_ _|_ _ _ _| _ | _ _ _| _ | _ _ _ _ _ uclamp_min
- * | | | | | | |
- * | | | | | | | (region c, boosted, util < uclamp_min)
- * +----------------------------------------
- * cpu0 cpu1 cpu2
- *
- * a) If util > uclamp_max, then we're capped, we don't care about
- * actual fitness value here. We only care if uclamp_max fits
- * capacity without taking margin/pressure into account.
- * See comment above.
- *
- * b) If uclamp_min <= util <= uclamp_max, then the normal
- * fits_capacity() rules apply. Except we need to ensure that we
- * enforce we remain within uclamp_max, see comment above.
- *
- * c) If util < uclamp_min, then we are boosted. Same as (b) but we
- * need to take into account the boosted value fits the CPU without
- * taking margin/pressure into account.
- *
- * Cases (a) and (b) are handled in the 'fits' variable already. We
- * just need to consider an extra check for case (c) after ensuring we
- * handle the case uclamp_min > uclamp_max.
- */
- uclamp_min = min(uclamp_min, uclamp_max);
- if (fits && (util < uclamp_min) && (uclamp_min > capacity_orig_thermal))
- return -1;
-
- return fits;
+ return fits_capacity(util, capacity);
}

static inline int task_fits_cpu(struct task_struct *p, int cpu)
{
- unsigned long uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
- unsigned long uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
unsigned long util = task_util_est(p);
- /*
- * Return true only if the cpu fully fits the task requirements, which
- * include the utilization but also the performance hints.
- */
- return (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0);
+
+ return util_fits_cpu(util, cpu);
}

static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
@@ -6424,11 +6308,8 @@ static inline void hrtick_update(struct rq *rq)
#ifdef CONFIG_SMP
static inline bool cpu_overutilized(int cpu)
{
- unsigned long rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
- unsigned long rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
-
/* Return true only if the utilization doesn't fit CPU's capacity */
- return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
+ return !util_fits_cpu(cpu_util_cfs(cpu), cpu);
}

static inline void update_overutilized_status(struct rq *rq)
@@ -7248,8 +7129,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
static int
select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
{
- unsigned long task_util, util_min, util_max, best_cap = 0;
- int fits, best_fits = 0;
+ unsigned long task_util, best_cap = 0;
int cpu, best_cpu = -1;
struct cpumask *cpus;

@@ -7257,8 +7137,6 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);

task_util = task_util_est(p);
- util_min = uclamp_eff_value(p, UCLAMP_MIN);
- util_max = uclamp_eff_value(p, UCLAMP_MAX);

for_each_cpu_wrap(cpu, cpus, target) {
unsigned long cpu_cap = capacity_of(cpu);
@@ -7266,44 +7144,22 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
continue;

- fits = util_fits_cpu(task_util, util_min, util_max, cpu);
-
- /* This CPU fits with all requirements */
- if (fits > 0)
+ if (util_fits_cpu(task_util, cpu))
return cpu;
- /*
- * Only the min performance hint (i.e. uclamp_min) doesn't fit.
- * Look for the CPU with best capacity.
- */
- else if (fits < 0)
- cpu_cap = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu));

- /*
- * First, select CPU which fits better (-1 being better than 0).
- * Then, select the one with best capacity at same level.
- */
- if ((fits < best_fits) ||
- ((fits == best_fits) && (cpu_cap > best_cap))) {
+ if (cpu_cap > best_cap) {
best_cap = cpu_cap;
best_cpu = cpu;
- best_fits = fits;
}
}

return best_cpu;
}

-static inline bool asym_fits_cpu(unsigned long util,
- unsigned long util_min,
- unsigned long util_max,
- int cpu)
+static inline bool asym_fits_cpu(unsigned long util, int cpu)
{
if (sched_asym_cpucap_active())
- /*
- * Return true only if the cpu fully fits the task requirements
- * which include the utilization and the performance hints.
- */
- return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
+ return util_fits_cpu(util, cpu);

return true;
}
@@ -7315,7 +7171,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
{
bool has_idle_core = false;
struct sched_domain *sd;
- unsigned long task_util, util_min, util_max;
+ unsigned long task_util;
int i, recent_used_cpu;

/*
@@ -7325,8 +7181,6 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if (sched_asym_cpucap_active()) {
sync_entity_load_avg(&p->se);
task_util = task_util_est(p);
- util_min = uclamp_eff_value(p, UCLAMP_MIN);
- util_max = uclamp_eff_value(p, UCLAMP_MAX);
}

/*
@@ -7335,7 +7189,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
lockdep_assert_irqs_disabled();

if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
- asym_fits_cpu(task_util, util_min, util_max, target))
+ asym_fits_cpu(task_util, target))
return target;

/*
@@ -7343,7 +7197,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
*/
if (prev != target && cpus_share_cache(prev, target) &&
(available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
- asym_fits_cpu(task_util, util_min, util_max, prev))
+ asym_fits_cpu(task_util, prev))
return prev;

/*
@@ -7358,7 +7212,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
in_task() &&
prev == smp_processor_id() &&
this_rq()->nr_running <= 1 &&
- asym_fits_cpu(task_util, util_min, util_max, prev)) {
+ asym_fits_cpu(task_util, prev)) {
return prev;
}

@@ -7370,7 +7224,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
cpus_share_cache(recent_used_cpu, target) &&
(available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
cpumask_test_cpu(recent_used_cpu, p->cpus_ptr) &&
- asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) {
+ asym_fits_cpu(task_util, recent_used_cpu)) {
return recent_used_cpu;
}

@@ -7721,13 +7575,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
- unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0;
- unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
struct root_domain *rd = this_rq()->rd;
int cpu, best_energy_cpu, target = -1;
- int prev_fits = -1, best_fits = -1;
- unsigned long best_thermal_cap = 0;
- unsigned long prev_thermal_cap = 0;
struct sched_domain *sd;
struct perf_domain *pd;
struct energy_env eenv;
@@ -7756,14 +7605,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
eenv_task_busy_time(&eenv, p, prev_cpu);

for (; pd; pd = pd->next) {
- unsigned long util_min = p_util_min, util_max = p_util_max;
unsigned long cpu_cap, cpu_thermal_cap, util;
unsigned long cur_delta, max_spare_cap = 0;
- unsigned long rq_util_min, rq_util_max;
unsigned long prev_spare_cap = 0;
int max_spare_cap_cpu = -1;
unsigned long base_energy;
- int fits, max_fits = -1;

cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);

@@ -7779,8 +7625,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
eenv.pd_cap = 0;

for_each_cpu(cpu, cpus) {
- struct rq *rq = cpu_rq(cpu);
-
eenv.pd_cap += cpu_thermal_cap;

if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
@@ -7791,31 +7635,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)

util = cpu_util(cpu, p, cpu, 0);
cpu_cap = capacity_of(cpu);
-
- /*
- * Skip CPUs that cannot satisfy the capacity request.
- * IOW, placing the task there would make the CPU
- * overutilized. Take uclamp into account to see how
- * much capacity we can get out of the CPU; this is
- * aligned with sched_cpu_util().
- */
- if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) {
- /*
- * Open code uclamp_rq_util_with() except for
- * the clamp() part. Ie: apply max aggregation
- * only. util_fits_cpu() logic requires to
- * operate on non clamped util but must use the
- * max-aggregated uclamp_{min, max}.
- */
- rq_util_min = uclamp_rq_get(rq, UCLAMP_MIN);
- rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
-
- util_min = max(rq_util_min, p_util_min);
- util_max = max(rq_util_max, p_util_max);
- }
-
- fits = util_fits_cpu(util, util_min, util_max, cpu);
- if (!fits)
+ if (!util_fits_cpu(util, cpu))
continue;

lsub_positive(&cpu_cap, util);
@@ -7823,9 +7643,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
if (cpu == prev_cpu) {
/* Always use prev_cpu as a candidate. */
prev_spare_cap = cpu_cap;
- prev_fits = fits;
- } else if ((fits > max_fits) ||
- ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
+ } else if (cpu_cap > max_spare_cap) {
/*
* Find the CPU with the maximum spare capacity
* among the remaining CPUs in the performance
@@ -7833,7 +7651,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
*/
max_spare_cap = cpu_cap;
max_spare_cap_cpu = cpu;
- max_fits = fits;
}
}

@@ -7852,50 +7669,26 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
if (prev_delta < base_energy)
goto unlock;
prev_delta -= base_energy;
- prev_thermal_cap = cpu_thermal_cap;
best_delta = min(best_delta, prev_delta);
}

/* Evaluate the energy impact of using max_spare_cap_cpu. */
if (max_spare_cap_cpu >= 0 && max_spare_cap > prev_spare_cap) {
- /* Current best energy cpu fits better */
- if (max_fits < best_fits)
- continue;
-
- /*
- * Both don't fit performance hint (i.e. uclamp_min)
- * but best energy cpu has better capacity.
- */
- if ((max_fits < 0) &&
- (cpu_thermal_cap <= best_thermal_cap))
- continue;
-
cur_delta = compute_energy(&eenv, pd, cpus, p,
max_spare_cap_cpu);
/* CPU utilization has changed */
if (cur_delta < base_energy)
goto unlock;
cur_delta -= base_energy;
-
- /*
- * Both fit for the task but best energy cpu has lower
- * energy impact.
- */
- if ((max_fits > 0) && (best_fits > 0) &&
- (cur_delta >= best_delta))
- continue;
-
- best_delta = cur_delta;
- best_energy_cpu = max_spare_cap_cpu;
- best_fits = max_fits;
- best_thermal_cap = cpu_thermal_cap;
+ if (cur_delta < best_delta) {
+ best_delta = cur_delta;
+ best_energy_cpu = max_spare_cap_cpu;
+ }
}
}
rcu_read_unlock();

- if ((best_fits > prev_fits) ||
- ((best_fits > 0) && (best_delta < prev_delta)) ||
- ((best_fits < 0) && (best_thermal_cap > prev_thermal_cap)))
+ if (best_delta < prev_delta)
target = best_energy_cpu;

return target;
--
2.34.1

2023-10-04 09:07:31

by Hongyan Xia

[permalink] [raw]
Subject: [RFC PATCH 6/6] sched/uclamp: Simplify uclamp_eff_value()

From: Hongyan Xia <[email protected]>

The commit

sched: Remove all uclamp bucket logic

removes uclamp_{inc/dec}() functions, so now p->uclamp contains the
correct values all the time after a update_uclamp_active() call, and
there's no need to toggle the boolean `active` after an update. As a
result, this function is fairly simple now and can live as a static
inline function.

Signed-off-by: Hongyan Xia <[email protected]>
---
kernel/sched/core.c | 13 ++++---------
kernel/sched/sched.h | 14 ++++++++++++--
2 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c5bf01e7df28..737921a9dd91 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1483,21 +1483,15 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
return uc_req;
}

-unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
-{
- if (!uclamp_is_used() || !p->uclamp[clamp_id].active)
- return uclamp_none(clamp_id);
-
- return p->uclamp[clamp_id].value;
-}
-
static inline void
uclamp_update_active_nolock(struct task_struct *p)
{
enum uclamp_id clamp_id;

- for_each_clamp_id(clamp_id)
+ for_each_clamp_id(clamp_id) {
p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
+ p->uclamp[clamp_id].active = 1;
+ }
}

static inline void
@@ -1759,6 +1753,7 @@ static void uclamp_fork(struct task_struct *p)
for_each_clamp_id(clamp_id) {
uclamp_se_set(&p->uclamp_req[clamp_id],
uclamp_none(clamp_id), false);
+ p->uclamp[clamp_id].active = 0;
}
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 30dee8eb2ed9..896626afbedc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2986,8 +2986,6 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
#endif

#ifdef CONFIG_UCLAMP_TASK
-unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id);
-
/*
* When uclamp is compiled in, the aggregation at rq level is 'turned off'
* by default in the fast path and only gets turned on once userspace performs
@@ -3001,6 +2999,18 @@ static inline bool uclamp_is_used(void)
return static_branch_likely(&sched_uclamp_used);
}

+static inline unsigned long uclamp_eff_value(struct task_struct *p,
+ enum uclamp_id clamp_id)
+{
+ if (uclamp_is_used() && p->uclamp[clamp_id].active)
+ return p->uclamp[clamp_id].value;
+
+ if (clamp_id == UCLAMP_MIN)
+ return 0;
+
+ return SCHED_CAPACITY_SCALE;
+}
+
static inline unsigned long root_cfs_util(struct rq *rq)
{
return READ_ONCE(rq->root_cfs_util_uclamp);
--
2.34.1

2023-10-30 18:47:47

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] sched: uclamp sum aggregation

On 04/10/2023 11:04, Hongyan Xia wrote:
> Current implementation of uclamp leaves many things to be desired.
> There are several problems:
>
> 1. Max aggregation is fragile. A single task with a high UCLAMP_MAX (or
> with the default value, which is 1024) can ruin the previous
> settings the moment it is enqueued, as shown in the uclamp frequency
> spike problem in Section 5.2 of
> Documentation/scheduler/sched-util-clamp.rst. Constantly running at
> 1024 utilization implies that the CPU is driven at its max capacity.
> However, with UCLAMP_MAX, this assumption is no longer true. To
> mitigate this, one idea is to filter out uclamp values for
> short-running tasks. However, the effectiveness of this idea remains
> to be seen.

The difference is that we don't have to lift the uclamp_max cap of
runnable p1's uclamp_max (< 1024) when a short running p2 with
uclamp_max = 1024 becomes runnable? Since now, when this happens, we
would just add p2's util_avg_uclamp to cfs_rq's util_avg_uclamp which is
tiny compared to its uclamp_max = 1024.

> 2. No way to differentiate between UCLAMP_MAX throttled CPU or a CPU
> running at its peak, as both show utilization of 1024. However, in
> certain cases it's important to tell the difference, as we still want
> to take the opportunity to enqueue tasks in the former case.

Is this related to the `best_fits/max_fits` logic in
find_energy_efficient_cpu() (feec()) and select_idle_capacity() (sic())
(commit e5ed0550c04c "sched/fair: unlink misfit task from cpu
overutilized")?
With your approach, having cfs_rq's `util_avg_uclamp` next to its
`util_avg` would allow to see those difference by comparing the two signals?

> It's also debatable whether uclamp should be a frequency hint. An
> example is a system with a mid CPU of capacity 500, and a little CPU
> with capacity 100. If we have 10 tasks with UCLAMP_MIN of 101, then
> util_fits_cpu() will schedule them on the mid CPU because feec()
> thinks they don't fit on the little, even though we should use the
> little core at some point instead of making the mid even more crowded.
> Of course, this in reality doesn't happen because of other mechanisms
> like load balancing, but it's also not good when other mechanisms can

CPU overutilization detection enables CFS load-balance & `sis()->sic()`
and disables feec().

> cancel the effect of uclamp in feec(). A CPU running at capacity 500 for
> 1ms or for 1000ms gives completely different performance levels, so
> trying to fit only the frequency does not give us any performance
> guarantees. If we then talk about not only running at some frequency but
> also for some amount of time, then what we really mean is a capacity
> hint, not a frequency hint.

Isn't CPU frequency and capacity going in the same direction?

IPC * CPU frequency = Instruction per Second == Performance (Capacity).

And in the scheduler, utilization is the portion of the Capacity
currently used.

What sum aggregation does differently is that you can sum-up
individually clamped utilization contributions and compare them against
capacity rather then being forced to use the maximum value of a clamp
value of one (runnable) task to guide frequency. This avoids those
discontinuity-moments when a task with a high uclamp value switches
between runnable and sleeping state.

> It's even worse when the tasks scheduled on the mid CPU also have
> UCLAMP_MAX values. In the previous example, instead of running at 500, a
> single UCLAMP_MAX, assuming it's 110, can make the mid CPU run at a much
> lower frequency than 500, so it is then a really bad decision to honor
> the UCLAMP_MIN frequency hint and place it on the mid CPU, instead of
> running it on the little CPU which is less crowded, and has pretty much
> the same capacity (100 vs 110) anyway.
>
> This series attempts to solve these problems by tracking a
> util_avg_uclamp signal in tasks and cfs_rq. At task level,
> p->se.avg.util_avg_uclamp is basically tracking the normal util_avg, but
> clamped within its uclamp min and max. At cfs_rq level, util_avg_uclamp
> must always be the sum of all util_avg_uclamp of all the entities on
> this cfs_rq. As a result, cfs_rq->avg.util_avg_uclamp is the sum
> aggregation of all the clamped values, which indicates the frequency
> this rq should run at and what the utilization is.
>
> This idea solves Problem 1 by capping the utilization of an
> always-running task throttled by UCLAMP_MAX. Although the task (denoted
> by Task A) has no idle time, the util_avg_uclamp signal gives its
> UCLAMP_MAX value instead of 1024, so even if another task (Task B) with
> a UCLAMP_MAX value at 1024 joins the rq, the util_avg_uclamp is A's
> UCLAMP_MAX plus B's utilization, instead of 1024 plus B's utilization,
> which means we no longer have the frequency spike problem caused by B.
> This should mean that we might completely avoid the need for uclamp
> filtering.

That would be very nice since I remember that this filtering approach
hat to figure out the actual runtime of the task and the implemention
couldn't be just in the sched class code but had to be done in core code
as well.

>
> It also solves Problem 2 by tracking the capped utilization of a rq.
> Using util_avg_uclamp, we are able to differentiate between a CPU
> throttled by UCLAMP_MAX and one that is actually running at its peak
> capacity. An always-running rq with a task having UCLAMP_MAX at 300 will
> report a cfs_rq util_avg_uclamp at 300, not 1024, which means task
> placement sees spare capacity on this CPU and will allocate some tasks
> to it, instead of treating it the same way as a CPU actually running at
> the peak. This brings us closer to the benefits of Android's sum
> aggregation (see code related to CONFIG_USE_GROUP_THROTTLE at
> https://android.googlesource.com/kernel/gs/+/refs/heads/android-gs-raviole-5.10-android12-d1/drivers/soc/google/vh/kernel/sched/fair.c#510),

That's true although the latest Pixel phone does not seem to rely on
CONFIG_USE_GROUP_THROTTLE anymore but on customized PELT tracking for so
called Vendor CFS Util Groups and its sum aggregation on CPU (root
cfs_rq) level(CONFIG_USE_VENDOR_GROUP_UTIL).

CONFIG_USE_GROUP_THROTTLE was using the fact that Android only uses 1.
level taskgroups to map its different task types (e.g. foreground,
background etc.) into:

for_each_leaf_cfs_rq_safe()
calc subgroup_util_sum and throttled_subgroup_util_sum and skip root

cpu_util = root_util - subgroup_util_sum + throttled_subgroup_util_sum

So summing up the (throttled (capped)) contritions of the 1. level
taskgroups plus the root taskgroup util.

With CONFIG_USE_VENDOR_GROUP_UTIL each tasks has a vendor pointer into
an CFS util array indexed by an Android task type ID to which the task
contributes its util. CPU util is then sum aggregated over this array.

We should recall that this is all done because the current uclamp-max
max aggression isn't working for Androids use-cases.

So to overcome this issue in mainline is key here.

> which shows energy benefits because we are able to schedule tasks on
> smaller cores which are UCLAMP_MAX capped, instead of finding a fresh
> big CPU. However, a major difference is that this series has an
> amortized cost of O(1), instead of O(n) in cpu_util_next() in Android
> kernels.
>
> It avoids the shortcomings from uclamp-being-a-frequency-hint. Under sum
> aggregation, the scheduler sees the sum aggregated util_avg_uclamp and
> will avoid the problem of enqueueing too many tasks just to fit the
> frequency, and will place tasks on little CPUs in the previous example.
>
> Patch 2/6 tries to simulate PELT decay in the new util_avg_uclamp
> signal, as this gives us gradual decay of utilization which avoids
> premature CPU frequency drops. This is a major caveat of this series. We
> should explore if there's a better way to integrate uclamp directly into
> the util_avg signal, instead of introducing a new util_avg_uclamp and
> then simulate PELT on it.
>
> This new design significantly simplifies uclamp logic.
> Patch 4/6 removes the tri-state return value of util_fits_cpu().
> Patch 5/6 then completely removes uclamp buckets. Because the
> util_avg_uclamp is already a clamped value propagated from bottom to
> top, there's no need to clamp anything at the top level and we can just
> use this value for frequency selection and spare capacity search.
> Basically, the idea is that all uclamp logic happens inside
> util_avg_uclamp, and other code using this new signal can just use it
> like util_avg, without even knowing that uclamp exists. This drastically
> reduces the complexity of uclamp and makes the code considering all the
> uclamp corner cases unnecessary. At the end of the series, we remove 749
> lines of code while adding a bit more than 300 (although once we update
> Documentation.rst, it will be a bit more than that).
>
> Note that this series is still considered RFC status. TODO items are:
>
> 1. Implement sum aggregation for RT tasks.
> 2. Improve handling of cpu_util(boost).

What about the integration with util_est here?

In cpu_util(), &rq->cfs->avg.util_avg is replaced by
rq->root_cfs_util_uclamp

and in

task_util_est() (should be actually named task_util() to be in sync with
cpu_util(), i.e. returning max(util, util_est)), task_util(p) returns
p->se.avg.util_avg_uclamp.

Are there use cases for the original avg.util_avg still in place?

> TESTING:
>
> Initial test and benchmark results suggest that sum aggregation not only
> is significantly simpler, but generally performs much better in
> achieving what uclamp is supposed to do. Two notebooks are shared at
>
> https://nbviewer.org/github/honxia02/sum_aggre_notebooks/blob/main/upstream.ipynb
> https://nbviewer.org/github/honxia02/sum_aggre_notebooks/blob/main/sum.ipynb
>
> The experiments done in notebooks are on Arm Juno r2 board. CPU0-3 are
> little cores with capacity of 383. CPU4-5 are big cores. The rt-app
> profiles used for these experiments are included in the notebooks.
>
> Scenario 1: Scheduling 4 always-running tasks with UCLAMP_MAX at 200.
>
> The scheduling decisions are plotted in Out[11] and Out[14]
> respectively. Max aggregation fails to recognize the opportunity to run
> all of them on the efficient little Power Domain (PD), whereas sum
> aggregation runs all 4 on the little PD, leaving the big PD completely
> powered off, satisfying uclamp requests while saving power.

Does `upstream` already contain the v6.7 fixes `Fix uclamp code corner
cases` ?

https://lkml.kernel.org/r/[email protected]

> Scenario 2: Scheduling 4 tasks with UCLAMP_MIN and UCLAMP_MAX at a value
> slightly above the capacity of the little CPU.
>
> Results are in Out[17] and Out[82]. The purpose is to use UCLAMP_MIN to
> place tasks on the big core. Max aggregation is pretty much in an
> overutilized state the entire time. Sum aggregation sees that the big
> CPU can hold two such tasks (satisfying both UCLAMP_MIN and UCLAMP_MAX
> requests for all 4 tasks) on each big CPU and quickly settles down, and
> is still under EAS without overutilization.

thread0-[0-3] uclamp_max = 309. So p->se.avg.util_avg_uclamp is
constrained by this value for all 4 tasks, letting 2 tasks fit on each
of the big CPUs. You have to zoom in into Out[82] to actually see this.

And I guess for max aggregation cpu_overutilized() can't hold the clamp
continuously because of all the other short running uclamp_max = 1024
(default) tasks on the rq's.

> Scenario 3: Task A is a task with a small utilization pinned to CPU4.
> Task B is an always-running task pinned to CPU5, but UCLAMP_MAX capped
> at 200. After a while, task A is then pinned to CPU5, joining B.
>
> Results are in Out[23] and Out[95]. The blue util_avg signal is the
> original root CFS util_avg. The yellow line is the root CFS utilization
> after considering uclamp. Max aggregation sees a frequency
> spike at 579.1s. When zoomed in, one can see square-wave-like
> utilization values because of A periodically going to sleep. When A
> wakes up, its default UCLAMP_MAX of 1024 will uncap B and reach the
> highest CPU frequency. When A sleeps, B's UCLAMP_MAX will be in effect
> and will reduce rq utilization to 200. This happens repeatedly, hence
> the square wave. In contrast, sum aggregation sees a normal increase in
> utilization when A joins B, at around 2507s, without any square-wave
> behavior.

Makes sense. But there shouldn't be a root_cfs_util_uclamp in main?
Which signal does the yellow line represent in Out[23]?

> Scenario 4: 4 always-running tasks with UCLAMP_MAX of 120 pinned to the
> little PD (CPU0-3). 4 same tasks pinned to the big PD (CPU4-5).
> After a while, remove the CPU pinning of the 4 tasks on the big PD.
>
> Results are in Out[29] and Out[101]. Max aggregation fails to identify
> that all 8 tasks can be packed on the little PD, whereas sum
> aggregation immediately moves the 4 tasks pinned to big PD to the
> little PD the moment pinning is removed. Sum aggregation understands
> that even when the rq seems to have utilization of 1024, this is because
> of UCLAMP_MAX and there's still opportunity to pack 2 such tasks on each
> little CPU, leaving the big PD untouched, saving power.

Looks good to me.

>
> BENCHMARKS:
>
> One TODO item is to handle cpu_util(boost) better. The current handling
> of boost is far from ideal and is a known caveat. All below benchmarks
> numbers are done without any boosting in cpu_util(), in both max and sum
> aggregation.

Maybe to early for black-box tests at this stage but good to see that
nothing seems to go sideways with uclamp sum aggregation.


>
> Geekbench 6 (on Rock-5B board)
> +-----+-------------+------------+
> | | Single-core | Multi-core |
> +-----+-------------+------------+
> | Max | 800.6 | 2977.0 |
> | Sum | 801.2 | 2981.4 |
> +-----+-------------+------------+
>
> No regression is seen after switching to sum aggregation.
>
> Jankbench (backporting sum aggregation to Android 5.18 mainline kernel)
>
> +------+-----------------+-------+-----------+
> | | variable | value | perc_diff |
> +------+-----------------+-------+-----------+
> | main | jank_percentage | 1.1 | 0.00% |
> | sum | jank_percentage | 0.5 | -53.92% |
> +------+-----------------+-------+-----------+
>
> +------------+--------+------+-------+-----------+
> | | metric | tag | value | perc_diff |
> +------------+--------+------+-------+-----------+
> | CPU | gmean | main | 166.1 | 0.00% |
> | CPU-Big | gmean | main | 55.1 | 0.00% |
> | CPU-Little | gmean | main | 91.7 | 0.00% |
> | CPU-Mid | gmean | main | 19.2 | 0.00% |
> | CPU | gmean | sum | 161.3 | -2.85% |
> | CPU-Big | gmean | sum | 52.9 | -3.97% |
> | CPU-Little | gmean | sum | 86.7 | -5.39% |
> | CPU-Mid | gmean | sum | 21.6 | 12.63% |
> +------------+--------+------+-------+-----------+
>
> The TL;DR of Jankbench is that sum aggregation reduces jank by 54% while
> with 2.85% less CPU power. Note that this benchmark on Pixel 6 by
> default has some sort of uclamp feedback applied to it. When jank is
> detected, certain display and benchmark threads are applied with a
> UCLAMP_MIN value of 512 (any help in identifying where these UCLAMP_MIN
> values come from in the mainline Android kernel is appreciated). In
> contrast, we constantly apply a UCLAMP_MIN value of 60 to these threads
> without any feedback loop. If a similar feedback loop is involved to
> only apply UCLAMP_MIN when needed, power consumption can be expected to
> be even lower.

[...]

2023-11-01 22:35:01

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] sched/fair: Use CFS util_avg_uclamp for utilization and frequency

On 04/10/2023 11:04, Hongyan Xia wrote:
> From: Hongyan Xia <[email protected]>
>
> Switch to the new util_avg_uclamp for task and runqueue utilization.
> Since util_est() calls task_util(), this means util_est is now also a
> clamped value.

s/util_est()/task_util_est()

but task_util_est() is max(task_util(p), _task_util_est(p))

So I don't immediately spot why util_est is a clamped value as well now.

We have a naming mismatch between CPU and task related function on this
level: cpu_util() vs. task_util_est().

> Now that we have the sum aggregated CFS util value, we do not need to
> consult uclamp buckets to know how the frequency should be clamped. We
> simply look at the aggregated top level root_cfs_util_uclamp to know
> what frequency to choose. Because we simulate PELT decay in
> root_cfs_util_uclamp anyway, there's no need in cpufreq_schedutil.c to
> avoid premature frequency drops.
>
> Consequently, there is no need for uclamp_rq_util_with(). This function
> takes the un-clamped util value and sends it through various clamping
> filters to get the final value. However, util_avg_uclamp is propagated
> with clamping in mind already, so it does not need to be clamped again.
>
> TODO: There are two major caveats in this patch.
> 1. At the moment sum aggregation does not consider RT tasks. The avg_rt
> signal considers all RT tasks on this rq as a single entity, which
> means the utilization of individual RT tasks is not tracked
> separately. If we want to use sum aggregation, we might have to track
> utilization of RT tasks individually.

Not sure if the RT class will except PELT task tracking (plus there is
CONFIG_RT_GROUP_SCHED too).

> 2. Busy time accounting in compute_energy() now takes the uclamp'ed
> value. Ideally, it should reflect reality and use the un-clamp'ed
> values. However, that would require maintaining both the normal and
> uclamp'ed values for util_est. This needs to be revisited if it
> causes real problems in practice.

You could use your new rq->root_cfs_util_uclamp for eenv_pd_max_util
(FREQUENCY_UTIL) and use rq->cfs.avg.util_avg in eenv_pd_busy_time()
(ENERGY_UTIL).

[...]

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index efe3848978a0..32511ee63f01 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7402,10 +7402,12 @@ int sched_core_idle_cpu(int cpu)
> * The DL bandwidth number otoh is not a measured metric but a value computed
> * based on the task model parameters and gives the minimal utilization
> * required to meet deadlines.
> + *
> + * The util_cfs parameter has already taken uclamp into account (unless uclamp
> + * support is not compiled in).
> */
> unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> - enum cpu_util_type type,
> - struct task_struct *p)
> + enum cpu_util_type type)

There are changes proposed in the area of uclamping right now in:

https://lkml.kernel.org/r/[email protected]

[...]

> /**
> @@ -282,7 +281,11 @@ static void sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
> * into the same scale so we can compare.
> */
> boost = (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
> - boost = uclamp_rq_util_with(cpu_rq(sg_cpu->cpu), boost, NULL);
> + /*
> + * TODO: Investigate what should be done here. In sum aggregation there
> + * is no such thing as uclamp_max on a rq, so how do we cap the boost
> + * value, or do we want to cap the boost frequency here at all?
> + */

https://lkml.kernel.org/r/[email protected]

is proposing to cap iowait boost with max (set in effective_cpu_util()
and max depends on uclamp_rq_get(rq, UCLAMP_MAX) too.

So you could cap iowait boost in case uclamp_rq_is_capped(), i.e. when:

if (rq->cfs.avg.util_avg > rq->root_cfs_util_uclamp + margin)

[...]

> @@ -7468,11 +7459,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> static unsigned long
> cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
> {
> - struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
> - unsigned long util = READ_ONCE(cfs_rq->avg.util_avg);
> + struct rq *rq = cpu_rq(cpu);
> + struct cfs_rq *cfs_rq = &rq->cfs;
> + unsigned long util = root_cfs_util(rq);
> + bool capped = uclamp_rq_is_capped(rq);
> unsigned long runnable;
>
> - if (boost) {
> + if (boost && !capped) {
> runnable = READ_ONCE(cfs_rq->avg.runnable_avg);
> util = max(util, runnable);
> }

IMHO, this makes sense. Only allow runnable boosting in case the rq is
not uclamp_max capped.

[...]

2023-11-02 17:38:22

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] sched/fair: Rewrite util_fits_cpu()

On 04/10/2023 11:04, Hongyan Xia wrote:
> From: Hongyan Xia <[email protected]>
>
> Currently, there's no way to distinguish the difference between 1) a CPU
> that is actually maxed out at its highest frequency, or 2) one that is
> throttled because of UCLAMP_MAX, since both present util_avg values of
> 1024. This is problematic because when we try to pick a CPU for a task
> to run, we would like to give 2) a chance, or at least prefer 2) to 1).
>
> Current upstream gives neither a chance because the spare capacity is 0
> for either case. There are patches to fix this problem by considering 0
> capacities [1], but this might still be inefficient because this ends
> up treating 1) and 2) equally, and will always pick the same one because
> we don't change how we iterate through all CPUs. If we end up putting
> many tasks on 1), then this creates a seriously unbalanced load for the
> two CPUs.
>
> Fix by using util_avg_uclamp for util_fits_cpu(). This way, case 1) will
> still keep its utilization at 1024 whereas 2) shows spare capacities if
> the sum of util_avg_uclamp values is still under the CPU capacity.
> Note that this is roughly what the sum aggregation does in the Android
> kernel [2] (although we clamp UCLAMP_MIN as well in this patch, which
> may need some discussions), which shows superior energy savings because
> there's more chance that a task can get scheduled on 2) instead of
> finding a big CPU to run on.
>
> Under sum aggregation, checking whether a task fits a CPU becomes much
> simpler. We simply do fits_capacity() and there does not need to be code
> checking all corner cases for uclamp. This means util_fits_cpu() returns
> to true and false instead of tri-state, simplifying a significant amount
> of code.

You could remove util_fits_cpu() and task_fits_cpu() and call
fits_capacity() directly. We should try to keep the zoo of util-related
functions as small as possible.

[...]

2023-11-03 11:19:41

by Hongyan Xia

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] sched: uclamp sum aggregation

On 30/10/2023 18:46, Dietmar Eggemann wrote:
> On 04/10/2023 11:04, Hongyan Xia wrote:
>> Current implementation of uclamp leaves many things to be desired.
>> There are several problems:
>>
>> 1. Max aggregation is fragile. A single task with a high UCLAMP_MAX (or
>> with the default value, which is 1024) can ruin the previous
>> settings the moment it is enqueued, as shown in the uclamp frequency
>> spike problem in Section 5.2 of
>> Documentation/scheduler/sched-util-clamp.rst. Constantly running at
>> 1024 utilization implies that the CPU is driven at its max capacity.
>> However, with UCLAMP_MAX, this assumption is no longer true. To
>> mitigate this, one idea is to filter out uclamp values for
>> short-running tasks. However, the effectiveness of this idea remains
>> to be seen.
>
> The difference is that we don't have to lift the uclamp_max cap of
> runnable p1's uclamp_max (< 1024) when a short running p2 with
> uclamp_max = 1024 becomes runnable? Since now, when this happens, we
> would just add p2's util_avg_uclamp to cfs_rq's util_avg_uclamp which is
> tiny compared to its uclamp_max = 1024.

Yes. That's a correct interpretation. I can add this version to the
cover letter to make things clearer.

>> 2. No way to differentiate between UCLAMP_MAX throttled CPU or a CPU
>> running at its peak, as both show utilization of 1024. However, in
>> certain cases it's important to tell the difference, as we still want
>> to take the opportunity to enqueue tasks in the former case.
>
> Is this related to the `best_fits/max_fits` logic in
> find_energy_efficient_cpu() (feec()) and select_idle_capacity() (sic())
> (commit e5ed0550c04c "sched/fair: unlink misfit task from cpu
> overutilized")?
> With your approach, having cfs_rq's `util_avg_uclamp` next to its
> `util_avg` would allow to see those difference by comparing the two signals?

Somewhat related. The usefulness is mostly around cpu_util(). Max
aggregation will see cpu_util() of 1024 (spare capacity of 0) even when
this is caused by UCLAMP_MAX, not by actually running the CPU at max
capacity, whereas this series will have a util < 1024 if capped by
UCLAMP_MAX.

This means the patch to consider 0 spare capacities (because under max
aggregation 0 capacity doesn't mean the CPU is at max) is unnecessary:

6b00a40147653c8ea748e8f4396510f252763364
sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0

>> It's also debatable whether uclamp should be a frequency hint. An
>> example is a system with a mid CPU of capacity 500, and a little CPU
>> with capacity 100. If we have 10 tasks with UCLAMP_MIN of 101, then
>> util_fits_cpu() will schedule them on the mid CPU because feec()
>> thinks they don't fit on the little, even though we should use the
>> little core at some point instead of making the mid even more crowded.
>> Of course, this in reality doesn't happen because of other mechanisms
>> like load balancing, but it's also not good when other mechanisms can
>
> CPU overutilization detection enables CFS load-balance & `sis()->sic()`
> and disables feec().
>
>> cancel the effect of uclamp in feec(). A CPU running at capacity 500 for
>> 1ms or for 1000ms gives completely different performance levels, so
>> trying to fit only the frequency does not give us any performance
>> guarantees. If we then talk about not only running at some frequency but
>> also for some amount of time, then what we really mean is a capacity
>> hint, not a frequency hint.
>
> Isn't CPU frequency and capacity going in the same direction?
>
> IPC * CPU frequency = Instruction per Second == Performance (Capacity).
>
> And in the scheduler, utilization is the portion of the Capacity
> currently used.
>
> What sum aggregation does differently is that you can sum-up
> individually clamped utilization contributions and compare them against
> capacity rather then being forced to use the maximum value of a clamp
> value of one (runnable) task to guide frequency. This avoids those
> discontinuity-moments when a task with a high uclamp value switches
> between runnable and sleeping state.

I guess what I was describing was a new metric:

Work = Capacity * Time

Max aggregation aims to fit capacity, but if a task can be run at a high
capacity but for only a very short period of time, then this scheduling
is not particularly useful. Like in the above example, moving a task
with UCLAMP_MIN of 101 to the mid CPU isn't helpful when there are
already 10 tasks on the mid CPU, because Time is reduced.

>> It's even worse when the tasks scheduled on the mid CPU also have
>> UCLAMP_MAX values. In the previous example, instead of running at 500, a
>> single UCLAMP_MAX, assuming it's 110, can make the mid CPU run at a much
>> lower frequency than 500, so it is then a really bad decision to honor
>> the UCLAMP_MIN frequency hint and place it on the mid CPU, instead of
>> running it on the little CPU which is less crowded, and has pretty much
>> the same capacity (100 vs 110) anyway.
>>
>> This series attempts to solve these problems by tracking a
>> util_avg_uclamp signal in tasks and cfs_rq. At task level,
>> p->se.avg.util_avg_uclamp is basically tracking the normal util_avg, but
>> clamped within its uclamp min and max. At cfs_rq level, util_avg_uclamp
>> must always be the sum of all util_avg_uclamp of all the entities on
>> this cfs_rq. As a result, cfs_rq->avg.util_avg_uclamp is the sum
>> aggregation of all the clamped values, which indicates the frequency
>> this rq should run at and what the utilization is.
>>
>> This idea solves Problem 1 by capping the utilization of an
>> always-running task throttled by UCLAMP_MAX. Although the task (denoted
>> by Task A) has no idle time, the util_avg_uclamp signal gives its
>> UCLAMP_MAX value instead of 1024, so even if another task (Task B) with
>> a UCLAMP_MAX value at 1024 joins the rq, the util_avg_uclamp is A's
>> UCLAMP_MAX plus B's utilization, instead of 1024 plus B's utilization,
>> which means we no longer have the frequency spike problem caused by B.
>> This should mean that we might completely avoid the need for uclamp
>> filtering.
>
> That would be very nice since I remember that this filtering approach
> hat to figure out the actual runtime of the task and the implemention
> couldn't be just in the sched class code but had to be done in core code
> as well.

Yes. So far I don't see any need for filtering and runtime accounting in
uclamp sum aggregation.

>>
>> It also solves Problem 2 by tracking the capped utilization of a rq.
>> Using util_avg_uclamp, we are able to differentiate between a CPU
>> throttled by UCLAMP_MAX and one that is actually running at its peak
>> capacity. An always-running rq with a task having UCLAMP_MAX at 300 will
>> report a cfs_rq util_avg_uclamp at 300, not 1024, which means task
>> placement sees spare capacity on this CPU and will allocate some tasks
>> to it, instead of treating it the same way as a CPU actually running at
>> the peak. This brings us closer to the benefits of Android's sum
>> aggregation (see code related to CONFIG_USE_GROUP_THROTTLE at
>> https://android.googlesource.com/kernel/gs/+/refs/heads/android-gs-raviole-5.10-android12-d1/drivers/soc/google/vh/kernel/sched/fair.c#510),
>
> That's true although the latest Pixel phone does not seem to rely on
> CONFIG_USE_GROUP_THROTTLE anymore but on customized PELT tracking for so
> called Vendor CFS Util Groups and its sum aggregation on CPU (root
> cfs_rq) level(CONFIG_USE_VENDOR_GROUP_UTIL).
>
> CONFIG_USE_GROUP_THROTTLE was using the fact that Android only uses 1.
> level taskgroups to map its different task types (e.g. foreground,
> background etc.) into:
>
> for_each_leaf_cfs_rq_safe()
> calc subgroup_util_sum and throttled_subgroup_util_sum and skip root
>
> cpu_util = root_util - subgroup_util_sum + throttled_subgroup_util_sum
>
> So summing up the (throttled (capped)) contritions of the 1. level
> taskgroups plus the root taskgroup util.
>
> With CONFIG_USE_VENDOR_GROUP_UTIL each tasks has a vendor pointer into
> an CFS util array indexed by an Android task type ID to which the task
> contributes its util. CPU util is then sum aggregated over this array.
>
> We should recall that this is all done because the current uclamp-max
> max aggression isn't working for Androids use-cases.
>
> So to overcome this issue in mainline is key here.

Thanks for pointing me to the latest Pixel code.

The basic ideas of that and this series look very similar. Both sum up
CFS utilization instead of directly tracking the root CFS utilization.
This series so far has only implemented uclamp on tasks, but the same
code can also be implemented on group sched_entities. Once I implement
that, then it essentially does the same thing as GROUP_THROTTLE. I think
if the current Pixel code works, then it's likely this series works too.

The big difference is that this series relies on
CONFIG_FAIR_GROUP_SCHED. There seems to be complaints about it and I may
need to know why it's not desirable for Android.

>> which shows energy benefits because we are able to schedule tasks on
>> smaller cores which are UCLAMP_MAX capped, instead of finding a fresh
>> big CPU. However, a major difference is that this series has an
>> amortized cost of O(1), instead of O(n) in cpu_util_next() in Android
>> kernels.
>>
>> It avoids the shortcomings from uclamp-being-a-frequency-hint. Under sum
>> aggregation, the scheduler sees the sum aggregated util_avg_uclamp and
>> will avoid the problem of enqueueing too many tasks just to fit the
>> frequency, and will place tasks on little CPUs in the previous example.
>>
>> Patch 2/6 tries to simulate PELT decay in the new util_avg_uclamp
>> signal, as this gives us gradual decay of utilization which avoids
>> premature CPU frequency drops. This is a major caveat of this series. We
>> should explore if there's a better way to integrate uclamp directly into
>> the util_avg signal, instead of introducing a new util_avg_uclamp and
>> then simulate PELT on it.
>>
>> This new design significantly simplifies uclamp logic.
>> Patch 4/6 removes the tri-state return value of util_fits_cpu().
>> Patch 5/6 then completely removes uclamp buckets. Because the
>> util_avg_uclamp is already a clamped value propagated from bottom to
>> top, there's no need to clamp anything at the top level and we can just
>> use this value for frequency selection and spare capacity search.
>> Basically, the idea is that all uclamp logic happens inside
>> util_avg_uclamp, and other code using this new signal can just use it
>> like util_avg, without even knowing that uclamp exists. This drastically
>> reduces the complexity of uclamp and makes the code considering all the
>> uclamp corner cases unnecessary. At the end of the series, we remove 749
>> lines of code while adding a bit more than 300 (although once we update
>> Documentation.rst, it will be a bit more than that).
>>
>> Note that this series is still considered RFC status. TODO items are:
>>
>> 1. Implement sum aggregation for RT tasks.
>> 2. Improve handling of cpu_util(boost).
>
> What about the integration with util_est here?
>
> In cpu_util(), &rq->cfs->avg.util_avg is replaced by
> rq->root_cfs_util_uclamp
>
> and in
>
> task_util_est() (should be actually named task_util() to be in sync with
> cpu_util(), i.e. returning max(util, util_est)), task_util(p) returns
> p->se.avg.util_avg_uclamp.
>
> Are there use cases for the original avg.util_avg still in place?

We still need the original avg.util_avg for tasks because
util_avg_uclamp is clamped from it, but other than that, at the moment
there's no reason to use the normal util_avg.

>> TESTING:
>>
>> Initial test and benchmark results suggest that sum aggregation not only
>> is significantly simpler, but generally performs much better in
>> achieving what uclamp is supposed to do. Two notebooks are shared at
>>
>> https://nbviewer.org/github/honxia02/sum_aggre_notebooks/blob/main/upstream.ipynb
>> https://nbviewer.org/github/honxia02/sum_aggre_notebooks/blob/main/sum.ipynb
>>
>> The experiments done in notebooks are on Arm Juno r2 board. CPU0-3 are
>> little cores with capacity of 383. CPU4-5 are big cores. The rt-app
>> profiles used for these experiments are included in the notebooks.
>>
>> Scenario 1: Scheduling 4 always-running tasks with UCLAMP_MAX at 200.
>>
>> The scheduling decisions are plotted in Out[11] and Out[14]
>> respectively. Max aggregation fails to recognize the opportunity to run
>> all of them on the efficient little Power Domain (PD), whereas sum
>> aggregation runs all 4 on the little PD, leaving the big PD completely
>> powered off, satisfying uclamp requests while saving power.
>
> Does `upstream` already contain the v6.7 fixes `Fix uclamp code corner
> cases` ?
>
> https://lkml.kernel.org/r/[email protected]

Unfortunately, no. These experiments were done before that patch. I
quickly did a test and that patch did fix this issue. Now the four
little tasks are scheduled on the small PD.

But I remember that you have another test indicating that the patch may
expose some new issues?

>> Scenario 2: Scheduling 4 tasks with UCLAMP_MIN and UCLAMP_MAX at a value
>> slightly above the capacity of the little CPU.
>>
>> Results are in Out[17] and Out[82]. The purpose is to use UCLAMP_MIN to
>> place tasks on the big core. Max aggregation is pretty much in an
>> overutilized state the entire time. Sum aggregation sees that the big
>> CPU can hold two such tasks (satisfying both UCLAMP_MIN and UCLAMP_MAX
>> requests for all 4 tasks) on each big CPU and quickly settles down, and
>> is still under EAS without overutilization.
>
> thread0-[0-3] uclamp_max = 309. So p->se.avg.util_avg_uclamp is
> constrained by this value for all 4 tasks, letting 2 tasks fit on each
> of the big CPUs. You have to zoom in into Out[82] to actually see this.
>
> And I guess for max aggregation cpu_overutilized() can't hold the clamp
> continuously because of all the other short running uclamp_max = 1024
> (default) tasks on the rq's.

Actually I'm not 100% sure about the reason why max aggregation under
this experiment never stabilizes. I can investigate further.

>> Scenario 3: Task A is a task with a small utilization pinned to CPU4.
>> Task B is an always-running task pinned to CPU5, but UCLAMP_MAX capped
>> at 200. After a while, task A is then pinned to CPU5, joining B.
>>
>> Results are in Out[23] and Out[95]. The blue util_avg signal is the
>> original root CFS util_avg. The yellow line is the root CFS utilization
>> after considering uclamp. Max aggregation sees a frequency
>> spike at 579.1s. When zoomed in, one can see square-wave-like
>> utilization values because of A periodically going to sleep. When A
>> wakes up, its default UCLAMP_MAX of 1024 will uncap B and reach the
>> highest CPU frequency. When A sleeps, B's UCLAMP_MAX will be in effect
>> and will reduce rq utilization to 200. This happens repeatedly, hence
>> the square wave. In contrast, sum aggregation sees a normal increase in
>> utilization when A joins B, at around 2507s, without any square-wave
>> behavior.
>
> Makes sense. But there shouldn't be a root_cfs_util_uclamp in main?
> Which signal does the yellow line represent in Out[23]?

Ah, that can be confusing. For current upstream root_cfs_util_uclamp
comes from uclamp_rq_util_with(rq, cfs_rq->avg.util_avg).

>> Scenario 4: 4 always-running tasks with UCLAMP_MAX of 120 pinned to the
>> little PD (CPU0-3). 4 same tasks pinned to the big PD (CPU4-5).
>> After a while, remove the CPU pinning of the 4 tasks on the big PD.
>>
>> Results are in Out[29] and Out[101]. Max aggregation fails to identify
>> that all 8 tasks can be packed on the little PD, whereas sum
>> aggregation immediately moves the 4 tasks pinned to big PD to the
>> little PD the moment pinning is removed. Sum aggregation understands
>> that even when the rq seems to have utilization of 1024, this is because
>> of UCLAMP_MAX and there's still opportunity to pack 2 such tasks on each
>> little CPU, leaving the big PD untouched, saving power.
>
> Looks good to me.
>
>>
>> BENCHMARKS:
>>
>> One TODO item is to handle cpu_util(boost) better. The current handling
>> of boost is far from ideal and is a known caveat. All below benchmarks
>> numbers are done without any boosting in cpu_util(), in both max and sum
>> aggregation.
>
> Maybe to early for black-box tests at this stage but good to see that
> nothing seems to go sideways with uclamp sum aggregation.
>
>
>>
>> Geekbench 6 (on Rock-5B board)
>> +-----+-------------+------------+
>> | | Single-core | Multi-core |
>> +-----+-------------+------------+
>> | Max | 800.6 | 2977.0 |
>> | Sum | 801.2 | 2981.4 |
>> +-----+-------------+------------+
>>
>> No regression is seen after switching to sum aggregation.
>>
>> Jankbench (backporting sum aggregation to Android 5.18 mainline kernel)
>>
>> +------+-----------------+-------+-----------+
>> | | variable | value | perc_diff |
>> +------+-----------------+-------+-----------+
>> | main | jank_percentage | 1.1 | 0.00% |
>> | sum | jank_percentage | 0.5 | -53.92% |
>> +------+-----------------+-------+-----------+
>>
>> +------------+--------+------+-------+-----------+
>> | | metric | tag | value | perc_diff |
>> +------------+--------+------+-------+-----------+
>> | CPU | gmean | main | 166.1 | 0.00% |
>> | CPU-Big | gmean | main | 55.1 | 0.00% |
>> | CPU-Little | gmean | main | 91.7 | 0.00% |
>> | CPU-Mid | gmean | main | 19.2 | 0.00% |
>> | CPU | gmean | sum | 161.3 | -2.85% |
>> | CPU-Big | gmean | sum | 52.9 | -3.97% |
>> | CPU-Little | gmean | sum | 86.7 | -5.39% |
>> | CPU-Mid | gmean | sum | 21.6 | 12.63% |
>> +------------+--------+------+-------+-----------+
>>
>> The TL;DR of Jankbench is that sum aggregation reduces jank by 54% while
>> with 2.85% less CPU power. Note that this benchmark on Pixel 6 by
>> default has some sort of uclamp feedback applied to it. When jank is
>> detected, certain display and benchmark threads are applied with a
>> UCLAMP_MIN value of 512 (any help in identifying where these UCLAMP_MIN
>> values come from in the mainline Android kernel is appreciated). In
>> contrast, we constantly apply a UCLAMP_MIN value of 60 to these threads
>> without any feedback loop. If a similar feedback loop is involved to
>> only apply UCLAMP_MIN when needed, power consumption can be expected to
>> be even lower.
>
> [...]

2023-11-03 13:50:57

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] sched/uclamp: Remove all uclamp bucket logic

On 04/10/2023 11:04, Hongyan Xia wrote:
> From: Hongyan Xia <[email protected]>
>
> Also rewrite uclamp_update_active() so that the effective uclamp values
> are updated every time we change task group properties, change system
> defaults or a request is issued from userspace.

Tested it with

# cgcreate -g cpu:/A
# echo $$ > /sys/fs/cgroup/cpu/A/tasks

(1) per-task

# uclampset --pid $$ -m 256 -M 768

(2) per taskgroup

# echo 25.0 > /sys/fs/cgroup/cpu/A/cpu.uclamp.min
# echo 75.0 > /sys/fs/cgroup/cpu/A/cpu.uclamp.max

(3) system-wide

# echo 256 > /proc/sys/kernel/sched_util_clamp_min
# echo 768 > /proc/sys/kernel/sched_util_clamp_max

uclamp_update_active() -> uclamp_update_active_nolock() is called in all
cases.

[...]

uclamp_eff_get()'s function header still mentions `clamp bucket index`.

> @@ -1542,196 +1485,24 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
>
> unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
> {
> - struct uclamp_se uc_eff;
> -
> - /* Task currently refcounted: use back-annotated (effective) value */
> - if (p->uclamp[clamp_id].active)
> - return (unsigned long)p->uclamp[clamp_id].value;
> -
> - uc_eff = uclamp_eff_get(p, clamp_id);
> -
> - return (unsigned long)uc_eff.value;
> -}

[...]

2023-11-03 14:01:49

by Hongyan Xia

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] sched/uclamp: Remove all uclamp bucket logic

On 03/11/2023 13:50, Dietmar Eggemann wrote:
> On 04/10/2023 11:04, Hongyan Xia wrote:
>> From: Hongyan Xia <[email protected]>
>>
>> Also rewrite uclamp_update_active() so that the effective uclamp values
>> are updated every time we change task group properties, change system
>> defaults or a request is issued from userspace.
>
> Tested it with
>
> # cgcreate -g cpu:/A
> # echo $$ > /sys/fs/cgroup/cpu/A/tasks
>
> (1) per-task
>
> # uclampset --pid $$ -m 256 -M 768
>
> (2) per taskgroup
>
> # echo 25.0 > /sys/fs/cgroup/cpu/A/cpu.uclamp.min
> # echo 75.0 > /sys/fs/cgroup/cpu/A/cpu.uclamp.max
>
> (3) system-wide
>
> # echo 256 > /proc/sys/kernel/sched_util_clamp_min
> # echo 768 > /proc/sys/kernel/sched_util_clamp_max
>
> uclamp_update_active() -> uclamp_update_active_nolock() is called in all
> cases.
>

Thanks for testing!

>
> uclamp_eff_get()'s function header still mentions `clamp bucket index`.

This is indeed confusing. I have changed it from 'clamp bucket index' to
just 'uclamp value'.

>> [...]
>
> [...]

2023-11-03 14:51:16

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] sched/uclamp: Simplify uclamp_eff_value()

On 04/10/2023 11:04, Hongyan Xia wrote:
> From: Hongyan Xia <[email protected]>
>
> The commit
>
> sched: Remove all uclamp bucket logic
>
> removes uclamp_{inc/dec}() functions, so now p->uclamp contains the

s/uclamp_{inc/dec}/uclamp_rq_{inc/dec}

> correct values all the time after a update_uclamp_active() call, and

s/update_uclamp_active()/uclamp_update_active()

> there's no need to toggle the boolean `active` after an update. As a
> result, this function is fairly simple now and can live as a static
> inline function.

[...]

> -unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
> -{
> - if (!uclamp_is_used() || !p->uclamp[clamp_id].active)
> - return uclamp_none(clamp_id);
> -
> - return p->uclamp[clamp_id].value;
> -}
> -

Is there still a need for p->uclamp[clamp_id].active ? Does
uclamp_eff_value() ever get called with !active ?

And why do we have to set uclamp default values in case (!used ||
!active)? Shouldn't they be set already in this situation?

[...]

2023-12-03 01:00:37

by Qais Yousef

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] sched: uclamp sum aggregation

Hi Hongyan

On 10/04/23 10:04, Hongyan Xia wrote:
> Current implementation of uclamp leaves many things to be desired.
> There are several problems:
>
> 1. Max aggregation is fragile. A single task with a high UCLAMP_MAX (or
> with the default value, which is 1024) can ruin the previous
> settings the moment it is enqueued, as shown in the uclamp frequency
> spike problem in Section 5.2 of
> Documentation/scheduler/sched-util-clamp.rst. Constantly running at
> 1024 utilization implies that the CPU is driven at its max capacity.
> However, with UCLAMP_MAX, this assumption is no longer true. To
> mitigate this, one idea is to filter out uclamp values for
> short-running tasks. However, the effectiveness of this idea remains
> to be seen.

This was proved when I was at arm and it is used in products now too. It is
effective.

>
> 2. No way to differentiate between UCLAMP_MAX throttled CPU or a CPU
> running at its peak, as both show utilization of 1024. However, in
> certain cases it's important to tell the difference, as we still want
> to take the opportunity to enqueue tasks in the former case.
>
> It's also debatable whether uclamp should be a frequency hint. An

It is not debatable. Uclamp is a performance hint which translates into task
placement (for HMP) and frequency selection. On SMP the latter is the only
meaning for uclamp. For the time being at least until we get a new technology
that can impact performance ;-)

> example is a system with a mid CPU of capacity 500, and a little CPU
> with capacity 100. If we have 10 tasks with UCLAMP_MIN of 101, then
> util_fits_cpu() will schedule them on the mid CPU because feec()
> thinks they don't fit on the little, even though we should use the
> little core at some point instead of making the mid even more crowded.

This is not a problem. One of the major design points for uclamp is to allow
small tasks to request to run faster. So if you have 100 tasks that run for
100us and to meet that they need to run at mid CPU that is fine. That doesn't
necessarily mean they'll overutilized the cluster.

If the cluster gets overutilized, then EAS will be disabled and
find_idle_capacity() will spread into highest spare capacity idle CPU, which
includes little CPUs. And this behavior is no different to have tasks with util
greater than 100 wanting to run on mid cores. The only difference is that we
can pack more on mid core before overutilized can be triggered. Which is the
desired effect.

So this is a no issue and we should spread once the cluster is busy.

> Of course, this in reality doesn't happen because of other mechanisms
> like load balancing, but it's also not good when other mechanisms can
> cancel the effect of uclamp in feec(). A CPU running at capacity 500 for
> 1ms or for 1000ms gives completely different performance levels, so
> trying to fit only the frequency does not give us any performance
> guarantees. If we then talk about not only running at some frequency but
> also for some amount of time, then what we really mean is a capacity
> hint, not a frequency hint.

I'm not sure I parsed that. But again, it's a performance hint which implies
both.

>
> It's even worse when the tasks scheduled on the mid CPU also have
> UCLAMP_MAX values. In the previous example, instead of running at 500, a
> single UCLAMP_MAX, assuming it's 110, can make the mid CPU run at a much
> lower frequency than 500, so it is then a really bad decision to honor
> the UCLAMP_MIN frequency hint and place it on the mid CPU, instead of
> running it on the little CPU which is less crowded, and has pretty much
> the same capacity (100 vs 110) anyway.

I can't follow this reasoning.

It seems you're assuming the user has picked 110 by mistake and it is okay to
ignore it.

FYI, one of the use cases is to set uclamp_min to capacity_of(little_core)+1
because they're so crappy to handle their workload. It is still better than
using affinity as when the system gets busy we can still use the little core.
But only when the system is busy and crowded already.

I don't know what's the reference to the mid core having a task with
uclamp_max, but seems a bug in feec() if not predicting the frequency
correctly.

>
> This series attempts to solve these problems by tracking a
> util_avg_uclamp signal in tasks and cfs_rq. At task level,

This was discussed offline and brought several times on the list already.
uclamp is a performance not a bandwidth hint. I remember Dietmar was suggesting
this concept in the past and I objected that this converts uclamp into
a bandwidth hint. This overlaps with CFS bandwidth controller. We can use that
if somebody wants to limit the bandwidth the task can consume. uclamp is about
limiting the performance level it can reach. ie: it is not about cycles
consumed, but Performance Point reached.

> p->se.avg.util_avg_uclamp is basically tracking the normal util_avg, but
> clamped within its uclamp min and max. At cfs_rq level, util_avg_uclamp
> must always be the sum of all util_avg_uclamp of all the entities on
> this cfs_rq. As a result, cfs_rq->avg.util_avg_uclamp is the sum
> aggregation of all the clamped values, which indicates the frequency
> this rq should run at and what the utilization is.

This is a no go for me. A task performance requirements can not be added to
resemble the 'size' of the tasks. Tasks with util_avg 100 but a uclamp_min of
1024 should allow us to pack them in the big core. Which is an important use
case to use. A bunch of small tasks that need to run faster are okay to pack on
a bigger core if they need to run faster as long as this doesn't cause
overutilized to trigger.

>
> This idea solves Problem 1 by capping the utilization of an
> always-running task throttled by UCLAMP_MAX. Although the task (denoted
> by Task A) has no idle time, the util_avg_uclamp signal gives its
> UCLAMP_MAX value instead of 1024, so even if another task (Task B) with
> a UCLAMP_MAX value at 1024 joins the rq, the util_avg_uclamp is A's
> UCLAMP_MAX plus B's utilization, instead of 1024 plus B's utilization,
> which means we no longer have the frequency spike problem caused by B.
> This should mean that we might completely avoid the need for uclamp
> filtering.
>
> It also solves Problem 2 by tracking the capped utilization of a rq.
> Using util_avg_uclamp, we are able to differentiate between a CPU
> throttled by UCLAMP_MAX and one that is actually running at its peak
> capacity. An always-running rq with a task having UCLAMP_MAX at 300 will
> report a cfs_rq util_avg_uclamp at 300, not 1024, which means task
> placement sees spare capacity on this CPU and will allocate some tasks
> to it, instead of treating it the same way as a CPU actually running at

We don't need to know the actual spare capacity, though that would be nice to
have.

But generally in my view this CPU should be available for non capped task to
run on. I view this a load balancer problem where wake up path can consider
this CPU as a candidate and rely on load balancer to move this task away to
another (smaller) CPU ASAP. If this capped task is already on the smaller CPU
and there's nowhere else to place the uncapped task in the system then we must
be quite busy already.

If userspace ends up capping too many tasks too hard that we end up with no
idle time in the system (where otherwise we should) I consider this a bad usage
in the userspace and will ask them not to cap that hard. This is a case of one
shooting themselves in the foot. Maybe we need to care in the future, but so
far this doesn't seem a valid problem to address. I'd like to see why userspace
has failed to do proper management first before jumping to fix this.

What I'd expect more use cases of tasks that are truly 1024 but being capped.
In this case there's truly no idle time on the CPU and your proposal won't
address this. I don't think we have a way to distinguish between the two cases
at the moment.

> the peak. This brings us closer to the benefits of Android's sum
> aggregation (see code related to CONFIG_USE_GROUP_THROTTLE at
> https://android.googlesource.com/kernel/gs/+/refs/heads/android-gs-raviole-5.10-android12-d1/drivers/soc/google/vh/kernel/sched/fair.c#510),
> which shows energy benefits because we are able to schedule tasks on
> smaller cores which are UCLAMP_MAX capped, instead of finding a fresh

You're jumping to the wrong conclusion about why this is being used. This was
to overcome issues when using cpu.shares. This has nothing to do with
UCLAMP_MAX. uclamp_max filter is what is being used. Based on the proposal
I posted long ago

https://android-review.googlesource.com/c/kernel/common/+/1914696

> big CPU. However, a major difference is that this series has an
> amortized cost of O(1), instead of O(n) in cpu_util_next() in Android
> kernels.
>
> It avoids the shortcomings from uclamp-being-a-frequency-hint. Under sum

This is not a short coming. Performance implies frequency. We should not
confused how schedutil works to estimate perf requirement based on historical
run time (util_avg) and the fact that a task with short historical runtime
needs to run faster. Running faster means increasing frequency.

> aggregation, the scheduler sees the sum aggregated util_avg_uclamp and
> will avoid the problem of enqueueing too many tasks just to fit the
> frequency, and will place tasks on little CPUs in the previous example.

I think the issue of uclamp being a performance but not a bandwidth hint was
raised several times. Did I miss something that makes this not converting it to
a bandwidth hint?

Generally I think the max-aggregation should go away too. We've been
overcompensating for a problem that must be fixed by cpufreq governor. I've
been building up to this solution with patches to improve the governor and will
post them soon.


Cheers

--
Qais Yousef

2023-12-04 01:49:04

by Hongyan Xia

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] sched: uclamp sum aggregation

Hi Qais,

On 03/12/2023 00:25, Qais Yousef wrote:
> Hi Hongyan
>
> On 10/04/23 10:04, Hongyan Xia wrote:
>> Current implementation of uclamp leaves many things to be desired.
>> There are several problems:
>>
>> 1. Max aggregation is fragile. A single task with a high UCLAMP_MAX (or
>> with the default value, which is 1024) can ruin the previous
>> settings the moment it is enqueued, as shown in the uclamp frequency
>> spike problem in Section 5.2 of
>> Documentation/scheduler/sched-util-clamp.rst. Constantly running at
>> 1024 utilization implies that the CPU is driven at its max capacity.
>> However, with UCLAMP_MAX, this assumption is no longer true. To
>> mitigate this, one idea is to filter out uclamp values for
>> short-running tasks. However, the effectiveness of this idea remains
>> to be seen.
>
> This was proved when I was at arm and it is used in products now too. It is
> effective.
>
>>
>> 2. No way to differentiate between UCLAMP_MAX throttled CPU or a CPU
>> running at its peak, as both show utilization of 1024. However, in
>> certain cases it's important to tell the difference, as we still want
>> to take the opportunity to enqueue tasks in the former case.
>>
>> It's also debatable whether uclamp should be a frequency hint. An
>
> It is not debatable. Uclamp is a performance hint which translates into task
> placement (for HMP) and frequency selection. On SMP the latter is the only
> meaning for uclamp. For the time being at least until we get a new technology
> that can impact performance ;-)

I agree. In this new series, uclamp is still a performance hint which
affects task placement and frequency, so we don't have a disagreement here.

>> example is a system with a mid CPU of capacity 500, and a little CPU
>> with capacity 100. If we have 10 tasks with UCLAMP_MIN of 101, then
>> util_fits_cpu() will schedule them on the mid CPU because feec()
>> thinks they don't fit on the little, even though we should use the
>> little core at some point instead of making the mid even more crowded.
>
> This is not a problem. One of the major design points for uclamp is to allow
> small tasks to request to run faster. So if you have 100 tasks that run for
> 100us and to meet that they need to run at mid CPU that is fine. That doesn't
> necessarily mean they'll overutilized the cluster.

I understand. This design point is what sum aggregation does as well.
This alone is not a problem, but moving a CPU to another CPU to fit
uclamp_min requirement can be a problem, when the destination CPU is
crowded and mixed with uclamp_max. I will elaborate on this below.

>
> If the cluster gets overutilized, then EAS will be disabled and
> find_idle_capacity() will spread into highest spare capacity idle CPU, which
> includes little CPUs. And this behavior is no different to have tasks with util
> greater than 100 wanting to run on mid cores. The only difference is that we
> can pack more on mid core before overutilized can be triggered. Which is the
> desired effect.
>
> So this is a no issue and we should spread once the cluster is busy.
>> Of course, this in reality doesn't happen because of other mechanisms
>> like load balancing, but it's also not good when other mechanisms can
>> cancel the effect of uclamp in feec(). A CPU running at capacity 500 for
>> 1ms or for 1000ms gives completely different performance levels, so
>> trying to fit only the frequency does not give us any performance
>> guarantees. If we then talk about not only running at some frequency but
>> also for some amount of time, then what we really mean is a capacity
>> hint, not a frequency hint.
>
> I'm not sure I parsed that. But again, it's a performance hint which implies
> both.
>
>>
>> It's even worse when the tasks scheduled on the mid CPU also have
>> UCLAMP_MAX values. In the previous example, instead of running at 500, a
>> single UCLAMP_MAX, assuming it's 110, can make the mid CPU run at a much
>> lower frequency than 500, so it is then a really bad decision to honor
>> the UCLAMP_MIN frequency hint and place it on the mid CPU, instead of
>> running it on the little CPU which is less crowded, and has pretty much
>> the same capacity (100 vs 110) anyway.
>
> I can't follow this reasoning.
>
> It seems you're assuming the user has picked 110 by mistake and it is okay to
> ignore it.
>
> FYI, one of the use cases is to set uclamp_min to capacity_of(little_core)+1
> because they're so crappy to handle their workload. It is still better than
> using affinity as when the system gets busy we can still use the little core.
> But only when the system is busy and crowded already.
>
> I don't know what's the reference to the mid core having a task with
> uclamp_max, but seems a bug in feec() if not predicting the frequency
> correctly.

The example in the cover letter might be complex. I can simplify as a "N
tasks" problem.

Say 2 CPUs, one with capacity 1024 and the other 512. Assume there's a
task with uclamp_min 513 and uclamp_max of 800, then the big CPU can, in
theory, fit an infinite number of such tasks on the big CPU, without
triggering overutilization. What does uclamp_min even mean when this
happens? uclamp_min should be a hint of a minimum performance it should
get, but it breaks down when there are N tasks, so the hint from
uclamp_min is divided by N. Under max aggregation, uclamp_min doesn't
give you the "performance" of this task. It gives performance of the
CPU, and this is not the performance of this task when N > 1. In fact, N
can be infinity. This is the same problem shared by multiple uclamp
users from LPC, where max aggregation hints the performance of the CPU,
but not the task itself.

As to "FYI, one of the use cases is to set uclamp_min to
capacity_of(little_core)+1 because they're so crappy to handle their
workload", I'm afraid I disagree with this. I think uclamp is a simple
hinting mechanism to the scheduler to say: please try to give me
[uclamp_min, uclamp_max] performance. It's then the job of EAS to pick
the most efficient CPU. We know the little CPU is crappy because we have
deep knowledge of the device, but uclamp values could just come from
game developers (using ADPF) who have no knowledge of the SoC, and in
the game developer's view, asking uclamp of little CPU + 1 is simply
because he/she wants that performance, not because the developer thinks
the little CPU should be avoided. I don't think a soft-affinity-like
property belongs here. We should improve EM and let EAS do its job
instead of teaching it to interpret uclamp as affinity hints.

>>
>> This series attempts to solve these problems by tracking a
>> util_avg_uclamp signal in tasks and cfs_rq. At task level,
>
> This was discussed offline and brought several times on the list already.
> uclamp is a performance not a bandwidth hint. I remember Dietmar was suggesting
> this concept in the past and I objected that this converts uclamp into
> a bandwidth hint. This overlaps with CFS bandwidth controller. We can use that
> if somebody wants to limit the bandwidth the task can consume. uclamp is about
> limiting the performance level it can reach. ie: it is not about cycles
> consumed, but Performance Point reached.

I'd argue the current max aggregation implementation is further away
from a performance hint. The moment there are more tasks interacting
with each other on the same rq, the performance hinting ability of max
aggregation tends to do worse, like uclamp_min being divided by N,
frequency spikes, uneven task distribution (both shown in the evaluation
section of my cover letter) etc, whereas sum aggregation still performs
well.

Other shortcomings are not that critical, but the fact that uclamp_min's
effectiveness is divided by N under max aggregation I think is not
acceptable.

>> p->se.avg.util_avg_uclamp is basically tracking the normal util_avg, but
>> clamped within its uclamp min and max. At cfs_rq level, util_avg_uclamp
>> must always be the sum of all util_avg_uclamp of all the entities on
>> this cfs_rq. As a result, cfs_rq->avg.util_avg_uclamp is the sum
>> aggregation of all the clamped values, which indicates the frequency
>> this rq should run at and what the utilization is.
>
> This is a no go for me. A task performance requirements can not be added to
> resemble the 'size' of the tasks. Tasks with util_avg 100 but a uclamp_min of
> 1024 should allow us to pack them in the big core. Which is an important use
> case to use. A bunch of small tasks that need to run faster are okay to pack on
> a bigger core if they need to run faster as long as this doesn't cause
> overutilized to trigger.

I see. I now understand where we start to diverge. It seems we want
completely different things from uclamp.

Max aggregation: please try to place me on a CPU that can give CPU
capacity [uclamp_min, uclamp_max] and run that CPU within that range.
Sum aggregation: please try to treat me like a task with utilization
within [uclamp_min, uclamp_max].

Like I mentioned in the previous comment, I'm afraid I disagree with
using uclamp as a CPU affinity hint, even if it's just a soft affinity.
Its job should just be a performance hint and EM should tell EAS which
CPU to pick under that hint. If the little CPU is crappy, then the
crappiness should be reflected in the EM, and maybe dynamic EM can help.

>>
>> This idea solves Problem 1 by capping the utilization of an
>> always-running task throttled by UCLAMP_MAX. Although the task (denoted
>> by Task A) has no idle time, the util_avg_uclamp signal gives its
>> UCLAMP_MAX value instead of 1024, so even if another task (Task B) with
>> a UCLAMP_MAX value at 1024 joins the rq, the util_avg_uclamp is A's
>> UCLAMP_MAX plus B's utilization, instead of 1024 plus B's utilization,
>> which means we no longer have the frequency spike problem caused by B.
>> This should mean that we might completely avoid the need for uclamp
>> filtering.
>>
>> It also solves Problem 2 by tracking the capped utilization of a rq.
>> Using util_avg_uclamp, we are able to differentiate between a CPU
>> throttled by UCLAMP_MAX and one that is actually running at its peak
>> capacity. An always-running rq with a task having UCLAMP_MAX at 300 will
>> report a cfs_rq util_avg_uclamp at 300, not 1024, which means task
>> placement sees spare capacity on this CPU and will allocate some tasks
>> to it, instead of treating it the same way as a CPU actually running at
>
> We don't need to know the actual spare capacity, though that would be nice to
> have.
>
> But generally in my view this CPU should be available for non capped task to
> run on. I view this a load balancer problem where wake up path can consider
> this CPU as a candidate and rely on load balancer to move this task away to
> another (smaller) CPU ASAP. If this capped task is already on the smaller CPU
> and there's nowhere else to place the uncapped task in the system then we must
> be quite busy already.

I'm not quite sure relying on the load balancer is a good idea, because
then it means we need uclamp code in the load balancer and the
complexity keeps growing. An example is that after applying the patch

'sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0'

even though the scheduler now knows that CPUs with 0 spare capacity are
now candidates, it creates a very unbalanced task placement and CPU0
tends to receive a lot of tasks in my tests. This certainly needs some
uclamp-aware code in the load balancer, whereas sum aggregation has no
such problem and balance is achieved during task placement in the first
place.

> If userspace ends up capping too many tasks too hard that we end up with no
> idle time in the system (where otherwise we should) I consider this a bad usage
> in the userspace and will ask them not to cap that hard. This is a case of one
> shooting themselves in the foot. Maybe we need to care in the future, but so
> far this doesn't seem a valid problem to address. I'd like to see why userspace
> has failed to do proper management first before jumping to fix this.
>
> What I'd expect more use cases of tasks that are truly 1024 but being capped.
> In this case there's truly no idle time on the CPU and your proposal won't
> address this. I don't think we have a way to distinguish between the two cases
> at the moment.

I don't quite get what you mean. In the examples in my cover letter the
capped tasks are truly 1024. Two CPUs, one with an uncapped 1024 task
and the other a 1024 task but with uclamp_max of 300. The scheduler sees
cfs_rq->avg.util_avg_uclamp of 1024 in the first CPU and 300 in the 2nd,
so my proposal indeed sees the difference, and will have the opportunity
to place more tasks on the 2nd CPU without any corner case code.

>> the peak. This brings us closer to the benefits of Android's sum
>> aggregation (see code related to CONFIG_USE_GROUP_THROTTLE at
>> https://android.googlesource.com/kernel/gs/+/refs/heads/android-gs-raviole-5.10-android12-d1/drivers/soc/google/vh/kernel/sched/fair.c#510),
>> which shows energy benefits because we are able to schedule tasks on
>> smaller cores which are UCLAMP_MAX capped, instead of finding a fresh
>
> You're jumping to the wrong conclusion about why this is being used. This was
> to overcome issues when using cpu.shares. This has nothing to do with
> UCLAMP_MAX. uclamp_max filter is what is being used. Based on the proposal
> I posted long ago
>
> https://android-review.googlesource.com/c/kernel/common/+/1914696
>

Maybe I need to rephrase. I said that GROUP_THROTTLE can give us some
energy benefits. I didn't draw any conclusion that this was why it was
invented. I was also trying to suggest that, under sum aggregation, the
same code can work for both uclamp and GROUP_THROTTLE. However, I
probably need to better understand the issues around cpu.shares before
saying this concretely.

Also, uclamp filtering might be another example why I think max
aggregation is over-complicated. Sum aggregation has only around 300
lines of code so far without the need for filtering at all.

>> big CPU. However, a major difference is that this series has an
>> amortized cost of O(1), instead of O(n) in cpu_util_next() in Android
>> kernels.
>>
>> It avoids the shortcomings from uclamp-being-a-frequency-hint. Under sum
>
> This is not a short coming. Performance implies frequency. We should not
> confused how schedutil works to estimate perf requirement based on historical
> run time (util_avg) and the fact that a task with short historical runtime
> needs to run faster. Running faster means increasing frequency.
>
>> aggregation, the scheduler sees the sum aggregated util_avg_uclamp and
>> will avoid the problem of enqueueing too many tasks just to fit the
>> frequency, and will place tasks on little CPUs in the previous example.
>
> I think the issue of uclamp being a performance but not a bandwidth hint was
> raised several times. Did I miss something that makes this not converting it to
> a bandwidth hint?
>

I wonder what does performance hint even mean if it doesn't at least
correlate to bandwidth in some way? Yes, specifying uclamp_min under max
aggregation can run the CPU at least at that frequency, but if you have
N tasks on the same rq, then each task can only get a share of
uclamp_min/N. I highly doubt that when a mobile phone developer
increases the uclamp_min of an application, he or she is okay with
doubling the CPU frequency when it runs and running it only for 1/10 of
the time because it gets moved to a crowded big CPU, which essentially
gives 0.2x the "performance" in my opinion, so I'm not sure fully
disconnecting bandwidth from performance is meaningful. Of course, a
(not yet) uclamp-aware load balancer could come in and migrate those
tasks, but why not avoid this problem in the first place via sum
aggregation without involving the load balancer, especially when today
we don't have any indication of how well a uclamp-aware load balancer
can perform anyway?

In the end I may still not be able to fully grasp what the issues of sum
aggregations are. I think so far I get that it's closer to "bandwidth"
and not "performance" (which I think "performance" doesn't mean much if
it doesn't correlate to bandwidth) and it doesn't work as a CPU affinity
hint (it actually does to some degree, but the uclamp values need to be
different). But I think focusing on these ignores the whole picture,
which are the benefits of sum aggregation (already presented in the
evaluation section of the cover letter of this series):

1. It's fewer than 350 lines of code (max aggregation is 750+ and still
growing).
2. It doesn't suffer from frequency spikes or load balancing issues.
3. It needs no code to consider uclamp as a corner case in the
scheduler. No need to carry around uclamp_min and uclamp_max in so many
functions. util_fits_cpu() is just one line of code.
4. It's effective. When you say a uclamp_min of X, the scheduler really
tries to give you X for your task, not for the CPU you are on, as seen
in 54% reduction of jank in jankbench, without increasing energy
consumption.
5. It looks like something multiple developers want from LPC 2023. The
use cases of "additive uclamp", and hinting the host from the VM via
util_guest can be achieved exactly by uclamp sum aggregation.

Given these benefits, I think it's probably a good idea to give sum
aggregation a serious look. On my Pixel 6, sum aggregation either gives
better benchmark scores at the same energy as max aggregation, or
consumes less power at the same benchmark scores. I'll keep benchmarking
it but so far so good.

Hongyan

2023-12-04 16:13:00

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] sched: uclamp sum aggregation

On Mon, 4 Dec 2023 at 02:48, Hongyan Xia <[email protected]> wrote:
>
> Hi Qais,
>
> On 03/12/2023 00:25, Qais Yousef wrote:
> > Hi Hongyan
> >
> > On 10/04/23 10:04, Hongyan Xia wrote:
> >> Current implementation of uclamp leaves many things to be desired.
> >> There are several problems:
> >>
> >> 1. Max aggregation is fragile. A single task with a high UCLAMP_MAX (or
> >> with the default value, which is 1024) can ruin the previous
> >> settings the moment it is enqueued, as shown in the uclamp frequency
> >> spike problem in Section 5.2 of
> >> Documentation/scheduler/sched-util-clamp.rst. Constantly running at
> >> 1024 utilization implies that the CPU is driven at its max capacity.
> >> However, with UCLAMP_MAX, this assumption is no longer true. To
> >> mitigate this, one idea is to filter out uclamp values for
> >> short-running tasks. However, the effectiveness of this idea remains
> >> to be seen.
> >
> > This was proved when I was at arm and it is used in products now too. It is
> > effective.
> >
> >>
> >> 2. No way to differentiate between UCLAMP_MAX throttled CPU or a CPU
> >> running at its peak, as both show utilization of 1024. However, in
> >> certain cases it's important to tell the difference, as we still want
> >> to take the opportunity to enqueue tasks in the former case.
> >>
> >> It's also debatable whether uclamp should be a frequency hint. An
> >
> > It is not debatable. Uclamp is a performance hint which translates into task
> > placement (for HMP) and frequency selection. On SMP the latter is the only
> > meaning for uclamp. For the time being at least until we get a new technology
> > that can impact performance ;-)
>
> I agree. In this new series, uclamp is still a performance hint which
> affects task placement and frequency, so we don't have a disagreement here.
>
> >> example is a system with a mid CPU of capacity 500, and a little CPU
> >> with capacity 100. If we have 10 tasks with UCLAMP_MIN of 101, then
> >> util_fits_cpu() will schedule them on the mid CPU because feec()
> >> thinks they don't fit on the little, even though we should use the
> >> little core at some point instead of making the mid even more crowded.
> >
> > This is not a problem. One of the major design points for uclamp is to allow
> > small tasks to request to run faster. So if you have 100 tasks that run for
> > 100us and to meet that they need to run at mid CPU that is fine. That doesn't
> > necessarily mean they'll overutilized the cluster.
>
> I understand. This design point is what sum aggregation does as well.
> This alone is not a problem, but moving a CPU to another CPU to fit
> uclamp_min requirement can be a problem, when the destination CPU is
> crowded and mixed with uclamp_max. I will elaborate on this below.
>
> >
> > If the cluster gets overutilized, then EAS will be disabled and
> > find_idle_capacity() will spread into highest spare capacity idle CPU, which
> > includes little CPUs. And this behavior is no different to have tasks with util
> > greater than 100 wanting to run on mid cores. The only difference is that we
> > can pack more on mid core before overutilized can be triggered. Which is the
> > desired effect.
> >
> > So this is a no issue and we should spread once the cluster is busy.
> >> Of course, this in reality doesn't happen because of other mechanisms
> >> like load balancing, but it's also not good when other mechanisms can
> >> cancel the effect of uclamp in feec(). A CPU running at capacity 500 for
> >> 1ms or for 1000ms gives completely different performance levels, so
> >> trying to fit only the frequency does not give us any performance
> >> guarantees. If we then talk about not only running at some frequency but
> >> also for some amount of time, then what we really mean is a capacity
> >> hint, not a frequency hint.
> >
> > I'm not sure I parsed that. But again, it's a performance hint which implies
> > both.
> >
> >>
> >> It's even worse when the tasks scheduled on the mid CPU also have
> >> UCLAMP_MAX values. In the previous example, instead of running at 500, a
> >> single UCLAMP_MAX, assuming it's 110, can make the mid CPU run at a much
> >> lower frequency than 500, so it is then a really bad decision to honor
> >> the UCLAMP_MIN frequency hint and place it on the mid CPU, instead of
> >> running it on the little CPU which is less crowded, and has pretty much
> >> the same capacity (100 vs 110) anyway.
> >
> > I can't follow this reasoning.
> >
> > It seems you're assuming the user has picked 110 by mistake and it is okay to
> > ignore it.
> >
> > FYI, one of the use cases is to set uclamp_min to capacity_of(little_core)+1
> > because they're so crappy to handle their workload. It is still better than
> > using affinity as when the system gets busy we can still use the little core.
> > But only when the system is busy and crowded already.
> >
> > I don't know what's the reference to the mid core having a task with
> > uclamp_max, but seems a bug in feec() if not predicting the frequency
> > correctly.
>
> The example in the cover letter might be complex. I can simplify as a "N
> tasks" problem.
>
> Say 2 CPUs, one with capacity 1024 and the other 512. Assume there's a
> task with uclamp_min 513 and uclamp_max of 800, then the big CPU can, in
> theory, fit an infinite number of such tasks on the big CPU, without
> triggering overutilization. What does uclamp_min even mean when this
> happens? uclamp_min should be a hint of a minimum performance it should
> get, but it breaks down when there are N tasks, so the hint from
> uclamp_min is divided by N. Under max aggregation, uclamp_min doesn't
> give you the "performance" of this task. It gives performance of the
> CPU, and this is not the performance of this task when N > 1. In fact, N
> can be infinity. This is the same problem shared by multiple uclamp
> users from LPC, where max aggregation hints the performance of the CPU,
> but not the task itself.
>
> As to "FYI, one of the use cases is to set uclamp_min to
> capacity_of(little_core)+1 because they're so crappy to handle their
> workload", I'm afraid I disagree with this. I think uclamp is a simple
> hinting mechanism to the scheduler to say: please try to give me
> [uclamp_min, uclamp_max] performance. It's then the job of EAS to pick

You get the point. It's an EAS decision and must stay in EAS not going
into PELT. You have all the information: actual utilization,
uclamp_min uclamp_max. Make EAS smarter


> the most efficient CPU. We know the little CPU is crappy because we have
> deep knowledge of the device, but uclamp values could just come from
> game developers (using ADPF) who have no knowledge of the SoC, and in
> the game developer's view, asking uclamp of little CPU + 1 is simply
> because he/she wants that performance, not because the developer thinks
> the little CPU should be avoided. I don't think a soft-affinity-like
> property belongs here. We should improve EM and let EAS do its job
> instead of teaching it to interpret uclamp as affinity hints.
>
> >>
> >> This series attempts to solve these problems by tracking a
> >> util_avg_uclamp signal in tasks and cfs_rq. At task level,
> >
> > This was discussed offline and brought several times on the list already.
> > uclamp is a performance not a bandwidth hint. I remember Dietmar was suggesting
> > this concept in the past and I objected that this converts uclamp into
> > a bandwidth hint. This overlaps with CFS bandwidth controller. We can use that
> > if somebody wants to limit the bandwidth the task can consume. uclamp is about
> > limiting the performance level it can reach. ie: it is not about cycles
> > consumed, but Performance Point reached.
>
> I'd argue the current max aggregation implementation is further away
> from a performance hint. The moment there are more tasks interacting
> with each other on the same rq, the performance hinting ability of max
> aggregation tends to do worse, like uclamp_min being divided by N,
> frequency spikes, uneven task distribution (both shown in the evaluation
> section of my cover letter) etc, whereas sum aggregation still performs
> well.
>
> Other shortcomings are not that critical, but the fact that uclamp_min's
> effectiveness is divided by N under max aggregation I think is not
> acceptable.

Change EAS task placement policy in this case to take into account
actual utilization and uclamp_min/max

>
> >> p->se.avg.util_avg_uclamp is basically tracking the normal util_avg, but
> >> clamped within its uclamp min and max. At cfs_rq level, util_avg_uclamp
> >> must always be the sum of all util_avg_uclamp of all the entities on
> >> this cfs_rq. As a result, cfs_rq->avg.util_avg_uclamp is the sum
> >> aggregation of all the clamped values, which indicates the frequency
> >> this rq should run at and what the utilization is.
> >
> > This is a no go for me. A task performance requirements can not be added to
> > resemble the 'size' of the tasks. Tasks with util_avg 100 but a uclamp_min of
> > 1024 should allow us to pack them in the big core. Which is an important use
> > case to use. A bunch of small tasks that need to run faster are okay to pack on
> > a bigger core if they need to run faster as long as this doesn't cause
> > overutilized to trigger.
>
> I see. I now understand where we start to diverge. It seems we want
> completely different things from uclamp.
>
> Max aggregation: please try to place me on a CPU that can give CPU
> capacity [uclamp_min, uclamp_max] and run that CPU within that range.
> Sum aggregation: please try to treat me like a task with utilization
> within [uclamp_min, uclamp_max].
>
> Like I mentioned in the previous comment, I'm afraid I disagree with
> using uclamp as a CPU affinity hint, even if it's just a soft affinity.
> Its job should just be a performance hint and EM should tell EAS which
> CPU to pick under that hint. If the little CPU is crappy, then the
> crappiness should be reflected in the EM, and maybe dynamic EM can help.
>
> >>
> >> This idea solves Problem 1 by capping the utilization of an
> >> always-running task throttled by UCLAMP_MAX. Although the task (denoted
> >> by Task A) has no idle time, the util_avg_uclamp signal gives its
> >> UCLAMP_MAX value instead of 1024, so even if another task (Task B) with
> >> a UCLAMP_MAX value at 1024 joins the rq, the util_avg_uclamp is A's
> >> UCLAMP_MAX plus B's utilization, instead of 1024 plus B's utilization,
> >> which means we no longer have the frequency spike problem caused by B.
> >> This should mean that we might completely avoid the need for uclamp
> >> filtering.
> >>
> >> It also solves Problem 2 by tracking the capped utilization of a rq.
> >> Using util_avg_uclamp, we are able to differentiate between a CPU
> >> throttled by UCLAMP_MAX and one that is actually running at its peak
> >> capacity. An always-running rq with a task having UCLAMP_MAX at 300 will
> >> report a cfs_rq util_avg_uclamp at 300, not 1024, which means task
> >> placement sees spare capacity on this CPU and will allocate some tasks
> >> to it, instead of treating it the same way as a CPU actually running at
> >
> > We don't need to know the actual spare capacity, though that would be nice to
> > have.
> >
> > But generally in my view this CPU should be available for non capped task to
> > run on. I view this a load balancer problem where wake up path can consider
> > this CPU as a candidate and rely on load balancer to move this task away to
> > another (smaller) CPU ASAP. If this capped task is already on the smaller CPU
> > and there's nowhere else to place the uncapped task in the system then we must
> > be quite busy already.
>
> I'm not quite sure relying on the load balancer is a good idea, because
> then it means we need uclamp code in the load balancer and the

But you will have because uclamp_max can make any task behaving like
an always running task (ie without wakeup event to migrate the task)

> complexity keeps growing. An example is that after applying the patch
>
> 'sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0'
>
> even though the scheduler now knows that CPUs with 0 spare capacity are
> now candidates, it creates a very unbalanced task placement and CPU0
> tends to receive a lot of tasks in my tests. This certainly needs some
> uclamp-aware code in the load balancer, whereas sum aggregation has no
> such problem and balance is achieved during task placement in the first
> place.
>
> > If userspace ends up capping too many tasks too hard that we end up with no
> > idle time in the system (where otherwise we should) I consider this a bad usage
> > in the userspace and will ask them not to cap that hard. This is a case of one
> > shooting themselves in the foot. Maybe we need to care in the future, but so
> > far this doesn't seem a valid problem to address. I'd like to see why userspace
> > has failed to do proper management first before jumping to fix this.
> >
> > What I'd expect more use cases of tasks that are truly 1024 but being capped.
> > In this case there's truly no idle time on the CPU and your proposal won't
> > address this. I don't think we have a way to distinguish between the two cases
> > at the moment.
>
> I don't quite get what you mean. In the examples in my cover letter the
> capped tasks are truly 1024. Two CPUs, one with an uncapped 1024 task
> and the other a 1024 task but with uclamp_max of 300. The scheduler sees
> cfs_rq->avg.util_avg_uclamp of 1024 in the first CPU and 300 in the 2nd,
> so my proposal indeed sees the difference, and will have the opportunity
> to place more tasks on the 2nd CPU without any corner case code.
>
> >> the peak. This brings us closer to the benefits of Android's sum
> >> aggregation (see code related to CONFIG_USE_GROUP_THROTTLE at
> >> https://android.googlesource.com/kernel/gs/+/refs/heads/android-gs-raviole-5.10-android12-d1/drivers/soc/google/vh/kernel/sched/fair.c#510),
> >> which shows energy benefits because we are able to schedule tasks on
> >> smaller cores which are UCLAMP_MAX capped, instead of finding a fresh
> >
> > You're jumping to the wrong conclusion about why this is being used. This was
> > to overcome issues when using cpu.shares. This has nothing to do with
> > UCLAMP_MAX. uclamp_max filter is what is being used. Based on the proposal
> > I posted long ago
> >
> > https://android-review.googlesource.com/c/kernel/common/+/1914696
> >
>
> Maybe I need to rephrase. I said that GROUP_THROTTLE can give us some
> energy benefits. I didn't draw any conclusion that this was why it was
> invented. I was also trying to suggest that, under sum aggregation, the
> same code can work for both uclamp and GROUP_THROTTLE. However, I
> probably need to better understand the issues around cpu.shares before
> saying this concretely.
>
> Also, uclamp filtering might be another example why I think max
> aggregation is over-complicated. Sum aggregation has only around 300
> lines of code so far without the need for filtering at all.
>
> >> big CPU. However, a major difference is that this series has an
> >> amortized cost of O(1), instead of O(n) in cpu_util_next() in Android
> >> kernels.
> >>
> >> It avoids the shortcomings from uclamp-being-a-frequency-hint. Under sum
> >
> > This is not a short coming. Performance implies frequency. We should not
> > confused how schedutil works to estimate perf requirement based on historical
> > run time (util_avg) and the fact that a task with short historical runtime
> > needs to run faster. Running faster means increasing frequency.
> >
> >> aggregation, the scheduler sees the sum aggregated util_avg_uclamp and
> >> will avoid the problem of enqueueing too many tasks just to fit the
> >> frequency, and will place tasks on little CPUs in the previous example.
> >
> > I think the issue of uclamp being a performance but not a bandwidth hint was
> > raised several times. Did I miss something that makes this not converting it to
> > a bandwidth hint?
> >
>
> I wonder what does performance hint even mean if it doesn't at least
> correlate to bandwidth in some way? Yes, specifying uclamp_min under max
> aggregation can run the CPU at least at that frequency, but if you have
> N tasks on the same rq, then each task can only get a share of
> uclamp_min/N. I highly doubt that when a mobile phone developer
> increases the uclamp_min of an application, he or she is okay with
> doubling the CPU frequency when it runs and running it only for 1/10 of
> the time because it gets moved to a crowded big CPU, which essentially
> gives 0.2x the "performance" in my opinion, so I'm not sure fully
> disconnecting bandwidth from performance is meaningful. Of course, a
> (not yet) uclamp-aware load balancer could come in and migrate those
> tasks, but why not avoid this problem in the first place via sum
> aggregation without involving the load balancer, especially when today
> we don't have any indication of how well a uclamp-aware load balancer
> can perform anyway?
>
> In the end I may still not be able to fully grasp what the issues of sum
> aggregations are. I think so far I get that it's closer to "bandwidth"
> and not "performance" (which I think "performance" doesn't mean much if
> it doesn't correlate to bandwidth) and it doesn't work as a CPU affinity
> hint (it actually does to some degree, but the uclamp values need to be
> different). But I think focusing on these ignores the whole picture,
> which are the benefits of sum aggregation (already presented in the
> evaluation section of the cover letter of this series):
>
> 1. It's fewer than 350 lines of code (max aggregation is 750+ and still
> growing).
> 2. It doesn't suffer from frequency spikes or load balancing issues.
> 3. It needs no code to consider uclamp as a corner case in the
> scheduler. No need to carry around uclamp_min and uclamp_max in so many
> functions. util_fits_cpu() is just one line of code.
> 4. It's effective. When you say a uclamp_min of X, the scheduler really
> tries to give you X for your task, not for the CPU you are on, as seen
> in 54% reduction of jank in jankbench, without increasing energy
> consumption.
> 5. It looks like something multiple developers want from LPC 2023. The
> use cases of "additive uclamp", and hinting the host from the VM via
> util_guest can be achieved exactly by uclamp sum aggregation.
>
> Given these benefits, I think it's probably a good idea to give sum
> aggregation a serious look. On my Pixel 6, sum aggregation either gives
> better benchmark scores at the same energy as max aggregation, or
> consumes less power at the same benchmark scores. I'll keep benchmarking
> it but so far so good.
>
> Hongyan

2023-12-05 15:19:15

by Hongyan Xia

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] sched: uclamp sum aggregation

On 04/12/2023 16:12, Vincent Guittot wrote:
> On Mon, 4 Dec 2023 at 02:48, Hongyan Xia <[email protected]> wrote:
>>
>> [...]
>>
>> Other shortcomings are not that critical, but the fact that uclamp_min's
>> effectiveness is divided by N under max aggregation I think is not
>> acceptable.
>
> Change EAS task placement policy in this case to take into account
> actual utilization and uclamp_min/max

Thank you. I agree. I want to emphasize this specifically because this
is exactly what I'm trying to do. The whole series can be rephrased in a
different way:

- The PELT signal is distorted when uclamp is active.
- Let's consider the [PELT, uclamp_min, uclamp_max] tuple.
- Always carrying all three variables is too much, but [PELT,
clamped(PELT)] is an approximation that works really well.

Of course, I'll explore if there's a way to make things less messy. I
just realized why I didn't do things util_est way but instead directly
clamping on PELT, it's because util_est boosts util_avg and can't work
for uclamp_max. I'll keep exploring options.

>> [...]

2023-12-05 16:27:59

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] sched: uclamp sum aggregation

On Tue, 5 Dec 2023 at 16:19, Hongyan Xia <[email protected]> wrote:
>
> On 04/12/2023 16:12, Vincent Guittot wrote:
> > On Mon, 4 Dec 2023 at 02:48, Hongyan Xia <[email protected]> wrote:
> >>
> >> [...]
> >>
> >> Other shortcomings are not that critical, but the fact that uclamp_min's
> >> effectiveness is divided by N under max aggregation I think is not
> >> acceptable.
> >
> > Change EAS task placement policy in this case to take into account
> > actual utilization and uclamp_min/max
>
> Thank you. I agree. I want to emphasize this specifically because this
> is exactly what I'm trying to do. The whole series can be rephrased in a
> different way:
>
> - The PELT signal is distorted when uclamp is active.

Sorry but no it's not

> - Let's consider the [PELT, uclamp_min, uclamp_max] tuple.

That's what we are already doing with effective_cpu_util. We might
want to improve how we use them in EAS but that's another story than
your proposal

> - Always carrying all three variables is too much, but [PELT,
> clamped(PELT)] is an approximation that works really well.

As said before. It's a no go for this mix

>
> Of course, I'll explore if there's a way to make things less messy. I
> just realized why I didn't do things util_est way but instead directly
> clamping on PELT, it's because util_est boosts util_avg and can't work
> for uclamp_max. I'll keep exploring options.
>
> >> [...]

2023-12-05 17:24:25

by Hongyan Xia

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] sched: uclamp sum aggregation

On 05/12/2023 16:26, Vincent Guittot wrote:
> On Tue, 5 Dec 2023 at 16:19, Hongyan Xia <[email protected]> wrote:
>>
>> On 04/12/2023 16:12, Vincent Guittot wrote:
>>> On Mon, 4 Dec 2023 at 02:48, Hongyan Xia <[email protected]> wrote:
>>>>
>>>> [...]
>>>>
>>>> Other shortcomings are not that critical, but the fact that uclamp_min's
>>>> effectiveness is divided by N under max aggregation I think is not
>>>> acceptable.
>>>
>>> Change EAS task placement policy in this case to take into account
>>> actual utilization and uclamp_min/max
>>
>> Thank you. I agree. I want to emphasize this specifically because this
>> is exactly what I'm trying to do. The whole series can be rephrased in a
>> different way:
>>
>> - The PELT signal is distorted when uclamp is active.
>
> Sorry but no it's not >> - Let's consider the [PELT, uclamp_min, uclamp_max] tuple.
>
> That's what we are already doing with effective_cpu_util. We might
> want to improve how we use them in EAS but that's another story than
> your proposal

It's different. We never catch how we *got* the PELT value. If we wake
up a task, what we do now is to have the following:

[p->util_avg, p->uclamp_min, p->uclamp_max, target_rq->uclamp_min,
target_rq->uclamp_max]

But to best understand how big this task really was, we want:

[p->util_avg, previous_rq->uclamp_min_back_then,
previous_rq->uclamp_max_back_then]

Without such information, issues cannot be avoided because we have no
idea how big the task really was. Frequency spikes is just one of the
symptoms when we mis-interpret how big the task was.

>> - Always carrying all three variables is too much, but [PELT,
>> clamped(PELT)] is an approximation that works really well.
>
> As said before. It's a no go for this mix

I see your concern. To rephrase this series again, I'm simply arguing that

[p->util_avg, previous_uclamp_min, previous_uclamp_max]

is better than

[p->util_avg, p->uclamp_min, p->uclamp_max, target_rq->uclamp_min,
target_rq->uclamp_max] plus the code to mitigate the issues

in estimating how big the task is.

I anticipate this series to be significantly smaller than the current
max aggregation approach plus future code to mitigate the problems, but
I'll keep trying to improve it to hopefully address your concerns.

>>
>> Of course, I'll explore if there's a way to make things less messy. I
>> just realized why I didn't do things util_est way but instead directly
>> clamping on PELT, it's because util_est boosts util_avg and can't work
>> for uclamp_max. I'll keep exploring options.
>>
>>>> [...]