2024-02-01 13:12:24

by Hongyan Xia

[permalink] [raw]
Subject: [RFC PATCH v2 0/7] uclamp sum aggregation

Current implementation of uclamp leaves many things to be desired.
There are three noteworthy problems:

1. Max aggregation only takes into account the max uclamp value. All
other uclamp values are not in effect.
2. Uclamp max aggregation gives UCLAMP_MIN and UCLAMP_MAX at the rq
level, and whether that translates to the desired performance of a
specific task is unknown, because it depends on other tasks on rq.
3. Complexity. Uclamp max aggregation now sits at more than 750 lines of
code and there is ongoing work to rework several interfaces to
prepare for further uclamp changes. Uclamp max aggregation itself
also needs future improvements in uclamp filtering and load balancing

The first 2 points can manifest into the following symptoms,

1. High-rate OPP changes ("frequency spike" problem). An always-running
task with a UCLAMP_MAX of 200 will drive the CPU at 200 even though
its utilization is 1024. However, when a util_avg task of 300 but
with default UCLAMP_MAX of 1024 joins the rq, the rq UCLAMP_MAX will
be uncapped, and the UCLAMP_MAX of the first task is no longer in
effect therefore driving the CPU at 1024, the highest OPP. When the
second task sleeps, the OPP will be reduced to 200. This fast and
sudden OPP switch every time the 2nd task wakes up or sleeps is
unnecessary.
2. Using UCLAMP_MIN to boost performance under max aggregation has been
shown to have weaker effectiveness than "sum aggregated" approaches,
including the util_guest proposal [1] and uclamp sum aggregation in
this series. The performance level of UCLAMP_MIN for a task under max
aggregation is unpredictable when there are more than 1 task runnable
on the rq.

This series solves these problems by tracking a
util_avg_uclamp signal in tasks and root cfs_rq. At task level,
p->se.avg.util_avg_uclamp is basically tracking the normal util_avg, but
clamped within its uclamp min and max. At cfs_rq level, util_avg_uclamp
must always be the sum of all util_avg_uclamp of all the entities on
this cfs_rq. As a result, rq->cfs.avg.util_avg_uclamp is the sum
aggregation of all the clamped values, which hints the frequency
this rq should run at and what the utilization is. This proposal has
some similarities to the util_guest series for VM workloads, in that it
brings the desired performance to the task that requested it, not to the
rq, in which the share of the task is unpredictable.

Note: This new signal does not change the existing PELT signal. The new
signal is only an extra hint to the scheduler to improve scheduling
decisions.

TL;DR OF THIS SERIES:

- Our evaluation shows significantly better effectiveness than max
aggregation. UI benchmarks and VM workloads have improved latency and
higher scores at the same or even reduced power consumption.
- For other benchmarks that do not involve uclamp, we have not observed
any noticeable regressions.
- This series is the entirety of sum aggregation. No groundwork or
refactoring in the scheduler is needed. The complexity of several
existing uclamp-related functions is massively reduced and sum
aggregation code is less than half of that in max aggregation (304+,
701-). The complexity gap will be even greater with all the ongoing
patches for max aggregation.

DECOMPOSITION OF SUM AGGREGATION:

- Patch 1 reverts some max aggregation code. Sum aggregation shows no
such problems so mitigation patches are not necessary, and that
patch has other undesirable side effects.
- Patch 2 and 3 introduce new sum aggregated signals to be a more
effective hint to the scheduler. Patch 3 employs a math trick to make
it significantly simpler to track on-rq and off-rq task utilization
contributions.
- Patch 4, 5 and 6 start using the new signal while significantly
simplifying existing uclamp code, including the total removal of
uclamp buckets and max aggregation.
- Patch 7 and part of 6 remove the overhead of uclamp on task
enqueue and dequeue, because uclamp values and buckets no longer need
to be re-computed every time a uclamp'ed task joins or leaves the rq.

TESTING:

Sum aggregation generally performs better in tests. Two notebooks, max
vs. sum aggregation, are shared at

https://nbviewer.org/github/honxia02/notebooks/blob/618de22a8da96205015fefabee203536683bd4e2/whitebox/max.ipynb
https://nbviewer.org/github/honxia02/notebooks/blob/618de22a8da96205015fefabee203536683bd4e2/whitebox/sum.ipynb

The experiments done in notebooks are on Arm Juno r2 board. CPU0-3 are
little cores with capacity of 383. CPU4-5 are big cores. The rt-app
profiles used for these experiments are included in the notebooks.

Scenario 1: Scheduling 4 always-running tasks with UCLAMP_MAX at 200.

The scheduling decisions are plotted in Out[11]. Both max and sum
aggregation recognizes the UCLAMP_MAX hints and run all the threads
on the little Performance Domain (PD). However, max aggregation in the
test shows some randomness in task placement, and we see the 4 threads
are often not evenly distributed across the 4 little CPUs. This uneven
task placement is also the reason why we revert the patch:

"sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0"

When the spare capacity is 0, the reverted patch tends to gather many
tasks on the same CPU because its EM calculation is bogus and it thinks
this placement is more energy efficient.

Scenario 2: Scheduling 4 tasks with UCLAMP_MIN and UCLAMP_MAX at a value
slightly above the capacity of the little CPU.

Results are in Out[17]. The purpose is to use UCLAMP_MIN to place tasks
on the big core but not to run at the highest OPP. Both max and sum
aggregation accomplish this task successfully, running the two threads
at the big cluster while not driving the frequency at the max.

Scenario 3: Task A is a task with a small utilization pinned to CPU4.
Task B is an always-running task pinned to CPU5, but UCLAMP_MAX capped
at 300. After a while, task A is then pinned to CPU5, joining B.

Results are in Out[23]. The util_avg curve is the original root CFS
util_avg. The root_cfs_util_uclamp is the root CFS utilization after
considering uclamp. Max aggregation sees a frequency spike at 751.7s.
When zoomed in, one can see square-wave-like utilization and CPU
frequency values because of task A periodically going to sleep. When A
wakes up, its default UCLAMP_MAX of 1024 will uncap B and reach the
highest CPU frequency. When A sleeps, B's UCLAMP_MAX will be in effect
and will reduce rq utilization to 300. This happens repeatedly, hence
the square wave. In contrast, sum aggregation sees a normal increase in
utilization when A joins B, at around 430.64s, without any square-wave
behavior. The CPU frequency also stays stable while the two tasks are on
the same rq.

Scenario 4: 4 always-running tasks with UCLAMP_MAX of 120 pinned to the
little PD (CPU0-3). 4 same tasks pinned to the big PD (CPU4-5).
After a while, remove the CPU pinning of the 4 tasks on the big PD.

Results are in Out[29]. Both max and sum aggregation understand that we
can move the 4 tasks from the big PD to the little PD to reduce power
because of the UCLAMP_MAX hints. However, max aggregation shows severely
unbalanced task placement, scheduling 5 tasks on CPU0 while 1 each on
CPU1-3. Sum aggregation schedules 2 tasks on each little CPU, honoring
UCLAMP_MAX while maintaining balanced task placement.

Again, this unbalanced task placement is the reason why we reverted:

"sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0"

Scenario 5: 8 tasks with UCLAMP_MAX of 120.

This test is similar to Scenario 4, shown in Out[35]. Both max and sum
aggregation understand the UCLAMP_MAX hints and schedule all tasks on
the 4 little CPUs. Max aggregation again shows unstable and unbalanced
task placement while sum aggregation schedules 2 tasks on each little
CPU, and the task placement remains stable. The total task residency is
shown in Out[36], showing how unbalanced max aggregation is.

BENCHMARKS:

Geekbench 6, no uclamp (on Rock-5B board)
+-----+-------------+------------+
| | Single-core | Multi-core |
+-----+-------------+------------+
| Max | 801.3 | 2976.8 |
| Sum | 802.8 | 2989.2 |
+-----+-------------+------------+

No regression is seen after switching to sum aggregation.

Jankbench (backporting sum aggregation to Pixel 6 Android 5.18 mainline kernel):

Jank percentage:
+------+-------+-----------+
| | value | perc_diff |
+------+-------+-----------+
| main | 1.1 | 0.00% |
| sum | 0.5 | -53.92% |
+------+-------+-----------+

Average power:
+------------+------+-------+-----------+
| | tag | value | perc_diff |
+------------+------+-------+-----------+
| CPU | max | 166.1 | 0.00% |
| CPU-Big | max | 55.1 | 0.00% |
| CPU-Little | max | 91.7 | 0.00% |
| CPU-Mid | max | 19.2 | 0.00% |
| CPU | sum | 161.3 | -2.85% |
| CPU-Big | sum | 52.9 | -3.97% |
| CPU-Little | sum | 86.7 | -5.39% |
| CPU-Mid | sum | 21.6 | 12.63% |
+------------+------+-------+-----------+

UIBench (backporting sum aggregation to Pixel 6 Android 6.6 mainline kernel):

Jank percentage:
+-------------+-------+-----------+
| tag | value | perc_diff |
+-------------+-------+-----------+
| max_aggr | 0.3 | 0.0 |
| sum_aggr | 0.26 | -12.5 |
+-------------+-------+-----------+

Average input latency:
+-------------+--------+-----------+
| tag | value | perc_diff |
+-------------+--------+-----------+
| max_aggr | 107.39 | 0.0 |
| sum_aggr | 81.135 | -24.5 |
+-------------+--------+-----------+

Average power:
+------------+--------------+--------+-----------+
| channel | tag | value | perc_diff |
+------------+--------------+--------+-----------+
| CPU | max_aggr | 209.85 | 0.0% |
| CPU-Big | max_aggr | 89.8 | 0.0% |
| CPU-Little | max_aggr | 94.45 | 0.0% |
| CPU-Mid | max_aggr | 25.45 | 0.0% |
| GPU | max_aggr | 22.9 | 0.0% |
| Total | max_aggr | 232.75 | 0.0% |
| CPU | sum_aggr | 206.05 | -1.81% |
| CPU-Big | sum_aggr | 84.7 | -5.68% |
| CPU-Little | sum_aggr | 94.9 | 0.48% |
| CPU-Mid | sum_aggr | 26.3 | 3.34% |
| GPU | sum_aggr | 22.45 | -1.97% |
| Total | sum_aggr | 228.5 | -1.83% |
+------------+--------------+--------+-----------+

It should be noted that sum aggregation reduces jank and reduces input
latency while consuming less power.

VM cpufreq hypercall driver [1], on Rock-5B board. Baseline indicates a
setup without the VM cpufreq driver:

Geekbench 6 uncontended. No other host threads running.
+------+-------------+-----------+------------+-----------+
| | Single-core | perc_diff | Multi-core | perc_diff |
+------+-------------+-----------+------------+-----------+
| base | 796.4 | 0 | 2947.0 | 0 |
| max | 795.6 | -0.10 | 2929.6 | -0.59 |
| sum | 794.6 | -0.23 | 2935.6 | -0.39 |
+------+-------------+-----------+------------+-----------+

Geekbench 6 contended. Host CPUs each has a 50% duty-cycle task running.
+------+-------------+-----------+------------+-----------+
| | Single-core | perc_diff | Multi-core | perc_diff |
+------+-------------+-----------+------------+-----------+
| base | 604.6 | 0 | 2330.2 | 0 |
| max | 599.4 | -0.86 | 2327.2 | -0.13 |
| sum | 645.8 | 6.81 | 2336.6 | 0.27 |
+------+-------------+-----------+------------+-----------+

VM CPUfreq driver using sum aggregation outperforms max aggregation when
the host is contended. When the host has no contention (only the VM
vCPUs are running and each host CPU accommodates one guest vCPU), the
two aggregation methods are roughly the same, and a bit surprisingly,
offers no speed-up versus the baseline. This is probably because of the
overhead of hypercalls, and the fact that Geekbench is CPU intensive and
is not the best workload to show the effectiveness of VM cpufreq driver.
We will try to benchmark on more VM workloads.

LIMITATIONS:

1. RT sum aggregation is not shown in the series.

A proof-of-concept RT sum aggregation implementation is done and going
through testing, with < 50 lines of code, using the same ideas as in
CFS. They will be sent out separately if we can agree on CFS sum
aggregation and once the testing is done.

2. A heavily UCLAMP_MAX-throttled task may prevent the CFS rq from
triggering over-utilization.

For example, two always-running task each having utilization of 512. If
one of the task is severely UCLAMP_MAX restricted, say, with a
UCLAMP_MAX of 1, then the total CFS sum aggregation will be 512 + 1 =
513, which won't trigger over-utilization even though the other task has
no UCLAMP_MAX and wants more performance.

I'm working on a fix for this problem. This at the moment can be solved
by either not giving long-running tasks ridiculously low UCLAMP_MAX
values, or adjusting the priority of UCLAMP_MAX tasks to make sure it
does not get a share of CPU run-time that vastly exceeds its UCLAMP_MAX.
However, my personal view is that maybe UCLAMP_MIN and UCLAMP_MAX just
don't belong together, and the proper way is to have a per-task
bandwidth throttling mechanism and what we want as UCLAMP_MAX maybe
actually belongs to that mechanism.

However, since the Android GROUP_THROTTLE feature [2] has the exact same
problem and has been used in actual products, we don't think this is a
big limitation in practice.

[1]: https://lore.kernel.org/all/[email protected]/
[2]: https://android.googlesource.com/kernel/gs/+/refs/heads/android-gs-raviole-5.10-android12-d1/drivers/soc/google/vh/kernel/sched/fair.c#510

---
Changed in v2:
- Rework util_avg_uclamp to be closer to the style of util_est.
- Rewrite patch notes to reflect the new style.
- Add the discussion of the under-utilizated example in limitations,
found by Vincent G.
- Remove task group uclamp to focus on tasks first.
- Fix several bugs in task migration.
- Add benchmark numbers from UIBench and VM cpufreq.
- Update python notebooks to reflect the latest max vs. sum aggregation.

Hongyan Xia (7):
Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is
0"
sched/uclamp: Track uclamped util_avg in sched_avg
sched/uclamp: Introduce root_cfs_util_uclamp for rq
sched/fair: Use CFS util_avg_uclamp for utilization and frequency
sched/fair: Massively simplify util_fits_cpu()
sched/uclamp: Remove all uclamp bucket logic
sched/uclamp: Simplify uclamp_eff_value()

include/linux/sched.h | 7 +-
init/Kconfig | 32 ---
kernel/sched/core.c | 324 +++---------------------------
kernel/sched/cpufreq_schedutil.c | 10 +-
kernel/sched/fair.c | 333 +++++++------------------------
kernel/sched/pelt.c | 144 ++++++++++++-
kernel/sched/pelt.h | 6 +-
kernel/sched/rt.c | 4 -
kernel/sched/sched.h | 145 ++++++--------
9 files changed, 304 insertions(+), 701 deletions(-)

--
2.34.1



2024-02-01 13:12:31

by Hongyan Xia

[permalink] [raw]
Subject: [RFC PATCH v2 1/7] Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0"

From: Hongyan Xia <[email protected]>

That commit creates further problems because 0 spare capacity can be
either a real indication that the CPU is maxed out, or the CPU is
UCLAMP_MAX throttled, but we end up giving all of them a chance which
can results in bogus energy calculations. It also tends to schedule
tasks on the same CPU and requires load balancing patches. Sum
aggregation solves these problems and this patch is not needed.

This reverts commit 6b00a40147653c8ea748e8f4396510f252763364.

Signed-off-by: Hongyan Xia <[email protected]>
---
kernel/sched/fair.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b803030c3a03..d5cc87db4845 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7978,10 +7978,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
for (; pd; pd = pd->next) {
unsigned long util_min = p_util_min, util_max = p_util_max;
unsigned long cpu_cap, cpu_thermal_cap, util;
- long prev_spare_cap = -1, max_spare_cap = -1;
+ unsigned long cur_delta, max_spare_cap = 0;
unsigned long rq_util_min, rq_util_max;
- unsigned long cur_delta, base_energy;
+ unsigned long prev_spare_cap = 0;
int max_spare_cap_cpu = -1;
+ unsigned long base_energy;
int fits, max_fits = -1;

cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
@@ -8044,7 +8045,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
prev_spare_cap = cpu_cap;
prev_fits = fits;
} else if ((fits > max_fits) ||
- ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
+ ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
/*
* Find the CPU with the maximum spare capacity
* among the remaining CPUs in the performance
@@ -8056,7 +8057,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
}
}

- if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
+ if (max_spare_cap_cpu < 0 && prev_spare_cap == 0)
continue;

eenv_pd_busy_time(&eenv, cpus, p);
@@ -8064,7 +8065,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
base_energy = compute_energy(&eenv, pd, cpus, p, -1);

/* Evaluate the energy impact of using prev_cpu. */
- if (prev_spare_cap > -1) {
+ if (prev_spare_cap > 0) {
prev_delta = compute_energy(&eenv, pd, cpus, p,
prev_cpu);
/* CPU utilization has changed */
--
2.34.1


2024-02-01 13:13:31

by Hongyan Xia

[permalink] [raw]
Subject: [RFC PATCH v2 5/7] sched/fair: Massively simplify util_fits_cpu()

Currently, there's no way to distinguish the difference between 1) a CPU
that is actually maxed out at its highest frequency, or 2) one that is
throttled because of UCLAMP_MAX, since both present util_avg values of
1024. This is problematic because when we try to pick a CPU for a task
to run, we would like to give 2) a chance, or at least prefer 2) to 1).

Current upstream now gives all 0 spare capacity CPUs a chance to
consider queuing more tasks because there's a chance that 0 spare
capacity is due to UCLAMP_MAX. However, this creates further problems
because energy calculations are now bogus when spare capacity is already
0, and tasks tend to pile up on one CPU.

Fix by using util_avg_uclamp for util_fits_cpu(). This way, case 1) will
still keep its utilization at 1024 whereas 2) shows spare capacities if
the sum of util_avg_uclamp values is still under the CPU capacity.

Under sum aggregation, checking whether a task fits a CPU becomes much
simpler. We simply do fits_capacity() and there does not need to be all
kinds of code checking all corner cases for uclamp. This means
util_fits_cpu() returns to true and false instead of tri-state,
simplifying a huge amount of code.

[1]: https://lore.kernel.org/all/[email protected]/

Signed-off-by: Hongyan Xia <[email protected]>
---
kernel/sched/fair.c | 253 ++++----------------------------------------
1 file changed, 23 insertions(+), 230 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b92739e1c52f..49997f1f58fb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4974,135 +4974,19 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
trace_sched_util_est_se_tp(&p->se);
}

-static inline int util_fits_cpu(unsigned long util,
- unsigned long uclamp_min,
- unsigned long uclamp_max,
- int cpu)
+/* util must be the uclamp'ed value (i.e. from util_avg_uclamp). */
+static inline int util_fits_cpu(unsigned long util, int cpu)
{
- unsigned long capacity_orig, capacity_orig_thermal;
unsigned long capacity = capacity_of(cpu);
- bool fits, uclamp_max_fits;

- /*
- * Check if the real util fits without any uclamp boost/cap applied.
- */
- fits = fits_capacity(util, capacity);
-
- if (!uclamp_is_used())
- return fits;
-
- /*
- * We must use arch_scale_cpu_capacity() for comparing against uclamp_min and
- * uclamp_max. We only care about capacity pressure (by using
- * capacity_of()) for comparing against the real util.
- *
- * If a task is boosted to 1024 for example, we don't want a tiny
- * pressure to skew the check whether it fits a CPU or not.
- *
- * Similarly if a task is capped to arch_scale_cpu_capacity(little_cpu), it
- * should fit a little cpu even if there's some pressure.
- *
- * Only exception is for thermal pressure since it has a direct impact
- * on available OPP of the system.
- *
- * We honour it for uclamp_min only as a drop in performance level
- * could result in not getting the requested minimum performance level.
- *
- * For uclamp_max, we can tolerate a drop in performance level as the
- * goal is to cap the task. So it's okay if it's getting less.
- */
- capacity_orig = arch_scale_cpu_capacity(cpu);
- capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
-
- /*
- * We want to force a task to fit a cpu as implied by uclamp_max.
- * But we do have some corner cases to cater for..
- *
- *
- * C=z
- * | ___
- * | C=y | |
- * |_ _ _ _ _ _ _ _ _ ___ _ _ _ | _ | _ _ _ _ _ uclamp_max
- * | C=x | | | |
- * | ___ | | | |
- * | | | | | | | (util somewhere in this region)
- * | | | | | | |
- * | | | | | | |
- * +----------------------------------------
- * cpu0 cpu1 cpu2
- *
- * In the above example if a task is capped to a specific performance
- * point, y, then when:
- *
- * * util = 80% of x then it does not fit on cpu0 and should migrate
- * to cpu1
- * * util = 80% of y then it is forced to fit on cpu1 to honour
- * uclamp_max request.
- *
- * which is what we're enforcing here. A task always fits if
- * uclamp_max <= capacity_orig. But when uclamp_max > capacity_orig,
- * the normal upmigration rules should withhold still.
- *
- * Only exception is when we are on max capacity, then we need to be
- * careful not to block overutilized state. This is so because:
- *
- * 1. There's no concept of capping at max_capacity! We can't go
- * beyond this performance level anyway.
- * 2. The system is being saturated when we're operating near
- * max capacity, it doesn't make sense to block overutilized.
- */
- uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
- uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig);
- fits = fits || uclamp_max_fits;
-
- /*
- *
- * C=z
- * | ___ (region a, capped, util >= uclamp_max)
- * | C=y | |
- * |_ _ _ _ _ _ _ _ _ ___ _ _ _ | _ | _ _ _ _ _ uclamp_max
- * | C=x | | | |
- * | ___ | | | | (region b, uclamp_min <= util <= uclamp_max)
- * |_ _ _|_ _|_ _ _ _| _ | _ _ _| _ | _ _ _ _ _ uclamp_min
- * | | | | | | |
- * | | | | | | | (region c, boosted, util < uclamp_min)
- * +----------------------------------------
- * cpu0 cpu1 cpu2
- *
- * a) If util > uclamp_max, then we're capped, we don't care about
- * actual fitness value here. We only care if uclamp_max fits
- * capacity without taking margin/pressure into account.
- * See comment above.
- *
- * b) If uclamp_min <= util <= uclamp_max, then the normal
- * fits_capacity() rules apply. Except we need to ensure that we
- * enforce we remain within uclamp_max, see comment above.
- *
- * c) If util < uclamp_min, then we are boosted. Same as (b) but we
- * need to take into account the boosted value fits the CPU without
- * taking margin/pressure into account.
- *
- * Cases (a) and (b) are handled in the 'fits' variable already. We
- * just need to consider an extra check for case (c) after ensuring we
- * handle the case uclamp_min > uclamp_max.
- */
- uclamp_min = min(uclamp_min, uclamp_max);
- if (fits && (util < uclamp_min) && (uclamp_min > capacity_orig_thermal))
- return -1;
-
- return fits;
+ return fits_capacity(util, capacity);
}

static inline int task_fits_cpu(struct task_struct *p, int cpu)
{
- unsigned long uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
- unsigned long uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
unsigned long util = task_util_est(p);
- /*
- * Return true only if the cpu fully fits the task requirements, which
- * include the utilization but also the performance hints.
- */
- return (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0);
+
+ return util_fits_cpu(util, cpu);
}

static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
@@ -6678,11 +6562,8 @@ static inline void hrtick_update(struct rq *rq)
#ifdef CONFIG_SMP
static inline bool cpu_overutilized(int cpu)
{
- unsigned long rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
- unsigned long rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
-
/* Return true only if the utilization doesn't fit CPU's capacity */
- return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
+ return !util_fits_cpu(cpu_util_cfs(cpu), cpu);
}

static inline void update_overutilized_status(struct rq *rq)
@@ -7463,8 +7344,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
static int
select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
{
- unsigned long task_util, util_min, util_max, best_cap = 0;
- int fits, best_fits = 0;
+ unsigned long task_util, best_cap = 0;
int cpu, best_cpu = -1;
struct cpumask *cpus;

@@ -7472,8 +7352,6 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);

task_util = task_util_est(p);
- util_min = uclamp_eff_value(p, UCLAMP_MIN);
- util_max = uclamp_eff_value(p, UCLAMP_MAX);

for_each_cpu_wrap(cpu, cpus, target) {
unsigned long cpu_cap = capacity_of(cpu);
@@ -7481,44 +7359,22 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
continue;

- fits = util_fits_cpu(task_util, util_min, util_max, cpu);
-
- /* This CPU fits with all requirements */
- if (fits > 0)
+ if (util_fits_cpu(task_util, cpu))
return cpu;
- /*
- * Only the min performance hint (i.e. uclamp_min) doesn't fit.
- * Look for the CPU with best capacity.
- */
- else if (fits < 0)
- cpu_cap = arch_scale_cpu_capacity(cpu) - thermal_load_avg(cpu_rq(cpu));

- /*
- * First, select CPU which fits better (-1 being better than 0).
- * Then, select the one with best capacity at same level.
- */
- if ((fits < best_fits) ||
- ((fits == best_fits) && (cpu_cap > best_cap))) {
+ if (cpu_cap > best_cap) {
best_cap = cpu_cap;
best_cpu = cpu;
- best_fits = fits;
}
}

return best_cpu;
}

-static inline bool asym_fits_cpu(unsigned long util,
- unsigned long util_min,
- unsigned long util_max,
- int cpu)
+static inline bool asym_fits_cpu(unsigned long util, int cpu)
{
if (sched_asym_cpucap_active())
- /*
- * Return true only if the cpu fully fits the task requirements
- * which include the utilization and the performance hints.
- */
- return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
+ return util_fits_cpu(util, cpu);

return true;
}
@@ -7530,7 +7386,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
{
bool has_idle_core = false;
struct sched_domain *sd;
- unsigned long task_util, util_min, util_max;
+ unsigned long task_util;
int i, recent_used_cpu, prev_aff = -1;

/*
@@ -7540,8 +7396,6 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if (sched_asym_cpucap_active()) {
sync_entity_load_avg(&p->se);
task_util = task_util_est(p);
- util_min = uclamp_eff_value(p, UCLAMP_MIN);
- util_max = uclamp_eff_value(p, UCLAMP_MAX);
}

/*
@@ -7550,7 +7404,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
lockdep_assert_irqs_disabled();

if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
- asym_fits_cpu(task_util, util_min, util_max, target))
+ asym_fits_cpu(task_util, target))
return target;

/*
@@ -7558,7 +7412,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
*/
if (prev != target && cpus_share_cache(prev, target) &&
(available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
- asym_fits_cpu(task_util, util_min, util_max, prev)) {
+ asym_fits_cpu(task_util, prev)) {

if (!static_branch_unlikely(&sched_cluster_active) ||
cpus_share_resources(prev, target))
@@ -7579,7 +7433,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
in_task() &&
prev == smp_processor_id() &&
this_rq()->nr_running <= 1 &&
- asym_fits_cpu(task_util, util_min, util_max, prev)) {
+ asym_fits_cpu(task_util, prev)) {
return prev;
}

@@ -7591,7 +7445,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
cpus_share_cache(recent_used_cpu, target) &&
(available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
cpumask_test_cpu(recent_used_cpu, p->cpus_ptr) &&
- asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) {
+ asym_fits_cpu(task_util, recent_used_cpu)) {

if (!static_branch_unlikely(&sched_cluster_active) ||
cpus_share_resources(recent_used_cpu, target))
@@ -7966,13 +7820,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
- unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0;
- unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
struct root_domain *rd = this_rq()->rd;
int cpu, best_energy_cpu, target = -1;
- int prev_fits = -1, best_fits = -1;
- unsigned long best_thermal_cap = 0;
- unsigned long prev_thermal_cap = 0;
struct sched_domain *sd;
struct perf_domain *pd;
struct energy_env eenv;
@@ -8001,14 +7850,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
eenv_task_busy_time(&eenv, p, prev_cpu);

for (; pd; pd = pd->next) {
- unsigned long util_min = p_util_min, util_max = p_util_max;
unsigned long cpu_cap, cpu_thermal_cap, util;
unsigned long cur_delta, max_spare_cap = 0;
- unsigned long rq_util_min, rq_util_max;
unsigned long prev_spare_cap = 0;
int max_spare_cap_cpu = -1;
unsigned long base_energy;
- int fits, max_fits = -1;

cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);

@@ -8024,8 +7870,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
eenv.pd_cap = 0;

for_each_cpu(cpu, cpus) {
- struct rq *rq = cpu_rq(cpu);
-
eenv.pd_cap += cpu_thermal_cap;

if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
@@ -8036,31 +7880,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)

util = cpu_util(cpu, p, cpu, 0);
cpu_cap = capacity_of(cpu);
-
- /*
- * Skip CPUs that cannot satisfy the capacity request.
- * IOW, placing the task there would make the CPU
- * overutilized. Take uclamp into account to see how
- * much capacity we can get out of the CPU; this is
- * aligned with sched_cpu_util().
- */
- if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) {
- /*
- * Open code uclamp_rq_util_with() except for
- * the clamp() part. Ie: apply max aggregation
- * only. util_fits_cpu() logic requires to
- * operate on non clamped util but must use the
- * max-aggregated uclamp_{min, max}.
- */
- rq_util_min = uclamp_rq_get(rq, UCLAMP_MIN);
- rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
-
- util_min = max(rq_util_min, p_util_min);
- util_max = max(rq_util_max, p_util_max);
- }
-
- fits = util_fits_cpu(util, util_min, util_max, cpu);
- if (!fits)
+ if (!util_fits_cpu(util, cpu))
continue;

lsub_positive(&cpu_cap, util);
@@ -8068,9 +7888,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
if (cpu == prev_cpu) {
/* Always use prev_cpu as a candidate. */
prev_spare_cap = cpu_cap;
- prev_fits = fits;
- } else if ((fits > max_fits) ||
- ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
+ } else if (cpu_cap > max_spare_cap) {
/*
* Find the CPU with the maximum spare capacity
* among the remaining CPUs in the performance
@@ -8078,7 +7896,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
*/
max_spare_cap = cpu_cap;
max_spare_cap_cpu = cpu;
- max_fits = fits;
}
}

@@ -8097,50 +7914,26 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
if (prev_delta < base_energy)
goto unlock;
prev_delta -= base_energy;
- prev_thermal_cap = cpu_thermal_cap;
best_delta = min(best_delta, prev_delta);
}

/* Evaluate the energy impact of using max_spare_cap_cpu. */
if (max_spare_cap_cpu >= 0 && max_spare_cap > prev_spare_cap) {
- /* Current best energy cpu fits better */
- if (max_fits < best_fits)
- continue;
-
- /*
- * Both don't fit performance hint (i.e. uclamp_min)
- * but best energy cpu has better capacity.
- */
- if ((max_fits < 0) &&
- (cpu_thermal_cap <= best_thermal_cap))
- continue;
-
cur_delta = compute_energy(&eenv, pd, cpus, p,
max_spare_cap_cpu);
/* CPU utilization has changed */
if (cur_delta < base_energy)
goto unlock;
cur_delta -= base_energy;
-
- /*
- * Both fit for the task but best energy cpu has lower
- * energy impact.
- */
- if ((max_fits > 0) && (best_fits > 0) &&
- (cur_delta >= best_delta))
- continue;
-
- best_delta = cur_delta;
- best_energy_cpu = max_spare_cap_cpu;
- best_fits = max_fits;
- best_thermal_cap = cpu_thermal_cap;
+ if (cur_delta < best_delta) {
+ best_delta = cur_delta;
+ best_energy_cpu = max_spare_cap_cpu;
+ }
}
}
rcu_read_unlock();

- if ((best_fits > prev_fits) ||
- ((best_fits > 0) && (best_delta < prev_delta)) ||
- ((best_fits < 0) && (best_thermal_cap > prev_thermal_cap)))
+ if (best_delta < prev_delta)
target = best_energy_cpu;

return target;
--
2.34.1


2024-02-01 13:13:46

by Hongyan Xia

[permalink] [raw]
Subject: [RFC PATCH v2 3/7] sched/uclamp: Introduce root_cfs_util_uclamp for rq

The problem with rq->cfs.avg.util_avg_uclamp is that it only tracks the
sum of contributions of CFS tasks that are on the rq. However, CFS tasks
that belong to a CPU which were just dequeued from the rq->cfs still
have decaying contributions to the rq utilization due to PELT. Introduce
root_cfs_util_uclamp to capture the total utilization of CFS tasks both
on and off this rq.

Theoretically, keeping track of the sum of all tasks on a CPU (either on
or off the rq) requires us to periodically sample the decaying PELT
utilization of all off-rq tasks and then sum them up, which introduces
substantial extra code and overhead. However, we can avoid the overhead,
shown in this example:

Let's assume 3 tasks, A, B and C. A is still on rq->cfs but B and C have
just been dequeued. The cfs.avg.util_avg_uclamp has dropped from A + B +
C to just A but the instantaneous utilization only just starts to decay
and is now still A + B + C. Let's denote root_cfs_util_uclamp_old as the
instantaneous total utilization right before B and C are dequeued.

After p periods, with y being the decay factor, the new
root_cfs_util_uclamp becomes:

root_cfs_util_uclamp
= A + B * y^p + C * y^p
= A + (A + B + C - A) * y^p
= cfs.avg.util_avg_uclamp +
(root_cfs_util_uclamp_old - cfs.avg.util_avg_uclamp) * y^p
= cfs.avg.util_avg_uclamp + diff * y^p

So, whenever we want to calculate the new root_cfs_util_uclamp
(including both on- and off-rq CFS tasks of a CPU), we could just decay
the diff between root_cfs_util_uclamp and cfs.avg.util_avg_uclamp, and
add the decayed diff to cfs.avg.util_avg_uclamp to obtain the new
root_cfs_util_uclamp, without bothering to periodically sample off-rq
CFS tasks and sum them up. This significantly reduces the overhead
needed to maintain this signal, and makes sure we now also include the
decaying contributions of CFS tasks that are dequeued.

NOTE: In no way do we change how PELT and util_avg work. The original
PELT signal is kept as-is and is used when needed. The new signals,
util_avg_uclamp and root_cfs_util_uclamp are additional hints to the
scheduler and are not meant to replace the original PELT signals.

Signed-off-by: Hongyan Xia <[email protected]>
---
kernel/sched/fair.c | 7 +++
kernel/sched/pelt.c | 106 +++++++++++++++++++++++++++++++++++++++----
kernel/sched/pelt.h | 3 +-
kernel/sched/sched.h | 16 +++++++
4 files changed, 123 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4f535c96463b..36357cfaf48d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6710,6 +6710,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
struct sched_entity *se = &p->se;
int idle_h_nr_running = task_has_idle_policy(p);
int task_new = !(flags & ENQUEUE_WAKEUP);
+ bool __maybe_unused migrated = p->se.avg.last_update_time == 0;

/*
* The code below (indirectly) updates schedutil which looks at
@@ -6769,6 +6770,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
#ifdef CONFIG_UCLAMP_TASK
util_uclamp_enqueue(&rq->cfs.avg, p);
update_util_uclamp(0, 0, 0, &rq->cfs.avg, p);
+ if (migrated)
+ rq->root_cfs_util_uclamp += p->se.avg.util_avg_uclamp;
+ rq->root_cfs_util_uclamp = max(rq->root_cfs_util_uclamp,
+ rq->cfs.avg.util_avg_uclamp);
/* TODO: Better skip the frequency update in the for loop above. */
cpufreq_update_util(rq, 0);
#endif
@@ -8252,6 +8257,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
migrate_se_pelt_lag(se);
}

+ remove_root_cfs_util_uclamp(p);
/* Tell new CPU we are migrated */
se->avg.last_update_time = 0;

@@ -8261,6 +8267,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
static void task_dead_fair(struct task_struct *p)
{
remove_entity_load_avg(&p->se);
+ remove_root_cfs_util_uclamp(p);
}

static int
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index eca45a863f9f..9ba208ac26db 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -267,14 +267,78 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
}

#ifdef CONFIG_UCLAMP_TASK
+static int ___update_util_uclamp_towards(u64 now,
+ u64 last_update_time,
+ u32 period_contrib,
+ unsigned int *old,
+ unsigned int new_val)
+{
+ unsigned int old_val = READ_ONCE(*old);
+ u64 delta, periods;
+
+ if (old_val <= new_val) {
+ WRITE_ONCE(*old, new_val);
+ return old_val < new_val;
+ }
+
+ if (!last_update_time)
+ return 0;
+ delta = now - last_update_time;
+ if ((s64)delta < 0)
+ return 0;
+ delta >>= 10;
+ if (!delta)
+ return 0;
+
+ delta += period_contrib;
+ periods = delta / 1024;
+ if (periods) {
+ u64 diff = old_val - new_val;
+
+ /*
+ * Let's assume 3 tasks, A, B and C. A is still on rq but B and
+ * C have just been dequeued. The cfs.avg.util_avg_uclamp has
+ * become A but root_cfs_util_uclamp just starts to decay and is
+ * now still A + B + C.
+ *
+ * After p periods with y being the decay factor, the new
+ * root_cfs_util_uclamp should become
+ *
+ * A + B * y^p + C * y^p == A + (A + B + C - A) * y^p
+ * == cfs.avg.util_avg_uclamp +
+ * (root_cfs_util_uclamp_at_the_start - cfs.avg.util_avg_uclamp) * y^p
+ * == cfs.avg.util_avg_uclamp + diff * y^p
+ *
+ * So, instead of summing up each individual decayed values, we
+ * could just decay the diff and not bother with the summation
+ * at all. This is why we decay the diff here.
+ */
+ diff = decay_load(diff, periods);
+ WRITE_ONCE(*old, new_val + diff);
+ return old_val != *old;
+ }
+
+ return 0;
+}
+
/* avg must belong to the queue this se is on. */
-void update_util_uclamp(struct sched_avg *avg, struct task_struct *p)
+void update_util_uclamp(u64 now,
+ u64 last_update_time,
+ u32 period_contrib,
+ struct sched_avg *avg,
+ struct task_struct *p)
{
unsigned int util, uclamp_min, uclamp_max;
int delta;

- if (!p->se.on_rq)
+ if (!p->se.on_rq) {
+ ___update_util_uclamp_towards(now,
+ last_update_time,
+ period_contrib,
+ &p->se.avg.util_avg_uclamp,
+ 0);
return;
+ }

if (!avg)
return;
@@ -294,7 +358,11 @@ void update_util_uclamp(struct sched_avg *avg, struct task_struct *p)
WRITE_ONCE(avg->util_avg_uclamp, util);
}
#else /* !CONFIG_UCLAMP_TASK */
-void update_util_uclamp(struct sched_avg *avg, struct task_struct *p)
+void update_util_uclamp(u64 now,
+ u64 last_update_time,
+ u32 period_contrib,
+ struct sched_avg *avg,
+ struct task_struct *p)
{
}
#endif
@@ -327,23 +395,32 @@ void update_util_uclamp(struct sched_avg *avg, struct task_struct *p)

void __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
{
+ u64 last_update_time = se->avg.last_update_time;
+ u32 period_contrib = se->avg.period_contrib;
+
if (___update_load_sum(now, &se->avg, 0, 0, 0)) {
___update_load_avg(&se->avg, se_weight(se));
if (entity_is_task(se))
- update_util_uclamp(NULL, task_of(se));
+ update_util_uclamp(now, last_update_time,
+ period_contrib, NULL, task_of(se));
trace_pelt_se_tp(se);
}
}

void __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ u64 last_update_time = se->avg.last_update_time;
+ u32 period_contrib = se->avg.period_contrib;
+
if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
cfs_rq->curr == se)) {

___update_load_avg(&se->avg, se_weight(se));
cfs_se_util_change(&se->avg);
if (entity_is_task(se))
- update_util_uclamp(&rq_of(cfs_rq)->cfs.avg,
+ update_util_uclamp(now, last_update_time,
+ period_contrib,
+ &rq_of(cfs_rq)->cfs.avg,
task_of(se));
trace_pelt_se_tp(se);
}
@@ -351,17 +428,30 @@ void __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *s

int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
{
+ u64 __maybe_unused last_update_time = cfs_rq->avg.last_update_time;
+ u32 __maybe_unused period_contrib = cfs_rq->avg.period_contrib;
+ int ret = 0;
+
if (___update_load_sum(now, &cfs_rq->avg,
scale_load_down(cfs_rq->load.weight),
cfs_rq->h_nr_running,
cfs_rq->curr != NULL)) {

___update_load_avg(&cfs_rq->avg, 1);
- trace_pelt_cfs_tp(cfs_rq);
- return 1;
+ ret = 1;
}

- return 0;
+#ifdef CONFIG_UCLAMP_TASK
+ if (&rq_of(cfs_rq)->cfs == cfs_rq)
+ ret = ___update_util_uclamp_towards(now,
+ last_update_time, period_contrib,
+ &rq_of(cfs_rq)->root_cfs_util_uclamp,
+ READ_ONCE(cfs_rq->avg.util_avg_uclamp));
+#endif
+ if (ret)
+ trace_pelt_cfs_tp(cfs_rq);
+
+ return ret;
}

/*
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 6862f79e0fcd..a2852d5e862d 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -1,7 +1,8 @@
#ifdef CONFIG_SMP
#include "sched-pelt.h"

-void update_util_uclamp(struct sched_avg *avg, struct task_struct *p);
+void update_util_uclamp(u64 now, u64 last_update_time, u32 period_contrib,
+ struct sched_avg *avg, struct task_struct *p);
void __update_load_avg_blocked_se(u64 now, struct sched_entity *se);
void __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se);
int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 35036246824b..ce80b87b549b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -998,6 +998,7 @@ struct rq {
/* Utilization clamp values based on CPU's RUNNABLE tasks */
struct uclamp_rq uclamp[UCLAMP_CNT] ____cacheline_aligned;
unsigned int uclamp_flags;
+ unsigned int root_cfs_util_uclamp;
#define UCLAMP_FLAG_IDLE 0x01
#endif

@@ -3112,6 +3113,17 @@ static inline void util_uclamp_dequeue(struct sched_avg *avg,

WRITE_ONCE(avg->util_avg_uclamp, new_val);
}
+
+static inline void remove_root_cfs_util_uclamp(struct task_struct *p)
+{
+ struct rq *rq = task_rq(p);
+ unsigned int root_util = READ_ONCE(rq->root_cfs_util_uclamp);
+ unsigned int p_util = READ_ONCE(p->se.avg.util_avg_uclamp), new_util;
+
+ new_util = (root_util > p_util) ? root_util - p_util : 0;
+ new_util = max(new_util, READ_ONCE(rq->cfs.avg.util_avg_uclamp));
+ WRITE_ONCE(rq->root_cfs_util_uclamp, new_util);
+}
#else /* CONFIG_UCLAMP_TASK */
static inline unsigned long uclamp_eff_value(struct task_struct *p,
enum uclamp_id clamp_id)
@@ -3147,6 +3159,10 @@ static inline bool uclamp_rq_is_idle(struct rq *rq)
{
return false;
}
+
+static inline void remove_root_cfs_util_uclamp(struct task_struct *p)
+{
+}
#endif /* CONFIG_UCLAMP_TASK */

#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
--
2.34.1


2024-02-01 13:13:58

by Hongyan Xia

[permalink] [raw]
Subject: [RFC PATCH v2 4/7] sched/fair: Use CFS util_avg_uclamp for utilization and frequency

Switch to the new util_avg_uclamp for task and runqueue utilization.
Since task_util_est() calls task_util() which now uses util_avg_uclamp,
this means util_est is now also a clamped value.

Now that we have the sum aggregated CFS util value, we do not need to
consult uclamp buckets to know how the frequency should be clamped. We
simply look at the aggregated top level root_cfs_util_uclamp to know
what frequency to choose.

TODO: Sum aggregation for RT tasks. I have already implemented RT sum
aggregation, which is only 49 lines of code, but I don't want RT to
distract this series which is mainly CFS-focused. RT will be sent in a
separate mini series.

Signed-off-by: Hongyan Xia <[email protected]>
---
kernel/sched/core.c | 17 ++++----------
kernel/sched/cpufreq_schedutil.c | 10 ++------
kernel/sched/fair.c | 39 ++++++++++++++++----------------
kernel/sched/sched.h | 21 +++++++++++++----
4 files changed, 42 insertions(+), 45 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index db4be4921e7f..0bedc05c883f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7465,6 +7465,9 @@ int sched_core_idle_cpu(int cpu)
* The DL bandwidth number otoh is not a measured metric but a value computed
* based on the task model parameters and gives the minimal utilization
* required to meet deadlines.
+ *
+ * The util_cfs parameter has already taken uclamp into account (unless uclamp
+ * support is not compiled in).
*/
unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
unsigned long *min,
@@ -7490,13 +7493,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
}

if (min) {
- /*
- * The minimum utilization returns the highest level between:
- * - the computed DL bandwidth needed with the IRQ pressure which
- * steals time to the deadline task.
- * - The minimum performance requirement for CFS and/or RT.
- */
- *min = max(irq + cpu_bw_dl(rq), uclamp_rq_get(rq, UCLAMP_MIN));
+ *min = irq + cpu_bw_dl(rq);

/*
* When an RT task is runnable and uclamp is not used, we must
@@ -7515,12 +7512,8 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
util = util_cfs + cpu_util_rt(rq);
util += cpu_util_dl(rq);

- /*
- * The maximum hint is a soft bandwidth requirement, which can be lower
- * than the actual utilization because of uclamp_max requirements.
- */
if (max)
- *max = min(scale, uclamp_rq_get(rq, UCLAMP_MAX));
+ *max = scale;

if (util >= scale)
return scale;
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 95c3c097083e..48a4e4a685d0 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -381,11 +381,8 @@ static void sugov_update_single_freq(struct update_util_data *hook, u64 time,
/*
* Do not reduce the frequency if the CPU has not been idle
* recently, as the reduction is likely to be premature then.
- *
- * Except when the rq is capped by uclamp_max.
*/
- if (!uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)) &&
- sugov_cpu_is_busy(sg_cpu) && next_f < sg_policy->next_freq &&
+ if (sugov_cpu_is_busy(sg_cpu) && next_f < sg_policy->next_freq &&
!sg_policy->need_freq_update) {
next_f = sg_policy->next_freq;

@@ -435,11 +432,8 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
/*
* Do not reduce the target performance level if the CPU has not been
* idle recently, as the reduction is likely to be premature then.
- *
- * Except when the rq is capped by uclamp_max.
*/
- if (!uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)) &&
- sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
+ if (sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
sg_cpu->util = prev_util;

cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 36357cfaf48d..b92739e1c52f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4821,10 +4821,17 @@ static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)

static int newidle_balance(struct rq *this_rq, struct rq_flags *rf);

+#ifdef CONFIG_UCLAMP_TASK
+static inline unsigned long task_util(struct task_struct *p)
+{
+ return READ_ONCE(p->se.avg.util_avg_uclamp);
+}
+#else
static inline unsigned long task_util(struct task_struct *p)
{
return READ_ONCE(p->se.avg.util_avg);
}
+#endif

static inline unsigned long task_runnable(struct task_struct *p)
{
@@ -4932,8 +4939,13 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
* To avoid underestimate of task utilization, skip updates of EWMA if
* we cannot grant that thread got all CPU time it wanted.
*/
- if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
+ if ((READ_ONCE(p->se.avg.util_avg) + UTIL_EST_MARGIN) <
+ task_runnable(p)) {
+ ewma = clamp(ewma,
+ uclamp_eff_value(p, UCLAMP_MIN),
+ uclamp_eff_value(p, UCLAMP_MAX));
goto done;
+ }


/*
@@ -7685,11 +7697,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
static unsigned long
cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
{
- struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
- unsigned long util = READ_ONCE(cfs_rq->avg.util_avg);
+ struct rq *rq = cpu_rq(cpu);
+ struct cfs_rq *cfs_rq = &rq->cfs;
+ unsigned long util = root_cfs_util(rq);
+ bool capped = uclamp_rq_is_capped(rq);
unsigned long runnable;

- if (boost) {
+ if (boost && !capped) {
runnable = READ_ONCE(cfs_rq->avg.runnable_avg);
util = max(util, runnable);
}
@@ -7867,7 +7881,6 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
int cpu;

for_each_cpu(cpu, pd_cpus) {
- struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
unsigned long eff_util, min, max;

@@ -7880,20 +7893,6 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
*/
eff_util = effective_cpu_util(cpu, util, &min, &max);

- /* Task's uclamp can modify min and max value */
- if (tsk && uclamp_is_used()) {
- min = max(min, uclamp_eff_value(p, UCLAMP_MIN));
-
- /*
- * If there is no active max uclamp constraint,
- * directly use task's one, otherwise keep max.
- */
- if (uclamp_rq_is_idle(cpu_rq(cpu)))
- max = uclamp_eff_value(p, UCLAMP_MAX);
- else
- max = max(max, uclamp_eff_value(p, UCLAMP_MAX));
- }
-
eff_util = sugov_effective_cpu_perf(cpu, eff_util, min, max);
max_util = max(max_util, eff_util);
}
@@ -7996,7 +7995,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
target = prev_cpu;

sync_entity_load_avg(&p->se);
- if (!task_util_est(p) && p_util_min == 0)
+ if (!task_util_est(p))
goto unlock;

eenv_task_busy_time(&eenv, p, prev_cpu);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ce80b87b549b..3ee28822f48f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3062,16 +3062,17 @@ static inline bool uclamp_rq_is_idle(struct rq *rq)
/* Is the rq being capped/throttled by uclamp_max? */
static inline bool uclamp_rq_is_capped(struct rq *rq)
{
- unsigned long rq_util;
- unsigned long max_util;
+ unsigned long rq_uclamp_util, rq_real_util;

if (!static_branch_likely(&sched_uclamp_used))
return false;

- rq_util = cpu_util_cfs(cpu_of(rq)) + cpu_util_rt(rq);
- max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
+ rq_uclamp_util = cpu_util_cfs(cpu_of(rq)) + cpu_util_rt(rq);
+ rq_real_util = READ_ONCE(rq->cfs.avg.util_avg) +
+ READ_ONCE(rq->avg_rt.util_avg);

- return max_util != SCHED_CAPACITY_SCALE && rq_util >= max_util;
+ return rq_uclamp_util < SCHED_CAPACITY_SCALE &&
+ rq_real_util > rq_uclamp_util;
}

/*
@@ -3087,6 +3088,11 @@ static inline bool uclamp_is_used(void)
return static_branch_likely(&sched_uclamp_used);
}

+static inline unsigned long root_cfs_util(struct rq *rq)
+{
+ return READ_ONCE(rq->root_cfs_util_uclamp);
+}
+
static inline void util_uclamp_enqueue(struct sched_avg *avg,
struct task_struct *p)
{
@@ -3160,6 +3166,11 @@ static inline bool uclamp_rq_is_idle(struct rq *rq)
return false;
}

+static inline unsigned long root_cfs_util(struct rq *rq)
+{
+ return READ_ONCE(rq->cfs.avg.util_avg);
+}
+
static inline void remove_root_cfs_util_uclamp(struct task_struct *p)
{
}
--
2.34.1


2024-02-01 13:14:05

by Hongyan Xia

[permalink] [raw]
Subject: [RFC PATCH v2 7/7] sched/uclamp: Simplify uclamp_eff_value()

The commit

sched: Remove all uclamp bucket logic

removes uclamp_rq_{inc/dec}() functions, so now p->uclamp contains the
correct values all the time after a uclamp_update_active() call, and
there's no need to toggle the boolean `active` after an update. As a
result, this function is fairly simple now and can live as a static
inline function.

Signed-off-by: Hongyan Xia <[email protected]>
---
kernel/sched/core.c | 13 ++++---------
kernel/sched/sched.h | 14 ++++++++++++--
2 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a3b36adc4dcc..f5f5f056525c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1499,21 +1499,15 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
return uc_req;
}

-unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
-{
- if (!uclamp_is_used() || !p->uclamp[clamp_id].active)
- return uclamp_none(clamp_id);
-
- return p->uclamp[clamp_id].value;
-}
-
static inline void
uclamp_update_active_nolock(struct task_struct *p)
{
enum uclamp_id clamp_id;

- for_each_clamp_id(clamp_id)
+ for_each_clamp_id(clamp_id) {
p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
+ p->uclamp[clamp_id].active = 1;
+ }
}

static inline void
@@ -1773,6 +1767,7 @@ static void uclamp_fork(struct task_struct *p)
for_each_clamp_id(clamp_id) {
uclamp_se_set(&p->uclamp_req[clamp_id],
uclamp_none(clamp_id), false);
+ p->uclamp[clamp_id].active = 0;
}
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 81578410984c..2caefc3344bb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2991,8 +2991,6 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
#endif

#ifdef CONFIG_UCLAMP_TASK
-unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id);
-
/* Is the rq being capped/throttled by uclamp_max? */
static inline bool uclamp_rq_is_capped(struct rq *rq)
{
@@ -3022,6 +3020,18 @@ static inline bool uclamp_is_used(void)
return static_branch_likely(&sched_uclamp_used);
}

+static inline unsigned long uclamp_eff_value(struct task_struct *p,
+ enum uclamp_id clamp_id)
+{
+ if (uclamp_is_used() && p->uclamp[clamp_id].active)
+ return p->uclamp[clamp_id].value;
+
+ if (clamp_id == UCLAMP_MIN)
+ return 0;
+
+ return SCHED_CAPACITY_SCALE;
+}
+
static inline unsigned long root_cfs_util(struct rq *rq)
{
return READ_ONCE(rq->root_cfs_util_uclamp);
--
2.34.1


2024-02-01 13:14:36

by Hongyan Xia

[permalink] [raw]
Subject: [RFC PATCH v2 6/7] sched/uclamp: Remove all uclamp bucket logic

Also rewrite uclamp_update_active() so that the effective uclamp values
are updated every time we change task group properties, change system
defaults or a request is issued from userspace.

This also signnificantly reduces uclamp overhead because we no longer
need to compute effective uclamp values and manipulate buckets every
time a task is enqueued or dequeued (in uclamp_rq_{inc/dec}()).

TODO: Rewrite documentation to match the new logic.

Signed-off-by: Hongyan Xia <[email protected]>

---
Changed in v2:
- Remove stale comments about 'uclamp buckets'.
---
include/linux/sched.h | 4 -
init/Kconfig | 32 -----
kernel/sched/core.c | 300 +++---------------------------------------
kernel/sched/fair.c | 4 -
kernel/sched/rt.c | 4 -
kernel/sched/sched.h | 85 ------------
6 files changed, 19 insertions(+), 410 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f28eeff169ff..291b6781b221 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -678,9 +678,6 @@ struct sched_dl_entity {
};

#ifdef CONFIG_UCLAMP_TASK
-/* Number of utilization clamp buckets (shorter alias) */
-#define UCLAMP_BUCKETS CONFIG_UCLAMP_BUCKETS_COUNT
-
/*
* Utilization clamp for a scheduling entity
* @value: clamp value "assigned" to a se
@@ -706,7 +703,6 @@ struct sched_dl_entity {
*/
struct uclamp_se {
unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
- unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
unsigned int active : 1;
unsigned int user_defined : 1;
};
diff --git a/init/Kconfig b/init/Kconfig
index 9ffb103fc927..1c8e11dcda17 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -808,38 +808,6 @@ config UCLAMP_TASK
enforce or grant any specific bandwidth for tasks.

If in doubt, say N.
-
-config UCLAMP_BUCKETS_COUNT
- int "Number of supported utilization clamp buckets"
- range 5 20
- default 5
- depends on UCLAMP_TASK
- help
- Defines the number of clamp buckets to use. The range of each bucket
- will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the
- number of clamp buckets the finer their granularity and the higher
- the precision of clamping aggregation and tracking at run-time.
-
- For example, with the minimum configuration value we will have 5
- clamp buckets tracking 20% utilization each. A 25% boosted tasks will
- be refcounted in the [20..39]% bucket and will set the bucket clamp
- effective value to 25%.
- If a second 30% boosted task should be co-scheduled on the same CPU,
- that task will be refcounted in the same bucket of the first task and
- it will boost the bucket clamp effective value to 30%.
- The clamp effective value of a bucket is reset to its nominal value
- (20% in the example above) when there are no more tasks refcounted in
- that bucket.
-
- An additional boost/capping margin can be added to some tasks. In the
- example above the 25% task will be boosted to 30% until it exits the
- CPU. If that should be considered not acceptable on certain systems,
- it's always possible to reduce the margin by increasing the number of
- clamp buckets to trade off used memory for run-time tracking
- precision.
-
- If in doubt, use the default value.
-
endmenu

#
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0bedc05c883f..a3b36adc4dcc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1408,17 +1408,9 @@ static struct uclamp_se uclamp_default[UCLAMP_CNT];
*/
DEFINE_STATIC_KEY_FALSE(sched_uclamp_used);

-/* Integer rounded range for each bucket */
-#define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS)
-
#define for_each_clamp_id(clamp_id) \
for ((clamp_id) = 0; (clamp_id) < UCLAMP_CNT; (clamp_id)++)

-static inline unsigned int uclamp_bucket_id(unsigned int clamp_value)
-{
- return min_t(unsigned int, clamp_value / UCLAMP_BUCKET_DELTA, UCLAMP_BUCKETS - 1);
-}
-
static inline unsigned int uclamp_none(enum uclamp_id clamp_id)
{
if (clamp_id == UCLAMP_MIN)
@@ -1430,58 +1422,9 @@ static inline void uclamp_se_set(struct uclamp_se *uc_se,
unsigned int value, bool user_defined)
{
uc_se->value = value;
- uc_se->bucket_id = uclamp_bucket_id(value);
uc_se->user_defined = user_defined;
}

-static inline unsigned int
-uclamp_idle_value(struct rq *rq, enum uclamp_id clamp_id,
- unsigned int clamp_value)
-{
- /*
- * Avoid blocked utilization pushing up the frequency when we go
- * idle (which drops the max-clamp) by retaining the last known
- * max-clamp.
- */
- if (clamp_id == UCLAMP_MAX) {
- rq->uclamp_flags |= UCLAMP_FLAG_IDLE;
- return clamp_value;
- }
-
- return uclamp_none(UCLAMP_MIN);
-}
-
-static inline void uclamp_idle_reset(struct rq *rq, enum uclamp_id clamp_id,
- unsigned int clamp_value)
-{
- /* Reset max-clamp retention only on idle exit */
- if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))
- return;
-
- uclamp_rq_set(rq, clamp_id, clamp_value);
-}
-
-static inline
-unsigned int uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
- unsigned int clamp_value)
-{
- struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
- int bucket_id = UCLAMP_BUCKETS - 1;
-
- /*
- * Since both min and max clamps are max aggregated, find the
- * top most bucket with tasks in.
- */
- for ( ; bucket_id >= 0; bucket_id--) {
- if (!bucket[bucket_id].tasks)
- continue;
- return bucket[bucket_id].value;
- }
-
- /* No tasks -- default clamp values */
- return uclamp_idle_value(rq, clamp_id, clamp_value);
-}
-
static void __uclamp_update_util_min_rt_default(struct task_struct *p)
{
unsigned int default_util_min;
@@ -1537,8 +1480,7 @@ uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
}

/*
- * The effective clamp bucket index of a task depends on, by increasing
- * priority:
+ * The effective uclamp value of a task depends on, by increasing priority:
* - the task specific clamp value, when explicitly requested from userspace
* - the task group effective clamp value, for tasks not either in the root
* group or in an autogroup
@@ -1559,196 +1501,24 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)

unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
{
- struct uclamp_se uc_eff;
-
- /* Task currently refcounted: use back-annotated (effective) value */
- if (p->uclamp[clamp_id].active)
- return (unsigned long)p->uclamp[clamp_id].value;
-
- uc_eff = uclamp_eff_get(p, clamp_id);
-
- return (unsigned long)uc_eff.value;
-}
-
-/*
- * When a task is enqueued on a rq, the clamp bucket currently defined by the
- * task's uclamp::bucket_id is refcounted on that rq. This also immediately
- * updates the rq's clamp value if required.
- *
- * Tasks can have a task-specific value requested from user-space, track
- * within each bucket the maximum value for tasks refcounted in it.
- * This "local max aggregation" allows to track the exact "requested" value
- * for each bucket when all its RUNNABLE tasks require the same clamp.
- */
-static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
- enum uclamp_id clamp_id)
-{
- struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
- struct uclamp_se *uc_se = &p->uclamp[clamp_id];
- struct uclamp_bucket *bucket;
-
- lockdep_assert_rq_held(rq);
+ if (!uclamp_is_used() || !p->uclamp[clamp_id].active)
+ return uclamp_none(clamp_id);

- /* Update task effective clamp */
- p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
-
- bucket = &uc_rq->bucket[uc_se->bucket_id];
- bucket->tasks++;
- uc_se->active = true;
-
- uclamp_idle_reset(rq, clamp_id, uc_se->value);
-
- /*
- * Local max aggregation: rq buckets always track the max
- * "requested" clamp value of its RUNNABLE tasks.
- */
- if (bucket->tasks == 1 || uc_se->value > bucket->value)
- bucket->value = uc_se->value;
-
- if (uc_se->value > uclamp_rq_get(rq, clamp_id))
- uclamp_rq_set(rq, clamp_id, uc_se->value);
+ return p->uclamp[clamp_id].value;
}

-/*
- * When a task is dequeued from a rq, the clamp bucket refcounted by the task
- * is released. If this is the last task reference counting the rq's max
- * active clamp value, then the rq's clamp value is updated.
- *
- * Both refcounted tasks and rq's cached clamp values are expected to be
- * always valid. If it's detected they are not, as defensive programming,
- * enforce the expected state and warn.
- */
-static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
- enum uclamp_id clamp_id)
-{
- struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
- struct uclamp_se *uc_se = &p->uclamp[clamp_id];
- struct uclamp_bucket *bucket;
- unsigned int bkt_clamp;
- unsigned int rq_clamp;
-
- lockdep_assert_rq_held(rq);
-
- /*
- * If sched_uclamp_used was enabled after task @p was enqueued,
- * we could end up with unbalanced call to uclamp_rq_dec_id().
- *
- * In this case the uc_se->active flag should be false since no uclamp
- * accounting was performed at enqueue time and we can just return
- * here.
- *
- * Need to be careful of the following enqueue/dequeue ordering
- * problem too
- *
- * enqueue(taskA)
- * // sched_uclamp_used gets enabled
- * enqueue(taskB)
- * dequeue(taskA)
- * // Must not decrement bucket->tasks here
- * dequeue(taskB)
- *
- * where we could end up with stale data in uc_se and
- * bucket[uc_se->bucket_id].
- *
- * The following check here eliminates the possibility of such race.
- */
- if (unlikely(!uc_se->active))
- return;
-
- bucket = &uc_rq->bucket[uc_se->bucket_id];
-
- SCHED_WARN_ON(!bucket->tasks);
- if (likely(bucket->tasks))
- bucket->tasks--;
-
- uc_se->active = false;
-
- /*
- * Keep "local max aggregation" simple and accept to (possibly)
- * overboost some RUNNABLE tasks in the same bucket.
- * The rq clamp bucket value is reset to its base value whenever
- * there are no more RUNNABLE tasks refcounting it.
- */
- if (likely(bucket->tasks))
- return;
-
- rq_clamp = uclamp_rq_get(rq, clamp_id);
- /*
- * Defensive programming: this should never happen. If it happens,
- * e.g. due to future modification, warn and fixup the expected value.
- */
- SCHED_WARN_ON(bucket->value > rq_clamp);
- if (bucket->value >= rq_clamp) {
- bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
- uclamp_rq_set(rq, clamp_id, bkt_clamp);
- }
-}
-
-static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
-{
- enum uclamp_id clamp_id;
-
- /*
- * Avoid any overhead until uclamp is actually used by the userspace.
- *
- * The condition is constructed such that a NOP is generated when
- * sched_uclamp_used is disabled.
- */
- if (!static_branch_unlikely(&sched_uclamp_used))
- return;
-
- if (unlikely(!p->sched_class->uclamp_enabled))
- return;
-
- for_each_clamp_id(clamp_id)
- uclamp_rq_inc_id(rq, p, clamp_id);
-
- /* Reset clamp idle holding when there is one RUNNABLE task */
- if (rq->uclamp_flags & UCLAMP_FLAG_IDLE)
- rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
-}
-
-static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
+static inline void
+uclamp_update_active_nolock(struct task_struct *p)
{
enum uclamp_id clamp_id;

- /*
- * Avoid any overhead until uclamp is actually used by the userspace.
- *
- * The condition is constructed such that a NOP is generated when
- * sched_uclamp_used is disabled.
- */
- if (!static_branch_unlikely(&sched_uclamp_used))
- return;
-
- if (unlikely(!p->sched_class->uclamp_enabled))
- return;
-
for_each_clamp_id(clamp_id)
- uclamp_rq_dec_id(rq, p, clamp_id);
-}
-
-static inline void uclamp_rq_reinc_id(struct rq *rq, struct task_struct *p,
- enum uclamp_id clamp_id)
-{
- if (!p->uclamp[clamp_id].active)
- return;
-
- uclamp_rq_dec_id(rq, p, clamp_id);
- uclamp_rq_inc_id(rq, p, clamp_id);
-
- /*
- * Make sure to clear the idle flag if we've transiently reached 0
- * active tasks on rq.
- */
- if (clamp_id == UCLAMP_MAX && (rq->uclamp_flags & UCLAMP_FLAG_IDLE))
- rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
+ p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
}

static inline void
uclamp_update_active(struct task_struct *p)
{
- enum uclamp_id clamp_id;
struct rq_flags rf;
struct rq *rq;

@@ -1762,14 +1532,7 @@ uclamp_update_active(struct task_struct *p)
*/
rq = task_rq_lock(p, &rf);

- /*
- * Setting the clamp bucket is serialized by task_rq_lock().
- * If the task is not yet RUNNABLE and its task_struct is not
- * affecting a valid clamp bucket, the next time it's enqueued,
- * it will already see the updated clamp bucket value.
- */
- for_each_clamp_id(clamp_id)
- uclamp_rq_reinc_id(rq, p, clamp_id);
+ uclamp_update_active_nolock(p);

task_rq_unlock(rq, p, &rf);
}
@@ -1998,26 +1761,22 @@ static void __setscheduler_uclamp(struct task_struct *p,
uclamp_se_set(&p->uclamp_req[UCLAMP_MAX],
attr->sched_util_max, true);
}
+
+ uclamp_update_active_nolock(p);
}

static void uclamp_fork(struct task_struct *p)
{
enum uclamp_id clamp_id;

- /*
- * We don't need to hold task_rq_lock() when updating p->uclamp_* here
- * as the task is still at its early fork stages.
- */
- for_each_clamp_id(clamp_id)
- p->uclamp[clamp_id].active = false;
-
- if (likely(!p->sched_reset_on_fork))
- return;
-
- for_each_clamp_id(clamp_id) {
- uclamp_se_set(&p->uclamp_req[clamp_id],
- uclamp_none(clamp_id), false);
+ if (unlikely(p->sched_reset_on_fork)) {
+ for_each_clamp_id(clamp_id) {
+ uclamp_se_set(&p->uclamp_req[clamp_id],
+ uclamp_none(clamp_id), false);
+ }
}
+
+ uclamp_update_active(p);
}

static void uclamp_post_fork(struct task_struct *p)
@@ -2025,28 +1784,10 @@ static void uclamp_post_fork(struct task_struct *p)
uclamp_update_util_min_rt_default(p);
}

-static void __init init_uclamp_rq(struct rq *rq)
-{
- enum uclamp_id clamp_id;
- struct uclamp_rq *uc_rq = rq->uclamp;
-
- for_each_clamp_id(clamp_id) {
- uc_rq[clamp_id] = (struct uclamp_rq) {
- .value = uclamp_none(clamp_id)
- };
- }
-
- rq->uclamp_flags = UCLAMP_FLAG_IDLE;
-}
-
static void __init init_uclamp(void)
{
struct uclamp_se uc_max = {};
enum uclamp_id clamp_id;
- int cpu;
-
- for_each_possible_cpu(cpu)
- init_uclamp_rq(cpu_rq(cpu));

for_each_clamp_id(clamp_id) {
uclamp_se_set(&init_task.uclamp_req[clamp_id],
@@ -2065,8 +1806,7 @@ static void __init init_uclamp(void)
}

#else /* CONFIG_UCLAMP_TASK */
-static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
-static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
+static inline void uclamp_update_active_nolock(struct task_struct *p) { }
static inline int uclamp_validate(struct task_struct *p,
const struct sched_attr *attr)
{
@@ -2113,7 +1853,6 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
}

- uclamp_rq_inc(rq, p);
p->sched_class->enqueue_task(rq, p, flags);

if (sched_core_enabled(rq))
@@ -2133,7 +1872,6 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
psi_dequeue(p, flags & DEQUEUE_SLEEP);
}

- uclamp_rq_dec(rq, p);
p->sched_class->dequeue_task(rq, p, flags);
}

@@ -10480,6 +10218,7 @@ void sched_move_task(struct task_struct *tsk)
put_prev_task(rq, tsk);

sched_change_group(tsk, group);
+ uclamp_update_active_nolock(tsk);

if (queued)
enqueue_task(rq, tsk, queue_flags);
@@ -10612,7 +10351,6 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css)
if (eff[clamp_id] == uc_se[clamp_id].value)
continue;
uc_se[clamp_id].value = eff[clamp_id];
- uc_se[clamp_id].bucket_id = uclamp_bucket_id(eff[clamp_id]);
clamps |= (0x1 << clamp_id);
}
if (!clamps) {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 49997f1f58fb..ac1dd5739ec6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12996,10 +12996,6 @@ DEFINE_SCHED_CLASS(fair) = {
#ifdef CONFIG_SCHED_CORE
.task_is_throttled = task_is_throttled_fair,
#endif
-
-#ifdef CONFIG_UCLAMP_TASK
- .uclamp_enabled = 1,
-#endif
};

#ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 3261b067b67e..86733bed0e3c 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2681,10 +2681,6 @@ DEFINE_SCHED_CLASS(rt) = {
#ifdef CONFIG_SCHED_CORE
.task_is_throttled = task_is_throttled_rt,
#endif
-
-#ifdef CONFIG_UCLAMP_TASK
- .uclamp_enabled = 1,
-#endif
};

#ifdef CONFIG_RT_GROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3ee28822f48f..81578410984c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -913,46 +913,6 @@ extern void rto_push_irq_work_func(struct irq_work *work);
#endif /* CONFIG_SMP */

#ifdef CONFIG_UCLAMP_TASK
-/*
- * struct uclamp_bucket - Utilization clamp bucket
- * @value: utilization clamp value for tasks on this clamp bucket
- * @tasks: number of RUNNABLE tasks on this clamp bucket
- *
- * Keep track of how many tasks are RUNNABLE for a given utilization
- * clamp value.
- */
-struct uclamp_bucket {
- unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
- unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
-};
-
-/*
- * struct uclamp_rq - rq's utilization clamp
- * @value: currently active clamp values for a rq
- * @bucket: utilization clamp buckets affecting a rq
- *
- * Keep track of RUNNABLE tasks on a rq to aggregate their clamp values.
- * A clamp value is affecting a rq when there is at least one task RUNNABLE
- * (or actually running) with that value.
- *
- * There are up to UCLAMP_CNT possible different clamp values, currently there
- * are only two: minimum utilization and maximum utilization.
- *
- * All utilization clamping values are MAX aggregated, since:
- * - for util_min: we want to run the CPU at least at the max of the minimum
- * utilization required by its currently RUNNABLE tasks.
- * - for util_max: we want to allow the CPU to run up to the max of the
- * maximum utilization allowed by its currently RUNNABLE tasks.
- *
- * Since on each system we expect only a limited number of different
- * utilization clamp values (UCLAMP_BUCKETS), use a simple array to track
- * the metrics required to compute all the per-rq utilization clamp values.
- */
-struct uclamp_rq {
- unsigned int value;
- struct uclamp_bucket bucket[UCLAMP_BUCKETS];
-};
-
DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
#endif /* CONFIG_UCLAMP_TASK */

@@ -995,11 +955,7 @@ struct rq {
u64 nr_switches;

#ifdef CONFIG_UCLAMP_TASK
- /* Utilization clamp values based on CPU's RUNNABLE tasks */
- struct uclamp_rq uclamp[UCLAMP_CNT] ____cacheline_aligned;
- unsigned int uclamp_flags;
unsigned int root_cfs_util_uclamp;
-#define UCLAMP_FLAG_IDLE 0x01
#endif

struct cfs_rq cfs;
@@ -2247,11 +2203,6 @@ struct affinity_context {
extern s64 update_curr_common(struct rq *rq);

struct sched_class {
-
-#ifdef CONFIG_UCLAMP_TASK
- int uclamp_enabled;
-#endif
-
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*yield_task) (struct rq *rq);
@@ -3042,23 +2993,6 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
#ifdef CONFIG_UCLAMP_TASK
unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id);

-static inline unsigned long uclamp_rq_get(struct rq *rq,
- enum uclamp_id clamp_id)
-{
- return READ_ONCE(rq->uclamp[clamp_id].value);
-}
-
-static inline void uclamp_rq_set(struct rq *rq, enum uclamp_id clamp_id,
- unsigned int value)
-{
- WRITE_ONCE(rq->uclamp[clamp_id].value, value);
-}
-
-static inline bool uclamp_rq_is_idle(struct rq *rq)
-{
- return rq->uclamp_flags & UCLAMP_FLAG_IDLE;
-}
-
/* Is the rq being capped/throttled by uclamp_max? */
static inline bool uclamp_rq_is_capped(struct rq *rq)
{
@@ -3147,25 +3081,6 @@ static inline bool uclamp_is_used(void)
return false;
}

-static inline unsigned long uclamp_rq_get(struct rq *rq,
- enum uclamp_id clamp_id)
-{
- if (clamp_id == UCLAMP_MIN)
- return 0;
-
- return SCHED_CAPACITY_SCALE;
-}
-
-static inline void uclamp_rq_set(struct rq *rq, enum uclamp_id clamp_id,
- unsigned int value)
-{
-}
-
-static inline bool uclamp_rq_is_idle(struct rq *rq)
-{
- return false;
-}
-
static inline unsigned long root_cfs_util(struct rq *rq)
{
return READ_ONCE(rq->cfs.avg.util_avg);
--
2.34.1


2024-02-01 13:32:09

by Hongyan Xia

[permalink] [raw]
Subject: [RFC PATCH v2 2/7] sched/uclamp: Track uclamped util_avg in sched_avg

Track a uclamped version of util_avg in sched_avg, which clamps util_avg
within [uclamp[UCLAMP_MIN], uclamp[UCLAMP_MAX]] every time util_avg is
updated. At the root CFS rq level, just like util_est,
rq->cfs.avg.util_avg_uclamp must always be the sum of all
util_avg_uclamp of CFS tasks on this rq. So, each time the
util_avg_uclamp of a task gets updated, we also track the delta and
update the root cfs_rq. When a CFS task gets enqueued or dequeued, the
rq->cfs.avg.util_avg_uclamp also needs to add or subtract the
util_avg_uclamp of this task.

Signed-off-by: Hongyan Xia <[email protected]>
---
include/linux/sched.h | 3 +++
kernel/sched/fair.c | 21 +++++++++++++++++++
kernel/sched/pelt.c | 48 +++++++++++++++++++++++++++++++++++--------
kernel/sched/pelt.h | 5 +++--
kernel/sched/sched.h | 27 ++++++++++++++++++++++++
5 files changed, 94 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 03bfe9ab2951..f28eeff169ff 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -470,6 +470,9 @@ struct sched_avg {
unsigned long runnable_avg;
unsigned long util_avg;
unsigned int util_est;
+#ifdef CONFIG_UCLAMP_TASK
+ unsigned int util_avg_uclamp;
+#endif
} ____cacheline_aligned;

/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d5cc87db4845..4f535c96463b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1089,6 +1089,9 @@ void post_init_entity_util_avg(struct task_struct *p)
}

sa->runnable_avg = sa->util_avg;
+#ifdef CONFIG_UCLAMP_TASK
+ sa->util_avg_uclamp = sa->util_avg;
+#endif
}

#else /* !CONFIG_SMP */
@@ -6763,6 +6766,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)

/* At this point se is NULL and we are at root level*/
add_nr_running(rq, 1);
+#ifdef CONFIG_UCLAMP_TASK
+ util_uclamp_enqueue(&rq->cfs.avg, p);
+ update_util_uclamp(0, 0, 0, &rq->cfs.avg, p);
+ /* TODO: Better skip the frequency update in the for loop above. */
+ cpufreq_update_util(rq, 0);
+#endif

/*
* Since new tasks are assigned an initial util_avg equal to
@@ -6854,6 +6863,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)

/* At this point se is NULL and we are at root level*/
sub_nr_running(rq, 1);
+#ifdef CONFIG_UCLAMP_TASK
+ util_uclamp_dequeue(&rq->cfs.avg, p);
+#endif

/* balance early to pull high priority tasks */
if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
@@ -6862,6 +6874,15 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
dequeue_throttle:
util_est_update(&rq->cfs, p, task_sleep);
hrtick_update(rq);
+
+#ifdef CONFIG_UCLAMP_TASK
+ if (rq->cfs.h_nr_running == 0) {
+ WARN_ONCE(rq->cfs.avg.util_avg_uclamp,
+ "0 tasks on CFS of CPU %d, but util_avg_uclamp is %u\n",
+ rq->cpu, rq->cfs.avg.util_avg_uclamp);
+ WRITE_ONCE(rq->cfs.avg.util_avg_uclamp, 0);
+ }
+#endif
}

#ifdef CONFIG_SMP
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 63b6cf898220..eca45a863f9f 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -266,6 +266,39 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
}

+#ifdef CONFIG_UCLAMP_TASK
+/* avg must belong to the queue this se is on. */
+void update_util_uclamp(struct sched_avg *avg, struct task_struct *p)
+{
+ unsigned int util, uclamp_min, uclamp_max;
+ int delta;
+
+ if (!p->se.on_rq)
+ return;
+
+ if (!avg)
+ return;
+
+ util = READ_ONCE(p->se.avg.util_avg);
+ uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
+ uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
+ util = clamp(util, uclamp_min, uclamp_max);
+
+ delta = util - READ_ONCE(p->se.avg.util_avg_uclamp);
+ if (delta == 0)
+ return;
+
+ WRITE_ONCE(p->se.avg.util_avg_uclamp, util);
+ util = READ_ONCE(avg->util_avg_uclamp);
+ util += delta;
+ WRITE_ONCE(avg->util_avg_uclamp, util);
+}
+#else /* !CONFIG_UCLAMP_TASK */
+void update_util_uclamp(struct sched_avg *avg, struct task_struct *p)
+{
+}
+#endif
+
/*
* sched_entity:
*
@@ -292,29 +325,28 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
* load_avg = \Sum se->avg.load_avg
*/

-int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
+void __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
{
if (___update_load_sum(now, &se->avg, 0, 0, 0)) {
___update_load_avg(&se->avg, se_weight(se));
+ if (entity_is_task(se))
+ update_util_uclamp(NULL, task_of(se));
trace_pelt_se_tp(se);
- return 1;
}
-
- return 0;
}

-int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
+void __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
{
if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
cfs_rq->curr == se)) {

___update_load_avg(&se->avg, se_weight(se));
cfs_se_util_change(&se->avg);
+ if (entity_is_task(se))
+ update_util_uclamp(&rq_of(cfs_rq)->cfs.avg,
+ task_of(se));
trace_pelt_se_tp(se);
- return 1;
}
-
- return 0;
}

int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 9e1083465fbc..6862f79e0fcd 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -1,8 +1,9 @@
#ifdef CONFIG_SMP
#include "sched-pelt.h"

-int __update_load_avg_blocked_se(u64 now, struct sched_entity *se);
-int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se);
+void update_util_uclamp(struct sched_avg *avg, struct task_struct *p);
+void __update_load_avg_blocked_se(u64 now, struct sched_entity *se);
+void __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se);
int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e58a54bda77d..35036246824b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3085,6 +3085,33 @@ static inline bool uclamp_is_used(void)
{
return static_branch_likely(&sched_uclamp_used);
}
+
+static inline void util_uclamp_enqueue(struct sched_avg *avg,
+ struct task_struct *p)
+{
+ unsigned int avg_val = READ_ONCE(avg->util_avg_uclamp);
+ unsigned int p_val = READ_ONCE(p->se.avg.util_avg_uclamp);
+
+ WRITE_ONCE(avg->util_avg_uclamp, avg_val + p_val);
+}
+
+static inline void util_uclamp_dequeue(struct sched_avg *avg,
+ struct task_struct *p)
+{
+ unsigned int avg_val = READ_ONCE(avg->util_avg_uclamp);
+ unsigned int p_val = READ_ONCE(p->se.avg.util_avg_uclamp), new_val;
+
+ if (avg_val > p_val)
+ new_val = avg_val - p_val;
+ else {
+ WARN_ONCE(avg_val < p_val,
+ "avg_val underflow. avg_val %u is even less than p_val %u before subtraction\n",
+ avg_val, p_val);
+ new_val = 0;
+ }
+
+ WRITE_ONCE(avg->util_avg_uclamp, new_val);
+}
#else /* CONFIG_UCLAMP_TASK */
static inline unsigned long uclamp_eff_value(struct task_struct *p,
enum uclamp_id clamp_id)
--
2.34.1


2024-02-06 15:36:07

by Qais Yousef

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/7] uclamp sum aggregation

On 02/01/24 13:11, Hongyan Xia wrote:

> [1]: https://lore.kernel.org/all/[email protected]/

Their solution is not acceptable for the same reason yours isn't. Saravana and
David know this and we discussed at LPC. uclamp hints are limits and should not
be summed.

> [2]: https://android.googlesource.com/kernel/gs/+/refs/heads/android-gs-raviole-5.10-android12-d1/drivers/soc/google/vh/kernel/sched/fair.c#510

I think I clarified several times so far that this is not related to uclamp.
Could you please refrain from referring to it again in the future? This is
misleading and neither helps your cause nor its cause. The fact that you're
relating to it makes me very worried as both links demonstrate lack of
understanding/confusion of what uclamp is supposed to be.

Again, this solution is not acceptable and you're moving things in the wrong
direction. We don't want to redesign what uclamp means, but fix some corner
cases. And you're doing the former not the latter.

2024-02-06 17:33:23

by Hongyan Xia

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/7] uclamp sum aggregation

On 06/02/2024 15:20, Qais Yousef wrote:
> On 02/01/24 13:11, Hongyan Xia wrote:
>
>> [1]: https://lore.kernel.org/all/[email protected]/
>
> Their solution is not acceptable for the same reason yours isn't. Saravana and
> David know this and we discussed at LPC. uclamp hints are limits and should not
> be summed.

Uclamp is a performance hint and nothing in its definition says it can't
be summed. Clearly whether a uclamp approach should be taken should be
determined by how well it works as a hint, not by how we calculate it. I
would not say I want to reject max aggregation simply because it throws
away all other uclamp values except the max. It's because I have real
evaluation results showing sum aggregation works as a much better hint.

>> [2]: https://android.googlesource.com/kernel/gs/+/refs/heads/android-gs-raviole-5.10-android12-d1/drivers/soc/google/vh/kernel/sched/fair.c#510
>
> I think I clarified several times so far that this is not related to uclamp.
> Could you please refrain from referring to it again in the future? This is
> misleading and neither helps your cause nor its cause. The fact that you're
> relating to it makes me very worried as both links demonstrate lack of
> understanding/confusion of what uclamp is supposed to be.

The intention of the code is irrelevant. What I'm talking about is what
effect the code actually has. The fact that you keep thinking I don't
understand what the code does even after me explaining "I know what the
intention of the code is, I'm just talking about the actual effect of
the code" is an even more worrying sign.

> Again, this solution is not acceptable and you're moving things in the wrong
> direction. We don't want to redesign what uclamp means, but fix some corner
> cases. And you're doing the former not the latter.

I'm saying max aggregation is not effective and proposing a more
effective implementation. In fact, you have sent a series that removes
max aggregation. Clearly that does not count as fixing corner cases but
is actually a redesign, and I don't understand why you are allowed to do
such things and I am not. Also, when something becomes harder and harder
to fix, a redesign that solves all the problems is clearly justified.

What I can summarize from sum aggregation is:

Pros:
1. A more effective implementation, proven by evaluation numbers
2. Consuming the same or even less power in benchmarks
3. 350 lines of code in total, less than half of max aggregation
4. This series shows the entirety and effectiveness of sum aggregation,
at this very moment, today. Max aggregation needs further filtering and
load balancing patches which we have not seen yet.
5. Resolves the drawbacks from max aggregation (which you might say is
the same as 4)
6. Significantly reduces uclamp overhead, no bucket operations

Cons:
1. should not be summed (although the scheduler used to sum up
utilization and util_est sums up a processed PELT signal today)
2. Under-utilization case (which is a problem GROUP_THROTTLE also has,
and can be worked around. Please, I know the intention of
GROUP_THROTTLE, I'm just talking about its actual effects).

I don't see why the things I listed above is in the wrong direction.

2024-02-12 09:17:10

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/7] uclamp sum aggregation

On Thu, 1 Feb 2024 at 14:12, Hongyan Xia <[email protected]> wrote:
>
> Current implementation of uclamp leaves many things to be desired.
> There are three noteworthy problems:
>
> 1. Max aggregation only takes into account the max uclamp value. All
> other uclamp values are not in effect.
> 2. Uclamp max aggregation gives UCLAMP_MIN and UCLAMP_MAX at the rq
> level, and whether that translates to the desired performance of a
> specific task is unknown, because it depends on other tasks on rq.
> 3. Complexity. Uclamp max aggregation now sits at more than 750 lines of
> code and there is ongoing work to rework several interfaces to
> prepare for further uclamp changes. Uclamp max aggregation itself
> also needs future improvements in uclamp filtering and load balancing
>
> The first 2 points can manifest into the following symptoms,
>
> 1. High-rate OPP changes ("frequency spike" problem). An always-running
> task with a UCLAMP_MAX of 200 will drive the CPU at 200 even though
> its utilization is 1024. However, when a util_avg task of 300 but
> with default UCLAMP_MAX of 1024 joins the rq, the rq UCLAMP_MAX will
> be uncapped, and the UCLAMP_MAX of the first task is no longer in
> effect therefore driving the CPU at 1024, the highest OPP. When the
> second task sleeps, the OPP will be reduced to 200. This fast and
> sudden OPP switch every time the 2nd task wakes up or sleeps is
> unnecessary.
> 2. Using UCLAMP_MIN to boost performance under max aggregation has been
> shown to have weaker effectiveness than "sum aggregated" approaches,
> including the util_guest proposal [1] and uclamp sum aggregation in
> this series. The performance level of UCLAMP_MIN for a task under max
> aggregation is unpredictable when there are more than 1 task runnable
> on the rq.
>
> This series solves these problems by tracking a
> util_avg_uclamp signal in tasks and root cfs_rq. At task level,
> p->se.avg.util_avg_uclamp is basically tracking the normal util_avg, but
> clamped within its uclamp min and max. At cfs_rq level, util_avg_uclamp
> must always be the sum of all util_avg_uclamp of all the entities on
> this cfs_rq. As a result, rq->cfs.avg.util_avg_uclamp is the sum
> aggregation of all the clamped values, which hints the frequency
> this rq should run at and what the utilization is. This proposal has
> some similarities to the util_guest series for VM workloads, in that it
> brings the desired performance to the task that requested it, not to the
> rq, in which the share of the task is unpredictable.

As I mentioned in your previous version, I don't want such mixed
signals which are hard to maintain (as you then try to compensate for
all side effects and compensate for some decaying as an example) and
don't mean anything at the end. You must find another way if you want
to go in that direction of sum aggregated.

>
> Note: This new signal does not change the existing PELT signal. The new
> signal is only an extra hint to the scheduler to improve scheduling
> decisions.
>
> TL;DR OF THIS SERIES:
>
> - Our evaluation shows significantly better effectiveness than max
> aggregation. UI benchmarks and VM workloads have improved latency and
> higher scores at the same or even reduced power consumption.
> - For other benchmarks that do not involve uclamp, we have not observed
> any noticeable regressions.
> - This series is the entirety of sum aggregation. No groundwork or
> refactoring in the scheduler is needed. The complexity of several
> existing uclamp-related functions is massively reduced and sum
> aggregation code is less than half of that in max aggregation (304+,
> 701-). The complexity gap will be even greater with all the ongoing
> patches for max aggregation.
>
> DECOMPOSITION OF SUM AGGREGATION:
>
> - Patch 1 reverts some max aggregation code. Sum aggregation shows no

Sorry I don't get what you mean here by reverting some max aggregation
code ? This is not really linked to max aggregation but how EAS and
feec() try to compute energy

The code that is reverted here mainly highlights where your problem is
and more generally speaking one of the problems of the current
algorithm of EAS and feec() when it tries to estimate the energy cost.
We don't take into account that an utilization becoming larger the cpu
capacity means more running time. Such problem has been highlighted
previously and is one root cause of the problems that you are trying
to solve there

> such problems so mitigation patches are not necessary, and that
> patch has other undesirable side effects.
> - Patch 2 and 3 introduce new sum aggregated signals to be a more
> effective hint to the scheduler. Patch 3 employs a math trick to make
> it significantly simpler to track on-rq and off-rq task utilization
> contributions.
> - Patch 4, 5 and 6 start using the new signal while significantly
> simplifying existing uclamp code, including the total removal of
> uclamp buckets and max aggregation.
> - Patch 7 and part of 6 remove the overhead of uclamp on task
> enqueue and dequeue, because uclamp values and buckets no longer need
> to be re-computed every time a uclamp'ed task joins or leaves the rq.
>
> TESTING:
>
> Sum aggregation generally performs better in tests. Two notebooks, max
> vs. sum aggregation, are shared at
>
> https://nbviewer.org/github/honxia02/notebooks/blob/618de22a8da96205015fefabee203536683bd4e2/whitebox/max.ipynb
> https://nbviewer.org/github/honxia02/notebooks/blob/618de22a8da96205015fefabee203536683bd4e2/whitebox/sum.ipynb
>
> The experiments done in notebooks are on Arm Juno r2 board. CPU0-3 are
> little cores with capacity of 383. CPU4-5 are big cores. The rt-app
> profiles used for these experiments are included in the notebooks.
>
> Scenario 1: Scheduling 4 always-running tasks with UCLAMP_MAX at 200.
>
> The scheduling decisions are plotted in Out[11]. Both max and sum
> aggregation recognizes the UCLAMP_MAX hints and run all the threads
> on the little Performance Domain (PD). However, max aggregation in the
> test shows some randomness in task placement, and we see the 4 threads
> are often not evenly distributed across the 4 little CPUs. This uneven
> task placement is also the reason why we revert the patch:
>
> "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0"
>
> When the spare capacity is 0, the reverted patch tends to gather many
> tasks on the same CPU because its EM calculation is bogus and it thinks
> this placement is more energy efficient.

So the problem is not max vs sum aggregation but fix feec()

>
> Scenario 2: Scheduling 4 tasks with UCLAMP_MIN and UCLAMP_MAX at a value
> slightly above the capacity of the little CPU.
>
> Results are in Out[17]. The purpose is to use UCLAMP_MIN to place tasks
> on the big core but not to run at the highest OPP. Both max and sum
> aggregation accomplish this task successfully, running the two threads
> at the big cluster while not driving the frequency at the max.

No problem here

>
> Scenario 3: Task A is a task with a small utilization pinned to CPU4.
> Task B is an always-running task pinned to CPU5, but UCLAMP_MAX capped
> at 300. After a while, task A is then pinned to CPU5, joining B.
>
> Results are in Out[23]. The util_avg curve is the original root CFS
> util_avg. The root_cfs_util_uclamp is the root CFS utilization after
> considering uclamp. Max aggregation sees a frequency spike at 751.7s.
> When zoomed in, one can see square-wave-like utilization and CPU

sidenote: there is no graph showed in your links so nothing to see or zoom

> frequency values because of task A periodically going to sleep. When A
> wakes up, its default UCLAMP_MAX of 1024 will uncap B and reach the
> highest CPU frequency. When A sleeps, B's UCLAMP_MAX will be in effect
> and will reduce rq utilization to 300. This happens repeatedly, hence
> the square wave. In contrast, sum aggregation sees a normal increase in

What do you mean by normal increase ?

> utilization when A joins B, at around 430.64s, without any square-wave
> behavior. The CPU frequency also stays stable while the two tasks are on
> the same rq.

Difficult to make any comment as there is no graph to see in your link.

What should be the right frequency in such a case ?

>
> Scenario 4: 4 always-running tasks with UCLAMP_MAX of 120 pinned to the
> little PD (CPU0-3). 4 same tasks pinned to the big PD (CPU4-5).
> After a while, remove the CPU pinning of the 4 tasks on the big PD.
>
> Results are in Out[29]. Both max and sum aggregation understand that we
> can move the 4 tasks from the big PD to the little PD to reduce power
> because of the UCLAMP_MAX hints. However, max aggregation shows severely
> unbalanced task placement, scheduling 5 tasks on CPU0 while 1 each on
> CPU1-3. Sum aggregation schedules 2 tasks on each little CPU, honoring
> UCLAMP_MAX while maintaining balanced task placement.
>
> Again, this unbalanced task placement is the reason why we reverted:
>
> "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0"

So then it's not a matter a sum vs max aggregation but to fix the EAS
policy to place tasks correctly

>
> Scenario 5: 8 tasks with UCLAMP_MAX of 120.
>
> This test is similar to Scenario 4, shown in Out[35]. Both max and sum
> aggregation understand the UCLAMP_MAX hints and schedule all tasks on
> the 4 little CPUs. Max aggregation again shows unstable and unbalanced
> task placement while sum aggregation schedules 2 tasks on each little
> CPU, and the task placement remains stable. The total task residency is
> shown in Out[36], showing how unbalanced max aggregation is.

same comment as previous test

>
> BENCHMARKS:
>
> Geekbench 6, no uclamp (on Rock-5B board)
> +-----+-------------+------------+
> | | Single-core | Multi-core |
> +-----+-------------+------------+
> | Max | 801.3 | 2976.8 |
> | Sum | 802.8 | 2989.2 |
> +-----+-------------+------------+
>
> No regression is seen after switching to sum aggregation.
>
> Jankbench (backporting sum aggregation to Pixel 6 Android 5.18 mainline kernel):
>
> Jank percentage:
> +------+-------+-----------+
> | | value | perc_diff |
> +------+-------+-----------+
> | main | 1.1 | 0.00% |
> | sum | 0.5 | -53.92% |
> +------+-------+-----------+
>
> Average power:
> +------------+------+-------+-----------+
> | | tag | value | perc_diff |
> +------------+------+-------+-----------+
> | CPU | max | 166.1 | 0.00% |
> | CPU-Big | max | 55.1 | 0.00% |
> | CPU-Little | max | 91.7 | 0.00% |
> | CPU-Mid | max | 19.2 | 0.00% |
> | CPU | sum | 161.3 | -2.85% |
> | CPU-Big | sum | 52.9 | -3.97% |
> | CPU-Little | sum | 86.7 | -5.39% |
> | CPU-Mid | sum | 21.6 | 12.63% |
> +------------+------+-------+-----------+
>
> UIBench (backporting sum aggregation to Pixel 6 Android 6.6 mainline kernel):
>
> Jank percentage:
> +-------------+-------+-----------+
> | tag | value | perc_diff |
> +-------------+-------+-----------+
> | max_aggr | 0.3 | 0.0 |
> | sum_aggr | 0.26 | -12.5 |
> +-------------+-------+-----------+
>
> Average input latency:
> +-------------+--------+-----------+
> | tag | value | perc_diff |
> +-------------+--------+-----------+
> | max_aggr | 107.39 | 0.0 |
> | sum_aggr | 81.135 | -24.5 |
> +-------------+--------+-----------+
>
> Average power:
> +------------+--------------+--------+-----------+
> | channel | tag | value | perc_diff |
> +------------+--------------+--------+-----------+
> | CPU | max_aggr | 209.85 | 0.0% |
> | CPU-Big | max_aggr | 89.8 | 0.0% |
> | CPU-Little | max_aggr | 94.45 | 0.0% |
> | CPU-Mid | max_aggr | 25.45 | 0.0% |
> | GPU | max_aggr | 22.9 | 0.0% |
> | Total | max_aggr | 232.75 | 0.0% |
> | CPU | sum_aggr | 206.05 | -1.81% |
> | CPU-Big | sum_aggr | 84.7 | -5.68% |
> | CPU-Little | sum_aggr | 94.9 | 0.48% |
> | CPU-Mid | sum_aggr | 26.3 | 3.34% |
> | GPU | sum_aggr | 22.45 | -1.97% |
> | Total | sum_aggr | 228.5 | -1.83% |
> +------------+--------------+--------+-----------+
>
> It should be noted that sum aggregation reduces jank and reduces input
> latency while consuming less power.
>
> VM cpufreq hypercall driver [1], on Rock-5B board. Baseline indicates a
> setup without the VM cpufreq driver:
>
> Geekbench 6 uncontended. No other host threads running.
> +------+-------------+-----------+------------+-----------+
> | | Single-core | perc_diff | Multi-core | perc_diff |
> +------+-------------+-----------+------------+-----------+
> | base | 796.4 | 0 | 2947.0 | 0 |
> | max | 795.6 | -0.10 | 2929.6 | -0.59 |
> | sum | 794.6 | -0.23 | 2935.6 | -0.39 |
> +------+-------------+-----------+------------+-----------+
>
> Geekbench 6 contended. Host CPUs each has a 50% duty-cycle task running.
> +------+-------------+-----------+------------+-----------+
> | | Single-core | perc_diff | Multi-core | perc_diff |
> +------+-------------+-----------+------------+-----------+
> | base | 604.6 | 0 | 2330.2 | 0 |
> | max | 599.4 | -0.86 | 2327.2 | -0.13 |
> | sum | 645.8 | 6.81 | 2336.6 | 0.27 |
> +------+-------------+-----------+------------+-----------+
>
> VM CPUfreq driver using sum aggregation outperforms max aggregation when
> the host is contended. When the host has no contention (only the VM
> vCPUs are running and each host CPU accommodates one guest vCPU), the
> two aggregation methods are roughly the same, and a bit surprisingly,
> offers no speed-up versus the baseline. This is probably because of the
> overhead of hypercalls, and the fact that Geekbench is CPU intensive and
> is not the best workload to show the effectiveness of VM cpufreq driver.
> We will try to benchmark on more VM workloads.
>
> LIMITATIONS:
>
> 1. RT sum aggregation is not shown in the series.
>
> A proof-of-concept RT sum aggregation implementation is done and going
> through testing, with < 50 lines of code, using the same ideas as in
> CFS. They will be sent out separately if we can agree on CFS sum
> aggregation and once the testing is done.
>
> 2. A heavily UCLAMP_MAX-throttled task may prevent the CFS rq from
> triggering over-utilization.
>
> For example, two always-running task each having utilization of 512. If
> one of the task is severely UCLAMP_MAX restricted, say, with a
> UCLAMP_MAX of 1, then the total CFS sum aggregation will be 512 + 1 =
> 513, which won't trigger over-utilization even though the other task has
> no UCLAMP_MAX and wants more performance.

But what if they are pinned to the same CPU or other cpus are already
fully busy ? This is a blocking point because the uclamp of one task
then impacts the cpu bandwidth of another not clamped task

>
> I'm working on a fix for this problem. This at the moment can be solved
> by either not giving long-running tasks ridiculously low UCLAMP_MAX
> values, or adjusting the priority of UCLAMP_MAX tasks to make sure it
> does not get a share of CPU run-time that vastly exceeds its UCLAMP_MAX.
> However, my personal view is that maybe UCLAMP_MIN and UCLAMP_MAX just
> don't belong together, and the proper way is to have a per-task

No this is a blocking point. We don't want to add dependency between
nice priority and uclamp.

> bandwidth throttling mechanism and what we want as UCLAMP_MAX maybe
> actually belongs to that mechanism.
>
> However, since the Android GROUP_THROTTLE feature [2] has the exact same
> problem and has been used in actual products, we don't think this is a
> big limitation in practice.

This is not because an out of tree patchset is ok to go with a wrong
behavior that it makes this patchset acceptable

>
> [1]: https://lore.kernel.org/all/[email protected]/
> [2]: https://android.googlesource.com/kernel/gs/+/refs/heads/android-gs-raviole-5.10-android12-d1/drivers/soc/google/vh/kernel/sched/fair.c#510
>
> ---
> Changed in v2:
> - Rework util_avg_uclamp to be closer to the style of util_est.
> - Rewrite patch notes to reflect the new style.
> - Add the discussion of the under-utilizated example in limitations,
> found by Vincent G.
> - Remove task group uclamp to focus on tasks first.
> - Fix several bugs in task migration.
> - Add benchmark numbers from UIBench and VM cpufreq.
> - Update python notebooks to reflect the latest max vs. sum aggregation.
>
> Hongyan Xia (7):
> Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is
> 0"
> sched/uclamp: Track uclamped util_avg in sched_avg
> sched/uclamp: Introduce root_cfs_util_uclamp for rq
> sched/fair: Use CFS util_avg_uclamp for utilization and frequency
> sched/fair: Massively simplify util_fits_cpu()
> sched/uclamp: Remove all uclamp bucket logic
> sched/uclamp: Simplify uclamp_eff_value()
>
> include/linux/sched.h | 7 +-
> init/Kconfig | 32 ---
> kernel/sched/core.c | 324 +++---------------------------
> kernel/sched/cpufreq_schedutil.c | 10 +-
> kernel/sched/fair.c | 333 +++++++------------------------
> kernel/sched/pelt.c | 144 ++++++++++++-
> kernel/sched/pelt.h | 6 +-
> kernel/sched/rt.c | 4 -
> kernel/sched/sched.h | 145 ++++++--------
> 9 files changed, 304 insertions(+), 701 deletions(-)
>
> --
> 2.34.1
>

2024-02-12 14:33:53

by Hongyan Xia

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/7] uclamp sum aggregation

On 12/02/2024 09:16, Vincent Guittot wrote:
> On Thu, 1 Feb 2024 at 14:12, Hongyan Xia <[email protected]> wrote:
>>
>> Current implementation of uclamp leaves many things to be desired.
>> There are three noteworthy problems:
>>
>> 1. Max aggregation only takes into account the max uclamp value. All
>> other uclamp values are not in effect.
>> 2. Uclamp max aggregation gives UCLAMP_MIN and UCLAMP_MAX at the rq
>> level, and whether that translates to the desired performance of a
>> specific task is unknown, because it depends on other tasks on rq.
>> 3. Complexity. Uclamp max aggregation now sits at more than 750 lines of
>> code and there is ongoing work to rework several interfaces to
>> prepare for further uclamp changes. Uclamp max aggregation itself
>> also needs future improvements in uclamp filtering and load balancing
>>
>> The first 2 points can manifest into the following symptoms,
>>
>> 1. High-rate OPP changes ("frequency spike" problem). An always-running
>> task with a UCLAMP_MAX of 200 will drive the CPU at 200 even though
>> its utilization is 1024. However, when a util_avg task of 300 but
>> with default UCLAMP_MAX of 1024 joins the rq, the rq UCLAMP_MAX will
>> be uncapped, and the UCLAMP_MAX of the first task is no longer in
>> effect therefore driving the CPU at 1024, the highest OPP. When the
>> second task sleeps, the OPP will be reduced to 200. This fast and
>> sudden OPP switch every time the 2nd task wakes up or sleeps is
>> unnecessary.
>> 2. Using UCLAMP_MIN to boost performance under max aggregation has been
>> shown to have weaker effectiveness than "sum aggregated" approaches,
>> including the util_guest proposal [1] and uclamp sum aggregation in
>> this series. The performance level of UCLAMP_MIN for a task under max
>> aggregation is unpredictable when there are more than 1 task runnable
>> on the rq.
>>
>> This series solves these problems by tracking a
>> util_avg_uclamp signal in tasks and root cfs_rq. At task level,
>> p->se.avg.util_avg_uclamp is basically tracking the normal util_avg, but
>> clamped within its uclamp min and max. At cfs_rq level, util_avg_uclamp
>> must always be the sum of all util_avg_uclamp of all the entities on
>> this cfs_rq. As a result, rq->cfs.avg.util_avg_uclamp is the sum
>> aggregation of all the clamped values, which hints the frequency
>> this rq should run at and what the utilization is. This proposal has
>> some similarities to the util_guest series for VM workloads, in that it
>> brings the desired performance to the task that requested it, not to the
>> rq, in which the share of the task is unpredictable.
>
> As I mentioned in your previous version, I don't want such mixed
> signals which are hard to maintain (as you then try to compensate for
> all side effects and compensate for some decaying as an example) and
> don't mean anything at the end. You must find another way if you want
> to go in that direction of sum aggregated.

The new style in v2 is following util_est and the old scheduler way of
tracking root CFS. The old root CFS tracking was changed because we
didn't want the cost of updating all the intermediate task group levels
and this series brings back root CFS sum tracking, without the cost of
task groups.

I'm not entirely certain that this new signal shouldn't exist and is
hard to maintain, when it is proven to be more effective than max
aggregation that is more than twice the amount of code (and still
growing, to fix all the different cases), in which the bucket logic is
fairly complicated to understand and maintain already. I also just
learned that the math to decay the CFS sum in Patch 3/7 isn't what I
invented but what CFS did.

>>
>> Note: This new signal does not change the existing PELT signal. The new
>> signal is only an extra hint to the scheduler to improve scheduling
>> decisions.
>>
>> TL;DR OF THIS SERIES:
>>
>> - Our evaluation shows significantly better effectiveness than max
>> aggregation. UI benchmarks and VM workloads have improved latency and
>> higher scores at the same or even reduced power consumption.
>> - For other benchmarks that do not involve uclamp, we have not observed
>> any noticeable regressions.
>> - This series is the entirety of sum aggregation. No groundwork or
>> refactoring in the scheduler is needed. The complexity of several
>> existing uclamp-related functions is massively reduced and sum
>> aggregation code is less than half of that in max aggregation (304+,
>> 701-). The complexity gap will be even greater with all the ongoing
>> patches for max aggregation.
>>
>> DECOMPOSITION OF SUM AGGREGATION:
>>
>> - Patch 1 reverts some max aggregation code. Sum aggregation shows no
>
> Sorry I don't get what you mean here by reverting some max aggregation
> code ? This is not really linked to max aggregation but how EAS and
> feec() try to compute energy

I agree. Strictly speaking this is EAS related and not directly related
to max aggregation. I'll reword the message.

>
> The code that is reverted here mainly highlights where your problem is
> and more generally speaking one of the problems of the current
> algorithm of EAS and feec() when it tries to estimate the energy cost.
> We don't take into account that an utilization becoming larger the cpu
> capacity means more running time. Such problem has been highlighted
> previously and is one root cause of the problems that you are trying
> to solve there

Under max aggregation, an always-running task capped at 200 on a CPU and
10 such tasks on the same CPU have the same total running time and CPU
frequency. I call this 'infinite tasks problem' because theoretically
you can have an infinite number of such tasks on the same CPU, without
max aggregation complaining.

This problem and the reverted patch both sound to me that, yes, we need
to consider the time axis like you said. However, I must say I feel it
is hacky when the uclamp implementation needs to take into account extra
parameters to work well.

>> such problems so mitigation patches are not necessary, and that
>> patch has other undesirable side effects.
>> - Patch 2 and 3 introduce new sum aggregated signals to be a more
>> effective hint to the scheduler. Patch 3 employs a math trick to make
>> it significantly simpler to track on-rq and off-rq task utilization
>> contributions.
>> - Patch 4, 5 and 6 start using the new signal while significantly
>> simplifying existing uclamp code, including the total removal of
>> uclamp buckets and max aggregation.
>> - Patch 7 and part of 6 remove the overhead of uclamp on task
>> enqueue and dequeue, because uclamp values and buckets no longer need
>> to be re-computed every time a uclamp'ed task joins or leaves the rq.
>>
>> TESTING:
>>
>> Sum aggregation generally performs better in tests. Two notebooks, max
>> vs. sum aggregation, are shared at
>>
>> https://nbviewer.org/github/honxia02/notebooks/blob/618de22a8da96205015fefabee203536683bd4e2/whitebox/max.ipynb
>> https://nbviewer.org/github/honxia02/notebooks/blob/618de22a8da96205015fefabee203536683bd4e2/whitebox/sum.ipynb
>>
>> The experiments done in notebooks are on Arm Juno r2 board. CPU0-3 are
>> little cores with capacity of 383. CPU4-5 are big cores. The rt-app
>> profiles used for these experiments are included in the notebooks.
>>
>> Scenario 1: Scheduling 4 always-running tasks with UCLAMP_MAX at 200.
>>
>> The scheduling decisions are plotted in Out[11]. Both max and sum
>> aggregation recognizes the UCLAMP_MAX hints and run all the threads
>> on the little Performance Domain (PD). However, max aggregation in the
>> test shows some randomness in task placement, and we see the 4 threads
>> are often not evenly distributed across the 4 little CPUs. This uneven
>> task placement is also the reason why we revert the patch:
>>
>> "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0"
>>
>> When the spare capacity is 0, the reverted patch tends to gather many
>> tasks on the same CPU because its EM calculation is bogus and it thinks
>> this placement is more energy efficient.
>
> So the problem is not max vs sum aggregation but fix feec()

Like in my previous comment, I agree, although I think max aggregation
makes accurate feec() tricky when the whole rq can be capped and
uncapped by one task and running time may need to be considered to work
well.

>>
>> Scenario 2: Scheduling 4 tasks with UCLAMP_MIN and UCLAMP_MAX at a value
>> slightly above the capacity of the little CPU.
>>
>> Results are in Out[17]. The purpose is to use UCLAMP_MIN to place tasks
>> on the big core but not to run at the highest OPP. Both max and sum
>> aggregation accomplish this task successfully, running the two threads
>> at the big cluster while not driving the frequency at the max.
>
> No problem here
>
>>
>> Scenario 3: Task A is a task with a small utilization pinned to CPU4.
>> Task B is an always-running task pinned to CPU5, but UCLAMP_MAX capped
>> at 300. After a while, task A is then pinned to CPU5, joining B.
>>
>> Results are in Out[23]. The util_avg curve is the original root CFS
>> util_avg. The root_cfs_util_uclamp is the root CFS utilization after
>> considering uclamp. Max aggregation sees a frequency spike at 751.7s.
>> When zoomed in, one can see square-wave-like utilization and CPU
>
> sidenote: there is no graph showed in your links so nothing to see or zoom

Hmm, I have confirmed with a couple of people that the graphs show up
before sending out the series. This sometimes happens when you have ad
or cookie blocker. Maybe disabling them or viewing in a private window
helps. If it doesn't I'll try other ways of sharing the results.

>> frequency values because of task A periodically going to sleep. When A
>> wakes up, its default UCLAMP_MAX of 1024 will uncap B and reach the
>> highest CPU frequency. When A sleeps, B's UCLAMP_MAX will be in effect
>> and will reduce rq utilization to 300. This happens repeatedly, hence
>> the square wave. In contrast, sum aggregation sees a normal increase in
>
> What do you mean by normal increase ?

Not having a frequency spike and not jumping between the highest and
lowest OPP for every time slice.

>> utilization when A joins B, at around 430.64s, without any square-wave
>> behavior. The CPU frequency also stays stable while the two tasks are on
>> the same rq.
>
> Difficult to make any comment as there is no graph to see in your link.
>
> What should be the right frequency in such a case ?

A frequency high enough to satisfy util(A) and clamped(B).

>>
>> Scenario 4: 4 always-running tasks with UCLAMP_MAX of 120 pinned to the
>> little PD (CPU0-3). 4 same tasks pinned to the big PD (CPU4-5).
>> After a while, remove the CPU pinning of the 4 tasks on the big PD.
>>
>> Results are in Out[29]. Both max and sum aggregation understand that we
>> can move the 4 tasks from the big PD to the little PD to reduce power
>> because of the UCLAMP_MAX hints. However, max aggregation shows severely
>> unbalanced task placement, scheduling 5 tasks on CPU0 while 1 each on
>> CPU1-3. Sum aggregation schedules 2 tasks on each little CPU, honoring
>> UCLAMP_MAX while maintaining balanced task placement.
>>
>> Again, this unbalanced task placement is the reason why we reverted:
>>
>> "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0"
>
> So then it's not a matter a sum vs max aggregation but to fix the EAS
> policy to place tasks correctly
>
>>
>> Scenario 5: 8 tasks with UCLAMP_MAX of 120.
>>
>> This test is similar to Scenario 4, shown in Out[35]. Both max and sum
>> aggregation understand the UCLAMP_MAX hints and schedule all tasks on
>> the 4 little CPUs. Max aggregation again shows unstable and unbalanced
>> task placement while sum aggregation schedules 2 tasks on each little
>> CPU, and the task placement remains stable. The total task residency is
>> shown in Out[36], showing how unbalanced max aggregation is.
>
> same comment as previous test
>
>>
>> BENCHMARKS:
>>
>> Geekbench 6, no uclamp (on Rock-5B board)
>> +-----+-------------+------------+
>> | | Single-core | Multi-core |
>> +-----+-------------+------------+
>> | Max | 801.3 | 2976.8 |
>> | Sum | 802.8 | 2989.2 |
>> +-----+-------------+------------+
>>
>> No regression is seen after switching to sum aggregation.
>>
>> Jankbench (backporting sum aggregation to Pixel 6 Android 5.18 mainline kernel):
>>
>> Jank percentage:
>> +------+-------+-----------+
>> | | value | perc_diff |
>> +------+-------+-----------+
>> | main | 1.1 | 0.00% |
>> | sum | 0.5 | -53.92% |
>> +------+-------+-----------+
>>
>> Average power:
>> +------------+------+-------+-----------+
>> | | tag | value | perc_diff |
>> +------------+------+-------+-----------+
>> | CPU | max | 166.1 | 0.00% |
>> | CPU-Big | max | 55.1 | 0.00% |
>> | CPU-Little | max | 91.7 | 0.00% |
>> | CPU-Mid | max | 19.2 | 0.00% |
>> | CPU | sum | 161.3 | -2.85% |
>> | CPU-Big | sum | 52.9 | -3.97% |
>> | CPU-Little | sum | 86.7 | -5.39% |
>> | CPU-Mid | sum | 21.6 | 12.63% |
>> +------------+------+-------+-----------+
>>
>> UIBench (backporting sum aggregation to Pixel 6 Android 6.6 mainline kernel):
>>
>> Jank percentage:
>> +-------------+-------+-----------+
>> | tag | value | perc_diff |
>> +-------------+-------+-----------+
>> | max_aggr | 0.3 | 0.0 |
>> | sum_aggr | 0.26 | -12.5 |
>> +-------------+-------+-----------+
>>
>> Average input latency:
>> +-------------+--------+-----------+
>> | tag | value | perc_diff |
>> +-------------+--------+-----------+
>> | max_aggr | 107.39 | 0.0 |
>> | sum_aggr | 81.135 | -24.5 |
>> +-------------+--------+-----------+
>>
>> Average power:
>> +------------+--------------+--------+-----------+
>> | channel | tag | value | perc_diff |
>> +------------+--------------+--------+-----------+
>> | CPU | max_aggr | 209.85 | 0.0% |
>> | CPU-Big | max_aggr | 89.8 | 0.0% |
>> | CPU-Little | max_aggr | 94.45 | 0.0% |
>> | CPU-Mid | max_aggr | 25.45 | 0.0% |
>> | GPU | max_aggr | 22.9 | 0.0% |
>> | Total | max_aggr | 232.75 | 0.0% |
>> | CPU | sum_aggr | 206.05 | -1.81% |
>> | CPU-Big | sum_aggr | 84.7 | -5.68% |
>> | CPU-Little | sum_aggr | 94.9 | 0.48% |
>> | CPU-Mid | sum_aggr | 26.3 | 3.34% |
>> | GPU | sum_aggr | 22.45 | -1.97% |
>> | Total | sum_aggr | 228.5 | -1.83% |
>> +------------+--------------+--------+-----------+
>>
>> It should be noted that sum aggregation reduces jank and reduces input
>> latency while consuming less power.
>>
>> VM cpufreq hypercall driver [1], on Rock-5B board. Baseline indicates a
>> setup without the VM cpufreq driver:
>>
>> Geekbench 6 uncontended. No other host threads running.
>> +------+-------------+-----------+------------+-----------+
>> | | Single-core | perc_diff | Multi-core | perc_diff |
>> +------+-------------+-----------+------------+-----------+
>> | base | 796.4 | 0 | 2947.0 | 0 |
>> | max | 795.6 | -0.10 | 2929.6 | -0.59 |
>> | sum | 794.6 | -0.23 | 2935.6 | -0.39 |
>> +------+-------------+-----------+------------+-----------+
>>
>> Geekbench 6 contended. Host CPUs each has a 50% duty-cycle task running.
>> +------+-------------+-----------+------------+-----------+
>> | | Single-core | perc_diff | Multi-core | perc_diff |
>> +------+-------------+-----------+------------+-----------+
>> | base | 604.6 | 0 | 2330.2 | 0 |
>> | max | 599.4 | -0.86 | 2327.2 | -0.13 |
>> | sum | 645.8 | 6.81 | 2336.6 | 0.27 |
>> +------+-------------+-----------+------------+-----------+
>>
>> VM CPUfreq driver using sum aggregation outperforms max aggregation when
>> the host is contended. When the host has no contention (only the VM
>> vCPUs are running and each host CPU accommodates one guest vCPU), the
>> two aggregation methods are roughly the same, and a bit surprisingly,
>> offers no speed-up versus the baseline. This is probably because of the
>> overhead of hypercalls, and the fact that Geekbench is CPU intensive and
>> is not the best workload to show the effectiveness of VM cpufreq driver.
>> We will try to benchmark on more VM workloads.
>>
>> LIMITATIONS:
>>
>> 1. RT sum aggregation is not shown in the series.
>>
>> A proof-of-concept RT sum aggregation implementation is done and going
>> through testing, with < 50 lines of code, using the same ideas as in
>> CFS. They will be sent out separately if we can agree on CFS sum
>> aggregation and once the testing is done.
>>
>> 2. A heavily UCLAMP_MAX-throttled task may prevent the CFS rq from
>> triggering over-utilization.
>>
>> For example, two always-running task each having utilization of 512. If
>> one of the task is severely UCLAMP_MAX restricted, say, with a
>> UCLAMP_MAX of 1, then the total CFS sum aggregation will be 512 + 1 =
>> 513, which won't trigger over-utilization even though the other task has
>> no UCLAMP_MAX and wants more performance.
>
> But what if they are pinned to the same CPU or other cpus are already
> fully busy ? This is a blocking point because the uclamp of one task
> then impacts the cpu bandwidth of another not clamped task

True. This is a limitation I acknowledge. I'm working on a fix.

>>
>> I'm working on a fix for this problem. This at the moment can be solved
>> by either not giving long-running tasks ridiculously low UCLAMP_MAX
>> values, or adjusting the priority of UCLAMP_MAX tasks to make sure it
>> does not get a share of CPU run-time that vastly exceeds its UCLAMP_MAX.
>> However, my personal view is that maybe UCLAMP_MIN and UCLAMP_MAX just
>> don't belong together, and the proper way is to have a per-task
>
> No this is a blocking point. We don't want to add dependency between
> nice priority and uclamp.

Actually in my testing I find the 'infinite tasks' problem to be a way
bigger blocker than this problem. Under max aggregation, often
uclamp_max tasks are placed on the same CPU because the energy
calculation is bogus, just like in Scenario 4. I'm just saying that at
least the under-utilized problem can be easily worked around with
priorities when you use uclamp_max in sum aggregation.

>> bandwidth throttling mechanism and what we want as UCLAMP_MAX maybe
>> actually belongs to that mechanism.
>>
>> However, since the Android GROUP_THROTTLE feature [2] has the exact same
>> problem and has been used in actual products, we don't think this is a
>> big limitation in practice.
>
> This is not because an out of tree patchset is ok to go with a wrong
> behavior that it makes this patchset acceptable

I completely agree.

I was not arguing that what somebody else does makes a limitation
acceptable. I was saying that often one doesn't run into that
limitation. It's rare in practice that one wants a task to run at normal
or even higher priority but with a heavily throttled uclamp_max. In the
testing and evaluation, one runs into the limitations of max aggregation
more often than seeing this problem in sum aggregation. But, you are
right, this problem should be fixed.

Finally, I may not have expressed this clearly in email threads, but I'm
not strongly against max aggregation, it's just it has many drawbacks
that need to be addressed, and we do not yet know how many more lines of
code are needed and how well it will perform in the end. If we have a
way forward and the effectiveness is shown by evaluation numbers, then I
don't see why we can't go with the current uclamp implementation.

>>
>> [1]: https://lore.kernel.org/all/[email protected]/
>> [2]: https://android.googlesource.com/kernel/gs/+/refs/heads/android-gs-raviole-5.10-android12-d1/drivers/soc/google/vh/kernel/sched/fair.c#510
>>
>> ---
>> Changed in v2:
>> - Rework util_avg_uclamp to be closer to the style of util_est.
>> - Rewrite patch notes to reflect the new style.
>> - Add the discussion of the under-utilizated example in limitations,
>> found by Vincent G.
>> - Remove task group uclamp to focus on tasks first.
>> - Fix several bugs in task migration.
>> - Add benchmark numbers from UIBench and VM cpufreq.
>> - Update python notebooks to reflect the latest max vs. sum aggregation.
>>
>> Hongyan Xia (7):
>> Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is
>> 0"
>> sched/uclamp: Track uclamped util_avg in sched_avg
>> sched/uclamp: Introduce root_cfs_util_uclamp for rq
>> sched/fair: Use CFS util_avg_uclamp for utilization and frequency
>> sched/fair: Massively simplify util_fits_cpu()
>> sched/uclamp: Remove all uclamp bucket logic
>> sched/uclamp: Simplify uclamp_eff_value()
>>
>> include/linux/sched.h | 7 +-
>> init/Kconfig | 32 ---
>> kernel/sched/core.c | 324 +++---------------------------
>> kernel/sched/cpufreq_schedutil.c | 10 +-
>> kernel/sched/fair.c | 333 +++++++------------------------
>> kernel/sched/pelt.c | 144 ++++++++++++-
>> kernel/sched/pelt.h | 6 +-
>> kernel/sched/rt.c | 4 -
>> kernel/sched/sched.h | 145 ++++++--------
>> 9 files changed, 304 insertions(+), 701 deletions(-)
>>
>> --
>> 2.34.1
>>

2024-02-20 15:49:44

by Qais Yousef

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/7] uclamp sum aggregation

On 02/06/24 17:32, Hongyan Xia wrote:
> On 06/02/2024 15:20, Qais Yousef wrote:
> > On 02/01/24 13:11, Hongyan Xia wrote:
> >
> > > [1]: https://lore.kernel.org/all/[email protected]/
> >
> > Their solution is not acceptable for the same reason yours isn't. Saravana and
> > David know this and we discussed at LPC. uclamp hints are limits and should not
> > be summed.
>
> Uclamp is a performance hint and nothing in its definition says it can't be

That's the major problem here. The definition says it can't be summed because
they're limits. So if two tasks boosted to 1024, then their combined
performance limit is 2048? Or if 4 tasks are capped to 512, then the combined
limits is 2048? You're changing the behavior. If not, please help me understand
how not. You're treating them as 'bandwidth' hints.

> summed. Clearly whether a uclamp approach should be taken should be
> determined by how well it works as a hint, not by how we calculate it. I
> would not say I want to reject max aggregation simply because it throws away
> all other uclamp values except the max. It's because I have real evaluation
> results showing sum aggregation works as a much better hint.

It is easy to get numbers that shows improvements. That doesn't necessarily
mean it is correct.

>
> > > [2]: https://android.googlesource.com/kernel/gs/+/refs/heads/android-gs-raviole-5.10-android12-d1/drivers/soc/google/vh/kernel/sched/fair.c#510
> >
> > I think I clarified several times so far that this is not related to uclamp.
> > Could you please refrain from referring to it again in the future? This is
> > misleading and neither helps your cause nor its cause. The fact that you're
> > relating to it makes me very worried as both links demonstrate lack of
> > understanding/confusion of what uclamp is supposed to be.
>
> The intention of the code is irrelevant. What I'm talking about is what

It is relevant.

1. You failed to get consent on using this an example.
2. You failed to demonstrate the purpose of the code and why it was
actually added. I don't think you know the full details anyway.
3. You're using a hack that is problematic and needs to be deleted as a proof
of something good that _we do not want_.
4. How do you know it works as you think it is? You're making wrong assumptions
that are misleading and quite frankly annoying.

Generally we don't want to discuss this in this venue. So for the last time
please stop bringing it up altogether and let's focus on the problems you are
seeing and trying to fix and leave other people's code out of the discussion.

> effect the code actually has. The fact that you keep thinking I don't
> understand what the code does even after me explaining "I know what the
> intention of the code is, I'm just talking about the actual effect of the
> code" is an even more worrying sign.
>
> > Again, this solution is not acceptable and you're moving things in the wrong
> > direction. We don't want to redesign what uclamp means, but fix some corner
> > cases. And you're doing the former not the latter.
>
> I'm saying max aggregation is not effective and proposing a more effective
> implementation. In fact, you have sent a series that removes max
> aggregation. Clearly that does not count as fixing corner cases but is
> actually a redesign, and I don't understand why you are allowed to do such
> things and I am not. Also, when something becomes harder and harder to fix,
> a redesign that solves all the problems is clearly justified.

I don't have issues with max aggregation removal. I actually would love to see
it gone. But summing uclamp values changes the meaning of uclamp hints. Which
is my number 1 concern.

And you're lumping too many problems to max aggregations. It was brought up
several times that overutilized definition being tied to misfit is not fit for
purpose anymore. This needs to be fixed. feec() fails to distribute fairly when
capacities are equal. This is a known old problem. This needs fixing. energy
computation needs to evolve to deal with a number of new realities. This needs
fixing. Load balancer can't correct bad decision at wake up in aligned way with
feec(), this needs fixing. Adding a new signal is not the right way to fix
these problems. And the only issue with max-aggregation in practice is the fact
it is not effective in restricting power use for uclamp_max. It has also the
minor problem of boosting unnecessarily for the duration of enqueue of the
task. The correct behavior we want and other vendors has expressed to me in the
past is to apply the boost/cap only when required; ie the task is running. The
hardware can cope with this. And this is tied to how cpufreq governor works and
if there are hardware limitations that makes this hard to achieve, then it is
up to the governor to decide this, not the scheduler. We already have util_avg
and util_est. I really don't think we need another addition. We just need to
fix issues else where and improve the interface with the cpufreq governor to
honour the performance limits the tasks have so that we can get rid off
max-aggregation.

>
> What I can summarize from sum aggregation is:
>
> Pros:
> 1. A more effective implementation, proven by evaluation numbers
> 2. Consuming the same or even less power in benchmarks
> 3. 350 lines of code in total, less than half of max aggregation
> 4. This series shows the entirety and effectiveness of sum aggregation, at
> this very moment, today. Max aggregation needs further filtering and load
> balancing patches which we have not seen yet.
> 5. Resolves the drawbacks from max aggregation (which you might say is the
> same as 4)
> 6. Significantly reduces uclamp overhead, no bucket operations
>
> Cons:
> 1. should not be summed (although the scheduler used to sum up utilization
> and util_est sums up a processed PELT signal today)
> 2. Under-utilization case (which is a problem GROUP_THROTTLE also has, and
> can be worked around. Please, I know the intention of GROUP_THROTTLE, I'm
> just talking about its actual effects).
>
> I don't see why the things I listed above is in the wrong direction.

2024-02-20 22:43:13

by Hongyan Xia

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/7] uclamp sum aggregation

On 20/02/2024 15:47, Qais Yousef wrote:
> On 02/06/24 17:32, Hongyan Xia wrote:
>> On 06/02/2024 15:20, Qais Yousef wrote:
>>> On 02/01/24 13:11, Hongyan Xia wrote:
>>>
>>>> [1]: https://lore.kernel.org/all/[email protected]/
>>>
>>> Their solution is not acceptable for the same reason yours isn't. Saravana and
>>> David know this and we discussed at LPC. uclamp hints are limits and should not
>>> be summed.
>>
>> Uclamp is a performance hint and nothing in its definition says it can't be
>
> That's the major problem here. The definition says it can't be summed because
> they're limits. So if two tasks boosted to 1024, then their combined
> performance limit is 2048? Or if 4 tasks are capped to 512, then the combined
> limits is 2048? You're changing the behavior. If not, please help me understand
> how not. You're treating them as 'bandwidth' hints.

You have not looked at this series closely. '4 tasks are capped to 512,
then the combined limits is 2048?'. This is a misinterpretation of this
series. The fact that you conclude this series like this is already
alarming to me.

Also, if you spend time playing with this series or just look at my
evaluation results, you'll see it gives a much better 'dynamic range'
than max aggregation. You need only a smaller uclamp value in sum
aggregation to achieve the same results, often with less power
consumption. Higher values can give you better performance in a way max
aggregation can't, and the same amount of power budget often gives you
less jank.

About changing behavior, yes, I'm changing it from a less effective hint
to a more effective one.

>> summed. Clearly whether a uclamp approach should be taken should be
>> determined by how well it works as a hint, not by how we calculate it. I
>> would not say I want to reject max aggregation simply because it throws away
>> all other uclamp values except the max. It's because I have real evaluation
>> results showing sum aggregation works as a much better hint.
>
> It is easy to get numbers that shows improvements. That doesn't necessarily
> mean it is correct.

But telling people that your implementation meets a definition of uclamp
without showing improved results or a roadmap to address all the
drawbacks won't convince people that it's the correct solution.

Also, it's likely something that is simpler and works better is correct.
It's not always true, but it's likely.

>>
>>>> [2]: https://android.googlesource.com/kernel/gs/+/refs/heads/android-gs-raviole-5.10-android12-d1/drivers/soc/google/vh/kernel/sched/fair.c#510
>>>
>>> I think I clarified several times so far that this is not related to uclamp.
>>> Could you please refrain from referring to it again in the future? This is
>>> misleading and neither helps your cause nor its cause. The fact that you're
>>> relating to it makes me very worried as both links demonstrate lack of
>>> understanding/confusion of what uclamp is supposed to be.
>>
>> The intention of the code is irrelevant. What I'm talking about is what
>
> It is relevant.
>
> 1. You failed to get consent on using this an example.
> 2. You failed to demonstrate the purpose of the code and why it was
> actually added. I don't think you know the full details anyway.
> 3. You're using a hack that is problematic and needs to be deleted as a proof
> of something good that _we do not want_.
> 4. How do you know it works as you think it is? You're making wrong assumptions
> that are misleading and quite frankly annoying.
>
> Generally we don't want to discuss this in this venue. So for the last time
> please stop bringing it up altogether and let's focus on the problems you are
> seeing and trying to fix and leave other people's code out of the discussion.

I find these 4 points arrogant and offensive, which do not belong to the
mailing list.

I do not need consent from you or need to know the full back story to
talk about what a piece of public code does. The fun thing is, I even
spent time understanding what the code intends to do, but clearly to you
I still 'don't know the full details anyway' and not qualified to talk
about what the code actually does in real world, as if the intention of
a kitchen knife matters so much when I explain the knife is dangerous
and can kill people.

Meanwhile you are making the wrong comment that this series is
'combining the limits'. Of course, I could just say you failed to
demonstrate good understanding of the series (this time not just the
intention, but also what it actually does), but that kind of words
rarely helps in making any technical progress on mailing lists.

>> effect the code actually has. The fact that you keep thinking I don't
>> understand what the code does even after me explaining "I know what the
>> intention of the code is, I'm just talking about the actual effect of the
>> code" is an even more worrying sign.
>>
>>> Again, this solution is not acceptable and you're moving things in the wrong
>>> direction. We don't want to redesign what uclamp means, but fix some corner
>>> cases. And you're doing the former not the latter.
>>
>> I'm saying max aggregation is not effective and proposing a more effective
>> implementation. In fact, you have sent a series that removes max
>> aggregation. Clearly that does not count as fixing corner cases but is
>> actually a redesign, and I don't understand why you are allowed to do such
>> things and I am not. Also, when something becomes harder and harder to fix,
>> a redesign that solves all the problems is clearly justified.
>
> I don't have issues with max aggregation removal. I actually would love to see
> it gone. But summing uclamp values changes the meaning of uclamp hints. Which
> is my number 1 concern.
>
> And you're lumping too many problems to max aggregations. It was brought up
> several times that overutilized definition being tied to misfit is not fit for
> purpose anymore. This needs to be fixed. feec() fails to distribute fairly when
> capacities are equal. This is a known old problem. This needs fixing. energy
> computation needs to evolve to deal with a number of new realities. This needs
> fixing. Load balancer can't correct bad decision at wake up in aligned way with
> feec(), this needs fixing. Adding a new signal is not the right way to fix
> these problems. And the only issue with max-aggregation in practice is the fact
> it is not effective in restricting power use for uclamp_max. It has also the
> minor problem of boosting unnecessarily for the duration of enqueue of the
> task. The correct behavior we want and other vendors has expressed to me in the
> past is to apply the boost/cap only when required; ie the task is running. The
> hardware can cope with this. And this is tied to how cpufreq governor works and
> if there are hardware limitations that makes this hard to achieve, then it is
> up to the governor to decide this, not the scheduler. We already have util_avg
> and util_est. I really don't think we need another addition. We just need to
> fix issues else where and improve the interface with the cpufreq governor to
> honour the performance limits the tasks have so that we can get rid off
> max-aggregation.

We need to lay quite some groundwork for the existing implementation,
seen by many patches merged recently. Patch like "sched/uclamp: Set
max_spare_cap_cpu even if max_spare_cap is 0" has already created other
problems, shown in my evaluation results which I doubt you looked at,
and now you are saying we need to cope with cpufreq governor and
hardware limits. It sounds like a much bigger hack when the
effectiveness of uclamp depends on governor and hardware limits. Also,
we still don't know how many lines of code are needed to address all
these problems.

This conversation is going nowhere. The fact that we have gone back and
forth so many times and you have spent quite some time lecturing me that
I don't know stuff, and you haven't commented on the drawbacks shown in
my whitebox tests or on evaluation numbers shown in benchmarks or on
anything technical of this series even once is very alarming to me.
Meanwhile, other people have spotted and pointed out technical issues
like the under-utilization problem. We have wasted too much time on
this, and I probably will only reply to you on technical discussions
from now on.

>>
>> What I can summarize from sum aggregation is:
>>
>> Pros:
>> 1. A more effective implementation, proven by evaluation numbers
>> 2. Consuming the same or even less power in benchmarks
>> 3. 350 lines of code in total, less than half of max aggregation
>> 4. This series shows the entirety and effectiveness of sum aggregation, at
>> this very moment, today. Max aggregation needs further filtering and load
>> balancing patches which we have not seen yet.
>> 5. Resolves the drawbacks from max aggregation (which you might say is the
>> same as 4)
>> 6. Significantly reduces uclamp overhead, no bucket operations
>>
>> Cons:
>> 1. should not be summed (although the scheduler used to sum up utilization
>> and util_est sums up a processed PELT signal today)
>> 2. Under-utilization case (which is a problem GROUP_THROTTLE also has, and
>> can be worked around. Please, I know the intention of GROUP_THROTTLE, I'm
>> just talking about its actual effects).
>>
>> I don't see why the things I listed above is in the wrong direction.

2024-03-15 12:31:29

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH v2 4/7] sched/fair: Use CFS util_avg_uclamp for utilization and frequency

On 01/02/2024 14:12, Hongyan Xia wrote:

[...]

> @@ -7685,11 +7697,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> static unsigned long
> cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
> {
> - struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
> - unsigned long util = READ_ONCE(cfs_rq->avg.util_avg);
> + struct rq *rq = cpu_rq(cpu);
> + struct cfs_rq *cfs_rq = &rq->cfs;
> + unsigned long util = root_cfs_util(rq);
> + bool capped = uclamp_rq_is_capped(rq);

I try to rerun your tests in your 2 ipynbs (cover letter) but this let's
the sum aggr stack go sideways ...

if 'sched_uclamp_used' then uclamp_rq_is_capped() will call
cpu_util_cfs()->cpu_util() which then calls uclamp_rq_is_capped()
recursively resulting in a stack overflow.

Do you have a fix for that you can share? For the time I remove the call
to uclamp_rq_is_capped() in cpu_util().

[...]

2024-03-15 19:49:06

by Hongyan Xia

[permalink] [raw]
Subject: Re: [RFC PATCH v2 4/7] sched/fair: Use CFS util_avg_uclamp for utilization and frequency

On 15/03/2024 12:31, Dietmar Eggemann wrote:
> On 01/02/2024 14:12, Hongyan Xia wrote:
>
> [...]
>
>> @@ -7685,11 +7697,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>> static unsigned long
>> cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
>> {
>> - struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
>> - unsigned long util = READ_ONCE(cfs_rq->avg.util_avg);
>> + struct rq *rq = cpu_rq(cpu);
>> + struct cfs_rq *cfs_rq = &rq->cfs;
>> + unsigned long util = root_cfs_util(rq);
>> + bool capped = uclamp_rq_is_capped(rq);
>
> I try to rerun your tests in your 2 ipynbs (cover letter) but this let's
> the sum aggr stack go sideways ...
>
> if 'sched_uclamp_used' then uclamp_rq_is_capped() will call
> cpu_util_cfs()->cpu_util() which then calls uclamp_rq_is_capped()
> recursively resulting in a stack overflow.
>
> Do you have a fix for that you can share? For the time I remove the call
> to uclamp_rq_is_capped() in cpu_util().

My apologies. This has long ago been fixed and here is the diff:

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1ebdd0b9ebca..d5dcda036e0d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3018,9 +3018,8 @@ static inline bool uclamp_rq_is_capped(struct rq *rq)
if (!static_branch_likely(&sched_uclamp_used))
return false;

- rq_uclamp_util = cpu_util_cfs(cpu_of(rq)) + cpu_util_rt(rq);
- rq_real_util = READ_ONCE(rq->cfs.avg.util_avg) +
- READ_ONCE(rq->avg_rt.util_avg);
+ rq_uclamp_util = READ_ONCE(rq->root_cfs_util_uclamp);
+ rq_real_util = READ_ONCE(rq->cfs.avg.util_avg);

return rq_uclamp_util < SCHED_CAPACITY_SCALE &&
rq_real_util > rq_uclamp_util;

> [...]
>

2024-03-18 18:22:07

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/7] sched/uclamp: Introduce root_cfs_util_uclamp for rq

On 01/02/2024 14:11, Hongyan Xia wrote:

[...]

> /*
> * The code below (indirectly) updates schedutil which looks at
> @@ -6769,6 +6770,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> #ifdef CONFIG_UCLAMP_TASK
> util_uclamp_enqueue(&rq->cfs.avg, p);
> update_util_uclamp(0, 0, 0, &rq->cfs.avg, p);
> + if (migrated)

IMHO, you don't need 'bool __maybe_unused migrated'. You can use:

if (flags & ENQUEUE_MIGRATED)

> + rq->root_cfs_util_uclamp += p->se.avg.util_avg_uclamp;
> + rq->root_cfs_util_uclamp = max(rq->root_cfs_util_uclamp,
> + rq->cfs.avg.util_avg_uclamp);
> /* TODO: Better skip the frequency update in the for loop above. */
> cpufreq_update_util(rq, 0);
> #endif
> @@ -8252,6 +8257,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> migrate_se_pelt_lag(se);
> }
>
> + remove_root_cfs_util_uclamp(p);

You can't always do this here. In the '!task_on_rq_migrating()' case we
don't hold the 'old' rq->lock.

Have a look into remove_entity_load_avg() for what we do for PELT in
this case.

And:

144d8487bc6e ("sched/fair: Implement synchonous PELT detach on load-balance migrate")
e1f078f50478 ("sched/fair: Combine detach into dequeue when migrating task")

@@ -3081,6 +3081,8 @@ static inline void remove_root_cfs_util_uclamp(struct task_struct *p)
unsigned int root_util = READ_ONCE(rq->root_cfs_util_uclamp);
unsigned int p_util = READ_ONCE(p->se.avg.util_avg_uclamp), new_util;

+ lockdep_assert_rq_held(task_rq(p));
+
new_util = (root_util > p_util) ? root_util - p_util : 0;
new_util = max(new_util, READ_ONCE(rq->cfs.avg.util_avg_uclamp));
WRITE_ONCE(rq->root_cfs_util_uclamp, new_util);

[...]

> /* avg must belong to the queue this se is on. */
> -void update_util_uclamp(struct sched_avg *avg, struct task_struct *p)
> +void update_util_uclamp(u64 now,
> + u64 last_update_time,
> + u32 period_contrib,
> + struct sched_avg *avg,
> + struct task_struct *p)
> {

I was wondering why you use such a long parameter list for this
function.

IMHO

update_util_uclamp(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)

would work as well. You could check whether se represents a task inside
update_util_uclamp() as well as get last_update_time and period_contrib.

The only reason I see is that you want to use this function for the RT
class as well later, where you have to deal with 'struct rt_rq' and
'struct sched_rt_entity'.

IMHO, it's always better to keep the implementation to the minimum and
only introduce changes which are related to the functionality you
present. This would make reviewing so much easier.


> unsigned int util, uclamp_min, uclamp_max;
> int delta;
>
> - if (!p->se.on_rq)
> + if (!p->se.on_rq) {
> + ___update_util_uclamp_towards(now,
> + last_update_time,
> + period_contrib,
> + &p->se.avg.util_avg_uclamp,
> + 0);
> return;
> + }

You decay 'p->se.avg.util_avg_uclamp' which is not really related to
root_cfs_util_uclamp (patch header). IMHO, this would belong to 2/7.

This is the util_avg_uclamp handling for a se (task):

enqueue_task_fair()

util_uclamp_enqueue()

update_util_uclamp() (1)

if (!p->se.on_rq) (x)
___update_util_uclamp_towards() (2)

dequeue_task_fair()

util_uclamp_dequeue()

__update_load_avg_blocked_se()

update_util_uclamp()

(x)

__update_load_avg_se()

update_util_uclamp()

(x)

Why is it so unbalanced? Why do you need (1) and (2)?

Isn't this just an indication that the se util_avg_uclamp
is done at the wrong places?

Is there an other way to provide a task/rq signal as the base
for uclamp sum aggregation?

[...]

2024-03-19 11:50:54

by Hongyan Xia

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/7] sched/uclamp: Introduce root_cfs_util_uclamp for rq

On 18/03/2024 18:21, Dietmar Eggemann wrote:
> On 01/02/2024 14:11, Hongyan Xia wrote:
>
> [...]
>
>> /*
>> * The code below (indirectly) updates schedutil which looks at
>> @@ -6769,6 +6770,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> #ifdef CONFIG_UCLAMP_TASK
>> util_uclamp_enqueue(&rq->cfs.avg, p);
>> update_util_uclamp(0, 0, 0, &rq->cfs.avg, p);
>> + if (migrated)
>
> IMHO, you don't need 'bool __maybe_unused migrated'. You can use:
>
> if (flags & ENQUEUE_MIGRATED)

I'm not sure if they are entirely equivalent. Both
task_on_rq_migrating() and !task_on_rq_migrating() can have
last_update_time == 0 but ENQUEUE_MIGRATED flag is only for the former.
Maybe I'm missing something?

>> + rq->root_cfs_util_uclamp += p->se.avg.util_avg_uclamp;
>> + rq->root_cfs_util_uclamp = max(rq->root_cfs_util_uclamp,
>> + rq->cfs.avg.util_avg_uclamp);
>> /* TODO: Better skip the frequency update in the for loop above. */
>> cpufreq_update_util(rq, 0);
>> #endif
>> @@ -8252,6 +8257,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>> migrate_se_pelt_lag(se);
>> }
>>
>> + remove_root_cfs_util_uclamp(p);
>
> You can't always do this here. In the '!task_on_rq_migrating()' case we
> don't hold the 'old' rq->lock.
>
> Have a look into remove_entity_load_avg() for what we do for PELT in
> this case.
>
> And:
>
> 144d8487bc6e ("sched/fair: Implement synchonous PELT detach on load-balance migrate")
> e1f078f50478 ("sched/fair: Combine detach into dequeue when migrating task")
>
> @@ -3081,6 +3081,8 @@ static inline void remove_root_cfs_util_uclamp(struct task_struct *p)
> unsigned int root_util = READ_ONCE(rq->root_cfs_util_uclamp);
> unsigned int p_util = READ_ONCE(p->se.avg.util_avg_uclamp), new_util;
>
> + lockdep_assert_rq_held(task_rq(p));
> +
> new_util = (root_util > p_util) ? root_util - p_util : 0;
> new_util = max(new_util, READ_ONCE(rq->cfs.avg.util_avg_uclamp));
> WRITE_ONCE(rq->root_cfs_util_uclamp, new_util);

Ack. I saw the removed_* functions. I will change into that style.

> [...]
>
>> /* avg must belong to the queue this se is on. */
>> -void update_util_uclamp(struct sched_avg *avg, struct task_struct *p)
>> +void update_util_uclamp(u64 now,
>> + u64 last_update_time,
>> + u32 period_contrib,
>> + struct sched_avg *avg,
>> + struct task_struct *p)
>> {
>
> I was wondering why you use such a long parameter list for this
> function.
>
> IMHO
>
> update_util_uclamp(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
>
> would work as well. You could check whether se represents a task inside
> update_util_uclamp() as well as get last_update_time and period_contrib.
>
> The only reason I see is that you want to use this function for the RT
> class as well later, where you have to deal with 'struct rt_rq' and
> 'struct sched_rt_entity'.
>
> IMHO, it's always better to keep the implementation to the minimum and
> only introduce changes which are related to the functionality you
> present. This would make reviewing so much easier.

Those parameters are necessary because of

if (___update_load_sum()) {
___update_load_avg();
update_util_uclamp();
}

We have to cache last_update_time and period_contrib, because after
___update_load_sum() is done, both parameters in sched_avg have already
been updated and overwritten and we lose the timestamp when the
sched_avg was previously updated. update_util_uclamp() needs to know
when sched_avg was previously updated.

>
>> unsigned int util, uclamp_min, uclamp_max;
>> int delta;
>>
>> - if (!p->se.on_rq)
>> + if (!p->se.on_rq) {
>> + ___update_util_uclamp_towards(now,
>> + last_update_time,
>> + period_contrib,
>> + &p->se.avg.util_avg_uclamp,
>> + 0);
>> return;
>> + }
>
> You decay 'p->se.avg.util_avg_uclamp' which is not really related to
> root_cfs_util_uclamp (patch header). IMHO, this would belong to 2/7.

I would say this still belongs to 3/7, because 2/7 only implements
utilization for on_rq tasks. This patch implements utilization for both
on_rq and !on_rq. For rq, we have rq->cfs.avg.util_avg_uclamp (for
on_rq) and rq->root_cfs_util_uclamp (for on_rq plus !on_rq).

For tasks, we also have two utilization numbers, one is on_rq and the
other is on_rq plus !on_rq. However, we know they do not have to be
stored in separate variables and util_avg_uclamp can capture both.

Moving this to 2/7 might be fine, although then this would be the only
!on_rq utilization in 2/7. I can add comments to clarify the situation.

> This is the util_avg_uclamp handling for a se (task):
>
> enqueue_task_fair()
>
> util_uclamp_enqueue()
>
> update_util_uclamp() (1)
>
> if (!p->se.on_rq) (x)
> ___update_util_uclamp_towards() (2)
>
> dequeue_task_fair()
>
> util_uclamp_dequeue()
>
> __update_load_avg_blocked_se()
>
> update_util_uclamp()
>
> (x)
>
> __update_load_avg_se()
>
> update_util_uclamp()
>
> (x)
>
> Why is it so unbalanced? Why do you need (1) and (2)?
>
> Isn't this just an indication that the se util_avg_uclamp
> is done at the wrong places?
>
> Is there an other way to provide a task/rq signal as the base
> for uclamp sum aggregation?

(2) won't happen, because at that point p->se.on_rq must be 1.

The sequence during enqueue_task_fair() is:

enqueue_task_fair(p)
enqueue_entity(se)
update_load_avg(se)
update_util_uclamp(p) (decay path)
p->se.on_rq = 1;
util_uclamp_enqueue(p)
update_util_uclamp(p) (update path)

The only reason why we want to issue update_util_uclamp() after seeing
on_rq == 1 is that now it goes down the normal uclamp path and not the
decay path. Otherwise, uclamp won't immediately engage during enqueue
and has to wait a timer tick.

Ideally, we should:

enqueue_task_fair(p)
enqueue_entity(se)
update_load_avg(se)
util_uclamp_enqueue(p)
update_util_uclamp(p) (force update path)
p->se.on_rq = 1;

This requires us to invent a new flag to update_util_uclamp() to force
the update path even when p->se.on_rq is 0.

At the moment I'm treating util_avg_uclamp in the same way as util_est
after the comments in v1, which is independent and has its own code
path. We can go back to the old style, where util_avg_uclamp is closer
to how we treat se rather than a separate thing like util_est.

> [...]

2024-03-19 15:42:47

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/7] Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0"

On 01/02/2024 14:11, Hongyan Xia wrote:
> From: Hongyan Xia <[email protected]>
>
> That commit creates further problems because 0 spare capacity can be
> either a real indication that the CPU is maxed out, or the CPU is
> UCLAMP_MAX throttled, but we end up giving all of them a chance which
> can results in bogus energy calculations. It also tends to schedule
> tasks on the same CPU and requires load balancing patches. Sum
> aggregation solves these problems and this patch is not needed.
>
> This reverts commit 6b00a40147653c8ea748e8f4396510f252763364.

I assume you did this revert especially for the 'Scenario 5: 8 tasks
with UCLAMP_MAX of 120' testcase?

IMHO, the issue is especially visible in compute_energy()'s busy_time
computation with a valid destination CPU (dst_cpu >= 0). I.e. when we
have to add performance domain (pd) and task busy time.

find_energy_efficient_cpu() (feec())

for each pd
for each cpu in pd

set {prev_,max}_spare_cap

bail if prev_ and max_spare_cap < 0 (was == 0 before )

{base_,prev_,cur_}energy = compute_energy

So with the patch we potentially compute energy for a saturated PD
according:

compute_energy()

if (dst_cpu >= 0)
busy_time = min(eenv->pd_cap, eenv->busy_time + eenv->task_busy_time)
<----(a)---> <--------------(b)------------------->

energy = em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap)

If (b) > (a) then we're saturated and 'energy' is bogus.

The way to fix this is up for discussion:

(1) feec() returning prev_cpu
(2) feec() returning -1 (forcing wakeup into sis() -> sic())
(3) using uclamped values for task and rq utilization

None of those have immediately given the desired task placement on
mainline (2 tasks on each of the 4 little CPUs and no task on the 2 big
CPUs on my [l B B l l l] w/ CPU capacities = [446 1024 1024 446 446 446]
machine) you can achieve with uclamp sum aggregation.

[...]

2024-03-19 17:23:55

by Hongyan Xia

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/7] Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0"

On 19/03/2024 15:34, Dietmar Eggemann wrote:
> On 01/02/2024 14:11, Hongyan Xia wrote:
>> From: Hongyan Xia <[email protected]>
>>
>> That commit creates further problems because 0 spare capacity can be
>> either a real indication that the CPU is maxed out, or the CPU is
>> UCLAMP_MAX throttled, but we end up giving all of them a chance which
>> can results in bogus energy calculations. It also tends to schedule
>> tasks on the same CPU and requires load balancing patches. Sum
>> aggregation solves these problems and this patch is not needed.
>>
>> This reverts commit 6b00a40147653c8ea748e8f4396510f252763364.
>
> I assume you did this revert especially for the 'Scenario 5: 8 tasks
> with UCLAMP_MAX of 120' testcase?

More or less. Actually you can already see the problem in Scenario 1.
Ideally the 4 uclamp_max tasks should be evenly distributed on 4 little
CPUs, but from time to time task placement places more than 1 such task
on the same CPU, leaving some other little CPUs not occupied.

> IMHO, the issue is especially visible in compute_energy()'s busy_time
> computation with a valid destination CPU (dst_cpu >= 0). I.e. when we
> have to add performance domain (pd) and task busy time.
>
> find_energy_efficient_cpu() (feec())
>
> for each pd
> for each cpu in pd
>
> set {prev_,max}_spare_cap
>
> bail if prev_ and max_spare_cap < 0 (was == 0 before )
>
> {base_,prev_,cur_}energy = compute_energy
>
> So with the patch we potentially compute energy for a saturated PD
> according:
>
> compute_energy()
>
> if (dst_cpu >= 0)
> busy_time = min(eenv->pd_cap, eenv->busy_time + eenv->task_busy_time)
> <----(a)---> <--------------(b)------------------->
>
> energy = em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap)
>
> If (b) > (a) then we're saturated and 'energy' is bogus.

Yeah, I think what's happening is because placing more tasks on the same
CPU won't increase energy computation, so in the end task placement
thinks it's the better decision. The root issue is that once you have
uclamp_max, you can theoretically fit an infinite number of such tasks
on the same CPU.

> The way to fix this is up for discussion:
>
> (1) feec() returning prev_cpu
> (2) feec() returning -1 (forcing wakeup into sis() -> sic())
> (3) using uclamped values for task and rq utilization
>
> None of those have immediately given the desired task placement on
> mainline (2 tasks on each of the 4 little CPUs and no task on the 2 big
> CPUs on my [l B B l l l] w/ CPU capacities = [446 1024 1024 446 446 446]
> machine) you can achieve with uclamp sum aggregation.

Personally from the results I've seen I definitely prefer (3), although
(3) has other problems. One thing is that sum aggregation pushes up
utilization with uclamp_min, but its energy consumption definitely won't
be that high. The real energy is between its util_avg and util_avg_uclamp.

I haven't seen this as a real problem, but maybe we can see even better
task placement if this is accounted for.

> [...]

2024-03-20 15:28:20

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/7] sched/uclamp: Introduce root_cfs_util_uclamp for rq

On 19/03/2024 12:50, Hongyan Xia wrote:
> On 18/03/2024 18:21, Dietmar Eggemann wrote:
>> On 01/02/2024 14:11, Hongyan Xia wrote:
>>
>> [...]
>>
>>>       /*
>>>        * The code below (indirectly) updates schedutil which looks at
>>> @@ -6769,6 +6770,10 @@ enqueue_task_fair(struct rq *rq, struct
>>> task_struct *p, int flags)
>>>   #ifdef CONFIG_UCLAMP_TASK
>>>       util_uclamp_enqueue(&rq->cfs.avg, p);
>>>       update_util_uclamp(0, 0, 0, &rq->cfs.avg, p);
>>> +    if (migrated)
>>
>> IMHO, you don't need 'bool __maybe_unused migrated'. You can use:
>>
>>    if (flags & ENQUEUE_MIGRATED)
>
> I'm not sure if they are entirely equivalent. Both
> task_on_rq_migrating() and !task_on_rq_migrating() can have
> last_update_time == 0 but ENQUEUE_MIGRATED flag is only for the former.
> Maybe I'm missing something?

That's true. There are 2:

(!task_on_rq_migrating() && !p->se.avg.last_update_time)

cases:

(1) wakeup migration: ENQUEUE_MIGRATED (0x40) set in ttwu_do_wakeup()

(2) new task: wake_up_new_task() -> activate_task(), ENQUEUE_MIGRATED is
not set.

I assume you want to add the task's util_avg_uclamp to
rq->root_cfs_util_uclamp in (2) too? So:

if (flags & ENQUEUE_MIGRATED || task_new)

[...]

>>>   /* avg must belong to the queue this se is on. */
>>> -void update_util_uclamp(struct sched_avg *avg, struct task_struct *p)
>>> +void update_util_uclamp(u64 now,
>>> +            u64 last_update_time,
>>> +            u32 period_contrib,
>>> +            struct sched_avg *avg,
>>> +            struct task_struct *p)
>>>   {
>>
>> I was wondering why you use such a long parameter list for this
>> function.
>>
>> IMHO
>>
>>    update_util_uclamp(u64 now, struct cfs_rq *cfs_rq, struct
>> sched_entity *se)
>>
>> would work as well. You could check whether se represents a task inside
>> update_util_uclamp() as well as get last_update_time and period_contrib.
>>
>> The only reason I see is that you want to use this function for the RT
>> class as well later, where you have to deal with 'struct rt_rq' and
>> 'struct sched_rt_entity'.
>>
>> IMHO, it's always better to keep the implementation to the minimum and
>> only introduce changes which are related to the functionality you
>> present. This would make reviewing so much easier.
>
> Those parameters are necessary because of
>
> if (___update_load_sum()) {
>     ___update_load_avg();
>     update_util_uclamp();
> }

So you need ___update_load_avg() happening for the `normal uclamp path`
and `last_uptade_time` and `period_contrib` for the `decay path` (1) of
update_util_uclamp()?

This is pretty hard to grasp. Isn't there a cleaner solution for this?

Why do you need the:

if (!avg)
return;

thing in update_util_uclamp()? __update_load_avg_blocked_se() calls
update_util_uclamp(..., avg = NULL, ...) but this should follow (1)?

> We have to cache last_update_time and period_contrib, because after
> ___update_load_sum() is done, both parameters in sched_avg have already
> been updated and overwritten and we lose the timestamp when the
> sched_avg was previously updated. update_util_uclamp() needs to know
> when sched_avg was previously updated.
>
>>
>>>       unsigned int util, uclamp_min, uclamp_max;
>>>       int delta;
>>>   -    if (!p->se.on_rq)
>>> +    if (!p->se.on_rq) {
>>> +        ___update_util_uclamp_towards(now,
>>> +                          last_update_time,
>>> +                          period_contrib,
>>> +                          &p->se.avg.util_avg_uclamp,
>>> +                          0);
>>>           return;
>>> +    }
>>
>> You decay 'p->se.avg.util_avg_uclamp' which is not really related to
>> root_cfs_util_uclamp (patch header). IMHO, this would belong to 2/7.
>
> I would say this still belongs to 3/7, because 2/7 only implements
> utilization for on_rq tasks. This patch implements utilization for both
> on_rq and !on_rq. For rq, we have rq->cfs.avg.util_avg_uclamp (for
> on_rq) and rq->root_cfs_util_uclamp (for on_rq plus !on_rq).

Looks like you maintain `rq->cfs.avg.util_avg_uclamp` (2) (consider all
runnable tasks) to be able to:

(a) set rq->root_cfs_util_uclamp (3) to max((3), (2))

(b) check that if 'rq->cfs.h_nr_running == 0' that (2) = 0 too.

uclamp is based on runnable tasks (so enqueue/dequeue) but you uclamp
around util_avg which has contributions from blocked tasks. And that's
why you have to maintain (3). And (3) only decays within
__update_load_avg_cfs_rq().
Can this be the reason why th implementation is so convoluted? It's
definitely more complicated than util_est with its easy layout:

enqueue_task_fair()
util_est_enqueue()

dequeue_task_fair()
util_est_dequeue()
util_est_update()

> For tasks, we also have two utilization numbers, one is on_rq and the
> other is on_rq plus !on_rq. However, we know they do not have to be
> stored in separate variables and util_avg_uclamp can capture both.

Here you lost me. Which value does 'p->se.avg.util_avg_uclamp' store?
'runnable' or 'runnable + blocking'? I would say it's the latter one but
like in PELT we don't update the task signal when it's sleeping.

> Moving this to 2/7 might be fine, although then this would be the only
> !on_rq utilization in 2/7. I can add comments to clarify the situation.
>
>> This is the util_avg_uclamp handling for a se (task):
>>
>> enqueue_task_fair()
>>
>>    util_uclamp_enqueue()
>>
>>    update_util_uclamp()                 (1)
>>
>>      if (!p->se.on_rq)                  (x)
>>        ___update_util_uclamp_towards()  (2)
>>
>> dequeue_task_fair()
>>
>>    util_uclamp_dequeue()
>>
>> __update_load_avg_blocked_se()
>>
>>    update_util_uclamp()
>>
>>      (x)
>>
>> __update_load_avg_se()
>>
>>    update_util_uclamp()
>>
>>      (x)
>>
>> Why is it so unbalanced? Why do you need (1) and (2)?
>>
>> Isn't this just an indication that the se util_avg_uclamp
>> is done at the wrong places?
>>
>> Is there an other way to provide a task/rq signal as the base
>> for uclamp sum aggregation?
>
> (2) won't happen, because at that point p->se.on_rq must be 1.

I see.

> The sequence during enqueue_task_fair() is:
>
> enqueue_task_fair(p)
>     enqueue_entity(se)
>         update_load_avg(se)
>             update_util_uclamp(p) (decay path)
>         p->se.on_rq = 1;
>     util_uclamp_enqueue(p)
>     update_util_uclamp(p) (update path)
>
> The only reason why we want to issue update_util_uclamp() after seeing
> on_rq == 1 is that now it goes down the normal uclamp path and not the
> decay path. Otherwise, uclamp won't immediately engage during enqueue
> and has to wait a timer tick.

OK, I see, the task contribution should be visible immediately after the
enqueue.

> Ideally, we should:
>
> enqueue_task_fair(p)
>     enqueue_entity(se)
>         update_load_avg(se)   
>             util_uclamp_enqueue(p)
>             update_util_uclamp(p) (force update path)
>         p->se.on_rq = 1;
>
> This requires us to invent a new flag to update_util_uclamp() to force
> the update path even when p->se.on_rq is 0.

I guess you have to go this way to achieve a cleaner/easier integration
of 'util_avg_uclamp'.

> At the moment I'm treating util_avg_uclamp in the same way as util_est
> after the comments in v1, which is independent and has its own code
> path. We can go back to the old style, where util_avg_uclamp is closer
> to how we treat se rather than a separate thing like util_est.

Except that 'util_est' integration is much easier to understand. And
this is because of 'util_est' is clear runnable based only and is not
linked to any blocked part.

[...]


2024-03-20 17:23:23

by Hongyan Xia

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/7] sched/uclamp: Introduce root_cfs_util_uclamp for rq

On 20/03/2024 15:27, Dietmar Eggemann wrote:
> On 19/03/2024 12:50, Hongyan Xia wrote:
>> On 18/03/2024 18:21, Dietmar Eggemann wrote:
>>> On 01/02/2024 14:11, Hongyan Xia wrote:
>>>
>>> [...]
>>>
>>>>       /*
>>>>        * The code below (indirectly) updates schedutil which looks at
>>>> @@ -6769,6 +6770,10 @@ enqueue_task_fair(struct rq *rq, struct
>>>> task_struct *p, int flags)
>>>>   #ifdef CONFIG_UCLAMP_TASK
>>>>       util_uclamp_enqueue(&rq->cfs.avg, p);
>>>>       update_util_uclamp(0, 0, 0, &rq->cfs.avg, p);
>>>> +    if (migrated)
>>>
>>> IMHO, you don't need 'bool __maybe_unused migrated'. You can use:
>>>
>>>    if (flags & ENQUEUE_MIGRATED)
>>
>> I'm not sure if they are entirely equivalent. Both
>> task_on_rq_migrating() and !task_on_rq_migrating() can have
>> last_update_time == 0 but ENQUEUE_MIGRATED flag is only for the former.
>> Maybe I'm missing something?
>
> That's true. There are 2:
>
> (!task_on_rq_migrating() && !p->se.avg.last_update_time)
>
> cases:
>
> (1) wakeup migration: ENQUEUE_MIGRATED (0x40) set in ttwu_do_wakeup()
>
> (2) new task: wake_up_new_task() -> activate_task(), ENQUEUE_MIGRATED is
> not set.
>
> I assume you want to add the task's util_avg_uclamp to
> rq->root_cfs_util_uclamp in (2) too? So:
>
> if (flags & ENQUEUE_MIGRATED || task_new)

I see. Maybe we don't need to check last_update_time. I'll take a look.

Although if we want to integrate sum aggregation with se rather than
making it fully independent (in your comments below) like util_est,
we'll probably just do that in

update_util_avg()
if (!se->avg.last_update_time && (flags & DO_ATTACH)) { ... }

and don't bother doing it in the outside loop of enqueue_task_fair() anyway.

> [...]
>
>>>>   /* avg must belong to the queue this se is on. */
>>>> -void update_util_uclamp(struct sched_avg *avg, struct task_struct *p)
>>>> +void update_util_uclamp(u64 now,
>>>> +            u64 last_update_time,
>>>> +            u32 period_contrib,
>>>> +            struct sched_avg *avg,
>>>> +            struct task_struct *p)
>>>>   {
>>>
>>> I was wondering why you use such a long parameter list for this
>>> function.
>>>
>>> IMHO
>>>
>>>    update_util_uclamp(u64 now, struct cfs_rq *cfs_rq, struct
>>> sched_entity *se)
>>>
>>> would work as well. You could check whether se represents a task inside
>>> update_util_uclamp() as well as get last_update_time and period_contrib.
>>>
>>> The only reason I see is that you want to use this function for the RT
>>> class as well later, where you have to deal with 'struct rt_rq' and
>>> 'struct sched_rt_entity'.
>>>
>>> IMHO, it's always better to keep the implementation to the minimum and
>>> only introduce changes which are related to the functionality you
>>> present. This would make reviewing so much easier.
>>
>> Those parameters are necessary because of
>>
>> if (___update_load_sum()) {
>>     ___update_load_avg();
>>     update_util_uclamp();
>> }
>
> So you need ___update_load_avg() happening for the `normal uclamp path`
> and `last_uptade_time` and `period_contrib` for the `decay path` (1) of
> update_util_uclamp()?
>
> This is pretty hard to grasp. Isn't there a cleaner solution for this?

You are correct. Not sure if there's a better way.

> Why do you need the:
>
> if (!avg)
> return;
>
> thing in update_util_uclamp()? __update_load_avg_blocked_se() calls
> update_util_uclamp(..., avg = NULL, ...) but this should follow (1)?

I added it as a guard to rule out edge cases, but I think when you do
__update_load_avg_blocked_se(), it has to be !on_rq so it will never
enter this path. I think it doesn't hurt but I can remove it.

>> We have to cache last_update_time and period_contrib, because after
>> ___update_load_sum() is done, both parameters in sched_avg have already
>> been updated and overwritten and we lose the timestamp when the
>> sched_avg was previously updated. update_util_uclamp() needs to know
>> when sched_avg was previously updated.
>>
>>>
>>>>       unsigned int util, uclamp_min, uclamp_max;
>>>>       int delta;
>>>>   -    if (!p->se.on_rq)
>>>> +    if (!p->se.on_rq) {
>>>> +        ___update_util_uclamp_towards(now,
>>>> +                          last_update_time,
>>>> +                          period_contrib,
>>>> +                          &p->se.avg.util_avg_uclamp,
>>>> +                          0);
>>>>           return;
>>>> +    }
>>>
>>> You decay 'p->se.avg.util_avg_uclamp' which is not really related to
>>> root_cfs_util_uclamp (patch header). IMHO, this would belong to 2/7.
>>
>> I would say this still belongs to 3/7, because 2/7 only implements
>> utilization for on_rq tasks. This patch implements utilization for both
>> on_rq and !on_rq. For rq, we have rq->cfs.avg.util_avg_uclamp (for
>> on_rq) and rq->root_cfs_util_uclamp (for on_rq plus !on_rq).
>
> Looks like you maintain `rq->cfs.avg.util_avg_uclamp` (2) (consider all
> runnable tasks) to be able to:
>
> (a) set rq->root_cfs_util_uclamp (3) to max((3), (2))
>
> (b) check that if 'rq->cfs.h_nr_running == 0' that (2) = 0 too.
>
> uclamp is based on runnable tasks (so enqueue/dequeue) but you uclamp
> around util_avg which has contributions from blocked tasks. And that's
> why you have to maintain (3). And (3) only decays within
> __update_load_avg_cfs_rq().
> Can this be the reason why th implementation is so convoluted? It's
> definitely more complicated than util_est with its easy layout:
>
> enqueue_task_fair()
> util_est_enqueue()
>
> dequeue_task_fair()
> util_est_dequeue()
> util_est_update()
There's only one rule here, which is the root value must be the sum of
all task util_avg_uclamp values. If PELT slowly decays each util_avg,
then the sum will also decay in a similar fashion. The code here is just
doing that.

This 'convoluted' property is from PELT and not from sum aggregation.
Actually this is what PELT used to do a while back, when the root CFS
util was the sum of all tasks instead of being tracked separately, and
we do the same decay here as we did back then.

You are right that this feels convoluted, but sum aggregation doesn't
attempt to change how PELT works so the decaying property is there due
to PELT. I can keep exploring new ways to make the logic easier to follow.

>> For tasks, we also have two utilization numbers, one is on_rq and the
>> other is on_rq plus !on_rq. However, we know they do not have to be
>> stored in separate variables and util_avg_uclamp can capture both.
>
> Here you lost me. Which value does 'p->se.avg.util_avg_uclamp' store?
> 'runnable' or 'runnable + blocking'? I would say it's the latter one but
> like in PELT we don't update the task signal when it's sleeping.

The latter. We don't update the task signal when it's sleeping but we do
when we need to use it, and that's enough. This is also the case for all
blocked util_avg.

>> Moving this to 2/7 might be fine, although then this would be the only
>> !on_rq utilization in 2/7. I can add comments to clarify the situation.
>>
>>> This is the util_avg_uclamp handling for a se (task):
>>>
>>> enqueue_task_fair()
>>>
>>>    util_uclamp_enqueue()
>>>
>>>    update_util_uclamp()                 (1)
>>>
>>>      if (!p->se.on_rq)                  (x)
>>>        ___update_util_uclamp_towards()  (2)
>>>
>>> dequeue_task_fair()
>>>
>>>    util_uclamp_dequeue()
>>>
>>> __update_load_avg_blocked_se()
>>>
>>>    update_util_uclamp()
>>>
>>>      (x)
>>>
>>> __update_load_avg_se()
>>>
>>>    update_util_uclamp()
>>>
>>>      (x)
>>>
>>> Why is it so unbalanced? Why do you need (1) and (2)?
>>>
>>> Isn't this just an indication that the se util_avg_uclamp
>>> is done at the wrong places?
>>>
>>> Is there an other way to provide a task/rq signal as the base
>>> for uclamp sum aggregation?
>>
>> (2) won't happen, because at that point p->se.on_rq must be 1.
>
> I see.
>
>> The sequence during enqueue_task_fair() is:
>>
>> enqueue_task_fair(p)
>>     enqueue_entity(se)
>>         update_load_avg(se)
>>             update_util_uclamp(p) (decay path)
>>         p->se.on_rq = 1;
>>     util_uclamp_enqueue(p)
>>     update_util_uclamp(p) (update path)
>>
>> The only reason why we want to issue update_util_uclamp() after seeing
>> on_rq == 1 is that now it goes down the normal uclamp path and not the
>> decay path. Otherwise, uclamp won't immediately engage during enqueue
>> and has to wait a timer tick.
>
> OK, I see, the task contribution should be visible immediately after the
> enqueue.
>
>> Ideally, we should:
>>
>> enqueue_task_fair(p)
>>     enqueue_entity(se)
>>         update_load_avg(se)
>>             util_uclamp_enqueue(p)
>>             update_util_uclamp(p) (force update path)
>>         p->se.on_rq = 1;
>>
>> This requires us to invent a new flag to update_util_uclamp() to force
>> the update path even when p->se.on_rq is 0.
>
> I guess you have to go this way to achieve a cleaner/easier integration
> of 'util_avg_uclamp'.

If we don't want to keep util_avg_uclamp separately like util_est but
move it closer to se, then we can explore this option.

>> At the moment I'm treating util_avg_uclamp in the same way as util_est
>> after the comments in v1, which is independent and has its own code
>> path. We can go back to the old style, where util_avg_uclamp is closer
>> to how we treat se rather than a separate thing like util_est.
>
> Except that 'util_est' integration is much easier to understand. And
> this is because of 'util_est' is clear runnable based only and is not
> linked to any blocked part.

True.

Well, I can go back to the style in RFC v1. One big advantage of v2 is
that we can compile out the support of uclamp very easily because it's
treated more or less like an independent thing.

> [...]
>