Relationship between uclamp and fits_capacity() is currently broken. Mostly due
to how uclamp should interact with migration margin and capacity pressure. But
also due not all users were converted to consider uclamp before calling
fits_capacity(). Namely cpu_overutilized().
The meat of the series is patch 1 where we introduce a new function,
util_fits_cpu(), that takes uclamp into account. The new function should
replace all call sits to fits_capacity(), which what subsequent patches do.
Except for patch 7 where we fix handling of early exit condition in
find_energy_efficient_cpu(AKA feec()) that must be uclamp aware too.
util_fits_cpu() will revert back to a simple call to fits_capacity() if uclamp
is not being used on the system.
Last two patches still need testing and verification, but they address the
various thermal handling issues raised in v1. We can re-order the patches, but
added at the end to facilitate review at this stage. Hope the approach and the
logic make sense.
I'll be on holidays, so if I don't respond to review comments fast enough,
apologies in advance.
Changes in v2:
* use uclamp_rq_is_idle() in uclamp_rq_util_with() (Xuewen)
* Simplify logic in update_sg_wakeup_stats() when converting
task_fits_cpu() (Vincent)
* Better handling of thermal pressure (Vincent)
- We consider thermal pressure for uclamp_min in patch 1
- 2 new patches to handle capacity inversion which improves handling
when:
+ There're multiple big cpus on separate perf domains.
+ A task is boosted by uclamp_min and inspite of thermal
pressure, the big cpu is still the best placement.
e.g: p0->util_avg = 300, p0->uclamp_min = 1024
This task should stay on big CPU until thermal pressure is in
capacity inversion.
v1 discussion: https://lore.kernel.org/lkml/[email protected]/
Thanks for all reviewers on v1!
Qais Yousef (9):
sched/uclamp: Fix relationship between uclamp and migration margin
sched/uclamp: Make task_fits_capacity() use util_fits_cpu()
sched/uclamp: Fix fits_capacity() check in feec()
sched/uclamp: Make select_idle_capacity() use util_fits_cpu()
sched/uclamp: Make asym_fits_capacity() use util_fits_cpu()
sched/uclamp: Make cpu_overutilized() use util_fits_cpu()
sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early
exit condition
sched/fair: Detect capacity inversion
sched/fair: Consider capacity inversion in util_fits_cpu()
kernel/sched/core.c | 10 +-
kernel/sched/fair.c | 290 ++++++++++++++++++++++++++++++++++++++-----
kernel/sched/sched.h | 70 ++++++++++-
3 files changed, 329 insertions(+), 41 deletions(-)
--
2.25.1
We do consider thermal pressure in util_fits_cpu() for uclamp_min only.
With the exception of the biggest cores which by definition are the max
performance point of the system and all tasks by definition should fit.
Even under thermal pressure, the capacity of the biggest CPU is the
highest in the system and should still fit every task. Except when it
reaches capacity inversion point, then this is no longer true.
We can handle this by using the inverted capacity as capacity_orig in
util_fits_cpu(). Which not only addresses the problem above, but also
ensure uclamp_max now considers the inverted capacity. Force fitting
a task when a CPU is in this adverse state will contribute to making the
thermal throttling last longer.
Signed-off-by: Qais Yousef <[email protected]>
---
kernel/sched/fair.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cb32dc9a057f..77ae343e32a3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4293,12 +4293,16 @@ static inline int util_fits_cpu(unsigned long util,
* For uclamp_max, we can tolerate a drop in performance level as the
* goal is to cap the task. So it's okay if it's getting less.
*
- * In case of capacity inversion, which is not handled yet, we should
- * honour the inverted capacity for both uclamp_min and uclamp_max all
- * the time.
+ * In case of capacity inversion we should honour the inverted capacity
+ * for both uclamp_min and uclamp_max all the time.
*/
- capacity_orig = capacity_orig_of(cpu);
- capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
+ capacity_orig = cpu_in_capacity_inversion(cpu);
+ if (capacity_orig) {
+ capacity_orig_thermal = capacity_orig;
+ } else {
+ capacity_orig = capacity_orig_of(cpu);
+ capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
+ }
/*
* We want to force a task to fit a cpu as implied by uclamp_max.
--
2.25.1
As reported by Yun Hsiang [1], if a task has its uclamp_min >= 0.8 * 1024,
it'll always pick the previous CPU because fits_capacity() will always
return false in this case.
The new util_fits_cpu() logic should handle this correctly for us beside
more corner cases where similar failures could occur, like when using
UCLAMP_MAX.
We open code uclamp_rq_util_with() except for the clamp() part,
util_fits_cpu() needs the 'raw' values to be passed to it.
Also introduce uclamp_rq_{set, get}() shorthand accessors to get uclamp
value for the rq. Makes the code more readable and ensures the right
rules (use READ_ONCE/WRITE_ONCE) are respected transparently.
[1] https://lists.linaro.org/pipermail/eas-dev/2020-July/001488.html
Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions")
Reported-by: Yun Hsiang <[email protected]>
Signed-off-by: Qais Yousef <[email protected]>
---
kernel/sched/core.c | 10 +++++-----
kernel/sched/fair.c | 26 ++++++++++++++++++++++++--
kernel/sched/sched.h | 42 +++++++++++++++++++++++++++++++++++++++---
3 files changed, 68 insertions(+), 10 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 64c08993221b..ea66c525d3ef 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1398,7 +1398,7 @@ static inline void uclamp_idle_reset(struct rq *rq, enum uclamp_id clamp_id,
if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))
return;
- WRITE_ONCE(rq->uclamp[clamp_id].value, clamp_value);
+ uclamp_rq_set(rq, clamp_id, clamp_value);
}
static inline
@@ -1549,8 +1549,8 @@ static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
if (bucket->tasks == 1 || uc_se->value > bucket->value)
bucket->value = uc_se->value;
- if (uc_se->value > READ_ONCE(uc_rq->value))
- WRITE_ONCE(uc_rq->value, uc_se->value);
+ if (uc_se->value > uclamp_rq_get(rq, clamp_id))
+ uclamp_rq_set(rq, clamp_id, uc_se->value);
}
/*
@@ -1616,7 +1616,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
if (likely(bucket->tasks))
return;
- rq_clamp = READ_ONCE(uc_rq->value);
+ rq_clamp = uclamp_rq_get(rq, clamp_id);
/*
* Defensive programming: this should never happen. If it happens,
* e.g. due to future modification, warn and fixup the expected value.
@@ -1624,7 +1624,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
SCHED_WARN_ON(bucket->value > rq_clamp);
if (bucket->value >= rq_clamp) {
bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
- WRITE_ONCE(uc_rq->value, bkt_clamp);
+ uclamp_rq_set(rq, clamp_id, bkt_clamp);
}
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 78feb9ca1e41..ea02c64cd933 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6993,6 +6993,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
+ unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0;
+ unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
struct root_domain *rd = this_rq()->rd;
int cpu, best_energy_cpu, target = -1;
struct sched_domain *sd;
@@ -7025,6 +7027,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
for (; pd; pd = pd->next) {
unsigned long cpu_cap, cpu_thermal_cap, util;
unsigned long cur_delta, max_spare_cap = 0;
+ unsigned long rq_util_min, rq_util_max;
+ unsigned long util_min, util_max;
bool compute_prev_delta = false;
int max_spare_cap_cpu = -1;
unsigned long base_energy;
@@ -7061,8 +7065,26 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
* much capacity we can get out of the CPU; this is
* aligned with sched_cpu_util().
*/
- util = uclamp_rq_util_with(cpu_rq(cpu), util, p);
- if (!fits_capacity(util, cpu_cap))
+ if (uclamp_is_used()) {
+ if (uclamp_rq_is_idle(cpu_rq(cpu))) {
+ util_min = p_util_min;
+ util_max = p_util_max;
+ } else {
+ /*
+ * Open code uclamp_rq_util_with() except for
+ * the clamp() part. Ie: apply max aggregation
+ * only. util_fits_cpu() logic requires to
+ * operate on non clamped util but must use the
+ * max-aggregated uclamp_{min, max}.
+ */
+ rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
+ rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
+
+ util_min = max(rq_util_min, p_util_min);
+ util_max = max(rq_util_max, p_util_max);
+ }
+ }
+ if (!util_fits_cpu(util, util_min, util_max, cpu))
continue;
lsub_positive(&cpu_cap, util);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index eec1cac3eef4..caf017f7def6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2966,6 +2966,23 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
#ifdef CONFIG_UCLAMP_TASK
unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id);
+static inline unsigned long uclamp_rq_get(struct rq *rq,
+ enum uclamp_id clamp_id)
+{
+ return READ_ONCE(rq->uclamp[clamp_id].value);
+}
+
+static inline void uclamp_rq_set(struct rq *rq, enum uclamp_id clamp_id,
+ unsigned int value)
+{
+ WRITE_ONCE(rq->uclamp[clamp_id].value, value);
+}
+
+static inline bool uclamp_rq_is_idle(struct rq *rq)
+{
+ return rq->uclamp_flags & UCLAMP_FLAG_IDLE;
+}
+
/**
* uclamp_rq_util_with - clamp @util with @rq and @p effective uclamp values.
* @rq: The rq to clamp against. Must not be NULL.
@@ -3001,12 +3018,12 @@ unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
* Ignore last runnable task's max clamp, as this task will
* reset it. Similarly, no need to read the rq's min clamp.
*/
- if (rq->uclamp_flags & UCLAMP_FLAG_IDLE)
+ if (uclamp_rq_is_idle(rq))
goto out;
}
- min_util = max_t(unsigned long, min_util, READ_ONCE(rq->uclamp[UCLAMP_MIN].value));
- max_util = max_t(unsigned long, max_util, READ_ONCE(rq->uclamp[UCLAMP_MAX].value));
+ min_util = max_t(unsigned long, min_util, uclamp_rq_get(rq, UCLAMP_MIN));
+ max_util = max_t(unsigned long, max_util, uclamp_rq_get(rq, UCLAMP_MAX));
out:
/*
* Since CPU's {min,max}_util clamps are MAX aggregated considering
@@ -3069,6 +3086,25 @@ static inline bool uclamp_is_used(void)
{
return false;
}
+
+static inline unsigned long uclamp_rq_get(struct rq *rq,
+ enum uclamp_id clamp_id)
+{
+ if (clamp_id == UCLAMP_MIN)
+ return 0;
+
+ return SCHED_CAPACITY_SCALE;
+}
+
+static inline void uclamp_rq_set(struct rq *rq, enum uclamp_id clamp_id,
+ unsigned int value)
+{
+}
+
+static inline bool uclamp_rq_is_idle(struct rq *rq)
+{
+ return false;
+}
#endif /* CONFIG_UCLAMP_TASK */
#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
--
2.25.1
If the utilization of the woken up task is 0, we skip the energy
calculation because it has no impact.
But if the task is boosted (uclamp_min != 0) will have an impact on task
placement and frequency selection. Only skip if the util is truly
0 after applying uclamp values.
Change uclamp_task_cpu() signature to avoid unnecessary additional calls
to uclamp_eff_get(). feec() is the only user now.
Fixes: 732cd75b8c920 ("sched/fair: Select an energy-efficient CPU on task wake-up")
Signed-off-by: Qais Yousef <[email protected]>
---
kernel/sched/fair.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4c3a5240d7e3..59ba7106ddc6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4108,14 +4108,16 @@ static inline unsigned long task_util_est(struct task_struct *p)
}
#ifdef CONFIG_UCLAMP_TASK
-static inline unsigned long uclamp_task_util(struct task_struct *p)
+static inline unsigned long uclamp_task_util(struct task_struct *p,
+ unsigned long uclamp_min,
+ unsigned long uclamp_max)
{
- return clamp(task_util_est(p),
- uclamp_eff_value(p, UCLAMP_MIN),
- uclamp_eff_value(p, UCLAMP_MAX));
+ return clamp(task_util_est(p), uclamp_min, uclamp_max);
}
#else
-static inline unsigned long uclamp_task_util(struct task_struct *p)
+static inline unsigned long uclamp_task_util(struct task_struct *p,
+ unsigned long uclamp_min,
+ unsigned long uclamp_max)
{
return task_util_est(p);
}
@@ -7029,7 +7031,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
target = prev_cpu;
sync_entity_load_avg(&p->se);
- if (!task_util_est(p))
+ if (!uclamp_task_util(p, p_util_min, p_util_max))
goto unlock;
eenv_task_busy_time(&eenv, p, prev_cpu);
--
2.25.1
Use the new util_fits_cpu() to ensure migration margin and capacity
pressure are taken into account correctly when uclamp is being used
otherwise we will fail to consider CPUs as fitting in scenarios where
they should.
Fixes: b4c9c9f15649 ("sched/fair: Prefer prev cpu in asymmetric wakeup path")
Signed-off-by: Qais Yousef <[email protected]>
---
kernel/sched/fair.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ea02c64cd933..3079ca867f2c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6603,21 +6603,23 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
static int
select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
{
- unsigned long task_util, best_cap = 0;
+ unsigned long task_util, util_min, util_max, best_cap = 0;
int cpu, best_cpu = -1;
struct cpumask *cpus;
cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
- task_util = uclamp_task_util(p);
+ task_util = task_util_est(p);
+ util_min = uclamp_eff_value(p, UCLAMP_MIN);
+ util_max = uclamp_eff_value(p, UCLAMP_MAX);
for_each_cpu_wrap(cpu, cpus, target) {
unsigned long cpu_cap = capacity_of(cpu);
if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
continue;
- if (fits_capacity(task_util, cpu_cap))
+ if (util_fits_cpu(task_util, util_min, util_max, cpu))
return cpu;
if (cpu_cap > best_cap) {
--
2.25.1
Check each performance domain to see if thermal pressure is causing its
capacity to be lower than another performance domain.
We assume that each performance domain has CPUs with the same
capacities, which is similar to an assumption made in energy_model.c
We also assume that thermal pressure impacts all CPUs in a performance
domain equally.
If there're multiple performance domains with the same capacity_orig, we
will trigger a capacity inversion if the domain is under thermal
pressure.
The new cpu_in_capacity_inversion() should help users to know when
information about capacity_orig are not reliable and can opt in to use
the inverted capacity as the 'actual' capacity_orig.
Signed-off-by: Qais Yousef <[email protected]>
---
kernel/sched/fair.c | 63 +++++++++++++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 19 +++++++++++++
2 files changed, 79 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 59ba7106ddc6..cb32dc9a057f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8659,16 +8659,73 @@ static unsigned long scale_rt_capacity(int cpu)
static void update_cpu_capacity(struct sched_domain *sd, int cpu)
{
+ unsigned long capacity_orig = arch_scale_cpu_capacity(cpu);
unsigned long capacity = scale_rt_capacity(cpu);
struct sched_group *sdg = sd->groups;
+ struct rq *rq = cpu_rq(cpu);
- cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(cpu);
+ rq->cpu_capacity_orig = capacity_orig;
if (!capacity)
capacity = 1;
- cpu_rq(cpu)->cpu_capacity = capacity;
- trace_sched_cpu_capacity_tp(cpu_rq(cpu));
+ rq->cpu_capacity = capacity;
+
+ /*
+ * Detect if the performance domain is in capacity inversion state.
+ *
+ * Capacity inversion happens when another perf domain with equal or
+ * lower capacity_orig_of() ends up having higher capacity than this
+ * domain after subtracting thermal pressure.
+ *
+ * We only take into account thermal pressure in this detection as it's
+ * the only metric that actually results in *real* reduction of
+ * capacity due to performance points (OPPs) being dropped/become
+ * unreachable due to thermal throttling.
+ *
+ * We assume:
+ * * That all cpus in a perf domain have the same capacity_orig
+ * (same uArch).
+ * * Thermal pressure will impact all cpus in this perf domain
+ * equally.
+ */
+ if (static_branch_unlikely(&sched_asym_cpucapacity)) {
+ unsigned long inv_cap = capacity_orig - thermal_load_avg(rq);
+ struct perf_domain *pd = rcu_dereference(rq->rd->pd);
+
+ rq->cpu_capacity_inverted = 0;
+
+ for (; pd; pd = pd->next) {
+ struct cpumask *pd_span = perf_domain_span(pd);
+ unsigned long pd_cap_orig, pd_cap;
+
+ cpu = cpumask_any(pd_span);
+ pd_cap_orig = arch_scale_cpu_capacity(cpu);
+
+ if (capacity_orig < pd_cap_orig)
+ continue;
+
+ /*
+ * handle the case of multiple perf domains have the
+ * same capacity_orig but one of them is under higher
+ * thermal pressure. We record it as capacity
+ * inversion.
+ */
+ if (capacity_orig == pd_cap_orig) {
+ pd_cap = pd_cap_orig - thermal_load_avg(cpu_rq(cpu));
+
+ if (pd_cap > inv_cap) {
+ rq->cpu_capacity_inverted = inv_cap;
+ break;
+ }
+ } else if (pd_cap_orig > inv_cap) {
+ rq->cpu_capacity_inverted = inv_cap;
+ break;
+ }
+ }
+ }
+
+ trace_sched_cpu_capacity_tp(rq);
sdg->sgc->capacity = capacity;
sdg->sgc->min_capacity = capacity;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index caf017f7def6..541a70fa55b3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1033,6 +1033,7 @@ struct rq {
unsigned long cpu_capacity;
unsigned long cpu_capacity_orig;
+ unsigned long cpu_capacity_inverted;
struct callback_head *balance_callback;
@@ -2865,6 +2866,24 @@ static inline unsigned long capacity_orig_of(int cpu)
return cpu_rq(cpu)->cpu_capacity_orig;
}
+/*
+ * Returns inverted capacity if the CPU is in capacity inversion state.
+ * 0 otherwise.
+ *
+ * Capacity inversion detection only considers thermal impact where actual
+ * performance points (OPPs) gets dropped.
+ *
+ * Capacity inversion state happens when another performance domain that has
+ * equal or lower capacity_orig_of() becomes effectively larger than the perf
+ * domain this CPU belongs to due to thermal pressure throttling it hard.
+ *
+ * See comment in update_cpu_capacity().
+ */
+static inline unsigned long cpu_in_capacity_inversion(int cpu)
+{
+ return cpu_rq(cpu)->cpu_capacity_inverted;
+}
+
/**
* enum cpu_util_type - CPU utilization type
* @FREQUENCY_UTIL: Utilization used to select frequency
--
2.25.1
Use the new util_fits_cpu() to ensure migration margin and capacity
pressure are taken into account correctly when uclamp is being used
otherwise we will fail to consider CPUs as fitting in scenarios where
they should.
s/asym_fits_capacity/asym_fits_cpu/ to better reflect what it does now.
Fixes: b4c9c9f15649 ("sched/fair: Prefer prev cpu in asymmetric wakeup path")
Signed-off-by: Qais Yousef <[email protected]>
---
kernel/sched/fair.c | 21 +++++++++++++--------
1 file changed, 13 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3079ca867f2c..38cf56ade66d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6631,10 +6631,13 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
return best_cpu;
}
-static inline bool asym_fits_capacity(unsigned long task_util, int cpu)
+static inline bool asym_fits_cpu(unsigned long util,
+ unsigned long util_min,
+ unsigned long util_max,
+ int cpu)
{
if (sched_asym_cpucap_active())
- return fits_capacity(task_util, capacity_of(cpu));
+ return util_fits_cpu(util, util_min, util_max, cpu);
return true;
}
@@ -6646,7 +6649,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
{
bool has_idle_core = false;
struct sched_domain *sd;
- unsigned long task_util;
+ unsigned long task_util, util_min, util_max;
int i, recent_used_cpu;
/*
@@ -6655,7 +6658,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
*/
if (sched_asym_cpucap_active()) {
sync_entity_load_avg(&p->se);
- task_util = uclamp_task_util(p);
+ task_util = task_util_est(p);
+ util_min = uclamp_eff_value(p, UCLAMP_MIN);
+ util_max = uclamp_eff_value(p, UCLAMP_MAX);
}
/*
@@ -6664,7 +6669,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
lockdep_assert_irqs_disabled();
if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
- asym_fits_capacity(task_util, target))
+ asym_fits_cpu(task_util, util_min, util_max, target))
return target;
/*
@@ -6672,7 +6677,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
*/
if (prev != target && cpus_share_cache(prev, target) &&
(available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
- asym_fits_capacity(task_util, prev))
+ asym_fits_cpu(task_util, util_min, util_max, prev))
return prev;
/*
@@ -6687,7 +6692,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
in_task() &&
prev == smp_processor_id() &&
this_rq()->nr_running <= 1 &&
- asym_fits_capacity(task_util, prev)) {
+ asym_fits_cpu(task_util, util_min, util_max, prev)) {
return prev;
}
@@ -6699,7 +6704,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
cpus_share_cache(recent_used_cpu, target) &&
(available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
- asym_fits_capacity(task_util, recent_used_cpu)) {
+ asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) {
return recent_used_cpu;
}
--
2.25.1
So that the new uclamp rules in regard to migration margin and capacity
pressure are taken into account correctly.
Fixes: a7008c07a568 ("sched/fair: Make task_fits_capacity() consider uclamp restrictions")
Co-developed-by: Vincent Guittot <[email protected]>
Signed-off-by: Qais Yousef <[email protected]>
---
kernel/sched/fair.c | 26 ++++++++++++++++----------
kernel/sched/sched.h | 9 +++++++++
2 files changed, 25 insertions(+), 10 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 00c2de800685..78feb9ca1e41 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4377,10 +4377,12 @@ static inline int util_fits_cpu(unsigned long util,
return fits;
}
-static inline int task_fits_capacity(struct task_struct *p,
- unsigned long capacity)
+static inline int task_fits_cpu(struct task_struct *p, int cpu)
{
- return fits_capacity(uclamp_task_util(p), capacity);
+ unsigned long uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
+ unsigned long uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
+ unsigned long util = task_util_est(p);
+ return util_fits_cpu(util, uclamp_min, uclamp_max, cpu);
}
static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
@@ -4393,7 +4395,7 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
return;
}
- if (task_fits_capacity(p, capacity_of(cpu_of(rq)))) {
+ if (task_fits_cpu(p, cpu_of(rq))) {
rq->misfit_task_load = 0;
return;
}
@@ -8234,7 +8236,7 @@ static int detach_tasks(struct lb_env *env)
case migrate_misfit:
/* This is not a misfit task */
- if (task_fits_capacity(p, capacity_of(env->src_cpu)))
+ if (task_fits_cpu(p, env->src_cpu))
goto next;
env->imbalance = 0;
@@ -9239,6 +9241,10 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
memset(sgs, 0, sizeof(*sgs));
+ /* Assume that task can't fit any CPU of the group */
+ if (sd->flags & SD_ASYM_CPUCAPACITY)
+ sgs->group_misfit_task_load = 1;
+
for_each_cpu(i, sched_group_span(group)) {
struct rq *rq = cpu_rq(i);
unsigned int local;
@@ -9258,12 +9264,12 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
if (!nr_running && idle_cpu_without(i, p))
sgs->idle_cpus++;
- }
+ /* Check if task fits in the CPU */
+ if (sd->flags & SD_ASYM_CPUCAPACITY &&
+ sgs->group_misfit_task_load &&
+ task_fits_cpu(p, i))
+ sgs->group_misfit_task_load = 0;
- /* Check if task fits in the group */
- if (sd->flags & SD_ASYM_CPUCAPACITY &&
- !task_fits_capacity(p, group->sgc->max_capacity)) {
- sgs->group_misfit_task_load = 1;
}
sgs->group_capacity = group->sgc->capacity;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3ccd35c22f0f..eec1cac3eef4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3047,6 +3047,15 @@ static inline bool uclamp_is_used(void)
return static_branch_likely(&sched_uclamp_used);
}
#else /* CONFIG_UCLAMP_TASK */
+static inline unsigned long uclamp_eff_value(struct task_struct *p,
+ enum uclamp_id clamp_id)
+{
+ if (clamp_id == UCLAMP_MIN)
+ return 0;
+
+ return SCHED_CAPACITY_SCALE;
+}
+
static inline
unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
struct task_struct *p)
--
2.25.1
fits_capacity() verifies that a util is within 20% margin of the
capacity of a CPU, which is an attempt to speed up upmigration.
But when uclamp is used, this 20% margin is problematic because for
example if a task is boosted to 1024, then it will not fit on any CPU
according to fits_capacity() logic.
Or if a task is boosted to capacity_orig_of(medium_cpu). The task will
end up on big instead on the desired medium CPU.
Similar corner cases exist for uclamp and usage of capacity_of().
Slightest irq pressure on biggest CPU for example will make a 1024
boosted task look like it can't fit.
What we really want is for uclamp comparisons to ignore the migration
margin and capacity pressure, yet retain them for when checking the
_actual_ util signal.
For example, task p:
p->util_avg = 300
p->uclamp[UCLAMP_MIN] = 1024
Will fit a big CPU. But
p->util_avg = 900
p->uclamp[UCLAMP_MIN] = 1024
will not, this should trigger overutilized state because the big CPU is
now *actually* being saturated.
Similar reasoning applies to capping tasks with UCLAMP_MAX. For example:
p->util_avg = 1024
p->uclamp[UCLAMP_MAX] = capacity_orig_of(medium_cpu)
Should fit the task on medium cpus without triggering overutilized
state.
Inlined comments expand more on desired behavior in more scenarios.
Introduce new util_fits_cpu() function which encapsulates the new logic.
The new function is not used anywhere yet, but will be used to update
various users of fits_capacity() in later patches.
Fixes: af24bde8df202 ("sched/uclamp: Add uclamp support to energy_compute()")
Signed-off-by: Qais Yousef <[email protected]>
---
kernel/sched/fair.c | 123 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 123 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 52dc9d6f811e..00c2de800685 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4254,6 +4254,129 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
trace_sched_util_est_se_tp(&p->se);
}
+static inline int util_fits_cpu(unsigned long util,
+ unsigned long uclamp_min,
+ unsigned long uclamp_max,
+ int cpu)
+{
+ unsigned long capacity_orig, capacity_orig_thermal;
+ unsigned long capacity = capacity_of(cpu);
+ bool fits, uclamp_max_fits;
+
+ /*
+ * Check if the real util fits without any uclamp boost/cap applied.
+ */
+ fits = fits_capacity(util, capacity);
+
+ if (!uclamp_is_used())
+ return fits;
+
+ /*
+ * We must use capacity_orig_of() for comparing against uclamp_min and
+ * uclamp_max. We only care about capacity pressure (by using
+ * capacity_of()) for comparing against the real util.
+ *
+ * If a task is boosted to 1024 for example, we don't want a tiny
+ * pressure to skew the check whether it fits a CPU or not.
+ *
+ * Similarly if a task is capped to capacity_orig_of(little_cpu), it
+ * should fit a little cpu even if there's some pressure.
+ *
+ * Only exception is for thermal pressure since it has a direct impact
+ * on available OPP of the system.
+ *
+ * We honour it for uclamp_min only as a drop in performance level
+ * could result in not getting the requested minimum performance level.
+ *
+ * For uclamp_max, we can tolerate a drop in performance level as the
+ * goal is to cap the task. So it's okay if it's getting less.
+ *
+ * In case of capacity inversion, which is not handled yet, we should
+ * honour the inverted capacity for both uclamp_min and uclamp_max all
+ * the time.
+ */
+ capacity_orig = capacity_orig_of(cpu);
+ capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
+
+ /*
+ * We want to force a task to fit a cpu as implied by uclamp_max.
+ * But we do have some corner cases to cater for..
+ *
+ *
+ * C=z
+ * | ___
+ * | C=y | |
+ * |_ _ _ _ _ _ _ _ _ ___ _ _ _ | _ | _ _ _ _ _ uclamp_max
+ * | C=x | | | |
+ * | ___ | | | |
+ * | | | | | | | (util somewhere in this region)
+ * | | | | | | |
+ * | | | | | | |
+ * +----------------------------------------
+ * cpu0 cpu1 cpu2
+ *
+ * In the above example if a task is capped to a specific performance
+ * point, y, then when:
+ *
+ * * util = 80% of x then it does not fit on cpu0 and should migrate
+ * to cpu1
+ * * util = 80% of y then it is forced to fit on cpu1 to honour
+ * uclamp_max request.
+ *
+ * which is what we're enforcing here. A task always fits if
+ * uclamp_max <= capacity_orig. But when uclamp_max > capacity_orig,
+ * the normal upmigration rules should withhold still.
+ *
+ * Only exception is when we are on max capacity, then we need to be
+ * careful not to block overutilized state. This is so because:
+ *
+ * 1. There's no concept of capping at max_capacity! We can't go
+ * beyond this performance level anyway.
+ * 2. The system is being saturated when we're operating near
+ * max capacity, it doesn't make sense to block overutilized.
+ */
+ uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
+ uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig);
+ fits = fits || uclamp_max_fits;
+
+ /*
+ *
+ * C=z
+ * | ___ (region a, capped, util >= uclamp_max)
+ * | C=y | |
+ * |_ _ _ _ _ _ _ _ _ ___ _ _ _ | _ | _ _ _ _ _ uclamp_max
+ * | C=x | | | |
+ * | ___ | | | | (region b, uclamp_min <= util <= uclamp_max)
+ * |_ _ _|_ _|_ _ _ _| _ | _ _ _| _ | _ _ _ _ _ uclamp_min
+ * | | | | | | |
+ * | | | | | | | (region c, boosted, util < uclamp_min)
+ * +----------------------------------------
+ * cpu0 cpu1 cpu2
+ *
+ * a) If util > uclamp_max, then we're capped, we don't care about
+ * actual fitness value here. We only care if uclamp_max fits
+ * capacity without taking margin/pressure into account.
+ * See comment above.
+ *
+ * b) If uclamp_min <= util <= uclamp_max, then the normal
+ * fits_capacity() rules apply. Except we need to ensure that we
+ * enforce we remain within uclamp_max, see comment above.
+ *
+ * c) If util < uclamp_min, then we are boosted. Same as (b) but we
+ * need to take into account the boosted value fits the CPU without
+ * taking margin/pressure into account.
+ *
+ * Cases (a) and (b) are handled in the 'fits' variable already. We
+ * just need to consider an extra check for case (c) after ensuring we
+ * handle the case uclamp_min > uclamp_max.
+ */
+ uclamp_min = min(uclamp_min, uclamp_max);
+ if (util < uclamp_min && capacity_orig != SCHED_CAPACITY_SCALE)
+ fits = fits && (uclamp_min <= capacity_orig_thermal);
+
+ return fits;
+}
+
static inline int task_fits_capacity(struct task_struct *p,
unsigned long capacity)
{
--
2.25.1
So that it is now uclamp aware.
This fixes a major problem of busy tasks capped with UCLAMP_MAX keeping
the system in overutilized state which disables EAS and leads to wasting
energy in the long run.
Without this patch running a busy background activity like JIT
compilation on Pixel 6 causes the system to be in overutilized state
74.5% of the time.
With this patch this goes down to 9.79%.
It also fixes another problem when long running tasks that have their
UCLAMP_MIN changed while running such that they need to upmigrate to
honour the new UCLAMP_MIN value. The upmigration doesn't get triggered
because overutilized state never gets set in this state, hence misfit
migration never happens at tick in this case until the task wakes up
again.
Fixes: af24bde8df202 ("sched/uclamp: Add uclamp support to energy_compute()")
Signed-off-by: Qais Yousef <[email protected]>
---
kernel/sched/fair.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 38cf56ade66d..4c3a5240d7e3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5807,7 +5807,10 @@ static inline void hrtick_update(struct rq *rq)
#ifdef CONFIG_SMP
static inline bool cpu_overutilized(int cpu)
{
- return !fits_capacity(cpu_util_cfs(cpu), capacity_of(cpu));
+ unsigned long rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
+ unsigned long rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
+
+ return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
}
static inline void update_overutilized_status(struct rq *rq)
--
2.25.1
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 44c7b80bffc3a657a36857098d5d9c49d94e652b
Gitweb: https://git.kernel.org/tip/44c7b80bffc3a657a36857098d5d9c49d94e652b
Author: Qais Yousef <[email protected]>
AuthorDate: Thu, 04 Aug 2022 15:36:08 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 27 Oct 2022 11:01:20 +02:00
sched/fair: Detect capacity inversion
Check each performance domain to see if thermal pressure is causing its
capacity to be lower than another performance domain.
We assume that each performance domain has CPUs with the same
capacities, which is similar to an assumption made in energy_model.c
We also assume that thermal pressure impacts all CPUs in a performance
domain equally.
If there're multiple performance domains with the same capacity_orig, we
will trigger a capacity inversion if the domain is under thermal
pressure.
The new cpu_in_capacity_inversion() should help users to know when
information about capacity_orig are not reliable and can opt in to use
the inverted capacity as the 'actual' capacity_orig.
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 63 ++++++++++++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 19 +++++++++++++-
2 files changed, 79 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0f32acb..4c4ea47 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8824,16 +8824,73 @@ static unsigned long scale_rt_capacity(int cpu)
static void update_cpu_capacity(struct sched_domain *sd, int cpu)
{
+ unsigned long capacity_orig = arch_scale_cpu_capacity(cpu);
unsigned long capacity = scale_rt_capacity(cpu);
struct sched_group *sdg = sd->groups;
+ struct rq *rq = cpu_rq(cpu);
- cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(cpu);
+ rq->cpu_capacity_orig = capacity_orig;
if (!capacity)
capacity = 1;
- cpu_rq(cpu)->cpu_capacity = capacity;
- trace_sched_cpu_capacity_tp(cpu_rq(cpu));
+ rq->cpu_capacity = capacity;
+
+ /*
+ * Detect if the performance domain is in capacity inversion state.
+ *
+ * Capacity inversion happens when another perf domain with equal or
+ * lower capacity_orig_of() ends up having higher capacity than this
+ * domain after subtracting thermal pressure.
+ *
+ * We only take into account thermal pressure in this detection as it's
+ * the only metric that actually results in *real* reduction of
+ * capacity due to performance points (OPPs) being dropped/become
+ * unreachable due to thermal throttling.
+ *
+ * We assume:
+ * * That all cpus in a perf domain have the same capacity_orig
+ * (same uArch).
+ * * Thermal pressure will impact all cpus in this perf domain
+ * equally.
+ */
+ if (static_branch_unlikely(&sched_asym_cpucapacity)) {
+ unsigned long inv_cap = capacity_orig - thermal_load_avg(rq);
+ struct perf_domain *pd = rcu_dereference(rq->rd->pd);
+
+ rq->cpu_capacity_inverted = 0;
+
+ for (; pd; pd = pd->next) {
+ struct cpumask *pd_span = perf_domain_span(pd);
+ unsigned long pd_cap_orig, pd_cap;
+
+ cpu = cpumask_any(pd_span);
+ pd_cap_orig = arch_scale_cpu_capacity(cpu);
+
+ if (capacity_orig < pd_cap_orig)
+ continue;
+
+ /*
+ * handle the case of multiple perf domains have the
+ * same capacity_orig but one of them is under higher
+ * thermal pressure. We record it as capacity
+ * inversion.
+ */
+ if (capacity_orig == pd_cap_orig) {
+ pd_cap = pd_cap_orig - thermal_load_avg(cpu_rq(cpu));
+
+ if (pd_cap > inv_cap) {
+ rq->cpu_capacity_inverted = inv_cap;
+ break;
+ }
+ } else if (pd_cap_orig > inv_cap) {
+ rq->cpu_capacity_inverted = inv_cap;
+ break;
+ }
+ }
+ }
+
+ trace_sched_cpu_capacity_tp(rq);
sdg->sgc->capacity = capacity;
sdg->sgc->min_capacity = capacity;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d6d488e..5f18460 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1041,6 +1041,7 @@ struct rq {
unsigned long cpu_capacity;
unsigned long cpu_capacity_orig;
+ unsigned long cpu_capacity_inverted;
struct balance_callback *balance_callback;
@@ -2878,6 +2879,24 @@ static inline unsigned long capacity_orig_of(int cpu)
return cpu_rq(cpu)->cpu_capacity_orig;
}
+/*
+ * Returns inverted capacity if the CPU is in capacity inversion state.
+ * 0 otherwise.
+ *
+ * Capacity inversion detection only considers thermal impact where actual
+ * performance points (OPPs) gets dropped.
+ *
+ * Capacity inversion state happens when another performance domain that has
+ * equal or lower capacity_orig_of() becomes effectively larger than the perf
+ * domain this CPU belongs to due to thermal pressure throttling it hard.
+ *
+ * See comment in update_cpu_capacity().
+ */
+static inline unsigned long cpu_in_capacity_inversion(int cpu)
+{
+ return cpu_rq(cpu)->cpu_capacity_inverted;
+}
+
/**
* enum cpu_util_type - CPU utilization type
* @FREQUENCY_UTIL: Utilization used to select frequency
The following commit has been merged into the sched/core branch of tip:
Commit-ID: aa69c36f31aadc1669bfa8a3de6a47b5e6c98ee8
Gitweb: https://git.kernel.org/tip/aa69c36f31aadc1669bfa8a3de6a47b5e6c98ee8
Author: Qais Yousef <[email protected]>
AuthorDate: Thu, 04 Aug 2022 15:36:09 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 27 Oct 2022 11:01:20 +02:00
sched/fair: Consider capacity inversion in util_fits_cpu()
We do consider thermal pressure in util_fits_cpu() for uclamp_min only.
With the exception of the biggest cores which by definition are the max
performance point of the system and all tasks by definition should fit.
Even under thermal pressure, the capacity of the biggest CPU is the
highest in the system and should still fit every task. Except when it
reaches capacity inversion point, then this is no longer true.
We can handle this by using the inverted capacity as capacity_orig in
util_fits_cpu(). Which not only addresses the problem above, but also
ensure uclamp_max now considers the inverted capacity. Force fitting
a task when a CPU is in this adverse state will contribute to making the
thermal throttling last longer.
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4c4ea47..919d016 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4465,12 +4465,16 @@ static inline int util_fits_cpu(unsigned long util,
* For uclamp_max, we can tolerate a drop in performance level as the
* goal is to cap the task. So it's okay if it's getting less.
*
- * In case of capacity inversion, which is not handled yet, we should
- * honour the inverted capacity for both uclamp_min and uclamp_max all
- * the time.
+ * In case of capacity inversion we should honour the inverted capacity
+ * for both uclamp_min and uclamp_max all the time.
*/
- capacity_orig = capacity_orig_of(cpu);
- capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
+ capacity_orig = cpu_in_capacity_inversion(cpu);
+ if (capacity_orig) {
+ capacity_orig_thermal = capacity_orig;
+ } else {
+ capacity_orig = capacity_orig_of(cpu);
+ capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
+ }
/*
* We want to force a task to fit a cpu as implied by uclamp_max.
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 244226035a1f9b2b6c326e55ae5188fab4f428cb
Gitweb: https://git.kernel.org/tip/244226035a1f9b2b6c326e55ae5188fab4f428cb
Author: Qais Yousef <[email protected]>
AuthorDate: Thu, 04 Aug 2022 15:36:03 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 27 Oct 2022 11:01:18 +02:00
sched/uclamp: Fix fits_capacity() check in feec()
As reported by Yun Hsiang [1], if a task has its uclamp_min >= 0.8 * 1024,
it'll always pick the previous CPU because fits_capacity() will always
return false in this case.
The new util_fits_cpu() logic should handle this correctly for us beside
more corner cases where similar failures could occur, like when using
UCLAMP_MAX.
We open code uclamp_rq_util_with() except for the clamp() part,
util_fits_cpu() needs the 'raw' values to be passed to it.
Also introduce uclamp_rq_{set, get}() shorthand accessors to get uclamp
value for the rq. Makes the code more readable and ensures the right
rules (use READ_ONCE/WRITE_ONCE) are respected transparently.
[1] https://lists.linaro.org/pipermail/eas-dev/2020-July/001488.html
Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions")
Reported-by: Yun Hsiang <[email protected]>
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/core.c | 10 +++++-----
kernel/sched/fair.c | 26 ++++++++++++++++++++++++--
kernel/sched/sched.h | 42 +++++++++++++++++++++++++++++++++++++++---
3 files changed, 68 insertions(+), 10 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cb2aa2b..069da4a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1392,7 +1392,7 @@ static inline void uclamp_idle_reset(struct rq *rq, enum uclamp_id clamp_id,
if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))
return;
- WRITE_ONCE(rq->uclamp[clamp_id].value, clamp_value);
+ uclamp_rq_set(rq, clamp_id, clamp_value);
}
static inline
@@ -1543,8 +1543,8 @@ static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
if (bucket->tasks == 1 || uc_se->value > bucket->value)
bucket->value = uc_se->value;
- if (uc_se->value > READ_ONCE(uc_rq->value))
- WRITE_ONCE(uc_rq->value, uc_se->value);
+ if (uc_se->value > uclamp_rq_get(rq, clamp_id))
+ uclamp_rq_set(rq, clamp_id, uc_se->value);
}
/*
@@ -1610,7 +1610,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
if (likely(bucket->tasks))
return;
- rq_clamp = READ_ONCE(uc_rq->value);
+ rq_clamp = uclamp_rq_get(rq, clamp_id);
/*
* Defensive programming: this should never happen. If it happens,
* e.g. due to future modification, warn and fixup the expected value.
@@ -1618,7 +1618,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
SCHED_WARN_ON(bucket->value > rq_clamp);
if (bucket->value >= rq_clamp) {
bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
- WRITE_ONCE(uc_rq->value, bkt_clamp);
+ uclamp_rq_set(rq, clamp_id, bkt_clamp);
}
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index db6174b..c8eb5ff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7169,6 +7169,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
+ unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0;
+ unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
struct root_domain *rd = this_rq()->rd;
int cpu, best_energy_cpu, target = -1;
struct sched_domain *sd;
@@ -7201,6 +7203,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
for (; pd; pd = pd->next) {
unsigned long cpu_cap, cpu_thermal_cap, util;
unsigned long cur_delta, max_spare_cap = 0;
+ unsigned long rq_util_min, rq_util_max;
+ unsigned long util_min, util_max;
bool compute_prev_delta = false;
int max_spare_cap_cpu = -1;
unsigned long base_energy;
@@ -7237,8 +7241,26 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
* much capacity we can get out of the CPU; this is
* aligned with sched_cpu_util().
*/
- util = uclamp_rq_util_with(cpu_rq(cpu), util, p);
- if (!fits_capacity(util, cpu_cap))
+ if (uclamp_is_used()) {
+ if (uclamp_rq_is_idle(cpu_rq(cpu))) {
+ util_min = p_util_min;
+ util_max = p_util_max;
+ } else {
+ /*
+ * Open code uclamp_rq_util_with() except for
+ * the clamp() part. Ie: apply max aggregation
+ * only. util_fits_cpu() logic requires to
+ * operate on non clamped util but must use the
+ * max-aggregated uclamp_{min, max}.
+ */
+ rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
+ rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
+
+ util_min = max(rq_util_min, p_util_min);
+ util_max = max(rq_util_max, p_util_max);
+ }
+ }
+ if (!util_fits_cpu(util, util_min, util_max, cpu))
continue;
lsub_positive(&cpu_cap, util);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0ab091b..d6d488e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2979,6 +2979,23 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
#ifdef CONFIG_UCLAMP_TASK
unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id);
+static inline unsigned long uclamp_rq_get(struct rq *rq,
+ enum uclamp_id clamp_id)
+{
+ return READ_ONCE(rq->uclamp[clamp_id].value);
+}
+
+static inline void uclamp_rq_set(struct rq *rq, enum uclamp_id clamp_id,
+ unsigned int value)
+{
+ WRITE_ONCE(rq->uclamp[clamp_id].value, value);
+}
+
+static inline bool uclamp_rq_is_idle(struct rq *rq)
+{
+ return rq->uclamp_flags & UCLAMP_FLAG_IDLE;
+}
+
/**
* uclamp_rq_util_with - clamp @util with @rq and @p effective uclamp values.
* @rq: The rq to clamp against. Must not be NULL.
@@ -3014,12 +3031,12 @@ unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
* Ignore last runnable task's max clamp, as this task will
* reset it. Similarly, no need to read the rq's min clamp.
*/
- if (rq->uclamp_flags & UCLAMP_FLAG_IDLE)
+ if (uclamp_rq_is_idle(rq))
goto out;
}
- min_util = max_t(unsigned long, min_util, READ_ONCE(rq->uclamp[UCLAMP_MIN].value));
- max_util = max_t(unsigned long, max_util, READ_ONCE(rq->uclamp[UCLAMP_MAX].value));
+ min_util = max_t(unsigned long, min_util, uclamp_rq_get(rq, UCLAMP_MIN));
+ max_util = max_t(unsigned long, max_util, uclamp_rq_get(rq, UCLAMP_MAX));
out:
/*
* Since CPU's {min,max}_util clamps are MAX aggregated considering
@@ -3082,6 +3099,25 @@ static inline bool uclamp_is_used(void)
{
return false;
}
+
+static inline unsigned long uclamp_rq_get(struct rq *rq,
+ enum uclamp_id clamp_id)
+{
+ if (clamp_id == UCLAMP_MIN)
+ return 0;
+
+ return SCHED_CAPACITY_SCALE;
+}
+
+static inline void uclamp_rq_set(struct rq *rq, enum uclamp_id clamp_id,
+ unsigned int value)
+{
+}
+
+static inline bool uclamp_rq_is_idle(struct rq *rq)
+{
+ return false;
+}
#endif /* CONFIG_UCLAMP_TASK */
#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
The following commit has been merged into the sched/core branch of tip:
Commit-ID: b759caa1d9f667b94727b2ad12589cbc4ce13a82
Gitweb: https://git.kernel.org/tip/b759caa1d9f667b94727b2ad12589cbc4ce13a82
Author: Qais Yousef <[email protected]>
AuthorDate: Thu, 04 Aug 2022 15:36:04 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 27 Oct 2022 11:01:18 +02:00
sched/uclamp: Make select_idle_capacity() use util_fits_cpu()
Use the new util_fits_cpu() to ensure migration margin and capacity
pressure are taken into account correctly when uclamp is being used
otherwise we will fail to consider CPUs as fitting in scenarios where
they should.
Fixes: b4c9c9f15649 ("sched/fair: Prefer prev cpu in asymmetric wakeup path")
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c8eb5ff..c877bbf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6779,21 +6779,23 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
static int
select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
{
- unsigned long task_util, best_cap = 0;
+ unsigned long task_util, util_min, util_max, best_cap = 0;
int cpu, best_cpu = -1;
struct cpumask *cpus;
cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
- task_util = uclamp_task_util(p);
+ task_util = task_util_est(p);
+ util_min = uclamp_eff_value(p, UCLAMP_MIN);
+ util_max = uclamp_eff_value(p, UCLAMP_MAX);
for_each_cpu_wrap(cpu, cpus, target) {
unsigned long cpu_cap = capacity_of(cpu);
if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
continue;
- if (fits_capacity(task_util, cpu_cap))
+ if (util_fits_cpu(task_util, util_min, util_max, cpu))
return cpu;
if (cpu_cap > best_cap) {
The following commit has been merged into the sched/core branch of tip:
Commit-ID: b48e16a69792b5dc4a09d6807369d11b2970cc36
Gitweb: https://git.kernel.org/tip/b48e16a69792b5dc4a09d6807369d11b2970cc36
Author: Qais Yousef <[email protected]>
AuthorDate: Thu, 04 Aug 2022 15:36:02 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 27 Oct 2022 11:01:18 +02:00
sched/uclamp: Make task_fits_capacity() use util_fits_cpu()
So that the new uclamp rules in regard to migration margin and capacity
pressure are taken into account correctly.
Fixes: a7008c07a568 ("sched/fair: Make task_fits_capacity() consider uclamp restrictions")
Co-developed-by: Vincent Guittot <[email protected]>
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 26 ++++++++++++++++----------
kernel/sched/sched.h | 9 +++++++++
2 files changed, 25 insertions(+), 10 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0d193ef..db6174b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4549,10 +4549,12 @@ static inline int util_fits_cpu(unsigned long util,
return fits;
}
-static inline int task_fits_capacity(struct task_struct *p,
- unsigned long capacity)
+static inline int task_fits_cpu(struct task_struct *p, int cpu)
{
- return fits_capacity(uclamp_task_util(p), capacity);
+ unsigned long uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
+ unsigned long uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
+ unsigned long util = task_util_est(p);
+ return util_fits_cpu(util, uclamp_min, uclamp_max, cpu);
}
static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
@@ -4565,7 +4567,7 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
return;
}
- if (task_fits_capacity(p, capacity_of(cpu_of(rq)))) {
+ if (task_fits_cpu(p, cpu_of(rq))) {
rq->misfit_task_load = 0;
return;
}
@@ -8399,7 +8401,7 @@ static int detach_tasks(struct lb_env *env)
case migrate_misfit:
/* This is not a misfit task */
- if (task_fits_capacity(p, capacity_of(env->src_cpu)))
+ if (task_fits_cpu(p, env->src_cpu))
goto next;
env->imbalance = 0;
@@ -9404,6 +9406,10 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
memset(sgs, 0, sizeof(*sgs));
+ /* Assume that task can't fit any CPU of the group */
+ if (sd->flags & SD_ASYM_CPUCAPACITY)
+ sgs->group_misfit_task_load = 1;
+
for_each_cpu(i, sched_group_span(group)) {
struct rq *rq = cpu_rq(i);
unsigned int local;
@@ -9423,12 +9429,12 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
if (!nr_running && idle_cpu_without(i, p))
sgs->idle_cpus++;
- }
+ /* Check if task fits in the CPU */
+ if (sd->flags & SD_ASYM_CPUCAPACITY &&
+ sgs->group_misfit_task_load &&
+ task_fits_cpu(p, i))
+ sgs->group_misfit_task_load = 0;
- /* Check if task fits in the group */
- if (sd->flags & SD_ASYM_CPUCAPACITY &&
- !task_fits_capacity(p, group->sgc->max_capacity)) {
- sgs->group_misfit_task_load = 1;
}
sgs->group_capacity = group->sgc->capacity;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a4a2004..0ab091b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3060,6 +3060,15 @@ static inline bool uclamp_is_used(void)
return static_branch_likely(&sched_uclamp_used);
}
#else /* CONFIG_UCLAMP_TASK */
+static inline unsigned long uclamp_eff_value(struct task_struct *p,
+ enum uclamp_id clamp_id)
+{
+ if (clamp_id == UCLAMP_MIN)
+ return 0;
+
+ return SCHED_CAPACITY_SCALE;
+}
+
static inline
unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
struct task_struct *p)
The following commit has been merged into the sched/core branch of tip:
Commit-ID: c56ab1b3506ba0e7a872509964b100912bde165d
Gitweb: https://git.kernel.org/tip/c56ab1b3506ba0e7a872509964b100912bde165d
Author: Qais Yousef <[email protected]>
AuthorDate: Thu, 04 Aug 2022 15:36:06 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 27 Oct 2022 11:01:19 +02:00
sched/uclamp: Make cpu_overutilized() use util_fits_cpu()
So that it is now uclamp aware.
This fixes a major problem of busy tasks capped with UCLAMP_MAX keeping
the system in overutilized state which disables EAS and leads to wasting
energy in the long run.
Without this patch running a busy background activity like JIT
compilation on Pixel 6 causes the system to be in overutilized state
74.5% of the time.
With this patch this goes down to 9.79%.
It also fixes another problem when long running tasks that have their
UCLAMP_MIN changed while running such that they need to upmigrate to
honour the new UCLAMP_MIN value. The upmigration doesn't get triggered
because overutilized state never gets set in this state, hence misfit
migration never happens at tick in this case until the task wakes up
again.
Fixes: af24bde8df202 ("sched/uclamp: Add uclamp support to energy_compute()")
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cabbdac..a0ee319 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5987,7 +5987,10 @@ static inline void hrtick_update(struct rq *rq)
#ifdef CONFIG_SMP
static inline bool cpu_overutilized(int cpu)
{
- return !fits_capacity(cpu_util_cfs(cpu), capacity_of(cpu));
+ unsigned long rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
+ unsigned long rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
+
+ return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
}
static inline void update_overutilized_status(struct rq *rq)
The following commit has been merged into the sched/core branch of tip:
Commit-ID: d81304bc6193554014d4372a01debdf65e1e9a4d
Gitweb: https://git.kernel.org/tip/d81304bc6193554014d4372a01debdf65e1e9a4d
Author: Qais Yousef <[email protected]>
AuthorDate: Thu, 04 Aug 2022 15:36:07 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 27 Oct 2022 11:01:19 +02:00
sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition
If the utilization of the woken up task is 0, we skip the energy
calculation because it has no impact.
But if the task is boosted (uclamp_min != 0) will have an impact on task
placement and frequency selection. Only skip if the util is truly
0 after applying uclamp values.
Change uclamp_task_cpu() signature to avoid unnecessary additional calls
to uclamp_eff_get(). feec() is the only user now.
Fixes: 732cd75b8c920 ("sched/fair: Select an energy-efficient CPU on task wake-up")
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a0ee319..0f32acb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4280,14 +4280,16 @@ static inline unsigned long task_util_est(struct task_struct *p)
}
#ifdef CONFIG_UCLAMP_TASK
-static inline unsigned long uclamp_task_util(struct task_struct *p)
+static inline unsigned long uclamp_task_util(struct task_struct *p,
+ unsigned long uclamp_min,
+ unsigned long uclamp_max)
{
- return clamp(task_util_est(p),
- uclamp_eff_value(p, UCLAMP_MIN),
- uclamp_eff_value(p, UCLAMP_MAX));
+ return clamp(task_util_est(p), uclamp_min, uclamp_max);
}
#else
-static inline unsigned long uclamp_task_util(struct task_struct *p)
+static inline unsigned long uclamp_task_util(struct task_struct *p,
+ unsigned long uclamp_min,
+ unsigned long uclamp_max)
{
return task_util_est(p);
}
@@ -7205,7 +7207,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
target = prev_cpu;
sync_entity_load_avg(&p->se);
- if (!task_util_est(p))
+ if (!uclamp_task_util(p, p_util_min, p_util_max))
goto unlock;
eenv_task_busy_time(&eenv, p, prev_cpu);
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 48d5e9daa8b767e75ed9421665b037a49ce4bc04
Gitweb: https://git.kernel.org/tip/48d5e9daa8b767e75ed9421665b037a49ce4bc04
Author: Qais Yousef <[email protected]>
AuthorDate: Thu, 04 Aug 2022 15:36:01 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 27 Oct 2022 11:01:17 +02:00
sched/uclamp: Fix relationship between uclamp and migration margin
fits_capacity() verifies that a util is within 20% margin of the
capacity of a CPU, which is an attempt to speed up upmigration.
But when uclamp is used, this 20% margin is problematic because for
example if a task is boosted to 1024, then it will not fit on any CPU
according to fits_capacity() logic.
Or if a task is boosted to capacity_orig_of(medium_cpu). The task will
end up on big instead on the desired medium CPU.
Similar corner cases exist for uclamp and usage of capacity_of().
Slightest irq pressure on biggest CPU for example will make a 1024
boosted task look like it can't fit.
What we really want is for uclamp comparisons to ignore the migration
margin and capacity pressure, yet retain them for when checking the
_actual_ util signal.
For example, task p:
p->util_avg = 300
p->uclamp[UCLAMP_MIN] = 1024
Will fit a big CPU. But
p->util_avg = 900
p->uclamp[UCLAMP_MIN] = 1024
will not, this should trigger overutilized state because the big CPU is
now *actually* being saturated.
Similar reasoning applies to capping tasks with UCLAMP_MAX. For example:
p->util_avg = 1024
p->uclamp[UCLAMP_MAX] = capacity_orig_of(medium_cpu)
Should fit the task on medium cpus without triggering overutilized
state.
Inlined comments expand more on desired behavior in more scenarios.
Introduce new util_fits_cpu() function which encapsulates the new logic.
The new function is not used anywhere yet, but will be used to update
various users of fits_capacity() in later patches.
Fixes: af24bde8df202 ("sched/uclamp: Add uclamp support to energy_compute()")
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 123 +++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 123 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e4a0b8b..0d193ef 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4426,6 +4426,129 @@ done:
trace_sched_util_est_se_tp(&p->se);
}
+static inline int util_fits_cpu(unsigned long util,
+ unsigned long uclamp_min,
+ unsigned long uclamp_max,
+ int cpu)
+{
+ unsigned long capacity_orig, capacity_orig_thermal;
+ unsigned long capacity = capacity_of(cpu);
+ bool fits, uclamp_max_fits;
+
+ /*
+ * Check if the real util fits without any uclamp boost/cap applied.
+ */
+ fits = fits_capacity(util, capacity);
+
+ if (!uclamp_is_used())
+ return fits;
+
+ /*
+ * We must use capacity_orig_of() for comparing against uclamp_min and
+ * uclamp_max. We only care about capacity pressure (by using
+ * capacity_of()) for comparing against the real util.
+ *
+ * If a task is boosted to 1024 for example, we don't want a tiny
+ * pressure to skew the check whether it fits a CPU or not.
+ *
+ * Similarly if a task is capped to capacity_orig_of(little_cpu), it
+ * should fit a little cpu even if there's some pressure.
+ *
+ * Only exception is for thermal pressure since it has a direct impact
+ * on available OPP of the system.
+ *
+ * We honour it for uclamp_min only as a drop in performance level
+ * could result in not getting the requested minimum performance level.
+ *
+ * For uclamp_max, we can tolerate a drop in performance level as the
+ * goal is to cap the task. So it's okay if it's getting less.
+ *
+ * In case of capacity inversion, which is not handled yet, we should
+ * honour the inverted capacity for both uclamp_min and uclamp_max all
+ * the time.
+ */
+ capacity_orig = capacity_orig_of(cpu);
+ capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
+
+ /*
+ * We want to force a task to fit a cpu as implied by uclamp_max.
+ * But we do have some corner cases to cater for..
+ *
+ *
+ * C=z
+ * | ___
+ * | C=y | |
+ * |_ _ _ _ _ _ _ _ _ ___ _ _ _ | _ | _ _ _ _ _ uclamp_max
+ * | C=x | | | |
+ * | ___ | | | |
+ * | | | | | | | (util somewhere in this region)
+ * | | | | | | |
+ * | | | | | | |
+ * +----------------------------------------
+ * cpu0 cpu1 cpu2
+ *
+ * In the above example if a task is capped to a specific performance
+ * point, y, then when:
+ *
+ * * util = 80% of x then it does not fit on cpu0 and should migrate
+ * to cpu1
+ * * util = 80% of y then it is forced to fit on cpu1 to honour
+ * uclamp_max request.
+ *
+ * which is what we're enforcing here. A task always fits if
+ * uclamp_max <= capacity_orig. But when uclamp_max > capacity_orig,
+ * the normal upmigration rules should withhold still.
+ *
+ * Only exception is when we are on max capacity, then we need to be
+ * careful not to block overutilized state. This is so because:
+ *
+ * 1. There's no concept of capping at max_capacity! We can't go
+ * beyond this performance level anyway.
+ * 2. The system is being saturated when we're operating near
+ * max capacity, it doesn't make sense to block overutilized.
+ */
+ uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
+ uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig);
+ fits = fits || uclamp_max_fits;
+
+ /*
+ *
+ * C=z
+ * | ___ (region a, capped, util >= uclamp_max)
+ * | C=y | |
+ * |_ _ _ _ _ _ _ _ _ ___ _ _ _ | _ | _ _ _ _ _ uclamp_max
+ * | C=x | | | |
+ * | ___ | | | | (region b, uclamp_min <= util <= uclamp_max)
+ * |_ _ _|_ _|_ _ _ _| _ | _ _ _| _ | _ _ _ _ _ uclamp_min
+ * | | | | | | |
+ * | | | | | | | (region c, boosted, util < uclamp_min)
+ * +----------------------------------------
+ * cpu0 cpu1 cpu2
+ *
+ * a) If util > uclamp_max, then we're capped, we don't care about
+ * actual fitness value here. We only care if uclamp_max fits
+ * capacity without taking margin/pressure into account.
+ * See comment above.
+ *
+ * b) If uclamp_min <= util <= uclamp_max, then the normal
+ * fits_capacity() rules apply. Except we need to ensure that we
+ * enforce we remain within uclamp_max, see comment above.
+ *
+ * c) If util < uclamp_min, then we are boosted. Same as (b) but we
+ * need to take into account the boosted value fits the CPU without
+ * taking margin/pressure into account.
+ *
+ * Cases (a) and (b) are handled in the 'fits' variable already. We
+ * just need to consider an extra check for case (c) after ensuring we
+ * handle the case uclamp_min > uclamp_max.
+ */
+ uclamp_min = min(uclamp_min, uclamp_max);
+ if (util < uclamp_min && capacity_orig != SCHED_CAPACITY_SCALE)
+ fits = fits && (uclamp_min <= capacity_orig_thermal);
+
+ return fits;
+}
+
static inline int task_fits_capacity(struct task_struct *p,
unsigned long capacity)
{
The following commit has been merged into the sched/core branch of tip:
Commit-ID: a2e7f03ed28fce26c78b985f87913b6ce3accf9d
Gitweb: https://git.kernel.org/tip/a2e7f03ed28fce26c78b985f87913b6ce3accf9d
Author: Qais Yousef <[email protected]>
AuthorDate: Thu, 04 Aug 2022 15:36:05 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 27 Oct 2022 11:01:19 +02:00
sched/uclamp: Make asym_fits_capacity() use util_fits_cpu()
Use the new util_fits_cpu() to ensure migration margin and capacity
pressure are taken into account correctly when uclamp is being used
otherwise we will fail to consider CPUs as fitting in scenarios where
they should.
s/asym_fits_capacity/asym_fits_cpu/ to better reflect what it does now.
Fixes: b4c9c9f15649 ("sched/fair: Prefer prev cpu in asymmetric wakeup path")
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 21 +++++++++++++--------
1 file changed, 13 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c877bbf..cabbdac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6807,10 +6807,13 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
return best_cpu;
}
-static inline bool asym_fits_capacity(unsigned long task_util, int cpu)
+static inline bool asym_fits_cpu(unsigned long util,
+ unsigned long util_min,
+ unsigned long util_max,
+ int cpu)
{
if (sched_asym_cpucap_active())
- return fits_capacity(task_util, capacity_of(cpu));
+ return util_fits_cpu(util, util_min, util_max, cpu);
return true;
}
@@ -6822,7 +6825,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
{
bool has_idle_core = false;
struct sched_domain *sd;
- unsigned long task_util;
+ unsigned long task_util, util_min, util_max;
int i, recent_used_cpu;
/*
@@ -6831,7 +6834,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
*/
if (sched_asym_cpucap_active()) {
sync_entity_load_avg(&p->se);
- task_util = uclamp_task_util(p);
+ task_util = task_util_est(p);
+ util_min = uclamp_eff_value(p, UCLAMP_MIN);
+ util_max = uclamp_eff_value(p, UCLAMP_MAX);
}
/*
@@ -6840,7 +6845,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
lockdep_assert_irqs_disabled();
if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
- asym_fits_capacity(task_util, target))
+ asym_fits_cpu(task_util, util_min, util_max, target))
return target;
/*
@@ -6848,7 +6853,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
*/
if (prev != target && cpus_share_cache(prev, target) &&
(available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
- asym_fits_capacity(task_util, prev))
+ asym_fits_cpu(task_util, util_min, util_max, prev))
return prev;
/*
@@ -6863,7 +6868,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
in_task() &&
prev == smp_processor_id() &&
this_rq()->nr_running <= 1 &&
- asym_fits_capacity(task_util, prev)) {
+ asym_fits_cpu(task_util, util_min, util_max, prev)) {
return prev;
}
@@ -6875,7 +6880,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
cpus_share_cache(recent_used_cpu, target) &&
(available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
- asym_fits_capacity(task_util, recent_used_cpu)) {
+ asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) {
return recent_used_cpu;
}
On 04/08/22 15:36, Qais Yousef wrote:
> We do consider thermal pressure in util_fits_cpu() for uclamp_min only.
> With the exception of the biggest cores which by definition are the max
> performance point of the system and all tasks by definition should fit.
>
> Even under thermal pressure, the capacity of the biggest CPU is the
> highest in the system and should still fit every task. Except when it
> reaches capacity inversion point, then this is no longer true.
>
> We can handle this by using the inverted capacity as capacity_orig in
> util_fits_cpu(). Which not only addresses the problem above, but also
> ensure uclamp_max now considers the inverted capacity. Force fitting
> a task when a CPU is in this adverse state will contribute to making the
> thermal throttling last longer.
>
> Signed-off-by: Qais Yousef <[email protected]>
> ---
> kernel/sched/fair.c | 14 +++++++++-----
> 1 file changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index cb32dc9a057f..77ae343e32a3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4293,12 +4293,16 @@ static inline int util_fits_cpu(unsigned long util,
> * For uclamp_max, we can tolerate a drop in performance level as the
> * goal is to cap the task. So it's okay if it's getting less.
> *
> - * In case of capacity inversion, which is not handled yet, we should
> - * honour the inverted capacity for both uclamp_min and uclamp_max all
> - * the time.
> + * In case of capacity inversion we should honour the inverted capacity
> + * for both uclamp_min and uclamp_max all the time.
> */
> - capacity_orig = capacity_orig_of(cpu);
> - capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
> + capacity_orig = cpu_in_capacity_inversion(cpu);
> + if (capacity_orig) {
> + capacity_orig_thermal = capacity_orig;
> + } else {
> + capacity_orig = capacity_orig_of(cpu);
> + capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
> + }
>
IIUC the rq->cpu_capacity_inverted computation in update_cpu_capacity() can be
summarised as:
- If there is a PD with equal cap_orig, but higher effective (orig - thermal)
capacity
OR
there is a PD with pd_cap_orig > cpu_effective_cap:
rq->cpu_capacity_inverted = capacity_orig - thermal_load_avg(rq)
- Else:
rq->cpu_capacity_inverted = 0
Then, the code above uses either rq->cpu_capacity_inverted if it is
non-zero, otherwise:
capacity_orig - arch_scale_thermal_pressure(cpu);
Why use average thermal pressure in one case, and use instantaneous
thermal pressure in the other?
Can't we get rid of rq->cpu_capacity_inverted and replace this whole thing
with an unconditional
capacity_orig_thermal = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu));
?
> /*
> * We want to force a task to fit a cpu as implied by uclamp_max.
> --
> 2.25.1
I'm only seeing this now that it has hit tip/sched/core and I've had a
stroll through fair.c, apologies for this being late.
On 04/08/22 15:36, Qais Yousef wrote:
> +static inline int util_fits_cpu(unsigned long util,
> + unsigned long uclamp_min,
> + unsigned long uclamp_max,
> + int cpu)
> +{
> + unsigned long capacity_orig, capacity_orig_thermal;
> + unsigned long capacity = capacity_of(cpu);
> + bool fits, uclamp_max_fits;
> +
> + /*
> + * Check if the real util fits without any uclamp boost/cap applied.
> + */
> + fits = fits_capacity(util, capacity);
> +
> + if (!uclamp_is_used())
> + return fits;
> +
> + /*
> + * We must use capacity_orig_of() for comparing against uclamp_min and
> + * uclamp_max. We only care about capacity pressure (by using
> + * capacity_of()) for comparing against the real util.
> + *
> + * If a task is boosted to 1024 for example, we don't want a tiny
> + * pressure to skew the check whether it fits a CPU or not.
> + *
> + * Similarly if a task is capped to capacity_orig_of(little_cpu), it
> + * should fit a little cpu even if there's some pressure.
> + *
> + * Only exception is for thermal pressure since it has a direct impact
> + * on available OPP of the system.
> + *
> + * We honour it for uclamp_min only as a drop in performance level
> + * could result in not getting the requested minimum performance level.
> + *
Why specifically care about OPPs here? Per our CPU capacity model, a task
alone on a CPUx throttled to f=fmax/2 and a task coscheduled on a CPUy with
RT/DL tasks and/or IRQs such that cpu_capacity(CPUy) = 50% are both getting
(roughly) the same performance level.
On 11/04/22 17:35, Valentin Schneider wrote:
> > + /*
> > + * We must use capacity_orig_of() for comparing against uclamp_min and
> > + * uclamp_max. We only care about capacity pressure (by using
> > + * capacity_of()) for comparing against the real util.
> > + *
> > + * If a task is boosted to 1024 for example, we don't want a tiny
> > + * pressure to skew the check whether it fits a CPU or not.
> > + *
> > + * Similarly if a task is capped to capacity_orig_of(little_cpu), it
> > + * should fit a little cpu even if there's some pressure.
> > + *
> > + * Only exception is for thermal pressure since it has a direct impact
> > + * on available OPP of the system.
> > + *
> > + * We honour it for uclamp_min only as a drop in performance level
> > + * could result in not getting the requested minimum performance level.
> > + *
>
> Why specifically care about OPPs here? Per our CPU capacity model, a task
> alone on a CPUx throttled to f=fmax/2 and a task coscheduled on a CPUy with
> RT/DL tasks and/or IRQs such that cpu_capacity(CPUy) = 50% are both getting
> (roughly) the same performance level.
Depends how you define performance level. What you call performance level,
I think is better called bandwidth. Uclamp is a performance and not a bandwidth
hint.
If a 10% task:
p->util_avg = 10% * 1024
is requesting max performance level
p->uclamp_min = 1024
This will translate to running at highest frequency and in case of big.LITTLE
system, the biggest CPU too.
RT/DL pressure has no impact in the task being able to achieve this; that is
running at max frequency and biggest cpu.
If the cpu has no bandwidth to fit this task, then our usual comparison of
util_avg with capacity_of() should fail as usual.
In the example above, the RT/DL pressure has to be pretty high for the 10% task
not to fit from bandwidth point of view. Which has nothing to do with
uclamp_min. Only thermal pressure which drops OPPs can actually affect the
uclamp_min hint/request.
That is, when the task runs it will run at maximum frequency regardless of the
RT/DL pressure. The fact that the bandwidth of the CPU can be stolen has
nothing to do with uclamp_min hint.
Thanks!
--
Qais Yousef
On 11/04/22 17:35, Valentin Schneider wrote:
> On 04/08/22 15:36, Qais Yousef wrote:
> > We do consider thermal pressure in util_fits_cpu() for uclamp_min only.
> > With the exception of the biggest cores which by definition are the max
> > performance point of the system and all tasks by definition should fit.
> >
> > Even under thermal pressure, the capacity of the biggest CPU is the
> > highest in the system and should still fit every task. Except when it
> > reaches capacity inversion point, then this is no longer true.
> >
> > We can handle this by using the inverted capacity as capacity_orig in
> > util_fits_cpu(). Which not only addresses the problem above, but also
> > ensure uclamp_max now considers the inverted capacity. Force fitting
> > a task when a CPU is in this adverse state will contribute to making the
> > thermal throttling last longer.
> >
> > Signed-off-by: Qais Yousef <[email protected]>
> > ---
> > kernel/sched/fair.c | 14 +++++++++-----
> > 1 file changed, 9 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index cb32dc9a057f..77ae343e32a3 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4293,12 +4293,16 @@ static inline int util_fits_cpu(unsigned long util,
> > * For uclamp_max, we can tolerate a drop in performance level as the
> > * goal is to cap the task. So it's okay if it's getting less.
> > *
> > - * In case of capacity inversion, which is not handled yet, we should
> > - * honour the inverted capacity for both uclamp_min and uclamp_max all
> > - * the time.
> > + * In case of capacity inversion we should honour the inverted capacity
> > + * for both uclamp_min and uclamp_max all the time.
> > */
> > - capacity_orig = capacity_orig_of(cpu);
> > - capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
> > + capacity_orig = cpu_in_capacity_inversion(cpu);
> > + if (capacity_orig) {
> > + capacity_orig_thermal = capacity_orig;
> > + } else {
> > + capacity_orig = capacity_orig_of(cpu);
> > + capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
> > + }
> >
>
> IIUC the rq->cpu_capacity_inverted computation in update_cpu_capacity() can be
> summarised as:
>
> - If there is a PD with equal cap_orig, but higher effective (orig - thermal)
> capacity
> OR
> there is a PD with pd_cap_orig > cpu_effective_cap:
> rq->cpu_capacity_inverted = capacity_orig - thermal_load_avg(rq)
>
> - Else:
> rq->cpu_capacity_inverted = 0
>
> Then, the code above uses either rq->cpu_capacity_inverted if it is
> non-zero, otherwise:
>
> capacity_orig - arch_scale_thermal_pressure(cpu);
>
> Why use average thermal pressure in one case, and use instantaneous
> thermal pressure in the other?
There was a big debate on [1] about using avg vs instantaneous.
I used avg for detecting inversion to be consistent with using average in in
scale_rt_capacity(). I didn't want the inversion state to be flipping too
quickly too.
I used the instantaneous in the other check based on that discussion. It seemed
using the average is hurtful when for example the medium drops an OPP and by
not reacting quickly at wake up we lose the chance to place it on a big; which
if my memory didn't fail me is what Xuewen was seeing.
[1] https://lore.kernel.org/lkml/[email protected]/
>
> Can't we get rid of rq->cpu_capacity_inverted and replace this whole thing
> with an unconditional
>
> capacity_orig_thermal = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu));
>
> ?
I can't see how we end up with equivalent behavior then. Or address the
concerns raised by Xuewen and Lukasz on the RT thread in regards to avg vs
instantaneous.
Specifically, if we don't use the new rq->cpu_capacity_inverted we can't handle
the case where the task is requesting to run at maximum performance but a small
drop in thermal pressure means it won't fit anywhere. That PD is the best fit
until it hits an inversion.
Originally I wanted to defer handling thermal pressure into a different series.
But Vincent thought it's better to handle it now. We want more data points from
more systems tbh. But I think what we have now is still a good improvement over
what we had before.
Lukasz had a patch [2] which could allow making thermal_load_avg() more
acceptable for systems that care about faster response times.
[2] https://lore.kernel.org/lkml/[email protected]/
Thanks
--
Qais Yousef
On 05/11/22 20:41, Qais Yousef wrote:
> On 11/04/22 17:35, Valentin Schneider wrote:
>> On 04/08/22 15:36, Qais Yousef wrote:
>> > We do consider thermal pressure in util_fits_cpu() for uclamp_min only.
>> > With the exception of the biggest cores which by definition are the max
>> > performance point of the system and all tasks by definition should fit.
>> >
>> > Even under thermal pressure, the capacity of the biggest CPU is the
>> > highest in the system and should still fit every task. Except when it
>> > reaches capacity inversion point, then this is no longer true.
>> >
>> > We can handle this by using the inverted capacity as capacity_orig in
>> > util_fits_cpu(). Which not only addresses the problem above, but also
>> > ensure uclamp_max now considers the inverted capacity. Force fitting
>> > a task when a CPU is in this adverse state will contribute to making the
>> > thermal throttling last longer.
>> >
>> > Signed-off-by: Qais Yousef <[email protected]>
>> > ---
>> > kernel/sched/fair.c | 14 +++++++++-----
>> > 1 file changed, 9 insertions(+), 5 deletions(-)
>> >
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index cb32dc9a057f..77ae343e32a3 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -4293,12 +4293,16 @@ static inline int util_fits_cpu(unsigned long util,
>> > * For uclamp_max, we can tolerate a drop in performance level as the
>> > * goal is to cap the task. So it's okay if it's getting less.
>> > *
>> > - * In case of capacity inversion, which is not handled yet, we should
>> > - * honour the inverted capacity for both uclamp_min and uclamp_max all
>> > - * the time.
>> > + * In case of capacity inversion we should honour the inverted capacity
>> > + * for both uclamp_min and uclamp_max all the time.
>> > */
>> > - capacity_orig = capacity_orig_of(cpu);
>> > - capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
>> > + capacity_orig = cpu_in_capacity_inversion(cpu);
>> > + if (capacity_orig) {
>> > + capacity_orig_thermal = capacity_orig;
>> > + } else {
>> > + capacity_orig = capacity_orig_of(cpu);
>> > + capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
>> > + }
>> >
>>
>> IIUC the rq->cpu_capacity_inverted computation in update_cpu_capacity() can be
>> summarised as:
>>
>> - If there is a PD with equal cap_orig, but higher effective (orig - thermal)
>> capacity
>> OR
>> there is a PD with pd_cap_orig > cpu_effective_cap:
>> rq->cpu_capacity_inverted = capacity_orig - thermal_load_avg(rq)
>>
>> - Else:
>> rq->cpu_capacity_inverted = 0
>>
>> Then, the code above uses either rq->cpu_capacity_inverted if it is
>> non-zero, otherwise:
>>
>> capacity_orig - arch_scale_thermal_pressure(cpu);
>>
>> Why use average thermal pressure in one case, and use instantaneous
>> thermal pressure in the other?
>
> There was a big debate on [1] about using avg vs instantaneous.
>
Interesting thread, thanks for the link!
> I used avg for detecting inversion to be consistent with using average in in
> scale_rt_capacity(). I didn't want the inversion state to be flipping too
> quickly too.
>
> I used the instantaneous in the other check based on that discussion. It seemed
> using the average is hurtful when for example the medium drops an OPP and by
> not reacting quickly at wake up we lose the chance to place it on a big; which
> if my memory didn't fail me is what Xuewen was seeing.
>
OK So IIUC by using the inst. pressure you start excluding CPUs sooner, and
with the avg pressure you keep those CPUs out (if the pressure remained
long enough).
> [1] https://lore.kernel.org/lkml/[email protected]/
>
>>
>> Can't we get rid of rq->cpu_capacity_inverted and replace this whole thing
>> with an unconditional
>>
>> capacity_orig_thermal = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu));
>>
>> ?
>
> I can't see how we end up with equivalent behavior then. Or address the
> concerns raised by Xuewen and Lukasz on the RT thread in regards to avg vs
> instantaneous.
>
> Specifically, if we don't use the new rq->cpu_capacity_inverted we can't handle
> the case where the task is requesting to run at maximum performance but a small
> drop in thermal pressure means it won't fit anywhere. That PD is the best fit
> until it hits an inversion.
>
> Originally I wanted to defer handling thermal pressure into a different series.
> But Vincent thought it's better to handle it now. We want more data points from
> more systems tbh. But I think what we have now is still a good improvement over
> what we had before.
>
> Lukasz had a patch [2] which could allow making thermal_load_avg() more
> acceptable for systems that care about faster response times.
>
> [2] https://lore.kernel.org/lkml/[email protected]/
>
>
> Thanks
>
> --
> Qais Yousef
On 05/11/22 19:24, Qais Yousef wrote:
> On 11/04/22 17:35, Valentin Schneider wrote:
>
>> > + /*
>> > + * We must use capacity_orig_of() for comparing against uclamp_min and
>> > + * uclamp_max. We only care about capacity pressure (by using
>> > + * capacity_of()) for comparing against the real util.
>> > + *
>> > + * If a task is boosted to 1024 for example, we don't want a tiny
>> > + * pressure to skew the check whether it fits a CPU or not.
>> > + *
>> > + * Similarly if a task is capped to capacity_orig_of(little_cpu), it
>> > + * should fit a little cpu even if there's some pressure.
>> > + *
>> > + * Only exception is for thermal pressure since it has a direct impact
>> > + * on available OPP of the system.
>> > + *
>> > + * We honour it for uclamp_min only as a drop in performance level
>> > + * could result in not getting the requested minimum performance level.
>> > + *
>>
>> Why specifically care about OPPs here? Per our CPU capacity model, a task
>> alone on a CPUx throttled to f=fmax/2 and a task coscheduled on a CPUy with
>> RT/DL tasks and/or IRQs such that cpu_capacity(CPUy) = 50% are both getting
>> (roughly) the same performance level.
>
> Depends how you define performance level. What you call performance level,
> I think is better called bandwidth. Uclamp is a performance and not a bandwidth
> hint.
>
> If a 10% task:
>
> p->util_avg = 10% * 1024
>
> is requesting max performance level
>
> p->uclamp_min = 1024
>
> This will translate to running at highest frequency and in case of big.LITTLE
> system, the biggest CPU too.
>
> RT/DL pressure has no impact in the task being able to achieve this; that is
> running at max frequency and biggest cpu.
>
> If the cpu has no bandwidth to fit this task, then our usual comparison of
> util_avg with capacity_of() should fail as usual.
>
Ok so we *do* have this with how the fitting criteria are combined (I
didn't get that when I first scanned through the code); thanks for
elaborating on that.
> In the example above, the RT/DL pressure has to be pretty high for the 10% task
> not to fit from bandwidth point of view. Which has nothing to do with
> uclamp_min. Only thermal pressure which drops OPPs can actually affect the
> uclamp_min hint/request.
>
> That is, when the task runs it will run at maximum frequency regardless of the
> RT/DL pressure. The fact that the bandwidth of the CPU can be stolen has
> nothing to do with uclamp_min hint.
>
>
> Thanks!
>
> --
> Qais Yousef
On 11/07/22 18:58, Valentin Schneider wrote:
> On 05/11/22 19:24, Qais Yousef wrote:
> > On 11/04/22 17:35, Valentin Schneider wrote:
> >
> >> > + /*
> >> > + * We must use capacity_orig_of() for comparing against uclamp_min and
> >> > + * uclamp_max. We only care about capacity pressure (by using
> >> > + * capacity_of()) for comparing against the real util.
> >> > + *
> >> > + * If a task is boosted to 1024 for example, we don't want a tiny
> >> > + * pressure to skew the check whether it fits a CPU or not.
> >> > + *
> >> > + * Similarly if a task is capped to capacity_orig_of(little_cpu), it
> >> > + * should fit a little cpu even if there's some pressure.
> >> > + *
> >> > + * Only exception is for thermal pressure since it has a direct impact
> >> > + * on available OPP of the system.
> >> > + *
> >> > + * We honour it for uclamp_min only as a drop in performance level
> >> > + * could result in not getting the requested minimum performance level.
> >> > + *
> >>
> >> Why specifically care about OPPs here? Per our CPU capacity model, a task
> >> alone on a CPUx throttled to f=fmax/2 and a task coscheduled on a CPUy with
> >> RT/DL tasks and/or IRQs such that cpu_capacity(CPUy) = 50% are both getting
> >> (roughly) the same performance level.
> >
> > Depends how you define performance level. What you call performance level,
> > I think is better called bandwidth. Uclamp is a performance and not a bandwidth
> > hint.
> >
> > If a 10% task:
> >
> > p->util_avg = 10% * 1024
> >
> > is requesting max performance level
> >
> > p->uclamp_min = 1024
> >
> > This will translate to running at highest frequency and in case of big.LITTLE
> > system, the biggest CPU too.
> >
> > RT/DL pressure has no impact in the task being able to achieve this; that is
> > running at max frequency and biggest cpu.
> >
> > If the cpu has no bandwidth to fit this task, then our usual comparison of
> > util_avg with capacity_of() should fail as usual.
> >
>
> Ok so we *do* have this with how the fitting criteria are combined (I
> didn't get that when I first scanned through the code); thanks for
> elaborating on that.
Oh yeah, this hasn't changed.
>
> > In the example above, the RT/DL pressure has to be pretty high for the 10% task
> > not to fit from bandwidth point of view. Which has nothing to do with
> > uclamp_min. Only thermal pressure which drops OPPs can actually affect the
> > uclamp_min hint/request.
> >
> > That is, when the task runs it will run at maximum frequency regardless of the
> > RT/DL pressure. The fact that the bandwidth of the CPU can be stolen has
> > nothing to do with uclamp_min hint.
> >
> >
> > Thanks!
> >
> > --
> > Qais Yousef
>
On 11/07/22 18:58, Valentin Schneider wrote:
> On 05/11/22 20:41, Qais Yousef wrote:
> > On 11/04/22 17:35, Valentin Schneider wrote:
> >> On 04/08/22 15:36, Qais Yousef wrote:
> >> > We do consider thermal pressure in util_fits_cpu() for uclamp_min only.
> >> > With the exception of the biggest cores which by definition are the max
> >> > performance point of the system and all tasks by definition should fit.
> >> >
> >> > Even under thermal pressure, the capacity of the biggest CPU is the
> >> > highest in the system and should still fit every task. Except when it
> >> > reaches capacity inversion point, then this is no longer true.
> >> >
> >> > We can handle this by using the inverted capacity as capacity_orig in
> >> > util_fits_cpu(). Which not only addresses the problem above, but also
> >> > ensure uclamp_max now considers the inverted capacity. Force fitting
> >> > a task when a CPU is in this adverse state will contribute to making the
> >> > thermal throttling last longer.
> >> >
> >> > Signed-off-by: Qais Yousef <[email protected]>
> >> > ---
> >> > kernel/sched/fair.c | 14 +++++++++-----
> >> > 1 file changed, 9 insertions(+), 5 deletions(-)
> >> >
> >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> > index cb32dc9a057f..77ae343e32a3 100644
> >> > --- a/kernel/sched/fair.c
> >> > +++ b/kernel/sched/fair.c
> >> > @@ -4293,12 +4293,16 @@ static inline int util_fits_cpu(unsigned long util,
> >> > * For uclamp_max, we can tolerate a drop in performance level as the
> >> > * goal is to cap the task. So it's okay if it's getting less.
> >> > *
> >> > - * In case of capacity inversion, which is not handled yet, we should
> >> > - * honour the inverted capacity for both uclamp_min and uclamp_max all
> >> > - * the time.
> >> > + * In case of capacity inversion we should honour the inverted capacity
> >> > + * for both uclamp_min and uclamp_max all the time.
> >> > */
> >> > - capacity_orig = capacity_orig_of(cpu);
> >> > - capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
> >> > + capacity_orig = cpu_in_capacity_inversion(cpu);
> >> > + if (capacity_orig) {
> >> > + capacity_orig_thermal = capacity_orig;
> >> > + } else {
> >> > + capacity_orig = capacity_orig_of(cpu);
> >> > + capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
> >> > + }
> >> >
> >>
> >> IIUC the rq->cpu_capacity_inverted computation in update_cpu_capacity() can be
> >> summarised as:
> >>
> >> - If there is a PD with equal cap_orig, but higher effective (orig - thermal)
> >> capacity
> >> OR
> >> there is a PD with pd_cap_orig > cpu_effective_cap:
> >> rq->cpu_capacity_inverted = capacity_orig - thermal_load_avg(rq)
> >>
> >> - Else:
> >> rq->cpu_capacity_inverted = 0
> >>
> >> Then, the code above uses either rq->cpu_capacity_inverted if it is
> >> non-zero, otherwise:
> >>
> >> capacity_orig - arch_scale_thermal_pressure(cpu);
> >>
> >> Why use average thermal pressure in one case, and use instantaneous
> >> thermal pressure in the other?
> >
> > There was a big debate on [1] about using avg vs instantaneous.
> >
>
> Interesting thread, thanks for the link!
>
> > I used avg for detecting inversion to be consistent with using average in in
> > scale_rt_capacity(). I didn't want the inversion state to be flipping too
> > quickly too.
> >
> > I used the instantaneous in the other check based on that discussion. It seemed
> > using the average is hurtful when for example the medium drops an OPP and by
> > not reacting quickly at wake up we lose the chance to place it on a big; which
> > if my memory didn't fail me is what Xuewen was seeing.
> >
>
> OK So IIUC by using the inst. pressure you start excluding CPUs sooner, and
> with the avg pressure you keep those CPUs out (if the pressure remained
> long enough).
Yes. I hope the discussion on avg vs instantaneous will be continued and we can
unify the usages.
>
> > [1] https://lore.kernel.org/lkml/[email protected]/
> >
>
> >>
> >> Can't we get rid of rq->cpu_capacity_inverted and replace this whole thing
> >> with an unconditional
> >>
> >> capacity_orig_thermal = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu));
> >>
> >> ?
> >
> > I can't see how we end up with equivalent behavior then. Or address the
> > concerns raised by Xuewen and Lukasz on the RT thread in regards to avg vs
> > instantaneous.
> >
> > Specifically, if we don't use the new rq->cpu_capacity_inverted we can't handle
> > the case where the task is requesting to run at maximum performance but a small
> > drop in thermal pressure means it won't fit anywhere. That PD is the best fit
> > until it hits an inversion.
> >
> > Originally I wanted to defer handling thermal pressure into a different series.
> > But Vincent thought it's better to handle it now. We want more data points from
> > more systems tbh. But I think what we have now is still a good improvement over
> > what we had before.
> >
> > Lukasz had a patch [2] which could allow making thermal_load_avg() more
> > acceptable for systems that care about faster response times.
> >
> > [2] https://lore.kernel.org/lkml/[email protected]/
> >
> >
> > Thanks
> >
> > --
> > Qais Yousef
>
On 04/08/2022 16:36, Qais Yousef wrote:
[...]
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 52dc9d6f811e..00c2de800685 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4254,6 +4254,129 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> trace_sched_util_est_se_tp(&p->se);
> }
>
> +static inline int util_fits_cpu(unsigned long util,
> + unsigned long uclamp_min,
> + unsigned long uclamp_max,
> + int cpu)
> +{
> + unsigned long capacity_orig, capacity_orig_thermal;
> + unsigned long capacity = capacity_of(cpu);
> + bool fits, uclamp_max_fits;
> +
> + /*
> + * Check if the real util fits without any uclamp boost/cap applied.
> + */
> + fits = fits_capacity(util, capacity);
> +
> + if (!uclamp_is_used())
> + return fits;
> +
> + /*
> + * We must use capacity_orig_of() for comparing against uclamp_min and
> + * uclamp_max. We only care about capacity pressure (by using
> + * capacity_of()) for comparing against the real util.
> + *
> + * If a task is boosted to 1024 for example, we don't want a tiny
> + * pressure to skew the check whether it fits a CPU or not.
> + *
> + * Similarly if a task is capped to capacity_orig_of(little_cpu), it
> + * should fit a little cpu even if there's some pressure.
> + *
> + * Only exception is for thermal pressure since it has a direct impact
> + * on available OPP of the system.
> + *
> + * We honour it for uclamp_min only as a drop in performance level
> + * could result in not getting the requested minimum performance level.
> + *
> + * For uclamp_max, we can tolerate a drop in performance level as the
> + * goal is to cap the task. So it's okay if it's getting less.
> + *
> + * In case of capacity inversion, which is not handled yet, we should
> + * honour the inverted capacity for both uclamp_min and uclamp_max all
> + * the time.
> + */
> + capacity_orig = capacity_orig_of(cpu);
> + capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
Did you and Vincent agreed at the end to use `cap_orig - _instantaneous_
thermal pressure` (1) here?
Last email in v1 from Vincent on this one was
https://lkml.kernel.org/r/20220722151300.GA30193@vingu-book :
"Using capacity_orig_of(cpu) - thermal_load_avg(rq_of(cpu)) seems like
a simple solution to cover thermal mitigation".
And there is no Acked-By/Reviewed-By so far.
We use (1) in feec() to cater for the thermal throttling (thermal
restricting policy->max) schedutil takes into account immediately when
asking for frequency (performance). EAS and schedutil should see the
same thing.
Do you want to use the same in util_fits_cpu() since it's used in feec()?
[...]
- Qais Yousef <[email protected]>
+ Qais Yousef <[email protected]>
On 04/08/2022 16:36, Qais Yousef wrote:
I was surprised to see these capacity inversion patches in v2. They were
not part of v1 so I never review them (even internally).
> Check each performance domain to see if thermal pressure is causing its
I guess that's `PELT thermal pressure` instead of `instantaneous thermal
pressure`. IMHO an important detail here to understand the patch.
> capacity to be lower than another performance domain.
^^^^^^^
s/another/next lower (CPU capacity) level/ ?
I assume that is the definition of `capacity inversion`? IMHO it
appeared the first time in your discussion with Xuewen and Lukasz:
https://lkml.kernel.org/r/20220503144352.lxduzhl6jq6xdhw2@airbuntu
> We assume that each performance domain has CPUs with the same
> capacities, which is similar to an assumption made in energy_model.c
>
> We also assume that thermal pressure impacts all CPUs in a performance
> domain equally.
>
> If there're multiple performance domains with the same capacity_orig, we
Not aware of such a system. At least it wouldn't make much sense. Not
sure if EAS would correctly work on such a system.
> will trigger a capacity inversion if the domain is under thermal
> pressure.
>
> The new cpu_in_capacity_inversion() should help users to know when
> information about capacity_orig are not reliable and can opt in to use
> the inverted capacity as the 'actual' capacity_orig.
>
> Signed-off-by: Qais Yousef <[email protected]>
> ---
> kernel/sched/fair.c | 63 +++++++++++++++++++++++++++++++++++++++++---
> kernel/sched/sched.h | 19 +++++++++++++
> 2 files changed, 79 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 59ba7106ddc6..cb32dc9a057f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8659,16 +8659,73 @@ static unsigned long scale_rt_capacity(int cpu)
>
> static void update_cpu_capacity(struct sched_domain *sd, int cpu)
> {
> + unsigned long capacity_orig = arch_scale_cpu_capacity(cpu);
> unsigned long capacity = scale_rt_capacity(cpu);
> struct sched_group *sdg = sd->groups;
> + struct rq *rq = cpu_rq(cpu);
>
> - cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(cpu);
> + rq->cpu_capacity_orig = capacity_orig;
>
> if (!capacity)
> capacity = 1;
>
> - cpu_rq(cpu)->cpu_capacity = capacity;
> - trace_sched_cpu_capacity_tp(cpu_rq(cpu));
> + rq->cpu_capacity = capacity;
> +
> + /*
> + * Detect if the performance domain is in capacity inversion state.
> + *
> + * Capacity inversion happens when another perf domain with equal or
> + * lower capacity_orig_of() ends up having higher capacity than this
> + * domain after subtracting thermal pressure.
> + *
> + * We only take into account thermal pressure in this detection as it's
> + * the only metric that actually results in *real* reduction of
> + * capacity due to performance points (OPPs) being dropped/become
> + * unreachable due to thermal throttling.
> + *
> + * We assume:
> + * * That all cpus in a perf domain have the same capacity_orig
> + * (same uArch).
> + * * Thermal pressure will impact all cpus in this perf domain
> + * equally.
> + */
> + if (static_branch_unlikely(&sched_asym_cpucapacity)) {
This should be sched_energy_enabled(). Performance Domains (PDs) are an
EAS thing.
> + unsigned long inv_cap = capacity_orig - thermal_load_avg(rq);
rcu_read_lock()
> + struct perf_domain *pd = rcu_dereference(rq->rd->pd);
rcu_read_unlock()
It's called from build_sched_domains() too. I assume
static_branch_unlikely(&sched_asym_cpucapacity) hides this issue so far.
> +
> + rq->cpu_capacity_inverted = 0;
> +
> + for (; pd; pd = pd->next) {
> + struct cpumask *pd_span = perf_domain_span(pd);
> + unsigned long pd_cap_orig, pd_cap;
> +
> + cpu = cpumask_any(pd_span);
> + pd_cap_orig = arch_scale_cpu_capacity(cpu);
> +
> + if (capacity_orig < pd_cap_orig)
> + continue;
> +
> + /*
> + * handle the case of multiple perf domains have the
> + * same capacity_orig but one of them is under higher
Like I said above, I'm not aware of such an EAS system.
> + * thermal pressure. We record it as capacity
> + * inversion.
> + */
> + if (capacity_orig == pd_cap_orig) {
> + pd_cap = pd_cap_orig - thermal_load_avg(cpu_rq(cpu));
> +
> + if (pd_cap > inv_cap) {
> + rq->cpu_capacity_inverted = inv_cap;
> + break;
> + }
In case `capacity_orig == pd_cap_orig` and cpumask_test_cpu(cpu_of(rq),
pd_span) the code can set rq->cpu_capacity_inverted = inv_cap
erroneously since thermal_load_avg(rq) can return different values for
inv_cap and pd_cap.
So even on a classical big little system, this condition can set
rq->cpu_capacity_inverted for a CPU in the little or big cluster.
thermal_load_avg(rq) would have to stay constant for all CPUs within the
PD to avoid this.
This is one example of the `thermal pressure` is per PD (or Frequency
Domain) in Thermal but per-CPU in the task scheduler.
> + } else if (pd_cap_orig > inv_cap) {
> + rq->cpu_capacity_inverted = inv_cap;
> + break;
> + }
> + }
> + }
> +
> + trace_sched_cpu_capacity_tp(rq);
>
> sdg->sgc->capacity = capacity;
> sdg->sgc->min_capacity = capacity;
[...]
- Qais Yousef <[email protected]>
On 05/11/2022 21:41, Qais Yousef wrote:
> On 11/04/22 17:35, Valentin Schneider wrote:
>> On 04/08/22 15:36, Qais Yousef wrote:
[...]
>> IIUC the rq->cpu_capacity_inverted computation in update_cpu_capacity() can be
>> summarised as:
>>
>> - If there is a PD with equal cap_orig, but higher effective (orig - thermal)
>> capacity
>> OR
>> there is a PD with pd_cap_orig > cpu_effective_cap:
>> rq->cpu_capacity_inverted = capacity_orig - thermal_load_avg(rq)
>>
>> - Else:
>> rq->cpu_capacity_inverted = 0
>>
>> Then, the code above uses either rq->cpu_capacity_inverted if it is
>> non-zero, otherwise:
>>
>> capacity_orig - arch_scale_thermal_pressure(cpu);
>>
>> Why use average thermal pressure in one case, and use instantaneous
>> thermal pressure in the other?
>
> There was a big debate on [1] about using avg vs instantaneous.
>
> I used avg for detecting inversion to be consistent with using average in in
> scale_rt_capacity(). I didn't want the inversion state to be flipping too
> quickly too.
>
> I used the instantaneous in the other check based on that discussion. It seemed
> using the average is hurtful when for example the medium drops an OPP and by
> not reacting quickly at wake up we lose the chance to place it on a big; which
> if my memory didn't fail me is what Xuewen was seeing.
>
> [1] https://lore.kernel.org/lkml/[email protected]/
>
>>
>> Can't we get rid of rq->cpu_capacity_inverted and replace this whole thing
>> with an unconditional
>>
>> capacity_orig_thermal = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu));
>>
>> ?
>
> I can't see how we end up with equivalent behavior then. Or address the
> concerns raised by Xuewen and Lukasz on the RT thread in regards to avg vs
> instantaneous.
>
> Specifically, if we don't use the new rq->cpu_capacity_inverted we can't handle
> the case where the task is requesting to run at maximum performance but a small
> drop in thermal pressure means it won't fit anywhere. That PD is the best fit
> until it hits an inversion.
>
> Originally I wanted to defer handling thermal pressure into a different series.
> But Vincent thought it's better to handle it now. We want more data points from
> more systems tbh. But I think what we have now is still a good improvement over
> what we had before.
I can't see the rationale in using:
!inversion: `cap_orig - instantaneous thermal pressure`
inversion: `cap_orig - PELT thermal pressure`
I can see that there was a lot of discussion on this topic but hardly
any agreement IMHO.
AFAICS, the 2 capacity inversion patches just appeared in v2 and haven't
seen any review yet I'm afraid.
[...]
On 11/09/22 11:42, Dietmar Eggemann wrote:
[...]
> > + /*
> > + * Detect if the performance domain is in capacity inversion state.
> > + *
> > + * Capacity inversion happens when another perf domain with equal or
> > + * lower capacity_orig_of() ends up having higher capacity than this
> > + * domain after subtracting thermal pressure.
> > + *
> > + * We only take into account thermal pressure in this detection as it's
> > + * the only metric that actually results in *real* reduction of
> > + * capacity due to performance points (OPPs) being dropped/become
> > + * unreachable due to thermal throttling.
> > + *
> > + * We assume:
> > + * * That all cpus in a perf domain have the same capacity_orig
> > + * (same uArch).
> > + * * Thermal pressure will impact all cpus in this perf domain
> > + * equally.
> > + */
> > + if (static_branch_unlikely(&sched_asym_cpucapacity)) {
>
> This should be sched_energy_enabled(). Performance Domains (PDs) are an
> EAS thing.
Bummer. I had a version that used cpumasks only, but I thought using pds is
cleaner and will save unnecessarily extra traversing. But I missed that it's
conditional on sched_energy_enabled().
This is not good news for CAS.
>
> > + unsigned long inv_cap = capacity_orig - thermal_load_avg(rq);
>
> rcu_read_lock()
>
> > + struct perf_domain *pd = rcu_dereference(rq->rd->pd);
>
> rcu_read_unlock()
Shouldn't we continue to hold it while traversing the pd too?
>
> It's called from build_sched_domains() too. I assume
> static_branch_unlikely(&sched_asym_cpucapacity) hides this issue so far.
>
> > +
> > + rq->cpu_capacity_inverted = 0;
> > +
> > + for (; pd; pd = pd->next) {
> > + struct cpumask *pd_span = perf_domain_span(pd);
> > + unsigned long pd_cap_orig, pd_cap;
> > +
> > + cpu = cpumask_any(pd_span);
> > + pd_cap_orig = arch_scale_cpu_capacity(cpu);
> > +
> > + if (capacity_orig < pd_cap_orig)
> > + continue;
> > +
> > + /*
> > + * handle the case of multiple perf domains have the
> > + * same capacity_orig but one of them is under higher
>
> Like I said above, I'm not aware of such an EAS system.
I did argue against that. But Vincent's PoV was that we shouldn't make
assumptions and handle the case where we have big cores each on its own domain.
>
> > + * thermal pressure. We record it as capacity
> > + * inversion.
> > + */
> > + if (capacity_orig == pd_cap_orig) {
> > + pd_cap = pd_cap_orig - thermal_load_avg(cpu_rq(cpu));
> > +
> > + if (pd_cap > inv_cap) {
> > + rq->cpu_capacity_inverted = inv_cap;
> > + break;
> > + }
>
> In case `capacity_orig == pd_cap_orig` and cpumask_test_cpu(cpu_of(rq),
> pd_span) the code can set rq->cpu_capacity_inverted = inv_cap
> erroneously since thermal_load_avg(rq) can return different values for
> inv_cap and pd_cap.
Good catch!
>
> So even on a classical big little system, this condition can set
> rq->cpu_capacity_inverted for a CPU in the little or big cluster.
>
> thermal_load_avg(rq) would have to stay constant for all CPUs within the
> PD to avoid this.
>
> This is one example of the `thermal pressure` is per PD (or Frequency
> Domain) in Thermal but per-CPU in the task scheduler.
Only compile tested so far, does this patch address all your points? I should
get hardware soon to run some tests and send the patch. I might re-write it to
avoid using pds; though it seems cleaner this way but we miss CAS support.
Thoughts?
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 89dadaafc1ec..b01854984994 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8856,16 +8856,24 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
* * Thermal pressure will impact all cpus in this perf domain
* equally.
*/
- if (static_branch_unlikely(&sched_asym_cpucapacity)) {
- unsigned long inv_cap = capacity_orig - thermal_load_avg(rq);
- struct perf_domain *pd = rcu_dereference(rq->rd->pd);
+ if (sched_energy_enabled()) {
+ struct perf_domain *pd;
+ unsigned long inv_cap;
+
+ rcu_read_lock();
+ inv_cap = capacity_orig - thermal_load_avg(rq);
+ pd = rcu_dereference(rq->rd->pd);
rq->cpu_capacity_inverted = 0;
for (; pd; pd = pd->next) {
struct cpumask *pd_span = perf_domain_span(pd);
unsigned long pd_cap_orig, pd_cap;
+ /* We can't be inverted against our own pd */
+ if (cpumask_test_cpu(cpu_of(rq), pd_span))
+ continue;
+
cpu = cpumask_any(pd_span);
pd_cap_orig = arch_scale_cpu_capacity(cpu);
@@ -8890,6 +8898,8 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
break;
}
}
+
+ rcu_read_unlock();
}
Thanks!
--
Qais Yousef
On 12/11/2022 20:35, Qais Yousef wrote:
> On 11/09/22 11:42, Dietmar Eggemann wrote:
>
[...]
>>> + * thermal pressure. We record it as capacity
>>> + * inversion.
>>> + */
>>> + if (capacity_orig == pd_cap_orig) {
>>> + pd_cap = pd_cap_orig - thermal_load_avg(cpu_rq(cpu));
>>> +
>>> + if (pd_cap > inv_cap) {
>>> + rq->cpu_capacity_inverted = inv_cap;
>>> + break;
>>> + }
>>
>> In case `capacity_orig == pd_cap_orig` and cpumask_test_cpu(cpu_of(rq),
>> pd_span) the code can set rq->cpu_capacity_inverted = inv_cap
>> erroneously since thermal_load_avg(rq) can return different values for
>> inv_cap and pd_cap.
>
> Good catch!
>
>>
>> So even on a classical big little system, this condition can set
>> rq->cpu_capacity_inverted for a CPU in the little or big cluster.
>>
>> thermal_load_avg(rq) would have to stay constant for all CPUs within the
>> PD to avoid this.
>>
>> This is one example of the `thermal pressure` is per PD (or Frequency
>> Domain) in Thermal but per-CPU in the task scheduler.
>
> Only compile tested so far, does this patch address all your points? I should
> get hardware soon to run some tests and send the patch. I might re-write it to
> avoid using pds; though it seems cleaner this way but we miss CAS support.
>
> Thoughts?
I still don't think that the `CPU capacity inversion` implementation
which uses `cap_orig' = cap_orig - thermal load avg (2)` instead of
`cap_orig'' = cap_orig - thermal pressure (1)` for inverted CPUs (i.e.
other PD exists w/ cap_orig > cap_orig') is the right answer, besides
the EAS vs. CAS coverage.
The basic question for me is why do we have to switch between (1) and
(2)? IIRC we introduced (1) in feec() to cater for the CPUfreq policy
min/max capping between schedutil and the CPUfreq driver
__resolve_freq() [drivers/cpufreq/cpufreq.c] (3).
The policy caps are set together with thermal pressure in
cpufreq_set_cur_state() [drivers/thermal/cpufreq_cooling.c] via
freq_qos_update_request().
I think we should only use (2) in the task scheduler even though the
EAS-schedutil machinery would be not 100% in sync in some cases due to (3).
Thermal load avg has similar properties like all the other EWMA-based
signals we use and we have to live with a certain degree of inaccuracy
anyway (e.g. also because of lock-less CPU statistic fetching when
selecting CPU).
And in this case we wouldn't have to have infrastructure to switch
between (1) and (2) at all.
To illustrate the problem I ran 2 workloads (hackbench/sleep) on a H960
board with step-wise thermal governor tracing thermal load_avg
(sched_pelt_thermal), thermal pressure (thermal_pressure_update) and CPU
capacity (sched_cpu_capacity). Would we really gain something
substantial reliably when we would know the diff between (1) and (2)?
https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/thermal_pressure.ipynb
On 11/16/22 18:45, Dietmar Eggemann wrote:
> On 12/11/2022 20:35, Qais Yousef wrote:
> > On 11/09/22 11:42, Dietmar Eggemann wrote:
> >
>
> [...]
>
> >>> + * thermal pressure. We record it as capacity
> >>> + * inversion.
> >>> + */
> >>> + if (capacity_orig == pd_cap_orig) {
> >>> + pd_cap = pd_cap_orig - thermal_load_avg(cpu_rq(cpu));
> >>> +
> >>> + if (pd_cap > inv_cap) {
> >>> + rq->cpu_capacity_inverted = inv_cap;
> >>> + break;
> >>> + }
> >>
> >> In case `capacity_orig == pd_cap_orig` and cpumask_test_cpu(cpu_of(rq),
> >> pd_span) the code can set rq->cpu_capacity_inverted = inv_cap
> >> erroneously since thermal_load_avg(rq) can return different values for
> >> inv_cap and pd_cap.
> >
> > Good catch!
> >
> >>
> >> So even on a classical big little system, this condition can set
> >> rq->cpu_capacity_inverted for a CPU in the little or big cluster.
> >>
> >> thermal_load_avg(rq) would have to stay constant for all CPUs within the
> >> PD to avoid this.
> >>
> >> This is one example of the `thermal pressure` is per PD (or Frequency
> >> Domain) in Thermal but per-CPU in the task scheduler.
> >
> > Only compile tested so far, does this patch address all your points? I should
> > get hardware soon to run some tests and send the patch. I might re-write it to
> > avoid using pds; though it seems cleaner this way but we miss CAS support.
> >
> > Thoughts?
>
> I still don't think that the `CPU capacity inversion` implementation
> which uses `cap_orig' = cap_orig - thermal load avg (2)` instead of
> `cap_orig'' = cap_orig - thermal pressure (1)` for inverted CPUs (i.e.
> other PD exists w/ cap_orig > cap_orig') is the right answer, besides
> the EAS vs. CAS coverage.
>
> The basic question for me is why do we have to switch between (1) and
> (2)? IIRC we introduced (1) in feec() to cater for the CPUfreq policy
> min/max capping between schedutil and the CPUfreq driver
> __resolve_freq() [drivers/cpufreq/cpufreq.c] (3).
>
> The policy caps are set together with thermal pressure in
> cpufreq_set_cur_state() [drivers/thermal/cpufreq_cooling.c] via
> freq_qos_update_request().
>
> I think we should only use (2) in the task scheduler even though the
> EAS-schedutil machinery would be not 100% in sync in some cases due to (3).
> Thermal load avg has similar properties like all the other EWMA-based
> signals we use and we have to live with a certain degree of inaccuracy
> anyway (e.g. also because of lock-less CPU statistic fetching when
> selecting CPU).
>
> And in this case we wouldn't have to have infrastructure to switch
> between (1) and (2) at all.
>
> To illustrate the problem I ran 2 workloads (hackbench/sleep) on a H960
> board with step-wise thermal governor tracing thermal load_avg
> (sched_pelt_thermal), thermal pressure (thermal_pressure_update) and CPU
> capacity (sched_cpu_capacity). Would we really gain something
> substantial reliably when we would know the diff between (1) and (2)?
>
> https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/thermal_pressure.ipynb
>
So what you're asking for is to switch to this?
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b01854984994..989f1947bd34 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8862,7 +8862,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
rcu_read_lock();
- inv_cap = capacity_orig - thermal_load_avg(rq);
+ inv_cap = capacity_orig - arch_scale_thermal_pressure(rq);
pd = rcu_dereference(rq->rd->pd);
rq->cpu_capacity_inverted = 0;
@@ -8887,7 +8887,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
* inversion.
*/
if (capacity_orig == pd_cap_orig) {
- pd_cap = pd_cap_orig - thermal_load_avg(cpu_rq(cpu));
+ pd_cap = pd_cap_orig - arch_scale_thermal_pressure(cpu_rq(cpu));
if (pd_cap > inv_cap) {
rq->cpu_capacity_inverted = inv_cap;
My main worry is that rq->cpu_capacity which is updated in the same location
uses thermal_load_avg(). The consistency was important IMO. Besides I think we
need good certainty the inversion is there - we don't want to be oscillating.
Say the big core thermal pressure is increasing and it is entering capacity
inversion. If we don't use the average we'd be avoiding the cpu one tick, but
place something that drives freq high on it the next tick. This ping-pong could
end up not giving the big cores some breathing room to cool down and settle on
one state, no?
I think Lukasz patch [1] is very important in controlling this aspect. And it
might help make the code more consistent by enabling switching all users to
thermal_load_avg() if we can speed up its reaction time sufficiently.
That said; I don't have a bullet proof answer or a very strong opinion about
it. Either direction we take I think we'll have room for improvements. It
seemed the safer more sensible option to me at this stage.
[1] https://lore.kernel.org/lkml/[email protected]/
Thanks!
--
Qais Yousef