Changes since v3:
* Update commit message of patch 2 to be less verbose
Changes since v2:
* Convert access of asym_cap_list to be rcu protected
* Add new patch to sort the list in descending order
* Move some declarations inside affinity check block
* Remove now redundant check against max_cpu_capacity in check_misfit_status()
(thanks Pierre!)
Changes since v1:
* Use asym_cap_list (thanks Dietmar) to iterate instead of iterating
through every cpu which Vincent was concerned about.
* Use uclamped util to compare with capacity instead of util_fits_cpu()
when iterating through capcities (Dietmar).
* Update commit log with test results to better demonstrate the problem
v1 discussion: https://lore.kernel.org/lkml/[email protected]/
v2 discussion: https://lore.kernel.org/lkml/[email protected]/
v3 discussion: https://lore.kernel.org/lkml/[email protected]/
Food for thoughts:
------------------
Should misfit cause balance_interval to double? This patch will still be needed
if the answer is yes to avoid unnecessary misfit-lb to trigger repeatedly
anyway.
Should the doubling be made independent of tick value? As it stands 3 failures
for TICK = 1ms will increase it to 8ms. But for 4ms tick this will become 32ms
after 3 failures. Which I think is too high too soon.
Should the balance_interval be capped to something more reasonable? On systems
that require fast response (interactive Desktop for example);
a balance_interval of 64ms and above seem too high.
Thanks!
--
Qais Yousef
Qais Yousef (2):
sched/fair: Check a task has a fitting cpu when updating misfit
sched/topology: Sort asym_cap_list in descending order
kernel/sched/fair.c | 65 ++++++++++++++++++++++++++++++++++-------
kernel/sched/sched.h | 14 +++++++++
kernel/sched/topology.c | 43 +++++++++++++++------------
3 files changed, 94 insertions(+), 28 deletions(-)
--
2.34.1
So that searches always start from biggest CPU which would help misfit
detection logic to be more efficient.
Suggested-by: Pierre Gondois <[email protected]>
Signed-off-by: Qais Yousef (Google) <[email protected]>
---
kernel/sched/topology.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index ba4a0b18ae25..1505677e4247 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1384,18 +1384,30 @@ static void free_asym_cap_entry(struct rcu_head *head)
static inline void asym_cpu_capacity_update_data(int cpu)
{
unsigned long capacity = arch_scale_cpu_capacity(cpu);
- struct asym_cap_data *entry = NULL;
+ struct asym_cap_data *insert_entry = NULL;
+ struct asym_cap_data *entry;
+ /*
+ * Search if capacity already exits. If not, track which the entry
+ * where we should insert to keep the list ordered descendingly.
+ */
list_for_each_entry(entry, &asym_cap_list, link) {
if (capacity == entry->capacity)
goto done;
+ else if (!insert_entry && capacity > entry->capacity)
+ insert_entry = list_prev_entry(entry, link);
}
entry = kzalloc(sizeof(*entry) + cpumask_size(), GFP_KERNEL);
if (WARN_ONCE(!entry, "Failed to allocate memory for asymmetry data\n"))
return;
entry->capacity = capacity;
- list_add_rcu(&entry->link, &asym_cap_list);
+
+ /* If NULL then the new capacity is the smallest, add last. */
+ if (!insert_entry)
+ list_add_tail_rcu(&entry->link, &asym_cap_list);
+ else
+ list_add_rcu(&entry->link, &insert_entry->link);
done:
__cpumask_set_cpu(cpu, cpu_capacity_span(entry));
}
--
2.34.1
From: Qais Yousef <[email protected]>
If a misfit task is affined to a subset of the possible cpus, we need to
verify that one of these cpus can fit it. Otherwise the load balancer
code will continuously trigger needlessly leading the balance_interval
to increase in return and eventually end up with a situation where real
imbalances take a long time to address because of this impossible
imbalance situation.
This can happen in Android world where it's common for background tasks
to be restricted to little cores.
Similarly if we can't fit the biggest core, triggering misfit is
pointless as it is the best we can ever get on this system.
To be able to detect that; we use asym_cap_list to iterate through
capacities in the system to see if the task is able to run at a higher
capacity level based on its p->cpus_ptr. To do so safely, we convert the
list to be RCU protected.
To be able to iterate through capacity levels, export asym_cap_list to
allow for fast traversal of all available capacity levels in the system.
Test:
=====
Add
trace_printk("balance_interval = %lu\n", interval)
in get_sd_balance_interval().
run
if [ "$MASK" != "0" ]; then
adb shell "taskset -a $MASK cat /dev/zero > /dev/null"
fi
sleep 10
// parse ftrace buffer counting the occurrence of each valaue
Where MASK is either:
* 0: no busy task running
* 1: busy task is pinned to 1 cpu; handled today to not cause
misfit
* f: busy task pinned to little cores, simulates busy background
task, demonstrates the problem to be fixed
Results:
========
Note how occurrence of balance_interval = 128 overshoots for MASK = f.
BEFORE
------
MASK=0
1 balance_interval = 175
120 balance_interval = 128
846 balance_interval = 64
55 balance_interval = 63
215 balance_interval = 32
2 balance_interval = 31
2 balance_interval = 16
4 balance_interval = 8
1870 balance_interval = 4
65 balance_interval = 2
MASK=1
27 balance_interval = 175
37 balance_interval = 127
840 balance_interval = 64
167 balance_interval = 63
449 balance_interval = 32
84 balance_interval = 31
304 balance_interval = 16
1156 balance_interval = 8
2781 balance_interval = 4
428 balance_interval = 2
MASK=f
1 balance_interval = 175
1328 balance_interval = 128
44 balance_interval = 64
101 balance_interval = 63
25 balance_interval = 32
5 balance_interval = 31
23 balance_interval = 16
23 balance_interval = 8
4306 balance_interval = 4
177 balance_interval = 2
AFTER
-----
Note how the high values almost disappear for all MASK values. The
system has background tasks that could trigger the problem without
simulate it even with MASK=0.
MASK=0
103 balance_interval = 63
19 balance_interval = 31
194 balance_interval = 8
4827 balance_interval = 4
179 balance_interval = 2
MASK=1
131 balance_interval = 63
1 balance_interval = 31
87 balance_interval = 8
3600 balance_interval = 4
7 balance_interval = 2
MASK=f
8 balance_interval = 127
182 balance_interval = 63
3 balance_interval = 31
9 balance_interval = 16
415 balance_interval = 8
3415 balance_interval = 4
21 balance_interval = 2
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Qais Yousef (Google) <[email protected]>
---
kernel/sched/fair.c | 65 ++++++++++++++++++++++++++++++++++-------
kernel/sched/sched.h | 14 +++++++++
kernel/sched/topology.c | 29 ++++++++----------
3 files changed, 81 insertions(+), 27 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bcea3d55d95d..0830ceb7ca07 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5065,17 +5065,61 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)
static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
{
+ unsigned long uclamp_min, uclamp_max;
+ unsigned long util, cpu_cap;
+ int cpu = cpu_of(rq);
+
if (!sched_asym_cpucap_active())
return;
- if (!p || p->nr_cpus_allowed == 1) {
- rq->misfit_task_load = 0;
- return;
- }
+ if (!p || p->nr_cpus_allowed == 1)
+ goto out;
- if (task_fits_cpu(p, cpu_of(rq))) {
- rq->misfit_task_load = 0;
- return;
+ cpu_cap = arch_scale_cpu_capacity(cpu);
+
+ /* If we can't fit the biggest CPU, that's the best we can ever get. */
+ if (cpu_cap == SCHED_CAPACITY_SCALE)
+ goto out;
+
+ uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
+ uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
+ util = task_util_est(p);
+
+ if (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0)
+ goto out;
+
+ /*
+ * If the task affinity is not set to default, make sure it is not
+ * restricted to a subset where no CPU can ever fit it. Triggering
+ * misfit in this case is pointless as it has no where better to move
+ * to. And it can lead to balance_interval to grow too high as we'll
+ * continuously fail to move it anywhere.
+ */
+ if (!cpumask_equal(p->cpus_ptr, cpu_possible_mask)) {
+ unsigned long clamped_util = clamp(util, uclamp_min, uclamp_max);
+ bool has_fitting_cpu = false;
+ struct asym_cap_data *entry;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(entry, &asym_cap_list, link) {
+ if (entry->capacity > cpu_cap) {
+ cpumask_t *cpumask;
+
+ if (clamped_util > entry->capacity)
+ continue;
+
+ cpumask = cpu_capacity_span(entry);
+ if (!cpumask_intersects(p->cpus_ptr, cpumask))
+ continue;
+
+ has_fitting_cpu = true;
+ break;
+ }
+ }
+ rcu_read_unlock();
+
+ if (!has_fitting_cpu)
+ goto out;
}
/*
@@ -5083,6 +5127,9 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
* task_h_load() returns 0.
*/
rq->misfit_task_load = max_t(unsigned long, task_h_load(p), 1);
+ return;
+out:
+ rq->misfit_task_load = 0;
}
#else /* CONFIG_SMP */
@@ -9583,9 +9630,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
*/
static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
{
- return rq->misfit_task_load &&
- (arch_scale_cpu_capacity(rq->cpu) < rq->rd->max_cpu_capacity ||
- check_cpu_capacity(rq, sd));
+ return rq->misfit_task_load && check_cpu_capacity(rq, sd);
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e58a54bda77d..a653017a1b9b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -109,6 +109,20 @@ extern int sysctl_sched_rt_period;
extern int sysctl_sched_rt_runtime;
extern int sched_rr_timeslice;
+/*
+ * Asymmetric CPU capacity bits
+ */
+struct asym_cap_data {
+ struct list_head link;
+ struct rcu_head rcu;
+ unsigned long capacity;
+ unsigned long cpus[];
+};
+
+extern struct list_head asym_cap_list;
+
+#define cpu_capacity_span(asym_data) to_cpumask((asym_data)->cpus)
+
/*
* Helpers for converting nanosecond timing to jiffy resolution
*/
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 10d1391e7416..ba4a0b18ae25 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1329,24 +1329,13 @@ static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
update_group_capacity(sd, cpu);
}
-/*
- * Asymmetric CPU capacity bits
- */
-struct asym_cap_data {
- struct list_head link;
- unsigned long capacity;
- unsigned long cpus[];
-};
-
/*
* Set of available CPUs grouped by their corresponding capacities
* Each list entry contains a CPU mask reflecting CPUs that share the same
* capacity.
* The lifespan of data is unlimited.
*/
-static LIST_HEAD(asym_cap_list);
-
-#define cpu_capacity_span(asym_data) to_cpumask((asym_data)->cpus)
+LIST_HEAD(asym_cap_list);
/*
* Verify whether there is any CPU capacity asymmetry in a given sched domain.
@@ -1386,6 +1375,12 @@ asym_cpu_capacity_classify(const struct cpumask *sd_span,
}
+static void free_asym_cap_entry(struct rcu_head *head)
+{
+ struct asym_cap_data *entry = container_of(head, struct asym_cap_data, rcu);
+ kfree(entry);
+}
+
static inline void asym_cpu_capacity_update_data(int cpu)
{
unsigned long capacity = arch_scale_cpu_capacity(cpu);
@@ -1400,7 +1395,7 @@ static inline void asym_cpu_capacity_update_data(int cpu)
if (WARN_ONCE(!entry, "Failed to allocate memory for asymmetry data\n"))
return;
entry->capacity = capacity;
- list_add(&entry->link, &asym_cap_list);
+ list_add_rcu(&entry->link, &asym_cap_list);
done:
__cpumask_set_cpu(cpu, cpu_capacity_span(entry));
}
@@ -1423,8 +1418,8 @@ static void asym_cpu_capacity_scan(void)
list_for_each_entry_safe(entry, next, &asym_cap_list, link) {
if (cpumask_empty(cpu_capacity_span(entry))) {
- list_del(&entry->link);
- kfree(entry);
+ list_del_rcu(&entry->link);
+ call_rcu(&entry->rcu, free_asym_cap_entry);
}
}
@@ -1434,8 +1429,8 @@ static void asym_cpu_capacity_scan(void)
*/
if (list_is_singular(&asym_cap_list)) {
entry = list_first_entry(&asym_cap_list, typeof(*entry), link);
- list_del(&entry->link);
- kfree(entry);
+ list_del_rcu(&entry->link);
+ call_rcu(&entry->rcu, free_asym_cap_entry);
}
}
--
2.34.1
Hi Vincent
On 01/05/24 22:20, Qais Yousef wrote:
> Changes since v3:
>
> * Update commit message of patch 2 to be less verbose
>
> Changes since v2:
>
> * Convert access of asym_cap_list to be rcu protected
> * Add new patch to sort the list in descending order
> * Move some declarations inside affinity check block
> * Remove now redundant check against max_cpu_capacity in check_misfit_status()
>
> (thanks Pierre!)
>
> Changes since v1:
>
> * Use asym_cap_list (thanks Dietmar) to iterate instead of iterating
> through every cpu which Vincent was concerned about.
> * Use uclamped util to compare with capacity instead of util_fits_cpu()
> when iterating through capcities (Dietmar).
> * Update commit log with test results to better demonstrate the problem
>
> v1 discussion: https://lore.kernel.org/lkml/[email protected]/
> v2 discussion: https://lore.kernel.org/lkml/[email protected]/
> v3 discussion: https://lore.kernel.org/lkml/[email protected]/
>
> Food for thoughts:
> ------------------
>
> Should misfit cause balance_interval to double? This patch will still be needed
> if the answer is yes to avoid unnecessary misfit-lb to trigger repeatedly
> anyway.
>
> Should the doubling be made independent of tick value? As it stands 3 failures
> for TICK = 1ms will increase it to 8ms. But for 4ms tick this will become 32ms
> after 3 failures. Which I think is too high too soon.
>
> Should the balance_interval be capped to something more reasonable? On systems
> that require fast response (interactive Desktop for example);
> a balance_interval of 64ms and above seem too high.
Does this series address your concerns about scalability now?
If you have thoughts on the above that'd be great to hear too.
Thanks
--
Qais Yousef
On 05/01/2024 23:20, Qais Yousef wrote:
> From: Qais Yousef <[email protected]>
>
> If a misfit task is affined to a subset of the possible cpus, we need to
> verify that one of these cpus can fit it. Otherwise the load balancer
> code will continuously trigger needlessly leading the balance_interval
> to increase in return and eventually end up with a situation where real
> imbalances take a long time to address because of this impossible
> imbalance situation.
>
> This can happen in Android world where it's common for background tasks
> to be restricted to little cores.
>
> Similarly if we can't fit the biggest core, triggering misfit is
> pointless as it is the best we can ever get on this system.
>
> To be able to detect that; we use asym_cap_list to iterate through
> capacities in the system to see if the task is able to run at a higher
> capacity level based on its p->cpus_ptr. To do so safely, we convert the
> list to be RCU protected.
>
> To be able to iterate through capacity levels, export asym_cap_list to
> allow for fast traversal of all available capacity levels in the system.
>
> Test:
> =====
>
> Add
>
> trace_printk("balance_interval = %lu\n", interval)
>
> in get_sd_balance_interval().
>
> run
> if [ "$MASK" != "0" ]; then
> adb shell "taskset -a $MASK cat /dev/zero > /dev/null"
> fi
> sleep 10
> // parse ftrace buffer counting the occurrence of each valaue
>
> Where MASK is either:
>
> * 0: no busy task running
.. no busy task stands for no misfit scenario?
> * 1: busy task is pinned to 1 cpu; handled today to not cause
> misfit
> * f: busy task pinned to little cores, simulates busy background
> task, demonstrates the problem to be fixed
>
[...]
> + /*
> + * If the task affinity is not set to default, make sure it is not
> + * restricted to a subset where no CPU can ever fit it. Triggering
> + * misfit in this case is pointless as it has no where better to move
> + * to. And it can lead to balance_interval to grow too high as we'll
> + * continuously fail to move it anywhere.
> + */
> + if (!cpumask_equal(p->cpus_ptr, cpu_possible_mask)) {
Shouldn't this be cpu_active_mask ?
include/linux/cpumask.h
* cpu_possible_mask- has bit 'cpu' set iff cpu is populatable
* cpu_present_mask - has bit 'cpu' set iff cpu is populated
* cpu_online_mask - has bit 'cpu' set iff cpu available to scheduler
* cpu_active_mask - has bit 'cpu' set iff cpu available to migration
> + unsigned long clamped_util = clamp(util, uclamp_min, uclamp_max);
> + bool has_fitting_cpu = false;
> + struct asym_cap_data *entry;
> +
> + rcu_read_lock();
> + list_for_each_entry_rcu(entry, &asym_cap_list, link) {
> + if (entry->capacity > cpu_cap) {
> + cpumask_t *cpumask;
> +
> + if (clamped_util > entry->capacity)
> + continue;
> +
> + cpumask = cpu_capacity_span(entry);
> + if (!cpumask_intersects(p->cpus_ptr, cpumask))
> + continue;
> +
> + has_fitting_cpu = true;
> + break;
> + }
> + }
What happen when we hotplug out all CPUs of one CPU capacity value?
IMHO, we don't call asym_cpu_capacity_scan() with !new_topology
(partition_sched_domains_locked()).
> + rcu_read_unlock();
> +
> + if (!has_fitting_cpu)
> + goto out;
> }
>
> /*
> @@ -5083,6 +5127,9 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> * task_h_load() returns 0.
> */
> rq->misfit_task_load = max_t(unsigned long, task_h_load(p), 1);
> + return;
> +out:
> + rq->misfit_task_load = 0;
> }
>
> #else /* CONFIG_SMP */
> @@ -9583,9 +9630,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
> */
> static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
> {
> - return rq->misfit_task_load &&
> - (arch_scale_cpu_capacity(rq->cpu) < rq->rd->max_cpu_capacity ||
> - check_cpu_capacity(rq, sd));
> + return rq->misfit_task_load && check_cpu_capacity(rq, sd);
You removed 'arch_scale_cpu_capacity(rq->cpu) <
rq->rd->max_cpu_capacity' here. Why? I can see that with the standard
setup (max CPU capacity equal 1024) which is what we probably use 100%
of the time now. It might get useful again when Vincent will introduce
his 'user space system pressure' implementation?
> }
[...]
> @@ -1423,8 +1418,8 @@ static void asym_cpu_capacity_scan(void)
>
> list_for_each_entry_safe(entry, next, &asym_cap_list, link) {
> if (cpumask_empty(cpu_capacity_span(entry))) {
> - list_del(&entry->link);
> - kfree(entry);
> + list_del_rcu(&entry->link);
> + call_rcu(&entry->rcu, free_asym_cap_entry);
Looks like there could be brief moments in which one CPU capacity group
of CPUs could be twice in asym_cap_list. I'm thinking about initial
startup + max CPU frequency related adjustment of CPU capacity
(init_cpu_capacity_callback()) for instance. Not sure if this is really
an issue?
[...]
On 01/22/24 09:59, Dietmar Eggemann wrote:
> On 05/01/2024 23:20, Qais Yousef wrote:
> > From: Qais Yousef <[email protected]>
> >
> > If a misfit task is affined to a subset of the possible cpus, we need to
> > verify that one of these cpus can fit it. Otherwise the load balancer
> > code will continuously trigger needlessly leading the balance_interval
> > to increase in return and eventually end up with a situation where real
> > imbalances take a long time to address because of this impossible
> > imbalance situation.
> >
> > This can happen in Android world where it's common for background tasks
> > to be restricted to little cores.
> >
> > Similarly if we can't fit the biggest core, triggering misfit is
> > pointless as it is the best we can ever get on this system.
> >
> > To be able to detect that; we use asym_cap_list to iterate through
> > capacities in the system to see if the task is able to run at a higher
> > capacity level based on its p->cpus_ptr. To do so safely, we convert the
> > list to be RCU protected.
> >
> > To be able to iterate through capacity levels, export asym_cap_list to
> > allow for fast traversal of all available capacity levels in the system.
> >
> > Test:
> > =====
> >
> > Add
> >
> > trace_printk("balance_interval = %lu\n", interval)
> >
> > in get_sd_balance_interval().
> >
> > run
> > if [ "$MASK" != "0" ]; then
> > adb shell "taskset -a $MASK cat /dev/zero > /dev/null"
> > fi
> > sleep 10
> > // parse ftrace buffer counting the occurrence of each valaue
> >
> > Where MASK is either:
> >
> > * 0: no busy task running
>
> ... no busy task stands for no misfit scenario?
Yes
>
> > * 1: busy task is pinned to 1 cpu; handled today to not cause
> > misfit
> > * f: busy task pinned to little cores, simulates busy background
> > task, demonstrates the problem to be fixed
> >
>
> [...]
>
> > + /*
> > + * If the task affinity is not set to default, make sure it is not
> > + * restricted to a subset where no CPU can ever fit it. Triggering
> > + * misfit in this case is pointless as it has no where better to move
> > + * to. And it can lead to balance_interval to grow too high as we'll
> > + * continuously fail to move it anywhere.
> > + */
> > + if (!cpumask_equal(p->cpus_ptr, cpu_possible_mask)) {
>
> Shouldn't this be cpu_active_mask ?
Hmm. So the intention was to check if the affinity was changed from default.
If we hotplug all but little we could end up with the same problem, yes you're
right.
But if the affinity is set to only to littles and cpu_active_mask is only for
littles too, then we'll also end up with the same problem as they both are
equal.
Better to drop this check then? With the sorted list the common case should be
quick to return as they'll have 1024 as a possible CPU.
>
> include/linux/cpumask.h
>
> * cpu_possible_mask- has bit 'cpu' set iff cpu is populatable
> * cpu_present_mask - has bit 'cpu' set iff cpu is populated
> * cpu_online_mask - has bit 'cpu' set iff cpu available to scheduler
> * cpu_active_mask - has bit 'cpu' set iff cpu available to migration
>
>
> > + unsigned long clamped_util = clamp(util, uclamp_min, uclamp_max);
> > + bool has_fitting_cpu = false;
> > + struct asym_cap_data *entry;
> > +
> > + rcu_read_lock();
> > + list_for_each_entry_rcu(entry, &asym_cap_list, link) {
> > + if (entry->capacity > cpu_cap) {
> > + cpumask_t *cpumask;
> > +
> > + if (clamped_util > entry->capacity)
> > + continue;
> > +
> > + cpumask = cpu_capacity_span(entry);
> > + if (!cpumask_intersects(p->cpus_ptr, cpumask))
> > + continue;
> > +
> > + has_fitting_cpu = true;
> > + break;
> > + }
> > + }
>
> What happen when we hotplug out all CPUs of one CPU capacity value?
> IMHO, we don't call asym_cpu_capacity_scan() with !new_topology
> (partition_sched_domains_locked()).
Right. I missed that. We can add another intersection check against
cpu_active_mask.
I am assuming the skipping was done by design, not a bug that needs fixing?
I see for suspend (cpuhp_tasks_frozen) the domains are rebuilt, but not for
hotplug.
>
> > + rcu_read_unlock();
> > +
> > + if (!has_fitting_cpu)
> > + goto out;
> > }
> >
> > /*
> > @@ -5083,6 +5127,9 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> > * task_h_load() returns 0.
> > */
> > rq->misfit_task_load = max_t(unsigned long, task_h_load(p), 1);
> > + return;
> > +out:
> > + rq->misfit_task_load = 0;
> > }
> >
> > #else /* CONFIG_SMP */
> > @@ -9583,9 +9630,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
> > */
> > static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
> > {
> > - return rq->misfit_task_load &&
> > - (arch_scale_cpu_capacity(rq->cpu) < rq->rd->max_cpu_capacity ||
> > - check_cpu_capacity(rq, sd));
> > + return rq->misfit_task_load && check_cpu_capacity(rq, sd);
>
> You removed 'arch_scale_cpu_capacity(rq->cpu) <
> rq->rd->max_cpu_capacity' here. Why? I can see that with the standard
Based on Pierre review since we no longer trigger misfit for big cores.
I thought Pierre's remark was correct so did the change in v3
https://lore.kernel.org/lkml/[email protected]/
> setup (max CPU capacity equal 1024) which is what we probably use 100%
> of the time now. It might get useful again when Vincent will introduce
> his 'user space system pressure' implementation?
I don't mind putting it back if you think it'd be required again in the near
future. I still didn't get a chance to look at Vincent patches properly, but if
there's a clash let's reduce the work.
>
> > }
>
> [...]
>
> > @@ -1423,8 +1418,8 @@ static void asym_cpu_capacity_scan(void)
> >
> > list_for_each_entry_safe(entry, next, &asym_cap_list, link) {
> > if (cpumask_empty(cpu_capacity_span(entry))) {
> > - list_del(&entry->link);
> > - kfree(entry);
> > + list_del_rcu(&entry->link);
> > + call_rcu(&entry->rcu, free_asym_cap_entry);
>
> Looks like there could be brief moments in which one CPU capacity group
> of CPUs could be twice in asym_cap_list. I'm thinking about initial
> startup + max CPU frequency related adjustment of CPU capacity
> (init_cpu_capacity_callback()) for instance. Not sure if this is really
> an issue?
I don't think so. As long as the reader sees a consistent value and no crashes
ensued, a momentarily wrong decision in transient or extra work is fine IMO.
I don't foresee a big impact.
Thanks!
--
Qais Yousef
>
> [...]
>
On Fri, 5 Jan 2024 at 23:20, Qais Yousef <[email protected]> wrote:
>
> From: Qais Yousef <[email protected]>
>
> If a misfit task is affined to a subset of the possible cpus, we need to
> verify that one of these cpus can fit it. Otherwise the load balancer
> code will continuously trigger needlessly leading the balance_interval
> to increase in return and eventually end up with a situation where real
> imbalances take a long time to address because of this impossible
> imbalance situation.
If your problem is about increasing balance_interval, it would be
better to not increase the interval is such case.
I mean that we are able to detect misfit_task conditions for the
periodic load balance so we should be able to not increase the
interval in such cases.
If I'm not wrong, your problem only happens when the system is
overutilized and we have disable EAS
>
> This can happen in Android world where it's common for background tasks
> to be restricted to little cores.
>
> Similarly if we can't fit the biggest core, triggering misfit is
> pointless as it is the best we can ever get on this system.
>
> To be able to detect that; we use asym_cap_list to iterate through
> capacities in the system to see if the task is able to run at a higher
> capacity level based on its p->cpus_ptr. To do so safely, we convert the
> list to be RCU protected.
>
> To be able to iterate through capacity levels, export asym_cap_list to
> allow for fast traversal of all available capacity levels in the system.
>
> Test:
> =====
>
> Add
>
> trace_printk("balance_interval = %lu\n", interval)
>
> in get_sd_balance_interval().
>
> run
> if [ "$MASK" != "0" ]; then
> adb shell "taskset -a $MASK cat /dev/zero > /dev/null"
> fi
> sleep 10
> // parse ftrace buffer counting the occurrence of each valaue
>
> Where MASK is either:
>
> * 0: no busy task running
> * 1: busy task is pinned to 1 cpu; handled today to not cause
> misfit
> * f: busy task pinned to little cores, simulates busy background
> task, demonstrates the problem to be fixed
>
> Results:
> ========
>
> Note how occurrence of balance_interval = 128 overshoots for MASK = f.
>
> BEFORE
> ------
>
> MASK=0
>
> 1 balance_interval = 175
> 120 balance_interval = 128
> 846 balance_interval = 64
> 55 balance_interval = 63
> 215 balance_interval = 32
> 2 balance_interval = 31
> 2 balance_interval = 16
> 4 balance_interval = 8
> 1870 balance_interval = 4
> 65 balance_interval = 2
>
> MASK=1
>
> 27 balance_interval = 175
> 37 balance_interval = 127
> 840 balance_interval = 64
> 167 balance_interval = 63
> 449 balance_interval = 32
> 84 balance_interval = 31
> 304 balance_interval = 16
> 1156 balance_interval = 8
> 2781 balance_interval = 4
> 428 balance_interval = 2
>
> MASK=f
>
> 1 balance_interval = 175
> 1328 balance_interval = 128
> 44 balance_interval = 64
> 101 balance_interval = 63
> 25 balance_interval = 32
> 5 balance_interval = 31
> 23 balance_interval = 16
> 23 balance_interval = 8
> 4306 balance_interval = 4
> 177 balance_interval = 2
>
> AFTER
> -----
>
> Note how the high values almost disappear for all MASK values. The
> system has background tasks that could trigger the problem without
> simulate it even with MASK=0.
>
> MASK=0
>
> 103 balance_interval = 63
> 19 balance_interval = 31
> 194 balance_interval = 8
> 4827 balance_interval = 4
> 179 balance_interval = 2
>
> MASK=1
>
> 131 balance_interval = 63
> 1 balance_interval = 31
> 87 balance_interval = 8
> 3600 balance_interval = 4
> 7 balance_interval = 2
>
> MASK=f
>
> 8 balance_interval = 127
> 182 balance_interval = 63
> 3 balance_interval = 31
> 9 balance_interval = 16
> 415 balance_interval = 8
> 3415 balance_interval = 4
> 21 balance_interval = 2
>
> Signed-off-by: Qais Yousef <[email protected]>
> Signed-off-by: Qais Yousef (Google) <[email protected]>
> ---
> kernel/sched/fair.c | 65 ++++++++++++++++++++++++++++++++++-------
> kernel/sched/sched.h | 14 +++++++++
> kernel/sched/topology.c | 29 ++++++++----------
> 3 files changed, 81 insertions(+), 27 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bcea3d55d95d..0830ceb7ca07 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5065,17 +5065,61 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)
>
> static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> {
> + unsigned long uclamp_min, uclamp_max;
> + unsigned long util, cpu_cap;
> + int cpu = cpu_of(rq);
> +
> if (!sched_asym_cpucap_active())
> return;
>
> - if (!p || p->nr_cpus_allowed == 1) {
> - rq->misfit_task_load = 0;
> - return;
> - }
> + if (!p || p->nr_cpus_allowed == 1)
> + goto out;
>
> - if (task_fits_cpu(p, cpu_of(rq))) {
> - rq->misfit_task_load = 0;
> - return;
> + cpu_cap = arch_scale_cpu_capacity(cpu);
> +
> + /* If we can't fit the biggest CPU, that's the best we can ever get. */
> + if (cpu_cap == SCHED_CAPACITY_SCALE)
> + goto out;
> +
> + uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
> + uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
> + util = task_util_est(p);
> +
> + if (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0)
> + goto out;
> +
> + /*
> + * If the task affinity is not set to default, make sure it is not
> + * restricted to a subset where no CPU can ever fit it. Triggering
> + * misfit in this case is pointless as it has no where better to move
> + * to. And it can lead to balance_interval to grow too high as we'll
> + * continuously fail to move it anywhere.
> + */
> + if (!cpumask_equal(p->cpus_ptr, cpu_possible_mask)) {
> + unsigned long clamped_util = clamp(util, uclamp_min, uclamp_max);
> + bool has_fitting_cpu = false;
> + struct asym_cap_data *entry;
> +
> + rcu_read_lock();
> + list_for_each_entry_rcu(entry, &asym_cap_list, link) {
> + if (entry->capacity > cpu_cap) {
> + cpumask_t *cpumask;
> +
> + if (clamped_util > entry->capacity)
> + continue;
> +
> + cpumask = cpu_capacity_span(entry);
> + if (!cpumask_intersects(p->cpus_ptr, cpumask))
> + continue;
> +
> + has_fitting_cpu = true;
> + break;
> + }
> + }
> + rcu_read_unlock();
> +
> + if (!has_fitting_cpu)
> + goto out;
> }
>
> /*
> @@ -5083,6 +5127,9 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> * task_h_load() returns 0.
> */
> rq->misfit_task_load = max_t(unsigned long, task_h_load(p), 1);
> + return;
> +out:
> + rq->misfit_task_load = 0;
> }
>
> #else /* CONFIG_SMP */
> @@ -9583,9 +9630,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
> */
> static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
> {
> - return rq->misfit_task_load &&
> - (arch_scale_cpu_capacity(rq->cpu) < rq->rd->max_cpu_capacity ||
> - check_cpu_capacity(rq, sd));
> + return rq->misfit_task_load && check_cpu_capacity(rq, sd);
> }
>
> /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index e58a54bda77d..a653017a1b9b 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -109,6 +109,20 @@ extern int sysctl_sched_rt_period;
> extern int sysctl_sched_rt_runtime;
> extern int sched_rr_timeslice;
>
> +/*
> + * Asymmetric CPU capacity bits
> + */
> +struct asym_cap_data {
> + struct list_head link;
> + struct rcu_head rcu;
> + unsigned long capacity;
> + unsigned long cpus[];
> +};
> +
> +extern struct list_head asym_cap_list;
> +
> +#define cpu_capacity_span(asym_data) to_cpumask((asym_data)->cpus)
> +
> /*
> * Helpers for converting nanosecond timing to jiffy resolution
> */
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 10d1391e7416..ba4a0b18ae25 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1329,24 +1329,13 @@ static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
> update_group_capacity(sd, cpu);
> }
>
> -/*
> - * Asymmetric CPU capacity bits
> - */
> -struct asym_cap_data {
> - struct list_head link;
> - unsigned long capacity;
> - unsigned long cpus[];
> -};
> -
> /*
> * Set of available CPUs grouped by their corresponding capacities
> * Each list entry contains a CPU mask reflecting CPUs that share the same
> * capacity.
> * The lifespan of data is unlimited.
> */
> -static LIST_HEAD(asym_cap_list);
> -
> -#define cpu_capacity_span(asym_data) to_cpumask((asym_data)->cpus)
> +LIST_HEAD(asym_cap_list);
>
> /*
> * Verify whether there is any CPU capacity asymmetry in a given sched domain.
> @@ -1386,6 +1375,12 @@ asym_cpu_capacity_classify(const struct cpumask *sd_span,
>
> }
>
> +static void free_asym_cap_entry(struct rcu_head *head)
> +{
> + struct asym_cap_data *entry = container_of(head, struct asym_cap_data, rcu);
> + kfree(entry);
> +}
> +
> static inline void asym_cpu_capacity_update_data(int cpu)
> {
> unsigned long capacity = arch_scale_cpu_capacity(cpu);
> @@ -1400,7 +1395,7 @@ static inline void asym_cpu_capacity_update_data(int cpu)
> if (WARN_ONCE(!entry, "Failed to allocate memory for asymmetry data\n"))
> return;
> entry->capacity = capacity;
> - list_add(&entry->link, &asym_cap_list);
> + list_add_rcu(&entry->link, &asym_cap_list);
> done:
> __cpumask_set_cpu(cpu, cpu_capacity_span(entry));
> }
> @@ -1423,8 +1418,8 @@ static void asym_cpu_capacity_scan(void)
>
> list_for_each_entry_safe(entry, next, &asym_cap_list, link) {
> if (cpumask_empty(cpu_capacity_span(entry))) {
> - list_del(&entry->link);
> - kfree(entry);
> + list_del_rcu(&entry->link);
> + call_rcu(&entry->rcu, free_asym_cap_entry);
> }
> }
>
> @@ -1434,8 +1429,8 @@ static void asym_cpu_capacity_scan(void)
> */
> if (list_is_singular(&asym_cap_list)) {
> entry = list_first_entry(&asym_cap_list, typeof(*entry), link);
> - list_del(&entry->link);
> - kfree(entry);
> + list_del_rcu(&entry->link);
> + call_rcu(&entry->rcu, free_asym_cap_entry);
> }
> }
>
> --
> 2.34.1
>
On Mon, 22 Jan 2024 at 10:59, Dietmar Eggemann <[email protected]> wrote:
>
> On 05/01/2024 23:20, Qais Yousef wrote:
> > From: Qais Yousef <[email protected]>
> >
> > If a misfit task is affined to a subset of the possible cpus, we need to
> > verify that one of these cpus can fit it. Otherwise the load balancer
> > code will continuously trigger needlessly leading the balance_interval
> > to increase in return and eventually end up with a situation where real
> > imbalances take a long time to address because of this impossible
> > imbalance situation.
> >
> > This can happen in Android world where it's common for background tasks
> > to be restricted to little cores.
> >
> > Similarly if we can't fit the biggest core, triggering misfit is
> > pointless as it is the best we can ever get on this system.
> >
> > To be able to detect that; we use asym_cap_list to iterate through
> > capacities in the system to see if the task is able to run at a higher
> > capacity level based on its p->cpus_ptr. To do so safely, we convert the
> > list to be RCU protected.
> >
> > To be able to iterate through capacity levels, export asym_cap_list to
> > allow for fast traversal of all available capacity levels in the system.
> >
> > Test:
> > =====
> >
> > Add
> >
> > trace_printk("balance_interval = %lu\n", interval)
> >
> > in get_sd_balance_interval().
> >
> > run
> > if [ "$MASK" != "0" ]; then
> > adb shell "taskset -a $MASK cat /dev/zero > /dev/null"
> > fi
> > sleep 10
> > // parse ftrace buffer counting the occurrence of each valaue
> >
> > Where MASK is either:
> >
> > * 0: no busy task running
>
> ... no busy task stands for no misfit scenario?
>
> > * 1: busy task is pinned to 1 cpu; handled today to not cause
> > misfit
> > * f: busy task pinned to little cores, simulates busy background
> > task, demonstrates the problem to be fixed
> >
>
> [...]
>
> > + /*
> > + * If the task affinity is not set to default, make sure it is not
> > + * restricted to a subset where no CPU can ever fit it. Triggering
> > + * misfit in this case is pointless as it has no where better to move
> > + * to. And it can lead to balance_interval to grow too high as we'll
> > + * continuously fail to move it anywhere.
> > + */
> > + if (!cpumask_equal(p->cpus_ptr, cpu_possible_mask)) {
>
> Shouldn't this be cpu_active_mask ?
>
> include/linux/cpumask.h
>
> * cpu_possible_mask- has bit 'cpu' set iff cpu is populatable
> * cpu_present_mask - has bit 'cpu' set iff cpu is populated
> * cpu_online_mask - has bit 'cpu' set iff cpu available to scheduler
> * cpu_active_mask - has bit 'cpu' set iff cpu available to migration
>
>
> > + unsigned long clamped_util = clamp(util, uclamp_min, uclamp_max);
> > + bool has_fitting_cpu = false;
> > + struct asym_cap_data *entry;
> > +
> > + rcu_read_lock();
> > + list_for_each_entry_rcu(entry, &asym_cap_list, link) {
> > + if (entry->capacity > cpu_cap) {
> > + cpumask_t *cpumask;
> > +
> > + if (clamped_util > entry->capacity)
> > + continue;
> > +
> > + cpumask = cpu_capacity_span(entry);
> > + if (!cpumask_intersects(p->cpus_ptr, cpumask))
> > + continue;
> > +
> > + has_fitting_cpu = true;
> > + break;
> > + }
> > + }
>
> What happen when we hotplug out all CPUs of one CPU capacity value?
> IMHO, we don't call asym_cpu_capacity_scan() with !new_topology
> (partition_sched_domains_locked()).
>
> > + rcu_read_unlock();
> > +
> > + if (!has_fitting_cpu)
> > + goto out;
> > }
> >
> > /*
> > @@ -5083,6 +5127,9 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> > * task_h_load() returns 0.
> > */
> > rq->misfit_task_load = max_t(unsigned long, task_h_load(p), 1);
> > + return;
> > +out:
> > + rq->misfit_task_load = 0;
> > }
> >
> > #else /* CONFIG_SMP */
> > @@ -9583,9 +9630,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
> > */
> > static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
> > {
> > - return rq->misfit_task_load &&
> > - (arch_scale_cpu_capacity(rq->cpu) < rq->rd->max_cpu_capacity ||
> > - check_cpu_capacity(rq, sd));
> > + return rq->misfit_task_load && check_cpu_capacity(rq, sd);
>
> You removed 'arch_scale_cpu_capacity(rq->cpu) <
> rq->rd->max_cpu_capacity' here. Why? I can see that with the standard
> setup (max CPU capacity equal 1024) which is what we probably use 100%
> of the time now. It might get useful again when Vincent will introduce
> his 'user space system pressure' implementation?
That's interesting because I'm doing the opposite in the user space
system pressure that I'm preparing:
I keep something similar to (arch_scale_cpu_capacity(rq->cpu) <
rq->rd->max_cpu_capacity but I remove check_cpu_capacity(rq, sd) which
seems to be useless because it's already used earlier in
nohz_balancer_kick()
>
> > }
>
> [...]
>
> > @@ -1423,8 +1418,8 @@ static void asym_cpu_capacity_scan(void)
> >
> > list_for_each_entry_safe(entry, next, &asym_cap_list, link) {
> > if (cpumask_empty(cpu_capacity_span(entry))) {
> > - list_del(&entry->link);
> > - kfree(entry);
> > + list_del_rcu(&entry->link);
> > + call_rcu(&entry->rcu, free_asym_cap_entry);
>
> Looks like there could be brief moments in which one CPU capacity group
> of CPUs could be twice in asym_cap_list. I'm thinking about initial
> startup + max CPU frequency related adjustment of CPU capacity
> (init_cpu_capacity_callback()) for instance. Not sure if this is really
> an issue?
>
> [...]
>
On Fri, 5 Jan 2024 at 23:20, Qais Yousef <[email protected]> wrote:
>
> From: Qais Yousef <[email protected]>
>
> If a misfit task is affined to a subset of the possible cpus, we need to
> verify that one of these cpus can fit it. Otherwise the load balancer
> code will continuously trigger needlessly leading the balance_interval
> to increase in return and eventually end up with a situation where real
> imbalances take a long time to address because of this impossible
> imbalance situation.
>
> This can happen in Android world where it's common for background tasks
> to be restricted to little cores.
>
> Similarly if we can't fit the biggest core, triggering misfit is
> pointless as it is the best we can ever get on this system.
>
> To be able to detect that; we use asym_cap_list to iterate through
> capacities in the system to see if the task is able to run at a higher
> capacity level based on its p->cpus_ptr. To do so safely, we convert the
> list to be RCU protected.
>
> To be able to iterate through capacity levels, export asym_cap_list to
> allow for fast traversal of all available capacity levels in the system.
>
> Test:
> =====
>
> Add
>
> trace_printk("balance_interval = %lu\n", interval)
>
> in get_sd_balance_interval().
>
> run
> if [ "$MASK" != "0" ]; then
> adb shell "taskset -a $MASK cat /dev/zero > /dev/null"
> fi
> sleep 10
> // parse ftrace buffer counting the occurrence of each valaue
>
> Where MASK is either:
>
> * 0: no busy task running
> * 1: busy task is pinned to 1 cpu; handled today to not cause
> misfit
> * f: busy task pinned to little cores, simulates busy background
> task, demonstrates the problem to be fixed
>
> Results:
> ========
>
> Note how occurrence of balance_interval = 128 overshoots for MASK = f.
>
> BEFORE
> ------
>
> MASK=0
>
> 1 balance_interval = 175
> 120 balance_interval = 128
> 846 balance_interval = 64
> 55 balance_interval = 63
> 215 balance_interval = 32
> 2 balance_interval = 31
> 2 balance_interval = 16
> 4 balance_interval = 8
> 1870 balance_interval = 4
> 65 balance_interval = 2
>
> MASK=1
>
> 27 balance_interval = 175
> 37 balance_interval = 127
> 840 balance_interval = 64
> 167 balance_interval = 63
> 449 balance_interval = 32
> 84 balance_interval = 31
> 304 balance_interval = 16
> 1156 balance_interval = 8
> 2781 balance_interval = 4
> 428 balance_interval = 2
>
> MASK=f
>
> 1 balance_interval = 175
> 1328 balance_interval = 128
> 44 balance_interval = 64
> 101 balance_interval = 63
> 25 balance_interval = 32
> 5 balance_interval = 31
> 23 balance_interval = 16
> 23 balance_interval = 8
> 4306 balance_interval = 4
> 177 balance_interval = 2
>
> AFTER
> -----
>
> Note how the high values almost disappear for all MASK values. The
> system has background tasks that could trigger the problem without
> simulate it even with MASK=0.
>
> MASK=0
>
> 103 balance_interval = 63
> 19 balance_interval = 31
> 194 balance_interval = 8
> 4827 balance_interval = 4
> 179 balance_interval = 2
>
> MASK=1
>
> 131 balance_interval = 63
> 1 balance_interval = 31
> 87 balance_interval = 8
> 3600 balance_interval = 4
> 7 balance_interval = 2
>
> MASK=f
>
> 8 balance_interval = 127
> 182 balance_interval = 63
> 3 balance_interval = 31
> 9 balance_interval = 16
> 415 balance_interval = 8
> 3415 balance_interval = 4
> 21 balance_interval = 2
>
> Signed-off-by: Qais Yousef <[email protected]>
> Signed-off-by: Qais Yousef (Google) <[email protected]>
> ---
> kernel/sched/fair.c | 65 ++++++++++++++++++++++++++++++++++-------
> kernel/sched/sched.h | 14 +++++++++
> kernel/sched/topology.c | 29 ++++++++----------
> 3 files changed, 81 insertions(+), 27 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bcea3d55d95d..0830ceb7ca07 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5065,17 +5065,61 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)
>
> static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> {
> + unsigned long uclamp_min, uclamp_max;
> + unsigned long util, cpu_cap;
> + int cpu = cpu_of(rq);
> +
> if (!sched_asym_cpucap_active())
> return;
>
> - if (!p || p->nr_cpus_allowed == 1) {
> - rq->misfit_task_load = 0;
> - return;
> - }
> + if (!p || p->nr_cpus_allowed == 1)
> + goto out;
>
> - if (task_fits_cpu(p, cpu_of(rq))) {
> - rq->misfit_task_load = 0;
> - return;
> + cpu_cap = arch_scale_cpu_capacity(cpu);
> +
> + /* If we can't fit the biggest CPU, that's the best we can ever get. */
> + if (cpu_cap == SCHED_CAPACITY_SCALE)
> + goto out;
> +
> + uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
> + uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
> + util = task_util_est(p);
> +
> + if (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0)
> + goto out;
> +
> + /*
> + * If the task affinity is not set to default, make sure it is not
> + * restricted to a subset where no CPU can ever fit it. Triggering
> + * misfit in this case is pointless as it has no where better to move
> + * to. And it can lead to balance_interval to grow too high as we'll
> + * continuously fail to move it anywhere.
> + */
> + if (!cpumask_equal(p->cpus_ptr, cpu_possible_mask)) {
> + unsigned long clamped_util = clamp(util, uclamp_min, uclamp_max);
> + bool has_fitting_cpu = false;
> + struct asym_cap_data *entry;
> +
> + rcu_read_lock();
> + list_for_each_entry_rcu(entry, &asym_cap_list, link) {
Do we really want to potentially do this loop at every pick_next task ?
> + if (entry->capacity > cpu_cap) {
> + cpumask_t *cpumask;
> +
> + if (clamped_util > entry->capacity)
> + continue;
> +
> + cpumask = cpu_capacity_span(entry);
> + if (!cpumask_intersects(p->cpus_ptr, cpumask))
> + continue;
> +
> + has_fitting_cpu = true;
> + break;
> + }
> + }
> + rcu_read_unlock();
> +
> + if (!has_fitting_cpu)
> + goto out;
> }
>
> /*
> @@ -5083,6 +5127,9 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> * task_h_load() returns 0.
> */
> rq->misfit_task_load = max_t(unsigned long, task_h_load(p), 1);
> + return;
> +out:
> + rq->misfit_task_load = 0;
> }
>
> #else /* CONFIG_SMP */
> @@ -9583,9 +9630,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
> */
> static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
> {
> - return rq->misfit_task_load &&
> - (arch_scale_cpu_capacity(rq->cpu) < rq->rd->max_cpu_capacity ||
> - check_cpu_capacity(rq, sd));
> + return rq->misfit_task_load && check_cpu_capacity(rq, sd);
> }
>
> /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index e58a54bda77d..a653017a1b9b 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -109,6 +109,20 @@ extern int sysctl_sched_rt_period;
> extern int sysctl_sched_rt_runtime;
> extern int sched_rr_timeslice;
>
> +/*
> + * Asymmetric CPU capacity bits
> + */
> +struct asym_cap_data {
> + struct list_head link;
> + struct rcu_head rcu;
> + unsigned long capacity;
> + unsigned long cpus[];
> +};
> +
> +extern struct list_head asym_cap_list;
> +
> +#define cpu_capacity_span(asym_data) to_cpumask((asym_data)->cpus)
> +
> /*
> * Helpers for converting nanosecond timing to jiffy resolution
> */
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 10d1391e7416..ba4a0b18ae25 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1329,24 +1329,13 @@ static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
> update_group_capacity(sd, cpu);
> }
>
> -/*
> - * Asymmetric CPU capacity bits
> - */
> -struct asym_cap_data {
> - struct list_head link;
> - unsigned long capacity;
> - unsigned long cpus[];
> -};
> -
> /*
> * Set of available CPUs grouped by their corresponding capacities
> * Each list entry contains a CPU mask reflecting CPUs that share the same
> * capacity.
> * The lifespan of data is unlimited.
> */
> -static LIST_HEAD(asym_cap_list);
> -
> -#define cpu_capacity_span(asym_data) to_cpumask((asym_data)->cpus)
> +LIST_HEAD(asym_cap_list);
>
> /*
> * Verify whether there is any CPU capacity asymmetry in a given sched domain.
> @@ -1386,6 +1375,12 @@ asym_cpu_capacity_classify(const struct cpumask *sd_span,
>
> }
>
> +static void free_asym_cap_entry(struct rcu_head *head)
> +{
> + struct asym_cap_data *entry = container_of(head, struct asym_cap_data, rcu);
> + kfree(entry);
> +}
> +
> static inline void asym_cpu_capacity_update_data(int cpu)
> {
> unsigned long capacity = arch_scale_cpu_capacity(cpu);
> @@ -1400,7 +1395,7 @@ static inline void asym_cpu_capacity_update_data(int cpu)
> if (WARN_ONCE(!entry, "Failed to allocate memory for asymmetry data\n"))
> return;
> entry->capacity = capacity;
> - list_add(&entry->link, &asym_cap_list);
> + list_add_rcu(&entry->link, &asym_cap_list);
> done:
> __cpumask_set_cpu(cpu, cpu_capacity_span(entry));
> }
> @@ -1423,8 +1418,8 @@ static void asym_cpu_capacity_scan(void)
>
> list_for_each_entry_safe(entry, next, &asym_cap_list, link) {
> if (cpumask_empty(cpu_capacity_span(entry))) {
> - list_del(&entry->link);
> - kfree(entry);
> + list_del_rcu(&entry->link);
> + call_rcu(&entry->rcu, free_asym_cap_entry);
> }
> }
>
> @@ -1434,8 +1429,8 @@ static void asym_cpu_capacity_scan(void)
> */
> if (list_is_singular(&asym_cap_list)) {
> entry = list_first_entry(&asym_cap_list, typeof(*entry), link);
> - list_del(&entry->link);
> - kfree(entry);
> + list_del_rcu(&entry->link);
> + call_rcu(&entry->rcu, free_asym_cap_entry);
> }
> }
>
> --
> 2.34.1
>
On 22/01/2024 19:02, Qais Yousef wrote:
> On 01/22/24 09:59, Dietmar Eggemann wrote:
>> On 05/01/2024 23:20, Qais Yousef wrote:
>>> From: Qais Yousef <[email protected]>
[...]
>>> + /*
>>> + * If the task affinity is not set to default, make sure it is not
>>> + * restricted to a subset where no CPU can ever fit it. Triggering
>>> + * misfit in this case is pointless as it has no where better to move
>>> + * to. And it can lead to balance_interval to grow too high as we'll
>>> + * continuously fail to move it anywhere.
>>> + */
>>> + if (!cpumask_equal(p->cpus_ptr, cpu_possible_mask)) {
>>
>> Shouldn't this be cpu_active_mask ?
>
> Hmm. So the intention was to check if the affinity was changed from default.
>
> If we hotplug all but little we could end up with the same problem, yes you're
> right.
>
> But if the affinity is set to only to littles and cpu_active_mask is only for
> littles too, then we'll also end up with the same problem as they both are
> equal.
Yes, that's true.
> Better to drop this check then? With the sorted list the common case should be
> quick to return as they'll have 1024 as a possible CPU.
Or you keep 'cpu_possible_mask' and rely on the fact that the
asym_cap_list entries are removed if those CPUs are hotplugged out. In
this case the !has_fitting_cpu path should prevent useless Misfit load
balancing approaches.
[...]
>> What happen when we hotplug out all CPUs of one CPU capacity value?
>> IMHO, we don't call asym_cpu_capacity_scan() with !new_topology
>> (partition_sched_domains_locked()).
>
> Right. I missed that. We can add another intersection check against
> cpu_active_mask.
>
> I am assuming the skipping was done by design, not a bug that needs fixing?
> I see for suspend (cpuhp_tasks_frozen) the domains are rebuilt, but not for
> hotplug.
IMHO, it's by design. We setup asym_cap_list only when new_topology is
set (update_topology_flags_workfn() from init_cpu_capacity_callback() or
topology_init_cpu_capacity_cppc()). I.e. when the (max) CPU capacity can
change.
In all the other !new_topology cases we check `has_asym |= sd->flags &
SD_ASYM_CPUCAPACITY` and set sched_asym_cpucapacity accordingly in
build_sched_domains(). Before we always reset sched_asym_cpucapacity in
detach_destroy_domains().
But now we would have to keep asym_cap_list in sync with the active CPUs
I guess.
[...]
>>> #else /* CONFIG_SMP */
>>> @@ -9583,9 +9630,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
>>> */
>>> static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
>>> {
>>> - return rq->misfit_task_load &&
>>> - (arch_scale_cpu_capacity(rq->cpu) < rq->rd->max_cpu_capacity ||
>>> - check_cpu_capacity(rq, sd));
>>> + return rq->misfit_task_load && check_cpu_capacity(rq, sd);
>>
>> You removed 'arch_scale_cpu_capacity(rq->cpu) <
>> rq->rd->max_cpu_capacity' here. Why? I can see that with the standard
>
> Based on Pierre review since we no longer trigger misfit for big cores.
> I thought Pierre's remark was correct so did the change in v3
Ah, this is the replacement:
- if (task_fits_cpu(p, cpu_of(rq))) { <- still MF for util > 0.8 * 1024
- rq->misfit_task_load = 0;
- return;
+ cpu_cap = arch_scale_cpu_capacity(cpu);
+
+ /* If we can't fit the biggest CPU, that's the best we can ever get */
+ if (cpu_cap == SCHED_CAPACITY_SCALE)
+ goto out;
>
> https://lore.kernel.org/lkml/[email protected]/
>
>> setup (max CPU capacity equal 1024) which is what we probably use 100%
>> of the time now. It might get useful again when Vincent will introduce
>> his 'user space system pressure' implementation?
>
> I don't mind putting it back if you think it'd be required again in the near
> future. I still didn't get a chance to look at Vincent patches properly, but if
> there's a clash let's reduce the work.
Vincent did already comment on this in this thread.
[...]
>>> @@ -1423,8 +1418,8 @@ static void asym_cpu_capacity_scan(void)
>>>
>>> list_for_each_entry_safe(entry, next, &asym_cap_list, link) {
>>> if (cpumask_empty(cpu_capacity_span(entry))) {
>>> - list_del(&entry->link);
>>> - kfree(entry);
>>> + list_del_rcu(&entry->link);
>>> + call_rcu(&entry->rcu, free_asym_cap_entry);
>>
>> Looks like there could be brief moments in which one CPU capacity group
>> of CPUs could be twice in asym_cap_list. I'm thinking about initial
>> startup + max CPU frequency related adjustment of CPU capacity
>> (init_cpu_capacity_callback()) for instance. Not sure if this is really
>> an issue?
>
> I don't think so. As long as the reader sees a consistent value and no crashes
> ensued, a momentarily wrong decision in transient or extra work is fine IMO.
> I don't foresee a big impact.
OK.
On 01/23/24 09:26, Vincent Guittot wrote:
> On Fri, 5 Jan 2024 at 23:20, Qais Yousef <[email protected]> wrote:
> >
> > From: Qais Yousef <[email protected]>
> >
> > If a misfit task is affined to a subset of the possible cpus, we need to
> > verify that one of these cpus can fit it. Otherwise the load balancer
> > code will continuously trigger needlessly leading the balance_interval
> > to increase in return and eventually end up with a situation where real
> > imbalances take a long time to address because of this impossible
> > imbalance situation.
>
> If your problem is about increasing balance_interval, it would be
> better to not increase the interval is such case.
> I mean that we are able to detect misfit_task conditions for the
> periodic load balance so we should be able to not increase the
> interval in such cases.
>
> If I'm not wrong, your problem only happens when the system is
> overutilized and we have disable EAS
Yes and no. There are two concerns here:
1.
So this patch is a generalized form of 0ae78eec8aa6 ("sched/eas: Don't update
misfit status if the task is pinned") which is when I originally noticed the
problem and this patch was written along side it.
We have unlinked misfit from overutilized since then.
And to be honest I am not sure if flattening of topology matters too since
I first noticed this, which was on Juno which doesn't have flat topology.
FWIW I can still reproduce this, but I have a different setup now. On M1 mac
mini if I spawn a busy task affined to littles then expand the mask for
a single big core; I see big delays (>500ms) without the patch. But with the
patch it moves in few ms. The delay without the patch is too large and I can't
explain it. So the worry here is that generally misfit migration not happening
fast enough due to this fake misfit cases.
I did hit issues where with this patch I saw big delays sometimes. I have no
clue why this happens. So there are potentially more problems to chase.
My expectations that newidle balance should be able to pull misfit regardless
of balance_interval. So the system has to be really busy or really quite to
notice delays. I think prior to flat topology this pull was not guaranteed, but
with flat topology it should happen.
On this system if I expand the mask to all cpus (instead of littles + single
big), the issue is not as easy to reproduce, but I captured 35+ms delays
- which is long if this task was carrying important work and needs to
upmigrate. I thought newidle balance is more likely to pull it sooner, but I am
not 100% sure.
It's a 6.6 kernel I am testing with.
2.
Here yes the concern is that when we are overutilized and load balance is
required, this unnecessarily long delay can cause potential problems.
Cheers
--
Qais Yousef
On 01/23/24 18:22, Vincent Guittot wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index bcea3d55d95d..0830ceb7ca07 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5065,17 +5065,61 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)
> >
> > static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> > {
> > + unsigned long uclamp_min, uclamp_max;
> > + unsigned long util, cpu_cap;
> > + int cpu = cpu_of(rq);
> > +
> > if (!sched_asym_cpucap_active())
> > return;
> >
> > - if (!p || p->nr_cpus_allowed == 1) {
> > - rq->misfit_task_load = 0;
> > - return;
> > - }
> > + if (!p || p->nr_cpus_allowed == 1)
> > + goto out;
> >
> > - if (task_fits_cpu(p, cpu_of(rq))) {
> > - rq->misfit_task_load = 0;
> > - return;
> > + cpu_cap = arch_scale_cpu_capacity(cpu);
> > +
> > + /* If we can't fit the biggest CPU, that's the best we can ever get. */
> > + if (cpu_cap == SCHED_CAPACITY_SCALE)
> > + goto out;
> > +
> > + uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
> > + uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
> > + util = task_util_est(p);
> > +
> > + if (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0)
> > + goto out;
> > +
> > + /*
> > + * If the task affinity is not set to default, make sure it is not
> > + * restricted to a subset where no CPU can ever fit it. Triggering
> > + * misfit in this case is pointless as it has no where better to move
> > + * to. And it can lead to balance_interval to grow too high as we'll
> > + * continuously fail to move it anywhere.
> > + */
> > + if (!cpumask_equal(p->cpus_ptr, cpu_possible_mask)) {
> > + unsigned long clamped_util = clamp(util, uclamp_min, uclamp_max);
> > + bool has_fitting_cpu = false;
> > + struct asym_cap_data *entry;
> > +
> > + rcu_read_lock();
> > + list_for_each_entry_rcu(entry, &asym_cap_list, link) {
>
> Do we really want to potentially do this loop at every pick_next task ?
The common case should return quickly as the biggest CPU should be present
in every task by default. And after sorting the biggest CPU will be the first
entry and we should return after one check.
Could we move the update to another less expensive location instead?
We could try to do better tracking for CPUs that has their affinity changed,
but I am not keen on sprinkling more complexity else where to deal with this.
We could keep the status quouo and just prevent the misfit load balancing from
increment nr_failed similar to newidle_balance too. I think this should have
a similar effect. Not ideal but if this is considered too expensive still
I can't think of other options that don't look ugly to me FWIW.
Thanks
--
Qais Yousef
On 01/23/24 18:07, Dietmar Eggemann wrote:
> On 22/01/2024 19:02, Qais Yousef wrote:
> > On 01/22/24 09:59, Dietmar Eggemann wrote:
> >> On 05/01/2024 23:20, Qais Yousef wrote:
> >>> From: Qais Yousef <[email protected]>
>
> [...]
>
> >>> + /*
> >>> + * If the task affinity is not set to default, make sure it is not
> >>> + * restricted to a subset where no CPU can ever fit it. Triggering
> >>> + * misfit in this case is pointless as it has no where better to move
> >>> + * to. And it can lead to balance_interval to grow too high as we'll
> >>> + * continuously fail to move it anywhere.
> >>> + */
> >>> + if (!cpumask_equal(p->cpus_ptr, cpu_possible_mask)) {
> >>
> >> Shouldn't this be cpu_active_mask ?
> >
> > Hmm. So the intention was to check if the affinity was changed from default.
> >
> > If we hotplug all but little we could end up with the same problem, yes you're
> > right.
> >
> > But if the affinity is set to only to littles and cpu_active_mask is only for
> > littles too, then we'll also end up with the same problem as they both are
> > equal.
>
> Yes, that's true.
>
> > Better to drop this check then? With the sorted list the common case should be
> > quick to return as they'll have 1024 as a possible CPU.
>
> Or you keep 'cpu_possible_mask' and rely on the fact that the
> asym_cap_list entries are removed if those CPUs are hotplugged out. In
> this case the !has_fitting_cpu path should prevent useless Misfit load
> balancing approaches.
IIUC this removal doesn't happen today as outlined below, but something we need
to do, right? Would be better, yes.
>
> [...]
>
> >> What happen when we hotplug out all CPUs of one CPU capacity value?
> >> IMHO, we don't call asym_cpu_capacity_scan() with !new_topology
> >> (partition_sched_domains_locked()).
> >
> > Right. I missed that. We can add another intersection check against
> > cpu_active_mask.
> >
> > I am assuming the skipping was done by design, not a bug that needs fixing?
> > I see for suspend (cpuhp_tasks_frozen) the domains are rebuilt, but not for
> > hotplug.
>
> IMHO, it's by design. We setup asym_cap_list only when new_topology is
> set (update_topology_flags_workfn() from init_cpu_capacity_callback() or
> topology_init_cpu_capacity_cppc()). I.e. when the (max) CPU capacity can
> change.
> In all the other !new_topology cases we check `has_asym |= sd->flags &
> SD_ASYM_CPUCAPACITY` and set sched_asym_cpucapacity accordingly in
> build_sched_domains(). Before we always reset sched_asym_cpucapacity in
> detach_destroy_domains().
> But now we would have to keep asym_cap_list in sync with the active CPUs
> I guess.
Okay, so you suggest we need to update the code to keep it in sync. Let's see
first if Vincent is satisfied with this list traversal or we need to go another
way :-)
I think it is worth having this asym_capacity list available. It seemed several
times we needed it and just a little work is required to make it available for
potential future users. Even if we don't merge immediately.
>
> [...]
>
> >>> #else /* CONFIG_SMP */
> >>> @@ -9583,9 +9630,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
> >>> */
> >>> static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
> >>> {
> >>> - return rq->misfit_task_load &&
> >>> - (arch_scale_cpu_capacity(rq->cpu) < rq->rd->max_cpu_capacity ||
> >>> - check_cpu_capacity(rq, sd));
> >>> + return rq->misfit_task_load && check_cpu_capacity(rq, sd);
> >>
> >> You removed 'arch_scale_cpu_capacity(rq->cpu) <
> >> rq->rd->max_cpu_capacity' here. Why? I can see that with the standard
> >
> > Based on Pierre review since we no longer trigger misfit for big cores.
> > I thought Pierre's remark was correct so did the change in v3
>
> Ah, this is the replacement:
>
> - if (task_fits_cpu(p, cpu_of(rq))) { <- still MF for util > 0.8 * 1024
> - rq->misfit_task_load = 0;
> - return;
> + cpu_cap = arch_scale_cpu_capacity(cpu);
> +
> + /* If we can't fit the biggest CPU, that's the best we can ever get */
> + if (cpu_cap == SCHED_CAPACITY_SCALE)
> + goto out;
Yep.
>
> >
> > https://lore.kernel.org/lkml/[email protected]/
> >
> >> setup (max CPU capacity equal 1024) which is what we probably use 100%
> >> of the time now. It might get useful again when Vincent will introduce
> >> his 'user space system pressure' implementation?
> >
> > I don't mind putting it back if you think it'd be required again in the near
> > future. I still didn't get a chance to look at Vincent patches properly, but if
> > there's a clash let's reduce the work.
>
> Vincent did already comment on this in this thread.
>
> [...]
>
> >>> @@ -1423,8 +1418,8 @@ static void asym_cpu_capacity_scan(void)
> >>>
> >>> list_for_each_entry_safe(entry, next, &asym_cap_list, link) {
> >>> if (cpumask_empty(cpu_capacity_span(entry))) {
> >>> - list_del(&entry->link);
> >>> - kfree(entry);
> >>> + list_del_rcu(&entry->link);
> >>> + call_rcu(&entry->rcu, free_asym_cap_entry);
> >>
> >> Looks like there could be brief moments in which one CPU capacity group
> >> of CPUs could be twice in asym_cap_list. I'm thinking about initial
> >> startup + max CPU frequency related adjustment of CPU capacity
> >> (init_cpu_capacity_callback()) for instance. Not sure if this is really
> >> an issue?
> >
> > I don't think so. As long as the reader sees a consistent value and no crashes
> > ensued, a momentarily wrong decision in transient or extra work is fine IMO.
> > I don't foresee a big impact.
>
> OK.
Thanks!
--
Qais Yousef
On 01/23/24 09:32, Vincent Guittot wrote:
> > > @@ -9583,9 +9630,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
> > > */
> > > static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
> > > {
> > > - return rq->misfit_task_load &&
> > > - (arch_scale_cpu_capacity(rq->cpu) < rq->rd->max_cpu_capacity ||
> > > - check_cpu_capacity(rq, sd));
> > > + return rq->misfit_task_load && check_cpu_capacity(rq, sd);
> >
> > You removed 'arch_scale_cpu_capacity(rq->cpu) <
> > rq->rd->max_cpu_capacity' here. Why? I can see that with the standard
> > setup (max CPU capacity equal 1024) which is what we probably use 100%
> > of the time now. It might get useful again when Vincent will introduce
> > his 'user space system pressure' implementation?
>
> That's interesting because I'm doing the opposite in the user space
> system pressure that I'm preparing:
> I keep something similar to (arch_scale_cpu_capacity(rq->cpu) <
> rq->rd->max_cpu_capacity but I remove check_cpu_capacity(rq, sd) which
> seems to be useless because it's already used earlier in
> nohz_balancer_kick()
Okay. I need to look at your patches anyway. I can potentially rebase on top of
your series.
Cheers
--
Qais Yousef
On 24/01/2024 22:43, Qais Yousef wrote:
> On 01/23/24 18:07, Dietmar Eggemann wrote:
>> On 22/01/2024 19:02, Qais Yousef wrote:
>>> On 01/22/24 09:59, Dietmar Eggemann wrote:
>>>> On 05/01/2024 23:20, Qais Yousef wrote:
>>>>> From: Qais Yousef <[email protected]>
[...]
>>>> What happen when we hotplug out all CPUs of one CPU capacity value?
>>>> IMHO, we don't call asym_cpu_capacity_scan() with !new_topology
>>>> (partition_sched_domains_locked()).
>>>
>>> Right. I missed that. We can add another intersection check against
>>> cpu_active_mask.
>>>
>>> I am assuming the skipping was done by design, not a bug that needs fixing?
>>> I see for suspend (cpuhp_tasks_frozen) the domains are rebuilt, but not for
>>> hotplug.
>>
>> IMHO, it's by design. We setup asym_cap_list only when new_topology is
>> set (update_topology_flags_workfn() from init_cpu_capacity_callback() or
>> topology_init_cpu_capacity_cppc()). I.e. when the (max) CPU capacity can
>> change.
>> In all the other !new_topology cases we check `has_asym |= sd->flags &
>> SD_ASYM_CPUCAPACITY` and set sched_asym_cpucapacity accordingly in
>> build_sched_domains(). Before we always reset sched_asym_cpucapacity in
>> detach_destroy_domains().
>> But now we would have to keep asym_cap_list in sync with the active CPUs
>> I guess.
>
> Okay, so you suggest we need to update the code to keep it in sync. Let's see
> first if Vincent is satisfied with this list traversal or we need to go another
> way :-)
Yes, if preventing the 'increase of balance_interval' will cure this
issue as well, then this will be definitely the less invasive fix.
Can you not easily do a 'perf bench sched messaging -g X -l Y' test on
you M1 to get some numbers behind this additional list traversal in
pick_next_task_fair()?
> I think it is worth having this asym_capacity list available. It seemed several
> times we needed it and just a little work is required to make it available for
> potential future users. Even if we don't merge immediately.
I agree. It would give us this ordered (by max CPU capacity) list of
CPUs to iterate over.
[...]
On Wed, 24 Jan 2024 at 23:30, Qais Yousef <[email protected]> wrote:
>
> On 01/23/24 09:26, Vincent Guittot wrote:
> > On Fri, 5 Jan 2024 at 23:20, Qais Yousef <[email protected]> wrote:
> > >
> > > From: Qais Yousef <[email protected]>
> > >
> > > If a misfit task is affined to a subset of the possible cpus, we need to
> > > verify that one of these cpus can fit it. Otherwise the load balancer
> > > code will continuously trigger needlessly leading the balance_interval
> > > to increase in return and eventually end up with a situation where real
> > > imbalances take a long time to address because of this impossible
> > > imbalance situation.
> >
> > If your problem is about increasing balance_interval, it would be
> > better to not increase the interval is such case.
> > I mean that we are able to detect misfit_task conditions for the
> > periodic load balance so we should be able to not increase the
> > interval in such cases.
> >
> > If I'm not wrong, your problem only happens when the system is
> > overutilized and we have disable EAS
>
> Yes and no. There are two concerns here:
>
> 1.
>
> So this patch is a generalized form of 0ae78eec8aa6 ("sched/eas: Don't update
> misfit status if the task is pinned") which is when I originally noticed the
> problem and this patch was written along side it.
>
> We have unlinked misfit from overutilized since then.
>
> And to be honest I am not sure if flattening of topology matters too since
> I first noticed this, which was on Juno which doesn't have flat topology.
>
> FWIW I can still reproduce this, but I have a different setup now. On M1 mac
> mini if I spawn a busy task affined to littles then expand the mask for
> a single big core; I see big delays (>500ms) without the patch. But with the
> patch it moves in few ms. The delay without the patch is too large and I can't
> explain it. So the worry here is that generally misfit migration not happening
> fast enough due to this fake misfit cases.
I tried a similar scenario on RB5 but I don't see any difference with
your patch. And that could be me not testing it correctly...
I set the affinity of always running task to cpu[0-3] for a few
seconds then extend it to [0-3,7] and the time to migrate is almost
the same.
I'm using tip/sched/core + [0]
[0] https://lore.kernel.org/all/[email protected]/
>
> I did hit issues where with this patch I saw big delays sometimes. I have no
> clue why this happens. So there are potentially more problems to chase.
>
> My expectations that newidle balance should be able to pull misfit regardless
> of balance_interval. So the system has to be really busy or really quite to
> notice delays. I think prior to flat topology this pull was not guaranteed, but
> with flat topology it should happen.
>
> On this system if I expand the mask to all cpus (instead of littles + single
> big), the issue is not as easy to reproduce, but I captured 35+ms delays
> - which is long if this task was carrying important work and needs to
> upmigrate. I thought newidle balance is more likely to pull it sooner, but I am
> not 100% sure.
>
> It's a 6.6 kernel I am testing with.
>
> 2.
>
> Here yes the concern is that when we are overutilized and load balance is
> required, this unnecessarily long delay can cause potential problems.
>
>
> Cheers
>
> --
> Qais Yousef
On Wed, 24 Jan 2024 at 23:46, Qais Yousef <[email protected]> wrote:
>
> On 01/23/24 09:32, Vincent Guittot wrote:
>
> > > > @@ -9583,9 +9630,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
> > > > */
> > > > static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
> > > > {
> > > > - return rq->misfit_task_load &&
> > > > - (arch_scale_cpu_capacity(rq->cpu) < rq->rd->max_cpu_capacity ||
> > > > - check_cpu_capacity(rq, sd));
> > > > + return rq->misfit_task_load && check_cpu_capacity(rq, sd);
Coming back to this:
With your change above, misfit can't kick an idle load balance and
must wait for the cpu capacity being noticeably reduced by something
else
> > >
> > > You removed 'arch_scale_cpu_capacity(rq->cpu) <
> > > rq->rd->max_cpu_capacity' here. Why? I can see that with the standard
> > > setup (max CPU capacity equal 1024) which is what we probably use 100%
> > > of the time now. It might get useful again when Vincent will introduce
> > > his 'user space system pressure' implementation?
> >
> > That's interesting because I'm doing the opposite in the user space
> > system pressure that I'm preparing:
> > I keep something similar to (arch_scale_cpu_capacity(rq->cpu) <
> > rq->rd->max_cpu_capacity but I remove check_cpu_capacity(rq, sd) which
> > seems to be useless because it's already used earlier in
> > nohz_balancer_kick()
>
> Okay. I need to look at your patches anyway. I can potentially rebase on top of
> your series.
>
>
> Cheers
>
> --
> Qais Yousef
On Wed, 24 Jan 2024 at 23:38, Qais Yousef <[email protected]> wrote:
>
> On 01/23/24 18:22, Vincent Guittot wrote:
>
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index bcea3d55d95d..0830ceb7ca07 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -5065,17 +5065,61 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)
> > >
> > > static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> > > {
> > > + unsigned long uclamp_min, uclamp_max;
> > > + unsigned long util, cpu_cap;
> > > + int cpu = cpu_of(rq);
> > > +
> > > if (!sched_asym_cpucap_active())
> > > return;
> > >
> > > - if (!p || p->nr_cpus_allowed == 1) {
> > > - rq->misfit_task_load = 0;
> > > - return;
> > > - }
> > > + if (!p || p->nr_cpus_allowed == 1)
> > > + goto out;
> > >
> > > - if (task_fits_cpu(p, cpu_of(rq))) {
> > > - rq->misfit_task_load = 0;
> > > - return;
> > > + cpu_cap = arch_scale_cpu_capacity(cpu);
> > > +
> > > + /* If we can't fit the biggest CPU, that's the best we can ever get. */
> > > + if (cpu_cap == SCHED_CAPACITY_SCALE)
> > > + goto out;
> > > +
> > > + uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
> > > + uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
> > > + util = task_util_est(p);
> > > +
> > > + if (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0)
> > > + goto out;
> > > +
> > > + /*
> > > + * If the task affinity is not set to default, make sure it is not
> > > + * restricted to a subset where no CPU can ever fit it. Triggering
> > > + * misfit in this case is pointless as it has no where better to move
> > > + * to. And it can lead to balance_interval to grow too high as we'll
> > > + * continuously fail to move it anywhere.
> > > + */
> > > + if (!cpumask_equal(p->cpus_ptr, cpu_possible_mask)) {
> > > + unsigned long clamped_util = clamp(util, uclamp_min, uclamp_max);
> > > + bool has_fitting_cpu = false;
> > > + struct asym_cap_data *entry;
> > > +
> > > + rcu_read_lock();
> > > + list_for_each_entry_rcu(entry, &asym_cap_list, link) {
> >
> > Do we really want to potentially do this loop at every pick_next task ?
>
> The common case should return quickly as the biggest CPU should be present
> in every task by default. And after sorting the biggest CPU will be the first
> entry and we should return after one check.
>
> Could we move the update to another less expensive location instead?
TBH, I don't know. I would need time to think about this...
May be when we set the new affinity of the task
>
> We could try to do better tracking for CPUs that has their affinity changed,
> but I am not keen on sprinkling more complexity else where to deal with this.
>
> We could keep the status quouo and just prevent the misfit load balancing from
> increment nr_failed similar to newidle_balance too. I think this should have
One main advantage is that we put the complexity out of the fast path
> a similar effect. Not ideal but if this is considered too expensive still
> I can't think of other options that don't look ugly to me FWIW.
>
>
> Thanks
>
> --
> Qais Yousef
On 01/25/24 18:44, Vincent Guittot wrote:
> On Wed, 24 Jan 2024 at 23:46, Qais Yousef <[email protected]> wrote:
> >
> > On 01/23/24 09:32, Vincent Guittot wrote:
> >
> > > > > @@ -9583,9 +9630,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
> > > > > */
> > > > > static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
> > > > > {
> > > > > - return rq->misfit_task_load &&
> > > > > - (arch_scale_cpu_capacity(rq->cpu) < rq->rd->max_cpu_capacity ||
> > > > > - check_cpu_capacity(rq, sd));
> > > > > + return rq->misfit_task_load && check_cpu_capacity(rq, sd);
>
> Coming back to this:
> With your change above, misfit can't kick an idle load balance and
> must wait for the cpu capacity being noticeably reduced by something
> else
Good catch, yes. It's a subtle change. We need to keep this and better add
a comment that we move immediately for all CPUs except the biggest one where we
need to check_cpu_capacity().
On 01/25/24 10:35, Dietmar Eggemann wrote:
> On 24/01/2024 22:43, Qais Yousef wrote:
> > On 01/23/24 18:07, Dietmar Eggemann wrote:
> >> On 22/01/2024 19:02, Qais Yousef wrote:
> >>> On 01/22/24 09:59, Dietmar Eggemann wrote:
> >>>> On 05/01/2024 23:20, Qais Yousef wrote:
> >>>>> From: Qais Yousef <[email protected]>
>
> [...]
>
> >>>> What happen when we hotplug out all CPUs of one CPU capacity value?
> >>>> IMHO, we don't call asym_cpu_capacity_scan() with !new_topology
> >>>> (partition_sched_domains_locked()).
> >>>
> >>> Right. I missed that. We can add another intersection check against
> >>> cpu_active_mask.
> >>>
> >>> I am assuming the skipping was done by design, not a bug that needs fixing?
> >>> I see for suspend (cpuhp_tasks_frozen) the domains are rebuilt, but not for
> >>> hotplug.
> >>
> >> IMHO, it's by design. We setup asym_cap_list only when new_topology is
> >> set (update_topology_flags_workfn() from init_cpu_capacity_callback() or
> >> topology_init_cpu_capacity_cppc()). I.e. when the (max) CPU capacity can
> >> change.
> >> In all the other !new_topology cases we check `has_asym |= sd->flags &
> >> SD_ASYM_CPUCAPACITY` and set sched_asym_cpucapacity accordingly in
> >> build_sched_domains(). Before we always reset sched_asym_cpucapacity in
> >> detach_destroy_domains().
> >> But now we would have to keep asym_cap_list in sync with the active CPUs
> >> I guess.
> >
> > Okay, so you suggest we need to update the code to keep it in sync. Let's see
> > first if Vincent is satisfied with this list traversal or we need to go another
> > way :-)
>
> Yes, if preventing the 'increase of balance_interval' will cure this
> issue as well, then this will be definitely the less invasive fix.
>
> Can you not easily do a 'perf bench sched messaging -g X -l Y' test on
> you M1 to get some numbers behind this additional list traversal in
> pick_next_task_fair()?
I can do. But I noticed sometimes there're unexplainable variation in numbers
when moving kernels that I am not 100% sure is due to random unrelated changes
in caching behavior or due to something I've done. ie: they get better or worse
in unexpected ways. I run `perf sched bench pipe`, which is similar enough
I guess? I don't know how to fill these -g and -l numbers sensibly.
I had issues when running perf to collect stats. But maybe I wasn't specifying
the right options. I will try again.
>
> > I think it is worth having this asym_capacity list available. It seemed several
> > times we needed it and just a little work is required to make it available for
> > potential future users. Even if we don't merge immediately.
>
> I agree. It would give us this ordered (by max CPU capacity) list of
> CPUs to iterate over.
Okay. I need to figure out how to fix this hotplug issue to keep the list in
sync.
Thanks
--
Qais Yousef
On 01/25/24 18:40, Vincent Guittot wrote:
> On Wed, 24 Jan 2024 at 23:30, Qais Yousef <[email protected]> wrote:
> >
> > On 01/23/24 09:26, Vincent Guittot wrote:
> > > On Fri, 5 Jan 2024 at 23:20, Qais Yousef <[email protected]> wrote:
> > > >
> > > > From: Qais Yousef <[email protected]>
> > > >
> > > > If a misfit task is affined to a subset of the possible cpus, we need to
> > > > verify that one of these cpus can fit it. Otherwise the load balancer
> > > > code will continuously trigger needlessly leading the balance_interval
> > > > to increase in return and eventually end up with a situation where real
> > > > imbalances take a long time to address because of this impossible
> > > > imbalance situation.
> > >
> > > If your problem is about increasing balance_interval, it would be
> > > better to not increase the interval is such case.
> > > I mean that we are able to detect misfit_task conditions for the
> > > periodic load balance so we should be able to not increase the
> > > interval in such cases.
> > >
> > > If I'm not wrong, your problem only happens when the system is
> > > overutilized and we have disable EAS
> >
> > Yes and no. There are two concerns here:
> >
> > 1.
> >
> > So this patch is a generalized form of 0ae78eec8aa6 ("sched/eas: Don't update
> > misfit status if the task is pinned") which is when I originally noticed the
> > problem and this patch was written along side it.
> >
> > We have unlinked misfit from overutilized since then.
> >
> > And to be honest I am not sure if flattening of topology matters too since
> > I first noticed this, which was on Juno which doesn't have flat topology.
> >
> > FWIW I can still reproduce this, but I have a different setup now. On M1 mac
> > mini if I spawn a busy task affined to littles then expand the mask for
> > a single big core; I see big delays (>500ms) without the patch. But with the
> > patch it moves in few ms. The delay without the patch is too large and I can't
> > explain it. So the worry here is that generally misfit migration not happening
> > fast enough due to this fake misfit cases.
>
> I tried a similar scenario on RB5 but I don't see any difference with
> your patch. And that could be me not testing it correctly...
>
> I set the affinity of always running task to cpu[0-3] for a few
> seconds then extend it to [0-3,7] and the time to migrate is almost
> the same.
That matches what I do.
I write a trace_marker when I change affinity to help see when it should move.
>
> I'm using tip/sched/core + [0]
>
> [0] https://lore.kernel.org/all/[email protected]/
I tried on pinebook pro which has a rk3399 and I can't reproduce there too.
On the M1 I get two sched domains, MC and DIE. But on the pine64 it has only
MC. Could this be the difference as lb has sched domains dependencies?
It seems we flatten topologies but not sched domains. I see all cpus shown as
core_siblings. The DT for apple silicon sets clusters in the cpu-map - which
seems the flatten topology stuff detect LLC correctly but still keeps the
sched-domains not flattened. Is this a bug? I thought we will end up with one
sched domain still.
TBH I had a bit of confirmation bias that this is a problem based on the fix
(0ae78eec8aa6) that we had in the past. So on verification I looked at
balance_interval and this reproducer which is a not the same as the original
one and it might be exposing another problem and I didn't think twice about it.
The patch did help though. So maybe there are more than one problem. The delays
are longer than I expected as I tried to highlight. I'll continue to probe.
On 01/25/24 18:50, Vincent Guittot wrote:
> On Wed, 24 Jan 2024 at 23:38, Qais Yousef <[email protected]> wrote:
> >
> > On 01/23/24 18:22, Vincent Guittot wrote:
> >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index bcea3d55d95d..0830ceb7ca07 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -5065,17 +5065,61 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)
> > > >
> > > > static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> > > > {
> > > > + unsigned long uclamp_min, uclamp_max;
> > > > + unsigned long util, cpu_cap;
> > > > + int cpu = cpu_of(rq);
> > > > +
> > > > if (!sched_asym_cpucap_active())
> > > > return;
> > > >
> > > > - if (!p || p->nr_cpus_allowed == 1) {
> > > > - rq->misfit_task_load = 0;
> > > > - return;
> > > > - }
> > > > + if (!p || p->nr_cpus_allowed == 1)
> > > > + goto out;
> > > >
> > > > - if (task_fits_cpu(p, cpu_of(rq))) {
> > > > - rq->misfit_task_load = 0;
> > > > - return;
> > > > + cpu_cap = arch_scale_cpu_capacity(cpu);
> > > > +
> > > > + /* If we can't fit the biggest CPU, that's the best we can ever get. */
> > > > + if (cpu_cap == SCHED_CAPACITY_SCALE)
> > > > + goto out;
> > > > +
> > > > + uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
> > > > + uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
> > > > + util = task_util_est(p);
> > > > +
> > > > + if (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0)
> > > > + goto out;
> > > > +
> > > > + /*
> > > > + * If the task affinity is not set to default, make sure it is not
> > > > + * restricted to a subset where no CPU can ever fit it. Triggering
> > > > + * misfit in this case is pointless as it has no where better to move
> > > > + * to. And it can lead to balance_interval to grow too high as we'll
> > > > + * continuously fail to move it anywhere.
> > > > + */
> > > > + if (!cpumask_equal(p->cpus_ptr, cpu_possible_mask)) {
> > > > + unsigned long clamped_util = clamp(util, uclamp_min, uclamp_max);
> > > > + bool has_fitting_cpu = false;
> > > > + struct asym_cap_data *entry;
> > > > +
> > > > + rcu_read_lock();
> > > > + list_for_each_entry_rcu(entry, &asym_cap_list, link) {
> > >
> > > Do we really want to potentially do this loop at every pick_next task ?
> >
> > The common case should return quickly as the biggest CPU should be present
> > in every task by default. And after sorting the biggest CPU will be the first
> > entry and we should return after one check.
> >
> > Could we move the update to another less expensive location instead?
>
> TBH, I don't know. I would need time to think about this...
> May be when we set the new affinity of the task
I was thinking to actually call update_misfit_status() from another less
expensive location.
We can certainly do something to help the check less expensive if we must do it
in pick_next_task(). For example set a flag if the task belongs to a single
capacity value; and store the highest capacity its affinity belongs too. But
with cpuset v1, v2 and hotplug I am wary that might get messy.
>
> >
> > We could try to do better tracking for CPUs that has their affinity changed,
> > but I am not keen on sprinkling more complexity else where to deal with this.
> >
> > We could keep the status quouo and just prevent the misfit load balancing from
> > increment nr_failed similar to newidle_balance too. I think this should have
>
> One main advantage is that we put the complexity out of the fast path
How about when we update_load_avg()? After all it's the util the decides if we
become misfit. So it makes sense to do the check when we update the util for
the task.
Which reminds me of another bug. We need to call update_misfit_status() when
uclamp values change too.
>
> > a similar effect. Not ideal but if this is considered too expensive still
> > I can't think of other options that don't look ugly to me FWIW.
> >
> >
> > Thanks
> >
> > --
> > Qais Yousef
On Fri, 26 Jan 2024 at 02:46, Qais Yousef <[email protected]> wrote:
>
> On 01/25/24 18:40, Vincent Guittot wrote:
> > On Wed, 24 Jan 2024 at 23:30, Qais Yousef <[email protected]> wrote:
> > >
> > > On 01/23/24 09:26, Vincent Guittot wrote:
> > > > On Fri, 5 Jan 2024 at 23:20, Qais Yousef <[email protected]> wrote:
> > > > >
> > > > > From: Qais Yousef <[email protected]>
> > > > >
> > > > > If a misfit task is affined to a subset of the possible cpus, we need to
> > > > > verify that one of these cpus can fit it. Otherwise the load balancer
> > > > > code will continuously trigger needlessly leading the balance_interval
> > > > > to increase in return and eventually end up with a situation where real
> > > > > imbalances take a long time to address because of this impossible
> > > > > imbalance situation.
> > > >
> > > > If your problem is about increasing balance_interval, it would be
> > > > better to not increase the interval is such case.
> > > > I mean that we are able to detect misfit_task conditions for the
> > > > periodic load balance so we should be able to not increase the
> > > > interval in such cases.
> > > >
> > > > If I'm not wrong, your problem only happens when the system is
> > > > overutilized and we have disable EAS
> > >
> > > Yes and no. There are two concerns here:
> > >
> > > 1.
> > >
> > > So this patch is a generalized form of 0ae78eec8aa6 ("sched/eas: Don't update
> > > misfit status if the task is pinned") which is when I originally noticed the
> > > problem and this patch was written along side it.
> > >
> > > We have unlinked misfit from overutilized since then.
> > >
> > > And to be honest I am not sure if flattening of topology matters too since
> > > I first noticed this, which was on Juno which doesn't have flat topology.
> > >
> > > FWIW I can still reproduce this, but I have a different setup now. On M1 mac
> > > mini if I spawn a busy task affined to littles then expand the mask for
> > > a single big core; I see big delays (>500ms) without the patch. But with the
> > > patch it moves in few ms. The delay without the patch is too large and I can't
> > > explain it. So the worry here is that generally misfit migration not happening
> > > fast enough due to this fake misfit cases.
> >
> > I tried a similar scenario on RB5 but I don't see any difference with
> > your patch. And that could be me not testing it correctly...
> >
> > I set the affinity of always running task to cpu[0-3] for a few
> > seconds then extend it to [0-3,7] and the time to migrate is almost
> > the same.
>
> That matches what I do.
>
> I write a trace_marker when I change affinity to help see when it should move.
same for me
>
> >
> > I'm using tip/sched/core + [0]
> >
> > [0] https://lore.kernel.org/all/[email protected]/
>
> I tried on pinebook pro which has a rk3399 and I can't reproduce there too.
>
> On the M1 I get two sched domains, MC and DIE. But on the pine64 it has only
> MC. Could this be the difference as lb has sched domains dependencies?
>
> It seems we flatten topologies but not sched domains. I see all cpus shown as
> core_siblings. The DT for apple silicon sets clusters in the cpu-map - which
> seems the flatten topology stuff detect LLC correctly but still keeps the
> sched-domains not flattened. Is this a bug? I thought we will end up with one
> sched domain still.
>
> TBH I had a bit of confirmation bias that this is a problem based on the fix
> (0ae78eec8aa6) that we had in the past. So on verification I looked at
> balance_interval and this reproducer which is a not the same as the original
> one and it might be exposing another problem and I didn't think twice about it.
I checked the behavior more deeply and I confirm that I don't see
improvement for the use case described above. I would say that it's
even worse as I can see some runs where the task stays on little
whereas a big core has been added in the affinity. Having in mind that
my system is pretty idle which means that there is almost no other
reason to trigger an ilb than the misfit task, the change in
check_misfit_status() is probably the reason for never kicking an ilb
for such case
>
> The patch did help though. So maybe there are more than one problem. The delays
> are longer than I expected as I tried to highlight. I'll continue to probe.
On Fri, 26 Jan 2024 at 03:07, Qais Yousef <[email protected]> wrote:
>
> On 01/25/24 18:50, Vincent Guittot wrote:
> > On Wed, 24 Jan 2024 at 23:38, Qais Yousef <[email protected]> wrote:
> > >
> > > On 01/23/24 18:22, Vincent Guittot wrote:
> > >
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > index bcea3d55d95d..0830ceb7ca07 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -5065,17 +5065,61 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)
> > > > >
> > > > > static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> > > > > {
> > > > > + unsigned long uclamp_min, uclamp_max;
> > > > > + unsigned long util, cpu_cap;
> > > > > + int cpu = cpu_of(rq);
> > > > > +
> > > > > if (!sched_asym_cpucap_active())
> > > > > return;
> > > > >
> > > > > - if (!p || p->nr_cpus_allowed == 1) {
> > > > > - rq->misfit_task_load = 0;
> > > > > - return;
> > > > > - }
> > > > > + if (!p || p->nr_cpus_allowed == 1)
> > > > > + goto out;
> > > > >
> > > > > - if (task_fits_cpu(p, cpu_of(rq))) {
> > > > > - rq->misfit_task_load = 0;
> > > > > - return;
> > > > > + cpu_cap = arch_scale_cpu_capacity(cpu);
> > > > > +
> > > > > + /* If we can't fit the biggest CPU, that's the best we can ever get. */
> > > > > + if (cpu_cap == SCHED_CAPACITY_SCALE)
> > > > > + goto out;
> > > > > +
> > > > > + uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
> > > > > + uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
> > > > > + util = task_util_est(p);
> > > > > +
> > > > > + if (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0)
> > > > > + goto out;
> > > > > +
> > > > > + /*
> > > > > + * If the task affinity is not set to default, make sure it is not
> > > > > + * restricted to a subset where no CPU can ever fit it. Triggering
> > > > > + * misfit in this case is pointless as it has no where better to move
> > > > > + * to. And it can lead to balance_interval to grow too high as we'll
> > > > > + * continuously fail to move it anywhere.
> > > > > + */
> > > > > + if (!cpumask_equal(p->cpus_ptr, cpu_possible_mask)) {
> > > > > + unsigned long clamped_util = clamp(util, uclamp_min, uclamp_max);
> > > > > + bool has_fitting_cpu = false;
> > > > > + struct asym_cap_data *entry;
> > > > > +
> > > > > + rcu_read_lock();
> > > > > + list_for_each_entry_rcu(entry, &asym_cap_list, link) {
> > > >
> > > > Do we really want to potentially do this loop at every pick_next task ?
> > >
> > > The common case should return quickly as the biggest CPU should be present
> > > in every task by default. And after sorting the biggest CPU will be the first
> > > entry and we should return after one check.
> > >
> > > Could we move the update to another less expensive location instead?
> >
> > TBH, I don't know. I would need time to think about this...
> > May be when we set the new affinity of the task
>
> I was thinking to actually call update_misfit_status() from another less
> expensive location.
>
> We can certainly do something to help the check less expensive if we must do it
> in pick_next_task(). For example set a flag if the task belongs to a single
> capacity value; and store the highest capacity its affinity belongs too. But
> with cpuset v1, v2 and hotplug I am wary that might get messy.
I think it worth looking at such solution as this would mean parsing
the possible max capacity for the task only once per affinity change
>
> >
> > >
> > > We could try to do better tracking for CPUs that has their affinity changed,
> > > but I am not keen on sprinkling more complexity else where to deal with this.
> > >
> > > We could keep the status quouo and just prevent the misfit load balancing from
> > > increment nr_failed similar to newidle_balance too. I think this should have
> >
> > One main advantage is that we put the complexity out of the fast path
>
> How about when we update_load_avg()? After all it's the util the decides if we
> become misfit. So it makes sense to do the check when we update the util for
> the task.
>
> Which reminds me of another bug. We need to call update_misfit_status() when
> uclamp values change too.
>
> >
> > > a similar effect. Not ideal but if this is considered too expensive still
> > > I can't think of other options that don't look ugly to me FWIW.
> > >
> > >
> > > Thanks
> > >
> > > --
> > > Qais Yousef
On 01/26/24 15:15, Vincent Guittot wrote:
> > > TBH, I don't know. I would need time to think about this...
> > > May be when we set the new affinity of the task
> >
> > I was thinking to actually call update_misfit_status() from another less
> > expensive location.
> >
> > We can certainly do something to help the check less expensive if we must do it
> > in pick_next_task(). For example set a flag if the task belongs to a single
> > capacity value; and store the highest capacity its affinity belongs too. But
> > with cpuset v1, v2 and hotplug I am wary that might get messy.
>
> I think it worth looking at such solution as this would mean parsing
> the possible max capacity for the task only once per affinity change
Okay. It might not be that bad and just need to do the parsing when we update
the cpus_ptr, which seems to happen only in set_cpus_allowed_common(). I think
I can create a wrapper for fair where we do set_cpus_allowed_common() then do
the checks to discover the max_allowed_capacity and whether the new affinity is
asymmetric or not.
Cheers
--
Qais Yousef
On 01/26/24 15:08, Vincent Guittot wrote:
> > TBH I had a bit of confirmation bias that this is a problem based on the fix
> > (0ae78eec8aa6) that we had in the past. So on verification I looked at
> > balance_interval and this reproducer which is a not the same as the original
> > one and it might be exposing another problem and I didn't think twice about it.
>
> I checked the behavior more deeply and I confirm that I don't see
> improvement for the use case described above. I would say that it's
> even worse as I can see some runs where the task stays on little
> whereas a big core has been added in the affinity. Having in mind that
> my system is pretty idle which means that there is almost no other
> reason to trigger an ilb than the misfit task, the change in
> check_misfit_status() is probably the reason for never kicking an ilb
> for such case
It seems I reproduced another problem while trying to reproduce the original
issue, eh.
I did dig more and from what I see the issue is that the rd->overload is not
being set correctly. Which I believe what causes the delays (see attached
picture how rd.overloaded is 0 with some spikes). Only when CPU7
newidle_balance() coincided with rd->overload being 1 that the migration
happens. With the below hack I can see that rd->overload is 1 all the time
(even after the move as we still trigger a misfit on the big CPU). With my
patch only rd->overload is set to 1 (because of this task) only for a short
period after we change affinity.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df348aa55d3c..86069fe527f9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9707,8 +9707,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
continue;
}
- if (local_group)
- continue;
+ /* if (local_group) */
+ /* continue; */
if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
/* Check for a misfit task on the cpu */
I am not sure what the right fix is, but it seems this condition is required
for the 2nd leg of this if condition when we compare with load? I don't think
we should skip the misfit check.
Thanks
--
Qais Yousef
On 01/28/24 23:50, Qais Yousef wrote:
> On 01/26/24 15:08, Vincent Guittot wrote:
>
> > > TBH I had a bit of confirmation bias that this is a problem based on the fix
> > > (0ae78eec8aa6) that we had in the past. So on verification I looked at
> > > balance_interval and this reproducer which is a not the same as the original
> > > one and it might be exposing another problem and I didn't think twice about it.
> >
> > I checked the behavior more deeply and I confirm that I don't see
> > improvement for the use case described above. I would say that it's
> > even worse as I can see some runs where the task stays on little
> > whereas a big core has been added in the affinity. Having in mind that
> > my system is pretty idle which means that there is almost no other
> > reason to trigger an ilb than the misfit task, the change in
> > check_misfit_status() is probably the reason for never kicking an ilb
> > for such case
>
> It seems I reproduced another problem while trying to reproduce the original
> issue, eh.
>
> I did dig more and from what I see the issue is that the rd->overload is not
> being set correctly. Which I believe what causes the delays (see attached
> picture how rd.overloaded is 0 with some spikes). Only when CPU7
> newidle_balance() coincided with rd->overload being 1 that the migration
> happens. With the below hack I can see that rd->overload is 1 all the time
> (even after the move as we still trigger a misfit on the big CPU). With my
> patch only rd->overload is set to 1 (because of this task) only for a short
> period after we change affinity.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index df348aa55d3c..86069fe527f9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9707,8 +9707,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> continue;
> }
>
> - if (local_group)
> - continue;
> + /* if (local_group) */
> + /* continue; */
>
> if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
> /* Check for a misfit task on the cpu */
>
> I am not sure what the right fix is, but it seems this condition is required
> for the 2nd leg of this if condition when we compare with load? I don't think
> we should skip the misfit check.
I'm still not sure I got the original intent of why we skip for local_group. We
need to set sg_status which operates at root domain to enable a cpu to
pull a misfit task.
AFAICS newidle_balance() will return without doing anything if rd->overload is
not set. So making sure we update this flag always and for both legs is
necessary IIUC
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df348aa55d3c..bd2f402eac41 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9707,9 +9707,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
continue;
}
- if (local_group)
- continue;
-
if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
/* Check for a misfit task on the cpu */
if (sgs->group_misfit_task_load < rq->misfit_task_load) {
@@ -9719,8 +9716,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
} else if ((env->idle != CPU_NOT_IDLE) &&
sched_reduced_capacity(rq, env->sd)) {
/* Check for a task running on a CPU with reduced capacity */
- if (sgs->group_misfit_task_load < load)
+ if (sgs->group_misfit_task_load < load) {
sgs->group_misfit_task_load = load;
+ *sg_status |= SG_OVERLOAD;
+ }
}
}
I was wondering why we never pull at TICK/rebalance_domains() where no such
check is made. But when newidle_balance() returns early it sets
update_next_balance() to add balance_interval which is already long. So we end
up delaying things further thinking we've 'attempted' a load balance and
it wasn't necessary - but in reality we failed to see it and not allowing the
rebalance_domains() to see it either by continuing to push it forward.
On Mon, 29 Jan 2024 at 00:50, Qais Yousef <[email protected]> wrote:
>
> On 01/26/24 15:08, Vincent Guittot wrote:
>
> > > TBH I had a bit of confirmation bias that this is a problem based on the fix
> > > (0ae78eec8aa6) that we had in the past. So on verification I looked at
> > > balance_interval and this reproducer which is a not the same as the original
> > > one and it might be exposing another problem and I didn't think twice about it.
> >
> > I checked the behavior more deeply and I confirm that I don't see
> > improvement for the use case described above. I would say that it's
> > even worse as I can see some runs where the task stays on little
> > whereas a big core has been added in the affinity. Having in mind that
> > my system is pretty idle which means that there is almost no other
> > reason to trigger an ilb than the misfit task, the change in
> > check_misfit_status() is probably the reason for never kicking an ilb
> > for such case
>
> It seems I reproduced another problem while trying to reproduce the original
> issue, eh.
>
> I did dig more and from what I see the issue is that the rd->overload is not
> being set correctly. Which I believe what causes the delays (see attached
> picture how rd.overloaded is 0 with some spikes). Only when CPU7
> newidle_balance() coincided with rd->overload being 1 that the migration
> happens. With the below hack I can see that rd->overload is 1 all the time
But here you rely on another activity happening in CPU7 whereas the
misfit should trigger by itself the load balance and not expect
another task waking up then sleeping on cpu7 to trigger a newidle
balance. We want a normal idle load balance not a newidle_balance
> (even after the move as we still trigger a misfit on the big CPU). With my
> patch only rd->overload is set to 1 (because of this task) only for a short
> period after we change affinity.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index df348aa55d3c..86069fe527f9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9707,8 +9707,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> continue;
> }
>
> - if (local_group)
> - continue;
> + /* if (local_group) */
> + /* continue; */
>
> if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
> /* Check for a misfit task on the cpu */
>
> I am not sure what the right fix is, but it seems this condition is required
> for the 2nd leg of this if condition when we compare with load? I don't think
> we should skip the misfit check.
>
>
> Thanks
>
> --
> Qais Yousef
On 01/30/24 10:41, Vincent Guittot wrote:
> On Mon, 29 Jan 2024 at 00:50, Qais Yousef <[email protected]> wrote:
> >
> > On 01/26/24 15:08, Vincent Guittot wrote:
> >
> > > > TBH I had a bit of confirmation bias that this is a problem based on the fix
> > > > (0ae78eec8aa6) that we had in the past. So on verification I looked at
> > > > balance_interval and this reproducer which is a not the same as the original
> > > > one and it might be exposing another problem and I didn't think twice about it.
> > >
> > > I checked the behavior more deeply and I confirm that I don't see
> > > improvement for the use case described above. I would say that it's
> > > even worse as I can see some runs where the task stays on little
> > > whereas a big core has been added in the affinity. Having in mind that
> > > my system is pretty idle which means that there is almost no other
> > > reason to trigger an ilb than the misfit task, the change in
> > > check_misfit_status() is probably the reason for never kicking an ilb
> > > for such case
> >
> > It seems I reproduced another problem while trying to reproduce the original
> > issue, eh.
> >
> > I did dig more and from what I see the issue is that the rd->overload is not
> > being set correctly. Which I believe what causes the delays (see attached
> > picture how rd.overloaded is 0 with some spikes). Only when CPU7
> > newidle_balance() coincided with rd->overload being 1 that the migration
> > happens. With the below hack I can see that rd->overload is 1 all the time
>
> But here you rely on another activity happening in CPU7 whereas the
I don't want to rely on that. I think this is a problem too. And this is what
ends up happening from what I see, sometimes at least.
When is it expected for newidle_balance to pull anyway? I agree we shouldn't
rely on it to randomly happen, but if it happens sooner, it should pull, no?
> misfit should trigger by itself the load balance and not expect
> another task waking up then sleeping on cpu7 to trigger a newidle
> balance. We want a normal idle load balance not a newidle_balance
I think there's a terminology problems. I thought you mean newidle_balnce() by
ilb. It seems you're referring to load_balance() called from
rebalance_domains() when tick happens at idle?
I thought this is not kicking. But I just double checked in my traces and I was
getting confused because I was looking at where run_rebalance_domains() would
happen, for example, on CPU2 but the balance would actually be for CPU7.
No clue why it fails to pull.. I can see actually we call load_balance() twice
for some (not all) entries to rebalance_domains(). So we don't always operate
on the two domains. But that's not necessarily a problem.
I think it's a good opportunity to add some tracepoints to help break this path
down. If you have suggestions of things to record that'd be helpful.
On Wed, 31 Jan 2024 at 00:57, Qais Yousef <[email protected]> wrote:
>
> On 01/30/24 10:41, Vincent Guittot wrote:
> > On Mon, 29 Jan 2024 at 00:50, Qais Yousef <[email protected]> wrote:
> > >
> > > On 01/26/24 15:08, Vincent Guittot wrote:
> > >
> > > > > TBH I had a bit of confirmation bias that this is a problem based on the fix
> > > > > (0ae78eec8aa6) that we had in the past. So on verification I looked at
> > > > > balance_interval and this reproducer which is a not the same as the original
> > > > > one and it might be exposing another problem and I didn't think twice about it.
> > > >
> > > > I checked the behavior more deeply and I confirm that I don't see
> > > > improvement for the use case described above. I would say that it's
> > > > even worse as I can see some runs where the task stays on little
> > > > whereas a big core has been added in the affinity. Having in mind that
> > > > my system is pretty idle which means that there is almost no other
> > > > reason to trigger an ilb than the misfit task, the change in
> > > > check_misfit_status() is probably the reason for never kicking an ilb
> > > > for such case
> > >
> > > It seems I reproduced another problem while trying to reproduce the original
> > > issue, eh.
> > >
> > > I did dig more and from what I see the issue is that the rd->overload is not
> > > being set correctly. Which I believe what causes the delays (see attached
> > > picture how rd.overloaded is 0 with some spikes). Only when CPU7
> > > newidle_balance() coincided with rd->overload being 1 that the migration
> > > happens. With the below hack I can see that rd->overload is 1 all the time
> >
> > But here you rely on another activity happening in CPU7 whereas the
>
> I don't want to rely on that. I think this is a problem too. And this is what
> ends up happening from what I see, sometimes at least.
>
> When is it expected for newidle_balance to pull anyway? I agree we shouldn't
> rely on it to randomly happen, but if it happens sooner, it should pull, no?
>
> > misfit should trigger by itself the load balance and not expect
> > another task waking up then sleeping on cpu7 to trigger a newidle
> > balance. We want a normal idle load balance not a newidle_balance
>
> I think there's a terminology problems. I thought you mean newidle_balnce() by
> ilb. It seems you're referring to load_balance() called from
> rebalance_domains() when tick happens at idle?
newidle_balance is different from idle load balance. newidle_balance
happens when the cpu becomes idle whereas busy and idle load balance
happen at tick.
>
> I thought this is not kicking. But I just double checked in my traces and I was
> getting confused because I was looking at where run_rebalance_domains() would
> happen, for example, on CPU2 but the balance would actually be for CPU7.
An idle load balance happens either on the target CPU if its tick is
not stopped or we kick one idle CPU to run the idle load balance on
behalf of all idle CPUs. This is the latter case that doesn't happen
anymore with your patch and the change in check_misfit_status.
>
> No clue why it fails to pull.. I can see actually we call load_balance() twice
> for some (not all) entries to rebalance_domains(). So we don't always operate
> on the two domains. But that's not necessarily a problem.
We have 3 different reasons for kicking an idle load balance :
- to do an actual balance of tasks
- to update stats ie blocked load
- to update nohz.next_balance
You are interested by the 1st one but it's most probably for the 2
last reasons that this happen
>
> I think it's a good opportunity to add some tracepoints to help break this path
> down. If you have suggestions of things to record that'd be helpful.
On 01/31/24 14:55, Vincent Guittot wrote:
> On Wed, 31 Jan 2024 at 00:57, Qais Yousef <[email protected]> wrote:
> >
> > On 01/30/24 10:41, Vincent Guittot wrote:
> > > On Mon, 29 Jan 2024 at 00:50, Qais Yousef <[email protected]> wrote:
> > > >
> > > > On 01/26/24 15:08, Vincent Guittot wrote:
> > > >
> > > > > > TBH I had a bit of confirmation bias that this is a problem based on the fix
> > > > > > (0ae78eec8aa6) that we had in the past. So on verification I looked at
> > > > > > balance_interval and this reproducer which is a not the same as the original
> > > > > > one and it might be exposing another problem and I didn't think twice about it.
> > > > >
> > > > > I checked the behavior more deeply and I confirm that I don't see
> > > > > improvement for the use case described above. I would say that it's
> > > > > even worse as I can see some runs where the task stays on little
> > > > > whereas a big core has been added in the affinity. Having in mind that
> > > > > my system is pretty idle which means that there is almost no other
> > > > > reason to trigger an ilb than the misfit task, the change in
> > > > > check_misfit_status() is probably the reason for never kicking an ilb
> > > > > for such case
> > > >
> > > > It seems I reproduced another problem while trying to reproduce the original
> > > > issue, eh.
> > > >
> > > > I did dig more and from what I see the issue is that the rd->overload is not
> > > > being set correctly. Which I believe what causes the delays (see attached
> > > > picture how rd.overloaded is 0 with some spikes). Only when CPU7
> > > > newidle_balance() coincided with rd->overload being 1 that the migration
> > > > happens. With the below hack I can see that rd->overload is 1 all the time
> > >
> > > But here you rely on another activity happening in CPU7 whereas the
> >
> > I don't want to rely on that. I think this is a problem too. And this is what
> > ends up happening from what I see, sometimes at least.
> >
> > When is it expected for newidle_balance to pull anyway? I agree we shouldn't
> > rely on it to randomly happen, but if it happens sooner, it should pull, no?
> >
> > > misfit should trigger by itself the load balance and not expect
> > > another task waking up then sleeping on cpu7 to trigger a newidle
> > > balance. We want a normal idle load balance not a newidle_balance
> >
> > I think there's a terminology problems. I thought you mean newidle_balnce() by
> > ilb. It seems you're referring to load_balance() called from
> > rebalance_domains() when tick happens at idle?
>
> newidle_balance is different from idle load balance. newidle_balance
> happens when the cpu becomes idle whereas busy and idle load balance
> happen at tick.
Yes. newidle_balance() is not supposed to pull a misfit task then?
>
> >
> > I thought this is not kicking. But I just double checked in my traces and I was
> > getting confused because I was looking at where run_rebalance_domains() would
> > happen, for example, on CPU2 but the balance would actually be for CPU7.
>
> An idle load balance happens either on the target CPU if its tick is
> not stopped or we kick one idle CPU to run the idle load balance on
> behalf of all idle CPUs. This is the latter case that doesn't happen
> anymore with your patch and the change in check_misfit_status.
Yes. I just got confused while looking at the log. I'm testing without my patch
FWIW. It should be 6.6 from Asahi folks which should contain nothing but the
necessary stuff to make the machine run that wasn't fully upstreamed yet.
>
> >
> > No clue why it fails to pull.. I can see actually we call load_balance() twice
> > for some (not all) entries to rebalance_domains(). So we don't always operate
> > on the two domains. But that's not necessarily a problem.
>
> We have 3 different reasons for kicking an idle load balance :
> - to do an actual balance of tasks
> - to update stats ie blocked load
> - to update nohz.next_balance
>
> You are interested by the 1st one but it's most probably for the 2
> last reasons that this happen
Okay. Thanks for the info. I need to figure out why 1 fails although there's
a misfit task to pull.
>
> >
> > I think it's a good opportunity to add some tracepoints to help break this path
> > down. If you have suggestions of things to record that'd be helpful.
On 26/01/2024 02:46, Qais Yousef wrote:
> On 01/25/24 18:40, Vincent Guittot wrote:
>> On Wed, 24 Jan 2024 at 23:30, Qais Yousef <[email protected]> wrote:
>>>
>>> On 01/23/24 09:26, Vincent Guittot wrote:
>>>> On Fri, 5 Jan 2024 at 23:20, Qais Yousef <[email protected]> wrote:
>>>>>
>>>>> From: Qais Yousef <[email protected]>
[...]
>>> And to be honest I am not sure if flattening of topology matters too since
>>> I first noticed this, which was on Juno which doesn't have flat topology.
>>>
>>> FWIW I can still reproduce this, but I have a different setup now. On M1 mac
>>> mini if I spawn a busy task affined to littles then expand the mask for
>>> a single big core; I see big delays (>500ms) without the patch. But with the
>>> patch it moves in few ms. The delay without the patch is too large and I can't
>>> explain it. So the worry here is that generally misfit migration not happening
>>> fast enough due to this fake misfit cases.
>>
>> I tried a similar scenario on RB5 but I don't see any difference with
>> your patch. And that could be me not testing it correctly...
>>
>> I set the affinity of always running task to cpu[0-3] for a few
>> seconds then extend it to [0-3,7] and the time to migrate is almost
>> the same.
>
> That matches what I do.
>
> I write a trace_marker when I change affinity to help see when it should move.
>
>>
>> I'm using tip/sched/core + [0]
>>
>> [0] https://lore.kernel.org/all/[email protected]/
>
> I tried on pinebook pro which has a rk3399 and I can't reproduce there too.
>
> On the M1 I get two sched domains, MC and DIE. But on the pine64 it has only
> MC. Could this be the difference as lb has sched domains dependencies?
>
> It seems we flatten topologies but not sched domains. I see all cpus shown as
> core_siblings. The DT for apple silicon sets clusters in the cpu-map - which
> seems the flatten topology stuff detect LLC correctly but still keeps the
> sched-domains not flattened. Is this a bug? I thought we will end up with one
> sched domain still.
IMHO, if you have a cpu_map entry with > 1 cluster in your dtb, you end
up with MC and PKG (former DIE) Sched Domain (SD) level. And misfit load
balance takes potentially longer on PKG than to MC.
(1) Vanilla Juno-r0 [L b b L L L)
root@juno:~# echo 1 > /sys/kernel/debug/sched/verbose
root@juno:~# cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
MC
PKG
root@juno:~# cat /proc/schedstat | head -5 | grep ^[cd]
cpu0 0 0 0 0 0 0 2441100800 251426780 6694
domain0 39 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 3f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
(2) flattened topology (including SDs):
Remove cluster1 from cpu_map and A57_L2 $ entry.
--- a/arch/arm64/boot/dts/arm/juno.dts
+++ b/arch/arm64/boot/dts/arm/juno.dts
@@ -44,19 +44,16 @@ core0 {
core1 {
cpu = <&A57_1>;
};
- };
-
- cluster1 {
- core0 {
+ core2 {
cpu = <&A53_0>;
};
- core1 {
+ core3 {
cpu = <&A53_1>;
};
- core2 {
+ core4 {
cpu = <&A53_2>;
};
- core3 {
+ core5 {
cpu = <&A53_3>;
};
};
@@ -95,7 +92,7 @@ A57_0: cpu@0 {
d-cache-size = <0x8000>;
d-cache-line-size = <64>;
d-cache-sets = <256>;
- next-level-cache = <&A57_L2>;
+ next-level-cache = <&A53_L2>;
clocks = <&scpi_dvfs 0>;
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
capacity-dmips-mhz = <1024>;
@@ -113,7 +110,7 @@ A57_1: cpu@1 {
d-cache-size = <0x8000>;
d-cache-line-size = <64>;
d-cache-sets = <256>;
- next-level-cache = <&A57_L2>;
+ next-level-cache = <&A53_L2>;
clocks = <&scpi_dvfs 0>;
cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
capacity-dmips-mhz = <1024>;
@@ -192,15 +189,6 @@ A53_3: cpu@103 {
dynamic-power-coefficient = <140>;
};
- A57_L2: l2-cache0 {
- compatible = "cache";
- cache-unified;
- cache-size = <0x200000>;
- cache-line-size = <64>;
- cache-sets = <2048>;
- cache-level = <2>;
- };
-
A53_L2: l2-cache1 {
compatible = "cache";
cache-unified;
root@juno:~# echo 1 > /sys/kernel/debug/sched/verbose
root@juno:~# cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
MC
root@juno:~# cat /proc/schedstat | head -4 | grep ^[cd]
cpu0 0 0 0 0 0 0 2378087600 310618500 8152
domain0 3f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[...]
On 02/05/24 20:49, Dietmar Eggemann wrote:
> On 26/01/2024 02:46, Qais Yousef wrote:
> > On 01/25/24 18:40, Vincent Guittot wrote:
> >> On Wed, 24 Jan 2024 at 23:30, Qais Yousef <[email protected]> wrote:
> >>>
> >>> On 01/23/24 09:26, Vincent Guittot wrote:
> >>>> On Fri, 5 Jan 2024 at 23:20, Qais Yousef <[email protected]> wrote:
> >>>>>
> >>>>> From: Qais Yousef <[email protected]>
>
> [...]
>
> >>> And to be honest I am not sure if flattening of topology matters too since
> >>> I first noticed this, which was on Juno which doesn't have flat topology.
> >>>
> >>> FWIW I can still reproduce this, but I have a different setup now. On M1 mac
> >>> mini if I spawn a busy task affined to littles then expand the mask for
> >>> a single big core; I see big delays (>500ms) without the patch. But with the
> >>> patch it moves in few ms. The delay without the patch is too large and I can't
> >>> explain it. So the worry here is that generally misfit migration not happening
> >>> fast enough due to this fake misfit cases.
> >>
> >> I tried a similar scenario on RB5 but I don't see any difference with
> >> your patch. And that could be me not testing it correctly...
> >>
> >> I set the affinity of always running task to cpu[0-3] for a few
> >> seconds then extend it to [0-3,7] and the time to migrate is almost
> >> the same.
> >
> > That matches what I do.
> >
> > I write a trace_marker when I change affinity to help see when it should move.
> >
> >>
> >> I'm using tip/sched/core + [0]
> >>
> >> [0] https://lore.kernel.org/all/[email protected]/
> >
> > I tried on pinebook pro which has a rk3399 and I can't reproduce there too.
> >
> > On the M1 I get two sched domains, MC and DIE. But on the pine64 it has only
> > MC. Could this be the difference as lb has sched domains dependencies?
> >
> > It seems we flatten topologies but not sched domains. I see all cpus shown as
> > core_siblings. The DT for apple silicon sets clusters in the cpu-map - which
> > seems the flatten topology stuff detect LLC correctly but still keeps the
> > sched-domains not flattened. Is this a bug? I thought we will end up with one
> > sched domain still.
>
> IMHO, if you have a cpu_map entry with > 1 cluster in your dtb, you end
> up with MC and PKG (former DIE) Sched Domain (SD) level. And misfit load
Hmm, okay. I thought the detection of topology where we know the LLC is shared
will cause the sched domains to collapse too.
> balance takes potentially longer on PKG than to MC.
Why potentially longer? We iterate through the domains the CPU belong to. If
the first iteration (at MC) pulled something, then once we go to PKG then we're
less likely to pull again?
Anyway. I think I am hitting a bug here. The behavior doesn't look right to me
given the delays I'm seeing and the fact we do the ilb but for some reason fail
to pull
>
> (1) Vanilla Juno-r0 [L b b L L L)
>
> root@juno:~# echo 1 > /sys/kernel/debug/sched/verbose
> root@juno:~# cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
> MC
> PKG
>
> root@juno:~# cat /proc/schedstat | head -5 | grep ^[cd]
> cpu0 0 0 0 0 0 0 2441100800 251426780 6694
> domain0 39 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> domain1 3f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>
> (2) flattened topology (including SDs):
>
> Remove cluster1 from cpu_map and A57_L2 $ entry.
>
> --- a/arch/arm64/boot/dts/arm/juno.dts
> +++ b/arch/arm64/boot/dts/arm/juno.dts
> @@ -44,19 +44,16 @@ core0 {
> core1 {
> cpu = <&A57_1>;
> };
> - };
> -
> - cluster1 {
> - core0 {
> + core2 {
> cpu = <&A53_0>;
> };
> - core1 {
> + core3 {
> cpu = <&A53_1>;
> };
> - core2 {
> + core4 {
> cpu = <&A53_2>;
> };
> - core3 {
> + core5 {
> cpu = <&A53_3>;
> };
> };
> @@ -95,7 +92,7 @@ A57_0: cpu@0 {
> d-cache-size = <0x8000>;
> d-cache-line-size = <64>;
> d-cache-sets = <256>;
> - next-level-cache = <&A57_L2>;
> + next-level-cache = <&A53_L2>;
> clocks = <&scpi_dvfs 0>;
> cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
> capacity-dmips-mhz = <1024>;
> @@ -113,7 +110,7 @@ A57_1: cpu@1 {
> d-cache-size = <0x8000>;
> d-cache-line-size = <64>;
> d-cache-sets = <256>;
> - next-level-cache = <&A57_L2>;
> + next-level-cache = <&A53_L2>;
> clocks = <&scpi_dvfs 0>;
> cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>;
> capacity-dmips-mhz = <1024>;
> @@ -192,15 +189,6 @@ A53_3: cpu@103 {
> dynamic-power-coefficient = <140>;
> };
>
> - A57_L2: l2-cache0 {
> - compatible = "cache";
> - cache-unified;
> - cache-size = <0x200000>;
> - cache-line-size = <64>;
> - cache-sets = <2048>;
> - cache-level = <2>;
> - };
> -
> A53_L2: l2-cache1 {
> compatible = "cache";
> cache-unified;
>
> root@juno:~# echo 1 > /sys/kernel/debug/sched/verbose
> root@juno:~# cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
> MC
>
> root@juno:~# cat /proc/schedstat | head -4 | grep ^[cd]
> cpu0 0 0 0 0 0 0 2378087600 310618500 8152
> domain0 3f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>
> [...]
>
On 06/02/2024 16:06, Qais Yousef wrote:
> On 02/05/24 20:49, Dietmar Eggemann wrote:
>> On 26/01/2024 02:46, Qais Yousef wrote:
>>> On 01/25/24 18:40, Vincent Guittot wrote:
>>>> On Wed, 24 Jan 2024 at 23:30, Qais Yousef <[email protected]> wrote:
>>>>>
>>>>> On 01/23/24 09:26, Vincent Guittot wrote:
>>>>>> On Fri, 5 Jan 2024 at 23:20, Qais Yousef <[email protected]> wrote:
>>>>>>>
>>>>>>> From: Qais Yousef <[email protected]>
[...]
>>> It seems we flatten topologies but not sched domains. I see all cpus shown as
>>> core_siblings. The DT for apple silicon sets clusters in the cpu-map - which
>>> seems the flatten topology stuff detect LLC correctly but still keeps the
>>> sched-domains not flattened. Is this a bug? I thought we will end up with one
>>> sched domain still.
>>
>> IMHO, if you have a cpu_map entry with > 1 cluster in your dtb, you end
>> up with MC and PKG (former DIE) Sched Domain (SD) level. And misfit load
>
> Hmm, okay. I thought the detection of topology where we know the LLC is shared
> will cause the sched domains to collapse too.
>
>> balance takes potentially longer on PKG than to MC.
>
> Why potentially longer? We iterate through the domains the CPU belong to. If
> the first iteration (at MC) pulled something, then once we go to PKG then we're
> less likely to pull again?
There are a couple of mechanisms in place to let load-balance on higher
sd levels happen less frequently, eg:
load_balance() -> should_we_balance() + continue_balancing
interval = get_sd_balance_interval(sd, busy) in rebalance_domains()
rq->avg_idle versus sd->max_newidle_lb_cost
> Anyway. I think I am hitting a bug here. The behavior doesn't look right to me
> given the delays I'm seeing and the fact we do the ilb but for some reason fail
> to pull
[...]
On 02/06/24 18:17, Dietmar Eggemann wrote:
> On 06/02/2024 16:06, Qais Yousef wrote:
> > On 02/05/24 20:49, Dietmar Eggemann wrote:
> >> On 26/01/2024 02:46, Qais Yousef wrote:
> >>> On 01/25/24 18:40, Vincent Guittot wrote:
> >>>> On Wed, 24 Jan 2024 at 23:30, Qais Yousef <[email protected]> wrote:
> >>>>>
> >>>>> On 01/23/24 09:26, Vincent Guittot wrote:
> >>>>>> On Fri, 5 Jan 2024 at 23:20, Qais Yousef <[email protected]> wrote:
> >>>>>>>
> >>>>>>> From: Qais Yousef <[email protected]>
>
> [...]
>
> >>> It seems we flatten topologies but not sched domains. I see all cpus shown as
> >>> core_siblings. The DT for apple silicon sets clusters in the cpu-map - which
> >>> seems the flatten topology stuff detect LLC correctly but still keeps the
> >>> sched-domains not flattened. Is this a bug? I thought we will end up with one
> >>> sched domain still.
> >>
> >> IMHO, if you have a cpu_map entry with > 1 cluster in your dtb, you end
> >> up with MC and PKG (former DIE) Sched Domain (SD) level. And misfit load
> >
> > Hmm, okay. I thought the detection of topology where we know the LLC is shared
> > will cause the sched domains to collapse too.
> >
> >> balance takes potentially longer on PKG than to MC.
> >
> > Why potentially longer? We iterate through the domains the CPU belong to. If
> > the first iteration (at MC) pulled something, then once we go to PKG then we're
> > less likely to pull again?
>
> There are a couple of mechanisms in place to let load-balance on higher
> sd levels happen less frequently, eg:
>
> load_balance() -> should_we_balance() + continue_balancing
>
> interval = get_sd_balance_interval(sd, busy) in rebalance_domains()
>
> rq->avg_idle versus sd->max_newidle_lb_cost
Okay thanks. That last one I missed.
>
> > Anyway. I think I am hitting a bug here. The behavior doesn't look right to me
> > given the delays I'm seeing and the fact we do the ilb but for some reason fail
> > to pull
>
> [...]
>