LinuxLists.cc - [PATCH v8 0/4] sched: Don't trigger misfit if affinity is restricted

2024-03-24 00:46:28

Subject: [PATCH v8 0/4] sched: Don't trigger misfit if affinity is restricted

There was a discussion on handling hotplug operation removing a capacity level
and lead to unnecessary misfit lb to trigger again. I opted not to handle it
now, but a working patch is available in [1]. I don't feel strongly about it
and would leave it up to the maintainers to push which direction they prefer.
Patch 4 will make sure that balance interval and nr_failed won't grow
unnecessarily due to bad unnecessary misfit lb. It will lead to some
sub-optimality, but no incorrect behavior.

After 6.9 merge window, dynamic Energy Model series would be merged and it can
lead to the capacities of the CPUs being changed at runtime. This means I need
to post follow up patch to handle this situation to ensure max_allowed_capacity
is correct after an EM update. It might make then handling of hotplug operation
attractive too as there would be some common shared ground.

[1] https://lore.kernel.org/lkml/20240321122039.7gk2mc3syvkrvhjz@airbuntu/

Changes since v7:

* Remove sd arg from check_misfit_status()
* Update typo in commit message in patch 2.
* Add Reviewed-by from Vincent

Changes since v6:

* Simplify update_misfit_status

Changes since v5:

* Remove redundant check to rq->rd->max_cpu_capacity
* Simplify check_misfit_status() further by removing unnecessary checks.
* Add new patch to remove no longer used rd->max_cpu_capacity
* Add new patch to prevent misfit lb from polluting balance_interval
and nr_balance_failed

Changes since v4:

* Store max_allowed_capacity in task_struct and populate it when
affinity changes to avoid iterating through the capacities list in the
fast path (Vincent)
* Use rq->rd->max_cpu_capacity which is updated after hotplug
operations to check biggest allowed capacity in the system.
* Undo the change to check_misfit_status() and improve the function to
avoid similar confusion in the future.
* Split the patches differently. Export the capacity list and sort it
is now patch 1, handling of affinity for misfit detection is patch 2.

Changes since v3:

* Update commit message of patch 2 to be less verbose

Changes since v2:

* Convert access of asym_cap_list to be rcu protected
* Add new patch to sort the list in descending order
* Move some declarations inside affinity check block
* Remove now redundant check against max_cpu_capacity in check_misfit_status()

Changes since v1:

* Use asym_cap_list (thanks Dietmar) to iterate instead of iterating
through every cpu which Vincent was concerned about.
* Use uclamped util to compare with capacity instead of util_fits_cpu()
when iterating through capcities (Dietmar).
* Update commit log with test results to better demonstrate the problem

v1 discussion: https://lore.kernel.org/lkml/[email protected]/
v2 discussion: https://lore.kernel.org/lkml/[email protected]/
v3 discussion: https://lore.kernel.org/lkml/[email protected]/
v4 discussion: https://lore.kernel.org/lkml/[email protected]/
v5 discussion: https://lore.kernel.org/lkml/[email protected]/
v6, v7 discussion: https://lore.kernel.org/lkml/[email protected]/

Thanks!

--
Qais Yousef

Qais Yousef (4):
sched/topology: Export asym_capacity_list
sched/fair: Check a task has a fitting cpu when updating misfit
sched/topology: Remove max_cpu_capacity from root_domain
sched/fair: Don't double balance_interval for migrate_misfit

include/linux/sched.h | 1 +
init/init_task.c | 1 +
kernel/sched/fair.c | 79 +++++++++++++++++++++++++++++++----------
kernel/sched/sched.h | 16 +++++++--
kernel/sched/topology.c | 56 ++++++++++++++---------------
5 files changed, 104 insertions(+), 49 deletions(-)

--
2.34.1

2024-03-24 00:46:31

by Qais Yousef

[permalink] [raw]

Subject: [PATCH v8 1/4] sched/topology: Export asym_capacity_list

So that we can use it to iterate through available capacities in the
system. Sort asym_cap_list in descending order as expected users are
likely to be interested on the highest capacity first.

Make the list RCU protected to allow for cheap access in hot paths.

Reviewed-by: Vincent Guittot <[email protected]>
Signed-off-by: Qais Yousef <[email protected]>
---
kernel/sched/sched.h | 14 ++++++++++++++
kernel/sched/topology.c | 43 ++++++++++++++++++++++++-----------------
2 files changed, 39 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 41024c1c49b4..f77c00dddfe1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -109,6 +109,20 @@ extern int sysctl_sched_rt_period;
extern int sysctl_sched_rt_runtime;
extern int sched_rr_timeslice;

+/*
+ * Asymmetric CPU capacity bits
+ */
+struct asym_cap_data {
+ struct list_head link;
+ struct rcu_head rcu;
+ unsigned long capacity;
+ unsigned long cpus[];
+};
+
+extern struct list_head asym_cap_list;
+
+#define cpu_capacity_span(asym_data) to_cpumask((asym_data)->cpus)
+
/*
* Helpers for converting nanosecond timing to jiffy resolution
*/
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 99ea5986038c..44ed3d0812ab 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1329,24 +1329,13 @@ static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
update_group_capacity(sd, cpu);
}

-/*
- * Asymmetric CPU capacity bits
- */
-struct asym_cap_data {
- struct list_head link;
- unsigned long capacity;
- unsigned long cpus[];
-};
-
/*
* Set of available CPUs grouped by their corresponding capacities
* Each list entry contains a CPU mask reflecting CPUs that share the same
* capacity.
* The lifespan of data is unlimited.
*/
-static LIST_HEAD(asym_cap_list);
-
-#define cpu_capacity_span(asym_data) to_cpumask((asym_data)->cpus)
+LIST_HEAD(asym_cap_list);

/*
* Verify whether there is any CPU capacity asymmetry in a given sched domain.
@@ -1386,21 +1375,39 @@ asym_cpu_capacity_classify(const struct cpumask *sd_span,

}

+static void free_asym_cap_entry(struct rcu_head *head)
+{
+ struct asym_cap_data *entry = container_of(head, struct asym_cap_data, rcu);
+ kfree(entry);
+}
+
static inline void asym_cpu_capacity_update_data(int cpu)
{
unsigned long capacity = arch_scale_cpu_capacity(cpu);
- struct asym_cap_data *entry = NULL;
+ struct asym_cap_data *insert_entry = NULL;
+ struct asym_cap_data *entry;

+ /*
+ * Search if capacity already exits. If not, track which the entry
+ * where we should insert to keep the list ordered descendingly.
+ */
list_for_each_entry(entry, &asym_cap_list, link) {
if (capacity == entry->capacity)
goto done;
+ else if (!insert_entry && capacity > entry->capacity)
+ insert_entry = list_prev_entry(entry, link);
}

entry = kzalloc(sizeof(*entry) + cpumask_size(), GFP_KERNEL);
if (WARN_ONCE(!entry, "Failed to allocate memory for asymmetry data\n"))
return;
entry->capacity = capacity;
- list_add(&entry->link, &asym_cap_list);
+
+ /* If NULL then the new capacity is the smallest, add last. */
+ if (!insert_entry)
+ list_add_tail_rcu(&entry->link, &asym_cap_list);
+ else
+ list_add_rcu(&entry->link, &insert_entry->link);
done:
__cpumask_set_cpu(cpu, cpu_capacity_span(entry));
}
@@ -1423,8 +1430,8 @@ static void asym_cpu_capacity_scan(void)

list_for_each_entry_safe(entry, next, &asym_cap_list, link) {
if (cpumask_empty(cpu_capacity_span(entry))) {
- list_del(&entry->link);
- kfree(entry);
+ list_del_rcu(&entry->link);
+ call_rcu(&entry->rcu, free_asym_cap_entry);
}
}

@@ -1434,8 +1441,8 @@ static void asym_cpu_capacity_scan(void)
*/
if (list_is_singular(&asym_cap_list)) {
entry = list_first_entry(&asym_cap_list, typeof(*entry), link);
- list_del(&entry->link);
- kfree(entry);
+ list_del_rcu(&entry->link);
+ call_rcu(&entry->rcu, free_asym_cap_entry);
}
}

--
2.34.1

2024-03-24 00:46:41

by Qais Yousef

[permalink] [raw]

Subject: [PATCH v8 2/4] sched/fair: Check a task has a fitting cpu when updating misfit

If a misfit task is affined to a subset of the possible cpus, we need to
verify that one of these cpus can fit it. Otherwise the load balancer
code will continuously trigger needlessly leading the balance_interval
to increase in return and eventually end up with a situation where real
imbalances take a long time to address because of this impossible
imbalance situation.

This can happen in Android world where it's common for background tasks
to be restricted to little cores.

Similarly if we can't fit the biggest core, triggering misfit is
pointless as it is the best we can ever get on this system.

To be able to detect that; we use asym_cap_list to iterate through
capacities in the system to see if the task is able to run at a higher
capacity level based on its p->cpus_ptr. We do that when the affinity
change, a fair task is forked, or when a task switched to fair policy.
We store the max_allowed_capacity in task_struct to allow for cheap
comparison in the fast path.

Improve check_misfit_status() function by removing redundant checks.
misfit_task_load will be 0 if the task can't move to a bigger CPU. And
nohz_balancer_kick() already checks for cpu_check_capacity() before
calling check_misfit_status().

Test:
=====

Add

trace_printk("balance_interval = %lu\n", interval)

in get_sd_balance_interval().

run
if [ "$MASK" != "0" ]; then
adb shell "taskset -a $MASK cat /dev/zero > /dev/null"
fi
sleep 10
// parse ftrace buffer counting the occurrence of each valaue

Where MASK is either:

* 0: no busy task running
* 1: busy task is pinned to 1 cpu; handled today to not cause
misfit
* f: busy task pinned to little cores, simulates busy background
task, demonstrates the problem to be fixed

Results:
========

Note how occurrence of balance_interval = 128 overshoots for MASK = f.

BEFORE
------

MASK=0

1 balance_interval = 175
120 balance_interval = 128
846 balance_interval = 64
55 balance_interval = 63
215 balance_interval = 32
2 balance_interval = 31
2 balance_interval = 16
4 balance_interval = 8
1870 balance_interval = 4
65 balance_interval = 2

MASK=1

27 balance_interval = 175
37 balance_interval = 127
840 balance_interval = 64
167 balance_interval = 63
449 balance_interval = 32
84 balance_interval = 31
304 balance_interval = 16
1156 balance_interval = 8
2781 balance_interval = 4
428 balance_interval = 2

MASK=f

1 balance_interval = 175
1328 balance_interval = 128
44 balance_interval = 64
101 balance_interval = 63
25 balance_interval = 32
5 balance_interval = 31
23 balance_interval = 16
23 balance_interval = 8
4306 balance_interval = 4
177 balance_interval = 2

AFTER
-----

Note how the high values almost disappear for all MASK values. The
system has background tasks that could trigger the problem without
simulate it even with MASK=0.

MASK=0

103 balance_interval = 63
19 balance_interval = 31
194 balance_interval = 8
4827 balance_interval = 4
179 balance_interval = 2

MASK=1

131 balance_interval = 63
1 balance_interval = 31
87 balance_interval = 8
3600 balance_interval = 4
7 balance_interval = 2

MASK=f

8 balance_interval = 127
182 balance_interval = 63
3 balance_interval = 31
9 balance_interval = 16
415 balance_interval = 8
3415 balance_interval = 4
21 balance_interval = 2

Reviewed-by: Vincent Guittot <[email protected]>
Signed-off-by: Qais Yousef <[email protected]>
---
include/linux/sched.h | 1 +
init/init_task.c | 1 +
kernel/sched/fair.c | 66 ++++++++++++++++++++++++++++++++-----------
3 files changed, 52 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7eb7f31af796..37b95dbdb4cb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -835,6 +835,7 @@ struct task_struct {
#endif

unsigned int policy;
+ unsigned long max_allowed_capacity;
int nr_cpus_allowed;
const cpumask_t *cpus_ptr;
cpumask_t *user_cpus_ptr;
diff --git a/init/init_task.c b/init/init_task.c
index 4daee6d761c8..2558b719e053 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -77,6 +77,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
.cpus_ptr = &init_task.cpus_mask,
.user_cpus_ptr = NULL,
.cpus_mask = CPU_MASK_ALL,
+ .max_allowed_capacity = SCHED_CAPACITY_SCALE,
.nr_cpus_allowed= NR_CPUS,
.mm = NULL,
.active_mm = &init_mm,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c8e50fbac345..3b88cf58fb45 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5092,15 +5092,19 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)

static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
{
+ int cpu = cpu_of(rq);
+
if (!sched_asym_cpucap_active())
return;

- if (!p || p->nr_cpus_allowed == 1) {
- rq->misfit_task_load = 0;
- return;
- }
+ /*
+ * Affinity allows us to go somewhere higher? Or are we on biggest
+ * available CPU already? Or do we fit into this CPU ?
+ */
+ if (!p || (p->nr_cpus_allowed == 1) ||
+ (arch_scale_cpu_capacity(cpu) == p->max_allowed_capacity) ||
+ task_fits_cpu(p, cpu)) {

- if (task_fits_cpu(p, cpu_of(rq))) {
rq->misfit_task_load = 0;
return;
}
@@ -8247,6 +8251,36 @@ static void task_dead_fair(struct task_struct *p)
remove_entity_load_avg(&p->se);
}

+/*
+ * Set the max capacity the task is allowed to run at for misfit detection.
+ */
+static void set_task_max_allowed_capacity(struct task_struct *p)
+{
+ struct asym_cap_data *entry;
+
+ if (!sched_asym_cpucap_active())
+ return;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(entry, &asym_cap_list, link) {
+ cpumask_t *cpumask;
+
+ cpumask = cpu_capacity_span(entry);
+ if (!cpumask_intersects(p->cpus_ptr, cpumask))
+ continue;
+
+ p->max_allowed_capacity = entry->capacity;
+ break;
+ }
+ rcu_read_unlock();
+}
+
+static void set_cpus_allowed_fair(struct task_struct *p, struct affinity_context *ctx)
+{
+ set_cpus_allowed_common(p, ctx);
+ set_task_max_allowed_capacity(p);
+}
+
static int
balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
@@ -8255,6 +8289,8 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

return sched_balance_newidle(rq, rf) != 0;
}
+#else
+static inline void set_task_max_allowed_capacity(struct task_struct *p) {}
#endif /* CONFIG_SMP */

static void set_next_buddy(struct sched_entity *se)
@@ -9604,16 +9640,10 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
(arch_scale_cpu_capacity(cpu_of(rq)) * 100));
}

-/*
- * Check whether a rq has a misfit task and if it looks like we can actually
- * help that task: we can migrate the task to a CPU of higher capacity, or
- * the task's current CPU is heavily pressured.
- */
-static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
+/* Check if the rq has a misfit task */
+static inline bool check_misfit_status(struct rq *rq)
{
- return rq->misfit_task_load &&
- (arch_scale_cpu_capacity(rq->cpu) < rq->rd->max_cpu_capacity ||
- check_cpu_capacity(rq, sd));
+ return rq->misfit_task_load;
}

/*
@@ -11917,7 +11947,7 @@ static void nohz_balancer_kick(struct rq *rq)
* When ASYM_CPUCAPACITY; see if there's a higher capacity CPU
* to run the misfit task on.
*/
- if (check_misfit_status(rq, sd)) {
+ if (check_misfit_status(rq)) {
flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
goto unlock;
}
@@ -12642,6 +12672,8 @@ static void task_fork_fair(struct task_struct *p)
rq_lock(rq, &rf);
update_rq_clock(rq);

+ set_task_max_allowed_capacity(p);
+
cfs_rq = task_cfs_rq(current);
curr = cfs_rq->curr;
if (curr)
@@ -12765,6 +12797,8 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)
{
attach_task_cfs_rq(p);

+ set_task_max_allowed_capacity(p);
+
if (task_on_rq_queued(p)) {
/*
* We were most likely switched from sched_rt, so
@@ -13136,7 +13170,7 @@ DEFINE_SCHED_CLASS(fair) = {
.rq_offline = rq_offline_fair,

.task_dead = task_dead_fair,
- .set_cpus_allowed = set_cpus_allowed_common,
+ .set_cpus_allowed = set_cpus_allowed_fair,
#endif

.task_tick = task_tick_fair,
--
2.34.1

2024-03-24 00:46:44

by Qais Yousef

[permalink] [raw]

Subject: [PATCH v8 3/4] sched/topology: Remove max_cpu_capacity from root_domain

The value is no longer used as we now keep track of max_allowed_capacity
for each task instead.

Reviewed-by: Vincent Guittot <[email protected]>
Signed-off-by: Qais Yousef <[email protected]>
---
kernel/sched/sched.h | 2 --
kernel/sched/topology.c | 13 ++-----------
2 files changed, 2 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f77c00dddfe1..4f9e952d4fad 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -917,8 +917,6 @@ struct root_domain {
cpumask_var_t rto_mask;
struct cpupri cpupri;

- unsigned long max_cpu_capacity;
-
/*
* NULL-terminated list of performance domains intersecting with the
* CPUs of the rd. Protected by RCU.
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 44ed3d0812ab..63aecd2a7a9f 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2514,16 +2514,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
/* Attach the domains */
rcu_read_lock();
for_each_cpu(i, cpu_map) {
- unsigned long capacity;
-
rq = cpu_rq(i);
sd = *per_cpu_ptr(d.sd, i);

- capacity = arch_scale_cpu_capacity(i);
- /* Use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing: */
- if (capacity > READ_ONCE(d.rd->max_cpu_capacity))
- WRITE_ONCE(d.rd->max_cpu_capacity, capacity);
-
cpu_attach_domain(sd, d.rd, i);

if (lowest_flag_domain(i, SD_CLUSTER))
@@ -2537,10 +2530,8 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
if (has_cluster)
static_branch_inc_cpuslocked(&sched_cluster_active);

- if (rq && sched_debug_verbose) {
- pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
- cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
- }
+ if (rq && sched_debug_verbose)
+ pr_info("root domain span: %*pbl\n", cpumask_pr_args(cpu_map));

ret = 0;
error:
--
2.34.1

2024-03-24 00:46:58

by Qais Yousef

[permalink] [raw]

Subject: [PATCH v8 4/4] sched/fair: Don't double balance_interval for migrate_misfit

It is not necessarily an indication of the system being busy and
requires a backoff of the load balancer activities. But pushing it high
could mean generally delaying other misfit activities or other type of
imbalances.

Also don't pollute nr_balance_failed because of misfit failures. The
value is used for enabling cache hot migration and in migrate_util/load
types. None of which should be impacted (skewed) by misfit failures.

Reviewed-by: Vincent Guittot <[email protected]>
Signed-off-by: Qais Yousef <[email protected]>
---
kernel/sched/fair.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3b88cf58fb45..18da54da48a5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11443,8 +11443,12 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
* We do not want newidle balance, which can be very
* frequent, pollute the failure counter causing
* excessive cache_hot migrations and active balances.
+ *
+ * Similarly for migration_misfit which is not related to
+ * load/util migration, don't pollute nr_balance_failed.
*/
- if (idle != CPU_NEWLY_IDLE)
+ if (idle != CPU_NEWLY_IDLE &&
+ env.migration_type != migrate_misfit)
sd->nr_balance_failed++;

if (need_active_balance(&env)) {
@@ -11527,8 +11531,13 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
* repeatedly reach this code, which would lead to balance_interval
* skyrocketing in a short amount of time. Skip the balance_interval
* increase logic to avoid that.
+ *
+ * Similarly misfit migration which is not necessarily an indication of
+ * the system being busy and requires lb to backoff to let it settle
+ * down.
*/
- if (env.idle == CPU_NEWLY_IDLE)
+ if (env.idle == CPU_NEWLY_IDLE ||
+ env.migration_type == migrate_misfit)
goto out;

/* tune up the balancing interval */
--
2.34.1

2024-03-25 14:53:30

by tip-bot2 for Alexey Makhalov

[permalink] [raw]

Subject: [tip: sched/core] sched/fair: Check if a task has a fitting CPU when updating misfit

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 22d5607400c62c72da9b60e3324744be83e147a4
Gitweb: https://git.kernel.org/tip/22d5607400c62c72da9b60e3324744be83e147a4
Author: Qais Yousef <[email protected]>
AuthorDate: Sun, 24 Mar 2024 00:45:50
Committer: Ingo Molnar <[email protected]>
CommitterDate: Mon, 25 Mar 2024 12:09:54 +01:00

sched/fair: Check if a task has a fitting CPU when updating misfit

If a misfit task is affined to a subset of the possible CPUs, we need to
verify that one of these CPUs can fit it. Otherwise the load balancer
code will continuously trigger needlessly leading the balance_interval
to increase in return and eventually end up with a situation where real
imbalances take a long time to address because of this impossible
imbalance situation.

This can happen in Android world where it's common for background tasks
to be restricted to little cores.

Similarly if we can't fit the biggest core, triggering misfit is
pointless as it is the best we can ever get on this system.

To be able to detect that; we use asym_cap_list to iterate through
capacities in the system to see if the task is able to run at a higher
capacity level based on its p->cpus_ptr. We do that when the affinity
change, a fair task is forked, or when a task switched to fair policy.
We store the max_allowed_capacity in task_struct to allow for cheap
comparison in the fast path.

Improve check_misfit_status() function by removing redundant checks.
misfit_task_load will be 0 if the task can't move to a bigger CPU. And
nohz_balancer_kick() already checks for cpu_check_capacity() before
calling check_misfit_status().

Test:
=====

Add

trace_printk("balance_interval = %lu\n", interval)

in get_sd_balance_interval().

run
if [ "$MASK" != "0" ]; then
adb shell "taskset -a $MASK cat /dev/zero > /dev/null"
fi
sleep 10
// parse ftrace buffer counting the occurrence of each valaue

Where MASK is either:

* 0: no busy task running
* 1: busy task is pinned to 1 cpu; handled today to not cause
misfit
* f: busy task pinned to little cores, simulates busy background
task, demonstrates the problem to be fixed

Results:
========

Note how occurrence of balance_interval = 128 overshoots for MASK = f.

BEFORE
------

MASK=0

1 balance_interval = 175
120 balance_interval = 128
846 balance_interval = 64
55 balance_interval = 63
215 balance_interval = 32
2 balance_interval = 31
2 balance_interval = 16
4 balance_interval = 8
1870 balance_interval = 4
65 balance_interval = 2

MASK=1

27 balance_interval = 175
37 balance_interval = 127
840 balance_interval = 64
167 balance_interval = 63
449 balance_interval = 32
84 balance_interval = 31
304 balance_interval = 16
1156 balance_interval = 8
2781 balance_interval = 4
428 balance_interval = 2

MASK=f

1 balance_interval = 175
1328 balance_interval = 128
44 balance_interval = 64
101 balance_interval = 63
25 balance_interval = 32
5 balance_interval = 31
23 balance_interval = 16
23 balance_interval = 8
4306 balance_interval = 4
177 balance_interval = 2

AFTER
-----

Note how the high values almost disappear for all MASK values. The
system has background tasks that could trigger the problem without
simulate it even with MASK=0.

MASK=0

103 balance_interval = 63
19 balance_interval = 31
194 balance_interval = 8
4827 balance_interval = 4
179 balance_interval = 2

MASK=1

131 balance_interval = 63
1 balance_interval = 31
87 balance_interval = 8
3600 balance_interval = 4
7 balance_interval = 2

MASK=f

8 balance_interval = 127
182 balance_interval = 63
3 balance_interval = 31
9 balance_interval = 16
415 balance_interval = 8
3415 balance_interval = 4
21 balance_interval = 2

Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
include/linux/sched.h | 1 +-
init/init_task.c | 1 +-
kernel/sched/fair.c | 66 +++++++++++++++++++++++++++++++-----------
3 files changed, 52 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3ed40e9..c75fd46 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -835,6 +835,7 @@ struct task_struct {
#endif

unsigned int policy;
+ unsigned long max_allowed_capacity;
int nr_cpus_allowed;
const cpumask_t *cpus_ptr;
cpumask_t *user_cpus_ptr;
diff --git a/init/init_task.c b/init/init_task.c
index 4daee6d..2558b71 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -77,6 +77,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
.cpus_ptr = &init_task.cpus_mask,
.user_cpus_ptr = NULL,
.cpus_mask = CPU_MASK_ALL,
+ .max_allowed_capacity = SCHED_CAPACITY_SCALE,
.nr_cpus_allowed= NR_CPUS,
.mm = NULL,
.active_mm = &init_mm,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8270e2..c47c4f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5098,15 +5098,19 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)

static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
{
+ int cpu = cpu_of(rq);
+
if (!sched_asym_cpucap_active())
return;

- if (!p || p->nr_cpus_allowed == 1) {
- rq->misfit_task_load = 0;
- return;
- }
+ /*
+ * Affinity allows us to go somewhere higher? Or are we on biggest
+ * available CPU already? Or do we fit into this CPU ?
+ */
+ if (!p || (p->nr_cpus_allowed == 1) ||
+ (arch_scale_cpu_capacity(cpu) == p->max_allowed_capacity) ||
+ task_fits_cpu(p, cpu)) {

- if (task_fits_cpu(p, cpu_of(rq))) {
rq->misfit_task_load = 0;
return;
}
@@ -8253,6 +8257,36 @@ static void task_dead_fair(struct task_struct *p)
remove_entity_load_avg(&p->se);
}

+/*
+ * Set the max capacity the task is allowed to run at for misfit detection.
+ */
+static void set_task_max_allowed_capacity(struct task_struct *p)
+{
+ struct asym_cap_data *entry;
+
+ if (!sched_asym_cpucap_active())
+ return;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(entry, &asym_cap_list, link) {
+ cpumask_t *cpumask;
+
+ cpumask = cpu_capacity_span(entry);
+ if (!cpumask_intersects(p->cpus_ptr, cpumask))
+ continue;
+
+ p->max_allowed_capacity = entry->capacity;
+ break;
+ }
+ rcu_read_unlock();
+}
+
+static void set_cpus_allowed_fair(struct task_struct *p, struct affinity_context *ctx)
+{
+ set_cpus_allowed_common(p, ctx);
+ set_task_max_allowed_capacity(p);
+}
+
static int
balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
@@ -8261,6 +8295,8 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

return sched_balance_newidle(rq, rf) != 0;
}
+#else
+static inline void set_task_max_allowed_capacity(struct task_struct *p) {}
#endif /* CONFIG_SMP */

static void set_next_buddy(struct sched_entity *se)
@@ -9610,16 +9646,10 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
(arch_scale_cpu_capacity(cpu_of(rq)) * 100));
}

-/*
- * Check whether a rq has a misfit task and if it looks like we can actually
- * help that task: we can migrate the task to a CPU of higher capacity, or
- * the task's current CPU is heavily pressured.
- */
-static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
+/* Check if the rq has a misfit task */
+static inline bool check_misfit_status(struct rq *rq)
{
- return rq->misfit_task_load &&
- (arch_scale_cpu_capacity(rq->cpu) < rq->rd->max_cpu_capacity ||
- check_cpu_capacity(rq, sd));
+ return rq->misfit_task_load;
}

/*
@@ -11923,7 +11953,7 @@ static void nohz_balancer_kick(struct rq *rq)
* When ASYM_CPUCAPACITY; see if there's a higher capacity CPU
* to run the misfit task on.
*/
- if (check_misfit_status(rq, sd)) {
+ if (check_misfit_status(rq)) {
flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
goto unlock;
}
@@ -12648,6 +12678,8 @@ static void task_fork_fair(struct task_struct *p)
rq_lock(rq, &rf);
update_rq_clock(rq);

+ set_task_max_allowed_capacity(p);
+
cfs_rq = task_cfs_rq(current);
curr = cfs_rq->curr;
if (curr)
@@ -12771,6 +12803,8 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)
{
attach_task_cfs_rq(p);

+ set_task_max_allowed_capacity(p);
+
if (task_on_rq_queued(p)) {
/*
* We were most likely switched from sched_rt, so
@@ -13142,7 +13176,7 @@ DEFINE_SCHED_CLASS(fair) = {
.rq_offline = rq_offline_fair,

.task_dead = task_dead_fair,
- .set_cpus_allowed = set_cpus_allowed_common,
+ .set_cpus_allowed = set_cpus_allowed_fair,
#endif

.task_tick = task_tick_fair,

2024-03-25 14:54:16

by tip-bot2 for Alexey Makhalov

[permalink] [raw]

Subject: [tip: sched/core] sched/fair: Don't double balance_interval for migrate_misfit

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 58eeb2d79b542c678c46e245dba6b66936368a99
Gitweb: https://git.kernel.org/tip/58eeb2d79b542c678c46e245dba6b66936368a99
Author: Qais Yousef <[email protected]>
AuthorDate: Sun, 24 Mar 2024 00:45:52
Committer: Ingo Molnar <[email protected]>
CommitterDate: Mon, 25 Mar 2024 12:09:57 +01:00

sched/fair: Don't double balance_interval for migrate_misfit

It is not necessarily an indication of the system being busy and
requires a backoff of the load balancer activities. But pushing it high
could mean generally delaying other misfit activities or other type of
imbalances.

Also don't pollute nr_balance_failed because of misfit failures. The
value is used for enabling cache hot migration and in migrate_util/load
types. None of which should be impacted (skewed) by misfit failures.

Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c47c4f2..dbf4f1c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11449,8 +11449,12 @@ more_balance:
* We do not want newidle balance, which can be very
* frequent, pollute the failure counter causing
* excessive cache_hot migrations and active balances.
+ *
+ * Similarly for migration_misfit which is not related to
+ * load/util migration, don't pollute nr_balance_failed.
*/
- if (idle != CPU_NEWLY_IDLE)
+ if (idle != CPU_NEWLY_IDLE &&
+ env.migration_type != migrate_misfit)
sd->nr_balance_failed++;

if (need_active_balance(&env)) {
@@ -11533,8 +11537,13 @@ out_one_pinned:
* repeatedly reach this code, which would lead to balance_interval
* skyrocketing in a short amount of time. Skip the balance_interval
* increase logic to avoid that.
+ *
+ * Similarly misfit migration which is not necessarily an indication of
+ * the system being busy and requires lb to backoff to let it settle
+ * down.
*/
- if (env.idle == CPU_NEWLY_IDLE)
+ if (env.idle == CPU_NEWLY_IDLE ||
+ env.migration_type == migrate_misfit)
goto out;

/* tune up the balancing interval */

2024-03-25 15:03:21

by tip-bot2 for Alexey Makhalov

[permalink] [raw]

Subject: [tip: sched/core] sched/topology: Remove root_domain::max_cpu_capacity

The following commit has been merged into the sched/core branch of tip:

Commit-ID: fa427e8e53d8db15090af7e952a55870dc2a453f
Gitweb: https://git.kernel.org/tip/fa427e8e53d8db15090af7e952a55870dc2a453f
Author: Qais Yousef <[email protected]>
AuthorDate: Sun, 24 Mar 2024 00:45:51
Committer: Ingo Molnar <[email protected]>
CommitterDate: Mon, 25 Mar 2024 12:09:56 +01:00

sched/topology: Remove root_domain::max_cpu_capacity

The value is no longer used as we now keep track of max_allowed_capacity
for each task instead.

Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/sched.h | 2 --
kernel/sched/topology.c | 13 ++-----------
2 files changed, 2 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f77c00d..4f9e952 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -917,8 +917,6 @@ struct root_domain {
cpumask_var_t rto_mask;
struct cpupri cpupri;

- unsigned long max_cpu_capacity;
-
/*
* NULL-terminated list of performance domains intersecting with the
* CPUs of the rd. Protected by RCU.
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 44ed3d0..63aecd2 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2514,16 +2514,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
/* Attach the domains */
rcu_read_lock();
for_each_cpu(i, cpu_map) {
- unsigned long capacity;
-
rq = cpu_rq(i);
sd = *per_cpu_ptr(d.sd, i);

- capacity = arch_scale_cpu_capacity(i);
- /* Use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing: */
- if (capacity > READ_ONCE(d.rd->max_cpu_capacity))
- WRITE_ONCE(d.rd->max_cpu_capacity, capacity);
-
cpu_attach_domain(sd, d.rd, i);

if (lowest_flag_domain(i, SD_CLUSTER))
@@ -2537,10 +2530,8 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
if (has_cluster)
static_branch_inc_cpuslocked(&sched_cluster_active);

- if (rq && sched_debug_verbose) {
- pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
- cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
- }
+ if (rq && sched_debug_verbose)
+ pr_info("root domain span: %*pbl\n", cpumask_pr_args(cpu_map));

ret = 0;
error:

2024-03-25 15:05:20

by tip-bot2 for Alexey Makhalov

[permalink] [raw]

Subject: [tip: sched/core] sched/topology: Export asym_cap_list

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 77222b0d12e8ae6f082261842174cc2e981bf99c
Gitweb: https://git.kernel.org/tip/77222b0d12e8ae6f082261842174cc2e981bf99c
Author: Qais Yousef <[email protected]>
AuthorDate: Sun, 24 Mar 2024 00:45:49
Committer: Ingo Molnar <[email protected]>
CommitterDate: Mon, 25 Mar 2024 12:09:53 +01:00

sched/topology: Export asym_cap_list

So that we can use it to iterate through available capacities in the
system. Sort asym_cap_list in descending order as expected users are
likely to be interested on the highest capacity first.

Make the list RCU protected to allow for cheap access in hot paths.

Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/sched.h | 14 +++++++++++++-
kernel/sched/topology.c | 43 +++++++++++++++++++++++-----------------
2 files changed, 39 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 41024c1..f77c00d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -110,6 +110,20 @@ extern int sysctl_sched_rt_runtime;
extern int sched_rr_timeslice;

/*
+ * Asymmetric CPU capacity bits
+ */
+struct asym_cap_data {
+ struct list_head link;
+ struct rcu_head rcu;
+ unsigned long capacity;
+ unsigned long cpus[];
+};
+
+extern struct list_head asym_cap_list;
+
+#define cpu_capacity_span(asym_data) to_cpumask((asym_data)->cpus)
+
+/*
* Helpers for converting nanosecond timing to jiffy resolution
*/
#define NS_TO_JIFFIES(TIME) ((unsigned long)(TIME) / (NSEC_PER_SEC / HZ))
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 99ea598..44ed3d0 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1330,23 +1330,12 @@ next:
}

/*
- * Asymmetric CPU capacity bits
- */
-struct asym_cap_data {
- struct list_head link;
- unsigned long capacity;
- unsigned long cpus[];
-};
-
-/*
* Set of available CPUs grouped by their corresponding capacities
* Each list entry contains a CPU mask reflecting CPUs that share the same
* capacity.
* The lifespan of data is unlimited.
*/
-static LIST_HEAD(asym_cap_list);
-
-#define cpu_capacity_span(asym_data) to_cpumask((asym_data)->cpus)
+LIST_HEAD(asym_cap_list);

/*
* Verify whether there is any CPU capacity asymmetry in a given sched domain.
@@ -1386,21 +1375,39 @@ asym_cpu_capacity_classify(const struct cpumask *sd_span,

}

+static void free_asym_cap_entry(struct rcu_head *head)
+{
+ struct asym_cap_data *entry = container_of(head, struct asym_cap_data, rcu);
+ kfree(entry);
+}
+
static inline void asym_cpu_capacity_update_data(int cpu)
{
unsigned long capacity = arch_scale_cpu_capacity(cpu);
- struct asym_cap_data *entry = NULL;
+ struct asym_cap_data *insert_entry = NULL;
+ struct asym_cap_data *entry;

+ /*
+ * Search if capacity already exits. If not, track which the entry
+ * where we should insert to keep the list ordered descendingly.
+ */
list_for_each_entry(entry, &asym_cap_list, link) {
if (capacity == entry->capacity)
goto done;
+ else if (!insert_entry && capacity > entry->capacity)
+ insert_entry = list_prev_entry(entry, link);
}

entry = kzalloc(sizeof(*entry) + cpumask_size(), GFP_KERNEL);
if (WARN_ONCE(!entry, "Failed to allocate memory for asymmetry data\n"))
return;
entry->capacity = capacity;
- list_add(&entry->link, &asym_cap_list);
+
+ /* If NULL then the new capacity is the smallest, add last. */
+ if (!insert_entry)
+ list_add_tail_rcu(&entry->link, &asym_cap_list);
+ else
+ list_add_rcu(&entry->link, &insert_entry->link);
done:
__cpumask_set_cpu(cpu, cpu_capacity_span(entry));
}
@@ -1423,8 +1430,8 @@ static void asym_cpu_capacity_scan(void)

list_for_each_entry_safe(entry, next, &asym_cap_list, link) {
if (cpumask_empty(cpu_capacity_span(entry))) {
- list_del(&entry->link);
- kfree(entry);
+ list_del_rcu(&entry->link);
+ call_rcu(&entry->rcu, free_asym_cap_entry);
}
}

@@ -1434,8 +1441,8 @@ static void asym_cpu_capacity_scan(void)
*/
if (list_is_singular(&asym_cap_list)) {
entry = list_first_entry(&asym_cap_list, typeof(*entry), link);
- list_del(&entry->link);
- kfree(entry);
+ list_del_rcu(&entry->link);
+ call_rcu(&entry->rcu, free_asym_cap_entry);
}
}

2024-03-29 02:10:03

by Qais Yousef

[permalink] [raw]

Subject: Re: [PATCH v8 0/4] sched: Don't trigger misfit if affinity is restricted

+Lukasz

On 03/24/24 00:45, Qais Yousef wrote:
> There was a discussion on handling hotplug operation removing a capacity level
> and lead to unnecessary misfit lb to trigger again. I opted not to handle it
> now, but a working patch is available in [1]. I don't feel strongly about it
> and would leave it up to the maintainers to push which direction they prefer.
> Patch 4 will make sure that balance interval and nr_failed won't grow
> unnecessarily due to bad unnecessary misfit lb. It will lead to some
> sub-optimality, but no incorrect behavior.
>
> After 6.9 merge window, dynamic Energy Model series would be merged and it can
> lead to the capacities of the CPUs being changed at runtime. This means I need
> to post follow up patch to handle this situation to ensure max_allowed_capacity
> is correct after an EM update. It might make then handling of hotplug operation
> attractive too as there would be some common shared ground.

I was trying to work on this follow up patch now tip has moved to 6.9-rc1, but
I can't see how the new dynamic EM logic will trigger an update to
asym_cap_list. Did I miss something? Will/should init_cpu_capacity_callback()
be triggered after the update?

How will scheduler know the new max capacities are different? Or did
I misunderstand the new EM runtime logic and it won't lead to having a new
arch_scale_cpu_capacity() values?

Thanks!

--
Qais Yousef

>
> [1] https://lore.kernel.org/lkml/20240321122039.7gk2mc3syvkrvhjz@airbuntu/
>
> Changes since v7:
>
> * Remove sd arg from check_misfit_status()
> * Update typo in commit message in patch 2.
> * Add Reviewed-by from Vincent
>
> Changes since v6:
>
> * Simplify update_misfit_status
>
> Changes since v5:
>
> * Remove redundant check to rq->rd->max_cpu_capacity
> * Simplify check_misfit_status() further by removing unnecessary checks.
> * Add new patch to remove no longer used rd->max_cpu_capacity
> * Add new patch to prevent misfit lb from polluting balance_interval
> and nr_balance_failed
>
> Changes since v4:
>
> * Store max_allowed_capacity in task_struct and populate it when
> affinity changes to avoid iterating through the capacities list in the
> fast path (Vincent)
> * Use rq->rd->max_cpu_capacity which is updated after hotplug
> operations to check biggest allowed capacity in the system.
> * Undo the change to check_misfit_status() and improve the function to
> avoid similar confusion in the future.
> * Split the patches differently. Export the capacity list and sort it
> is now patch 1, handling of affinity for misfit detection is patch 2.
>
> Changes since v3:
>
> * Update commit message of patch 2 to be less verbose
>
> Changes since v2:
>
> * Convert access of asym_cap_list to be rcu protected
> * Add new patch to sort the list in descending order
> * Move some declarations inside affinity check block
> * Remove now redundant check against max_cpu_capacity in check_misfit_status()
>
> Changes since v1:
>
> * Use asym_cap_list (thanks Dietmar) to iterate instead of iterating
> through every cpu which Vincent was concerned about.
> * Use uclamped util to compare with capacity instead of util_fits_cpu()
> when iterating through capcities (Dietmar).
> * Update commit log with test results to better demonstrate the problem
>
> v1 discussion: https://lore.kernel.org/lkml/[email protected]/
> v2 discussion: https://lore.kernel.org/lkml/[email protected]/
> v3 discussion: https://lore.kernel.org/lkml/[email protected]/
> v4 discussion: https://lore.kernel.org/lkml/[email protected]/
> v5 discussion: https://lore.kernel.org/lkml/[email protected]/
> v6, v7 discussion: https://lore.kernel.org/lkml/[email protected]/
>
> Thanks!
>
> --
> Qais Yousef
>
> Qais Yousef (4):
> sched/topology: Export asym_capacity_list
> sched/fair: Check a task has a fitting cpu when updating misfit
> sched/topology: Remove max_cpu_capacity from root_domain
> sched/fair: Don't double balance_interval for migrate_misfit
>
> include/linux/sched.h | 1 +
> init/init_task.c | 1 +
> kernel/sched/fair.c | 79 +++++++++++++++++++++++++++++++----------
> kernel/sched/sched.h | 16 +++++++--
> kernel/sched/topology.c | 56 ++++++++++++++---------------
> 5 files changed, 104 insertions(+), 49 deletions(-)
>
> --
> 2.34.1
>