Introduce the SELECT_BIAS_PREV scheduler feature to reduce the task
migration rate.
It needs to be used with the UTIL_FITS_CAPACITY scheduler feature to
show benchmark improvements.
For scenarios where the system is under-utilized (CPUs are partly idle),
eliminate frequent task migrations from CPUs with sufficient remaining
capacity left to completely idle CPUs by introducing a bias towards the
previous CPU if it is idle or has enough capacity left.
For scenarios where the system is fully or over-utilized (CPUs are
almost never idle), favor the previous CPU (rather than the target CPU)
if all CPUs are busy to minimize migrations. (suggested by Chen Yu)
The following benchmarks are performed on a v6.5.5 kernel with
mitigations=off.
This speeds up the following hackbench workload on a 192 cores AMD EPYC
9654 96-Core Processor (over 2 sockets):
hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100
from 49s to 26s. (47% speedup)
Elapsed time comparison:
Baseline: 48-49 s
UTIL_FITS_CAPACITY: 45-50 s
SELECT_BIAS_PREV: 48-50 s
UTIL_FITS_CAPACITY+SELECT_BIAS_PREV: 26-27 s
We can observe that the number of migrations is reduced significantly
(-93%) with this patch, which may explain the speedup:
Baseline: 117M cpu-migrations (9.355 K/sec)
UTIL_FITS_CAPACITY: 67M cpu-migrations (5.470 K/sec)
SELECT_BIAS_PREV: 75M cpu-migrations (5.674 K/sec)
UTIL_FITS_CAPACITY+SELECT_BIAS_PREV: 8M cpu-migrations (0.928 K/sec)
The CPU utilization is also improved:
Baseline: 253.275 CPUs utilized
UTIL_FITS_CAPACITY: 223.130 CPUs utilized
SELECT_BIAS_PREV: 276.526 CPUs utilized
UTIL_FITS_CAPACITY+SELECT_BIAS_PREV: 309.872 CPUs utilized
Interestingly, the rate of context switch increases with the patch, but
it does not appear to be an issue performance-wise:
Baseline: 445M context-switches (35.516 K/sec)
UTIL_FITS_CAPACITY: 581M context-switches (47.548 K/sec)
SELECT_BIAS_PREV: 655M context-switches (49.074 K/sec)
UTIL_FITS_CAPACITY+SELECT_BIAS_PREV: 597M context-switches (35.516 K/sec)
This was developed as part of the investigation into a weird regression
reported by AMD where adding a raw spinlock in the scheduler context
switch accelerated hackbench. It turned out that changing this raw
spinlock for a loop of 10000x cpu_relax within do_idle() had similar
benefits.
This patch achieves a similar effect without the busy-waiting by
allowing select_task_rq to favor the previously used CPUs based on the
utilization of that CPU.
Feedback is welcome. I am especially interested to learn whether this
patch has positive or detrimental effects on performance of other
workloads.
Link: https://lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Mathieu Desnoyers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Swapnil Sapkal <[email protected]>
Cc: Aaron Lu <[email protected]>
Cc: Chen Yu <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Gautham R . Shenoy <[email protected]>
Cc: [email protected]
---
kernel/sched/fair.c | 28 ++++++++++++++++++++++++++--
kernel/sched/features.h | 6 ++++++
2 files changed, 32 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8058058afb11..741d53b18d23 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7173,15 +7173,30 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
*/
lockdep_assert_irqs_disabled();
+ /*
+ * With the SELECT_BIAS_PREV feature, if the previous CPU is
+ * cache affine and the task fits within the prev cpu capacity,
+ * prefer the previous CPU to the target CPU to inhibit costly
+ * task migration.
+ */
+ if (sched_feat(SELECT_BIAS_PREV) &&
+ (prev == target || cpus_share_cache(prev, target)) &&
+ (available_idle_cpu(prev) || sched_idle_cpu(prev) ||
+ task_fits_remaining_cpu_capacity(task_util, prev)) &&
+ asym_fits_cpu(task_util, util_min, util_max, prev))
+ return prev;
+
if ((available_idle_cpu(target) || sched_idle_cpu(target) ||
task_fits_remaining_cpu_capacity(task_util, target)) &&
asym_fits_cpu(task_util, util_min, util_max, target))
return target;
/*
- * If the previous CPU is cache affine and idle, don't be stupid:
+ * Without the SELECT_BIAS_PREV feature, use the previous CPU if
+ * it is cache affine and idle if the target cpu is not idle.
*/
- if (prev != target && cpus_share_cache(prev, target) &&
+ if (!sched_feat(SELECT_BIAS_PREV) &&
+ prev != target && cpus_share_cache(prev, target) &&
(available_idle_cpu(prev) || sched_idle_cpu(prev) ||
task_fits_remaining_cpu_capacity(task_util, prev)) &&
asym_fits_cpu(task_util, util_min, util_max, prev))
@@ -7254,6 +7269,15 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if ((unsigned)i < nr_cpumask_bits)
return i;
+ /*
+ * With the SELECT_BIAS_PREV feature, if the previous CPU is
+ * cache affine, prefer the previous CPU when all CPUs are busy
+ * to inhibit migration.
+ */
+ if (sched_feat(SELECT_BIAS_PREV) &&
+ prev != target && cpus_share_cache(prev, target))
+ return prev;
+
return target;
}
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 9a84a1401123..a56d6861c553 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -103,6 +103,12 @@ SCHED_FEAT(UTIL_EST_FASTUP, true)
*/
SCHED_FEAT(UTIL_FITS_CAPACITY, true)
+/*
+ * Bias runqueue selection towards the previous runqueue over the target
+ * runqueue.
+ */
+SCHED_FEAT(SELECT_BIAS_PREV, true)
+
SCHED_FEAT(LATENCY_WARN, false)
SCHED_FEAT(ALT_PERIOD, true)
--
2.39.2