2023-10-19 16:06:19

by Mathieu Desnoyers

[permalink] [raw]
Subject: [RFC PATCH v2 0/2] sched/fair migration reduction features

Hi,

This series introduces two new scheduler features: UTIL_FITS_CAPACITY
and SELECT_BIAS_PREV. When used together, they achieve a 41% speedup of
a hackbench workload which leaves some idle CPU time on a 192-core AMD
EPYC.

The main metrics which are significantly improved are:

- cpu-migrations are reduced by 80%,
- CPU utilization is increased by 17%.

Feedback is welcome. I am especially interested to learn whether this
series has positive or detrimental effects on performance of other
workloads.

The main changes since v1 are to take into account feedback from Chen Yu
and keep a 20% margin of unused utilization in the capacity fit, and use
scale_rt_capacity() which is more precise.

Thanks,

Mathieu

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Swapnil Sapkal <[email protected]>
Cc: Aaron Lu <[email protected]>
Cc: Chen Yu <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Gautham R . Shenoy <[email protected]>
Cc: [email protected]

Mathieu Desnoyers (2):
sched/fair: Introduce UTIL_FITS_CAPACITY feature (v2)
sched/fair: Introduce SELECT_BIAS_PREV to reduce migrations

kernel/sched/fair.c | 68 ++++++++++++++++++++++++++++++++++++-----
kernel/sched/features.h | 12 ++++++++
kernel/sched/sched.h | 5 +++
3 files changed, 77 insertions(+), 8 deletions(-)

--
2.39.2


2023-10-19 16:06:31

by Mathieu Desnoyers

[permalink] [raw]
Subject: [RFC PATCH v2 2/2] sched/fair: Introduce SELECT_BIAS_PREV to reduce migrations

Introduce the SELECT_BIAS_PREV scheduler feature to reduce the task
migration rate.

It needs to be used with the UTIL_FITS_CAPACITY scheduler feature to
show benchmark improvements.

For scenarios where the system is under-utilized (CPUs are partly idle),
eliminate frequent task migrations from CPUs with sufficient remaining
capacity left to completely idle CPUs by introducing a bias towards the
previous CPU if it is idle or has enough capacity left.

For scenarios where the system is fully or over-utilized (CPUs are
almost never idle), favor the previous CPU (rather than the target CPU)
if all CPUs are busy to minimize migrations. (suggested by Chen Yu)

The following benchmarks are performed on a v6.5.5 kernel with
mitigations=off.

This speeds up the following hackbench workload on a 192 cores AMD EPYC
9654 96-Core Processor (over 2 sockets):

hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100

from 49s to 29s. (41% speedup)

Elapsed time comparison:

Baseline: 48-49 s
UTIL_FITS_CAPACITY: 39-40 s
SELECT_BIAS_PREV: 48-50 s
UTIL_FITS_CAPACITY+SELECT_BIAS_PREV: 29-30 s

We can observe that the number of migrations is reduced significantly
(-80%) with this patch, which may explain the speedup:

Baseline: 117M cpu-migrations (9.355 K/sec)
UTIL_FITS_CAPACITY: 48M cpu-migrations (3.977 K/sec)
SELECT_BIAS_PREV: 75M cpu-migrations (5.674 K/sec)
UTIL_FITS_CAPACITY+SELECT_BIAS_PREV: 23M cpu-migrations (2.316 K/sec)

The CPU utilization is also improved:

Baseline: 253.275 CPUs utilized
UTIL_FITS_CAPACITY: 271.367 CPUs utilized
SELECT_BIAS_PREV: 276.526 CPUs utilized
UTIL_FITS_CAPACITY+SELECT_BIAS_PREV: 303.356 CPUs utilized

Interestingly, the rate of context switch increases with the patch, but
it does not appear to be an issue performance-wise:

Baseline: 445M context-switches (35.516 K/sec)
UTIL_FITS_CAPACITY: 586M context-switches (48.823 K/sec)
SELECT_BIAS_PREV: 655M context-switches (49.074 K/sec)
UTIL_FITS_CAPACITY+SELECT_BIAS_PREV: 571M context-switches (57.714 K/sec)

This was developed as part of the investigation into a weird regression
reported by AMD where adding a raw spinlock in the scheduler context
switch accelerated hackbench. It turned out that changing this raw
spinlock for a loop of 10000x cpu_relax within do_idle() had similar
benefits.

This patch achieves a similar effect without the busy-waiting by
allowing select_task_rq to favor the previously used CPUs based on the
utilization of that CPU.

Feedback is welcome. I am especially interested to learn whether this
patch has positive or detrimental effects on performance of other
workloads.

Link: https://lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Mathieu Desnoyers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Swapnil Sapkal <[email protected]>
Cc: Aaron Lu <[email protected]>
Cc: Chen Yu <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Gautham R . Shenoy <[email protected]>
Cc: [email protected]
---
kernel/sched/fair.c | 28 ++++++++++++++++++++++++++--
kernel/sched/features.h | 6 ++++++
2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cc86d1ffeb27..0aae1f15cb0e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7164,15 +7164,30 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
*/
lockdep_assert_irqs_disabled();

+ /*
+ * With the SELECT_BIAS_PREV feature, if the previous CPU is
+ * cache affine and the task fits within the prev cpu capacity,
+ * prefer the previous CPU to the target CPU to inhibit costly
+ * task migration.
+ */
+ if (sched_feat(SELECT_BIAS_PREV) &&
+ (prev == target || cpus_share_cache(prev, target)) &&
+ (available_idle_cpu(prev) || sched_idle_cpu(prev) ||
+ task_fits_remaining_cpu_capacity(task_util, prev)) &&
+ asym_fits_cpu(task_util, util_min, util_max, prev))
+ return prev;
+
if ((available_idle_cpu(target) || sched_idle_cpu(target) ||
task_fits_remaining_cpu_capacity(task_util, target)) &&
asym_fits_cpu(task_util, util_min, util_max, target))
return target;

/*
- * If the previous CPU is cache affine and idle, don't be stupid:
+ * Without the SELECT_BIAS_PREV feature, use the previous CPU if
+ * it is cache affine and idle if the target cpu is not idle.
*/
- if (prev != target && cpus_share_cache(prev, target) &&
+ if (!sched_feat(SELECT_BIAS_PREV) &&
+ prev != target && cpus_share_cache(prev, target) &&
(available_idle_cpu(prev) || sched_idle_cpu(prev) ||
task_fits_remaining_cpu_capacity(task_util, prev)) &&
asym_fits_cpu(task_util, util_min, util_max, prev))
@@ -7245,6 +7260,15 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if ((unsigned)i < nr_cpumask_bits)
return i;

+ /*
+ * With the SELECT_BIAS_PREV feature, if the previous CPU is
+ * cache affine, prefer the previous CPU when all CPUs are busy
+ * to inhibit migration.
+ */
+ if (sched_feat(SELECT_BIAS_PREV) &&
+ prev != target && cpus_share_cache(prev, target))
+ return prev;
+
return target;
}

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 9a84a1401123..a56d6861c553 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -103,6 +103,12 @@ SCHED_FEAT(UTIL_EST_FASTUP, true)
*/
SCHED_FEAT(UTIL_FITS_CAPACITY, true)

+/*
+ * Bias runqueue selection towards the previous runqueue over the target
+ * runqueue.
+ */
+SCHED_FEAT(SELECT_BIAS_PREV, true)
+
SCHED_FEAT(LATENCY_WARN, false)

SCHED_FEAT(ALT_PERIOD, true)
--
2.39.2

2023-10-19 16:06:32

by Mathieu Desnoyers

[permalink] [raw]
Subject: [RFC PATCH v2 1/2] sched/fair: Introduce UTIL_FITS_CAPACITY feature (v2)

Introduce the UTIL_FITS_CAPACITY scheduler feature. The runqueue
selection picks the previous, target, or recent runqueues if they have
enough remaining capacity to enqueue the task before scanning for an
idle cpu.

This feature is introduced in preparation for the SELECT_BIAS_PREV
scheduler feature.

The following benchmarks only cover the UTIL_FITS_CAPACITY feature.
Those are performed on a v6.5.5 kernel with mitigations=off.

The following hackbench workload on a 192 cores AMD EPYC 9654 96-Core
Processor (over 2 sockets) improves the wall time from 49s to 40s
(18% speedup).

hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100

We can observe that the number of migrations is reduced significantly
with this patch (improvement):

Baseline: 117M cpu-migrations (9.355 K/sec)
With patch: 47M cpu-migrations (3.977 K/sec)

The task-clock utilization is increased (improvement):

Baseline: 253.275 CPUs utilized
With patch: 271.367 CPUs utilized

The number of context-switches is increased (degradation):

Baseline: 445M context-switches (35.516 K/sec)
With patch: 586M context-switches (48.823 K/sec)

Link: https://lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Mathieu Desnoyers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Swapnil Sapkal <[email protected]>
Cc: Aaron Lu <[email protected]>
Cc: Chen Yu <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Gautham R . Shenoy <[email protected]>
Cc: [email protected]
---
Changes since v1:
- Use scale_rt_capacity(),
- Use fits_capacity which leaves 20% unused capacity to account for
metrics inaccuracy.
---
kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++++++------
kernel/sched/features.h | 6 ++++++
kernel/sched/sched.h | 5 +++++
3 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d9c2482c5a3..cc86d1ffeb27 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4497,6 +4497,28 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
trace_sched_util_est_se_tp(&p->se);
}

+static unsigned long scale_rt_capacity(int cpu);
+
+/*
+ * Returns true if adding the task utilization to the estimated
+ * utilization of the runnable tasks on @cpu does not exceed the
+ * capacity of @cpu.
+ *
+ * This considers only the utilization of _runnable_ tasks on the @cpu
+ * runqueue, excluding blocked and sleeping tasks. This is achieved by
+ * using the runqueue util_est.enqueued.
+ */
+static inline bool task_fits_remaining_cpu_capacity(unsigned long task_util,
+ int cpu)
+{
+ unsigned long total_util;
+
+ if (!sched_util_fits_capacity_active())
+ return false;
+ total_util = READ_ONCE(cpu_rq(cpu)->cfs.avg.util_est.enqueued) + task_util;
+ return fits_capacity(total_util, scale_rt_capacity(cpu));
+}
+
static inline int util_fits_cpu(unsigned long util,
unsigned long uclamp_min,
unsigned long uclamp_max,
@@ -7124,12 +7146,15 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
int i, recent_used_cpu;

/*
- * On asymmetric system, update task utilization because we will check
- * that the task fits with cpu's capacity.
+ * With the UTIL_FITS_CAPACITY feature and on asymmetric system,
+ * update task utilization because we will check that the task
+ * fits with cpu's capacity.
*/
- if (sched_asym_cpucap_active()) {
+ if (sched_util_fits_capacity_active() || sched_asym_cpucap_active()) {
sync_entity_load_avg(&p->se);
task_util = task_util_est(p);
+ }
+ if (sched_asym_cpucap_active()) {
util_min = uclamp_eff_value(p, UCLAMP_MIN);
util_max = uclamp_eff_value(p, UCLAMP_MAX);
}
@@ -7139,7 +7164,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
*/
lockdep_assert_irqs_disabled();

- if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
+ if ((available_idle_cpu(target) || sched_idle_cpu(target) ||
+ task_fits_remaining_cpu_capacity(task_util, target)) &&
asym_fits_cpu(task_util, util_min, util_max, target))
return target;

@@ -7147,7 +7173,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
* If the previous CPU is cache affine and idle, don't be stupid:
*/
if (prev != target && cpus_share_cache(prev, target) &&
- (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
+ (available_idle_cpu(prev) || sched_idle_cpu(prev) ||
+ task_fits_remaining_cpu_capacity(task_util, prev)) &&
asym_fits_cpu(task_util, util_min, util_max, prev))
return prev;

@@ -7173,7 +7200,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if (recent_used_cpu != prev &&
recent_used_cpu != target &&
cpus_share_cache(recent_used_cpu, target) &&
- (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
+ (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu) ||
+ task_fits_remaining_cpu_capacity(task_util, recent_used_cpu)) &&
cpumask_test_cpu(recent_used_cpu, p->cpus_ptr) &&
asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) {
return recent_used_cpu;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..9a84a1401123 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -97,6 +97,12 @@ SCHED_FEAT(WA_BIAS, true)
SCHED_FEAT(UTIL_EST, true)
SCHED_FEAT(UTIL_EST_FASTUP, true)

+/*
+ * Select the previous, target, or recent runqueue if they have enough
+ * remaining capacity to enqueue the task. Requires UTIL_EST.
+ */
+SCHED_FEAT(UTIL_FITS_CAPACITY, true)
+
SCHED_FEAT(LATENCY_WARN, false)

SCHED_FEAT(ALT_PERIOD, true)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e93e006a942b..463e75084aed 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2090,6 +2090,11 @@ static const_debug __maybe_unused unsigned int sysctl_sched_features =

#endif /* SCHED_DEBUG */

+static __always_inline bool sched_util_fits_capacity_active(void)
+{
+ return sched_feat(UTIL_EST) && sched_feat(UTIL_FITS_CAPACITY);
+}
+
extern struct static_key_false sched_numa_balancing;
extern struct static_key_false sched_schedstats;

--
2.39.2

2023-10-23 14:11:36

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/2] sched/fair: Introduce UTIL_FITS_CAPACITY feature (v2)

On 19/10/2023 18:05, Mathieu Desnoyers wrote:
> Introduce the UTIL_FITS_CAPACITY scheduler feature. The runqueue
> selection picks the previous, target, or recent runqueues if they have
> enough remaining capacity to enqueue the task before scanning for an
> idle cpu.
>
> This feature is introduced in preparation for the SELECT_BIAS_PREV
> scheduler feature.
>
> The following benchmarks only cover the UTIL_FITS_CAPACITY feature.
> Those are performed on a v6.5.5 kernel with mitigations=off.
>
> The following hackbench workload on a 192 cores AMD EPYC 9654 96-Core
> Processor (over 2 sockets) improves the wall time from 49s to 40s
> (18% speedup).
>
> hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100
>
> We can observe that the number of migrations is reduced significantly
> with this patch (improvement):
>
> Baseline: 117M cpu-migrations (9.355 K/sec)
> With patch: 47M cpu-migrations (3.977 K/sec)
>
> The task-clock utilization is increased (improvement):
>
> Baseline: 253.275 CPUs utilized
> With patch: 271.367 CPUs utilized
>
> The number of context-switches is increased (degradation):
>
> Baseline: 445M context-switches (35.516 K/sec)
> With patch: 586M context-switches (48.823 K/sec)
>

Haven't run any benchmarks yet to prove the benefit of this prefer
packing over spreading or migration avoidance algorithm.

[...]

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4497,6 +4497,28 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> trace_sched_util_est_se_tp(&p->se);
> }
>
> +static unsigned long scale_rt_capacity(int cpu);
> +
> +/*
> + * Returns true if adding the task utilization to the estimated
> + * utilization of the runnable tasks on @cpu does not exceed the
> + * capacity of @cpu.
> + *
> + * This considers only the utilization of _runnable_ tasks on the @cpu
> + * runqueue, excluding blocked and sleeping tasks. This is achieved by
> + * using the runqueue util_est.enqueued.
> + */
> +static inline bool task_fits_remaining_cpu_capacity(unsigned long task_util,
> + int cpu)

This is almost like the existing task_fits_cpu(p, cpu) (used in Capacity
Aware Scheduling (CAS) for Asymmetric CPU capacity systems) except the
latter only uses `util = task_util_est(p)` and deals with uclamp as well
and only tests whether p could fit on the CPU.

Or like find_energy_efficient_cpu() (feec(), used in
Energy-Aware-Scheduling (EAS)) which uses cpu_util(cpu, p, cpu, 0) to get:

max(util_avg(CPU + p), util_est(CPU + p))

feec()
...
for (; pd; pd = pd->next)
...
util = cpu_util(cpu, p, cpu, 0);
...
fits = util_fits_cpu(util, util_min, util_max, cpu)
^^^^^^^^^^^^^^^^^^
not used when uclamp is not active (1)
...
capacity = capacity_of(cpu)
fits = fits_capacity(util, capacity)
if (!uclamp_is_used()) (1)
return fits

So not introducing new functions like task_fits_remaining_cpu_capacity()
in this area and using existing one would be good.

> +{
> + unsigned long total_util;
> +
> + if (!sched_util_fits_capacity_active())
> + return false;
> + total_util = READ_ONCE(cpu_rq(cpu)->cfs.avg.util_est.enqueued) + task_util;
> + return fits_capacity(total_util, scale_rt_capacity(cpu));

Why not use:

static unsigned long capacity_of(int cpu)
return cpu_rq(cpu)->cpu_capacity;

which is maintained in update_cpu_capacity() as scale_rt_capacity(cpu)?

[...]

> @@ -7173,7 +7200,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> if (recent_used_cpu != prev &&
> recent_used_cpu != target &&
> cpus_share_cache(recent_used_cpu, target) &&
> - (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
> + (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu) ||
> + task_fits_remaining_cpu_capacity(task_util, recent_used_cpu)) &&
> cpumask_test_cpu(recent_used_cpu, p->cpus_ptr) &&
> asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) {
> return recent_used_cpu;
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index ee7f23c76bd3..9a84a1401123 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -97,6 +97,12 @@ SCHED_FEAT(WA_BIAS, true)
> SCHED_FEAT(UTIL_EST, true)
> SCHED_FEAT(UTIL_EST_FASTUP, true)

IMHO, asymmetric CPU capacity systems would have to disable the sched
feature UTIL_FITS_CAPACITY. Otherwise CAS could deliver different
results. task_fits_remaining_cpu_capacity() and asym_fits_cpu() work
slightly different.

[...]

2023-10-23 15:05:49

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/2] sched/fair: Introduce UTIL_FITS_CAPACITY feature (v2)

On 2023-10-23 10:11, Dietmar Eggemann wrote:
> On 19/10/2023 18:05, Mathieu Desnoyers wrote:

[...]
>>
>> +static unsigned long scale_rt_capacity(int cpu);
>> +
>> +/*
>> + * Returns true if adding the task utilization to the estimated
>> + * utilization of the runnable tasks on @cpu does not exceed the
>> + * capacity of @cpu.
>> + *
>> + * This considers only the utilization of _runnable_ tasks on the @cpu
>> + * runqueue, excluding blocked and sleeping tasks. This is achieved by
>> + * using the runqueue util_est.enqueued.
>> + */
>> +static inline bool task_fits_remaining_cpu_capacity(unsigned long task_util,
>> + int cpu)
>
> This is almost like the existing task_fits_cpu(p, cpu) (used in Capacity
> Aware Scheduling (CAS) for Asymmetric CPU capacity systems) except the
> latter only uses `util = task_util_est(p)` and deals with uclamp as well
> and only tests whether p could fit on the CPU.

This is indeed a major difference between how asym capacity check works
and what is introduced here:

asym capacity check only checks whether the given task theoretically
fits in the cpu if that cpu was completely idle, without considering the
current cpu utilization.

My approach is to consider the current util_est of the cpu to check
whether the task fits in the remaining capacity.

I did not want to use the existing task_fits_cpu() helper because the
notions of uclamp bounds appear to be heavily tied to the fact that it
checks whether the task fits in an _idle_ runqueue, whereas the check I
am introducing here is much more restrictive: it checks that the task
fits on the runqueue within the remaining capacity.

>
> Or like find_energy_efficient_cpu() (feec(), used in
> Energy-Aware-Scheduling (EAS)) which uses cpu_util(cpu, p, cpu, 0) to get:
>
> max(util_avg(CPU + p), util_est(CPU + p))

I've tried using cpu_util(), but unfortunately anything that considers
blocked/sleeping tasks in its utilization total does not work for my
use-case.

From cpu_util():

* CPU utilization is the sum of running time of runnable tasks plus the
* recent utilization of currently non-runnable tasks on that CPU.

>
> feec()
> ...
> for (; pd; pd = pd->next)
> ...
> util = cpu_util(cpu, p, cpu, 0);
> ...
> fits = util_fits_cpu(util, util_min, util_max, cpu)
> ^^^^^^^^^^^^^^^^^^
> not used when uclamp is not active (1)
> ...
> capacity = capacity_of(cpu)
> fits = fits_capacity(util, capacity)
> if (!uclamp_is_used()) (1)
> return fits
>
> So not introducing new functions like task_fits_remaining_cpu_capacity()
> in this area and using existing one would be good.

If the notion of uclamp is not tied to the way asym capacity check is
done against a theoretically idle runqueue, I'd be OK with using this,
but so far both appear to be very much tied.

When I stumbled on this fundamental difference between asym cpu capacity
check and the check introduced here, I've started wondering whether the
asym cpu capacity check would benefit from considering the target cpu
current utilization as well.

>
>> +{
>> + unsigned long total_util;
>> +
>> + if (!sched_util_fits_capacity_active())
>> + return false;
>> + total_util = READ_ONCE(cpu_rq(cpu)->cfs.avg.util_est.enqueued) + task_util;
>> + return fits_capacity(total_util, scale_rt_capacity(cpu));
>
> Why not use:
>
> static unsigned long capacity_of(int cpu)
> return cpu_rq(cpu)->cpu_capacity;
>
> which is maintained in update_cpu_capacity() as scale_rt_capacity(cpu)?

The reason for preferring scale_rt_capacity(cpu) over capacity_of(cpu)
is that update_cpu_capacity() only runs periodically every
balance-interval, therefore providing a coarse-grained remaining
capacity approximation with respect to irq, rt, dl, and thermal utilization.

If it turns out that being coarse-grained is good enough, we may be able
to save some cycles by using capacity_of(), but not without carefully
considering the impacts of being imprecise.

>
> [...]
>
>> @@ -7173,7 +7200,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>> if (recent_used_cpu != prev &&
>> recent_used_cpu != target &&
>> cpus_share_cache(recent_used_cpu, target) &&
>> - (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
>> + (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu) ||
>> + task_fits_remaining_cpu_capacity(task_util, recent_used_cpu)) &&
>> cpumask_test_cpu(recent_used_cpu, p->cpus_ptr) &&
>> asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) {
>> return recent_used_cpu;
>> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
>> index ee7f23c76bd3..9a84a1401123 100644
>> --- a/kernel/sched/features.h
>> +++ b/kernel/sched/features.h
>> @@ -97,6 +97,12 @@ SCHED_FEAT(WA_BIAS, true)
>> SCHED_FEAT(UTIL_EST, true)
>> SCHED_FEAT(UTIL_EST_FASTUP, true)
>
> IMHO, asymmetric CPU capacity systems would have to disable the sched
> feature UTIL_FITS_CAPACITY. Otherwise CAS could deliver different
> results. task_fits_remaining_cpu_capacity() and asym_fits_cpu() work
> slightly different.

I don't think they should be mutually exclusive. We should look into the
differences between those two more closely to make them work nicely
together instead. For instance, why does asym capacity only consider
whether tasks fit in a theoretically idle runqueue, when it could use
the current utilization of the runqueue to check that the task fits in
the remaining capacity ?

Unfortunately I don't have a machine with asym cpu to test locally.

Thanks for your feedback !

Mathieu


>
> [...]
>

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-10-24 06:11:11

by Chen Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/2] sched/fair: Introduce UTIL_FITS_CAPACITY feature (v2)

On 2023-10-23 at 11:04:49 -0400, Mathieu Desnoyers wrote:
> On 2023-10-23 10:11, Dietmar Eggemann wrote:
> > On 19/10/2023 18:05, Mathieu Desnoyers wrote:
>
> [...]
> > > +static unsigned long scale_rt_capacity(int cpu);
> > > +
> > > +/*
> > > + * Returns true if adding the task utilization to the estimated
> > > + * utilization of the runnable tasks on @cpu does not exceed the
> > > + * capacity of @cpu.
> > > + *
> > > + * This considers only the utilization of _runnable_ tasks on the @cpu
> > > + * runqueue, excluding blocked and sleeping tasks. This is achieved by
> > > + * using the runqueue util_est.enqueued.
> > > + */
> > > +static inline bool task_fits_remaining_cpu_capacity(unsigned long task_util,
> > > + int cpu)
> >
> > Or like find_energy_efficient_cpu() (feec(), used in
> > Energy-Aware-Scheduling (EAS)) which uses cpu_util(cpu, p, cpu, 0) to get:
> >
> > max(util_avg(CPU + p), util_est(CPU + p))
>
> I've tried using cpu_util(), but unfortunately anything that considers
> blocked/sleeping tasks in its utilization total does not work for my
> use-case.
>
> From cpu_util():
>
> * CPU utilization is the sum of running time of runnable tasks plus the
> * recent utilization of currently non-runnable tasks on that CPU.
>

I thought cpu_util() indicates the utilization decay sum of task that was once
"running" on this CPU, but will not sum up the "util/load" of the blocked/sleeping
task?

accumulate_sum()
/* only the running task's util will be sum up */
if (running)
sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;

WRITE_ONCE(sa->util_avg, sa->util_sum / divider);

thanks,
Chenyu

2023-10-24 14:11:06

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/2] sched/fair: Introduce UTIL_FITS_CAPACITY feature (v2)

On 23/10/2023 17:04, Mathieu Desnoyers wrote:
> On 2023-10-23 10:11, Dietmar Eggemann wrote:
>> On 19/10/2023 18:05, Mathieu Desnoyers wrote:
>
> [...]
>>>   +static unsigned long scale_rt_capacity(int cpu);
>>> +
>>> +/*
>>> + * Returns true if adding the task utilization to the estimated
>>> + * utilization of the runnable tasks on @cpu does not exceed the
>>> + * capacity of @cpu.
>>> + *
>>> + * This considers only the utilization of _runnable_ tasks on the @cpu
>>> + * runqueue, excluding blocked and sleeping tasks. This is achieved by
>>> + * using the runqueue util_est.enqueued.
>>> + */
>>> +static inline bool task_fits_remaining_cpu_capacity(unsigned long
>>> task_util,
>>> +                            int cpu)
>>
>> This is almost like the existing task_fits_cpu(p, cpu) (used in Capacity
>> Aware Scheduling (CAS) for Asymmetric CPU capacity systems) except the
>> latter only uses `util = task_util_est(p)` and deals with uclamp as well
>> and only tests whether p could fit on the CPU.
>
> This is indeed a major difference between how asym capacity check works
> and what is introduced here:
>
> asym capacity check only checks whether the given task theoretically
> fits in the cpu if that cpu was completely idle, without considering the
> current cpu utilization.

Yeah, asymmetric CPU capacity systems have to make sure that p fits on
the idle/sched_idle CPU, hence the use of sync_entity_load_avg() and
asym_fits_cpu().

> My approach is to consider the current util_est of the cpu to check
> whether the task fits in the remaining capacity.

True.

> I did not want to use the existing task_fits_cpu() helper because the
> notions of uclamp bounds appear to be heavily tied to the fact that it
> checks whether the task fits in an _idle_ runqueue, whereas the check I
> am introducing here is much more restrictive: it checks that the task
> fits on the runqueue within the remaining capacity.

I see. Essentially what you do is

util_fits_cpu(util_est(CPU + p), 0, 1024, CPU) in !uclamp_is_used()

The uclamp_is_used() case is task-centric though. (*)

>> Or like find_energy_efficient_cpu() (feec(), used in
>> Energy-Aware-Scheduling (EAS)) which uses cpu_util(cpu, p, cpu, 0) to
>> get:
>>
>>    max(util_avg(CPU + p), util_est(CPU + p))
>
> I've tried using cpu_util(), but unfortunately anything that considers
> blocked/sleeping tasks in its utilization total does not work for my
> use-case.
>
> From cpu_util():
>
>  * CPU utilization is the sum of running time of runnable tasks plus the
>  * recent utilization of currently non-runnable tasks on that CPU.

OK, I see. Occasions in which `util_avg(CPU + p) > util_est(CPU + p)`
would ruin it for your use-case.

>> feec()
>>      ...
>>      for (; pd; pd = pd->next)
>>          ...
>>          util = cpu_util(cpu, p, cpu, 0);
>>          ...
>>          fits = util_fits_cpu(util, util_min, util_max, cpu)
>>                                     ^^^^^^^^^^^^^^^^^^
>>                                    not used when uclamp is not active (1)
>>              ...
>>              capacity = capacity_of(cpu)
>>              fits = fits_capacity(util, capacity)
>>              if (!uclamp_is_used()) (1)
>>                  return fits
>>
>> So not introducing new functions like task_fits_remaining_cpu_capacity()
>> in this area and using existing one would be good.
>
> If the notion of uclamp is not tied to the way asym capacity check is
> done against a theoretically idle runqueue, I'd be OK with using this,
> but so far both appear to be very much tied.

Yeah, uclamp_is_used() scenarios are more complicated (see *).

> When I stumbled on this fundamental difference between asym cpu capacity
> check and the check introduced here, I've started wondering whether the
> asym cpu capacity check would benefit from considering the target cpu
> current utilization as well.

We just adapted select_idle_sibling() for asymmetric CPU capacity
systems by adding the asym_fits_cpu() to the idle/sched_idle check.

For me so far sis() is all about finding an idle CPU and not task packing.

>>> +{
>>> +    unsigned long total_util;
>>> +
>>> +    if (!sched_util_fits_capacity_active())
>>> +        return false;
>>> +    total_util = READ_ONCE(cpu_rq(cpu)->cfs.avg.util_est.enqueued) +
>>> task_util;
>>> +    return fits_capacity(total_util, scale_rt_capacity(cpu));
>>
>> Why not use:
>>
>> static unsigned long capacity_of(int cpu)
>>      return cpu_rq(cpu)->cpu_capacity;
>>
>> which is maintained in update_cpu_capacity() as scale_rt_capacity(cpu)?
>
> The reason for preferring scale_rt_capacity(cpu) over capacity_of(cpu)
> is that update_cpu_capacity() only runs periodically every
> balance-interval, therefore providing a coarse-grained remaining
> capacity approximation with respect to irq, rt, dl, and thermal
> utilization.
>> If it turns out that being coarse-grained is good enough, we may be able
> to save some cycles by using capacity_of(), but not without carefully
> considering the impacts of being imprecise.

OK, I see. We normally consider capacity_of(cpu) as accurate enough.

[...]

>>> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
>>> index ee7f23c76bd3..9a84a1401123 100644
>>> --- a/kernel/sched/features.h
>>> +++ b/kernel/sched/features.h
>>> @@ -97,6 +97,12 @@ SCHED_FEAT(WA_BIAS, true)
>>>   SCHED_FEAT(UTIL_EST, true)
>>>   SCHED_FEAT(UTIL_EST_FASTUP, true)
>>
>> IMHO, asymmetric CPU capacity systems would have to disable the sched
>> feature UTIL_FITS_CAPACITY. Otherwise CAS could deliver different
>> results. task_fits_remaining_cpu_capacity() and asym_fits_cpu() work
>> slightly different.
>
> I don't think they should be mutually exclusive. We should look into the
> differences between those two more closely to make them work nicely
> together instead. For instance, why does asym capacity only consider
> whether tasks fit in a theoretically idle runqueue, when it could use
> the current utilization of the runqueue to check that the task fits in
> the remaining capacity ?

We have EAS (feec()) for this on asymmetric CPU capacity systems (as our
per-performance_domain packing strategy), which only works when
!overutilized. When overutilized, we just need asym_fits_cpu()
(select_idle_capacity() -> util_fits_cpu()) to select a fitting
idle/sched_idle CPU in CAS which includes the uclamp handling.

[...]

2023-10-24 14:49:55

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/2] sched/fair: Introduce UTIL_FITS_CAPACITY feature (v2)

On 2023-10-24 02:10, Chen Yu wrote:
> On 2023-10-23 at 11:04:49 -0400, Mathieu Desnoyers wrote:
>> On 2023-10-23 10:11, Dietmar Eggemann wrote:
>>> On 19/10/2023 18:05, Mathieu Desnoyers wrote:
>>
>> [...]
>>>> +static unsigned long scale_rt_capacity(int cpu);
>>>> +
>>>> +/*
>>>> + * Returns true if adding the task utilization to the estimated
>>>> + * utilization of the runnable tasks on @cpu does not exceed the
>>>> + * capacity of @cpu.
>>>> + *
>>>> + * This considers only the utilization of _runnable_ tasks on the @cpu
>>>> + * runqueue, excluding blocked and sleeping tasks. This is achieved by
>>>> + * using the runqueue util_est.enqueued.
>>>> + */
>>>> +static inline bool task_fits_remaining_cpu_capacity(unsigned long task_util,
>>>> + int cpu)
>>>
>>> Or like find_energy_efficient_cpu() (feec(), used in
>>> Energy-Aware-Scheduling (EAS)) which uses cpu_util(cpu, p, cpu, 0) to get:
>>>
>>> max(util_avg(CPU + p), util_est(CPU + p))
>>
>> I've tried using cpu_util(), but unfortunately anything that considers
>> blocked/sleeping tasks in its utilization total does not work for my
>> use-case.
>>
>> From cpu_util():
>>
>> * CPU utilization is the sum of running time of runnable tasks plus the
>> * recent utilization of currently non-runnable tasks on that CPU.
>>
>
> I thought cpu_util() indicates the utilization decay sum of task that was once
> "running" on this CPU, but will not sum up the "util/load" of the blocked/sleeping
> task?
>
> accumulate_sum()
> /* only the running task's util will be sum up */
> if (running)
> sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
>
> WRITE_ONCE(sa->util_avg, sa->util_sum / divider);

The accumulation into the cfs_rq->avg.util_sum indeed only happens when the task
is running, which means that the task does not actively contribute to increment
the util_sum when it is blocked/sleeping.

However, when the task is blocked/sleeping, the task is still attached to the
runqueue, and therefore its historic util_sum still contributes to the cfs_rq
util_sum/util_avg. This completely differs from what happens when the task is
migrated to a different runqueue, in which case its util_sum contribution is
entirely removed from the cfs_rq util_sum:

static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
[...]
update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH)
[...]

static void
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
[...]
if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
action |= DO_DETACH;
[...]

static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
[...]
if (!se->avg.last_update_time && (flags & DO_ATTACH)) {

/*
* DO_ATTACH means we're here from enqueue_entity().
* !last_update_time means we've passed through
* migrate_task_rq_fair() indicating we migrated.
*
* IOW we're enqueueing a task on a new CPU.
*/
attach_entity_load_avg(cfs_rq, se);
update_tg_load_avg(cfs_rq);

} else if (flags & DO_DETACH) {
/*
* DO_DETACH means we're here from dequeue_entity()
* and we are migrating task out of the CPU.
*/
detach_entity_load_avg(cfs_rq, se);
update_tg_load_avg(cfs_rq);
[...]

In comparison, util_est_enqueue()/util_est_dequeue() are called from enqueue_task_fair()
and dequeue_task_fair(), which include blocked/sleeping tasks scenarios. Therefore, util_est
only considers runnable tasks in its cfs_rq->avg.util_est.enqueued.

The current rq utilization total used for rq selection should not include historic
utilization of all blocked/sleeping tasks, because we are taking a decision to bring
back a recently blocked/sleeping task onto a runqueue at that point. Considering
the historic util_sum from the set of other blocked/sleeping tasks still attached to that
runqueue in the current utilization mistakenly makes the rq selection think that the rq is
busier than it really is.

I suspect that cpu_util_without() is an half-successful attempt at solving this by removing
the task p from the considered utilization, but it does not take into account scenarios where many
other tasks happen to be blocked/sleeping as well.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-10-24 15:04:23

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/2] sched/fair: Introduce UTIL_FITS_CAPACITY feature (v2)

On 24/10/2023 08:10, Chen Yu wrote:
> On 2023-10-23 at 11:04:49 -0400, Mathieu Desnoyers wrote:
>> On 2023-10-23 10:11, Dietmar Eggemann wrote:
>>> On 19/10/2023 18:05, Mathieu Desnoyers wrote:

[...]

>>> Or like find_energy_efficient_cpu() (feec(), used in
>>> Energy-Aware-Scheduling (EAS)) which uses cpu_util(cpu, p, cpu, 0) to get:
>>>
>>> max(util_avg(CPU + p), util_est(CPU + p))
>>
>> I've tried using cpu_util(), but unfortunately anything that considers
>> blocked/sleeping tasks in its utilization total does not work for my
>> use-case.
>>
>> From cpu_util():
>>
>> * CPU utilization is the sum of running time of runnable tasks plus the
>> * recent utilization of currently non-runnable tasks on that CPU.
>>
>
> I thought cpu_util() indicates the utilization decay sum of task that was once
> "running" on this CPU, but will not sum up the "util/load" of the blocked/sleeping
> task?

cpu_util() here refers to:

cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)

which when called with (cpu, p, cpu, 0) and task_cpu(p) != cpu returns:

max(util_avg(CPU + p), util_est(CPU + p))

The term `CPU utilization` in cpu_util()'s header stands for
cfs_rq->avg.util_avg.

It does not sum up the utilization of blocked tasks but it can contain
it. They have to be a blocked tasks and not tasks which were running in
cfs_rq since we subtract utilization of tasks which are migrating away
from the cfs_rq (cfs_rq->removed.util_avg in remove_entity_load_avg()
and update_cfs_rq_load_avg()).
> accumulate_sum()
> /* only the running task's util will be sum up */
> if (running)
> sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
>
> WRITE_ONCE(sa->util_avg, sa->util_sum / divider);

__update_load_avg_cfs_rq()

___update_load_sum(..., cfs_rq->curr != NULL
^^^^^^^^^^^^^^^^^^^^
running
accumulate_sum()

if (periods)
/* decay _sum */
sa->util_sum = decay_load(sa->util_sum, ...)

if (load)
/* decay and accrue _sum */
contrib = __accumulate_pelt_segments(...)

When crossing periods we decay the old _sum and when additionally load
!= 0 we decay and accrue the new _sum as well.

2023-10-25 07:21:20

by Chen Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/2] sched/fair: Introduce UTIL_FITS_CAPACITY feature (v2)

On 2023-10-24 at 10:49:37 -0400, Mathieu Desnoyers wrote:
> On 2023-10-24 02:10, Chen Yu wrote:
> > On 2023-10-23 at 11:04:49 -0400, Mathieu Desnoyers wrote:
> > > On 2023-10-23 10:11, Dietmar Eggemann wrote:
> > > > On 19/10/2023 18:05, Mathieu Desnoyers wrote:
> > >
> > > [...]
> > > > > +static unsigned long scale_rt_capacity(int cpu);
> > > > > +
> > > > > +/*
> > > > > + * Returns true if adding the task utilization to the estimated
> > > > > + * utilization of the runnable tasks on @cpu does not exceed the
> > > > > + * capacity of @cpu.
> > > > > + *
> > > > > + * This considers only the utilization of _runnable_ tasks on the @cpu
> > > > > + * runqueue, excluding blocked and sleeping tasks. This is achieved by
> > > > > + * using the runqueue util_est.enqueued.
> > > > > + */
> > > > > +static inline bool task_fits_remaining_cpu_capacity(unsigned long task_util,
> > > > > + int cpu)
> > > >
> > > > Or like find_energy_efficient_cpu() (feec(), used in
> > > > Energy-Aware-Scheduling (EAS)) which uses cpu_util(cpu, p, cpu, 0) to get:
> > > >
> > > > max(util_avg(CPU + p), util_est(CPU + p))
> > >
> > > I've tried using cpu_util(), but unfortunately anything that considers
> > > blocked/sleeping tasks in its utilization total does not work for my
> > > use-case.
> > >
> > > From cpu_util():
> > >
> > > * CPU utilization is the sum of running time of runnable tasks plus the
> > > * recent utilization of currently non-runnable tasks on that CPU.
> > >
> >
> > I thought cpu_util() indicates the utilization decay sum of task that was once
> > "running" on this CPU, but will not sum up the "util/load" of the blocked/sleeping
> > task?
> >
> > accumulate_sum()
> > /* only the running task's util will be sum up */
> > if (running)
> > sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
> >
> > WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
>
> The accumulation into the cfs_rq->avg.util_sum indeed only happens when the task
> is running, which means that the task does not actively contribute to increment
> the util_sum when it is blocked/sleeping.
>
> However, when the task is blocked/sleeping, the task is still attached to the
> runqueue, and therefore its historic util_sum still contributes to the cfs_rq
> util_sum/util_avg. This completely differs from what happens when the task is
> migrated to a different runqueue, in which case its util_sum contribution is
> entirely removed from the cfs_rq util_sum:
>
> static void
> enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> {
> [...]
> update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH)
> [...]
>
> static void
> dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> {
> [...]
> if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
> action |= DO_DETACH;
> [...]
>
> static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> {
> [...]
> if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
>
> /*
> * DO_ATTACH means we're here from enqueue_entity().
> * !last_update_time means we've passed through
> * migrate_task_rq_fair() indicating we migrated.
> *
> * IOW we're enqueueing a task on a new CPU.
> */
> attach_entity_load_avg(cfs_rq, se);
> update_tg_load_avg(cfs_rq);
>
> } else if (flags & DO_DETACH) {
> /*
> * DO_DETACH means we're here from dequeue_entity()
> * and we are migrating task out of the CPU.
> */
> detach_entity_load_avg(cfs_rq, se);
> update_tg_load_avg(cfs_rq);
> [...]
>
> In comparison, util_est_enqueue()/util_est_dequeue() are called from enqueue_task_fair()
> and dequeue_task_fair(), which include blocked/sleeping tasks scenarios. Therefore, util_est
> only considers runnable tasks in its cfs_rq->avg.util_est.enqueued.
>
> The current rq utilization total used for rq selection should not include historic
> utilization of all blocked/sleeping tasks, because we are taking a decision to bring
> back a recently blocked/sleeping task onto a runqueue at that point. Considering
> the historic util_sum from the set of other blocked/sleeping tasks still attached to that
> runqueue in the current utilization mistakenly makes the rq selection think that the rq is
> busier than it really is.
>

Thanks for the description in detail, it is very helpful! Now I understand that using cpu_util()
could overestimate the busyness of the CPU in UTIL_FITS_CAPACITY.

> I suspect that cpu_util_without() is an half-successful attempt at solving this by removing
> the task p from the considered utilization, but it does not take into account scenarios where many
> other tasks happen to be blocked/sleeping as well.

Agree, those non-migrated tasks could contribute to cfs_rq's util_avg.

thanks,
Chenyu

2023-10-25 07:46:41

by Chen Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/2] sched/fair: Introduce UTIL_FITS_CAPACITY feature (v2)

On 2023-10-24 at 17:03:25 +0200, Dietmar Eggemann wrote:
> On 24/10/2023 08:10, Chen Yu wrote:
> > On 2023-10-23 at 11:04:49 -0400, Mathieu Desnoyers wrote:
> >> On 2023-10-23 10:11, Dietmar Eggemann wrote:
> >>> On 19/10/2023 18:05, Mathieu Desnoyers wrote:
>
> [...]
>
> >>> Or like find_energy_efficient_cpu() (feec(), used in
> >>> Energy-Aware-Scheduling (EAS)) which uses cpu_util(cpu, p, cpu, 0) to get:
> >>>
> >>> max(util_avg(CPU + p), util_est(CPU + p))
> >>
> >> I've tried using cpu_util(), but unfortunately anything that considers
> >> blocked/sleeping tasks in its utilization total does not work for my
> >> use-case.
> >>
> >> From cpu_util():
> >>
> >> * CPU utilization is the sum of running time of runnable tasks plus the
> >> * recent utilization of currently non-runnable tasks on that CPU.
> >>
> >
> > I thought cpu_util() indicates the utilization decay sum of task that was once
> > "running" on this CPU, but will not sum up the "util/load" of the blocked/sleeping
> > task?
>
> cpu_util() here refers to:
>
> cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
>
> which when called with (cpu, p, cpu, 0) and task_cpu(p) != cpu returns:
>
> max(util_avg(CPU + p), util_est(CPU + p))
>
> The term `CPU utilization` in cpu_util()'s header stands for
> cfs_rq->avg.util_avg.
>
> It does not sum up the utilization of blocked tasks but it can contain
> it. They have to be a blocked tasks and not tasks which were running in
> cfs_rq since we subtract utilization of tasks which are migrating away
> from the cfs_rq (cfs_rq->removed.util_avg in remove_entity_load_avg()
> and update_cfs_rq_load_avg()).

Thanks for this description in detail, Dietmar. Yes, I just realized that,
if the blocked tasks once ran on this cfs_rq and not being migrated away,
the cfs_rq's util_avg will contain those utils.

thanks,
Chenyu

> > accumulate_sum()
> > /* only the running task's util will be sum up */
> > if (running)
> > sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
> >
> > WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
>
> __update_load_avg_cfs_rq()
>
> ___update_load_sum(..., cfs_rq->curr != NULL
> ^^^^^^^^^^^^^^^^^^^^
> running
> accumulate_sum()
>
> if (periods)
> /* decay _sum */
> sa->util_sum = decay_load(sa->util_sum, ...)
>
> if (load)
> /* decay and accrue _sum */
> contrib = __accumulate_pelt_segments(...)
>
> When crossing periods we decay the old _sum and when additionally load
> != 0 we decay and accrue the new _sum as well.

2023-10-25 07:57:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/2] sched/fair: Introduce UTIL_FITS_CAPACITY feature (v2)

On Thu, Oct 19, 2023 at 12:05:22PM -0400, Mathieu Desnoyers wrote:

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index e93e006a942b..463e75084aed 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2090,6 +2090,11 @@ static const_debug __maybe_unused unsigned int sysctl_sched_features =
>
> #endif /* SCHED_DEBUG */
>
> +static __always_inline bool sched_util_fits_capacity_active(void)
> +{
> + return sched_feat(UTIL_EST) && sched_feat(UTIL_FITS_CAPACITY);
> +}

This generates pretty terrible code; it cannot collapse this into a
single branch. And since sched_feat is at best a debug interface for
people who knows wtf they're doing, just make this UTIL_FITS_CAPACITY
with a comment or so.

2023-10-25 14:05:42

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/2] sched/fair: Introduce UTIL_FITS_CAPACITY feature (v2)

On 2023-10-25 03:56, Peter Zijlstra wrote:
> On Thu, Oct 19, 2023 at 12:05:22PM -0400, Mathieu Desnoyers wrote:
>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index e93e006a942b..463e75084aed 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -2090,6 +2090,11 @@ static const_debug __maybe_unused unsigned int sysctl_sched_features =
>>
>> #endif /* SCHED_DEBUG */
>>
>> +static __always_inline bool sched_util_fits_capacity_active(void)
>> +{
>> + return sched_feat(UTIL_EST) && sched_feat(UTIL_FITS_CAPACITY);
>> +}
>
> This generates pretty terrible code; it cannot collapse this into a
> single branch. And since sched_feat is at best a debug interface for
> people who knows wtf they're doing, just make this UTIL_FITS_CAPACITY
> with a comment or so.

OK, change applied for the next round, thanks!

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-10-27 03:27:52

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/2] sched/fair migration reduction features

Hello Mathieu,

On 10/19/2023 9:35 PM, Mathieu Desnoyers wrote:
> Hi,
>
> This series introduces two new scheduler features: UTIL_FITS_CAPACITY
> and SELECT_BIAS_PREV. When used together, they achieve a 41% speedup of
> a hackbench workload which leaves some idle CPU time on a 192-core AMD
> EPYC.
>
> The main metrics which are significantly improved are:
>
> - cpu-migrations are reduced by 80%,
> - CPU utilization is increased by 17%.
>
> Feedback is welcome. I am especially interested to learn whether this
> series has positive or detrimental effects on performance of other
> workloads.

I got a chance to test this series on a dual socket 3rd Generation EPYC
System (2 x 64C/128T). Following is a quick summary:

- stream and ycsb-mongodb don't see any changes.

- hackbench and DeathStarBench see a major improvement. Both are high
utilization workloads with CPUs being overloaded most of the time.
DeathStarBench is known to benefit from lower migration count. It was
discussed by Gautham at OSPM '23.

- tbench, netperf, and sch bench regresses. The former two when the
system is near fully loaded, and the latter for most cases. All these
benchmarks are client-server / messenger-worker oriented and is
known to perform better when client-server / messenger-worker are on
same CCX (LLC domain).

Detailed results are as follows:

o Machine details

- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)

o Kernel Details

- tip: tip:sched/core at commit 984ffb6a4366 ("sched/fair: Remove
SIS_PROP")

- wake_prev_bias: tip + this series + Peter's suggestion to optimize
sched_util_fits_capacity_active()

I've taken liberty at resolving the conflict with recently added cluster
wakeup optimization by prioritizing "SELECT_BIAS_PREV" feature.
select_idle_sibling() looks as follows:

select_idle_sibling(...)
{

...

/*
* With the SELECT_BIAS_PREV feature, if the previous CPU is
* cache affine, prefer the previous CPU when all CPUs are busy
* to inhibit migration.
*/
if (sched_feat(SELECT_BIAS_PREV) &&
prev != target && cpus_share_cache(prev, target))
return prev;

/*
* For cluster machines which have lower sharing cache like L2 or
* LLC Tag, we tend to find an idle CPU in the target's cluster
* first. But prev_cpu or recent_used_cpu may also be a good candidate,
* use them if possible when no idle CPU found in select_idle_cpu().
*/
if ((unsigned int)prev_aff < nr_cpumask_bits)
return prev_aff;
if ((unsigned int)recent_used_cpu < nr_cpumask_bits)
return recent_used_cpu;

return target;
}

Please let me know if you have a different ordering in mind.

o Benchmark results

==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) wake_prev_bias[pct imp](CV)
1-groups 1.00 [ -0.00]( 2.88) 0.97 [ 2.88]( 1.78)
2-groups 1.00 [ -0.00]( 2.03) 0.91 [ 8.79]( 1.19)
4-groups 1.00 [ -0.00]( 1.42) 0.87 [ 13.07]( 1.77)
8-groups 1.00 [ -0.00]( 1.37) 0.86 [ 13.70]( 0.98)
16-groups 1.00 [ -0.00]( 2.54) 0.90 [ 9.74]( 1.65)


==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) wake_prev_bias[pct imp](CV)
1 1.00 [ 0.00]( 0.63) 0.99 [ -0.53]( 0.97)
2 1.00 [ 0.00]( 0.89) 1.00 [ 0.21]( 0.99)
4 1.00 [ 0.00]( 1.34) 1.01 [ 0.70]( 0.88)
8 1.00 [ 0.00]( 0.49) 1.00 [ 0.40]( 0.55)
16 1.00 [ 0.00]( 1.51) 0.99 [ -0.51]( 1.23)
32 1.00 [ 0.00]( 0.74) 0.97 [ -2.57]( 0.59)
64 1.00 [ 0.00]( 0.92) 0.95 [ -4.69]( 0.70)
128 1.00 [ 0.00]( 0.97) 0.91 [ -8.58]( 0.29)
256 1.00 [ 0.00]( 1.14) 0.90 [ -9.86]( 2.40)
512 1.00 [ 0.00]( 0.35) 0.97 [ -2.91]( 1.78)
1024 1.00 [ 0.00]( 0.07) 0.96 [ -4.15]( 1.43)


==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) wake_prev_bias[pct imp](CV)
Copy 1.00 [ 0.00]( 8.25) 1.04 [ 3.53](10.84)
Scale 1.00 [ 0.00]( 5.65) 0.99 [ -0.85]( 5.94)
Add 1.00 [ 0.00]( 5.73) 1.00 [ 0.50]( 7.68)
Triad 1.00 [ 0.00]( 3.41) 1.00 [ 0.12]( 6.25)


==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) wake_prev_bias[pct imp](CV)
Copy 1.00 [ 0.00]( 1.75) 1.01 [ 1.18]( 1.61)
Scale 1.00 [ 0.00]( 0.92) 1.00 [ -0.14]( 1.37)
Add 1.00 [ 0.00]( 0.32) 0.99 [ -0.54]( 1.34)
Triad 1.00 [ 0.00]( 5.97) 1.00 [ 0.37]( 6.34)


==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) wake_prev_bias[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.67) 1.00 [ 0.08]( 0.15)
2-clients 1.00 [ 0.00]( 0.15) 1.00 [ 0.10]( 0.57)
4-clients 1.00 [ 0.00]( 0.58) 1.00 [ 0.10]( 0.74)
8-clients 1.00 [ 0.00]( 0.46) 1.00 [ 0.31]( 0.64)
16-clients 1.00 [ 0.00]( 0.84) 0.99 [ -0.56]( 1.78)
32-clients 1.00 [ 0.00]( 1.07) 1.00 [ 0.04]( 1.40)
64-clients 1.00 [ 0.00]( 1.53) 1.01 [ 0.68]( 2.27)
128-clients 1.00 [ 0.00]( 1.17) 0.99 [ -0.70]( 1.17)
256-clients 1.00 [ 0.00]( 5.42) 0.91 [ -9.31](10.74)
512-clients 1.00 [ 0.00](48.07) 1.00 [ -0.07](47.71)


==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) wake_prev_bias[pct imp](CV)
1 1.00 [ -0.00](12.00) 1.06 [ -5.56]( 2.99)
2 1.00 [ -0.00]( 6.96) 1.08 [ -7.69]( 2.38)
4 1.00 [ -0.00](13.57) 1.07 [ -7.32](12.95)
8 1.00 [ -0.00]( 6.45) 0.98 [ 2.08](10.86)
16 1.00 [ -0.00]( 3.45) 1.02 [ -1.72]( 1.69)
32 1.00 [ -0.00]( 3.00) 1.05 [ -5.00](10.92)
64 1.00 [ -0.00]( 2.18) 1.04 [ -4.17]( 1.15)
128 1.00 [ -0.00]( 7.15) 1.07 [ -6.65]( 8.45)
256 1.00 [ -0.00](30.20) 1.72 [-72.03](30.62)
512 1.00 [ -0.00]( 4.90) 0.97 [ 3.25]( 1.92)


==================================================================
Test : ycsb-mondodb
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Mean
==================================================================
metric tip wake_prev_bias(%diff)
throughput 1.00 0.99 (%diff: -0.94%)


==================================================================
Test : DeathStarBench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Mean
==================================================================
Pinning scaling tip wake_prev_bias(%diff)
1CCD 1 1.00 1.10 (%diff: 10.04%)
2CCD 2 1.00 1.06 (%diff: 5.90%)
4CCD 4 1.00 1.04 (%diff: 3.74%)
8CCD 8 1.00 1.03 (%diff: 2.98%)

--
It is a mixed bag of results, as expected. I would love to hear your
thoughts on the results. Meanwhile, I'll try to get some more data
from other benchmarks.

>
> [..snip..]
>

--
Thanks and Regards,
Prateek

2023-11-06 05:53:26

by Chen Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/2] sched/fair migration reduction features

On 2023-10-27 at 08:57:00 +0530, K Prateek Nayak wrote:
> Hello Mathieu,
>
> On 10/19/2023 9:35 PM, Mathieu Desnoyers wrote:
> > Hi,
> >
> > This series introduces two new scheduler features: UTIL_FITS_CAPACITY
> > and SELECT_BIAS_PREV. When used together, they achieve a 41% speedup of
> > a hackbench workload which leaves some idle CPU time on a 192-core AMD
> > EPYC.
> >
> > The main metrics which are significantly improved are:
> >
> > - cpu-migrations are reduced by 80%,
> > - CPU utilization is increased by 17%.
> >
> > Feedback is welcome. I am especially interested to learn whether this
> > series has positive or detrimental effects on performance of other
> > workloads.
>
> I got a chance to test this series on a dual socket 3rd Generation EPYC
> System (2 x 64C/128T). Following is a quick summary:
>
> - stream and ycsb-mongodb don't see any changes.
>
> - hackbench and DeathStarBench see a major improvement. Both are high
> utilization workloads with CPUs being overloaded most of the time.
> DeathStarBench is known to benefit from lower migration count. It was
> discussed by Gautham at OSPM '23.
>
> - tbench, netperf, and sch bench regresses. The former two when the
> system is near fully loaded, and the latter for most cases.

Does it mean hackbench gets benefits when the system is overloaded, while
tbench/netperf do not get benefit when the system is underloaded?

> All these benchmarks are client-server / messenger-worker oriented and is
> known to perform better when client-server / messenger-worker are on
> same CCX (LLC domain).

I thought hackbench should also be of client-server mode, because hackbench has
socket/pipe mode and exchanges datas between sender/receiver.

This reminds me of your proposal to provide user hint to the scheduler
to whether do task consolidation vs task spreading, and could this also
be applied to Mathieu's case? For task or task group with "consolidate"
flag set, tasks prefer to be woken up on target/previous CPU if the wakee
fits into that CPU. In this way we could bring benefit and not introduce
regress.

thanks,
Chenyu

2023-11-06 07:06:56

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/2] sched/fair migration reduction features

Hello Chenyu,

On 11/6/2023 11:22 AM, Chen Yu wrote:
> On 2023-10-27 at 08:57:00 +0530, K Prateek Nayak wrote:
>> Hello Mathieu,
>>
>> On 10/19/2023 9:35 PM, Mathieu Desnoyers wrote:
>>> Hi,
>>>
>>> This series introduces two new scheduler features: UTIL_FITS_CAPACITY
>>> and SELECT_BIAS_PREV. When used together, they achieve a 41% speedup of
>>> a hackbench workload which leaves some idle CPU time on a 192-core AMD
>>> EPYC.
>>>
>>> The main metrics which are significantly improved are:
>>>
>>> - cpu-migrations are reduced by 80%,
>>> - CPU utilization is increased by 17%.
>>>
>>> Feedback is welcome. I am especially interested to learn whether this
>>> series has positive or detrimental effects on performance of other
>>> workloads.
>>
>> I got a chance to test this series on a dual socket 3rd Generation EPYC
>> System (2 x 64C/128T). Following is a quick summary:
>>
>> - stream and ycsb-mongodb don't see any changes.
>>
>> - hackbench and DeathStarBench see a major improvement. Both are high
>> utilization workloads with CPUs being overloaded most of the time.
>> DeathStarBench is known to benefit from lower migration count. It was
>> discussed by Gautham at OSPM '23.
>>
>> - tbench, netperf, and sch bench regresses. The former two when the
>> system is near fully loaded, and the latter for most cases.
>
> Does it mean hackbench gets benefits when the system is overloaded, while
> tbench/netperf do not get benefit when the system is underloaded?

Yup! Seems like that from the results. From what I have seen so far,
there seems to be a work conservation aspect to hackbench where if we
reduce the time spent in the kernel (by reducing time to decide on the
target which Mathieu's patch [this one] achieves, there is also a
second order effect from another one of Mathieu's Patches that uses
wakelist but indirectly curbs the SIS_UTIL limits based on Aaron's
observation [1] thus reducing time spent in select_idle_cpu())
hackbench results seem to improve.

[1] https://lore.kernel.org/lkml/20230905072141.GA253439@ziqianlu-dell/

schbench, tbench, and netperf see that wakeups are faster when the
client and server are on same LLC so consolidation as long as there is
one task per run queue for under loaded case is better than just keeping
them on separate LLCs.

>
>> All these benchmarks are client-server / messenger-worker oriented and is
>> known to perform better when client-server / messenger-worker are on
>> same CCX (LLC domain).
>
> I thought hackbench should also be of client-server mode, because hackbench has
> socket/pipe mode and exchanges datas between sender/receiver.

Yes but its N:M nature makes it slightly complicated to understand where
the cache benefits disappear and the work conservation benefits become
more prominent.

>
> This reminds me of your proposal to provide user hint to the scheduler
> to whether do task consolidation vs task spreading, and could this also
> be applied to Mathieu's case? For task or task group with "consolidate"
> flag set, tasks prefer to be woken up on target/previous CPU if the wakee
> fits into that CPU. In this way we could bring benefit and not introduce
> regress.

I think even a simple WF_SYNC check will help tbench and netperf case.
Let me get back to you with some data on different variants of hackbench
wit the latest tip.

>
> thanks,
> Chenyu

--
Thanks and Regards,
Prateek

2023-11-06 16:32:44

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/2] sched/fair migration reduction features

On 2023-10-26 23:27, K Prateek Nayak wrote:
[...]
> --
> It is a mixed bag of results, as expected. I would love to hear your
> thoughts on the results. Meanwhile, I'll try to get some more data
> from other benchmarks.

I suspect that workloads that exhibit a client-server (1:1) pairing
pattern are hurt by the bias towards leaving tasks on their prev
runqueue: they benefit from moving both client/server tasks as close as
possible so they share either the same core or a common cache.

The hackbench workload is also client-server, but there are N-client and
N-server threads, creating a N:N relationship which really does not work
well when trying to pull tasks on sync wakeup: tasks then bounce all
over the place.

It's tricky though. If we try to fix the "1:1" client-server pattern
with a heuristic, we may miss scenarios which are close to 1:1 but don't
exactly match.

I'm working on a rewrite of select_task_rq_fair, with the aim to tackle
the more general task placement problem taking into account the following:

- We want to converge towards a task placement that moves tasks with
most waker/wakee interactions as close as possible in the cache
topology,
- We can use the core util_est/capacity metrics to calculate whether we
have capacity left to enqueue a task in a core's runqueue.
- The underlying assumption is that work conserving [1] is not a good
characteristic to aim for, because it does not take into account the
overhead associated with migrations, and thus lack of cache locality.

Thanks,

Mathieu

[1] I use the definition of "work conserving" found here:
https://people.ece.ubc.ca/sasha/papers/eurosys16-final29.pdf

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-11-06 17:18:44

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/2] sched/fair migration reduction features

On 2023-11-06 02:06, K Prateek Nayak wrote:
> Hello Chenyu,
>
> On 11/6/2023 11:22 AM, Chen Yu wrote:
>> On 2023-10-27 at 08:57:00 +0530, K Prateek Nayak wrote:
>>> Hello Mathieu,
>>>
>>> On 10/19/2023 9:35 PM, Mathieu Desnoyers wrote:
>>>> Hi,
>>>>
>>>> This series introduces two new scheduler features: UTIL_FITS_CAPACITY
>>>> and SELECT_BIAS_PREV. When used together, they achieve a 41% speedup of
>>>> a hackbench workload which leaves some idle CPU time on a 192-core AMD
>>>> EPYC.
>>>>
>>>> The main metrics which are significantly improved are:
>>>>
>>>> - cpu-migrations are reduced by 80%,
>>>> - CPU utilization is increased by 17%.
>>>>
>>>> Feedback is welcome. I am especially interested to learn whether this
>>>> series has positive or detrimental effects on performance of other
>>>> workloads.
>>>
>>> I got a chance to test this series on a dual socket 3rd Generation EPYC
>>> System (2 x 64C/128T). Following is a quick summary:
>>>
>>> - stream and ycsb-mongodb don't see any changes.
>>>
>>> - hackbench and DeathStarBench see a major improvement. Both are high
>>> utilization workloads with CPUs being overloaded most of the time.
>>> DeathStarBench is known to benefit from lower migration count. It was
>>> discussed by Gautham at OSPM '23.
>>>
>>> - tbench, netperf, and sch bench regresses. The former two when the
>>> system is near fully loaded, and the latter for most cases.
>>
>> Does it mean hackbench gets benefits when the system is overloaded, while
>> tbench/netperf do not get benefit when the system is underloaded?
>
> Yup! Seems like that from the results. From what I have seen so far,
> there seems to be a work conservation aspect to hackbench where if we
> reduce the time spent in the kernel (by reducing time to decide on the
> target which Mathieu's patch [this one] achieves,

I am confused by this comment.

Quoting Daniel Bristot, "work conserving" is defined as "in a system
with M processor, the M "higest priority" must be running (in
real-time)". This should apply to other scheduling classes as well. This
definition fits with this paper's definition [1]: "The Linux scheduler
is work-conserving, meaning that it should never leave cores idle if
there is work to do."

Do you mean something different by "work conservation" ?

Just in case, I've made the following experiment to figure out if my
patches benefit from having less time spent in select_task_rq_fair(). I
have copied the original "select_idle_sibling()" into a separate
function "select_idle_sibling_orig()", which I call at the beginning of
the new "biased" select_idle_sibling. I use its result in an empty asm
volatile, which ensures that the code is not optimized away. Then the
biased function selects the runqueue with the new biased approach.

The result with hackbench is that the speed up is still pretty much the
same with or without the added "select_idle_sibling_orig()" call.

Based on this, my understanding is that the speed up comes from
minimizing the amount of migrations (and the side effects caused by
those migrations such as runqueue locks and cache misses), rather than
by making select_idle_sibling faster.

So based on this, I suspect that we could add some overhead to
select_task_runqueue_fair if it means we do a better task placement
decision and minimize migrations, and that would still provide an
overall benefit performance-wise.

> there is also a
> second order effect from another one of Mathieu's Patches that uses
> wakelist but indirectly curbs the SIS_UTIL limits based on Aaron's
> observation [1] thus reducing time spent in select_idle_cpu())
> hackbench results seem to improve.

It's possible that an indirect effect of bias towards prev runqueue is
to affect the metrics used by select_idle_cpu() as well and make it
return early.

I've tried adding a 1000 iteration barrier() loop within
select_idle_sibling_orig(), and indeed the hackbench time goes from 29s
to 31s. Therefore, slowing down the task rq selection does have some impact.

>
> [1] https://lore.kernel.org/lkml/20230905072141.GA253439@ziqianlu-dell/
>
> schbench, tbench, and netperf see that wakeups are faster when the
> client and server are on same LLC so consolidation as long as there is
> one task per run queue for under loaded case is better than just keeping
> them on separate LLCs.

What is faster for the 1:1 client/server ping-pong scenario: having the
client and server on the same LLC, but different runqueues, or having
them share a single runqueue ? If they wait for each other, then I
suspect it's better to place them on the same runqueue as long as there
is capacity left.

>
>>
>>> All these benchmarks are client-server / messenger-worker oriented and is
>>> known to perform better when client-server / messenger-worker are on
>>> same CCX (LLC domain).
>>
>> I thought hackbench should also be of client-server mode, because hackbench has
>> socket/pipe mode and exchanges datas between sender/receiver.
>
> Yes but its N:M nature makes it slightly complicated to understand where
> the cache benefits disappear and the work conservation benefits become
> more prominent.

The N:M nature of hackbench AFAIU causes N-server *and* M-client tasks
to pull each other pretty much randomly, therefore trashing cache locality.

I'm still unclear about the definition of "work conservation" in this
discussion.

>
>>
>> This reminds me of your proposal to provide user hint to the scheduler
>> to whether do task consolidation vs task spreading, and could this also
>> be applied to Mathieu's case? For task or task group with "consolidate"
>> flag set, tasks prefer to be woken up on target/previous CPU if the wakee
>> fits into that CPU. In this way we could bring benefit and not introduce
>> regress.
>
> I think even a simple WF_SYNC check will help tbench and netperf case.
> Let me get back to you with some data on different variants of hackbench
> wit the latest tip.

AFAIU (to be double-checked) the hackbench workload also has WF_SYNC,
which prevents us from using this flag to distinguish between 1:1
server/client and N:M scenarios. Or am I missing something ?

Thanks,

Mathieu

[1] https://people.ece.ubc.ca/sasha/papers/eurosys16-final29.pdf

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-11-07 03:02:52

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/2] sched/fair migration reduction features

Hello Mathieu,

On 11/6/2023 10:48 PM, Mathieu Desnoyers wrote:
> On 2023-11-06 02:06, K Prateek Nayak wrote:
>> Hello Chenyu,
>>
>> On 11/6/2023 11:22 AM, Chen Yu wrote:
>>> On 2023-10-27 at 08:57:00 +0530, K Prateek Nayak wrote:
>>>> Hello Mathieu,
>>>>
>>>> On 10/19/2023 9:35 PM, Mathieu Desnoyers wrote:
>>>>> Hi,
>>>>>
>>>>> This series introduces two new scheduler features: UTIL_FITS_CAPACITY
>>>>> and SELECT_BIAS_PREV. When used together, they achieve a 41% speedup of
>>>>> a hackbench workload which leaves some idle CPU time on a 192-core AMD
>>>>> EPYC.
>>>>>
>>>>> The main metrics which are significantly improved are:
>>>>>
>>>>> - cpu-migrations are reduced by 80%,
>>>>> - CPU utilization is increased by 17%.
>>>>>
>>>>> Feedback is welcome. I am especially interested to learn whether this
>>>>> series has positive or detrimental effects on performance of other
>>>>> workloads.
>>>>
>>>> I got a chance to test this series on a dual socket 3rd Generation EPYC
>>>> System (2 x 64C/128T). Following is a quick summary:
>>>>
>>>> - stream and ycsb-mongodb don't see any changes.
>>>>
>>>> - hackbench and DeathStarBench see a major improvement. Both are high
>>>>    utilization workloads with CPUs being overloaded most of the time.
>>>>    DeathStarBench is known to benefit from lower migration count. It was
>>>>    discussed by Gautham at OSPM '23.
>>>>
>>>> - tbench, netperf, and sch bench regresses. The former two when the
>>>>    system is near fully loaded, and the latter for most cases.
>>>
>>> Does it mean hackbench gets benefits when the system is overloaded, while
>>> tbench/netperf do not get benefit when the system is underloaded?
>>
>> Yup! Seems like that from the results. From what I have seen so far,
>> there seems to be a work conservation aspect to hackbench where if we
>> reduce the time spent in the kernel (by reducing time to decide on the
>> target which Mathieu's patch [this one] achieves,
>
> I am confused by this comment.
>
> Quoting Daniel Bristot, "work conserving" is defined as "in a system with M processor, the M "higest priority" must be running (in real-time)". This should apply to other scheduling classes as well. This definition fits with this paper's definition [1]: "The Linux scheduler is work-conserving, meaning that it should never leave cores idle if there is work to do."
>
> Do you mean something different by "work conservation" ?

Sorry for the confusion. My interpretation of the term "work
conservation" was when there are multiple runnable tasks in the system,
each task more or less get same amount of CPU time. In case of hackbench
specifically, it is time in the userspace.

>
> Just in case, I've made the following experiment to figure out if my patches benefit from having less time spent in select_task_rq_fair(). I have copied the original "select_idle_sibling()" into a separate function "select_idle_sibling_orig()", which I call at the beginning of the new "biased" select_idle_sibling. I use its result in an empty asm volatile, which ensures that the code is not optimized away. Then the biased function selects the runqueue with the new biased approach.

So in a way you are doing two calls to "select_idle_sibling()" each
time? Or is it more like:

select_idle_sibling(...) {
int cpu = select_idle_sibling_orig();

/*
* Take full cost of select_idle_sibling_orig()
* but return prev_cpu if it is still optimal
* target for wakeup with the biases.
*/
if (sched_feat(SELECT_BIAS_PREV) && prev_cpu_still_optimal(p))
return prev_cpu;

return cpu;
}

>
> The result with hackbench is that the speed up is still pretty much the same with or without the added "select_idle_sibling_orig()" call.
>
> Based on this, my understanding is that the speed up comes from minimizing the amount of migrations (and the side effects caused by those migrations such as runqueue locks and cache misses), rather than by making select_idle_sibling faster.
>
> So based on this, I suspect that we could add some overhead to select_task_runqueue_fair if it means we do a better task placement decision and minimize migrations, and that would still provide an overall benefit performance-wise.

Some of my older experiments when SIS_NODE was proposed suggested the
opposite but things might have changed now :)

I'll get back to you on this.

>
>> there is also a
>> second order effect from another one of Mathieu's Patches that uses
>> wakelist but indirectly curbs the SIS_UTIL limits based on Aaron's
>> observation [1] thus reducing time spent in select_idle_cpu())
>> hackbench results seem to improve.
>
> It's possible that an indirect effect of bias towards prev runqueue is to affect the metrics used by select_idle_cpu() as well and make it return early.
>
> I've tried adding a 1000 iteration barrier() loop within select_idle_sibling_orig(), and indeed the hackbench time goes from 29s to 31s. Therefore, slowing down the task rq selection does have some impact.
>
>>
>> [1] https://lore.kernel.org/lkml/20230905072141.GA253439@ziqianlu-dell/
>>
>> schbench, tbench, and netperf see that wakeups are faster when the
>> client and server are on same LLC so consolidation as long as there is
>> one task per run queue for under loaded case is better than just keeping
>> them on separate LLCs.
>
> What is faster for the 1:1 client/server ping-pong scenario: having the client and server on the same LLC, but different runqueues, or having them share a single runqueue ?

Client and Server on same LLC, but on different cores give the best
result.

> If they wait for each other, then I suspect it's better to place them on the same runqueue as long as there is capacity left.

Yup, that is correct.

>
>>
>>>
>>>>    All these benchmarks are client-server / messenger-worker oriented and is
>>>>    known to perform better when client-server / messenger-worker are on
>>>>    same CCX (LLC domain).
>>>
>>> I thought hackbench should also be of client-server mode, because hackbench has
>>> socket/pipe mode and exchanges datas between sender/receiver.
>>
>> Yes but its N:M nature makes it slightly complicated to understand where
>> the cache benefits disappear and the work conservation benefits become
>> more prominent.
>
> The N:M nature of hackbench AFAIU causes N-server *and* M-client tasks to pull each other pretty much randomly, therefore trashing cache locality.
>
> I'm still unclear about the definition of "work conservation" in this discussion.

In my previous observations, if you can minimize time spent scheduling
the wakee and return back to userspace faster, the benchmark benefited
overall. But then the MM_CID observation goes against this ¯\_(ツ)_/¯
or maybe there is a higher order effect that I might be missing.

>
>>
>>>
>>> This reminds me of your proposal to provide user hint to the scheduler
>>> to whether do task consolidation vs task spreading, and could this also
>>> be applied to Mathieu's case? For task or task group with "consolidate"
>>> flag set, tasks prefer to be woken up on target/previous CPU if the wakee
>>> fits into that CPU. In this way we could bring benefit and not introduce
>>> regress.
>>
>> I think even a simple WF_SYNC check will help tbench and netperf case.
>> Let me get back to you with some data on different variants of hackbench
>> wit the latest tip.
>
> AFAIU (to be double-checked) the hackbench workload also has WF_SYNC, which prevents us from using this flag to distinguish between 1:1 server/client and N:M scenarios. Or am I missing something ?

Yup! You are right. My bad.

>
> Thanks,
>
> Mathieu
>
> [1] https://people.ece.ubc.ca/sasha/papers/eurosys16-final29.pdf
>

--
Thanks and Regards,
Prateek