LinuxLists.cc - [RFC PATCH 0/5] introduce sched-idle balancing

2022-02-17 18:50:59

[permalink] [raw]

Subject: [RFC PATCH 0/5] introduce sched-idle balancing

Current load balancing is mainly based on cpu capacity
and task util, which makes sense in the POV of overall
throughput. While there still might be some improvement
can be done by reducing number of overloaded cfs rqs if
sched-idle or idle rq exists.

An CFS runqueue is considered overloaded when there are
more than one pullable non-idle tasks on it (since sched-
idle cpus are treated as idle cpus). And idle tasks are
counted towards rq->cfs.idle_h_nr_running, that is either
assigned SCHED_IDLE policy or placed under idle cgroups.

The overloaded cfs rqs can cause performance issues to
both task types:

- for latency critical tasks like SCHED_NORMAL,
time of waiting in the rq will increase and
result in higher pct99 latency, and

- batch tasks may not be able to make full use
of cpu capacity if sched-idle rq exists, thus
presents poorer throughput.

So in short, the goal of the sched-idle balancing is to
let the *non-idle tasks* make full use of cpu resources.
To achieve that, we mainly do two things:

- pull non-idle tasks for sched-idle or idle rqs
from the overloaded ones, and

- prevent pulling the last non-idle task in an rq

The mask of overloaded cpus is updated in periodic tick
and the idle path at the LLC domain basis. This cpumask
will also be used in SIS as a filter, improving idle cpu
searching.

Tests are done in an Intel Xeon E5-2650 v4 server with
2 NUMA nodes each of which has 12 cores, and with SMT2
enabled, so 48 CPUs in total. Test results are listed
as follows.

- we used perf messaging test to test throughput
at different load (groups).

perf bench sched messaging -g [N] -l 40000

N w/o w/ diff
1 2.897 2.834 -2.17%
3 5.156 4.904 -4.89%
5 7.850 7.617 -2.97%
10 15.140 14.574 -3.74%
20 29.387 27.602 -6.07%

the result shows approximate 2~6% improvement.

- and schbench to test latency performance in two
scenarios: quiet and noisy. In quiet test, we
run schbench in a normal cpu cgroup in a quiet
system, while the noisy test additionally runs
perf messaging workload inside an idle cgroup
as nosie.

schbench -m 2 -t 24 -i 60 -r 60
perf bench sched messaging -g 1 -l 4000000

[quiet]
w/o w/
50.0th 31 31
75.0th 45 45
90.0th 55 55
95.0th 62 61
*99.0th 85 86
99.5th 565 318
99.9th 11536 10992
max 13029 13067

[nosiy]
w/o w/
50.0th 34 32
75.0th 48 45
90.0th 58 55
95.0th 65 61
*99.0th 2364 208
99.5th 6696 2068
99.9th 12688 8816
max 15209 14191

it can be seen that the quiet test results are
quite similar, but the p99 latency is greatly
improved in the nosiy test.

Comments and tests are appreciated!

Abel Wu (5):
sched/fair: record overloaded cpus
sched/fair: introduce sched-idle balance
sched/fair: add stats for sched-idle balancing
sched/fair: filter out overloaded cpus in sis
sched/fair: favor cpu capacity for idle tasks

include/linux/sched/idle.h | 1 +
include/linux/sched/topology.h | 15 ++++
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 187 ++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 6 ++
kernel/sched/stats.c | 5 +-
kernel/sched/topology.c | 4 +-
7 files changed, 215 insertions(+), 4 deletions(-)

--
2.11.0

2022-02-17 20:21:58

[permalink] [raw]

Subject: [RFC PATCH 1/5] sched/fair: record overloaded cpus

An CFS runqueue is considered overloaded when there are
more than one pullable non-idle tasks on it (since sched-
idle cpus are treated as idle cpus). And idle tasks are
counted towards rq->cfs.idle_h_nr_running, that is either
assigned SCHED_IDLE policy or placed under idle cgroups.

The overloaded cfs rqs can cause performance issues to
both task types:

- for latency critical tasks like SCHED_NORMAL,
time of waiting in the rq will increase and
result in higher pct99 latency, and

- batch tasks may not be able to make full use
of cpu capacity if sched-idle rq exists, thus
presents poorer throughput.

The mask of overloaded cpus is updated in periodic tick
and the idle path at the LLC domain basis. This cpumask
will also be used in SIS as a filter, improving idle cpu
searching.

Signed-off-by: Abel Wu <[email protected]>
---
include/linux/sched/topology.h | 10 ++++++++++
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 43 ++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 6 ++++++
kernel/sched/topology.c | 4 +++-
5 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 56cffe42abbc..03c9c81dc886 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -81,6 +81,16 @@ struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
int has_idle_cores;
+
+ /*
+ * The above varibles are used in idle path and
+ * select_task_rq, and the following two are
+ * mainly updated in tick. They are all hot but
+ * for different usage, so start a new cacheline
+ * to avoid false sharing.
+ */
+ atomic_t nr_overloaded ____cacheline_aligned;
+ unsigned long overloaded[]; /* Must be last */
};

struct sched_domain {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1d863d7f6ad7..a6da2998ec49 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9423,6 +9423,7 @@ void __init sched_init(void)
rq->wake_stamp = jiffies;
rq->wake_avg_idle = rq->avg_idle;
rq->max_idle_balance_cost = sysctl_sched_migration_cost;
+ rq->overloaded = 0;

INIT_LIST_HEAD(&rq->cfs_tasks);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5c4bfffe8c2c..0a0438c3319b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6968,6 +6968,46 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

return newidle_balance(rq, rf) != 0;
}
+
+static inline int cfs_rq_overloaded(struct rq *rq)
+{
+ return rq->cfs.h_nr_running - rq->cfs.idle_h_nr_running > 1;
+}
+
+/* Must be called with rq locked */
+static void update_overload_status(struct rq *rq)
+{
+ struct sched_domain_shared *sds;
+ int overloaded = cfs_rq_overloaded(rq);
+ int cpu = cpu_of(rq);
+
+ lockdep_assert_rq_held(rq);
+
+ if (rq->overloaded == overloaded)
+ return;
+
+ rcu_read_lock();
+ sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+ if (unlikely(!sds))
+ goto unlock;
+
+ if (overloaded) {
+ cpumask_set_cpu(cpu, sdo_mask(sds));
+ atomic_inc(&sds->nr_overloaded);
+ } else {
+ cpumask_clear_cpu(cpu, sdo_mask(sds));
+ atomic_dec(&sds->nr_overloaded);
+ }
+
+ rq->overloaded = overloaded;
+unlock:
+ rcu_read_unlock();
+}
+
+#else
+
+static inline void update_overload_status(struct rq *rq) { }
+
#endif /* CONFIG_SMP */

static unsigned long wakeup_gran(struct sched_entity *se)
@@ -7315,6 +7355,8 @@ done: __maybe_unused;
if (new_tasks > 0)
goto again;

+ update_overload_status(rq);
+
/*
* rq is about to be idle, check if we need to update the
* lost_idle_time of clock_pelt
@@ -11131,6 +11173,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);

+ update_overload_status(rq);
update_misfit_status(curr, rq);
update_overutilized_status(task_rq(curr));

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9b33ba9c3c42..c81a87082b8b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1012,6 +1012,7 @@ struct rq {

unsigned char nohz_idle_balance;
unsigned char idle_balance;
+ unsigned char overloaded;

unsigned long misfit_task_load;

@@ -1762,6 +1763,11 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
return sd;
}

+static inline struct cpumask *sdo_mask(struct sched_domain_shared *sds)
+{
+ return to_cpumask(sds->overloaded);
+}
+
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
DECLARE_PER_CPU(int, sd_llc_size);
DECLARE_PER_CPU(int, sd_llc_id);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index e6cd55951304..641f11415819 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1623,6 +1623,8 @@ sd_init(struct sched_domain_topology_level *tl,
sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
atomic_inc(&sd->shared->ref);
atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+ atomic_set(&sd->shared->nr_overloaded, 0);
+ cpumask_clear(sdo_mask(sd->shared));
}

sd->private = sdd;
@@ -2050,7 +2052,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)

*per_cpu_ptr(sdd->sd, j) = sd;

- sds = kzalloc_node(sizeof(struct sched_domain_shared),
+ sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
GFP_KERNEL, cpu_to_node(j));
if (!sds)
return -ENOMEM;
--
2.11.0

2022-02-17 23:01:27

[permalink] [raw]

Subject: [RFC PATCH 5/5] sched/fair: favor cpu capacity for idle tasks

Unlike select_idle_sibling() in which we need to find a
not-so-bad candidate ASAP, the slowpath gives us more
tolerance: ignore sched-idle cpus for idle tasks since
they prefer cpu capacity rather than latency, and besides
spreading out idle tasks also good for latency of normal
tasks.

Signed-off-by: Abel Wu <[email protected]>
---
kernel/sched/fair.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d8f396e6f41..57f1d8c43228 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6007,6 +6007,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu);
static int
find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
{
+ bool ignore_si = task_h_idle(p);
unsigned long load, min_load = ULONG_MAX;
unsigned int min_exit_latency = UINT_MAX;
u64 latest_idle_timestamp = 0;
@@ -6025,7 +6026,13 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
if (!sched_core_cookie_match(rq, p))
continue;

- if (sched_idle_cpu(i))
+ /*
+ * The idle tasks prefer cpu capacity rather than
+ * latency. Spreading out idle tasks also good for
+ * latency of normal tasks since they won't suffer
+ * high cpu wakeup delay.
+ */
+ if (!ignore_si && sched_idle_cpu(i))
return i;

if (available_idle_cpu(i)) {
--
2.11.0

2022-02-17 23:32:35

[permalink] [raw]

Subject: [RFC PATCH 3/5] sched/fair: add stats for sched-idle balancing

To better understand the behavior of sched-idle balancing, add
some statistics like other load balancing mechanisms did.

Signed-off-by: Abel Wu <[email protected]>
---
include/linux/sched/topology.h | 5 +++++
kernel/sched/fair.c | 6 +++++-
kernel/sched/stats.c | 5 +++--
3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 03c9c81dc886..4259963d3e5e 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -150,6 +150,11 @@ struct sched_domain {
unsigned int ttwu_wake_remote;
unsigned int ttwu_move_affine;
unsigned int ttwu_move_balance;
+
+ /* sched-idle balancing */
+ unsigned int sib_peeked;
+ unsigned int sib_pulled;
+ unsigned int sib_failed;
#endif
#ifdef CONFIG_SCHED_DEBUG
char *name;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 070a6fb1d2bf..c83c0864e429 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10330,8 +10330,10 @@ static void sched_idle_balance(struct rq *dst_rq)
if (cpu == dst_cpu)
continue;

- if (!cfs_rq_overloaded(rq))
+ if (!cfs_rq_overloaded(rq)) {
+ schedstat_inc(sd->sib_peeked);
continue;
+ }

rq_lock_irqsave(rq, &rf);

@@ -10375,10 +10377,12 @@ static void sched_idle_balance(struct rq *dst_rq)
if (p) {
attach_one_task(dst_rq, p);
local_irq_restore(rf.flags);
+ schedstat_inc(sd->sib_pulled);
return;
}

local_irq_restore(rf.flags);
+ schedstat_inc(sd->sib_failed);
}
}

diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 07dde2928c79..3ee476c72806 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -164,12 +164,13 @@ static int show_schedstat(struct seq_file *seq, void *v)
sd->lb_nobusyg[itype]);
}
seq_printf(seq,
- " %u %u %u %u %u %u %u %u %u %u %u %u\n",
+ " %u %u %u %u %u %u %u %u %u %u %u %u %u %u %u\n",
sd->alb_count, sd->alb_failed, sd->alb_pushed,
sd->sbe_count, sd->sbe_balanced, sd->sbe_pushed,
sd->sbf_count, sd->sbf_balanced, sd->sbf_pushed,
sd->ttwu_wake_remote, sd->ttwu_move_affine,
- sd->ttwu_move_balance);
+ sd->ttwu_move_balance, sd->sib_peeked,
+ sd->sib_pulled, sd->sib_failed);
}
rcu_read_unlock();
#endif
--
2.11.0

2022-02-17 23:50:06

[permalink] [raw]

Subject: [RFC PATCH 4/5] sched/fair: filter out overloaded cpus in sis

Skip overloaded cpus in SIS if any. This improves idle
cpu searching, especially under the SIS_PROP constrain
that search depth is limited.

The mask of overloaded cpus might not be quite accurate
since it is generally updated at tick granule, but the
overloaded cpus are unlikely to go into idle shortly.

Signed-off-by: Abel Wu <[email protected]>
---
kernel/sched/fair.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c83c0864e429..1d8f396e6f41 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6273,6 +6273,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool

cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);

+ if (atomic_read(&sd->shared->nr_overloaded))
+ cpumask_andnot(cpus, cpus, sdo_mask(sd->shared));
+
if (sched_feat(SIS_PROP) && !has_idle_core) {
u64 avg_cost, avg_idle, span_avg;
unsigned long now = jiffies;
--
2.11.0

2022-02-24 03:22:32

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] introduce sched-idle balancing

Ping :)

On 2/17/22 11:43 PM, Abel Wu Wrote:
> Current load balancing is mainly based on cpu capacity
> and task util, which makes sense in the POV of overall
> throughput. While there still might be some improvement
> can be done by reducing number of overloaded cfs rqs if
> sched-idle or idle rq exists.
>
> An CFS runqueue is considered overloaded when there are
> more than one pullable non-idle tasks on it (since sched-
> idle cpus are treated as idle cpus). And idle tasks are
> counted towards rq->cfs.idle_h_nr_running, that is either
> assigned SCHED_IDLE policy or placed under idle cgroups.
>
> The overloaded cfs rqs can cause performance issues to
> both task types:
>
> - for latency critical tasks like SCHED_NORMAL,
> time of waiting in the rq will increase and
> result in higher pct99 latency, and
>
> - batch tasks may not be able to make full use
> of cpu capacity if sched-idle rq exists, thus
> presents poorer throughput.
>
> So in short, the goal of the sched-idle balancing is to
> let the *non-idle tasks* make full use of cpu resources.
> To achieve that, we mainly do two things:
>
> - pull non-idle tasks for sched-idle or idle rqs
> from the overloaded ones, and
>
> - prevent pulling the last non-idle task in an rq
>
> The mask of overloaded cpus is updated in periodic tick
> and the idle path at the LLC domain basis. This cpumask
> will also be used in SIS as a filter, improving idle cpu
> searching.
>
> Tests are done in an Intel Xeon E5-2650 v4 server with
> 2 NUMA nodes each of which has 12 cores, and with SMT2
> enabled, so 48 CPUs in total. Test results are listed
> as follows.
>
> - we used perf messaging test to test throughput
> at different load (groups).
>
> perf bench sched messaging -g [N] -l 40000
>
> N w/o w/ diff
> 1 2.897 2.834 -2.17%
> 3 5.156 4.904 -4.89%
> 5 7.850 7.617 -2.97%
> 10 15.140 14.574 -3.74%
> 20 29.387 27.602 -6.07%
>
> the result shows approximate 2~6% improvement.
>
> - and schbench to test latency performance in two
> scenarios: quiet and noisy. In quiet test, we
> run schbench in a normal cpu cgroup in a quiet
> system, while the noisy test additionally runs
> perf messaging workload inside an idle cgroup
> as nosie.
>
> schbench -m 2 -t 24 -i 60 -r 60
> perf bench sched messaging -g 1 -l 4000000
>
> [quiet]
> w/o w/
> 50.0th 31 31
> 75.0th 45 45
> 90.0th 55 55
> 95.0th 62 61
> *99.0th 85 86
> 99.5th 565 318
> 99.9th 11536 10992
> max 13029 13067
>
> [nosiy]
> w/o w/
> 50.0th 34 32
> 75.0th 48 45
> 90.0th 58 55
> 95.0th 65 61
> *99.0th 2364 208
> 99.5th 6696 2068
> 99.9th 12688 8816
> max 15209 14191
>
> it can be seen that the quiet test results are
> quite similar, but the p99 latency is greatly
> improved in the nosiy test.
>
> Comments and tests are appreciated!
>
> Abel Wu (5):
> sched/fair: record overloaded cpus
> sched/fair: introduce sched-idle balance
> sched/fair: add stats for sched-idle balancing
> sched/fair: filter out overloaded cpus in sis
> sched/fair: favor cpu capacity for idle tasks
>
> include/linux/sched/idle.h | 1 +
> include/linux/sched/topology.h | 15 ++++
> kernel/sched/core.c | 1 +
> kernel/sched/fair.c | 187 ++++++++++++++++++++++++++++++++++++++++-
> kernel/sched/sched.h | 6 ++
> kernel/sched/stats.c | 5 +-
> kernel/sched/topology.c | 4 +-
> 7 files changed, 215 insertions(+), 4 deletions(-)
>

2022-02-24 08:13:18

by Gautham R. Shenoy

[permalink] [raw]

Subject: Re: [RFC PATCH 1/5] sched/fair: record overloaded cpus

Hello Abel,

(+ Aubrey Li, Srikar)

On Thu, Feb 17, 2022 at 11:43:57PM +0800, Abel Wu wrote:
> An CFS runqueue is considered overloaded when there are
> more than one pullable non-idle tasks on it (since sched-
> idle cpus are treated as idle cpus). And idle tasks are
> counted towards rq->cfs.idle_h_nr_running, that is either
> assigned SCHED_IDLE policy or placed under idle cgroups.
>
> The overloaded cfs rqs can cause performance issues to
> both task types:
>
> - for latency critical tasks like SCHED_NORMAL,
> time of waiting in the rq will increase and
> result in higher pct99 latency, and
>
> - batch tasks may not be able to make full use
> of cpu capacity if sched-idle rq exists, thus
> presents poorer throughput.
>
> The mask of overloaded cpus is updated in periodic tick
> and the idle path at the LLC domain basis. This cpumask
> will also be used in SIS as a filter, improving idle cpu
> searching.

This is an interesting approach to minimise the tail latencies by
keeping track of the overloaded cpus in the LLC so that
idle/sched-idle CPUs can pull from them. This approach contrasts with the
following approaches that were previously tried :

1. Maintain the idle cpumask at the LLC level by Aubrey Li
https://lore.kernel.org/all/[email protected]/

2. Maintain the identity of the idle core itself at the LLC level, by Srikar :
https://lore.kernel.org/lkml/[email protected]/

There have been concerns in the past about having to update the shared
mask/counter at regular intervals. Srikar, Aubrey any thoughts on this
?

>
> Signed-off-by: Abel Wu <[email protected]>
> ---
> include/linux/sched/topology.h | 10 ++++++++++
> kernel/sched/core.c | 1 +
> kernel/sched/fair.c | 43 ++++++++++++++++++++++++++++++++++++++++++
> kernel/sched/sched.h | 6 ++++++
> kernel/sched/topology.c | 4 +++-
> 5 files changed, 63 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 56cffe42abbc..03c9c81dc886 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -81,6 +81,16 @@ struct sched_domain_shared {
> atomic_t ref;
> atomic_t nr_busy_cpus;
> int has_idle_cores;
> +
> + /*
> + * The above varibles are used in idle path and
> + * select_task_rq, and the following two are
> + * mainly updated in tick. They are all hot but
> + * for different usage, so start a new cacheline
> + * to avoid false sharing.
> + */
> + atomic_t nr_overloaded ____cacheline_aligned;
> + unsigned long overloaded[]; /* Must be last */
> };
>
> struct sched_domain {
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1d863d7f6ad7..a6da2998ec49 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9423,6 +9423,7 @@ void __init sched_init(void)
> rq->wake_stamp = jiffies;
> rq->wake_avg_idle = rq->avg_idle;
> rq->max_idle_balance_cost = sysctl_sched_migration_cost;
> + rq->overloaded = 0;
>
> INIT_LIST_HEAD(&rq->cfs_tasks);
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5c4bfffe8c2c..0a0438c3319b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6968,6 +6968,46 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>
> return newidle_balance(rq, rf) != 0;
> }
> +
> +static inline int cfs_rq_overloaded(struct rq *rq)
> +{
> + return rq->cfs.h_nr_running - rq->cfs.idle_h_nr_running > 1;
> +}
> +
> +/* Must be called with rq locked */
> +static void update_overload_status(struct rq *rq)
> +{
> + struct sched_domain_shared *sds;
> + int overloaded = cfs_rq_overloaded(rq);
> + int cpu = cpu_of(rq);
> +
> + lockdep_assert_rq_held(rq);
> +
> + if (rq->overloaded == overloaded)
> + return;
> +
> + rcu_read_lock();
> + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> + if (unlikely(!sds))
> + goto unlock;
> +
> + if (overloaded) {
> + cpumask_set_cpu(cpu, sdo_mask(sds));
> + atomic_inc(&sds->nr_overloaded);
> + } else {
> + cpumask_clear_cpu(cpu, sdo_mask(sds));
> + atomic_dec(&sds->nr_overloaded);
> + }
> +
> + rq->overloaded = overloaded;
> +unlock:
> + rcu_read_unlock();
> +}
> +
> +#else
> +
> +static inline void update_overload_status(struct rq *rq) { }
> +
> #endif /* CONFIG_SMP */
>
> static unsigned long wakeup_gran(struct sched_entity *se)
> @@ -7315,6 +7355,8 @@ done: __maybe_unused;
> if (new_tasks > 0)
> goto again;
>
> + update_overload_status(rq);
> +

So here, we are calling update_overload_status() after
newidle_balance(). If we had pulled a single task as a part of
newidle_balance(), in your current code, we do not update the overload
status. While this should get remedied in the next tick, should we
move update_overload_status(rq) prior to the new_tasks > 0 check ?

--
Thanks and Regards
gautham.

2022-02-24 16:21:28

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] introduce sched-idle balancing

On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
> Current load balancing is mainly based on cpu capacity
> and task util, which makes sense in the POV of overall
> throughput. While there still might be some improvement
> can be done by reducing number of overloaded cfs rqs if
> sched-idle or idle rq exists.

I'm much confused, there is an explicit new-idle balancer and a periodic
idle balancer already there.

2022-02-24 16:33:41

by Vincent Guittot

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] introduce sched-idle balancing

On Thu, 24 Feb 2022 at 16:20, Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
> > Current load balancing is mainly based on cpu capacity
> > and task util, which makes sense in the POV of overall
> > throughput. While there still might be some improvement
> > can be done by reducing number of overloaded cfs rqs if
> > sched-idle or idle rq exists.
>
> I'm much confused, there is an explicit new-idle balancer and a periodic
> idle balancer already there.

I agree, You failed to explain why newly_idle and periodic idle load
balance are not enough and we need this new one

2022-02-24 16:38:12

[permalink] [raw]

Subject: Re: [RFC PATCH 1/5] sched/fair: record overloaded cpus

Hi Gautham, thanks for your comment!

On 2/24/22 3:10 PM, Gautham R. Shenoy wrote:
> Hello Abel,
>
> (+ Aubrey Li, Srikar)
>
> On Thu, Feb 17, 2022 at 11:43:57PM +0800, Abel Wu wrote:
>> An CFS runqueue is considered overloaded when there are
>> more than one pullable non-idle tasks on it (since sched-
>> idle cpus are treated as idle cpus). And idle tasks are
>> counted towards rq->cfs.idle_h_nr_running, that is either
>> assigned SCHED_IDLE policy or placed under idle cgroups.
>>
>> The overloaded cfs rqs can cause performance issues to
>> both task types:
>>
>> - for latency critical tasks like SCHED_NORMAL,
>> time of waiting in the rq will increase and
>> result in higher pct99 latency, and
>>
>> - batch tasks may not be able to make full use
>> of cpu capacity if sched-idle rq exists, thus
>> presents poorer throughput.
>>
>> The mask of overloaded cpus is updated in periodic tick
>> and the idle path at the LLC domain basis. This cpumask
>> will also be used in SIS as a filter, improving idle cpu
>> searching.
>
> This is an interesting approach to minimise the tail latencies by
> keeping track of the overloaded cpus in the LLC so that
> idle/sched-idle CPUs can pull from them. This approach contrasts with the
> following approaches that were previously tried :
>
> 1. Maintain the idle cpumask at the LLC level by Aubrey Li
> https://lore.kernel.org/all/[email protected]/

It's a similar approach from different sight in SIS. Both have pros and
cons, and I couldn't tell which one is more appropriate..
While since SIS can fail in finding one idle cpu due to scaling issues,
the sched-idle load balancing might be a valuable supplement to that to
consume the idle/sched-idle cpus.

>
> 2. Maintain the identity of the idle core itself at the LLC level, by Srikar :
> https://lore.kernel.org/lkml/[email protected]/

The efforts done by Srikar seems focused on idle core searching, which
has a different goal from my approach I think.
The case of short running tasks pointed out by Vincent should not be a
problem in updating the overloaded cpu mask/counter, since they are not
updated either when cpu becomes busy, or when cpu frequently goes idle
during a tick period.

>
> There have been concerns in the past about having to update the shared
> mask/counter at regular intervals. Srikar, Aubrey any thoughts on this
> ?
>

I'm afraid I didn't fully catch up with these loops, it is appreciated
if someone can shed some light, thanks!

>
>
>>
>> Signed-off-by: Abel Wu <[email protected]>
>> ---
>> include/linux/sched/topology.h | 10 ++++++++++
>> kernel/sched/core.c | 1 +
>> kernel/sched/fair.c | 43 ++++++++++++++++++++++++++++++++++++++++++
>> kernel/sched/sched.h | 6 ++++++
>> kernel/sched/topology.c | 4 +++-
>> 5 files changed, 63 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 56cffe42abbc..03c9c81dc886 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -81,6 +81,16 @@ struct sched_domain_shared {
>> atomic_t ref;
>> atomic_t nr_busy_cpus;
>> int has_idle_cores;
>> +
>> + /*
>> + * The above varibles are used in idle path and
>> + * select_task_rq, and the following two are
>> + * mainly updated in tick. They are all hot but
>> + * for different usage, so start a new cacheline
>> + * to avoid false sharing.
>> + */
>> + atomic_t nr_overloaded ____cacheline_aligned;
>> + unsigned long overloaded[]; /* Must be last */
>> };
>>
>> struct sched_domain {
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 1d863d7f6ad7..a6da2998ec49 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -9423,6 +9423,7 @@ void __init sched_init(void)
>> rq->wake_stamp = jiffies;
>> rq->wake_avg_idle = rq->avg_idle;
>> rq->max_idle_balance_cost = sysctl_sched_migration_cost;
>> + rq->overloaded = 0;
>>
>> INIT_LIST_HEAD(&rq->cfs_tasks);
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 5c4bfffe8c2c..0a0438c3319b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6968,6 +6968,46 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>>
>> return newidle_balance(rq, rf) != 0;
>> }
>> +
>> +static inline int cfs_rq_overloaded(struct rq *rq)
>> +{
>> + return rq->cfs.h_nr_running - rq->cfs.idle_h_nr_running > 1;
>> +}
>> +
>> +/* Must be called with rq locked */
>> +static void update_overload_status(struct rq *rq)
>> +{
>> + struct sched_domain_shared *sds;
>> + int overloaded = cfs_rq_overloaded(rq);
>> + int cpu = cpu_of(rq);
>> +
>> + lockdep_assert_rq_held(rq);
>> +
>> + if (rq->overloaded == overloaded)
>> + return;
>> +
>> + rcu_read_lock();
>> + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>> + if (unlikely(!sds))
>> + goto unlock;
>> +
>> + if (overloaded) {
>> + cpumask_set_cpu(cpu, sdo_mask(sds));
>> + atomic_inc(&sds->nr_overloaded);
>> + } else {
>> + cpumask_clear_cpu(cpu, sdo_mask(sds));
>> + atomic_dec(&sds->nr_overloaded);
>> + }
>> +
>> + rq->overloaded = overloaded;
>> +unlock:
>> + rcu_read_unlock();
>> +}
>> +
>> +#else
>> +
>> +static inline void update_overload_status(struct rq *rq) { }
>> +
>> #endif /* CONFIG_SMP */
>>
>> static unsigned long wakeup_gran(struct sched_entity *se)
>> @@ -7315,6 +7355,8 @@ done: __maybe_unused;
>> if (new_tasks > 0)
>> goto again;
>>
>> + update_overload_status(rq);
>> +
>
> So here, we are calling update_overload_status() after
> newidle_balance(). If we had pulled a single task as a part of
> newidle_balance(), in your current code, we do not update the overload
> status. While this should get remedied in the next tick, should we
> move update_overload_status(rq) prior to the new_tasks > 0 check ?

A single task won't change the overloaded status :)
And I think it would be better not do an update even if pulled several
tasks because that would break the rate limit which is undesired.

Best Regards,
Abel

>
>
> --
> Thanks and Regards
> gautham.

2022-02-24 16:48:20

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] introduce sched-idle balancing

On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
> Current load balancing is mainly based on cpu capacity
> and task util, which makes sense in the POV of overall
> throughput. While there still might be some improvement
> can be done by reducing number of overloaded cfs rqs if
> sched-idle or idle rq exists.
>
> An CFS runqueue is considered overloaded when there are
> more than one pullable non-idle tasks on it (since sched-
> idle cpus are treated as idle cpus). And idle tasks are
> counted towards rq->cfs.idle_h_nr_running, that is either
> assigned SCHED_IDLE policy or placed under idle cgroups.
>

It's not clear how your tests evaluated the balancing of SCHED_IDLE tasks
versus the existing idle balancing and isolated that impact. I suspect
the tests may primarily measured the effect of the SIS filter.

> So in short, the goal of the sched-idle balancing is to
> let the *non-idle tasks* make full use of cpu resources.
> To achieve that, we mainly do two things:
>
> - pull non-idle tasks for sched-idle or idle rqs
> from the overloaded ones, and
>
> - prevent pulling the last non-idle task in an rq
>
> The mask of overloaded cpus is updated in periodic tick
> and the idle path at the LLC domain basis. This cpumask
> will also be used in SIS as a filter, improving idle cpu
> searching.
>

As the overloaded mask may be updated on each idle, it could be a
significant source of cache misses between CPUs sharing the domain for
workloads that rapidly idle so there should be data on whether cache misses
are increased heavily. It also potentially delays the CPU reaching idle
but it may not be by much.

The filter may be out of date. It takes up to one tick to detect
overloaded and the filter to have a positive impact. As a CPU is not
guaranteed to enter idle if there is at least one CPU-bound task, it may
also be up to 1 tick before the mask is cleared. I'm not sure this is a
serious problem though as SIS would not pick the CPU with the CPU-bound
task anyway.

At minimum, the filter should be split out and considered first as it
is the most likely reason why a performance difference was measured. It
has some oddities like why nr_overloaded is really a boolean and as
it's under rq lock, it's not clear why it's atomic. The changelog
would ideally contain some comment on the impact to cache misses
if any and some sort of proof that SIS search depth is reduced which
https://lore.kernel.org/lkml/[email protected]/
may be some help.

At that point, compare the idle task balancing on top to isolate how
much it improves things if any and identify why existing balancing is
insufficient. Split out the can_migrate_task change beforehand in case it
is the main source of difference as opposed to the new balancing mechanism.

--
Mel Gorman
SUSE Labs

2022-02-25 07:17:16

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] introduce sched-idle balancing

Hi Peter,

On 2/24/22 11:20 PM, Peter Zijlstra Wrote:
> On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
>> Current load balancing is mainly based on cpu capacity
>> and task util, which makes sense in the POV of overall
>> throughput. While there still might be some improvement
>> can be done by reducing number of overloaded cfs rqs if
>> sched-idle or idle rq exists.
>
> I'm much confused, there is an explicit new-idle balancer and a periodic
> idle balancer already there.

The two balancers are triggered on the rqs that have no tasks on them,
and load_balance() seems don't show a preference for non-idle tasks so
there might be possibility that only idle tasks are pulled during load
balance while overloaded rqs (rq->cfs.h_nr_running > 1) exist. As a
result the normal tasks, mostly latency-critical ones in our case, on
that overloaded rq still suffer waiting for each other. I observed this
through perf sched.

IOW the main difference from the POV of load_balance() between the
latency-critical tasks and the idle ones is load.

The sched-idle balancer is triggered on the sched-idle rqs periodically
and the newly-idle ones. It does a 'fast' pull of non-idle tasks from
the overloaded rqs to the sched-idle/idle ones to let the non-idle tasks
make full use of cpu resources.

The sched-idle balancer only focuses on non-idle tasks' performance, so
it can introduce overall load imbalance, and that's why I put it before
load_balance().

Best Regards,
Abel

2022-02-25 10:43:18

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] introduce sched-idle balancing

Hi Mel,

On 2/25/22 12:47 AM, Mel Gorman Wrote:
> On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
>> Current load balancing is mainly based on cpu capacity
>> and task util, which makes sense in the POV of overall
>> throughput. While there still might be some improvement
>> can be done by reducing number of overloaded cfs rqs if
>> sched-idle or idle rq exists.
>>
>> An CFS runqueue is considered overloaded when there are
>> more than one pullable non-idle tasks on it (since sched-
>> idle cpus are treated as idle cpus). And idle tasks are
>> counted towards rq->cfs.idle_h_nr_running, that is either
>> assigned SCHED_IDLE policy or placed under idle cgroups.
>>
>
> It's not clear how your tests evaluated the balancing of SCHED_IDLE tasks
> versus the existing idle balancing and isolated that impact. I suspect
> the tests may primarily measured the effect of the SIS filter.

The sched-idle balancing doesn't really care about the idle tasks.
It tries to improve the non-idle tasks' performance by spreading
them out to make full use of cpu capacity.

I will do some individual tests to SIS and sched-idle balancer
each, and keep you informed.

>
>> So in short, the goal of the sched-idle balancing is to
>> let the *non-idle tasks* make full use of cpu resources.
>> To achieve that, we mainly do two things:
>>
>> - pull non-idle tasks for sched-idle or idle rqs
>> from the overloaded ones, and
>>
>> - prevent pulling the last non-idle task in an rq
>>
>> The mask of overloaded cpus is updated in periodic tick
>> and the idle path at the LLC domain basis. This cpumask
>> will also be used in SIS as a filter, improving idle cpu
>> searching.
>>
>
> As the overloaded mask may be updated on each idle, it could be a
> significant source of cache misses between CPUs sharing the domain for
> workloads that rapidly idle so there should be data on whether cache misses
> are increased heavily. It also potentially delays the CPU reaching idle
> but it may not be by much.

Yes, that's why I cached overloaded status in rq->overloaded. So in
this case of short running tasks, when cpus rapidly/frequently go
idle, the cpumask/counter are not actually updated if the cached
status is already 0 (not overloaded).

>
> The filter may be out of date. It takes up to one tick to detect
> overloaded and the filter to have a positive impact. As a CPU is not
> guaranteed to enter idle if there is at least one CPU-bound task, it may
> also be up to 1 tick before the mask is cleared. I'm not sure this is a
> serious problem though as SIS would not pick the CPU with the CPU-bound
> task anyway.

Yes, it can be out of date, but increasing the accuracy means more
frequent update which would introduce cache issues you mentioned
above. Rate limit the updating to tick at the LLC basis might be an
acceptable tradeoff I presume.

>
> At minimum, the filter should be split out and considered first as it
> is the most likely reason why a performance difference was measured. It
> has some oddities like why nr_overloaded is really a boolean and as
> it's under rq lock, it's not clear why it's atomic. The changelog
> would ideally contain some comment on the impact to cache misses
> if any and some sort of proof that SIS search depth is reduced which
> https://lore.kernel.org/lkml/[email protected]/
> may be some help.
>
> At that point, compare the idle task balancing on top to isolate how
> much it improves things if any and identify why existing balancing is
> insufficient. Split out the can_migrate_task change beforehand in case it
> is the main source of difference as opposed to the new balancing mechanism.
>

The nr_overloaded sits in shared domain structure thus shared in
LLC domain and needs to be atomic_t, while rq->overloaded is a
boolean updated under rq lock. Maybe the naming can cause some
confusion, please lighten me up if you have better idea :)

And yes, I agree it would be nice if test result on SIS search
depth can be shown, and I actually did the test, but the result
didn't show a reduction in depth due to sched-idle balancing
will also consume sched-idle/idle cpus. I will apply your patch
and make some further tests on that, thanks.

Best Regards,
Abel

2022-02-25 12:48:12

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] introduce sched-idle balancing

On 2/24/22 11:29 PM, Vincent Guittot Wrote:
> On Thu, 24 Feb 2022 at 16:20, Peter Zijlstra <[email protected]> wrote:
>>
>> On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
>>> Current load balancing is mainly based on cpu capacity
>>> and task util, which makes sense in the POV of overall
>>> throughput. While there still might be some improvement
>>> can be done by reducing number of overloaded cfs rqs if
>>> sched-idle or idle rq exists.
>>
>> I'm much confused, there is an explicit new-idle balancer and a periodic
>> idle balancer already there.
>
> I agree, You failed to explain why newly_idle and periodic idle load
> balance are not enough and we need this new one

Hi Vincent, sorry for not giving a clearer explanation. Please check
my previous email replying to Peter, thanks.

Best Regards,
Abel

2022-02-25 13:10:57

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] introduce sched-idle balancing

On Fri, Feb 25, 2022 at 04:15:06PM +0800, Abel Wu wrote:
> > As the overloaded mask may be updated on each idle, it could be a
> > significant source of cache misses between CPUs sharing the domain for
> > workloads that rapidly idle so there should be data on whether cache misses
> > are increased heavily. It also potentially delays the CPU reaching idle
> > but it may not be by much.
>
> Yes, that's why I cached overloaded status in rq->overloaded. So in
> this case of short running tasks, when cpus rapidly/frequently go
> idle, the cpumask/counter are not actually updated if the cached
> status is already 0 (not overloaded).
>

Which is a good idea in some respects. It tries to limit the number of
updates and treats it as a boolean but it's probably prone to races.

> > The filter may be out of date. It takes up to one tick to detect
> > overloaded and the filter to have a positive impact. As a CPU is not
> > guaranteed to enter idle if there is at least one CPU-bound task, it may
> > also be up to 1 tick before the mask is cleared. I'm not sure this is a
> > serious problem though as SIS would not pick the CPU with the CPU-bound
> > task anyway.
>
> Yes, it can be out of date, but increasing the accuracy means more
> frequent update which would introduce cache issues you mentioned
> above. Rate limit the updating to tick at the LLC basis might be an
> acceptable tradeoff I presume.
>
> >
> > At minimum, the filter should be split out and considered first as it
> > is the most likely reason why a performance difference was measured. It
> > has some oddities like why nr_overloaded is really a boolean and as
> > it's under rq lock, it's not clear why it's atomic. The changelog
> > would ideally contain some comment on the impact to cache misses
> > if any and some sort of proof that SIS search depth is reduced which
> > https://lore.kernel.org/lkml/[email protected]/
> > may be some help.
> >
> > At that point, compare the idle task balancing on top to isolate how
> > much it improves things if any and identify why existing balancing is
> > insufficient. Split out the can_migrate_task change beforehand in case it
> > is the main source of difference as opposed to the new balancing mechanism.
> >
>
> The nr_overloaded sits in shared domain structure thus shared in
> LLC domain and needs to be atomic_t, while rq->overloaded is a
> boolean updated under rq lock. Maybe the naming can cause some
> confusion, please lighten me up if you have better idea :)
>

The naming doesn't help because it's not really "the number of overloaded
rq's". atomic_t would be slightly safer against parallel updates but
it's race prone. I didn't think about it deeply but I suspect that two
separate rq's could disagree on what the boolean value should be if one rq
is overloaded, the other is not and they are updating via the idle path at
the same time. This probably can happen because the locking is rq based
and the cpumask is shared. On the flip side, making it an accurate count
would result in more updates and incur cache misses as well as probably
needing a cpumask check instead of a nr_overloaded comparison to determine
if the rq is already accounted for so it costs more. You are very likely
trading accuracy versus cost of update.

Whichever choice you make, add comments on the pros/cons and describe
the limitation of either approach. e.g. if overloaded is effectively a
boolean, describe in a comment the limitations.

> And yes, I agree it would be nice if test result on SIS search
> depth can be shown, and I actually did the test, but the result
> didn't show a reduction in depth due to sched-idle balancing
> will also consume sched-idle/idle cpus. I will apply your patch
> and make some further tests on that, thanks.
>

Just remember to use the patch to measure changes in SIS depth but
performance figures should not include the patch as the schedstat
overhead distorts results.

Also place the filter first and do any measurements of any change to
balancing versus the filter. I'm suggesting placing the filter first
because it's less controversial than a new balancer. Just be aware that
the filter alone is not a guarantee of merging as there have been a few
approaches to filtering and so far all of them had downsides on either cost
or accuracy. IIRC the only active approach to reducing search cost in SIS
is https://lore.kernel.org/all/[email protected]/
and it's likely to get a new version due to
https://lore.kernel.org/all/[email protected]/.
It also updates sched_domain_shared but with a single boolean instead of
an atomic+cpumask.

--
Mel Gorman
SUSE Labs

2022-02-25 13:35:58

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] introduce sched-idle balancing

On 2/25/22 4:29 PM, Vincent Guittot Wrote:
> On Fri, 25 Feb 2022 at 07:46, Abel Wu <[email protected]> wrote:
>>
>> Hi Peter,
>>
>> On 2/24/22 11:20 PM, Peter Zijlstra Wrote:
>>> On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
>>>> Current load balancing is mainly based on cpu capacity
>>>> and task util, which makes sense in the POV of overall
>>>> throughput. While there still might be some improvement
>>>> can be done by reducing number of overloaded cfs rqs if
>>>> sched-idle or idle rq exists.
>>>
>>> I'm much confused, there is an explicit new-idle balancer and a periodic
>>> idle balancer already there.
>>
>> The two balancers are triggered on the rqs that have no tasks on them,
>> and load_balance() seems don't show a preference for non-idle tasks so
>
> The load balance will happen at the idle pace if a sched_idle task is
> running on the cpu so you will have an ILB on each cpu that run a
> sched-idle task

I'm afraid I don't quite follow you, since sched-idle balancer doesn't
touch the ILB part, can you elaborate on this? Thanks.

>
>> there might be possibility that only idle tasks are pulled during load
>> balance while overloaded rqs (rq->cfs.h_nr_running > 1) exist. As a
>
> There is a LB_MIN feature (disable by default) that filters task with
> very low load ( < 16) which includes sched-idle task which has a max
> load of 3

This feature might not that friendly to the situation that only
sched-idle tasks are running in the system. And this situation
can last more than half a day in our co-location systems in which
the training/batch tasks are placed under idle groups or directly
assigned to SCHED_IDLE.

>
>> result the normal tasks, mostly latency-critical ones in our case, on
>> that overloaded rq still suffer waiting for each other. I observed this
>> through perf sched.
>>
>> IOW the main difference from the POV of load_balance() between the
>> latency-critical tasks and the idle ones is load.
>>
>> The sched-idle balancer is triggered on the sched-idle rqs periodically
>> and the newly-idle ones. It does a 'fast' pull of non-idle tasks from
>> the overloaded rqs to the sched-idle/idle ones to let the non-idle tasks
>> make full use of cpu resources.
>>
>> The sched-idle balancer only focuses on non-idle tasks' performance, so
>> it can introduce overall load imbalance, and that's why I put it before
>> load_balance().
>
> According to the very low weight of a sched-idle task, I don't expect
> much imbalance because of sched-idle tasks. But this also depends of
> the number of sched-idle task.
>
>
>>
>> Best Regards,
>> Abel

2022-02-25 16:36:14

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] introduce sched-idle balancing

Hi Mel, thanks a lot for your review!

On 2/25/22 6:16 PM, Mel Gorman Wrote:
> On Fri, Feb 25, 2022 at 04:15:06PM +0800, Abel Wu wrote:
>>> As the overloaded mask may be updated on each idle, it could be a
>>> significant source of cache misses between CPUs sharing the domain for
>>> workloads that rapidly idle so there should be data on whether cache misses
>>> are increased heavily. It also potentially delays the CPU reaching idle
>>> but it may not be by much.
>>
>> Yes, that's why I cached overloaded status in rq->overloaded. So in
>> this case of short running tasks, when cpus rapidly/frequently go
>> idle, the cpumask/counter are not actually updated if the cached
>> status is already 0 (not overloaded).
>>
>
> Which is a good idea in some respects. It tries to limit the number of
> updates and treats it as a boolean but it's probably prone to races.
>
>>> The filter may be out of date. It takes up to one tick to detect
>>> overloaded and the filter to have a positive impact. As a CPU is not
>>> guaranteed to enter idle if there is at least one CPU-bound task, it may
>>> also be up to 1 tick before the mask is cleared. I'm not sure this is a
>>> serious problem though as SIS would not pick the CPU with the CPU-bound
>>> task anyway.
>>
>> Yes, it can be out of date, but increasing the accuracy means more
>> frequent update which would introduce cache issues you mentioned
>> above. Rate limit the updating to tick at the LLC basis might be an
>> acceptable tradeoff I presume.
>>
>>>
>>> At minimum, the filter should be split out and considered first as it
>>> is the most likely reason why a performance difference was measured. It
>>> has some oddities like why nr_overloaded is really a boolean and as
>>> it's under rq lock, it's not clear why it's atomic. The changelog
>>> would ideally contain some comment on the impact to cache misses
>>> if any and some sort of proof that SIS search depth is reduced which
>>> https://lore.kernel.org/lkml/[email protected]/
>>> may be some help.
>>>
>>> At that point, compare the idle task balancing on top to isolate how
>>> much it improves things if any and identify why existing balancing is
>>> insufficient. Split out the can_migrate_task change beforehand in case it
>>> is the main source of difference as opposed to the new balancing mechanism.
>>>
>>
>> The nr_overloaded sits in shared domain structure thus shared in
>> LLC domain and needs to be atomic_t, while rq->overloaded is a
>> boolean updated under rq lock. Maybe the naming can cause some
>> confusion, please lighten me up if you have better idea :)
>>
>
> The naming doesn't help because it's not really "the number of overloaded
> rq's". atomic_t would be slightly safer against parallel updates but
> it's race prone. I didn't think about it deeply but I suspect that two
> separate rq's could disagree on what the boolean value should be if one rq
> is overloaded, the other is not and they are updating via the idle path at
> the same time. This probably can happen because the locking is rq based
> and the cpumask is shared. On the flip side, making it an accurate count
> would result in more updates and incur cache misses as well as probably
> needing a cpumask check instead of a nr_overloaded comparison to determine
> if the rq is already accounted for so it costs more. You are very likely
> trading accuracy versus cost of update.

The boolean value (rq->overloaded) is accessed under rq lock, and almost
accessed by its own rq except the very rare case in sched_idle_balance()
where a double check failed on cfs_rq_overloaded(). So this value should
be accurate and has good data locality.

But as you said, the nr_overloaded and cpu mask are race prone in the
following pattern in my patches:

if (nr_overloaded > 0)
/* nr_overloaded can be zero now */
read(overloaded_mask);

since the mask is accessed without rq locked, the cost might not be too
much. This is quite similar with the idle_cpu() usage in SIS I guess.

>
> Whichever choice you make, add comments on the pros/cons and describe
> the limitation of either approach. e.g. if overloaded is effectively a
> boolean, describe in a comment the limitations.

OK, will do.

>
>> And yes, I agree it would be nice if test result on SIS search
>> depth can be shown, and I actually did the test, but the result
>> didn't show a reduction in depth due to sched-idle balancing
>> will also consume sched-idle/idle cpus. I will apply your patch
>> and make some further tests on that, thanks.
>>
>
> Just remember to use the patch to measure changes in SIS depth but
> performance figures should not include the patch as the schedstat
> overhead distorts results.

Yes, agreed.

>
> Also place the filter first and do any measurements of any change to
> balancing versus the filter. I'm suggesting placing the filter first
> because it's less controversial than a new balancer. Just be aware that
> the filter alone is not a guarantee of merging as there have been a few
> approaches to filtering and so far all of them had downsides on either cost

Yes, understood. I will adjust the patches as you suggested and send v2
together with more tests next week.

> or accuracy. IIRC the only active approach to reducing search cost in SIS
> is https://lore.kernel.org/all/[email protected]/
> and it's likely to get a new version due to
> https://lore.kernel.org/all/[email protected]/.
> It also updates sched_domain_shared but with a single boolean instead of
> an atomic+cpumask.
>

Chen Yu's patch disables idle cpu searching in SIS when the LLC domain
is overloaded (that is 85% capacity usage) and Peter suggested him use
this metric to replace/improve SIS_PROP feature to make search depth
varying gently.

I don't think either of the two approaches conflict with mine, as they
are to reduce the effort of searching when system is busy and cpus are
not likely to be idle, and mine is to consume sched-idle/idle cpus by
themselves by pulling non-idle tasks from overloaded rqs so there will
be fewer sched-idle/idle cpus.

Thanks and best regards,
Abel

2022-02-25 16:37:29

by Vincent Guittot

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] introduce sched-idle balancing

On Fri, 25 Feb 2022 at 07:46, Abel Wu <[email protected]> wrote:
>
> Hi Peter,
>
> On 2/24/22 11:20 PM, Peter Zijlstra Wrote:
> > On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
> >> Current load balancing is mainly based on cpu capacity
> >> and task util, which makes sense in the POV of overall
> >> throughput. While there still might be some improvement
> >> can be done by reducing number of overloaded cfs rqs if
> >> sched-idle or idle rq exists.
> >
> > I'm much confused, there is an explicit new-idle balancer and a periodic
> > idle balancer already there.
>
> The two balancers are triggered on the rqs that have no tasks on them,
> and load_balance() seems don't show a preference for non-idle tasks so

The load balance will happen at the idle pace if a sched_idle task is
running on the cpu so you will have an ILB on each cpu that run a
sched-idle task

> there might be possibility that only idle tasks are pulled during load
> balance while overloaded rqs (rq->cfs.h_nr_running > 1) exist. As a

There is a LB_MIN feature (disable by default) that filters task with
very low load ( < 16) which includes sched-idle task which has a max
load of 3

> result the normal tasks, mostly latency-critical ones in our case, on
> that overloaded rq still suffer waiting for each other. I observed this
> through perf sched.
>
> IOW the main difference from the POV of load_balance() between the
> latency-critical tasks and the idle ones is load.
>
> The sched-idle balancer is triggered on the sched-idle rqs periodically
> and the newly-idle ones. It does a 'fast' pull of non-idle tasks from
> the overloaded rqs to the sched-idle/idle ones to let the non-idle tasks
> make full use of cpu resources.
>
> The sched-idle balancer only focuses on non-idle tasks' performance, so
> it can introduce overall load imbalance, and that's why I put it before
> load_balance().

According to the very low weight of a sched-idle task, I don't expect
much imbalance because of sched-idle tasks. But this also depends of
the number of sched-idle task.

>
> Best Regards,
> Abel

2022-02-26 02:20:39

by Vincent Guittot

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] introduce sched-idle balancing

On Fri, 25 Feb 2022 at 11:46, Abel Wu <[email protected]> wrote:
>
> On 2/25/22 4:29 PM, Vincent Guittot Wrote:
> > On Fri, 25 Feb 2022 at 07:46, Abel Wu <[email protected]> wrote:
> >>
> >> Hi Peter,
> >>
> >> On 2/24/22 11:20 PM, Peter Zijlstra Wrote:
> >>> On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
> >>>> Current load balancing is mainly based on cpu capacity
> >>>> and task util, which makes sense in the POV of overall
> >>>> throughput. While there still might be some improvement
> >>>> can be done by reducing number of overloaded cfs rqs if
> >>>> sched-idle or idle rq exists.
> >>>
> >>> I'm much confused, there is an explicit new-idle balancer and a periodic
> >>> idle balancer already there.
> >>
> >> The two balancers are triggered on the rqs that have no tasks on them,
> >> and load_balance() seems don't show a preference for non-idle tasks so
> >
> > The load balance will happen at the idle pace if a sched_idle task is
> > running on the cpu so you will have an ILB on each cpu that run a
> > sched-idle task
>
> I'm afraid I don't quite follow you, since sched-idle balancer doesn't
> touch the ILB part, can you elaborate on this? Thanks.

I was referring to your sentence " The two balancers are triggered on
the rqs that have no tasks on them". When there is only sched-idle
tasks on a rq, the load_balance behave like the Idle Load Balance when
there is no task i.e. as often

>
> >
> >> there might be possibility that only idle tasks are pulled during load
> >> balance while overloaded rqs (rq->cfs.h_nr_running > 1) exist. As a
> >
> > There is a LB_MIN feature (disable by default) that filters task with
> > very low load ( < 16) which includes sched-idle task which has a max
> > load of 3

but we could easily change this like if !sched_idle_cpus then LB can
migrate only cfs tasks otherwise can migrate sched_idle task as well.
Instead of creating another side channel

>
> This feature might not that friendly to the situation that only
> sched-idle tasks are running in the system. And this situation
> can last more than half a day in our co-location systems in which
> the training/batch tasks are placed under idle groups or directly
> assigned to SCHED_IDLE.
>
> >
> >> result the normal tasks, mostly latency-critical ones in our case, on
> >> that overloaded rq still suffer waiting for each other. I observed this
> >> through perf sched.
> >>
> >> IOW the main difference from the POV of load_balance() between the
> >> latency-critical tasks and the idle ones is load.
> >>
> >> The sched-idle balancer is triggered on the sched-idle rqs periodically
> >> and the newly-idle ones. It does a 'fast' pull of non-idle tasks from
> >> the overloaded rqs to the sched-idle/idle ones to let the non-idle tasks
> >> make full use of cpu resources.
> >>
> >> The sched-idle balancer only focuses on non-idle tasks' performance, so
> >> it can introduce overall load imbalance, and that's why I put it before
> >> load_balance().
> >
> > According to the very low weight of a sched-idle task, I don't expect
> > much imbalance because of sched-idle tasks. But this also depends of
> > the number of sched-idle task.
> >
> >
> >>
> >> Best Regards,
> >> Abel

2022-02-27 09:44:31

[permalink] [raw]

Subject: Re: [RFC PATCH 1/5] sched/fair: record overloaded cpus

On 2/24/22 3:10 PM, Gautham R. Shenoy wrote:
> Hello Abel,
>
> (+ Aubrey Li, Srikar)
>
> On Thu, Feb 17, 2022 at 11:43:57PM +0800, Abel Wu wrote:
>> An CFS runqueue is considered overloaded when there are
>> more than one pullable non-idle tasks on it (since sched-
>> idle cpus are treated as idle cpus). And idle tasks are
>> counted towards rq->cfs.idle_h_nr_running, that is either
>> assigned SCHED_IDLE policy or placed under idle cgroups.
>>
>> The overloaded cfs rqs can cause performance issues to
>> both task types:
>>
>> - for latency critical tasks like SCHED_NORMAL,
>> time of waiting in the rq will increase and
>> result in higher pct99 latency, and
>>
>> - batch tasks may not be able to make full use
>> of cpu capacity if sched-idle rq exists, thus
>> presents poorer throughput.
>>
>> The mask of overloaded cpus is updated in periodic tick
>> and the idle path at the LLC domain basis. This cpumask
>> will also be used in SIS as a filter, improving idle cpu
>> searching.
>
> This is an interesting approach to minimise the tail latencies by
> keeping track of the overloaded cpus in the LLC so that
> idle/sched-idle CPUs can pull from them. This approach contrasts with the
> following approaches that were previously tried :
>
> 1. Maintain the idle cpumask at the LLC level by Aubrey Li
> https://lore.kernel.org/all/[email protected]/
>
> 2. Maintain the identity of the idle core itself at the LLC level, by Srikar :
> https://lore.kernel.org/lkml/[email protected]/
>
> There have been concerns in the past about having to update the shared
> mask/counter at regular intervals. Srikar, Aubrey any thoughts on this
> ?
>
https://lkml.org/lkml/2022/2/7/1129

2022-03-02 10:50:49

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] introduce sched-idle balancing

On Fri, Feb 25, 2022 at 5:36 AM Abel Wu <[email protected]> wrote:
[snip]
> > Also place the filter first and do any measurements of any change to
> > balancing versus the filter. I'm suggesting placing the filter first
> > because it's less controversial than a new balancer. Just be aware that
> > the filter alone is not a guarantee of merging as there have been a few
> > approaches to filtering and so far all of them had downsides on either cost
>
> Yes, understood. I will adjust the patches as you suggested and send v2
> together with more tests next week.

+1 to trying the filter rather than introducing a new balance path.

We've found the sched_idle_cpu() checks in the wakeup path to be
adequate in allowing non-idle tasks to fully consume cpu resources
(but that of course relies on wakeup balancing, and not periodic
balancing).

Please cc me on the next series.

Thanks,
Josh