2024-03-04 09:49:21

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH -v3 0/9] sched/balancing: Misc updates & cleanups

Improve a handful of things I noticed while looking through the
scheduler balancing code. No change in functionality intended.

Changes in -v3:

- Fix a show_schedstat() assumption found by Shrikanth Hegde
- Rename the scheduler softirq handler

Thanks,

Ingo

======================================>

Ingo Molnar (8):
sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag
sched/balancing: Change 'enum cpu_idle_type' to have more natural definitions
sched/balancing: Change comment formatting to not overlap Git conflict marker lines
sched/balancing: Fix comments (trying to) refer to NOHZ_BALANCE_KICK
sched/balancing: Update run_rebalance_domains() comments
sched/balancing: Vertically align the comments of 'struct sg_lb_stats' and 'struct sd_lb_stats'
sched/balancing: Update comments in 'struct sg_lb_stats' and 'struct sd_lb_stats'
sched/balancing: Rename run_rebalance_domains() => sched_balance_softirq()

Shrikanth Hegde (1):
sched/balancing: Remove reliance on 'enum cpu_idle_type' ordering when iterating [CPU_MAX_IDLE_TYPES] arrays in show_schedstat()

Documentation/scheduler/sched-domains.rst | 2 +-
Documentation/translations/zh_CN/scheduler/sched-domains.rst | 2 +-
include/linux/sched/idle.h | 3 +-
kernel/sched/fair.c | 108 +++++++++++++++++--------------
kernel/sched/stats.c | 3 +-
5 files changed, 65 insertions(+), 53 deletions(-)

--
2.40.1



2024-03-04 09:49:49

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag

The 'balancing' spinlock added in:

08c183f31bdb ("[PATCH] sched: add option to serialize load balancing")

.. is taken when the SD_SERIALIZE flag is set in a domain, but in reality it
is a glorified global atomic flag serializing the load-balancing of
those domains.

It doesn't have any explicit locking semantics per se: we just
spin_trylock() it.

Turn it into a ... global atomic flag. This makes it more
clear what is going on here, and reduces overhead and code
size a bit:

# kernel/sched/fair.o: [x86-64 defconfig]

text data bss dec hex filename
60730 2721 104 63555 f843 fair.o.before
60718 2721 104 63543 f837 fair.o.after

Also document the flag a bit.

No change in functionality intended.

Signed-off-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6a16129f9a5c..64ae3d8dc93b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11633,7 +11633,20 @@ static int active_load_balance_cpu_stop(void *data)
return 0;
}

-static DEFINE_SPINLOCK(balancing);
+/*
+ * This flag serializes load-balancing passes over large domains
+ * (such as SD_NUMA) - only once load-balancing instance may run
+ * at a time, to reduce overhead on very large systems with lots
+ * of CPUs and large NUMA distances.
+ *
+ * - Note that load-balancing passes triggered while another one
+ * is executing are skipped and not re-tried.
+ *
+ * - Also note that this does not serialize sched_balance_domains()
+ * execution, as non-SD_SERIALIZE domains will still be
+ * load-balanced in parallel.
+ */
+static atomic_t sched_balance_running = ATOMIC_INIT(0);

/*
* Scale the max load_balance interval with the number of CPUs in the system.
@@ -11711,7 +11724,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)

need_serialize = sd->flags & SD_SERIALIZE;
if (need_serialize) {
- if (!spin_trylock(&balancing))
+ if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
goto out;
}

@@ -11729,7 +11742,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
interval = get_sd_balance_interval(sd, busy);
}
if (need_serialize)
- spin_unlock(&balancing);
+ atomic_set_release(&sched_balance_running, 0);
out:
if (time_after(next_balance, sd->last_balance + interval)) {
next_balance = sd->last_balance + interval;
--
2.40.1


2024-03-04 09:49:55

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 2/9] sched/balancing: Remove reliance on 'enum cpu_idle_type' ordering when iterating [CPU_MAX_IDLE_TYPES] arrays in show_schedstat()

From: Shrikanth Hegde <[email protected]>

Shrikanth Hegde reported that show_schedstat() output broke when
the ordering of the definitions in 'enum cpu_idle_type' is changed,
because show_schedstat() assumed that 'CPU_IDLE' is 0.

Fix it before we change the definition order & values.

[ mingo: Added changelog. ]

Signed-off-by: Shrikanth Hegde <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Valentin Schneider <[email protected]>
---
kernel/sched/stats.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..85277953cc72 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -150,8 +150,7 @@ static int show_schedstat(struct seq_file *seq, void *v)

seq_printf(seq, "domain%d %*pb", dcount++,
cpumask_pr_args(sched_domain_span(sd)));
- for (itype = CPU_IDLE; itype < CPU_MAX_IDLE_TYPES;
- itype++) {
+ for (itype = 0; itype < CPU_MAX_IDLE_TYPES; itype++) {
seq_printf(seq, " %u %u %u %u %u %u %u %u",
sd->lb_count[itype],
sd->lb_balanced[itype],
--
2.40.1


2024-03-04 09:50:21

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 3/9] sched/balancing: Change 'enum cpu_idle_type' to have more natural definitions

The cpu_idle_type enum has the confusingly inverted property
that 'not idle' is 1, and 'idle' is '0'.

This resulted in a number of unnecessary complications in the code.

Reverse the order, remove the CPU_NOT_IDLE type, and convert
all code to a natural boolean form.

It's much more readable:

- enum cpu_idle_type idle = this_rq->idle_balance ?
- CPU_IDLE : CPU_NOT_IDLE;
-
+ enum cpu_idle_type idle = this_rq->idle_balance;

--------------------------------

- if (env->idle == CPU_NOT_IDLE || !busiest->sum_nr_running)
+ if (!env->idle || !busiest->sum_nr_running)

--------------------------------

And gets rid of the double negation in these usages:

- if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
+ if (env->idle && env->src_rq->nr_running <= 1)

Furthermore, this makes code much more obvious where there's
differentiation between CPU_IDLE and CPU_NEWLY_IDLE.

Signed-off-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Valentin Schneider <[email protected]>
---
include/linux/sched/idle.h | 3 +--
kernel/sched/fair.c | 27 ++++++++++++---------------
2 files changed, 13 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched/idle.h b/include/linux/sched/idle.h
index 478084f9105e..4a6423700ffc 100644
--- a/include/linux/sched/idle.h
+++ b/include/linux/sched/idle.h
@@ -5,8 +5,7 @@
#include <linux/sched.h>

enum cpu_idle_type {
- CPU_IDLE,
- CPU_NOT_IDLE,
+ CPU_IDLE = 1,
CPU_NEWLY_IDLE,
CPU_MAX_IDLE_TYPES
};
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 64ae3d8dc93b..f11fc6dd39b1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9070,7 +9070,7 @@ static int detach_tasks(struct lb_env *env)
* We don't want to steal all, otherwise we may be treated likewise,
* which could at worst lead to a livelock crash.
*/
- if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
+ if (env->idle && env->src_rq->nr_running <= 1)
break;

env->loop++;
@@ -9803,7 +9803,7 @@ static inline bool smt_vs_nonsmt_groups(struct sched_group *sg1,
static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats *sgs,
struct sched_group *group)
{
- if (env->idle == CPU_NOT_IDLE)
+ if (!env->idle)
return false;

/*
@@ -9827,7 +9827,7 @@ static inline long sibling_imbalance(struct lb_env *env,
int ncores_busiest, ncores_local;
long imbalance;

- if (env->idle == CPU_NOT_IDLE || !busiest->sum_nr_running)
+ if (!env->idle || !busiest->sum_nr_running)
return 0;

ncores_busiest = sds->busiest->cores;
@@ -9927,8 +9927,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_misfit_task_load = rq->misfit_task_load;
*sg_status |= SG_OVERLOAD;
}
- } else if ((env->idle != CPU_NOT_IDLE) &&
- sched_reduced_capacity(rq, env->sd)) {
+ } else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
/* Check for a task running on a CPU with reduced capacity */
if (sgs->group_misfit_task_load < load)
sgs->group_misfit_task_load = load;
@@ -9940,7 +9939,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_weight = group->group_weight;

/* Check if dst CPU is idle and preferred to this group */
- if (!local_group && env->idle != CPU_NOT_IDLE && sgs->sum_h_nr_running &&
+ if (!local_group && env->idle && sgs->sum_h_nr_running &&
sched_group_asym(env, sgs, group))
sgs->group_asym_packing = 1;

@@ -10698,7 +10697,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
* waiting task in this overloaded busiest group. Let's
* try to pull it.
*/
- if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) {
+ if (env->idle && env->imbalance == 0) {
env->migration_type = migrate_task;
env->imbalance = 1;
}
@@ -10913,7 +10912,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
goto force_balance;

if (busiest->group_type != group_overloaded) {
- if (env->idle == CPU_NOT_IDLE) {
+ if (!env->idle) {
/*
* If the busiest group is not overloaded (and as a
* result the local one too) but this CPU is already
@@ -11121,7 +11120,7 @@ asym_active_balance(struct lb_env *env)
* the lower priority @env::dst_cpu help it. Do not follow
* CPU priority.
*/
- return env->idle != CPU_NOT_IDLE && sched_use_asym_prio(env->sd, env->dst_cpu) &&
+ return env->idle && sched_use_asym_prio(env->sd, env->dst_cpu) &&
(sched_asym_prefer(env->dst_cpu, env->src_cpu) ||
!sched_use_asym_prio(env->sd, env->src_cpu));
}
@@ -11159,7 +11158,7 @@ static int need_active_balance(struct lb_env *env)
* because of other sched_class or IRQs if more capacity stays
* available on dst_cpu.
*/
- if ((env->idle != CPU_NOT_IDLE) &&
+ if (env->idle &&
(env->src_rq->cfs.h_nr_running == 1)) {
if ((check_cpu_capacity(env->src_rq, sd)) &&
(capacity_of(env->src_cpu)*sd->imbalance_pct < capacity_of(env->dst_cpu)*100))
@@ -11735,8 +11734,8 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
* env->dst_cpu, so we can't know our idle
* state even if we migrated tasks. Update it.
*/
- idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;
- busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
+ idle = idle_cpu(cpu);
+ busy = !idle && !sched_idle_cpu(cpu);
}
sd->last_balance = jiffies;
interval = get_sd_balance_interval(sd, busy);
@@ -12416,9 +12415,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
{
struct rq *this_rq = this_rq();
- enum cpu_idle_type idle = this_rq->idle_balance ?
- CPU_IDLE : CPU_NOT_IDLE;
-
+ enum cpu_idle_type idle = this_rq->idle_balance;
/*
* If this CPU has a pending nohz_balance_kick, then do the
* balancing on behalf of the other idle CPUs whose ticks are
--
2.40.1


2024-03-04 09:50:22

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 4/9] sched/balancing: Change comment formatting to not overlap Git conflict marker lines

So the scheduler has two such comment blocks, with '=' used as a double underline:

/*
* VRUNTIME
* ========
*

'========' also happens to be a Git conflict marker, throwing off a simple
search in an editor for this pattern.

Change them to '-------' type of underline instead - it looks just as good.

Signed-off-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f11fc6dd39b1..934ace69eb30 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3679,7 +3679,7 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,

/*
* VRUNTIME
- * ========
+ * --------
*
* COROLLARY #1: The virtual runtime of the entity needs to be
* adjusted if re-weight at !0-lag point.
@@ -3762,7 +3762,7 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,

/*
* DEADLINE
- * ========
+ * --------
*
* When the weight changes, the virtual time slope changes and
* we should adjust the relative virtual deadline accordingly.
--
2.40.1


2024-03-04 09:50:44

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 5/9] sched/balancing: Fix comments (trying to) refer to NOHZ_BALANCE_KICK

Fix two typos:

- There's no such thing as 'nohz_balancing_kick', the
flag is named 'BALANCE' and is capitalized: NOHZ_BALANCE_KICK.

- Likewise there's no such thing as a 'pending nohz_balance_kick'
either, the NOHZ_BALANCE_KICK flag is all upper-case.

Signed-off-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 934ace69eb30..4c46bffb6a7a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12410,14 +12410,14 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)

/*
* run_rebalance_domains is triggered when needed from the scheduler tick.
- * Also triggered for nohz idle balancing (with nohz_balancing_kick set).
+ * Also triggered for NOHZ idle balancing (with NOHZ_BALANCE_KICK set).
*/
static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
{
struct rq *this_rq = this_rq();
enum cpu_idle_type idle = this_rq->idle_balance;
/*
- * If this CPU has a pending nohz_balance_kick, then do the
+ * If this CPU has a pending NOHZ_BALANCE_KICK, then do the
* balancing on behalf of the other idle CPUs whose ticks are
* stopped. Do nohz_idle_balance *before* rebalance_domains to
* give the idle CPUs a chance to load balance. Else we may
--
2.40.1


2024-03-04 09:51:14

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 7/9] sched/balancing: Vertically align the comments of 'struct sg_lb_stats' and 'struct sd_lb_stats'

Make them easier to read.

Signed-off-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 34 +++++++++++++++++-----------------
1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 18b7d2999cff..fc4b2df7ddf2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9436,19 +9436,19 @@ static void update_blocked_averages(int cpu)
* sg_lb_stats - stats of a sched_group required for load_balancing
*/
struct sg_lb_stats {
- unsigned long avg_load; /*Avg load across the CPUs of the group */
- unsigned long group_load; /* Total load over the CPUs of the group */
+ unsigned long avg_load; /* Avg load across the CPUs of the group */
+ unsigned long group_load; /* Total load over the CPUs of the group */
unsigned long group_capacity;
- unsigned long group_util; /* Total utilization over the CPUs of the group */
- unsigned long group_runnable; /* Total runnable time over the CPUs of the group */
- unsigned int sum_nr_running; /* Nr of tasks running in the group */
- unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
+ unsigned long group_util; /* Total utilization over the CPUs of the group */
+ unsigned long group_runnable; /* Total runnable time over the CPUs of the group */
+ unsigned int sum_nr_running; /* Nr of tasks running in the group */
+ unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
unsigned int idle_cpus;
unsigned int group_weight;
enum group_type group_type;
- unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
- unsigned int group_smt_balance; /* Task on busy SMT be moved */
- unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
+ unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
+ unsigned int group_smt_balance; /* Task on busy SMT be moved */
+ unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
@@ -9460,15 +9460,15 @@ struct sg_lb_stats {
* during load balancing.
*/
struct sd_lb_stats {
- struct sched_group *busiest; /* Busiest group in this sd */
- struct sched_group *local; /* Local group in this sd */
- unsigned long total_load; /* Total load of all groups in sd */
- unsigned long total_capacity; /* Total capacity of all groups in sd */
- unsigned long avg_load; /* Average load across all groups in sd */
- unsigned int prefer_sibling; /* tasks should go to sibling first */
+ struct sched_group *busiest; /* Busiest group in this sd */
+ struct sched_group *local; /* Local group in this sd */
+ unsigned long total_load; /* Total load of all groups in sd */
+ unsigned long total_capacity; /* Total capacity of all groups in sd */
+ unsigned long avg_load; /* Average load across all groups in sd */
+ unsigned int prefer_sibling; /* tasks should go to sibling first */

- struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
- struct sg_lb_stats local_stat; /* Statistics of the local group */
+ struct sg_lb_stats busiest_stat; /* Statistics of the busiest group */
+ struct sg_lb_stats local_stat; /* Statistics of the local group */
};

static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
--
2.40.1


2024-03-04 09:51:30

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 8/9] sched/balancing: Update comments in 'struct sg_lb_stats' and 'struct sd_lb_stats'

- Align for readability
- Capitalize consistently

Signed-off-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fc4b2df7ddf2..0eb2cafa1509 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9433,17 +9433,17 @@ static void update_blocked_averages(int cpu)
/********** Helpers for find_busiest_group ************************/

/*
- * sg_lb_stats - stats of a sched_group required for load_balancing
+ * sg_lb_stats - stats of a sched_group required for load-balancing:
*/
struct sg_lb_stats {
- unsigned long avg_load; /* Avg load across the CPUs of the group */
- unsigned long group_load; /* Total load over the CPUs of the group */
- unsigned long group_capacity;
- unsigned long group_util; /* Total utilization over the CPUs of the group */
+ unsigned long avg_load; /* Avg load over the CPUs of the group */
+ unsigned long group_load; /* Total load over the CPUs of the group */
+ unsigned long group_capacity; /* Capacity over the CPUs of the group */
+ unsigned long group_util; /* Total utilization over the CPUs of the group */
unsigned long group_runnable; /* Total runnable time over the CPUs of the group */
- unsigned int sum_nr_running; /* Nr of tasks running in the group */
+ unsigned int sum_nr_running; /* Nr of all tasks running in the group */
unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
- unsigned int idle_cpus;
+ unsigned int idle_cpus; /* Nr of idle CPUs in the group */
unsigned int group_weight;
enum group_type group_type;
unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
@@ -9456,8 +9456,7 @@ struct sg_lb_stats {
};

/*
- * sd_lb_stats - Structure to store the statistics of a sched_domain
- * during load balancing.
+ * sd_lb_stats - stats of a sched_domain required for load-balancing:
*/
struct sd_lb_stats {
struct sched_group *busiest; /* Busiest group in this sd */
@@ -9465,7 +9464,7 @@ struct sd_lb_stats {
unsigned long total_load; /* Total load of all groups in sd */
unsigned long total_capacity; /* Total capacity of all groups in sd */
unsigned long avg_load; /* Average load across all groups in sd */
- unsigned int prefer_sibling; /* tasks should go to sibling first */
+ unsigned int prefer_sibling; /* Tasks should go to sibling first */

struct sg_lb_stats busiest_stat; /* Statistics of the busiest group */
struct sg_lb_stats local_stat; /* Statistics of the local group */
--
2.40.1


2024-03-04 09:51:45

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 9/9] sched/balancing: Rename run_rebalance_domains() => sched_balance_softirq()

run_rebalance_domains() is a misnomer, as it doesn't only
run rebalance_domains(), but since the introduction of the
NOHZ code it also runs nohz_idle_balance().

Rename it to sched_balance_softirq(), reflecting its more
generic purpose and that it's a softirq handler.

Signed-off-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Valentin Schneider <[email protected]>
---
Documentation/scheduler/sched-domains.rst | 2 +-
Documentation/translations/zh_CN/scheduler/sched-domains.rst | 2 +-
kernel/sched/fair.c | 6 +++---
3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/Documentation/scheduler/sched-domains.rst b/Documentation/scheduler/sched-domains.rst
index e57ad28301bd..6577b068f921 100644
--- a/Documentation/scheduler/sched-domains.rst
+++ b/Documentation/scheduler/sched-domains.rst
@@ -34,7 +34,7 @@ out of balance are tasks moved between groups.
In kernel/sched/core.c, trigger_load_balance() is run periodically on each CPU
through scheduler_tick(). It raises a softirq after the next regularly scheduled
rebalancing event for the current runqueue has arrived. The actual load
-balancing workhorse, run_rebalance_domains()->rebalance_domains(), is then run
+balancing workhorse, sched_balance_softirq()->rebalance_domains(), is then run
in softirq context (SCHED_SOFTIRQ).

The latter function takes two arguments: the runqueue of current CPU and whether
diff --git a/Documentation/translations/zh_CN/scheduler/sched-domains.rst b/Documentation/translations/zh_CN/scheduler/sched-domains.rst
index e814d4c01141..fbc326668e37 100644
--- a/Documentation/translations/zh_CN/scheduler/sched-domains.rst
+++ b/Documentation/translations/zh_CN/scheduler/sched-domains.rst
@@ -36,7 +36,7 @@ CPU共享。任意两个组的CPU掩码的交集不一定为空,如果是这

在kernel/sched/core.c中,trigger_load_balance()在每个CPU上通过scheduler_tick()
周期执行。在当前运行队列下一个定期调度再平衡事件到达后,它引发一个软中断。负载均衡真正
-的工作由run_rebalance_domains()->rebalance_domains()完成,在软中断上下文中执行
+的工作由sched_balance_softirq()->rebalance_domains()完成,在软中断上下文中执行
(SCHED_SOFTIRQ)。

后一个函数有两个入参:当前CPU的运行队列、它在scheduler_tick()调用时是否空闲。函数会从
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0eb2cafa1509..cdf0dbb9d3e3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12408,7 +12408,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
}

/*
- * The run_rebalance_domains() softirq handler is triggered via SCHED_SOFTIRQ
+ * The sched_balance_softirq() softirq handler is triggered via SCHED_SOFTIRQ
* from two places:
*
* - the scheduler_tick(),
@@ -12416,7 +12416,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
* - from the SMP cross-call function nohz_csd_func(),
* used by NOHZ idle balancing (with NOHZ_BALANCE_KICK set).
*/
-static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
+static __latent_entropy void sched_balance_softirq(struct softirq_action *h)
{
struct rq *this_rq = this_rq();
enum cpu_idle_type idle = this_rq->idle_balance;
@@ -13217,7 +13217,7 @@ __init void init_sched_fair_class(void)
#endif
}

- open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
+ open_softirq(SCHED_SOFTIRQ, sched_balance_softirq);

#ifdef CONFIG_NO_HZ_COMMON
nohz.next_balance = jiffies;
--
2.40.1


2024-03-04 09:54:45

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 6/9] sched/balancing: Update run_rebalance_domains() comments

The first sentence of the comment explaining run_rebalance_domains()
is historic and not true anymore:

* run_rebalance_domains is triggered when needed from the scheduler tick.

.. contradicted/modified by the second sentence:

* Also triggered for NOHZ idle balancing (with NOHZ_BALANCE_KICK set).

Avoid that kind of confusion straight away and explain from what
places sched_balance_softirq() is triggered.

Signed-off-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4c46bffb6a7a..18b7d2999cff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12409,8 +12409,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
}

/*
- * run_rebalance_domains is triggered when needed from the scheduler tick.
- * Also triggered for NOHZ idle balancing (with NOHZ_BALANCE_KICK set).
+ * The run_rebalance_domains() softirq handler is triggered via SCHED_SOFTIRQ
+ * from two places:
+ *
+ * - the scheduler_tick(),
+ *
+ * - from the SMP cross-call function nohz_csd_func(),
+ * used by NOHZ idle balancing (with NOHZ_BALANCE_KICK set).
*/
static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
{
--
2.40.1


2024-03-04 15:06:57

by Shrikanth Hegde

[permalink] [raw]
Subject: Re: [PATCH 2/9] sched/balancing: Remove reliance on 'enum cpu_idle_type' ordering when iterating [CPU_MAX_IDLE_TYPES] arrays in show_schedstat()



On 3/4/24 3:18 PM, Ingo Molnar wrote:
> From: Shrikanth Hegde <[email protected]>
>
> Shrikanth Hegde reported that show_schedstat() output broke when
> the ordering of the definitions in 'enum cpu_idle_type' is changed,
> because show_schedstat() assumed that 'CPU_IDLE' is 0.
>
Hi Ingo.
Feel free to drop me from the changelog.

> @@ -150,8 +150,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
>
> seq_printf(seq, "domain%d %*pb", dcount++,
> cpumask_pr_args(sched_domain_span(sd)));
> - for (itype = CPU_IDLE; itype < CPU_MAX_IDLE_TYPES;
> - itype++) {
> + for (itype = 0; itype < CPU_MAX_IDLE_TYPES; itype++) {


It would still not be same order as current documentation of schedstat. no? The documentation
would need changes too. Change SCHEDSTAT_VERSION to 16?


Current documentation says this.
--------------------
The next 24 are a variety of load_balance() statistics in grouped into types
of idleness (idle, busy, and newly idle):

Above code will do.
(busy, idle and newly idle)


--------------------
Verified with the v3 patch as well using the previous method.
Before patch:
cpu0 0 0 4400 1485 1624 1229 301472313236 120382198 7714
[-------- idle --------][-----------busy--------][-------new-idle--]
domain0 00000000,00000000,00000055 1661 1661 0 0 0 0 0 1661 2495 2495 0 0 0 0 0 2495 67 66 1 2 0 0 0 66 0 0 0 0 0 0 0 0 0 133 38 0
domain1 ff000000,00ff0000,ffffffff 382 369 13 13 4 0 2 207 198 195 3 36 0 0 0 195 67 64 3 3 0 0 0 64 4 0 4 0 0 0 0 0 0 124 9 0
domain2 ff00ffff,00ffffff,ffffffff 586 585 1 6 0 0 0 365 118 116 2 96 0 0 0 116 67 67 0 0 0 0 0 67 0 0 0 0 0 0 0 0 0 59 0 0
domain3 ffffffff,ffffffff,ffffffff 481 479 2 58 0 0 0 387 97 97 0 0 0 0 0 96 67 67 0 0 0 0 0 67 0 0 0 0 0 0 0 0 0 79 0 0

After patch:
cpu0 0 0 3924 728 1940 1540 302019558490 425784368 8793
[-------- busy ----------][-----------idle--------][-------new-idle--]
domain0 00000000,00000000,00000055 2494 2489 3 37 2 0 0 2489 1691 1691 0 0 0 0 0 1691 21 19 0 2 2 0 0 19 0 0 0 0 0 0 0 0 0 89 2 0
domain1 ff000000,00ff0000,ffffffff 196 193 3 44 0 0 0 193 411 400 10 2060 4 1 4 260 19 16 3 1028 0 0 0 16 3 0 3 0 0 0 0 0 0 59 2 0
domain2 ff00ffff,00ffffff,ffffffff 116 116 0 0 0 0 0 116 590 588 2 3 0 0 0 447 19 18 1 2 0 0 0 18 0 0 0 0 0 0 0 0 0 192 0 0
domain3 ffffffff,ffffffff,ffffffff 97 97 0 0 0 0 0 96 457 457 0 0 0 0 0 427 19 18 1 27 0 0 0 18 0 0 0 0 0 0 0 0 0 60 0 0



> seq_printf(seq, " %u %u %u %u %u %u %u %u",
> sd->lb_count[itype],
> sd->lb_balanced[itype],

2024-03-05 10:56:26

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 3/9] sched/balancing: Change 'enum cpu_idle_type' to have more natural definitions

On 04/03/24 10:48, Ingo Molnar wrote:
> And gets rid of the double negation in these usages:
>
> - if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
> + if (env->idle && env->src_rq->nr_running <= 1)
>

Yes please :-)

> Furthermore, this makes code much more obvious where there's
> differentiation between CPU_IDLE and CPU_NEWLY_IDLE.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Valentin Schneider <[email protected]>

Reviewed-by: Valentin Schneider <[email protected]>


2024-03-05 10:56:39

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 4/9] sched/balancing: Change comment formatting to not overlap Git conflict marker lines

On 04/03/24 10:48, Ingo Molnar wrote:
> So the scheduler has two such comment blocks, with '=' used as a double underline:
>
> /*
> * VRUNTIME
> * ========
> *
>
> '========' also happens to be a Git conflict marker, throwing off a simple
> search in an editor for this pattern.
>
> Change them to '-------' type of underline instead - it looks just as good.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Valentin Schneider <[email protected]>

Reviewed-by: Valentin Schneider <[email protected]>


2024-03-05 10:56:53

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 5/9] sched/balancing: Fix comments (trying to) refer to NOHZ_BALANCE_KICK

On 04/03/24 10:48, Ingo Molnar wrote:
> Fix two typos:
>
> - There's no such thing as 'nohz_balancing_kick', the
> flag is named 'BALANCE' and is capitalized: NOHZ_BALANCE_KICK.
>
> - Likewise there's no such thing as a 'pending nohz_balance_kick'
> either, the NOHZ_BALANCE_KICK flag is all upper-case.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Valentin Schneider <[email protected]>

Reviewed-by: Valentin Schneider <[email protected]>


2024-03-05 10:57:08

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 6/9] sched/balancing: Update run_rebalance_domains() comments

On 04/03/24 10:48, Ingo Molnar wrote:
> The first sentence of the comment explaining run_rebalance_domains()
> is historic and not true anymore:
>
> * run_rebalance_domains is triggered when needed from the scheduler tick.
>
> ... contradicted/modified by the second sentence:
>
> * Also triggered for NOHZ idle balancing (with NOHZ_BALANCE_KICK set).
>
> Avoid that kind of confusion straight away and explain from what
> places sched_balance_softirq() is triggered.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> ---
> kernel/sched/fair.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4c46bffb6a7a..18b7d2999cff 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12409,8 +12409,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> }
>
> /*
> - * run_rebalance_domains is triggered when needed from the scheduler tick.
> - * Also triggered for NOHZ idle balancing (with NOHZ_BALANCE_KICK set).
> + * The run_rebalance_domains() softirq handler is triggered via SCHED_SOFTIRQ
> + * from two places:
> + *
> + * - the scheduler_tick(),
> + *
> + * - from the SMP cross-call function nohz_csd_func(),
> + * used by NOHZ idle balancing (with NOHZ_BALANCE_KICK set).

Bit of a nit but the CSD is also triggered via the scheduler_tick():

scheduler_tick()
`\
trigger_load_balance()
`\
raise_softirq(SCHED_SOFTIRQ)

scheduler_tick()
`\
trigger_load_balance()
`\
nohz_balance_kick()
`\
kick_ilb()
`\
smp_call_function_single_async(ilb_cpu, &cpu_rq(ilb_cpu)->nohz_csd);

I got to the below which is still somewhat confusing, thoughts?

"""
The run_rebalance_domains() softirq handler is triggered via SCHED_SOFTIRQ
from two places:

- directly from trigger_load_balance() in scheduler_tick(), for periodic
load balance

- indirectly from kick_ilb() (invoked down the scheduler_tick() too), which
issues an SMP cross-call to nohz_csd_func() which will itself raise the
softirq, for NOHZ idle balancing.
"""

> */
> static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
> {
> --
> 2.40.1


2024-03-05 10:57:21

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 7/9] sched/balancing: Vertically align the comments of 'struct sg_lb_stats' and 'struct sd_lb_stats'

On 04/03/24 10:48, Ingo Molnar wrote:
> Make them easier to read.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Valentin Schneider <[email protected]>

Reviewed-by: Valentin Schneider <[email protected]>


2024-03-05 10:57:38

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 8/9] sched/balancing: Update comments in 'struct sg_lb_stats' and 'struct sd_lb_stats'

On 04/03/24 10:48, Ingo Molnar wrote:
> - Align for readability
> - Capitalize consistently
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Valentin Schneider <[email protected]>

Reviewed-by: Valentin Schneider <[email protected]>


2024-03-05 11:02:09

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag

On 04/03/24 10:48, Ingo Molnar wrote:
> The 'balancing' spinlock added in:
>
> 08c183f31bdb ("[PATCH] sched: add option to serialize load balancing")
>
> ... is taken when the SD_SERIALIZE flag is set in a domain, but in reality it
> is a glorified global atomic flag serializing the load-balancing of
> those domains.
>
> It doesn't have any explicit locking semantics per se: we just
> spin_trylock() it.
>
> Turn it into a ... global atomic flag. This makes it more
> clear what is going on here, and reduces overhead and code
> size a bit:
>
> # kernel/sched/fair.o: [x86-64 defconfig]
>
> text data bss dec hex filename
> 60730 2721 104 63555 f843 fair.o.before
> 60718 2721 104 63543 f837 fair.o.after
>
> Also document the flag a bit.
>
> No change in functionality intended.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Valentin Schneider <[email protected]>

Few comment nits, otherwise:

Reviewed-by: Valentin Schneider <[email protected]>

> ---
> kernel/sched/fair.c | 19 ++++++++++++++++---
> 1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6a16129f9a5c..64ae3d8dc93b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11633,7 +11633,20 @@ static int active_load_balance_cpu_stop(void *data)
> return 0;
> }
>
> -static DEFINE_SPINLOCK(balancing);
> +/*
> + * This flag serializes load-balancing passes over large domains
> + * (such as SD_NUMA) - only once load-balancing instance may run
^^^^
s/once/one/

Also, currently the flag is only set for domains above the NODE topology
level, sd_init() will reject an architecture that forces SD_SERIALIZE in a
topology level's ->sd_flags(), so what about:

s/(such as SD_NUMA)/(above the NODE topology level)

> + * at a time, to reduce overhead on very large systems with lots
> + * of CPUs and large NUMA distances.
> + *
> + * - Note that load-balancing passes triggered while another one
> + * is executing are skipped and not re-tried.
> + *
> + * - Also note that this does not serialize sched_balance_domains()
^^^^^^^^^^^^^^^^^^^^^
Did you mean rebalance_domains()?

> + * execution, as non-SD_SERIALIZE domains will still be
> + * load-balanced in parallel.
> + */
> +static atomic_t sched_balance_running = ATOMIC_INIT(0);
>
> /*
> * Scale the max load_balance interval with the number of CPUs in the system.
> @@ -11711,7 +11724,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
>
> need_serialize = sd->flags & SD_SERIALIZE;
> if (need_serialize) {
> - if (!spin_trylock(&balancing))
> + if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> goto out;
> }
>
> @@ -11729,7 +11742,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
> interval = get_sd_balance_interval(sd, busy);
> }
> if (need_serialize)
> - spin_unlock(&balancing);
> + atomic_set_release(&sched_balance_running, 0);
> out:
> if (time_after(next_balance, sd->last_balance + interval)) {
> next_balance = sd->last_balance + interval;
> --
> 2.40.1


2024-03-05 11:07:20

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 9/9] sched/balancing: Rename run_rebalance_domains() => sched_balance_softirq()

On 04/03/24 10:48, Ingo Molnar wrote:
> run_rebalance_domains() is a misnomer, as it doesn't only
> run rebalance_domains(), but since the introduction of the
> NOHZ code it also runs nohz_idle_balance().
>
> Rename it to sched_balance_softirq(), reflecting its more
> generic purpose and that it's a softirq handler.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Valentin Schneider <[email protected]>

Acked-by: Valentin Schneider <[email protected]>


2024-03-05 11:12:33

by Shrikanth Hegde

[permalink] [raw]
Subject: Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag



On 3/4/24 3:18 PM, Ingo Molnar wrote:
> The 'balancing' spinlock added in:
>
> 08c183f31bdb ("[PATCH] sched: add option to serialize load balancing")
>

[...]

>
>
> -static DEFINE_SPINLOCK(balancing);
> +/*
> + * This flag serializes load-balancing passes over large domains
> + * (such as SD_NUMA) - only once load-balancing instance may run
> + * at a time, to reduce overhead on very large systems with lots
> + * of CPUs and large NUMA distances.
> + *
> + * - Note that load-balancing passes triggered while another one
> + * is executing are skipped and not re-tried.
> + *
> + * - Also note that this does not serialize sched_balance_domains()
> + * execution, as non-SD_SERIALIZE domains will still be
> + * load-balanced in parallel.
> + */
> +static atomic_t sched_balance_running = ATOMIC_INIT(0);
>
> /*

Continuing the discussion related whether this balancing lock is
contended or not.


It was observed in large system (1920CPU, 16 NUMA Nodes) cacheline containing the
balancing trylock was contended and rebalance_domains was seen as part of the traces.

So did some experiments on smaller system. This system as 224 CPUs and 6 NUMA nodes.
Added probe points in rebalance_domains. If lock is not contended, then lock should
success and both probe points should match. If not, there should be contention.
Below are the system details and perf probe -L rebalance_domains.

NUMA:
NUMA node(s): 6
NUMA node0 CPU(s): 0-31
NUMA node1 CPU(s): 32-71
NUMA node4 CPU(s): 72-111
NUMA node5 CPU(s): 112-151
NUMA node6 CPU(s): 152-183
NUMA node7 CPU(s): 184-223


------------------------------------------------------------------------------------------------------------------
#perf probe -L rebalance_domains
<rebalance_domains@/shrikanth/sched_tip/kernel/sched/fair.c:0>
0 static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
{
2 int continue_balancing = 1;
3 int cpu = rq->cpu;
[...]


33 interval = get_sd_balance_interval(sd, busy);

need_serialize = sd->flags & SD_SERIALIZE;
36 if (need_serialize) {
37 if (!spin_trylock(&balancing))
goto out;
}

41 if (time_after_eq(jiffies, sd->last_balance + interval)) {
42 if (load_balance(cpu, rq, sd, idle, &continue_balancing)) {
/*
* The LBF_DST_PINNED logic could have changed
* env->dst_cpu, so we can't know our idle
* state even if we migrated tasks. Update it.
*/
48 idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;
49 busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
}
51 sd->last_balance = jiffies;
52 interval = get_sd_balance_interval(sd, busy);
}
54 if (need_serialize)
55 spin_unlock(&balancing);
out:
57 if (time_after(next_balance, sd->last_balance + interval)) {
next_balance = sd->last_balance + interval;
update_next_balance = 1;
}
}

perf probe --list
probe:rebalance_domains_L37 (on rebalance_domains+856)
probe:rebalance_domains_L55 (on rebalance_domains+904)
------------------------------------------------------------------------------------------------------------------

Perf records are collected for 10 seconds in different system loads. load is created using stress-ng.
Contention is calculated as (1-L55/L37)*100

system is idle: <-- No contention
1K probe:rebalance_domains_L37
1K probe:rebalance_domains_L55


system is at 25% loa: <-- 4.4% contention
223K probe:rebalance_domains_L37: 1 chunks LOST!
213K probe:rebalance_domains_L55: 1 chunks LOST!



system is at 50% load <-- 12.5% contention
168K probe:rebalance_domains_L37
147K probe:rebalance_domains_L55


system is at 75% load <-- 25.6% contention
113K probe:rebalance_domains_L37
84K probe:rebalance_domains_L55

87
system is at 100% load <-- 87.5% contention.
64K probe:rebalance_domains_L37
8K probe:rebalance_domains_L55


A few reasons for contentions could be:
1. idle load balance is running and some other cpu is becoming idle, and tries newidle_balance.
2. when system is busy, every CPU would do busy balancing, it would contend for the lock. It will not do balance as
should_we_balance says this CPU need not balance. It bails out and release the lock.

2024-03-06 15:43:46

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 5/9] sched/balancing: Fix comments (trying to) refer to NOHZ_BALANCE_KICK

On Mon, 4 Mar 2024 at 10:48, Ingo Molnar <[email protected]> wrote:
>
> Fix two typos:
>
> - There's no such thing as 'nohz_balancing_kick', the
> flag is named 'BALANCE' and is capitalized: NOHZ_BALANCE_KICK.
>
> - Likewise there's no such thing as a 'pending nohz_balance_kick'
> either, the NOHZ_BALANCE_KICK flag is all upper-case.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> ---
> kernel/sched/fair.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 934ace69eb30..4c46bffb6a7a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12410,14 +12410,14 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>
> /*
> * run_rebalance_domains is triggered when needed from the scheduler tick.
> - * Also triggered for nohz idle balancing (with nohz_balancing_kick set).

I think it should be nohz_balancer_kick() instead of nohz_balancing_kick


> + * Also triggered for NOHZ idle balancing (with NOHZ_BALANCE_KICK set).

In fact, this can be triggered even without NOHZ_BALANCE_KICK. Any of
the below can trigger this
- NOHZ_BALANCE_KICK
- NOHZ_STATS_KICK
- NOHZ_NEXT_KICK

> */
> static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
> {
> struct rq *this_rq = this_rq();
> enum cpu_idle_type idle = this_rq->idle_balance;
> /*
> - * If this CPU has a pending nohz_balance_kick, then do the
> + * If this CPU has a pending NOHZ_BALANCE_KICK, then do the
> * balancing on behalf of the other idle CPUs whose ticks are
> * stopped. Do nohz_idle_balance *before* rebalance_domains to
> * give the idle CPUs a chance to load balance. Else we may
> --
> 2.40.1
>

2024-03-06 15:45:01

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 4/9] sched/balancing: Change comment formatting to not overlap Git conflict marker lines

On Mon, 4 Mar 2024 at 10:48, Ingo Molnar <[email protected]> wrote:
>
> So the scheduler has two such comment blocks, with '=' used as a double underline:
>
> /*
> * VRUNTIME
> * ========
> *
>
> '========' also happens to be a Git conflict marker, throwing off a simple
> search in an editor for this pattern.
>
> Change them to '-------' type of underline instead - it looks just as good.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Valentin Schneider <[email protected]>

Reviewed-by: Vincent Guittot <[email protected]>

> ---
> kernel/sched/fair.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f11fc6dd39b1..934ace69eb30 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3679,7 +3679,7 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
>
> /*
> * VRUNTIME
> - * ========
> + * --------
> *
> * COROLLARY #1: The virtual runtime of the entity needs to be
> * adjusted if re-weight at !0-lag point.
> @@ -3762,7 +3762,7 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
>
> /*
> * DEADLINE
> - * ========
> + * --------
> *
> * When the weight changes, the virtual time slope changes and
> * we should adjust the relative virtual deadline accordingly.
> --
> 2.40.1
>

2024-03-06 15:49:06

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 3/9] sched/balancing: Change 'enum cpu_idle_type' to have more natural definitions

On Mon, 4 Mar 2024 at 10:48, Ingo Molnar <[email protected]> wrote:
>
> The cpu_idle_type enum has the confusingly inverted property
> that 'not idle' is 1, and 'idle' is '0'.
>
> This resulted in a number of unnecessary complications in the code.
>
> Reverse the order, remove the CPU_NOT_IDLE type, and convert
> all code to a natural boolean form.
>
> It's much more readable:
>
> - enum cpu_idle_type idle = this_rq->idle_balance ?
> - CPU_IDLE : CPU_NOT_IDLE;
> -
> + enum cpu_idle_type idle = this_rq->idle_balance;
>
> --------------------------------
>
> - if (env->idle == CPU_NOT_IDLE || !busiest->sum_nr_running)
> + if (!env->idle || !busiest->sum_nr_running)
>
> --------------------------------
>
> And gets rid of the double negation in these usages:
>
> - if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
> + if (env->idle && env->src_rq->nr_running <= 1)
>
> Furthermore, this makes code much more obvious where there's
> differentiation between CPU_IDLE and CPU_NEWLY_IDLE.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> ---
> include/linux/sched/idle.h | 3 +--
> kernel/sched/fair.c | 27 ++++++++++++---------------
> 2 files changed, 13 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/sched/idle.h b/include/linux/sched/idle.h
> index 478084f9105e..4a6423700ffc 100644
> --- a/include/linux/sched/idle.h
> +++ b/include/linux/sched/idle.h
> @@ -5,8 +5,7 @@
> #include <linux/sched.h>
>
> enum cpu_idle_type {
> - CPU_IDLE,
> - CPU_NOT_IDLE,

Could be set CPU_NOT_IDLE = 0 to help keeping in mind that 0 means
cpu is not idle even if we don't use it ?

> + CPU_IDLE = 1,
> CPU_NEWLY_IDLE,
> CPU_MAX_IDLE_TYPES
> };
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 64ae3d8dc93b..f11fc6dd39b1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9070,7 +9070,7 @@ static int detach_tasks(struct lb_env *env)
> * We don't want to steal all, otherwise we may be treated likewise,
> * which could at worst lead to a livelock crash.
> */
> - if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
> + if (env->idle && env->src_rq->nr_running <= 1)
> break;
>
> env->loop++;
> @@ -9803,7 +9803,7 @@ static inline bool smt_vs_nonsmt_groups(struct sched_group *sg1,
> static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats *sgs,
> struct sched_group *group)
> {
> - if (env->idle == CPU_NOT_IDLE)
> + if (!env->idle)
> return false;
>
> /*
> @@ -9827,7 +9827,7 @@ static inline long sibling_imbalance(struct lb_env *env,
> int ncores_busiest, ncores_local;
> long imbalance;
>
> - if (env->idle == CPU_NOT_IDLE || !busiest->sum_nr_running)
> + if (!env->idle || !busiest->sum_nr_running)
> return 0;
>
> ncores_busiest = sds->busiest->cores;
> @@ -9927,8 +9927,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> sgs->group_misfit_task_load = rq->misfit_task_load;
> *sg_status |= SG_OVERLOAD;
> }
> - } else if ((env->idle != CPU_NOT_IDLE) &&
> - sched_reduced_capacity(rq, env->sd)) {
> + } else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
> /* Check for a task running on a CPU with reduced capacity */
> if (sgs->group_misfit_task_load < load)
> sgs->group_misfit_task_load = load;
> @@ -9940,7 +9939,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> sgs->group_weight = group->group_weight;
>
> /* Check if dst CPU is idle and preferred to this group */
> - if (!local_group && env->idle != CPU_NOT_IDLE && sgs->sum_h_nr_running &&
> + if (!local_group && env->idle && sgs->sum_h_nr_running &&
> sched_group_asym(env, sgs, group))
> sgs->group_asym_packing = 1;
>
> @@ -10698,7 +10697,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> * waiting task in this overloaded busiest group. Let's
> * try to pull it.
> */
> - if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) {
> + if (env->idle && env->imbalance == 0) {
> env->migration_type = migrate_task;
> env->imbalance = 1;
> }
> @@ -10913,7 +10912,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> goto force_balance;
>
> if (busiest->group_type != group_overloaded) {
> - if (env->idle == CPU_NOT_IDLE) {
> + if (!env->idle) {
> /*
> * If the busiest group is not overloaded (and as a
> * result the local one too) but this CPU is already
> @@ -11121,7 +11120,7 @@ asym_active_balance(struct lb_env *env)
> * the lower priority @env::dst_cpu help it. Do not follow
> * CPU priority.
> */
> - return env->idle != CPU_NOT_IDLE && sched_use_asym_prio(env->sd, env->dst_cpu) &&
> + return env->idle && sched_use_asym_prio(env->sd, env->dst_cpu) &&
> (sched_asym_prefer(env->dst_cpu, env->src_cpu) ||
> !sched_use_asym_prio(env->sd, env->src_cpu));
> }
> @@ -11159,7 +11158,7 @@ static int need_active_balance(struct lb_env *env)
> * because of other sched_class or IRQs if more capacity stays
> * available on dst_cpu.
> */
> - if ((env->idle != CPU_NOT_IDLE) &&
> + if (env->idle &&
> (env->src_rq->cfs.h_nr_running == 1)) {
> if ((check_cpu_capacity(env->src_rq, sd)) &&
> (capacity_of(env->src_cpu)*sd->imbalance_pct < capacity_of(env->dst_cpu)*100))
> @@ -11735,8 +11734,8 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
> * env->dst_cpu, so we can't know our idle
> * state even if we migrated tasks. Update it.
> */
> - idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;
> - busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
> + idle = idle_cpu(cpu);
> + busy = !idle && !sched_idle_cpu(cpu);
> }
> sd->last_balance = jiffies;
> interval = get_sd_balance_interval(sd, busy);
> @@ -12416,9 +12415,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
> {
> struct rq *this_rq = this_rq();
> - enum cpu_idle_type idle = this_rq->idle_balance ?
> - CPU_IDLE : CPU_NOT_IDLE;
> -
> + enum cpu_idle_type idle = this_rq->idle_balance;
> /*
> * If this CPU has a pending nohz_balance_kick, then do the
> * balancing on behalf of the other idle CPUs whose ticks are
> --
> 2.40.1
>

2024-03-06 16:17:50

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 6/9] sched/balancing: Update run_rebalance_domains() comments

On Tue, 5 Mar 2024 at 11:50, Valentin Schneider <[email protected]> wrote:
>
> On 04/03/24 10:48, Ingo Molnar wrote:
> > The first sentence of the comment explaining run_rebalance_domains()
> > is historic and not true anymore:
> >
> > * run_rebalance_domains is triggered when needed from the scheduler tick.
> >
> > ... contradicted/modified by the second sentence:
> >
> > * Also triggered for NOHZ idle balancing (with NOHZ_BALANCE_KICK set).
> >
> > Avoid that kind of confusion straight away and explain from what
> > places sched_balance_softirq() is triggered.
> >
> > Signed-off-by: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Vincent Guittot <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Linus Torvalds <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > ---
> > kernel/sched/fair.c | 9 +++++++--
> > 1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 4c46bffb6a7a..18b7d2999cff 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -12409,8 +12409,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > }
> >
> > /*
> > - * run_rebalance_domains is triggered when needed from the scheduler tick.
> > - * Also triggered for NOHZ idle balancing (with NOHZ_BALANCE_KICK set).
> > + * The run_rebalance_domains() softirq handler is triggered via SCHED_SOFTIRQ
> > + * from two places:
> > + *
> > + * - the scheduler_tick(),
> > + *
> > + * - from the SMP cross-call function nohz_csd_func(),
> > + * used by NOHZ idle balancing (with NOHZ_BALANCE_KICK set).
>
> Bit of a nit but the CSD is also triggered via the scheduler_tick():
>
> scheduler_tick()
> `\
> trigger_load_balance()
> `\
> raise_softirq(SCHED_SOFTIRQ)
>
> scheduler_tick()
> `\
> trigger_load_balance()
> `\
> nohz_balance_kick()
> `\
> kick_ilb()
> `\
> smp_call_function_single_async(ilb_cpu, &cpu_rq(ilb_cpu)->nohz_csd);
>
> I got to the below which is still somewhat confusing, thoughts?
>




> """
> The run_rebalance_domains() softirq handler is triggered via SCHED_SOFTIRQ
> from two places:
>
> - directly from trigger_load_balance() in scheduler_tick(), for periodic
> load balance
>
> - indirectly from kick_ilb() (invoked down the scheduler_tick() too), which
> issues an SMP cross-call to nohz_csd_func() which will itself raise the
> softirq, for NOHZ idle balancing.

I'm not sure that we should provide too many details of the path as
this might change in the future. What about the below ?

- directly from the local scheduler_tick() for periodic load balance

- indirectly from a remote scheduler_tick() for NOHZ idle balancing
through the SMP cross-call nohz_csd_func()

> """
>
> > */
> > static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
> > {
> > --
> > 2.40.1
>

2024-03-08 09:50:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag


* Valentin Schneider <[email protected]> wrote:

> On 04/03/24 10:48, Ingo Molnar wrote:
> > The 'balancing' spinlock added in:
> >
> > 08c183f31bdb ("[PATCH] sched: add option to serialize load balancing")
> >
> > ... is taken when the SD_SERIALIZE flag is set in a domain, but in reality it
> > is a glorified global atomic flag serializing the load-balancing of
> > those domains.
> >
> > It doesn't have any explicit locking semantics per se: we just
> > spin_trylock() it.
> >
> > Turn it into a ... global atomic flag. This makes it more
> > clear what is going on here, and reduces overhead and code
> > size a bit:
> >
> > # kernel/sched/fair.o: [x86-64 defconfig]
> >
> > text data bss dec hex filename
> > 60730 2721 104 63555 f843 fair.o.before
> > 60718 2721 104 63543 f837 fair.o.after
> >
> > Also document the flag a bit.
> >
> > No change in functionality intended.
> >
> > Signed-off-by: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Vincent Guittot <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Linus Torvalds <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
>
> Few comment nits, otherwise:
>
> Reviewed-by: Valentin Schneider <[email protected]>

Thanks!

> > -static DEFINE_SPINLOCK(balancing);
> > +/*
> > + * This flag serializes load-balancing passes over large domains
> > + * (such as SD_NUMA) - only once load-balancing instance may run
> ^^^^
> s/once/one/
>
> Also, currently the flag is only set for domains above the NODE topology
> level, sd_init() will reject an architecture that forces SD_SERIALIZE in a
> topology level's ->sd_flags(), so what about:
>
> s/(such as SD_NUMA)/(above the NODE topology level)

Agreed & done.

> > + * at a time, to reduce overhead on very large systems with lots
> > + * of CPUs and large NUMA distances.
> > + *
> > + * - Note that load-balancing passes triggered while another one
> > + * is executing are skipped and not re-tried.
> > + *
> > + * - Also note that this does not serialize sched_balance_domains()
> ^^^^^^^^^^^^^^^^^^^^^
> Did you mean rebalance_domains()?

Correct, a later rename that unifies the nomenclature of all the
rebalancing functions along the sched_balance_*() prefix that I have not
posted yet crept back into this comment block, but obviously this patch
should refer to the current namespace. Fixed.

Thanks,

Ingo

2024-03-08 09:56:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/9] sched/balancing: Remove reliance on 'enum cpu_idle_type' ordering when iterating [CPU_MAX_IDLE_TYPES] arrays in show_schedstat()


* Shrikanth Hegde <[email protected]> wrote:

>
>
> On 3/4/24 3:18 PM, Ingo Molnar wrote:
> > From: Shrikanth Hegde <[email protected]>
> >
> > Shrikanth Hegde reported that show_schedstat() output broke when
> > the ordering of the definitions in 'enum cpu_idle_type' is changed,
> > because show_schedstat() assumed that 'CPU_IDLE' is 0.
> >
> Hi Ingo.
> Feel free to drop me from the changelog.

Yeah - I made you the author of the commit, and indeed it should not refer
to you in the third person. :-) Fixed.

>
> > @@ -150,8 +150,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
> >
> > seq_printf(seq, "domain%d %*pb", dcount++,
> > cpumask_pr_args(sched_domain_span(sd)));
> > - for (itype = CPU_IDLE; itype < CPU_MAX_IDLE_TYPES;
> > - itype++) {
> > + for (itype = 0; itype < CPU_MAX_IDLE_TYPES; itype++) {
>
>
> It would still not be same order as current documentation of schedstat.
> no? The documentation would need changes too. Change SCHEDSTAT_VERSION to
> 16?

Correct. I've bumped SCHEDSTAT_VERSION up to 16 now, but since it hasn't
been changed for the last 10+ years I'm wondering whether that's the right
thing to do or we should add a quirk to maintain the v15 ordering?

I think we should also output the actual symbolic cpu_idle_type names into
schedstat, so that tooling (and observant kernel developers) can see the
actual ordering of the [CPU_MAX_IDLE_TYPES] columns.

A new line like this (mockup):

cpu0 0 0 4400 1485 1624 1229 301472313236 120382198 7714
+ cpu_idle_type CPU_IDLE 0 CPU_NOT_IDLE 1 CPU_NEWLY_IDLE 2 CPU_MAX_IDLE_TYPES 3
domain0 00000000,00000000,00000055 1661 1661 0 0 0 0 0 1661 2495 2495 0 0 0 0 0 2495 67 66 1 2 0 0 0 66 0 0 0 0 0 0 0 0 0 133 38 0

.. and after the change this would become:

cpu_idle_type CPU_NOT_IDLE 0 CPU_IDLE 1 CPU_NEWLY_IDLE 2 CPU_MAX_IDLE_TYPES 3

or so?

This gives tooling (that cares) a way to enumerate the idle types, without
having to rely on their numeric values. Adding a new line to schedstat
shouldn't break existing tooling - and if it does, we've increased
SCHEDSTAT_VERSION to 16 anyway. ;-)

Thanks,

Ingo

2024-03-08 10:00:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/9] sched/balancing: Change 'enum cpu_idle_type' to have more natural definitions


* Vincent Guittot <[email protected]> wrote:

> On Mon, 4 Mar 2024 at 10:48, Ingo Molnar <[email protected]> wrote:
> >
> > The cpu_idle_type enum has the confusingly inverted property
> > that 'not idle' is 1, and 'idle' is '0'.
> >
> > This resulted in a number of unnecessary complications in the code.
> >
> > Reverse the order, remove the CPU_NOT_IDLE type, and convert
> > all code to a natural boolean form.
> >
> > It's much more readable:
> >
> > - enum cpu_idle_type idle = this_rq->idle_balance ?
> > - CPU_IDLE : CPU_NOT_IDLE;
> > -
> > + enum cpu_idle_type idle = this_rq->idle_balance;
> >
> > --------------------------------
> >
> > - if (env->idle == CPU_NOT_IDLE || !busiest->sum_nr_running)
> > + if (!env->idle || !busiest->sum_nr_running)
> >
> > --------------------------------
> >
> > And gets rid of the double negation in these usages:
> >
> > - if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
> > + if (env->idle && env->src_rq->nr_running <= 1)
> >
> > Furthermore, this makes code much more obvious where there's
> > differentiation between CPU_IDLE and CPU_NEWLY_IDLE.
> >
> > Signed-off-by: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Vincent Guittot <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Linus Torvalds <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > ---
> > include/linux/sched/idle.h | 3 +--
> > kernel/sched/fair.c | 27 ++++++++++++---------------
> > 2 files changed, 13 insertions(+), 17 deletions(-)
> >
> > diff --git a/include/linux/sched/idle.h b/include/linux/sched/idle.h
> > index 478084f9105e..4a6423700ffc 100644
> > --- a/include/linux/sched/idle.h
> > +++ b/include/linux/sched/idle.h
> > @@ -5,8 +5,7 @@
> > #include <linux/sched.h>
> >
> > enum cpu_idle_type {
> > - CPU_IDLE,
> > - CPU_NOT_IDLE,
>
> Could be set CPU_NOT_IDLE = 0 to help keeping in mind that 0 means
> cpu is not idle even if we don't use it ?

Yeah, makes sense. I've added back __CPU_NOT_IDLE = 0 (with the underscore
to make sure some pending patch isn't accidentally relying on something it
shouldn't), and added your Reviewed-by.

Thanks,

Ingo

2024-03-08 10:12:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 5/9] sched/balancing: Fix comments (trying to) refer to NOHZ_BALANCE_KICK


* Vincent Guittot <[email protected]> wrote:

> On Mon, 4 Mar 2024 at 10:48, Ingo Molnar <[email protected]> wrote:
> >
> > Fix two typos:
> >
> > - There's no such thing as 'nohz_balancing_kick', the
> > flag is named 'BALANCE' and is capitalized: NOHZ_BALANCE_KICK.
> >
> > - Likewise there's no such thing as a 'pending nohz_balance_kick'
> > either, the NOHZ_BALANCE_KICK flag is all upper-case.
> >
> > Signed-off-by: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Vincent Guittot <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Linus Torvalds <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > ---
> > kernel/sched/fair.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 934ace69eb30..4c46bffb6a7a 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -12410,14 +12410,14 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >
> > /*
> > * run_rebalance_domains is triggered when needed from the scheduler tick.
> > - * Also triggered for nohz idle balancing (with nohz_balancing_kick set).
>
> I think it should be nohz_balancer_kick() instead of nohz_balancing_kick
>
>
> > + * Also triggered for NOHZ idle balancing (with NOHZ_BALANCE_KICK set).
>
> In fact, this can be triggered even without NOHZ_BALANCE_KICK. Any of
> the below can trigger this
> - NOHZ_BALANCE_KICK
> - NOHZ_STATS_KICK
> - NOHZ_NEXT_KICK

Yeah, indeed any of the flags in NOHZ_KICK_MASK can trigger nohz_csd and
indirect the balancing softirq - so I changed the text to:

/*
* This softirq may be triggered from the scheduler tick, or by
* any of the flags in NOHZ_KICK_MASK: NOHZ_BALANCE_KICK,
* NOHZ_STATS_KICK or NOHZ_NEXT_KICK.
*/
static __latent_entropy void run_rebalance_domains(struct softirq_action *h)

Thanks,

Ingo

2024-03-08 10:15:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 6/9] sched/balancing: Update run_rebalance_domains() comments


* Vincent Guittot <[email protected]> wrote:

> > """
> > The run_rebalance_domains() softirq handler is triggered via SCHED_SOFTIRQ
> > from two places:
> >
> > - directly from trigger_load_balance() in scheduler_tick(), for periodic
> > load balance
> >
> > - indirectly from kick_ilb() (invoked down the scheduler_tick() too), which
> > issues an SMP cross-call to nohz_csd_func() which will itself raise the
> > softirq, for NOHZ idle balancing.
>
> I'm not sure that we should provide too many details of the path as
> this might change in the future. What about the below ?
>
> - directly from the local scheduler_tick() for periodic load balance
>
> - indirectly from a remote scheduler_tick() for NOHZ idle balancing
> through the SMP cross-call nohz_csd_func()

Okay - I updated it to:

/*
* This softirq handler is triggered via SCHED_SOFTIRQ from two places:
*
* - directly from the local scheduler_tick() for periodic load balancing
*
* - indirectly from a remote scheduler_tick() for NOHZ idle balancing
* through the SMP cross-call nohz_csd_func()
*/
static __latent_entropy void run_rebalance_domains(struct softirq_action *h)

Does this work with everyone?

Thanks,

Ingo

2024-03-08 11:23:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag


* Shrikanth Hegde <[email protected]> wrote:

> system is at 75% load <-- 25.6% contention
> 113K probe:rebalance_domains_L37
> 84K probe:rebalance_domains_L55
>
> 87
> system is at 100% load <-- 87.5% contention.
> 64K probe:rebalance_domains_L37
> 8K probe:rebalance_domains_L55
>
>
> A few reasons for contentions could be:
>
> 1. idle load balance is running and some other cpu is becoming idle, and
> tries newidle_balance.
>
> 2. when system is busy, every CPU would do busy balancing, it would
> contend for the lock. It will not do balance as should_we_balance says
> this CPU need not balance. It bails out and release the lock.

Thanks, these measurements are really useful!

Would it be possible to disambiguate these two cases?

I think we should probably do something about this contention on this large
system: especially if #2 'no work to be done' bailout is the common case.

Thanks,

Ingo

2024-03-08 11:58:13

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 6/9] sched/balancing: Update run_rebalance_domains() comments

On Fri, 8 Mar 2024 at 11:15, Ingo Molnar <[email protected]> wrote:
>
>
> * Vincent Guittot <[email protected]> wrote:
>
> > > """
> > > The run_rebalance_domains() softirq handler is triggered via SCHED_SOFTIRQ
> > > from two places:
> > >
> > > - directly from trigger_load_balance() in scheduler_tick(), for periodic
> > > load balance
> > >
> > > - indirectly from kick_ilb() (invoked down the scheduler_tick() too), which
> > > issues an SMP cross-call to nohz_csd_func() which will itself raise the
> > > softirq, for NOHZ idle balancing.
> >
> > I'm not sure that we should provide too many details of the path as
> > this might change in the future. What about the below ?
> >
> > - directly from the local scheduler_tick() for periodic load balance
> >
> > - indirectly from a remote scheduler_tick() for NOHZ idle balancing
> > through the SMP cross-call nohz_csd_func()
>
> Okay - I updated it to:
>
> /*
> * This softirq handler is triggered via SCHED_SOFTIRQ from two places:
> *
> * - directly from the local scheduler_tick() for periodic load balancing
> *
> * - indirectly from a remote scheduler_tick() for NOHZ idle balancing
> * through the SMP cross-call nohz_csd_func()
> */
> static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
>
> Does this work with everyone?

yes looks good for me

>
> Thanks,
>
> Ingo

2024-03-08 14:51:41

by Shrikanth Hegde

[permalink] [raw]
Subject: Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag



On 3/8/24 4:53 PM, Ingo Molnar wrote:
>
> * Shrikanth Hegde <[email protected]> wrote:
>
>> system is at 75% load <-- 25.6% contention
>> 113K probe:rebalance_domains_L37
>> 84K probe:rebalance_domains_L55
>>
>> 87
>> system is at 100% load <-- 87.5% contention.
>> 64K probe:rebalance_domains_L37
>> 8K probe:rebalance_domains_L55
>>
>>
>> A few reasons for contentions could be:
>>
>> 1. idle load balance is running and some other cpu is becoming idle, and
>> tries newidle_balance.
>>
>> 2. when system is busy, every CPU would do busy balancing, it would
>> contend for the lock. It will not do balance as should_we_balance says
>> this CPU need not balance. It bails out and release the lock.
>
> Thanks, these measurements are really useful!
>
> Would it be possible to disambiguate these two cases?

I think its not case 1, since newidle_balance doesnt even take that lock. So
likely its case 2.

>
> I think we should probably do something about this contention on this large
> system: especially if #2 'no work to be done' bailout is the common case.
>


I have been thinking would it be right to move this balancing trylock/atomic after
should_we_balance(swb). This does reduce the number of times this checked/updated
significantly. Contention is still present. That's possible at higher utilization
when there are multiple NUMA domains. one CPU in each NUMA domain can contend if their invocation
is aligned.

That makes sense since, Right now a CPU takes lock, checks if it can balance, do balance if yes and
then releases the lock. If the lock is taken after swb then also, CPU checks if it can balance,
tries to take the lock and releases the lock if it did. If lock is contended, it bails out of
load_balance. That is the current behaviour as well, or I am completely wrong.

Not sure in which scenarios that would hurt. we could do this after this series.
This may need wider functional testing to make sure we don't regress badly in some cases.
This is only an *idea* as of now.

Perf probes at spin_trylock and spin_unlock codepoints on the same 224CPU, 6 NUMA node system.
6.8-rc6
-----------------------------------------
idle system:
449 probe:rebalance_domains_L37
377 probe:rebalance_domains_L55
stress-ng --cpu=$(nproc) -l 51 << 51% load
88K probe:rebalance_domains_L37
77K probe:rebalance_domains_L55
stress-ng --cpu=$(nproc) -l 100 << 100% load
41K probe:rebalance_domains_L37
10K probe:rebalance_domains_L55

+below patch
----------------------------------------
idle system:
462 probe:load_balance_L35
394 probe:load_balance_L274
stress-ng --cpu=$(nproc) -l 51 << 51% load
5K probe:load_balance_L35 <<-- almost 15x less
4K probe:load_balance_L274
stress-ng --cpu=$(nproc) -l 100 << 100% load
8K probe:load_balance_L35
3K probe:load_balance_L274 <<-- almost 4x less


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 62f247bdec86..3a8de7454377 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11272,6 +11272,7 @@ static int should_we_balance(struct lb_env *env)
return group_balance_cpu(sg) == env->dst_cpu;
}

+static DEFINE_SPINLOCK(balancing);
/*
* Check this_cpu to ensure it is balanced within domain. Attempt to move
* tasks if there is an imbalance.
@@ -11286,6 +11287,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
struct rq *busiest;
struct rq_flags rf;
struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
+ int need_serialize;
struct lb_env env = {
.sd = sd,
.dst_cpu = this_cpu,
@@ -11308,6 +11310,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
goto out_balanced;
}

+ need_serialize = sd->flags & SD_SERIALIZE;
+ if (need_serialize) {
+ if (!spin_trylock(&balancing))
+ goto lockout;
+ }
+
group = find_busiest_group(&env);
if (!group) {
schedstat_inc(sd->lb_nobusyg[idle]);
@@ -11434,6 +11442,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
if (!cpumask_subset(cpus, env.dst_grpmask)) {
env.loop = 0;
env.loop_break = SCHED_NR_MIGRATE_BREAK;
+ if (need_serialize)
+ spin_unlock(&balancing);
goto redo;
}
goto out_all_pinned;
@@ -11540,7 +11550,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
sd->balance_interval < MAX_PINNED_INTERVAL) ||
sd->balance_interval < sd->max_interval)
sd->balance_interval *= 2;
+
out:
+ if (need_serialize)
+ spin_unlock(&balancing);
+
+lockout:
return ld_moved;
}

@@ -11665,7 +11680,6 @@ static int active_load_balance_cpu_stop(void *data)
return 0;
}

-static DEFINE_SPINLOCK(balancing);

/*
* Scale the max load_balance interval with the number of CPUs in the system.
@@ -11716,7 +11730,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
/* Earliest time when we have to do rebalance again */
unsigned long next_balance = jiffies + 60*HZ;
int update_next_balance = 0;
- int need_serialize, need_decay = 0;
+ int need_decay = 0;
u64 max_cost = 0;

rcu_read_lock();
@@ -11741,12 +11755,6 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)

interval = get_sd_balance_interval(sd, busy);

- need_serialize = sd->flags & SD_SERIALIZE;
- if (need_serialize) {
- if (!spin_trylock(&balancing))
- goto out;
- }
-
if (time_after_eq(jiffies, sd->last_balance + interval)) {
if (load_balance(cpu, rq, sd, idle, &continue_balancing)) {
/*
@@ -11760,9 +11768,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
sd->last_balance = jiffies;
interval = get_sd_balance_interval(sd, busy);
}
- if (need_serialize)
- spin_unlock(&balancing);
-out:
+
if (time_after(next_balance, sd->last_balance + interval)) {
next_balance = sd->last_balance + interval;
update_next_balance = 1;




2024-03-08 16:45:41

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 6/9] sched/balancing: Update run_rebalance_domains() comments

On 08/03/24 12:57, Vincent Guittot wrote:
> On Fri, 8 Mar 2024 at 11:15, Ingo Molnar <[email protected]> wrote:
>>
>> Okay - I updated it to:
>>
>> /*
>> * This softirq handler is triggered via SCHED_SOFTIRQ from two places:
>> *
>> * - directly from the local scheduler_tick() for periodic load balancing
>> *
>> * - indirectly from a remote scheduler_tick() for NOHZ idle balancing
>> * through the SMP cross-call nohz_csd_func()
>> */
>> static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
>>
>> Does this work with everyone?
>
> yes looks good for me
>

Ditto!


2024-03-12 10:58:02

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag


* Shrikanth Hegde <[email protected]> wrote:

> > I think we should probably do something about this contention on this
> > large system: especially if #2 'no work to be done' bailout is the
> > common case.
>
>
> I have been thinking would it be right to move this balancing
> trylock/atomic after should_we_balance(swb). This does reduce the number
> of times this checked/updated significantly. Contention is still present.
> That's possible at higher utilization when there are multiple NUMA
> domains. one CPU in each NUMA domain can contend if their invocation is
> aligned.

Note that it's not true contention: it simply means there's overlapping
requests for the highest domains to be balanced, for which we only have a
single thread of execution at a time, system-wide.

> That makes sense since, Right now a CPU takes lock, checks if it can
> balance, do balance if yes and then releases the lock. If the lock is
> taken after swb then also, CPU checks if it can balance,
> tries to take the lock and releases the lock if it did. If lock is
> contended, it bails out of load_balance. That is the current behaviour as
> well, or I am completely wrong.
>
> Not sure in which scenarios that would hurt. we could do this after this
> series. This may need wider functional testing to make sure we don't
> regress badly in some cases. This is only an *idea* as of now.
>
> Perf probes at spin_trylock and spin_unlock codepoints on the same 224CPU, 6 NUMA node system.
> 6.8-rc6
> -----------------------------------------
> idle system:
> 449 probe:rebalance_domains_L37
> 377 probe:rebalance_domains_L55
> stress-ng --cpu=$(nproc) -l 51 << 51% load
> 88K probe:rebalance_domains_L37
> 77K probe:rebalance_domains_L55
> stress-ng --cpu=$(nproc) -l 100 << 100% load
> 41K probe:rebalance_domains_L37
> 10K probe:rebalance_domains_L55
>
> +below patch
> ----------------------------------------
> idle system:
> 462 probe:load_balance_L35
> 394 probe:load_balance_L274
> stress-ng --cpu=$(nproc) -l 51 << 51% load
> 5K probe:load_balance_L35 <<-- almost 15x less
> 4K probe:load_balance_L274
> stress-ng --cpu=$(nproc) -l 100 << 100% load
> 8K probe:load_balance_L35
> 3K probe:load_balance_L274 <<-- almost 4x less

That's nice.

> +static DEFINE_SPINLOCK(balancing);
> /*
> * Check this_cpu to ensure it is balanced within domain. Attempt to move
> * tasks if there is an imbalance.
> @@ -11286,6 +11287,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> struct rq *busiest;
> struct rq_flags rf;
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
> + int need_serialize;
> struct lb_env env = {
> .sd = sd,
> .dst_cpu = this_cpu,
> @@ -11308,6 +11310,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> goto out_balanced;
> }
>
> + need_serialize = sd->flags & SD_SERIALIZE;
> + if (need_serialize) {
> + if (!spin_trylock(&balancing))
> + goto lockout;
> + }
> +
> group = find_busiest_group(&env);

So if I'm reading your patch right, the main difference appears to be that
it allows the should_we_balance() check to be executed in parallel, and
will only try to take the NUMA-balancing flag if that function indicates an
imbalance.

Since should_we_balance() isn't taking any locks AFAICS, this might be a
valid approach. What might make sense is to instrument the percentage of
NUMA-balancing flag-taking 'failures' vs. successful attempts - not
necessarily the 'contention percentage'.

But another question is, why do we get here so frequently, so that the
cumulative execution time of these SD_SERIAL rebalance passes exceeds that
of 100% of single CPU time? Ie. a single CPU is basically continuously
scanning the scheduler data structures for imbalances, right? That doesn't
seem natural even with just ~224 CPUs.

Alternatively, is perhaps the execution time of the SD_SERIAL pass so large
that we exceed 100% CPU time?

Thanks,

Ingo

2024-03-21 12:13:11

by Shrikanth Hegde

[permalink] [raw]
Subject: Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag



On 3/12/24 4:27 PM, Ingo Molnar wrote:
>
> * Shrikanth Hegde <[email protected]> wrote:
>
>> I have been thinking would it be right to move this balancing
>> trylock/atomic after should_we_balance(swb). This does reduce the number
>> of times this checked/updated significantly. Contention is still present.
>> That's possible at higher utilization when there are multiple NUMA
>> domains. one CPU in each NUMA domain can contend if their invocation is
>> aligned.
>
> Note that it's not true contention: it simply means there's overlapping
> requests for the highest domains to be balanced, for which we only have a
> single thread of execution at a time, system-wide.
>

Hi Ingo, Vincent, Peter,

I did some more experiments and collected quite a bit of data.
Some interesting observations.

tl;dr

1. For BUSY and IDLE type load balancing, contention arises since multiple CPU's
are trying to set the flag at the same time. Only one succeeds and tries to do
load balance. Whether it can do load balance or not is decided later in should_we_balance(swb).
contention increases with system utilization. Why contention happens is simpler for BUSY, if their
SOFTIRQ's are aligned. If the busy is invoked(but may prefer idle) while an IDLE CPU is trying is
scenario when utilization is low.

2. NEWIDLE does not even check for sched_balance_running and it does not honor
continue_balancing flag either. should_we_balance for NEWIDLE is mostly return true anyway.
So NEWIDLE balance doesn't honor the SD_SERIALIZE flag. It does go through all the rq even at the
highest NUMA Domains often, and pulls task opportunistically. cost of balancing at NUMA is costly
as it would need to fetch all rq.

>> That makes sense since, Right now a CPU takes lock, checks if it can
>> balance, do balance if yes and then releases the lock. If the lock is
>> taken after swb then also, CPU checks if it can balance,
>> tries to take the lock and releases the lock if it did. If lock is
>> contended, it bails out of load_balance. That is the current behaviour as
>> well, or I am completely wrong.
>>
>> Perf probes at spin_trylock and spin_unlock codepoints on the same 224CPU, 6 NUMA node system.
>> 6.8-rc6
>> -----------------------------------------
>> idle system:
>> 449 probe:rebalance_domains_L37
>> 377 probe:rebalance_domains_L55
>> stress-ng --cpu=$(nproc) -l 51 << 51% load
>> 88K probe:rebalance_domains_L37
>> 77K probe:rebalance_domains_L55
>> stress-ng --cpu=$(nproc) -l 100 << 100% load
>> 41K probe:rebalance_domains_L37
>> 10K probe:rebalance_domains_L55
>>
>> +below patch
>> ----------------------------------------
>> idle system:
>> 462 probe:load_balance_L35
>> 394 probe:load_balance_L274
>> stress-ng --cpu=$(nproc) -l 51 << 51% load
>> 5K probe:load_balance_L35 <<-- almost 15x less
>> 4K probe:load_balance_L274
>> stress-ng --cpu=$(nproc) -l 100 << 100% load
>> 8K probe:load_balance_L35
>> 3K probe:load_balance_L274 <<-- almost 4x less
>
> That's nice.
>

These numbers completely go crazy for schbench or hackbench if we move the sched_balance_running
after should_we_balance due to NEWIDLE. we could add a idle type check before doing operations on
sched_balance_running.


>> +static DEFINE_SPINLOCK(balancing);
>> /*
>> * Check this_cpu to ensure it is balanced within domain. Attempt to move
>> * tasks if there is an imbalance.
>> @@ -11286,6 +11287,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>> struct rq *busiest;
>> struct rq_flags rf;
>> struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
>> + int need_serialize;
>> struct lb_env env = {
>> .sd = sd,
>> .dst_cpu = this_cpu,
>> @@ -11308,6 +11310,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>> goto out_balanced;
>> }
>>
>> + need_serialize = sd->flags & SD_SERIALIZE;
>> + if (need_serialize) {
>> + if (!spin_trylock(&balancing))
>> + goto lockout;
>> + }
>> +
>> group = find_busiest_group(&env);
>
> So if I'm reading your patch right, the main difference appears to be that
> it allows the should_we_balance() check to be executed in parallel, and
> will only try to take the NUMA-balancing flag if that function indicates an
> imbalance.

Yes for BUSY and IDLE balancing.

>
> Since should_we_balance() isn't taking any locks AFAICS, this might be a
> valid approach. What might make sense is to instrument the percentage of
> NUMA-balancing flag-taking 'failures' vs. successful attempts - not
> necessarily the 'contention percentage'.

So i put a bunch of trace_printk and added a hacky check similar to swb if taking flag fails.
I will put the numbers for schebench, hackbench and stress-ng -l 100 for initial reference.

I ran these a slightly larger system this time. I have two NUMA levels. domain 3 has two groups 0-159, 160-319.
Sorry in case the data isnt easy to read. I couldnt find a better way. some rounding errors are there, but mostly
sum where applicable is close. trace is collected for 5 seconds.

NUMA node(s): 4
NUMA node0 CPU(s): 0-79
NUMA node1 CPU(s): 80-159
NUMA node6 CPU(s): 160-239
NUMA node7 CPU(s): 240-319

more domain*/flags | cat
::::::::::::::
domain0/flags
::::::::::::::
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING
::::::::::::::
domain1/flags
::::::::::::::
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING
::::::::::::::
domain2/flags
::::::::::::::
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING
::::::::::::::
domain3/flags
::::::::::::::
SD_BALANCE_NEWIDLE SD_SERIALIZE SD_OVERLAP SD_NUMA
::::::::::::::
domain4/flags
::::::::::::::
SD_BALANCE_NEWIDLE SD_SERIALIZE SD_OVERLAP SD_NUMA

================================================================= BUSY balancing ===============================================
tried_access_to | failed | aquired(number of sched_balance_rq) <<-- aquired need not call sched_balance_rq due to time check.
sched_balance_running | to_aquire |
| swb true? swb_false? | swb_true? swb_false?
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
| |
hackbench 254727 | 236268 | 18508(170)
10 groups | 38 <--|--> 236230 | 0 <--|--> 170
| |
schbench 217794 | 207117 | 10683(659)
320 groups | 1059 <--|--> 206058 | 1 <--|--> 658
| |
stress-ng 31646 | 27430 | 4216(523)
-l100 | 272 <--|--> 27158 | 6 <--|--> 517
| |
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Observations:
* When the flag was not aquired it was mostly that CPU wouldnt have done the balancing. Good.
* When the flag was aquired, mostly that CPU didn't do load balancing. It bailed out since swb was false. Bad.


=============================================================== IDLE Balancing ==================================================
tried_access_to | failed | aquired(number of sched_balance_rq) <<-- aquired need not call sched_balance_rq due to time check.
sched_balance_running | to_aquire |
| swb true? swb_false? | swb_true? swb_false?
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

hackbench 11022 | 7309 | 3664(2635)
10 groups | 1322 <--|--> 5987 | 753 <--|--> 1882
| |
schbench 3018 | 2305 | 707(632)
320 groups | 1036 <--|--> 1269 | 268 <--|--> 364
| |
stress-ng 0 | 0 | 0
-l 100 | |
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Observations:
* When the flag was not aquired, it was good chance that CPU would have done the balance. bad
* When the flag is taken, many times it didnt do balance since swb was false. bad.

>
> But another question is, why do we get here so frequently, so that the
> cumulative execution time of these SD_SERIAL rebalance passes exceeds that
> of 100% of single CPU time? Ie. a single CPU is basically continuously
> scanning the scheduler data structures for imbalances, right? That doesn't
> seem natural even with just ~224 CPUs.
>

SD_SERIALIZE passes are costly. Doing a pass through 0-319 takes 20-30us and Doing
a pass through 0-159 takes around 10-20us. NEWIDLE can be called very often and it
doesnt serialize and loops through all domains aggressively. NEWIDLE balancing
successfully pull a task around 3-5% of the time.


For reference, the running the same hackbench, Combined it would take 12 seconds on 320 CPU system
where trace is collected for 5 seconds. Thats approx 0.7% cycles.
domain_cost avg: 33031.77 times: 172991
domain_cost avg: 15615.21 times: 211783
domain_cost avg: 15359.40 times: 208164

Similarly while running schbench, it takes approx 0.12% of total time.

> Alternatively, is perhaps the execution time of the SD_SERIAL pass so large
> that we exceed 100% CPU time?
>

So Main questions i have are:
1. Should NEWIDLE honor SD_SERIALIZE ? This would make SD_SERIALIZE defined correctly,
all three types honor it. Currenly on BUSY,IDLE honors it.
2. Should we move taking sched_balance_running after swb atleast for BUSY, IDLE?

================================================================================================


debug patch used:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c8e50fbac345..cb824c327ab6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11172,6 +11172,56 @@ static int need_active_balance(struct lb_env *env)

static int active_load_balance_cpu_stop(void *data);

+static int hack_check_if_i_can_balance(struct sched_domain *sd, int this_cpu) {
+ struct cpumask *swb_cpus = this_cpu_cpumask_var_ptr(should_we_balance_tmpmask);
+ struct sched_group *sg = sd->groups;
+ int cpu, idle_smt = -1;
+ struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
+
+ cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
+ if (!cpumask_test_cpu(this_cpu, cpus))
+ return 0;
+
+ cpumask_copy(swb_cpus, group_balance_mask(sg));
+ for_each_cpu_and(cpu, swb_cpus, cpus) {
+ if (!idle_cpu(cpu))
+ continue;
+
+ /*
+ * Don't balance to idle SMT in busy core right away when
+ * balancing cores, but remember the first idle SMT CPU for
+ * later consideration. Find CPU on an idle core first.
+ */
+ if (!(sd->flags & SD_SHARE_CPUCAPACITY) && !is_core_idle(cpu)) {
+ if (idle_smt == -1)
+ idle_smt = cpu;
+ /*
+ * If the core is not idle, and first SMT sibling which is
+ * idle has been found, then its not needed to check other
+ * SMT siblings for idleness:
+ */
+#ifdef CONFIG_SCHED_SMT
+ cpumask_andnot(swb_cpus, swb_cpus, cpu_smt_mask(cpu));
+#endif
+ continue;
+ }
+
+ /*
+ * Are we the first idle core in a non-SMT domain or higher,
+ * or the first idle CPU in a SMT domain?
+ */
+ return cpu == this_cpu;
+ }
+
+ /* Are we the first idle CPU with busy siblings? */
+ if (idle_smt != -1)
+ return idle_smt == this_cpu;
+
+ /* Are we the first CPU of this group ? */
+ return group_balance_cpu(sg) == this_cpu;
+}
+
+
static int should_we_balance(struct lb_env *env)
{
struct cpumask *swb_cpus = this_cpu_cpumask_var_ptr(should_we_balance_tmpmask);
@@ -11267,13 +11317,22 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);

schedstat_inc(sd->lb_count[idle]);
+ if (sd->flags & SD_SERIALIZE)
+ trace_printk("Trying load_balance_rq for idle: %d span: %*pbl\n", idle,cpumask_pr_args(sched_domain_span(sd)));

redo:
if (!should_we_balance(&env)) {
*continue_balancing = 0;
+ if (sd->flags & SD_SERIALIZE) {
+ trace_printk("swb says this cpu doesnt need to balance idle: %d span: %*pbl\n", idle, cpumask_pr_args(sched_domain_span(sd)));
+ }
goto out_balanced;
}

+ if (sd->flags & SD_SERIALIZE) {
+ trace_printk("This cpu would try balance idle: %d span: %*pbl\n", idle, cpumask_pr_args(sched_domain_span(sd)));
+ }
+
group = sched_balance_find_src_group(&env);
if (!group) {
schedstat_inc(sd->lb_nobusyg[idle]);
@@ -11288,6 +11347,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,

WARN_ON_ONCE(busiest == env.dst_rq);

+ if (sd->flags & SD_SERIALIZE) {
+ trace_printk("There is some imbalance idle: %d span: %*pbl\n", idle, cpumask_pr_args(sched_domain_span(sd)));
+ }
schedstat_add(sd->lb_imbalance[idle], env.imbalance);

env.src_cpu = busiest->cpu;
@@ -11507,6 +11569,8 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
sd->balance_interval < sd->max_interval)
sd->balance_interval *= 2;
out:
+ if ((sd->flags & SD_SERIALIZE) && ld_moved)
+ trace_printk("load balance was successful idle: %d span: %*pbl\n", idle, cpumask_pr_args(sched_domain_span(sd)));
return ld_moved;
}

@@ -11722,8 +11786,13 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)

need_serialize = sd->flags & SD_SERIALIZE;
if (need_serialize) {
- if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
+ int result;
+ trace_printk("try to set sched_balance_running idle: %d\n", idle);
+ if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1)) {
+ result = hack_check_if_i_can_balance(sd, cpu);
+ trace_printk("sched_balance_running was not aquird idle: %d was_swb: %d\n", idle, result);
goto out;
+ }
}

if (time_after_eq(jiffies, sd->last_balance + interval)) {
@@ -11739,8 +11808,10 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
sd->last_balance = jiffies;
interval = get_sd_balance_interval(sd, busy);
}
- if (need_serialize)
+ if (need_serialize) {
+ trace_printk("Release sched_balance_running idle: %d\n", idle);
atomic_set_release(&sched_balance_running, 0);
+ }
out:
if (time_after(next_balance, sd->last_balance + interval)) {
next_balance = sd->last_balance + interval;
@@ -12347,6 +12418,10 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
u64 domain_cost;

update_next_balance(sd, &next_balance);
+ if (sd->flags & SD_SERIALIZE) {
+ trace_printk("newidle balance called span: %*pbl, rq_avg: %llu, curr_cost: %llu, max_idle_cost: %llu\n",
+ cpumask_pr_args(sched_domain_span(sd)), this_rq->avg_idle, curr_cost, sd->max_newidle_lb_cost);
+ }

if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
break;
@@ -12360,6 +12435,9 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
t1 = sched_clock_cpu(this_cpu);
domain_cost = t1 - t0;
update_newidle_cost(sd, domain_cost);
+ if (sd->flags & SD_SERIALIZE)
+ trace_printk("cost of load balance,span: %*pbl, pulled_task: %d, domain_cost: %llu\n",
+ cpumask_pr_args(sched_domain_span(sd)), pulled_task, domain_cost);

curr_cost += domain_cost;
t0 = t1;