LinuxLists.cc - [PATCH 0/7] sched/balancing: Misc updates & cleanups

2024-03-01 11:10:06

Subject: [PATCH 0/7] sched/balancing: Misc updates & cleanups

Improve a handful of things I noticed while looking through the
scheduler balancing code. No change in functionality intended.

Ingo Molnar (7):
sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag
sched/balancing: Change 'enum cpu_idle_type' to have more natural definitions
sched/balancing: Change comment formatting to not overlap Git conflict marker lines
sched/balancing: Fix comments (trying to) refer to NOHZ_BALANCE_KICK
sched/balancing: Update run_rebalance_domains() comments
sched/balancing: Vertically align the comments of 'struct sg_lb_stats' and 'struct sd_lb_stats'
sched/balancing: Update comments in 'struct sg_lb_stats' and 'struct sd_lb_stats'

include/linux/sched/idle.h | 3 +-
kernel/sched/fair.c | 104 +++++++++++++++++++++++++++++++++++++----------------------------
2 files changed, 60 insertions(+), 47 deletions(-)

--
2.40.1

2024-03-01 11:10:14

by Ingo Molnar

[permalink] [raw]

Subject: [PATCH 1/7] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag

The 'balancing' spinlock added in:

08c183f31bdb ("[PATCH] sched: add option to serialize load balancing")

.. is taken when the SD_SERIALIZE flag is set in a domain, but in reality it
is a glorified global atomic flag serializing the load-balancing of
those domains.

It doesn't have any explicit locking semantics per se: we just
spin_trylock() it.

Turn it into a ... global atomic flag. This makes it more
clear what is going on here, and reduces overhead and code
size a bit:

# kernel/sched/fair.o: [x86-64 defconfig]

text data bss dec hex filename
60730 2721 104 63555 f843 fair.o.before
60718 2721 104 63543 f837 fair.o.after

Also document the flag a bit.

No change in functionality intended.

Signed-off-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6a16129f9a5c..64ae3d8dc93b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11633,7 +11633,20 @@ static int active_load_balance_cpu_stop(void *data)
return 0;
}

-static DEFINE_SPINLOCK(balancing);
+/*
+ * This flag serializes load-balancing passes over large domains
+ * (such as SD_NUMA) - only once load-balancing instance may run
+ * at a time, to reduce overhead on very large systems with lots
+ * of CPUs and large NUMA distances.
+ *
+ * - Note that load-balancing passes triggered while another one
+ * is executing are skipped and not re-tried.
+ *
+ * - Also note that this does not serialize sched_balance_domains()
+ * execution, as non-SD_SERIALIZE domains will still be
+ * load-balanced in parallel.
+ */
+static atomic_t sched_balance_running = ATOMIC_INIT(0);

/*
* Scale the max load_balance interval with the number of CPUs in the system.
@@ -11711,7 +11724,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)

need_serialize = sd->flags & SD_SERIALIZE;
if (need_serialize) {
- if (!spin_trylock(&balancing))
+ if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
goto out;
}

@@ -11729,7 +11742,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
interval = get_sd_balance_interval(sd, busy);
}
if (need_serialize)
- spin_unlock(&balancing);
+ atomic_set_release(&sched_balance_running, 0);
out:
if (time_after(next_balance, sd->last_balance + interval)) {
next_balance = sd->last_balance + interval;
--
2.40.1

2024-03-01 11:10:35

by Ingo Molnar

[permalink] [raw]

Subject: [PATCH 3/7] sched/balancing: Change comment formatting to not overlap Git conflict marker lines

So the scheduler has two such comment blocks, with '=' used as a double underline:

/*
* VRUNTIME
* ========
*

'========' also happens to be a Git conflict marker, throwing off a simple
search in an editor for this pattern.

Change them to '-------' type of underline instead - it looks just as good.

Signed-off-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f11fc6dd39b1..934ace69eb30 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3679,7 +3679,7 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,

/*
* VRUNTIME
- * ========
+ * --------
*
* COROLLARY #1: The virtual runtime of the entity needs to be
* adjusted if re-weight at !0-lag point.
@@ -3762,7 +3762,7 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,

/*
* DEADLINE
- * ========
+ * --------
*
* When the weight changes, the virtual time slope changes and
* we should adjust the relative virtual deadline accordingly.
--
2.40.1

2024-03-01 11:10:57

by Ingo Molnar

[permalink] [raw]

Subject: [PATCH 5/7] sched/balancing: Update run_rebalance_domains() comments

The first sentence of the comment explaining run_rebalance_domains()
is historic and not true anymore:

* run_rebalance_domains is triggered when needed from the scheduler tick.

.. contradicted/modified by the second sentence:

* Also triggered for NOHZ idle balancing (with NOHZ_BALANCING_KICK set).

Avoid that kind of confusion straight away and explain from what
places sched_balance_softirq() is triggered.

Signed-off-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 31838b9dcf36..0e13b0016f33 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12409,8 +12409,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
}

/*
- * run_rebalance_domains is triggered when needed from the scheduler tick.
- * Also triggered for NOHZ idle balancing (with NOHZ_BALANCING_KICK set).
+ * The run_rebalance_domains() softirq handler is triggered via SCHED_SOFTIRQ
+ * from two places:
+ *
+ * - the scheduler_tick(),
+ *
+ * - from the SMP cross-call function nohz_csd_func(),
+ * used by NOHZ idle balancing (with NOHZ_BALANCING_KICK set).
*/
static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
{
--
2.40.1

2024-03-01 11:11:16

by Ingo Molnar

[permalink] [raw]

Subject: [PATCH 2/7] sched/balancing: Change 'enum cpu_idle_type' to have more natural definitions

The cpu_idle_type enum has the confusingly inverted property
that 'not idle' is 1, and 'idle' is '0'.

This resulted in a number of unnecessary complications in the code.

Reverse the order, remove the CPU_NOT_IDLE type, and convert
all code to a natural boolean form.

It's much more readable:

- enum cpu_idle_type idle = this_rq->idle_balance ?
- CPU_IDLE : CPU_NOT_IDLE;
-
+ enum cpu_idle_type idle = this_rq->idle_balance;

--------------------------------

- if (env->idle == CPU_NOT_IDLE || !busiest->sum_nr_running)
+ if (!env->idle || !busiest->sum_nr_running)

--------------------------------

And gets rid of the double negation in these usages:

- if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
+ if (env->idle && env->src_rq->nr_running <= 1)

Furthermore, this makes code much more obvious where there's
differentiation between CPU_IDLE and CPU_NEWLY_IDLE.

Signed-off-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Valentin Schneider <[email protected]>
---
include/linux/sched/idle.h | 3 +--
kernel/sched/fair.c | 27 ++++++++++++---------------
2 files changed, 13 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched/idle.h b/include/linux/sched/idle.h
index 478084f9105e..4a6423700ffc 100644
--- a/include/linux/sched/idle.h
+++ b/include/linux/sched/idle.h
@@ -5,8 +5,7 @@
#include <linux/sched.h>

enum cpu_idle_type {
- CPU_IDLE,
- CPU_NOT_IDLE,
+ CPU_IDLE = 1,
CPU_NEWLY_IDLE,
CPU_MAX_IDLE_TYPES
};
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 64ae3d8dc93b..f11fc6dd39b1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9070,7 +9070,7 @@ static int detach_tasks(struct lb_env *env)
* We don't want to steal all, otherwise we may be treated likewise,
* which could at worst lead to a livelock crash.
*/
- if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
+ if (env->idle && env->src_rq->nr_running <= 1)
break;

env->loop++;
@@ -9803,7 +9803,7 @@ static inline bool smt_vs_nonsmt_groups(struct sched_group *sg1,
static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats *sgs,
struct sched_group *group)
{
- if (env->idle == CPU_NOT_IDLE)
+ if (!env->idle)
return false;

/*
@@ -9827,7 +9827,7 @@ static inline long sibling_imbalance(struct lb_env *env,
int ncores_busiest, ncores_local;
long imbalance;

- if (env->idle == CPU_NOT_IDLE || !busiest->sum_nr_running)
+ if (!env->idle || !busiest->sum_nr_running)
return 0;

ncores_busiest = sds->busiest->cores;
@@ -9927,8 +9927,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_misfit_task_load = rq->misfit_task_load;
*sg_status |= SG_OVERLOAD;
}
- } else if ((env->idle != CPU_NOT_IDLE) &&
- sched_reduced_capacity(rq, env->sd)) {
+ } else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
/* Check for a task running on a CPU with reduced capacity */
if (sgs->group_misfit_task_load < load)
sgs->group_misfit_task_load = load;
@@ -9940,7 +9939,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_weight = group->group_weight;

/* Check if dst CPU is idle and preferred to this group */
- if (!local_group && env->idle != CPU_NOT_IDLE && sgs->sum_h_nr_running &&
+ if (!local_group && env->idle && sgs->sum_h_nr_running &&
sched_group_asym(env, sgs, group))
sgs->group_asym_packing = 1;

@@ -10698,7 +10697,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
* waiting task in this overloaded busiest group. Let's
* try to pull it.
*/
- if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) {
+ if (env->idle && env->imbalance == 0) {
env->migration_type = migrate_task;
env->imbalance = 1;
}
@@ -10913,7 +10912,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
goto force_balance;

if (busiest->group_type != group_overloaded) {
- if (env->idle == CPU_NOT_IDLE) {
+ if (!env->idle) {
/*
* If the busiest group is not overloaded (and as a
* result the local one too) but this CPU is already
@@ -11121,7 +11120,7 @@ asym_active_balance(struct lb_env *env)
* the lower priority @env::dst_cpu help it. Do not follow
* CPU priority.
*/
- return env->idle != CPU_NOT_IDLE && sched_use_asym_prio(env->sd, env->dst_cpu) &&
+ return env->idle && sched_use_asym_prio(env->sd, env->dst_cpu) &&
(sched_asym_prefer(env->dst_cpu, env->src_cpu) ||
!sched_use_asym_prio(env->sd, env->src_cpu));
}
@@ -11159,7 +11158,7 @@ static int need_active_balance(struct lb_env *env)
* because of other sched_class or IRQs if more capacity stays
* available on dst_cpu.
*/
- if ((env->idle != CPU_NOT_IDLE) &&
+ if (env->idle &&
(env->src_rq->cfs.h_nr_running == 1)) {
if ((check_cpu_capacity(env->src_rq, sd)) &&
(capacity_of(env->src_cpu)*sd->imbalance_pct < capacity_of(env->dst_cpu)*100))
@@ -11735,8 +11734,8 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
* env->dst_cpu, so we can't know our idle
* state even if we migrated tasks. Update it.
*/
- idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;
- busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
+ idle = idle_cpu(cpu);
+ busy = !idle && !sched_idle_cpu(cpu);
}
sd->last_balance = jiffies;
interval = get_sd_balance_interval(sd, busy);
@@ -12416,9 +12415,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
{
struct rq *this_rq = this_rq();
- enum cpu_idle_type idle = this_rq->idle_balance ?
- CPU_IDLE : CPU_NOT_IDLE;
-
+ enum cpu_idle_type idle = this_rq->idle_balance;
/*
* If this CPU has a pending nohz_balance_kick, then do the
* balancing on behalf of the other idle CPUs whose ticks are
--
2.40.1

2024-03-01 11:11:17

by Ingo Molnar

[permalink] [raw]

Subject: [PATCH 6/7] sched/balancing: Vertically align the comments of 'struct sg_lb_stats' and 'struct sd_lb_stats'

Make them easier to read.

Signed-off-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 34 +++++++++++++++++-----------------
1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0e13b0016f33..99495c93e094 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9436,19 +9436,19 @@ static void update_blocked_averages(int cpu)
* sg_lb_stats - stats of a sched_group required for load_balancing
*/
struct sg_lb_stats {
- unsigned long avg_load; /*Avg load across the CPUs of the group */
- unsigned long group_load; /* Total load over the CPUs of the group */
+ unsigned long avg_load; /* Avg load across the CPUs of the group */
+ unsigned long group_load; /* Total load over the CPUs of the group */
unsigned long group_capacity;
- unsigned long group_util; /* Total utilization over the CPUs of the group */
- unsigned long group_runnable; /* Total runnable time over the CPUs of the group */
- unsigned int sum_nr_running; /* Nr of tasks running in the group */
- unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
+ unsigned long group_util; /* Total utilization over the CPUs of the group */
+ unsigned long group_runnable; /* Total runnable time over the CPUs of the group */
+ unsigned int sum_nr_running; /* Nr of tasks running in the group */
+ unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
unsigned int idle_cpus;
unsigned int group_weight;
enum group_type group_type;
- unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
- unsigned int group_smt_balance; /* Task on busy SMT be moved */
- unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
+ unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
+ unsigned int group_smt_balance; /* Task on busy SMT be moved */
+ unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
@@ -9460,15 +9460,15 @@ struct sg_lb_stats {
* during load balancing.
*/
struct sd_lb_stats {
- struct sched_group *busiest; /* Busiest group in this sd */
- struct sched_group *local; /* Local group in this sd */
- unsigned long total_load; /* Total load of all groups in sd */
- unsigned long total_capacity; /* Total capacity of all groups in sd */
- unsigned long avg_load; /* Average load across all groups in sd */
- unsigned int prefer_sibling; /* tasks should go to sibling first */
+ struct sched_group *busiest; /* Busiest group in this sd */
+ struct sched_group *local; /* Local group in this sd */
+ unsigned long total_load; /* Total load of all groups in sd */
+ unsigned long total_capacity; /* Total capacity of all groups in sd */
+ unsigned long avg_load; /* Average load across all groups in sd */
+ unsigned int prefer_sibling; /* tasks should go to sibling first */

- struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
- struct sg_lb_stats local_stat; /* Statistics of the local group */
+ struct sg_lb_stats busiest_stat; /* Statistics of the busiest group */
+ struct sg_lb_stats local_stat; /* Statistics of the local group */
};

static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
--
2.40.1

2024-03-01 11:11:25

by Ingo Molnar

[permalink] [raw]

Subject: [PATCH 7/7] sched/balancing: Update comments in 'struct sg_lb_stats' and 'struct sd_lb_stats'

- Align for readability
- Capitalize consistently

Signed-off-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 99495c93e094..4e58b9334d88 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9433,17 +9433,17 @@ static void update_blocked_averages(int cpu)
/********** Helpers for find_busiest_group ************************/

/*
- * sg_lb_stats - stats of a sched_group required for load_balancing
+ * sg_lb_stats - stats of a sched_group required for load-balancing:
*/
struct sg_lb_stats {
- unsigned long avg_load; /* Avg load across the CPUs of the group */
- unsigned long group_load; /* Total load over the CPUs of the group */
- unsigned long group_capacity;
- unsigned long group_util; /* Total utilization over the CPUs of the group */
+ unsigned long avg_load; /* Avg load over the CPUs of the group */
+ unsigned long group_load; /* Total load over the CPUs of the group */
+ unsigned long group_capacity; /* Capacity over the CPUs of the group */
+ unsigned long group_util; /* Total utilization over the CPUs of the group */
unsigned long group_runnable; /* Total runnable time over the CPUs of the group */
- unsigned int sum_nr_running; /* Nr of tasks running in the group */
+ unsigned int sum_nr_running; /* Nr of all tasks running in the group */
unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
- unsigned int idle_cpus;
+ unsigned int idle_cpus; /* Nr of idle CPUs in the group */
unsigned int group_weight;
enum group_type group_type;
unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
@@ -9456,8 +9456,7 @@ struct sg_lb_stats {
};

/*
- * sd_lb_stats - Structure to store the statistics of a sched_domain
- * during load balancing.
+ * sd_lb_stats - stats of a sched_domain required for load-balancing:
*/
struct sd_lb_stats {
struct sched_group *busiest; /* Busiest group in this sd */
@@ -9465,7 +9464,7 @@ struct sd_lb_stats {
unsigned long total_load; /* Total load of all groups in sd */
unsigned long total_capacity; /* Total capacity of all groups in sd */
unsigned long avg_load; /* Average load across all groups in sd */
- unsigned int prefer_sibling; /* tasks should go to sibling first */
+ unsigned int prefer_sibling; /* Tasks should go to sibling first */

struct sg_lb_stats busiest_stat; /* Statistics of the busiest group */
struct sg_lb_stats local_stat; /* Statistics of the local group */
--
2.40.1

2024-03-01 11:11:30

by Ingo Molnar

[permalink] [raw]

Subject: [PATCH 4/7] sched/balancing: Fix comments (trying to) refer to NOHZ_BALANCE_KICK

Fix two typos:

- There's no such thing as 'nohz_balancing_kick', the
flag is named 'BALANCE' and is capitalized: NOHZ_BALANCE_KICK.

- Likewise there's no such thing as a 'pending nohz_balance_kick'
either, the NOHZ_BALANCE_KICK flag is all upper-case.

Signed-off-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 934ace69eb30..31838b9dcf36 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12410,14 +12410,14 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)

/*
* run_rebalance_domains is triggered when needed from the scheduler tick.
- * Also triggered for nohz idle balancing (with nohz_balancing_kick set).
+ * Also triggered for NOHZ idle balancing (with NOHZ_BALANCING_KICK set).
*/
static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
{
struct rq *this_rq = this_rq();
enum cpu_idle_type idle = this_rq->idle_balance;
/*
- * If this CPU has a pending nohz_balance_kick, then do the
+ * If this CPU has a pending NOHZ_BALANCE_KICK, then do the
* balancing on behalf of the other idle CPUs whose ticks are
* stopped. Do nohz_idle_balance *before* rebalance_domains to
* give the idle CPUs a chance to load balance. Else we may
--
2.40.1

2024-03-01 11:23:56

by Ingo Molnar

[permalink] [raw]

Subject: [PATCH 4/7 v2] sched/balancing: Fix comments (trying to) refer to NOHZ_BALANCE_KICK

* Ingo Molnar <[email protected]> wrote:

> Fix two typos:
>
> - There's no such thing as 'nohz_balancing_kick', the
> flag is named 'BALANCE' and is capitalized: NOHZ_BALANCE_KICK.
>
> - Likewise there's no such thing as a 'pending nohz_balance_kick'
> either, the NOHZ_BALANCE_KICK flag is all upper-case.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> ---
> kernel/sched/fair.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 934ace69eb30..31838b9dcf36 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12410,14 +12410,14 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>
> /*
> * run_rebalance_domains is triggered when needed from the scheduler tick.
> - * Also triggered for nohz idle balancing (with nohz_balancing_kick set).
> + * Also triggered for NOHZ idle balancing (with NOHZ_BALANCING_KICK set).

s/NOHZ_BALANCING_KICK
/NOHZ_BALANCE_KICK

As the changelog says.

This change got squashed into the wrong patch later in the rest of this series
that I haven't posted yet. Find -v2 below.

Thanks,

Ingo

===============>
From: Ingo Molnar <[email protected]>
Date: Fri, 1 Mar 2024 11:30:42 +0100
Subject: [PATCH] sched/balancing: Fix comments (trying to) refer to NOHZ_BALANCE_KICK

Fix two typos:

- There's no such thing as 'nohz_balancing_kick', the
flag is named 'BALANCE' and is capitalized: NOHZ_BALANCE_KICK.

- Likewise there's no such thing as a 'pending nohz_balance_kick'
either, the NOHZ_BALANCE_KICK flag is all upper-case.

Signed-off-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 934ace69eb30..4c46bffb6a7a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12410,14 +12410,14 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)

/*
* run_rebalance_domains is triggered when needed from the scheduler tick.
- * Also triggered for nohz idle balancing (with nohz_balancing_kick set).
+ * Also triggered for NOHZ idle balancing (with NOHZ_BALANCE_KICK set).
*/
static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
{
struct rq *this_rq = this_rq();
enum cpu_idle_type idle = this_rq->idle_balance;
/*
- * If this CPU has a pending nohz_balance_kick, then do the
+ * If this CPU has a pending NOHZ_BALANCE_KICK, then do the
* balancing on behalf of the other idle CPUs whose ticks are
* stopped. Do nohz_idle_balance *before* rebalance_domains to
* give the idle CPUs a chance to load balance. Else we may

2024-03-01 11:27:12

by Ingo Molnar

[permalink] [raw]

Subject: [PATCH 5/7 v2] sched/balancing: Update run_rebalance_domains() comments

The first sentence of the comment explaining run_rebalance_domains()
is historic and not true anymore:

* run_rebalance_domains is triggered when needed from the scheduler tick.

.. contradicted/modified by the second sentence:

* Also triggered for NOHZ idle balancing (with NOHZ_BALANCE_KICK set).

Avoid that kind of confusion straight away and explain from what
places sched_balance_softirq() is triggered.

Signed-off-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4c46bffb6a7a..18b7d2999cff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12409,8 +12409,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
}

/*
- * run_rebalance_domains is triggered when needed from the scheduler tick.
- * Also triggered for NOHZ idle balancing (with NOHZ_BALANCE_KICK set).
+ * The run_rebalance_domains() softirq handler is triggered via SCHED_SOFTIRQ
+ * from two places:
+ *
+ * - the scheduler_tick(),
+ *
+ * - from the SMP cross-call function nohz_csd_func(),
+ * used by NOHZ idle balancing (with NOHZ_BALANCE_KICK set).
*/
static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
{

2024-03-01 15:38:13

by Shrikanth Hegde

[permalink] [raw]

Subject: Re: [PATCH 1/7] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag

On 3/1/24 4:39 PM, Ingo Molnar wrote:
> The 'balancing' spinlock added in:

Hi Ingo.

>
> 08c183f31bdb ("[PATCH] sched: add option to serialize load balancing")
>

[...]

>
> need_serialize = sd->flags & SD_SERIALIZE;
> if (need_serialize) {
> - if (!spin_trylock(&balancing))
> + if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))

Thinking from very little I know, I may be completely wrong.

Is it possible that arch_spin_trylock, which would be called from spin_trylock is
faster in some architectures? Maybe in contended case?

For example, in powerpc, queued_spin_trylock, uses more optimal ll/sc style access patterns
rather than cmpxchg.
https://lore.kernel.org/all/[email protected]/

+nick

> goto out;
> }
>
> @@ -11729,7 +11742,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
> interval = get_sd_balance_interval(sd, busy);
> }
> if (need_serialize)
> - spin_unlock(&balancing);
> + atomic_set_release(&sched_balance_running, 0);
> out:
> if (time_after(next_balance, sd->last_balance + interval)) {
> next_balance = sd->last_balance + interval;

2024-03-02 09:06:51

by Shrikanth Hegde

[permalink] [raw]

Subject: Re: [PATCH 2/7] sched/balancing: Change 'enum cpu_idle_type' to have more natural definitions

On 3/1/24 4:39 PM, Ingo Molnar wrote:
> The cpu_idle_type enum has the confusingly inverted property
> that 'not idle' is 1, and 'idle' is '0'.
>
> This resulted in a number of unnecessary complications in the code.
>
> Reverse the order, remove the CPU_NOT_IDLE type, and convert
> all code to a natural boolean form.
>
> It's much more readable:
>
> - enum cpu_idle_type idle = this_rq->idle_balance ?
> - CPU_IDLE : CPU_NOT_IDLE;
> -
> + enum cpu_idle_type idle = this_rq->idle_balance;
>
> --------------------------------
>
> - if (env->idle == CPU_NOT_IDLE || !busiest->sum_nr_running)
> + if (!env->idle || !busiest->sum_nr_running)
>
> --------------------------------
>
> And gets rid of the double negation in these usages:
>
> - if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
> + if (env->idle && env->src_rq->nr_running <= 1)
>
> Furthermore, this makes code much more obvious where there's
> differentiation between CPU_IDLE and CPU_NEWLY_IDLE.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> ---
> include/linux/sched/idle.h | 3 +--
> kernel/sched/fair.c | 27 ++++++++++++---------------
> 2 files changed, 13 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/sched/idle.h b/include/linux/sched/idle.h
> index 478084f9105e..4a6423700ffc 100644
> --- a/include/linux/sched/idle.h
> +++ b/include/linux/sched/idle.h
> @@ -5,8 +5,7 @@
> #include <linux/sched.h>
>
> enum cpu_idle_type {
> - CPU_IDLE,
> - CPU_NOT_IDLE,
> + CPU_IDLE = 1,
> CPU_NEWLY_IDLE,
> CPU_MAX_IDLE_TYPES
> };

[...]

> struct rq *this_rq = this_rq();
> - enum cpu_idle_type idle = this_rq->idle_balance ?
> - CPU_IDLE : CPU_NOT_IDLE;
> -
> + enum cpu_idle_type idle = this_rq->idle_balance;
> /*
> * If this CPU has a pending nohz_balance_kick, then do the
> * balancing on behalf of the other idle CPUs whose ticks are

Hi Ingo.
This is more readable code indeed.

But schedtstat area also needs a fix. I applied the patch, I see it displays
only for CPU_IDLE and CPU_NEWLY_IDLE since stats.c displays starting from CPU_IDLE.

Did below test.
echo 1 > /proc/sys/kernel/sched_schedstats
sleep 300
stress-ng --cpu=$(nproc) -t 300
cat /proc/schedstat

6.8.rc5 -- There are 36 fields.
cpu0 0 0 4400 1485 1624 1229 301472313236 120382198 7714
domain0 00000000,00000000,00000055 1661 1661 0 0 0 0 0 1661 2495 2495 0 0 0 0 0 2495 67 66 1 2 0 0 0 66 0 0 0 0 0 0 0 0 0 133 38 0
domain1 ff000000,00ff0000,ffffffff 382 369 13 13 4 0 2 207 198 195 3 36 0 0 0 195 67 64 3 3 0 0 0 64 4 0 4 0 0 0 0 0 0 124 9 0
domain2 ff00ffff,00ffffff,ffffffff 586 585 1 6 0 0 0 365 118 116 2 96 0 0 0 116 67 67 0 0 0 0 0 67 0 0 0 0 0 0 0 0 0 59 0 0
domain3 ffffffff,ffffffff,ffffffff 481 479 2 58 0 0 0 387 97 97 0 0 0 0 0 96 67 67 0 0 0 0 0 67 0 0 0 0 0 0 0 0 0 79 0 0

+Patch - There are 28 fields.
cpu0 0 0 2520 1244 1283 974 2273868054 212337506 6911
domain0 00000000,00000000,00000055 1975 1975 0 0 0 0 0 1975 45 45 0 0 0 0 0 45 0 0 0 0 0 0 0 0 0 65 3 0
domain1 ff000000,00ff0000,ffffffff 441 438 3 3 1 0 0 242 45 43 2 2 0 0 0 43 1 0 1 0 0 0 0 0 0 48 0 0
domain2 ff00ffff,00ffffff,ffffffff 655 654 1 2 0 0 0 468 45 45 0 0 0 0 0 45 0 0 0 0 0 0 0 0 0 152 0 0
domain3 ffffffff,ffffffff,ffffffff 521 521 0 0 0 0 0 472 44 44 0 0 0 0 0 44 0 0 0 0 0 0 0 0 0 44 0 0

I think its all getting accounted. Just that its not being printed.
With the below change, able to see all the three types, albeit not in right order as
per schedstat documentation.

diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..f36b54bdd9fa 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -150,7 +150,7 @@ static int show_schedstat(struct seq_file *seq, void *v)

seq_printf(seq, "domain%d %*pb", dcount++,
cpumask_pr_args(sched_domain_span(sd)));
- for (itype = CPU_IDLE; itype < CPU_MAX_IDLE_TYPES;
+ for (itype = 0; itype < CPU_MAX_IDLE_TYPES;
itype++) {
seq_printf(seq, " %u %u %u %u %u %u %u %u",
sd->lb_count[itype],

2024-03-04 09:29:36

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH 1/7] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag

* Shrikanth Hegde <[email protected]> wrote:

>
>
> On 3/1/24 4:39 PM, Ingo Molnar wrote:
> > The 'balancing' spinlock added in:
>
> Hi Ingo.
>
> >
> > 08c183f31bdb ("[PATCH] sched: add option to serialize load balancing")
> >
>
>
> [...]
>
> >
> > need_serialize = sd->flags & SD_SERIALIZE;
> > if (need_serialize) {
> > - if (!spin_trylock(&balancing))
> > + if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
>
> Thinking from very little I know, I may be completely wrong.
>
> Is it possible that arch_spin_trylock, which would be called from spin_trylock is
> faster in some architectures? Maybe in contended case?

This code path should never really be 'contended': SD_SERIALIZE is only set
for the outermost, largest domains (NUMA and up) that get balanced
infrequently. This change is more for the sake of readability: a flag
disguised as a spinlock.

Thanks,

Ingo

2024-03-04 09:29:59

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH 2/7] sched/balancing: Change 'enum cpu_idle_type' to have more natural definitions

* Shrikanth Hegde <[email protected]> wrote:

> I think its all getting accounted. Just that its not being printed. With
> the below change, able to see all the three types, albeit not in right
> order as per schedstat documentation.
>
> diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
> index 857f837f52cb..f36b54bdd9fa 100644
> --- a/kernel/sched/stats.c
> +++ b/kernel/sched/stats.c
> @@ -150,7 +150,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
>
> seq_printf(seq, "domain%d %*pb", dcount++,
> cpumask_pr_args(sched_domain_span(sd)));
> - for (itype = CPU_IDLE; itype < CPU_MAX_IDLE_TYPES;
> + for (itype = 0; itype < CPU_MAX_IDLE_TYPES;
> itype++) {
> seq_printf(seq, " %u %u %u %u %u %u %u %u",
> sd->lb_count[itype],

Indeed, good catch - there was an implicit reliance on CPU_IDLE's position
in the loop above that my patch didn't take into account.

Fixed.

Thanks,

Ingo