2024-03-08 11:00:17

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH -v4 00/10] sched/balancing: Misc updates & cleanups

Improve a handful of things I noticed while looking through the
scheduler balancing code. No change in functionality intended.

Changes in -v4:

- Incorporate review feedback
- Bump SCHEDSTAT_VERSION to 16, due to the changed order of colums
- Allow CONFIG_SCHEDSTAT builds more widely

Thanks,

Ingo

==============================>
Ingo Molnar (9):
sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag
sched/balancing: Change 'enum cpu_idle_type' to have more natural definitions
sched/debug: Increase SCHEDSTAT_VERSION to 16
sched/debug: Allow CONFIG_SCHEDSTATS even on !KERNEL_DEBUG kernels
sched/balancing: Change comment formatting to not overlap Git conflict marker lines
sched/balancing: Fix comments (trying to) refer to NOHZ_BALANCE_KICK
sched/balancing: Update run_rebalance_domains() comments
sched/balancing: Vertically align the comments of 'struct sg_lb_stats' and 'struct sd_lb_stats'
sched/balancing: Update comments in 'struct sg_lb_stats' and 'struct sd_lb_stats'

Shrikanth Hegde (1):
sched/balancing: Remove reliance on 'enum cpu_idle_type' ordering when iterating [CPU_MAX_IDLE_TYPES] arrays in show_schedstat()

include/linux/sched/idle.h | 2 +-
kernel/sched/fair.c | 103 +++++++++++++++++++++++++++++++++++++----------------------------
kernel/sched/stats.c | 5 ++--
lib/Kconfig.debug | 2 +-
4 files changed, 62 insertions(+), 50 deletions(-)

--
2.40.1



2024-03-08 11:00:40

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 04/10] sched/debug: Increase SCHEDSTAT_VERSION to 16

We changed the order of definitions within 'enum cpu_idle_type',
which changed the order of [CPU_MAX_IDLE_TYPES] columns in
show_schedstat().

Suggested-by: Shrikanth Hegde <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Shrikanth Hegde <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Vincent Guittot <[email protected]>
---
kernel/sched/stats.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 85277953cc72..78e48f5426ee 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -113,7 +113,7 @@ void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
* Bump this up when changing the output format or the meaning of an existing
* format, so that tools can adapt (or abort)
*/
-#define SCHEDSTAT_VERSION 15
+#define SCHEDSTAT_VERSION 16

static int show_schedstat(struct seq_file *seq, void *v)
{
--
2.40.1


2024-03-08 11:00:43

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 05/10] sched/debug: Allow CONFIG_SCHEDSTATS even on !KERNEL_DEBUG kernels

All major Linux distributions enable CONFIG_SCHEDSTATS,
so make it more widely available.

Signed-off-by: Ingo Molnar <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Shrikanth Hegde <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Vincent Guittot <[email protected]>
---
lib/Kconfig.debug | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index ef36b829ae1f..0d6700f23ca4 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1250,7 +1250,7 @@ config SCHED_INFO

config SCHEDSTATS
bool "Collect scheduler statistics"
- depends on DEBUG_KERNEL && PROC_FS
+ depends on PROC_FS
select SCHED_INFO
help
If you say Y here, additional code will be inserted into the
--
2.40.1


2024-03-08 11:00:59

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 06/10] sched/balancing: Change comment formatting to not overlap Git conflict marker lines

So the scheduler has two such comment blocks, with '=' used as a double underline:

/*
* VRUNTIME
* ========
*

'========' also happens to be a Git conflict marker, throwing off a simple
search in an editor for this pattern.

Change them to '-------' type of underline instead - it looks just as good.

Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Shrikanth Hegde <[email protected]>
---
kernel/sched/fair.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3a510cf1fb00..84d4791cf628 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3679,7 +3679,7 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,

/*
* VRUNTIME
- * ========
+ * --------
*
* COROLLARY #1: The virtual runtime of the entity needs to be
* adjusted if re-weight at !0-lag point.
@@ -3762,7 +3762,7 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,

/*
* DEADLINE
- * ========
+ * --------
*
* When the weight changes, the virtual time slope changes and
* we should adjust the relative virtual deadline accordingly.
--
2.40.1


2024-03-08 11:01:21

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 07/10] sched/balancing: Fix comments (trying to) refer to NOHZ_BALANCE_KICK

Fix two typos:

- There's no such thing as 'nohz_balancing_kick', the
flag is named 'BALANCE' and is capitalized: NOHZ_BALANCE_KICK.

- Likewise there's no such thing as a 'pending nohz_balance_kick'
either, the NOHZ_BALANCE_KICK flag is all upper-case.

Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Shrikanth Hegde <[email protected]>
Cc: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 84d4791cf628..f3c03c6db3c8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12409,15 +12409,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
}

/*
- * run_rebalance_domains is triggered when needed from the scheduler tick.
- * Also triggered for nohz idle balancing (with nohz_balancing_kick set).
+ * This softirq may be triggered from the scheduler tick, or by
+ * any of the flags in NOHZ_KICK_MASK: NOHZ_BALANCE_KICK,
+ * NOHZ_STATS_KICK or NOHZ_NEXT_KICK.
*/
static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
{
struct rq *this_rq = this_rq();
enum cpu_idle_type idle = this_rq->idle_balance;
/*
- * If this CPU has a pending nohz_balance_kick, then do the
+ * If this CPU has a pending NOHZ_BALANCE_KICK, then do the
* balancing on behalf of the other idle CPUs whose ticks are
* stopped. Do nohz_idle_balance *before* rebalance_domains to
* give the idle CPUs a chance to load balance. Else we may
--
2.40.1


2024-03-08 11:01:37

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 02/10] sched/balancing: Remove reliance on 'enum cpu_idle_type' ordering when iterating [CPU_MAX_IDLE_TYPES] arrays in show_schedstat()

From: Shrikanth Hegde <[email protected]>

show_schedstat() output breaks and doesn't print all entries
if the ordering of the definitions in 'enum cpu_idle_type' is changed,
because show_schedstat() assumes that 'CPU_IDLE' is 0.

Fix it before we change the definition order & values.

[ mingo: Added changelog. ]

Signed-off-by: Shrikanth Hegde <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Valentin Schneider <[email protected]>
---
kernel/sched/stats.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..85277953cc72 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -150,8 +150,7 @@ static int show_schedstat(struct seq_file *seq, void *v)

seq_printf(seq, "domain%d %*pb", dcount++,
cpumask_pr_args(sched_domain_span(sd)));
- for (itype = CPU_IDLE; itype < CPU_MAX_IDLE_TYPES;
- itype++) {
+ for (itype = 0; itype < CPU_MAX_IDLE_TYPES; itype++) {
seq_printf(seq, " %u %u %u %u %u %u %u %u",
sd->lb_count[itype],
sd->lb_balanced[itype],
--
2.40.1


2024-03-08 11:02:06

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 10/10] sched/balancing: Update comments in 'struct sg_lb_stats' and 'struct sd_lb_stats'

- Align for readability
- Capitalize consistently

Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Shrikanth Hegde <[email protected]>
Cc: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40b98e43d794..116a640534b9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9433,17 +9433,17 @@ static void update_blocked_averages(int cpu)
/********** Helpers for find_busiest_group ************************/

/*
- * sg_lb_stats - stats of a sched_group required for load_balancing
+ * sg_lb_stats - stats of a sched_group required for load-balancing:
*/
struct sg_lb_stats {
- unsigned long avg_load; /* Avg load across the CPUs of the group */
- unsigned long group_load; /* Total load over the CPUs of the group */
- unsigned long group_capacity;
- unsigned long group_util; /* Total utilization over the CPUs of the group */
+ unsigned long avg_load; /* Avg load over the CPUs of the group */
+ unsigned long group_load; /* Total load over the CPUs of the group */
+ unsigned long group_capacity; /* Capacity over the CPUs of the group */
+ unsigned long group_util; /* Total utilization over the CPUs of the group */
unsigned long group_runnable; /* Total runnable time over the CPUs of the group */
- unsigned int sum_nr_running; /* Nr of tasks running in the group */
+ unsigned int sum_nr_running; /* Nr of all tasks running in the group */
unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
- unsigned int idle_cpus;
+ unsigned int idle_cpus; /* Nr of idle CPUs in the group */
unsigned int group_weight;
enum group_type group_type;
unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
@@ -9456,8 +9456,7 @@ struct sg_lb_stats {
};

/*
- * sd_lb_stats - Structure to store the statistics of a sched_domain
- * during load balancing.
+ * sd_lb_stats - stats of a sched_domain required for load-balancing:
*/
struct sd_lb_stats {
struct sched_group *busiest; /* Busiest group in this sd */
@@ -9465,7 +9464,7 @@ struct sd_lb_stats {
unsigned long total_load; /* Total load of all groups in sd */
unsigned long total_capacity; /* Total capacity of all groups in sd */
unsigned long avg_load; /* Average load across all groups in sd */
- unsigned int prefer_sibling; /* tasks should go to sibling first */
+ unsigned int prefer_sibling; /* Tasks should go to sibling first */

struct sg_lb_stats busiest_stat; /* Statistics of the busiest group */
struct sg_lb_stats local_stat; /* Statistics of the local group */
--
2.40.1


2024-03-08 11:02:14

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 03/10] sched/balancing: Change 'enum cpu_idle_type' to have more natural definitions

The cpu_idle_type enum has the confusingly inverted property
that 'not idle' is 1, and 'idle' is '0'.

This resulted in a number of unnecessary complications in the code.

Reverse the order, remove the CPU_NOT_IDLE type, and convert
all code to a natural boolean form.

It's much more readable:

- enum cpu_idle_type idle = this_rq->idle_balance ?
- CPU_IDLE : CPU_NOT_IDLE;
-
+ enum cpu_idle_type idle = this_rq->idle_balance;

--------------------------------

- if (env->idle == CPU_NOT_IDLE || !busiest->sum_nr_running)
+ if (!env->idle || !busiest->sum_nr_running)

--------------------------------

And gets rid of the double negation in these usages:

- if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
+ if (env->idle && env->src_rq->nr_running <= 1)

Furthermore, this makes code much more obvious where there's
differentiation between CPU_IDLE and CPU_NEWLY_IDLE.

Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Shrikanth Hegde <[email protected]>
---
include/linux/sched/idle.h | 2 +-
kernel/sched/fair.c | 27 ++++++++++++---------------
2 files changed, 13 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched/idle.h b/include/linux/sched/idle.h
index 478084f9105e..e670ac282333 100644
--- a/include/linux/sched/idle.h
+++ b/include/linux/sched/idle.h
@@ -5,8 +5,8 @@
#include <linux/sched.h>

enum cpu_idle_type {
+ __CPU_NOT_IDLE = 0,
CPU_IDLE,
- CPU_NOT_IDLE,
CPU_NEWLY_IDLE,
CPU_MAX_IDLE_TYPES
};
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2ef89b36aed1..3a510cf1fb00 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9070,7 +9070,7 @@ static int detach_tasks(struct lb_env *env)
* We don't want to steal all, otherwise we may be treated likewise,
* which could at worst lead to a livelock crash.
*/
- if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
+ if (env->idle && env->src_rq->nr_running <= 1)
break;

env->loop++;
@@ -9803,7 +9803,7 @@ static inline bool smt_vs_nonsmt_groups(struct sched_group *sg1,
static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats *sgs,
struct sched_group *group)
{
- if (env->idle == CPU_NOT_IDLE)
+ if (!env->idle)
return false;

/*
@@ -9827,7 +9827,7 @@ static inline long sibling_imbalance(struct lb_env *env,
int ncores_busiest, ncores_local;
long imbalance;

- if (env->idle == CPU_NOT_IDLE || !busiest->sum_nr_running)
+ if (!env->idle || !busiest->sum_nr_running)
return 0;

ncores_busiest = sds->busiest->cores;
@@ -9927,8 +9927,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_misfit_task_load = rq->misfit_task_load;
*sg_status |= SG_OVERLOAD;
}
- } else if ((env->idle != CPU_NOT_IDLE) &&
- sched_reduced_capacity(rq, env->sd)) {
+ } else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
/* Check for a task running on a CPU with reduced capacity */
if (sgs->group_misfit_task_load < load)
sgs->group_misfit_task_load = load;
@@ -9940,7 +9939,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_weight = group->group_weight;

/* Check if dst CPU is idle and preferred to this group */
- if (!local_group && env->idle != CPU_NOT_IDLE && sgs->sum_h_nr_running &&
+ if (!local_group && env->idle && sgs->sum_h_nr_running &&
sched_group_asym(env, sgs, group))
sgs->group_asym_packing = 1;

@@ -10698,7 +10697,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
* waiting task in this overloaded busiest group. Let's
* try to pull it.
*/
- if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) {
+ if (env->idle && env->imbalance == 0) {
env->migration_type = migrate_task;
env->imbalance = 1;
}
@@ -10913,7 +10912,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
goto force_balance;

if (busiest->group_type != group_overloaded) {
- if (env->idle == CPU_NOT_IDLE) {
+ if (!env->idle) {
/*
* If the busiest group is not overloaded (and as a
* result the local one too) but this CPU is already
@@ -11121,7 +11120,7 @@ asym_active_balance(struct lb_env *env)
* the lower priority @env::dst_cpu help it. Do not follow
* CPU priority.
*/
- return env->idle != CPU_NOT_IDLE && sched_use_asym_prio(env->sd, env->dst_cpu) &&
+ return env->idle && sched_use_asym_prio(env->sd, env->dst_cpu) &&
(sched_asym_prefer(env->dst_cpu, env->src_cpu) ||
!sched_use_asym_prio(env->sd, env->src_cpu));
}
@@ -11159,7 +11158,7 @@ static int need_active_balance(struct lb_env *env)
* because of other sched_class or IRQs if more capacity stays
* available on dst_cpu.
*/
- if ((env->idle != CPU_NOT_IDLE) &&
+ if (env->idle &&
(env->src_rq->cfs.h_nr_running == 1)) {
if ((check_cpu_capacity(env->src_rq, sd)) &&
(capacity_of(env->src_cpu)*sd->imbalance_pct < capacity_of(env->dst_cpu)*100))
@@ -11735,8 +11734,8 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
* env->dst_cpu, so we can't know our idle
* state even if we migrated tasks. Update it.
*/
- idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;
- busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
+ idle = idle_cpu(cpu);
+ busy = !idle && !sched_idle_cpu(cpu);
}
sd->last_balance = jiffies;
interval = get_sd_balance_interval(sd, busy);
@@ -12416,9 +12415,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
{
struct rq *this_rq = this_rq();
- enum cpu_idle_type idle = this_rq->idle_balance ?
- CPU_IDLE : CPU_NOT_IDLE;
-
+ enum cpu_idle_type idle = this_rq->idle_balance;
/*
* If this CPU has a pending nohz_balance_kick, then do the
* balancing on behalf of the other idle CPUs whose ticks are
--
2.40.1


2024-03-08 11:03:45

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 08/10] sched/balancing: Update run_rebalance_domains() comments

The first sentence of the comment explaining run_rebalance_domains()
is historic and not true anymore:

* run_rebalance_domains is triggered when needed from the scheduler tick.

.. contradicted/modified by the second sentence:

* Also triggered for NOHZ idle balancing (with NOHZ_BALANCE_KICK set).

Avoid that kind of confusion straight away and explain from what
places sched_balance_softirq() is triggered.

Signed-off-by: Ingo Molnar <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Shrikanth Hegde <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f3c03c6db3c8..b567c0790f44 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12409,9 +12409,12 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
}

/*
- * This softirq may be triggered from the scheduler tick, or by
- * any of the flags in NOHZ_KICK_MASK: NOHZ_BALANCE_KICK,
- * NOHZ_STATS_KICK or NOHZ_NEXT_KICK.
+ * This softirq handler is triggered via SCHED_SOFTIRQ from two places:
+ *
+ * - directly from the local scheduler_tick() for periodic load balancing
+ *
+ * - indirectly from a remote scheduler_tick() for NOHZ idle balancing
+ * through the SMP cross-call nohz_csd_func()
*/
static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
{
--
2.40.1


2024-03-08 11:04:12

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 09/10] sched/balancing: Vertically align the comments of 'struct sg_lb_stats' and 'struct sd_lb_stats'

Make them easier to read.

Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Shrikanth Hegde <[email protected]>
Cc: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 34 +++++++++++++++++-----------------
1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b567c0790f44..40b98e43d794 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9436,19 +9436,19 @@ static void update_blocked_averages(int cpu)
* sg_lb_stats - stats of a sched_group required for load_balancing
*/
struct sg_lb_stats {
- unsigned long avg_load; /*Avg load across the CPUs of the group */
- unsigned long group_load; /* Total load over the CPUs of the group */
+ unsigned long avg_load; /* Avg load across the CPUs of the group */
+ unsigned long group_load; /* Total load over the CPUs of the group */
unsigned long group_capacity;
- unsigned long group_util; /* Total utilization over the CPUs of the group */
- unsigned long group_runnable; /* Total runnable time over the CPUs of the group */
- unsigned int sum_nr_running; /* Nr of tasks running in the group */
- unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
+ unsigned long group_util; /* Total utilization over the CPUs of the group */
+ unsigned long group_runnable; /* Total runnable time over the CPUs of the group */
+ unsigned int sum_nr_running; /* Nr of tasks running in the group */
+ unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
unsigned int idle_cpus;
unsigned int group_weight;
enum group_type group_type;
- unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
- unsigned int group_smt_balance; /* Task on busy SMT be moved */
- unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
+ unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
+ unsigned int group_smt_balance; /* Task on busy SMT be moved */
+ unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
@@ -9460,15 +9460,15 @@ struct sg_lb_stats {
* during load balancing.
*/
struct sd_lb_stats {
- struct sched_group *busiest; /* Busiest group in this sd */
- struct sched_group *local; /* Local group in this sd */
- unsigned long total_load; /* Total load of all groups in sd */
- unsigned long total_capacity; /* Total capacity of all groups in sd */
- unsigned long avg_load; /* Average load across all groups in sd */
- unsigned int prefer_sibling; /* tasks should go to sibling first */
+ struct sched_group *busiest; /* Busiest group in this sd */
+ struct sched_group *local; /* Local group in this sd */
+ unsigned long total_load; /* Total load of all groups in sd */
+ unsigned long total_capacity; /* Total capacity of all groups in sd */
+ unsigned long avg_load; /* Average load across all groups in sd */
+ unsigned int prefer_sibling; /* tasks should go to sibling first */

- struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
- struct sg_lb_stats local_stat; /* Statistics of the local group */
+ struct sg_lb_stats busiest_stat; /* Statistics of the busiest group */
+ struct sg_lb_stats local_stat; /* Statistics of the local group */
};

static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
--
2.40.1


2024-03-08 12:01:55

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 10/10] sched/balancing: Update comments in 'struct sg_lb_stats' and 'struct sd_lb_stats'

On Fri, 8 Mar 2024 at 11:59, Ingo Molnar <[email protected]> wrote:
>
> - Align for readability
> - Capitalize consistently
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Reviewed-by: Valentin Schneider <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Shrikanth Hegde <[email protected]>
> Cc: Vincent Guittot <[email protected]>

Reviewed-by: Vincent Guittot <[email protected]>

> ---
> kernel/sched/fair.c | 19 +++++++++----------
> 1 file changed, 9 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 40b98e43d794..116a640534b9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9433,17 +9433,17 @@ static void update_blocked_averages(int cpu)
> /********** Helpers for find_busiest_group ************************/
>
> /*
> - * sg_lb_stats - stats of a sched_group required for load_balancing
> + * sg_lb_stats - stats of a sched_group required for load-balancing:
> */
> struct sg_lb_stats {
> - unsigned long avg_load; /* Avg load across the CPUs of the group */
> - unsigned long group_load; /* Total load over the CPUs of the group */
> - unsigned long group_capacity;
> - unsigned long group_util; /* Total utilization over the CPUs of the group */
> + unsigned long avg_load; /* Avg load over the CPUs of the group */
> + unsigned long group_load; /* Total load over the CPUs of the group */
> + unsigned long group_capacity; /* Capacity over the CPUs of the group */
> + unsigned long group_util; /* Total utilization over the CPUs of the group */
> unsigned long group_runnable; /* Total runnable time over the CPUs of the group */
> - unsigned int sum_nr_running; /* Nr of tasks running in the group */
> + unsigned int sum_nr_running; /* Nr of all tasks running in the group */
> unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
> - unsigned int idle_cpus;
> + unsigned int idle_cpus; /* Nr of idle CPUs in the group */
> unsigned int group_weight;
> enum group_type group_type;
> unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
> @@ -9456,8 +9456,7 @@ struct sg_lb_stats {
> };
>
> /*
> - * sd_lb_stats - Structure to store the statistics of a sched_domain
> - * during load balancing.
> + * sd_lb_stats - stats of a sched_domain required for load-balancing:
> */
> struct sd_lb_stats {
> struct sched_group *busiest; /* Busiest group in this sd */
> @@ -9465,7 +9464,7 @@ struct sd_lb_stats {
> unsigned long total_load; /* Total load of all groups in sd */
> unsigned long total_capacity; /* Total capacity of all groups in sd */
> unsigned long avg_load; /* Average load across all groups in sd */
> - unsigned int prefer_sibling; /* tasks should go to sibling first */
> + unsigned int prefer_sibling; /* Tasks should go to sibling first */
>
> struct sg_lb_stats busiest_stat; /* Statistics of the busiest group */
> struct sg_lb_stats local_stat; /* Statistics of the local group */
> --
> 2.40.1
>

2024-03-08 12:02:07

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 09/10] sched/balancing: Vertically align the comments of 'struct sg_lb_stats' and 'struct sd_lb_stats'

On Fri, 8 Mar 2024 at 11:59, Ingo Molnar <[email protected]> wrote:
>
> Make them easier to read.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Reviewed-by: Valentin Schneider <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Shrikanth Hegde <[email protected]>
> Cc: Vincent Guittot <[email protected]>

Reviewed-by: Vincent Guittot <[email protected]>

> ---
> kernel/sched/fair.c | 34 +++++++++++++++++-----------------
> 1 file changed, 17 insertions(+), 17 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b567c0790f44..40b98e43d794 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9436,19 +9436,19 @@ static void update_blocked_averages(int cpu)
> * sg_lb_stats - stats of a sched_group required for load_balancing
> */
> struct sg_lb_stats {
> - unsigned long avg_load; /*Avg load across the CPUs of the group */
> - unsigned long group_load; /* Total load over the CPUs of the group */
> + unsigned long avg_load; /* Avg load across the CPUs of the group */
> + unsigned long group_load; /* Total load over the CPUs of the group */
> unsigned long group_capacity;
> - unsigned long group_util; /* Total utilization over the CPUs of the group */
> - unsigned long group_runnable; /* Total runnable time over the CPUs of the group */
> - unsigned int sum_nr_running; /* Nr of tasks running in the group */
> - unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
> + unsigned long group_util; /* Total utilization over the CPUs of the group */
> + unsigned long group_runnable; /* Total runnable time over the CPUs of the group */
> + unsigned int sum_nr_running; /* Nr of tasks running in the group */
> + unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
> unsigned int idle_cpus;
> unsigned int group_weight;
> enum group_type group_type;
> - unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
> - unsigned int group_smt_balance; /* Task on busy SMT be moved */
> - unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
> + unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
> + unsigned int group_smt_balance; /* Task on busy SMT be moved */
> + unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
> #ifdef CONFIG_NUMA_BALANCING
> unsigned int nr_numa_running;
> unsigned int nr_preferred_running;
> @@ -9460,15 +9460,15 @@ struct sg_lb_stats {
> * during load balancing.
> */
> struct sd_lb_stats {
> - struct sched_group *busiest; /* Busiest group in this sd */
> - struct sched_group *local; /* Local group in this sd */
> - unsigned long total_load; /* Total load of all groups in sd */
> - unsigned long total_capacity; /* Total capacity of all groups in sd */
> - unsigned long avg_load; /* Average load across all groups in sd */
> - unsigned int prefer_sibling; /* tasks should go to sibling first */
> + struct sched_group *busiest; /* Busiest group in this sd */
> + struct sched_group *local; /* Local group in this sd */
> + unsigned long total_load; /* Total load of all groups in sd */
> + unsigned long total_capacity; /* Total capacity of all groups in sd */
> + unsigned long avg_load; /* Average load across all groups in sd */
> + unsigned int prefer_sibling; /* tasks should go to sibling first */
>
> - struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
> - struct sg_lb_stats local_stat; /* Statistics of the local group */
> + struct sg_lb_stats busiest_stat; /* Statistics of the busiest group */
> + struct sg_lb_stats local_stat; /* Statistics of the local group */
> };
>
> static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
> --
> 2.40.1
>

2024-03-08 12:02:29

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 08/10] sched/balancing: Update run_rebalance_domains() comments

On Fri, 8 Mar 2024 at 11:59, Ingo Molnar <[email protected]> wrote:
>
> The first sentence of the comment explaining run_rebalance_domains()
> is historic and not true anymore:
>
> * run_rebalance_domains is triggered when needed from the scheduler tick.
>
> ... contradicted/modified by the second sentence:
>
> * Also triggered for NOHZ idle balancing (with NOHZ_BALANCE_KICK set).
>
> Avoid that kind of confusion straight away and explain from what
> places sched_balance_softirq() is triggered.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Shrikanth Hegde <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Vincent Guittot <[email protected]>

Reviewed-by: Vincent Guittot <[email protected]>

> ---
> kernel/sched/fair.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f3c03c6db3c8..b567c0790f44 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12409,9 +12409,12 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> }
>
> /*
> - * This softirq may be triggered from the scheduler tick, or by
> - * any of the flags in NOHZ_KICK_MASK: NOHZ_BALANCE_KICK,
> - * NOHZ_STATS_KICK or NOHZ_NEXT_KICK.
> + * This softirq handler is triggered via SCHED_SOFTIRQ from two places:
> + *
> + * - directly from the local scheduler_tick() for periodic load balancing
> + *
> + * - indirectly from a remote scheduler_tick() for NOHZ idle balancing
> + * through the SMP cross-call nohz_csd_func()
> */
> static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
> {
> --
> 2.40.1
>

2024-03-08 12:03:33

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 07/10] sched/balancing: Fix comments (trying to) refer to NOHZ_BALANCE_KICK

On Fri, 8 Mar 2024 at 11:59, Ingo Molnar <[email protected]> wrote:
>
> Fix two typos:
>
> - There's no such thing as 'nohz_balancing_kick', the
> flag is named 'BALANCE' and is capitalized: NOHZ_BALANCE_KICK.
>
> - Likewise there's no such thing as a 'pending nohz_balance_kick'
> either, the NOHZ_BALANCE_KICK flag is all upper-case.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Reviewed-by: Valentin Schneider <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Shrikanth Hegde <[email protected]>
> Cc: Vincent Guittot <[email protected]>

Reviewed-by: Vincent Guittot <[email protected]>

> ---
> kernel/sched/fair.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 84d4791cf628..f3c03c6db3c8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12409,15 +12409,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> }
>
> /*
> - * run_rebalance_domains is triggered when needed from the scheduler tick.
> - * Also triggered for nohz idle balancing (with nohz_balancing_kick set).
> + * This softirq may be triggered from the scheduler tick, or by
> + * any of the flags in NOHZ_KICK_MASK: NOHZ_BALANCE_KICK,
> + * NOHZ_STATS_KICK or NOHZ_NEXT_KICK.
> */
> static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
> {
> struct rq *this_rq = this_rq();
> enum cpu_idle_type idle = this_rq->idle_balance;
> /*
> - * If this CPU has a pending nohz_balance_kick, then do the
> + * If this CPU has a pending NOHZ_BALANCE_KICK, then do the
> * balancing on behalf of the other idle CPUs whose ticks are
> * stopped. Do nohz_idle_balance *before* rebalance_domains to
> * give the idle CPUs a chance to load balance. Else we may
> --
> 2.40.1
>

2024-03-08 13:36:55

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 02/10] sched/balancing: Remove reliance on 'enum cpu_idle_type' ordering when iterating [CPU_MAX_IDLE_TYPES] arrays in show_schedstat()

On Fri, 8 Mar 2024 at 11:59, Ingo Molnar <[email protected]> wrote:
>
> From: Shrikanth Hegde <[email protected]>
>
> show_schedstat() output breaks and doesn't print all entries
> if the ordering of the definitions in 'enum cpu_idle_type' is changed,
> because show_schedstat() assumes that 'CPU_IDLE' is 0.
>
> Fix it before we change the definition order & values.
>
> [ mingo: Added changelog. ]
>
> Signed-off-by: Shrikanth Hegde <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Valentin Schneider <[email protected]>

Reviewed-by: Vincent Guittot <[email protected]>

> ---
> kernel/sched/stats.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
> index 857f837f52cb..85277953cc72 100644
> --- a/kernel/sched/stats.c
> +++ b/kernel/sched/stats.c
> @@ -150,8 +150,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
>
> seq_printf(seq, "domain%d %*pb", dcount++,
> cpumask_pr_args(sched_domain_span(sd)));
> - for (itype = CPU_IDLE; itype < CPU_MAX_IDLE_TYPES;
> - itype++) {
> + for (itype = 0; itype < CPU_MAX_IDLE_TYPES; itype++) {
> seq_printf(seq, " %u %u %u %u %u %u %u %u",
> sd->lb_count[itype],
> sd->lb_balanced[itype],
> --
> 2.40.1
>

2024-03-08 16:40:38

by Shrikanth Hegde

[permalink] [raw]
Subject: Re: [PATCH 04/10] sched/debug: Increase SCHEDSTAT_VERSION to 16



On 3/8/24 4:28 PM, Ingo Molnar wrote:
> We changed the order of definitions within 'enum cpu_idle_type',
> which changed the order of [CPU_MAX_IDLE_TYPES] columns in
> show_schedstat().
>

> +++ b/kernel/sched/stats.c
> @@ -113,7 +113,7 @@ void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
> * Bump this up when changing the output format or the meaning of an existing
> * format, so that tools can adapt (or abort)
> */
> -#define SCHEDSTAT_VERSION 15
> +#define SCHEDSTAT_VERSION 16

Please add the info about version, and change of the order
briefly in Documentation/scheduler/sched-stats.rst as well.

One recent user that I recollect is sched scoreboard. Version number should
be able to help the users when they it is breaking their scripts.

+Gautham, Any thoughts?

>
> static int show_schedstat(struct seq_file *seq, void *v)
> {

2024-03-12 10:02:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 04/10] sched/debug: Increase SCHEDSTAT_VERSION to 16


* Shrikanth Hegde <[email protected]> wrote:

>
>
> On 3/8/24 4:28 PM, Ingo Molnar wrote:
> > We changed the order of definitions within 'enum cpu_idle_type',
> > which changed the order of [CPU_MAX_IDLE_TYPES] columns in
> > show_schedstat().
> >
>
> > +++ b/kernel/sched/stats.c
> > @@ -113,7 +113,7 @@ void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
> > * Bump this up when changing the output format or the meaning of an existing
> > * format, so that tools can adapt (or abort)
> > */
> > -#define SCHEDSTAT_VERSION 15
> > +#define SCHEDSTAT_VERSION 16
>
> Please add the info about version, and change of the order
> briefly in Documentation/scheduler/sched-stats.rst as well.

Good point, I've added this paragraph to sched-stats.rst:

+Version 16 of schedstats changed the order of definitions within
+'enum cpu_idle_type', which changed the order of [CPU_MAX_IDLE_TYPES]
+columns in show_schedstat(). In particular the position of CPU_IDLE
+and __CPU_NOT_IDLE changed places. The size of the array is unchanged.

> One recent user that I recollect is sched scoreboard. Version number should
> be able to help the users when they it is breaking their scripts.
>
> +Gautham, Any thoughts?

If it's really a problem, I suspect we could maintain the v15 order of
output.

Thanks,

Ingo

Subject: [tip: sched/core] sched/balancing: Vertically align the comments of 'struct sg_lb_stats' and 'struct sd_lb_stats'

The following commit has been merged into the sched/core branch of tip:

Commit-ID: e492e1b0e0721f3929ef9d9708d029144b396dd7
Gitweb: https://git.kernel.org/tip/e492e1b0e0721f3929ef9d9708d029144b396dd7
Author: Ingo Molnar <[email protected]>
AuthorDate: Fri, 08 Mar 2024 11:59:00 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 12 Mar 2024 11:59:59 +01:00

sched/balancing: Vertically align the comments of 'struct sg_lb_stats' and 'struct sd_lb_stats'

Make them easier to read.

Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 36 ++++++++++++++++++------------------
1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b567c07..40b98e4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9436,19 +9436,19 @@ static void update_blocked_averages(int cpu)
* sg_lb_stats - stats of a sched_group required for load_balancing
*/
struct sg_lb_stats {
- unsigned long avg_load; /*Avg load across the CPUs of the group */
- unsigned long group_load; /* Total load over the CPUs of the group */
+ unsigned long avg_load; /* Avg load across the CPUs of the group */
+ unsigned long group_load; /* Total load over the CPUs of the group */
unsigned long group_capacity;
- unsigned long group_util; /* Total utilization over the CPUs of the group */
- unsigned long group_runnable; /* Total runnable time over the CPUs of the group */
- unsigned int sum_nr_running; /* Nr of tasks running in the group */
- unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
+ unsigned long group_util; /* Total utilization over the CPUs of the group */
+ unsigned long group_runnable; /* Total runnable time over the CPUs of the group */
+ unsigned int sum_nr_running; /* Nr of tasks running in the group */
+ unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
unsigned int idle_cpus;
unsigned int group_weight;
enum group_type group_type;
- unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
- unsigned int group_smt_balance; /* Task on busy SMT be moved */
- unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
+ unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
+ unsigned int group_smt_balance; /* Task on busy SMT be moved */
+ unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
@@ -9460,15 +9460,15 @@ struct sg_lb_stats {
* during load balancing.
*/
struct sd_lb_stats {
- struct sched_group *busiest; /* Busiest group in this sd */
- struct sched_group *local; /* Local group in this sd */
- unsigned long total_load; /* Total load of all groups in sd */
- unsigned long total_capacity; /* Total capacity of all groups in sd */
- unsigned long avg_load; /* Average load across all groups in sd */
- unsigned int prefer_sibling; /* tasks should go to sibling first */
-
- struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
- struct sg_lb_stats local_stat; /* Statistics of the local group */
+ struct sched_group *busiest; /* Busiest group in this sd */
+ struct sched_group *local; /* Local group in this sd */
+ unsigned long total_load; /* Total load of all groups in sd */
+ unsigned long total_capacity; /* Total capacity of all groups in sd */
+ unsigned long avg_load; /* Average load across all groups in sd */
+ unsigned int prefer_sibling; /* tasks should go to sibling first */
+
+ struct sg_lb_stats busiest_stat; /* Statistics of the busiest group */
+ struct sg_lb_stats local_stat; /* Statistics of the local group */
};

static inline void init_sd_lb_stats(struct sd_lb_stats *sds)

Subject: [tip: sched/core] sched/balancing: Update run_rebalance_domains() comments

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 3dc6f6c8efe2d9148865664fffbe9dd89fb0d157
Gitweb: https://git.kernel.org/tip/3dc6f6c8efe2d9148865664fffbe9dd89fb0d157
Author: Ingo Molnar <[email protected]>
AuthorDate: Fri, 08 Mar 2024 11:58:59 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 12 Mar 2024 11:59:43 +01:00

sched/balancing: Update run_rebalance_domains() comments

The first sentence of the comment explaining run_rebalance_domains()
is historic and not true anymore:

* run_rebalance_domains is triggered when needed from the scheduler tick.

.. contradicted/modified by the second sentence:

* Also triggered for NOHZ idle balancing (with NOHZ_BALANCE_KICK set).

Avoid that kind of confusion straight away and explain from what
places sched_balance_softirq() is triggered.

Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Acked-by: Valentin Schneider <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f3c03c6..b567c07 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12409,9 +12409,12 @@ out:
}

/*
- * This softirq may be triggered from the scheduler tick, or by
- * any of the flags in NOHZ_KICK_MASK: NOHZ_BALANCE_KICK,
- * NOHZ_STATS_KICK or NOHZ_NEXT_KICK.
+ * This softirq handler is triggered via SCHED_SOFTIRQ from two places:
+ *
+ * - directly from the local scheduler_tick() for periodic load balancing
+ *
+ * - indirectly from a remote scheduler_tick() for NOHZ idle balancing
+ * through the SMP cross-call nohz_csd_func()
*/
static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
{

Subject: [tip: sched/core] sched/debug: Allow CONFIG_SCHEDSTATS even on !KERNEL_DEBUG kernels

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 9ab121d65e03b4dc38f207871070eb353b396b05
Gitweb: https://git.kernel.org/tip/9ab121d65e03b4dc38f207871070eb353b396b05
Author: Ingo Molnar <[email protected]>
AuthorDate: Fri, 08 Mar 2024 11:58:56 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 12 Mar 2024 11:03:41 +01:00

sched/debug: Allow CONFIG_SCHEDSTATS even on !KERNEL_DEBUG kernels

All major Linux distributions enable CONFIG_SCHEDSTATS,
so make it more widely available.

Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
lib/Kconfig.debug | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 6c596e6..ed2ad1a 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1250,7 +1250,7 @@ config SCHED_INFO

config SCHEDSTATS
bool "Collect scheduler statistics"
- depends on DEBUG_KERNEL && PROC_FS
+ depends on PROC_FS
select SCHED_INFO
help
If you say Y here, additional code will be inserted into the

Subject: [tip: sched/core] sched/balancing: Change comment formatting to not overlap Git conflict marker lines

The following commit has been merged into the sched/core branch of tip:

Commit-ID: be8858dba9a2c3aec454a6b382671101fd0dc3b7
Gitweb: https://git.kernel.org/tip/be8858dba9a2c3aec454a6b382671101fd0dc3b7
Author: Ingo Molnar <[email protected]>
AuthorDate: Fri, 08 Mar 2024 11:58:57 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 12 Mar 2024 11:03:41 +01:00

sched/balancing: Change comment formatting to not overlap Git conflict marker lines

So the scheduler has two such comment blocks, with '=' used as a double underline:

/*
* VRUNTIME
* ========
*

'========' also happens to be a Git conflict marker, throwing off a simple
search in an editor for this pattern.

Change them to '-------' type of underline instead - it looks just as good.

Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3a510cf..84d4791 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3679,7 +3679,7 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,

/*
* VRUNTIME
- * ========
+ * --------
*
* COROLLARY #1: The virtual runtime of the entity needs to be
* adjusted if re-weight at !0-lag point.
@@ -3762,7 +3762,7 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,

/*
* DEADLINE
- * ========
+ * --------
*
* When the weight changes, the virtual time slope changes and
* we should adjust the relative virtual deadline accordingly.

Subject: [tip: sched/core] sched/balancing: Change 'enum cpu_idle_type' to have more natural definitions

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 38d707c54df4ca58cd9ceae2ddcbd6f606b99e9f
Gitweb: https://git.kernel.org/tip/38d707c54df4ca58cd9ceae2ddcbd6f606b99e9f
Author: Ingo Molnar <[email protected]>
AuthorDate: Fri, 08 Mar 2024 11:58:54 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 12 Mar 2024 11:03:40 +01:00

sched/balancing: Change 'enum cpu_idle_type' to have more natural definitions

The cpu_idle_type enum has the confusingly inverted property
that 'not idle' is 1, and 'idle' is '0'.

This resulted in a number of unnecessary complications in the code.

Reverse the order, remove the CPU_NOT_IDLE type, and convert
all code to a natural boolean form.

It's much more readable:

- enum cpu_idle_type idle = this_rq->idle_balance ?
- CPU_IDLE : CPU_NOT_IDLE;
-
+ enum cpu_idle_type idle = this_rq->idle_balance;

--------------------------------

- if (env->idle == CPU_NOT_IDLE || !busiest->sum_nr_running)
+ if (!env->idle || !busiest->sum_nr_running)

--------------------------------

And gets rid of the double negation in these usages:

- if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
+ if (env->idle && env->src_rq->nr_running <= 1)

Furthermore, this makes code much more obvious where there's
differentiation between CPU_IDLE and CPU_NEWLY_IDLE.

Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Cc: "Gautham R. Shenoy" <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
include/linux/sched/idle.h | 2 +-
kernel/sched/fair.c | 27 ++++++++++++---------------
2 files changed, 13 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched/idle.h b/include/linux/sched/idle.h
index 478084f..e670ac2 100644
--- a/include/linux/sched/idle.h
+++ b/include/linux/sched/idle.h
@@ -5,8 +5,8 @@
#include <linux/sched.h>

enum cpu_idle_type {
+ __CPU_NOT_IDLE = 0,
CPU_IDLE,
- CPU_NOT_IDLE,
CPU_NEWLY_IDLE,
CPU_MAX_IDLE_TYPES
};
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2ef89b3..3a510cf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9070,7 +9070,7 @@ static int detach_tasks(struct lb_env *env)
* We don't want to steal all, otherwise we may be treated likewise,
* which could at worst lead to a livelock crash.
*/
- if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
+ if (env->idle && env->src_rq->nr_running <= 1)
break;

env->loop++;
@@ -9803,7 +9803,7 @@ static inline bool smt_vs_nonsmt_groups(struct sched_group *sg1,
static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats *sgs,
struct sched_group *group)
{
- if (env->idle == CPU_NOT_IDLE)
+ if (!env->idle)
return false;

/*
@@ -9827,7 +9827,7 @@ static inline long sibling_imbalance(struct lb_env *env,
int ncores_busiest, ncores_local;
long imbalance;

- if (env->idle == CPU_NOT_IDLE || !busiest->sum_nr_running)
+ if (!env->idle || !busiest->sum_nr_running)
return 0;

ncores_busiest = sds->busiest->cores;
@@ -9927,8 +9927,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_misfit_task_load = rq->misfit_task_load;
*sg_status |= SG_OVERLOAD;
}
- } else if ((env->idle != CPU_NOT_IDLE) &&
- sched_reduced_capacity(rq, env->sd)) {
+ } else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
/* Check for a task running on a CPU with reduced capacity */
if (sgs->group_misfit_task_load < load)
sgs->group_misfit_task_load = load;
@@ -9940,7 +9939,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_weight = group->group_weight;

/* Check if dst CPU is idle and preferred to this group */
- if (!local_group && env->idle != CPU_NOT_IDLE && sgs->sum_h_nr_running &&
+ if (!local_group && env->idle && sgs->sum_h_nr_running &&
sched_group_asym(env, sgs, group))
sgs->group_asym_packing = 1;

@@ -10698,7 +10697,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
* waiting task in this overloaded busiest group. Let's
* try to pull it.
*/
- if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) {
+ if (env->idle && env->imbalance == 0) {
env->migration_type = migrate_task;
env->imbalance = 1;
}
@@ -10913,7 +10912,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
goto force_balance;

if (busiest->group_type != group_overloaded) {
- if (env->idle == CPU_NOT_IDLE) {
+ if (!env->idle) {
/*
* If the busiest group is not overloaded (and as a
* result the local one too) but this CPU is already
@@ -11121,7 +11120,7 @@ asym_active_balance(struct lb_env *env)
* the lower priority @env::dst_cpu help it. Do not follow
* CPU priority.
*/
- return env->idle != CPU_NOT_IDLE && sched_use_asym_prio(env->sd, env->dst_cpu) &&
+ return env->idle && sched_use_asym_prio(env->sd, env->dst_cpu) &&
(sched_asym_prefer(env->dst_cpu, env->src_cpu) ||
!sched_use_asym_prio(env->sd, env->src_cpu));
}
@@ -11159,7 +11158,7 @@ static int need_active_balance(struct lb_env *env)
* because of other sched_class or IRQs if more capacity stays
* available on dst_cpu.
*/
- if ((env->idle != CPU_NOT_IDLE) &&
+ if (env->idle &&
(env->src_rq->cfs.h_nr_running == 1)) {
if ((check_cpu_capacity(env->src_rq, sd)) &&
(capacity_of(env->src_cpu)*sd->imbalance_pct < capacity_of(env->dst_cpu)*100))
@@ -11735,8 +11734,8 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
* env->dst_cpu, so we can't know our idle
* state even if we migrated tasks. Update it.
*/
- idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;
- busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
+ idle = idle_cpu(cpu);
+ busy = !idle && !sched_idle_cpu(cpu);
}
sd->last_balance = jiffies;
interval = get_sd_balance_interval(sd, busy);
@@ -12416,9 +12415,7 @@ out:
static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
{
struct rq *this_rq = this_rq();
- enum cpu_idle_type idle = this_rq->idle_balance ?
- CPU_IDLE : CPU_NOT_IDLE;
-
+ enum cpu_idle_type idle = this_rq->idle_balance;
/*
* If this CPU has a pending nohz_balance_kick, then do the
* balancing on behalf of the other idle CPUs whose ticks are

Subject: [tip: sched/core] sched/balancing: Remove reliance on 'enum cpu_idle_type' ordering when iterating [CPU_MAX_IDLE_TYPES] arrays in show_schedstat()

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 02a61f325a8e62a7c76479c5f2f7ddcba16877e8
Gitweb: https://git.kernel.org/tip/02a61f325a8e62a7c76479c5f2f7ddcba16877e8
Author: Shrikanth Hegde <[email protected]>
AuthorDate: Fri, 08 Mar 2024 11:58:53 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 12 Mar 2024 11:03:40 +01:00

sched/balancing: Remove reliance on 'enum cpu_idle_type' ordering when iterating [CPU_MAX_IDLE_TYPES] arrays in show_schedstat()

show_schedstat() output breaks and doesn't print all entries
if the ordering of the definitions in 'enum cpu_idle_type' is changed,
because show_schedstat() assumes that 'CPU_IDLE' is 0.

Fix it before we change the definition order & values.

[ mingo: Added changelog. ]

Signed-off-by: Shrikanth Hegde <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/stats.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837..8527795 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -150,8 +150,7 @@ static int show_schedstat(struct seq_file *seq, void *v)

seq_printf(seq, "domain%d %*pb", dcount++,
cpumask_pr_args(sched_domain_span(sd)));
- for (itype = CPU_IDLE; itype < CPU_MAX_IDLE_TYPES;
- itype++) {
+ for (itype = 0; itype < CPU_MAX_IDLE_TYPES; itype++) {
seq_printf(seq, " %u %u %u %u %u %u %u %u",
sd->lb_count[itype],
sd->lb_balanced[itype],

Subject: [tip: sched/core] sched/balancing: Update comments in 'struct sg_lb_stats' and 'struct sd_lb_stats'

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 33928ed8bde066316dc3cb6ccf7f74400073652f
Gitweb: https://git.kernel.org/tip/33928ed8bde066316dc3cb6ccf7f74400073652f
Author: Ingo Molnar <[email protected]>
AuthorDate: Fri, 08 Mar 2024 11:59:01 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 12 Mar 2024 11:59:59 +01:00

sched/balancing: Update comments in 'struct sg_lb_stats' and 'struct sd_lb_stats'

- Align for readability
- Capitalize consistently

Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40b98e4..116a640 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9433,17 +9433,17 @@ static void update_blocked_averages(int cpu)
/********** Helpers for find_busiest_group ************************/

/*
- * sg_lb_stats - stats of a sched_group required for load_balancing
+ * sg_lb_stats - stats of a sched_group required for load-balancing:
*/
struct sg_lb_stats {
- unsigned long avg_load; /* Avg load across the CPUs of the group */
- unsigned long group_load; /* Total load over the CPUs of the group */
- unsigned long group_capacity;
- unsigned long group_util; /* Total utilization over the CPUs of the group */
+ unsigned long avg_load; /* Avg load over the CPUs of the group */
+ unsigned long group_load; /* Total load over the CPUs of the group */
+ unsigned long group_capacity; /* Capacity over the CPUs of the group */
+ unsigned long group_util; /* Total utilization over the CPUs of the group */
unsigned long group_runnable; /* Total runnable time over the CPUs of the group */
- unsigned int sum_nr_running; /* Nr of tasks running in the group */
+ unsigned int sum_nr_running; /* Nr of all tasks running in the group */
unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
- unsigned int idle_cpus;
+ unsigned int idle_cpus; /* Nr of idle CPUs in the group */
unsigned int group_weight;
enum group_type group_type;
unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
@@ -9456,8 +9456,7 @@ struct sg_lb_stats {
};

/*
- * sd_lb_stats - Structure to store the statistics of a sched_domain
- * during load balancing.
+ * sd_lb_stats - stats of a sched_domain required for load-balancing:
*/
struct sd_lb_stats {
struct sched_group *busiest; /* Busiest group in this sd */
@@ -9465,7 +9464,7 @@ struct sd_lb_stats {
unsigned long total_load; /* Total load of all groups in sd */
unsigned long total_capacity; /* Total capacity of all groups in sd */
unsigned long avg_load; /* Average load across all groups in sd */
- unsigned int prefer_sibling; /* tasks should go to sibling first */
+ unsigned int prefer_sibling; /* Tasks should go to sibling first */

struct sg_lb_stats busiest_stat; /* Statistics of the busiest group */
struct sg_lb_stats local_stat; /* Statistics of the local group */

Subject: [tip: sched/core] sched/balancing: Fix comments (trying to) refer to NOHZ_BALANCE_KICK

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 3a5fe9305719c680ccf63216781a4d4068c8e3f3
Gitweb: https://git.kernel.org/tip/3a5fe9305719c680ccf63216781a4d4068c8e3f3
Author: Ingo Molnar <[email protected]>
AuthorDate: Fri, 08 Mar 2024 11:58:58 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 12 Mar 2024 11:03:42 +01:00

sched/balancing: Fix comments (trying to) refer to NOHZ_BALANCE_KICK

Fix two typos:

- There's no such thing as 'nohz_balancing_kick', the
flag is named 'BALANCE' and is capitalized: NOHZ_BALANCE_KICK.

- Likewise there's no such thing as a 'pending nohz_balance_kick'
either, the NOHZ_BALANCE_KICK flag is all upper-case.

Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 84d4791..f3c03c6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12409,15 +12409,16 @@ out:
}

/*
- * run_rebalance_domains is triggered when needed from the scheduler tick.
- * Also triggered for nohz idle balancing (with nohz_balancing_kick set).
+ * This softirq may be triggered from the scheduler tick, or by
+ * any of the flags in NOHZ_KICK_MASK: NOHZ_BALANCE_KICK,
+ * NOHZ_STATS_KICK or NOHZ_NEXT_KICK.
*/
static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
{
struct rq *this_rq = this_rq();
enum cpu_idle_type idle = this_rq->idle_balance;
/*
- * If this CPU has a pending nohz_balance_kick, then do the
+ * If this CPU has a pending NOHZ_BALANCE_KICK, then do the
* balancing on behalf of the other idle CPUs whose ticks are
* stopped. Do nohz_idle_balance *before* rebalance_domains to
* give the idle CPUs a chance to load balance. Else we may

Subject: [tip: sched/core] sched/debug: Increase SCHEDSTAT_VERSION to 16

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 11b0bfa5d463b17cac5bf6b94fea4921713530c3
Gitweb: https://git.kernel.org/tip/11b0bfa5d463b17cac5bf6b94fea4921713530c3
Author: Ingo Molnar <[email protected]>
AuthorDate: Fri, 08 Mar 2024 11:58:55 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 12 Mar 2024 11:03:40 +01:00

sched/debug: Increase SCHEDSTAT_VERSION to 16

We changed the order of definitions within 'enum cpu_idle_type',
which changed the order of [CPU_MAX_IDLE_TYPES] columns in
show_schedstat().

Suggested-by: Shrikanth Hegde <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Cc: "Gautham R. Shenoy" <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
Documentation/scheduler/sched-stats.rst | 5 +++++
kernel/sched/stats.c | 2 +-
2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/Documentation/scheduler/sched-stats.rst b/Documentation/scheduler/sched-stats.rst
index 03c0629..73c4126 100644
--- a/Documentation/scheduler/sched-stats.rst
+++ b/Documentation/scheduler/sched-stats.rst
@@ -2,6 +2,11 @@
Scheduler Statistics
====================

+Version 16 of schedstats changed the order of definitions within
+'enum cpu_idle_type', which changed the order of [CPU_MAX_IDLE_TYPES]
+columns in show_schedstat(). In particular the position of CPU_IDLE
+and __CPU_NOT_IDLE changed places. The size of the array is unchanged.
+
Version 15 of schedstats dropped counters for some sched_yield:
yld_exp_empty, yld_act_empty and yld_both_empty. Otherwise, it is
identical to version 14.
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 8527795..78e48f5 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -113,7 +113,7 @@ void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
* Bump this up when changing the output format or the meaning of an existing
* format, so that tools can adapt (or abort)
*/
-#define SCHEDSTAT_VERSION 15
+#define SCHEDSTAT_VERSION 16

static int show_schedstat(struct seq_file *seq, void *v)
{

2024-03-14 05:22:11

by Gautham R. Shenoy

[permalink] [raw]
Subject: Re: [PATCH 04/10] sched/debug: Increase SCHEDSTAT_VERSION to 16

Hello Ingo, Shrikanth,

On Tue, Mar 12, 2024 at 11:01:54AM +0100, Ingo Molnar wrote:
>
> * Shrikanth Hegde <[email protected]> wrote:
>
> >
> >
> > On 3/8/24 4:28 PM, Ingo Molnar wrote:
> > > We changed the order of definitions within 'enum cpu_idle_type',
> > > which changed the order of [CPU_MAX_IDLE_TYPES] columns in
> > > show_schedstat().
> > >
> >
> > > +++ b/kernel/sched/stats.c
> > > @@ -113,7 +113,7 @@ void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
> > > * Bump this up when changing the output format or the meaning of an existing
> > > * format, so that tools can adapt (or abort)
> > > */
> > > -#define SCHEDSTAT_VERSION 15
> > > +#define SCHEDSTAT_VERSION 16
> >
> > Please add the info about version, and change of the order
> > briefly in Documentation/scheduler/sched-stats.rst as well.
>
> Good point, I've added this paragraph to sched-stats.rst:
>
> +Version 16 of schedstats changed the order of definitions within
> +'enum cpu_idle_type', which changed the order of [CPU_MAX_IDLE_TYPES]
> +columns in show_schedstat(). In particular the position of CPU_IDLE
> +and __CPU_NOT_IDLE changed places. The size of the array is unchanged.

Thanks for this change!

Since we are considering bumping up the version for this change, it
might be good to get rid of "lb_imbalance" which is currently a sum of
all the different kinds of imbalances (due to load, utilization,
number of tasks, misfit tasks). This is not useful.

Swapnil had a patch to print the different imbalances, which will
change the number of fields. Can we include this change into v16 as
well?

Swapnil, could you please rebase your patch on tip/sched/core and post
it ?

>
> > One recent user that I recollect is sched scoreboard. Version number should
> > be able to help the users when they it is breaking their scripts.
> >

> > +Gautham, Any thoughts?

Yes, the scoreboard looks at the version. If it doesn't understand a
version, it will not parse it. It can be extended to understand the
new version, since that's only about adding a new json.

>
> If it's really a problem, I suspect we could maintain the v15 order of
> output.

I don't think moving to v16 is an issue.

>
> Thanks,
>
> Ingo

--
Thanks and Regards
gautham.

2024-03-14 06:10:30

by Swapnil Sapkal

[permalink] [raw]
Subject: Re: [PATCH 04/10] sched/debug: Increase SCHEDSTAT_VERSION to 16



On 3/14/2024 10:51 AM, Gautham R. Shenoy wrote:
> Hello Ingo, Shrikanth,
>
> On Tue, Mar 12, 2024 at 11:01:54AM +0100, Ingo Molnar wrote:
>>
>> * Shrikanth Hegde <[email protected]> wrote:
>>
>>>
>>>
>>> On 3/8/24 4:28 PM, Ingo Molnar wrote:
>>>> We changed the order of definitions within 'enum cpu_idle_type',
>>>> which changed the order of [CPU_MAX_IDLE_TYPES] columns in
>>>> show_schedstat().
>>>>
>>>
>>>> +++ b/kernel/sched/stats.c
>>>> @@ -113,7 +113,7 @@ void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
>>>> * Bump this up when changing the output format or the meaning of an existing
>>>> * format, so that tools can adapt (or abort)
>>>> */
>>>> -#define SCHEDSTAT_VERSION 15
>>>> +#define SCHEDSTAT_VERSION 16
>>>
>>> Please add the info about version, and change of the order
>>> briefly in Documentation/scheduler/sched-stats.rst as well.
>>
>> Good point, I've added this paragraph to sched-stats.rst:
>>
>> +Version 16 of schedstats changed the order of definitions within
>> +'enum cpu_idle_type', which changed the order of [CPU_MAX_IDLE_TYPES]
>> +columns in show_schedstat(). In particular the position of CPU_IDLE
>> +and __CPU_NOT_IDLE changed places. The size of the array is unchanged.
>
> Thanks for this change!
>
> Since we are considering bumping up the version for this change, it
> might be good to get rid of "lb_imbalance" which is currently a sum of
> all the different kinds of imbalances (due to load, utilization,
> number of tasks, misfit tasks). This is not useful.
>
> Swapnil had a patch to print the different imbalances, which will
> change the number of fields. Can we include this change into v16 as
> well?
>
> Swapnil, could you please rebase your patch on tip/sched/core and post
> it ?
Please find the patch below. This is based on tip/sched/core.

---
From deeed5bf937bddf227deb1cdb9e2e6c164c48053 Mon Sep 17 00:00:00 2001
From: Swapnil Sapkal <[email protected]>
Date: Thu, 15 Jun 2023 04:55:09 +0000
Subject: [PATCH] sched: Report the different kinds of imbalances in /proc/schedstat

In /proc/schedstat, lb_imbalance reports the sum of imbalances
discovered in sched domains with each call to sched_balance_rq(), which is
not very useful because lb_imbalance is an aggregate of the imbalances
due to load, utilization, nr_tasks and misfit_tasks. Remove this field
from /proc/schedstat.

Currently there are no fields in /proc/schedstat to report different types
of imbalances. Introduce new fields in /proc/schedstat to report the
total imbalances in load, utilization, nr_tasks or misfit_tasks.

Added fields to /proc/schedstat:
- lb_imbalance_load: Total imbalance due to load.
- lb_imbalance_util: Total imbalance due to utilization.
- lb_imbalance_task: Total imbalance due to number of tasks.
- lb_imbalance_misfit: Total imbalance due to misfit tasks.

Signed-off-by: Swapnil Sapkal <[email protected]>
---
Documentation/scheduler/sched-stats.rst | 104 +++++++++++++-----------
include/linux/sched/topology.h | 5 +-
kernel/sched/fair.c | 15 +++-
kernel/sched/stats.c | 7 +-
4 files changed, 80 insertions(+), 51 deletions(-)

diff --git a/Documentation/scheduler/sched-stats.rst b/Documentation/scheduler/sched-stats.rst
index 7c2b16c4729d..d6e9a8a5619c 100644
--- a/Documentation/scheduler/sched-stats.rst
+++ b/Documentation/scheduler/sched-stats.rst
@@ -6,6 +6,9 @@ Version 16 of schedstats changed the order of definitions within
'enum cpu_idle_type', which changed the order of [CPU_MAX_IDLE_TYPES]
columns in show_schedstat(). In particular the position of CPU_IDLE
and __CPU_NOT_IDLE changed places. The size of the array is unchanged.
+Also stop reporting 'lb_imbalance' as it has no significance anymore
+and instead add more relevant fields namely 'lb_imbalance_load',
+'lb_imbalance_util', 'lb_imbalance_task' and 'lb_imbalance_misfit'.

Version 15 of schedstats dropped counters for some sched_yield:
yld_exp_empty, yld_act_empty and yld_both_empty. Otherwise, it is
@@ -73,86 +76,93 @@ One of these is produced per domain for each cpu described. (Note that if
CONFIG_SMP is not defined, *no* domains are utilized and these lines
will not appear in the output.)

-domain<N> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
+domain<N> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

The first field is a bit mask indicating what cpus this domain operates over.

-The next 24 are a variety of sched_balance_rq() statistics in grouped into types
+The next 33 are a variety of sched_balance_rq() statistics in grouped into types
of idleness (idle, busy, and newly idle):

1) # of times in this domain sched_balance_rq() was called when the
+ cpu was busy
+ 2) # of times in this domain sched_balance_rq() checked but found the
+ load did not require balancing when busy
+ 3) # of times in this domain sched_balance_rq() tried to move one or
+ more tasks and failed, when the cpu was busy
+ 4) Total imbalance in load when the cpu was busy
+ 5) Total imbalance in utilization when the cpu was busy
+ 6) Total imbalance in number of tasks when the cpu was busy
+ 7) Total imbalance due to misfit tasks when the cpu was busy
+ 8) # of times in this domain pull_task() was called when busy
+ 9) # of times in this domain pull_task() was called even though the
+ target task was cache-hot when busy
+ 10) # of times in this domain sched_balance_rq() was called but did not
+ find a busier queue while the cpu was busy
+ 11) # of times in this domain a busier queue was found while the cpu
+ was busy but no busier group was found
+
+ 12) # of times in this domain sched_balance_rq() was called when the
cpu was idle
- 2) # of times in this domain sched_balance_rq() checked but found
+ 13) # of times in this domain sched_balance_rq() checked but found
the load did not require balancing when the cpu was idle
- 3) # of times in this domain sched_balance_rq() tried to move one or
+ 14) # of times in this domain sched_balance_rq() tried to move one or
more tasks and failed, when the cpu was idle
- 4) sum of imbalances discovered (if any) with each call to
- sched_balance_rq() in this domain when the cpu was idle
- 5) # of times in this domain pull_task() was called when the cpu
+ 15) Total imbalance in load when the cpu was idle
+ 16) Total imbalance in utilization when the cpu was idle
+ 17) Total imbalance in number of tasks when the cpu was idle
+ 18) Total imbalance due to misfit tasks when the cpu was idle
+ 19) # of times in this domain pull_task() was called when the cpu
was idle
- 6) # of times in this domain pull_task() was called even though
+ 20) # of times in this domain pull_task() was called even though
the target task was cache-hot when idle
- 7) # of times in this domain sched_balance_rq() was called but did
+ 21) # of times in this domain sched_balance_rq() was called but did
not find a busier queue while the cpu was idle
- 8) # of times in this domain a busier queue was found while the
+ 22) # of times in this domain a busier queue was found while the
cpu was idle but no busier group was found
- 9) # of times in this domain sched_balance_rq() was called when the
- cpu was busy
- 10) # of times in this domain sched_balance_rq() checked but found the
- load did not require balancing when busy
- 11) # of times in this domain sched_balance_rq() tried to move one or
- more tasks and failed, when the cpu was busy
- 12) sum of imbalances discovered (if any) with each call to
- sched_balance_rq() in this domain when the cpu was busy
- 13) # of times in this domain pull_task() was called when busy
- 14) # of times in this domain pull_task() was called even though the
- target task was cache-hot when busy
- 15) # of times in this domain sched_balance_rq() was called but did not
- find a busier queue while the cpu was busy
- 16) # of times in this domain a busier queue was found while the cpu
- was busy but no busier group was found

- 17) # of times in this domain sched_balance_rq() was called when the
- cpu was just becoming idle
- 18) # of times in this domain sched_balance_rq() checked but found the
+ 23) # of times in this domain sched_balance_rq() was called when the
+ was just becoming idle
+ 24) # of times in this domain sched_balance_rq() checked but found the
load did not require balancing when the cpu was just becoming idle
- 19) # of times in this domain sched_balance_rq() tried to move one or more
+ 25) # of times in this domain sched_balance_rq() tried to move one or more
tasks and failed, when the cpu was just becoming idle
- 20) sum of imbalances discovered (if any) with each call to
- sched_balance_rq() in this domain when the cpu was just becoming idle
- 21) # of times in this domain pull_task() was called when newly idle
- 22) # of times in this domain pull_task() was called even though the
+ 26) Total imbalance in load when the cpu was just becoming idle
+ 27) Total imbalance in utilization when the cpu was just becoming idle
+ 28) Total imbalance in number of tasks when the cpu was just becoming idle
+ 29) Total imbalance due to misfit tasks when the cpu was just becoming idle
+ 30) # of times in this domain pull_task() was called when newly idle
+ 31) # of times in this domain pull_task() was called even though the
target task was cache-hot when just becoming idle
- 23) # of times in this domain sched_balance_rq() was called but did not
+ 32) # of times in this domain sched_balance_rq() was called but did not
find a busier queue while the cpu was just becoming idle
- 24) # of times in this domain a busier queue was found while the cpu
+ 33) # of times in this domain a busier queue was found while the cpu
was just becoming idle but no busier group was found

Next three are active_load_balance() statistics:

- 25) # of times active_load_balance() was called
- 26) # of times active_load_balance() tried to move a task and failed
- 27) # of times active_load_balance() successfully moved a task
+ 34) # of times active_load_balance() was called
+ 35) # of times active_load_balance() tried to move a task and failed
+ 36) # of times active_load_balance() successfully moved a task

Next three are sched_balance_exec() statistics:

- 28) sbe_cnt is not used
- 29) sbe_balanced is not used
- 30) sbe_pushed is not used
+ 37) sbe_cnt is not used
+ 38) sbe_balanced is not used
+ 39) sbe_pushed is not used

Next three are sched_balance_fork() statistics:

- 31) sbf_cnt is not used
- 32) sbf_balanced is not used
- 33) sbf_pushed is not used
+ 40) sbf_cnt is not used
+ 41) sbf_balanced is not used
+ 42) sbf_pushed is not used

Next three are try_to_wake_up() statistics:

- 34) # of times in this domain try_to_wake_up() awoke a task that
+ 43) # of times in this domain try_to_wake_up() awoke a task that
last ran on a different cpu in this domain
- 35) # of times in this domain try_to_wake_up() moved a task to the
+ 44) # of times in this domain try_to_wake_up() moved a task to the
waking cpu because it was cache-cold on its own cpu anyway
- 36) # of times in this domain try_to_wake_up() started passive balancing
+ 45) # of times in this domain try_to_wake_up() started passive balancing

/proc/<pid>/schedstat
---------------------
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index c8fe9bab981b..15685c40a713 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -114,7 +114,10 @@ struct sched_domain {
unsigned int lb_count[CPU_MAX_IDLE_TYPES];
unsigned int lb_failed[CPU_MAX_IDLE_TYPES];
unsigned int lb_balanced[CPU_MAX_IDLE_TYPES];
- unsigned int lb_imbalance[CPU_MAX_IDLE_TYPES];
+ unsigned int lb_imbalance_load[CPU_MAX_IDLE_TYPES];
+ unsigned int lb_imbalance_util[CPU_MAX_IDLE_TYPES];
+ unsigned int lb_imbalance_task[CPU_MAX_IDLE_TYPES];
+ unsigned int lb_imbalance_misfit[CPU_MAX_IDLE_TYPES];
unsigned int lb_gained[CPU_MAX_IDLE_TYPES];
unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES];
unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES];
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a19ea290b790..515258f97ba3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11288,7 +11288,20 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,

WARN_ON_ONCE(busiest == env.dst_rq);

- schedstat_add(sd->lb_imbalance[idle], env.imbalance);
+ switch (env.migration_type) {
+ case migrate_load:
+ schedstat_add(sd->lb_imbalance_load[idle], env.imbalance);
+ break;
+ case migrate_util:
+ schedstat_add(sd->lb_imbalance_util[idle], env.imbalance);
+ break;
+ case migrate_task:
+ schedstat_add(sd->lb_imbalance_task[idle], env.imbalance);
+ break;
+ case migrate_misfit:
+ schedstat_add(sd->lb_imbalance_misfit[idle], env.imbalance);
+ break;
+ }

env.src_cpu = busiest->cpu;
env.src_rq = busiest;
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 78e48f5426ee..a02bc9db2f1c 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -151,11 +151,14 @@ static int show_schedstat(struct seq_file *seq, void *v)
seq_printf(seq, "domain%d %*pb", dcount++,
cpumask_pr_args(sched_domain_span(sd)));
for (itype = 0; itype < CPU_MAX_IDLE_TYPES; itype++) {
- seq_printf(seq, " %u %u %u %u %u %u %u %u",
+ seq_printf(seq, " %u %u %u %u %u %u %u %u %u %u %u",
sd->lb_count[itype],
sd->lb_balanced[itype],
sd->lb_failed[itype],
- sd->lb_imbalance[itype],
+ sd->lb_imbalance_load[itype],
+ sd->lb_imbalance_util[itype],
+ sd->lb_imbalance_task[itype],
+ sd->lb_imbalance_misfit[itype],
sd->lb_gained[itype],
sd->lb_hot_gained[itype],
sd->lb_nobusyq[itype],
--
2.34.1


>
>>
>>> One recent user that I recollect is sched scoreboard. Version number should
>>> be able to help the users when they it is breaking their scripts.
>>>
>
>>> +Gautham, Any thoughts?
>
> Yes, the scoreboard looks at the version. If it doesn't understand a
> version, it will not parse it. It can be extended to understand the
> new version, since that's only about adding a new json.
>
>>
>> If it's really a problem, I suspect we could maintain the v15 order of
>> output.
>
> I don't think moving to v16 is an issue.
>
>>
>> Thanks,
>>
>> Ingo
>
> --
> Thanks and Regards
> gautham.

--
Thanks and Regards,
Swapnil

2024-03-15 04:38:05

by Shrikanth Hegde

[permalink] [raw]
Subject: Re: [PATCH 04/10] sched/debug: Increase SCHEDSTAT_VERSION to 16



On 3/14/24 11:34 AM, Swapnil Sapkal wrote:
>
>
> Please find the patch below. This is based on tip/sched/core.
>
> ---
> From deeed5bf937bddf227deb1cdb9e2e6c164c48053 Mon Sep 17 00:00:00 2001
> From: Swapnil Sapkal <[email protected]>
> Date: Thu, 15 Jun 2023 04:55:09 +0000
> Subject: [PATCH] sched: Report the different kinds of imbalances in
> /proc/schedstat
>
> In /proc/schedstat, lb_imbalance reports the sum of imbalances
> discovered in sched domains with each call to sched_balance_rq(), which is
> not very useful because lb_imbalance is an aggregate of the imbalances
> due to load, utilization, nr_tasks and misfit_tasks. Remove this field
> from /proc/schedstat.
>

Yes. This makes sense. any one of them can skew the value.

> Currently there are no fields in /proc/schedstat to report different types
> of imbalances. Introduce new fields in /proc/schedstat to report the
> total imbalances in load, utilization, nr_tasks or misfit_tasks.
>
> Added fields to /proc/schedstat:
>      - lb_imbalance_load: Total imbalance due to load.
>     - lb_imbalance_util: Total imbalance due to utilization.
>     - lb_imbalance_task: Total imbalance due to number of tasks.
>     - lb_imbalance_misfit: Total imbalance due to misfit tasks.
>
> Signed-off-by: Swapnil Sapkal <[email protected]>
> ---
>  Documentation/scheduler/sched-stats.rst | 104 +++++++++++++-----------
>  include/linux/sched/topology.h          |   5 +-
>  kernel/sched/fair.c                     |  15 +++-
>  kernel/sched/stats.c                    |   7 +-
>  4 files changed, 80 insertions(+), 51 deletions(-)
>
> diff --git a/Documentation/scheduler/sched-stats.rst
> b/Documentation/scheduler/sched-stats.rst
> index 7c2b16c4729d..d6e9a8a5619c 100644
> --- a/Documentation/scheduler/sched-stats.rst
> +++ b/Documentation/scheduler/sched-stats.rst
> @@ -6,6 +6,9 @@ Version 16 of schedstats changed the order of
> definitions within
>  'enum cpu_idle_type', which changed the order of [CPU_MAX_IDLE_TYPES]
>  columns in show_schedstat(). In particular the position of CPU_IDLE
>  and __CPU_NOT_IDLE changed places. The size of the array is unchanged.
> +Also stop reporting 'lb_imbalance' as it has no significance anymore
> +and instead add more relevant fields namely 'lb_imbalance_load',
> +'lb_imbalance_util', 'lb_imbalance_task' and 'lb_imbalance_misfit'.
>  
>  Version 15 of schedstats dropped counters for some sched_yield:
>  yld_exp_empty, yld_act_empty and yld_both_empty. Otherwise, it is
> @@ -73,86 +76,93 @@ One of these is produced per domain for each cpu
> described. (Note that if
>  CONFIG_SMP is not defined, *no* domains are utilized and these lines
>  will not appear in the output.)
>  
> -domain<N> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
> +domain<N> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
>  
>  The first field is a bit mask indicating what cpus this domain operates
> over.
>  

IIUC, this is editing the content of Version 15, But changes would happen on Version 16.
Instead can we add the section for Version 16 and not modify for 15?

> -The next 24 are a variety of sched_balance_rq() statistics in grouped
> into types
> +The next 33 are a variety of sched_balance_rq() statistics in grouped
> into types
>  of idleness (idle, busy, and newly idle):
>  
>      1)  # of times in this domain sched_balance_rq() was called when the
> +        cpu was busy
> +    2)  # of times in this domain sched_balance_rq() checked but found the
> +        load did not require balancing when busy
> +    3)  # of times in this domain sched_balance_rq() tried to move one or
> +        more tasks and failed, when the cpu was busy
> +    4)  Total imbalance in load when the cpu was busy
> +    5)  Total imbalance in utilization when the cpu was busy
> +    6)  Total imbalance in number of tasks when the cpu was busy
> +    7)  Total imbalance due to misfit tasks when the cpu was busy
> +    8)  # of times in this domain pull_task() was called when busy
> +    9)  # of times in this domain pull_task() was called even though the
> +        target task was cache-hot when busy
> +    10) # of times in this domain sched_balance_rq() was called but did
> not
> +        find a busier queue while the cpu was busy
> +    11) # of times in this domain a busier queue was found while the cpu
> +        was busy but no busier group was found
> +
> +    12) # of times in this domain sched_balance_rq() was called when the
>          cpu was idle
> -    2)  # of times in this domain sched_balance_rq() checked but found
> +    13) # of times in this domain sched_balance_rq() checked but found
>          the load did not require balancing when the cpu was idle
> -    3)  # of times in this domain sched_balance_rq() tried to move one or
> +    14) # of times in this domain sched_balance_rq() tried to move one or
>          more tasks and failed, when the cpu was idle
> -    4)  sum of imbalances discovered (if any) with each call to
> -        sched_balance_rq() in this domain when the cpu was idle
> -    5)  # of times in this domain pull_task() was called when the cpu
> +    15) Total imbalance in load when the cpu was idle
> +    16) Total imbalance in utilization when the cpu was idle
> +    17) Total imbalance in number of tasks when the cpu was idle
> +    18) Total imbalance due to misfit tasks when the cpu was idle
> +    19) # of times in this domain pull_task() was called when the cpu
>          was idle
> -    6)  # of times in this domain pull_task() was called even though
> +    20) # of times in this domain pull_task() was called even though
>          the target task was cache-hot when idle
> -    7)  # of times in this domain sched_balance_rq() was called but did
> +    21) # of times in this domain sched_balance_rq() was called but did
>          not find a busier queue while the cpu was idle
> -    8)  # of times in this domain a busier queue was found while the
> +    22) # of times in this domain a busier queue was found while the
>          cpu was idle but no busier group was found
> -    9)  # of times in this domain sched_balance_rq() was called when the
> -        cpu was busy
> -    10) # of times in this domain sched_balance_rq() checked but found the
> -        load did not require balancing when busy
> -    11) # of times in this domain sched_balance_rq() tried to move one or
> -        more tasks and failed, when the cpu was busy
> -    12) sum of imbalances discovered (if any) with each call to
> -        sched_balance_rq() in this domain when the cpu was busy
> -    13) # of times in this domain pull_task() was called when busy
> -    14) # of times in this domain pull_task() was called even though the
> -        target task was cache-hot when busy
> -    15) # of times in this domain sched_balance_rq() was called but did
> not
> -        find a busier queue while the cpu was busy
> -    16) # of times in this domain a busier queue was found while the cpu
> -        was busy but no busier group was found
>  
> -    17) # of times in this domain sched_balance_rq() was called when the
> -        cpu was just becoming idle
> -    18) # of times in this domain sched_balance_rq() checked but found the
> +    23) # of times in this domain sched_balance_rq() was called when the
> +        was just becoming idle
> +    24) # of times in this domain sched_balance_rq() checked but found the
>          load did not require balancing when the cpu was just becoming idle
> -    19) # of times in this domain sched_balance_rq() tried to move one
> or more
> +    25) # of times in this domain sched_balance_rq() tried to move one
> or more
>          tasks and failed, when the cpu was just becoming idle
> -    20) sum of imbalances discovered (if any) with each call to
> -        sched_balance_rq() in this domain when the cpu was just
> becoming idle
> -    21) # of times in this domain pull_task() was called when newly idle
> -    22) # of times in this domain pull_task() was called even though the
> +    26) Total imbalance in load when the cpu was just becoming idle
> +    27) Total imbalance in utilization when the cpu was just becoming idle
> +    28) Total imbalance in number of tasks when the cpu was just
> becoming idle
> +    29) Total imbalance due to misfit tasks when the cpu was just
> becoming idle
> +    30) # of times in this domain pull_task() was called when newly idle
> +    31) # of times in this domain pull_task() was called even though the
>          target task was cache-hot when just becoming idle
> -    23) # of times in this domain sched_balance_rq() was called but did
> not
> +    32) # of times in this domain sched_balance_rq() was called but did
> not
>          find a busier queue while the cpu was just becoming idle
> -    24) # of times in this domain a busier queue was found while the cpu
> +    33) # of times in this domain a busier queue was found while the cpu
>          was just becoming idle but no busier group was found
>  
>     Next three are active_load_balance() statistics:
>  
> -    25) # of times active_load_balance() was called
> -    26) # of times active_load_balance() tried to move a task and failed
> -    27) # of times active_load_balance() successfully moved a task
> +    34) # of times active_load_balance() was called
> +    35) # of times active_load_balance() tried to move a task and failed
> +    36) # of times active_load_balance() successfully moved a task
>  
>     Next three are sched_balance_exec() statistics:
>  
> -    28) sbe_cnt is not used
> -    29) sbe_balanced is not used
> -    30) sbe_pushed is not used
> +    37) sbe_cnt is not used
> +    38) sbe_balanced is not used
> +    39) sbe_pushed is not used
>  
>     Next three are sched_balance_fork() statistics:
>  
> -    31) sbf_cnt is not used
> -    32) sbf_balanced is not used
> -    33) sbf_pushed is not used
> +    40) sbf_cnt is not used
> +    41) sbf_balanced is not used
> +    42) sbf_pushed is not used
>  
>     Next three are try_to_wake_up() statistics:
>  
> -    34) # of times in this domain try_to_wake_up() awoke a task that
> +    43) # of times in this domain try_to_wake_up() awoke a task that
>          last ran on a different cpu in this domain
> -    35) # of times in this domain try_to_wake_up() moved a task to the
> +    44) # of times in this domain try_to_wake_up() moved a task to the
>          waking cpu because it was cache-cold on its own cpu anyway
> -    36) # of times in this domain try_to_wake_up() started passive
> balancing
> +    45) # of times in this domain try_to_wake_up() started passive
> balancing
>  
>  /proc/<pid>/schedstat
>  ---------------------
> diff --git a/include/linux/sched/topology.h
> b/include/linux/sched/topology.h
> index c8fe9bab981b..15685c40a713 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -114,7 +114,10 @@ struct sched_domain {
>      unsigned int lb_count[CPU_MAX_IDLE_TYPES];
>      unsigned int lb_failed[CPU_MAX_IDLE_TYPES];
>      unsigned int lb_balanced[CPU_MAX_IDLE_TYPES];
> -    unsigned int lb_imbalance[CPU_MAX_IDLE_TYPES];
> +    unsigned int lb_imbalance_load[CPU_MAX_IDLE_TYPES];
> +    unsigned int lb_imbalance_util[CPU_MAX_IDLE_TYPES];
> +    unsigned int lb_imbalance_task[CPU_MAX_IDLE_TYPES];
> +    unsigned int lb_imbalance_misfit[CPU_MAX_IDLE_TYPES];
>      unsigned int lb_gained[CPU_MAX_IDLE_TYPES];
>      unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES];
>      unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES];
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a19ea290b790..515258f97ba3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11288,7 +11288,20 @@ static int sched_balance_rq(int this_cpu,
> struct rq *this_rq,
>  
>      WARN_ON_ONCE(busiest == env.dst_rq);
>  
> -    schedstat_add(sd->lb_imbalance[idle], env.imbalance);
> +    switch (env.migration_type) {
> +    case migrate_load:
> +        schedstat_add(sd->lb_imbalance_load[idle], env.imbalance);
> +        break;
> +    case migrate_util:
> +        schedstat_add(sd->lb_imbalance_util[idle], env.imbalance);
> +        break;
> +    case migrate_task:
> +        schedstat_add(sd->lb_imbalance_task[idle], env.imbalance);
> +        break;
> +    case migrate_misfit:
> +        schedstat_add(sd->lb_imbalance_misfit[idle], env.imbalance);
> +        break;
> +    }
>  

This switch statement could use a helper function.

>      env.src_cpu = busiest->cpu;
>      env.src_rq = busiest;
> diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
> index 78e48f5426ee..a02bc9db2f1c 100644
> --- a/kernel/sched/stats.c
> +++ b/kernel/sched/stats.c
> @@ -151,11 +151,14 @@ static int show_schedstat(struct seq_file *seq,
> void *v)
>              seq_printf(seq, "domain%d %*pb", dcount++,
>                     cpumask_pr_args(sched_domain_span(sd)));
>              for (itype = 0; itype < CPU_MAX_IDLE_TYPES; itype++) {
> -                seq_printf(seq, " %u %u %u %u %u %u %u %u",
> +                seq_printf(seq, " %u %u %u %u %u %u %u %u %u %u %u",
>                      sd->lb_count[itype],
>                      sd->lb_balanced[itype],
>                      sd->lb_failed[itype],
> -                    sd->lb_imbalance[itype],
> +                    sd->lb_imbalance_load[itype],
> +                    sd->lb_imbalance_util[itype],
> +                    sd->lb_imbalance_task[itype],
> +                    sd->lb_imbalance_misfit[itype],
>                      sd->lb_gained[itype],
>                      sd->lb_hot_gained[itype],
>                      sd->lb_nobusyq[itype],

2024-03-15 09:07:07

by Swapnil Sapkal

[permalink] [raw]
Subject: Re: [PATCH 04/10] sched/debug: Increase SCHEDSTAT_VERSION to 16

Hello Shrikanth,

Thank you for reviewing.

On 3/15/2024 10:07 AM, Shrikanth Hegde wrote:
>
> On 3/14/24 11:34 AM, Swapnil Sapkal wrote:
>>
>> Please find the patch below. This is based on tip/sched/core.
>>
>> ---
>> From deeed5bf937bddf227deb1cdb9e2e6c164c48053 Mon Sep 17 00:00:00 2001
>> From: Swapnil Sapkal <[email protected]>
>> Date: Thu, 15 Jun 2023 04:55:09 +0000
>> Subject: [PATCH] sched: Report the different kinds of imbalances in
>> /proc/schedstat
>>
>> In /proc/schedstat, lb_imbalance reports the sum of imbalances
>> discovered in sched domains with each call to sched_balance_rq(), which is
>> not very useful because lb_imbalance is an aggregate of the imbalances
>> due to load, utilization, nr_tasks and misfit_tasks. Remove this field
>> from /proc/schedstat.
>>
> Yes. This makes sense. any one of them can skew the value.

Yes.

>> Currently there are no fields in /proc/schedstat to report different types
>> of imbalances. Introduce new fields in /proc/schedstat to report the
>> total imbalances in load, utilization, nr_tasks or misfit_tasks.
>>
>> Added fields to /proc/schedstat:
>>      - lb_imbalance_load: Total imbalance due to load.
>>     - lb_imbalance_util: Total imbalance due to utilization.
>>     - lb_imbalance_task: Total imbalance due to number of tasks.
>>     - lb_imbalance_misfit: Total imbalance due to misfit tasks.
>>
>> Signed-off-by: Swapnil Sapkal <[email protected]>
>> ---
>>  Documentation/scheduler/sched-stats.rst | 104 +++++++++++++-----------
>>  include/linux/sched/topology.h          |   5 +-
>>  kernel/sched/fair.c                     |  15 +++-
>>  kernel/sched/stats.c                    |   7 +-
>>  4 files changed, 80 insertions(+), 51 deletions(-)
>>
>> diff --git a/Documentation/scheduler/sched-stats.rst
>> b/Documentation/scheduler/sched-stats.rst
>> index 7c2b16c4729d..d6e9a8a5619c 100644
>> --- a/Documentation/scheduler/sched-stats.rst
>> +++ b/Documentation/scheduler/sched-stats.rst
>> @@ -6,6 +6,9 @@ Version 16 of schedstats changed the order of
>> definitions within
>>  'enum cpu_idle_type', which changed the order of [CPU_MAX_IDLE_TYPES]
>>  columns in show_schedstat(). In particular the position of CPU_IDLE
>>  and __CPU_NOT_IDLE changed places. The size of the array is unchanged.
>> +Also stop reporting 'lb_imbalance' as it has no significance anymore
>> +and instead add more relevant fields namely 'lb_imbalance_load',
>> +'lb_imbalance_util', 'lb_imbalance_task' and 'lb_imbalance_misfit'.
>>
>>  Version 15 of schedstats dropped counters for some sched_yield:
>>  yld_exp_empty, yld_act_empty and yld_both_empty. Otherwise, it is
>> @@ -73,86 +76,93 @@ One of these is produced per domain for each cpu
>> described. (Note that if
>>  CONFIG_SMP is not defined, *no* domains are utilized and these lines
>>  will not appear in the output.)
>>
>> -domain<N> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
>> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
>> +domain<N> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
>> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
>>
>>  The first field is a bit mask indicating what cpus this domain operates
>> over.
>>
> IIUC, this is editing the content of Version 15, But changes would happen on Version 16.
> Instead can we add the section for Version 16 and not modify for 15?

This file contains the field details of current schedstat version and the short
summary of what change across the versions. Maintaining each versions details will
increase the file size. Instead I will add the links to previous version's
documentation.

Thoughts on this?

>> -The next 24 are a variety of sched_balance_rq() statistics in grouped
>> into types
>> +The next 33 are a variety of sched_balance_rq() statistics in grouped
>> into types
>>  of idleness (idle, busy, and newly idle):
>>
>>      1)  # of times in this domain sched_balance_rq() was called when the
>> +        cpu was busy
>> +    2)  # of times in this domain sched_balance_rq() checked but found the
>> +        load did not require balancing when busy
>> +    3)  # of times in this domain sched_balance_rq() tried to move one or
>> +        more tasks and failed, when the cpu was busy
>> +    4)  Total imbalance in load when the cpu was busy
>> +    5)  Total imbalance in utilization when the cpu was busy
>> +    6)  Total imbalance in number of tasks when the cpu was busy
>> +    7)  Total imbalance due to misfit tasks when the cpu was busy
>> +    8)  # of times in this domain pull_task() was called when busy
>> +    9)  # of times in this domain pull_task() was called even though the
>> +        target task was cache-hot when busy
>> +    10) # of times in this domain sched_balance_rq() was called but did
>> not
>> +        find a busier queue while the cpu was busy
>> +    11) # of times in this domain a busier queue was found while the cpu
>> +        was busy but no busier group was found
>> +
>> +    12) # of times in this domain sched_balance_rq() was called when the
>>          cpu was idle
>> -    2)  # of times in this domain sched_balance_rq() checked but found
>> +    13) # of times in this domain sched_balance_rq() checked but found
>>          the load did not require balancing when the cpu was idle
>> -    3)  # of times in this domain sched_balance_rq() tried to move one or
>> +    14) # of times in this domain sched_balance_rq() tried to move one or
>>          more tasks and failed, when the cpu was idle
>> -    4)  sum of imbalances discovered (if any) with each call to
>> -        sched_balance_rq() in this domain when the cpu was idle
>> -    5)  # of times in this domain pull_task() was called when the cpu
>> +    15) Total imbalance in load when the cpu was idle
>> +    16) Total imbalance in utilization when the cpu was idle
>> +    17) Total imbalance in number of tasks when the cpu was idle
>> +    18) Total imbalance due to misfit tasks when the cpu was idle
>> +    19) # of times in this domain pull_task() was called when the cpu
>>          was idle
>> -    6)  # of times in this domain pull_task() was called even though
>> +    20) # of times in this domain pull_task() was called even though
>>          the target task was cache-hot when idle
>> -    7)  # of times in this domain sched_balance_rq() was called but did
>> +    21) # of times in this domain sched_balance_rq() was called but did
>>          not find a busier queue while the cpu was idle
>> -    8)  # of times in this domain a busier queue was found while the
>> +    22) # of times in this domain a busier queue was found while the
>>          cpu was idle but no busier group was found
>> -    9)  # of times in this domain sched_balance_rq() was called when the
>> -        cpu was busy
>> -    10) # of times in this domain sched_balance_rq() checked but found the
>> -        load did not require balancing when busy
>> -    11) # of times in this domain sched_balance_rq() tried to move one or
>> -        more tasks and failed, when the cpu was busy
>> -    12) sum of imbalances discovered (if any) with each call to
>> -        sched_balance_rq() in this domain when the cpu was busy
>> -    13) # of times in this domain pull_task() was called when busy
>> -    14) # of times in this domain pull_task() was called even though the
>> -        target task was cache-hot when busy
>> -    15) # of times in this domain sched_balance_rq() was called but did
>> not
>> -        find a busier queue while the cpu was busy
>> -    16) # of times in this domain a busier queue was found while the cpu
>> -        was busy but no busier group was found
>>
>> -    17) # of times in this domain sched_balance_rq() was called when the
>> -        cpu was just becoming idle
>> -    18) # of times in this domain sched_balance_rq() checked but found the
>> +    23) # of times in this domain sched_balance_rq() was called when the
>> +        was just becoming idle
>> +    24) # of times in this domain sched_balance_rq() checked but found the
>>          load did not require balancing when the cpu was just becoming idle
>> -    19) # of times in this domain sched_balance_rq() tried to move one
>> or more
>> +    25) # of times in this domain sched_balance_rq() tried to move one
>> or more
>>          tasks and failed, when the cpu was just becoming idle
>> -    20) sum of imbalances discovered (if any) with each call to
>> -        sched_balance_rq() in this domain when the cpu was just
>> becoming idle
>> -    21) # of times in this domain pull_task() was called when newly idle
>> -    22) # of times in this domain pull_task() was called even though the
>> +    26) Total imbalance in load when the cpu was just becoming idle
>> +    27) Total imbalance in utilization when the cpu was just becoming idle
>> +    28) Total imbalance in number of tasks when the cpu was just
>> becoming idle
>> +    29) Total imbalance due to misfit tasks when the cpu was just
>> becoming idle
>> +    30) # of times in this domain pull_task() was called when newly idle
>> +    31) # of times in this domain pull_task() was called even though the
>>          target task was cache-hot when just becoming idle
>> -    23) # of times in this domain sched_balance_rq() was called but did
>> not
>> +    32) # of times in this domain sched_balance_rq() was called but did
>> not
>>          find a busier queue while the cpu was just becoming idle
>> -    24) # of times in this domain a busier queue was found while the cpu
>> +    33) # of times in this domain a busier queue was found while the cpu
>>          was just becoming idle but no busier group was found
>>
>>     Next three are active_load_balance() statistics:
>>
>> -    25) # of times active_load_balance() was called
>> -    26) # of times active_load_balance() tried to move a task and failed
>> -    27) # of times active_load_balance() successfully moved a task
>> +    34) # of times active_load_balance() was called
>> +    35) # of times active_load_balance() tried to move a task and failed
>> +    36) # of times active_load_balance() successfully moved a task
>>
>>     Next three are sched_balance_exec() statistics:
>>
>> -    28) sbe_cnt is not used
>> -    29) sbe_balanced is not used
>> -    30) sbe_pushed is not used
>> +    37) sbe_cnt is not used
>> +    38) sbe_balanced is not used
>> +    39) sbe_pushed is not used
>>
>>     Next three are sched_balance_fork() statistics:
>>
>> -    31) sbf_cnt is not used
>> -    32) sbf_balanced is not used
>> -    33) sbf_pushed is not used
>> +    40) sbf_cnt is not used
>> +    41) sbf_balanced is not used
>> +    42) sbf_pushed is not used
>>
>>     Next three are try_to_wake_up() statistics:
>>
>> -    34) # of times in this domain try_to_wake_up() awoke a task that
>> +    43) # of times in this domain try_to_wake_up() awoke a task that
>>          last ran on a different cpu in this domain
>> -    35) # of times in this domain try_to_wake_up() moved a task to the
>> +    44) # of times in this domain try_to_wake_up() moved a task to the
>>          waking cpu because it was cache-cold on its own cpu anyway
>> -    36) # of times in this domain try_to_wake_up() started passive
>> balancing
>> +    45) # of times in this domain try_to_wake_up() started passive
>> balancing
>>
>>  /proc/<pid>/schedstat
>>  ---------------------
>> diff --git a/include/linux/sched/topology.h
>> b/include/linux/sched/topology.h
>> index c8fe9bab981b..15685c40a713 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -114,7 +114,10 @@ struct sched_domain {
>>      unsigned int lb_count[CPU_MAX_IDLE_TYPES];
>>      unsigned int lb_failed[CPU_MAX_IDLE_TYPES];
>>      unsigned int lb_balanced[CPU_MAX_IDLE_TYPES];
>> -    unsigned int lb_imbalance[CPU_MAX_IDLE_TYPES];
>> +    unsigned int lb_imbalance_load[CPU_MAX_IDLE_TYPES];
>> +    unsigned int lb_imbalance_util[CPU_MAX_IDLE_TYPES];
>> +    unsigned int lb_imbalance_task[CPU_MAX_IDLE_TYPES];
>> +    unsigned int lb_imbalance_misfit[CPU_MAX_IDLE_TYPES];
>>      unsigned int lb_gained[CPU_MAX_IDLE_TYPES];
>>      unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES];
>>      unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES];
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index a19ea290b790..515258f97ba3 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -11288,7 +11288,20 @@ static int sched_balance_rq(int this_cpu,
>> struct rq *this_rq,
>>
>>      WARN_ON_ONCE(busiest == env.dst_rq);
>>
>> -    schedstat_add(sd->lb_imbalance[idle], env.imbalance);
>> +    switch (env.migration_type) {
>> +    case migrate_load:
>> +        schedstat_add(sd->lb_imbalance_load[idle], env.imbalance);
>> +        break;
>> +    case migrate_util:
>> +        schedstat_add(sd->lb_imbalance_util[idle], env.imbalance);
>> +        break;
>> +    case migrate_task:
>> +        schedstat_add(sd->lb_imbalance_task[idle], env.imbalance);
>> +        break;
>> +    case migrate_misfit:
>> +        schedstat_add(sd->lb_imbalance_misfit[idle], env.imbalance);
>> +        break;
>> +    }
>>
> This switch statement could use a helper function.

Sure, will update this.

>
>
>>      env.src_cpu = busiest->cpu;
>>      env.src_rq = busiest;
>> diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
>> index 78e48f5426ee..a02bc9db2f1c 100644
>> --- a/kernel/sched/stats.c
>> +++ b/kernel/sched/stats.c
>> @@ -151,11 +151,14 @@ static int show_schedstat(struct seq_file *seq,
>> void *v)
>>              seq_printf(seq, "domain%d %*pb", dcount++,
>>                     cpumask_pr_args(sched_domain_span(sd)));
>>              for (itype = 0; itype < CPU_MAX_IDLE_TYPES; itype++) {
>> -                seq_printf(seq, " %u %u %u %u %u %u %u %u",
>> +                seq_printf(seq, " %u %u %u %u %u %u %u %u %u %u %u",
>>                      sd->lb_count[itype],
>>                      sd->lb_balanced[itype],
>>                      sd->lb_failed[itype],
>> -                    sd->lb_imbalance[itype],
>> +                    sd->lb_imbalance_load[itype],
>> +                    sd->lb_imbalance_util[itype],
>> +                    sd->lb_imbalance_task[itype],
>> +                    sd->lb_imbalance_misfit[itype],
>>                      sd->lb_gained[itype],
>>                      sd->lb_hot_gained[itype],
>>                      sd->lb_nobusyq[itype],

--
Thanks and Regards,
Swapnil


2024-03-15 09:21:16

by Shrikanth Hegde

[permalink] [raw]
Subject: Re: [PATCH 04/10] sched/debug: Increase SCHEDSTAT_VERSION to 16



On 3/15/24 2:36 PM, Swapnil Sapkal wrote:
> Hello Shrikanth,
>
> Thank you for reviewing.
>
> On 3/15/2024 10:07 AM, Shrikanth Hegde wrote:
>>

>>> 19 20
>>> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
>>> +domain<N> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
>>> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
>>> 44 45
>>>     The first field is a bit mask indicating what cpus this domain
>>> operates
>>> over.
>>>  
>> IIUC, this is editing the content of Version 15, But changes would
>> happen on Version 16.
>> Instead can we add the section for Version 16 and not modify for 15?
>
> This file contains the field details of current schedstat version and
> the short
> summary of what change across the versions. Maintaining each versions
> details will
> increase the file size. Instead I will add the links to previous version's
> documentation.
>  

I hadn't seen that. what you are saying is right. it would bloat up.

> Thoughts on this?
>
>>> -The next 24 are a variety of sched_balance_rq() statistics in grouped
>>> into types
>>> +The next 33 are a variety of sched_balance_rq() statistics in grouped
>>> into types
>>>   of idleness (idle, busy, and newly idle):

Please change this to
of idleness (busy, idle, and newly idle):

>>>         1)  # of times in this domain sched_balance_rq() was called
>>> when the
>>> +        cpu was busy
>>> +    2)  # of times in this domain sched_balance_rq() checked but
>>> found the
>>> +        load did not require balancing when busy

2024-03-15 14:04:05

by Swapnil Sapkal

[permalink] [raw]
Subject: Re: [PATCH 04/10] sched/debug: Increase SCHEDSTAT_VERSION to 16


On 3/15/2024 2:50 PM, Shrikanth Hegde wrote:
> On 3/15/24 2:36 PM, Swapnil Sapkal wrote:
>> Hello Shrikanth,
>>
>> Thank you for reviewing.
>>
>> On 3/15/2024 10:07 AM, Shrikanth Hegde wrote:
>>>> 19 20
>>>> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
>>>> +domain<N> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
>>>> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
>>>> 44 45
>>>>     The first field is a bit mask indicating what cpus this domain
>>>> operates
>>>> over.
>>>>
>>> IIUC, this is editing the content of Version 15, But changes would
>>> happen on Version 16.
>>> Instead can we add the section for Version 16 and not modify for 15?
>> This file contains the field details of current schedstat version and
>> the short
>> summary of what change across the versions. Maintaining each versions
>> details will
>> increase the file size. Instead I will add the links to previous version's
>> documentation.
>>
> I hadn't seen that. what you are saying is right. it would bloat up.
>
>> Thoughts on this?
>>
>>>> -The next 24 are a variety of sched_balance_rq() statistics in grouped
>>>> into types
>>>> +The next 33 are a variety of sched_balance_rq() statistics in grouped
>>>> into types
>>>>   of idleness (idle, busy, and newly idle):
> Please change this to
> of idleness (busy, idle, and newly idle):

Thanks for catching this. I have updated this in v2. v2 is available at:
https://lore.kernel.org/all/[email protected]/

>>>>         1)  # of times in this domain sched_balance_rq() was called
>>>> when the
>>>> +        cpu was busy
>>>> +    2)  # of times in this domain sched_balance_rq() checked but
>>>> found the
>>>> +        load did not require balancing when busy

--
Thanks and Regards,
Swapnil