2024-04-03 13:31:44

by Vitalii Bursov

[permalink] [raw]
Subject: [PATCH v3 0/3] sched/fair: allow disabling sched_balance_newidle with sched_relax_domain_level

Changes in v3:
- Remove levels table change from the documentation patch
- Link to v2: https://lore.kernel.org/lkml/[email protected]/
Changes in v2:
- Split debug.c change in a separate commit and move new "level"
after "groups_flags"
- Added "Fixes" tag and updated commit message
- Update domain levels cgroup-v1/cpusets.rst documentation
- Link to v1: https://lore.kernel.org/all/[email protected]/

During the upgrade from Linux 5.4 we found a small (around 3%)
performance regression which was tracked to commit
c5b0a7eefc70150caf23e37bc9d639c68c87a097

sched/fair: Remove sysctl_sched_migration_cost condition

With a default value of 500us, sysctl_sched_migration_cost is
significanlty higher than the cost of load_balance. Remove the
condition and rely on the sd->max_newidle_lb_cost to abort
newidle_balance.

Looks like "newidle" balancing is beneficial for a lot of workloads,
just not for this specific one. The workload is video encoding, there
are 100s-1000s of threads, some are synchronized with mutexes and
conditional variables. The process aims to have a portion of CPU idle,
so no CPU cores are 100% busy. Perhaps, the performance impact we see
comes from additional processing in the scheduler and additional cost
like more cache misses, and not from an incorrect balancing. See
perf output below.

My understanding is that "sched_relax_domain_level" cgroup parameter
should control if sched_balance_newidle() is called and what's the scope
of the balancing is, but it doesn't fully work for this case.

cpusets.rst documentation:
> The 'cpuset.sched_relax_domain_level' file allows you to request changing
> this searching range as you like. This file takes int value which
> indicates size of searching range in levels ideally as follows,
> otherwise initial value -1 that indicates the cpuset has no request.
>
> ====== ===========================================================
> -1 no request. use system default or follow request of others.
> 0 no search.
> 1 search siblings (hyperthreads in a core).
> 2 search cores in a package.
> 3 search cpus in a node [= system wide on non-NUMA system]
> 4 search nodes in a chunk of node [on NUMA system]
> 5 search system wide [on NUMA system]
> ====== ===========================================================

Setting cpuset.sched_relax_domain_level to 0 works as 1.

On a dual-CPU server, domains and levels are as follows:
domain 0: level 0, SMT
domain 1: level 2, MC
domain 2: level 5, NUMA

So, to support "0 no search", the value in
cpuset.sched_relax_domain_level should disable SD_BALANCE_NEWIDLE for a
specified level and keep it enabled for prior levels. For example, SMT
level is 0, so sched_relax_domain_level=0 should exclude levels >=0.

Instead, cpuset.sched_relax_domain_level enables the specified level,
which effectively removes "no search" option. See below for domain
flags for all cpuset.sched_relax_domain_level values.

Proposed patch allows clearing SD_BALANCE_NEWIDLE flags when
cpuset.sched_relax_domain_level is set to 0 and extends max
value validation range beyond sched_domain_level_max. This allows
setting SD_BALANCE_NEWIDLE on all levels and override platform
default if it does not include all levels.

Thanks

=========================
Perf output for a simimar workload/test case shows that newidle_balance
(now renamed to sched_balance_newidle) is called when handling futex and
nanosleep syscalls:
8.74% 0.40% a.out [kernel.vmlinux] [k] entry_SYSCALL_64
8.34% entry_SYSCALL_64
- do_syscall_64
- 5.50% __x64_sys_futex
- 5.42% do_futex
- 3.79% futex_wait
- 3.74% __futex_wait
- 3.53% futex_wait_queue
- 3.45% schedule
- 3.43% __schedule
- 2.06% pick_next_task
- 1.93% pick_next_task_fair
- 1.87% newidle_balance
- 1.52% load_balance
- 1.16% find_busiest_group
- 1.13% update_sd_lb_stats.constprop.0
1.01% update_sg_lb_stats
- 0.83% dequeue_task_fair
0.66% dequeue_entity
- 1.57% futex_wake
- 1.22% wake_up_q
- 1.20% try_to_wake_up
0.58% select_task_rq_fair
- 2.44% __x64_sys_nanosleep
- 2.36% hrtimer_nanosleep
- 2.33% do_nanosleep
- 2.05% schedule
- 2.03% __schedule
- 1.23% pick_next_task
- 1.15% pick_next_task_fair
- 1.12% newidle_balance
- 0.90% load_balance
- 0.68% find_busiest_group
- 0.66% update_sd_lb_stats.constprop.0
0.59% update_sg_lb_stats
0.52% dequeue_task_fair

When newidle_balance is disabled (or when using older kernels), perf
output is:
6.37% 0.41% a.out [kernel.vmlinux] [k] entry_SYSCALL_64
5.96% entry_SYSCALL_64
- do_syscall_64
- 3.97% __x64_sys_futex
- 3.89% do_futex
- 2.32% futex_wait
- 2.27% __futex_wait
- 2.05% futex_wait_queue
- 1.98% schedule
- 1.96% __schedule
- 0.81% dequeue_task_fair
0.66% dequeue_entity
- 0.64% pick_next_task
0.51% pick_next_task_fair
- 1.52% futex_wake
- 1.15% wake_up_q
- try_to_wake_up
0.59% select_task_rq_fair
- 1.58% __x64_sys_nanosleep
- 1.52% hrtimer_nanosleep
- 1.48% do_nanosleep
- 1.20% schedule
- 1.19% __schedule
0.53% dequeue_task_fair


Without a patch:
=========================
CPUs: 2 Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

# uname -r
6.8.1

# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
node 0 size: 63962 MB
node 0 free: 59961 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 64446 MB
node 1 free: 63338 MB
node distances:
node 0 1
0: 10 21
1: 21 10

# head /proc/schedstat
version 15
timestamp 4295347219
cpu0 0 0 0 0 0 0 3035466036 858375615 67578
domain0 0000,01000001 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
domain1 000f,ff000fff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
domain2 ffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...

# cd /sys/kernel/debug/sched/domains
# echo -1 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{name,flags,groups_flags,max_newidle_lb_cost}
cpu0/domain0/name:SMT
cpu0/domain1/name:MC
cpu0/domain2/name:NUMA

cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY
SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES
SD_PREFER_SIBLING
cpu0/domain2/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP
SD_NUMA
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC
SD_BALANCE_FORK SD_WAKE_AFFINE
SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES
SD_PREFER_SIBLING
cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC
SD_BALANCE_FORK SD_WAKE_AFFINE
SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
cpu0/domain0/max_newidle_lb_cost:2236
cpu0/domain1/max_newidle_lb_cost:3444
cpu0/domain2/max_newidle_lb_cost:4590

# echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags,max_newidle_lb_cost}
cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY
SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
cpu0/domain1/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE
SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
cpu0/domain2/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE
SD_SERIALIZE SD_OVERLAP SD_NUMA
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC
SD_BALANCE_FORK SD_WAKE_AFFINE
SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES
SD_PREFER_SIBLING
cpu0/domain2/groups_flags:SD_BALANCE_EXEC SD_BALANCE_FORK
SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES
SD_PREFER_SIBLING
cpu0/domain0/max_newidle_lb_cost:0
cpu0/domain1/max_newidle_lb_cost:0
cpu0/domain2/max_newidle_lb_cost:0

# echo 1 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags,max_newidle_lb_cost}

cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY
SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
cpu0/domain1/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE
SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
cpu0/domain2/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE
SD_SERIALIZE SD_OVERLAP SD_NUMA
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC
SD_BALANCE_FORK SD_WAKE_AFFINE
SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES
SD_PREFER_SIBLING
cpu0/domain2/groups_flags:SD_BALANCE_EXEC SD_BALANCE_FORK
SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES
SD_PREFER_SIBLING
cpu0/domain0/max_newidle_lb_cost:309
cpu0/domain1/max_newidle_lb_cost:0
cpu0/domain2/max_newidle_lb_cost:0

# echo 2 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags,max_newidle_lb_cost}

cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY
SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES
SD_PREFER_SIBLING
cpu0/domain2/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE
SD_SERIALIZE SD_OVERLAP SD_NUMA
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC
SD_BALANCE_FORK SD_WAKE_AFFINE
SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES
SD_PREFER_SIBLING
cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC
SD_BALANCE_FORK SD_WAKE_AFFINE
SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
cpu0/domain0/max_newidle_lb_cost:276
cpu0/domain1/max_newidle_lb_cost:2776
cpu0/domain2/max_newidle_lb_cost:0

# echo 3 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags,max_newidle_lb_cost}
cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY
SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES
SD_PREFER_SIBLING
cpu0/domain2/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE
SD_SERIALIZE SD_OVERLAP SD_NUMA
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC
SD_BALANCE_FORK SD_WAKE_AFFINE
SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES
SD_PREFER_SIBLING
cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC
SD_BALANCE_FORK SD_WAKE_AFFINE
SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
cpu0/domain0/max_newidle_lb_cost:289
cpu0/domain1/max_newidle_lb_cost:3192
cpu0/domain2/max_newidle_lb_cost:0

# echo 4 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags,max_newidle_lb_cost}
cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY
SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES
SD_PREFER_SIBLING
cpu0/domain2/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE
SD_SERIALIZE SD_OVERLAP SD_NUMA
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC
SD_BALANCE_FORK SD_WAKE_AFFINE
SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES
SD_PREFER_SIBLING
cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC
SD_BALANCE_FORK SD_WAKE_AFFINE
SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
cpu0/domain0/max_newidle_lb_cost:1306
cpu0/domain1/max_newidle_lb_cost:1999
cpu0/domain2/max_newidle_lb_cost:0

# echo 5 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
bash: echo: write error: Invalid argument
=========================


The same system with the patch applied:
=========================
# cd /sys/kernel/debug/sched/domains
# echo -1 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{name,level,flags,groups_flags}
cpu0/domain0/name:SMT
cpu0/domain1/name:MC
cpu0/domain2/name:NUMA
cpu0/domain0/level:0
cpu0/domain1/level:2
cpu0/domain2/level:5
cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain2/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...

# echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags}
cpu0/domain0/flags:SD_BALANCE_EXEC ...
cpu0/domain1/flags:SD_BALANCE_EXEC ...
cpu0/domain2/flags:SD_BALANCE_EXEC ...
cpu0/domain1/groups_flags:SD_BALANCE_EXEC ...
cpu0/domain2/groups_flags:SD_BALANCE_EXEC ...

# echo 1 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags}
cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain1/flags:SD_BALANCE_EXEC ...
cpu0/domain2/flags:SD_BALANCE_EXEC ...
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain2/groups_flags:SD_BALANCE_EXEC ...

[skip 2, same as 1]

# echo 3 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags}
cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain2/flags:SD_BALANCE_EXEC ...
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...

[skip 4 and 5, same as 3]

# echo 6 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags}
cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain2/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...

# echo 7 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
bash: echo: write error: Invalid argument
=========================

Vitalii Bursov (3):
sched/fair: allow disabling sched_balance_newidle with
sched_relax_domain_level
sched/debug: dump domains' level
docs: cgroup-v1: clarify that domain levels are system-specific

Documentation/admin-guide/cgroup-v1/cpusets.rst | 7 ++++++-
kernel/cgroup/cpuset.c | 2 +-
kernel/sched/debug.c | 1 +
kernel/sched/topology.c | 2 +-
4 files changed, 9 insertions(+), 3 deletions(-)

--
2.20.1



2024-04-03 13:35:21

by Vitalii Bursov

[permalink] [raw]
Subject: [PATCH v3 1/3] sched/fair: allow disabling sched_balance_newidle with sched_relax_domain_level

Change relax_domain_level checks so that it would be possible
to include or exclude all domains from newidle balancing.

This matches the behavior described in the documentation:
-1 no request. use system default or follow request of others.
0 no search.
1 search siblings (hyperthreads in a core).

"2" enables levels 0 and 1, level_max excludes the last (level_max)
level, and level_max+1 includes all levels.

Fixes: 9ae7ab20b483 ("sched/topology: Don't set SD_BALANCE_WAKE on cpuset domain relax")
Signed-off-by: Vitalii Bursov <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
---
kernel/cgroup/cpuset.c | 2 +-
kernel/sched/topology.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 4237c8748715..da24187c4e02 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2948,7 +2948,7 @@ bool current_cpuset_is_being_rebound(void)
static int update_relax_domain_level(struct cpuset *cs, s64 val)
{
#ifdef CONFIG_SMP
- if (val < -1 || val >= sched_domain_level_max)
+ if (val < -1 || val > sched_domain_level_max + 1)
return -EINVAL;
#endif

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 63aecd2a7a9f..67a777b31743 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1475,7 +1475,7 @@ static void set_domain_attribute(struct sched_domain *sd,
} else
request = attr->relax_domain_level;

- if (sd->level > request) {
+ if (sd->level >= request) {
/* Turn off idle balance on this domain: */
sd->flags &= ~(SD_BALANCE_WAKE|SD_BALANCE_NEWIDLE);
}
--
2.20.1


2024-04-03 13:46:44

by Vitalii Bursov

[permalink] [raw]
Subject: [PATCH v3 2/3] sched/debug: dump domains' level

Knowing domain's level exactly can be useful when setting
relax_domain_level or cpuset.sched_relax_domain_level

Usage:
cat /debug/sched/domains/cpu0/domain1/level
to dump cpu0 domain1's level.

Signed-off-by: Vitalii Bursov <[email protected]>
---
kernel/sched/debug.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 8d5d98a5834d..c1eb9a1afd13 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -425,6 +425,7 @@ static void register_sd(struct sched_domain *sd, struct dentry *parent)

debugfs_create_file("flags", 0444, parent, &sd->flags, &sd_flags_fops);
debugfs_create_file("groups_flags", 0444, parent, &sd->groups->flags, &sd_flags_fops);
+ debugfs_create_u32("level", 0444, parent, (u32 *)&sd->level);
}

void update_sched_domain_debugfs(void)
--
2.20.1


2024-04-04 12:28:39

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] sched/debug: dump domains' level

On Wed, 3 Apr 2024 at 15:28, Vitalii Bursov <[email protected]> wrote:
>
> Knowing domain's level exactly can be useful when setting
> relax_domain_level or cpuset.sched_relax_domain_level
>
> Usage:
> cat /debug/sched/domains/cpu0/domain1/level
> to dump cpu0 domain1's level.
>
> Signed-off-by: Vitalii Bursov <[email protected]>

Acked-by: Vincent Guittot <[email protected]>

> ---
> kernel/sched/debug.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 8d5d98a5834d..c1eb9a1afd13 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -425,6 +425,7 @@ static void register_sd(struct sched_domain *sd, struct dentry *parent)
>
> debugfs_create_file("flags", 0444, parent, &sd->flags, &sd_flags_fops);
> debugfs_create_file("groups_flags", 0444, parent, &sd->groups->flags, &sd_flags_fops);
> + debugfs_create_u32("level", 0444, parent, (u32 *)&sd->level);
> }
>
> void update_sched_domain_debugfs(void)
> --
> 2.20.1
>

2024-04-04 14:16:39

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] sched/fair: allow disabling sched_balance_newidle with sched_relax_domain_level

On 03/04/24 16:28, Vitalii Bursov wrote:
> Change relax_domain_level checks so that it would be possible
> to include or exclude all domains from newidle balancing.
>
> This matches the behavior described in the documentation:
> -1 no request. use system default or follow request of others.
> 0 no search.
> 1 search siblings (hyperthreads in a core).
>
> "2" enables levels 0 and 1, level_max excludes the last (level_max)
> level, and level_max+1 includes all levels.
>
> Fixes: 9ae7ab20b483 ("sched/topology: Don't set SD_BALANCE_WAKE on cpuset domain relax")

Not that it matters too much, but wasn't the behaviour the same back then?
i.e.

if (request < sd->level)
sd->flags &= ~(SD_BALANCE_WAKE|SD_BALANCE_NEWIDLE);

So if relax_domain_level=0 we wouldn't clear the flags on e.g. SMT
(level=0)

AFAICT the docs & the code have always been misaligned:

4d5f35533fb9 ("sched, cpuset: customize sched domains, docs") [2008]
1d3504fcf560 ("sched, cpuset: customize sched domains, core") [2008]

History nitpicking aside, I think this makes sense, but existing users are
going to get a surprise...


2024-04-04 14:28:08

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] sched/debug: dump domains' level

On 03/04/24 16:28, Vitalii Bursov wrote:
> Knowing domain's level exactly can be useful when setting
> relax_domain_level or cpuset.sched_relax_domain_level
>
> Usage:
> cat /debug/sched/domains/cpu0/domain1/level
> to dump cpu0 domain1's level.
>
> Signed-off-by: Vitalii Bursov <[email protected]>
> ---
> kernel/sched/debug.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 8d5d98a5834d..c1eb9a1afd13 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -425,6 +425,7 @@ static void register_sd(struct sched_domain *sd, struct dentry *parent)
>
> debugfs_create_file("flags", 0444, parent, &sd->flags, &sd_flags_fops);
> debugfs_create_file("groups_flags", 0444, parent, &sd->groups->flags, &sd_flags_fops);
> + debugfs_create_u32("level", 0444, parent, (u32 *)&sd->level);

How about reusing the SDM macro? ->flags and ->groups_flags get special
treatment for pretty printing, but the others don't need that.
---
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index c1eb9a1afd13e..f97902208b34d 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -419,13 +419,13 @@ static void register_sd(struct sched_domain *sd, struct dentry *parent)
SDM(u32, 0644, busy_factor);
SDM(u32, 0644, imbalance_pct);
SDM(u32, 0644, cache_nice_tries);
+ SDM(u32, 0444, level);
SDM(str, 0444, name);

#undef SDM

debugfs_create_file("flags", 0444, parent, &sd->flags, &sd_flags_fops);
debugfs_create_file("groups_flags", 0444, parent, &sd->groups->flags, &sd_flags_fops);
- debugfs_create_u32("level", 0444, parent, (u32 *)&sd->level);
}

void update_sched_domain_debugfs(void)


2024-04-04 15:10:56

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] sched/fair: allow disabling sched_balance_newidle with sched_relax_domain_level

On Thu, 4 Apr 2024 at 16:14, Valentin Schneider <[email protected]> wrote:
>
> On 03/04/24 16:28, Vitalii Bursov wrote:
> > Change relax_domain_level checks so that it would be possible
> > to include or exclude all domains from newidle balancing.
> >
> > This matches the behavior described in the documentation:
> > -1 no request. use system default or follow request of others.
> > 0 no search.
> > 1 search siblings (hyperthreads in a core).
> >
> > "2" enables levels 0 and 1, level_max excludes the last (level_max)
> > level, and level_max+1 includes all levels.
> >
> > Fixes: 9ae7ab20b483 ("sched/topology: Don't set SD_BALANCE_WAKE on cpuset domain relax")
>
> Not that it matters too much, but wasn't the behaviour the same back then?
> i.e.
>
> if (request < sd->level)
> sd->flags &= ~(SD_BALANCE_WAKE|SD_BALANCE_NEWIDLE);
>
> So if relax_domain_level=0 we wouldn't clear the flags on e.g. SMT
> (level=0)

Yes, I have been too quick: this patch [2019] was quite "old" and the
last one which changes the condition so I assumed it was the culprit

>
> AFAICT the docs & the code have always been misaligned:
>
> 4d5f35533fb9 ("sched, cpuset: customize sched domains, docs") [2008]
> 1d3504fcf560 ("sched, cpuset: customize sched domains, core") [2008]
>
> History nitpicking aside, I think this makes sense, but existing users are
> going to get a surprise...
>

2024-04-04 15:11:57

by Vitalii Bursov

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] sched/debug: dump domains' level



On 04.04.24 17:21, Valentin Schneider wrote:
> On 03/04/24 16:28, Vitalii Bursov wrote:
>> Knowing domain's level exactly can be useful when setting
>> relax_domain_level or cpuset.sched_relax_domain_level
>>
>> Usage:
>> cat /debug/sched/domains/cpu0/domain1/level
>> to dump cpu0 domain1's level.
>>
>> Signed-off-by: Vitalii Bursov <[email protected]>
>> ---
>> kernel/sched/debug.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>> index 8d5d98a5834d..c1eb9a1afd13 100644
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -425,6 +425,7 @@ static void register_sd(struct sched_domain *sd, struct dentry *parent)
>>
>> debugfs_create_file("flags", 0444, parent, &sd->flags, &sd_flags_fops);
>> debugfs_create_file("groups_flags", 0444, parent, &sd->groups->flags, &sd_flags_fops);
>> + debugfs_create_u32("level", 0444, parent, (u32 *)&sd->level);
>
> How about reusing the SDM macro? ->flags and ->groups_flags get special
> treatment for pretty printing, but the others don't need that.
> ---
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index c1eb9a1afd13e..f97902208b34d 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -419,13 +419,13 @@ static void register_sd(struct sched_domain *sd, struct dentry *parent)
> SDM(u32, 0644, busy_factor);
> SDM(u32, 0644, imbalance_pct);
> SDM(u32, 0644, cache_nice_tries);
> + SDM(u32, 0444, level);
> SDM(str, 0444, name);
>
> #undef SDM
>
> debugfs_create_file("flags", 0444, parent, &sd->flags, &sd_flags_fops);
> debugfs_create_file("groups_flags", 0444, parent, &sd->groups->flags, &sd_flags_fops);
> - debugfs_create_u32("level", 0444, parent, (u32 *)&sd->level);
> }
>
> void update_sched_domain_debugfs(void)
>

This worked when I tried it. The reason why I chose an explicit implementation
with debugfs_create_u32() is because "level" is "int" and there's no
debugfs_create_{s32,int}(). While casting is not the best option either, it
hints that types mismatch.

In a few other cases when types do not match, casting is usually used. e.g.
mod_debug_add_ulong macro in kernel/module/stats.c:
#define mod_debug_add_ulong(name) debugfs_create_ulong(#name, 0400, mod_debugfs_root, (unsigned long *) &name.counter)
where "counter" can be s64 from the atomic64.

Thanks

2024-04-05 09:21:11

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] sched/fair: allow disabling sched_balance_newidle with sched_relax_domain_level

On 03/04/2024 15:28, Vitalii Bursov wrote:
> Changes in v3:
> - Remove levels table change from the documentation patch
> - Link to v2: https://lore.kernel.org/lkml/[email protected]/
> Changes in v2:
> - Split debug.c change in a separate commit and move new "level"
> after "groups_flags"
> - Added "Fixes" tag and updated commit message
> - Update domain levels cgroup-v1/cpusets.rst documentation
> - Link to v1: https://lore.kernel.org/all/[email protected]/
>
> During the upgrade from Linux 5.4 we found a small (around 3%)
> performance regression which was tracked to commit
> c5b0a7eefc70150caf23e37bc9d639c68c87a097
>
> sched/fair: Remove sysctl_sched_migration_cost condition
>
> With a default value of 500us, sysctl_sched_migration_cost is
> significanlty higher than the cost of load_balance. Remove the
> condition and rely on the sd->max_newidle_lb_cost to abort
> newidle_balance.
>
> Looks like "newidle" balancing is beneficial for a lot of workloads,
> just not for this specific one. The workload is video encoding, there
> are 100s-1000s of threads, some are synchronized with mutexes and
> conditional variables. The process aims to have a portion of CPU idle,
> so no CPU cores are 100% busy. Perhaps, the performance impact we see
> comes from additional processing in the scheduler and additional cost
> like more cache misses, and not from an incorrect balancing. See
> perf output below.
>
> My understanding is that "sched_relax_domain_level" cgroup parameter
> should control if sched_balance_newidle() is called and what's the scope
> of the balancing is, but it doesn't fully work for this case.
>
> cpusets.rst documentation:
>> The 'cpuset.sched_relax_domain_level' file allows you to request changing
>> this searching range as you like. This file takes int value which
>> indicates size of searching range in levels ideally as follows,
>> otherwise initial value -1 that indicates the cpuset has no request.
>>
>> ====== ===========================================================
>> -1 no request. use system default or follow request of others.
>> 0 no search.
>> 1 search siblings (hyperthreads in a core).
>> 2 search cores in a package.
>> 3 search cpus in a node [= system wide on non-NUMA system]
>> 4 search nodes in a chunk of node [on NUMA system]
>> 5 search system wide [on NUMA system]
>> ====== ===========================================================

IMHO, this list misses:

2 search cores in a cluster.

Related to CONFIG_SCHED_CLUSTER.
Like you mentioned, if CONFIG_SCHED_CLUSTER is not configured MC becomes
level=1.

I ran this on an Arm64 TaiShan 2280 v2, Kunpeng 920 - 4826 server:

$ numactl -H | tail -6
node distances:
node 0 1 2 3
0: 10 12 20 22
1: 12 10 22 24
2: 20 22 10 12
3: 22 24 12 10

$ head -8 /proc/schedstat | awk '{ print $1 " " $2 }' | tail -5
domain0 00000000,00000000,0000000f
domain1 00000000,00000000,00ffffff
domain2 00000000,0000ffff,ffffffff
domain3 000000ff,ffffffff,ffffffff
domain4 ffffffff,ffffffff,ffffffff

with additional debug:

[ 18.196484] build_sched_domain() cpu=0 name=SMT level=0
[ 18.202308] build_sched_domain() cpu=0 name=CLS level=1
[ 18.208188] build_sched_domain() cpu=0 name=MC level=2
[ 18.222550] build_sched_domain() cpu=0 name=PKG level=3
[ 18.228371] build_sched_domain() cpu=0 name=NODE level=4
[ 18.234515] build_sched_domain() cpu=0 name=NUMA level=5
[ 18.246400] build_sched_domain() cpu=0 name=NUMA level=6
[ 18.258841] build_sched_domain() cpu=0 name=NUMA level=7

/* search cores in a cluster */
# echo 2 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level

# grep . /sys/kernel/debug/sched/domains/cpu0/*/{name,flags,level}
/sys/kernel/debug/sched/domains/cpu0/domain0/name:CLS
/sys/kernel/debug/sched/domains/cpu0/domain1/name:MC
/sys/kernel/debug/sched/domains/cpu0/domain2/name:NUMA
/sys/kernel/debug/sched/domains/cpu0/domain3/name:NUMA
/sys/kernel/debug/sched/domains/cpu0/domain4/name:NUMA
/sys/kernel/debug/sched/domains/cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_CLUSTER SD_SHARE_LLC SD_PREFER_SIBLING
/sys/kernel/debug/sched/domains/cpu0/domain1/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_LLC SD_PREFER_SIBLING
/sys/kernel/debug/sched/domains/cpu0/domain2/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA
/sys/kernel/debug/sched/domains/cpu0/domain3/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA
/sys/kernel/debug/sched/domains/cpu0/domain4/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA
/sys/kernel/debug/sched/domains/cpu0/domain0/level:1
/sys/kernel/debug/sched/domains/cpu0/domain1/level:2
/sys/kernel/debug/sched/domains/cpu0/domain2/level:5
/sys/kernel/debug/sched/domains/cpu0/domain3/level:6
/sys/kernel/debug/sched/domains/cpu0/domain4/level:7

LGTM.

Tested-by: Dietmar Eggemann <[email protected]>

> Setting cpuset.sched_relax_domain_level to 0 works as 1.
>
> On a dual-CPU server, domains and levels are as follows:
> domain 0: level 0, SMT
> domain 1: level 2, MC

This is with CONFIG_SCHED_CLUSTER=y ?

[...]

2024-04-05 10:25:58

by Vitalii Bursov

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] sched/fair: allow disabling sched_balance_newidle with sched_relax_domain_level



On 05.04.24 12:17, Dietmar Eggemann wrote:
> On 03/04/2024 15:28, Vitalii Bursov wrote:
>> Changes in v3:
>> - Remove levels table change from the documentation patch
>> - Link to v2: https://lore.kernel.org/lkml/[email protected]/
>> Changes in v2:
>> - Split debug.c change in a separate commit and move new "level"
>> after "groups_flags"
>> - Added "Fixes" tag and updated commit message
>> - Update domain levels cgroup-v1/cpusets.rst documentation
>> - Link to v1: https://lore.kernel.org/all/[email protected]/
>>
>> During the upgrade from Linux 5.4 we found a small (around 3%)
>> performance regression which was tracked to commit
>> c5b0a7eefc70150caf23e37bc9d639c68c87a097
>>
>> sched/fair: Remove sysctl_sched_migration_cost condition
>>
>> With a default value of 500us, sysctl_sched_migration_cost is
>> significanlty higher than the cost of load_balance. Remove the
>> condition and rely on the sd->max_newidle_lb_cost to abort
>> newidle_balance.
>>
>> Looks like "newidle" balancing is beneficial for a lot of workloads,
>> just not for this specific one. The workload is video encoding, there
>> are 100s-1000s of threads, some are synchronized with mutexes and
>> conditional variables. The process aims to have a portion of CPU idle,
>> so no CPU cores are 100% busy. Perhaps, the performance impact we see
>> comes from additional processing in the scheduler and additional cost
>> like more cache misses, and not from an incorrect balancing. See
>> perf output below.
>>
>> My understanding is that "sched_relax_domain_level" cgroup parameter
>> should control if sched_balance_newidle() is called and what's the scope
>> of the balancing is, but it doesn't fully work for this case.
>>
>> cpusets.rst documentation:
>>> The 'cpuset.sched_relax_domain_level' file allows you to request changing
>>> this searching range as you like. This file takes int value which
>>> indicates size of searching range in levels ideally as follows,
>>> otherwise initial value -1 that indicates the cpuset has no request.
>>>
>>> ====== ===========================================================
>>> -1 no request. use system default or follow request of others.
>>> 0 no search.
>>> 1 search siblings (hyperthreads in a core).
>>> 2 search cores in a package.
>>> 3 search cpus in a node [= system wide on non-NUMA system]
>>> 4 search nodes in a chunk of node [on NUMA system]
>>> 5 search system wide [on NUMA system]
>>> ====== ===========================================================
>
> IMHO, this list misses:
>
> 2 search cores in a cluster.
>
> Related to CONFIG_SCHED_CLUSTER.
> Like you mentioned, if CONFIG_SCHED_CLUSTER is not configured MC becomes
> level=1.

Previous discussion in v2 on this topic:
https://lore.kernel.org/linux-kernel/[email protected]/T/#maf4ad0ef3b8c18c8bb3e3524c683b6459c6f7f64

The table certainly depends on the kernel configuraion and describing this
dependency in detail probably isn't worth it, so how the table should look
like in the documentation is debatable...

> I ran this on an Arm64 TaiShan 2280 v2, Kunpeng 920 - 4826 server:
>
> $ numactl -H | tail -6
> node distances:
> node 0 1 2 3
> 0: 10 12 20 22
> 1: 12 10 22 24
> 2: 20 22 10 12
> 3: 22 24 12 10
>
> $ head -8 /proc/schedstat | awk '{ print $1 " " $2 }' | tail -5
> domain0 00000000,00000000,0000000f
> domain1 00000000,00000000,00ffffff
> domain2 00000000,0000ffff,ffffffff
> domain3 000000ff,ffffffff,ffffffff
> domain4 ffffffff,ffffffff,ffffffff
>
> with additional debug:
>
> [ 18.196484] build_sched_domain() cpu=0 name=SMT level=0
> [ 18.202308] build_sched_domain() cpu=0 name=CLS level=1
> [ 18.208188] build_sched_domain() cpu=0 name=MC level=2
> [ 18.222550] build_sched_domain() cpu=0 name=PKG level=3
> [ 18.228371] build_sched_domain() cpu=0 name=NODE level=4
> [ 18.234515] build_sched_domain() cpu=0 name=NUMA level=5
> [ 18.246400] build_sched_domain() cpu=0 name=NUMA level=6
> [ 18.258841] build_sched_domain() cpu=0 name=NUMA level=7
>
> /* search cores in a cluster */
> # echo 2 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
>
> # grep . /sys/kernel/debug/sched/domains/cpu0/*/{name,flags,level}
> /sys/kernel/debug/sched/domains/cpu0/domain0/name:CLS
> /sys/kernel/debug/sched/domains/cpu0/domain1/name:MC
> /sys/kernel/debug/sched/domains/cpu0/domain2/name:NUMA
> /sys/kernel/debug/sched/domains/cpu0/domain3/name:NUMA
> /sys/kernel/debug/sched/domains/cpu0/domain4/name:NUMA
> /sys/kernel/debug/sched/domains/cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_CLUSTER SD_SHARE_LLC SD_PREFER_SIBLING
> /sys/kernel/debug/sched/domains/cpu0/domain1/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_LLC SD_PREFER_SIBLING
> /sys/kernel/debug/sched/domains/cpu0/domain2/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA
> /sys/kernel/debug/sched/domains/cpu0/domain3/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA
> /sys/kernel/debug/sched/domains/cpu0/domain4/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA
> /sys/kernel/debug/sched/domains/cpu0/domain0/level:1
> /sys/kernel/debug/sched/domains/cpu0/domain1/level:2
> /sys/kernel/debug/sched/domains/cpu0/domain2/level:5
> /sys/kernel/debug/sched/domains/cpu0/domain3/level:6
> /sys/kernel/debug/sched/domains/cpu0/domain4/level:7
>
> LGTM.
>
> Tested-by: Dietmar Eggemann <[email protected]>
>
>> Setting cpuset.sched_relax_domain_level to 0 works as 1.
>>
>> On a dual-CPU server, domains and levels are as follows:
>> domain 0: level 0, SMT
>> domain 1: level 2, MC
>
> This is with CONFIG_SCHED_CLUSTER=y ?
>
Yes, I tested mostly with RHEL9 and Debian12 configs on (some) x86-64
and those have CONFIG_SCHED_CLUSTER=y, but no separate CLS domain.

Thanks


2024-04-05 10:59:43

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] sched/fair: allow disabling sched_balance_newidle with sched_relax_domain_level

On 05/04/2024 12:25, Vitalii Bursov wrote:
>
>
> On 05.04.24 12:17, Dietmar Eggemann wrote:
>> On 03/04/2024 15:28, Vitalii Bursov wrote:

[...]

>>>> ====== ===========================================================
>>>> -1 no request. use system default or follow request of others.
>>>> 0 no search.
>>>> 1 search siblings (hyperthreads in a core).
>>>> 2 search cores in a package.
>>>> 3 search cpus in a node [= system wide on non-NUMA system]
>>>> 4 search nodes in a chunk of node [on NUMA system]
>>>> 5 search system wide [on NUMA system]
>>>> ====== ===========================================================
>>
>> IMHO, this list misses:
>>
>> 2 search cores in a cluster.
>>
>> Related to CONFIG_SCHED_CLUSTER.
>> Like you mentioned, if CONFIG_SCHED_CLUSTER is not configured MC becomes
>> level=1.
>
> Previous discussion in v2 on this topic:
> https://lore.kernel.org/linux-kernel/[email protected]/T/#maf4ad0ef3b8c18c8bb3e3524c683b6459c6f7f64

Sorry, I missed this discussion.

I thought that SCHED_CLUSTER is based on shared L3 tags (Arm64
kunpeng920) or L2 cache (X86 Jacobsville) so it's similar to SCHED_MC
just one level down?

> The table certainly depends on the kernel configuraion and describing this
> dependency in detail probably isn't worth it, so how the table should look
> like in the documentation is debatable...

[...]