2023-07-27 06:52:58

by Chen Yu

[permalink] [raw]
Subject: [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance

Hi,

This is the second version of the newidle balance optimization[1].
It aims to reduce the cost of newidle balance which is found to
occupy noticeable CPU cycles on some high-core count systems.

For example, when running sqlite on Intel Sapphire Rapids, which has
2 x 56C/112T = 224 CPUs:

6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance
5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats

To mitigate this cost, the optimization is inspired by the question
raised by Tim:
Do we always have to find the busiest group and pull from it? Would
a relatively busy group be enough?

There are two proposals in this patch set.
The first one is ILB_UTIL. It was proposed to limit the scan
depth in update_sd_lb_stats(). The scan depth is based on the
overall utilization of this sched domain. The higher the utilization
is, the less update_sd_lb_stats() scans. Vice versa.

The second one is ILB_FAST. Instead of always finding the busiest
group in update_sd_lb_stats(), lower the bar and try to find a
relatively busy group. ILB_FAST takes effect when the local group
is group_has_spare. Because when there are many CPUs running
newidle_balance() concurrently, the sched groups should have a
high idle percentage.

Compared between ILB_UTIL and ILB_FAST, the former inhibits the
sched group scan when the system is busy. While the latter
chooses a compromised busy group when the system is not busy.
So they are complementary to each other and work independently.

patch 1/7 and patch 2/7 are preparation for ILB_UTIL.

patch 3/7 is a preparation for both ILB_UTIL and ILB_FAST.

patch 4/7 is part of ILB_UTIL. It calculates the scan depth
of sched groups which will be used by
update_sd_lb_stats(). The depth is calculated by the
periodic load balance.

patch 5/7 introduces the ILB_UTIL.

patch 6/7 introduces the ILB_FAST.

patch 7/7 is a debug patch to print more sched statistics, inspired
by Prateek's test report.

In the previous version, Prateek found some regressions[2].
This is probably caused by:
1. Cross Numa access to sched_domain_shared. So this version removed
the sched_domain_shared for Numa domain.
2. newidle balance did not try so hard to scan for the busiest
group. This version still keeps the linear scan function. If
the regression is still there, we can try to leverage the result
of SIS_UTIL. Because SIS_UTIL is a quadratic function which
could help scan the domain harder when the system is not
overloaded.

Changes since the previous version:
1. For all levels except for NUMA, connect a sched_domain_shared
instance. This makes the newidle balance optimization more
generic, and not only for LLC domain. (Peter, Gautham)
2. Introduce ILB_FAST, which terminates the sched group scan
earlier, if it finds a proper group rather than the busiest
one (Tim).


Peter has suggested reusing the statistics of the sched group
if multiple CPUs trigger newidle balance concurrently[3]. I created
a prototype[4] based on this direction. According to the test, there
are some regressions. The bottlenecks are a spin_trylock() and the
memory load from the 'cached' shared region. It is still under
investigation so I did not include that change into this patch set.

Any comments would be appreciated.

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] https://lore.kernel.org/lkml/[email protected]/
[4] https://github.com/chen-yu-surf/linux/commit/a6b33df883b972d6aaab5fceeddb11c34cc59059.patch

Chen Yu (7):
sched/topology: Assign sd_share for all non NUMA sched domains
sched/topology: Introduce nr_groups in sched_domain to indicate the
number of groups
sched/fair: Save a snapshot of sched domain total_load and
total_capacity
sched/fair: Calculate the scan depth for idle balance based on system
utilization
sched/fair: Adjust the busiest group scanning depth in idle load
balance
sched/fair: Pull from a relatively busy group during newidle balance
sched/stats: Track the scan number of groups during load balance

include/linux/sched/topology.h | 5 ++
kernel/sched/fair.c | 114 ++++++++++++++++++++++++++++++++-
kernel/sched/features.h | 4 ++
kernel/sched/stats.c | 5 +-
kernel/sched/topology.c | 14 ++--
5 files changed, 135 insertions(+), 7 deletions(-)

--
2.25.1



2023-07-27 07:04:35

by Chen Yu

[permalink] [raw]
Subject: [RFC PATCH 7/7] sched/stats: Track the scan number of groups during load balance

This metric could be used to evaluate the load balance cost and
effeciency.

Signed-off-by: Chen Yu <[email protected]>
---
include/linux/sched/topology.h | 1 +
kernel/sched/fair.c | 2 ++
kernel/sched/stats.c | 5 +++--
3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index af2261308529..fa8fc6a497fd 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -124,6 +124,7 @@ struct sched_domain {
unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES];
unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES];
unsigned int lb_nobusyq[CPU_MAX_IDLE_TYPES];
+ unsigned int lb_sg_scan[CPU_MAX_IDLE_TYPES];

/* Active load balancing */
unsigned int alb_count;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9af57b5a24dc..96df7c5706d1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10253,6 +10253,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
goto next_group;


+ schedstat_inc(env->sd->lb_sg_scan[env->idle]);
+
if (update_sd_pick_busiest(env, sds, sg, sgs)) {
sds->busiest = sg;
sds->busiest_stat = *sgs;
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..38608f791363 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -152,7 +152,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
cpumask_pr_args(sched_domain_span(sd)));
for (itype = CPU_IDLE; itype < CPU_MAX_IDLE_TYPES;
itype++) {
- seq_printf(seq, " %u %u %u %u %u %u %u %u",
+ seq_printf(seq, " %u %u %u %u %u %u %u %u %u",
sd->lb_count[itype],
sd->lb_balanced[itype],
sd->lb_failed[itype],
@@ -160,7 +160,8 @@ static int show_schedstat(struct seq_file *seq, void *v)
sd->lb_gained[itype],
sd->lb_hot_gained[itype],
sd->lb_nobusyq[itype],
- sd->lb_nobusyg[itype]);
+ sd->lb_nobusyg[itype],
+ sd->lb_sg_scan[itype]);
}
seq_printf(seq,
" %u %u %u %u %u %u %u %u %u %u %u %u\n",
--
2.25.1


2023-07-27 07:04:45

by Chen Yu

[permalink] [raw]
Subject: [RFC PATCH 4/7] sched/fair: Calculate the scan depth for idle balance based on system utilization

When the CPU is about to enter idle, it invokes newidle_balance()
to pull some tasks from other runqueues. Although there is per
domain max_newidle_lb_cost to throttle the newidle_balance(), it
would be good to further limit the scan based on overall system
utilization. The reason is that there is no limitation for
newidle_balance() to launch this balance simultaneously on
multiple CPUs. Since each newidle_balance() has to traverse all
the groups to calculate the statistics one by one, this total
time cost on newidle_balance() could be O(n^2). n is the number
of groups. This issue is more severe if there are many groups
within 1 domain, for example, a system with a large number of
Cores in a LLC domain. This is not good for performance or
power saving.

sqlite has spent quite some time on newidle balance() on Intel
Sapphire Rapids, which has 2 x 56C/112T = 224 CPUs:
6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance
5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats

Based on this observation, limit the scan depth of newidle_balance()
by considering the utilization of the sched domain. Let the number of
scanned groups be a linear function of the utilization ratio:

nr_groups_to_scan = nr_groups * (1 - util_ratio)

Suggested-by: Tim Chen <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
---
include/linux/sched/topology.h | 1 +
kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++++
kernel/sched/features.h | 1 +
3 files changed, 32 insertions(+)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index d6a64a2c92aa..af2261308529 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -84,6 +84,7 @@ struct sched_domain_shared {
int nr_idle_scan;
unsigned long total_load;
unsigned long total_capacity;
+ int nr_sg_scan;
};

struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index edcfee9965cd..6925813db59b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10153,6 +10153,35 @@ static void ilb_save_stats(struct lb_env *env,
WRITE_ONCE(sd_share->total_capacity, sds->total_capacity);
}

+static void update_ilb_group_scan(struct lb_env *env,
+ unsigned long sum_util,
+ struct sched_domain_shared *sd_share)
+{
+ u64 tmp, nr_scan;
+
+ if (!sched_feat(ILB_UTIL))
+ return;
+
+ if (!sd_share)
+ return;
+
+ if (env->idle == CPU_NEWLY_IDLE)
+ return;
+
+ /*
+ * Limit the newidle balance scan depth based on overall system
+ * utilization:
+ * nr_groups_scan = nr_groups * (1 - util_ratio)
+ * and util_ratio = sum_util / (sd_weight * SCHED_CAPACITY_SCALE)
+ */
+ nr_scan = env->sd->nr_groups * sum_util;
+ tmp = env->sd->span_weight * SCHED_CAPACITY_SCALE;
+ do_div(nr_scan, tmp);
+ nr_scan = env->sd->nr_groups - nr_scan;
+ if ((int)nr_scan != sd_share->nr_sg_scan)
+ WRITE_ONCE(sd_share->nr_sg_scan, (int)nr_scan);
+}
+
/**
* update_sd_lb_stats - Update sched_domain's statistics for load balancing.
* @env: The load balancing environment.
@@ -10231,6 +10260,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
}

update_idle_cpu_scan(env, sum_util);
+ update_ilb_group_scan(env, sum_util, sd_share);

/* save a snapshot of stats during periodic load balance */
ilb_save_stats(env, sd_share, sds);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 3cb71c8cddc0..30f6d1a2f235 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -103,3 +103,4 @@ SCHED_FEAT(ALT_PERIOD, true)
SCHED_FEAT(BASE_SLICE, true)

SCHED_FEAT(ILB_SNAPSHOT, true)
+SCHED_FEAT(ILB_UTIL, true)
--
2.25.1


2023-07-27 07:04:45

by Chen Yu

[permalink] [raw]
Subject: [RFC PATCH 1/7] sched/topology: Assign sd_share for all non NUMA sched domains

Currently, only the domain with SD_SHARE_PKG_RESOURCES flag
would share 1 sd_share for every CPU in this domain. Remove this
restriction and extend it for other sched domains under NUMA
domain.

This shared field will be used by a later patch which optimizes
newidle balancing.

Suggested-by: "Gautham R. Shenoy" <[email protected]>
Suggested-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
---
kernel/sched/topology.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d3a3b2646ec4..64212f514765 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1641,10 +1641,10 @@ sd_init(struct sched_domain_topology_level *tl,
}

/*
- * For all levels sharing cache; connect a sched_domain_shared
+ * For all levels except for NUMA; connect a sched_domain_shared
* instance.
*/
- if (sd->flags & SD_SHARE_PKG_RESOURCES) {
+ if (!(sd->flags & SD_NUMA)) {
sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
atomic_inc(&sd->shared->ref);
atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
--
2.25.1


2023-07-27 07:05:38

by Chen Yu

[permalink] [raw]
Subject: [RFC PATCH 5/7] sched/fair: Adjust the busiest group scanning depth in idle load balance

Scanning the whole sched domain to find the busiest group is time costly
during newidle_balance(). And if a CPU becomes idle, it would be good
if this idle CPU pulls some tasks from other CPUs as quickly as possible.

Limit the scan depth of newidle_balance() to only scan for a limited number
of sched groups to find a relatively busy group, and pull from it.
In summary, the more spare time there is in the domain, the more time
each newidle balance can spend on scanning for a busy group. Although
the newidle balance has per domain max_newidle_lb_cost to decide
whether to launch the balance or not, the ILB_UTIL provides a smaller
granularity to decide how many groups each newidle balance can scan.

The scanning depth is calculated by the previous periodic load balance
based on its overall utilization.

Tested on top of v6.5-rc2, Sapphire Rapids with 2 x 56C/112T = 224 CPUs.
With cpufreq governor set to performance, and C6 disabled.

Firstly, tested on a extreme synthetic test[1], which launches 224
process. Each process is a loop of nanosleep(1 us), which is supposed
to trigger newidle balance as much as possible:

i=1;while [ $i -le "224" ]; do ./nano_sleep 1000 & i=$(($i+1)); done;

NO_ILB_UTIL + ILB_SNAPSHOT:
9.38% 0.45% [kernel.kallsyms] [k] newidle_balance
6.84% 5.32% [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0

ILB_UTIL + ILB_SNAPSHOT:
3.35% 0.38% [kernel.kallsyms] [k] newidle_balance
2.30% 1.81% [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0

With ILB_UTIL enabled, the total number of newidle_balance() and
update_sd_lb() drops. But the reason for why there are less newidle
balance has not been investigated. According to the low util_avg value
in /sys/kernel/debug/sched/debug, there should be no much impact
on the nanosleep stress test.

Test in a wider range:

[netperf]
Launches nr instances of:
netperf -4 -H 127.0.0.1 -t $work_mode -c -C -l 100 &

nr: 56, 112, 168, 224, 280, 336, 392, 448
work_mode: TCP_RR UDP_RR

throughput
=======
case load baseline(std%) compare%( std%)
TCP_RR 56-threads 1.00 ( 5.15) -3.96 ( 2.17)
TCP_RR 112-threads 1.00 ( 2.84) -0.82 ( 2.24)
TCP_RR 168-threads 1.00 ( 2.11) -0.03 ( 2.31)
TCP_RR 224-threads 1.00 ( 1.76) +0.01 ( 2.12)
TCP_RR 280-threads 1.00 ( 62.46) +56.56 ( 56.91)
TCP_RR 336-threads 1.00 ( 19.81) +0.27 ( 17.90)
TCP_RR 392-threads 1.00 ( 30.85) +0.13 ( 29.09)
TCP_RR 448-threads 1.00 ( 39.71) -18.82 ( 45.93)
UDP_RR 56-threads 1.00 ( 2.08) -0.31 ( 7.89)
UDP_RR 112-threads 1.00 ( 3.22) -0.50 ( 15.19)
UDP_RR 168-threads 1.00 ( 11.77) +0.37 ( 10.30)
UDP_RR 224-threads 1.00 ( 14.03) +0.25 ( 12.88)
UDP_RR 280-threads 1.00 ( 16.83) -0.57 ( 15.34)
UDP_RR 336-threads 1.00 ( 22.57) +0.01 ( 24.68)
UDP_RR 392-threads 1.00 ( 33.89) +2.65 ( 33.89)
UDP_RR 448-threads 1.00 ( 44.18) +0.81 ( 41.28)

Considering the std%, there is no much difference to netperf.

[tbench]
tbench -t 100 $job 127.0.0.1
job: 56, 112, 168, 224, 280, 336, 392, 448

throughput
======
case load baseline(std%) compare%( std%)
loopback 56-threads 1.00 ( 2.20) -0.09 ( 2.05)
loopback 112-threads 1.00 ( 0.29) -0.88 ( 0.10)
loopback 168-threads 1.00 ( 0.02) +62.92 ( 54.57)
loopback 224-threads 1.00 ( 0.05) +234.30 ( 1.81)
loopback 280-threads 1.00 ( 0.08) -0.11 ( 0.21)
loopback 336-threads 1.00 ( 0.17) -0.17 ( 0.08)
loopback 392-threads 1.00 ( 0.14) -0.09 ( 0.18)
loopback 448-threads 1.00 ( 0.24) -0.53 ( 0.55)

There are improvement of tbench in 224 threads case.

[hackbench]

hackbench -g $job --$work_type --pipe -l 200000 -s 100 -f 28
and
hackbench -g $job --$work_type -l 200000 -s 100 -f 28

job: 1, 2, 4, 8
work_type: process threads

throughput
==========
case load baseline(std%) compare%( std%)
process-pipe 1-groups 1.00 ( 0.20) +1.57 ( 0.58)
process-pipe 2-groups 1.00 ( 3.53) +2.99 ( 2.03)
process-pipe 4-groups 1.00 ( 1.07) +0.17 ( 1.64)
process-sockets 1-groups 1.00 ( 0.36) -0.04 ( 1.44)
process-sockets 2-groups 1.00 ( 0.84) +0.65 ( 1.65)
process-sockets 4-groups 1.00 ( 0.04) +0.89 ( 0.08)
threads-pipe 1-groups 1.00 ( 3.62) -0.53 ( 1.67)
threads-pipe 2-groups 1.00 ( 4.17) -4.79 ( 0.53)
threads-pipe 4-groups 1.00 ( 5.30) +5.06 ( 1.95)
threads-sockets 1-groups 1.00 ( 0.40) +1.44 ( 0.53)
threads-sockets 2-groups 1.00 ( 2.54) +2.21 ( 2.51)
threads-sockets 4-groups 1.00 ( 0.05) +1.29 ( 0.05)

No much difference of hackbench.

[schbench(old)]
schbench -m $job -t 56 -r 30
job: 1, 2, 4, 8
3 iterations

99.0th latency
========
case load baseline(std%) compare%( std%)
normal 1-mthreads 1.00 ( 0.56) -0.91 ( 0.32)
normal 2-mthreads 1.00 ( 0.95) -4.05 ( 3.63)
normal 4-mthreads 1.00 ( 4.04) -0.30 ( 2.35)

No much difference of schbench.

[Limitation]
In the previous version, Prateek reported a regression. That could be
due to the concurrent access across the Numa node, or ILB_UTIL did not
scan hard enough to pull from the busiest group. The former issue is
fixed by not enabling ILB_UTIL for Numa domain. If there is still
regression in this version, we can leverage the result of SIS_UTIL,
to provide a quadratic function rather than the linear function, to
scan harder when the system is idle.

Link: https://raw.githubusercontent.com/chen-yu-surf/tools/master/stress_nanosleep.c #1
Suggested-by: Tim Chen <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
---
kernel/sched/fair.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6925813db59b..4e360ed16e14 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10195,7 +10195,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
struct sg_lb_stats *local = &sds->local_stat;
struct sg_lb_stats tmp_sgs;
unsigned long sum_util = 0;
- int sg_status = 0;
+ int sg_status = 0, nr_sg_scan;
+ /* only newidle CPU can load the snapshot */
+ bool ilb_can_load = env->idle == CPU_NEWLY_IDLE &&
+ sd_share && READ_ONCE(sd_share->total_capacity);
+
+ if (sched_feat(ILB_UTIL) && ilb_can_load)
+ nr_sg_scan = sd_share->nr_sg_scan;

do {
struct sg_lb_stats *sgs = &tmp_sgs;
@@ -10222,6 +10228,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
sds->busiest_stat = *sgs;
}

+ if (sched_feat(ILB_UTIL) && ilb_can_load && --nr_sg_scan <= 0)
+ goto load_snapshot;
+
next_group:
/* Now, start updating sd_lb_stats */
sds->total_load += sgs->group_load;
@@ -10231,6 +10240,15 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
sg = sg->next;
} while (sg != env->sd->groups);

+ ilb_can_load = false;
+
+load_snapshot:
+ if (ilb_can_load) {
+ /* borrow the statistic of previous periodic load balance */
+ sds->total_load = READ_ONCE(sd_share->total_load);
+ sds->total_capacity = READ_ONCE(sd_share->total_capacity);
+ }
+
/*
* Indicate that the child domain of the busiest group prefers tasks
* go to a child's sibling domains first. NB the flags of a sched group
--
2.25.1


2023-07-27 07:07:42

by Chen Yu

[permalink] [raw]
Subject: [RFC PATCH 6/7] sched/fair: Pull from a relatively busy group during newidle balance

Scanning the whole sched domain to find the busiest group is time
costly during newidle_balance() on a high core count system.

Introduce ILB_FAST to lower the bar during the busiest group
scanning. If the target sched group is relatively busier than the
local group, terminate the scan and try to pull from that group
directly.

Compared between ILB_UTIL and ILB_FAST, the former inhibits the
sched group scan when the system is busy. While the latter
choose a compromised busy group when the system is not busy.
So they are complementary to each other and work independently.

Tested on top of v6.5-rc2,
Sapphire Rapids with 2 x 56C/112T = 224 CPUs.
With cpufreq governor set to performance, and C6 disabled.

Firstly, tested on an extreme synthetic test[1] borrowed from
Tianyou. It launches 224 process. Each process is a loop of
nanosleep(1 us), which is supposed to trigger newidle balance
frequently:

i=1;while [ $i -le "224" ]; do ./nano_sleep 1000 & i=$(($i+1)); done;

[ILB_SNAPSHOT + NO_ILB_UTIL + NO_ILB_FAST]
Check the /proc/schedstat delta on CPU8 within 5 seconds using
the following script[2] by running: schedstat.py -i 5 -c 8

Mon Jul 24 23:43:43 2023 cpu8
.domain0.CPU_IDLE.lb_balanced 843
.domain0.CPU_IDLE.lb_count 843
.domain0.CPU_IDLE.lb_nobusyg 843
.domain0.CPU_IDLE.lb_sg_scan 843
.domain0.CPU_NEWLY_IDLE.lb_balanced 836
.domain0.CPU_NEWLY_IDLE.lb_count 837
.domain0.CPU_NEWLY_IDLE.lb_gained 1
.domain0.CPU_NEWLY_IDLE.lb_imbalance 1
.domain0.CPU_NEWLY_IDLE.lb_nobusyg 836
.domain0.CPU_NEWLY_IDLE.lb_sg_scan 837
.domain1.CPU_IDLE.lb_balanced 41
.domain1.CPU_IDLE.lb_count 41
.domain1.CPU_IDLE.lb_nobusyg 39
.domain1.CPU_IDLE.lb_sg_scan 2145
.domain1.CPU_NEWLY_IDLE.lb_balanced 732 <-----
.domain1.CPU_NEWLY_IDLE.lb_count 822 <-----
.domain1.CPU_NEWLY_IDLE.lb_failed 90
.domain1.CPU_NEWLY_IDLE.lb_imbalance 90
.domain1.CPU_NEWLY_IDLE.lb_nobusyg 497
.domain1.CPU_NEWLY_IDLE.lb_nobusyq 235
.domain1.CPU_NEWLY_IDLE.lb_sg_scan 45210 <-----
.domain1.ttwu_wake_remote 626
.domain2.CPU_IDLE.lb_balanced 15
.domain2.CPU_IDLE.lb_count 15
.domain2.CPU_NEWLY_IDLE.lb_balanced 635
.domain2.CPU_NEWLY_IDLE.lb_count 655
.domain2.CPU_NEWLY_IDLE.lb_failed 20
.domain2.CPU_NEWLY_IDLE.lb_imbalance 40
.domain2.CPU_NEWLY_IDLE.lb_nobusyg 633
.domain2.CPU_NEWLY_IDLE.lb_nobusyq 2
.domain2.CPU_NEWLY_IDLE.lb_sg_scan 655
.stats.rq_cpu_time 227910772
.stats.rq_sched_info.pcount 89393
.stats.rq_sched_info.run_delay 2145671
.stats.sched_count 178783
.stats.sched_goidle 89390
.stats.ttwu_count 89392
.stats.ttwu_local 88766

For domain1, there are 822 newidle balance attempt, and
the total number of groups scanned is 45210, thus each
balance would scan for 55 groups. During this 822 balance,
732 becomes(or are already) balanced, so the effect balance
success ratio is (822 - 732) / 822 = 10.94%

The perf:
9.38% 0.45% [kernel.kallsyms] [k] newidle_balance
6.84% 5.32% [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0

[ILB_SNAPSHOT + NO_ILB_UTIL + ILB_FAST]
Mon Jul 24 23:43:50 2023 cpu8
.domain0.CPU_IDLE.lb_balanced 918
.domain0.CPU_IDLE.lb_count 918
.domain0.CPU_IDLE.lb_nobusyg 918
.domain0.CPU_IDLE.lb_sg_scan 918
.domain0.CPU_NEWLY_IDLE.lb_balanced 1536
.domain0.CPU_NEWLY_IDLE.lb_count 1545
.domain0.CPU_NEWLY_IDLE.lb_failed 1
.domain0.CPU_NEWLY_IDLE.lb_gained 8
.domain0.CPU_NEWLY_IDLE.lb_imbalance 9
.domain0.CPU_NEWLY_IDLE.lb_nobusyg 1536
.domain0.CPU_NEWLY_IDLE.lb_sg_scan 1545
.domain1.CPU_IDLE.lb_balanced 45
.domain1.CPU_IDLE.lb_count 45
.domain1.CPU_IDLE.lb_nobusyg 43
.domain1.CPU_IDLE.lb_sg_scan 2365
.domain1.CPU_NEWLY_IDLE.lb_balanced 1196 <------
.domain1.CPU_NEWLY_IDLE.lb_count 1496 <------
.domain1.CPU_NEWLY_IDLE.lb_failed 296
.domain1.CPU_NEWLY_IDLE.lb_gained 4
.domain1.CPU_NEWLY_IDLE.lb_imbalance 301
.domain1.CPU_NEWLY_IDLE.lb_nobusyg 1182
.domain1.CPU_NEWLY_IDLE.lb_nobusyq 14
.domain1.CPU_NEWLY_IDLE.lb_sg_scan 30127 <------
.domain1.ttwu_wake_remote 2688
.domain2.CPU_IDLE.lb_balanced 13
.domain2.CPU_IDLE.lb_count 13
.domain2.CPU_NEWLY_IDLE.lb_balanced 898
.domain2.CPU_NEWLY_IDLE.lb_count 904
.domain2.CPU_NEWLY_IDLE.lb_failed 6
.domain2.CPU_NEWLY_IDLE.lb_imbalance 11
.domain2.CPU_NEWLY_IDLE.lb_nobusyg 896
.domain2.CPU_NEWLY_IDLE.lb_nobusyq 2
.domain2.CPU_NEWLY_IDLE.lb_sg_scan 904
.stats.rq_cpu_time 239830575
.stats.rq_sched_info.pcount 90879
.stats.rq_sched_info.run_delay 2436461
.stats.sched_count 181732
.stats.sched_goidle 90853
.stats.ttwu_count 90880
.stats.ttwu_local 88192

With ILB_FAST enabled, the CPU_NEWLY_IDLE in domain1 on CPU8
is 1496, and the total number of groups scanned is 30127. For
each load balance, it will scan for 20 groups, which is only
half of the 56 groups in a domain. During this 1496 balance,
1196 are balanced, so the effect balance success ratio
is (1496 - 1196) / 1496 = 20.95%, which is higher than 10.94%
when ILB_FAST is disabled.

perf profile:

2.95% 0.38% [kernel.kallsyms] [k] newidle_balance
2.00% 1.51% [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0

With ILB_FAST enabled, the total update_sd_lb_stats() has dropped a lot.

More benchmark results are shown below.
Baseline is ILB_SNAPSHOT + NO_ILB_UTIL, to compare with
ILB_SNAPSHOT + NO_ILB_UTIL + ILB_FAST

[netperf]
Launches nr instances of:
netperf -4 -H 127.0.0.1 -t $work_mode -c -C -l 100 &

nr: 56, 112, 168, 224, 280, 336, 392, 448
work_mode: TCP_RR UDP_RR

throughput
=======
case load baseline(std%) compare%( std%)
TCP_RR 56-threads 1.00 ( 1.83) +4.25 ( 5.15)
TCP_RR 112-threads 1.00 ( 2.19) +0.96 ( 2.84)
TCP_RR 168-threads 1.00 ( 1.92) -0.04 ( 2.11)
TCP_RR 224-threads 1.00 ( 1.98) -0.03 ( 1.76)
TCP_RR 280-threads 1.00 ( 63.11) -7.59 ( 62.46)
TCP_RR 336-threads 1.00 ( 18.44) -0.45 ( 19.81)
TCP_RR 392-threads 1.00 ( 26.49) -0.09 ( 30.85)
TCP_RR 448-threads 1.00 ( 40.47) -0.28 ( 39.71)
UDP_RR 56-threads 1.00 ( 1.83) -0.31 ( 2.08)
UDP_RR 112-threads 1.00 ( 13.77) +3.58 ( 3.22)
UDP_RR 168-threads 1.00 ( 10.97) -0.08 ( 11.77)
UDP_RR 224-threads 1.00 ( 12.83) -0.04 ( 14.03)
UDP_RR 280-threads 1.00 ( 13.89) +0.35 ( 16.83)
UDP_RR 336-threads 1.00 ( 24.91) +1.38 ( 22.57)
UDP_RR 392-threads 1.00 ( 34.86) -0.91 ( 33.89)
UDP_RR 448-threads 1.00 ( 40.63) +0.70 ( 44.18)

[tbench]
tbench -t 100 $job 127.0.0.1
job: 56, 112, 168, 224, 280, 336, 392, 448

throughput
======
case load baseline(std%) compare%( std%)
loopback 56-threads 1.00 ( 0.89) +1.51 ( 2.20)
loopback 112-threads 1.00 ( 0.03) +1.15 ( 0.29)
loopback 168-threads 1.00 ( 53.55) -37.92 ( 0.02)
loopback 224-threads 1.00 ( 61.24) -43.18 ( 0.01)
loopback 280-threads 1.00 ( 0.04) +0.33 ( 0.08)
loopback 336-threads 1.00 ( 0.35) +0.40 ( 0.17)
loopback 392-threads 1.00 ( 0.61) +0.49 ( 0.14)
loopback 448-threads 1.00 ( 0.08) +0.01 ( 0.24)

[schbench]
schbench -m $job -t 56 -r 30
job: 1, 2, 4, 8
3 iterations

99.0th latency
========
case load baseline(std%) compare%( std%)
normal 1-mthreads 1.00 ( 0.56) -0.45 ( 0.32)
normal 2-mthreads 1.00 ( 0.95) +1.01 ( 3.45)
normal 4-mthreads 1.00 ( 4.04) -0.60 ( 1.26)

[hackbench]
hackbench -g $job --$work_type --pipe -l 200000 -s 100 -f 28
and
hackbench -g $job --$work_type -l 200000 -s 100 -f 28

job: 1, 2, 4, 8
work_type: process threads

throughput
=========
case load baseline(std%) compare%( std%)
process-pipe 1-groups 1.00 ( 0.20) +2.30 ( 0.26)
process-pipe 2-groups 1.00 ( 3.53) +6.14 ( 2.45)
process-pipe 4-groups 1.00 ( 1.07) -4.58 ( 2.58)
process-sockets 1-groups 1.00 ( 0.36) +0.75 ( 1.22)
process-sockets 2-groups 1.00 ( 0.84) +1.26 ( 1.11)
process-sockets 4-groups 1.00 ( 0.04) +0.97 ( 0.11)
threads-pipe 1-groups 1.00 ( 3.62) +3.22 ( 2.64)
threads-pipe 2-groups 1.00 ( 4.17) +5.85 ( 7.53)
threads-pipe 4-groups 1.00 ( 5.30) -4.14 ( 5.39)
threads-sockets 1-groups 1.00 ( 0.40) +3.50 ( 3.13)
threads-sockets 2-groups 1.00 ( 2.54) +1.79 ( 0.80)
threads-sockets 4-groups 1.00 ( 0.05) +1.33 ( 0.03)

Considering the std%, there is no much score difference noticed.
It probably indicates that ILB_FAST has reduced the cost of newidle
balance without hurting the performance.

Link: https://raw.githubusercontent.com/chen-yu-surf/tools/master/stress_nanosleep.c #1
Link: https://raw.githubusercontent.com/chen-yu-surf/tools/master/schedstat.py #2
Suggested-by: Tim Chen <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
---
kernel/sched/fair.c | 37 +++++++++++++++++++++++++++++++++++++
kernel/sched/features.h | 1 +
2 files changed, 38 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4e360ed16e14..9af57b5a24dc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10182,6 +10182,36 @@ static void update_ilb_group_scan(struct lb_env *env,
WRITE_ONCE(sd_share->nr_sg_scan, (int)nr_scan);
}

+static bool can_pull_busiest(struct sg_lb_stats *local,
+ struct sg_lb_stats *busiest)
+{
+ /*
+ * Check if the local group can pull from the 'busiest'
+ * group directly. When reaching here, update_sd_pick_busiest()
+ * has already filtered a candidate.
+ * The scan in newidle load balance on high core count system
+ * is costly, thus provide this shortcut to find a relative busy
+ * group rather than the busiest one.
+ *
+ * Only enable this shortcut when the local group is quite
+ * idle. This is because the total cost of newidle_balance()
+ * becomes severe when multiple CPUs fall into idle and launch
+ * newidle_balance() concurrently. And that usually indicates
+ * a group_has_spare status.
+ */
+ if (local->group_type != group_has_spare)
+ return false;
+
+ if (busiest->idle_cpus > local->idle_cpus)
+ return false;
+
+ if (busiest->idle_cpus == local->idle_cpus &&
+ busiest->sum_nr_running <= local->sum_nr_running)
+ return false;
+
+ return true;
+}
+
/**
* update_sd_lb_stats - Update sched_domain's statistics for load balancing.
* @env: The load balancing environment.
@@ -10226,6 +10256,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
if (update_sd_pick_busiest(env, sds, sg, sgs)) {
sds->busiest = sg;
sds->busiest_stat = *sgs;
+ /*
+ * Check if this busiest group can be pulled by the
+ * local group directly.
+ */
+ if (sched_feat(ILB_FAST) && ilb_can_load &&
+ can_pull_busiest(local, sgs))
+ goto load_snapshot;
}

if (sched_feat(ILB_UTIL) && ilb_can_load && --nr_sg_scan <= 0)
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 30f6d1a2f235..4d67e0abb78c 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -104,3 +104,4 @@ SCHED_FEAT(BASE_SLICE, true)

SCHED_FEAT(ILB_SNAPSHOT, true)
SCHED_FEAT(ILB_UTIL, true)
+SCHED_FEAT(ILB_FAST, true)
--
2.25.1


2023-07-27 07:40:13

by Chen Yu

[permalink] [raw]
Subject: [RFC PATCH 3/7] sched/fair: Save a snapshot of sched domain total_load and total_capacity

Save the total_load, total_capacity of the current sched domain in each
periodic load balance. This statistic can be used later by CPU_NEWLY_IDLE
load balance if it quits the scan earlier. Introduce a sched feature
ILB_SNAPSHOT to control this. Code can check if sd_share->total_capacity
is non-zero to verify if the stat is valid.

In theory, if the system has reached a stable status, the total_capacity
and total_load should not change dramatically.

Signed-off-by: Chen Yu <[email protected]>
---
include/linux/sched/topology.h | 2 ++
kernel/sched/fair.c | 25 +++++++++++++++++++++++++
kernel/sched/features.h | 2 ++
3 files changed, 29 insertions(+)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index c07f2f00317a..d6a64a2c92aa 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -82,6 +82,8 @@ struct sched_domain_shared {
atomic_t nr_busy_cpus;
int has_idle_cores;
int nr_idle_scan;
+ unsigned long total_load;
+ unsigned long total_capacity;
};

struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b3e25be58e2b..edcfee9965cd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10132,6 +10132,27 @@ static void update_idle_cpu_scan(struct lb_env *env,
WRITE_ONCE(sd_share->nr_idle_scan, (int)y);
}

+static void ilb_save_stats(struct lb_env *env,
+ struct sched_domain_shared *sd_share,
+ struct sd_lb_stats *sds)
+{
+ if (!sched_feat(ILB_SNAPSHOT))
+ return;
+
+ if (!sd_share)
+ return;
+
+ /* newidle balance is too frequent */
+ if (env->idle == CPU_NEWLY_IDLE)
+ return;
+
+ if (sds->total_load != sd_share->total_load)
+ WRITE_ONCE(sd_share->total_load, sds->total_load);
+
+ if (sds->total_capacity != sd_share->total_capacity)
+ WRITE_ONCE(sd_share->total_capacity, sds->total_capacity);
+}
+
/**
* update_sd_lb_stats - Update sched_domain's statistics for load balancing.
* @env: The load balancing environment.
@@ -10140,6 +10161,7 @@ static void update_idle_cpu_scan(struct lb_env *env,

static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
{
+ struct sched_domain_shared *sd_share = env->sd->shared;
struct sched_group *sg = env->sd->groups;
struct sg_lb_stats *local = &sds->local_stat;
struct sg_lb_stats tmp_sgs;
@@ -10209,6 +10231,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
}

update_idle_cpu_scan(env, sum_util);
+
+ /* save a snapshot of stats during periodic load balance */
+ ilb_save_stats(env, sd_share, sds);
}

/**
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..3cb71c8cddc0 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -101,3 +101,5 @@ SCHED_FEAT(LATENCY_WARN, false)

SCHED_FEAT(ALT_PERIOD, true)
SCHED_FEAT(BASE_SLICE, true)
+
+SCHED_FEAT(ILB_SNAPSHOT, true)
--
2.25.1


2023-07-27 08:20:46

by Chen Yu

[permalink] [raw]
Subject: [RFC PATCH 2/7] sched/topology: Introduce nr_groups in sched_domain to indicate the number of groups

Record the number of sched groups within each sched domain. Prepare for
newidle_balance() scan depth calculation introduced by ILB_UTIL.

Signed-off-by: Chen Yu <[email protected]>
---
include/linux/sched/topology.h | 1 +
kernel/sched/topology.c | 10 ++++++++--
2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 67b573d5bf28..c07f2f00317a 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -152,6 +152,7 @@ struct sched_domain {
struct sched_domain_shared *shared;

unsigned int span_weight;
+ unsigned int nr_groups;
/*
* Span of all CPUs in this domain.
*
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 64212f514765..56dc564fc9a3 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1023,7 +1023,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
struct cpumask *covered = sched_domains_tmpmask;
struct sd_data *sdd = sd->private;
struct sched_domain *sibling;
- int i;
+ int i, nr_groups = 0;

cpumask_clear(covered);

@@ -1087,6 +1087,8 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
if (!sg)
goto fail;

+ nr_groups++;
+
sg_span = sched_group_span(sg);
cpumask_or(covered, covered, sg_span);

@@ -1100,6 +1102,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
last->next = first;
}
sd->groups = first;
+ sd->nr_groups = nr_groups;

return 0;

@@ -1233,7 +1236,7 @@ build_sched_groups(struct sched_domain *sd, int cpu)
struct sd_data *sdd = sd->private;
const struct cpumask *span = sched_domain_span(sd);
struct cpumask *covered;
- int i;
+ int i, nr_groups = 0;

lockdep_assert_held(&sched_domains_mutex);
covered = sched_domains_tmpmask;
@@ -1248,6 +1251,8 @@ build_sched_groups(struct sched_domain *sd, int cpu)

sg = get_group(i, sdd);

+ nr_groups++;
+
cpumask_or(covered, covered, sched_group_span(sg));

if (!first)
@@ -1258,6 +1263,7 @@ build_sched_groups(struct sched_domain *sd, int cpu)
}
last->next = first;
sd->groups = first;
+ sd->nr_groups = nr_groups;

return 0;
}
--
2.25.1