[Problem Statement]
select_idle_cpu() might spend too much time searching for an idle CPU,
when the system is overloaded.
The following histogram is the time spent in select_idle_cpu(),
when running 224 instances of netperf on a system with 112 CPUs
per LLC domain:
@usecs:
[0] 533 | |
[1] 5495 | |
[2, 4) 12008 | |
[4, 8) 239252 | |
[8, 16) 4041924 |@@@@@@@@@@@@@@ |
[16, 32) 12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[32, 64) 14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 128) 13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[128, 256) 8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[256, 512) 4507667 |@@@@@@@@@@@@@@@ |
[512, 1K) 2600472 |@@@@@@@@@ |
[1K, 2K) 927912 |@@@ |
[2K, 4K) 218720 | |
[4K, 8K) 98161 | |
[8K, 16K) 37722 | |
[16K, 32K) 6715 | |
[32K, 64K) 477 | |
[64K, 128K) 7 | |
netperf latency usecs:
=======
case load Lat_99th std%
TCP_RR thread-224 257.39 ( 0.21)
The time spent in select_idle_cpu() is visible to netperf and might have a negative
impact.
[Symptom analysis]
The patch [1] from Mel Gorman has been applied to track the efficiency
of select_idle_sibling. Copy the indicators here:
SIS Search Efficiency(se_eff%):
A ratio expressed as a percentage of runqueues scanned versus
idle CPUs found. A 100% efficiency indicates that the target,
prev or recent CPU of a task was idle at wakeup. The lower the
efficiency, the more runqueues were scanned before an idle CPU
was found.
SIS Domain Search Efficiency(dom_eff%):
Similar, except only for the slower SIS
patch.
SIS Fast Success Rate(fast_rate%):
Percentage of SIS that used target, prev or
recent CPUs.
SIS Success rate(success_rate%):
Percentage of scans that found an idle CPU.
The test is based on Aubrey's schedtests tool, netperf, hackbench,
schbench and tbench were launched with 25% 50% 75% 100% 125% 150%
175% 200% of CPU number respectively. Each test lasts for 100 seconds
and repeats 3 times. The system reboots into a fresh environment for
each test.
Test on vanilla kernel:
schedstat_parse.py -f netperf_vanilla.log
case load se_eff% dom_eff% fast_rate% success_rate%
TCP_RR 28 threads 99.978 18.535 99.995 100.000
TCP_RR 56 threads 99.397 5.671 99.964 100.000
TCP_RR 84 threads 21.721 6.818 73.632 100.000
TCP_RR 112 threads 12.500 5.533 59.000 100.000
TCP_RR 140 threads 8.524 4.535 49.020 100.000
TCP_RR 168 threads 6.438 3.945 40.309 99.999
TCP_RR 196 threads 5.397 3.718 32.320 99.982
TCP_RR 224 threads 4.874 3.661 25.775 99.767
UDP_RR 28 threads 99.988 17.704 99.997 100.000
UDP_RR 56 threads 99.528 5.977 99.970 100.000
UDP_RR 84 threads 24.219 6.992 76.479 100.000
UDP_RR 112 threads 13.907 5.706 62.538 100.000
UDP_RR 140 threads 9.408 4.699 52.519 100.000
UDP_RR 168 threads 7.095 4.077 44.352 100.000
UDP_RR 196 threads 5.757 3.775 35.764 99.991
UDP_RR 224 threads 5.124 3.704 28.748 99.860
schedstat_parse.py -f schbench_vanilla.log
(each group has 28 tasks)
case load se_eff% dom_eff% fast_rate% success_rate%
normal 1 mthread 99.152 6.400 99.941 100.000
normal 2 mthreads 97.844 4.003 99.908 100.000
normal 3 mthreads 96.395 2.118 99.917 99.998
normal 4 mthreads 55.288 1.451 98.615 99.804
normal 5 mthreads 7.004 1.870 45.597 61.036
normal 6 mthreads 3.354 1.346 20.777 34.230
normal 7 mthreads 2.183 1.028 11.257 21.055
normal 8 mthreads 1.653 0.825 7.849 15.549
schedstat_parse.py -f hackbench_vanilla.log
(each group has 28 tasks)
case load se_eff% dom_eff% fast_rate% success_rate%
process-pipe 1 group 99.991 7.692 99.999 100.000
process-pipe 2 groups 99.934 4.615 99.997 100.000
process-pipe 3 groups 99.597 3.198 99.987 100.000
process-pipe 4 groups 98.378 2.464 99.958 100.000
process-pipe 5 groups 27.474 3.653 89.811 99.800
process-pipe 6 groups 20.201 4.098 82.763 99.570
process-pipe 7 groups 16.423 4.156 77.398 99.316
process-pipe 8 groups 13.165 3.920 72.232 98.828
process-sockets 1 group 99.977 5.882 99.999 100.000
process-sockets 2 groups 99.927 5.505 99.996 100.000
process-sockets 3 groups 99.397 3.250 99.980 100.000
process-sockets 4 groups 79.680 4.258 98.864 99.998
process-sockets 5 groups 7.673 2.503 63.659 92.115
process-sockets 6 groups 4.642 1.584 58.946 88.048
process-sockets 7 groups 3.493 1.379 49.816 81.164
process-sockets 8 groups 3.015 1.407 40.845 75.500
threads-pipe 1 group 99.997 0.000 100.000 100.000
threads-pipe 2 groups 99.894 2.932 99.997 100.000
threads-pipe 3 groups 99.611 4.117 99.983 100.000
threads-pipe 4 groups 97.703 2.624 99.937 100.000
threads-pipe 5 groups 22.919 3.623 87.150 99.764
threads-pipe 6 groups 18.016 4.038 80.491 99.557
threads-pipe 7 groups 14.663 3.991 75.239 99.247
threads-pipe 8 groups 12.242 3.808 70.651 98.644
threads-sockets 1 group 99.990 6.667 99.999 100.000
threads-sockets 2 groups 99.940 5.114 99.997 100.000
threads-sockets 3 groups 99.469 4.115 99.977 100.000
threads-sockets 4 groups 87.528 4.038 99.400 100.000
threads-sockets 5 groups 6.942 2.398 59.244 88.337
threads-sockets 6 groups 4.359 1.954 49.448 87.860
threads-sockets 7 groups 2.845 1.345 41.198 77.102
threads-sockets 8 groups 2.871 1.404 38.512 74.312
schedstat_parse.py -f tbench_vanilla.log
case load se_eff% dom_eff% fast_rate% success_rate%
loopback 28 threads 99.976 18.369 99.995 100.000
loopback 56 threads 99.222 7.799 99.934 100.000
loopback 84 threads 19.723 6.819 70.215 100.000
loopback 112 threads 11.283 5.371 55.371 99.999
loopback 140 threads 0.000 0.000 0.000 0.000
loopback 168 threads 0.000 0.000 0.000 0.000
loopback 196 threads 0.000 0.000 0.000 0.000
loopback 224 threads 0.000 0.000 0.000 0.000
According to the test above, if the system becomes busy, the
SIS Search Efficiency(se_eff%) drops significantly. Although some
benchmarks would finally find an idle CPU(success_rate% = 100%), it is
doubtful whether it is worth it to search the whole LLC domain.
[Proposal]
It would be ideal to have a crystal ball to answer this question:
How many CPUs must a wakeup path walk down, before it can find an idle
CPU? Many potential metrics could be used to predict the number.
One candidate is the sum of util_avg in this LLC domain. The benefit
of choosing util_avg is that it is a metric of accumulated historic
activity, which seems to be smoother than instantaneous metrics
(such as rq->nr_running). Besides, choosing the sum of util_avg
would help predict the load of the LLC domain more precisely, because
SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
time. As Peter suggested[2], the lower the util_avg is, the
more select_idle_cpu() should scan for idle CPU, and vice versa.
Introduce the quadratic function:
y = a - bx^2
x is the sum_util ratio [0, 100] of this LLC domain, and y is the percentage
of CPUs to be scanned in the LLC domain. The number of CPUs to search drops
as sum_util increases. When sum_util hits 85% or above, the scan stops.
Choosing 85% is because it is the threshold of an overloaded LLC sched group
(imbalance_pct = 117). Choosing quadratic function is because:
[1] Compared to the linear function, it scans more aggressively when the
sum_util is low.
[2] Compared to the exponential function, it is easier to calculate.
[3] It seems that there is no accurate mapping between the sum of util_avg
and the number of CPUs to be scanned. Use heuristic scan for now.
The steps to calculate scan_nr are as followed:
[1] scan_percent = 100 - (x/8.5)^2
when utilization reaches 85%, scan_percent becomes 0.
[2] scan_nr = nr_llc * scan_percent / 100
[3] scan_nr = max(scan_nr, 0)
For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
sum_util% 0 5 15 25 35 45 55 65 75 85 ...
scan_ns 112 112 108 103 92 80 64 47 24 0 ...
Furthermore, to minimize the overhead of calculating the metrics in
select_idle_cpu(), borrow the statistics from periodic load balance.
As mentioned by Abel, on a platform with 112 CPUs per LLC, the
sum_util calculated by periodic load balance after 112ms would decay
to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay in
reflecting the latest utilization. But it is a trade-off.
Checking the util_avg in newidle load balance would be more frequent,
but it brings overhead - multiple CPUs write/read the per-LLC shared
variable and introduces cache false sharing. And Tim also mentioned
that, it is allowed to be non-optimal in terms of scheduling for the
short term variations, but if there is a long term trend in the load
behavior, the scheduler can adjust for that.
SIS_UTIL is disabled by default. When it is enabled, the select_idle_cpu()
will use the nr_scan calculated by SIS_UTIL instead of the one from
SIS_PROP. Later SIS_UTIL and SIS_PROP could be made mutually exclusive.
[Test result]
The following is the benchmark result comparison between
baseline:vanilla and compare:patched kernel. Positive compare%
indicates better performance.
netperf.throughput
each thread: netperf -4 -H 127.0.0.1 -t TCP/UDP_RR -c -C -l 100
=======
case load baseline(std%) compare%( std%)
TCP_RR 28 threads 1.00 ( 0.40) +1.14 ( 0.37)
TCP_RR 56 threads 1.00 ( 0.49) +0.62 ( 0.31)
TCP_RR 84 threads 1.00 ( 0.50) +0.26 ( 0.55)
TCP_RR 112 threads 1.00 ( 0.27) +0.29 ( 0.28)
TCP_RR 140 threads 1.00 ( 0.22) +0.14 ( 0.23)
TCP_RR 168 threads 1.00 ( 0.21) +0.40 ( 0.19)
TCP_RR 196 threads 1.00 ( 0.18) +183.40 ( 16.43)
TCP_RR 224 threads 1.00 ( 0.16) +188.44 ( 9.29)
UDP_RR 28 threads 1.00 ( 0.47) +1.45 ( 0.47)
UDP_RR 56 threads 1.00 ( 0.28) -0.22 ( 0.30)
UDP_RR 84 threads 1.00 ( 0.38) +1.72 ( 27.10)
UDP_RR 112 threads 1.00 ( 0.16) +0.01 ( 0.18)
UDP_RR 140 threads 1.00 ( 14.10) +0.32 ( 11.15)
UDP_RR 168 threads 1.00 ( 12.75) +0.91 ( 11.62)
UDP_RR 196 threads 1.00 ( 14.41) +191.97 ( 19.34)
UDP_RR 224 threads 1.00 ( 15.34) +194.88 ( 17.06)
Take the 224 threads as an example, the SIS search metrics changes are
illustrated below:
vanilla patched
4544492 +237.5% 15338634 sched_debug.cpu.sis_domain_search.avg
38539 +39686.8% 15333634 sched_debug.cpu.sis_failed.avg
128300000 -87.9% 15551326 sched_debug.cpu.sis_scanned.avg
5842896 +162.7% 15347978 sched_debug.cpu.sis_search.avg
There is -87.9% less CPU scans after patched, which indicates lower overhead.
Besides, with this patch applied, there is -13% less rq lock contention
in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested
.try_to_wake_up.default_wake_function.woken_wake_function.
This could help explain the performance improvement - Because this patch allows
the waking task to remain on the previous CPU, rather than grabbing other CPU's
lock.
Other benchmarks:
hackbench.throughput
=========
case load baseline(std%) compare%( std%)
process-pipe 1 group 1.00 ( 0.09) -0.54 ( 0.82)
process-pipe 2 groups 1.00 ( 0.47) +0.89 ( 0.61)
process-pipe 4 groups 1.00 ( 0.83) +0.90 ( 0.15)
process-pipe 8 groups 1.00 ( 0.09) +0.31 ( 0.07)
process-sockets 1 group 1.00 ( 0.13) -0.58 ( 0.49)
process-sockets 2 groups 1.00 ( 0.41) -0.58 ( 0.52)
process-sockets 4 groups 1.00 ( 0.61) -0.37 ( 0.50)
process-sockets 8 groups 1.00 ( 0.22) +1.15 ( 0.10)
threads-pipe 1 group 1.00 ( 0.35) -0.28 ( 0.78)
threads-pipe 2 groups 1.00 ( 0.65) +0.03 ( 0.96)
threads-pipe 4 groups 1.00 ( 0.43) +0.81 ( 0.38)
threads-pipe 8 groups 1.00 ( 0.11) -1.56 ( 0.07)
threads-sockets 1 group 1.00 ( 0.30) -0.39 ( 0.41)
threads-sockets 2 groups 1.00 ( 0.21) -0.23 ( 0.27)
threads-sockets 4 groups 1.00 ( 0.23) +0.36 ( 0.19)
threads-sockets 8 groups 1.00 ( 0.13) +1.57 ( 0.06)
tbench.throughput
======
case load baseline(std%) compare%( std%)
loopback 28 threads 1.00 ( 0.15) +1.05 ( 0.08)
loopback 56 threads 1.00 ( 0.09) +0.36 ( 0.04)
loopback 84 threads 1.00 ( 0.12) +0.26 ( 0.06)
loopback 112 threads 1.00 ( 0.12) +0.04 ( 0.09)
loopback 140 threads 1.00 ( 0.04) +2.98 ( 0.18)
loopback 168 threads 1.00 ( 0.10) +2.88 ( 0.30)
loopback 196 threads 1.00 ( 0.06) +2.63 ( 0.03)
loopback 224 threads 1.00 ( 0.08) +2.60 ( 0.06)
schbench.latency_90%_us
========
case load baseline compare%
normal 1 mthread 1.00 -1.7%
normal 2 mthreads 1.00 +1.6%
normal 4 mthreads 1.00 +1.4%
normal 8 mthreads 1.00 +21.0%
Limitations:
[1]
This patch is based on the util_avg, which is very sensitive to the CPU
frequency invariance. The util_avg would decay quite fast when the
CPU is idle, if the max frequency has been limited by the user.
Patch [3] should be applied if turbo is disabled manually on Intel
platforms.
[2]
There may be unbalanced tasks among CPUs due to CPU affinity. For example,
suppose the LLC domain is composed of 8 CPUs, and 7 tasks are bound to
CPU0~CPU6, while CPU7 is idle:
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
util_avg 1024 1024 1024 1024 1024 1024 1024 0
Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%,
select_idle_cpu() will not scan, thus CPU7 is undetected.
A possible workaround to mitigate this problem is that the nr_scan should
be increased if idle CPUs are detected during periodic load balance. And the
problem could be mitigated by idle load balance that, CPU7 might pull some
tasks on it.
[3]
Prateek mentioned that we should scan aggressively in an LLC domain
with 16 CPUs. Because the cost to search for an idle one among 16 CPUs is
negligible. The current patch aims to propose a generic solution and only
considers the util_avg. A follow-up change could enhance the scan policy
to adjust the scan_percent according to the CPU number in LLC.
v2->v3:
- Use 85% as the threshold again, because the CPU frequency invariance issue
has been fixed and the patch is queued for 5.19.
- Stop the scan if 85% is reached, rather than scanning for at least 4 CPUs.
According to the feedback from Yicong, it might be better to stop scanning
entirely when the LLC is overloaded.
- Replace linear scan with quadratic function scan, to let the SIS scan
aggressively when the LLC is not busy. Prateek mentioned there was slight
regression from ycsb-mongodb in v2, which might be due to fewer CPUs
scanned when the utilization is around 20%.
- Add back the logic to stop the CPU scan even if has_idle_core is true.
It might be a waste of time to search for an idle Core if the LLC is
overloaded. Besides, according to the tbench result from Prateek, stop idle
Core scan would bring extra performance improvement.
- Provide the SIS search statistics in the commit log, based on Mel Gorman's
patch, which is suggested by Adel.
- Introduce SIS_UTIL sched feature rather than changing the logic of SIS_PROP
directly, which can be reviewed easier.
v2->v1:
- As suggested by Peter, introduce an idle CPU scan strategy that is based on
the util_avg metric. When util_avg is very low it scans more, while when
util_avg hits the threshold we naturally stop scanning entirely. The threshold
has been decreased from 85% to 50%, because this is the threshold when the
CPU is nearly 100% but with turbo disabled. At least scan for 4 CPUs even
when the LLC is overloaded, to keep it consistent with the current logic of
select_idle_cpu().
v1:
- Stop scanning the idle CPU in select_idle_cpu(), if the sum of util_avg in
the LLC domain has reached 85%.
[Resend to include the missing mailing list, sorry for any inconvenience.]
Link: https://lore.kernel.org/lkml/[email protected] #1
Link: https://lore.kernel.org/lkml/[email protected] #2
Link: https://lore.kernel.org/lkml/[email protected] #3
Suggested-by: Tim Chen <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
---
include/linux/sched/topology.h | 1 +
kernel/sched/fair.c | 56 ++++++++++++++++++++++++++++++++++
kernel/sched/features.h | 1 +
3 files changed, 58 insertions(+)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 56cffe42abbc..816df6cc444e 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -81,6 +81,7 @@ struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
int has_idle_cores;
+ int nr_idle_scan;
};
struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 23c7d0f617ee..50c9d5b2b338 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6327,6 +6327,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
int i, cpu, idle_cpu = -1, nr = INT_MAX;
+ struct sched_domain_shared *sd_share;
struct rq *this_rq = this_rq();
int this = smp_processor_id();
struct sched_domain *this_sd;
@@ -6366,6 +6367,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
time = cpu_clock(this);
}
+ if (sched_feat(SIS_UTIL)) {
+ sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
+ if (sd_share) {
+ /* because !--nr is the condition to stop scan */
+ nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
+ /* overloaded LLC is unlikely to have idle cpu/core */
+ if (nr == 1)
+ return -1;
+ }
+ }
+
for_each_cpu_wrap(cpu, cpus, target + 1) {
if (has_idle_core) {
i = select_idle_core(p, cpu, cpus, &idle_cpu);
@@ -9267,6 +9279,46 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
return idlest;
}
+static inline void update_idle_cpu_scan(struct lb_env *env,
+ unsigned long sum_util)
+{
+ struct sched_domain_shared *sd_share;
+ int nr_scan, nr_llc, llc_util_pct;
+
+ if (!sched_feat(SIS_UTIL))
+ return;
+ /*
+ * Update the number of CPUs to scan in LLC domain, which could
+ * be used as a hint in select_idle_cpu(). The update of this hint
+ * occurs during periodic load balancing, rather than frequent
+ * newidle balance.
+ */
+ nr_llc = per_cpu(sd_llc_size, env->dst_cpu);
+ if (env->idle == CPU_NEWLY_IDLE ||
+ env->sd->span_weight != nr_llc)
+ return;
+
+ sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu));
+ if (!sd_share)
+ return;
+
+ /*
+ * The number of CPUs to search drops as sum_util increases, when
+ * sum_util hits 85% or above, the scan stops.
+ * The reason to choose 85% as the threshold is because this is the
+ * imbalance_pct when a LLC sched group is overloaded.
+ * let y = 100 - (x/8.5)^2 = 100 - x^2/72
+ * y is the percentage of CPUs to be scanned in the LLC
+ * domain, x is the ratio of sum_util compared to the
+ * CPU capacity, which ranges in [0, 100], thus
+ * nr_scan = nr_llc * y / 100
+ */
+ llc_util_pct = (sum_util * 100) / (nr_llc * SCHED_CAPACITY_SCALE);
+ nr_scan = (100 - (llc_util_pct * llc_util_pct / 72)) * nr_llc / 100;
+ nr_scan = max(nr_scan, 0);
+ WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
+}
+
/**
* update_sd_lb_stats - Update sched_domain's statistics for load balancing.
* @env: The load balancing environment.
@@ -9279,6 +9331,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
struct sched_group *sg = env->sd->groups;
struct sg_lb_stats *local = &sds->local_stat;
struct sg_lb_stats tmp_sgs;
+ unsigned long sum_util = 0;
int sg_status = 0;
do {
@@ -9311,6 +9364,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
sds->total_load += sgs->group_load;
sds->total_capacity += sgs->group_capacity;
+ sum_util += sgs->group_util;
sg = sg->next;
} while (sg != env->sd->groups);
@@ -9336,6 +9390,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
}
+
+ update_idle_cpu_scan(env, sum_util);
}
#define NUMA_IMBALANCE_MIN 2
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 1cf435bbcd9c..69be099019f4 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -61,6 +61,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
* When doing wakeups, attempt to limit superfluous scans of the LLC domain.
*/
SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_UTIL, false)
/*
* Issue a WARN when we do multiple update_rq_clock() calls
--
2.25.1
On Fri, May 06, 2022 at 10:38:39AM +0200, Peter Zijlstra wrote:
>
> So in general I'm liking this.
>
Thanks Peter for the review.
> But we appear to be in the position where there's two patches touching
> this code, the other by Abel Wu.
>
> However, I'm thinking that this patch is the simpler and parts of Wu's
> patch can be done of top of this.
>
> Specifically, Wu's patch reminds of me some of my earlier experiments
> that added an additional cpumask to the SIS path and there I found that
> the cache-miss from accessing an additional mask was hurting enough to
> negate a lot of the benefit it brought.
>
I'll look at Abel Wu's proposal to see if we can co-work on that.
> On Fri, Apr 29, 2022 at 02:24:42AM +0800, Chen Yu wrote:
>
> > It would be ideal to have a crystal ball to answer this question:
> > How many CPUs must a wakeup path walk down, before it can find an idle
> > CPU? Many potential metrics could be used to predict the number.
> > One candidate is the sum of util_avg in this LLC domain. The benefit
> > of choosing util_avg is that it is a metric of accumulated historic
> > activity, which seems to be smoother than instantaneous metrics
> > (such as rq->nr_running). Besides, choosing the sum of util_avg
> > would help predict the load of the LLC domain more precisely, because
> > SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
> > time. As Peter suggested[2], the lower the util_avg is, the
> > more select_idle_cpu() should scan for idle CPU, and vice versa.
> >
> > Introduce the quadratic function:
> >
> > y = a - bx^2
> >
> > x is the sum_util ratio [0, 100] of this LLC domain, and y is the percentage
> > of CPUs to be scanned in the LLC domain. The number of CPUs to search drops
> > as sum_util increases. When sum_util hits 85% or above, the scan stops.
> > Choosing 85% is because it is the threshold of an overloaded LLC sched group
> > (imbalance_pct = 117). Choosing quadratic function is because:
> >
> > [1] Compared to the linear function, it scans more aggressively when the
> > sum_util is low.
> > [2] Compared to the exponential function, it is easier to calculate.
> > [3] It seems that there is no accurate mapping between the sum of util_avg
> > and the number of CPUs to be scanned. Use heuristic scan for now.
> >
> > The steps to calculate scan_nr are as followed:
> > [1] scan_percent = 100 - (x/8.5)^2
> > when utilization reaches 85%, scan_percent becomes 0.
> > [2] scan_nr = nr_llc * scan_percent / 100
> > [3] scan_nr = max(scan_nr, 0)
> >
> > For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
> > sum_util% 0 5 15 25 35 45 55 65 75 85 ...
> > scan_ns 112 112 108 103 92 80 64 47 24 0 ...
> >
> > Furthermore, to minimize the overhead of calculating the metrics in
> > select_idle_cpu(), borrow the statistics from periodic load balance.
> > As mentioned by Abel, on a platform with 112 CPUs per LLC, the
> > sum_util calculated by periodic load balance after 112ms would decay
> > to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay in
> > reflecting the latest utilization. But it is a trade-off.
> > Checking the util_avg in newidle load balance would be more frequent,
> > but it brings overhead - multiple CPUs write/read the per-LLC shared
> > variable and introduces cache false sharing. And Tim also mentioned
> > that, it is allowed to be non-optimal in terms of scheduling for the
> > short term variations, but if there is a long term trend in the load
> > behavior, the scheduler can adjust for that.
> >
> > SIS_UTIL is disabled by default. When it is enabled, the select_idle_cpu()
> > will use the nr_scan calculated by SIS_UTIL instead of the one from
> > SIS_PROP. Later SIS_UTIL and SIS_PROP could be made mutually exclusive.
>
> This; I'm thinking that it should default-enable, otherwise what's the
> point. And ideally we're remove SIS_PROP after a few releases if this
> works out.
>
Ok, will change it to enabled by default.
> A few more notes below.. mostly minor nits.
>
> > ---
> > include/linux/sched/topology.h | 1 +
> > kernel/sched/fair.c | 56 ++++++++++++++++++++++++++++++++++
> > kernel/sched/features.h | 1 +
> > 3 files changed, 58 insertions(+)
> >
> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > index 56cffe42abbc..816df6cc444e 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -81,6 +81,7 @@ struct sched_domain_shared {
> > atomic_t ref;
> > atomic_t nr_busy_cpus;
> > int has_idle_cores;
> > + int nr_idle_scan;
> > };
> >
> > struct sched_domain {
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 23c7d0f617ee..50c9d5b2b338 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6327,6 +6327,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > {
> > struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> > int i, cpu, idle_cpu = -1, nr = INT_MAX;
> > + struct sched_domain_shared *sd_share;
> > struct rq *this_rq = this_rq();
> > int this = smp_processor_id();
> > struct sched_domain *this_sd;
> > @@ -6366,6 +6367,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > time = cpu_clock(this);
> > }
> >
> > + if (sched_feat(SIS_UTIL)) {
> > + sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
> > + if (sd_share) {
> > + /* because !--nr is the condition to stop scan */
> > + nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
> > + /* overloaded LLC is unlikely to have idle cpu/core */
> > + if (nr == 1)
> > + return -1;
> > + }
> > + }
> > +
> > for_each_cpu_wrap(cpu, cpus, target + 1) {
> > if (has_idle_core) {
> > i = select_idle_core(p, cpu, cpus, &idle_cpu);
> > @@ -9267,6 +9279,46 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> > return idlest;
> > }
> >
> > +static inline void update_idle_cpu_scan(struct lb_env *env,
> > + unsigned long sum_util)
> > +{
> > + struct sched_domain_shared *sd_share;
> > + int nr_scan, nr_llc, llc_util_pct;
> > +
> > + if (!sched_feat(SIS_UTIL))
> > + return;
> > + /*
> > + * Update the number of CPUs to scan in LLC domain, which could
> > + * be used as a hint in select_idle_cpu(). The update of this hint
> > + * occurs during periodic load balancing, rather than frequent
> > + * newidle balance.
> > + */
> > + nr_llc = per_cpu(sd_llc_size, env->dst_cpu);
> > + if (env->idle == CPU_NEWLY_IDLE ||
> > + env->sd->span_weight != nr_llc)
> > + return;
> > +
> > + sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu));
> > + if (!sd_share)
> > + return;
> > +
> > + /*
> > + * The number of CPUs to search drops as sum_util increases, when
> > + * sum_util hits 85% or above, the scan stops.
> > + * The reason to choose 85% as the threshold is because this is the
> > + * imbalance_pct when a LLC sched group is overloaded.
> > + * let y = 100 - (x/8.5)^2 = 100 - x^2/72
> > + * y is the percentage of CPUs to be scanned in the LLC
> > + * domain, x is the ratio of sum_util compared to the
> > + * CPU capacity, which ranges in [0, 100], thus
> > + * nr_scan = nr_llc * y / 100
> > + */
> > + llc_util_pct = (sum_util * 100) / (nr_llc * SCHED_CAPACITY_SCALE);
> > + nr_scan = (100 - (llc_util_pct * llc_util_pct / 72)) * nr_llc / 100;
> > + nr_scan = max(nr_scan, 0);
>
> The comment and code don't nicely match up, which makes it much harder
> to read than is needed. Given the comment, I would expect the code to
> look a little like:
>
> x = (sum_util * 100) / (nr_llc * SCHED_CAPACITY_SCALE);
> y = 100 - (x*x)/72;
> nr_scan = max(0, (nr_llc * y) / 100);
>
> Now a pet peeve of mine, computers *SUCK* at '100', that's just not a
> good number for them, and AFAICT there's no reason what*so*ever to use
> 100 here, we're free to pick any random number. So lets be silly and
> pick SCHED_CAPACITY_SCALE :-)
>
> x = sum_util / nr_llc;
> y = SCHED_CAPACITY_SCALE - (x*x)/737;
> nr_scan = max(0, (nr_llc * y) / SCHED_CAPACITY_SCALE);
>
> How's that?
>
Ok, will change it to this. I calculated the parameter from scratch
using 72.25 instead of 72, the parameter becomes 740. I'll do some test
based on this formula.
> (In general, forget about percentages; that's just a daft human
> convention because humans like 100 due to having 10 digits or something,
> fractions work on any base.)
>
Thanks for the guidance.
> > + WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
> > +}
> > +
> > /**
> > * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
> > * @env: The load balancing environment.
> > @@ -9279,6 +9331,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> > struct sched_group *sg = env->sd->groups;
> > struct sg_lb_stats *local = &sds->local_stat;
> > struct sg_lb_stats tmp_sgs;
> > + unsigned long sum_util = 0;
> > int sg_status = 0;
> >
> > do {
> > @@ -9311,6 +9364,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> > sds->total_load += sgs->group_load;
> > sds->total_capacity += sgs->group_capacity;
> >
> > + sum_util += sgs->group_util;
> > sg = sg->next;
> > } while (sg != env->sd->groups);
> >
> > @@ -9336,6 +9390,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> > WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
> > trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
> > }
> > +
> > + update_idle_cpu_scan(env, sum_util);
> > }
> >
> > #define NUMA_IMBALANCE_MIN 2
> > diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> > index 1cf435bbcd9c..69be099019f4 100644
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -61,6 +61,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> > * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
> > */
> > SCHED_FEAT(SIS_PROP, true)
> > +SCHED_FEAT(SIS_UTIL, false)
>
> As said above, flip those. Otherwise there's no point in merging this,
> nobody will use it.
Ok, will do in next version.
Thanks,
Chenyu
So in general I'm liking this.
But we appear to be in the position where there's two patches touching
this code, the other by Abel Wu.
However, I'm thinking that this patch is the simpler and parts of Wu's
patch can be done of top of this.
Specifically, Wu's patch reminds of me some of my earlier experiments
that added an additional cpumask to the SIS path and there I found that
the cache-miss from accessing an additional mask was hurting enough to
negate a lot of the benefit it brought.
On Fri, Apr 29, 2022 at 02:24:42AM +0800, Chen Yu wrote:
> It would be ideal to have a crystal ball to answer this question:
> How many CPUs must a wakeup path walk down, before it can find an idle
> CPU? Many potential metrics could be used to predict the number.
> One candidate is the sum of util_avg in this LLC domain. The benefit
> of choosing util_avg is that it is a metric of accumulated historic
> activity, which seems to be smoother than instantaneous metrics
> (such as rq->nr_running). Besides, choosing the sum of util_avg
> would help predict the load of the LLC domain more precisely, because
> SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
> time. As Peter suggested[2], the lower the util_avg is, the
> more select_idle_cpu() should scan for idle CPU, and vice versa.
>
> Introduce the quadratic function:
>
> y = a - bx^2
>
> x is the sum_util ratio [0, 100] of this LLC domain, and y is the percentage
> of CPUs to be scanned in the LLC domain. The number of CPUs to search drops
> as sum_util increases. When sum_util hits 85% or above, the scan stops.
> Choosing 85% is because it is the threshold of an overloaded LLC sched group
> (imbalance_pct = 117). Choosing quadratic function is because:
>
> [1] Compared to the linear function, it scans more aggressively when the
> sum_util is low.
> [2] Compared to the exponential function, it is easier to calculate.
> [3] It seems that there is no accurate mapping between the sum of util_avg
> and the number of CPUs to be scanned. Use heuristic scan for now.
>
> The steps to calculate scan_nr are as followed:
> [1] scan_percent = 100 - (x/8.5)^2
> when utilization reaches 85%, scan_percent becomes 0.
> [2] scan_nr = nr_llc * scan_percent / 100
> [3] scan_nr = max(scan_nr, 0)
>
> For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
> sum_util% 0 5 15 25 35 45 55 65 75 85 ...
> scan_ns 112 112 108 103 92 80 64 47 24 0 ...
>
> Furthermore, to minimize the overhead of calculating the metrics in
> select_idle_cpu(), borrow the statistics from periodic load balance.
> As mentioned by Abel, on a platform with 112 CPUs per LLC, the
> sum_util calculated by periodic load balance after 112ms would decay
> to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay in
> reflecting the latest utilization. But it is a trade-off.
> Checking the util_avg in newidle load balance would be more frequent,
> but it brings overhead - multiple CPUs write/read the per-LLC shared
> variable and introduces cache false sharing. And Tim also mentioned
> that, it is allowed to be non-optimal in terms of scheduling for the
> short term variations, but if there is a long term trend in the load
> behavior, the scheduler can adjust for that.
>
> SIS_UTIL is disabled by default. When it is enabled, the select_idle_cpu()
> will use the nr_scan calculated by SIS_UTIL instead of the one from
> SIS_PROP. Later SIS_UTIL and SIS_PROP could be made mutually exclusive.
This; I'm thinking that it should default-enable, otherwise what's the
point. And ideally we're remove SIS_PROP after a few releases if this
works out.
A few more notes below.. mostly minor nits.
> ---
> include/linux/sched/topology.h | 1 +
> kernel/sched/fair.c | 56 ++++++++++++++++++++++++++++++++++
> kernel/sched/features.h | 1 +
> 3 files changed, 58 insertions(+)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 56cffe42abbc..816df6cc444e 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -81,6 +81,7 @@ struct sched_domain_shared {
> atomic_t ref;
> atomic_t nr_busy_cpus;
> int has_idle_cores;
> + int nr_idle_scan;
> };
>
> struct sched_domain {
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 23c7d0f617ee..50c9d5b2b338 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6327,6 +6327,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> {
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> int i, cpu, idle_cpu = -1, nr = INT_MAX;
> + struct sched_domain_shared *sd_share;
> struct rq *this_rq = this_rq();
> int this = smp_processor_id();
> struct sched_domain *this_sd;
> @@ -6366,6 +6367,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> time = cpu_clock(this);
> }
>
> + if (sched_feat(SIS_UTIL)) {
> + sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
> + if (sd_share) {
> + /* because !--nr is the condition to stop scan */
> + nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
> + /* overloaded LLC is unlikely to have idle cpu/core */
> + if (nr == 1)
> + return -1;
> + }
> + }
> +
> for_each_cpu_wrap(cpu, cpus, target + 1) {
> if (has_idle_core) {
> i = select_idle_core(p, cpu, cpus, &idle_cpu);
> @@ -9267,6 +9279,46 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> return idlest;
> }
>
> +static inline void update_idle_cpu_scan(struct lb_env *env,
> + unsigned long sum_util)
> +{
> + struct sched_domain_shared *sd_share;
> + int nr_scan, nr_llc, llc_util_pct;
> +
> + if (!sched_feat(SIS_UTIL))
> + return;
> + /*
> + * Update the number of CPUs to scan in LLC domain, which could
> + * be used as a hint in select_idle_cpu(). The update of this hint
> + * occurs during periodic load balancing, rather than frequent
> + * newidle balance.
> + */
> + nr_llc = per_cpu(sd_llc_size, env->dst_cpu);
> + if (env->idle == CPU_NEWLY_IDLE ||
> + env->sd->span_weight != nr_llc)
> + return;
> +
> + sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu));
> + if (!sd_share)
> + return;
> +
> + /*
> + * The number of CPUs to search drops as sum_util increases, when
> + * sum_util hits 85% or above, the scan stops.
> + * The reason to choose 85% as the threshold is because this is the
> + * imbalance_pct when a LLC sched group is overloaded.
> + * let y = 100 - (x/8.5)^2 = 100 - x^2/72
> + * y is the percentage of CPUs to be scanned in the LLC
> + * domain, x is the ratio of sum_util compared to the
> + * CPU capacity, which ranges in [0, 100], thus
> + * nr_scan = nr_llc * y / 100
> + */
> + llc_util_pct = (sum_util * 100) / (nr_llc * SCHED_CAPACITY_SCALE);
> + nr_scan = (100 - (llc_util_pct * llc_util_pct / 72)) * nr_llc / 100;
> + nr_scan = max(nr_scan, 0);
The comment and code don't nicely match up, which makes it much harder
to read than is needed. Given the comment, I would expect the code to
look a little like:
x = (sum_util * 100) / (nr_llc * SCHED_CAPACITY_SCALE);
y = 100 - (x*x)/72;
nr_scan = max(0, (nr_llc * y) / 100);
Now a pet peeve of mine, computers *SUCK* at '100', that's just not a
good number for them, and AFAICT there's no reason what*so*ever to use
100 here, we're free to pick any random number. So lets be silly and
pick SCHED_CAPACITY_SCALE :-)
x = sum_util / nr_llc;
y = SCHED_CAPACITY_SCALE - (x*x)/737;
nr_scan = max(0, (nr_llc * y) / SCHED_CAPACITY_SCALE);
How's that?
(In general, forget about percentages; that's just a daft human
convention because humans like 100 due to having 10 digits or something,
fractions work on any base.)
> + WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
> +}
> +
> /**
> * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
> * @env: The load balancing environment.
> @@ -9279,6 +9331,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> struct sched_group *sg = env->sd->groups;
> struct sg_lb_stats *local = &sds->local_stat;
> struct sg_lb_stats tmp_sgs;
> + unsigned long sum_util = 0;
> int sg_status = 0;
>
> do {
> @@ -9311,6 +9364,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> sds->total_load += sgs->group_load;
> sds->total_capacity += sgs->group_capacity;
>
> + sum_util += sgs->group_util;
> sg = sg->next;
> } while (sg != env->sd->groups);
>
> @@ -9336,6 +9390,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
> trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
> }
> +
> + update_idle_cpu_scan(env, sum_util);
> }
>
> #define NUMA_IMBALANCE_MIN 2
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 1cf435bbcd9c..69be099019f4 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -61,6 +61,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
> */
> SCHED_FEAT(SIS_PROP, true)
> +SCHED_FEAT(SIS_UTIL, false)
As said above, flip those. Otherwise there's no point in merging this,
nobody will use it.
On Tue, May 10, 2022 at 08:41:57PM +0800, Yicong Yang wrote:
> On 2022/4/29 2:24, Chen Yu wrote:
> > @@ -61,6 +61,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> > * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
> > */
> > SCHED_FEAT(SIS_PROP, true)
> > +SCHED_FEAT(SIS_UTIL, false)
> >
>
> I see you mentioned they're mutually exclusive in the commit, worth a comment here?
>
Yes, previously I thought it could be made mutually exclusive, and Peter has
suggested that we should make SIS_UTIL enabled by default, so later we could
remove SIS_PROP if SIS_UTIL behaves stable. So I assume there is no need to
add comment in the next version now.
> One minor question: nr is updated in load balance so there maybe a delay because of
> interval of load balancing.
Yes, this is a good question. The default interval between two load balance is sd_weight ms,
which is 112ms in my case. This interval was a trade off to reduce cache contention. Besides,
every 1st idle CPU or the balanced CPU in one sched group within the LLC domain has the chance
to launch a periodic load balance, for example, although CPU0 and CPU1's periodic load balance
are both triggered every 112ms, CPU1 could help launch the load balance when CPU0 is not in
load balance work. Consider there are many CPUs in a LLC domain, the 'internal' to launch
the periodic load balance becomes smaller.
> Furthermore, the LLC domain may not be balanced everytime
> if the lowest domain is not LLC, like CLS->LLC. So maybe a bit more delay included.
>
I thought every domain has its chance to launch a load balance, the difference is different
domains have different interval. No?
> The test results is fine and as expected. The improvement of netperf at a heavy load
> condition, compared to your v2 version.
>
Thanks for the test, would you mind if I add Tested-by tag?
thanks,
Chenyu
> Thanks,
> Yicong
>
> TCP_RR node 0-1
> threads
> 16 57559.56667 57930.03333 (+0.64%)
> 32 56373 57754.53333 (+2.45%)
> 64 18831.4 46234.76667 (+145.52%)
> 128 15658.9 19620.26667 (+25.30%)
> 256 7959.896667 8869.013333 (+11.42%)
>
> TCP_RR node 0
> threads
> 16 58389.43333 59026.03333 (+1.09%)
> 32 23779.6 51563.33333 (+116.84%)
> 64 20514.56667 23485.63333 (+14.48%)
> 128 8202.49 9205.483333 (+12.23%)
> 256 3843.163333 4304.8 (+12.01%)
>
> tbench4 node 0-1
> 5.18-rc1 patched
> Hmean 1 299.02 ( 0.00%) 307.73 * 2.91%*
> Hmean 2 597.88 ( 0.00%) 619.10 * 3.55%*
> Hmean 4 1207.11 ( 0.00%) 1239.57 * 2.69%*
> Hmean 8 2406.67 ( 0.00%) 2463.63 * 2.37%*
> Hmean 16 4755.52 ( 0.00%) 4979.46 * 4.71%*
> Hmean 32 9449.01 ( 0.00%) 9709.59 * 2.76%*
> Hmean 64 10538.89 ( 0.00%) 10727.86 * 1.79%*
> Hmean 128 13333.84 ( 0.00%) 14580.63 * 9.35%*
> Hmean 256 11735.24 ( 0.00%) 11737.16 ( 0.02%)
>
> tbench4 node 0
> 5.18-rc1 patched
> Hmean 1 302.26 ( 0.00%) 313.43 * 3.70%*
> Hmean 2 603.87 ( 0.00%) 618.56 * 2.43%*
> Hmean 4 1213.91 ( 0.00%) 1249.63 * 2.94%*
> Hmean 8 2469.72 ( 0.00%) 2527.48 * 2.34%*
> Hmean 16 4980.70 ( 0.00%) 5099.62 * 2.39%*
> Hmean 32 9001.88 ( 0.00%) 9730.27 * 8.09%*
> Hmean 64 7032.07 ( 0.00%) 7691.56 * 9.38%*
> Hmean 128 6037.76 ( 0.00%) 6712.86 * 11.18%*
> Hmean 256 8513.83 ( 0.00%) 9117.79 * 7.09%*
>
Hello Chenyu,
Sorry for the delay with analysis.
On 4/28/2022 11:54 PM, Chen Yu wrote:
> [Problem Statement]
> select_idle_cpu() might spend too much time searching for an idle CPU,
> when the system is overloaded.
>
> [..snip..]
>
> [Test result]
>
> The following is the benchmark result comparison between
> baseline:vanilla and compare:patched kernel. Positive compare%
> indicates better performance.
>
> netperf.throughput
> each thread: netperf -4 -H 127.0.0.1 -t TCP/UDP_RR -c -C -l 100
> =======
> case load baseline(std%) compare%( std%)
> TCP_RR 28 threads 1.00 ( 0.40) +1.14 ( 0.37)
> TCP_RR 56 threads 1.00 ( 0.49) +0.62 ( 0.31)
> TCP_RR 84 threads 1.00 ( 0.50) +0.26 ( 0.55)
> TCP_RR 112 threads 1.00 ( 0.27) +0.29 ( 0.28)
> TCP_RR 140 threads 1.00 ( 0.22) +0.14 ( 0.23)
> TCP_RR 168 threads 1.00 ( 0.21) +0.40 ( 0.19)
> TCP_RR 196 threads 1.00 ( 0.18) +183.40 ( 16.43)
> TCP_RR 224 threads 1.00 ( 0.16) +188.44 ( 9.29)
> UDP_RR 28 threads 1.00 ( 0.47) +1.45 ( 0.47)
> UDP_RR 56 threads 1.00 ( 0.28) -0.22 ( 0.30)
> UDP_RR 84 threads 1.00 ( 0.38) +1.72 ( 27.10)
> UDP_RR 112 threads 1.00 ( 0.16) +0.01 ( 0.18)
> UDP_RR 140 threads 1.00 ( 14.10) +0.32 ( 11.15)
> UDP_RR 168 threads 1.00 ( 12.75) +0.91 ( 11.62)
> UDP_RR 196 threads 1.00 ( 14.41) +191.97 ( 19.34)
> UDP_RR 224 threads 1.00 ( 15.34) +194.88 ( 17.06)
>
> Take the 224 threads as an example, the SIS search metrics changes are
> illustrated below:
>
> vanilla patched
> 4544492 +237.5% 15338634 sched_debug.cpu.sis_domain_search.avg
> 38539 +39686.8% 15333634 sched_debug.cpu.sis_failed.avg
> 128300000 -87.9% 15551326 sched_debug.cpu.sis_scanned.avg
> 5842896 +162.7% 15347978 sched_debug.cpu.sis_search.avg
>
> There is -87.9% less CPU scans after patched, which indicates lower overhead.
> Besides, with this patch applied, there is -13% less rq lock contention
> in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested
> .try_to_wake_up.default_wake_function.woken_wake_function.
> This could help explain the performance improvement - Because this patch allows
> the waking task to remain on the previous CPU, rather than grabbing other CPU's
> lock.
>
> Other benchmarks:
>
> hackbench.throughput
> =========
> case load baseline(std%) compare%( std%)
> process-pipe 1 group 1.00 ( 0.09) -0.54 ( 0.82)
> process-pipe 2 groups 1.00 ( 0.47) +0.89 ( 0.61)
> process-pipe 4 groups 1.00 ( 0.83) +0.90 ( 0.15)
> process-pipe 8 groups 1.00 ( 0.09) +0.31 ( 0.07)
> process-sockets 1 group 1.00 ( 0.13) -0.58 ( 0.49)
> process-sockets 2 groups 1.00 ( 0.41) -0.58 ( 0.52)
> process-sockets 4 groups 1.00 ( 0.61) -0.37 ( 0.50)
> process-sockets 8 groups 1.00 ( 0.22) +1.15 ( 0.10)
> threads-pipe 1 group 1.00 ( 0.35) -0.28 ( 0.78)
> threads-pipe 2 groups 1.00 ( 0.65) +0.03 ( 0.96)
> threads-pipe 4 groups 1.00 ( 0.43) +0.81 ( 0.38)
> threads-pipe 8 groups 1.00 ( 0.11) -1.56 ( 0.07)
> threads-sockets 1 group 1.00 ( 0.30) -0.39 ( 0.41)
> threads-sockets 2 groups 1.00 ( 0.21) -0.23 ( 0.27)
> threads-sockets 4 groups 1.00 ( 0.23) +0.36 ( 0.19)
> threads-sockets 8 groups 1.00 ( 0.13) +1.57 ( 0.06)
>
> tbench.throughput
> ======
> case load baseline(std%) compare%( std%)
> loopback 28 threads 1.00 ( 0.15) +1.05 ( 0.08)
> loopback 56 threads 1.00 ( 0.09) +0.36 ( 0.04)
> loopback 84 threads 1.00 ( 0.12) +0.26 ( 0.06)
> loopback 112 threads 1.00 ( 0.12) +0.04 ( 0.09)
> loopback 140 threads 1.00 ( 0.04) +2.98 ( 0.18)
> loopback 168 threads 1.00 ( 0.10) +2.88 ( 0.30)
> loopback 196 threads 1.00 ( 0.06) +2.63 ( 0.03)
> loopback 224 threads 1.00 ( 0.08) +2.60 ( 0.06)
>
> schbench.latency_90%_us
> ========
> case load baseline compare%
> normal 1 mthread 1.00 -1.7%
> normal 2 mthreads 1.00 +1.6%
> normal 4 mthreads 1.00 +1.4%
> normal 8 mthreads 1.00 +21.0%
>
> Limitations:
> [1]
> This patch is based on the util_avg, which is very sensitive to the CPU
> frequency invariance. The util_avg would decay quite fast when the
> CPU is idle, if the max frequency has been limited by the user.
> Patch [3] should be applied if turbo is disabled manually on Intel
> platforms.
>
> [2]
> There may be unbalanced tasks among CPUs due to CPU affinity. For example,
> suppose the LLC domain is composed of 8 CPUs, and 7 tasks are bound to
> CPU0~CPU6, while CPU7 is idle:
>
> CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
> util_avg 1024 1024 1024 1024 1024 1024 1024 0
>
> Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%,
> select_idle_cpu() will not scan, thus CPU7 is undetected.
Following are the results from dual socket Zen3 platform (2 x 64C/128T) running with
various NPS configuration:
Following is the NUMA configuration for each NPS mode on the system:
NPS1: Each socket is a NUMA node.
Total 2 NUMA nodes in the dual socket machine.
Node 0: 0-63, 128-191
Node 1: 64-127, 192-255
NPS2: Each socket is further logically divided into 2 NUMA regions.
Total 4 NUMA nodes exist over 2 socket.
Node 0: 0-31, 128-159
Node 1: 32-63, 160-191
Node 2: 64-95, 192-223
Node 3: 96-127, 223-255
NPS4: Each socket is logically divided into 4 NUMA regions.
Total 8 NUMA nodes exist over 2 socket.
Node 0: 0-15, 128-143
Node 1: 16-31, 144-159
Node 2: 32-47, 160-175
Node 3: 48-63, 176-191
Node 4: 64-79, 192-207
Node 5: 80-95, 208-223
Node 6: 96-111, 223-231
Node 7: 112-127, 232-255
Kernel versions:
- tip: 5.18-rc1 tip sched/core
- SIS_UTIL: 5.18-rc1 tip sched/core + this patch
When we began testing, tip was at:
commit: a658353167bf "sched/fair: Revise comment about lb decision matrix"
Following are the results from the benchmark:
* - Data points of concern
~~~~~~~~~
hackbench
~~~~~~~~~
NPS1
Test: tip SIS_UTIL
1-groups: 4.64 (0.00 pct) 4.70 (-1.29 pct)
2-groups: 5.38 (0.00 pct) 5.45 (-1.30 pct)
4-groups: 6.15 (0.00 pct) 6.10 (0.81 pct)
8-groups: 7.42 (0.00 pct) 7.42 (0.00 pct)
16-groups: 10.70 (0.00 pct) 11.69 (-9.25 pct) *
NPS2
Test: tip SIS_UTIL
1-groups: 4.70 (0.00 pct) 4.70 (0.00 pct)
2-groups: 5.45 (0.00 pct) 5.46 (-0.18 pct)
4-groups: 6.13 (0.00 pct) 6.05 (1.30 pct)
8-groups: 7.30 (0.00 pct) 7.05 (3.42 pct)
16-groups: 10.30 (0.00 pct) 10.12 (1.74 pct)
NPS4
Test: tip SIS_UTIL
1-groups: 4.60 (0.00 pct) 4.75 (-3.26 pct) *
2-groups: 5.41 (0.00 pct) 5.42 (-0.18 pct)
4-groups: 6.12 (0.00 pct) 6.00 (1.96 pct)
8-groups: 7.22 (0.00 pct) 7.10 (1.66 pct)
16-groups: 10.24 (0.00 pct) 10.11 (1.26 pct)
~~~~~~~~
schbench
~~~~~~~~
NPS 1
#workers: tip SIS_UTIL
1: 29.00 (0.00 pct) 21.00 (27.58 pct)
2: 28.00 (0.00 pct) 28.00 (0.00 pct)
4: 31.50 (0.00 pct) 31.00 (1.58 pct)
8: 42.00 (0.00 pct) 39.00 (7.14 pct)
16: 56.50 (0.00 pct) 54.50 (3.53 pct)
32: 94.50 (0.00 pct) 94.00 (0.52 pct)
64: 176.00 (0.00 pct) 175.00 (0.56 pct)
128: 404.00 (0.00 pct) 394.00 (2.47 pct)
256: 869.00 (0.00 pct) 863.00 (0.69 pct)
512: 58432.00 (0.00 pct) 55424.00 (5.14 pct)
NPS2
#workers: tip SIS_UTIL
1: 26.50 (0.00 pct) 25.00 (5.66 pct)
2: 26.50 (0.00 pct) 25.50 (3.77 pct)
4: 34.50 (0.00 pct) 34.00 (1.44 pct)
8: 45.00 (0.00 pct) 46.00 (-2.22 pct)
16: 56.50 (0.00 pct) 60.50 (-7.07 pct) *
32: 95.50 (0.00 pct) 93.00 (2.61 pct)
64: 179.00 (0.00 pct) 179.00 (0.00 pct)
128: 369.00 (0.00 pct) 376.00 (-1.89 pct)
256: 898.00 (0.00 pct) 903.00 (-0.55 pct)
512: 56256.00 (0.00 pct) 57088.00 (-1.47 pct)
NPS4
#workers: tip SIS_UTIL
1: 25.00 (0.00 pct) 21.00 (16.00 pct)
2: 28.00 (0.00 pct) 24.00 (14.28 pct)
4: 29.50 (0.00 pct) 29.50 (0.00 pct)
8: 41.00 (0.00 pct) 37.50 (8.53 pct)
16: 65.50 (0.00 pct) 64.00 (2.29 pct)
32: 93.00 (0.00 pct) 94.50 (-1.61 pct)
64: 170.50 (0.00 pct) 175.50 (-2.93 pct)
128: 377.00 (0.00 pct) 368.50 (2.25 pct)
256: 867.00 (0.00 pct) 902.00 (-4.03 pct)
512: 58048.00 (0.00 pct) 55488.00 (4.41 pct)
~~~~~~
tbench
~~~~~~
NPS 1
Clients: tip SIS_UTIL
1 443.31 (0.00 pct) 456.19 (2.90 pct)
2 877.32 (0.00 pct) 875.24 (-0.23 pct)
4 1665.11 (0.00 pct) 1647.31 (-1.06 pct)
8 3016.68 (0.00 pct) 2993.23 (-0.77 pct)
16 5374.30 (0.00 pct) 5246.93 (-2.36 pct)
32 8763.86 (0.00 pct) 7878.18 (-10.10 pct) *
64 15786.93 (0.00 pct) 12958.47 (-17.91 pct) *
128 26826.08 (0.00 pct) 26741.14 (-0.31 pct)
256 24207.35 (0.00 pct) 52041.89 (114.98 pct)
512 51740.58 (0.00 pct) 52084.44 (0.66 pct)
1024 51177.82 (0.00 pct) 53126.29 (3.80 pct)
NPS 2
Clients: tip SIS_UTIL
1 449.49 (0.00 pct) 447.96 (-0.34 pct)
2 867.28 (0.00 pct) 869.52 (0.25 pct)
4 1643.60 (0.00 pct) 1625.91 (-1.07 pct)
8 3047.35 (0.00 pct) 2952.82 (-3.10 pct)
16 5340.77 (0.00 pct) 5251.41 (-1.67 pct)
32 10536.85 (0.00 pct) 8843.49 (-16.07 pct) *
64 16543.23 (0.00 pct) 14265.35 (-13.76 pct) *
128 26400.40 (0.00 pct) 25595.42 (-3.04 pct)
256 23436.75 (0.00 pct) 47090.03 (100.92 pct)
512 50902.60 (0.00 pct) 50036.58 (-1.70 pct)
1024 50216.10 (0.00 pct) 50639.74 (0.84 pct)
NPS 4
Clients: tip SIS_UTIL
1 443.82 (0.00 pct) 459.93 (3.62 pct)
2 849.14 (0.00 pct) 882.17 (3.88 pct)
4 1603.26 (0.00 pct) 1629.64 (1.64 pct)
8 2972.37 (0.00 pct) 3003.09 (1.03 pct)
16 5277.13 (0.00 pct) 5234.07 (-0.81 pct)
32 9744.73 (0.00 pct) 9347.90 (-4.07 pct) *
64 15854.80 (0.00 pct) 14180.27 (-10.56 pct) *
128 26116.97 (0.00 pct) 24597.45 (-5.81 pct) *
256 22403.25 (0.00 pct) 47385.09 (111.50 pct)
512 48317.20 (0.00 pct) 49781.02 (3.02 pct)
1024 50445.41 (0.00 pct) 51607.53 (2.30 pct)
~~~~~~
Stream
~~~~~~
- 10 runs
NPS1
tip SIS_UTIL
Copy: 189113.11 (0.00 pct) 188490.27 (-0.32 pct)
Scale: 201190.61 (0.00 pct) 204526.15 (1.65 pct)
Add: 232654.21 (0.00 pct) 234948.01 (0.98 pct)
Triad: 226583.57 (0.00 pct) 228844.43 (0.99 pct)
NPS2
Test: tip SIS_UTIL
Copy: 155347.14 (0.00 pct) 169386.29 (9.03 pct)
Scale: 191701.53 (0.00 pct) 196110.51 (2.29 pct)
Add: 210013.97 (0.00 pct) 221088.45 (5.27 pct)
Triad: 207602.00 (0.00 pct) 218072.52 (5.04 pct)
NPS4
Test: tip SIS_UTIL
Copy: 136421.15 (0.00 pct) 140894.11 (3.27 pct)
Scale: 191217.59 (0.00 pct) 190554.17 (-0.34 pct)
Add: 189229.52 (0.00 pct) 190871.88 (0.86 pct)
Triad: 188052.99 (0.00 pct) 188417.63 (0.19 pct)
- 100 runs
NPS1
Test: tip SIS_UTIL
Copy: 244693.32 (0.00 pct) 232328.05 (-5.05 pct)
Scale: 221874.99 (0.00 pct) 216858.39 (-2.26 pct)
Add: 268363.89 (0.00 pct) 265449.16 (-1.08 pct)
Triad: 260945.24 (0.00 pct) 252240.56 (-3.33 pct)
NPS2
Test: tip SIS_UTIL
Copy: 211262.00 (0.00 pct) 225240.03 (6.61 pct)
Scale: 222493.34 (0.00 pct) 219094.65 (-1.52 pct)
Add: 280277.17 (0.00 pct) 275677.73 (-1.64 pct)
Triad: 265860.49 (0.00 pct) 262584.22 (-1.23 pct)
NPS4
Test: tip SIS_UTIL
Copy: 250171.40 (0.00 pct) 230983.60 (-7.66 pct)
Scale: 222293.56 (0.00 pct) 215984.34 (-2.83 pct)
Add: 279222.16 (0.00 pct) 270402.64 (-3.15 pct)
Triad: 262013.92 (0.00 pct) 254820.60 (-2.74 pct)
~~~~~~~~~~~~
ycsb-mongodb
~~~~~~~~~~~~
NPS1
sched-tip: 303718.33 (var: 1.31)
SIS_UTIL: 303529.33 (var: 0.67) (-0.06%)
NPS2
sched-tip: 304536.33 (var: 2.46)
SIS_UTIL: 303730.33 (var: 1.57) (-0.26%)
NPS4
sched-tip: 301192.33 (var: 1.81)
SIS_UTIL: 300101.33 (var: 0.35) (-0.36%)
~~~~~~~~~~~~~~~~~~
Notes:
- There seems to be some noticeable regression for hackbench
with 16 groups in NPS1 mode.
- There seems to be regression in tbench for case with number
of workers in range 32-128 (12.5% loaded to 50% loaded)
- tbench reaches saturation early when system is fully loaded
This probably show that the strategy in the initial v1 RFC
seems to work better with our LLC where number of CPUs per LLC
is low compared to systems with unified LLC. Given this is
showing great results for unified LLC, maybe SIS_PROP and SIS_UTIL
can be enabled based on the the size of LLC.
> [..snip..]
>
> [3]
> Prateek mentioned that we should scan aggressively in an LLC domain
> with 16 CPUs. Because the cost to search for an idle one among 16 CPUs is
> negligible. The current patch aims to propose a generic solution and only
> considers the util_avg. A follow-up change could enhance the scan policy
> to adjust the scan_percent according to the CPU number in LLC.
Following are some additional numbers I would like to share comparing SIS_PROP and
SIS_UTIL:
o Hackbench with 1 group
With 1 group, following are the chances of SIS_PROP
and SIS_UTIL finding an idle CPU when an idle CPU
exists in LLC:
+-----------------+---------------------------+---------------------------+--------+
| Idle CPU in LLC | SIS_PROP able to find CPU | SIS_UTIL able to find CPU | Count |
+-----------------+---------------------------+---------------------------+--------+
| 1 | 0 | 0 | 66444 |
| 1 | 0 | 1 | 34153 |
| 1 | 1 | 0 | 57204 |
| 1 | 1 | 1 | 119263 |
+-----------------+---------------------------+---------------------------+--------+
SIS_PROP vs no SIS_PROP CPU search stats:
Total time without SIS_PROP: 90653653
Total time with SIS_PROP: 53558942 (-40.92 pct)
Total time saved: 37094711
Following are number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size):
+--------------+-------+
| CPU Searched | Count |
+--------------+-------+
| 0 | 10520 |
| 1 | 7770 |
| 2 | 11976 |
| 3 | 17554 |
| 4 | 13932 |
| 5 | 15051 |
| 6 | 8398 |
| 7 | 4544 |
| 8 | 3712 |
| 9 | 2337 |
| 10 | 4541 |
| 11 | 1947 |
| 12 | 3846 |
| 13 | 3645 |
| 14 | 2686 |
| 15 | 8390 |
| 16 | 26157 |
+--------------+-------+
- SIS_UTIL might be bailing out too early in some of these cases.
o Hackbench with 16 group
the success rate looks as follows:
+-----------------+---------------------------+---------------------------+---------+
| Idle CPU in LLC | SIS_PROP able to find CPU | SIS_UTIL able to find CPU | Count |
+-----------------+---------------------------+---------------------------+---------+
| 1 | 0 | 0 | 1313745 |
| 1 | 0 | 1 | 694132 |
| 1 | 1 | 0 | 2888450 |
| 1 | 1 | 1 | 5343065 |
+-----------------+---------------------------+---------------------------+---------+
SIS_PROP vs no SIS_PROP CPU search stats:
Total time without SIS_PROP: 5227299388
Total time with SIS_PROP: 3866575188 (-26.03 pct)
Total time saved: 1360724200
Following are number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size):
+--------------+---------+
| CPU Searched | Count |
+--------------+---------+
| 0 | 150351 |
| 1 | 105116 |
| 2 | 214291 |
| 3 | 440053 |
| 4 | 914116 |
| 5 | 1757984 |
| 6 | 2410484 |
| 7 | 1867668 |
| 8 | 379888 |
| 9 | 84055 |
| 10 | 55389 |
| 11 | 26795 |
| 12 | 43113 |
| 13 | 24579 |
| 14 | 32896 |
| 15 | 70059 |
| 16 | 150858 |
+--------------+---------+
- SIS_UTIL might be bailing out too early in most of these cases
o tbench with 256 workers
For tbench with 256 threads, SIS_UTIL works great as we have drastically cut down the number
of CPUs to search.
SIS_PROP vs no SIS_PROP CPU search stats:
Total time without SIS_PROP: 64004752959
Total time with SIS_PROP: 34695004390 (-45.79 pct)
Total time saved: 29309748569
Following are number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size):
+--------------+----------+
| CPU Searched | Count |
+--------------+----------+
| 0 | 500077 |
| 1 | 543865 |
| 2 | 4257684 |
| 3 | 27457498 |
| 4 | 40208673 |
| 5 | 3264358 |
| 6 | 191631 |
| 7 | 24658 |
| 8 | 2469 |
| 9 | 1374 |
| 10 | 2008 |
| 11 | 1300 |
| 12 | 1226 |
| 13 | 1179 |
| 14 | 1631 |
| 15 | 11678 |
| 16 | 7793 |
+--------------+----------+
- This is where SIS_UTIL shines for tbench case with 256 workers as it is effective
at restricting search space well.
o Observations
SIS_PROP seems to have a higher chance of finding an idle CPU compared to SIS_UTIL
in case of hackbench with 16-group. The gap between SIS_PROP and SIS_UTIL is wider
with 16 groups compared to than with 1 group.
Also SIS_PROP is more aggressive at saving time for 1-group compared to the
case with 16-groups.
The bailout from SIS_UTIL is fruitful for tbench with 256 workers leading to massive
performance gain in a fully loaded system.
Note: There might be some inaccuracies for the numbers presented for metrics that
directly compare SIS_PROP and SIS_UTIL as both SIS_PROP and SIS_UTIL were enabled
when gathering these data points and the results from SIS_PROP were returned from
search_idle_cpu(). All the numbers for the above analysis were gathered in NPS1 mode.
> [..snip..]
--
Thanks and Regards,
Prateek
Hi Prateek,
On Fri, May 13, 2022 at 12:07:00PM +0530, K Prateek Nayak wrote:
> Hello Chenyu,
>
> Sorry for the delay with analysis.
>
Thanks very much for the test and analysis in detail.
>
> Following are the results from dual socket Zen3 platform (2 x 64C/128T) running with
> various NPS configuration:
May I know if in all NPS mode, all LLC domains have 16 CPUs?
>
> Following is the NUMA configuration for each NPS mode on the system:
>
> NPS1: Each socket is a NUMA node.
> ??? Total 2 NUMA nodes in the dual socket machine.
>
> ??? Node 0: 0-63,?? 128-191
> ??? Node 1: 64-127, 192-255
>
> NPS2: Each socket is further logically divided into 2 NUMA regions.
> ??? Total 4 NUMA nodes exist over 2 socket.
> ? ?
> ??? Node 0: 0-31,?? 128-159
> ??? Node 1: 32-63,? 160-191
> ??? Node 2: 64-95,? 192-223
> ??? Node 3: 96-127, 223-255
>
> NPS4: Each socket is logically divided into 4 NUMA regions.
> ??? Total 8 NUMA nodes exist over 2 socket.
> ? ?
> ??? Node 0: 0-15,??? 128-143
> ??? Node 1: 16-31,?? 144-159
> ??? Node 2: 32-47,?? 160-175
> ??? Node 3: 48-63,?? 176-191
> ??? Node 4: 64-79,?? 192-207
> ??? Node 5: 80-95,?? 208-223
> ??? Node 6: 96-111,? 223-231
> ??? Node 7: 112-127, 232-255
>
> Kernel versions:
> - tip:????? 5.18-rc1 tip sched/core
> - SIS_UTIL:??? 5.18-rc1 tip sched/core + this patch
>
> When we began testing, tip was at:
>
> commit: a658353167bf "sched/fair: Revise comment about lb decision matrix"
>
> Following are the results from the benchmark:
>
> * - Data points of concern
>
> ~~~~~~~~~
> hackbench
> ~~~~~~~~~
>
> NPS1
>
> Test:?????????????????? tip???????????????????? SIS_UTIL
> ?1-groups:???????? 4.64 (0.00 pct)???????? 4.70 (-1.29 pct)
> ?2-groups:???????? 5.38 (0.00 pct)???????? 5.45 (-1.30 pct)
> ?4-groups:???????? 6.15 (0.00 pct)???????? 6.10 (0.81 pct)
> ?8-groups:???????? 7.42 (0.00 pct)???????? 7.42 (0.00 pct)
> 16-groups:??????? 10.70 (0.00 pct)??????? 11.69 (-9.25 pct)? *
>
> NPS2
>
> Test:?????????????????? tip???????????????????? SIS_UTIL
> ?1-groups:???????? 4.70 (0.00 pct)???????? 4.70 (0.00 pct)
> ?2-groups:???????? 5.45 (0.00 pct)???????? 5.46 (-0.18 pct)
> ?4-groups:???????? 6.13 (0.00 pct)???????? 6.05 (1.30 pct)
> ?8-groups:???????? 7.30 (0.00 pct)???????? 7.05 (3.42 pct)
> 16-groups:??????? 10.30 (0.00 pct)??????? 10.12 (1.74 pct)
>
> NPS4
>
> Test:?????????????????? tip???????????????????? SIS_UTIL
> ?1-groups:???????? 4.60 (0.00 pct)???????? 4.75 (-3.26 pct)? *
> ?2-groups:???????? 5.41 (0.00 pct)???????? 5.42 (-0.18 pct)
> ?4-groups:???????? 6.12 (0.00 pct)???????? 6.00 (1.96 pct)
> ?8-groups:???????? 7.22 (0.00 pct)???????? 7.10 (1.66 pct)
> 16-groups:??????? 10.24 (0.00 pct)??????? 10.11 (1.26 pct)
>
> ~~~~~~~~
> schbench
> ~~~~~~~~
>
> NPS 1
>
> #workers:?? tip???????????????????? SIS_UTIL
> ? 1:????? 29.00 (0.00 pct)??????? 21.00 (27.58 pct)
> ? 2:????? 28.00 (0.00 pct)??????? 28.00 (0.00 pct)
> ? 4:????? 31.50 (0.00 pct)??????? 31.00 (1.58 pct)
> ? 8:????? 42.00 (0.00 pct)??????? 39.00 (7.14 pct)
> ?16:????? 56.50 (0.00 pct)??????? 54.50 (3.53 pct)
> ?32:????? 94.50 (0.00 pct)??????? 94.00 (0.52 pct)
> ?64:???? 176.00 (0.00 pct)?????? 175.00 (0.56 pct)
> 128:???? 404.00 (0.00 pct)?????? 394.00 (2.47 pct)
> 256:???? 869.00 (0.00 pct)?????? 863.00 (0.69 pct)
> 512:???? 58432.00 (0.00 pct)???? 55424.00 (5.14 pct)
>
> NPS2
>
> #workers:????? tip???????????????????? SIS_UTIL
> ? 1:????? 26.50 (0.00 pct)??????? 25.00 (5.66 pct)
> ? 2:????? 26.50 (0.00 pct)??????? 25.50 (3.77 pct)
> ? 4:????? 34.50 (0.00 pct)??????? 34.00 (1.44 pct)
> ? 8:????? 45.00 (0.00 pct)??????? 46.00 (-2.22 pct)
> ?16:????? 56.50 (0.00 pct)??????? 60.50 (-7.07 pct)??? ??? *
> ?32:????? 95.50 (0.00 pct)??????? 93.00 (2.61 pct)
> ?64:???? 179.00 (0.00 pct)?????? 179.00 (0.00 pct)
> 128:???? 369.00 (0.00 pct)?????? 376.00 (-1.89 pct)
> 256:???? 898.00 (0.00 pct)?????? 903.00 (-0.55 pct)
> 512:???? 56256.00 (0.00 pct)???? 57088.00 (-1.47 pct)
>
> NPS4
>
> #workers:??? tip???????????????????? SIS_UTIL
> ? 1:????? 25.00 (0.00 pct)??????? 21.00 (16.00 pct)
> ? 2:????? 28.00 (0.00 pct)??????? 24.00 (14.28 pct)
> ? 4:????? 29.50 (0.00 pct)??????? 29.50 (0.00 pct)
> ? 8:????? 41.00 (0.00 pct)??????? 37.50 (8.53 pct)
> ?16:????? 65.50 (0.00 pct)??????? 64.00 (2.29 pct)
> ?32:????? 93.00 (0.00 pct)??????? 94.50 (-1.61 pct)
> ?64:???? 170.50 (0.00 pct)?????? 175.50 (-2.93 pct)
> 128:???? 377.00 (0.00 pct)?????? 368.50 (2.25 pct)
> 256:???? 867.00 (0.00 pct)?????? 902.00 (-4.03 pct)
> 512:???? 58048.00 (0.00 pct)???? 55488.00 (4.41 pct)
>
> ~~~~~~
> tbench
> ~~~~~~
>
> NPS 1
>
> Clients:???? tip???????????????????? SIS_UTIL
> ??? 1??? 443.31 (0.00 pct)?????? 456.19 (2.90 pct)
> ??? 2??? 877.32 (0.00 pct)?????? 875.24 (-0.23 pct)
> ??? 4??? 1665.11 (0.00 pct)????? 1647.31 (-1.06 pct)
> ??? 8??? 3016.68 (0.00 pct)????? 2993.23 (-0.77 pct)
> ?? 16??? 5374.30 (0.00 pct)????? 5246.93 (-2.36 pct)
> ?? 32??? 8763.86 (0.00 pct)????? 7878.18 (-10.10 pct)???? *
> ?? 64??? 15786.93 (0.00 pct)???? 12958.47 (-17.91 pct)??? *
> ? 128??? 26826.08 (0.00 pct)???? 26741.14 (-0.31 pct)
> ? 256??? 24207.35 (0.00 pct)???? 52041.89 (114.98 pct)
> ? 512??? 51740.58 (0.00 pct)???? 52084.44 (0.66 pct)
> ?1024??? 51177.82 (0.00 pct)???? 53126.29 (3.80 pct)
>
> NPS 2
>
> Clients:???? tip???????????????????? SIS_UTIL
> ??? 1??? 449.49 (0.00 pct)?????? 447.96 (-0.34 pct)
> ??? 2??? 867.28 (0.00 pct)?????? 869.52 (0.25 pct)
> ??? 4??? 1643.60 (0.00 pct)????? 1625.91 (-1.07 pct)
> ??? 8??? 3047.35 (0.00 pct)????? 2952.82 (-3.10 pct)
> ?? 16??? 5340.77 (0.00 pct)????? 5251.41 (-1.67 pct)
> ?? 32??? 10536.85 (0.00 pct)???? 8843.49 (-16.07 pct)???? *
> ?? 64??? 16543.23 (0.00 pct)???? 14265.35 (-13.76 pct)??? *
> ? 128??? 26400.40 (0.00 pct)???? 25595.42 (-3.04 pct)
> ? 256??? 23436.75 (0.00 pct)???? 47090.03 (100.92 pct)
> ? 512??? 50902.60 (0.00 pct)???? 50036.58 (-1.70 pct)
> ?1024??? 50216.10 (0.00 pct)???? 50639.74 (0.84 pct)
>
> NPS 4
>
> Clients:???? tip???????????????????? SIS_UTIL
> ??? 1??? 443.82 (0.00 pct)?????? 459.93 (3.62 pct)
> ??? 2??? 849.14 (0.00 pct)?????? 882.17 (3.88 pct)
> ??? 4??? 1603.26 (0.00 pct)????? 1629.64 (1.64 pct)
> ??? 8??? 2972.37 (0.00 pct)????? 3003.09 (1.03 pct)
> ?? 16??? 5277.13 (0.00 pct)????? 5234.07 (-0.81 pct)
> ?? 32??? 9744.73 (0.00 pct)????? 9347.90 (-4.07 pct)?? ?? *
> ?? 64??? 15854.80 (0.00 pct)???? 14180.27 (-10.56 pct)??? *
> ? 128??? 26116.97 (0.00 pct)???? 24597.45 (-5.81 pct)???? *
> ? 256??? 22403.25 (0.00 pct)???? 47385.09 (111.50 pct)
> ? 512??? 48317.20 (0.00 pct)???? 49781.02 (3.02 pct)
> ?1024??? 50445.41 (0.00 pct)???? 51607.53 (2.30 pct)
>
> ~~~~~~
> Stream
> ~~~~~~
>
> - 10 runs
>
> NPS1
>
> ????????????? tip???????????????????? SIS_UTIL
> ?Copy:?? 189113.11 (0.00 pct)??? 188490.27 (-0.32 pct)
> Scale:?? 201190.61 (0.00 pct)??? 204526.15 (1.65 pct)
> ? Add:?? 232654.21 (0.00 pct)??? 234948.01 (0.98 pct)
> Triad:?? 226583.57 (0.00 pct)??? 228844.43 (0.99 pct)
>
> NPS2
>
> Test:???????? tip???????????????????? SIS_UTIL
> ?Copy:?? 155347.14 (0.00 pct)??? 169386.29 (9.03 pct)
> Scale:?? 191701.53 (0.00 pct)??? 196110.51 (2.29 pct)
> ? Add:?? 210013.97 (0.00 pct)??? 221088.45 (5.27 pct)
> Triad:?? 207602.00 (0.00 pct)??? 218072.52 (5.04 pct)
>
> NPS4
>
> Test:???????? tip???????????????????? SIS_UTIL
> ?Copy:?? 136421.15 (0.00 pct)??? 140894.11 (3.27 pct)
> Scale:?? 191217.59 (0.00 pct)??? 190554.17 (-0.34 pct)
> ? Add:?? 189229.52 (0.00 pct)??? 190871.88 (0.86 pct)
> Triad:?? 188052.99 (0.00 pct)??? 188417.63 (0.19 pct)
>
> - 100 runs
>
> NPS1
>
> Test:?????? tip???????????????????? SIS_UTIL
> ?Copy:?? 244693.32 (0.00 pct)??? 232328.05 (-5.05 pct)
> Scale:?? 221874.99 (0.00 pct)??? 216858.39 (-2.26 pct)
> ? Add:?? 268363.89 (0.00 pct)??? 265449.16 (-1.08 pct)
> Triad:?? 260945.24 (0.00 pct)??? 252240.56 (-3.33 pct)
>
> NPS2
>
> Test:?????? tip???????????????????? SIS_UTIL
> ?Copy:?? 211262.00 (0.00 pct)??? 225240.03 (6.61 pct)
> Scale:?? 222493.34 (0.00 pct)??? 219094.65 (-1.52 pct)
> ? Add:?? 280277.17 (0.00 pct)??? 275677.73 (-1.64 pct)
> Triad:?? 265860.49 (0.00 pct)??? 262584.22 (-1.23 pct)
>
> NPS4
>
> Test:?????? tip???????????????????? SIS_UTIL
> ?Copy:?? 250171.40 (0.00 pct)??? 230983.60 (-7.66 pct)
> Scale:?? 222293.56 (0.00 pct)??? 215984.34 (-2.83 pct)
> ? Add:?? 279222.16 (0.00 pct)??? 270402.64 (-3.15 pct)
> Triad:?? 262013.92 (0.00 pct)??? 254820.60 (-2.74 pct)
>
> ~~~~~~~~~~~~
> ycsb-mongodb
> ~~~~~~~~~~~~
>
> NPS1
>
> sched-tip:????? 303718.33 (var: 1.31)
> SIS_UTIL:?????? 303529.33 (var: 0.67)??? (-0.06%)
>
> NPS2
>
> sched-tip:????? 304536.33 (var: 2.46)
> SIS_UTIL:?????? 303730.33 (var: 1.57)??? (-0.26%)
>
> NPS4
>
> sched-tip:????? 301192.33 (var: 1.81)
> SIS_UTIL:?????? 300101.33 (var: 0.35)?? (-0.36%)
>
> ~~~~~~~~~~~~~~~~~~
>
> Notes:
>
> - There seems to be some noticeable regression for hackbench
> ? with 16 groups in NPS1 mode.
Did the hackbench use the default fd number(20) in every group? If
this is the case, then there are 16 * 20 * 2 = 640 threads in the
system. I thought this should be overloaded, either in SIS_PROP or
SIS_UTIL, the search depth might be 4 and 0 respectively. And it
is also very likely the SIS_PROP will not find an idle CPU after
searching for 4 CPUs. So in theory there should be not much performance
difference with vs without the patch applied. But if the fd number is set
to a smaller one, the regression could be explained as you mentioned,
SIS_PROP search more aggressively.
> - There seems to be regression in tbench for case with number
> ? of workers in range 32-128 (12.5% loaded to 50% loaded)
> - tbench reaches saturation early when system is fully loaded
>
> This probably show that the strategy in the initial v1 RFC
> seems to work better with our LLC where number of CPUs per LLC
> is low compared to systems with unified LLC. Given this is
> showing great results for unified LLC, maybe SIS_PROP and SIS_UTIL
> can be enabled based on the the size of LLC.
>
Yes, SIS_PROP searches more aggressively, but we attempts to replace
SIS_PROP with a more accurate policy.
> > [..snip..]
> >
> > [3]
> > Prateek mentioned that we should scan aggressively in an LLC domain
> > with 16 CPUs. Because the cost to search for an idle one among 16 CPUs is
> > negligible. The current patch aims to propose a generic solution and only
> > considers the util_avg. A follow-up change could enhance the scan policy
> > to adjust the scan_percent according to the CPU number in LLC.
>
> Following are some additional numbers I would like to share comparing SIS_PROP and
> SIS_UTIL:
>
Nice analysis.
> o Hackbench with 1 group
>
> With 1 group, following are the chances of SIS_PROP
> and SIS_UTIL finding an idle CPU when an idle CPU
> exists in LLC:
>
> +-----------------+---------------------------+---------------------------+--------+
> | Idle CPU in LLC | SIS_PROP able to find CPU | SIS_UTIL able to find CPU | Count? |
> +-----------------+---------------------------+---------------------------+--------+
> |??????? 1??????? |???????????? 0???????????? |???????????? 0???????????? | 66444? |
> |??????? 1??????? |???????????? 0???????????? |???????????? 1???????????? | 34153? |
> |??????? 1??????? |???????????? 1???????????? |???????????? 0???????????? | 57204? |
> |??????? 1??????? |???????????? 1???????????? |???????????? 1???????????? | 119263 |
> +-----------------+---------------------------+---------------------------+--------+
>
So SIS_PROP searches more, and get higher chance to find an idle CPU in a LLC with
16 CPUs.
> SIS_PROP vs no SIS_PROP CPU search stats:
>
> Total time without SIS_PROP: 90653653
> Total time with SIS_PROP: 53558942 (-40.92 pct)
> Total time saved: 37094711
>
What does no SIS_PROP mean? Is it with SIS_PROP disabled and
SIS_UTIL enabled? Or with both SIS_PROP and SIS_UTIL disabled?
If it is the latter, is there any performance difference between
the two?
> Following are number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size):
>
> +--------------+-------+
> | CPU Searched | Count |
> +--------------+-------+
> |????? 0?????? | 10520 |
> |????? 1?????? |? 7770 |
> |????? 2?????? | 11976 |
> |????? 3?????? | 17554 |
> |????? 4?????? | 13932 |
> |????? 5?????? | 15051 |
> |????? 6?????? |? 8398 |
> |????? 7?????? |? 4544 |
> |????? 8?????? |? 3712 |
> |????? 9?????? |? 2337 |
> |????? 10????? |? 4541 |
> |????? 11????? |? 1947 |
> |????? 12????? |? 3846 |
> |????? 13????? |? 3645 |
> |????? 14????? |? 2686 |
> |????? 15????? |? 8390 |
> |????? 16????? | 26157 |
> +--------------+-------+
>
> - SIS_UTIL might be bailing out too early in some of these cases.
>
Right.
> o Hackbench with 16 group
>
> the success rate looks as follows:
>
> +-----------------+---------------------------+---------------------------+---------+
> | Idle CPU in LLC | SIS_PROP able to find CPU | SIS_UTIL able to find CPU |? Count? |
> +-----------------+---------------------------+---------------------------+---------+
> |??????? 1??????? |???????????? 0???????????? |???????????? 0???????????? | 1313745 |
> |??????? 1??????? |???????????? 0???????????? |???????????? 1???????????? |? 694132 |
> |??????? 1??????? |???????????? 1???????????? |???????????? 0???????????? | 2888450 |
> |??????? 1??????? |???????????? 1???????????? |???????????? 1???????????? | 5343065 |
> +-----------------+---------------------------+---------------------------+---------+
>
> SIS_PROP vs no SIS_PROP CPU search stats:
>
> Total time without SIS_PROP: 5227299388
> Total time with SIS_PROP: 3866575188 (-26.03 pct)
> Total time saved: 1360724200
>
> Following are number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size):
>
> +--------------+---------+
> | CPU Searched |? Count? |
> +--------------+---------+
> |????? 0?????? |? 150351 |
> |????? 1?????? |? 105116 |
> |????? 2?????? |? 214291 |
> |????? 3?????? |? 440053 |
> |????? 4?????? |? 914116 |
> |????? 5?????? | 1757984 |
> |????? 6?????? | 2410484 |
> |????? 7?????? | 1867668 |
> |????? 8?????? |? 379888 |
> |????? 9?????? |? 84055? |
> |????? 10????? |? 55389? |
> |????? 11????? |? 26795? |
> |????? 12????? |? 43113? |
> |????? 13????? |? 24579? |
> |????? 14????? |? 32896? |
> |????? 15????? |? 70059? |
> |????? 16????? |? 150858 |
> +--------------+---------+
>
> - SIS_UTIL might be bailing out too early in most of these cases
>
It might be interesting to see what the current sum of util_avg is, and this suggested that,
even if util_avg is a little high, it might be still be worthwhile to search more CPUs.
> o tbench with 256 workers
>
> For tbench with 256 threads, SIS_UTIL works great as we have drastically cut down the number
> of CPUs to search.
>
> SIS_PROP vs no SIS_PROP CPU search stats:
>
> Total time without SIS_PROP: 64004752959
> Total time with SIS_PROP: 34695004390 (-45.79 pct)
> Total time saved: 29309748569
>
> Following are number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size):
>
> +--------------+----------+
> | CPU Searched |? Count?? |
> +--------------+----------+
> |????? 0?????? |? 500077? |
> |????? 1?????? |? 543865? |
> |????? 2?????? | 4257684? |
> |????? 3?????? | 27457498 |
> |????? 4?????? | 40208673 |
> |????? 5?????? | 3264358? |
> |????? 6?????? |? 191631? |
> |????? 7?????? |? 24658?? |
> |????? 8?????? |?? 2469?? |
> |????? 9?????? |?? 1374?? |
> |????? 10????? |?? 2008?? |
> |????? 11????? |?? 1300?? |
> |????? 12????? |?? 1226?? |
> |????? 13????? |?? 1179?? |
> |????? 14????? |?? 1631?? |
> |????? 15????? |? 11678?? |
> |????? 16????? |?? 7793?? |
> +--------------+----------+
>
> - This is where SIS_UTIL shines for tbench case with 256 workers as it is effective
> ? at restricting search space well.
>
> o Observations
>
> SIS_PROP seems to have a higher chance of finding an idle CPU compared to SIS_UTIL
> in case of hackbench with 16-group. The gap between SIS_PROP and SIS_UTIL is wider
> with 16 groups compared to than with 1 group.
> Also SIS_PROP is more aggressive at saving time for 1-group compared to the
> case with 16-groups.
>
> The bailout from SIS_UTIL is fruitful for tbench with 256 workers leading to massive
> performance gain in a fully loaded system.
>
> Note: There might be some inaccuracies for the numbers presented for metrics that
> directly compare SIS_PROP and SIS_UTIL as both SIS_PROP and SIS_UTIL were enabled
> when gathering these data points and the results from SIS_PROP were returned from
> search_idle_cpu().
Do you mean the 'CPU Searched' calculated by SIS_PROP was collected with both SIS_UTIL
and SIS_PROP enabled?
> All the numbers for the above analysis were gathered in NPS1 mode.
>
I'm thinking of taking nr_llc number into consideration to adjust the search depth,
something like:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dd52fc5a034b..39b914599dce 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9302,6 +9302,9 @@ static inline void update_idle_cpu_scan(struct lb_env *env,
llc_util_pct = (sum_util * 100) / (nr_llc * SCHED_CAPACITY_SCALE);
nr_scan = (100 - (llc_util_pct * llc_util_pct / 72)) * nr_llc / 100;
nr_scan = max(nr_scan, 0);
+ if (nr_llc <= 16 && nr_scan)
+ nr_scan = nr_llc;
+
WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
}
I'll offline the CPUs to make it 16 CPUs per LLC, and check what hackbench behaves.
thanks,
Chenyu
Hello Chenyu,
Thank you taking a look at the results.
On 5/14/2022 4:25 PM, Chen Yu wrote:
> [..snip..]
> May I know if in all NPS mode, all LLC domains have 16 CPUs?
Yes. The number of CPUs in LLC domain is always 16 irrespective of the NPS mode.
>> Following is the NUMA configuration for each NPS mode on the system:
>>
>> NPS1: Each socket is a NUMA node.
>> Total 2 NUMA nodes in the dual socket machine.
>>
>> Node 0: 0-63, 128-191
>> Node 1: 64-127, 192-255
>>
>> NPS2: Each socket is further logically divided into 2 NUMA regions.
>> Total 4 NUMA nodes exist over 2 socket.
>>
>> Node 0: 0-31, 128-159
>> Node 1: 32-63, 160-191
>> Node 2: 64-95, 192-223
>> Node 3: 96-127, 223-255
>>
>> NPS4: Each socket is logically divided into 4 NUMA regions.
>> Total 8 NUMA nodes exist over 2 socket.
>>
>> Node 0: 0-15, 128-143
>> Node 1: 16-31, 144-159
>> Node 2: 32-47, 160-175
>> Node 3: 48-63, 176-191
>> Node 4: 64-79, 192-207
>> Node 5: 80-95, 208-223
>> Node 6: 96-111, 223-231
>> Node 7: 112-127, 232-255
>>
>> Kernel versions:
>> - tip: 5.18-rc1 tip sched/core
>> - SIS_UTIL: 5.18-rc1 tip sched/core + this patch
>>
>> When we began testing, tip was at:
>>
>> commit: a658353167bf "sched/fair: Revise comment about lb decision matrix"
>>
>> Following are the results from the benchmark:
>>
>> * - Data points of concern
>>
>> ~~~~~~~~~
>> hackbench
>> ~~~~~~~~~
>>
>> NPS1
>>
>> Test: tip SIS_UTIL
>> 1-groups: 4.64 (0.00 pct) 4.70 (-1.29 pct)
>> 2-groups: 5.38 (0.00 pct) 5.45 (-1.30 pct)
>> 4-groups: 6.15 (0.00 pct) 6.10 (0.81 pct)
>> 8-groups: 7.42 (0.00 pct) 7.42 (0.00 pct)
>> 16-groups: 10.70 (0.00 pct) 11.69 (-9.25 pct) *
>>
>> NPS2
>>
>> Test: tip SIS_UTIL
>> 1-groups: 4.70 (0.00 pct) 4.70 (0.00 pct)
>> 2-groups: 5.45 (0.00 pct) 5.46 (-0.18 pct)
>> 4-groups: 6.13 (0.00 pct) 6.05 (1.30 pct)
>> 8-groups: 7.30 (0.00 pct) 7.05 (3.42 pct)
>> 16-groups: 10.30 (0.00 pct) 10.12 (1.74 pct)
>>
>> NPS4
>>
>> Test: tip SIS_UTIL
>> 1-groups: 4.60 (0.00 pct) 4.75 (-3.26 pct) *
>> 2-groups: 5.41 (0.00 pct) 5.42 (-0.18 pct)
>> 4-groups: 6.12 (0.00 pct) 6.00 (1.96 pct)
>> 8-groups: 7.22 (0.00 pct) 7.10 (1.66 pct)
>> 16-groups: 10.24 (0.00 pct) 10.11 (1.26 pct)
>>
>> ~~~~~~~~
>> schbench
>> ~~~~~~~~
>>
>> NPS 1
>>
>> #workers: tip SIS_UTIL
>> 1: 29.00 (0.00 pct) 21.00 (27.58 pct)
>> 2: 28.00 (0.00 pct) 28.00 (0.00 pct)
>> 4: 31.50 (0.00 pct) 31.00 (1.58 pct)
>> 8: 42.00 (0.00 pct) 39.00 (7.14 pct)
>> 16: 56.50 (0.00 pct) 54.50 (3.53 pct)
>> 32: 94.50 (0.00 pct) 94.00 (0.52 pct)
>> 64: 176.00 (0.00 pct) 175.00 (0.56 pct)
>> 128: 404.00 (0.00 pct) 394.00 (2.47 pct)
>> 256: 869.00 (0.00 pct) 863.00 (0.69 pct)
>> 512: 58432.00 (0.00 pct) 55424.00 (5.14 pct)
>>
>> NPS2
>>
>> #workers: tip SIS_UTIL
>> 1: 26.50 (0.00 pct) 25.00 (5.66 pct)
>> 2: 26.50 (0.00 pct) 25.50 (3.77 pct)
>> 4: 34.50 (0.00 pct) 34.00 (1.44 pct)
>> 8: 45.00 (0.00 pct) 46.00 (-2.22 pct)
>> 16: 56.50 (0.00 pct) 60.50 (-7.07 pct) *
>> 32: 95.50 (0.00 pct) 93.00 (2.61 pct)
>> 64: 179.00 (0.00 pct) 179.00 (0.00 pct)
>> 128: 369.00 (0.00 pct) 376.00 (-1.89 pct)
>> 256: 898.00 (0.00 pct) 903.00 (-0.55 pct)
>> 512: 56256.00 (0.00 pct) 57088.00 (-1.47 pct)
>>
>> NPS4
>>
>> #workers: tip SIS_UTIL
>> 1: 25.00 (0.00 pct) 21.00 (16.00 pct)
>> 2: 28.00 (0.00 pct) 24.00 (14.28 pct)
>> 4: 29.50 (0.00 pct) 29.50 (0.00 pct)
>> 8: 41.00 (0.00 pct) 37.50 (8.53 pct)
>> 16: 65.50 (0.00 pct) 64.00 (2.29 pct)
>> 32: 93.00 (0.00 pct) 94.50 (-1.61 pct)
>> 64: 170.50 (0.00 pct) 175.50 (-2.93 pct)
>> 128: 377.00 (0.00 pct) 368.50 (2.25 pct)
>> 256: 867.00 (0.00 pct) 902.00 (-4.03 pct)
>> 512: 58048.00 (0.00 pct) 55488.00 (4.41 pct)
>>
>> ~~~~~~
>> tbench
>> ~~~~~~
>>
>> NPS 1
>>
>> Clients: tip SIS_UTIL
>> 1 443.31 (0.00 pct) 456.19 (2.90 pct)
>> 2 877.32 (0.00 pct) 875.24 (-0.23 pct)
>> 4 1665.11 (0.00 pct) 1647.31 (-1.06 pct)
>> 8 3016.68 (0.00 pct) 2993.23 (-0.77 pct)
>> 16 5374.30 (0.00 pct) 5246.93 (-2.36 pct)
>> 32 8763.86 (0.00 pct) 7878.18 (-10.10 pct) *
>> 64 15786.93 (0.00 pct) 12958.47 (-17.91 pct) *
>> 128 26826.08 (0.00 pct) 26741.14 (-0.31 pct)
>> 256 24207.35 (0.00 pct) 52041.89 (114.98 pct)
>> 512 51740.58 (0.00 pct) 52084.44 (0.66 pct)
>> 1024 51177.82 (0.00 pct) 53126.29 (3.80 pct)
>>
>> NPS 2
>>
>> Clients: tip SIS_UTIL
>> 1 449.49 (0.00 pct) 447.96 (-0.34 pct)
>> 2 867.28 (0.00 pct) 869.52 (0.25 pct)
>> 4 1643.60 (0.00 pct) 1625.91 (-1.07 pct)
>> 8 3047.35 (0.00 pct) 2952.82 (-3.10 pct)
>> 16 5340.77 (0.00 pct) 5251.41 (-1.67 pct)
>> 32 10536.85 (0.00 pct) 8843.49 (-16.07 pct) *
>> 64 16543.23 (0.00 pct) 14265.35 (-13.76 pct) *
>> 128 26400.40 (0.00 pct) 25595.42 (-3.04 pct)
>> 256 23436.75 (0.00 pct) 47090.03 (100.92 pct)
>> 512 50902.60 (0.00 pct) 50036.58 (-1.70 pct)
>> 1024 50216.10 (0.00 pct) 50639.74 (0.84 pct)
>>
>> NPS 4
>>
>> Clients: tip SIS_UTIL
>> 1 443.82 (0.00 pct) 459.93 (3.62 pct)
>> 2 849.14 (0.00 pct) 882.17 (3.88 pct)
>> 4 1603.26 (0.00 pct) 1629.64 (1.64 pct)
>> 8 2972.37 (0.00 pct) 3003.09 (1.03 pct)
>> 16 5277.13 (0.00 pct) 5234.07 (-0.81 pct)
>> 32 9744.73 (0.00 pct) 9347.90 (-4.07 pct) *
>> 64 15854.80 (0.00 pct) 14180.27 (-10.56 pct) *
>> 128 26116.97 (0.00 pct) 24597.45 (-5.81 pct) *
>> 256 22403.25 (0.00 pct) 47385.09 (111.50 pct)
>> 512 48317.20 (0.00 pct) 49781.02 (3.02 pct)
>> 1024 50445.41 (0.00 pct) 51607.53 (2.30 pct)
>>
>> ~~~~~~
>> Stream
>> ~~~~~~
>>
>> - 10 runs
>>
>> NPS1
>>
>> tip SIS_UTIL
>> Copy: 189113.11 (0.00 pct) 188490.27 (-0.32 pct)
>> Scale: 201190.61 (0.00 pct) 204526.15 (1.65 pct)
>> Add: 232654.21 (0.00 pct) 234948.01 (0.98 pct)
>> Triad: 226583.57 (0.00 pct) 228844.43 (0.99 pct)
>>
>> NPS2
>>
>> Test: tip SIS_UTIL
>> Copy: 155347.14 (0.00 pct) 169386.29 (9.03 pct)
>> Scale: 191701.53 (0.00 pct) 196110.51 (2.29 pct)
>> Add: 210013.97 (0.00 pct) 221088.45 (5.27 pct)
>> Triad: 207602.00 (0.00 pct) 218072.52 (5.04 pct)
>>
>> NPS4
>>
>> Test: tip SIS_UTIL
>> Copy: 136421.15 (0.00 pct) 140894.11 (3.27 pct)
>> Scale: 191217.59 (0.00 pct) 190554.17 (-0.34 pct)
>> Add: 189229.52 (0.00 pct) 190871.88 (0.86 pct)
>> Triad: 188052.99 (0.00 pct) 188417.63 (0.19 pct)
>>
>> - 100 runs
>>
>> NPS1
>>
>> Test: tip SIS_UTIL
>> Copy: 244693.32 (0.00 pct) 232328.05 (-5.05 pct)
>> Scale: 221874.99 (0.00 pct) 216858.39 (-2.26 pct)
>> Add: 268363.89 (0.00 pct) 265449.16 (-1.08 pct)
>> Triad: 260945.24 (0.00 pct) 252240.56 (-3.33 pct)
>>
>> NPS2
>>
>> Test: tip SIS_UTIL
>> Copy: 211262.00 (0.00 pct) 225240.03 (6.61 pct)
>> Scale: 222493.34 (0.00 pct) 219094.65 (-1.52 pct)
>> Add: 280277.17 (0.00 pct) 275677.73 (-1.64 pct)
>> Triad: 265860.49 (0.00 pct) 262584.22 (-1.23 pct)
>>
>> NPS4
>>
>> Test: tip SIS_UTIL
>> Copy: 250171.40 (0.00 pct) 230983.60 (-7.66 pct)
>> Scale: 222293.56 (0.00 pct) 215984.34 (-2.83 pct)
>> Add: 279222.16 (0.00 pct) 270402.64 (-3.15 pct)
>> Triad: 262013.92 (0.00 pct) 254820.60 (-2.74 pct)
>>
>> ~~~~~~~~~~~~
>> ycsb-mongodb
>> ~~~~~~~~~~~~
>>
>> NPS1
>>
>> sched-tip: 303718.33 (var: 1.31)
>> SIS_UTIL: 303529.33 (var: 0.67) (-0.06%)
>>
>> NPS2
>>
>> sched-tip: 304536.33 (var: 2.46)
>> SIS_UTIL: 303730.33 (var: 1.57) (-0.26%)
>>
>> NPS4
>>
>> sched-tip: 301192.33 (var: 1.81)
>> SIS_UTIL: 300101.33 (var: 0.35) (-0.36%)
>>
>> ~~~~~~~~~~~~~~~~~~
>>
>> Notes:
>>
>> - There seems to be some noticeable regression for hackbench
>> with 16 groups in NPS1 mode.
> Did the hackbench use the default fd number(20) in every group? If
> this is the case, then there are 16 * 20 * 2 = 640 threads in the
> system. I thought this should be overloaded, either in SIS_PROP or
> SIS_UTIL, the search depth might be 4 and 0 respectively. And it
> is also very likely the SIS_PROP will not find an idle CPU after
> searching for 4 CPUs. So in theory there should be not much performance
> difference with vs without the patch applied. But if the fd number is set
> to a smaller one, the regression could be explained as you mentioned,
> SIS_PROP search more aggressively.
Yes, I'm using fd number(20). The logs from hackbench run show that it is
indeed running 640 threads with 16 groups:
# Running 'sched/messaging' benchmark:
# 20 sender and receiver threads per group
# 16 groups == 640 threads run
This is indeed counterintuitive and I don't have
a good explanation for this other than that SIS_PROP
probably finding slightly greater success at finding
an idle CPU even in this overloaded environment.
I've ran the benchmark in two sets of 3 runs rebooting
in between on each kernel version:
- tip
Test: tip-r0 tip-r1 tip-r2
1-groups: 4.64 (0.00 pct) 4.90 (-5.60 pct) 4.99 (-7.54 pct)
2-groups: 5.54 (0.00 pct) 5.56 (-0.36 pct) 5.58 (-0.72 pct)
4-groups: 6.24 (0.00 pct) 6.18 (0.96 pct) 6.20 (0.64 pct)
8-groups: 7.54 (0.00 pct) 7.50 (0.53 pct) 7.54 (0.00 pct)
16-groups: 10.85 (0.00 pct) 11.17 (-2.94 pct) 10.91 (-0.55 pct)
Test: tip-r3 tip-r4 tip-r5
1-groups: 4.68 (0.00 pct) 4.97 (-6.19 pct) 4.98 (-6.41 pct)
2-groups: 5.60 (0.00 pct) 5.62 (-0.35 pct) 5.66 (-1.07 pct)
4-groups: 6.24 (0.00 pct) 6.23 (0.16 pct) 6.24 (0.00 pct)
8-groups: 7.54 (0.00 pct) 7.50 (0.53 pct) 7.46 (1.06 pct)
16-groups: 10.81 (0.00 pct) 10.84 (-0.27 pct) 10.81 (0.00 pct)
- SIS_UTIL
Test: SIS_UTIL-r0 SIS_UTIL-r1 SIS_UTIL-r2
1-groups: 4.68 (0.00 pct) 5.03 (-7.47 pct) 4.96 (-5.98 pct)
2-groups: 5.45 (0.00 pct) 5.48 (-0.55 pct) 5.50 (-0.91 pct)
4-groups: 6.10 (0.00 pct) 6.07 (0.49 pct) 6.14 (-0.65 pct)
8-groups: 7.52 (0.00 pct) 7.51 (0.13 pct) 7.52 (0.00 pct)
16-groups: 11.63 (0.00 pct) 11.48 (1.28 pct) 11.51 (1.03 pct)
Test: SIS_UTIL-r3 SIS_UTIL-r4 SIS_UTIL-r5
1-groups: 4.80 (0.00 pct) 5.00 (-4.16 pct) 5.06 (-5.41 pct)
2-groups: 5.51 (0.00 pct) 5.58 (-1.27 pct) 5.58 (-1.27 pct)
4-groups: 6.14 (0.00 pct) 6.11 (0.48 pct) 6.06 (1.30 pct)
8-groups: 7.35 (0.00 pct) 7.38 (-0.40 pct) 7.40 (-0.68 pct)
16-groups: 11.03 (0.00 pct) 11.29 (-2.35 pct) 11.14 (-0.99 pct)
- Comparing the best and bad data points for 16-groups with each
kernel version:
Test: tip-good SIS_UTIL-good
1-groups: 4.68 (0.00 pct) 4.80 (-2.56 pct)
2-groups: 5.60 (0.00 pct) 5.51 (1.60 pct)
4-groups: 6.24 (0.00 pct) 6.14 (1.60 pct)
8-groups: 7.54 (0.00 pct) 7.35 (2.51 pct)
16-groups: 10.81 (0.00 pct) 11.03 (-2.03 pct)
Test: tip-good SIS_UTIL-bad
1-groups: 4.68 (0.00 pct) 4.68 (0.00 pct)
2-groups: 5.60 (0.00 pct) 5.45 (2.67 pct)
4-groups: 6.24 (0.00 pct) 6.10 (2.24 pct)
8-groups: 7.54 (0.00 pct) 7.52 (0.26 pct)
16-groups: 10.81 (0.00 pct) 11.63 (-7.58 pct)
Test: tip-bad SIS_UTIL-good
1-groups: 4.90 (0.00 pct) 4.80 (2.04 pct)
2-groups: 5.56 (0.00 pct) 5.51 (0.89 pct)
4-groups: 6.18 (0.00 pct) 6.14 (0.64 pct)
8-groups: 7.50 (0.00 pct) 7.35 (2.00 pct)
16-groups: 11.17 (0.00 pct) 11.03 (1.25 pct)
Test: tip-bad SIS_UTIL-bad
1-groups: 4.90 (0.00 pct) 4.68 (4.48 pct)
2-groups: 5.56 (0.00 pct) 5.45 (1.97 pct)
4-groups: 6.18 (0.00 pct) 6.10 (1.29 pct)
8-groups: 7.50 (0.00 pct) 7.52 (-0.26 pct)
16-groups: 11.17 (0.00 pct) 11.63 (-4.11 pct)
Hackbench consistently reports > 11 for 16-group
case with SIS_UTIL however only once with SIS_PROP
>> - There seems to be regression in tbench for case with number
>> of workers in range 32-128 (12.5% loaded to 50% loaded)
>> - tbench reaches saturation early when system is fully loaded
>>
>> This probably show that the strategy in the initial v1 RFC
>> seems to work better with our LLC where number of CPUs per LLC
>> is low compared to systems with unified LLC. Given this is
>> showing great results for unified LLC, maybe SIS_PROP and SIS_UTIL
>> can be enabled based on the the size of LLC.
>>
> Yes, SIS_PROP searches more aggressively, but we attempts to replace
> SIS_PROP with a more accurate policy.
>>> [..snip..]
>>>
>>> [3]
>>> Prateek mentioned that we should scan aggressively in an LLC domain
>>> with 16 CPUs. Because the cost to search for an idle one among 16 CPUs is
>>> negligible. The current patch aims to propose a generic solution and only
>>> considers the util_avg. A follow-up change could enhance the scan policy
>>> to adjust the scan_percent according to the CPU number in LLC.
>> Following are some additional numbers I would like to share comparing SIS_PROP and
>> SIS_UTIL:
>>
> Nice analysis.
>> o Hackbench with 1 group
>>
>> With 1 group, following are the chances of SIS_PROP
>> and SIS_UTIL finding an idle CPU when an idle CPU
>> exists in LLC:
>>
>> +-----------------+---------------------------+---------------------------+--------+
>> | Idle CPU in LLC | SIS_PROP able to find CPU | SIS_UTIL able to find CPU | Count |
>> +-----------------+---------------------------+---------------------------+--------+
>> | 1 | 0 | 0 | 66444 |
>> | 1 | 0 | 1 | 34153 |
>> | 1 | 1 | 0 | 57204 |
>> | 1 | 1 | 1 | 119263 |
>> +-----------------+---------------------------+---------------------------+--------+
>>
> So SIS_PROP searches more, and get higher chance to find an idle CPU in a LLC with
> 16 CPUs.
Yes!
>> SIS_PROP vs no SIS_PROP CPU search stats:
>>
>> Total time without SIS_PROP: 90653653
>> Total time with SIS_PROP: 53558942 (-40.92 pct)
>> Total time saved: 37094711
>>
> What does no SIS_PROP mean? Is it with SIS_PROP disabled and
> SIS_UTIL enabled? Or with both SIS_PROP and SIS_UTIL disabled?
> If it is the latter, is there any performance difference between
> the two?
Sorry for not being clear here. No SIS_PROP mean we are searching the
entire LLC all the time for an idle CPU.This data aims to find how much time SIS_PROP is saving compared tocase where it is disabled.
>> Following are number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size):
>>
>> +--------------+-------+
>> | CPU Searched | Count |
>> +--------------+-------+
>> | 0 | 10520 |
>> | 1 | 7770 |
>> | 2 | 11976 |
>> | 3 | 17554 |
>> | 4 | 13932 |
>> | 5 | 15051 |
>> | 6 | 8398 |
>> | 7 | 4544 |
>> | 8 | 3712 |
>> | 9 | 2337 |
>> | 10 | 4541 |
>> | 11 | 1947 |
>> | 12 | 3846 |
>> | 13 | 3645 |
>> | 14 | 2686 |
>> | 15 | 8390 |
>> | 16 | 26157 |
>> +--------------+-------+
>>
>> - SIS_UTIL might be bailing out too early in some of these cases.
>>
> Right.
>> o Hackbench with 16 group
>>
>> the success rate looks as follows:
>>
>> +-----------------+---------------------------+---------------------------+---------+
>> | Idle CPU in LLC | SIS_PROP able to find CPU | SIS_UTIL able to find CPU | Count |
>> +-----------------+---------------------------+---------------------------+---------+
>> | 1 | 0 | 0 | 1313745 |
>> | 1 | 0 | 1 | 694132 |
>> | 1 | 1 | 0 | 2888450 |
>> | 1 | 1 | 1 | 5343065 |
>> +-----------------+---------------------------+---------------------------+---------+
>>
>> SIS_PROP vs no SIS_PROP CPU search stats:
>>
>> Total time without SIS_PROP: 5227299388
>> Total time with SIS_PROP: 3866575188 (-26.03 pct)
>> Total time saved: 1360724200
>>
>> Following are number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size):
>>
>> +--------------+---------+
>> | CPU Searched | Count |
>> +--------------+---------+
>> | 0 | 150351 |
>> | 1 | 105116 |
>> | 2 | 214291 |
>> | 3 | 440053 |
>> | 4 | 914116 |
>> | 5 | 1757984 |
>> | 6 | 2410484 |
>> | 7 | 1867668 |
>> | 8 | 379888 |
>> | 9 | 84055 |
>> | 10 | 55389 |
>> | 11 | 26795 |
>> | 12 | 43113 |
>> | 13 | 24579 |
>> | 14 | 32896 |
>> | 15 | 70059 |
>> | 16 | 150858 |
>> +--------------+---------+
>>
>> - SIS_UTIL might be bailing out too early in most of these cases
>>
> It might be interesting to see what the current sum of util_avg is, and this suggested that,
> even if util_avg is a little high, it might be still be worthwhile to search more CPUs.
I agree. Let me know if there is any data you would like me to collect wrt this.
>> o tbench with 256 workers
>>
>> For tbench with 256 threads, SIS_UTIL works great as we have drastically cut down the number
>> of CPUs to search.
>>
>> SIS_PROP vs no SIS_PROP CPU search stats:
>>
>> Total time without SIS_PROP: 64004752959
>> Total time with SIS_PROP: 34695004390 (-45.79 pct)
>> Total time saved: 29309748569
>>
>> Following are number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size):
>>
>> +--------------+----------+
>> | CPU Searched | Count |
>> +--------------+----------+
>> | 0 | 500077 |
>> | 1 | 543865 |
>> | 2 | 4257684 |
>> | 3 | 27457498 |
>> | 4 | 40208673 |
>> | 5 | 3264358 |
>> | 6 | 191631 |
>> | 7 | 24658 |
>> | 8 | 2469 |
>> | 9 | 1374 |
>> | 10 | 2008 |
>> | 11 | 1300 |
>> | 12 | 1226 |
>> | 13 | 1179 |
>> | 14 | 1631 |
>> | 15 | 11678 |
>> | 16 | 7793 |
>> +--------------+----------+
>>
>> - This is where SIS_UTIL shines for tbench case with 256 workers as it is effective
>> at restricting search space well.
>>
>> o Observations
>>
>> SIS_PROP seems to have a higher chance of finding an idle CPU compared to SIS_UTIL
>> in case of hackbench with 16-group. The gap between SIS_PROP and SIS_UTIL is wider
>> with 16 groups compared to than with 1 group.
>> Also SIS_PROP is more aggressive at saving time for 1-group compared to the
>> case with 16-groups.
>>
>> The bailout from SIS_UTIL is fruitful for tbench with 256 workers leading to massive
>> performance gain in a fully loaded system.
>>
>> Note: There might be some inaccuracies for the numbers presented for metrics that
>> directly compare SIS_PROP and SIS_UTIL as both SIS_PROP and SIS_UTIL were enabled
>> when gathering these data points and the results from SIS_PROP were returned from
>> search_idle_cpu().
> Do you mean the 'CPU Searched' calculated by SIS_PROP was collected with both SIS_UTIL
> and SIS_PROP enabled?
Yes, the table
"Number of CPUs SIS_UTIL will search when SIS_PROP limit >= 16 (LLC size)"
was obtained by enabling both the features - SIS_PROP and SIS_UTIL, and
comparing the nr values suggested by SIS_UTIL when SIS_PROP allowed
searching for the entire LLC.
>> All the numbers for the above analysis were gathered in NPS1 mode.
>>
> I'm thinking of taking nr_llc number into consideration to adjust the search depth,
> something like:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index dd52fc5a034b..39b914599dce 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9302,6 +9302,9 @@ static inline void update_idle_cpu_scan(struct lb_env *env,
> llc_util_pct = (sum_util * 100) / (nr_llc * SCHED_CAPACITY_SCALE);
> nr_scan = (100 - (llc_util_pct * llc_util_pct / 72)) * nr_llc / 100;
> nr_scan = max(nr_scan, 0);
> + if (nr_llc <= 16 && nr_scan)
> + nr_scan = nr_llc;
> +
This will behave closer to the initial RFC on systems with smaller LLC.
I can do some preliminary testing with this and get back to you.
> WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
> }
>
> I'll offline the CPUs to make it 16 CPUs per LLC, and check what hackbench behaves.
Thank you for looking into this.
--
Thanks and Regards,
Prateek
On Fri, Apr 29, 2022 at 02:24:42AM +0800, Chen Yu wrote:
> Introduce the quadratic function:
>
> y = a - bx^2
>
> x is the sum_util ratio [0, 100] of this LLC domain, and y is the percentage
> of CPUs to be scanned in the LLC domain. The number of CPUs to search drops
> as sum_util increases. When sum_util hits 85% or above, the scan stops.
> Choosing 85% is because it is the threshold of an overloaded LLC sched group
> (imbalance_pct = 117). Choosing quadratic function is because:
>
> [1] Compared to the linear function, it scans more aggressively when the
> sum_util is low.
> [2] Compared to the exponential function, it is easier to calculate.
> [3] It seems that there is no accurate mapping between the sum of util_avg
> and the number of CPUs to be scanned. Use heuristic scan for now.
>
> The steps to calculate scan_nr are as followed:
> [1] scan_percent = 100 - (x/8.5)^2
> when utilization reaches 85%, scan_percent becomes 0.
> [2] scan_nr = nr_llc * scan_percent / 100
> [3] scan_nr = max(scan_nr, 0)
>
> For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
> sum_util% 0 5 15 25 35 45 55 65 75 85 ...
> scan_ns 112 112 108 103 92 80 64 47 24 0 ...
>
> Furthermore, to minimize the overhead of calculating the metrics in
> select_idle_cpu(), borrow the statistics from periodic load balance.
> As mentioned by Abel, on a platform with 112 CPUs per LLC, the
> sum_util calculated by periodic load balance after 112ms would decay
> to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay in
> reflecting the latest utilization. But it is a trade-off.
> Checking the util_avg in newidle load balance would be more frequent,
> but it brings overhead - multiple CPUs write/read the per-LLC shared
> variable and introduces cache false sharing. And Tim also mentioned
> that, it is allowed to be non-optimal in terms of scheduling for the
> short term variations, but if there is a long term trend in the load
> behavior, the scheduler can adjust for that.
>
Seems fair other than the 85% is hardcoded based on an imbalance_pct of
117. If that value should ever change, it'll drift apart.
> SIS_UTIL is disabled by default. When it is enabled, the select_idle_cpu()
> will use the nr_scan calculated by SIS_UTIL instead of the one from
> SIS_PROP. Later SIS_UTIL and SIS_PROP could be made mutually exclusive.
>
I agree with Peter, this should be enabled by default. I am somewhat
embarassed that I initially queued this patch blindly for testing when it
was mailed thinking "I would like to have some results before reviewing"
and then when I sit down to do the review, the results are all useless
because the feature was disabled. My initial thinking starting the review
was "Weird, none of these results are conclusive in any way."
I don't think you need to explicitly check for both being enabled given
that it's a sched_feat. Someone poking in there is meant to be debugging
something but the vast majority of people will never go looking.
> Limitations:
> [1]
> This patch is based on the util_avg, which is very sensitive to the CPU
> frequency invariance. The util_avg would decay quite fast when the
> CPU is idle, if the max frequency has been limited by the user.
> Patch [3] should be applied if turbo is disabled manually on Intel
> platforms.
>
It is worth mentioning in the changelog if there is a patch that this
implicitly depends upon. It affects the ordering patches should be merged
or bisections for a regression may unfairly identify your patch as the
source of the problem.
At least then if they merge in the wrong order there will something
obvious to check.
> [2]
> There may be unbalanced tasks among CPUs due to CPU affinity. For example,
> suppose the LLC domain is composed of 8 CPUs, and 7 tasks are bound to
> CPU0~CPU6, while CPU7 is idle:
>
> CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
> util_avg 1024 1024 1024 1024 1024 1024 1024 0
>
> Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%,
> select_idle_cpu() will not scan, thus CPU7 is undetected.
>
> A possible workaround to mitigate this problem is that the nr_scan should
> be increased if idle CPUs are detected during periodic load balance. And the
> problem could be mitigated by idle load balance that, CPU7 might pull some
> tasks on it.
>
While a valid concern, it's no worse than what is there now and I think
this case is unlikely. It could naturally happen if there was 6 busy tasks
running entirely in userspace which is harmless. Normal load balancing
would use CPU7 if there was any stacking on CPU0 to CPU6 or NEWLY_IDLE
ILB on CPU7 would pull something if there was any other activity on CPU7.
> [3]
> Prateek mentioned that we should scan aggressively in an LLC domain
> with 16 CPUs. Because the cost to search for an idle one among 16 CPUs is
> negligible. The current patch aims to propose a generic solution and only
> considers the util_avg. A follow-up change could enhance the scan policy
> to adjust the scan_percent according to the CPU number in LLC.
>
Yes, anything along those lines is a separate patch.
> v2->v3:
> - Use 85% as the threshold again, because the CPU frequency invariance issue
> has been fixed and the patch is queued for 5.19.
>
Note in changelog exactly what this fix is in case the patches go in the
wrong order.
> - Stop the scan if 85% is reached, rather than scanning for at least 4 CPUs.
> According to the feedback from Yicong, it might be better to stop scanning
> entirely when the LLC is overloaded.
>
I think only workloads like hackbench benefit from "search 4 CPUs"
heuristic because the machine is heavily overloaded with tasks that are
rapidly idling. Be wary of a patch that sets it back to 4 with hackbench
being the only justification.
> Link: https://lore.kernel.org/lkml/[email protected] #1
> Link: https://lore.kernel.org/lkml/[email protected] #2
> Link: https://lore.kernel.org/lkml/[email protected] #3
> Suggested-by: Tim Chen <[email protected]>
> Suggested-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Chen Yu <[email protected]>
> ---
> include/linux/sched/topology.h | 1 +
> kernel/sched/fair.c | 56 ++++++++++++++++++++++++++++++++++
> kernel/sched/features.h | 1 +
> 3 files changed, 58 insertions(+)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 56cffe42abbc..816df6cc444e 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -81,6 +81,7 @@ struct sched_domain_shared {
> atomic_t ref;
> atomic_t nr_busy_cpus;
> int has_idle_cores;
> + int nr_idle_scan;
> };
>
> struct sched_domain {
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 23c7d0f617ee..50c9d5b2b338 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6327,6 +6327,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> {
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> int i, cpu, idle_cpu = -1, nr = INT_MAX;
> + struct sched_domain_shared *sd_share;
> struct rq *this_rq = this_rq();
> int this = smp_processor_id();
> struct sched_domain *this_sd;
> @@ -6366,6 +6367,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> time = cpu_clock(this);
> }
>
> + if (sched_feat(SIS_UTIL)) {
> + sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
> + if (sd_share) {
> + /* because !--nr is the condition to stop scan */
> + nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
> + /* overloaded LLC is unlikely to have idle cpu/core */
> + if (nr == 1)
> + return -1;
> + }
> + }
> +
> for_each_cpu_wrap(cpu, cpus, target + 1) {
> if (has_idle_core) {
> i = select_idle_core(p, cpu, cpus, &idle_cpu);
> @@ -9267,6 +9279,46 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> return idlest;
> }
>
> +static inline void update_idle_cpu_scan(struct lb_env *env,
> + unsigned long sum_util)
> +{
Don't inline. This is a long function with one callsite, the compiler
should be able to figure it out.
> + struct sched_domain_shared *sd_share;
> + int nr_scan, nr_llc, llc_util_pct;
> +
> + if (!sched_feat(SIS_UTIL))
> + return;
Move the CPU_NEWLY_IDLE check here because it's essentially a free check
and avoids the per_cpu lookup.
Also comment on why CPU_NEWLY_IDLE is ignored. I assume it's because you
are updating sd_shared which is a shared cache line write and
CPU_NEWLY_IDLE can fire way more frequently than periodic load
balancing.
> + /*
> + * Update the number of CPUs to scan in LLC domain, which could
> + * be used as a hint in select_idle_cpu(). The update of this hint
> + * occurs during periodic load balancing, rather than frequent
> + * newidle balance.
> + */
> + nr_llc = per_cpu(sd_llc_size, env->dst_cpu);
> + if (env->idle == CPU_NEWLY_IDLE ||
> + env->sd->span_weight != nr_llc)
> + return;
> +
This caught me because nr_llc made me think it was "the number
of last level caches in a NUMA node" because that's what it means
in kernel/sched/topology.c. This is the number of CPUs sharing an
LLC to llc_weight would be appropriate given that it's compared to
span_weight. It's not mandatory for me but it's preferable.
> + sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu));
> + if (!sd_share)
> + return;
> +
> + /*
> + * The number of CPUs to search drops as sum_util increases, when
> + * sum_util hits 85% or above, the scan stops.
> + * The reason to choose 85% as the threshold is because this is the
> + * imbalance_pct when a LLC sched group is overloaded.
> + * let y = 100 - (x/8.5)^2 = 100 - x^2/72
> + * y is the percentage of CPUs to be scanned in the LLC
> + * domain, x is the ratio of sum_util compared to the
> + * CPU capacity, which ranges in [0, 100], thus
> + * nr_scan = nr_llc * y / 100
> + */
> + llc_util_pct = (sum_util * 100) / (nr_llc * SCHED_CAPACITY_SCALE);
> + nr_scan = (100 - (llc_util_pct * llc_util_pct / 72)) * nr_llc / 100;
> + nr_scan = max(nr_scan, 0);
Peter commented on this already.
> + WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
> +}
To avoid excessive unnecessary cache line bounces;
if (nr_scan != sd_share->nr_idle_scan)
WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
I expect the two very common values represent "machine is mostly idle
scan everything" and "machine is heavily overloaded, scan nothing" and
I'm betting the cost of the branch is less than the cost of rewriting
the same values.
--
Mel Gorman
SUSE Labs
On Mon, May 16, 2022 at 04:22:34PM +0530, K Prateek Nayak wrote:
[snip]
> I've ran the benchmark in two sets of 3 runs rebooting
> in between on each kernel version:
>
> - tip
>
> Test:?????????????????? tip-r0????????????????? tip-r1????????????????? tip-r2
> ?1-groups:???????? 4.64 (0.00 pct)???????? 4.90 (-5.60 pct)??????? 4.99 (-7.54 pct)
> ?2-groups:???????? 5.54 (0.00 pct)???????? 5.56 (-0.36 pct)??????? 5.58 (-0.72 pct)
> ?4-groups:???????? 6.24 (0.00 pct)???????? 6.18 (0.96 pct)???????? 6.20 (0.64 pct)
> ?8-groups:???????? 7.54 (0.00 pct)???????? 7.50 (0.53 pct)???????? 7.54 (0.00 pct)
> 16-groups:??????? 10.85 (0.00 pct)??????? 11.17 (-2.94 pct)?????? 10.91 (-0.55 pct)
>
> Test:?????????????????? tip-r3????????????????? tip-r4????????????????? tip-r5
> ?1-groups:???????? 4.68 (0.00 pct)???????? 4.97 (-6.19 pct)??????? 4.98 (-6.41 pct)
> ?2-groups:???????? 5.60 (0.00 pct)???????? 5.62 (-0.35 pct)??????? 5.66 (-1.07 pct)
> ?4-groups:???????? 6.24 (0.00 pct)???????? 6.23 (0.16 pct)???????? 6.24 (0.00 pct)
> ?8-groups:???????? 7.54 (0.00 pct)???????? 7.50 (0.53 pct)???????? 7.46 (1.06 pct)
> 16-groups:??????? 10.81 (0.00 pct)??????? 10.84 (-0.27 pct)?????? 10.81 (0.00 pct)
>
> - SIS_UTIL
>
>
> Test:??????????????? SIS_UTIL-r0????????????? SIS_UTIL-r1???????????? SIS_UTIL-r2
> ?1-groups:???????? 4.68 (0.00 pct)???????? 5.03 (-7.47 pct)??????? 4.96 (-5.98 pct)
> ?2-groups:???????? 5.45 (0.00 pct)???????? 5.48 (-0.55 pct)??????? 5.50 (-0.91 pct)
> ?4-groups:???????? 6.10 (0.00 pct)???????? 6.07 (0.49 pct)???????? 6.14 (-0.65 pct)
> ?8-groups:???????? 7.52 (0.00 pct)???????? 7.51 (0.13 pct)???????? 7.52 (0.00 pct)
> 16-groups:??????? 11.63 (0.00 pct)??????? 11.48 (1.28 pct)??????? 11.51 (1.03 pct)
>
> Test:??????????????? SIS_UTIL-r3????????????? SIS_UTIL-r4???????????? SIS_UTIL-r5
> ?1-groups:???????? 4.80 (0.00 pct)???????? 5.00 (-4.16 pct)??????? 5.06 (-5.41 pct)
> ?2-groups:???????? 5.51 (0.00 pct)???????? 5.58 (-1.27 pct)??????? 5.58 (-1.27 pct)
> ?4-groups:???????? 6.14 (0.00 pct)???????? 6.11 (0.48 pct)???????? 6.06 (1.30 pct)
> ?8-groups:???????? 7.35 (0.00 pct)???????? 7.38 (-0.40 pct)??????? 7.40 (-0.68 pct)
> 16-groups:??????? 11.03 (0.00 pct)??????? 11.29 (-2.35 pct)?????? 11.14 (-0.99 pct)
>
> - Comparing the best and bad data points for 16-groups with each
> kernel version:
>
> Test:?????????????????? tip-good ??????????? SIS_UTIL-good
> ?1-groups:???????? 4.68 (0.00 pct)???????? 4.80 (-2.56 pct)
> ?2-groups:???????? 5.60 (0.00 pct)???????? 5.51 (1.60 pct)
> ?4-groups:???????? 6.24 (0.00 pct)???????? 6.14 (1.60 pct)
> ?8-groups:???????? 7.54 (0.00 pct)???????? 7.35 (2.51 pct)
> 16-groups:??????? 10.81 (0.00 pct)??????? 11.03 (-2.03 pct)
>
> Test:?????????????????? tip-good ??????????? SIS_UTIL-bad
> ?1-groups:???????? 4.68 (0.00 pct)???????? 4.68 (0.00 pct)
> ?2-groups:???????? 5.60 (0.00 pct)???????? 5.45 (2.67 pct)
> ?4-groups:???????? 6.24 (0.00 pct)???????? 6.10 (2.24 pct)
> ?8-groups:???????? 7.54 (0.00 pct)???????? 7.52 (0.26 pct)
> 16-groups:??????? 10.81 (0.00 pct)??????? 11.63 (-7.58 pct)
>
> Test:?????????????????? tip-bad ??????????? SIS_UTIL-good
> ?1-groups:???????? 4.90 (0.00 pct)???????? 4.80 (2.04 pct)
> ?2-groups:???????? 5.56 (0.00 pct)???????? 5.51 (0.89 pct)
> ?4-groups:???????? 6.18 (0.00 pct)???????? 6.14 (0.64 pct)
> ?8-groups:???????? 7.50 (0.00 pct)???????? 7.35 (2.00 pct)
> 16-groups:??????? 11.17 (0.00 pct)??????? 11.03 (1.25 pct)
>
> Test:?????????????????? tip-bad ??????????? SIS_UTIL-bad
> ?1-groups:???????? 4.90 (0.00 pct)???????? 4.68 (4.48 pct)
> ?2-groups:???????? 5.56 (0.00 pct)???????? 5.45 (1.97 pct)
> ?4-groups:???????? 6.18 (0.00 pct)???????? 6.10 (1.29 pct)
> ?8-groups:???????? 7.50 (0.00 pct)???????? 7.52 (-0.26 pct)
> 16-groups:??????? 11.17 (0.00 pct)??????? 11.63 (-4.11 pct)
>
> Hackbench consistently reports > 11 for 16-group
> case with SIS_UTIL however only once with SIS_PROP
>
Mel has mentioned that, although it is 'overloaded', the nature of hackbench
would benefit from scanning for more CPUs because there is frequent context switch,
thus there is more chance to get idle.
> > I'm thinking of taking nr_llc number into consideration to adjust the search depth,
> > something like:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index dd52fc5a034b..39b914599dce 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9302,6 +9302,9 @@ static inline void update_idle_cpu_scan(struct lb_env *env,
> > llc_util_pct = (sum_util * 100) / (nr_llc * SCHED_CAPACITY_SCALE);
> > nr_scan = (100 - (llc_util_pct * llc_util_pct / 72)) * nr_llc / 100;
> > nr_scan = max(nr_scan, 0);
> > + if (nr_llc <= 16 && nr_scan)
> > + nr_scan = nr_llc;
> > +
> This will behave closer to the initial RFC on systems with smaller LLC.
> I can do some preliminary testing with this and get back to you.
> > WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
> > }
> >
> > I'll offline the CPUs to make it 16 CPUs per LLC, and check what hackbench behaves.
> Thank you for looking into this.
>
OK, I've done some tests and recorded the number of CPUs SIS_PROP and SIS_UTIL
want to scan(the nr).
1. 16 CPUs online, 16 groups of hackbench (total 64 threads)
Most of time SIS_PROP would scan for 4 CPUs, while SIS_UTIL scans for 2 CPUs.
2. 112 CPUs online, 16 groups of hackbench (total 448 threads)
Most of time SIS_PROP would scan for 4 CPUs, but there is a small part of
SIS_PROP would scan the entire LLC, which could be time costly.
The number of CPUs scanned by SIS_UTIL ranges in [2, 20] and it looks like a
Normal Distribution.
(I'll send you the plot picture offline so you can have a view of how these data
look like)
This means, for 16 CPUs case, the extrem overloaded hackbench would benefit from
scanning for more CPUs. But for 112 CPUs platform, using SIS_UTIL seems to be more
reasonable because it does not have 'jitters' to scan for the whole LLC.
Let me revise the patch according to Peter and Mel's suggestion, and send
a v4 (then cook the complementary patch to deal with system having 16
CPUs per LLC domain).
thanks,
Chenyu
> --
> Thanks and Regards,
> Prateek
>
Hi Mel,
On Tue, May 17, 2022 at 01:50:47PM +0100, Mel Gorman wrote:
> On Fri, Apr 29, 2022 at 02:24:42AM +0800, Chen Yu wrote:
> > Introduce the quadratic function:
> >
> > y = a - bx^2
> >
> > x is the sum_util ratio [0, 100] of this LLC domain, and y is the percentage
> > of CPUs to be scanned in the LLC domain. The number of CPUs to search drops
> > as sum_util increases. When sum_util hits 85% or above, the scan stops.
> > Choosing 85% is because it is the threshold of an overloaded LLC sched group
> > (imbalance_pct = 117). Choosing quadratic function is because:
> >
> > [1] Compared to the linear function, it scans more aggressively when the
> > sum_util is low.
> > [2] Compared to the exponential function, it is easier to calculate.
> > [3] It seems that there is no accurate mapping between the sum of util_avg
> > and the number of CPUs to be scanned. Use heuristic scan for now.
> >
> > The steps to calculate scan_nr are as followed:
> > [1] scan_percent = 100 - (x/8.5)^2
> > when utilization reaches 85%, scan_percent becomes 0.
> > [2] scan_nr = nr_llc * scan_percent / 100
> > [3] scan_nr = max(scan_nr, 0)
> >
> > For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
> > sum_util% 0 5 15 25 35 45 55 65 75 85 ...
> > scan_ns 112 112 108 103 92 80 64 47 24 0 ...
> >
> > Furthermore, to minimize the overhead of calculating the metrics in
> > select_idle_cpu(), borrow the statistics from periodic load balance.
> > As mentioned by Abel, on a platform with 112 CPUs per LLC, the
> > sum_util calculated by periodic load balance after 112ms would decay
> > to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay in
> > reflecting the latest utilization. But it is a trade-off.
> > Checking the util_avg in newidle load balance would be more frequent,
> > but it brings overhead - multiple CPUs write/read the per-LLC shared
> > variable and introduces cache false sharing. And Tim also mentioned
> > that, it is allowed to be non-optimal in terms of scheduling for the
> > short term variations, but if there is a long term trend in the load
> > behavior, the scheduler can adjust for that.
> >
>
> Seems fair other than the 85% is hardcoded based on an imbalance_pct of
> 117. If that value should ever change, it'll drift apart.
>
OK, I can change the calculation based on imbalance_pct rather than hard code
to 85%.
> > SIS_UTIL is disabled by default. When it is enabled, the select_idle_cpu()
> > will use the nr_scan calculated by SIS_UTIL instead of the one from
> > SIS_PROP. Later SIS_UTIL and SIS_PROP could be made mutually exclusive.
> >
>
> I agree with Peter, this should be enabled by default. I am somewhat
> embarassed that I initially queued this patch blindly for testing when it
> was mailed thinking "I would like to have some results before reviewing"
> and then when I sit down to do the review, the results are all useless
> because the feature was disabled. My initial thinking starting the review
> was "Weird, none of these results are conclusive in any way."
>
Thank you for the review and testing this change, and sorry for the inconvenience.
> I don't think you need to explicitly check for both being enabled given
> that it's a sched_feat. Someone poking in there is meant to be debugging
> something but the vast majority of people will never go looking.
>
OK, I'll change it to enabled by default.
> > Limitations:
> > [1]
> > This patch is based on the util_avg, which is very sensitive to the CPU
> > frequency invariance. The util_avg would decay quite fast when the
> > CPU is idle, if the max frequency has been limited by the user.
> > Patch [3] should be applied if turbo is disabled manually on Intel
> > platforms.
> >
>
> It is worth mentioning in the changelog if there is a patch that this
> implicitly depends upon. It affects the ordering patches should be merged
> or bisections for a regression may unfairly identify your patch as the
> source of the problem.
>
> At least then if they merge in the wrong order there will something
> obvious to check.
>
OK, I'll keep this informantion in the changelog. And the patch mentioned
has been queued for 5.19.
> > [2]
> > There may be unbalanced tasks among CPUs due to CPU affinity. For example,
> > suppose the LLC domain is composed of 8 CPUs, and 7 tasks are bound to
> > CPU0~CPU6, while CPU7 is idle:
> >
> > CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
> > util_avg 1024 1024 1024 1024 1024 1024 1024 0
> >
> > Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%,
> > select_idle_cpu() will not scan, thus CPU7 is undetected.
> >
> > A possible workaround to mitigate this problem is that the nr_scan should
> > be increased if idle CPUs are detected during periodic load balance. And the
> > problem could be mitigated by idle load balance that, CPU7 might pull some
> > tasks on it.
> >
>
> While a valid concern, it's no worse than what is there now and I think
> this case is unlikely. It could naturally happen if there was 6 busy tasks
> running entirely in userspace which is harmless. Normal load balancing
> would use CPU7 if there was any stacking on CPU0 to CPU6 or NEWLY_IDLE
> ILB on CPU7 would pull something if there was any other activity on CPU7.
>
Yes agreed.
> > [3]
> > Prateek mentioned that we should scan aggressively in an LLC domain
> > with 16 CPUs. Because the cost to search for an idle one among 16 CPUs is
> > negligible. The current patch aims to propose a generic solution and only
> > considers the util_avg. A follow-up change could enhance the scan policy
> > to adjust the scan_percent according to the CPU number in LLC.
> >
>
> Yes, anything along those lines is a separate patch.
>
OK, I'm working with Prateek on this, to collect test data and make the search
policy more friendly to small LLC size platform.
> > v2->v3:
> > - Use 85% as the threshold again, because the CPU frequency invariance issue
> > has been fixed and the patch is queued for 5.19.
> >
>
> Note in changelog exactly what this fix is in case the patches go in the
> wrong order.
>
OK, will do.
> > - Stop the scan if 85% is reached, rather than scanning for at least 4 CPUs.
> > According to the feedback from Yicong, it might be better to stop scanning
> > entirely when the LLC is overloaded.
> >
>
> I think only workloads like hackbench benefit from "search 4 CPUs"
> heuristic because the machine is heavily overloaded with tasks that are
> rapidly idling. Be wary of a patch that sets it back to 4 with hackbench
> being the only justification.
>
Yes, this is also my concern. As Prateek mentioned, in a LLC with 16 CPUs, and when
the system is extremely overloaded, there is slight regression on hackbench, because
with this patch applied, the SIS would gives up scanning CPUs, while SIS_PROP would
scan at least 4 CPUs.
> > }
> >
[snip]
> > +static inline void update_idle_cpu_scan(struct lb_env *env,
> > + unsigned long sum_util)
> > +{
>
> Don't inline. This is a long function with one callsite, the compiler
> should be able to figure it out.
>
OK, will do in next version.
> > + struct sched_domain_shared *sd_share;
> > + int nr_scan, nr_llc, llc_util_pct;
> > +
> > + if (!sched_feat(SIS_UTIL))
> > + return;
>
> Move the CPU_NEWLY_IDLE check here because it's essentially a free check
> and avoids the per_cpu lookup.
>
OK.
> Also comment on why CPU_NEWLY_IDLE is ignored. I assume it's because you
> are updating sd_shared which is a shared cache line write and
> CPU_NEWLY_IDLE can fire way more frequently than periodic load
> balancing.
>
Yes, I'll add the comment here.
> > + /*
> > + * Update the number of CPUs to scan in LLC domain, which could
> > + * be used as a hint in select_idle_cpu(). The update of this hint
> > + * occurs during periodic load balancing, rather than frequent
> > + * newidle balance.
> > + */
> > + nr_llc = per_cpu(sd_llc_size, env->dst_cpu);
> > + if (env->idle == CPU_NEWLY_IDLE ||
> > + env->sd->span_weight != nr_llc)
> > + return;
> > +
>
> This caught me because nr_llc made me think it was "the number
> of last level caches in a NUMA node" because that's what it means
> in kernel/sched/topology.c. This is the number of CPUs sharing an
> LLC to llc_weight would be appropriate given that it's compared to
> span_weight. It's not mandatory for me but it's preferable.
>
OK, I'll change it to llc_weight.
> > + sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu));
> > + if (!sd_share)
> > + return;
> > +
> > + /*
> > + * The number of CPUs to search drops as sum_util increases, when
> > + * sum_util hits 85% or above, the scan stops.
> > + * The reason to choose 85% as the threshold is because this is the
> > + * imbalance_pct when a LLC sched group is overloaded.
> > + * let y = 100 - (x/8.5)^2 = 100 - x^2/72
> > + * y is the percentage of CPUs to be scanned in the LLC
> > + * domain, x is the ratio of sum_util compared to the
> > + * CPU capacity, which ranges in [0, 100], thus
> > + * nr_scan = nr_llc * y / 100
> > + */
> > + llc_util_pct = (sum_util * 100) / (nr_llc * SCHED_CAPACITY_SCALE);
> > + nr_scan = (100 - (llc_util_pct * llc_util_pct / 72)) * nr_llc / 100;
> > + nr_scan = max(nr_scan, 0);
>
> Peter commented on this already.
>
> > + WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
> > +}
>
> To avoid excessive unnecessary cache line bounces;
>
> if (nr_scan != sd_share->nr_idle_scan)
> WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
>
> I expect the two very common values represent "machine is mostly idle
> scan everything" and "machine is heavily overloaded, scan nothing" and
> I'm betting the cost of the branch is less than the cost of rewriting
> the same values.
>
OK, will do in next version.
thanks,
Chenyu