The wakeup fastpath (select_idle_sibling or SIS) plays an important role
in maximizing the usage of cpu resources and can greatly affect overall
performance of the system.
The SIS tries to find an idle cpu inside that LLC to place the woken-up
task. The cache hot cpus will be checked first, then other cpus of that
LLC (domain scan) if the hot ones are not idle.
The domain scan works well under light workload by simply traversing
the cpus of the LLC due to lots of idle cpus can be available. But this
doesn’t scale well once the LLC gets bigger and the load increases, so
SIS_PROP was born to limit the scan cost. For now SIS_PROP just limits
the number of cpus to be scanned, but the way of how it scans is not
changed.
This patchset introduces the SIS filter to help improving scan efficiency
when scan depth is limited. The filter only contains the unoccupied cpus,
and is updated during SMT level load balancing. It is expected that the
more overloaded the system is, the less cpus will be scanned.
SIS_PROP
=======
The filter should be put under SIS_PROP which is not now, because this
sched-feature is under rework or been replaced (by SIS_UTIL [1]). Will
be adjusted once that work is ready.
[1] https://lore.kernel.org/all/[email protected]/
Entrance: Load balance vs. Tick
===============================
Now we update the filter during load balancing at SMT level, while it was
done at the tick (patch v1-v3). There are several reasons:
- The periodic load balancing has the admission control that only the
first idle cpu (if any) or the balance cpu can trigger balance, so
the complexity is reduced by O(n) in the domain scope. While on the
other hand, the complexity of per-cpu updating at the tick is not
reduced, and can cause heavy traffic on the LLC cache.
- In order to keep the filter relatively fresh, the cpu idle path needs
to be taken into consideration as well. Since load_balance is the
common path for the periodic and newly-idle balance, we can simplify
and concentrate on load_balance itself. While the tick way both tick
and idle path need to be modified.
- The calculation of SMT state can re-use the intermediate results of
the load balance, so it’s cheap without sacrifice.
But the tick way also has its advantages in the idle path that the cpus
set to filter are guaranteed to be idle compared to load_balance in which
the filter is updated before task migrations (so the idle dst_cpu can be
fed with tasks). This can be fixed by false positive correction at later
discussion.
Filter: Busy vs. Overloaded
===========================
In preious patch versions, rather than the busy or unoccupied cpus, only
the overloaded ones will be set into the filter because of its stableness.
For example, the state of the unoccupied cpus (non-idle tasks == 0) or the
busy cpus (== 1) can be easily changed by a short running workload. And
besides, wakup won’t pollute the filter since add tasks to the already
overloaded cpus can’t change their states.
But it’s different in this patchset in which the load balance cpu will be
responsible for updating its core state. The state will be set to has_icpus
if unoccupied cpus exist no matter what the other cpu states are. So the
overloaded state of a core is also unstable. Given this, the semantics of
busy is much preferred than overloaded due to its accuracy.
The only problem comes to cache, again. To avoid frequent update thanks
to the variance, the false positive bits in the filter are allowed to exist
for a while, which will be talked about later.
Incremental Propagation
=======================
By design, the unoccupied cpus will be set (or propagate) to the filter.
But it’s heavy to update the whole SMT state every time. What we actually
do is only to propagate one unoccupied cpu of this core to the filter at
a time, so called incremental propagation. It’s not just a tradeoff, it
also helps spreading load to different cores.
The cpu to be propagated needs to be selected carefully. If the load
balance cpu is chosen, the filter might be polluted because that cpu
could be fed with tasks soon. But it’s not wise either to wait to the
end of load balancing as we need to do it as early as possible to keep
the filter fresh. The solution is false positive correction too.
Accuracy: False Positive Correction
===================================
The filter is supposed to contain unoccupied cpus. There are two main
cases that will downgrade the filter:
- False negtive: there are unoccupied cpus not represented in the
filter. This is the case we should try to eliminate because these
cpus will be out of reach in the wakeup leading to in-sufficient
use of cpu resources. So in practical the unoccupied cpus will be
aggressively propagated to the filter.
- False positive: some of the cpus in the filter are not unoccupied
any more. We allow these cpus to stay in the filter for some time
to relieve cache pressure. It's a tradeoff.
A correction is needed when the false positive case is continuously
detected. See patch 7 for detailed info.
SMT 2/4/8
=========
The incremental propagation is relatively friendly to SMT2, but not for
the others. As the core grows bigger, the more false negtive cpus could
exist. It would be appreciated if anyone can help testing on SMT 4/8
machines.
Benchmark
=========
Tests are done in an Intel Xeon(R) Platinum 8260 [email protected] machine
with 2 NUMA nodes each of which has 24 cores with SMT2 enabled, so 96
CPUs in total.
All of the benchmarks are done inside a normal cpu cgroup in a clean
environment with cpu turbo disabled.
Based on tip sched/core f3dd3f674555 (v5.19-rc2).
Results
=======
hackbench-process-pipes
vanilla filter
Amean 1 0.2357 ( 0.00%) 0.2540 ( -7.78%)
Amean 4 0.6403 ( 0.00%) 0.6363 ( 0.62%)
Amean 7 0.7583 ( 0.00%) 0.7367 ( 2.86%)
Amean 12 1.0703 ( 0.00%) 1.0520 ( 1.71%)
Amean 21 2.0363 ( 0.00%) 1.9610 * 3.70%*
Amean 30 3.2487 ( 0.00%) 2.9083 * 10.48%*
Amean 48 6.3620 ( 0.00%) 4.8543 * 23.70%*
Amean 79 8.3653 ( 0.00%) 7.1077 * 15.03%*
Amean 110 9.8370 ( 0.00%) 8.5740 * 12.84%*
Amean 141 11.4667 ( 0.00%) 10.8750 * 5.16%*
Amean 172 13.4433 ( 0.00%) 12.6443 * 5.94%*
Amean 203 15.8970 ( 0.00%) 14.9143 * 6.18%*
Amean 234 17.9643 ( 0.00%) 16.9123 * 5.86%*
Amean 265 20.3910 ( 0.00%) 19.2040 * 5.82%*
Amean 296 22.5253 ( 0.00%) 21.2547 * 5.64%*
hackbench-process-sockets
vanilla filter
Amean 1 0.4177 ( 0.00%) 0.4133 ( 1.04%)
Amean 4 1.4397 ( 0.00%) 1.4240 * 1.09%*
Amean 7 2.4720 ( 0.00%) 2.4310 * 1.66%*
Amean 12 4.1407 ( 0.00%) 4.0683 * 1.75%*
Amean 21 7.0550 ( 0.00%) 6.8830 * 2.44%*
Amean 30 9.9633 ( 0.00%) 9.7750 * 1.89%*
Amean 48 15.9837 ( 0.00%) 15.5313 * 2.83%*
Amean 79 26.7740 ( 0.00%) 26.2703 * 1.88%*
Amean 110 37.2913 ( 0.00%) 36.5433 * 2.01%*
Amean 141 47.8937 ( 0.00%) 46.5300 * 2.85%*
Amean 172 58.0273 ( 0.00%) 56.4530 * 2.71%*
Amean 203 68.2530 ( 0.00%) 66.3320 * 2.81%*
Amean 234 78.8987 ( 0.00%) 76.8497 * 2.60%*
Amean 265 89.1520 ( 0.00%) 86.8213 * 2.61%*
Amean 296 99.6920 ( 0.00%) 96.9707 * 2.73%*
hackbench-thread-pipes
vanilla filter
Amean 1 0.2647 ( 0.00%) 0.2633 ( 0.50%)
Amean 4 0.6290 ( 0.00%) 0.6607 ( -5.03%)
Amean 7 0.7850 ( 0.00%) 0.7870 ( -0.25%)
Amean 12 1.3347 ( 0.00%) 1.2577 ( 5.77%)
Amean 21 3.1233 ( 0.00%) 2.4613 * 21.20%*
Amean 30 5.7120 ( 0.00%) 3.6847 * 35.49%*
Amean 48 8.1947 ( 0.00%) 6.2670 * 23.52%*
Amean 79 9.1750 ( 0.00%) 8.0640 * 12.11%*
Amean 110 10.6300 ( 0.00%) 9.5583 * 10.08%*
Amean 141 12.7490 ( 0.00%) 12.0167 * 5.74%*
Amean 172 15.1567 ( 0.00%) 14.1570 * 6.60%*
Amean 203 17.5160 ( 0.00%) 16.7883 ( 4.15%)
Amean 234 19.8710 ( 0.00%) 19.5370 ( 1.68%)
Amean 265 23.2700 ( 0.00%) 21.4017 * 8.03%*
Amean 296 25.4093 ( 0.00%) 23.9943 * 5.57%*
hackbench-thread-sockets
vanilla filter
Amean 1 0.4467 ( 0.00%) 0.4347 ( 2.69%)
Amean 4 1.4757 ( 0.00%) 1.4533 * 1.51%*
Amean 7 2.5320 ( 0.00%) 2.4993 * 1.29%*
Amean 12 4.2617 ( 0.00%) 4.1780 * 1.96%*
Amean 21 7.2397 ( 0.00%) 7.0660 * 2.40%*
Amean 30 10.2200 ( 0.00%) 9.9810 * 2.34%*
Amean 48 16.2623 ( 0.00%) 16.0483 * 1.32%*
Amean 79 27.4307 ( 0.00%) 26.8410 * 2.15%*
Amean 110 37.8993 ( 0.00%) 37.3223 * 1.52%*
Amean 141 48.3890 ( 0.00%) 47.4823 * 1.87%*
Amean 172 58.9887 ( 0.00%) 57.7753 * 2.06%*
Amean 203 69.5853 ( 0.00%) 68.0563 * 2.20%*
Amean 234 80.0743 ( 0.00%) 78.4857 * 1.98%*
Amean 265 90.5473 ( 0.00%) 89.3363 * 1.34%*
Amean 296 101.3857 ( 0.00%) 99.7717 * 1.59%*
schbench
vanilla filter
Lat 50.0th-qrtle-1 6.00 ( 0.00%) 6.00 ( 0.00%)
Lat 75.0th-qrtle-1 8.00 ( 0.00%) 8.00 ( 0.00%)
Lat 90.0th-qrtle-1 9.00 ( 0.00%) 8.00 ( 11.11%)
Lat 95.0th-qrtle-1 9.00 ( 0.00%) 8.00 ( 11.11%)
Lat 99.0th-qrtle-1 10.00 ( 0.00%) 9.00 ( 10.00%)
Lat 99.5th-qrtle-1 11.00 ( 0.00%) 9.00 ( 18.18%)
Lat 99.9th-qrtle-1 11.00 ( 0.00%) 9.00 ( 18.18%)
Lat 50.0th-qrtle-2 6.00 ( 0.00%) 7.00 ( -16.67%)
Lat 75.0th-qrtle-2 7.00 ( 0.00%) 8.00 ( -14.29%)
Lat 90.0th-qrtle-2 8.00 ( 0.00%) 9.00 ( -12.50%)
Lat 95.0th-qrtle-2 8.00 ( 0.00%) 10.00 ( -25.00%)
Lat 99.0th-qrtle-2 9.00 ( 0.00%) 11.00 ( -22.22%)
Lat 99.5th-qrtle-2 9.00 ( 0.00%) 11.00 ( -22.22%)
Lat 99.9th-qrtle-2 9.00 ( 0.00%) 11.00 ( -22.22%)
Lat 50.0th-qrtle-4 9.00 ( 0.00%) 8.00 ( 11.11%)
Lat 75.0th-qrtle-4 10.00 ( 0.00%) 10.00 ( 0.00%)
Lat 90.0th-qrtle-4 11.00 ( 0.00%) 11.00 ( 0.00%)
Lat 95.0th-qrtle-4 12.00 ( 0.00%) 11.00 ( 8.33%)
Lat 99.0th-qrtle-4 13.00 ( 0.00%) 13.00 ( 0.00%)
Lat 99.5th-qrtle-4 13.00 ( 0.00%) 16.00 ( -23.08%)
Lat 99.9th-qrtle-4 13.00 ( 0.00%) 19.00 ( -46.15%)
Lat 50.0th-qrtle-8 11.00 ( 0.00%) 11.00 ( 0.00%)
Lat 75.0th-qrtle-8 14.00 ( 0.00%) 14.00 ( 0.00%)
Lat 90.0th-qrtle-8 16.00 ( 0.00%) 16.00 ( 0.00%)
Lat 95.0th-qrtle-8 17.00 ( 0.00%) 17.00 ( 0.00%)
Lat 99.0th-qrtle-8 22.00 ( 0.00%) 19.00 ( 13.64%)
Lat 99.5th-qrtle-8 28.00 ( 0.00%) 23.00 ( 17.86%)
Lat 99.9th-qrtle-8 31.00 ( 0.00%) 42.00 ( -35.48%)
Lat 50.0th-qrtle-16 17.00 ( 0.00%) 17.00 ( 0.00%)
Lat 75.0th-qrtle-16 23.00 ( 0.00%) 23.00 ( 0.00%)
Lat 90.0th-qrtle-16 26.00 ( 0.00%) 27.00 ( -3.85%)
Lat 95.0th-qrtle-16 28.00 ( 0.00%) 29.00 ( -3.57%)
Lat 99.0th-qrtle-16 32.00 ( 0.00%) 33.00 ( -3.12%)
Lat 99.5th-qrtle-16 37.00 ( 0.00%) 35.00 ( 5.41%)
Lat 99.9th-qrtle-16 54.00 ( 0.00%) 46.00 ( 14.81%)
Lat 50.0th-qrtle-32 30.00 ( 0.00%) 29.00 ( 3.33%)
Lat 75.0th-qrtle-32 43.00 ( 0.00%) 42.00 ( 2.33%)
Lat 90.0th-qrtle-32 51.00 ( 0.00%) 49.00 ( 3.92%)
Lat 95.0th-qrtle-32 54.00 ( 0.00%) 51.00 ( 5.56%)
Lat 99.0th-qrtle-32 61.00 ( 0.00%) 57.00 ( 6.56%)
Lat 99.5th-qrtle-32 64.00 ( 0.00%) 60.00 ( 6.25%)
Lat 99.9th-qrtle-32 72.00 ( 0.00%) 82.00 ( -13.89%)
Lat 50.0th-qrtle-47 44.00 ( 0.00%) 45.00 ( -2.27%)
Lat 75.0th-qrtle-47 64.00 ( 0.00%) 65.00 ( -1.56%)
Lat 90.0th-qrtle-47 75.00 ( 0.00%) 77.00 ( -2.67%)
Lat 95.0th-qrtle-47 81.00 ( 0.00%) 82.00 ( -1.23%)
Lat 99.0th-qrtle-47 92.00 ( 0.00%) 98.00 ( -6.52%)
Lat 99.5th-qrtle-47 101.00 ( 0.00%) 114.00 ( -12.87%)
Lat 99.9th-qrtle-47 271.00 ( 0.00%) 167.00 ( 38.38%)
netperf-udp
vanilla filter
Hmean send-64 199.12 ( 0.00%) 201.32 ( 1.11%)
Hmean send-128 396.22 ( 0.00%) 397.01 ( 0.20%)
Hmean send-256 777.80 ( 0.00%) 783.96 ( 0.79%)
Hmean send-1024 2972.62 ( 0.00%) 3011.87 * 1.32%*
Hmean send-2048 5600.64 ( 0.00%) 5730.50 * 2.32%*
Hmean send-3312 8757.45 ( 0.00%) 8703.62 ( -0.61%)
Hmean send-4096 10578.90 ( 0.00%) 10590.93 ( 0.11%)
Hmean send-8192 17051.22 ( 0.00%) 17189.62 * 0.81%*
Hmean send-16384 27915.16 ( 0.00%) 27816.01 ( -0.36%)
Hmean recv-64 199.12 ( 0.00%) 201.32 ( 1.11%)
Hmean recv-128 396.22 ( 0.00%) 397.01 ( 0.20%)
Hmean recv-256 777.80 ( 0.00%) 783.96 ( 0.79%)
Hmean recv-1024 2972.62 ( 0.00%) 3011.87 * 1.32%*
Hmean recv-2048 5600.64 ( 0.00%) 5730.49 * 2.32%*
Hmean recv-3312 8757.45 ( 0.00%) 8703.61 ( -0.61%)
Hmean recv-4096 10578.90 ( 0.00%) 10590.93 ( 0.11%)
Hmean recv-8192 17051.21 ( 0.00%) 17189.57 * 0.81%*
Hmean recv-16384 27915.08 ( 0.00%) 27815.86 ( -0.36%)
netperf-tcp
vanilla filter
Hmean 64 811.07 ( 0.00%) 835.46 * 3.01%*
Hmean 128 1614.86 ( 0.00%) 1652.27 * 2.32%*
Hmean 256 3131.16 ( 0.00%) 3119.01 ( -0.39%)
Hmean 1024 10286.12 ( 0.00%) 10333.64 ( 0.46%)
Hmean 2048 16231.88 ( 0.00%) 17141.88 * 5.61%*
Hmean 3312 20705.91 ( 0.00%) 21703.49 * 4.82%*
Hmean 4096 22650.75 ( 0.00%) 23904.09 * 5.53%*
Hmean 8192 27984.06 ( 0.00%) 29170.57 * 4.24%*
Hmean 16384 32816.85 ( 0.00%) 33351.41 * 1.63%*
tbench4 Throughput
vanilla filter
Hmean 1 300.07 ( 0.00%) 302.52 * 0.82%*
Hmean 2 617.72 ( 0.00%) 598.45 * -3.12%*
Hmean 4 1213.99 ( 0.00%) 1206.36 * -0.63%*
Hmean 8 2373.78 ( 0.00%) 2372.28 * -0.06%*
Hmean 16 4777.82 ( 0.00%) 4711.44 * -1.39%*
Hmean 32 7182.50 ( 0.00%) 7718.15 * 7.46%*
Hmean 64 8611.44 ( 0.00%) 9409.29 * 9.27%*
Hmean 128 18102.63 ( 0.00%) 20650.23 * 14.07%*
Hmean 256 18029.28 ( 0.00%) 20611.03 * 14.32%*
Hmean 384 17986.44 ( 0.00%) 19361.29 * 7.64%*
Conclusion
==========
There seems like not much obvious regressions except the hackbench pipe
tests with minor groups like 1/4, and some of the benchmarks showed great
improvement.
As preious said, the domain scan works well under light workload by simply
traversing the cpus of the LLC due to lots of idle cpus can be available.
So in this case, the cost of maintaining the filter can hardly contribute
to the scan efficiency, but rather adds to the scheduling overhead slowing
down the system.
One reasonable way to fix it is to disable the filter when the system is
not loaded. Three possible solutions come to my mind:
a. Disable the filter when the number of unoccupied cpus in the LLC
is more than a threshold. This is straight but not easy to define a
proper threshold for different topologies or workloads. So I gave up
on this quickly.
b. Enable the filter when average idle time reduces. This is what the
SIS_PROP relies on. I did a test on enabling the filter when nr=4,
but the result is not convincing..
c. Enable the filter when scan efficiency drops to a threshold. Like a,
the threshold is hard to define.
I am still working on this and open to discussion and suggestions.
Comments and tests are appreciated!
v3 -> v4:
- update filter in load_balance rather than in the tick
- now the filter contains unoccupied cpus rather than overloaded ones
- added mechanisms to deal with the false positive cases
v2 -> v3:
- removed sched-idle balance feature and focus on SIS
- take non-CFS tasks into consideration
- several fixes/improvement suggested by Josh Don
v1 -> v2:
- several optimizations on sched-idle balancing
- ignore asym topos in can_migrate_task
- add more benchmarks including SIS efficiency
- re-organize patch as suggested by Mel
v3: https://lore.kernel.org/lkml/[email protected]/
v2: https://lore.kernel.org/lkml/[email protected]/
v1: https://lore.kernel.org/lkml/[email protected]/
Abel Wu (7):
sched/fair: default to false in test_idle_cores
sched/fair: remove redundant check in select_idle_smt
sched/fair: avoid double search on same cpu
sched/fair: remove useless check in select_idle_core
sched/fair: skip SIS domain search if fully busy
sched/fair: skip busy cores in SIS search
sched/fair: de-entropy for SIS filter
include/linux/sched/topology.h | 62 ++++++++-
kernel/sched/fair.c | 233 +++++++++++++++++++++++++++++----
kernel/sched/topology.c | 12 +-
3 files changed, 277 insertions(+), 30 deletions(-)
--
2.31.1
If two cpus share LLC cache, then the cores they belong to are also in
the same LLC domain.
Signed-off-by: Abel Wu <[email protected]>
---
kernel/sched/fair.c | 11 ++++-------
1 file changed, 4 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e5e8453e3ffb..e57d7cdf46ce 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6285,14 +6285,11 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
/*
* Scan the local SMT mask for idle CPUs.
*/
-static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
+static int select_idle_smt(struct task_struct *p, int target)
{
int cpu;
- for_each_cpu(cpu, cpu_smt_mask(target)) {
- if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
- !cpumask_test_cpu(cpu, sched_domain_span(sd)))
- continue;
+ for_each_cpu_and(cpu, cpu_smt_mask(target), p->cpus_ptr) {
if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
return cpu;
}
@@ -6316,7 +6313,7 @@ static inline int select_idle_core(struct task_struct *p, int core, struct cpuma
return __select_idle_cpu(core, p);
}
-static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
+static inline int select_idle_smt(struct task_struct *p, int target)
{
return -1;
}
@@ -6538,7 +6535,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
has_idle_core = test_idle_cores(target);
if (!has_idle_core && cpus_share_cache(prev, target)) {
- i = select_idle_smt(p, sd, prev);
+ i = select_idle_smt(p, prev);
if ((unsigned int)i < nr_cpumask_bits)
return i;
}
--
2.31.1
Try to improve efficiency of SIS domain search by filtering out busy
cores, and as a result the more overloaded the system is, the less
cpus will be scanned.
The filter is supposed to contain unoccupied cpus of the LLC. And we
propagate these cpus to the filter one at a time at core granule.
This can help spreading load to different cores given that the search
depth is limited. The chosen cpu to be propagated is guaranteed to be
unoccupied at that time.
When idle cpu exists, the last one is preferred in order not to
conflict with periodic load balancing during which the first idle
cpu (if any) is chosen to be fed with tasks.
Signed-off-by: Abel Wu <[email protected]>
---
include/linux/sched/topology.h | 20 ++++++++
kernel/sched/fair.c | 90 +++++++++++++++++++++++++++++++---
kernel/sched/topology.c | 12 ++++-
3 files changed, 115 insertions(+), 7 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 3e99ac98d766..b93edf587d84 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -103,6 +103,10 @@ struct sched_group;
* load balancing on each SMT domain inside the LLC, the state will be
* re-evaluated and switch from sd_is_busy to sd_has_icpus if idle cpus
* exist.
+ *
+ * For SMT domains, the state is updated during load balancing at SMT
+ * level. Upper levels are ignored due to the long intervals that make
+ * information out-of-date quickly.
*/
enum sd_state {
sd_has_icores,
@@ -113,7 +117,18 @@ enum sd_state {
struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
+
+ int updating;
int state; /* see enum sd_state */
+
+ /*
+ * Record unoccupied cpus for SIS domain search.
+ *
+ * NOTE: this field is variable length. (Allocated dynamically
+ * by attaching extra space to the end of the structure,
+ * depending on how many CPUs the kernel has booted up with)
+ */
+ unsigned long idle_cpus[];
};
struct sched_domain {
@@ -199,6 +214,11 @@ static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
return to_cpumask(sd->span);
}
+static inline struct cpumask *sched_domain_icpus(struct sched_domain_shared *sds)
+{
+ return to_cpumask(sds->idle_cpus);
+}
+
extern void partition_sched_domains_locked(int ndoms_new,
cpumask_var_t doms_new[],
struct sched_domain_attr *dattr_new);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2ca37fdc6c4d..d55fdcedf2c0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6241,6 +6241,28 @@ static inline bool test_idle_cpus(int cpu)
return sd_get_state(cpu) != sd_is_busy;
}
+static void sd_update_icpus(int core, int icpu)
+{
+ struct sched_domain_shared *sds;
+ struct cpumask *icpus;
+
+ sds = rcu_dereference(per_cpu(sd_llc_shared, core));
+ if (!sds)
+ return;
+
+ icpus = sched_domain_icpus(sds);
+
+ /*
+ * XXX: The update is racy between different cores.
+ * The non-atomic ops here is a tradeoff of accuracy
+ * for easing the cache traffic.
+ */
+ if (icpu == -1)
+ cpumask_andnot(icpus, icpus, cpu_smt_mask(core));
+ else if (!cpumask_test_cpu(icpu, icpus))
+ __cpumask_set_cpu(icpu, icpus);
+}
+
/*
* Scans the local SMT mask to see if the entire core is idle, and records this
* information in sd_llc_shared->has_idle_cores.
@@ -6340,6 +6362,10 @@ static inline bool test_idle_cpus(int cpu)
return true;
}
+static inline void sd_update_icpus(int core, int icpu)
+{
+}
+
static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
{
return __select_idle_cpu(core, p);
@@ -6370,7 +6396,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
if (!this_sd)
return -1;
- cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+ cpumask_and(cpus, has_idle_core ? sched_domain_span(sd) :
+ sched_domain_icpus(sd->shared), p->cpus_ptr);
if (sched_feat(SIS_PROP) && !has_idle_core) {
u64 avg_cost, avg_idle, span_avg;
@@ -8342,6 +8369,7 @@ struct sd_lb_stats {
unsigned int prefer_sibling; /* tasks should go to sibling first */
int sd_state;
+ int idle_cpu;
struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
struct sg_lb_stats local_stat; /* Statistics of the local group */
@@ -8362,6 +8390,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
.total_load = 0UL,
.total_capacity = 0UL,
.sd_state = sd_is_busy,
+ .idle_cpu = -1,
.busiest_stat = {
.idle_cpus = UINT_MAX,
.group_type = group_has_spare,
@@ -8702,10 +8731,18 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs
return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
}
-static inline void sd_classify(struct sd_lb_stats *sds, struct rq *rq)
+static inline void sd_classify(struct sd_lb_stats *sds, struct rq *rq, int cpu)
{
- if (sds->sd_state != sd_has_icpus && unoccupied_rq(rq))
+ if (sds->sd_state != sd_has_icpus && unoccupied_rq(rq)) {
+ /*
+ * Prefer idle cpus than unoccupied ones. This
+ * is achieved by only allowing the idle ones
+ * unconditionally overwrite the preious record
+ * while the occupied ones can't.
+ */
+ sds->idle_cpu = cpu;
sds->sd_state = sd_has_icpus;
+ }
}
/**
@@ -8741,7 +8778,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->sum_nr_running += nr_running;
if (update_core)
- sd_classify(sds, rq);
+ sd_classify(sds, rq, i);
if (nr_running > 1)
*sg_status |= SG_OVERLOAD;
@@ -8757,7 +8794,16 @@ static inline void update_sg_lb_stats(struct lb_env *env,
* No need to call idle_cpu() if nr_running is not 0
*/
if (!nr_running && idle_cpu(i)) {
+ /*
+ * Prefer the last idle cpu by overwriting
+ * preious one. The first idle cpu in this
+ * domain (if any) can trigger balancing
+ * and fed with tasks, so we'd better choose
+ * a candidate in an opposite way.
+ */
+ sds->idle_cpu = i;
sgs->idle_cpus++;
+
/* Idle cpu can't have misfit task */
continue;
}
@@ -9273,8 +9319,40 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
{
- if (sds->sd_state == sd_has_icpus && !test_idle_cpus(env->dst_cpu))
- set_idle_cpus(env->dst_cpu, true);
+ struct sched_domain_shared *sd_smt_shared = env->sd->shared;
+ enum sd_state new = sds->sd_state;
+ int this = env->dst_cpu;
+
+ /*
+ * Parallel updating can hardly contribute accuracy to
+ * the filter, besides it can be one of the burdens on
+ * cache traffic.
+ */
+ if (cmpxchg(&sd_smt_shared->updating, 0, 1))
+ return;
+
+ /*
+ * There is at least one unoccupied cpu available, so
+ * propagate it to the filter to avoid false negative
+ * issue which could result in lost tracking of some
+ * idle cpus thus throughupt downgraded.
+ */
+ if (new != sd_is_busy) {
+ if (!test_idle_cpus(this))
+ set_idle_cpus(this, true);
+ } else {
+ /*
+ * Nothing changes so nothing to update or
+ * propagate.
+ */
+ if (sd_smt_shared->state == sd_is_busy)
+ goto out;
+ }
+
+ sd_update_icpus(this, sds->idle_cpu);
+ sd_smt_shared->state = new;
+out:
+ xchg(&sd_smt_shared->updating, 0);
}
/**
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..d3cd7cf5a136 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1641,6 +1641,16 @@ sd_init(struct sched_domain_topology_level *tl,
sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
atomic_inc(&sd->shared->ref);
atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+
+ /*
+ * Initialize SMT domains to be busy, so that we don't
+ * need to propagate idle cpus to LLC domains which are
+ * default to fully busy (no cpus set). This will be
+ * updated in the first load balancing on SMT domains
+ * if necessary.
+ */
+ if (sd->flags & SD_SHARE_CPUCAPACITY)
+ WRITE_ONCE(sd->shared->state, sd_is_busy);
}
sd->private = sdd;
@@ -2106,7 +2116,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
*per_cpu_ptr(sdd->sd, j) = sd;
- sds = kzalloc_node(sizeof(struct sched_domain_shared),
+ sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
GFP_KERNEL, cpu_to_node(j));
if (!sds)
return -ENOMEM;
--
2.31.1
If a full scan on SIS domain failed, then no unoccupied cpus available
and the LLC is fully busy. In this case we'd better spend the time on
something more useful, rather than wasting it trying to find an idle
cpu that probably not exist.
The fully busy status will be re-evaluated when any core of this LLC
domain enters load balancing, and cleared once idle cpus found.
Signed-off-by: Abel Wu <[email protected]>
---
include/linux/sched/topology.h | 35 ++++++++++++++-
kernel/sched/fair.c | 82 +++++++++++++++++++++++++++++-----
2 files changed, 104 insertions(+), 13 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 56cffe42abbc..3e99ac98d766 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -77,10 +77,43 @@ extern int sched_domain_level_max;
struct sched_group;
+/*
+ * States of the sched-domain
+ *
+ * - sd_has_icores
+ * This state is only used in LLC domains to indicate worthy
+ * of a full scan in SIS due to idle cores available.
+ *
+ * - sd_has_icpus
+ * This state indicates that unoccupied (sched-idle/idle) cpus
+ * might exist in this domain. For the LLC domains it is the
+ * default state since these cpus are the main targets of SIS
+ * search, and is also used as a fallback state of the other
+ * states.
+ *
+ * - sd_is_busy
+ * This state indicates there are no unoccupied cpus in this
+ * domain. So for LLC domains, it gives the hint on whether
+ * we should put efforts on the SIS search or not.
+ *
+ * For LLC domains, sd_has_icores is set when the last non-idle cpu of
+ * a core becomes idle. After a full SIS scan and if no idle cores found,
+ * sd_has_icores must be cleared and the state will be set to sd_has_icpus
+ * or sd_is_busy depending on whether there is any idle cpu. And during
+ * load balancing on each SMT domain inside the LLC, the state will be
+ * re-evaluated and switch from sd_is_busy to sd_has_icpus if idle cpus
+ * exist.
+ */
+enum sd_state {
+ sd_has_icores,
+ sd_has_icpus,
+ sd_is_busy
+};
+
struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
- int has_idle_cores;
+ int state; /* see enum sd_state */
};
struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1cc86e76e38e..2ca37fdc6c4d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5642,11 +5642,15 @@ static inline void update_overutilized_status(struct rq *rq)
static inline void update_overutilized_status(struct rq *rq) { }
#endif
+static int unoccupied_rq(struct rq *rq)
+{
+ return rq->nr_running == rq->cfs.idle_h_nr_running;
+}
+
/* Runqueue only has SCHED_IDLE tasks enqueued */
static int sched_idle_rq(struct rq *rq)
{
- return unlikely(rq->nr_running == rq->cfs.idle_h_nr_running &&
- rq->nr_running);
+ return unlikely(rq->nr_running && unoccupied_rq(rq));
}
/*
@@ -6197,24 +6201,44 @@ static inline int __select_idle_cpu(int cpu, struct task_struct *p)
DEFINE_STATIC_KEY_FALSE(sched_smt_present);
EXPORT_SYMBOL_GPL(sched_smt_present);
-static inline void set_idle_cores(int cpu, int val)
+static inline void sd_set_state(int cpu, enum sd_state state)
{
struct sched_domain_shared *sds;
sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
if (sds)
- WRITE_ONCE(sds->has_idle_cores, val);
+ WRITE_ONCE(sds->state, state);
}
-static inline bool test_idle_cores(int cpu)
+static inline enum sd_state sd_get_state(int cpu)
{
struct sched_domain_shared *sds;
sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
if (sds)
- return READ_ONCE(sds->has_idle_cores);
+ return READ_ONCE(sds->state);
- return false;
+ return sd_has_icpus;
+}
+
+static inline void set_idle_cores(int cpu, int idle)
+{
+ sd_set_state(cpu, idle ? sd_has_icores : sd_has_icpus);
+}
+
+static inline bool test_idle_cores(int cpu)
+{
+ return sd_get_state(cpu) == sd_has_icores;
+}
+
+static inline void set_idle_cpus(int cpu, int idle)
+{
+ sd_set_state(cpu, idle ? sd_has_icpus : sd_is_busy);
+}
+
+static inline bool test_idle_cpus(int cpu)
+{
+ return sd_get_state(cpu) != sd_is_busy;
}
/*
@@ -6298,7 +6322,7 @@ static int select_idle_smt(struct task_struct *p, int target)
#else /* CONFIG_SCHED_SMT */
-static inline void set_idle_cores(int cpu, int val)
+static inline void set_idle_cores(int cpu, int idle)
{
}
@@ -6307,6 +6331,15 @@ static inline bool test_idle_cores(int cpu)
return false;
}
+static inline void set_idle_cpus(int cpu, int idle)
+{
+}
+
+static inline bool test_idle_cpus(int cpu)
+{
+ return true;
+}
+
static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
{
return __select_idle_cpu(core, p);
@@ -6382,7 +6415,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
}
}
- if (has_idle_core)
+ if (idle_cpu == -1)
+ set_idle_cpus(target, false);
+ else if (has_idle_core)
set_idle_cores(target, false);
if (sched_feat(SIS_PROP) && !has_idle_core) {
@@ -6538,6 +6573,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if ((unsigned int)i < nr_cpumask_bits)
return i;
}
+
+ if (!has_idle_core && !test_idle_cpus(target))
+ return target;
}
i = select_idle_cpu(p, sd, has_idle_core, target);
@@ -8303,6 +8341,8 @@ struct sd_lb_stats {
unsigned long avg_load; /* Average load across all groups in sd */
unsigned int prefer_sibling; /* tasks should go to sibling first */
+ int sd_state;
+
struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
struct sg_lb_stats local_stat; /* Statistics of the local group */
};
@@ -8321,6 +8361,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
.local = NULL,
.total_load = 0UL,
.total_capacity = 0UL,
+ .sd_state = sd_is_busy,
.busiest_stat = {
.idle_cpus = UINT_MAX,
.group_type = group_has_spare,
@@ -8661,6 +8702,12 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs
return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
}
+static inline void sd_classify(struct sd_lb_stats *sds, struct rq *rq)
+{
+ if (sds->sd_state != sd_has_icpus && unoccupied_rq(rq))
+ sds->sd_state = sd_has_icpus;
+}
+
/**
* update_sg_lb_stats - Update sched_group's statistics for load balancing.
* @env: The load balancing environment.
@@ -8675,11 +8722,12 @@ static inline void update_sg_lb_stats(struct lb_env *env,
struct sg_lb_stats *sgs,
int *sg_status)
{
- int i, nr_running, local_group;
+ int i, nr_running, local_group, update_core;
memset(sgs, 0, sizeof(*sgs));
local_group = group == sds->local;
+ update_core = env->sd->flags & SD_SHARE_CPUCAPACITY;
for_each_cpu_and(i, sched_group_span(group), env->cpus) {
struct rq *rq = cpu_rq(i);
@@ -8692,6 +8740,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
nr_running = rq->nr_running;
sgs->sum_nr_running += nr_running;
+ if (update_core)
+ sd_classify(sds, rq);
+
if (nr_running > 1)
*sg_status |= SG_OVERLOAD;
@@ -9220,6 +9271,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
return idlest;
}
+static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
+{
+ if (sds->sd_state == sd_has_icpus && !test_idle_cpus(env->dst_cpu))
+ set_idle_cpus(env->dst_cpu, true);
+}
+
/**
* update_sd_lb_stats - Update sched_domain's statistics for load balancing.
* @env: The load balancing environment.
@@ -9270,8 +9327,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
/* Tag domain that child domain prefers tasks go to siblings first */
sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
-
- if (env->sd->flags & SD_NUMA)
+ if (env->sd->flags & SD_SHARE_CPUCAPACITY)
+ sd_update_state(env, sds);
+ else if (env->sd->flags & SD_NUMA)
env->fbq_type = fbq_classify_group(&sds->busiest_stat);
if (!env->sd->parent) {
--
2.31.1
The function only gets called when sds->has_idle_cores is true which can
be possible only when sched_smt_present is enabled.
Signed-off-by: Abel Wu <[email protected]>
---
kernel/sched/fair.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aba1dad19574..1cc86e76e38e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6256,9 +6256,6 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
bool idle = true;
int cpu;
- if (!static_branch_likely(&sched_smt_present))
- return __select_idle_cpu(core, p);
-
for_each_cpu(cpu, cpu_smt_mask(core)) {
if (!available_idle_cpu(cpu)) {
idle = false;
--
2.31.1
Now when updating core state, there are two main problems that could
pollute the SIS filter:
- The updating is before task migration, so if dst_cpu is
selected to be propagated which might be fed with tasks
soon, the efforts we paid is no more than setting a busy
cpu into the SIS filter. While on the other hand it is
important that we update as early as possible to keep the
filter fresh, so it's not wise to delay the update to the
end of load balancing.
- False negative propagation hurts performance since some
idle cpus could be out of reach. So in general we will
aggressively propagate idle cpus but allow false positive
continue to exist for a while, which may lead to filter
being fully polluted.
Pains can be relieved by a force correction when false positive
continuously detected.
Signed-off-by: Abel Wu <[email protected]>
---
include/linux/sched/topology.h | 7 +++++
kernel/sched/fair.c | 51 ++++++++++++++++++++++++++++++++--
2 files changed, 55 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index b93edf587d84..e3552ce192a9 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -91,6 +91,12 @@ struct sched_group;
* search, and is also used as a fallback state of the other
* states.
*
+ * - sd_may_idle
+ * This state implies the unstableness of the SIS filter, and
+ * some bits of it may out of date. This state is only used in
+ * SMT domains as an intermediate state between sd_has_icpus
+ * and sd_is_busy.
+ *
* - sd_is_busy
* This state indicates there are no unoccupied cpus in this
* domain. So for LLC domains, it gives the hint on whether
@@ -111,6 +117,7 @@ struct sched_group;
enum sd_state {
sd_has_icores,
sd_has_icpus,
+ sd_may_idle,
sd_is_busy
};
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d55fdcedf2c0..9713d183d35e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8768,6 +8768,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
for_each_cpu_and(i, sched_group_span(group), env->cpus) {
struct rq *rq = cpu_rq(i);
+ bool update = update_core && (env->dst_cpu != i);
sgs->group_load += cpu_load(rq);
sgs->group_util += cpu_util_cfs(i);
@@ -8777,7 +8778,11 @@ static inline void update_sg_lb_stats(struct lb_env *env,
nr_running = rq->nr_running;
sgs->sum_nr_running += nr_running;
- if (update_core)
+ /*
+ * The dst_cpu is not preferred since it might
+ * be fed with tasks soon.
+ */
+ if (update)
sd_classify(sds, rq, i);
if (nr_running > 1)
@@ -8801,7 +8806,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
* and fed with tasks, so we'd better choose
* a candidate in an opposite way.
*/
- sds->idle_cpu = i;
+ if (update)
+ sds->idle_cpu = i;
sgs->idle_cpus++;
/* Idle cpu can't have misfit task */
@@ -9321,7 +9327,7 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
{
struct sched_domain_shared *sd_smt_shared = env->sd->shared;
enum sd_state new = sds->sd_state;
- int this = env->dst_cpu;
+ int icpu = sds->idle_cpu, this = env->dst_cpu;
/*
* Parallel updating can hardly contribute accuracy to
@@ -9331,6 +9337,22 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
if (cmpxchg(&sd_smt_shared->updating, 0, 1))
return;
+ /*
+ * The dst_cpu is likely to be fed with tasks soon.
+ * If it is the only unoccupied cpu in this domain,
+ * we still handle it the same way as as_has_icpus
+ * but turn the SMT into the unstable state, rather
+ * than waiting to the end of load balancing since
+ * it's also important that update the filter as
+ * early as possible to keep it fresh.
+ */
+ if (new == sd_is_busy) {
+ if (idle_cpu(this) || sched_idle_cpu(this)) {
+ new = sd_may_idle;
+ icpu = this;
+ }
+ }
+
/*
* There is at least one unoccupied cpu available, so
* propagate it to the filter to avoid false negative
@@ -9338,6 +9360,12 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
* idle cpus thus throughupt downgraded.
*/
if (new != sd_is_busy) {
+ /*
+ * The sd_may_idle state is taken into
+ * consideration as well because from
+ * here we couldn't actually know task
+ * migrations would happen or not.
+ */
if (!test_idle_cpus(this))
set_idle_cpus(this, true);
} else {
@@ -9347,9 +9375,26 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
*/
if (sd_smt_shared->state == sd_is_busy)
goto out;
+
+ /*
+ * Allow false positive to exist for some time
+ * to make a tradeoff of accuracy of the filter
+ * for relieving cache traffic.
+ */
+ if (sd_smt_shared->state == sd_has_icpus) {
+ new = sd_may_idle;
+ goto update;
+ }
+
+ /*
+ * If the false positive issue has already been
+ * there for a while, a correction of the filter
+ * is needed.
+ */
}
sd_update_icpus(this, sds->idle_cpu);
+update:
sd_smt_shared->state = new;
out:
xchg(&sd_smt_shared->updating, 0);
--
2.31.1
The prev cpu is checked at the beginning of SIS, and it's unlikely to be
idle before the second check in select_idle_smt(). So we'd better focus
only on its SMT siblings.
Signed-off-by: Abel Wu <[email protected]>
---
kernel/sched/fair.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e57d7cdf46ce..aba1dad19574 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6290,6 +6290,8 @@ static int select_idle_smt(struct task_struct *p, int target)
int cpu;
for_each_cpu_and(cpu, cpu_smt_mask(target), p->cpus_ptr) {
+ if (cpu == target)
+ continue;
if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
return cpu;
}
--
2.31.1
Hi Abel,
On Sun, Jun 19, 2022 at 08:04:50PM +0800, Abel Wu wrote:
> Try to improve efficiency of SIS domain search by filtering out busy
> cores, and as a result the more overloaded the system is, the less
> cpus will be scanned.
>
> The filter is supposed to contain unoccupied cpus of the LLC. And we
> propagate these cpus to the filter one at a time at core granule.
> This can help spreading load to different cores given that the search
> depth is limited. The chosen cpu to be propagated is guaranteed to be
> unoccupied at that time.
>
> When idle cpu exists, the last one is preferred in order not to
> conflict with periodic load balancing during which the first idle
> cpu (if any) is chosen to be fed with tasks.
>
Maybe the content in cover letter could be copied into this git log for
future reference?
> Signed-off-by: Abel Wu <[email protected]>
> ---
> include/linux/sched/topology.h | 20 ++++++++
> kernel/sched/fair.c | 90 +++++++++++++++++++++++++++++++---
> kernel/sched/topology.c | 12 ++++-
> 3 files changed, 115 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 3e99ac98d766..b93edf587d84 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -103,6 +103,10 @@ struct sched_group;
> * load balancing on each SMT domain inside the LLC, the state will be
> * re-evaluated and switch from sd_is_busy to sd_has_icpus if idle cpus
> * exist.
> + *
> + * For SMT domains, the state is updated during load balancing at SMT
> + * level. Upper levels are ignored due to the long intervals that make
> + * information out-of-date quickly.
Reuse the data from load balance to select the unoccupied candidate
is applicable IMO, which is also aligned with SIS_UTIL path. And I have
a question regarding the update frequency. In v3 patch, the update is
based on periodic tick, which is triggered at most every 1ms(CONFIG_HZ_1000).
Then the periodic SMT load balance is launched every smt_weight ms, usually 2ms.
I expect the 2ms is of the same scale unit as 1ms, and since task tick based
update is acceptable, would excluding the CPU_NEWLY_IDLE balance from this patch
reduce the overhead meanwhile not bring too much inaccuracy?
> */
> enum sd_state {
> sd_has_icores,
> @@ -113,7 +117,18 @@ enum sd_state {
> struct sched_domain_shared {
> atomic_t ref;
> atomic_t nr_busy_cpus;
> +
> + int updating;
> int state; /* see enum sd_state */
> +
> + /*
> + * Record unoccupied cpus for SIS domain search.
> + *
> + * NOTE: this field is variable length. (Allocated dynamically
> + * by attaching extra space to the end of the structure,
> + * depending on how many CPUs the kernel has booted up with)
> + */
> + unsigned long idle_cpus[];
> };
>
> struct sched_domain {
> @@ -199,6 +214,11 @@ static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
> return to_cpumask(sd->span);
> }
>
> +static inline struct cpumask *sched_domain_icpus(struct sched_domain_shared *sds)
> +{
> + return to_cpumask(sds->idle_cpus);
> +}
> +
> extern void partition_sched_domains_locked(int ndoms_new,
> cpumask_var_t doms_new[],
> struct sched_domain_attr *dattr_new);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2ca37fdc6c4d..d55fdcedf2c0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6241,6 +6241,28 @@ static inline bool test_idle_cpus(int cpu)
> return sd_get_state(cpu) != sd_is_busy;
> }
>
> +static void sd_update_icpus(int core, int icpu)
> +{
> + struct sched_domain_shared *sds;
> + struct cpumask *icpus;
> +
> + sds = rcu_dereference(per_cpu(sd_llc_shared, core));
> + if (!sds)
> + return;
> +
> + icpus = sched_domain_icpus(sds);
> +
> + /*
> + * XXX: The update is racy between different cores.
> + * The non-atomic ops here is a tradeoff of accuracy
> + * for easing the cache traffic.
> + */
> + if (icpu == -1)
> + cpumask_andnot(icpus, icpus, cpu_smt_mask(core));
> + else if (!cpumask_test_cpu(icpu, icpus))
> + __cpumask_set_cpu(icpu, icpus);
> +}
> +
> /*
> * Scans the local SMT mask to see if the entire core is idle, and records this
> * information in sd_llc_shared->has_idle_cores.
> @@ -6340,6 +6362,10 @@ static inline bool test_idle_cpus(int cpu)
> return true;
> }
>
> +static inline void sd_update_icpus(int core, int icpu)
> +{
> +}
> +
> static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
> {
> return __select_idle_cpu(core, p);
> @@ -6370,7 +6396,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> if (!this_sd)
> return -1;
>
> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> + cpumask_and(cpus, has_idle_core ? sched_domain_span(sd) :
> + sched_domain_icpus(sd->shared), p->cpus_ptr);
>
> if (sched_feat(SIS_PROP) && !has_idle_core) {
> u64 avg_cost, avg_idle, span_avg;
> @@ -8342,6 +8369,7 @@ struct sd_lb_stats {
> unsigned int prefer_sibling; /* tasks should go to sibling first */
>
> int sd_state;
> + int idle_cpu;
>
> struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
> struct sg_lb_stats local_stat; /* Statistics of the local group */
> @@ -8362,6 +8390,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
> .total_load = 0UL,
> .total_capacity = 0UL,
> .sd_state = sd_is_busy,
> + .idle_cpu = -1,
> .busiest_stat = {
> .idle_cpus = UINT_MAX,
> .group_type = group_has_spare,
> @@ -8702,10 +8731,18 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs
> return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
> }
>
> -static inline void sd_classify(struct sd_lb_stats *sds, struct rq *rq)
> +static inline void sd_classify(struct sd_lb_stats *sds, struct rq *rq, int cpu)
> {
> - if (sds->sd_state != sd_has_icpus && unoccupied_rq(rq))
> + if (sds->sd_state != sd_has_icpus && unoccupied_rq(rq)) {
> + /*
> + * Prefer idle cpus than unoccupied ones. This
> + * is achieved by only allowing the idle ones
> + * unconditionally overwrite the preious record
> + * while the occupied ones can't.
> + */
> + sds->idle_cpu = cpu;
> sds->sd_state = sd_has_icpus;
> + }
> }
>
> /**
> @@ -8741,7 +8778,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> sgs->sum_nr_running += nr_running;
>
> if (update_core)
> - sd_classify(sds, rq);
> + sd_classify(sds, rq, i);
>
> if (nr_running > 1)
> *sg_status |= SG_OVERLOAD;
> @@ -8757,7 +8794,16 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> * No need to call idle_cpu() if nr_running is not 0
> */
> if (!nr_running && idle_cpu(i)) {
> + /*
> + * Prefer the last idle cpu by overwriting
> + * preious one. The first idle cpu in this
> + * domain (if any) can trigger balancing
> + * and fed with tasks, so we'd better choose
> + * a candidate in an opposite way.
> + */
> + sds->idle_cpu = i;
Does it mean, only 1 idle CPU in the smt domain would be set to the
idle cpu mask at one time? For SMT4/8 we might lose track of the
idle siblings.
> sgs->idle_cpus++;
> +
> /* Idle cpu can't have misfit task */
> continue;
> }
> @@ -9273,8 +9319,40 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>
> static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
> {
> - if (sds->sd_state == sd_has_icpus && !test_idle_cpus(env->dst_cpu))
> - set_idle_cpus(env->dst_cpu, true);
> + struct sched_domain_shared *sd_smt_shared = env->sd->shared;
> + enum sd_state new = sds->sd_state;
> + int this = env->dst_cpu;
> +
> + /*
> + * Parallel updating can hardly contribute accuracy to
> + * the filter, besides it can be one of the burdens on
> + * cache traffic.
> + */
> + if (cmpxchg(&sd_smt_shared->updating, 0, 1))
> + return;
> +
> + /*
> + * There is at least one unoccupied cpu available, so
> + * propagate it to the filter to avoid false negative
> + * issue which could result in lost tracking of some
> + * idle cpus thus throughupt downgraded.
> + */
> + if (new != sd_is_busy) {
> + if (!test_idle_cpus(this))
> + set_idle_cpus(this, true);
> + } else {
> + /*
> + * Nothing changes so nothing to update or
> + * propagate.
> + */
> + if (sd_smt_shared->state == sd_is_busy)
> + goto out;
> + }
> +
> + sd_update_icpus(this, sds->idle_cpu);
I wonder if we could further enhance it to facilitate idle CPU scan.
For example, can we propagate the idle CPUs in smt domain, to its parent
domain in a hierarchic sequence, and finally to the LLC domain. If there is
a cluster domain between SMT and LLC domain, the cluster domain idle CPU filter
could benefit from this mechanism.
https://lore.kernel.org/lkml/[email protected]/
Furthermore, even if there is no cluster domain, would a 'virtual' middle
sched domain would help reduce the contention?
Core0(CPU0,CPU1),Core1(CPU2,CPU3) Core2(CPU4,CPU5) Core3(CPU6,CPU7)
We can create cpumask1, which is composed of Core0 and Core1, and cpumask2
which is composed of Core2 and Core3. The SIS would first scan in cpumask1,
if idle cpu is not found, scan cpumask2. In this way, the CPUs in Core0 and
Core1 only updates cpumask1, without competing with Core2 and Core3 on cpumask2.
thanks,
Chenyu
> + sd_smt_shared->state = new;
> +out:
> + xchg(&sd_smt_shared->updating, 0);
> }
>
> /**
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 8739c2a5a54e..d3cd7cf5a136 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1641,6 +1641,16 @@ sd_init(struct sched_domain_topology_level *tl,
> sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> atomic_inc(&sd->shared->ref);
> atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> +
> + /*
> + * Initialize SMT domains to be busy, so that we don't
> + * need to propagate idle cpus to LLC domains which are
> + * default to fully busy (no cpus set). This will be
> + * updated in the first load balancing on SMT domains
> + * if necessary.
> + */
> + if (sd->flags & SD_SHARE_CPUCAPACITY)
> + WRITE_ONCE(sd->shared->state, sd_is_busy);
> }
>
> sd->private = sdd;
> @@ -2106,7 +2116,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
>
> *per_cpu_ptr(sdd->sd, j) = sd;
>
> - sds = kzalloc_node(sizeof(struct sched_domain_shared),
> + sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
> GFP_KERNEL, cpu_to_node(j));
> if (!sds)
> return -ENOMEM;
> --
> 2.31.1
>
On Sun, Jun 19, 2022 at 08:04:51PM +0800, Abel Wu wrote:
> Now when updating core state, there are two main problems that could
> pollute the SIS filter:
>
> - The updating is before task migration, so if dst_cpu is
> selected to be propagated which might be fed with tasks
> soon, the efforts we paid is no more than setting a busy
> cpu into the SIS filter. While on the other hand it is
> important that we update as early as possible to keep the
> filter fresh, so it's not wise to delay the update to the
> end of load balancing.
>
> - False negative propagation hurts performance since some
> idle cpus could be out of reach. So in general we will
> aggressively propagate idle cpus but allow false positive
> continue to exist for a while, which may lead to filter
> being fully polluted.
>
> Pains can be relieved by a force correction when false positive
> continuously detected.
>
> Signed-off-by: Abel Wu <[email protected]>
> ---
> include/linux/sched/topology.h | 7 +++++
> kernel/sched/fair.c | 51 ++++++++++++++++++++++++++++++++--
> 2 files changed, 55 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index b93edf587d84..e3552ce192a9 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -91,6 +91,12 @@ struct sched_group;
> * search, and is also used as a fallback state of the other
> * states.
> *
> + * - sd_may_idle
> + * This state implies the unstableness of the SIS filter, and
> + * some bits of it may out of date. This state is only used in
> + * SMT domains as an intermediate state between sd_has_icpus
> + * and sd_is_busy.
> + *
> * - sd_is_busy
> * This state indicates there are no unoccupied cpus in this
> * domain. So for LLC domains, it gives the hint on whether
> @@ -111,6 +117,7 @@ struct sched_group;
> enum sd_state {
> sd_has_icores,
> sd_has_icpus,
> + sd_may_idle,
> sd_is_busy
> };
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d55fdcedf2c0..9713d183d35e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8768,6 +8768,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>
> for_each_cpu_and(i, sched_group_span(group), env->cpus) {
> struct rq *rq = cpu_rq(i);
> + bool update = update_core && (env->dst_cpu != i);
>
> sgs->group_load += cpu_load(rq);
> sgs->group_util += cpu_util_cfs(i);
> @@ -8777,7 +8778,11 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> nr_running = rq->nr_running;
> sgs->sum_nr_running += nr_running;
>
> - if (update_core)
> + /*
> + * The dst_cpu is not preferred since it might
> + * be fed with tasks soon.
> + */
> + if (update)
maybe if (update_core && (env->dst_cpu != i)) so that the comment would
be near the code logic and meanwhile without introducing a update variable?
> sd_classify(sds, rq, i);
>
> if (nr_running > 1)
> @@ -8801,7 +8806,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> * and fed with tasks, so we'd better choose
> * a candidate in an opposite way.
> */
> - sds->idle_cpu = i;
> + if (update)
> + sds->idle_cpu = i;
> sgs->idle_cpus++;
>
> /* Idle cpu can't have misfit task */
> @@ -9321,7 +9327,7 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
> {
> struct sched_domain_shared *sd_smt_shared = env->sd->shared;
> enum sd_state new = sds->sd_state;
> - int this = env->dst_cpu;
> + int icpu = sds->idle_cpu, this = env->dst_cpu;
>
> /*
> * Parallel updating can hardly contribute accuracy to
> @@ -9331,6 +9337,22 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
> if (cmpxchg(&sd_smt_shared->updating, 0, 1))
> return;
>
> + /*
> + * The dst_cpu is likely to be fed with tasks soon.
> + * If it is the only unoccupied cpu in this domain,
> + * we still handle it the same way as as_has_icpus
> + * but turn the SMT into the unstable state, rather
> + * than waiting to the end of load balancing since
> + * it's also important that update the filter as
> + * early as possible to keep it fresh.
> + */
> + if (new == sd_is_busy) {
> + if (idle_cpu(this) || sched_idle_cpu(this)) {
available_idle_cpu()?
thanks,
Chenyu
> + new = sd_may_idle;
> + icpu = this;
> + }
> + }
> +
> /*
> * There is at least one unoccupied cpu available, so
> * propagate it to the filter to avoid false negative
> @@ -9338,6 +9360,12 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
> * idle cpus thus throughupt downgraded.
> */
> if (new != sd_is_busy) {
> + /*
> + * The sd_may_idle state is taken into
> + * consideration as well because from
> + * here we couldn't actually know task
> + * migrations would happen or not.
> + */
> if (!test_idle_cpus(this))
> set_idle_cpus(this, true);
> } else {
> @@ -9347,9 +9375,26 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
> */
> if (sd_smt_shared->state == sd_is_busy)
> goto out;
> +
> + /*
> + * Allow false positive to exist for some time
> + * to make a tradeoff of accuracy of the filter
> + * for relieving cache traffic.
> + */
> + if (sd_smt_shared->state == sd_has_icpus) {
> + new = sd_may_idle;
> + goto update;
> + }
> +
> + /*
> + * If the false positive issue has already been
> + * there for a while, a correction of the filter
> + * is needed.
> + */
> }
>
> sd_update_icpus(this, sds->idle_cpu);
> +update:
> sd_smt_shared->state = new;
> out:
> xchg(&sd_smt_shared->updating, 0);
> --
> 2.31.1
>
Hi Chen, thanks for your comments!
On 6/22/22 2:14 AM, Chen Yu Wrote:
>> ...
> Reuse the data from load balance to select the unoccupied candidate
> is applicable IMO, which is also aligned with SIS_UTIL path. And I have
> a question regarding the update frequency. In v3 patch, the update is
> based on periodic tick, which is triggered at most every 1ms(CONFIG_HZ_1000).
> Then the periodic SMT load balance is launched every smt_weight ms, usually 2ms.
> I expect the 2ms is of the same scale unit as 1ms, and since task tick based
> update is acceptable, would excluding the CPU_NEWLY_IDLE balance from this patch
> reduce the overhead meanwhile not bring too much inaccuracy?
I doubt periodic balancing entry is enough. The ms-scale could still
be too large for wakeup path. Say one cpu becomes newly idle just after
an update, then it keeps untouchable until next update which is nearly
2ms (even worse in SMT4/8 case) wasting time-slices to do nothing. So
newly-idle update is important to keep the filter fresh. And the false
positive correction is there to avoid excessive updates due to newly
idle, by allowing the false positive cpus to stay in the filter for a
little longer.
>> @@ -8757,7 +8794,16 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>> * No need to call idle_cpu() if nr_running is not 0
>> */
>> if (!nr_running && idle_cpu(i)) {
>> + /*
>> + * Prefer the last idle cpu by overwriting
>> + * preious one. The first idle cpu in this
>> + * domain (if any) can trigger balancing
>> + * and fed with tasks, so we'd better choose
>> + * a candidate in an opposite way.
>> + */
>> + sds->idle_cpu = i;
> Does it mean, only 1 idle CPU in the smt domain would be set to the
> idle cpu mask at one time? For SMT4/8 we might lose track of the
> idle siblings.
Yes. The intention of one-at-a-time propagation is
1) help spreading out load to different cores
2) reduce some update overhead
In this way, if the filter contains several cpus of a core, ideally
we can sure about that at least one of them is actually unoccupied.
For SMT4/8 we still have newly idle balance to make things right.
>> sgs->idle_cpus++;
>> +
>> /* Idle cpu can't have misfit task */
>> continue;
>> }
>> @@ -9273,8 +9319,40 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>
>> static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>> {
>> - if (sds->sd_state == sd_has_icpus && !test_idle_cpus(env->dst_cpu))
>> - set_idle_cpus(env->dst_cpu, true);
>> + struct sched_domain_shared *sd_smt_shared = env->sd->shared;
>> + enum sd_state new = sds->sd_state;
>> + int this = env->dst_cpu;
>> +
>> + /*
>> + * Parallel updating can hardly contribute accuracy to
>> + * the filter, besides it can be one of the burdens on
>> + * cache traffic.
>> + */
>> + if (cmpxchg(&sd_smt_shared->updating, 0, 1))
>> + return;
>> +
>> + /*
>> + * There is at least one unoccupied cpu available, so
>> + * propagate it to the filter to avoid false negative
>> + * issue which could result in lost tracking of some
>> + * idle cpus thus throughupt downgraded.
>> + */
>> + if (new != sd_is_busy) {
>> + if (!test_idle_cpus(this))
>> + set_idle_cpus(this, true);
>> + } else {
>> + /*
>> + * Nothing changes so nothing to update or
>> + * propagate.
>> + */
>> + if (sd_smt_shared->state == sd_is_busy)
>> + goto out;
>> + }
>> +
>> + sd_update_icpus(this, sds->idle_cpu);
> I wonder if we could further enhance it to facilitate idle CPU scan.
> For example, can we propagate the idle CPUs in smt domain, to its parent
> domain in a hierarchic sequence, and finally to the LLC domain. If there is
In fact, it was my first try to cache the unoccupied cpus in SMT
shared domain, but the overhead of cpumask ops seems like a major
stumbling block.
> a cluster domain between SMT and LLC domain, the cluster domain idle CPU filter
> could benefit from this mechanism.
> https://lore.kernel.org/lkml/[email protected]/
Putting SIS into a hierarchical pattern is good for cache locality.
But I don't think multi-level filter is appropriate since it could
bring too much cache traffic in SIS, and it could be expected to be
a disaster for netperf/tbench or the workloads suffering frequent
context switches.
>
> Furthermore, even if there is no cluster domain, would a 'virtual' middle
> sched domain would help reduce the contention?
> Core0(CPU0,CPU1),Core1(CPU2,CPU3) Core2(CPU4,CPU5) Core3(CPU6,CPU7)
> We can create cpumask1, which is composed of Core0 and Core1, and cpumask2
> which is composed of Core2 and Core3. The SIS would first scan in cpumask1,
> if idle cpu is not found, scan cpumask2. In this way, the CPUs in Core0 and
> Core1 only updates cpumask1, without competing with Core2 and Core3 on cpumask2.
>
Yes, this is the best case, but the worst case is something that
we probably can't afford.
Thanks & BR,
Abel
On 6/22/22 2:23 AM, Chen Yu Wrote:
> On Sun, Jun 19, 2022 at 08:04:51PM +0800, Abel Wu wrote:
>> ...
>> @@ -8777,7 +8778,11 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>> nr_running = rq->nr_running;
>> sgs->sum_nr_running += nr_running;
>>
>> - if (update_core)
>> + /*
>> + * The dst_cpu is not preferred since it might
>> + * be fed with tasks soon.
>> + */
>> + if (update)
> maybe if (update_core && (env->dst_cpu != i)) so that the comment would
> be near the code logic and meanwhile without introducing a update variable?
Makes sense.
>> ...
>> @@ -9331,6 +9337,22 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>> if (cmpxchg(&sd_smt_shared->updating, 0, 1))
>> return;
>>
>> + /*
>> + * The dst_cpu is likely to be fed with tasks soon.
>> + * If it is the only unoccupied cpu in this domain,
>> + * we still handle it the same way as as_has_icpus
>> + * but turn the SMT into the unstable state, rather
>> + * than waiting to the end of load balancing since
>> + * it's also important that update the filter as
>> + * early as possible to keep it fresh.
>> + */
>> + if (new == sd_is_busy) {
>> + if (idle_cpu(this) || sched_idle_cpu(this)) {
> available_idle_cpu()?
>
It is used for choosing an idle cpu that will be immediately used,
so generally inside the wakeup path. But here we just want to know
the idle state of the cpus (and later inside wakeup path these cpus
will still be re-checked to see if they are preempted).
Thanks & BR,
Abel
On 6/22/22 2:14 AM, Chen Yu Wrote:
> Hi Abel,
> On Sun, Jun 19, 2022 at 08:04:50PM +0800, Abel Wu wrote:
>> Try to improve efficiency of SIS domain search by filtering out busy
>> cores, and as a result the more overloaded the system is, the less
>> cpus will be scanned.
>>
>> The filter is supposed to contain unoccupied cpus of the LLC. And we
>> propagate these cpus to the filter one at a time at core granule.
>> This can help spreading load to different cores given that the search
>> depth is limited. The chosen cpu to be propagated is guaranteed to be
>> unoccupied at that time.
>>
>> When idle cpu exists, the last one is preferred in order not to
>> conflict with periodic load balancing during which the first idle
>> cpu (if any) is chosen to be fed with tasks.
>>
> Maybe the content in cover letter could be copied into this git log for
> future reference?
Definitely. Will fix this in next version.
Thanks.
On Wed, Jun 22, 2022 at 11:52:19AM +0800, Abel Wu wrote:
> Hi Chen, thanks for your comments!
>
> On 6/22/22 2:14 AM, Chen Yu Wrote:
> > > ...
> > Reuse the data from load balance to select the unoccupied candidate
> > is applicable IMO, which is also aligned with SIS_UTIL path. And I have
> > a question regarding the update frequency. In v3 patch, the update is
> > based on periodic tick, which is triggered at most every 1ms(CONFIG_HZ_1000).
> > Then the periodic SMT load balance is launched every smt_weight ms, usually 2ms.
> > I expect the 2ms is of the same scale unit as 1ms, and since task tick based
> > update is acceptable, would excluding the CPU_NEWLY_IDLE balance from this patch
> > reduce the overhead meanwhile not bring too much inaccuracy?
>
> I doubt periodic balancing entry is enough. The ms-scale could still
> be too large for wakeup path. Say one cpu becomes newly idle just after
> an update, then it keeps untouchable until next update which is nearly
> 2ms (even worse in SMT4/8 case) wasting time-slices to do nothing. So
> newly-idle update is important to keep the filter fresh. And the false
> positive correction is there to avoid excessive updates due to newly
> idle, by allowing the false positive cpus to stay in the filter for a
> little longer.
>
> > > @@ -8757,7 +8794,16 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> > > * No need to call idle_cpu() if nr_running is not 0
> > > */
> > > if (!nr_running && idle_cpu(i)) {
> > > + /*
> > > + * Prefer the last idle cpu by overwriting
> > > + * preious one. The first idle cpu in this
> > > + * domain (if any) can trigger balancing
> > > + * and fed with tasks, so we'd better choose
> > > + * a candidate in an opposite way.
> > > + */
> > > + sds->idle_cpu = i;
> > Does it mean, only 1 idle CPU in the smt domain would be set to the
> > idle cpu mask at one time? For SMT4/8 we might lose track of the
> > idle siblings.
>
> Yes. The intention of one-at-a-time propagation is
> 1) help spreading out load to different cores
> 2) reduce some update overhead
> In this way, if the filter contains several cpus of a core, ideally
> we can sure about that at least one of them is actually unoccupied.
> For SMT4/8 we still have newly idle balance to make things right.
>
> > > sgs->idle_cpus++;
> > > +
> > > /* Idle cpu can't have misfit task */
> > > continue;
> > > }
> > > @@ -9273,8 +9319,40 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> > > static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
> > > {
> > > - if (sds->sd_state == sd_has_icpus && !test_idle_cpus(env->dst_cpu))
> > > - set_idle_cpus(env->dst_cpu, true);
> > > + struct sched_domain_shared *sd_smt_shared = env->sd->shared;
> > > + enum sd_state new = sds->sd_state;
> > > + int this = env->dst_cpu;
> > > +
> > > + /*
> > > + * Parallel updating can hardly contribute accuracy to
> > > + * the filter, besides it can be one of the burdens on
> > > + * cache traffic.
> > > + */
> > > + if (cmpxchg(&sd_smt_shared->updating, 0, 1))
> > > + return;
> > > +
> > > + /*
> > > + * There is at least one unoccupied cpu available, so
> > > + * propagate it to the filter to avoid false negative
> > > + * issue which could result in lost tracking of some
> > > + * idle cpus thus throughupt downgraded.
> > > + */
> > > + if (new != sd_is_busy) {
> > > + if (!test_idle_cpus(this))
> > > + set_idle_cpus(this, true);
> > > + } else {
> > > + /*
> > > + * Nothing changes so nothing to update or
> > > + * propagate.
> > > + */
> > > + if (sd_smt_shared->state == sd_is_busy)
> > > + goto out;
> > > + }
> > > +
> > > + sd_update_icpus(this, sds->idle_cpu);
> > I wonder if we could further enhance it to facilitate idle CPU scan.
> > For example, can we propagate the idle CPUs in smt domain, to its parent
> > domain in a hierarchic sequence, and finally to the LLC domain. If there is
>
> In fact, it was my first try to cache the unoccupied cpus in SMT
> shared domain, but the overhead of cpumask ops seems like a major
> stumbling block.
>
> > a cluster domain between SMT and LLC domain, the cluster domain idle CPU filter
> > could benefit from this mechanism.
> > https://lore.kernel.org/lkml/[email protected]/
>
> Putting SIS into a hierarchical pattern is good for cache locality.
> But I don't think multi-level filter is appropriate since it could
> bring too much cache traffic in SIS,
Could you please elaborate a little more about the cache traffic? I thought we
don't save the unoccupied cpus in SMT shared domain, but to store it in middle
layer shared domain, say, cluster->idle_cpus, this would reduce cache write
contention compared to writing to llc->idle_cpus directly, because a smaller
set of CPUs share the idle_cpus filter. Similarly, SIS can only scan the cluster->idle_cpus
first, without having to query the llc->idle_cpus. This looks like splitting
a big lock into fine grain small lock.
> and it could be expected to be
> a disaster for netperf/tbench or the workloads suffering frequent
> context switches.
>
So this overhead comes from the NEWLY_IDLE case?
thanks,
Chenyu
> >
> > Furthermore, even if there is no cluster domain, would a 'virtual' middle
> > sched domain would help reduce the contention?
> > Core0(CPU0,CPU1),Core1(CPU2,CPU3) Core2(CPU4,CPU5) Core3(CPU6,CPU7)
> > We can create cpumask1, which is composed of Core0 and Core1, and cpumask2
> > which is composed of Core2 and Core3. The SIS would first scan in cpumask1,
> > if idle cpu is not found, scan cpumask2. In this way, the CPUs in Core0 and
> > Core1 only updates cpumask1, without competing with Core2 and Core3 on cpumask2.
> >
> Yes, this is the best case, but the worst case is something that
> we probably can't afford.
>
> Thanks & BR,
> Abel
On 6/24/22 11:30 AM, Chen Yu Wrote:
>> ...
>>>> @@ -9273,8 +9319,40 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>>> static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>>>> {
>>>> - if (sds->sd_state == sd_has_icpus && !test_idle_cpus(env->dst_cpu))
>>>> - set_idle_cpus(env->dst_cpu, true);
>>>> + struct sched_domain_shared *sd_smt_shared = env->sd->shared;
>>>> + enum sd_state new = sds->sd_state;
>>>> + int this = env->dst_cpu;
>>>> +
>>>> + /*
>>>> + * Parallel updating can hardly contribute accuracy to
>>>> + * the filter, besides it can be one of the burdens on
>>>> + * cache traffic.
>>>> + */
>>>> + if (cmpxchg(&sd_smt_shared->updating, 0, 1))
>>>> + return;
>>>> +
>>>> + /*
>>>> + * There is at least one unoccupied cpu available, so
>>>> + * propagate it to the filter to avoid false negative
>>>> + * issue which could result in lost tracking of some
>>>> + * idle cpus thus throughupt downgraded.
>>>> + */
>>>> + if (new != sd_is_busy) {
>>>> + if (!test_idle_cpus(this))
>>>> + set_idle_cpus(this, true);
>>>> + } else {
>>>> + /*
>>>> + * Nothing changes so nothing to update or
>>>> + * propagate.
>>>> + */
>>>> + if (sd_smt_shared->state == sd_is_busy)
>>>> + goto out;
>>>> + }
>>>> +
>>>> + sd_update_icpus(this, sds->idle_cpu);
>>> I wonder if we could further enhance it to facilitate idle CPU scan.
>>> For example, can we propagate the idle CPUs in smt domain, to its parent
>>> domain in a hierarchic sequence, and finally to the LLC domain. If there is
>>
>> In fact, it was my first try to cache the unoccupied cpus in SMT
>> shared domain, but the overhead of cpumask ops seems like a major
>> stumbling block.
>>
>>> a cluster domain between SMT and LLC domain, the cluster domain idle CPU filter
>>> could benefit from this mechanism.
>>> https://lore.kernel.org/lkml/[email protected]/
>>
>> Putting SIS into a hierarchical pattern is good for cache locality.
>> But I don't think multi-level filter is appropriate since it could
>> bring too much cache traffic in SIS,
> Could you please elaborate a little more about the cache traffic? I thought we
> don't save the unoccupied cpus in SMT shared domain, but to store it in middle
> layer shared domain, say, cluster->idle_cpus, this would reduce cache write
> contention compared to writing to llc->idle_cpus directly, because a smaller
> set of CPUs share the idle_cpus filter. Similarly, SIS can only scan the cluster->idle_cpus
> first, without having to query the llc->idle_cpus. This looks like splitting
> a big lock into fine grain small lock.
I'm afraid I didn't quite follow.. Did you mean replace the LLC filter
with multiple cluster filters? Then I agree with what you suggested
that the contention would be reduced. But there are other concerns:
a. Is it appropriate to fake an intermediate sched_domain if
cluster level doesn't available? How to identify the proper
size of the faked sched_domain?
b. The SIS path might touch more cachelines (multiple cluster
filters). I'm not sure how much is the impact.
Whatever, this seems worth a try. :)
>> and it could be expected to be
>> a disaster for netperf/tbench or the workloads suffering frequent
>> context switches.
>>
> So this overhead comes from the NEWLY_IDLE case?
>
Yes, I think it's the main cause to rise the contention to new heights.
But it's also important to keep the filter fresh.
Thanks & BR,
Abel
On Sun, Jun 19, 2022 at 5:05 AM Abel Wu <[email protected]> wrote:
>
> The prev cpu is checked at the beginning of SIS, and it's unlikely to be
> idle before the second check in select_idle_smt(). So we'd better focus
> only on its SMT siblings.
>
> Signed-off-by: Abel Wu <[email protected]>
Reviewed-by: Josh Don <[email protected]>
On Sun, Jun 19, 2022 at 5:05 AM Abel Wu <[email protected]> wrote:
>
> If two cpus share LLC cache, then the cores they belong to are also in
> the same LLC domain.
>
> Signed-off-by: Abel Wu <[email protected]>
Reviewed-by: Josh Don <[email protected]>
On Sun, Jun 19, 2022 at 5:05 AM Abel Wu <[email protected]> wrote:
>
> The function only gets called when sds->has_idle_cores is true which can
> be possible only when sched_smt_present is enabled.
>
> Signed-off-by: Abel Wu <[email protected]>
> ---
> kernel/sched/fair.c | 3 ---
> 1 file changed, 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index aba1dad19574..1cc86e76e38e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6256,9 +6256,6 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
> bool idle = true;
> int cpu;
>
> - if (!static_branch_likely(&sched_smt_present))
> - return __select_idle_cpu(core, p);
> -
The static branch is basically free; although you're right that we
currently don't take !smt_present branch direction here, it doesn't
seem harmful to leave this check in case assumptions change about when
we call select_idle_core().
> for_each_cpu(cpu, cpu_smt_mask(core)) {
> if (!available_idle_cpu(cpu)) {
> idle = false;
> --
> 2.31.1
>
On Sun, Jun 19, 2022 at 5:05 AM Abel Wu <[email protected]> wrote:
>
> If a full scan on SIS domain failed, then no unoccupied cpus available
> and the LLC is fully busy. In this case we'd better spend the time on
> something more useful, rather than wasting it trying to find an idle
> cpu that probably not exist.
>
> The fully busy status will be re-evaluated when any core of this LLC
> domain enters load balancing, and cleared once idle cpus found.
>
> Signed-off-by: Abel Wu <[email protected]>
> ---
> include/linux/sched/topology.h | 35 ++++++++++++++-
> kernel/sched/fair.c | 82 +++++++++++++++++++++++++++++-----
> 2 files changed, 104 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 56cffe42abbc..3e99ac98d766 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -77,10 +77,43 @@ extern int sched_domain_level_max;
>
> struct sched_group;
>
> +/*
> + * States of the sched-domain
> + *
> + * - sd_has_icores
> + * This state is only used in LLC domains to indicate worthy
> + * of a full scan in SIS due to idle cores available.
> + *
> + * - sd_has_icpus
> + * This state indicates that unoccupied (sched-idle/idle) cpus
> + * might exist in this domain. For the LLC domains it is the
> + * default state since these cpus are the main targets of SIS
> + * search, and is also used as a fallback state of the other
> + * states.
> + *
> + * - sd_is_busy
> + * This state indicates there are no unoccupied cpus in this
> + * domain. So for LLC domains, it gives the hint on whether
> + * we should put efforts on the SIS search or not.
> + *
> + * For LLC domains, sd_has_icores is set when the last non-idle cpu of
> + * a core becomes idle. After a full SIS scan and if no idle cores found,
> + * sd_has_icores must be cleared and the state will be set to sd_has_icpus
> + * or sd_is_busy depending on whether there is any idle cpu. And during
> + * load balancing on each SMT domain inside the LLC, the state will be
> + * re-evaluated and switch from sd_is_busy to sd_has_icpus if idle cpus
> + * exist.
> + */
> +enum sd_state {
> + sd_has_icores,
> + sd_has_icpus,
> + sd_is_busy
> +};
> +
> struct sched_domain_shared {
> atomic_t ref;
> atomic_t nr_busy_cpus;
> - int has_idle_cores;
> + int state; /* see enum sd_state */
nit: s/int/enum sd_state
> };
>
> struct sched_domain {
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1cc86e76e38e..2ca37fdc6c4d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5642,11 +5642,15 @@ static inline void update_overutilized_status(struct rq *rq)
> static inline void update_overutilized_status(struct rq *rq) { }
> #endif
>
> +static int unoccupied_rq(struct rq *rq)
> +{
> + return rq->nr_running == rq->cfs.idle_h_nr_running;
> +}
nit: static inline int
> +
> /* Runqueue only has SCHED_IDLE tasks enqueued */
> static int sched_idle_rq(struct rq *rq)
> {
> - return unlikely(rq->nr_running == rq->cfs.idle_h_nr_running &&
> - rq->nr_running);
> + return unlikely(rq->nr_running && unoccupied_rq(rq));
> }
>
> /*
> @@ -6197,24 +6201,44 @@ static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> DEFINE_STATIC_KEY_FALSE(sched_smt_present);
> EXPORT_SYMBOL_GPL(sched_smt_present);
>
> -static inline void set_idle_cores(int cpu, int val)
> +static inline void sd_set_state(int cpu, enum sd_state state)
> {
> struct sched_domain_shared *sds;
>
> sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> if (sds)
> - WRITE_ONCE(sds->has_idle_cores, val);
> + WRITE_ONCE(sds->state, state);
> }
>
> -static inline bool test_idle_cores(int cpu)
> +static inline enum sd_state sd_get_state(int cpu)
> {
> struct sched_domain_shared *sds;
>
> sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> if (sds)
> - return READ_ONCE(sds->has_idle_cores);
> + return READ_ONCE(sds->state);
>
> - return false;
> + return sd_has_icpus;
> +}
Why is default not sd_is_busy?
> +
> +static inline void set_idle_cores(int cpu, int idle)
nit: Slightly confusing to call the param 'idle', since in the case it
is false we still mark icpus. Consider possibly 'core_idle'.
> +{
> + sd_set_state(cpu, idle ? sd_has_icores : sd_has_icpus);
> +}
> +
> +static inline bool test_idle_cores(int cpu)
> +{
> + return sd_get_state(cpu) == sd_has_icores;
> +}
> +
> +static inline void set_idle_cpus(int cpu, int idle)
> +{
> + sd_set_state(cpu, idle ? sd_has_icpus : sd_is_busy);
> +}
> +
> +static inline bool test_idle_cpus(int cpu)
> +{
> + return sd_get_state(cpu) != sd_is_busy;
> }
>
> /*
> @@ -6298,7 +6322,7 @@ static int select_idle_smt(struct task_struct *p, int target)
>
> #else /* CONFIG_SCHED_SMT */
>
> -static inline void set_idle_cores(int cpu, int val)
> +static inline void set_idle_cores(int cpu, int idle)
> {
> }
>
> @@ -6307,6 +6331,15 @@ static inline bool test_idle_cores(int cpu)
> return false;
> }
>
> +static inline void set_idle_cpus(int cpu, int idle)
> +{
> +}
> +
> +static inline bool test_idle_cpus(int cpu)
> +{
> + return true;
> +}
> +
> static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
> {
> return __select_idle_cpu(core, p);
> @@ -6382,7 +6415,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> }
> }
>
> - if (has_idle_core)
> + if (idle_cpu == -1)
> + set_idle_cpus(target, false);
> + else if (has_idle_core)
> set_idle_cores(target, false);
>
> if (sched_feat(SIS_PROP) && !has_idle_core) {
> @@ -6538,6 +6573,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> if ((unsigned int)i < nr_cpumask_bits)
> return i;
> }
> +
> + if (!has_idle_core && !test_idle_cpus(target))
> + return target;
> }
>
> i = select_idle_cpu(p, sd, has_idle_core, target);
> @@ -8303,6 +8341,8 @@ struct sd_lb_stats {
> unsigned long avg_load; /* Average load across all groups in sd */
> unsigned int prefer_sibling; /* tasks should go to sibling first */
>
> + int sd_state;
> +
> struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
> struct sg_lb_stats local_stat; /* Statistics of the local group */
> };
> @@ -8321,6 +8361,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
> .local = NULL,
> .total_load = 0UL,
> .total_capacity = 0UL,
> + .sd_state = sd_is_busy,
> .busiest_stat = {
> .idle_cpus = UINT_MAX,
> .group_type = group_has_spare,
> @@ -8661,6 +8702,12 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs
> return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
> }
>
> +static inline void sd_classify(struct sd_lb_stats *sds, struct rq *rq)
> +{
> + if (sds->sd_state != sd_has_icpus && unoccupied_rq(rq))
> + sds->sd_state = sd_has_icpus;
> +}
> +
> /**
> * update_sg_lb_stats - Update sched_group's statistics for load balancing.
> * @env: The load balancing environment.
> @@ -8675,11 +8722,12 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> struct sg_lb_stats *sgs,
> int *sg_status)
> {
> - int i, nr_running, local_group;
> + int i, nr_running, local_group, update_core;
>
> memset(sgs, 0, sizeof(*sgs));
>
> local_group = group == sds->local;
> + update_core = env->sd->flags & SD_SHARE_CPUCAPACITY;
Nothing special about SD_SHARE_CPUCAPACITY here other than you want to
do the update early on at the lowest domain level during balancing
right?
> for_each_cpu_and(i, sched_group_span(group), env->cpus) {
> struct rq *rq = cpu_rq(i);
> @@ -8692,6 +8740,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> nr_running = rq->nr_running;
> sgs->sum_nr_running += nr_running;
>
> + if (update_core)
> + sd_classify(sds, rq);
> +
> if (nr_running > 1)
> *sg_status |= SG_OVERLOAD;
>
> @@ -9220,6 +9271,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> return idlest;
> }
>
> +static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
> +{
> + if (sds->sd_state == sd_has_icpus && !test_idle_cpus(env->dst_cpu))
> + set_idle_cpus(env->dst_cpu, true);
> +}
We're only setting state to has_icpus here in sd_update_state. That
doesn't feel good enough, since we're only updating state for
env->dst_cpu; all the other per-cpu state will remain stale (ie.
falsely sd_is_busy).
I think you also want a case in __update_idle_core() to call
set_idle_cores(core, 0) in the case where we have a non-idle sibling,
since we want to at least mark has_icpus even if the entire core isn't
idle.
Still, that doesn't feel quite good enough, since we're only updating
the per_cpu sd state for the given cpu. That seems like it will
frequently leave us with idle cpus, and select_idle_sibling() skipping
select_idle_cpu due to a false negative from test_idle_cpus(). Am I
missing something there?
> +
> /**
> * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
> * @env: The load balancing environment.
> @@ -9270,8 +9327,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> /* Tag domain that child domain prefers tasks go to siblings first */
> sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
>
> -
> - if (env->sd->flags & SD_NUMA)
> + if (env->sd->flags & SD_SHARE_CPUCAPACITY)
> + sd_update_state(env, sds);
> + else if (env->sd->flags & SD_NUMA)
> env->fbq_type = fbq_classify_group(&sds->busiest_stat);
>
> if (!env->sd->parent) {
> --
> 2.31.1
>
On 6/28/22 7:42 AM, Josh Don Wrote:
> On Sun, Jun 19, 2022 at 5:05 AM Abel Wu <[email protected]> wrote:
>>
>> The function only gets called when sds->has_idle_cores is true which can
>> be possible only when sched_smt_present is enabled.
>>
>> Signed-off-by: Abel Wu <[email protected]>
>> ---
>> kernel/sched/fair.c | 3 ---
>> 1 file changed, 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index aba1dad19574..1cc86e76e38e 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6256,9 +6256,6 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
>> bool idle = true;
>> int cpu;
>>
>> - if (!static_branch_likely(&sched_smt_present))
>> - return __select_idle_cpu(core, p);
>> -
>
> The static branch is basically free; although you're right that we
> currently don't take !smt_present branch direction here, it doesn't
> seem harmful to leave this check in case assumptions change about when
> we call select_idle_core().
I was thinking that it would be better to align with select_idle_smt
that the caller do the check if necessary.
>
>> for_each_cpu(cpu, cpu_smt_mask(core)) {
>> if (!available_idle_cpu(cpu)) {
>> idle = false;
>> --
>> 2.31.1
>>
On 6/28/22 8:28 AM, Josh Don Wrote:
> On Sun, Jun 19, 2022 at 5:05 AM Abel Wu <[email protected]> wrote:
>>
>> If a full scan on SIS domain failed, then no unoccupied cpus available
>> and the LLC is fully busy. In this case we'd better spend the time on
>> something more useful, rather than wasting it trying to find an idle
>> cpu that probably not exist.
>>
>> The fully busy status will be re-evaluated when any core of this LLC
>> domain enters load balancing, and cleared once idle cpus found.
>>
>> Signed-off-by: Abel Wu <[email protected]>
>> ---
>> include/linux/sched/topology.h | 35 ++++++++++++++-
>> kernel/sched/fair.c | 82 +++++++++++++++++++++++++++++-----
>> 2 files changed, 104 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 56cffe42abbc..3e99ac98d766 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -77,10 +77,43 @@ extern int sched_domain_level_max;
>>
>> struct sched_group;
>>
>> +/*
>> + * States of the sched-domain
>> + *
>> + * - sd_has_icores
>> + * This state is only used in LLC domains to indicate worthy
>> + * of a full scan in SIS due to idle cores available.
>> + *
>> + * - sd_has_icpus
>> + * This state indicates that unoccupied (sched-idle/idle) cpus
>> + * might exist in this domain. For the LLC domains it is the
>> + * default state since these cpus are the main targets of SIS
>> + * search, and is also used as a fallback state of the other
>> + * states.
>> + *
>> + * - sd_is_busy
>> + * This state indicates there are no unoccupied cpus in this
>> + * domain. So for LLC domains, it gives the hint on whether
>> + * we should put efforts on the SIS search or not.
>> + *
>> + * For LLC domains, sd_has_icores is set when the last non-idle cpu of
>> + * a core becomes idle. After a full SIS scan and if no idle cores found,
>> + * sd_has_icores must be cleared and the state will be set to sd_has_icpus
>> + * or sd_is_busy depending on whether there is any idle cpu. And during
>> + * load balancing on each SMT domain inside the LLC, the state will be
>> + * re-evaluated and switch from sd_is_busy to sd_has_icpus if idle cpus
>> + * exist.
>> + */
>> +enum sd_state {
>> + sd_has_icores,
>> + sd_has_icpus,
>> + sd_is_busy
>> +};
>> +
>> struct sched_domain_shared {
>> atomic_t ref;
>> atomic_t nr_busy_cpus;
>> - int has_idle_cores;
>> + int state; /* see enum sd_state */
>
> nit: s/int/enum sd_state
Will fix in next version.
>
>> };
>>
>> struct sched_domain {
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 1cc86e76e38e..2ca37fdc6c4d 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5642,11 +5642,15 @@ static inline void update_overutilized_status(struct rq *rq)
>> static inline void update_overutilized_status(struct rq *rq) { }
>> #endif
>>
>> +static int unoccupied_rq(struct rq *rq)
>> +{
>> + return rq->nr_running == rq->cfs.idle_h_nr_running;
>> +}
>
> nit: static inline int
Will fix.
>
>> +
>> /* Runqueue only has SCHED_IDLE tasks enqueued */
>> static int sched_idle_rq(struct rq *rq)
>> {
>> - return unlikely(rq->nr_running == rq->cfs.idle_h_nr_running &&
>> - rq->nr_running);
>> + return unlikely(rq->nr_running && unoccupied_rq(rq));
>> }
>>
>> /*
>> @@ -6197,24 +6201,44 @@ static inline int __select_idle_cpu(int cpu, struct task_struct *p)
>> DEFINE_STATIC_KEY_FALSE(sched_smt_present);
>> EXPORT_SYMBOL_GPL(sched_smt_present);
>>
>> -static inline void set_idle_cores(int cpu, int val)
>> +static inline void sd_set_state(int cpu, enum sd_state state)
>> {
>> struct sched_domain_shared *sds;
>>
>> sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>> if (sds)
>> - WRITE_ONCE(sds->has_idle_cores, val);
>> + WRITE_ONCE(sds->state, state);
>> }
>>
>> -static inline bool test_idle_cores(int cpu)
>> +static inline enum sd_state sd_get_state(int cpu)
>> {
>> struct sched_domain_shared *sds;
>>
>> sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>> if (sds)
>> - return READ_ONCE(sds->has_idle_cores);
>> + return READ_ONCE(sds->state);
>>
>> - return false;
>> + return sd_has_icpus;
>> +}
>
> Why is default not sd_is_busy?
The state of sd_is_busy will prevent us from searching the LLC. By
design, both sd_has_icores and sd_is_busy indicate deterministic
status: has idle cores / no idle cpu exists. While sd_has_icpus is
not deterministic, it means there could be unoccupied cpus.
The naming seems misleading, it would be nice to have other options.
>
>> +
>> +static inline void set_idle_cores(int cpu, int idle)
>
> nit: Slightly confusing to call the param 'idle', since in the case it
> is false we still mark icpus. Consider possibly 'core_idle'.
What about changing the param 'cpu' to 'core'?
>
>> +{
>> + sd_set_state(cpu, idle ? sd_has_icores : sd_has_icpus);
>> +}
>> +
>> +static inline bool test_idle_cores(int cpu)
>> +{
>> + return sd_get_state(cpu) == sd_has_icores;
>> +}
>> +
>> +static inline void set_idle_cpus(int cpu, int idle)
>> +{
>> + sd_set_state(cpu, idle ? sd_has_icpus : sd_is_busy);
>> +}
>> +
>> +static inline bool test_idle_cpus(int cpu)
>> +{
>> + return sd_get_state(cpu) != sd_is_busy;
>> }
>>
>> /*
>> @@ -6298,7 +6322,7 @@ static int select_idle_smt(struct task_struct *p, int target)
>>
>> #else /* CONFIG_SCHED_SMT */
>>
>> -static inline void set_idle_cores(int cpu, int val)
>> +static inline void set_idle_cores(int cpu, int idle)
>> {
>> }
>>
>> @@ -6307,6 +6331,15 @@ static inline bool test_idle_cores(int cpu)
>> return false;
>> }
>>
>> +static inline void set_idle_cpus(int cpu, int idle)
>> +{
>> +}
>> +
>> +static inline bool test_idle_cpus(int cpu)
>> +{
>> + return true;
>> +}
>> +
>> static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
>> {
>> return __select_idle_cpu(core, p);
>> @@ -6382,7 +6415,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>> }
>> }
>>
>> - if (has_idle_core)
>> + if (idle_cpu == -1)
>> + set_idle_cpus(target, false);
>> + else if (has_idle_core)
>> set_idle_cores(target, false);
>>
>> if (sched_feat(SIS_PROP) && !has_idle_core) {
>> @@ -6538,6 +6573,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>> if ((unsigned int)i < nr_cpumask_bits)
>> return i;
>> }
>> +
>> + if (!has_idle_core && !test_idle_cpus(target))
>> + return target;
>> }
>>
>> i = select_idle_cpu(p, sd, has_idle_core, target);
>> @@ -8303,6 +8341,8 @@ struct sd_lb_stats {
>> unsigned long avg_load; /* Average load across all groups in sd */
>> unsigned int prefer_sibling; /* tasks should go to sibling first */
>>
>> + int sd_state;
>> +
>> struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
>> struct sg_lb_stats local_stat; /* Statistics of the local group */
>> };
>> @@ -8321,6 +8361,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
>> .local = NULL,
>> .total_load = 0UL,
>> .total_capacity = 0UL,
>> + .sd_state = sd_is_busy,
>> .busiest_stat = {
>> .idle_cpus = UINT_MAX,
>> .group_type = group_has_spare,
>> @@ -8661,6 +8702,12 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs
>> return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
>> }
>>
>> +static inline void sd_classify(struct sd_lb_stats *sds, struct rq *rq)
>> +{
>> + if (sds->sd_state != sd_has_icpus && unoccupied_rq(rq))
>> + sds->sd_state = sd_has_icpus;
>> +}
>> +
>> /**
>> * update_sg_lb_stats - Update sched_group's statistics for load balancing.
>> * @env: The load balancing environment.
>> @@ -8675,11 +8722,12 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>> struct sg_lb_stats *sgs,
>> int *sg_status)
>> {
>> - int i, nr_running, local_group;
>> + int i, nr_running, local_group, update_core;
>>
>> memset(sgs, 0, sizeof(*sgs));
>>
>> local_group = group == sds->local;
>> + update_core = env->sd->flags & SD_SHARE_CPUCAPACITY;
>
> Nothing special about SD_SHARE_CPUCAPACITY here other than you want to
> do the update early on at the lowest domain level during balancing
> right?
I'm not sure what you are suggesting, the only interested domain
is the SMT domain. It contains all the information we need without
irrelevant data.
>
>
>> for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>> struct rq *rq = cpu_rq(i);
>> @@ -8692,6 +8740,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>> nr_running = rq->nr_running;
>> sgs->sum_nr_running += nr_running;
>>
>> + if (update_core)
>> + sd_classify(sds, rq);
>> +
>> if (nr_running > 1)
>> *sg_status |= SG_OVERLOAD;
>>
>> @@ -9220,6 +9271,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>> return idlest;
>> }
>>
>> +static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>> +{
>> + if (sds->sd_state == sd_has_icpus && !test_idle_cpus(env->dst_cpu))
>> + set_idle_cpus(env->dst_cpu, true);
>> +}
>
> We're only setting state to has_icpus here in sd_update_state. That
> doesn't feel good enough, since we're only updating state for
> env->dst_cpu; all the other per-cpu state will remain stale (ie.
> falsely sd_is_busy).
It's LLC-wide shared :)
>
> I think you also want a case in __update_idle_core() to call
> set_idle_cores(core, 0) in the case where we have a non-idle sibling,
> since we want to at least mark has_icpus even if the entire core isn't
> idle.
>
> Still, that doesn't feel quite good enough, since we're only updating
> the per_cpu sd state for the given cpu. That seems like it will
> frequently leave us with idle cpus, and select_idle_sibling() skipping
> select_idle_cpu due to a false negative from test_idle_cpus(). Am I
> missing something there?
Until this patch, the sd_smt_shared->state is not used.
>
>> +
>> /**
>> * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
>> * @env: The load balancing environment.
>> @@ -9270,8 +9327,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>> /* Tag domain that child domain prefers tasks go to siblings first */
>> sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
>>
>> -
>> - if (env->sd->flags & SD_NUMA)
>> + if (env->sd->flags & SD_SHARE_CPUCAPACITY)
>> + sd_update_state(env, sds);
>> + else if (env->sd->flags & SD_NUMA)
>> env->fbq_type = fbq_classify_group(&sds->busiest_stat);
>>
>> if (!env->sd->parent) {
>> --
>> 2.31.1
>>
On 6/27/22 6:13 PM, Abel Wu Wrote:
>
> On 6/24/22 11:30 AM, Chen Yu Wrote:
>>> ...
>>>>> @@ -9273,8 +9319,40 @@ find_idlest_group(struct sched_domain *sd,
>>>>> struct task_struct *p, int this_cpu)
>>>>> static void sd_update_state(struct lb_env *env, struct
>>>>> sd_lb_stats *sds)
>>>>> {
>>>>> - if (sds->sd_state == sd_has_icpus &&
>>>>> !test_idle_cpus(env->dst_cpu))
>>>>> - set_idle_cpus(env->dst_cpu, true);
>>>>> + struct sched_domain_shared *sd_smt_shared = env->sd->shared;
>>>>> + enum sd_state new = sds->sd_state;
>>>>> + int this = env->dst_cpu;
>>>>> +
>>>>> + /*
>>>>> + * Parallel updating can hardly contribute accuracy to
>>>>> + * the filter, besides it can be one of the burdens on
>>>>> + * cache traffic.
>>>>> + */
>>>>> + if (cmpxchg(&sd_smt_shared->updating, 0, 1))
>>>>> + return;
>>>>> +
>>>>> + /*
>>>>> + * There is at least one unoccupied cpu available, so
>>>>> + * propagate it to the filter to avoid false negative
>>>>> + * issue which could result in lost tracking of some
>>>>> + * idle cpus thus throughupt downgraded.
>>>>> + */
>>>>> + if (new != sd_is_busy) {
>>>>> + if (!test_idle_cpus(this))
>>>>> + set_idle_cpus(this, true);
>>>>> + } else {
>>>>> + /*
>>>>> + * Nothing changes so nothing to update or
>>>>> + * propagate.
>>>>> + */
>>>>> + if (sd_smt_shared->state == sd_is_busy)
>>>>> + goto out;
>>>>> + }
>>>>> +
>>>>> + sd_update_icpus(this, sds->idle_cpu);
>>>> I wonder if we could further enhance it to facilitate idle CPU scan.
>>>> For example, can we propagate the idle CPUs in smt domain, to its
>>>> parent
>>>> domain in a hierarchic sequence, and finally to the LLC domain. If
>>>> there is
>>>
>>> In fact, it was my first try to cache the unoccupied cpus in SMT
>>> shared domain, but the overhead of cpumask ops seems like a major
>>> stumbling block.
>>>
>>>> a cluster domain between SMT and LLC domain, the cluster domain idle
>>>> CPU filter
>>>> could benefit from this mechanism.
>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>
>>>
>>> Putting SIS into a hierarchical pattern is good for cache locality.
>>> But I don't think multi-level filter is appropriate since it could
>>> bring too much cache traffic in SIS,
>> Could you please elaborate a little more about the cache traffic? I
>> thought we
>> don't save the unoccupied cpus in SMT shared domain, but to store it
>> in middle
>> layer shared domain, say, cluster->idle_cpus, this would reduce cache
>> write
>> contention compared to writing to llc->idle_cpus directly, because a
>> smaller
>> set of CPUs share the idle_cpus filter. Similarly, SIS can only scan
>> the cluster->idle_cpus
>> first, without having to query the llc->idle_cpus. This looks like
>> splitting
>> a big lock into fine grain small lock.
>
> I'm afraid I didn't quite follow.. Did you mean replace the LLC filter
> with multiple cluster filters? Then I agree with what you suggested
> that the contention would be reduced. But there are other concerns:
>
> a. Is it appropriate to fake an intermediate sched_domain if
> cluster level doesn't available? How to identify the proper
> size of the faked sched_domain?
>
> b. The SIS path might touch more cachelines (multiple cluster
> filters). I'm not sure how much is the impact.
>
> Whatever, this seems worth a try. :)
>
After a second thought, maybe it's a similar case of enabling SNC?
I benchmarked with SNC disabled, so the LLC is relatively big. This
time I enabled SNC on the same machine mentioned in cover letter, to
make the filter more fine grained. Please see the following result.
a) hackbench-process-pipes
Amean 1 0.4380 ( 0.00%) 0.4250 * 2.97%*
Amean 4 0.6123 ( 0.00%) 0.6153 ( -0.49%)
Amean 7 0.7693 ( 0.00%) 0.7217 * 6.20%*
Amean 12 1.0730 ( 0.00%) 1.0723 ( 0.06%)
Amean 21 1.8540 ( 0.00%) 1.8817 ( -1.49%)
Amean 30 2.8147 ( 0.00%) 2.7297 ( 3.02%)
Amean 48 4.6280 ( 0.00%) 4.4923 * 2.93%*
Amean 79 8.0897 ( 0.00%) 7.8773 ( 2.62%)
Amean 110 10.5320 ( 0.00%) 10.1737 ( 3.40%)
Amean 141 13.0260 ( 0.00%) 12.4953 ( 4.07%)
Amean 172 15.5093 ( 0.00%) 14.3697 * 7.35%*
Amean 203 17.9633 ( 0.00%) 16.7853 * 6.56%*
Amean 234 20.2327 ( 0.00%) 19.2020 * 5.09%*
Amean 265 22.1203 ( 0.00%) 21.3353 ( 3.55%)
Amean 296 24.9337 ( 0.00%) 23.8967 ( 4.16%)
b) hackbench-process-sockets
Amean 1 0.6990 ( 0.00%) 0.6520 * 6.72%*
Amean 4 1.6513 ( 0.00%) 1.6080 * 2.62%*
Amean 7 2.5103 ( 0.00%) 2.5020 ( 0.33%)
Amean 12 4.1470 ( 0.00%) 4.0957 * 1.24%*
Amean 21 7.0823 ( 0.00%) 6.9237 * 2.24%*
Amean 30 9.9510 ( 0.00%) 9.7937 * 1.58%*
Amean 48 15.8853 ( 0.00%) 15.5410 * 2.17%*
Amean 79 26.3313 ( 0.00%) 26.0363 * 1.12%*
Amean 110 36.6647 ( 0.00%) 36.2657 * 1.09%*
Amean 141 47.0590 ( 0.00%) 46.4010 * 1.40%*
Amean 172 57.5020 ( 0.00%) 56.9897 ( 0.89%)
Amean 203 67.9277 ( 0.00%) 66.8273 * 1.62%*
Amean 234 78.3967 ( 0.00%) 77.2137 * 1.51%*
Amean 265 88.5817 ( 0.00%) 87.6143 * 1.09%*
Amean 296 99.4397 ( 0.00%) 98.0233 * 1.42%*
c) hackbench-thread-pipes
Amean 1 0.4437 ( 0.00%) 0.4373 ( 1.43%)
Amean 4 0.6667 ( 0.00%) 0.6340 ( 4.90%)
Amean 7 0.7813 ( 0.00%) 0.8177 * -4.65%*
Amean 12 1.2747 ( 0.00%) 1.3113 ( -2.88%)
Amean 21 2.4703 ( 0.00%) 2.3637 * 4.32%*
Amean 30 3.6547 ( 0.00%) 3.2377 * 11.41%*
Amean 48 5.7580 ( 0.00%) 5.3140 * 7.71%*
Amean 79 9.1770 ( 0.00%) 8.3717 * 8.78%*
Amean 110 11.7167 ( 0.00%) 11.3867 * 2.82%*
Amean 141 14.1490 ( 0.00%) 13.9017 ( 1.75%)
Amean 172 17.3880 ( 0.00%) 16.4897 ( 5.17%)
Amean 203 19.3760 ( 0.00%) 18.8807 ( 2.56%)
Amean 234 22.7477 ( 0.00%) 21.7420 * 4.42%*
Amean 265 25.8940 ( 0.00%) 23.6173 * 8.79%*
Amean 296 27.8677 ( 0.00%) 26.5053 * 4.89%*
d) hackbench-thread-sockets
Amean 1 0.7303 ( 0.00%) 0.6817 * 6.66%*
Amean 4 1.6820 ( 0.00%) 1.6343 * 2.83%*
Amean 7 2.6060 ( 0.00%) 2.5393 * 2.56%*
Amean 12 4.2663 ( 0.00%) 4.1810 * 2.00%*
Amean 21 7.2110 ( 0.00%) 7.0873 * 1.71%*
Amean 30 10.1453 ( 0.00%) 10.0320 * 1.12%*
Amean 48 16.2787 ( 0.00%) 15.9040 * 2.30%*
Amean 79 27.0090 ( 0.00%) 26.5803 * 1.59%*
Amean 110 37.5397 ( 0.00%) 37.1200 * 1.12%*
Amean 141 48.0853 ( 0.00%) 47.7613 * 0.67%*
Amean 172 58.7967 ( 0.00%) 58.2570 * 0.92%*
Amean 203 69.5303 ( 0.00%) 68.8930 * 0.92%*
Amean 234 79.9943 ( 0.00%) 79.5347 * 0.57%*
Amean 265 90.5877 ( 0.00%) 90.1223 ( 0.51%)
Amean 296 101.2390 ( 0.00%) 101.1687 ( 0.07%)
e) netperf-udp
Hmean send-64 202.37 ( 0.00%) 202.46 ( 0.05%)
Hmean send-128 407.01 ( 0.00%) 402.86 * -1.02%*
Hmean send-256 788.50 ( 0.00%) 789.87 ( 0.17%)
Hmean send-1024 3047.98 ( 0.00%) 3036.19 ( -0.39%)
Hmean send-2048 5820.33 ( 0.00%) 5776.30 ( -0.76%)
Hmean send-3312 8941.40 ( 0.00%) 8809.25 * -1.48%*
Hmean send-4096 10804.41 ( 0.00%) 10686.95 * -1.09%*
Hmean send-8192 17105.63 ( 0.00%) 17323.44 * 1.27%*
Hmean send-16384 28166.17 ( 0.00%) 28191.05 ( 0.09%)
Hmean recv-64 202.37 ( 0.00%) 202.46 ( 0.05%)
Hmean recv-128 407.01 ( 0.00%) 402.86 * -1.02%*
Hmean recv-256 788.50 ( 0.00%) 789.87 ( 0.17%)
Hmean recv-1024 3047.98 ( 0.00%) 3036.19 ( -0.39%)
Hmean recv-2048 5820.33 ( 0.00%) 5776.30 ( -0.76%)
Hmean recv-3312 8941.40 ( 0.00%) 8809.23 * -1.48%*
Hmean recv-4096 10804.41 ( 0.00%) 10686.95 * -1.09%*
Hmean recv-8192 17105.55 ( 0.00%) 17323.44 * 1.27%*
Hmean recv-16384 28166.03 ( 0.00%) 28191.04 ( 0.09%)
f) netperf-tcp
Hmean 64 838.30 ( 0.00%) 837.61 ( -0.08%)
Hmean 128 1633.25 ( 0.00%) 1653.50 * 1.24%*
Hmean 256 3107.89 ( 0.00%) 3148.10 ( 1.29%)
Hmean 1024 10435.39 ( 0.00%) 10503.81 ( 0.66%)
Hmean 2048 17152.34 ( 0.00%) 17314.40 ( 0.94%)
Hmean 3312 21928.05 ( 0.00%) 21995.97 ( 0.31%)
Hmean 4096 23990.44 ( 0.00%) 24008.97 ( 0.08%)
Hmean 8192 29445.84 ( 0.00%) 29245.31 * -0.68%*
Hmean 16384 33592.90 ( 0.00%) 34096.68 * 1.50%*
g) tbench4 Throughput
Hmean 1 311.15 ( 0.00%) 306.76 * -1.41%*
Hmean 2 619.24 ( 0.00%) 615.00 * -0.68%*
Hmean 4 1220.45 ( 0.00%) 1222.08 * 0.13%*
Hmean 8 2410.93 ( 0.00%) 2413.59 * 0.11%*
Hmean 16 4652.09 ( 0.00%) 4766.12 * 2.45%*
Hmean 32 7809.03 ( 0.00%) 7831.88 * 0.29%*
Hmean 64 9116.92 ( 0.00%) 9171.25 * 0.60%*
Hmean 128 17732.63 ( 0.00%) 20209.26 * 13.97%*
Hmean 256 19603.22 ( 0.00%) 19007.72 * -3.04%*
Hmean 384 19796.37 ( 0.00%) 17396.64 * -12.12%*
There seems like not much difference except hackbench pipe test at
certain groups (30~110). I am intended to provide better scalability
by applying the filter which will be enabled when:
- The LLC is large enough that simply traversing becomes
in-sufficient, and/or
- The LLC is loaded that unoccupied cpus are minority.
But it would be very nice if a more fine grained pattern works well
so we can drop the above constrains.
>
> Thanks & BR,
> Abel
On Mon, Jun 27, 2022 at 8:51 PM Abel Wu <[email protected]> wrote:
>
>
> On 6/28/22 7:42 AM, Josh Don Wrote:
> > On Sun, Jun 19, 2022 at 5:05 AM Abel Wu <[email protected]> wrote:
> >>
> >> The function only gets called when sds->has_idle_cores is true which can
> >> be possible only when sched_smt_present is enabled.
> >>
> >> Signed-off-by: Abel Wu <[email protected]>
> >> ---
> >> kernel/sched/fair.c | 3 ---
> >> 1 file changed, 3 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index aba1dad19574..1cc86e76e38e 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -6256,9 +6256,6 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
> >> bool idle = true;
> >> int cpu;
> >>
> >> - if (!static_branch_likely(&sched_smt_present))
> >> - return __select_idle_cpu(core, p);
> >> -
> >
> > The static branch is basically free; although you're right that we
> > currently don't take !smt_present branch direction here, it doesn't
> > seem harmful to leave this check in case assumptions change about when
> > we call select_idle_core().
>
> I was thinking that it would be better to align with select_idle_smt
> that the caller do the check if necessary.
The difference there though is that select_idle_smt() is called
directly under the sched_smt_active() check, whereas the
select_idle_core() is a few nested function calls away (and relies on
has_idle_core rather than sched_smt_active() directly). So it is a bit
harder to codify this expectation. Since we're using a static_branch
here, I don't see a strong reason to remove it.
> >
> >> for_each_cpu(cpu, cpu_smt_mask(core)) {
> >> if (!available_idle_cpu(cpu)) {
> >> idle = false;
> >> --
> >> 2.31.1
> >>
On Mon, Jun 27, 2022 at 11:53 PM Abel Wu <[email protected]> wrote:
>
> >>
> >> -static inline bool test_idle_cores(int cpu)
> >> +static inline enum sd_state sd_get_state(int cpu)
> >> {
> >> struct sched_domain_shared *sds;
> >>
> >> sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> >> if (sds)
> >> - return READ_ONCE(sds->has_idle_cores);
> >> + return READ_ONCE(sds->state);
> >>
> >> - return false;
> >> + return sd_has_icpus;
> >> +}
> >
> > Why is default not sd_is_busy?
>
> The state of sd_is_busy will prevent us from searching the LLC. By
> design, both sd_has_icores and sd_is_busy indicate deterministic
> status: has idle cores / no idle cpu exists. While sd_has_icpus is
> not deterministic, it means there could be unoccupied cpus.
>
> The naming seems misleading, it would be nice to have other options.
sd_has_icores isn't deterministic; when the last fully idle core gets
an occupied sibling, it will take until the next select_idle_cpu() to
mark the state as sd_has_icpus instead.
A comment here and also at the enum definitions would be helpful I think.
> >
> >> +
> >> +static inline void set_idle_cores(int cpu, int idle)
> >
> > nit: Slightly confusing to call the param 'idle', since in the case it
> > is false we still mark icpus. Consider possibly 'core_idle'.
>
> What about changing the param 'cpu' to 'core'?
I think keeping it as "cpu" is fine, since as "core" that would imply
some per-core state (when we're still setting this per-cpu).
> >> for_each_cpu_and(i, sched_group_span(group), env->cpus) {
> >> struct rq *rq = cpu_rq(i);
> >> @@ -8692,6 +8740,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >> nr_running = rq->nr_running;
> >> sgs->sum_nr_running += nr_running;
> >>
> >> + if (update_core)
> >> + sd_classify(sds, rq);
> >> +
> >> if (nr_running > 1)
> >> *sg_status |= SG_OVERLOAD;
> >>
> >> @@ -9220,6 +9271,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >> return idlest;
> >> }
> >>
> >> +static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
> >> +{
> >> + if (sds->sd_state == sd_has_icpus && !test_idle_cpus(env->dst_cpu))
> >> + set_idle_cpus(env->dst_cpu, true);
> >> +}
> >
> > We're only setting state to has_icpus here in sd_update_state. That
> > doesn't feel good enough, since we're only updating state for
> > env->dst_cpu; all the other per-cpu state will remain stale (ie.
> > falsely sd_is_busy).
>
> It's LLC-wide shared :)
Oh wow, yea that's the thing I missed... Thanks.
> > I think you also want a case in __update_idle_core() to call
> > set_idle_cores(core, 0) in the case where we have a non-idle sibling,
> > since we want to at least mark has_icpus even if the entire core isn't
> > idle.
More specifically, in the __update_idle_core() function, if the
sibling is still busy and the sd_state is sd_is_busy, we should
instead mark it as sd_has_icpus, since the current cpu is guaranteed
to be going idle.
Additionally, to be consistent with what we're calling "idle"
elsewhere, I think you mean to have __update_idle_core() check either
available_idle_cpu() or sched_idle_cpu()?
On 6/29/22 9:11 AM, Josh Don Wrote:
> On Mon, Jun 27, 2022 at 11:53 PM Abel Wu <[email protected]> wrote:
>>
>>>>
>>>> -static inline bool test_idle_cores(int cpu)
>>>> +static inline enum sd_state sd_get_state(int cpu)
>>>> {
>>>> struct sched_domain_shared *sds;
>>>>
>>>> sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>>>> if (sds)
>>>> - return READ_ONCE(sds->has_idle_cores);
>>>> + return READ_ONCE(sds->state);
>>>>
>>>> - return false;
>>>> + return sd_has_icpus;
>>>> +}
>>>
>>> Why is default not sd_is_busy?
>>
>> The state of sd_is_busy will prevent us from searching the LLC. By
>> design, both sd_has_icores and sd_is_busy indicate deterministic
>> status: has idle cores / no idle cpu exists. While sd_has_icpus is
>> not deterministic, it means there could be unoccupied cpus.
>>
>> The naming seems misleading, it would be nice to have other options.
>
> sd_has_icores isn't deterministic; when the last fully idle core gets
> an occupied sibling, it will take until the next select_idle_cpu() to
> mark the state as sd_has_icpus instead.
Yes, it's not deterministic in nature, but we treat it as deterministic.
As long as sd_has_icores, a full scan will be fired no matter there are
any idle cores or not.
>
> A comment here and also at the enum definitions would be helpful I think.
Agreed. I will add some comments here. State descriptions are already
above their definitions, please let me know if any modification needed.
>
>>>
>>>> +
>>>> +static inline void set_idle_cores(int cpu, int idle)
>>>
>>> nit: Slightly confusing to call the param 'idle', since in the case it
>>> is false we still mark icpus. Consider possibly 'core_idle'.
>>
>> What about changing the param 'cpu' to 'core'?
>
> I think keeping it as "cpu" is fine, since as "core" that would imply
> some per-core state (when we're still setting this per-cpu).
The function has already been there for a long time, and I haven't
changed its semantics, so maybe it isn't that confusing.. Does the
following naming make things clearer?
static inline void set_idle_cores(int cpu, int has_icores);
static inline void set_idle_cpus(int cpu, int has_icpus);
>
>>>> for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>>>> struct rq *rq = cpu_rq(i);
>>>> @@ -8692,6 +8740,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>>> nr_running = rq->nr_running;
>>>> sgs->sum_nr_running += nr_running;
>>>>
>>>> + if (update_core)
>>>> + sd_classify(sds, rq);
>>>> +
>>>> if (nr_running > 1)
>>>> *sg_status |= SG_OVERLOAD;
>>>>
>>>> @@ -9220,6 +9271,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>>> return idlest;
>>>> }
>>>>
>>>> +static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>>>> +{
>>>> + if (sds->sd_state == sd_has_icpus && !test_idle_cpus(env->dst_cpu))
>>>> + set_idle_cpus(env->dst_cpu, true);
>>>> +}
>>>
>>> We're only setting state to has_icpus here in sd_update_state. That
>>> doesn't feel good enough, since we're only updating state for
>>> env->dst_cpu; all the other per-cpu state will remain stale (ie.
>>> falsely sd_is_busy).
>>
>> It's LLC-wide shared :)
>
> Oh wow, yea that's the thing I missed... Thanks.
>
>>> I think you also want a case in __update_idle_core() to call
>>> set_idle_cores(core, 0) in the case where we have a non-idle sibling,
>>> since we want to at least mark has_icpus even if the entire core isn't
>>> idle.
>
> More specifically, in the __update_idle_core() function, if the
> sibling is still busy and the sd_state is sd_is_busy, we should
> instead mark it as sd_has_icpus, since the current cpu is guaranteed
> to be going idle.
The sd_is_busy state will be cleared during newidle balance. And the
state should not set back to is_busy between the gap before the cpu
actually goes idle.
>
> Additionally, to be consistent with what we're calling "idle"
> elsewhere, I think you mean to have __update_idle_core() check either
> available_idle_cpu() or sched_idle_cpu()?
I think the condition should be aligned with SIS that an unoccupied
cpu satisfies "idle_cpu(cpu) || sched_idle_cpu(cpu)". The function
available_idle_cpu() is not used in load balancing due to its not
being used immediately.
Thanks & BR,
Abel
On Tue, Jun 28, 2022 at 03:58:55PM +0800, Abel Wu wrote:
>
> On 6/27/22 6:13 PM, Abel Wu Wrote:
> >
> > On 6/24/22 11:30 AM, Chen Yu Wrote:
> > > > ...
> > > > > > @@ -9273,8 +9319,40 @@ find_idlest_group(struct
> > > > > > sched_domain *sd, struct task_struct *p, int this_cpu)
> > > > > > ?? static void sd_update_state(struct lb_env *env,
> > > > > > struct sd_lb_stats *sds)
> > > > > > ?? {
> > > > > > -??? if (sds->sd_state == sd_has_icpus &&
> > > > > > !test_idle_cpus(env->dst_cpu))
> > > > > > -??????? set_idle_cpus(env->dst_cpu, true);
> > > > > > +??? struct sched_domain_shared *sd_smt_shared = env->sd->shared;
> > > > > > +??? enum sd_state new = sds->sd_state;
> > > > > > +??? int this = env->dst_cpu;
> > > > > > +
> > > > > > +??? /*
> > > > > > +???? * Parallel updating can hardly contribute accuracy to
> > > > > > +???? * the filter, besides it can be one of the burdens on
> > > > > > +???? * cache traffic.
> > > > > > +???? */
> > > > > > +??? if (cmpxchg(&sd_smt_shared->updating, 0, 1))
> > > > > > +??????? return;
> > > > > > +
> > > > > > +??? /*
> > > > > > +???? * There is at least one unoccupied cpu available, so
> > > > > > +???? * propagate it to the filter to avoid false negative
> > > > > > +???? * issue which could result in lost tracking of some
> > > > > > +???? * idle cpus thus throughupt downgraded.
> > > > > > +???? */
> > > > > > +??? if (new != sd_is_busy) {
> > > > > > +??????? if (!test_idle_cpus(this))
> > > > > > +??????????? set_idle_cpus(this, true);
> > > > > > +??? } else {
> > > > > > +??????? /*
> > > > > > +???????? * Nothing changes so nothing to update or
> > > > > > +???????? * propagate.
> > > > > > +???????? */
> > > > > > +??????? if (sd_smt_shared->state == sd_is_busy)
> > > > > > +??????????? goto out;
> > > > > > +??? }
> > > > > > +
> > > > > > +??? sd_update_icpus(this, sds->idle_cpu);
> > > > > I wonder if we could further enhance it to facilitate idle CPU scan.
> > > > > For example, can we propagate the idle CPUs in smt domain,
> > > > > to its parent
> > > > > domain in a hierarchic sequence, and finally to the LLC
> > > > > domain. If there is
> > > >
> > > > In fact, it was my first try to cache the unoccupied cpus in SMT
> > > > shared domain, but the overhead of cpumask ops seems like a major
> > > > stumbling block.
> > > >
> > > > > a cluster domain between SMT and LLC domain, the cluster
> > > > > domain idle CPU filter
> > > > > could benefit from this mechanism.
> > > > > https://lore.kernel.org/lkml/[email protected]/
> > > > >
> > > >
> > > > Putting SIS into a hierarchical pattern is good for cache locality.
> > > > But I don't think multi-level filter is appropriate since it could
> > > > bring too much cache traffic in SIS,
> > > Could you please elaborate a little more about the cache traffic? I
> > > thought we
> > > don't save the unoccupied cpus in SMT shared domain, but to store it
> > > in middle
> > > layer shared domain, say, cluster->idle_cpus, this would reduce
> > > cache write
> > > contention compared to writing to llc->idle_cpus directly, because a
> > > smaller
> > > set of CPUs share the idle_cpus filter. Similarly, SIS can only scan
> > > the cluster->idle_cpus
> > > first, without having to query the llc->idle_cpus. This looks like
> > > splitting
> > > a big lock into fine grain small lock.
> >
> > I'm afraid I didn't quite follow.. Did you mean replace the LLC filter
> > with multiple cluster filters? Then I agree with what you suggested
> > that the contention would be reduced. But there are other concerns:
> >
> > ? a. Is it appropriate to fake an intermediate sched_domain if
> > ???? cluster level doesn't available? How to identify the proper
> > ???? size of the faked sched_domain?
> >
> > ? b. The SIS path might touch more cachelines (multiple cluster
> > ???? filters). I'm not sure how much is the impact.
> >
> > Whatever, this seems worth a try. :)
> >
>
> After a second thought, maybe it's a similar case of enabling SNC?
> I benchmarked with SNC disabled, so the LLC is relatively big. This
> time I enabled SNC on the same machine mentioned in cover letter, to
> make the filter more fine grained. Please see the following result.
>
> a) hackbench-process-pipes
>
> Amean 1 0.4380 ( 0.00%) 0.4250 * 2.97%*
> Amean 4 0.6123 ( 0.00%) 0.6153 ( -0.49%)
> Amean 7 0.7693 ( 0.00%) 0.7217 * 6.20%*
> Amean 12 1.0730 ( 0.00%) 1.0723 ( 0.06%)
> Amean 21 1.8540 ( 0.00%) 1.8817 ( -1.49%)
> Amean 30 2.8147 ( 0.00%) 2.7297 ( 3.02%)
> Amean 48 4.6280 ( 0.00%) 4.4923 * 2.93%*
> Amean 79 8.0897 ( 0.00%) 7.8773 ( 2.62%)
> Amean 110 10.5320 ( 0.00%) 10.1737 ( 3.40%)
> Amean 141 13.0260 ( 0.00%) 12.4953 ( 4.07%)
> Amean 172 15.5093 ( 0.00%) 14.3697 * 7.35%*
> Amean 203 17.9633 ( 0.00%) 16.7853 * 6.56%*
> Amean 234 20.2327 ( 0.00%) 19.2020 * 5.09%*
> Amean 265 22.1203 ( 0.00%) 21.3353 ( 3.55%)
> Amean 296 24.9337 ( 0.00%) 23.8967 ( 4.16%)
>
> b) hackbench-process-sockets
>
> Amean 1 0.6990 ( 0.00%) 0.6520 * 6.72%*
> Amean 4 1.6513 ( 0.00%) 1.6080 * 2.62%*
> Amean 7 2.5103 ( 0.00%) 2.5020 ( 0.33%)
> Amean 12 4.1470 ( 0.00%) 4.0957 * 1.24%*
> Amean 21 7.0823 ( 0.00%) 6.9237 * 2.24%*
> Amean 30 9.9510 ( 0.00%) 9.7937 * 1.58%*
> Amean 48 15.8853 ( 0.00%) 15.5410 * 2.17%*
> Amean 79 26.3313 ( 0.00%) 26.0363 * 1.12%*
> Amean 110 36.6647 ( 0.00%) 36.2657 * 1.09%*
> Amean 141 47.0590 ( 0.00%) 46.4010 * 1.40%*
> Amean 172 57.5020 ( 0.00%) 56.9897 ( 0.89%)
> Amean 203 67.9277 ( 0.00%) 66.8273 * 1.62%*
> Amean 234 78.3967 ( 0.00%) 77.2137 * 1.51%*
> Amean 265 88.5817 ( 0.00%) 87.6143 * 1.09%*
> Amean 296 99.4397 ( 0.00%) 98.0233 * 1.42%*
>
> c) hackbench-thread-pipes
>
> Amean 1 0.4437 ( 0.00%) 0.4373 ( 1.43%)
> Amean 4 0.6667 ( 0.00%) 0.6340 ( 4.90%)
> Amean 7 0.7813 ( 0.00%) 0.8177 * -4.65%*
> Amean 12 1.2747 ( 0.00%) 1.3113 ( -2.88%)
> Amean 21 2.4703 ( 0.00%) 2.3637 * 4.32%*
> Amean 30 3.6547 ( 0.00%) 3.2377 * 11.41%*
> Amean 48 5.7580 ( 0.00%) 5.3140 * 7.71%*
> Amean 79 9.1770 ( 0.00%) 8.3717 * 8.78%*
> Amean 110 11.7167 ( 0.00%) 11.3867 * 2.82%*
> Amean 141 14.1490 ( 0.00%) 13.9017 ( 1.75%)
> Amean 172 17.3880 ( 0.00%) 16.4897 ( 5.17%)
> Amean 203 19.3760 ( 0.00%) 18.8807 ( 2.56%)
> Amean 234 22.7477 ( 0.00%) 21.7420 * 4.42%*
> Amean 265 25.8940 ( 0.00%) 23.6173 * 8.79%*
> Amean 296 27.8677 ( 0.00%) 26.5053 * 4.89%*
>
> d) hackbench-thread-sockets
>
> Amean 1 0.7303 ( 0.00%) 0.6817 * 6.66%*
> Amean 4 1.6820 ( 0.00%) 1.6343 * 2.83%*
> Amean 7 2.6060 ( 0.00%) 2.5393 * 2.56%*
> Amean 12 4.2663 ( 0.00%) 4.1810 * 2.00%*
> Amean 21 7.2110 ( 0.00%) 7.0873 * 1.71%*
> Amean 30 10.1453 ( 0.00%) 10.0320 * 1.12%*
> Amean 48 16.2787 ( 0.00%) 15.9040 * 2.30%*
> Amean 79 27.0090 ( 0.00%) 26.5803 * 1.59%*
> Amean 110 37.5397 ( 0.00%) 37.1200 * 1.12%*
> Amean 141 48.0853 ( 0.00%) 47.7613 * 0.67%*
> Amean 172 58.7967 ( 0.00%) 58.2570 * 0.92%*
> Amean 203 69.5303 ( 0.00%) 68.8930 * 0.92%*
> Amean 234 79.9943 ( 0.00%) 79.5347 * 0.57%*
> Amean 265 90.5877 ( 0.00%) 90.1223 ( 0.51%)
> Amean 296 101.2390 ( 0.00%) 101.1687 ( 0.07%)
>
> e) netperf-udp
>
> Hmean send-64 202.37 ( 0.00%) 202.46 ( 0.05%)
> Hmean send-128 407.01 ( 0.00%) 402.86 * -1.02%*
> Hmean send-256 788.50 ( 0.00%) 789.87 ( 0.17%)
> Hmean send-1024 3047.98 ( 0.00%) 3036.19 ( -0.39%)
> Hmean send-2048 5820.33 ( 0.00%) 5776.30 ( -0.76%)
> Hmean send-3312 8941.40 ( 0.00%) 8809.25 * -1.48%*
> Hmean send-4096 10804.41 ( 0.00%) 10686.95 * -1.09%*
> Hmean send-8192 17105.63 ( 0.00%) 17323.44 * 1.27%*
> Hmean send-16384 28166.17 ( 0.00%) 28191.05 ( 0.09%)
> Hmean recv-64 202.37 ( 0.00%) 202.46 ( 0.05%)
> Hmean recv-128 407.01 ( 0.00%) 402.86 * -1.02%*
> Hmean recv-256 788.50 ( 0.00%) 789.87 ( 0.17%)
> Hmean recv-1024 3047.98 ( 0.00%) 3036.19 ( -0.39%)
> Hmean recv-2048 5820.33 ( 0.00%) 5776.30 ( -0.76%)
> Hmean recv-3312 8941.40 ( 0.00%) 8809.23 * -1.48%*
> Hmean recv-4096 10804.41 ( 0.00%) 10686.95 * -1.09%*
> Hmean recv-8192 17105.55 ( 0.00%) 17323.44 * 1.27%*
> Hmean recv-16384 28166.03 ( 0.00%) 28191.04 ( 0.09%)
>
> f) netperf-tcp
>
> Hmean 64 838.30 ( 0.00%) 837.61 ( -0.08%)
> Hmean 128 1633.25 ( 0.00%) 1653.50 * 1.24%*
> Hmean 256 3107.89 ( 0.00%) 3148.10 ( 1.29%)
> Hmean 1024 10435.39 ( 0.00%) 10503.81 ( 0.66%)
> Hmean 2048 17152.34 ( 0.00%) 17314.40 ( 0.94%)
> Hmean 3312 21928.05 ( 0.00%) 21995.97 ( 0.31%)
> Hmean 4096 23990.44 ( 0.00%) 24008.97 ( 0.08%)
> Hmean 8192 29445.84 ( 0.00%) 29245.31 * -0.68%*
> Hmean 16384 33592.90 ( 0.00%) 34096.68 * 1.50%*
>
> g) tbench4 Throughput
>
> Hmean 1 311.15 ( 0.00%) 306.76 * -1.41%*
> Hmean 2 619.24 ( 0.00%) 615.00 * -0.68%*
> Hmean 4 1220.45 ( 0.00%) 1222.08 * 0.13%*
> Hmean 8 2410.93 ( 0.00%) 2413.59 * 0.11%*
> Hmean 16 4652.09 ( 0.00%) 4766.12 * 2.45%*
> Hmean 32 7809.03 ( 0.00%) 7831.88 * 0.29%*
> Hmean 64 9116.92 ( 0.00%) 9171.25 * 0.60%*
> Hmean 128 17732.63 ( 0.00%) 20209.26 * 13.97%*
> Hmean 256 19603.22 ( 0.00%) 19007.72 * -3.04%*
> Hmean 384 19796.37 ( 0.00%) 17396.64 * -12.12%*
>
>
> There seems like not much difference except hackbench pipe test at
> certain groups (30~110).
OK, smaller LLC domain seems to not have much difference, which might
suggest that by leveraging load balance code path, the read/write
to LLC shared mask might not be the bottleneck. I have an vague
impression that during Aubrey's cpumask searching for idle CPUs
work[1], there is concern that updating the shared mask in large LLC
has introduced cache contention and performance degrading. Maybe we
can find that regressed test case to verify.
[1] https://lore.kernel.org/all/[email protected]/
> I am intended to provide better scalability
> by applying the filter which will be enabled when:
>
> - The LLC is large enough that simply traversing becomes
> in-sufficient, and/or
>
> - The LLC is loaded that unoccupied cpus are minority.
>
> But it would be very nice if a more fine grained pattern works well
> so we can drop the above constrains.
>
We can first try to push a simple version, and later optimize it.
One concern about v4 is that, we changed the logic in v3, which recorded
the overloaded CPU, while v4 tracks unoccupied CPUs. An overloaded CPU is
more "stable" because there are more than 1 running tasks on that runqueue.
It is more likely to remain "occupied" for a while. That is to say,
nr_task = 1, 2, 3... will all be regarded as occupied, while only nr_task = 0
is unoccupied. The former would bring less false negative/positive.
By far I have tested hackbench/schbench/netperf on top of Peter's sched/core branch,
with SIS_UTIL enabled. Overall it looks good, and netperf has especially
significant improvement when the load approaches overloaded(which is aligned
with your comment above). I'll re-run the netperf for several cycles to check the
standard deviation. And I'm also curious about v3's performance because it
tracks overloaded CPUs, so I'll also test on v3 with small modifications.
hackbench
=========
case load baseline(std%) compare%( std%)
process-pipe group-1 1.00 ( 0.00) -0.16 ( 0.00)
process-pipe group-2 1.00 ( 0.00) +0.47 ( 0.00)
process-pipe group-4 1.00 ( 0.00) -0.56 ( 0.00)
process-pipe group-8 1.00 ( 0.00) +3.29 ( 0.00)
process-sockets group-1 1.00 ( 0.00) -1.85 ( 0.00)
process-sockets group-2 1.00 ( 0.00) -5.67 ( 0.00)
process-sockets group-4 1.00 ( 0.00) -0.14 ( 0.00)
process-sockets group-8 1.00 ( 0.00) -0.29 ( 0.00)
threads-pipe group-1 1.00 ( 0.00) +2.17 ( 0.00)
threads-pipe group-2 1.00 ( 0.00) +3.26 ( 0.00)
threads-pipe group-4 1.00 ( 0.00) -0.32 ( 0.00)
threads-pipe group-8 1.00 ( 0.00) +3.36 ( 0.00)
threads-sockets group-1 1.00 ( 0.00) -0.91 ( 0.00)
threads-sockets group-2 1.00 ( 0.00) -0.91 ( 0.00)
threads-sockets group-4 1.00 ( 0.00) +0.27 ( 0.00)
threads-sockets group-8 1.00 ( 0.00) -0.55 ( 0.00)
schbench
========
case load baseline(std%) compare%( std%)
normal mthread-1 1.00 ( 0.00) -3.12 ( 0.00)
normal mthread-2 1.00 ( 0.00) +0.00 ( 0.00)
normal mthread-4 1.00 ( 0.00) -2.63 ( 0.00)
normal mthread-8 1.00 ( 0.00) -7.22 ( 0.00)
thanks,
Chenyu
> >
> > Thanks & BR,
> > Abel
On 6/19/22 8:04 PM, Abel Wu Wrote:
> Now when updating core state, there are two main problems that could
> pollute the SIS filter:
>
> - The updating is before task migration, so if dst_cpu is
> selected to be propagated which might be fed with tasks
> soon, the efforts we paid is no more than setting a busy
> cpu into the SIS filter. While on the other hand it is
> important that we update as early as possible to keep the
> filter fresh, so it's not wise to delay the update to the
> end of load balancing.
>
> - False negative propagation hurts performance since some
> idle cpus could be out of reach. So in general we will
> aggressively propagate idle cpus but allow false positive
> continue to exist for a while, which may lead to filter
> being fully polluted.
>
> Pains can be relieved by a force correction when false positive
> continuously detected.
>
> Signed-off-by: Abel Wu <[email protected]>
> ---
> include/linux/sched/topology.h | 7 +++++
> kernel/sched/fair.c | 51 ++++++++++++++++++++++++++++++++--
> 2 files changed, 55 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index b93edf587d84..e3552ce192a9 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -91,6 +91,12 @@ struct sched_group;
> * search, and is also used as a fallback state of the other
> * states.
> *
> + * - sd_may_idle
> + * This state implies the unstableness of the SIS filter, and
> + * some bits of it may out of date. This state is only used in
> + * SMT domains as an intermediate state between sd_has_icpus
> + * and sd_is_busy.
> + *
> * - sd_is_busy
> * This state indicates there are no unoccupied cpus in this
> * domain. So for LLC domains, it gives the hint on whether
> @@ -111,6 +117,7 @@ struct sched_group;
> enum sd_state {
> sd_has_icores,
> sd_has_icpus,
> + sd_may_idle,
> sd_is_busy
> };
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d55fdcedf2c0..9713d183d35e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8768,6 +8768,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>
> for_each_cpu_and(i, sched_group_span(group), env->cpus) {
> struct rq *rq = cpu_rq(i);
> + bool update = update_core && (env->dst_cpu != i);
>
> sgs->group_load += cpu_load(rq);
> sgs->group_util += cpu_util_cfs(i);
> @@ -8777,7 +8778,11 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> nr_running = rq->nr_running;
> sgs->sum_nr_running += nr_running;
>
> - if (update_core)
> + /*
> + * The dst_cpu is not preferred since it might
> + * be fed with tasks soon.
> + */
> + if (update)
> sd_classify(sds, rq, i);
>
> if (nr_running > 1)
> @@ -8801,7 +8806,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> * and fed with tasks, so we'd better choose
> * a candidate in an opposite way.
> */
> - sds->idle_cpu = i;
> + if (update)
> + sds->idle_cpu = i;
> sgs->idle_cpus++;
>
> /* Idle cpu can't have misfit task */
> @@ -9321,7 +9327,7 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
> {
> struct sched_domain_shared *sd_smt_shared = env->sd->shared;
> enum sd_state new = sds->sd_state;
> - int this = env->dst_cpu;
> + int icpu = sds->idle_cpu, this = env->dst_cpu;
>
> /*
> * Parallel updating can hardly contribute accuracy to
> @@ -9331,6 +9337,22 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
> if (cmpxchg(&sd_smt_shared->updating, 0, 1))
> return;
>
> + /*
> + * The dst_cpu is likely to be fed with tasks soon.
> + * If it is the only unoccupied cpu in this domain,
> + * we still handle it the same way as as_has_icpus
> + * but turn the SMT into the unstable state, rather
> + * than waiting to the end of load balancing since
> + * it's also important that update the filter as
> + * early as possible to keep it fresh.
> + */
> + if (new == sd_is_busy) {
> + if (idle_cpu(this) || sched_idle_cpu(this)) {
> + new = sd_may_idle;
> + icpu = this;
> + }
> + }
> +
> /*
> * There is at least one unoccupied cpu available, so
> * propagate it to the filter to avoid false negative
> @@ -9338,6 +9360,12 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
> * idle cpus thus throughupt downgraded.
> */
> if (new != sd_is_busy) {
> + /*
> + * The sd_may_idle state is taken into
> + * consideration as well because from
> + * here we couldn't actually know task
> + * migrations would happen or not.
> + */
> if (!test_idle_cpus(this))
> set_idle_cpus(this, true);
> } else {
> @@ -9347,9 +9375,26 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
> */
> if (sd_smt_shared->state == sd_is_busy)
> goto out;
> +
> + /*
> + * Allow false positive to exist for some time
> + * to make a tradeoff of accuracy of the filter
> + * for relieving cache traffic.
> + */
> + if (sd_smt_shared->state == sd_has_icpus) {
> + new = sd_may_idle;
> + goto update;
> + }
> +
> + /*
> + * If the false positive issue has already been
> + * there for a while, a correction of the filter
> + * is needed.
> + */
> }
>
> sd_update_icpus(this, sds->idle_cpu);
The @icpu should be used here rather than 'sds->idle_cpu'..
Will be fixed in next version.
> +update:
> sd_smt_shared->state = new;
> out:
> xchg(&sd_smt_shared->updating, 0);
On 6/30/22 12:16 PM, Chen Yu Wrote:
> On Tue, Jun 28, 2022 at 03:58:55PM +0800, Abel Wu wrote:
>>
>> On 6/27/22 6:13 PM, Abel Wu Wrote:
>> There seems like not much difference except hackbench pipe test at
>> certain groups (30~110).
> OK, smaller LLC domain seems to not have much difference, which might
> suggest that by leveraging load balance code path, the read/write
> to LLC shared mask might not be the bottleneck. I have an vague
> impression that during Aubrey's cpumask searching for idle CPUs
> work[1], there is concern that updating the shared mask in large LLC
> has introduced cache contention and performance degrading. Maybe we
> can find that regressed test case to verify.
> [1] https://lore.kernel.org/all/[email protected]/
I just went through Aubrey's v1-v11 patches and didn't find any
particular tests other than hackbench/tbench/uperf. Please let
me know if I missed something, thanks!
>> I am intended to provide better scalability
>> by applying the filter which will be enabled when:
>>
>> - The LLC is large enough that simply traversing becomes
>> in-sufficient, and/or
>>
>> - The LLC is loaded that unoccupied cpus are minority.
>>
>> But it would be very nice if a more fine grained pattern works well
>> so we can drop the above constrains.
>>
> We can first try to push a simple version, and later optimize it.
> One concern about v4 is that, we changed the logic in v3, which recorded
> the overloaded CPU, while v4 tracks unoccupied CPUs. An overloaded CPU is
> more "stable" because there are more than 1 running tasks on that runqueue.
> It is more likely to remain "occupied" for a while. That is to say,
> nr_task = 1, 2, 3... will all be regarded as occupied, while only nr_task = 0
> is unoccupied. The former would bring less false negative/positive.
Yes, I like the 'overloaded mask' too, but the downside is extra
cpumask ops needed in the SIS path (the added cpumask_andnot).
Besides, in this patch, the 'overloaded mask' is also unstable due
to the state is maintained at core level rather than per-cpu, some
more thoughts are in cover letter.
>
> By far I have tested hackbench/schbench/netperf on top of Peter's sched/core branch,
> with SIS_UTIL enabled. Overall it looks good, and netperf has especially
> significant improvement when the load approaches overloaded(which is aligned
> with your comment above). I'll re-run the netperf for several cycles to check the
> standard deviation. And I'm also curious about v3's performance because it
> tracks overloaded CPUs, so I'll also test on v3 with small modifications.
Thanks very much for your reviewing and testing.
Abel
Gentle ping
On 6/19/22 8:04 PM, Abel Wu Wrote:
> The wakeup fastpath (select_idle_sibling or SIS) plays an important role
> in maximizing the usage of cpu resources and can greatly affect overall
> performance of the system.
>
> The SIS tries to find an idle cpu inside that LLC to place the woken-up
> task. The cache hot cpus will be checked first, then other cpus of that
> LLC (domain scan) if the hot ones are not idle.
>
> The domain scan works well under light workload by simply traversing
> the cpus of the LLC due to lots of idle cpus can be available. But this
> doesn’t scale well once the LLC gets bigger and the load increases, so
> SIS_PROP was born to limit the scan cost. For now SIS_PROP just limits
> the number of cpus to be scanned, but the way of how it scans is not
> changed.
>
> This patchset introduces the SIS filter to help improving scan efficiency
> when scan depth is limited. The filter only contains the unoccupied cpus,
> and is updated during SMT level load balancing. It is expected that the
> more overloaded the system is, the less cpus will be scanned.
>
> ...
>
On Thu, Jun 30, 2022 at 06:46:08PM +0800, Abel Wu wrote:
>
> On 6/30/22 12:16 PM, Chen Yu Wrote:
> > On Tue, Jun 28, 2022 at 03:58:55PM +0800, Abel Wu wrote:
> > >
> > > On 6/27/22 6:13 PM, Abel Wu Wrote:
> > > There seems like not much difference except hackbench pipe test at
> > > certain groups (30~110).
> > OK, smaller LLC domain seems to not have much difference, which might
> > suggest that by leveraging load balance code path, the read/write
> > to LLC shared mask might not be the bottleneck. I have an vague
> > impression that during Aubrey's cpumask searching for idle CPUs
> > work[1], there is concern that updating the shared mask in large LLC
> > has introduced cache contention and performance degrading. Maybe we
> > can find that regressed test case to verify.
> > [1] https://lore.kernel.org/all/[email protected]/
>
> I just went through Aubrey's v1-v11 patches and didn't find any
> particular tests other than hackbench/tbench/uperf. Please let
> me know if I missed something, thanks!
>
I haven't found any testcase that could trigger the cache contention
issue. I thought we could stick with these testcases for now, especially
for tbench, it has detected a cache issue described in
https://lore.kernel.org/lkml/[email protected]
if I understand correctly.
> > > I am intended to provide better scalability
> > > by applying the filter which will be enabled when:
> > >
> > > - The LLC is large enough that simply traversing becomes
> > > in-sufficient, and/or
> > >
> > > - The LLC is loaded that unoccupied cpus are minority.
> > >
> > > But it would be very nice if a more fine grained pattern works well
> > > so we can drop the above constrains.
> > >
> > We can first try to push a simple version, and later optimize it.
> > One concern about v4 is that, we changed the logic in v3, which recorded
> > the overloaded CPU, while v4 tracks unoccupied CPUs. An overloaded CPU is
> > more "stable" because there are more than 1 running tasks on that runqueue.
> > It is more likely to remain "occupied" for a while. That is to say,
> > nr_task = 1, 2, 3... will all be regarded as occupied, while only nr_task = 0
> > is unoccupied. The former would bring less false negative/positive.
>
> Yes, I like the 'overloaded mask' too, but the downside is extra
> cpumask ops needed in the SIS path (the added cpumask_andnot).
> Besides, in this patch, the 'overloaded mask' is also unstable due
> to the state is maintained at core level rather than per-cpu, some
> more thoughts are in cover letter.
>
I see.
> >
> > By far I have tested hackbench/schbench/netperf on top of Peter's sched/core branch,
> > with SIS_UTIL enabled. Overall it looks good, and netperf has especially
> > significant improvement when the load approaches overloaded(which is aligned
> > with your comment above). I'll re-run the netperf for several cycles to check the
> > standard deviation. And I'm also curious about v3's performance because it
> > tracks overloaded CPUs, so I'll also test on v3 with small modifications.
>
> Thanks very much for your reviewing and testing.
>
I modified your v3 patch a little bit, and the test result shows good improvement
on netperf and no significant regression on schbench/tbench/hackbench on this draft
patch. I would like to vote for your v3 version as it seems to be more straightforward,
what do you think of the following change:
From 277b60b7cd055d5be93188a552da50fdfe53214c Mon Sep 17 00:00:00 2001
From: Abel Wu <[email protected]>
Date: Fri, 8 Jul 2022 02:16:47 +0800
Subject: [PATCH] sched/fair: Introduce SIS_FILTER to skip overloaded CPUs
during SIS
Currently SIS_UTIL is used to limit the scan depth of idle CPUs in
select_idle_cpu(). There could be another optimization to filter
the overloaded CPUs so as to further speed up select_idle_cpu().
Launch the CPU overload check in periodic tick, and take consideration
of nr_running, avg_util and runnable_avg of that CPU. If the CPU is
overloaded, add it into per LLC overload cpumask, so select_idle_cpu()
could skip those overloaded CPUs. Although this detection is in periodic
tick, checking the pelt signal of the CPU would make the 'overloaded' state
more stable and reduce the frequency to update the LLC shared mask,
so as to mitigate the cache contention in the LLC.
The following results are tested on top of latest sched/core tip.
The baseline is with SIS_UTIL enabled, and compared it with both SIS_FILTER
/SIS_UTIL enabled. Positive %compare stands for better performance.
hackbench
=========
case load baseline(std%) compare%( std%)
process-pipe 1 group 1.00 ( 0.59) -1.35 ( 0.88)
process-pipe 2 groups 1.00 ( 0.38) -1.49 ( 0.04)
process-pipe 4 groups 1.00 ( 0.45) +0.10 ( 0.91)
process-pipe 8 groups 1.00 ( 0.11) +0.03 ( 0.38)
process-sockets 1 group 1.00 ( 3.48) +2.88 ( 7.07)
process-sockets 2 groups 1.00 ( 2.38) -3.78 ( 2.81)
process-sockets 4 groups 1.00 ( 0.26) -1.79 ( 0.82)
process-sockets 8 groups 1.00 ( 0.07) -0.35 ( 0.07)
threads-pipe 1 group 1.00 ( 0.87) -0.21 ( 0.71)
threads-pipe 2 groups 1.00 ( 0.63) +0.34 ( 0.45)
threads-pipe 4 groups 1.00 ( 0.18) -0.02 ( 0.50)
threads-pipe 8 groups 1.00 ( 0.08) +0.46 ( 0.05)
threads-sockets 1 group 1.00 ( 0.80) -0.08 ( 1.06)
threads-sockets 2 groups 1.00 ( 0.55) +0.06 ( 0.85)
threads-sockets 4 groups 1.00 ( 1.00) -2.13 ( 0.18)
threads-sockets 8 groups 1.00 ( 0.07) -0.41 ( 0.08)
netperf
=======
case load baseline(std%) compare%( std%)
TCP_RR 28 threads 1.00 ( 0.50) +0.19 ( 0.53)
TCP_RR 56 threads 1.00 ( 0.33) +0.31 ( 0.35)
TCP_RR 84 threads 1.00 ( 0.23) +0.15 ( 0.28)
TCP_RR 112 threads 1.00 ( 0.20) +0.03 ( 0.21)
TCP_RR 140 threads 1.00 ( 0.17) +0.20 ( 0.18)
TCP_RR 168 threads 1.00 ( 0.17) +112.84 ( 40.35)
TCP_RR 196 threads 1.00 ( 16.66) +0.39 ( 15.72)
TCP_RR 224 threads 1.00 ( 10.28) +0.05 ( 9.97)
UDP_RR 28 threads 1.00 ( 16.15) -0.13 ( 0.93)
UDP_RR 56 threads 1.00 ( 7.76) +1.24 ( 0.44)
UDP_RR 84 threads 1.00 ( 11.68) -0.49 ( 6.33)
UDP_RR 112 threads 1.00 ( 8.49) -0.21 ( 7.77)
UDP_RR 140 threads 1.00 ( 8.49) +2.05 ( 19.88)
UDP_RR 168 threads 1.00 ( 8.91) +1.67 ( 11.74)
UDP_RR 196 threads 1.00 ( 19.96) +4.35 ( 21.37)
UDP_RR 224 threads 1.00 ( 19.44) +4.38 ( 16.61)
tbench
======
case load baseline(std%) compare%( std%)
loopback 28 threads 1.00 ( 0.12) +0.57 ( 0.12)
loopback 56 threads 1.00 ( 0.11) +0.42 ( 0.11)
loopback 84 threads 1.00 ( 0.09) +0.71 ( 0.03)
loopback 112 threads 1.00 ( 0.03) -0.13 ( 0.08)
loopback 140 threads 1.00 ( 0.29) +0.59 ( 0.01)
loopback 168 threads 1.00 ( 0.01) +0.86 ( 0.03)
loopback 196 threads 1.00 ( 0.02) +0.97 ( 0.21)
loopback 224 threads 1.00 ( 0.04) +0.83 ( 0.22)
schbench
========
case load baseline(std%) compare%( std%)
normal 1 mthread 1.00 ( 0.00) -8.82 ( 0.00)
normal 2 mthreads 1.00 ( 0.00) +0.00 ( 0.00)
normal 4 mthreads 1.00 ( 0.00) +17.02 ( 0.00)
normal 8 mthreads 1.00 ( 0.00) -4.84 ( 0.00)
Signed-off-by: Abel Wu <[email protected]>
---
include/linux/sched/topology.h | 6 +++++
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++
kernel/sched/features.h | 1 +
kernel/sched/sched.h | 2 ++
kernel/sched/topology.c | 3 ++-
6 files changed, 59 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 816df6cc444e..c03076850a67 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -82,8 +82,14 @@ struct sched_domain_shared {
atomic_t nr_busy_cpus;
int has_idle_cores;
int nr_idle_scan;
+ unsigned long overloaded_cpus[];
};
+static inline struct cpumask *sdo_mask(struct sched_domain_shared *sds)
+{
+ return to_cpumask(sds->overloaded_cpus);
+}
+
struct sched_domain {
/* These fields must be setup */
struct sched_domain __rcu *parent; /* top domain must be null terminated */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d3e2c5a7c1b7..452eb63ee6f6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5395,6 +5395,7 @@ void scheduler_tick(void)
resched_latency = cpu_resched_latency(rq);
calc_global_load_tick(rq);
sched_core_tick(rq);
+ update_overloaded_rq(rq);
rq_unlock(rq, &rf);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f80ae86bb404..34b1650f85f6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6323,6 +6323,50 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
#endif /* CONFIG_SCHED_SMT */
+/* derived from group_is_overloaded() */
+static inline bool rq_overloaded(struct rq *rq, int cpu, unsigned int imbalance_pct)
+{
+ if (rq->nr_running - rq->cfs.idle_h_nr_running <= 1)
+ return false;
+
+ if ((SCHED_CAPACITY_SCALE * 100) <
+ (cpu_util_cfs(cpu) * imbalance_pct))
+ return true;
+
+ if ((SCHED_CAPACITY_SCALE * imbalance_pct) <
+ (cpu_runnable(rq) * 100))
+ return true;
+
+ return false;
+}
+
+void update_overloaded_rq(struct rq *rq)
+{
+ struct sched_domain_shared *sds;
+ struct sched_domain *sd;
+ int cpu;
+
+ if (!sched_feat(SIS_FILTER))
+ return;
+
+ cpu = cpu_of(rq);
+ sd = rcu_dereference(per_cpu(sd_llc, cpu));
+ if (unlikely(!sd))
+ return;
+
+ sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+ if (unlikely(!sds))
+ return;
+
+ if (rq_overloaded(rq, cpu, sd->imbalance_pct)) {
+ /* avoid duplicated write, mitigate cache contention */
+ if (!cpumask_test_cpu(cpu, sdo_mask(sds)))
+ cpumask_set_cpu(cpu, sdo_mask(sds));
+ } else {
+ if (cpumask_test_cpu(cpu, sdo_mask(sds)))
+ cpumask_clear_cpu(cpu, sdo_mask(sds));
+ }
+}
/*
* Scan the LLC domain for idle CPUs; this is dynamically regulated by
* comparing the average scan cost (tracked in sd->avg_scan_cost) against the
@@ -6383,6 +6427,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
}
}
+ if (sched_feat(SIS_FILTER) && !has_idle_core && sd->shared)
+ cpumask_andnot(cpus, cpus, sdo_mask(sd->shared));
+
for_each_cpu_wrap(cpu, cpus, target + 1) {
if (has_idle_core) {
i = select_idle_core(p, cpu, cpus, &idle_cpu);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..1bebdb87c2f4 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
*/
SCHED_FEAT(SIS_PROP, false)
SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_FILTER, true)
/*
* Issue a WARN when we do multiple update_rq_clock() calls
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 02c970501295..316127ab1ec7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1812,6 +1812,8 @@ static inline struct cpumask *group_balance_mask(struct sched_group *sg)
extern int group_balance_cpu(struct sched_group *sg);
+void update_overloaded_rq(struct rq *rq);
+
#ifdef CONFIG_SCHED_DEBUG
void update_sched_domain_debugfs(void);
void dirty_sched_domain_sysctl(int cpu);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..0d149e76a3b3 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1641,6 +1641,7 @@ sd_init(struct sched_domain_topology_level *tl,
sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
atomic_inc(&sd->shared->ref);
atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+ cpumask_clear(sdo_mask(sd->shared));
}
sd->private = sdd;
@@ -2106,7 +2107,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
*per_cpu_ptr(sdd->sd, j) = sd;
- sds = kzalloc_node(sizeof(struct sched_domain_shared),
+ sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
GFP_KERNEL, cpu_to_node(j));
if (!sds)
return -ENOMEM;
--
2.25.1
Greeting,
FYI, we noticed a -11.7% regression of phoronix-test-suite.fio.SequentialWrite.IO_uring.Yes.No.4KB.DefaultTestDirectory.mb_s due to commit:
commit: 32fe13cd7aa184ed349d698ebf6f420fa426dd73 ("[PATCH v4 7/7] sched/fair: de-entropy for SIS filter")
url: https://github.com/intel-lab-lkp/linux/commits/Abel-Wu/sched-fair-improve-scan-efficiency-of-SIS/20220619-200743
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git f3dd3f674555bd9455c5ae7fafce0696bd9931b3
patch link: https://lore.kernel.org/lkml/[email protected]
in testcase: phoronix-test-suite
on test machine: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 512G memory
with following parameters:
test: fio-1.14.1
option_a: Sequential Write
option_b: IO_uring
option_c: Yes
option_d: No
option_e: 4KB
option_f: Default Test Directory
cpufreq_governor: performance
ucode: 0x500320a
test-description: The Phoronix Test Suite is the most comprehensive testing and benchmarking platform available that provides an extensible framework for which new tests can be easily added.
test-url: http://www.phoronix-test-suite.com/
In addition to that, the commit also has significant impact on the following tests:
+------------------+-------------------------------------------------------------------------------------+
| testcase: change | stress-ng: stress-ng.vm-rw.ops_per_sec 113.5% improvement |
| test machine | 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory |
| test parameters | class=memory |
| | cpufreq_governor=performance |
| | nr_threads=100% |
| | test=vm-rw |
| | testtime=60s |
| | ucode=0xd000331 |
+------------------+-------------------------------------------------------------------------------------+
If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>
Details are as below:
-------------------------------------------------------------------------------------------------->
To reproduce:
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
sudo bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
sudo bin/lkp run generated-yaml-file
# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.
=========================================================================================
compiler/cpufreq_governor/kconfig/option_a/option_b/option_c/option_d/option_e/option_f/rootfs/tbox_group/test/testcase/ucode:
gcc-11/performance/x86_64-rhel-8.3/Sequential Write/IO_uring/Yes/No/4KB/Default Test Directory/debian-x86_64-phoronix/lkp-csl-2sp7/fio-1.14.1/phoronix-test-suite/0x500320a
commit:
fcc108377a ("sched/fair: skip busy cores in SIS search")
32fe13cd7a ("sched/fair: de-entropy for SIS filter")
fcc108377a7cf79c 32fe13cd7aa184ed349d698ebf6
---------------- ---------------------------
%stddev %change %stddev
\ | \
166666 -11.6% 147277 ? 3% phoronix-test-suite.fio.SequentialWrite.IO_uring.Yes.No.4KB.DefaultTestDirectory.iops
651.00 -11.7% 574.83 ? 3% phoronix-test-suite.fio.SequentialWrite.IO_uring.Yes.No.4KB.DefaultTestDirectory.mb_s
3145 ? 5% -18.4% 2565 ? 12% meminfo.Writeback
0.19 ? 4% -0.0 0.17 ? 2% mpstat.cpu.all.iowait%
2228 ? 33% -37.5% 1392 ? 21% numa-meminfo.node0.Writeback
553.33 ? 37% -35.9% 354.83 ? 18% numa-vmstat.node0.nr_writeback
445604 ? 4% -12.5% 390116 ? 4% vmstat.io.bo
14697101 ? 3% -11.0% 13074497 ? 4% perf-stat.i.cache-misses
9447 ? 8% -37.6% 5890 ? 5% perf-stat.i.cpu-migrations
5125 ? 6% +12.9% 5786 ? 6% perf-stat.i.instructions-per-iTLB-miss
2330431 ? 4% -11.4% 2064845 ? 4% perf-stat.i.node-loads
2.55 ?104% -1.6 0.96 ? 14% perf-profile.calltrace.cycles-pp.poll_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
2.62 ?102% -1.6 0.99 ? 14% perf-profile.children.cycles-pp.poll_idle
0.82 ? 23% -0.3 0.53 ? 23% perf-profile.children.cycles-pp.asm_sysvec_call_function_single
0.74 ? 23% -0.3 0.46 ? 23% perf-profile.children.cycles-pp.sysvec_call_function_single
0.69 ? 24% -0.3 0.44 ? 24% perf-profile.children.cycles-pp.__sysvec_call_function_single
0.38 ? 10% -0.1 0.28 ? 18% perf-profile.children.cycles-pp.__perf_event_header__init_id
0.16 ? 13% -0.0 0.11 ? 22% perf-profile.children.cycles-pp.__task_pid_nr_ns
2.10 ?108% -1.3 0.79 ? 11% perf-profile.self.cycles-pp.poll_idle
0.16 ? 13% -0.0 0.11 ? 22% perf-profile.self.cycles-pp.__task_pid_nr_ns
***************************************************************************************************
lkp-icl-2sp6: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory
=========================================================================================
class/compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime/ucode:
memory/gcc-11/performance/x86_64-rhel-8.3/100%/debian-11.1-x86_64-20220510.cgz/lkp-icl-2sp6/vm-rw/stress-ng/60s/0xd000331
commit:
fcc108377a ("sched/fair: skip busy cores in SIS search")
32fe13cd7a ("sched/fair: de-entropy for SIS filter")
fcc108377a7cf79c 32fe13cd7aa184ed349d698ebf6
---------------- ---------------------------
%stddev %change %stddev
\ | \
7328835 ? 17% +3441.0% 2.595e+08 ? 12% stress-ng.time.involuntary_context_switches
123165 ? 3% -14.1% 105742 ? 2% stress-ng.time.minor_page_faults
8940 +32.8% 11872 ? 2% stress-ng.time.percent_of_cpu_this_job_got
5268 +33.4% 7027 ? 2% stress-ng.time.system_time
278.70 +21.5% 338.70 ? 2% stress-ng.time.user_time
2.554e+08 +13.3% 2.894e+08 stress-ng.time.voluntary_context_switches
1.283e+08 +113.5% 2.74e+08 ? 6% stress-ng.vm-rw.ops
2139049 +113.5% 4567054 ? 6% stress-ng.vm-rw.ops_per_sec
39411 ? 34% +56.3% 61612 ? 24% numa-meminfo.node1.Mapped
5013 -22.5% 3883 ? 4% uptime.idle
1.798e+09 -60.3% 7.135e+08 ? 21% cpuidle..time
1.701e+08 -87.3% 21598951 ? 90% cpuidle..usage
75821 ? 2% -11.6% 67063 ? 5% meminfo.Active
75821 ? 2% -11.6% 67063 ? 5% meminfo.Active(anon)
81710 ? 2% +20.1% 98158 ? 3% meminfo.Mapped
26.00 -59.6% 10.50 ? 18% vmstat.cpu.id
112.00 +10.9% 124.17 vmstat.procs.r
6561639 +31.6% 8634043 ? 2% vmstat.system.cs
990604 -62.4% 372118 ? 18% vmstat.system.in
24.13 -16.1 8.03 ? 23% mpstat.cpu.all.idle%
2.71 -1.6 1.11 ? 10% mpstat.cpu.all.irq%
0.17 ? 6% -0.1 0.06 ? 30% mpstat.cpu.all.soft%
69.33 +17.4 86.71 ? 2% mpstat.cpu.all.sys%
3.66 +0.4 4.09 mpstat.cpu.all.usr%
2.024e+09 +93.3% 3.912e+09 ? 16% numa-vmstat.node0.nr_foll_pin_acquired
2.024e+09 +93.3% 3.912e+09 ? 16% numa-vmstat.node0.nr_foll_pin_released
2.043e+09 ? 2% +119.0% 4.473e+09 numa-vmstat.node1.nr_foll_pin_acquired
2.043e+09 ? 2% +119.0% 4.473e+09 numa-vmstat.node1.nr_foll_pin_released
9865 ? 34% +54.1% 15201 ? 23% numa-vmstat.node1.nr_mapped
18954 ? 2% -11.5% 16767 ? 5% proc-vmstat.nr_active_anon
4.062e+09 +107.3% 8.419e+09 ? 7% proc-vmstat.nr_foll_pin_acquired
4.062e+09 +107.3% 8.419e+09 ? 7% proc-vmstat.nr_foll_pin_released
87380 +5.3% 92039 proc-vmstat.nr_inactive_anon
24453 -3.2% 23658 proc-vmstat.nr_kernel_stack
20437 ? 2% +19.6% 24443 ? 3% proc-vmstat.nr_mapped
18954 ? 2% -11.5% 16767 ? 5% proc-vmstat.nr_zone_active_anon
87380 +5.3% 92039 proc-vmstat.nr_zone_inactive_anon
108777 ? 4% -17.2% 90014 proc-vmstat.numa_hint_faults
96756 ? 6% -17.6% 79691 ? 2% proc-vmstat.numa_hint_faults_local
490607 -4.4% 469155 proc-vmstat.pgfault
80.85 +10.9 91.75 turbostat.Busy%
3221 -5.0% 3060 turbostat.Bzy_MHz
77259218 ? 3% -87.0% 10057388 ? 92% turbostat.C1
6.74 ? 2% -5.9 0.85 ? 90% turbostat.C1%
92212921 -87.8% 11243535 ? 91% turbostat.C1E
12.00 ? 22% -6.6 5.42 ? 57% turbostat.C1E%
16.39 ? 16% -62.0% 6.24 ? 55% turbostat.CPU%c1
0.16 ? 3% +74.7% 0.29 ? 6% turbostat.IPC
65322725 -62.5% 24502370 ? 18% turbostat.IRQ
339708 -86.5% 45941 ? 88% turbostat.POLL
0.05 -0.0 0.01 ? 82% turbostat.POLL%
165121 ? 23% -100.0% 39.19 ?101% sched_debug.cfs_rq:/.MIN_vruntime.avg
2462709 -99.9% 3407 ?102% sched_debug.cfs_rq:/.MIN_vruntime.max
607348 ? 11% -99.9% 348.57 ?100% sched_debug.cfs_rq:/.MIN_vruntime.stddev
0.56 ? 4% +11.8% 0.62 ? 3% sched_debug.cfs_rq:/.h_nr_running.avg
2.58 ? 13% -38.7% 1.58 ? 11% sched_debug.cfs_rq:/.h_nr_running.max
0.54 ? 9% -39.7% 0.33 ? 6% sched_debug.cfs_rq:/.h_nr_running.stddev
165121 ? 23% -100.0% 39.19 ?101% sched_debug.cfs_rq:/.max_vruntime.avg
2462709 -99.9% 3407 ?102% sched_debug.cfs_rq:/.max_vruntime.max
607348 ? 11% -99.9% 348.57 ?100% sched_debug.cfs_rq:/.max_vruntime.stddev
2439879 +43.2% 3493834 ? 4% sched_debug.cfs_rq:/.min_vruntime.avg
2485561 +49.1% 3705888 sched_debug.cfs_rq:/.min_vruntime.max
2129935 +34.5% 2865147 ? 2% sched_debug.cfs_rq:/.min_vruntime.min
35480 ? 17% +324.2% 150497 ? 59% sched_debug.cfs_rq:/.min_vruntime.stddev
0.43 ? 3% +27.9% 0.55 sched_debug.cfs_rq:/.nr_running.avg
0.35 ? 5% -57.2% 0.15 ? 4% sched_debug.cfs_rq:/.nr_running.stddev
2186 ? 15% -27.9% 1575 ? 11% sched_debug.cfs_rq:/.runnable_avg.max
152.08 ? 6% +134.5% 356.58 ? 31% sched_debug.cfs_rq:/.runnable_avg.min
399.32 ? 4% -50.5% 197.69 ? 8% sched_debug.cfs_rq:/.runnable_avg.stddev
25106 ? 50% +1121.1% 306577 ? 66% sched_debug.cfs_rq:/.spread0.max
35510 ? 17% +323.3% 150305 ? 59% sched_debug.cfs_rq:/.spread0.stddev
545.95 ? 3% +16.4% 635.59 sched_debug.cfs_rq:/.util_avg.avg
1726 ? 15% -26.7% 1266 ? 14% sched_debug.cfs_rq:/.util_avg.max
154.67 ? 2% +112.9% 329.33 ? 30% sched_debug.cfs_rq:/.util_avg.min
317.35 ? 4% -43.1% 180.53 ? 10% sched_debug.cfs_rq:/.util_avg.stddev
192.70 ? 6% +104.5% 393.98 ? 7% sched_debug.cfs_rq:/.util_est_enqueued.avg
5359 ? 4% -26.1% 3958 ? 8% sched_debug.cpu.avg_idle.min
4.69 ? 7% +136.0% 11.07 ? 5% sched_debug.cpu.clock.stddev
2380 ? 4% +31.0% 3117 sched_debug.cpu.curr->pid.avg
1818 ? 3% -65.9% 620.26 ? 8% sched_debug.cpu.curr->pid.stddev
0.00 ? 8% +59.7% 0.00 ? 10% sched_debug.cpu.next_balance.stddev
2.58 ? 17% -41.9% 1.50 sched_debug.cpu.nr_running.max
0.52 ? 9% -43.2% 0.29 ? 5% sched_debug.cpu.nr_running.stddev
1610935 +31.3% 2115112 ? 2% sched_debug.cpu.nr_switches.avg
1661619 +34.5% 2234069 sched_debug.cpu.nr_switches.max
1415677 ? 3% +20.3% 1702445 sched_debug.cpu.nr_switches.min
30576 ? 26% +151.6% 76923 ? 37% sched_debug.cpu.nr_switches.stddev
25.47 -91.3% 2.21 ? 69% perf-stat.i.MPKI
3.342e+10 +84.7% 6.172e+10 ? 5% perf-stat.i.branch-instructions
0.58 -0.3 0.33 ? 5% perf-stat.i.branch-miss-rate%
1.667e+08 -13.2% 1.448e+08 ? 2% perf-stat.i.branch-misses
0.63 ? 17% +4.8 5.42 ? 39% perf-stat.i.cache-miss-rate%
18939524 ? 4% -46.6% 10109353 ? 18% perf-stat.i.cache-misses
4.422e+09 -87.1% 5.724e+08 ? 77% perf-stat.i.cache-references
6897069 +30.8% 9023752 ? 2% perf-stat.i.context-switches
2.04 -43.3% 1.16 ? 5% perf-stat.i.cpi
3.523e+11 +3.8% 3.656e+11 perf-stat.i.cpu-cycles
2322589 -86.6% 310934 ? 93% perf-stat.i.cpu-migrations
18560 ? 4% +113.2% 39578 ? 15% perf-stat.i.cycles-between-cache-misses
0.20 -0.2 0.02 ? 70% perf-stat.i.dTLB-load-miss-rate%
85472762 -87.2% 10962661 ? 82% perf-stat.i.dTLB-load-misses
4.266e+10 +83.8% 7.841e+10 ? 5% perf-stat.i.dTLB-loads
0.10 ? 4% -0.1 0.01 ? 72% perf-stat.i.dTLB-store-miss-rate%
25396322 ? 4% -86.5% 3437369 ? 90% perf-stat.i.dTLB-store-misses
2.483e+10 +85.2% 4.598e+10 ? 5% perf-stat.i.dTLB-stores
1.699e+11 +85.8% 3.157e+11 ? 5% perf-stat.i.instructions
0.50 +73.4% 0.87 ? 4% perf-stat.i.ipc
2.75 +3.8% 2.86 perf-stat.i.metric.GHz
822.90 +77.2% 1458 ? 5% perf-stat.i.metric.M/sec
5691 -3.4% 5500 perf-stat.i.minor-faults
91.09 +4.6 95.71 perf-stat.i.node-load-miss-rate%
334087 ? 17% -67.4% 109033 ? 18% perf-stat.i.node-loads
70.09 +17.6 87.68 ? 6% perf-stat.i.node-store-miss-rate%
1559730 ? 5% -64.9% 548115 ? 56% perf-stat.i.node-stores
5704 -3.3% 5513 perf-stat.i.page-faults
26.03 -92.7% 1.89 ? 83% perf-stat.overall.MPKI
0.50 -0.3 0.24 ? 8% perf-stat.overall.branch-miss-rate%
0.43 ? 3% +2.5 2.91 ? 60% perf-stat.overall.cache-miss-rate%
2.08 -44.0% 1.16 ? 5% perf-stat.overall.cpi
18664 ? 4% +100.4% 37402 ? 16% perf-stat.overall.cycles-between-cache-misses
0.20 -0.2 0.01 ? 87% perf-stat.overall.dTLB-load-miss-rate%
0.10 ? 4% -0.1 0.01 ? 96% perf-stat.overall.dTLB-store-miss-rate%
0.48 +79.1% 0.86 ? 4% perf-stat.overall.ipc
91.02 +5.1 96.07 perf-stat.overall.node-load-miss-rate%
70.91 +17.6 88.54 ? 6% perf-stat.overall.node-store-miss-rate%
3.289e+10 +85.0% 6.085e+10 ? 5% perf-stat.ps.branch-instructions
1.641e+08 -13.1% 1.425e+08 ? 2% perf-stat.ps.branch-misses
18633656 ? 4% -46.7% 9931368 ? 18% perf-stat.ps.cache-misses
4.354e+09 -87.1% 5.613e+08 ? 77% perf-stat.ps.cache-references
6788892 +31.0% 8894592 ? 2% perf-stat.ps.context-switches
3.47e+11 +3.9% 3.604e+11 perf-stat.ps.cpu-cycles
2286778 -86.7% 304327 ? 94% perf-stat.ps.cpu-migrations
84173329 -87.2% 10770448 ? 82% perf-stat.ps.dTLB-load-misses
4.198e+10 +84.1% 7.73e+10 ? 5% perf-stat.ps.dTLB-loads
25001705 ? 4% -86.5% 3364501 ? 91% perf-stat.ps.dTLB-store-misses
2.444e+10 +85.5% 4.533e+10 ? 5% perf-stat.ps.dTLB-stores
1.673e+11 +86.1% 3.112e+11 ? 5% perf-stat.ps.instructions
12.40 -1.5% 12.22 perf-stat.ps.major-faults
5543 -3.9% 5329 perf-stat.ps.minor-faults
332272 ? 17% -66.0% 112911 ? 16% perf-stat.ps.node-loads
1533930 ? 5% -65.2% 534337 ? 57% perf-stat.ps.node-stores
5556 -3.9% 5341 perf-stat.ps.page-faults
1.065e+13 +86.7% 1.988e+13 ? 5% perf-stat.total.instructions
18.10 -16.2 1.91 ?142% perf-profile.calltrace.cycles-pp.secondary_startup_64_no_verify
17.94 -16.1 1.88 ?142% perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64_no_verify
17.93 -16.1 1.88 ?142% perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
17.90 -16.0 1.88 ?142% perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
15.83 -8.0 7.86 ? 20% perf-profile.calltrace.cycles-pp.read
13.06 -8.0 5.11 ? 30% perf-profile.calltrace.cycles-pp.pipe_read.new_sync_read.vfs_read.ksys_read.do_syscall_64
13.21 -7.9 5.30 ? 29% perf-profile.calltrace.cycles-pp.new_sync_read.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe
13.58 -7.9 5.68 ? 27% perf-profile.calltrace.cycles-pp.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
14.72 -7.9 6.86 ? 22% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.read
14.51 -7.8 6.73 ? 22% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
13.77 -7.5 6.23 ? 23% perf-profile.calltrace.cycles-pp.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
9.90 -7.1 2.83 ? 47% perf-profile.calltrace.cycles-pp.__schedule.schedule.pipe_read.new_sync_read.vfs_read
9.96 -7.0 2.92 ? 45% perf-profile.calltrace.cycles-pp.schedule.pipe_read.new_sync_read.vfs_read.ksys_read
7.84 -6.9 0.94 ?142% perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
7.44 -6.8 0.63 ?142% perf-profile.calltrace.cycles-pp.flush_smp_call_function_queue.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
7.07 -6.2 0.85 ?142% perf-profile.calltrace.cycles-pp.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
7.03 -6.2 0.84 ?142% perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry
8.60 -5.3 3.30 ? 44% perf-profile.calltrace.cycles-pp.__wake_up_common.__wake_up_common_lock.pipe_write.new_sync_write.vfs_write
8.75 -5.3 3.49 ? 41% perf-profile.calltrace.cycles-pp.__wake_up_common_lock.pipe_write.new_sync_write.vfs_write.ksys_write
10.84 -5.3 5.58 ? 27% perf-profile.calltrace.cycles-pp.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
10.53 -5.2 5.29 ? 28% perf-profile.calltrace.cycles-pp.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
6.70 -5.2 1.49 ? 62% perf-profile.calltrace.cycles-pp.dequeue_task_fair.__schedule.schedule.pipe_read.new_sync_read
10.02 -5.2 4.83 ? 30% perf-profile.calltrace.cycles-pp.pipe_write.new_sync_write.vfs_write.ksys_write.do_syscall_64
5.64 -5.2 0.48 ?142% perf-profile.calltrace.cycles-pp.sched_ttwu_pending.flush_smp_call_function_queue.do_idle.cpu_startup_entry.start_secondary
10.08 -5.1 4.98 ? 29% perf-profile.calltrace.cycles-pp.new_sync_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
8.21 -5.1 3.14 ? 44% perf-profile.calltrace.cycles-pp.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.pipe_write
8.25 -5.1 3.20 ? 44% perf-profile.calltrace.cycles-pp.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.pipe_write.new_sync_write
5.11 -4.7 0.40 ?141% perf-profile.calltrace.cycles-pp.ttwu_do_activate.sched_ttwu_pending.flush_smp_call_function_queue.do_idle.cpu_startup_entry
5.07 -4.7 0.40 ?141% perf-profile.calltrace.cycles-pp.enqueue_task_fair.ttwu_do_activate.sched_ttwu_pending.flush_smp_call_function_queue.do_idle
5.19 -4.6 0.55 ?141% perf-profile.calltrace.cycles-pp.dequeue_entity.dequeue_task_fair.__schedule.schedule.pipe_read
20.28 -4.6 15.69 ? 5% perf-profile.calltrace.cycles-pp.copy_user_enhanced_fast_string.copyout.copy_page_to_iter.process_vm_rw_single_vec.process_vm_rw_core
20.55 -4.3 16.22 ? 5% perf-profile.calltrace.cycles-pp.copyout.copy_page_to_iter.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw
10.91 -3.7 7.17 ? 17% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
24.78 -3.7 21.12 ? 2% perf-profile.calltrace.cycles-pp.copy_page_to_iter.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw.__x64_sys_process_vm_readv
10.95 -3.6 7.30 ? 16% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.write
11.15 -3.0 8.18 ? 13% perf-profile.calltrace.cycles-pp.write
1.03 ? 4% -0.4 0.62 ? 14% perf-profile.calltrace.cycles-pp.stress_vm_child
0.76 ? 4% +0.5 1.22 ? 10% perf-profile.calltrace.cycles-pp.stress_vm_rw
0.63 +0.5 1.15 ? 23% perf-profile.calltrace.cycles-pp.enqueue_task_fair.ttwu_do_activate.try_to_wake_up.autoremove_wake_function.__wake_up_common
0.53 ? 2% +0.5 1.06 ? 9% perf-profile.calltrace.cycles-pp.__might_fault.copy_page_to_iter.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw
0.64 +0.5 1.18 ? 22% perf-profile.calltrace.cycles-pp.ttwu_do_activate.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock
0.00 +0.6 0.60 ? 7% perf-profile.calltrace.cycles-pp.__might_resched.copy_page_to_iter.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw
0.00 +1.0 0.96 ? 15% perf-profile.calltrace.cycles-pp.__schedule.schedule.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode
0.00 +1.0 1.02 ? 17% perf-profile.calltrace.cycles-pp.mod_node_page_state.gup_put_folio.unpin_user_pages.process_vm_rw_single_vec.process_vm_rw_core
0.00 +1.0 1.02 ? 10% perf-profile.calltrace.cycles-pp.__might_fault.copy_page_from_iter.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw
0.00 +1.0 1.03 ? 15% perf-profile.calltrace.cycles-pp.schedule.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
0.00 +1.0 1.04 ? 17% perf-profile.calltrace.cycles-pp.mod_node_page_state.gup_put_folio.unpin_user_pages_dirty_lock.process_vm_rw_single_vec.process_vm_rw_core
0.00 +1.1 1.10 ? 16% perf-profile.calltrace.cycles-pp.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.24 ? 2% +1.4 2.60 ? 11% perf-profile.calltrace.cycles-pp._raw_spin_lock.follow_page_pte.__get_user_pages.__get_user_pages_remote.process_vm_rw_single_vec
0.00 +1.4 1.40 ? 16% perf-profile.calltrace.cycles-pp.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
0.00 +1.5 1.46 ? 16% perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
1.04 +1.5 2.54 ? 14% perf-profile.calltrace.cycles-pp.gup_put_folio.unpin_user_pages.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw
0.99 ? 2% +1.6 2.57 ? 14% perf-profile.calltrace.cycles-pp.gup_put_folio.unpin_user_pages_dirty_lock.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw
0.00 +1.7 1.69 ? 11% perf-profile.calltrace.cycles-pp.follow_pud_mask.__get_user_pages.__get_user_pages_remote.process_vm_rw_single_vec.process_vm_rw_core
1.35 +1.8 3.20 ? 14% perf-profile.calltrace.cycles-pp.unpin_user_pages.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw.__x64_sys_process_vm_readv
0.00 +2.0 1.97 ? 10% perf-profile.calltrace.cycles-pp.follow_page_mask.__get_user_pages.__get_user_pages_remote.process_vm_rw_single_vec.process_vm_rw_core
1.27 ? 2% +2.0 3.30 ? 14% perf-profile.calltrace.cycles-pp.unpin_user_pages_dirty_lock.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw.__x64_sys_process_vm_writev
0.00 +2.1 2.12 ? 18% perf-profile.calltrace.cycles-pp.mod_node_page_state.try_grab_page.follow_page_pte.__get_user_pages.__get_user_pages_remote
0.00 +2.3 2.30 ? 11% perf-profile.calltrace.cycles-pp.follow_pmd_mask.__get_user_pages.__get_user_pages_remote.process_vm_rw_single_vec.process_vm_rw_core
3.16 ? 2% +2.4 5.51 ? 11% perf-profile.calltrace.cycles-pp.try_grab_page.follow_page_pte.__get_user_pages.__get_user_pages_remote.process_vm_rw_single_vec
32.57 +5.2 37.78 ? 3% perf-profile.calltrace.cycles-pp.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw.__x64_sys_process_vm_readv.do_syscall_64
33.67 +5.6 39.24 ? 3% perf-profile.calltrace.cycles-pp.process_vm_rw_core.process_vm_rw.__x64_sys_process_vm_readv.do_syscall_64.entry_SYSCALL_64_after_hwframe
6.24 +6.1 12.34 ? 10% perf-profile.calltrace.cycles-pp.__get_user_pages_remote.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw.__x64_sys_process_vm_readv
34.23 +6.2 40.39 ? 3% perf-profile.calltrace.cycles-pp.process_vm_rw.__x64_sys_process_vm_readv.do_syscall_64.entry_SYSCALL_64_after_hwframe.process_vm_readv
34.22 +6.2 40.42 ? 3% perf-profile.calltrace.cycles-pp.__x64_sys_process_vm_readv.do_syscall_64.entry_SYSCALL_64_after_hwframe.process_vm_readv
34.39 +6.3 40.68 ? 3% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.process_vm_readv
34.49 +6.4 40.88 ? 3% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.process_vm_readv
34.87 +6.6 41.43 ? 3% perf-profile.calltrace.cycles-pp.process_vm_readv
6.26 +6.6 12.83 ? 11% perf-profile.calltrace.cycles-pp.follow_page_pte.__get_user_pages.__get_user_pages_remote.process_vm_rw_single_vec.process_vm_rw_core
7.43 +6.6 14.02 ? 8% perf-profile.calltrace.cycles-pp.copy_user_enhanced_fast_string.copyin.copy_page_from_iter.process_vm_rw_single_vec.process_vm_rw_core
7.70 +6.9 14.64 ? 8% perf-profile.calltrace.cycles-pp.copyin.copy_page_from_iter.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw
4.80 ? 2% +7.2 11.95 ? 12% perf-profile.calltrace.cycles-pp.__get_user_pages_remote.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw.__x64_sys_process_vm_writev
9.52 +9.3 18.86 ? 8% perf-profile.calltrace.cycles-pp.copy_page_from_iter.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw.__x64_sys_process_vm_writev
10.80 +12.9 23.71 ? 11% perf-profile.calltrace.cycles-pp.__get_user_pages.__get_user_pages_remote.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw
16.02 ? 2% +19.0 35.05 ? 10% perf-profile.calltrace.cycles-pp.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw.__x64_sys_process_vm_writev.do_syscall_64
16.70 ? 2% +19.7 36.44 ? 10% perf-profile.calltrace.cycles-pp.process_vm_rw_core.process_vm_rw.__x64_sys_process_vm_writev.do_syscall_64.entry_SYSCALL_64_after_hwframe
17.15 ? 2% +20.4 37.51 ? 10% perf-profile.calltrace.cycles-pp.process_vm_rw.__x64_sys_process_vm_writev.do_syscall_64.entry_SYSCALL_64_after_hwframe.process_vm_writev
17.17 ? 2% +20.4 37.55 ? 10% perf-profile.calltrace.cycles-pp.__x64_sys_process_vm_writev.do_syscall_64.entry_SYSCALL_64_after_hwframe.process_vm_writev
17.26 ? 2% +20.5 37.72 ? 10% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.process_vm_writev
17.31 ? 2% +20.5 37.83 ? 10% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.process_vm_writev
17.60 ? 2% +20.8 38.42 ? 10% perf-profile.calltrace.cycles-pp.process_vm_writev
18.10 -16.2 1.92 ?141% perf-profile.children.cycles-pp.secondary_startup_64_no_verify
18.10 -16.2 1.92 ?141% perf-profile.children.cycles-pp.cpu_startup_entry
18.08 -16.2 1.92 ?141% perf-profile.children.cycles-pp.do_idle
17.94 -16.0 1.89 ?141% perf-profile.children.cycles-pp.start_secondary
16.00 -8.0 8.02 ? 19% perf-profile.children.cycles-pp.read
12.02 -7.9 4.10 ? 37% perf-profile.children.cycles-pp.__schedule
13.10 -7.9 5.18 ? 30% perf-profile.children.cycles-pp.pipe_read
13.22 -7.9 5.31 ? 29% perf-profile.children.cycles-pp.new_sync_read
13.60 -7.9 5.70 ? 27% perf-profile.children.cycles-pp.vfs_read
13.78 -7.5 6.25 ? 23% perf-profile.children.cycles-pp.ksys_read
7.62 -7.0 0.65 ?142% perf-profile.children.cycles-pp.flush_smp_call_function_queue
7.92 -7.0 0.96 ?142% perf-profile.children.cycles-pp.cpuidle_idle_call
7.18 -6.6 0.59 ?142% perf-profile.children.cycles-pp.sched_ttwu_pending
7.14 -6.3 0.86 ?142% perf-profile.children.cycles-pp.cpuidle_enter
7.24 -6.3 0.98 ?124% perf-profile.children.cycles-pp.update_cfs_group
7.12 -6.3 0.86 ?142% perf-profile.children.cycles-pp.cpuidle_enter_state
9.97 -6.0 3.96 ? 29% perf-profile.children.cycles-pp.schedule
6.52 -5.7 0.79 ?142% perf-profile.children.cycles-pp.mwait_idle_with_hints
7.11 -5.4 1.69 ? 57% perf-profile.children.cycles-pp.ttwu_do_activate
7.07 -5.4 1.66 ? 58% perf-profile.children.cycles-pp.enqueue_task_fair
8.61 -5.3 3.30 ? 44% perf-profile.children.cycles-pp.__wake_up_common
10.86 -5.2 5.61 ? 26% perf-profile.children.cycles-pp.ksys_write
8.76 -5.2 3.51 ? 41% perf-profile.children.cycles-pp.__wake_up_common_lock
10.55 -5.2 5.32 ? 28% perf-profile.children.cycles-pp.vfs_write
6.72 -5.2 1.50 ? 61% perf-profile.children.cycles-pp.dequeue_task_fair
10.04 -5.2 4.88 ? 30% perf-profile.children.cycles-pp.pipe_write
10.10 -5.1 5.00 ? 29% perf-profile.children.cycles-pp.new_sync_write
8.26 -5.1 3.20 ? 44% perf-profile.children.cycles-pp.autoremove_wake_function
8.22 -5.0 3.17 ? 44% perf-profile.children.cycles-pp.try_to_wake_up
5.62 -4.7 0.94 ? 62% perf-profile.children.cycles-pp.enqueue_entity
21.11 -4.4 16.69 ? 5% perf-profile.children.cycles-pp.copyout
5.21 -4.4 0.84 ? 70% perf-profile.children.cycles-pp.dequeue_entity
5.50 -4.3 1.25 ? 48% perf-profile.children.cycles-pp.update_load_avg
25.20 -3.0 22.21 ? 2% perf-profile.children.cycles-pp.copy_page_to_iter
11.21 -2.9 8.33 ? 13% perf-profile.children.cycles-pp.write
3.29 -2.5 0.83 ? 78% perf-profile.children.cycles-pp.select_task_rq
3.22 -2.4 0.78 ? 83% perf-profile.children.cycles-pp.select_task_rq_fair
2.78 -2.2 0.62 ? 93% perf-profile.children.cycles-pp.select_idle_sibling
1.80 ? 2% -1.5 0.29 ?137% perf-profile.children.cycles-pp.available_idle_cpu
0.87 -0.6 0.22 ? 57% perf-profile.children.cycles-pp.finish_task_switch
0.98 -0.5 0.45 ? 23% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
0.94 -0.5 0.41 ? 27% perf-profile.children.cycles-pp.prepare_to_wait_event
0.60 ? 2% -0.5 0.13 ? 76% perf-profile.children.cycles-pp.switch_mm_irqs_off
1.04 ? 4% -0.4 0.63 ? 14% perf-profile.children.cycles-pp.stress_vm_child
0.58 -0.3 0.29 ? 28% perf-profile.children.cycles-pp.update_rq_clock
0.69 -0.3 0.40 ? 19% perf-profile.children.cycles-pp.prepare_task_switch
0.89 ? 3% -0.3 0.62 ? 13% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
0.58 -0.3 0.31 ? 21% perf-profile.children.cycles-pp.__switch_to_asm
0.52 -0.2 0.28 ? 21% perf-profile.children.cycles-pp.___perf_sw_event
0.76 ? 4% -0.2 0.56 ? 12% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
0.28 ? 3% -0.2 0.09 ? 52% perf-profile.children.cycles-pp._find_next_bit
0.50 -0.2 0.33 ? 13% perf-profile.children.cycles-pp.security_file_permission
0.26 ? 2% -0.2 0.09 ? 41% perf-profile.children.cycles-pp.task_tick_fair
0.24 ? 6% -0.2 0.08 ? 57% perf-profile.children.cycles-pp.__irq_exit_rcu
0.48 -0.2 0.32 ? 5% perf-profile.children.cycles-pp.set_next_entity
0.42 ? 4% -0.1 0.27 ? 13% perf-profile.children.cycles-pp.__hrtimer_run_queues
0.30 ? 3% -0.1 0.16 ? 18% perf-profile.children.cycles-pp.scheduler_tick
0.37 ? 4% -0.1 0.24 ? 13% perf-profile.children.cycles-pp.tick_sched_timer
0.35 ? 4% -0.1 0.22 ? 14% perf-profile.children.cycles-pp.tick_sched_handle
0.34 ? 4% -0.1 0.22 ? 15% perf-profile.children.cycles-pp.update_process_times
0.25 ? 3% -0.1 0.12 ? 40% perf-profile.children.cycles-pp.find_vma
0.19 ? 7% -0.1 0.07 ? 56% perf-profile.children.cycles-pp.__softirqentry_text_start
0.68 -0.1 0.56 ? 5% perf-profile.children.cycles-pp.mutex_lock
0.26 ? 3% -0.1 0.14 ? 34% perf-profile.children.cycles-pp.find_extend_vma
0.38 ? 2% -0.1 0.26 ? 10% perf-profile.children.cycles-pp.apparmor_file_permission
0.37 -0.1 0.27 ? 9% perf-profile.children.cycles-pp.__update_load_avg_cfs_rq
0.34 -0.1 0.25 ? 10% perf-profile.children.cycles-pp.perf_trace_sched_wakeup_template
0.24 ? 3% -0.1 0.16 ? 9% perf-profile.children.cycles-pp.sched_clock_cpu
0.17 ? 4% -0.1 0.09 ? 33% perf-profile.children.cycles-pp.vmacache_find
0.64 -0.1 0.56 ? 6% perf-profile.children.cycles-pp.switch_fpu_return
0.53 ? 4% -0.1 0.46 ? 8% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
0.53 ? 4% -0.1 0.46 ? 8% perf-profile.children.cycles-pp.hrtimer_interrupt
0.19 ? 2% -0.1 0.13 ? 9% perf-profile.children.cycles-pp.native_sched_clock
0.57 -0.1 0.50 ? 5% perf-profile.children.cycles-pp.__switch_to
0.27 -0.0 0.22 ? 2% perf-profile.children.cycles-pp.mutex_unlock
0.10 ? 5% -0.0 0.06 ? 17% perf-profile.children.cycles-pp.anon_pipe_buf_release
0.13 ? 2% -0.0 0.10 ? 9% perf-profile.children.cycles-pp._raw_spin_unlock_irqrestore
0.17 ? 2% -0.0 0.14 ? 8% perf-profile.children.cycles-pp.perf_tp_event
0.08 ? 4% -0.0 0.05 ? 8% perf-profile.children.cycles-pp.__list_add_valid
0.10 -0.0 0.09 perf-profile.children.cycles-pp.finish_wait
0.14 +0.0 0.16 ? 3% perf-profile.children.cycles-pp.atime_needs_update
0.12 ? 4% +0.0 0.15 ? 2% perf-profile.children.cycles-pp.file_update_time
0.06 ? 6% +0.0 0.09 ? 7% perf-profile.children.cycles-pp.__rdgsbase_inactive
0.10 ? 6% +0.0 0.13 ? 5% perf-profile.children.cycles-pp.__wrgsbase_inactive
0.75 +0.0 0.79 perf-profile.children.cycles-pp.pick_next_task_fair
0.05 +0.0 0.09 ? 5% perf-profile.children.cycles-pp.pick_next_entity
0.19 ? 2% +0.0 0.23 ? 5% perf-profile.children.cycles-pp.down_read_killable
0.02 ?141% +0.0 0.06 ? 11% perf-profile.children.cycles-pp.perf_trace_sched_switch
0.05 +0.0 0.10 ? 6% perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
0.02 ? 99% +0.1 0.08 ? 6% perf-profile.children.cycles-pp.kmalloc_slab
0.14 ? 2% +0.1 0.20 perf-profile.children.cycles-pp.down_read
0.00 +0.1 0.06 ? 8% perf-profile.children.cycles-pp.resched_curr
0.02 ? 99% +0.1 0.08 ? 8% perf-profile.children.cycles-pp.perf_trace_sched_stat_runtime
0.16 ? 3% +0.1 0.21 ? 2% perf-profile.children.cycles-pp.mmput
0.00 +0.1 0.06 ? 9% perf-profile.children.cycles-pp.ktime_get_coarse_real_ts64
0.13 ? 2% +0.1 0.19 ? 6% perf-profile.children.cycles-pp.get_task_mm
0.00 +0.1 0.07 ? 7% perf-profile.children.cycles-pp.idr_find
0.24 ? 2% +0.1 0.31 ? 4% perf-profile.children.cycles-pp.__update_load_avg_se
0.02 ?141% +0.1 0.09 ? 10% perf-profile.children.cycles-pp.__calc_delta
0.66 +0.1 0.73 ? 2% perf-profile.children.cycles-pp.update_curr
0.04 ? 44% +0.1 0.12 ? 9% perf-profile.children.cycles-pp.memcg_slab_free_hook
0.00 +0.1 0.07 ? 18% perf-profile.children.cycles-pp.cpumask_next_and
0.00 +0.1 0.08 ? 10% perf-profile.children.cycles-pp.syscall_exit_to_user_mode_prepare
0.08 ? 4% +0.1 0.16 ? 10% perf-profile.children.cycles-pp.up_read
0.07 ? 11% +0.1 0.15 ? 8% perf-profile.children.cycles-pp.clockevents_program_event
0.13 +0.1 0.22 ? 3% perf-profile.children.cycles-pp.ttwu_do_wakeup
0.06 ? 6% +0.1 0.15 ? 9% perf-profile.children.cycles-pp.current_time
0.00 +0.1 0.09 ? 10% perf-profile.children.cycles-pp.check_stack_object
0.11 ? 4% +0.1 0.20 ? 4% perf-profile.children.cycles-pp.check_preempt_curr
0.15 ? 4% +0.1 0.26 ? 6% perf-profile.children.cycles-pp.os_xsave
0.12 ? 4% +0.1 0.23 ? 9% perf-profile.children.cycles-pp.syscall_enter_from_user_mode
0.16 ? 3% +0.1 0.27 ? 9% perf-profile.children.cycles-pp.reweight_entity
0.61 ? 2% +0.1 0.75 ? 2% perf-profile.children.cycles-pp.find_get_task_by_vpid
0.19 ? 5% +0.1 0.33 ? 6% perf-profile.children.cycles-pp.__radix_tree_lookup
0.10 ? 3% +0.2 0.25 ? 10% perf-profile.children.cycles-pp.__check_object_size
0.13 ? 3% +0.2 0.29 ? 9% perf-profile.children.cycles-pp.syscall_return_via_sysret
0.11 ? 6% +0.2 0.26 ? 11% perf-profile.children.cycles-pp.follow_huge_addr
0.37 ? 2% +0.2 0.53 ? 6% perf-profile.children.cycles-pp.mm_access
0.00 +0.2 0.17 ? 9% perf-profile.children.cycles-pp.check_preempt_wakeup
0.00 +0.2 0.17 ? 14% perf-profile.children.cycles-pp.put_prev_entity
0.15 ? 3% +0.2 0.39 ? 12% perf-profile.children.cycles-pp.pud_huge
0.14 ? 3% +0.2 0.39 ? 11% perf-profile.children.cycles-pp.mark_page_accessed
0.19 ? 3% +0.3 0.44 ? 9% perf-profile.children.cycles-pp.kfree
0.19 ? 3% +0.3 0.50 ? 12% perf-profile.children.cycles-pp.pmd_huge
0.31 ? 3% +0.3 0.65 ? 9% perf-profile.children.cycles-pp.__kmalloc
0.28 ? 3% +0.4 0.67 ? 10% perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
0.26 ? 3% +0.4 0.70 ? 12% perf-profile.children.cycles-pp.vm_normal_page
0.28 ? 2% +0.5 0.73 ? 11% perf-profile.children.cycles-pp.folio_mark_accessed
0.40 ? 2% +0.5 0.86 ? 10% perf-profile.children.cycles-pp.__entry_text_start
0.77 ? 4% +0.5 1.24 ? 10% perf-profile.children.cycles-pp.stress_vm_rw
2.54 +0.6 3.10 ? 4% perf-profile.children.cycles-pp._raw_spin_lock
0.44 ? 3% +0.6 1.02 ? 10% perf-profile.children.cycles-pp.__import_iovec
0.46 ? 3% +0.6 1.05 ? 10% perf-profile.children.cycles-pp.import_iovec
0.57 ? 2% +0.7 1.26 ? 9% perf-profile.children.cycles-pp._copy_from_user
0.47 ? 3% +0.7 1.18 ? 11% perf-profile.children.cycles-pp.rcu_all_qs
1.72 +0.9 2.58 ? 6% perf-profile.children.cycles-pp.__cond_resched
0.75 ? 2% +1.0 1.71 ? 9% perf-profile.children.cycles-pp.iovec_from_user
0.00 +1.1 1.11 ? 16% perf-profile.children.cycles-pp.exit_to_user_mode_loop
0.98 ? 2% +1.1 2.10 ? 10% perf-profile.children.cycles-pp.__might_sleep
0.70 +1.1 1.84 ? 9% perf-profile.children.cycles-pp.exit_to_user_mode_prepare
0.76 ? 3% +1.2 1.95 ? 11% perf-profile.children.cycles-pp.follow_pud_mask
0.79 +1.3 2.06 ? 9% perf-profile.children.cycles-pp.syscall_exit_to_user_mode
0.93 ? 2% +1.3 2.22 ? 10% perf-profile.children.cycles-pp.follow_page_mask
1.30 ? 2% +1.5 2.80 ? 9% perf-profile.children.cycles-pp.__might_fault
1.03 ? 2% +1.6 2.67 ? 11% perf-profile.children.cycles-pp.follow_pmd_mask
2.83 +1.7 4.55 ? 7% perf-profile.children.cycles-pp.__might_resched
1.37 +1.9 3.26 ? 14% perf-profile.children.cycles-pp.unpin_user_pages
1.30 ? 2% +2.1 3.38 ? 14% perf-profile.children.cycles-pp.unpin_user_pages_dirty_lock
3.29 ? 2% +2.6 5.87 ? 11% perf-profile.children.cycles-pp.try_grab_page
29.19 +2.6 31.78 perf-profile.children.cycles-pp.copy_user_enhanced_fast_string
1.40 +3.0 4.42 ? 17% perf-profile.children.cycles-pp.mod_node_page_state
2.13 ? 2% +3.2 5.35 ? 14% perf-profile.children.cycles-pp.gup_put_folio
34.26 +6.2 40.45 ? 3% perf-profile.children.cycles-pp.__x64_sys_process_vm_readv
34.96 +6.7 41.64 ? 3% perf-profile.children.cycles-pp.process_vm_readv
8.05 +7.0 15.10 ? 8% perf-profile.children.cycles-pp.copyin
6.54 +7.1 13.60 ? 11% perf-profile.children.cycles-pp.follow_page_pte
10.06 +9.7 19.72 ? 8% perf-profile.children.cycles-pp.copy_page_from_iter
10.99 +13.2 24.22 ? 11% perf-profile.children.cycles-pp.__get_user_pages
11.04 +13.3 24.31 ? 11% perf-profile.children.cycles-pp.__get_user_pages_remote
77.15 +15.3 92.46 ? 2% perf-profile.children.cycles-pp.do_syscall_64
77.51 +15.4 92.94 ? 2% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
17.18 ? 2% +20.4 37.56 ? 10% perf-profile.children.cycles-pp.__x64_sys_process_vm_writev
17.69 ? 2% +20.9 38.64 ? 10% perf-profile.children.cycles-pp.process_vm_writev
48.65 +24.3 72.98 ? 6% perf-profile.children.cycles-pp.process_vm_rw_single_vec
50.39 +25.4 75.74 ? 6% perf-profile.children.cycles-pp.process_vm_rw_core
51.40 +26.5 77.94 ? 6% perf-profile.children.cycles-pp.process_vm_rw
7.23 -6.3 0.97 ?125% perf-profile.self.cycles-pp.update_cfs_group
6.42 -5.6 0.78 ?142% perf-profile.self.cycles-pp.mwait_idle_with_hints
4.82 -4.2 0.67 ? 85% perf-profile.self.cycles-pp.update_load_avg
1.78 ? 2% -1.5 0.29 ?137% perf-profile.self.cycles-pp.available_idle_cpu
0.96 -0.5 0.42 ? 26% perf-profile.self.cycles-pp._raw_spin_lock_irqsave
0.98 ? 2% -0.5 0.45 ? 28% perf-profile.self.cycles-pp.__schedule
0.59 ? 2% -0.5 0.13 ? 78% perf-profile.self.cycles-pp.switch_mm_irqs_off
0.97 -0.4 0.61 ? 15% perf-profile.self.cycles-pp.stress_vm_child
0.45 ? 2% -0.3 0.14 ? 58% perf-profile.self.cycles-pp.update_rq_clock
0.58 -0.3 0.30 ? 22% perf-profile.self.cycles-pp.__switch_to_asm
0.58 -0.3 0.33 ? 17% perf-profile.self.cycles-pp.pipe_read
0.35 ? 3% -0.2 0.10 ? 59% perf-profile.self.cycles-pp.__wake_up_common
0.47 -0.2 0.24 ? 23% perf-profile.self.cycles-pp.___perf_sw_event
0.32 ? 2% -0.2 0.13 ? 39% perf-profile.self.cycles-pp.finish_task_switch
0.38 ? 2% -0.2 0.21 ? 20% perf-profile.self.cycles-pp.prepare_to_wait_event
0.25 ? 2% -0.2 0.08 ? 53% perf-profile.self.cycles-pp._find_next_bit
0.31 ? 3% -0.2 0.14 ? 27% perf-profile.self.cycles-pp.enqueue_entity
0.34 ? 15% -0.2 0.18 ? 12% perf-profile.self.cycles-pp.read
0.26 ? 2% -0.2 0.11 ? 20% perf-profile.self.cycles-pp.try_to_wake_up
0.29 ? 3% -0.2 0.13 ? 30% perf-profile.self.cycles-pp.prepare_task_switch
0.45 ? 2% -0.1 0.31 ? 9% perf-profile.self.cycles-pp.mutex_lock
0.26 -0.1 0.16 ? 14% perf-profile.self.cycles-pp.apparmor_file_permission
0.13 ? 2% -0.1 0.03 ?103% perf-profile.self.cycles-pp.perf_trace_sched_wakeup_template
0.35 -0.1 0.26 ? 11% perf-profile.self.cycles-pp.__update_load_avg_cfs_rq
0.16 ? 3% -0.1 0.08 ? 22% perf-profile.self.cycles-pp.dequeue_entity
0.56 -0.1 0.48 ? 5% perf-profile.self.cycles-pp.__switch_to
0.15 ? 2% -0.1 0.08 ? 37% perf-profile.self.cycles-pp.vmacache_find
0.18 ? 2% -0.1 0.11 ? 25% perf-profile.self.cycles-pp.select_idle_sibling
0.19 ? 3% -0.1 0.13 ? 10% perf-profile.self.cycles-pp.native_sched_clock
0.13 ? 3% -0.1 0.07 ? 17% perf-profile.self.cycles-pp.security_file_permission
0.19 -0.1 0.13 ? 18% perf-profile.self.cycles-pp.enqueue_task_fair
0.35 ? 2% -0.1 0.30 ? 8% perf-profile.self.cycles-pp.update_curr
0.15 ? 2% -0.1 0.10 ? 12% perf-profile.self.cycles-pp.dequeue_task_fair
0.26 -0.0 0.21 ? 2% perf-profile.self.cycles-pp.mutex_unlock
0.38 -0.0 0.34 ? 3% perf-profile.self.cycles-pp.find_get_task_by_vpid
0.09 ? 4% -0.0 0.06 ? 13% perf-profile.self.cycles-pp.anon_pipe_buf_release
0.11 ? 4% -0.0 0.08 ? 10% perf-profile.self.cycles-pp.atime_needs_update
0.21 ? 2% -0.0 0.19 ? 3% perf-profile.self.cycles-pp.vfs_read
0.17 ? 2% -0.0 0.14 ? 3% perf-profile.self.cycles-pp.switch_fpu_return
0.11 ? 4% -0.0 0.09 ? 10% perf-profile.self.cycles-pp.aa_file_perm
0.08 ? 6% -0.0 0.06 ? 13% perf-profile.self.cycles-pp.select_task_rq
0.05 +0.0 0.06 perf-profile.self.cycles-pp._raw_spin_unlock_irqrestore
0.06 +0.0 0.07 perf-profile.self.cycles-pp.set_next_entity
0.07 ? 5% +0.0 0.09 perf-profile.self.cycles-pp.get_task_mm
0.06 ? 9% +0.0 0.08 perf-profile.self.cycles-pp.__get_user_pages_remote
0.06 ? 6% +0.0 0.09 ? 5% perf-profile.self.cycles-pp.__rdgsbase_inactive
0.10 ? 5% +0.0 0.13 ? 4% perf-profile.self.cycles-pp.__wrgsbase_inactive
0.11 +0.0 0.15 ? 2% perf-profile.self.cycles-pp.pick_next_task_fair
0.09 ? 9% +0.0 0.13 ? 8% perf-profile.self.cycles-pp.ktime_get
0.03 ? 70% +0.0 0.08 ? 6% perf-profile.self.cycles-pp.pick_next_entity
0.08 ? 6% +0.1 0.13 ? 7% perf-profile.self.cycles-pp.vfs_write
0.00 +0.1 0.06 ? 9% perf-profile.self.cycles-pp.resched_curr
0.01 ?223% +0.1 0.06 ? 11% perf-profile.self.cycles-pp.perf_trace_sched_switch
0.00 +0.1 0.06 ? 8% perf-profile.self.cycles-pp.put_prev_entity
0.00 +0.1 0.06 ? 8% perf-profile.self.cycles-pp.syscall_exit_to_user_mode_prepare
0.00 +0.1 0.06 ? 8% perf-profile.self.cycles-pp.idr_find
0.22 ? 2% +0.1 0.28 ? 4% perf-profile.self.cycles-pp.__update_load_avg_se
0.00 +0.1 0.06 ? 11% perf-profile.self.cycles-pp.ksys_write
0.00 +0.1 0.06 ? 14% perf-profile.self.cycles-pp.__wake_up_common_lock
0.00 +0.1 0.06 ? 7% perf-profile.self.cycles-pp.kmalloc_slab
0.10 ? 6% +0.1 0.16 ? 6% perf-profile.self.cycles-pp.write
0.00 +0.1 0.07 ? 15% perf-profile.self.cycles-pp.check_preempt_wakeup
0.08 ? 6% +0.1 0.15 ? 9% perf-profile.self.cycles-pp.up_read
0.01 ?223% +0.1 0.08 ? 12% perf-profile.self.cycles-pp.perf_trace_sched_stat_runtime
0.01 ?223% +0.1 0.08 ? 10% perf-profile.self.cycles-pp.__calc_delta
0.00 +0.1 0.08 ? 10% perf-profile.self.cycles-pp.check_stack_object
0.00 +0.1 0.08 ? 20% perf-profile.self.cycles-pp.exit_to_user_mode_loop
0.02 ?141% +0.1 0.09 ? 10% perf-profile.self.cycles-pp.new_sync_write
0.02 ?141% +0.1 0.09 ? 10% perf-profile.self.cycles-pp.entry_SYSCALL_64_safe_stack
0.06 ? 8% +0.1 0.14 ? 11% perf-profile.self.cycles-pp.follow_huge_addr
0.10 ? 4% +0.1 0.19 ? 10% perf-profile.self.cycles-pp.syscall_enter_from_user_mode
0.00 +0.1 0.09 ? 12% perf-profile.self.cycles-pp.current_time
0.06 +0.1 0.15 ? 11% perf-profile.self.cycles-pp.__import_iovec
0.05 ? 8% +0.1 0.14 ? 13% perf-profile.self.cycles-pp.__check_object_size
0.03 ? 70% +0.1 0.14 ? 15% perf-profile.self.cycles-pp.syscall_exit_to_user_mode
0.19 ? 3% +0.1 0.30 ? 5% perf-profile.self.cycles-pp.process_vm_readv
0.15 ? 3% +0.1 0.25 ? 5% perf-profile.self.cycles-pp.os_xsave
0.00 +0.1 0.11 ? 9% perf-profile.self.cycles-pp.memcg_slab_free_hook
0.07 ? 5% +0.1 0.18 ? 10% perf-profile.self.cycles-pp._copy_from_user
0.15 ? 2% +0.1 0.26 ? 6% perf-profile.self.cycles-pp.pipe_write
0.05 +0.1 0.16 ? 12% perf-profile.self.cycles-pp.exit_to_user_mode_prepare
0.12 ? 3% +0.1 0.24 ? 8% perf-profile.self.cycles-pp.__entry_text_start
0.00 +0.1 0.12 ? 14% perf-profile.self.cycles-pp.schedule
0.14 ? 12% +0.1 0.26 ? 5% perf-profile.self.cycles-pp.process_vm_rw
0.38 +0.1 0.51 ? 6% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
0.19 ? 7% +0.1 0.32 ? 6% perf-profile.self.cycles-pp.__radix_tree_lookup
0.09 ? 4% +0.1 0.23 ? 10% perf-profile.self.cycles-pp.iovec_from_user
0.09 +0.1 0.24 ? 11% perf-profile.self.cycles-pp.mark_page_accessed
0.14 ? 3% +0.2 0.29 ? 9% perf-profile.self.cycles-pp.process_vm_rw_core
0.13 ? 2% +0.2 0.28 ? 9% perf-profile.self.cycles-pp.syscall_return_via_sysret
0.13 ? 2% +0.2 0.29 ? 10% perf-profile.self.cycles-pp.process_vm_writev
0.10 ? 3% +0.2 0.26 ? 13% perf-profile.self.cycles-pp.pud_huge
0.09 ? 5% +0.2 0.26 ? 13% perf-profile.self.cycles-pp.pmd_huge
0.14 ? 3% +0.2 0.31 ? 11% perf-profile.self.cycles-pp.copyout
0.14 ? 4% +0.2 0.31 ? 10% perf-profile.self.cycles-pp.kfree
0.17 ? 3% +0.2 0.35 ? 9% perf-profile.self.cycles-pp.do_syscall_64
0.18 ? 3% +0.2 0.36 ? 9% perf-profile.self.cycles-pp.__kmalloc
0.18 ? 2% +0.3 0.45 ? 10% perf-profile.self.cycles-pp.copyin
0.24 ? 3% +0.3 0.50 ? 9% perf-profile.self.cycles-pp.__might_fault
0.11 ? 5% +0.3 0.41 ? 22% perf-profile.self.cycles-pp.ksys_read
0.30 ? 3% +0.3 0.65 ? 12% perf-profile.self.cycles-pp.unpin_user_pages
0.22 ? 3% +0.4 0.58 ? 12% perf-profile.self.cycles-pp.vm_normal_page
0.26 ? 4% +0.4 0.62 ? 11% perf-profile.self.cycles-pp.rcu_all_qs
0.28 ? 2% +0.4 0.65 ? 10% perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
0.23 ? 3% +0.4 0.61 ? 12% perf-profile.self.cycles-pp.folio_mark_accessed
0.28 ? 4% +0.5 0.73 ? 13% perf-profile.self.cycles-pp.unpin_user_pages_dirty_lock
0.67 ? 4% +0.5 1.15 ? 12% perf-profile.self.cycles-pp.stress_vm_rw
0.56 ? 2% +0.5 1.06 ? 8% perf-profile.self.cycles-pp.process_vm_rw_single_vec
0.91 +0.6 1.55 ? 8% perf-profile.self.cycles-pp.__cond_resched
0.81 ? 2% +0.9 1.72 ? 10% perf-profile.self.cycles-pp.__might_sleep
0.61 ? 3% +0.9 1.55 ? 11% perf-profile.self.cycles-pp.follow_pud_mask
1.94 +0.9 2.89 ? 7% perf-profile.self.cycles-pp._raw_spin_lock
1.60 +1.0 2.57 ? 7% perf-profile.self.cycles-pp.copy_page_to_iter
2.56 ? 2% +1.0 3.61 ? 7% perf-profile.self.cycles-pp.try_grab_page
0.88 ? 2% +1.1 1.98 ? 10% perf-profile.self.cycles-pp.copy_page_from_iter
0.82 ? 3% +1.1 1.95 ? 11% perf-profile.self.cycles-pp.follow_page_mask
2.60 +1.4 3.98 ? 6% perf-profile.self.cycles-pp.__might_resched
0.88 ? 2% +1.4 2.27 ? 11% perf-profile.self.cycles-pp.follow_pmd_mask
0.88 ? 2% +1.4 2.28 ? 11% perf-profile.self.cycles-pp.__get_user_pages
1.43 ? 2% +1.7 3.15 ? 12% perf-profile.self.cycles-pp.gup_put_folio
1.50 ? 3% +2.3 3.79 ? 11% perf-profile.self.cycles-pp.follow_page_pte
28.89 +2.5 31.42 perf-profile.self.cycles-pp.copy_user_enhanced_fast_string
1.26 +2.8 4.04 ? 18% perf-profile.self.cycles-pp.mod_node_page_state
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://01.org/lkp
On 7/9/22 4:55 PM, Chen Yu Wrote:
> On Thu, Jun 30, 2022 at 06:46:08PM +0800, Abel Wu wrote:
>>
>> On 6/30/22 12:16 PM, Chen Yu Wrote:
>>> On Tue, Jun 28, 2022 at 03:58:55PM +0800, Abel Wu wrote:
>>>>
>>>> On 6/27/22 6:13 PM, Abel Wu Wrote:
>>>> There seems like not much difference except hackbench pipe test at
>>>> certain groups (30~110).
>>> OK, smaller LLC domain seems to not have much difference, which might
>>> suggest that by leveraging load balance code path, the read/write
>>> to LLC shared mask might not be the bottleneck. I have an vague
>>> impression that during Aubrey's cpumask searching for idle CPUs
>>> work[1], there is concern that updating the shared mask in large LLC
>>> has introduced cache contention and performance degrading. Maybe we
>>> can find that regressed test case to verify.
>>> [1] https://lore.kernel.org/all/[email protected]/
>>
>> I just went through Aubrey's v1-v11 patches and didn't find any
>> particular tests other than hackbench/tbench/uperf. Please let
>> me know if I missed something, thanks!
>>
> I haven't found any testcase that could trigger the cache contention
> issue. I thought we could stick with these testcases for now, especially
> for tbench, it has detected a cache issue described in
> https://lore.kernel.org/lkml/[email protected]
> if I understand correctly.
I Agree.
>>>> I am intended to provide better scalability
>>>> by applying the filter which will be enabled when:
>>>>
>>>> - The LLC is large enough that simply traversing becomes
>>>> in-sufficient, and/or
>>>>
>>>> - The LLC is loaded that unoccupied cpus are minority.
>>>>
>>>> But it would be very nice if a more fine grained pattern works well
>>>> so we can drop the above constrains.
>>>>
>>> We can first try to push a simple version, and later optimize it.
>>> One concern about v4 is that, we changed the logic in v3, which recorded
>>> the overloaded CPU, while v4 tracks unoccupied CPUs. An overloaded CPU is
>>> more "stable" because there are more than 1 running tasks on that runqueue.
>>> It is more likely to remain "occupied" for a while. That is to say,
>>> nr_task = 1, 2, 3... will all be regarded as occupied, while only nr_task = 0
>>> is unoccupied. The former would bring less false negative/positive.
>>
>> Yes, I like the 'overloaded mask' too, but the downside is extra
>> cpumask ops needed in the SIS path (the added cpumask_andnot).
>> Besides, in this patch, the 'overloaded mask' is also unstable due
>> to the state is maintained at core level rather than per-cpu, some
>> more thoughts are in cover letter.
>>
> I see.
>>>
>>> By far I have tested hackbench/schbench/netperf on top of Peter's sched/core branch,
>>> with SIS_UTIL enabled. Overall it looks good, and netperf has especially
>>> significant improvement when the load approaches overloaded(which is aligned
>>> with your comment above). I'll re-run the netperf for several cycles to check the
>>> standard deviation. And I'm also curious about v3's performance because it
>>> tracks overloaded CPUs, so I'll also test on v3 with small modifications.
>>
>> Thanks very much for your reviewing and testing.
>>
> I modified your v3 patch a little bit, and the test result shows good improvement
> on netperf and no significant regression on schbench/tbench/hackbench on this draft
I don't know why there is such a big improvement in netperf TCP_RR
168-threads while results under other configs are plain.
> patch. I would like to vote for your v3 version as it seems to be more straightforward,
> what do you think of the following change:
>
> From 277b60b7cd055d5be93188a552da50fdfe53214c Mon Sep 17 00:00:00 2001
> From: Abel Wu <[email protected]>
> Date: Fri, 8 Jul 2022 02:16:47 +0800
> Subject: [PATCH] sched/fair: Introduce SIS_FILTER to skip overloaded CPUs
> during SIS
>
> Currently SIS_UTIL is used to limit the scan depth of idle CPUs in
> select_idle_cpu(). There could be another optimization to filter
> the overloaded CPUs so as to further speed up select_idle_cpu().
> Launch the CPU overload check in periodic tick, and take consideration
> of nr_running, avg_util and runnable_avg of that CPU. If the CPU is
> overloaded, add it into per LLC overload cpumask, so select_idle_cpu()
> could skip those overloaded CPUs. Although this detection is in periodic
> tick, checking the pelt signal of the CPU would make the 'overloaded' state
> more stable and reduce the frequency to update the LLC shared mask,
> so as to mitigate the cache contention in the LLC.
>
> The following results are tested on top of latest sched/core tip.
> The baseline is with SIS_UTIL enabled, and compared it with both SIS_FILTER
> /SIS_UTIL enabled. Positive %compare stands for better performance.
Can you share the cpu topology please?
>
> hackbench
> =========
> case load baseline(std%) compare%( std%)
> process-pipe 1 group 1.00 ( 0.59) -1.35 ( 0.88)
> process-pipe 2 groups 1.00 ( 0.38) -1.49 ( 0.04)
> process-pipe 4 groups 1.00 ( 0.45) +0.10 ( 0.91)
> process-pipe 8 groups 1.00 ( 0.11) +0.03 ( 0.38)
> process-sockets 1 group 1.00 ( 3.48) +2.88 ( 7.07)
> process-sockets 2 groups 1.00 ( 2.38) -3.78 ( 2.81)
> process-sockets 4 groups 1.00 ( 0.26) -1.79 ( 0.82)
> process-sockets 8 groups 1.00 ( 0.07) -0.35 ( 0.07)
> threads-pipe 1 group 1.00 ( 0.87) -0.21 ( 0.71)
> threads-pipe 2 groups 1.00 ( 0.63) +0.34 ( 0.45)
> threads-pipe 4 groups 1.00 ( 0.18) -0.02 ( 0.50)
> threads-pipe 8 groups 1.00 ( 0.08) +0.46 ( 0.05)
> threads-sockets 1 group 1.00 ( 0.80) -0.08 ( 1.06)
> threads-sockets 2 groups 1.00 ( 0.55) +0.06 ( 0.85)
> threads-sockets 4 groups 1.00 ( 1.00) -2.13 ( 0.18)
> threads-sockets 8 groups 1.00 ( 0.07) -0.41 ( 0.08)
>
> netperf
> =======
> case load baseline(std%) compare%( std%)
> TCP_RR 28 threads 1.00 ( 0.50) +0.19 ( 0.53)
> TCP_RR 56 threads 1.00 ( 0.33) +0.31 ( 0.35)
> TCP_RR 84 threads 1.00 ( 0.23) +0.15 ( 0.28)
> TCP_RR 112 threads 1.00 ( 0.20) +0.03 ( 0.21)
> TCP_RR 140 threads 1.00 ( 0.17) +0.20 ( 0.18)
> TCP_RR 168 threads 1.00 ( 0.17) +112.84 ( 40.35)
> TCP_RR 196 threads 1.00 ( 16.66) +0.39 ( 15.72)
> TCP_RR 224 threads 1.00 ( 10.28) +0.05 ( 9.97)
> UDP_RR 28 threads 1.00 ( 16.15) -0.13 ( 0.93)
> UDP_RR 56 threads 1.00 ( 7.76) +1.24 ( 0.44)
> UDP_RR 84 threads 1.00 ( 11.68) -0.49 ( 6.33)
> UDP_RR 112 threads 1.00 ( 8.49) -0.21 ( 7.77)
> UDP_RR 140 threads 1.00 ( 8.49) +2.05 ( 19.88)
> UDP_RR 168 threads 1.00 ( 8.91) +1.67 ( 11.74)
> UDP_RR 196 threads 1.00 ( 19.96) +4.35 ( 21.37)
> UDP_RR 224 threads 1.00 ( 19.44) +4.38 ( 16.61)
>
> tbench
> ======
> case load baseline(std%) compare%( std%)
> loopback 28 threads 1.00 ( 0.12) +0.57 ( 0.12)
> loopback 56 threads 1.00 ( 0.11) +0.42 ( 0.11)
> loopback 84 threads 1.00 ( 0.09) +0.71 ( 0.03)
> loopback 112 threads 1.00 ( 0.03) -0.13 ( 0.08)
> loopback 140 threads 1.00 ( 0.29) +0.59 ( 0.01)
> loopback 168 threads 1.00 ( 0.01) +0.86 ( 0.03)
> loopback 196 threads 1.00 ( 0.02) +0.97 ( 0.21)
> loopback 224 threads 1.00 ( 0.04) +0.83 ( 0.22)
>
> schbench
> ========
> case load baseline(std%) compare%( std%)
> normal 1 mthread 1.00 ( 0.00) -8.82 ( 0.00)
> normal 2 mthreads 1.00 ( 0.00) +0.00 ( 0.00)
> normal 4 mthreads 1.00 ( 0.00) +17.02 ( 0.00)
> normal 8 mthreads 1.00 ( 0.00) -4.84 ( 0.00)
>
> Signed-off-by: Abel Wu <[email protected]>
> ---
> include/linux/sched/topology.h | 6 +++++
> kernel/sched/core.c | 1 +
> kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++
> kernel/sched/features.h | 1 +
> kernel/sched/sched.h | 2 ++
> kernel/sched/topology.c | 3 ++-
> 6 files changed, 59 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 816df6cc444e..c03076850a67 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -82,8 +82,14 @@ struct sched_domain_shared {
> atomic_t nr_busy_cpus;
> int has_idle_cores;
> int nr_idle_scan;
> + unsigned long overloaded_cpus[];
> };
>
> +static inline struct cpumask *sdo_mask(struct sched_domain_shared *sds)
> +{
> + return to_cpumask(sds->overloaded_cpus);
> +}
> +
> struct sched_domain {
> /* These fields must be setup */
> struct sched_domain __rcu *parent; /* top domain must be null terminated */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d3e2c5a7c1b7..452eb63ee6f6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5395,6 +5395,7 @@ void scheduler_tick(void)
> resched_latency = cpu_resched_latency(rq);
> calc_global_load_tick(rq);
> sched_core_tick(rq);
> + update_overloaded_rq(rq);
I didn't see this update in idle path. Is this on your intend?
>
> rq_unlock(rq, &rf);
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f80ae86bb404..34b1650f85f6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6323,6 +6323,50 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
>
> #endif /* CONFIG_SCHED_SMT */
>
> +/* derived from group_is_overloaded() */
> +static inline bool rq_overloaded(struct rq *rq, int cpu, unsigned int imbalance_pct)
> +{
> + if (rq->nr_running - rq->cfs.idle_h_nr_running <= 1)
> + return false;
> +
> + if ((SCHED_CAPACITY_SCALE * 100) <
> + (cpu_util_cfs(cpu) * imbalance_pct))
> + return true;
> +
> + if ((SCHED_CAPACITY_SCALE * imbalance_pct) <
> + (cpu_runnable(rq) * 100))
> + return true;
So the filter contains cpus that over-utilized or overloaded now.
This steps further to make the filter reliable while at the cost
of sacrificing scan efficiency.
The idea behind my recent patches is to keep the filter radical,
but use it conservatively.
> +
> + return false;
> +}
> +
> +void update_overloaded_rq(struct rq *rq)
> +{
> + struct sched_domain_shared *sds;
> + struct sched_domain *sd;
> + int cpu;
> +
> + if (!sched_feat(SIS_FILTER))
> + return;
> +
> + cpu = cpu_of(rq);
> + sd = rcu_dereference(per_cpu(sd_llc, cpu));
> + if (unlikely(!sd))
> + return;
> +
> + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> + if (unlikely(!sds))
> + return;
> +
> + if (rq_overloaded(rq, cpu, sd->imbalance_pct)) {
I'm not sure whether it is appropriate to use LLC imbalance pct here,
because we are comparing inside the LLC rather than between the LLCs.
> + /* avoid duplicated write, mitigate cache contention */
> + if (!cpumask_test_cpu(cpu, sdo_mask(sds)))
> + cpumask_set_cpu(cpu, sdo_mask(sds));
> + } else {
> + if (cpumask_test_cpu(cpu, sdo_mask(sds)))
> + cpumask_clear_cpu(cpu, sdo_mask(sds));
> + }
> +}
> /*
> * Scan the LLC domain for idle CPUs; this is dynamically regulated by
> * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
> @@ -6383,6 +6427,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> }
> }
>
> + if (sched_feat(SIS_FILTER) && !has_idle_core && sd->shared)
> + cpumask_andnot(cpus, cpus, sdo_mask(sd->shared));
> +
> for_each_cpu_wrap(cpu, cpus, target + 1) {
> if (has_idle_core) {
> i = select_idle_core(p, cpu, cpus, &idle_cpu);
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index ee7f23c76bd3..1bebdb87c2f4 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> */
> SCHED_FEAT(SIS_PROP, false)
> SCHED_FEAT(SIS_UTIL, true)
> +SCHED_FEAT(SIS_FILTER, true)
The filter should be enabled when there is a need. If the system
is idle enough, I don't think it's a good idea to clear out the
overloaded cpus from domain scan. Making the filter a sched-feat
won't help the problem.
My latest patch will only apply the filter when nr is less than
the LLC size. It doesn't work perfectly yet, but really better
than doing nothing in my v4 patchset.
I will give this patch a test on my machine a few days later.
Thanks & BR,
Abel
>
> /*
> * Issue a WARN when we do multiple update_rq_clock() calls
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 02c970501295..316127ab1ec7 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1812,6 +1812,8 @@ static inline struct cpumask *group_balance_mask(struct sched_group *sg)
>
> extern int group_balance_cpu(struct sched_group *sg);
>
> +void update_overloaded_rq(struct rq *rq);
> +
> #ifdef CONFIG_SCHED_DEBUG
> void update_sched_domain_debugfs(void);
> void dirty_sched_domain_sysctl(int cpu);
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 8739c2a5a54e..0d149e76a3b3 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1641,6 +1641,7 @@ sd_init(struct sched_domain_topology_level *tl,
> sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> atomic_inc(&sd->shared->ref);
> atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> + cpumask_clear(sdo_mask(sd->shared));
> }
>
> sd->private = sdd;
> @@ -2106,7 +2107,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
>
> *per_cpu_ptr(sdd->sd, j) = sd;
>
> - sds = kzalloc_node(sizeof(struct sched_domain_shared),
> + sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
> GFP_KERNEL, cpu_to_node(j));
> if (!sds)
> return -ENOMEM;
Hi Robot, thanks for your testing!
On 7/9/22 10:42 PM, kernel test robot Wrote:
>
>
> Greeting,
>
> FYI, we noticed a -11.7% regression of phoronix-test-suite.fio.SequentialWrite.IO_uring.Yes.No.4KB.DefaultTestDirectory.mb_s due to commit:
>
>
> commit: 32fe13cd7aa184ed349d698ebf6f420fa426dd73 ("[PATCH v4 7/7] sched/fair: de-entropy for SIS filter")
> url: https://github.com/intel-lab-lkp/linux/commits/Abel-Wu/sched-fair-improve-scan-efficiency-of-SIS/20220619-200743
> base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git f3dd3f674555bd9455c5ae7fafce0696bd9931b3
> patch link: https://lore.kernel.org/lkml/[email protected]
>
> in testcase: phoronix-test-suite
> on test machine: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 512G memory
Does SNC enabled?
> with following parameters:
>
> test: fio-1.14.1
> option_a: Sequential Write
> option_b: IO_uring
> option_c: Yes
> option_d: No
> option_e: 4KB
> option_f: Default Test Directory
> cpufreq_governor: performance
> ucode: 0x500320a
>
> test-description: The Phoronix Test Suite is the most comprehensive testing and benchmarking platform available that provides an extensible framework for which new tests can be easily added.
> test-url: http://www.phoronix-test-suite.com/
>
> In addition to that, the commit also has significant impact on the following tests:
>
> +------------------+-------------------------------------------------------------------------------------+
> | testcase: change | stress-ng: stress-ng.vm-rw.ops_per_sec 113.5% improvement |
> | test machine | 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory |
> | test parameters | class=memory |
> | | cpufreq_governor=performance |
> | | nr_threads=100% |
> | | test=vm-rw |
> | | testtime=60s |
> | | ucode=0xd000331 |
> +------------------+-------------------------------------------------------------------------------------+
>
>
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot <[email protected]>
>
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> To reproduce:
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> sudo bin/lkp install job.yaml # job file is attached in this email
> bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
> sudo bin/lkp run generated-yaml-file
>
> # if come across any failure that blocks the test,
> # please remove ~/.lkp and /lkp dir to run from a clean state.
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/option_a/option_b/option_c/option_d/option_e/option_f/rootfs/tbox_group/test/testcase/ucode:
> gcc-11/performance/x86_64-rhel-8.3/Sequential Write/IO_uring/Yes/No/4KB/Default Test Directory/debian-x86_64-phoronix/lkp-csl-2sp7/fio-1.14.1/phoronix-test-suite/0x500320a
>
> commit:
> fcc108377a ("sched/fair: skip busy cores in SIS search")
> 32fe13cd7a ("sched/fair: de-entropy for SIS filter")
Does the 5th patch applied? It's also important to bail out early if
the system is busy enough that idle cpus can hardly exist.
>
> fcc108377a7cf79c 32fe13cd7aa184ed349d698ebf6
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
> 166666 -11.6% 147277 ± 3% phoronix-test-suite.fio.SequentialWrite.IO_uring.Yes.No.4KB.DefaultTestDirectory.iops
> 651.00 -11.7% 574.83 ± 3% phoronix-test-suite.fio.SequentialWrite.IO_uring.Yes.No.4KB.DefaultTestDirectory.mb_s
> 3145 ± 5% -18.4% 2565 ± 12% meminfo.Writeback
> 0.19 ± 4% -0.0 0.17 ± 2% mpstat.cpu.all.iowait%
> 2228 ± 33% -37.5% 1392 ± 21% numa-meminfo.node0.Writeback
> 553.33 ± 37% -35.9% 354.83 ± 18% numa-vmstat.node0.nr_writeback
I will try to reproduce the test to see why there is such a big change.
> 445604 ± 4% -12.5% 390116 ± 4% vmstat.io.bo
> 14697101 ± 3% -11.0% 13074497 ± 4% perf-stat.i.cache-misses
> 9447 ± 8% -37.6% 5890 ± 5% perf-stat.i.cpu-migrations
> 5125 ± 6% +12.9% 5786 ± 6% perf-stat.i.instructions-per-iTLB-miss
> 2330431 ± 4% -11.4% 2064845 ± 4% perf-stat.i.node-loads
> 2.55 ±104% -1.6 0.96 ± 14% perf-profile.calltrace.cycles-pp.poll_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
> 2.62 ±102% -1.6 0.99 ± 14% perf-profile.children.cycles-pp.poll_idle
> 0.82 ± 23% -0.3 0.53 ± 23% perf-profile.children.cycles-pp.asm_sysvec_call_function_single
> 0.74 ± 23% -0.3 0.46 ± 23% perf-profile.children.cycles-pp.sysvec_call_function_single
> 0.69 ± 24% -0.3 0.44 ± 24% perf-profile.children.cycles-pp.__sysvec_call_function_single
> 0.38 ± 10% -0.1 0.28 ± 18% perf-profile.children.cycles-pp.__perf_event_header__init_id
> 0.16 ± 13% -0.0 0.11 ± 22% perf-profile.children.cycles-pp.__task_pid_nr_ns
> 2.10 ±108% -1.3 0.79 ± 11% perf-profile.self.cycles-pp.poll_idle
> 0.16 ± 13% -0.0 0.11 ± 22% perf-profile.self.cycles-pp.__task_pid_nr_ns
>
>
> ***************************************************************************************************
> lkp-icl-2sp6: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory
> =========================================================================================
> class/compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime/ucode:
> memory/gcc-11/performance/x86_64-rhel-8.3/100%/debian-11.1-x86_64-20220510.cgz/lkp-icl-2sp6/vm-rw/stress-ng/60s/0xd000331
>
> commit:
> fcc108377a ("sched/fair: skip busy cores in SIS search")
> 32fe13cd7a ("sched/fair: de-entropy for SIS filter")
>
> fcc108377a7cf79c 32fe13cd7aa184ed349d698ebf6
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
> 7328835 ± 17% +3441.0% 2.595e+08 ± 12% stress-ng.time.involuntary_context_switches
It's really horrible..
> 123165 ± 3% -14.1% 105742 ± 2% stress-ng.time.minor_page_faults
> 8940 +32.8% 11872 ± 2% stress-ng.time.percent_of_cpu_this_job_got
> 5268 +33.4% 7027 ± 2% stress-ng.time.system_time
> 278.70 +21.5% 338.70 ± 2% stress-ng.time.user_time
> 2.554e+08 +13.3% 2.894e+08 stress-ng.time.voluntary_context_switches
> 1.283e+08 +113.5% 2.74e+08 ± 6% stress-ng.vm-rw.ops
> 2139049 +113.5% 4567054 ± 6% stress-ng.vm-rw.ops_per_sec
> 39411 ± 34% +56.3% 61612 ± 24% numa-meminfo.node1.Mapped
> 5013 -22.5% 3883 ± 4% uptime.idle
> 1.798e+09 -60.3% 7.135e+08 ± 21% cpuidle..time > 1.701e+08 -87.3% 21598951 ± 90% cpuidle..usage
> 75821 ± 2% -11.6% 67063 ± 5% meminfo.Active
> 75821 ± 2% -11.6% 67063 ± 5% meminfo.Active(anon)
> 81710 ± 2% +20.1% 98158 ± 3% meminfo.Mapped
> 26.00 -59.6% 10.50 ± 18% vmstat.cpu.id
> 112.00 +10.9% 124.17 vmstat.procs.r
> 6561639 +31.6% 8634043 ± 2% vmstat.system.cs
> 990604 -62.4% 372118 ± 18% vmstat.system.in
> 24.13 -16.1 8.03 ± 23% mpstat.cpu.all.idle%
This indicates that the SIS scan efficiency is largely improved, which
is in line with our expectations.
> 2.71 -1.6 1.11 ± 10% mpstat.cpu.all.irq%
> 0.17 ± 6% -0.1 0.06 ± 30% mpstat.cpu.all.soft%
> 69.33 +17.4 86.71 ± 2% mpstat.cpu.all.sys%
> 3.66 +0.4 4.09 mpstat.cpu.all.usr%
> 2.024e+09 +93.3% 3.912e+09 ± 16% numa-vmstat.node0.nr_foll_pin_acquired
> 2.024e+09 +93.3% 3.912e+09 ± 16% numa-vmstat.node0.nr_foll_pin_released
> 2.043e+09 ± 2% +119.0% 4.473e+09 numa-vmstat.node1.nr_foll_pin_acquired
> 2.043e+09 ± 2% +119.0% 4.473e+09 numa-vmstat.node1.nr_foll_pin_released
> 9865 ± 34% +54.1% 15201 ± 23% numa-vmstat.node1.nr_mapped
> 18954 ± 2% -11.5% 16767 ± 5% proc-vmstat.nr_active_anon
> 4.062e+09 +107.3% 8.419e+09 ± 7% proc-vmstat.nr_foll_pin_acquired
> 4.062e+09 +107.3% 8.419e+09 ± 7% proc-vmstat.nr_foll_pin_released
> 87380 +5.3% 92039 proc-vmstat.nr_inactive_anon
> 24453 -3.2% 23658 proc-vmstat.nr_kernel_stack
> 20437 ± 2% +19.6% 24443 ± 3% proc-vmstat.nr_mapped
> 18954 ± 2% -11.5% 16767 ± 5% proc-vmstat.nr_zone_active_anon
> 87380 +5.3% 92039 proc-vmstat.nr_zone_inactive_anon
> 108777 ± 4% -17.2% 90014 proc-vmstat.numa_hint_faults
> 96756 ± 6% -17.6% 79691 ± 2% proc-vmstat.numa_hint_faults_local
> 490607 -4.4% 469155 proc-vmstat.pgfault
> 80.85 +10.9 91.75 turbostat.Busy%
> 3221 -5.0% 3060 turbostat.Bzy_MHz
> 77259218 ± 3% -87.0% 10057388 ± 92% turbostat.C1
> 6.74 ± 2% -5.9 0.85 ± 90% turbostat.C1%
> 92212921 -87.8% 11243535 ± 91% turbostat.C1E
> 12.00 ± 22% -6.6 5.42 ± 57% turbostat.C1E%
and this.
> 16.39 ± 16% -62.0% 6.24 ± 55% turbostat.CPU%c1
> 0.16 ± 3% +74.7% 0.29 ± 6% turbostat.IPC
> 65322725 -62.5% 24502370 ± 18% turbostat.IRQ
> 339708 -86.5% 45941 ± 88% turbostat.POLL
> 0.05 -0.0 0.01 ± 82% turbostat.POLL%
> 165121 ± 23% -100.0% 39.19 ±101% sched_debug.cfs_rq:/.MIN_vruntime.avg
> 2462709 -99.9% 3407 ±102% sched_debug.cfs_rq:/.MIN_vruntime.max
> 607348 ± 11% -99.9% 348.57 ±100% sched_debug.cfs_rq:/.MIN_vruntime.stddev
> 0.56 ± 4% +11.8% 0.62 ± 3% sched_debug.cfs_rq:/.h_nr_running.avg
> 2.58 ± 13% -38.7% 1.58 ± 11% sched_debug.cfs_rq:/.h_nr_running.max
> 0.54 ± 9% -39.7% 0.33 ± 6% sched_debug.cfs_rq:/.h_nr_running.stddev
> 165121 ± 23% -100.0% 39.19 ±101% sched_debug.cfs_rq:/.max_vruntime.avg
> 2462709 -99.9% 3407 ±102% sched_debug.cfs_rq:/.max_vruntime.max
> 607348 ± 11% -99.9% 348.57 ±100% sched_debug.cfs_rq:/.max_vruntime.stddev
> 2439879 +43.2% 3493834 ± 4% sched_debug.cfs_rq:/.min_vruntime.avg
> 2485561 +49.1% 3705888 sched_debug.cfs_rq:/.min_vruntime.max
> 2129935 +34.5% 2865147 ± 2% sched_debug.cfs_rq:/.min_vruntime.min
> 35480 ± 17% +324.2% 150497 ± 59% sched_debug.cfs_rq:/.min_vruntime.stddev
> 0.43 ± 3% +27.9% 0.55 sched_debug.cfs_rq:/.nr_running.avg
> 0.35 ± 5% -57.2% 0.15 ± 4% sched_debug.cfs_rq:/.nr_running.stddev
> 2186 ± 15% -27.9% 1575 ± 11% sched_debug.cfs_rq:/.runnable_avg.max
> 152.08 ± 6% +134.5% 356.58 ± 31% sched_debug.cfs_rq:/.runnable_avg.min
> 399.32 ± 4% -50.5% 197.69 ± 8% sched_debug.cfs_rq:/.runnable_avg.stddev
> 25106 ± 50% +1121.1% 306577 ± 66% sched_debug.cfs_rq:/.spread0.max
> 35510 ± 17% +323.3% 150305 ± 59% sched_debug.cfs_rq:/.spread0.stddev
> 545.95 ± 3% +16.4% 635.59 sched_debug.cfs_rq:/.util_avg.avg
> 1726 ± 15% -26.7% 1266 ± 14% sched_debug.cfs_rq:/.util_avg.max
> 154.67 ± 2% +112.9% 329.33 ± 30% sched_debug.cfs_rq:/.util_avg.min
> 317.35 ± 4% -43.1% 180.53 ± 10% sched_debug.cfs_rq:/.util_avg.stddev
> 192.70 ± 6% +104.5% 393.98 ± 7% sched_debug.cfs_rq:/.util_est_enqueued.avg
> 5359 ± 4% -26.1% 3958 ± 8% sched_debug.cpu.avg_idle.min
> 4.69 ± 7% +136.0% 11.07 ± 5% sched_debug.cpu.clock.stddev
> 2380 ± 4% +31.0% 3117 sched_debug.cpu.curr->pid.avg
> 1818 ± 3% -65.9% 620.26 ± 8% sched_debug.cpu.curr->pid.stddev
> 0.00 ± 8% +59.7% 0.00 ± 10% sched_debug.cpu.next_balance.stddev
> 2.58 ± 17% -41.9% 1.50 sched_debug.cpu.nr_running.max
> 0.52 ± 9% -43.2% 0.29 ± 5% sched_debug.cpu.nr_running.stddev
> 1610935 +31.3% 2115112 ± 2% sched_debug.cpu.nr_switches.avg
> 1661619 +34.5% 2234069 sched_debug.cpu.nr_switches.max
> 1415677 ± 3% +20.3% 1702445 sched_debug.cpu.nr_switches.min
> 30576 ± 26% +151.6% 76923 ± 37% sched_debug.cpu.nr_switches.stddev
> 25.47 -91.3% 2.21 ± 69% perf-stat.i.MPKI
> 3.342e+10 +84.7% 6.172e+10 ± 5% perf-stat.i.branch-instructions
> 0.58 -0.3 0.33 ± 5% perf-stat.i.branch-miss-rate%
> 1.667e+08 -13.2% 1.448e+08 ± 2% perf-stat.i.branch-misses
> 0.63 ± 17% +4.8 5.42 ± 39% perf-stat.i.cache-miss-rate%
> 18939524 ± 4% -46.6% 10109353 ± 18% perf-stat.i.cache-misses
> 4.422e+09 -87.1% 5.724e+08 ± 77% perf-stat.i.cache-references
> 6897069 +30.8% 9023752 ± 2% perf-stat.i.context-switches
> 2.04 -43.3% 1.16 ± 5% perf-stat.i.cpi
> 3.523e+11 +3.8% 3.656e+11 perf-stat.i.cpu-cycles
> 2322589 -86.6% 310934 ± 93% perf-stat.i.cpu-migrations
> 18560 ± 4% +113.2% 39578 ± 15% perf-stat.i.cycles-between-cache-misses
> 0.20 -0.2 0.02 ± 70% perf-stat.i.dTLB-load-miss-rate%
> 85472762 -87.2% 10962661 ± 82% perf-stat.i.dTLB-load-misses
> 4.266e+10 +83.8% 7.841e+10 ± 5% perf-stat.i.dTLB-loads
> 0.10 ± 4% -0.1 0.01 ± 72% perf-stat.i.dTLB-store-miss-rate%
> 25396322 ± 4% -86.5% 3437369 ± 90% perf-stat.i.dTLB-store-misses
> 2.483e+10 +85.2% 4.598e+10 ± 5% perf-stat.i.dTLB-stores
> 1.699e+11 +85.8% 3.157e+11 ± 5% perf-stat.i.instructions
> 0.50 +73.4% 0.87 ± 4% perf-stat.i.ipc
> 2.75 +3.8% 2.86 perf-stat.i.metric.GHz
> 822.90 +77.2% 1458 ± 5% perf-stat.i.metric.M/sec
> 5691 -3.4% 5500 perf-stat.i.minor-faults
> 91.09 +4.6 95.71 perf-stat.i.node-load-miss-rate%
> 334087 ± 17% -67.4% 109033 ± 18% perf-stat.i.node-loads
> 70.09 +17.6 87.68 ± 6% perf-stat.i.node-store-miss-rate%
> 1559730 ± 5% -64.9% 548115 ± 56% perf-stat.i.node-stores
> 5704 -3.3% 5513 perf-stat.i.page-faults
> 26.03 -92.7% 1.89 ± 83% perf-stat.overall.MPKI
> 0.50 -0.3 0.24 ± 8% perf-stat.overall.branch-miss-rate%
> 0.43 ± 3% +2.5 2.91 ± 60% perf-stat.overall.cache-miss-rate%
> 2.08 -44.0% 1.16 ± 5% perf-stat.overall.cpi
> 18664 ± 4% +100.4% 37402 ± 16% perf-stat.overall.cycles-between-cache-misses
> 0.20 -0.2 0.01 ± 87% perf-stat.overall.dTLB-load-miss-rate%
> 0.10 ± 4% -0.1 0.01 ± 96% perf-stat.overall.dTLB-store-miss-rate%
> 0.48 +79.1% 0.86 ± 4% perf-stat.overall.ipc
> 91.02 +5.1 96.07 perf-stat.overall.node-load-miss-rate%
> 70.91 +17.6 88.54 ± 6% perf-stat.overall.node-store-miss-rate%
> 3.289e+10 +85.0% 6.085e+10 ± 5% perf-stat.ps.branch-instructions
> 1.641e+08 -13.1% 1.425e+08 ± 2% perf-stat.ps.branch-misses
> 18633656 ± 4% -46.7% 9931368 ± 18% perf-stat.ps.cache-misses
> 4.354e+09 -87.1% 5.613e+08 ± 77% perf-stat.ps.cache-references
> 6788892 +31.0% 8894592 ± 2% perf-stat.ps.context-switches
> 3.47e+11 +3.9% 3.604e+11 perf-stat.ps.cpu-cycles
> 2286778 -86.7% 304327 ± 94% perf-stat.ps.cpu-migrations
> 84173329 -87.2% 10770448 ± 82% perf-stat.ps.dTLB-load-misses
> 4.198e+10 +84.1% 7.73e+10 ± 5% perf-stat.ps.dTLB-loads
> 25001705 ± 4% -86.5% 3364501 ± 91% perf-stat.ps.dTLB-store-misses
> 2.444e+10 +85.5% 4.533e+10 ± 5% perf-stat.ps.dTLB-stores
> 1.673e+11 +86.1% 3.112e+11 ± 5% perf-stat.ps.instructions
> 12.40 -1.5% 12.22 perf-stat.ps.major-faults
> 5543 -3.9% 5329 perf-stat.ps.minor-faults
> 332272 ± 17% -66.0% 112911 ± 16% perf-stat.ps.node-loads
> 1533930 ± 5% -65.2% 534337 ± 57% perf-stat.ps.node-stores
> 5556 -3.9% 5341 perf-stat.ps.page-faults
> 1.065e+13 +86.7% 1.988e+13 ± 5% perf-stat.total.instructions
> 18.10 -16.2 1.91 ±142% perf-profile.calltrace.cycles-pp.secondary_startup_64_no_verify
> 17.94 -16.1 1.88 ±142% perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64_no_verify
> 17.93 -16.1 1.88 ±142% perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
> 17.90 -16.0 1.88 ±142% perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
> 15.83 -8.0 7.86 ± 20% perf-profile.calltrace.cycles-pp.read
> 13.06 -8.0 5.11 ± 30% perf-profile.calltrace.cycles-pp.pipe_read.new_sync_read.vfs_read.ksys_read.do_syscall_64
> 13.21 -7.9 5.30 ± 29% perf-profile.calltrace.cycles-pp.new_sync_read.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 13.58 -7.9 5.68 ± 27% perf-profile.calltrace.cycles-pp.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
> 14.72 -7.9 6.86 ± 22% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.read
> 14.51 -7.8 6.73 ± 22% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
> 13.77 -7.5 6.23 ± 23% perf-profile.calltrace.cycles-pp.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
> 9.90 -7.1 2.83 ± 47% perf-profile.calltrace.cycles-pp.__schedule.schedule.pipe_read.new_sync_read.vfs_read
> 9.96 -7.0 2.92 ± 45% perf-profile.calltrace.cycles-pp.schedule.pipe_read.new_sync_read.vfs_read.ksys_read
> 7.84 -6.9 0.94 ±142% perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
> 7.44 -6.8 0.63 ±142% perf-profile.calltrace.cycles-pp.flush_smp_call_function_queue.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
> 7.07 -6.2 0.85 ±142% perf-profile.calltrace.cycles-pp.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
> 7.03 -6.2 0.84 ±142% perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry
> 8.60 -5.3 3.30 ± 44% perf-profile.calltrace.cycles-pp.__wake_up_common.__wake_up_common_lock.pipe_write.new_sync_write.vfs_write
> 8.75 -5.3 3.49 ± 41% perf-profile.calltrace.cycles-pp.__wake_up_common_lock.pipe_write.new_sync_write.vfs_write.ksys_write
> 10.84 -5.3 5.58 ± 27% perf-profile.calltrace.cycles-pp.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
> 10.53 -5.2 5.29 ± 28% perf-profile.calltrace.cycles-pp.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
> 6.70 -5.2 1.49 ± 62% perf-profile.calltrace.cycles-pp.dequeue_task_fair.__schedule.schedule.pipe_read.new_sync_read
> 10.02 -5.2 4.83 ± 30% perf-profile.calltrace.cycles-pp.pipe_write.new_sync_write.vfs_write.ksys_write.do_syscall_64
> 5.64 -5.2 0.48 ±142% perf-profile.calltrace.cycles-pp.sched_ttwu_pending.flush_smp_call_function_queue.do_idle.cpu_startup_entry.start_secondary
> 10.08 -5.1 4.98 ± 29% perf-profile.calltrace.cycles-pp.new_sync_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 8.21 -5.1 3.14 ± 44% perf-profile.calltrace.cycles-pp.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.pipe_write
> 8.25 -5.1 3.20 ± 44% perf-profile.calltrace.cycles-pp.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.pipe_write.new_sync_write
> 5.11 -4.7 0.40 ±141% perf-profile.calltrace.cycles-pp.ttwu_do_activate.sched_ttwu_pending.flush_smp_call_function_queue.do_idle.cpu_startup_entry
> 5.07 -4.7 0.40 ±141% perf-profile.calltrace.cycles-pp.enqueue_task_fair.ttwu_do_activate.sched_ttwu_pending.flush_smp_call_function_queue.do_idle
> 5.19 -4.6 0.55 ±141% perf-profile.calltrace.cycles-pp.dequeue_entity.dequeue_task_fair.__schedule.schedule.pipe_read
> 20.28 -4.6 15.69 ± 5% perf-profile.calltrace.cycles-pp.copy_user_enhanced_fast_string.copyout.copy_page_to_iter.process_vm_rw_single_vec.process_vm_rw_core
> 20.55 -4.3 16.22 ± 5% perf-profile.calltrace.cycles-pp.copyout.copy_page_to_iter.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw
> 10.91 -3.7 7.17 ± 17% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
> 24.78 -3.7 21.12 ± 2% perf-profile.calltrace.cycles-pp.copy_page_to_iter.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw.__x64_sys_process_vm_readv
> 10.95 -3.6 7.30 ± 16% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.write
> 11.15 -3.0 8.18 ± 13% perf-profile.calltrace.cycles-pp.write
> 1.03 ± 4% -0.4 0.62 ± 14% perf-profile.calltrace.cycles-pp.stress_vm_child
> 0.76 ± 4% +0.5 1.22 ± 10% perf-profile.calltrace.cycles-pp.stress_vm_rw
> 0.63 +0.5 1.15 ± 23% perf-profile.calltrace.cycles-pp.enqueue_task_fair.ttwu_do_activate.try_to_wake_up.autoremove_wake_function.__wake_up_common
> 0.53 ± 2% +0.5 1.06 ± 9% perf-profile.calltrace.cycles-pp.__might_fault.copy_page_to_iter.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw
> 0.64 +0.5 1.18 ± 22% perf-profile.calltrace.cycles-pp.ttwu_do_activate.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock
> 0.00 +0.6 0.60 ± 7% perf-profile.calltrace.cycles-pp.__might_resched.copy_page_to_iter.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw
> 0.00 +1.0 0.96 ± 15% perf-profile.calltrace.cycles-pp.__schedule.schedule.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode
> 0.00 +1.0 1.02 ± 17% perf-profile.calltrace.cycles-pp.mod_node_page_state.gup_put_folio.unpin_user_pages.process_vm_rw_single_vec.process_vm_rw_core
> 0.00 +1.0 1.02 ± 10% perf-profile.calltrace.cycles-pp.__might_fault.copy_page_from_iter.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw
> 0.00 +1.0 1.03 ± 15% perf-profile.calltrace.cycles-pp.schedule.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
> 0.00 +1.0 1.04 ± 17% perf-profile.calltrace.cycles-pp.mod_node_page_state.gup_put_folio.unpin_user_pages_dirty_lock.process_vm_rw_single_vec.process_vm_rw_core
> 0.00 +1.1 1.10 ± 16% perf-profile.calltrace.cycles-pp.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 1.24 ± 2% +1.4 2.60 ± 11% perf-profile.calltrace.cycles-pp._raw_spin_lock.follow_page_pte.__get_user_pages.__get_user_pages_remote.process_vm_rw_single_vec
> 0.00 +1.4 1.40 ± 16% perf-profile.calltrace.cycles-pp.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
> 0.00 +1.5 1.46 ± 16% perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
> 1.04 +1.5 2.54 ± 14% perf-profile.calltrace.cycles-pp.gup_put_folio.unpin_user_pages.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw
> 0.99 ± 2% +1.6 2.57 ± 14% perf-profile.calltrace.cycles-pp.gup_put_folio.unpin_user_pages_dirty_lock.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw
> 0.00 +1.7 1.69 ± 11% perf-profile.calltrace.cycles-pp.follow_pud_mask.__get_user_pages.__get_user_pages_remote.process_vm_rw_single_vec.process_vm_rw_core
> 1.35 +1.8 3.20 ± 14% perf-profile.calltrace.cycles-pp.unpin_user_pages.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw.__x64_sys_process_vm_readv
> 0.00 +2.0 1.97 ± 10% perf-profile.calltrace.cycles-pp.follow_page_mask.__get_user_pages.__get_user_pages_remote.process_vm_rw_single_vec.process_vm_rw_core
> 1.27 ± 2% +2.0 3.30 ± 14% perf-profile.calltrace.cycles-pp.unpin_user_pages_dirty_lock.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw.__x64_sys_process_vm_writev
> 0.00 +2.1 2.12 ± 18% perf-profile.calltrace.cycles-pp.mod_node_page_state.try_grab_page.follow_page_pte.__get_user_pages.__get_user_pages_remote
> 0.00 +2.3 2.30 ± 11% perf-profile.calltrace.cycles-pp.follow_pmd_mask.__get_user_pages.__get_user_pages_remote.process_vm_rw_single_vec.process_vm_rw_core
> 3.16 ± 2% +2.4 5.51 ± 11% perf-profile.calltrace.cycles-pp.try_grab_page.follow_page_pte.__get_user_pages.__get_user_pages_remote.process_vm_rw_single_vec
> 32.57 +5.2 37.78 ± 3% perf-profile.calltrace.cycles-pp.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw.__x64_sys_process_vm_readv.do_syscall_64
> 33.67 +5.6 39.24 ± 3% perf-profile.calltrace.cycles-pp.process_vm_rw_core.process_vm_rw.__x64_sys_process_vm_readv.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 6.24 +6.1 12.34 ± 10% perf-profile.calltrace.cycles-pp.__get_user_pages_remote.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw.__x64_sys_process_vm_readv
> 34.23 +6.2 40.39 ± 3% perf-profile.calltrace.cycles-pp.process_vm_rw.__x64_sys_process_vm_readv.do_syscall_64.entry_SYSCALL_64_after_hwframe.process_vm_readv
> 34.22 +6.2 40.42 ± 3% perf-profile.calltrace.cycles-pp.__x64_sys_process_vm_readv.do_syscall_64.entry_SYSCALL_64_after_hwframe.process_vm_readv
> 34.39 +6.3 40.68 ± 3% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.process_vm_readv
> 34.49 +6.4 40.88 ± 3% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.process_vm_readv
> 34.87 +6.6 41.43 ± 3% perf-profile.calltrace.cycles-pp.process_vm_readv
> 6.26 +6.6 12.83 ± 11% perf-profile.calltrace.cycles-pp.follow_page_pte.__get_user_pages.__get_user_pages_remote.process_vm_rw_single_vec.process_vm_rw_core
> 7.43 +6.6 14.02 ± 8% perf-profile.calltrace.cycles-pp.copy_user_enhanced_fast_string.copyin.copy_page_from_iter.process_vm_rw_single_vec.process_vm_rw_core
> 7.70 +6.9 14.64 ± 8% perf-profile.calltrace.cycles-pp.copyin.copy_page_from_iter.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw
> 4.80 ± 2% +7.2 11.95 ± 12% perf-profile.calltrace.cycles-pp.__get_user_pages_remote.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw.__x64_sys_process_vm_writev
> 9.52 +9.3 18.86 ± 8% perf-profile.calltrace.cycles-pp.copy_page_from_iter.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw.__x64_sys_process_vm_writev
> 10.80 +12.9 23.71 ± 11% perf-profile.calltrace.cycles-pp.__get_user_pages.__get_user_pages_remote.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw
> 16.02 ± 2% +19.0 35.05 ± 10% perf-profile.calltrace.cycles-pp.process_vm_rw_single_vec.process_vm_rw_core.process_vm_rw.__x64_sys_process_vm_writev.do_syscall_64
> 16.70 ± 2% +19.7 36.44 ± 10% perf-profile.calltrace.cycles-pp.process_vm_rw_core.process_vm_rw.__x64_sys_process_vm_writev.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 17.15 ± 2% +20.4 37.51 ± 10% perf-profile.calltrace.cycles-pp.process_vm_rw.__x64_sys_process_vm_writev.do_syscall_64.entry_SYSCALL_64_after_hwframe.process_vm_writev
> 17.17 ± 2% +20.4 37.55 ± 10% perf-profile.calltrace.cycles-pp.__x64_sys_process_vm_writev.do_syscall_64.entry_SYSCALL_64_after_hwframe.process_vm_writev
> 17.26 ± 2% +20.5 37.72 ± 10% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.process_vm_writev
> 17.31 ± 2% +20.5 37.83 ± 10% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.process_vm_writev
> 17.60 ± 2% +20.8 38.42 ± 10% perf-profile.calltrace.cycles-pp.process_vm_writev
> 18.10 -16.2 1.92 ±141% perf-profile.children.cycles-pp.secondary_startup_64_no_verify
> 18.10 -16.2 1.92 ±141% perf-profile.children.cycles-pp.cpu_startup_entry
> 18.08 -16.2 1.92 ±141% perf-profile.children.cycles-pp.do_idle
> 17.94 -16.0 1.89 ±141% perf-profile.children.cycles-pp.start_secondary
> 16.00 -8.0 8.02 ± 19% perf-profile.children.cycles-pp.read
> 12.02 -7.9 4.10 ± 37% perf-profile.children.cycles-pp.__schedule
> 13.10 -7.9 5.18 ± 30% perf-profile.children.cycles-pp.pipe_read
> 13.22 -7.9 5.31 ± 29% perf-profile.children.cycles-pp.new_sync_read
> 13.60 -7.9 5.70 ± 27% perf-profile.children.cycles-pp.vfs_read
> 13.78 -7.5 6.25 ± 23% perf-profile.children.cycles-pp.ksys_read
> 7.62 -7.0 0.65 ±142% perf-profile.children.cycles-pp.flush_smp_call_function_queue
> 7.92 -7.0 0.96 ±142% perf-profile.children.cycles-pp.cpuidle_idle_call
> 7.18 -6.6 0.59 ±142% perf-profile.children.cycles-pp.sched_ttwu_pending
> 7.14 -6.3 0.86 ±142% perf-profile.children.cycles-pp.cpuidle_enter
> 7.24 -6.3 0.98 ±124% perf-profile.children.cycles-pp.update_cfs_group
> 7.12 -6.3 0.86 ±142% perf-profile.children.cycles-pp.cpuidle_enter_state
> 9.97 -6.0 3.96 ± 29% perf-profile.children.cycles-pp.schedule
> 6.52 -5.7 0.79 ±142% perf-profile.children.cycles-pp.mwait_idle_with_hints
> 7.11 -5.4 1.69 ± 57% perf-profile.children.cycles-pp.ttwu_do_activate
> 7.07 -5.4 1.66 ± 58% perf-profile.children.cycles-pp.enqueue_task_fair
> 8.61 -5.3 3.30 ± 44% perf-profile.children.cycles-pp.__wake_up_common
> 10.86 -5.2 5.61 ± 26% perf-profile.children.cycles-pp.ksys_write
> 8.76 -5.2 3.51 ± 41% perf-profile.children.cycles-pp.__wake_up_common_lock
> 10.55 -5.2 5.32 ± 28% perf-profile.children.cycles-pp.vfs_write
> 6.72 -5.2 1.50 ± 61% perf-profile.children.cycles-pp.dequeue_task_fair
> 10.04 -5.2 4.88 ± 30% perf-profile.children.cycles-pp.pipe_write
> 10.10 -5.1 5.00 ± 29% perf-profile.children.cycles-pp.new_sync_write
> 8.26 -5.1 3.20 ± 44% perf-profile.children.cycles-pp.autoremove_wake_function
> 8.22 -5.0 3.17 ± 44% perf-profile.children.cycles-pp.try_to_wake_up
> 5.62 -4.7 0.94 ± 62% perf-profile.children.cycles-pp.enqueue_entity
> 21.11 -4.4 16.69 ± 5% perf-profile.children.cycles-pp.copyout
> 5.21 -4.4 0.84 ± 70% perf-profile.children.cycles-pp.dequeue_entity
> 5.50 -4.3 1.25 ± 48% perf-profile.children.cycles-pp.update_load_avg
> 25.20 -3.0 22.21 ± 2% perf-profile.children.cycles-pp.copy_page_to_iter
> 11.21 -2.9 8.33 ± 13% perf-profile.children.cycles-pp.write
> 3.29 -2.5 0.83 ± 78% perf-profile.children.cycles-pp.select_task_rq
> 3.22 -2.4 0.78 ± 83% perf-profile.children.cycles-pp.select_task_rq_fair
> 2.78 -2.2 0.62 ± 93% perf-profile.children.cycles-pp.select_idle_sibling
> 1.80 ± 2% -1.5 0.29 ±137% perf-profile.children.cycles-pp.available_idle_cpu
> 0.87 -0.6 0.22 ± 57% perf-profile.children.cycles-pp.finish_task_switch
> 0.98 -0.5 0.45 ± 23% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
> 0.94 -0.5 0.41 ± 27% perf-profile.children.cycles-pp.prepare_to_wait_event
> 0.60 ± 2% -0.5 0.13 ± 76% perf-profile.children.cycles-pp.switch_mm_irqs_off
> 1.04 ± 4% -0.4 0.63 ± 14% perf-profile.children.cycles-pp.stress_vm_child
> 0.58 -0.3 0.29 ± 28% perf-profile.children.cycles-pp.update_rq_clock
> 0.69 -0.3 0.40 ± 19% perf-profile.children.cycles-pp.prepare_task_switch
> 0.89 ± 3% -0.3 0.62 ± 13% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
> 0.58 -0.3 0.31 ± 21% perf-profile.children.cycles-pp.__switch_to_asm
> 0.52 -0.2 0.28 ± 21% perf-profile.children.cycles-pp.___perf_sw_event
> 0.76 ± 4% -0.2 0.56 ± 12% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
> 0.28 ± 3% -0.2 0.09 ± 52% perf-profile.children.cycles-pp._find_next_bit
> 0.50 -0.2 0.33 ± 13% perf-profile.children.cycles-pp.security_file_permission
> 0.26 ± 2% -0.2 0.09 ± 41% perf-profile.children.cycles-pp.task_tick_fair
> 0.24 ± 6% -0.2 0.08 ± 57% perf-profile.children.cycles-pp.__irq_exit_rcu
> 0.48 -0.2 0.32 ± 5% perf-profile.children.cycles-pp.set_next_entity
> 0.42 ± 4% -0.1 0.27 ± 13% perf-profile.children.cycles-pp.__hrtimer_run_queues
> 0.30 ± 3% -0.1 0.16 ± 18% perf-profile.children.cycles-pp.scheduler_tick
> 0.37 ± 4% -0.1 0.24 ± 13% perf-profile.children.cycles-pp.tick_sched_timer
> 0.35 ± 4% -0.1 0.22 ± 14% perf-profile.children.cycles-pp.tick_sched_handle
> 0.34 ± 4% -0.1 0.22 ± 15% perf-profile.children.cycles-pp.update_process_times
> 0.25 ± 3% -0.1 0.12 ± 40% perf-profile.children.cycles-pp.find_vma
> 0.19 ± 7% -0.1 0.07 ± 56% perf-profile.children.cycles-pp.__softirqentry_text_start
> 0.68 -0.1 0.56 ± 5% perf-profile.children.cycles-pp.mutex_lock
> 0.26 ± 3% -0.1 0.14 ± 34% perf-profile.children.cycles-pp.find_extend_vma
> 0.38 ± 2% -0.1 0.26 ± 10% perf-profile.children.cycles-pp.apparmor_file_permission
> 0.37 -0.1 0.27 ± 9% perf-profile.children.cycles-pp.__update_load_avg_cfs_rq
> 0.34 -0.1 0.25 ± 10% perf-profile.children.cycles-pp.perf_trace_sched_wakeup_template
> 0.24 ± 3% -0.1 0.16 ± 9% perf-profile.children.cycles-pp.sched_clock_cpu
> 0.17 ± 4% -0.1 0.09 ± 33% perf-profile.children.cycles-pp.vmacache_find
> 0.64 -0.1 0.56 ± 6% perf-profile.children.cycles-pp.switch_fpu_return
> 0.53 ± 4% -0.1 0.46 ± 8% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
> 0.53 ± 4% -0.1 0.46 ± 8% perf-profile.children.cycles-pp.hrtimer_interrupt
> 0.19 ± 2% -0.1 0.13 ± 9% perf-profile.children.cycles-pp.native_sched_clock
> 0.57 -0.1 0.50 ± 5% perf-profile.children.cycles-pp.__switch_to
> 0.27 -0.0 0.22 ± 2% perf-profile.children.cycles-pp.mutex_unlock
> 0.10 ± 5% -0.0 0.06 ± 17% perf-profile.children.cycles-pp.anon_pipe_buf_release
> 0.13 ± 2% -0.0 0.10 ± 9% perf-profile.children.cycles-pp._raw_spin_unlock_irqrestore
> 0.17 ± 2% -0.0 0.14 ± 8% perf-profile.children.cycles-pp.perf_tp_event
> 0.08 ± 4% -0.0 0.05 ± 8% perf-profile.children.cycles-pp.__list_add_valid
> 0.10 -0.0 0.09 perf-profile.children.cycles-pp.finish_wait
> 0.14 +0.0 0.16 ± 3% perf-profile.children.cycles-pp.atime_needs_update
> 0.12 ± 4% +0.0 0.15 ± 2% perf-profile.children.cycles-pp.file_update_time
> 0.06 ± 6% +0.0 0.09 ± 7% perf-profile.children.cycles-pp.__rdgsbase_inactive
> 0.10 ± 6% +0.0 0.13 ± 5% perf-profile.children.cycles-pp.__wrgsbase_inactive
> 0.75 +0.0 0.79 perf-profile.children.cycles-pp.pick_next_task_fair
> 0.05 +0.0 0.09 ± 5% perf-profile.children.cycles-pp.pick_next_entity
> 0.19 ± 2% +0.0 0.23 ± 5% perf-profile.children.cycles-pp.down_read_killable
> 0.02 ±141% +0.0 0.06 ± 11% perf-profile.children.cycles-pp.perf_trace_sched_switch
> 0.05 +0.0 0.10 ± 6% perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
> 0.02 ± 99% +0.1 0.08 ± 6% perf-profile.children.cycles-pp.kmalloc_slab
> 0.14 ± 2% +0.1 0.20 perf-profile.children.cycles-pp.down_read
> 0.00 +0.1 0.06 ± 8% perf-profile.children.cycles-pp.resched_curr
> 0.02 ± 99% +0.1 0.08 ± 8% perf-profile.children.cycles-pp.perf_trace_sched_stat_runtime
> 0.16 ± 3% +0.1 0.21 ± 2% perf-profile.children.cycles-pp.mmput
> 0.00 +0.1 0.06 ± 9% perf-profile.children.cycles-pp.ktime_get_coarse_real_ts64
> 0.13 ± 2% +0.1 0.19 ± 6% perf-profile.children.cycles-pp.get_task_mm
> 0.00 +0.1 0.07 ± 7% perf-profile.children.cycles-pp.idr_find
> 0.24 ± 2% +0.1 0.31 ± 4% perf-profile.children.cycles-pp.__update_load_avg_se
> 0.02 ±141% +0.1 0.09 ± 10% perf-profile.children.cycles-pp.__calc_delta
> 0.66 +0.1 0.73 ± 2% perf-profile.children.cycles-pp.update_curr
> 0.04 ± 44% +0.1 0.12 ± 9% perf-profile.children.cycles-pp.memcg_slab_free_hook
> 0.00 +0.1 0.07 ± 18% perf-profile.children.cycles-pp.cpumask_next_and
> 0.00 +0.1 0.08 ± 10% perf-profile.children.cycles-pp.syscall_exit_to_user_mode_prepare
> 0.08 ± 4% +0.1 0.16 ± 10% perf-profile.children.cycles-pp.up_read
> 0.07 ± 11% +0.1 0.15 ± 8% perf-profile.children.cycles-pp.clockevents_program_event
> 0.13 +0.1 0.22 ± 3% perf-profile.children.cycles-pp.ttwu_do_wakeup
> 0.06 ± 6% +0.1 0.15 ± 9% perf-profile.children.cycles-pp.current_time
> 0.00 +0.1 0.09 ± 10% perf-profile.children.cycles-pp.check_stack_object
> 0.11 ± 4% +0.1 0.20 ± 4% perf-profile.children.cycles-pp.check_preempt_curr
> 0.15 ± 4% +0.1 0.26 ± 6% perf-profile.children.cycles-pp.os_xsave
> 0.12 ± 4% +0.1 0.23 ± 9% perf-profile.children.cycles-pp.syscall_enter_from_user_mode
> 0.16 ± 3% +0.1 0.27 ± 9% perf-profile.children.cycles-pp.reweight_entity
> 0.61 ± 2% +0.1 0.75 ± 2% perf-profile.children.cycles-pp.find_get_task_by_vpid
> 0.19 ± 5% +0.1 0.33 ± 6% perf-profile.children.cycles-pp.__radix_tree_lookup
> 0.10 ± 3% +0.2 0.25 ± 10% perf-profile.children.cycles-pp.__check_object_size
> 0.13 ± 3% +0.2 0.29 ± 9% perf-profile.children.cycles-pp.syscall_return_via_sysret
> 0.11 ± 6% +0.2 0.26 ± 11% perf-profile.children.cycles-pp.follow_huge_addr
> 0.37 ± 2% +0.2 0.53 ± 6% perf-profile.children.cycles-pp.mm_access
> 0.00 +0.2 0.17 ± 9% perf-profile.children.cycles-pp.check_preempt_wakeup
> 0.00 +0.2 0.17 ± 14% perf-profile.children.cycles-pp.put_prev_entity
> 0.15 ± 3% +0.2 0.39 ± 12% perf-profile.children.cycles-pp.pud_huge
> 0.14 ± 3% +0.2 0.39 ± 11% perf-profile.children.cycles-pp.mark_page_accessed
> 0.19 ± 3% +0.3 0.44 ± 9% perf-profile.children.cycles-pp.kfree
> 0.19 ± 3% +0.3 0.50 ± 12% perf-profile.children.cycles-pp.pmd_huge
> 0.31 ± 3% +0.3 0.65 ± 9% perf-profile.children.cycles-pp.__kmalloc
> 0.28 ± 3% +0.4 0.67 ± 10% perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
> 0.26 ± 3% +0.4 0.70 ± 12% perf-profile.children.cycles-pp.vm_normal_page
> 0.28 ± 2% +0.5 0.73 ± 11% perf-profile.children.cycles-pp.folio_mark_accessed
> 0.40 ± 2% +0.5 0.86 ± 10% perf-profile.children.cycles-pp.__entry_text_start
> 0.77 ± 4% +0.5 1.24 ± 10% perf-profile.children.cycles-pp.stress_vm_rw
> 2.54 +0.6 3.10 ± 4% perf-profile.children.cycles-pp._raw_spin_lock
> 0.44 ± 3% +0.6 1.02 ± 10% perf-profile.children.cycles-pp.__import_iovec
> 0.46 ± 3% +0.6 1.05 ± 10% perf-profile.children.cycles-pp.import_iovec
> 0.57 ± 2% +0.7 1.26 ± 9% perf-profile.children.cycles-pp._copy_from_user
> 0.47 ± 3% +0.7 1.18 ± 11% perf-profile.children.cycles-pp.rcu_all_qs
> 1.72 +0.9 2.58 ± 6% perf-profile.children.cycles-pp.__cond_resched
> 0.75 ± 2% +1.0 1.71 ± 9% perf-profile.children.cycles-pp.iovec_from_user
> 0.00 +1.1 1.11 ± 16% perf-profile.children.cycles-pp.exit_to_user_mode_loop
> 0.98 ± 2% +1.1 2.10 ± 10% perf-profile.children.cycles-pp.__might_sleep
> 0.70 +1.1 1.84 ± 9% perf-profile.children.cycles-pp.exit_to_user_mode_prepare
> 0.76 ± 3% +1.2 1.95 ± 11% perf-profile.children.cycles-pp.follow_pud_mask
> 0.79 +1.3 2.06 ± 9% perf-profile.children.cycles-pp.syscall_exit_to_user_mode
> 0.93 ± 2% +1.3 2.22 ± 10% perf-profile.children.cycles-pp.follow_page_mask
> 1.30 ± 2% +1.5 2.80 ± 9% perf-profile.children.cycles-pp.__might_fault
> 1.03 ± 2% +1.6 2.67 ± 11% perf-profile.children.cycles-pp.follow_pmd_mask
> 2.83 +1.7 4.55 ± 7% perf-profile.children.cycles-pp.__might_resched
> 1.37 +1.9 3.26 ± 14% perf-profile.children.cycles-pp.unpin_user_pages
> 1.30 ± 2% +2.1 3.38 ± 14% perf-profile.children.cycles-pp.unpin_user_pages_dirty_lock
> 3.29 ± 2% +2.6 5.87 ± 11% perf-profile.children.cycles-pp.try_grab_page
> 29.19 +2.6 31.78 perf-profile.children.cycles-pp.copy_user_enhanced_fast_string
> 1.40 +3.0 4.42 ± 17% perf-profile.children.cycles-pp.mod_node_page_state
> 2.13 ± 2% +3.2 5.35 ± 14% perf-profile.children.cycles-pp.gup_put_folio
> 34.26 +6.2 40.45 ± 3% perf-profile.children.cycles-pp.__x64_sys_process_vm_readv
> 34.96 +6.7 41.64 ± 3% perf-profile.children.cycles-pp.process_vm_readv
> 8.05 +7.0 15.10 ± 8% perf-profile.children.cycles-pp.copyin
> 6.54 +7.1 13.60 ± 11% perf-profile.children.cycles-pp.follow_page_pte
> 10.06 +9.7 19.72 ± 8% perf-profile.children.cycles-pp.copy_page_from_iter
> 10.99 +13.2 24.22 ± 11% perf-profile.children.cycles-pp.__get_user_pages
> 11.04 +13.3 24.31 ± 11% perf-profile.children.cycles-pp.__get_user_pages_remote
> 77.15 +15.3 92.46 ± 2% perf-profile.children.cycles-pp.do_syscall_64
> 77.51 +15.4 92.94 ± 2% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
> 17.18 ± 2% +20.4 37.56 ± 10% perf-profile.children.cycles-pp.__x64_sys_process_vm_writev
> 17.69 ± 2% +20.9 38.64 ± 10% perf-profile.children.cycles-pp.process_vm_writev
> 48.65 +24.3 72.98 ± 6% perf-profile.children.cycles-pp.process_vm_rw_single_vec
> 50.39 +25.4 75.74 ± 6% perf-profile.children.cycles-pp.process_vm_rw_core
> 51.40 +26.5 77.94 ± 6% perf-profile.children.cycles-pp.process_vm_rw
> 7.23 -6.3 0.97 ±125% perf-profile.self.cycles-pp.update_cfs_group
> 6.42 -5.6 0.78 ±142% perf-profile.self.cycles-pp.mwait_idle_with_hints
> 4.82 -4.2 0.67 ± 85% perf-profile.self.cycles-pp.update_load_avg
> 1.78 ± 2% -1.5 0.29 ±137% perf-profile.self.cycles-pp.available_idle_cpu
> 0.96 -0.5 0.42 ± 26% perf-profile.self.cycles-pp._raw_spin_lock_irqsave
> 0.98 ± 2% -0.5 0.45 ± 28% perf-profile.self.cycles-pp.__schedule
> 0.59 ± 2% -0.5 0.13 ± 78% perf-profile.self.cycles-pp.switch_mm_irqs_off
> 0.97 -0.4 0.61 ± 15% perf-profile.self.cycles-pp.stress_vm_child
> 0.45 ± 2% -0.3 0.14 ± 58% perf-profile.self.cycles-pp.update_rq_clock
> 0.58 -0.3 0.30 ± 22% perf-profile.self.cycles-pp.__switch_to_asm
> 0.58 -0.3 0.33 ± 17% perf-profile.self.cycles-pp.pipe_read
> 0.35 ± 3% -0.2 0.10 ± 59% perf-profile.self.cycles-pp.__wake_up_common
> 0.47 -0.2 0.24 ± 23% perf-profile.self.cycles-pp.___perf_sw_event
> 0.32 ± 2% -0.2 0.13 ± 39% perf-profile.self.cycles-pp.finish_task_switch
> 0.38 ± 2% -0.2 0.21 ± 20% perf-profile.self.cycles-pp.prepare_to_wait_event
> 0.25 ± 2% -0.2 0.08 ± 53% perf-profile.self.cycles-pp._find_next_bit
> 0.31 ± 3% -0.2 0.14 ± 27% perf-profile.self.cycles-pp.enqueue_entity
> 0.34 ± 15% -0.2 0.18 ± 12% perf-profile.self.cycles-pp.read
> 0.26 ± 2% -0.2 0.11 ± 20% perf-profile.self.cycles-pp.try_to_wake_up
> 0.29 ± 3% -0.2 0.13 ± 30% perf-profile.self.cycles-pp.prepare_task_switch
> 0.45 ± 2% -0.1 0.31 ± 9% perf-profile.self.cycles-pp.mutex_lock
> 0.26 -0.1 0.16 ± 14% perf-profile.self.cycles-pp.apparmor_file_permission
> 0.13 ± 2% -0.1 0.03 ±103% perf-profile.self.cycles-pp.perf_trace_sched_wakeup_template
> 0.35 -0.1 0.26 ± 11% perf-profile.self.cycles-pp.__update_load_avg_cfs_rq
> 0.16 ± 3% -0.1 0.08 ± 22% perf-profile.self.cycles-pp.dequeue_entity
> 0.56 -0.1 0.48 ± 5% perf-profile.self.cycles-pp.__switch_to
> 0.15 ± 2% -0.1 0.08 ± 37% perf-profile.self.cycles-pp.vmacache_find
> 0.18 ± 2% -0.1 0.11 ± 25% perf-profile.self.cycles-pp.select_idle_sibling
> 0.19 ± 3% -0.1 0.13 ± 10% perf-profile.self.cycles-pp.native_sched_clock
> 0.13 ± 3% -0.1 0.07 ± 17% perf-profile.self.cycles-pp.security_file_permission
> 0.19 -0.1 0.13 ± 18% perf-profile.self.cycles-pp.enqueue_task_fair
> 0.35 ± 2% -0.1 0.30 ± 8% perf-profile.self.cycles-pp.update_curr
> 0.15 ± 2% -0.1 0.10 ± 12% perf-profile.self.cycles-pp.dequeue_task_fair
> 0.26 -0.0 0.21 ± 2% perf-profile.self.cycles-pp.mutex_unlock
> 0.38 -0.0 0.34 ± 3% perf-profile.self.cycles-pp.find_get_task_by_vpid
> 0.09 ± 4% -0.0 0.06 ± 13% perf-profile.self.cycles-pp.anon_pipe_buf_release
> 0.11 ± 4% -0.0 0.08 ± 10% perf-profile.self.cycles-pp.atime_needs_update
> 0.21 ± 2% -0.0 0.19 ± 3% perf-profile.self.cycles-pp.vfs_read
> 0.17 ± 2% -0.0 0.14 ± 3% perf-profile.self.cycles-pp.switch_fpu_return
> 0.11 ± 4% -0.0 0.09 ± 10% perf-profile.self.cycles-pp.aa_file_perm
> 0.08 ± 6% -0.0 0.06 ± 13% perf-profile.self.cycles-pp.select_task_rq
> 0.05 +0.0 0.06 perf-profile.self.cycles-pp._raw_spin_unlock_irqrestore
> 0.06 +0.0 0.07 perf-profile.self.cycles-pp.set_next_entity
> 0.07 ± 5% +0.0 0.09 perf-profile.self.cycles-pp.get_task_mm
> 0.06 ± 9% +0.0 0.08 perf-profile.self.cycles-pp.__get_user_pages_remote
> 0.06 ± 6% +0.0 0.09 ± 5% perf-profile.self.cycles-pp.__rdgsbase_inactive
> 0.10 ± 5% +0.0 0.13 ± 4% perf-profile.self.cycles-pp.__wrgsbase_inactive
> 0.11 +0.0 0.15 ± 2% perf-profile.self.cycles-pp.pick_next_task_fair
> 0.09 ± 9% +0.0 0.13 ± 8% perf-profile.self.cycles-pp.ktime_get
> 0.03 ± 70% +0.0 0.08 ± 6% perf-profile.self.cycles-pp.pick_next_entity
> 0.08 ± 6% +0.1 0.13 ± 7% perf-profile.self.cycles-pp.vfs_write
> 0.00 +0.1 0.06 ± 9% perf-profile.self.cycles-pp.resched_curr
> 0.01 ±223% +0.1 0.06 ± 11% perf-profile.self.cycles-pp.perf_trace_sched_switch
> 0.00 +0.1 0.06 ± 8% perf-profile.self.cycles-pp.put_prev_entity
> 0.00 +0.1 0.06 ± 8% perf-profile.self.cycles-pp.syscall_exit_to_user_mode_prepare
> 0.00 +0.1 0.06 ± 8% perf-profile.self.cycles-pp.idr_find
> 0.22 ± 2% +0.1 0.28 ± 4% perf-profile.self.cycles-pp.__update_load_avg_se
> 0.00 +0.1 0.06 ± 11% perf-profile.self.cycles-pp.ksys_write
> 0.00 +0.1 0.06 ± 14% perf-profile.self.cycles-pp.__wake_up_common_lock
> 0.00 +0.1 0.06 ± 7% perf-profile.self.cycles-pp.kmalloc_slab
> 0.10 ± 6% +0.1 0.16 ± 6% perf-profile.self.cycles-pp.write
> 0.00 +0.1 0.07 ± 15% perf-profile.self.cycles-pp.check_preempt_wakeup
> 0.08 ± 6% +0.1 0.15 ± 9% perf-profile.self.cycles-pp.up_read
> 0.01 ±223% +0.1 0.08 ± 12% perf-profile.self.cycles-pp.perf_trace_sched_stat_runtime
> 0.01 ±223% +0.1 0.08 ± 10% perf-profile.self.cycles-pp.__calc_delta
> 0.00 +0.1 0.08 ± 10% perf-profile.self.cycles-pp.check_stack_object
> 0.00 +0.1 0.08 ± 20% perf-profile.self.cycles-pp.exit_to_user_mode_loop
> 0.02 ±141% +0.1 0.09 ± 10% perf-profile.self.cycles-pp.new_sync_write
> 0.02 ±141% +0.1 0.09 ± 10% perf-profile.self.cycles-pp.entry_SYSCALL_64_safe_stack
> 0.06 ± 8% +0.1 0.14 ± 11% perf-profile.self.cycles-pp.follow_huge_addr
> 0.10 ± 4% +0.1 0.19 ± 10% perf-profile.self.cycles-pp.syscall_enter_from_user_mode
> 0.00 +0.1 0.09 ± 12% perf-profile.self.cycles-pp.current_time
> 0.06 +0.1 0.15 ± 11% perf-profile.self.cycles-pp.__import_iovec
> 0.05 ± 8% +0.1 0.14 ± 13% perf-profile.self.cycles-pp.__check_object_size
> 0.03 ± 70% +0.1 0.14 ± 15% perf-profile.self.cycles-pp.syscall_exit_to_user_mode
> 0.19 ± 3% +0.1 0.30 ± 5% perf-profile.self.cycles-pp.process_vm_readv
> 0.15 ± 3% +0.1 0.25 ± 5% perf-profile.self.cycles-pp.os_xsave
> 0.00 +0.1 0.11 ± 9% perf-profile.self.cycles-pp.memcg_slab_free_hook
> 0.07 ± 5% +0.1 0.18 ± 10% perf-profile.self.cycles-pp._copy_from_user
> 0.15 ± 2% +0.1 0.26 ± 6% perf-profile.self.cycles-pp.pipe_write
> 0.05 +0.1 0.16 ± 12% perf-profile.self.cycles-pp.exit_to_user_mode_prepare
> 0.12 ± 3% +0.1 0.24 ± 8% perf-profile.self.cycles-pp.__entry_text_start
> 0.00 +0.1 0.12 ± 14% perf-profile.self.cycles-pp.schedule
> 0.14 ± 12% +0.1 0.26 ± 5% perf-profile.self.cycles-pp.process_vm_rw
> 0.38 +0.1 0.51 ± 6% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
> 0.19 ± 7% +0.1 0.32 ± 6% perf-profile.self.cycles-pp.__radix_tree_lookup
> 0.09 ± 4% +0.1 0.23 ± 10% perf-profile.self.cycles-pp.iovec_from_user
> 0.09 +0.1 0.24 ± 11% perf-profile.self.cycles-pp.mark_page_accessed
> 0.14 ± 3% +0.2 0.29 ± 9% perf-profile.self.cycles-pp.process_vm_rw_core
> 0.13 ± 2% +0.2 0.28 ± 9% perf-profile.self.cycles-pp.syscall_return_via_sysret
> 0.13 ± 2% +0.2 0.29 ± 10% perf-profile.self.cycles-pp.process_vm_writev
> 0.10 ± 3% +0.2 0.26 ± 13% perf-profile.self.cycles-pp.pud_huge
> 0.09 ± 5% +0.2 0.26 ± 13% perf-profile.self.cycles-pp.pmd_huge
> 0.14 ± 3% +0.2 0.31 ± 11% perf-profile.self.cycles-pp.copyout
> 0.14 ± 4% +0.2 0.31 ± 10% perf-profile.self.cycles-pp.kfree
> 0.17 ± 3% +0.2 0.35 ± 9% perf-profile.self.cycles-pp.do_syscall_64
> 0.18 ± 3% +0.2 0.36 ± 9% perf-profile.self.cycles-pp.__kmalloc
> 0.18 ± 2% +0.3 0.45 ± 10% perf-profile.self.cycles-pp.copyin
> 0.24 ± 3% +0.3 0.50 ± 9% perf-profile.self.cycles-pp.__might_fault
> 0.11 ± 5% +0.3 0.41 ± 22% perf-profile.self.cycles-pp.ksys_read
> 0.30 ± 3% +0.3 0.65 ± 12% perf-profile.self.cycles-pp.unpin_user_pages
> 0.22 ± 3% +0.4 0.58 ± 12% perf-profile.self.cycles-pp.vm_normal_page
> 0.26 ± 4% +0.4 0.62 ± 11% perf-profile.self.cycles-pp.rcu_all_qs
> 0.28 ± 2% +0.4 0.65 ± 10% perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
> 0.23 ± 3% +0.4 0.61 ± 12% perf-profile.self.cycles-pp.folio_mark_accessed
> 0.28 ± 4% +0.5 0.73 ± 13% perf-profile.self.cycles-pp.unpin_user_pages_dirty_lock
> 0.67 ± 4% +0.5 1.15 ± 12% perf-profile.self.cycles-pp.stress_vm_rw
> 0.56 ± 2% +0.5 1.06 ± 8% perf-profile.self.cycles-pp.process_vm_rw_single_vec
> 0.91 +0.6 1.55 ± 8% perf-profile.self.cycles-pp.__cond_resched
> 0.81 ± 2% +0.9 1.72 ± 10% perf-profile.self.cycles-pp.__might_sleep
> 0.61 ± 3% +0.9 1.55 ± 11% perf-profile.self.cycles-pp.follow_pud_mask
> 1.94 +0.9 2.89 ± 7% perf-profile.self.cycles-pp._raw_spin_lock
> 1.60 +1.0 2.57 ± 7% perf-profile.self.cycles-pp.copy_page_to_iter
> 2.56 ± 2% +1.0 3.61 ± 7% perf-profile.self.cycles-pp.try_grab_page
> 0.88 ± 2% +1.1 1.98 ± 10% perf-profile.self.cycles-pp.copy_page_from_iter
> 0.82 ± 3% +1.1 1.95 ± 11% perf-profile.self.cycles-pp.follow_page_mask
> 2.60 +1.4 3.98 ± 6% perf-profile.self.cycles-pp.__might_resched
> 0.88 ± 2% +1.4 2.27 ± 11% perf-profile.self.cycles-pp.follow_pmd_mask
> 0.88 ± 2% +1.4 2.28 ± 11% perf-profile.self.cycles-pp.__get_user_pages
> 1.43 ± 2% +1.7 3.15 ± 12% perf-profile.self.cycles-pp.gup_put_folio
> 1.50 ± 3% +2.3 3.79 ± 11% perf-profile.self.cycles-pp.follow_page_pte
> 28.89 +2.5 31.42 perf-profile.self.cycles-pp.copy_user_enhanced_fast_string
> 1.26 +2.8 4.04 ± 18% perf-profile.self.cycles-pp.mod_node_page_state
>
>
>
>
>
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
>
>
On Sat, Jul 09, 2022 at 11:56:19PM +0800, Abel Wu wrote:
>
> On 7/9/22 4:55 PM, Chen Yu Wrote:
> > On Thu, Jun 30, 2022 at 06:46:08PM +0800, Abel Wu wrote:
> > >
> > > On 6/30/22 12:16 PM, Chen Yu Wrote:
> > > > On Tue, Jun 28, 2022 at 03:58:55PM +0800, Abel Wu wrote:
> > > > >
> > > > > On 6/27/22 6:13 PM, Abel Wu Wrote:
> > > > > There seems like not much difference except hackbench pipe test at
> > > > > certain groups (30~110).
> > > > OK, smaller LLC domain seems to not have much difference, which might
> > > > suggest that by leveraging load balance code path, the read/write
> > > > to LLC shared mask might not be the bottleneck. I have an vague
> > > > impression that during Aubrey's cpumask searching for idle CPUs
> > > > work[1], there is concern that updating the shared mask in large LLC
> > > > has introduced cache contention and performance degrading. Maybe we
> > > > can find that regressed test case to verify.
> > > > [1] https://lore.kernel.org/all/[email protected]/
> > >
> > > I just went through Aubrey's v1-v11 patches and didn't find any
> > > particular tests other than hackbench/tbench/uperf. Please let
> > > me know if I missed something, thanks!
> > >
> > I haven't found any testcase that could trigger the cache contention
> > issue. I thought we could stick with these testcases for now, especially
> > for tbench, it has detected a cache issue described in
> > https://lore.kernel.org/lkml/[email protected]
> > if I understand correctly.
>
> I Agree.
>
> > > > > I am intended to provide better scalability
> > > > > by applying the filter which will be enabled when:
> > > > >
> > > > > - The LLC is large enough that simply traversing becomes
> > > > > in-sufficient, and/or
> > > > >
> > > > > - The LLC is loaded that unoccupied cpus are minority.
> > > > >
> > > > > But it would be very nice if a more fine grained pattern works well
> > > > > so we can drop the above constrains.
> > > > >
> > > > We can first try to push a simple version, and later optimize it.
> > > > One concern about v4 is that, we changed the logic in v3, which recorded
> > > > the overloaded CPU, while v4 tracks unoccupied CPUs. An overloaded CPU is
> > > > more "stable" because there are more than 1 running tasks on that runqueue.
> > > > It is more likely to remain "occupied" for a while. That is to say,
> > > > nr_task = 1, 2, 3... will all be regarded as occupied, while only nr_task = 0
> > > > is unoccupied. The former would bring less false negative/positive.
> > >
> > > Yes, I like the 'overloaded mask' too, but the downside is extra
> > > cpumask ops needed in the SIS path (the added cpumask_andnot).
> > > Besides, in this patch, the 'overloaded mask' is also unstable due
> > > to the state is maintained at core level rather than per-cpu, some
> > > more thoughts are in cover letter.
> > >
> > I see.
> > > >
> > > > By far I have tested hackbench/schbench/netperf on top of Peter's sched/core branch,
> > > > with SIS_UTIL enabled. Overall it looks good, and netperf has especially
> > > > significant improvement when the load approaches overloaded(which is aligned
> > > > with your comment above). I'll re-run the netperf for several cycles to check the
> > > > standard deviation. And I'm also curious about v3's performance because it
> > > > tracks overloaded CPUs, so I'll also test on v3 with small modifications.
> > >
> > > Thanks very much for your reviewing and testing.
> > >
> > I modified your v3 patch a little bit, and the test result shows good improvement
> > on netperf and no significant regression on schbench/tbench/hackbench on this draft
>
> I don't know why there is such a big improvement in netperf TCP_RR
> 168-threads while results under other configs are plain.
>
Here is my thought: in previous testing for SIS_UTIL on the same platform, netperf
prefers to run on the previous CPU rather than a new idle one. It brings improvement
for 224-threads when SIS_UTIL is enabled, because SIS_UTIL terminates the scan
earlier in this case. And current v3 patch terminates the scan in 168-threads
case(which does not in SIS_UTIL), so it get further improvement.
> > patch. I would like to vote for your v3 version as it seems to be more straightforward,
> > what do you think of the following change:
> >
> > From 277b60b7cd055d5be93188a552da50fdfe53214c Mon Sep 17 00:00:00 2001
> > From: Abel Wu <[email protected]>
> > Date: Fri, 8 Jul 2022 02:16:47 +0800
> > Subject: [PATCH] sched/fair: Introduce SIS_FILTER to skip overloaded CPUs
> > during SIS
> >
> > Currently SIS_UTIL is used to limit the scan depth of idle CPUs in
> > select_idle_cpu(). There could be another optimization to filter
> > the overloaded CPUs so as to further speed up select_idle_cpu().
> > Launch the CPU overload check in periodic tick, and take consideration
> > of nr_running, avg_util and runnable_avg of that CPU. If the CPU is
> > overloaded, add it into per LLC overload cpumask, so select_idle_cpu()
> > could skip those overloaded CPUs. Although this detection is in periodic
> > tick, checking the pelt signal of the CPU would make the 'overloaded' state
> > more stable and reduce the frequency to update the LLC shared mask,
> > so as to mitigate the cache contention in the LLC.
> >
> > The following results are tested on top of latest sched/core tip.
> > The baseline is with SIS_UTIL enabled, and compared it with both SIS_FILTER
> > /SIS_UTIL enabled. Positive %compare stands for better performance.
>
> Can you share the cpu topology please?
>
It is a 2-sockets system, with 112 CPUs in each LLC domain/socket.
> >
> > hackbench
> > =========
> > case load baseline(std%) compare%( std%)
> > process-pipe 1 group 1.00 ( 0.59) -1.35 ( 0.88)
> > process-pipe 2 groups 1.00 ( 0.38) -1.49 ( 0.04)
> > process-pipe 4 groups 1.00 ( 0.45) +0.10 ( 0.91)
> > process-pipe 8 groups 1.00 ( 0.11) +0.03 ( 0.38)
> > process-sockets 1 group 1.00 ( 3.48) +2.88 ( 7.07)
> > process-sockets 2 groups 1.00 ( 2.38) -3.78 ( 2.81)
> > process-sockets 4 groups 1.00 ( 0.26) -1.79 ( 0.82)
> > process-sockets 8 groups 1.00 ( 0.07) -0.35 ( 0.07)
> > threads-pipe 1 group 1.00 ( 0.87) -0.21 ( 0.71)
> > threads-pipe 2 groups 1.00 ( 0.63) +0.34 ( 0.45)
> > threads-pipe 4 groups 1.00 ( 0.18) -0.02 ( 0.50)
> > threads-pipe 8 groups 1.00 ( 0.08) +0.46 ( 0.05)
> > threads-sockets 1 group 1.00 ( 0.80) -0.08 ( 1.06)
> > threads-sockets 2 groups 1.00 ( 0.55) +0.06 ( 0.85)
> > threads-sockets 4 groups 1.00 ( 1.00) -2.13 ( 0.18)
> > threads-sockets 8 groups 1.00 ( 0.07) -0.41 ( 0.08)
> >
> > netperf
> > =======
> > case load baseline(std%) compare%( std%)
> > TCP_RR 28 threads 1.00 ( 0.50) +0.19 ( 0.53)
> > TCP_RR 56 threads 1.00 ( 0.33) +0.31 ( 0.35)
> > TCP_RR 84 threads 1.00 ( 0.23) +0.15 ( 0.28)
> > TCP_RR 112 threads 1.00 ( 0.20) +0.03 ( 0.21)
> > TCP_RR 140 threads 1.00 ( 0.17) +0.20 ( 0.18)
> > TCP_RR 168 threads 1.00 ( 0.17) +112.84 ( 40.35)
> > TCP_RR 196 threads 1.00 ( 16.66) +0.39 ( 15.72)
> > TCP_RR 224 threads 1.00 ( 10.28) +0.05 ( 9.97)
> > UDP_RR 28 threads 1.00 ( 16.15) -0.13 ( 0.93)
> > UDP_RR 56 threads 1.00 ( 7.76) +1.24 ( 0.44)
> > UDP_RR 84 threads 1.00 ( 11.68) -0.49 ( 6.33)
> > UDP_RR 112 threads 1.00 ( 8.49) -0.21 ( 7.77)
> > UDP_RR 140 threads 1.00 ( 8.49) +2.05 ( 19.88)
> > UDP_RR 168 threads 1.00 ( 8.91) +1.67 ( 11.74)
> > UDP_RR 196 threads 1.00 ( 19.96) +4.35 ( 21.37)
> > UDP_RR 224 threads 1.00 ( 19.44) +4.38 ( 16.61)
> >
> > tbench
> > ======
> > case load baseline(std%) compare%( std%)
> > loopback 28 threads 1.00 ( 0.12) +0.57 ( 0.12)
> > loopback 56 threads 1.00 ( 0.11) +0.42 ( 0.11)
> > loopback 84 threads 1.00 ( 0.09) +0.71 ( 0.03)
> > loopback 112 threads 1.00 ( 0.03) -0.13 ( 0.08)
> > loopback 140 threads 1.00 ( 0.29) +0.59 ( 0.01)
> > loopback 168 threads 1.00 ( 0.01) +0.86 ( 0.03)
> > loopback 196 threads 1.00 ( 0.02) +0.97 ( 0.21)
> > loopback 224 threads 1.00 ( 0.04) +0.83 ( 0.22)
> >
> > schbench
> > ========
> > case load baseline(std%) compare%( std%)
> > normal 1 mthread 1.00 ( 0.00) -8.82 ( 0.00)
> > normal 2 mthreads 1.00 ( 0.00) +0.00 ( 0.00)
> > normal 4 mthreads 1.00 ( 0.00) +17.02 ( 0.00)
> > normal 8 mthreads 1.00 ( 0.00) -4.84 ( 0.00)
> >
> > Signed-off-by: Abel Wu <[email protected]>
> > ---
> > include/linux/sched/topology.h | 6 +++++
> > kernel/sched/core.c | 1 +
> > kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++
> > kernel/sched/features.h | 1 +
> > kernel/sched/sched.h | 2 ++
> > kernel/sched/topology.c | 3 ++-
> > 6 files changed, 59 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > index 816df6cc444e..c03076850a67 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -82,8 +82,14 @@ struct sched_domain_shared {
> > atomic_t nr_busy_cpus;
> > int has_idle_cores;
> > int nr_idle_scan;
> > + unsigned long overloaded_cpus[];
> > };
> > +static inline struct cpumask *sdo_mask(struct sched_domain_shared *sds)
> > +{
> > + return to_cpumask(sds->overloaded_cpus);
> > +}
> > +
> > struct sched_domain {
> > /* These fields must be setup */
> > struct sched_domain __rcu *parent; /* top domain must be null terminated */
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index d3e2c5a7c1b7..452eb63ee6f6 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5395,6 +5395,7 @@ void scheduler_tick(void)
> > resched_latency = cpu_resched_latency(rq);
> > calc_global_load_tick(rq);
> > sched_core_tick(rq);
> > + update_overloaded_rq(rq);
>
> I didn't see this update in idle path. Is this on your intend?
>
It is intended to exclude the idle path. My thought was that, since
the avg_util has contained the historic activity, checking the cpu
status in each idle path seems to have no much difference...
> > rq_unlock(rq, &rf);
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index f80ae86bb404..34b1650f85f6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6323,6 +6323,50 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
> > #endif /* CONFIG_SCHED_SMT */
> > +/* derived from group_is_overloaded() */
> > +static inline bool rq_overloaded(struct rq *rq, int cpu, unsigned int imbalance_pct)
> > +{
> > + if (rq->nr_running - rq->cfs.idle_h_nr_running <= 1)
> > + return false;
> > +
> > + if ((SCHED_CAPACITY_SCALE * 100) <
> > + (cpu_util_cfs(cpu) * imbalance_pct))
> > + return true;
> > +
> > + if ((SCHED_CAPACITY_SCALE * imbalance_pct) <
> > + (cpu_runnable(rq) * 100))
> > + return true;
>
> So the filter contains cpus that over-utilized or overloaded now.
> This steps further to make the filter reliable while at the cost
> of sacrificing scan efficiency.
>
Right. Ideally if there is a 'realtime' idle cpumask for SIS, the
scan would be quite accurate. The issue is how to maintain this
cpumask in low cost.
> The idea behind my recent patches is to keep the filter radical,
> but use it conservatively.
>
Do you mean, update the per-core idle filter frequently, but only
propogate the filter to LLC-cpumask when the system is overloaded?
> > +
> > + return false;
> > +}
> > +
> > +void update_overloaded_rq(struct rq *rq)
> > +{
> > + struct sched_domain_shared *sds;
> > + struct sched_domain *sd;
> > + int cpu;
> > +
> > + if (!sched_feat(SIS_FILTER))
> > + return;
> > +
> > + cpu = cpu_of(rq);
> > + sd = rcu_dereference(per_cpu(sd_llc, cpu));
> > + if (unlikely(!sd))
> > + return;
> > +
> > + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > + if (unlikely(!sds))
> > + return;
> > +
> > + if (rq_overloaded(rq, cpu, sd->imbalance_pct)) {
>
> I'm not sure whether it is appropriate to use LLC imbalance pct here,
> because we are comparing inside the LLC rather than between the LLCs.
>
Right, imbalance_pct could not be of LLC's, it could be of the core domain's
imbalance_pct.
> > + /* avoid duplicated write, mitigate cache contention */
> > + if (!cpumask_test_cpu(cpu, sdo_mask(sds)))
> > + cpumask_set_cpu(cpu, sdo_mask(sds));
> > + } else {
> > + if (cpumask_test_cpu(cpu, sdo_mask(sds)))
> > + cpumask_clear_cpu(cpu, sdo_mask(sds));
> > + }
> > +}
> > /*
> > * Scan the LLC domain for idle CPUs; this is dynamically regulated by
> > * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
> > @@ -6383,6 +6427,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > }
> > }
> > + if (sched_feat(SIS_FILTER) && !has_idle_core && sd->shared)
> > + cpumask_andnot(cpus, cpus, sdo_mask(sd->shared));
> > +
> > for_each_cpu_wrap(cpu, cpus, target + 1) {
> > if (has_idle_core) {
> > i = select_idle_core(p, cpu, cpus, &idle_cpu);
> > diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> > index ee7f23c76bd3..1bebdb87c2f4 100644
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> > */
> > SCHED_FEAT(SIS_PROP, false)
> > SCHED_FEAT(SIS_UTIL, true)
> > +SCHED_FEAT(SIS_FILTER, true)
>
> The filter should be enabled when there is a need. If the system
> is idle enough, I don't think it's a good idea to clear out the
> overloaded cpus from domain scan. Making the filter a sched-feat
> won't help the problem.
>
> My latest patch will only apply the filter when nr is less than
> the LLC size.
Do you mean only update the filter(idle cpu mask), or only uses the
filter in SIS when the system meets: nr_running < LLC size?
thanks,
Chenyu
> It doesn't work perfectly yet, but really better
> than doing nothing in my v4 patchset.
>
>
> I will give this patch a test on my machine a few days later.
>
On 7/11/22 8:02 PM, Chen Yu Wrote:
>>> ...
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index d3e2c5a7c1b7..452eb63ee6f6 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -5395,6 +5395,7 @@ void scheduler_tick(void)
>>> resched_latency = cpu_resched_latency(rq);
>>> calc_global_load_tick(rq);
>>> sched_core_tick(rq);
>>> + update_overloaded_rq(rq);
>>
>> I didn't see this update in idle path. Is this on your intend?
>>
> It is intended to exclude the idle path. My thought was that, since
> the avg_util has contained the historic activity, checking the cpu
> status in each idle path seems to have no much difference...
I presume the avg_util is used to decide how many cpus to scan, while
the update determines which cpus to scan.
>>> rq_unlock(rq, &rf);
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index f80ae86bb404..34b1650f85f6 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6323,6 +6323,50 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
>>> #endif /* CONFIG_SCHED_SMT */
>>> +/* derived from group_is_overloaded() */
>>> +static inline bool rq_overloaded(struct rq *rq, int cpu, unsigned int imbalance_pct)
>>> +{
>>> + if (rq->nr_running - rq->cfs.idle_h_nr_running <= 1)
>>> + return false;
>>> +
>>> + if ((SCHED_CAPACITY_SCALE * 100) <
>>> + (cpu_util_cfs(cpu) * imbalance_pct))
>>> + return true;
>>> +
>>> + if ((SCHED_CAPACITY_SCALE * imbalance_pct) <
>>> + (cpu_runnable(rq) * 100))
>>> + return true;
>>
>> So the filter contains cpus that over-utilized or overloaded now.
>> This steps further to make the filter reliable while at the cost
>> of sacrificing scan efficiency.
>>
> Right. Ideally if there is a 'realtime' idle cpumask for SIS, the
> scan would be quite accurate. The issue is how to maintain this
> cpumask in low cost.
Yes indeed.
>> The idea behind my recent patches is to keep the filter radical,
>> but use it conservatively.
>>
> Do you mean, update the per-core idle filter frequently, but only
> propogate the filter to LLC-cpumask when the system is overloaded?
Not exactly. I want to update the filter (BTW there is only the LLC
filter, no core filters :)) once core state changes, while apply it
in SIS domain scan only if the domain is busy enough.
>>> +
>>> + return false;
>>> +}
>>> +
>>> +void update_overloaded_rq(struct rq *rq)
>>> +{
>>> + struct sched_domain_shared *sds;
>>> + struct sched_domain *sd;
>>> + int cpu;
>>> +
>>> + if (!sched_feat(SIS_FILTER))
>>> + return;
>>> +
>>> + cpu = cpu_of(rq);
>>> + sd = rcu_dereference(per_cpu(sd_llc, cpu));
>>> + if (unlikely(!sd))
>>> + return;
>>> +
>>> + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>>> + if (unlikely(!sds))
>>> + return;
>>> +
>>> + if (rq_overloaded(rq, cpu, sd->imbalance_pct)) {
>>
>> I'm not sure whether it is appropriate to use LLC imbalance pct here,
>> because we are comparing inside the LLC rather than between the LLCs.
>>
> Right, imbalance_pct could not be of LLC's, it could be of the core domain's
> imbalance_pct.
>>> + /* avoid duplicated write, mitigate cache contention */
>>> + if (!cpumask_test_cpu(cpu, sdo_mask(sds)))
>>> + cpumask_set_cpu(cpu, sdo_mask(sds));
>>> + } else {
>>> + if (cpumask_test_cpu(cpu, sdo_mask(sds)))
>>> + cpumask_clear_cpu(cpu, sdo_mask(sds));
>>> + }
>>> +}
>>> /*
>>> * Scan the LLC domain for idle CPUs; this is dynamically regulated by
>>> * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
>>> @@ -6383,6 +6427,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>> }
>>> }
>>> + if (sched_feat(SIS_FILTER) && !has_idle_core && sd->shared)
>>> + cpumask_andnot(cpus, cpus, sdo_mask(sd->shared));
>>> +
>>> for_each_cpu_wrap(cpu, cpus, target + 1) {
>>> if (has_idle_core) {
>>> i = select_idle_core(p, cpu, cpus, &idle_cpu);
>>> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
>>> index ee7f23c76bd3..1bebdb87c2f4 100644
>>> --- a/kernel/sched/features.h
>>> +++ b/kernel/sched/features.h
>>> @@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
>>> */
>>> SCHED_FEAT(SIS_PROP, false)
>>> SCHED_FEAT(SIS_UTIL, true)
>>> +SCHED_FEAT(SIS_FILTER, true)
>>
>> The filter should be enabled when there is a need. If the system
>> is idle enough, I don't think it's a good idea to clear out the
>> overloaded cpus from domain scan. Making the filter a sched-feat
>> won't help the problem.
>>
>> My latest patch will only apply the filter when nr is less than
>> the LLC size.
> Do you mean only update the filter(idle cpu mask), or only uses the
> filter in SIS when the system meets: nr_running < LLC size?
>
In SIS domain search, apply the filter when nr < LLC_size. But I
haven't tested this with SIS_UTIL, and in the SIS_UTIL case this
condition seems always true.
Thanks,
Abel
Hello Abel,
We've tested the patch on a dual socket Zen3 System (2 x 64C/128T).
tl;dr
- There is a noticeable regression for Hackbench with the system
configured in NPS4 mode. This regression is more noticeable
with SIS_UTIL enabled and not as severe with SIS_PROP.
This regression is surprising given the patch should have
improved SIS Efficiency in case of fully loaded system and is
consistently reproducible across multiple runs and reboots.
- Apart from the above anomaly, the results look positive overall
with the patched kernel behaving as well as, or better than the tip.
I'll leave the full test results below.
On 6/19/2022 5:34 PM, Abel Wu wrote:
>
> [..snip..]
>
> Benchmark
> =========
>
> Tests are done in an Intel Xeon(R) Platinum 8260 [email protected] machine
> with 2 NUMA nodes each of which has 24 cores with SMT2 enabled, so 96
> CPUs in total.
>
> All of the benchmarks are done inside a normal cpu cgroup in a clean
> environment with cpu turbo disabled.
>
> Based on tip sched/core f3dd3f674555 (v5.19-rc2).
>
> Results
> =======
>
> hackbench-process-pipes
> vanilla filter
> Amean 1 0.2357 ( 0.00%) 0.2540 ( -7.78%)
> Amean 4 0.6403 ( 0.00%) 0.6363 ( 0.62%)
> Amean 7 0.7583 ( 0.00%) 0.7367 ( 2.86%)
> Amean 12 1.0703 ( 0.00%) 1.0520 ( 1.71%)
> Amean 21 2.0363 ( 0.00%) 1.9610 * 3.70%*
> Amean 30 3.2487 ( 0.00%) 2.9083 * 10.48%*
> Amean 48 6.3620 ( 0.00%) 4.8543 * 23.70%*
> Amean 79 8.3653 ( 0.00%) 7.1077 * 15.03%*
> Amean 110 9.8370 ( 0.00%) 8.5740 * 12.84%*
> Amean 141 11.4667 ( 0.00%) 10.8750 * 5.16%*
> Amean 172 13.4433 ( 0.00%) 12.6443 * 5.94%*
> Amean 203 15.8970 ( 0.00%) 14.9143 * 6.18%*
> Amean 234 17.9643 ( 0.00%) 16.9123 * 5.86%*
> Amean 265 20.3910 ( 0.00%) 19.2040 * 5.82%*
> Amean 296 22.5253 ( 0.00%) 21.2547 * 5.64%*
>
> hackbench-process-sockets
> vanilla filter
> Amean 1 0.4177 ( 0.00%) 0.4133 ( 1.04%)
> Amean 4 1.4397 ( 0.00%) 1.4240 * 1.09%*
> Amean 7 2.4720 ( 0.00%) 2.4310 * 1.66%*
> Amean 12 4.1407 ( 0.00%) 4.0683 * 1.75%*
> Amean 21 7.0550 ( 0.00%) 6.8830 * 2.44%*
> Amean 30 9.9633 ( 0.00%) 9.7750 * 1.89%*
> Amean 48 15.9837 ( 0.00%) 15.5313 * 2.83%*
> Amean 79 26.7740 ( 0.00%) 26.2703 * 1.88%*
> Amean 110 37.2913 ( 0.00%) 36.5433 * 2.01%*
> Amean 141 47.8937 ( 0.00%) 46.5300 * 2.85%*
> Amean 172 58.0273 ( 0.00%) 56.4530 * 2.71%*
> Amean 203 68.2530 ( 0.00%) 66.3320 * 2.81%*
> Amean 234 78.8987 ( 0.00%) 76.8497 * 2.60%*
> Amean 265 89.1520 ( 0.00%) 86.8213 * 2.61%*
> Amean 296 99.6920 ( 0.00%) 96.9707 * 2.73%*
>
> hackbench-thread-pipes
> vanilla filter
> Amean 1 0.2647 ( 0.00%) 0.2633 ( 0.50%)
> Amean 4 0.6290 ( 0.00%) 0.6607 ( -5.03%)
> Amean 7 0.7850 ( 0.00%) 0.7870 ( -0.25%)
> Amean 12 1.3347 ( 0.00%) 1.2577 ( 5.77%)
> Amean 21 3.1233 ( 0.00%) 2.4613 * 21.20%*
> Amean 30 5.7120 ( 0.00%) 3.6847 * 35.49%*
> Amean 48 8.1947 ( 0.00%) 6.2670 * 23.52%*
> Amean 79 9.1750 ( 0.00%) 8.0640 * 12.11%*
> Amean 110 10.6300 ( 0.00%) 9.5583 * 10.08%*
> Amean 141 12.7490 ( 0.00%) 12.0167 * 5.74%*
> Amean 172 15.1567 ( 0.00%) 14.1570 * 6.60%*
> Amean 203 17.5160 ( 0.00%) 16.7883 ( 4.15%)
> Amean 234 19.8710 ( 0.00%) 19.5370 ( 1.68%)
> Amean 265 23.2700 ( 0.00%) 21.4017 * 8.03%*
> Amean 296 25.4093 ( 0.00%) 23.9943 * 5.57%*
>
> hackbench-thread-sockets
> vanilla filter
> Amean 1 0.4467 ( 0.00%) 0.4347 ( 2.69%)
> Amean 4 1.4757 ( 0.00%) 1.4533 * 1.51%*
> Amean 7 2.5320 ( 0.00%) 2.4993 * 1.29%*
> Amean 12 4.2617 ( 0.00%) 4.1780 * 1.96%*
> Amean 21 7.2397 ( 0.00%) 7.0660 * 2.40%*
> Amean 30 10.2200 ( 0.00%) 9.9810 * 2.34%*
> Amean 48 16.2623 ( 0.00%) 16.0483 * 1.32%*
> Amean 79 27.4307 ( 0.00%) 26.8410 * 2.15%*
> Amean 110 37.8993 ( 0.00%) 37.3223 * 1.52%*
> Amean 141 48.3890 ( 0.00%) 47.4823 * 1.87%*
> Amean 172 58.9887 ( 0.00%) 57.7753 * 2.06%*
> Amean 203 69.5853 ( 0.00%) 68.0563 * 2.20%*
> Amean 234 80.0743 ( 0.00%) 78.4857 * 1.98%*
> Amean 265 90.5473 ( 0.00%) 89.3363 * 1.34%*
> Amean 296 101.3857 ( 0.00%) 99.7717 * 1.59%*
>
> schbench
> vanilla filter
> Lat 50.0th-qrtle-1 6.00 ( 0.00%) 6.00 ( 0.00%)
> Lat 75.0th-qrtle-1 8.00 ( 0.00%) 8.00 ( 0.00%)
> Lat 90.0th-qrtle-1 9.00 ( 0.00%) 8.00 ( 11.11%)
> Lat 95.0th-qrtle-1 9.00 ( 0.00%) 8.00 ( 11.11%)
> Lat 99.0th-qrtle-1 10.00 ( 0.00%) 9.00 ( 10.00%)
> Lat 99.5th-qrtle-1 11.00 ( 0.00%) 9.00 ( 18.18%)
> Lat 99.9th-qrtle-1 11.00 ( 0.00%) 9.00 ( 18.18%)
> Lat 50.0th-qrtle-2 6.00 ( 0.00%) 7.00 ( -16.67%)
> Lat 75.0th-qrtle-2 7.00 ( 0.00%) 8.00 ( -14.29%)
> Lat 90.0th-qrtle-2 8.00 ( 0.00%) 9.00 ( -12.50%)
> Lat 95.0th-qrtle-2 8.00 ( 0.00%) 10.00 ( -25.00%)
> Lat 99.0th-qrtle-2 9.00 ( 0.00%) 11.00 ( -22.22%)
> Lat 99.5th-qrtle-2 9.00 ( 0.00%) 11.00 ( -22.22%)
> Lat 99.9th-qrtle-2 9.00 ( 0.00%) 11.00 ( -22.22%)
> Lat 50.0th-qrtle-4 9.00 ( 0.00%) 8.00 ( 11.11%)
> Lat 75.0th-qrtle-4 10.00 ( 0.00%) 10.00 ( 0.00%)
> Lat 90.0th-qrtle-4 11.00 ( 0.00%) 11.00 ( 0.00%)
> Lat 95.0th-qrtle-4 12.00 ( 0.00%) 11.00 ( 8.33%)
> Lat 99.0th-qrtle-4 13.00 ( 0.00%) 13.00 ( 0.00%)
> Lat 99.5th-qrtle-4 13.00 ( 0.00%) 16.00 ( -23.08%)
> Lat 99.9th-qrtle-4 13.00 ( 0.00%) 19.00 ( -46.15%)
> Lat 50.0th-qrtle-8 11.00 ( 0.00%) 11.00 ( 0.00%)
> Lat 75.0th-qrtle-8 14.00 ( 0.00%) 14.00 ( 0.00%)
> Lat 90.0th-qrtle-8 16.00 ( 0.00%) 16.00 ( 0.00%)
> Lat 95.0th-qrtle-8 17.00 ( 0.00%) 17.00 ( 0.00%)
> Lat 99.0th-qrtle-8 22.00 ( 0.00%) 19.00 ( 13.64%)
> Lat 99.5th-qrtle-8 28.00 ( 0.00%) 23.00 ( 17.86%)
> Lat 99.9th-qrtle-8 31.00 ( 0.00%) 42.00 ( -35.48%)
> Lat 50.0th-qrtle-16 17.00 ( 0.00%) 17.00 ( 0.00%)
> Lat 75.0th-qrtle-16 23.00 ( 0.00%) 23.00 ( 0.00%)
> Lat 90.0th-qrtle-16 26.00 ( 0.00%) 27.00 ( -3.85%)
> Lat 95.0th-qrtle-16 28.00 ( 0.00%) 29.00 ( -3.57%)
> Lat 99.0th-qrtle-16 32.00 ( 0.00%) 33.00 ( -3.12%)
> Lat 99.5th-qrtle-16 37.00 ( 0.00%) 35.00 ( 5.41%)
> Lat 99.9th-qrtle-16 54.00 ( 0.00%) 46.00 ( 14.81%)
> Lat 50.0th-qrtle-32 30.00 ( 0.00%) 29.00 ( 3.33%)
> Lat 75.0th-qrtle-32 43.00 ( 0.00%) 42.00 ( 2.33%)
> Lat 90.0th-qrtle-32 51.00 ( 0.00%) 49.00 ( 3.92%)
> Lat 95.0th-qrtle-32 54.00 ( 0.00%) 51.00 ( 5.56%)
> Lat 99.0th-qrtle-32 61.00 ( 0.00%) 57.00 ( 6.56%)
> Lat 99.5th-qrtle-32 64.00 ( 0.00%) 60.00 ( 6.25%)
> Lat 99.9th-qrtle-32 72.00 ( 0.00%) 82.00 ( -13.89%)
> Lat 50.0th-qrtle-47 44.00 ( 0.00%) 45.00 ( -2.27%)
> Lat 75.0th-qrtle-47 64.00 ( 0.00%) 65.00 ( -1.56%)
> Lat 90.0th-qrtle-47 75.00 ( 0.00%) 77.00 ( -2.67%)
> Lat 95.0th-qrtle-47 81.00 ( 0.00%) 82.00 ( -1.23%)
> Lat 99.0th-qrtle-47 92.00 ( 0.00%) 98.00 ( -6.52%)
> Lat 99.5th-qrtle-47 101.00 ( 0.00%) 114.00 ( -12.87%)
> Lat 99.9th-qrtle-47 271.00 ( 0.00%) 167.00 ( 38.38%)
>
> netperf-udp
> vanilla filter
> Hmean send-64 199.12 ( 0.00%) 201.32 ( 1.11%)
> Hmean send-128 396.22 ( 0.00%) 397.01 ( 0.20%)
> Hmean send-256 777.80 ( 0.00%) 783.96 ( 0.79%)
> Hmean send-1024 2972.62 ( 0.00%) 3011.87 * 1.32%*
> Hmean send-2048 5600.64 ( 0.00%) 5730.50 * 2.32%*
> Hmean send-3312 8757.45 ( 0.00%) 8703.62 ( -0.61%)
> Hmean send-4096 10578.90 ( 0.00%) 10590.93 ( 0.11%)
> Hmean send-8192 17051.22 ( 0.00%) 17189.62 * 0.81%*
> Hmean send-16384 27915.16 ( 0.00%) 27816.01 ( -0.36%)
> Hmean recv-64 199.12 ( 0.00%) 201.32 ( 1.11%)
> Hmean recv-128 396.22 ( 0.00%) 397.01 ( 0.20%)
> Hmean recv-256 777.80 ( 0.00%) 783.96 ( 0.79%)
> Hmean recv-1024 2972.62 ( 0.00%) 3011.87 * 1.32%*
> Hmean recv-2048 5600.64 ( 0.00%) 5730.49 * 2.32%*
> Hmean recv-3312 8757.45 ( 0.00%) 8703.61 ( -0.61%)
> Hmean recv-4096 10578.90 ( 0.00%) 10590.93 ( 0.11%)
> Hmean recv-8192 17051.21 ( 0.00%) 17189.57 * 0.81%*
> Hmean recv-16384 27915.08 ( 0.00%) 27815.86 ( -0.36%)
>
> netperf-tcp
> vanilla filter
> Hmean 64 811.07 ( 0.00%) 835.46 * 3.01%*
> Hmean 128 1614.86 ( 0.00%) 1652.27 * 2.32%*
> Hmean 256 3131.16 ( 0.00%) 3119.01 ( -0.39%)
> Hmean 1024 10286.12 ( 0.00%) 10333.64 ( 0.46%)
> Hmean 2048 16231.88 ( 0.00%) 17141.88 * 5.61%*
> Hmean 3312 20705.91 ( 0.00%) 21703.49 * 4.82%*
> Hmean 4096 22650.75 ( 0.00%) 23904.09 * 5.53%*
> Hmean 8192 27984.06 ( 0.00%) 29170.57 * 4.24%*
> Hmean 16384 32816.85 ( 0.00%) 33351.41 * 1.63%*
>
> tbench4 Throughput
> vanilla filter
> Hmean 1 300.07 ( 0.00%) 302.52 * 0.82%*
> Hmean 2 617.72 ( 0.00%) 598.45 * -3.12%*
> Hmean 4 1213.99 ( 0.00%) 1206.36 * -0.63%*
> Hmean 8 2373.78 ( 0.00%) 2372.28 * -0.06%*
> Hmean 16 4777.82 ( 0.00%) 4711.44 * -1.39%*
> Hmean 32 7182.50 ( 0.00%) 7718.15 * 7.46%*
> Hmean 64 8611.44 ( 0.00%) 9409.29 * 9.27%*
> Hmean 128 18102.63 ( 0.00%) 20650.23 * 14.07%*
> Hmean 256 18029.28 ( 0.00%) 20611.03 * 14.32%*
> Hmean 384 17986.44 ( 0.00%) 19361.29 * 7.64%
Following are the results from dual socket Zen3 platform (2 x 64C/128T) running with
various NPS configuration:
Following is the NUMA configuration for each NPS mode on the system:
NPS1: Each socket is a NUMA node.
Total 2 NUMA nodes in the dual socket machine.
Node 0: 0-63, 128-191
Node 1: 64-127, 192-255
NPS2: Each socket is further logically divided into 2 NUMA regions.
Total 4 NUMA nodes exist over 2 socket.
Node 0: 0-31, 128-159
Node 1: 32-63, 160-191
Node 2: 64-95, 192-223
Node 3: 96-127, 223-255
NPS4: Each socket is logically divided into 4 NUMA regions.
Total 8 NUMA nodes exist over 2 socket.
Node 0: 0-15, 128-143
Node 1: 16-31, 144-159
Node 2: 32-47, 160-175
Node 3: 48-63, 176-191
Node 4: 64-79, 192-207
Node 5: 80-95, 208-223
Node 6: 96-111, 223-231
Node 7: 112-127, 232-255
Kernel versions:
- tip: 5.19-rc2 tip sched/core
- SIS_Eff: 5.19-rc2 tip sched/core + this patch series
When we started testing, the tip was at:
commit: c02d5546ea34 "sched/core: Use try_cmpxchg in set_nr_{and_not,if}_polling"
Note: All the testing was done with SIS_UTIL as the default SIS logic.
~~~~~~~~~
Hackbench
~~~~~~~~~
NPS1
Test: tip SIS_Eff
1-groups: 4.85 (0.00 pct) 4.83 (0.41 pct)
2-groups: 5.06 (0.00 pct) 5.24 (-3.55 pct)
4-groups: 5.25 (0.00 pct) 5.29 (-0.76 pct)
8-groups: 5.64 (0.00 pct) 5.57 (1.24 pct)
16-groups: 7.46 (0.00 pct) 7.40 (0.80 pct)
NPS2
Test: tip SIS_Eff
1-groups: 4.73 (0.00 pct) 4.70 (0.63 pct)
2-groups: 4.94 (0.00 pct) 4.95 (-0.20 pct)
4-groups: 5.14 (0.00 pct) 5.10 (0.77 pct)
8-groups: 5.59 (0.00 pct) 5.48 (1.96 pct)
16-groups: 7.42 (0.00 pct) 7.52 (-1.34 pct)
NPS4
Test: tip SIS_Eff
1-groups: 4.75 (0.00 pct) 4.76 (-0.21 pct)
2-groups: 5.10 (0.00 pct) 5.01 (1.76 pct)
4-groups: 5.87 (0.00 pct) 5.54 (5.62 pct)
8-groups: 5.97 (0.00 pct) 6.30 (-5.52 pct) *
8-groups: 5.80 (0.00 pct) 5.95 (-2.58 pct) [Verification Run]
16-groups: 8.33 (0.00 pct) 13.45 (-61.46 pct) * (System Overloaded ~ 2.5 task per CPU)
16-groups: 8.25 (0.00 pct) 14.26 (-72.84 pct) [Verification Run]
16-groups: 8.25 (0.00 pct) 13.44 (-62.90 pct) [Verification Run]
~~~~~~~~
schbench
~~~~~~~~
NPS1
#workers: tip SIS_Eff
1: 24.50 (0.00 pct) 19.00 (22.44 pct)
2: 28.00 (0.00 pct) 37.50 (-33.92 pct) *
2: 21.50 (0.00 pct) 23.00 (-6.97 pct) [Verification Run]
4: 35.50 (0.00 pct) 31.00 (12.67 pct)
8: 44.50 (0.00 pct) 43.50 (2.24 pct)
16: 69.50 (0.00 pct) 67.00 (3.59 pct)
32: 103.00 (0.00 pct) 102.50 (0.48 pct)
64: 183.50 (0.00 pct) 183.00 (0.27 pct)
128: 388.50 (0.00 pct) 380.50 (2.05 pct)
256: 868.00 (0.00 pct) 868.00 (0.00 pct)
512: 57856.00 (0.00 pct) 60224.00 (-4.09 pct)
NPS2
#workers: tip SIS_Eff
1: 18.50 (0.00 pct) 18.50 (0.00 pct)
2: 33.00 (0.00 pct) 27.50 (16.66 pct)
4: 37.00 (0.00 pct) 35.50 (4.05 pct)
8: 54.00 (0.00 pct) 52.50 (2.77 pct)
16: 70.50 (0.00 pct) 74.00 (-4.96 pct) *
16: 79.00 (0.00 pct) 71.00 (10.12 pct) [Verification Run]
32: 103.50 (0.00 pct) 105.00 (-1.44 pct)
64: 183.50 (0.00 pct) 192.00 (-4.63 pct) *
64: 179.50 (0.00 pct) 179.50 (0.00 pct) [Verification Run]
128: 378.50 (0.00 pct) 380.00 (-0.39 pct)
256: 902.00 (0.00 pct) 907.00 (-0.55 pct)
512: 58304.00 (0.00 pct) 60160.00 (-3.18 pct)
NPS4
#workers: tip SIS_Eff
1: 20.50 (0.00 pct) 18.50 (9.75 pct)
2: 35.00 (0.00 pct) 35.50 (-1.42 pct)
4: 36.00 (0.00 pct) 36.00 (0.00 pct)
8: 47.00 (0.00 pct) 52.00 (-10.63 pct) *
8: 54.00 (0.00 pct) 51.00 (5.55 pct) [Verification Run]
16: 83.50 (0.00 pct) 85.00 (-1.79 pct)
32: 102.50 (0.00 pct) 102.50 (0.00 pct)
64: 178.50 (0.00 pct) 181.00 (-1.40 pct)
128: 366.50 (0.00 pct) 380.00 (-3.68 pct)
256: 921.00 (0.00 pct) 920.00 (0.10 pct)
512: 60032.00 (0.00 pct) 60096.00 (-0.10 pct)
Note: schbench shows run to run variance which is linked to
new idle balance.
~~~~~~
tbench
~~~~~~
NPS1
Clients: tip SIS_Eff
1 446.42 (0.00 pct) 446.15 (-0.06 pct)
2 869.92 (0.00 pct) 876.12 (0.71 pct)
4 1637.44 (0.00 pct) 1641.65 (0.25 pct)
8 3210.27 (0.00 pct) 3213.43 (0.09 pct)
16 6196.44 (0.00 pct) 5972.09 (-3.62 pct)
32 11844.65 (0.00 pct) 10903.18 (-7.94 pct) *
32 11761.81 (0.00 pct) 12072.37 (2.64 pct) [Verification Run]
64 21678.40 (0.00 pct) 21747.25 (0.31 pct)
128 31311.82 (0.00 pct) 31499.24 (0.59 pct)
256 50250.62 (0.00 pct) 50930.33 (1.35 pct)
512 51377.40 (0.00 pct) 51377.19 (0.00 pct)
1024 53628.03 (0.00 pct) 53876.23 (0.46 pct)
NPS2
Clients: tip SIS_Eff
1 445.03 (0.00 pct) 447.66 (0.59 pct)
2 865.65 (0.00 pct) 872.88 (0.83 pct)
4 1619.09 (0.00 pct) 1645.17 (1.61 pct)
8 3117.29 (0.00 pct) 3168.90 (1.65 pct)
16 5950.42 (0.00 pct) 5657.02 (-4.93 pct) *
16 5663.73 (0.00 pct) 5915.61 (4.44 pct) [Verification Run]
32 11708.61 (0.00 pct) 11342.37 (-3.12 pct)
64 20415.11 (0.00 pct) 20876.49 (2.25 pct)
128 30134.02 (0.00 pct) 29912.02 (-0.73 pct)
256 41660.75 (0.00 pct) 49418.80 (18.62 pct)
512 49560.67 (0.00 pct) 52372.63 (5.67 pct)
1024 52340.77 (0.00 pct) 53714.13 (2.62 pct)
NPS4
Clients: tip SIS_Eff
1 441.41 (0.00 pct) 446.12 (1.06 pct)
2 863.53 (0.00 pct) 871.30 (0.89 pct)
4 1646.06 (0.00 pct) 1646.85 (0.04 pct)
8 3160.26 (0.00 pct) 3103.85 (-1.78 pct)
16 5605.21 (0.00 pct) 5853.23 (4.42 pct)
32 10657.29 (0.00 pct) 11397.58 (6.94 pct)
64 21054.34 (0.00 pct) 20234.69 (-3.89 pct)
128 31285.51 (0.00 pct) 30188.03 (-3.50 pct)
256 49287.47 (0.00 pct) 51330.34 (4.14 pct)
512 50042.95 (0.00 pct) 52658.13 (5.22 pct)
1024 52589.40 (0.00 pct) 53067.80 (0.90 pct)
~~~~~~
Stream
~~~~~~
- 10 runs
NPS1
Test: tip SIS_Eff
Copy: 211443.49 (0.00 pct) 224768.55 (6.30 pct)
Scale: 209797.67 (0.00 pct) 208050.69 (-0.83 pct)
Add: 250586.89 (0.00 pct) 250268.08 (-0.12 pct)
Triad: 243034.31 (0.00 pct) 244552.37 (0.62 pct)
NPS2
Test: tip SIS_Eff
Copy: 146432.23 (0.00 pct) 140412.56 (-4.11 pct)
Scale: 179508.95 (0.00 pct) 178038.96 (-0.81 pct)
Add: 188328.69 (0.00 pct) 189046.25 (0.38 pct)
Triad: 186353.95 (0.00 pct) 187660.23 (0.70 pct)
NPS4
Test: tip SIS_Eff
Copy: 133934.59 (0.00 pct) 132595.29 (-0.99 pct)
Scale: 179877.71 (0.00 pct) 185094.80 (2.90 pct)
Add: 178079.94 (0.00 pct) 186399.22 (4.67 pct)
Triad: 177277.29 (0.00 pct) 184725.00 (4.20 pct)
- 100 runs
NPS1
Test: tip SIS_Eff
Copy: 220780.64 (0.00 pct) 226194.97 (2.45 pct)
Scale: 204372.85 (0.00 pct) 209461.05 (2.48 pct)
Add: 245548.31 (0.00 pct) 253363.99 (3.18 pct)
Triad: 233611.73 (0.00 pct) 242670.41 (3.87 pct)
NPS2
Test: tip SIS_Eff
Copy: 229624.58 (0.00 pct) 233458.30 (1.66 pct)
Scale: 210163.13 (0.00 pct) 211198.12 (0.49 pct)
Add: 261992.62 (0.00 pct) 260243.74 (-0.66 pct)
Triad: 248713.02 (0.00 pct) 247594.21 (-0.44 pct)
NPS4
Test: tip SIS_Eff
Copy: 249308.79 (0.00 pct) 245362.22 (-1.58 pct)
Scale: 218948.88 (0.00 pct) 217819.07 (-0.51 pct)
Add: 268469.03 (0.00 pct) 270935.96 (0.91 pct)
Triad: 251920.99 (0.00 pct) 255730.95 (1.51 pct)
~~~~~~~~~~~~
ycsb-mongodb
~~~~~~~~~~~~
NPS1
tip: 289306.33 (var: 0.91)
SIS_Eff: 293789.33 (var: 1.38) (1.54 pct)
NPS2
NPS2
tip: 290226.00 (var: 1.62)
SIS_Eff 290796.00 (var: 1.76) (0.19 pct)
NPS4
tip: 290427.33 (var: 1.57)
SIS_Eff: 288134.33 (var: 0.88) (-0.78pct)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Hackbench - 15 runs statistics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
o NPS 4 - 16 groups (SIS_UTIL)
- tip
Min : 7.35
Max : 12.66
Median : 10.60
AMean : 10.00
GMean : 9.82
HMean : 9.64
AMean Stddev : 1.88
AMean CoefVar : 18.85 pct
- SIS_Eff
Min : 12.32
Max : 18.92
Median : 13.82
AMean : 14.96 (-49.60 pct)
GMean : 14.80
HMean : 14.66
AMean Stddev : 2.25
AMean CoefVar : 15.01 pct
o NPS 4 - 16 groups (SIS_PROP)
- tip
Min : 7.04
Max : 8.22
Median : 7.49
AMean : 7.52
GMean : 7.52
HMean : 7.51
AMean Stddev : 0.29
AMean CoefVar : 3.88 pct
- SIS_Eff
Min : 7.04
Max : 9.78
Median : 8.16
AMean : 8.42 (-11.06 pct)
GMean : 8.39
HMean : 8.36
AMean Stddev : 0.78
AMean CoefVar : 9.23 pct
The Hackbench regression is much more noticeable with SIS_UTIL
enabled but only when the test machine is running in NPS4 mode.
It is not obvious why this is happening given the patch series
aims at improving SIS Efficiency.
It would be great if you can test the series with SIS_UTIL
enabled and SIS_PROP disabled to see if it effects any benchmark
behavior given SIS_UTIL is the default SIS logic currently on
the tip.
>
> [..snip..]
>
Please let me know if you need any more data from the
test system.
--
Thanks and Regards,
Prateek
Hello Abel,
On Sun, Jun 19, 2022 at 08:04:49PM +0800, Abel Wu wrote:
> If a full scan on SIS domain failed, then no unoccupied cpus available
> and the LLC is fully busy. In this case we'd better spend the time on
> something more useful, rather than wasting it trying to find an idle
> cpu that probably not exist.
>
> The fully busy status will be re-evaluated when any core of this LLC
> domain enters load balancing, and cleared once idle cpus found.
>
> Signed-off-by: Abel Wu <[email protected]>
> ---
[..snip..]
> @@ -6197,24 +6201,44 @@ static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> DEFINE_STATIC_KEY_FALSE(sched_smt_present);
> EXPORT_SYMBOL_GPL(sched_smt_present);
>
> -static inline void set_idle_cores(int cpu, int val)
> +static inline void sd_set_state(int cpu, enum sd_state state)
Nit: We are setting the state of only the LLC domain and not any other
domain via this function. So should we name it as
set_llc_state()/get_llc_state() for better readability ?
> {
> struct sched_domain_shared *sds;
>
> sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> if (sds)
> - WRITE_ONCE(sds->has_idle_cores, val);
> + WRITE_ONCE(sds->state, state);
> }
>
> -static inline bool test_idle_cores(int cpu)
> +static inline enum sd_state sd_get_state(int cpu)
> {
> struct sched_domain_shared *sds;
>
> sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> if (sds)
> - return READ_ONCE(sds->has_idle_cores);
> + return READ_ONCE(sds->state);
>
> - return false;
> + return sd_has_icpus;
> +}
> +
> +static inline void set_idle_cores(int cpu, int idle)
^^^^^
I agree with Josh. We can use core_idle instead of idle here.
> +{
> + sd_set_state(cpu, idle ? sd_has_icores : sd_has_icpus);
> +}
> +
> +static inline bool test_idle_cores(int cpu)
> +{
> + return sd_get_state(cpu) == sd_has_icores;
> +}
> +
> +static inline void set_idle_cpus(int cpu, int idle)
> +{
> + sd_set_state(cpu, idle ? sd_has_icpus : sd_is_busy);
> +}
> +
> +static inline bool test_idle_cpus(int cpu)
> +{
> + return sd_get_state(cpu) != sd_is_busy;
> }
>
> /*
[...]
> @@ -8661,6 +8702,12 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs
> return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
> }
>
> +static inline void sd_classify(struct sd_lb_stats *sds, struct rq *rq)
> +{
> + if (sds->sd_state != sd_has_icpus && unoccupied_rq(rq))
Nit: sds->sd_state can either be sd_has_icpus or sd_is_busy. So for
better readability, we can just use the positive check
if (sds->sd_state == sd_is_busy && unoccupied_rq(rq))
sds->sd_state = sd_has_icpus;
> + sds->sd_state = sd_has_icpus;
> +}
> +
> /**
> * update_sg_lb_stats - Update sched_group's statistics for load balancing.
> * @env: The load balancing environment.
> @@ -8675,11 +8722,12 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> struct sg_lb_stats *sgs,
> int *sg_status)
> {
> - int i, nr_running, local_group;
> + int i, nr_running, local_group, update_core;
>
> memset(sgs, 0, sizeof(*sgs));
>
> local_group = group == sds->local;
> + update_core = env->sd->flags & SD_SHARE_CPUCAPACITY;
>
> for_each_cpu_and(i, sched_group_span(group), env->cpus) {
> struct rq *rq = cpu_rq(i);
> @@ -8692,6 +8740,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> nr_running = rq->nr_running;
> sgs->sum_nr_running += nr_running;
>
> + if (update_core)
> + sd_classify(sds, rq);
> +
> if (nr_running > 1)
> *sg_status |= SG_OVERLOAD;
>
> @@ -9220,6 +9271,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> return idlest;
> }
>
> +static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
> +{
> + if (sds->sd_state == sd_has_icpus && !test_idle_cpus(env->dst_cpu))
> + set_idle_cpus(env->dst_cpu, true);
We could enter this if condition when env->dst_cpu is the only idle
CPU in the SMT domain (which is likely to be the case every time we do
a NEW_IDLE balance). By the end of this load-balancing round, the
env->dst_cpu can pull a task from some other CPU and thereby no longer
remain idle but the LLC state would still be sd_has_icpus.
That would mean that some CPU on this LLC would do a full scan during
the wakeup only to find no idle CPU and reset the state to
sd_is_busy. Have you seen instances where this false-positive pattern
can result in wasteful scan thereby cause a performance degradation?
Ideally it should not be worse that what we currently have.
Apart from this, patch looks good to me.
I would be worth the while to explore if the LLC state can be used
early on in select_task_rq_fair() to determine if we need to do a
wake-affine or allow the task to stick to its previous LLC depending
on which among the previous LLC and the waker's LLC have an idle CPU.
--
Thanks and Regards
gautham.
Hello Abel,
On Sun, Jun 19, 2022 at 08:04:50PM +0800, Abel Wu wrote:
[..snip..]
>
> +static void sd_update_icpus(int core, int icpu)
How about update_llc_icpus() ?
> +{
> + struct sched_domain_shared *sds;
> + struct cpumask *icpus;
> +
> + sds = rcu_dereference(per_cpu(sd_llc_shared, core));
> + if (!sds)
> + return;
> +
> + icpus = sched_domain_icpus(sds);
> +
> + /*
> + * XXX: The update is racy between different cores.
> + * The non-atomic ops here is a tradeoff of accuracy
> + * for easing the cache traffic.
> + */
> + if (icpu == -1)
> + cpumask_andnot(icpus, icpus, cpu_smt_mask(core));
> + else if (!cpumask_test_cpu(icpu, icpus))
> + __cpumask_set_cpu(icpu, icpus);
> +}
> +
> /*
> * Scans the local SMT mask to see if the entire core is idle, and records this
> * information in sd_llc_shared->has_idle_cores.
> @@ -6340,6 +6362,10 @@ static inline bool test_idle_cpus(int cpu)
> return true;
> }
>
> +static inline void sd_update_icpus(int core, int icpu)
> +{
> +}
> +
> static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
> {
> return __select_idle_cpu(core, p);
> @@ -6370,7 +6396,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> if (!this_sd)
> return -1;
>
> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> + cpumask_and(cpus, has_idle_core ? sched_domain_span(sd) :
> + sched_domain_icpus(sd->shared), p->cpus_ptr);
With this we get an idea of the likely idle CPUs. However, we may
still want SIS_UTIL on top of this as it determines the number of idle
CPUs to scan based on the utilization average that will iron out any
transient idle CPUs which may feature in
sched_domain_icpus(sd->shared) but are not likely to remain idle. Is
this understanding correct ?
>
> if (sched_feat(SIS_PROP) && !has_idle_core) {
> u64 avg_cost, avg_idle, span_avg;
> @@ -8342,6 +8369,7 @@ struct sd_lb_stats {
> unsigned int prefer_sibling; /* tasks should go to sibling first */
>
> int sd_state;
> + int idle_cpu;
>
> struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
> struct sg_lb_stats local_stat; /* Statistics of the local group */
> @@ -8362,6 +8390,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
> .total_load = 0UL,
> .total_capacity = 0UL,
> .sd_state = sd_is_busy,
> + .idle_cpu = -1,
> .busiest_stat = {
> .idle_cpus = UINT_MAX,
> .group_type = group_has_spare,
> @@ -8702,10 +8731,18 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs
> return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
> }
>
> -static inline void sd_classify(struct sd_lb_stats *sds, struct rq *rq)
> +static inline void sd_classify(struct sd_lb_stats *sds, struct rq *rq, int cpu)
> {
> - if (sds->sd_state != sd_has_icpus && unoccupied_rq(rq))
> + if (sds->sd_state != sd_has_icpus && unoccupied_rq(rq)) {
> + /*
> + * Prefer idle cpus than unoccupied ones. This
> + * is achieved by only allowing the idle ones
> + * unconditionally overwrite the preious record
^^^^^^^^
Nit: previous
> + * while the occupied ones can't.
> + */
This if condition is only executed when we encounter the very first
unoccupied cpu in the SMT domain. So why do we need this comment here
about preferring idle cpus over unoccupied ones ?
> + sds->idle_cpu = cpu;
> sds->sd_state = sd_has_icpus;
> + }
> }
>
> /**
> @@ -8741,7 +8778,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> sgs->sum_nr_running += nr_running;
>
> if (update_core)
> - sd_classify(sds, rq);
> + sd_classify(sds, rq, i);
>
> if (nr_running > 1)
> *sg_status |= SG_OVERLOAD;
> @@ -8757,7 +8794,16 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> * No need to call idle_cpu() if nr_running is not 0
> */
> if (!nr_running && idle_cpu(i)) {
> + /*
> + * Prefer the last idle cpu by overwriting
> + * preious one. The first idle cpu in this
^^^^^^^
Nit: previous
> + * domain (if any) can trigger balancing
> + * and fed with tasks, so we'd better choose
> + * a candidate in an opposite way.
> + */
This is a better place to call out the fact that an idle cpu is
preferrable to an unoccupied cpu.
> + sds->idle_cpu = i;
> sgs->idle_cpus++;
> +
> /* Idle cpu can't have misfit task */
> continue;
> }
> @@ -9273,8 +9319,40 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>
> static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
> {
> - if (sds->sd_state == sd_has_icpus && !test_idle_cpus(env->dst_cpu))
> - set_idle_cpus(env->dst_cpu, true);
> + struct sched_domain_shared *sd_smt_shared = env->sd->shared;
> + enum sd_state new = sds->sd_state;
> + int this = env->dst_cpu;
> +
> + /*
> + * Parallel updating can hardly contribute accuracy to
> + * the filter, besides it can be one of the burdens on
> + * cache traffic.
> + */
> + if (cmpxchg(&sd_smt_shared->updating, 0, 1))
> + return;
> +
> + /*
> + * There is at least one unoccupied cpu available, so
> + * propagate it to the filter to avoid false negative
> + * issue which could result in lost tracking of some
> + * idle cpus thus throughupt downgraded.
> + */
> + if (new != sd_is_busy) {
> + if (!test_idle_cpus(this))
> + set_idle_cpus(this, true);
> + } else {
> + /*
> + * Nothing changes so nothing to update or
> + * propagate.
> + */
> + if (sd_smt_shared->state == sd_is_busy)
> + goto out;
The main use of sd_smt_shared->state is to detect the transition
between sd_has_icpu --> sd_is_busy during which sds->idle_cpu == -1
which will ensure that sd_update_icpus() below clears this core's CPUs
from the LLC's icpus mask. Calling this out may be a more useful
comment instead of the comment above.
> + }
> +
> + sd_update_icpus(this, sds->idle_cpu);
> + sd_smt_shared->state = new;
> +out:
> + xchg(&sd_smt_shared->updating, 0);
> }
--
Thanks and Regards
gautham.
Hello Abel,
On Sun, Jun 19, 2022 at 08:04:51PM +0800, Abel Wu wrote:
> Now when updating core state, there are two main problems that could
> pollute the SIS filter:
>
> - The updating is before task migration, so if dst_cpu is
> selected to be propagated which might be fed with tasks
> soon, the efforts we paid is no more than setting a busy
> cpu into the SIS filter. While on the other hand it is
> important that we update as early as possible to keep the
> filter fresh, so it's not wise to delay the update to the
> end of load balancing.
>
> - False negative propagation hurts performance since some
> idle cpus could be out of reach. So in general we will
> aggressively propagate idle cpus but allow false positive
> continue to exist for a while, which may lead to filter
> being fully polluted.
Ok, so the false positive case is being addressed in this patch.
>
> Pains can be relieved by a force correction when false positive
> continuously detected.
>
[..snip..]
> @@ -111,6 +117,7 @@ struct sched_group;
> enum sd_state {
> sd_has_icores,
> sd_has_icpus,
> + sd_may_idle,
> sd_is_busy
> };
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d55fdcedf2c0..9713d183d35e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
[...snip..]
> @@ -9321,7 +9327,7 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
> {
> struct sched_domain_shared *sd_smt_shared = env->sd->shared;
> enum sd_state new = sds->sd_state;
> - int this = env->dst_cpu;
> + int icpu = sds->idle_cpu, this = env->dst_cpu;
>
> /*
> * Parallel updating can hardly contribute accuracy to
> @@ -9331,6 +9337,22 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
> if (cmpxchg(&sd_smt_shared->updating, 0, 1))
> return;
>
> + /*
> + * The dst_cpu is likely to be fed with tasks soon.
> + * If it is the only unoccupied cpu in this domain,
> + * we still handle it the same way as as_has_icpus
^^^^^^^^^^^^^
Nit: sd_has_icpus
> + * but turn the SMT into the unstable state, rather
> + * than waiting to the end of load balancing since
> + * it's also important that update the filter as
> + * early as possible to keep it fresh.
> + */
> + if (new == sd_is_busy) {
> + if (idle_cpu(this) || sched_idle_cpu(this)) {
> + new = sd_may_idle;
> + icpu = this;
> + }
> + }
> +
> /*
> * There is at least one unoccupied cpu available, so
> * propagate it to the filter to avoid false negative
> @@ -9338,6 +9360,12 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
> * idle cpus thus throughupt downgraded.
> */
> if (new != sd_is_busy) {
> + /*
> + * The sd_may_idle state is taken into
> + * consideration as well because from
> + * here we couldn't actually know task
> + * migrations would happen or not.
> + */
> if (!test_idle_cpus(this))
> set_idle_cpus(this, true);
> } else {
> @@ -9347,9 +9375,26 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
> */
> if (sd_smt_shared->state == sd_is_busy)
> goto out;
> +
> + /*
> + * Allow false positive to exist for some time
> + * to make a tradeoff of accuracy of the filter
> + * for relieving cache traffic.
> + */
I can understand allowing the false positive to exist when there are
no other idle CPUs in this SMT domain other than this CPU, which is
handled by the case where new != sd_is_busy in the current
load-balance round and will be handled by the "else" clause in the
subsequent round if env->dst_cpu ends up becoming busy.
However, when we know that new == sd_is_busy and the previous state of
this SMT domain was sd_has_icpus, should we not immediately clear this
core's cpumask from the LLCs icpus mask? What is the need for the
intermediate sd_may_idle state transition between sd_has_icpus and
sd_is_busy in this case ?
> + if (sd_smt_shared->state == sd_has_icpus) {
> + new = sd_may_idle;
> + goto update;
> + }
> +
> + /*
> + * If the false positive issue has already been
> + * there for a while, a correction of the filter
> + * is needed.
> + */
> }
>
> sd_update_icpus(this, sds->idle_cpu);
^^^^^^^^^^^^^^
I think you meant to use icpu here ? sds->idle_cpu == -1 in the case
when new == sd_may_idle. Which will end up clearing this core's
cpumask from this LLC's icpus mask. This defeats the
"allow-false-positive" heuristic.
> +update:
> sd_smt_shared->state = new;
> out:
> xchg(&sd_smt_shared->updating, 0);
> --
> 2.31.1
>
--
Thanks and Regards
gautham.
Hi Gautham, thanks for your reviewing and sorry for my late reply..
On 7/20/22 11:34 PM, Gautham R. Shenoy Wrote:
>
> [..snip..]
>
>> @@ -6197,24 +6201,44 @@ static inline int __select_idle_cpu(int cpu, struct task_struct *p)
>> DEFINE_STATIC_KEY_FALSE(sched_smt_present);
>> EXPORT_SYMBOL_GPL(sched_smt_present);
>>
>> -static inline void set_idle_cores(int cpu, int val)
>> +static inline void sd_set_state(int cpu, enum sd_state state)
>
> Nit: We are setting the state of only the LLC domain and not any other
> domain via this function. So should we name it as
> set_llc_state()/get_llc_state() for better readability ?
>
Makes sense, will rename in next version.
>
>> {
>> struct sched_domain_shared *sds;
>>
>> sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>> if (sds)
>> - WRITE_ONCE(sds->has_idle_cores, val);
>> + WRITE_ONCE(sds->state, state);
>> }
>>
>> -static inline bool test_idle_cores(int cpu)
>> +static inline enum sd_state sd_get_state(int cpu)
>> {
>> struct sched_domain_shared *sds;
>>
>> sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>> if (sds)
>> - return READ_ONCE(sds->has_idle_cores);
>> + return READ_ONCE(sds->state);
>>
>> - return false;
>> + return sd_has_icpus;
>> +}
>> +
>> +static inline void set_idle_cores(int cpu, int idle)
> ^^^^^
> I agree with Josh. We can use core_idle instead of idle here.
OK, I will make the param more verbose...
>
>> +{
>> + sd_set_state(cpu, idle ? sd_has_icores : sd_has_icpus);
>> +}
>> +
>> +static inline bool test_idle_cores(int cpu)
>> +{
>> + return sd_get_state(cpu) == sd_has_icores;
>> +}
>> +
>> +static inline void set_idle_cpus(int cpu, int idle)
and this one too.
>> +{
>> + sd_set_state(cpu, idle ? sd_has_icpus : sd_is_busy);
>> +}
>> +
>> +static inline bool test_idle_cpus(int cpu)
>> +{
>> + return sd_get_state(cpu) != sd_is_busy;
>> }
>>
>> /*
>
> [...]
>
>
>> @@ -8661,6 +8702,12 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs
>> return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
>> }
>>
>> +static inline void sd_classify(struct sd_lb_stats *sds, struct rq *rq)
>> +{
>> + if (sds->sd_state != sd_has_icpus && unoccupied_rq(rq))
>
> Nit: sds->sd_state can either be sd_has_icpus or sd_is_busy. So for
> better readability, we can just use the positive check
For now, yes. But sd_state can be expanded and once that happens, the
positive check could be error prone.
>
> if (sds->sd_state == sd_is_busy && unoccupied_rq(rq))
> sds->sd_state = sd_has_icpus;
>
>
>> + sds->sd_state = sd_has_icpus;
>> +}
>> +
>> /**
>> * update_sg_lb_stats - Update sched_group's statistics for load balancing.
>> * @env: The load balancing environment.
>> @@ -8675,11 +8722,12 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>> struct sg_lb_stats *sgs,
>> int *sg_status)
>> {
>> - int i, nr_running, local_group;
>> + int i, nr_running, local_group, update_core;
>>
>> memset(sgs, 0, sizeof(*sgs));
>>
>> local_group = group == sds->local;
>> + update_core = env->sd->flags & SD_SHARE_CPUCAPACITY;
>>
>> for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>> struct rq *rq = cpu_rq(i);
>> @@ -8692,6 +8740,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>> nr_running = rq->nr_running;
>> sgs->sum_nr_running += nr_running;
>>
>> + if (update_core)
>> + sd_classify(sds, rq);
>> +
>> if (nr_running > 1)
>> *sg_status |= SG_OVERLOAD;
>>
>> @@ -9220,6 +9271,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>> return idlest;
>> }
>>
>> +static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>> +{
>> + if (sds->sd_state == sd_has_icpus && !test_idle_cpus(env->dst_cpu))
>> + set_idle_cpus(env->dst_cpu, true);
>
> We could enter this if condition when env->dst_cpu is the only idle
> CPU in the SMT domain (which is likely to be the case every time we do
> a NEW_IDLE balance). By the end of this load-balancing round, the
> env->dst_cpu can pull a task from some other CPU and thereby no longer
> remain idle but the LLC state would still be sd_has_icpus.
>
> That would mean that some CPU on this LLC would do a full scan during
> the wakeup only to find no idle CPU and reset the state to
> sd_is_busy. Have you seen instances where this false-positive pattern
> can result in wasteful scan thereby cause a performance degradation?
> Ideally it should not be worse that what we currently have.
Yes, indeed. We will talk about this later in the 7th patch.
>
> Apart from this, patch looks good to me.
Thanks!
>
> I would be worth the while to explore if the LLC state can be used
> early on in select_task_rq_fair() to determine if we need to do a
> wake-affine or allow the task to stick to its previous LLC depending
> on which among the previous LLC and the waker's LLC have an idle CPU.
>
Sounds like a good idea!
Best Regards,
Abel
On 7/21/22 1:08 AM, Gautham R. Shenoy Wrote:
> Hello Abel,
>
> On Sun, Jun 19, 2022 at 08:04:51PM +0800, Abel Wu wrote:
>> Now when updating core state, there are two main problems that could
>> pollute the SIS filter:
>>
>> - The updating is before task migration, so if dst_cpu is
>> selected to be propagated which might be fed with tasks
>> soon, the efforts we paid is no more than setting a busy
>> cpu into the SIS filter. While on the other hand it is
>> important that we update as early as possible to keep the
>> filter fresh, so it's not wise to delay the update to the
>> end of load balancing.
>>
>> - False negative propagation hurts performance since some
>> idle cpus could be out of reach. So in general we will
>> aggressively propagate idle cpus but allow false positive
>> continue to exist for a while, which may lead to filter
>> being fully polluted.
>
> Ok, so the false positive case is being addressed in this patch.
>
>>
>> Pains can be relieved by a force correction when false positive
>> continuously detected.
>>
> [..snip..]
>
>> @@ -111,6 +117,7 @@ struct sched_group;
>> enum sd_state {
>> sd_has_icores,
>> sd_has_icpus,
>> + sd_may_idle,
>> sd_is_busy
>> };
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d55fdcedf2c0..9713d183d35e 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>
> [...snip..]
>
>> @@ -9321,7 +9327,7 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>> {
>> struct sched_domain_shared *sd_smt_shared = env->sd->shared;
>> enum sd_state new = sds->sd_state;
>> - int this = env->dst_cpu;
>> + int icpu = sds->idle_cpu, this = env->dst_cpu;
>>
>> /*
>> * Parallel updating can hardly contribute accuracy to
>> @@ -9331,6 +9337,22 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>> if (cmpxchg(&sd_smt_shared->updating, 0, 1))
>> return;
>>
>> + /*
>> + * The dst_cpu is likely to be fed with tasks soon.
>> + * If it is the only unoccupied cpu in this domain,
>> + * we still handle it the same way as as_has_icpus
> ^^^^^^^^^^^^^
> Nit: sd_has_icpus
Will fix.
>
>> + * but turn the SMT into the unstable state, rather
>> + * than waiting to the end of load balancing since
>> + * it's also important that update the filter as
>> + * early as possible to keep it fresh.
>> + */
>> + if (new == sd_is_busy) {
>> + if (idle_cpu(this) || sched_idle_cpu(this)) {
>> + new = sd_may_idle;
>> + icpu = this;
>> + }
>> + }
>> +
>> /*
>> * There is at least one unoccupied cpu available, so
>> * propagate it to the filter to avoid false negative
>> @@ -9338,6 +9360,12 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>> * idle cpus thus throughupt downgraded.
>> */
>> if (new != sd_is_busy) {
>> + /*
>> + * The sd_may_idle state is taken into
>> + * consideration as well because from
>> + * here we couldn't actually know task
>> + * migrations would happen or not.
>> + */
>> if (!test_idle_cpus(this))
>> set_idle_cpus(this, true);
>> } else {
>> @@ -9347,9 +9375,26 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>> */
>> if (sd_smt_shared->state == sd_is_busy)
>> goto out;
>> +
>> + /*
>> + * Allow false positive to exist for some time
>> + * to make a tradeoff of accuracy of the filter
>> + * for relieving cache traffic.
>> + */
>
> I can understand allowing the false positive to exist when there are
> no other idle CPUs in this SMT domain other than this CPU, which is
> handled by the case where new != sd_is_busy in the current
> load-balance round and will be handled by the "else" clause in the
> subsequent round if env->dst_cpu ends up becoming busy.
>
Yes.
>
> However, when we know that new == sd_is_busy and the previous state of
> this SMT domain was sd_has_icpus, should we not immediately clear this
> core's cpumask from the LLCs icpus mask? What is the need for the
> intermediate sd_may_idle state transition between sd_has_icpus and
> sd_is_busy in this case ?
>
The thought was to make addition more aggressive than deletion to the
filter to try to avoid real idle cpus being out of reach. Take short
running tasks for example, the cpus in this newly-busy SMT domain can
become idle multiple times during next balancing period, which won't
be selected if update to sd_is_busy intermediately. While for the non-
short running workloads, the only downside is sacrificing the filter's
accuracy for a short while.
IOW, temporarily containing false positive cpus in the filter is more
acceptable than missing some real idle ones which would cause throughput
reduced. Besides, in this way the cache traffic can be relieved more
or less.
>
>
>> + if (sd_smt_shared->state == sd_has_icpus) {
>> + new = sd_may_idle;
>> + goto update;
>> + }
>> +
>> + /*
>> + * If the false positive issue has already been
>> + * there for a while, a correction of the filter
>> + * is needed.
>> + */
>> }
>>
>> sd_update_icpus(this, sds->idle_cpu);
> ^^^^^^^^^^^^^^
>
> I think you meant to use icpu here ? sds->idle_cpu == -1 in the case
> when new == sd_may_idle. Which will end up clearing this core's
> cpumask from this LLC's icpus mask. This defeats the
> "allow-false-positive" heuristic.
>
Nice catch, will fix.
Thanks & Best Regards,
Abel
On 7/21/22 12:16 AM, Gautham R. Shenoy Wrote:
> On Sun, Jun 19, 2022 at 08:04:50PM +0800, Abel Wu wrote:
>
> [..snip..]
>
>>
>> +static void sd_update_icpus(int core, int icpu)
>
> How about update_llc_icpus() ?
LGTM, will rename.
>
>> +{
>> + struct sched_domain_shared *sds;
>> + struct cpumask *icpus;
>> +
>> + sds = rcu_dereference(per_cpu(sd_llc_shared, core));
>> + if (!sds)
>> + return;
>> +
>> + icpus = sched_domain_icpus(sds);
>> +
>> + /*
>> + * XXX: The update is racy between different cores.
>> + * The non-atomic ops here is a tradeoff of accuracy
>> + * for easing the cache traffic.
>> + */
>> + if (icpu == -1)
>> + cpumask_andnot(icpus, icpus, cpu_smt_mask(core));
>> + else if (!cpumask_test_cpu(icpu, icpus))
>> + __cpumask_set_cpu(icpu, icpus);
>> +}
>> +
>> /*
>> * Scans the local SMT mask to see if the entire core is idle, and records this
>> * information in sd_llc_shared->has_idle_cores.
>> @@ -6340,6 +6362,10 @@ static inline bool test_idle_cpus(int cpu)
>> return true;
>> }
>>
>> +static inline void sd_update_icpus(int core, int icpu)
>> +{
>> +}
>> +
>> static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
>> {
>> return __select_idle_cpu(core, p);
>> @@ -6370,7 +6396,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>> if (!this_sd)
>> return -1;
>>
>> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> + cpumask_and(cpus, has_idle_core ? sched_domain_span(sd) :
>> + sched_domain_icpus(sd->shared), p->cpus_ptr);
>
> With this we get an idea of the likely idle CPUs. However, we may
> still want SIS_UTIL on top of this as it determines the number of idle
> CPUs to scan based on the utilization average that will iron out any
> transient idle CPUs which may feature in
> sched_domain_icpus(sd->shared) but are not likely to remain idle. Is
> this understanding correct ?
>
Yes, the sd->shared is not real-time updated so it could contain
false positives. The SIS_UTIL limits the efforts we should pay for
and SIS filter tries to make the efforts more efficient by ironing
out the unlikely idle cpus.
>
>>
>> if (sched_feat(SIS_PROP) && !has_idle_core) {
>> u64 avg_cost, avg_idle, span_avg;
>> @@ -8342,6 +8369,7 @@ struct sd_lb_stats {
>> unsigned int prefer_sibling; /* tasks should go to sibling first */
>>
>> int sd_state;
>> + int idle_cpu;
>>
>> struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
>> struct sg_lb_stats local_stat; /* Statistics of the local group */
>> @@ -8362,6 +8390,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
>> .total_load = 0UL,
>> .total_capacity = 0UL,
>> .sd_state = sd_is_busy,
>> + .idle_cpu = -1,
>> .busiest_stat = {
>> .idle_cpus = UINT_MAX,
>> .group_type = group_has_spare,
>> @@ -8702,10 +8731,18 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs
>> return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
>> }
>>
>> -static inline void sd_classify(struct sd_lb_stats *sds, struct rq *rq)
>> +static inline void sd_classify(struct sd_lb_stats *sds, struct rq *rq, int cpu)
>> {
>> - if (sds->sd_state != sd_has_icpus && unoccupied_rq(rq))
>> + if (sds->sd_state != sd_has_icpus && unoccupied_rq(rq)) {
>> + /*
>> + * Prefer idle cpus than unoccupied ones. This
>> + * is achieved by only allowing the idle ones
>> + * unconditionally overwrite the preious record
> ^^^^^^^^
> Nit: previous
>
Will fix.
>
>> + * while the occupied ones can't.
>> + */
>
> This if condition is only executed when we encounter the very first
> unoccupied cpu in the SMT domain. So why do we need this comment here
> about preferring idle cpus over unoccupied ones ?
>
Agreed, this comment should be removed.
>
>> + sds->idle_cpu = cpu;
>> sds->sd_state = sd_has_icpus;
>> + }
>> }
>>
>> /**
>> @@ -8741,7 +8778,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>> sgs->sum_nr_running += nr_running;
>>
>> if (update_core)
>> - sd_classify(sds, rq);
>> + sd_classify(sds, rq, i);
>>
>> if (nr_running > 1)
>> *sg_status |= SG_OVERLOAD;
>> @@ -8757,7 +8794,16 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>> * No need to call idle_cpu() if nr_running is not 0
>> */
>> if (!nr_running && idle_cpu(i)) {
>> + /*
>> + * Prefer the last idle cpu by overwriting
>> + * preious one. The first idle cpu in this
> ^^^^^^^
> Nit: previous
Will fix.
>
>> + * domain (if any) can trigger balancing
>> + * and fed with tasks, so we'd better choose
>> + * a candidate in an opposite way.
>> + */
>
> This is a better place to call out the fact that an idle cpu is
> preferrable to an unoccupied cpu.
>
>> + sds->idle_cpu = i;
>> sgs->idle_cpus++;
>> +
>> /* Idle cpu can't have misfit task */
>> continue;
>> }
>> @@ -9273,8 +9319,40 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>
>> static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>> {
>> - if (sds->sd_state == sd_has_icpus && !test_idle_cpus(env->dst_cpu))
>> - set_idle_cpus(env->dst_cpu, true);
>> + struct sched_domain_shared *sd_smt_shared = env->sd->shared;
>> + enum sd_state new = sds->sd_state;
>> + int this = env->dst_cpu;
>> +
>> + /*
>> + * Parallel updating can hardly contribute accuracy to
>> + * the filter, besides it can be one of the burdens on
>> + * cache traffic.
>> + */
>> + if (cmpxchg(&sd_smt_shared->updating, 0, 1))
>> + return;
>> +
>> + /*
>> + * There is at least one unoccupied cpu available, so
>> + * propagate it to the filter to avoid false negative
>> + * issue which could result in lost tracking of some
>> + * idle cpus thus throughupt downgraded.
>> + */
>> + if (new != sd_is_busy) {
>> + if (!test_idle_cpus(this))
>> + set_idle_cpus(this, true);
>> + } else {
>> + /*
>> + * Nothing changes so nothing to update or
>> + * propagate.
>> + */
>> + if (sd_smt_shared->state == sd_is_busy)
>> + goto out;
>
>
> The main use of sd_smt_shared->state is to detect the transition
> between sd_has_icpu --> sd_is_busy during which sds->idle_cpu == -1
> which will ensure that sd_update_icpus() below clears this core's CPUs
> from the LLC's icpus mask. Calling this out may be a more useful
> comment instead of the comment above.
>
The sd_has_icpu --> sd_is_busy transition is just one of them, the
full decision matrix is:
old new decision
* has_icpu update(icpu)
has_icpu is_busy update(-1)
is_busy is_busy -
The comment here corresponds to the hyphen above. Please let me know
if I understood you incorrectly.
Best Regards,
Abel
>
>> + }
>> +
>> + sd_update_icpus(this, sds->idle_cpu);
>> + sd_smt_shared->state = new;
>> +out:
>> + xchg(&sd_smt_shared->updating, 0);
>> }
>
>
> --
> Thanks and Regards
> gautham.
Hi K Prateek, thanks for your test and sorry for the late reply..
On 7/18/22 7:00 PM, K Prateek Nayak Wrote:
> Hello Abel,
>
> We've tested the patch on a dual socket Zen3 System (2 x 64C/128T).
>
> tl;dr
>
> - There is a noticeable regression for Hackbench with the system
> configured in NPS4 mode. This regression is more noticeable
> with SIS_UTIL enabled and not as severe with SIS_PROP.
> This regression is surprising given the patch should have
> improved SIS Efficiency in case of fully loaded system and is
> consistently reproducible across multiple runs and reboots.
The regression seems unexpected, I will try to reproduce with my
Intel server. While staring at the code, I found something may be
relative to the issue:
- The cpumask_and() in select_idle_cpu() is before SIS_UTIL which
could bail out early. So when SIS filter is enabled, lots of
useless efforts could be made if nr_idle_scan==0 (e.g. 16groups).
While the SIS_PROP case is different, the efforts done by the
filter won't be all in vain, that's probably the reason why the
regression under SIS_UTIL is more noticeable. I am working on a
patch to optimize this.
- If nr_idle_scan == 0 then select_idle_cpu() will bail out early,
so it's pointless to update SIS filter which may further burden
the overhead together with the above issue. This will be fixed
in next version.
I will rework the whole patchset to fit the new SIS_UTIL feature.
>
> - Apart from the above anomaly, the results look positive overall
> with the patched kernel behaving as well as, or better than the tip.
Cheers!
>
> [..snip..]
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Hackbench - 15 runs statistics
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> o NPS 4 - 16 groups (SIS_UTIL)
>
> - tip
>
> Min : 7.35
> Max : 12.66
> Median : 10.60
> AMean : 10.00
> GMean : 9.82
> HMean : 9.64
> AMean Stddev : 1.88
> AMean CoefVar : 18.85 pct
>
> - SIS_Eff
>
> Min : 12.32
> Max : 18.92
> Median : 13.82
> AMean : 14.96 (-49.60 pct)
> GMean : 14.80
> HMean : 14.66
> AMean Stddev : 2.25
> AMean CoefVar : 15.01 pct
>
> o NPS 4 - 16 groups (SIS_PROP)
>
> - tip
>
> Min : 7.04
> Max : 8.22
> Median : 7.49
> AMean : 7.52
> GMean : 7.52
> HMean : 7.51
> AMean Stddev : 0.29
> AMean CoefVar : 3.88 pct
>
> - SIS_Eff
>
> Min : 7.04
> Max : 9.78
> Median : 8.16
> AMean : 8.42 (-11.06 pct)
> GMean : 8.39
> HMean : 8.36
> AMean Stddev : 0.78
> AMean CoefVar : 9.23 pct
>
> The Hackbench regression is much more noticeable with SIS_UTIL
> enabled but only when the test machine is running in NPS4 mode.
> It is not obvious why this is happening given the patch series
> aims at improving SIS Efficiency.
The result seems to get some kind of connection with the LLC size.
I need some time to figure it out.
>
> It would be great if you can test the series with SIS_UTIL
> enabled and SIS_PROP disabled to see if it effects any benchmark
> behavior given SIS_UTIL is the default SIS logic currently on
> the tip.
Yes, I will.
Thanks & Best Regards,
Abel