The wakeup fastpath for fair tasks, select_idle_sibling() aka SIS,
plays an important role in maximizing the usage of cpu resources and
can greatly affect overall performance of the system.
The SIS tries to find an idle cpu inside the targeting LLC to place
the woken-up task. The cache hot cpus are preferred, but if none of
them is idle, just fallback to the other cpus of that LLC by firing
a domain scan. The domain scan works well under light pressure by
simply traversing the cpus of the LLC due to lots of idle cpus can
be available.
But things change. The LLCs are getting bigger in modern and future
machines, and the cloud service providers are consistently trying to
reduce TCO by pushing more load squeezing all kinds of resources.
So the fashion of simply traversing doesn't fit well to the future
requirement.
There are already features like SIS_{UTIL,PROP} exist to cope with
the scalability issue by limiting the scan depth, and it would be
better if we can improve the way of how it scans as well. And this
is exactly what the SIS_FILTER is born for.
So this patchset will focus on improving the efficiency of the SIS
domain scan:
- Patch 1~2, give a chance to scan for idle cores if has_idle_core
is true even when the LLC is overloaded (nr_idle_scan == 0). This
helps exploiting the cpu resources by more actively kicking idle
cores working.
- Patch 3, add check on whether there are any tasks pending right
at the beginning of __update_idle_core(), so the has_idle_cores
hint won't be mistakenly updated if the cpu isn't idle any more.
- Patch 4, skip SIS domain scan if the target LLC is fully busy.
So the cpu resources can be used more wisely rather than wasting
time trying to find an idle cpu that probably not exist.
- Patch 5, introduce the SIS_FILTER feature to optimize the way of
scanning LLC, helping SIS_{UTIL,PROP} to work more efficiently.
Benchmark
=========
Tests are done in a dual socket (2 x 24C/48T) machine modeled Intel
Xeon(R) Platinum 8260, with SNC configuration:
SNC on: 4 NUMA nodes each of which has 12C/24T
SNC off: 2 NUMA nodes each of which has 24C/48T
All of the benchmarks are done inside a normal cpu cgroup in a clean
environment with cpu turbo disabled.
Based on tip sched/core 0fba527e959d (v5.19.0).
Results
=======
hackbench-process-pipes
vanilla patched
(SNC on)
Amean 1 0.4480 ( 0.00%) 0.4500 ( -0.45%)
Amean 4 0.6137 ( 0.00%) 0.5990 ( 2.39%)
Amean 7 0.7530 ( 0.00%) 0.7377 ( 2.04%)
Amean 12 1.1230 ( 0.00%) 1.0490 * 6.59%*
Amean 21 2.0567 ( 0.00%) 1.8680 * 9.17%*
Amean 30 3.0847 ( 0.00%) 2.7710 * 10.17%*
Amean 48 5.9043 ( 0.00%) 4.5393 * 23.12%*
Amean 79 9.3477 ( 0.00%) 7.6610 * 18.04%*
Amean 110 11.0647 ( 0.00%) 10.5560 ( 4.60%)
Amean 141 13.3297 ( 0.00%) 12.6137 * 5.37%*
Amean 172 15.2210 ( 0.00%) 14.6547 ( 3.72%)
Amean 203 17.8510 ( 0.00%) 16.9000 * 5.33%*
Amean 234 19.9263 ( 0.00%) 18.7687 * 5.81%*
Amean 265 21.9117 ( 0.00%) 21.4060 ( 2.31%)
Amean 296 23.7683 ( 0.00%) 23.0990 * 2.82%*
(SNC off)
Amean 1 0.2963 ( 0.00%) 0.2933 ( 1.01%)
Amean 4 0.6093 ( 0.00%) 0.5883 ( 3.45%)
Amean 7 0.7837 ( 0.00%) 0.7570 ( 3.40%)
Amean 12 1.2703 ( 0.00%) 1.0780 ( 15.14%)
Amean 21 2.6260 ( 0.00%) 1.8903 * 28.01%*
Amean 30 4.3483 ( 0.00%) 2.7903 * 35.83%*
Amean 48 7.9753 ( 0.00%) 4.8920 * 38.66%*
Amean 79 9.6540 ( 0.00%) 8.0127 * 17.00%*
Amean 110 11.2597 ( 0.00%) 10.1557 * 9.80%*
Amean 141 13.8077 ( 0.00%) 12.7387 * 7.74%*
Amean 172 16.3513 ( 0.00%) 14.5860 * 10.80%*
Amean 203 19.0880 ( 0.00%) 17.1950 * 9.92%*
Amean 234 21.7660 ( 0.00%) 19.6763 * 9.60%*
Amean 265 23.0447 ( 0.00%) 22.5557 ( 2.12%)
Amean 296 25.4660 ( 0.00%) 24.4273 ( 4.08%)
hackbench-process-sockets
vanilla patched
(SNC on)
Amean 1 0.6503 ( 0.00%) 0.6430 ( 1.13%)
Amean 4 1.6320 ( 0.00%) 1.6247 ( 0.45%)
Amean 7 2.4810 ( 0.00%) 2.4927 ( -0.47%)
Amean 12 4.0943 ( 0.00%) 4.0743 ( 0.49%)
Amean 21 6.8833 ( 0.00%) 6.9220 ( -0.56%)
Amean 30 9.7560 ( 0.00%) 9.7107 * 0.46%*
Amean 48 15.5020 ( 0.00%) 15.5013 ( 0.00%)
Amean 79 25.7690 ( 0.00%) 25.7860 ( -0.07%)
Amean 110 35.7700 ( 0.00%) 35.9203 ( -0.42%)
Amean 141 45.8710 ( 0.00%) 46.0040 ( -0.29%)
Amean 172 55.7460 ( 0.00%) 56.1457 * -0.72%*
Amean 203 65.9393 ( 0.00%) 65.8903 ( 0.07%)
Amean 234 75.7340 ( 0.00%) 76.0653 * -0.44%*
Amean 265 85.8940 ( 0.00%) 85.9670 ( -0.08%)
Amean 296 96.0660 ( 0.00%) 96.3823 ( -0.33%)
(SNC off)
Amean 1 0.4050 ( 0.00%) 0.4267 ( -5.35%)
Amean 4 1.4517 ( 0.00%) 1.4260 * 1.77%*
Amean 7 2.4617 ( 0.00%) 2.4583 ( 0.14%)
Amean 12 4.0730 ( 0.00%) 4.0900 ( -0.42%)
Amean 21 6.9217 ( 0.00%) 6.9297 ( -0.12%)
Amean 30 9.7527 ( 0.00%) 9.7483 ( 0.04%)
Amean 48 15.5490 ( 0.00%) 15.5040 ( 0.29%)
Amean 79 26.1670 ( 0.00%) 26.2290 ( -0.24%)
Amean 110 36.3910 ( 0.00%) 36.4073 ( -0.04%)
Amean 141 46.6660 ( 0.00%) 46.1683 * 1.07%*
Amean 172 56.7627 ( 0.00%) 55.9527 * 1.43%*
Amean 203 66.8097 ( 0.00%) 66.4117 * 0.60%*
Amean 234 77.3577 ( 0.00%) 76.3683 ( 1.28%)
Amean 265 87.5923 ( 0.00%) 85.9367 * 1.89%*
Amean 296 97.2430 ( 0.00%) 96.4750 * 0.79%*
hackbench-thread-pipes
vanilla patched
(SNC on)
Amean 1 0.4470 ( 0.00%) 0.4477 ( -0.15%)
Amean 4 0.6563 ( 0.00%) 0.6803 ( -3.66%)
Amean 7 0.8303 ( 0.00%) 0.7963 ( 4.09%)
Amean 12 1.3983 ( 0.00%) 1.2387 * 11.42%*
Amean 21 2.8490 ( 0.00%) 2.3220 * 18.50%*
Amean 30 4.3807 ( 0.00%) 3.2800 * 25.13%*
Amean 48 7.0663 ( 0.00%) 5.4790 * 22.46%*
Amean 79 10.0643 ( 0.00%) 8.4573 * 15.97%*
Amean 110 12.2953 ( 0.00%) 11.2157 * 8.78%*
Amean 141 15.1210 ( 0.00%) 14.2923 * 5.48%*
Amean 172 16.8823 ( 0.00%) 16.3013 * 3.44%*
Amean 203 19.3927 ( 0.00%) 18.9503 ( 2.28%)
Amean 234 22.1817 ( 0.00%) 21.1763 * 4.53%*
Amean 265 24.4327 ( 0.00%) 24.1200 ( 1.28%)
Amean 296 26.9480 ( 0.00%) 26.3020 * 2.40%*
(SNC off)
Amean 1 0.2663 ( 0.00%) 0.2583 ( 3.00%)
Amean 4 0.6627 ( 0.00%) 0.6147 * 7.24%*
Amean 7 0.8183 ( 0.00%) 0.7757 ( 5.21%)
Amean 12 1.3550 ( 0.00%) 1.2430 ( 8.27%)
Amean 21 3.4853 ( 0.00%) 2.2950 * 34.15%*
Amean 30 7.1470 ( 0.00%) 3.6847 * 48.44%*
Amean 48 9.6940 ( 0.00%) 6.9593 * 28.21%*
Amean 79 10.3100 ( 0.00%) 8.7727 * 14.91%*
Amean 110 11.9537 ( 0.00%) 10.4757 * 12.36%*
Amean 141 14.2673 ( 0.00%) 12.8350 * 10.04%*
Amean 172 16.6850 ( 0.00%) 15.1883 * 8.97%*
Amean 203 19.5060 ( 0.00%) 18.1390 * 7.01%*
Amean 234 21.6900 ( 0.00%) 20.2783 * 6.51%*
Amean 265 24.9073 ( 0.00%) 23.3440 * 6.28%*
Amean 296 26.9260 ( 0.00%) 26.3007 ( 2.32%)
hackbench-thread-sockets
vanilla patched
(SNC on)
Amean 1 0.6700 ( 0.00%) 0.6983 ( -4.23%)
Amean 4 1.6450 ( 0.00%) 1.6530 ( -0.49%)
Amean 7 2.5587 ( 0.00%) 2.5420 * 0.65%*
Amean 12 4.1793 ( 0.00%) 4.1630 ( 0.39%)
Amean 21 7.1000 ( 0.00%) 7.1087 ( -0.12%)
Amean 30 9.9783 ( 0.00%) 9.9740 ( 0.04%)
Amean 48 15.8737 ( 0.00%) 15.8643 ( 0.06%)
Amean 79 26.3667 ( 0.00%) 26.4097 ( -0.16%)
Amean 110 36.5910 ( 0.00%) 36.7513 ( -0.44%)
Amean 141 46.8240 ( 0.00%) 47.0613 * -0.51%*
Amean 172 56.9787 ( 0.00%) 57.3263 * -0.61%*
Amean 203 67.4160 ( 0.00%) 67.6133 ( -0.29%)
Amean 234 77.9683 ( 0.00%) 78.1877 ( -0.28%)
Amean 265 88.1377 ( 0.00%) 88.2207 ( -0.09%)
Amean 296 98.1933 ( 0.00%) 98.4537 ( -0.27%)
(SNC off)
Amean 1 0.4280 ( 0.00%) 0.4973 ( -16.20%)
Amean 4 1.4897 ( 0.00%) 1.4593 * 2.04%*
Amean 7 2.5220 ( 0.00%) 2.4803 * 1.65%*
Amean 12 4.1887 ( 0.00%) 4.1693 ( 0.46%)
Amean 21 7.1040 ( 0.00%) 7.1267 ( -0.32%)
Amean 30 10.0300 ( 0.00%) 10.0183 ( 0.12%)
Amean 48 15.9953 ( 0.00%) 15.9597 ( 0.22%)
Amean 79 26.8963 ( 0.00%) 26.6973 * 0.74%*
Amean 110 37.1493 ( 0.00%) 37.0533 ( 0.26%)
Amean 141 47.3680 ( 0.00%) 47.1407 ( 0.48%)
Amean 172 57.8660 ( 0.00%) 57.5513 ( 0.54%)
Amean 203 68.0573 ( 0.00%) 68.3440 ( -0.42%)
Amean 234 78.5393 ( 0.00%) 78.2117 ( 0.42%)
Amean 265 89.1383 ( 0.00%) 88.6450 ( 0.55%)
Amean 296 99.8007 ( 0.00%) 99.1537 * 0.65%*
tbench4 Throughput
vanilla patched
(SNC on)
Hmean 1 300.70 ( 0.00%) 302.52 * 0.61%*
Hmean 2 597.53 ( 0.00%) 604.76 * 1.21%*
Hmean 4 1188.34 ( 0.00%) 1204.79 * 1.38%*
Hmean 8 2336.22 ( 0.00%) 2375.87 * 1.70%*
Hmean 16 4459.17 ( 0.00%) 4681.25 * 4.98%*
Hmean 32 7606.69 ( 0.00%) 7607.93 ( 0.02%)
Hmean 64 9009.48 ( 0.00%) 8956.00 * -0.59%*
Hmean 128 19456.88 ( 0.00%) 19258.30 * -1.02%*
Hmean 256 19771.10 ( 0.00%) 20783.82 * 5.12%*
Hmean 384 20118.74 ( 0.00%) 20407.40 * 1.43%*
(SNC off)
Hmean 1 284.44 ( 0.00%) 286.27 * 0.64%*
Hmean 2 564.10 ( 0.00%) 574.82 * 1.90%*
Hmean 4 1120.93 ( 0.00%) 1137.27 * 1.46%*
Hmean 8 2248.94 ( 0.00%) 2261.98 * 0.58%*
Hmean 16 4360.10 ( 0.00%) 4430.95 * 1.63%*
Hmean 32 7300.52 ( 0.00%) 7341.70 * 0.56%*
Hmean 64 8912.37 ( 0.00%) 8954.61 * 0.47%*
Hmean 128 19874.16 ( 0.00%) 20198.82 * 1.63%*
Hmean 256 19759.42 ( 0.00%) 19785.57 * 0.13%*
Hmean 384 19502.40 ( 0.00%) 19956.96 * 2.33%*
netperf-udp
vanilla patched
(SNC on)
Hmean send-64 209.06 ( 0.00%) 210.32 ( 0.60%)
Hmean send-128 416.70 ( 0.00%) 415.34 ( -0.33%)
Hmean send-256 819.65 ( 0.00%) 808.52 * -1.36%*
Hmean send-1024 3163.12 ( 0.00%) 3132.35 * -0.97%*
Hmean send-2048 5958.21 ( 0.00%) 5926.40 ( -0.53%)
Hmean send-3312 9168.81 ( 0.00%) 9194.53 ( 0.28%)
Hmean send-4096 11039.27 ( 0.00%) 11159.21 * 1.09%*
Hmean send-8192 17804.42 ( 0.00%) 17840.95 ( 0.21%)
Hmean send-16384 28529.57 ( 0.00%) 28389.97 ( -0.49%)
Hmean recv-64 209.06 ( 0.00%) 210.32 ( 0.60%)
Hmean recv-128 416.70 ( 0.00%) 415.34 ( -0.33%)
Hmean recv-256 819.65 ( 0.00%) 808.52 * -1.36%*
Hmean recv-1024 3163.12 ( 0.00%) 3132.35 * -0.97%*
Hmean recv-2048 5958.21 ( 0.00%) 5926.40 ( -0.53%)
Hmean recv-3312 9168.81 ( 0.00%) 9194.53 ( 0.28%)
Hmean recv-4096 11039.27 ( 0.00%) 11159.10 * 1.09%*
Hmean recv-8192 17804.32 ( 0.00%) 17840.92 ( 0.21%)
Hmean recv-16384 28529.38 ( 0.00%) 28389.96 ( -0.49%)
(SNC off)
Hmean send-64 211.39 ( 0.00%) 210.60 ( -0.37%)
Hmean send-128 415.25 ( 0.00%) 420.91 * 1.36%*
Hmean send-256 814.75 ( 0.00%) 822.22 * 0.92%*
Hmean send-1024 3171.61 ( 0.00%) 3135.16 ( -1.15%)
Hmean send-2048 6015.92 ( 0.00%) 5943.85 * -1.20%*
Hmean send-3312 9210.17 ( 0.00%) 9159.59 ( -0.55%)
Hmean send-4096 11084.55 ( 0.00%) 11098.02 ( 0.12%)
Hmean send-8192 17769.83 ( 0.00%) 17804.44 ( 0.19%)
Hmean send-16384 27718.62 ( 0.00%) 27828.89 ( 0.40%)
Hmean recv-64 211.39 ( 0.00%) 210.60 ( -0.37%)
Hmean recv-128 415.25 ( 0.00%) 420.90 * 1.36%*
Hmean recv-256 814.75 ( 0.00%) 822.22 * 0.92%*
Hmean recv-1024 3171.61 ( 0.00%) 3135.16 ( -1.15%)
Hmean recv-2048 6015.92 ( 0.00%) 5943.85 * -1.20%*
Hmean recv-3312 9210.17 ( 0.00%) 9159.59 ( -0.55%)
Hmean recv-4096 11084.55 ( 0.00%) 11098.02 ( 0.12%)
Hmean recv-8192 17769.76 ( 0.00%) 17804.41 ( 0.20%)
Hmean recv-16384 27718.62 ( 0.00%) 27828.79 ( 0.40%)
netperf-tcp
vanilla patched
(SNC on)
Hmean 64 1192.41 ( 0.00%) 1219.91 * 2.31%*
Hmean 128 2354.50 ( 0.00%) 2360.65 ( 0.26%)
Hmean 256 4371.10 ( 0.00%) 4393.92 ( 0.52%)
Hmean 1024 13813.84 ( 0.00%) 13712.10 ( -0.74%)
Hmean 2048 21518.91 ( 0.00%) 21950.82 * 2.01%*
Hmean 3312 25585.77 ( 0.00%) 26087.72 * 1.96%*
Hmean 4096 27402.77 ( 0.00%) 27927.67 * 1.92%*
Hmean 8192 31766.67 ( 0.00%) 31914.49 ( 0.47%)
Hmean 16384 36227.30 ( 0.00%) 36630.26 ( 1.11%)
(SNC off)
Hmean 64 1182.09 ( 0.00%) 1211.82 * 2.52%*
Hmean 128 2316.35 ( 0.00%) 2356.54 * 1.74%*
Hmean 256 4231.05 ( 0.00%) 4282.52 ( 1.22%)
Hmean 1024 13461.44 ( 0.00%) 13571.02 ( 0.81%)
Hmean 2048 21016.51 ( 0.00%) 21356.97 * 1.62%*
Hmean 3312 24834.03 ( 0.00%) 25140.70 * 1.23%*
Hmean 4096 26700.53 ( 0.00%) 26734.55 ( 0.13%)
Hmean 8192 31094.10 ( 0.00%) 31201.16 ( 0.34%)
Hmean 16384 34953.23 ( 0.00%) 34949.00 ( -0.01%)
Conclusion
==========
Substantial improvement can be seen in hackbench pipe tests, while
others are generally neutral. It is because if the workload has
already had a relatively high SIS success rate, e.g. tbench4~=50%
and netperf~=100% compared to hackbench~=3.5%, the cost of trying
to improve its efficiency will negate lots of benefit it brings.
But the real world is more complicated. Different workloads can be
co-located on the same machine to share resources, and their profiles
can vary quite much, so the SIS success rate is not predictable.
Please see more detailed analysis in individual patches.
v4 -> v5:
- Add limited scan for idle cores when overloaded, suggested by Mel
- Split out several patches since they are irrelevant to this scope
- Add quick check on ttwu_pending before core update
- Wrap the filter into SIS_FILTER feature, suggested by Chen Yu
- Move the main filter logic to the idle path, because the newidle
balance can bail out early if rq->avg_idle is small enough and
lose chances to update the filter.
v3 -> v4:
- Update filter in load_balance rather than in the tick
- Now the filter contains unoccupied cpus rather than overloaded ones
- Added mechanisms to deal with the false positive cases
v2 -> v3:
- Removed sched-idle balance feature and focus on SIS
- Take non-CFS tasks into consideration
- Several fixes/improvement suggested by Josh Don
v1 -> v2:
- Several optimizations on sched-idle balancing
- Ignore asym topos in can_migrate_task
- Add more benchmarks including SIS efficiency
- Re-organize patch as suggested by Mel Gorman
v4: https://lore.kernel.org/lkml/[email protected]/
v3: https://lore.kernel.org/lkml/[email protected]/
v2: https://lore.kernel.org/lkml/[email protected]/
v1: https://lore.kernel.org/lkml/[email protected]/
Abel Wu (5):
sched/fair: Ignore SIS_UTIL when has idle core
sched/fair: Limited scan for idle cores when overloaded
sched/fair: Skip core update if task pending
sched/fair: Skip SIS domain scan if fully busy
sched/fair: Introduce SIS_FILTER
include/linux/sched/topology.h | 53 +++++++++-
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 171 +++++++++++++++++++++++++++++----
kernel/sched/features.h | 1 +
kernel/sched/sched.h | 3 +
kernel/sched/topology.c | 9 +-
6 files changed, 218 insertions(+), 20 deletions(-)
--
2.37.3
The function __update_idle_core() considers this cpu is idle so
only checks its siblings to decide whether the resident core is
idle or not and update has_idle_cores hint if necessary. But the
problem is that this cpu might not be idle at that moment any
more, resulting in the hint being misleading.
It's not proper to make this check everywhere in the idle path,
but checking just before core updating can make the has_idle_core
hint more reliable with negligible cost.
Signed-off-by: Abel Wu <[email protected]>
---
kernel/sched/fair.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7abe188a1533..fad289530e07 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6294,6 +6294,9 @@ void __update_idle_core(struct rq *rq)
int core = cpu_of(rq);
int cpu;
+ if (rq->ttwu_pending)
+ return;
+
rcu_read_lock();
if (test_idle_cores(core, true))
goto unlock;
--
2.37.3
The wakeup fastpath for fair tasks, select_idle_sibling() aka SIS,
plays an important role in maximizing the usage of cpu resources and
can greatly affect overall performance of the system.
The SIS tries to find an idle cpu inside the targeting LLC to place
the woken-up task. The cache hot cpus are preferred, but if none of
them is idle, just fallback to the other cpus of that LLC by firing
a domain scan. The domain scan works well under light pressure by
simply traversing the cpus of the LLC due to lots of idle cpus can
be available.
But things change. The LLCs are getting bigger in modern and future
machines, and the cloud service providers are consistently trying to
reduce TCO by pushing more load squeezing all kinds of resources.
So the fashion of simply traversing doesn't fit well to the future
requirement.
There are already features like SIS_{UTIL,PROP} exist to cope with
the scalability issue by limiting the scan depth, and it would be
better if we can improve the way of how it scans as well. And this
is exactly what the SIS_FILTER is born for.
The SIS filter is supposed to contain the idle and sched-idle cpus
of the targeting LLC and try to improve efficiency of SIS domain scan
by filtering out busy cpus, making the limited scan depth being used
more wisely.
The idle cpus are propagated to the filter lazily if the resident
cores are already present in the filter. This will ease the pain of
LLC cache traffic, and can also bring benefit by spreading out load
to different cores.
There is also a sequence number to indicate the generation of the
filter. The generation iterates every time when the filter resets.
Once a cpu is propagated/set to the filter, the filter generation
is cached locally to help identify whether it is present or not in
the filter, rather than peeking at the LLC shared filter.
Benchmark
=========
Tests are done in a dual socket (2 x 24C/48T) machine modeled Intel
Xeon(R) Platinum 8260, with SNC configuration:
SNC on: 4 NUMA nodes each of which has 12C/24T
SNC off: 2 NUMA nodes each of which has 24C/48T
All of the benchmarks are done inside a normal cpu cgroup in a clean
environment with cpu turbo disabled.
Baseline is tip sched/core 0fba527e959d (v5.19.0) with the first two
patches of this series, and the "patched" contains the whole series.
Results
=======
hackbench-process-pipes
baseline patched
(SNC on)
Amean 1 0.4557 ( 0.00%) 0.4500 ( 1.24%)
Amean 4 0.6033 ( 0.00%) 0.5990 ( 0.72%)
Amean 7 0.7627 ( 0.00%) 0.7377 ( 3.28%)
Amean 12 1.0653 ( 0.00%) 1.0490 ( 1.53%)
Amean 21 2.0283 ( 0.00%) 1.8680 * 7.90%*
Amean 30 2.9670 ( 0.00%) 2.7710 * 6.61%*
Amean 48 4.6863 ( 0.00%) 4.5393 ( 3.14%)
Amean 79 7.9443 ( 0.00%) 7.6610 * 3.57%*
Amean 110 10.2393 ( 0.00%) 10.5560 ( -3.09%)
Amean 141 12.6343 ( 0.00%) 12.6137 ( 0.16%)
Amean 172 14.9957 ( 0.00%) 14.6547 * 2.27%*
Amean 203 16.9133 ( 0.00%) 16.9000 ( 0.08%)
Amean 234 19.2433 ( 0.00%) 18.7687 ( 2.47%)
Amean 265 21.6917 ( 0.00%) 21.4060 ( 1.32%)
Amean 296 23.8743 ( 0.00%) 23.0990 * 3.25%*
(SNC off)
Amean 1 0.3143 ( 0.00%) 0.2933 ( 6.68%)
Amean 4 0.6070 ( 0.00%) 0.5883 ( 3.08%)
Amean 7 0.7960 ( 0.00%) 0.7570 ( 4.90%)
Amean 12 1.1947 ( 0.00%) 1.0780 * 9.77%*
Amean 21 2.4837 ( 0.00%) 1.8903 * 23.89%*
Amean 30 3.4577 ( 0.00%) 2.7903 * 19.30%*
Amean 48 5.5227 ( 0.00%) 4.8920 ( 11.42%)
Amean 79 9.2933 ( 0.00%) 8.0127 * 13.78%*
Amean 110 11.0443 ( 0.00%) 10.1557 * 8.05%*
Amean 141 13.1360 ( 0.00%) 12.7387 ( 3.02%)
Amean 172 15.7770 ( 0.00%) 14.5860 * 7.55%*
Amean 203 17.9557 ( 0.00%) 17.1950 * 4.24%*
Amean 234 20.4373 ( 0.00%) 19.6763 * 3.72%*
Amean 265 23.6053 ( 0.00%) 22.5557 ( 4.45%)
Amean 296 25.6803 ( 0.00%) 24.4273 * 4.88%*
Generally the result showed better improvement on larger LLCs. And with
load increases but not saturate the cpus (<30 groups), the benefit is
obvious, and even under extreme pressure the filter also helps squeeze
out some power (remember that the baseline has already included first
two patches).
tbench4 Throughput
baseline patched
(SNC on)
Hmean 1 301.54 ( 0.00%) 302.52 * 0.32%*
Hmean 2 607.81 ( 0.00%) 604.76 * -0.50%*
Hmean 4 1210.64 ( 0.00%) 1204.79 * -0.48%*
Hmean 8 2381.73 ( 0.00%) 2375.87 * -0.25%*
Hmean 16 4601.21 ( 0.00%) 4681.25 * 1.74%*
Hmean 32 7626.84 ( 0.00%) 7607.93 * -0.25%*
Hmean 64 9251.51 ( 0.00%) 8956.00 * -3.19%*
Hmean 128 20620.98 ( 0.00%) 19258.30 * -6.61%*
Hmean 256 20406.51 ( 0.00%) 20783.82 * 1.85%*
Hmean 384 20312.97 ( 0.00%) 20407.40 * 0.46%*
(SNC off)
Hmean 1 292.37 ( 0.00%) 286.27 * -2.09%*
Hmean 2 583.29 ( 0.00%) 574.82 * -1.45%*
Hmean 4 1155.83 ( 0.00%) 1137.27 * -1.61%*
Hmean 8 2297.63 ( 0.00%) 2261.98 * -1.55%*
Hmean 16 4562.44 ( 0.00%) 4430.95 * -2.88%*
Hmean 32 7425.69 ( 0.00%) 7341.70 * -1.13%*
Hmean 64 9021.77 ( 0.00%) 8954.61 * -0.74%*
Hmean 128 20257.76 ( 0.00%) 20198.82 * -0.29%*
Hmean 256 20043.54 ( 0.00%) 19785.57 * -1.29%*
Hmean 384 19528.03 ( 0.00%) 19956.96 * 2.20%*
The slight regression indicates that if the workload has already had
a relatively high SIS success rate, e.g. tbench4 ~= 50%, the benefit
the filter brought will be reduced while its cost is still there. And
the benefit might not balance the cost if SIS success rate goes high
enough.
But the real world is more complicated. Different workloads can be
located on the same machine to share resources, and their profiles can
vary quite much, so the SIS success rate is not predictable.
Conclusion
==========
The SIS filter is much more efficient than linear scan under certain
circumstances, and even if it goes unlucky, the filter can be disabled
at any time.
Signed-off-by: Abel Wu <[email protected]>
---
include/linux/sched/topology.h | 18 ++++++++
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 83 ++++++++++++++++++++++++++++++++--
kernel/sched/features.h | 1 +
kernel/sched/sched.h | 3 ++
kernel/sched/topology.c | 9 +++-
6 files changed, 109 insertions(+), 6 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index cc6089765b64..f8e6154b5c37 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -114,7 +114,20 @@ struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
enum sd_state state;
+ u64 seq;
int nr_idle_scan;
+
+ /*
+ * The SIS filter
+ *
+ * Record idle and sched-idle cpus to improve efficiency of
+ * the SIS domain scan.
+ *
+ * NOTE: this field is variable length. (Allocated dynamically
+ * by attaching extra space to the end of the structure,
+ * depending on how many CPUs the kernel has booted up with)
+ */
+ unsigned long icpus[];
};
struct sched_domain {
@@ -200,6 +213,11 @@ static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
return to_cpumask(sd->span);
}
+static inline struct cpumask *sched_domain_icpus(struct sched_domain *sd)
+{
+ return to_cpumask(sd->shared->icpus);
+}
+
extern void partition_sched_domains_locked(int ndoms_new,
cpumask_var_t doms_new[],
struct sched_domain_attr *dattr_new);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7d289d87acf7..a0cbf6c0d540 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9719,6 +9719,7 @@ void __init sched_init(void)
rq->wake_stamp = jiffies;
rq->wake_avg_idle = rq->avg_idle;
rq->max_idle_balance_cost = sysctl_sched_migration_cost;
+ rq->last_idle_seq = 0;
INIT_LIST_HEAD(&rq->cfs_tasks);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25df73c7e73c..354e6e646a7b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6262,6 +6262,57 @@ static inline int __select_idle_cpu(int cpu, struct task_struct *p)
DEFINE_STATIC_KEY_FALSE(sched_smt_present);
EXPORT_SYMBOL_GPL(sched_smt_present);
+static inline u64 filter_seq(int cpu)
+{
+ struct sched_domain_shared *sds;
+
+ sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+ if (sds)
+ return READ_ONCE(sds->seq);
+
+ return 0;
+}
+
+static inline void filter_set_cpu(int cpu, int nr)
+{
+ struct sched_domain_shared *sds;
+
+ if (!sched_feat(SIS_FILTER))
+ return;
+
+ sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+ if (sds) {
+ cpu_rq(nr)->last_idle_seq = filter_seq(cpu);
+ set_bit(nr, sds->icpus);
+ }
+}
+
+static inline bool filter_test_cpu(int cpu, int nr)
+{
+ if (!sched_feat(SIS_FILTER))
+ return true;
+
+ return cpu_rq(nr)->last_idle_seq >= filter_seq(cpu);
+}
+
+static inline void filter_reset(int cpu)
+{
+ struct sched_domain_shared *sds;
+
+ if (!sched_feat(SIS_FILTER))
+ return;
+
+ sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+ if (sds) {
+ bitmap_zero(sds->icpus, nr_cpumask_bits);
+ /*
+ * The seq field is racy but at least we can
+ * use WRITE_ONCE() to prevent store tearing.
+ */
+ WRITE_ONCE(sds->seq, filter_seq(cpu) + 1);
+ }
+}
+
static inline void set_llc_state(int cpu, enum sd_state state)
{
struct sched_domain_shared *sds;
@@ -6285,6 +6336,8 @@ static inline enum sd_state get_llc_state(int cpu, enum sd_state def)
static inline void clear_idle_cpus(int cpu)
{
set_llc_state(cpu, sd_is_busy);
+ if (sched_smt_active())
+ filter_reset(cpu);
}
static inline bool test_idle_cpus(int cpu)
@@ -6314,13 +6367,15 @@ void __update_idle_core(struct rq *rq)
{
enum sd_state old, new = sd_has_icores;
int core = cpu_of(rq), cpu;
+ int exist;
if (rq->ttwu_pending)
return;
rcu_read_lock();
old = get_llc_state(core, sd_has_icores);
- if (old == sd_has_icores)
+ exist = filter_test_cpu(core, core);
+ if (old == sd_has_icores && exist)
goto unlock;
for_each_cpu(cpu, cpu_smt_mask(core)) {
@@ -6329,11 +6384,26 @@ void __update_idle_core(struct rq *rq)
if (!available_idle_cpu(cpu)) {
new = sd_has_icpus;
- break;
+
+ /*
+ * If any cpu of this core has already
+ * been set in the filter, then this
+ * core is present and won't be missed
+ * during SIS domain scan.
+ */
+ if (exist)
+ break;
+ if (!sched_idle_cpu(cpu))
+ continue;
}
+
+ if (!exist)
+ exist = filter_test_cpu(core, cpu);
}
- if (old != new)
+ if (!exist)
+ filter_set_cpu(core, core);
+ if (old != sd_has_icores && old != new)
set_llc_state(core, new);
unlock:
rcu_read_unlock();
@@ -6447,8 +6517,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
if (!this_sd)
return -1;
- cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
-
if (sched_feat(SIS_PROP) && !has_idle_core) {
u64 avg_cost, avg_idle, span_avg;
unsigned long now = jiffies;
@@ -6496,6 +6564,11 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
}
}
+ if (sched_smt_active() && sched_feat(SIS_FILTER))
+ cpumask_and(cpus, sched_domain_icpus(sd), p->cpus_ptr);
+ else
+ cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+
for_each_cpu_wrap(cpu, cpus, target + 1) {
/*
* This might get the has_idle_cores hint cleared for a
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..1bebdb87c2f4 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
*/
SCHED_FEAT(SIS_PROP, false)
SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_FILTER, true)
/*
* Issue a WARN when we do multiple update_rq_clock() calls
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b75ac74986fb..1fe1b152bc20 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1071,6 +1071,9 @@ struct rq {
/* This is used to determine avg_idle's max value */
u64 max_idle_balance_cost;
+ /* Cached filter generation when setting this cpu */
+ u64 last_idle_seq;
+
#ifdef CONFIG_HOTPLUG_CPU
struct rcuwait hotplug_wait;
#endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..01dccaca0fa8 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1641,6 +1641,13 @@ sd_init(struct sched_domain_topology_level *tl,
sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
atomic_inc(&sd->shared->ref);
atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+ cpumask_copy(sched_domain_icpus(sd), sd_span);
+
+ /*
+ * The cached per-cpu seq starts from 0, so initialize
+ * filter seq to 1 to discard all cached cpu state.
+ */
+ WRITE_ONCE(sd->shared->seq, 1);
}
sd->private = sdd;
@@ -2106,7 +2113,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
*per_cpu_ptr(sdd->sd, j) = sd;
- sds = kzalloc_node(sizeof(struct sched_domain_shared),
+ sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
GFP_KERNEL, cpu_to_node(j));
if (!sds)
return -ENOMEM;
--
2.37.3
On 2022-09-09 at 13:53:02 +0800, Abel Wu wrote:
> The function __update_idle_core() considers this cpu is idle so
> only checks its siblings to decide whether the resident core is
> idle or not and update has_idle_cores hint if necessary. But the
> problem is that this cpu might not be idle at that moment any
> more, resulting in the hint being misleading.
>
> It's not proper to make this check everywhere in the idle path,
> but checking just before core updating can make the has_idle_core
> hint more reliable with negligible cost.
>
> Signed-off-by: Abel Wu <[email protected]>
> ---
> kernel/sched/fair.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7abe188a1533..fad289530e07 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6294,6 +6294,9 @@ void __update_idle_core(struct rq *rq)
> int core = cpu_of(rq);
> int cpu;
>
> + if (rq->ttwu_pending)
> + return;
> +
Is it to deal with the race condition? I'm thinking of the
following scenario: task p1 on rq1 is about to switch to idle.
However when p1 reaches __update_idle_core(), someone on other
CPU tries to wake up p2, and leverages rq1 to queue p2
thus set the ttwu_pending flag on rq1. It is likely that
rq1 becomes idle but soon finds that TF_NEED_RESCHED is set, thus
quits the idle loop. As a result rq will not be idle and we will
get false positive here.
thanks,
Chenyu
> rcu_read_lock();
> if (test_idle_cores(core, true))
> goto unlock;
> --
> 2.37.3
>
On 9/9/22 6:09 PM, Chen Yu wrote:
> On 2022-09-09 at 13:53:02 +0800, Abel Wu wrote:
>> The function __update_idle_core() considers this cpu is idle so
>> only checks its siblings to decide whether the resident core is
>> idle or not and update has_idle_cores hint if necessary. But the
>> problem is that this cpu might not be idle at that moment any
>> more, resulting in the hint being misleading.
>>
>> It's not proper to make this check everywhere in the idle path,
>> but checking just before core updating can make the has_idle_core
>> hint more reliable with negligible cost.
>>
>> Signed-off-by: Abel Wu <[email protected]>
>> ---
>> kernel/sched/fair.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7abe188a1533..fad289530e07 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6294,6 +6294,9 @@ void __update_idle_core(struct rq *rq)
>> int core = cpu_of(rq);
>> int cpu;
>>
>> + if (rq->ttwu_pending)
>> + return;
>> +
> Is it to deal with the race condition? I'm thinking of the
> following scenario: task p1 on rq1 is about to switch to idle.
> However when p1 reaches __update_idle_core(), someone on other
> CPU tries to wake up p2, and leverages rq1 to queue p2
> thus set the ttwu_pending flag on rq1. It is likely that
> rq1 becomes idle but soon finds that TF_NEED_RESCHED is set, thus
> quits the idle loop. As a result rq will not be idle and we will
> get false positive here.
Yes, exactly as you said.
On Fri, 2022-09-09 at 13:53 +0800, Abel Wu wrote:
> The function __update_idle_core() considers this cpu is idle so
> only checks its siblings to decide whether the resident core is
> idle or not and update has_idle_cores hint if necessary. But the
> problem is that this cpu might not be idle at that moment any
> more, resulting in the hint being misleading.
>
> It's not proper to make this check everywhere in the idle path,
> but checking just before core updating can make the has_idle_core
> hint more reliable with negligible cost.
>
Reviewed-by: Tim Chen <[email protected]>
> Signed-off-by: Abel Wu <[email protected]>
> ---
> kernel/sched/fair.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7abe188a1533..fad289530e07 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6294,6 +6294,9 @@ void __update_idle_core(struct rq *rq)
> int core = cpu_of(rq);
> int cpu;
>
> + if (rq->ttwu_pending)
> + return;
> +
> rcu_read_lock();
> if (test_idle_cores(core, true))
> goto unlock;