LinuxLists.cc - [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS

2022-10-19 13:53:55

Subject: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS

This patchset tries to improve SIS scan efficiency by recording idle
cpus in a cpumask for each LLC which will be used as a target cpuset
in the domain scan. The cpus are recorded at CORE granule to avoid
tasks being stack on same core.

v5 -> v6:
- Rename SIS_FILTER to SIS_CORE as it can only be activated when
SMT is enabled and better describes the behavior of CORE granule
update & load delivery.
- Removed the part of limited scan for idle cores since it might be
better to open another thread to discuss the strategies such as
limited or scaled depth. But keep the part of full scan for idle
cores when LLC is overloaded because SIS_CORE can greatly reduce
the overhead of full scan in such case.
- Removed the state of sd_is_busy which indicates an LLC is fully
busy and we can safely skip the SIS domain scan. I would prefer
leave this to SIS_UTIL.
- The filter generation mechanism is replaced by in-place updates
during domain scan to better deal with partial scan failures.
- Collect Reviewed-bys from Tim Chen

v4 -> v5:
- Add limited scan for idle cores when overloaded, suggested by Mel
- Split out several patches since they are irrelevant to this scope
- Add quick check on ttwu_pending before core update
- Wrap the filter into SIS_FILTER feature, suggested by Chen Yu
- Move the main filter logic to the idle path, because the newidle
balance can bail out early if rq->avg_idle is small enough and
lose chances to update the filter.

v3 -> v4:
- Update filter in load_balance rather than in the tick
- Now the filter contains unoccupied cpus rather than overloaded ones
- Added mechanisms to deal with the false positive cases

v2 -> v3:
- Removed sched-idle balance feature and focus on SIS
- Take non-CFS tasks into consideration
- Several fixes/improvement suggested by Josh Don

v1 -> v2:
- Several optimizations on sched-idle balancing
- Ignore asym topos in can_migrate_task
- Add more benchmarks including SIS efficiency
- Re-organize patch as suggested by Mel Gorman

Abel Wu (4):
sched/fair: Skip core update if task pending
sched/fair: Ignore SIS_UTIL when has_idle_core
sched/fair: Introduce SIS_CORE
sched/fair: Deal with SIS scan failures

include/linux/sched/topology.h | 15 ++++
kernel/sched/fair.c | 122 +++++++++++++++++++++++++++++----
kernel/sched/features.h | 7 ++
kernel/sched/sched.h | 3 +
kernel/sched/topology.c | 8 ++-
5 files changed, 141 insertions(+), 14 deletions(-)

--
2.37.3

2022-10-19 14:01:27

by Abel Wu

[permalink] [raw]

Subject: [PATCH v6 2/4] sched/fair: Ignore SIS_UTIL when has_idle_core

When SIS_UTIL is enabled, SIS domain scan will be skipped if the
LLC is overloaded even the has_idle_core hint is true. Since idle
load balancing is triggered at tick boundary, the idle cores can
stay cold for the whole tick period wasting time meanwhile some
of other cpus might be overloaded.

Give it a chance to scan for idle cores if the hint implies a
worthy effort.

Benchmark
=========

All of the benchmarks are done inside a normal cpu cgroup in a clean
environment with cpu turbo disabled, and test machines are:

A) A dual socket machine modeled Intel Xeon(R) Platinum 8260 with SNC
disabled, so there are 2 NUMA nodes each of which has 24C/48T. Each
NUMA shares an LLC.

B) A dual socket machine modeled AMD EPYC 7Y83 64-Core Processor with
NPS1 enabled, so there are 2 NUMA nodes each of which has 64C/128T.
Each NUMA node contains several LLCs sized of 16 cpus.

Based on tip sched/core fb04563d1cae (v5.19.0).

Results
=======

hackbench-process-pipes
(A) vanilla patched
Amean 1 0.2767 ( 0.00%) 0.2540 ( 8.19%)
Amean 4 0.6080 ( 0.00%) 0.6220 ( -2.30%)
Amean 7 0.7923 ( 0.00%) 0.8020 ( -1.22%)
Amean 12 1.3917 ( 0.00%) 1.1823 ( 15.04%)
Amean 21 3.6747 ( 0.00%) 2.7717 ( 24.57%)
Amean 30 6.7070 ( 0.00%) 5.1200 * 23.66%*
Amean 48 9.3537 ( 0.00%) 8.5890 * 8.18%*
Amean 79 11.6627 ( 0.00%) 11.2580 ( 3.47%)
Amean 110 13.4473 ( 0.00%) 13.1283 ( 2.37%)
Amean 141 16.4747 ( 0.00%) 15.5967 * 5.33%*
Amean 172 19.0000 ( 0.00%) 18.1153 * 4.66%*
Amean 203 21.4200 ( 0.00%) 21.1340 ( 1.34%)
Amean 234 24.2250 ( 0.00%) 23.8227 ( 1.66%)
Amean 265 27.2400 ( 0.00%) 26.8293 ( 1.51%)
Amean 296 30.6937 ( 0.00%) 29.5800 * 3.63%*
(B)
Amean 1 0.3543 ( 0.00%) 0.3650 ( -3.01%)
Amean 4 0.4623 ( 0.00%) 0.4837 ( -4.61%)
Amean 7 0.5117 ( 0.00%) 0.4997 ( 2.35%)
Amean 12 0.5707 ( 0.00%) 0.5863 ( -2.75%)
Amean 21 0.9717 ( 0.00%) 0.8930 * 8.10%*
Amean 30 1.4423 ( 0.00%) 1.2530 ( 13.13%)
Amean 48 2.3520 ( 0.00%) 1.9743 * 16.06%*
Amean 79 5.7193 ( 0.00%) 3.4933 * 38.92%*
Amean 110 6.9893 ( 0.00%) 5.5963 * 19.93%*
Amean 141 9.1103 ( 0.00%) 7.6550 ( 15.97%)
Amean 172 10.2490 ( 0.00%) 8.8323 * 13.82%*
Amean 203 11.3727 ( 0.00%) 10.8683 ( 4.43%)
Amean 234 12.7627 ( 0.00%) 11.8683 ( 7.01%)
Amean 265 13.8947 ( 0.00%) 13.4717 ( 3.04%)
Amean 296 14.1093 ( 0.00%) 13.8130 ( 2.10%)

The results can approximately divided into 3 sections:
- busy, e.g. <12 groups on A and <21 groups on B
- overloaded, e.g. 12~48 groups on A and 21~172 groups on B
- saturated, the rest part

For the busy part the result is neutral with slight wins or loss.
It is probably because there are still idle cpus not hard to be find
so the effort we paid for locating an idle core will bring limited
benefit which can be negated by the cost of full scan easily.

While for the overloaded but not saturated part, great improvement
can be seen due to exploiting the cpu resources by more actively
kicking idle cores working. But once all cpus are totally saturated,
scanning for idle cores doesn't help much.

One concern of the full scan is that the cost gets bigger in larger
LLCs, but the test result seems still positive. One possible reason
is due to the low SIS success rate (<2%), so the paid effort will
indeed trade for efficiency.

tbench4 Throughput
(A) vanilla patched
Hmean 1 275.61 ( 0.00%) 280.53 * 1.78%*
Hmean 2 541.28 ( 0.00%) 561.94 * 3.82%*
Hmean 4 1102.62 ( 0.00%) 1109.14 * 0.59%*
Hmean 8 2149.58 ( 0.00%) 2229.39 * 3.71%*
Hmean 16 4305.40 ( 0.00%) 4383.06 * 1.80%*
Hmean 32 7088.36 ( 0.00%) 7124.14 * 0.50%*
Hmean 64 8609.16 ( 0.00%) 8815.41 * 2.40%*
Hmean 128 19304.92 ( 0.00%) 19519.35 * 1.11%*
Hmean 256 19147.04 ( 0.00%) 19392.24 * 1.28%*
Hmean 384 18970.86 ( 0.00%) 19201.07 * 1.21%*
(B)
Hmean 1 519.62 ( 0.00%) 515.98 * -0.70%*
Hmean 2 1042.92 ( 0.00%) 1031.54 * -1.09%*
Hmean 4 1959.10 ( 0.00%) 1953.44 * -0.29%*
Hmean 8 3842.82 ( 0.00%) 3622.52 * -5.73%*
Hmean 16 6768.50 ( 0.00%) 6545.82 * -3.29%*
Hmean 32 12589.50 ( 0.00%) 13697.73 * 8.80%*
Hmean 64 24797.23 ( 0.00%) 25589.59 * 3.20%*
Hmean 128 38036.66 ( 0.00%) 35667.64 * -6.23%*
Hmean 256 65069.93 ( 0.00%) 65215.85 * 0.22%*
Hmean 512 61147.99 ( 0.00%) 66035.57 * 7.99%*
Hmean 1024 48542.73 ( 0.00%) 53391.64 * 9.99%*

The tbench4 test has a ~40% success rate on used target, prev or
recent cpus, and ~45% of total success rate. And the core scan is
also not very frequent, so the benefit this patch brings is limited
while still some gains can be seen.

netperf

The netperf has an almost 100% success rate on used target, prev or
recent cpus, so the domain scan is generally not performed and not
affected by this patch.

Conclusion
==========

Taking full scan for idle cores is generally good for making better
use of the cpu resources.

Signed-off-by: Abel Wu <[email protected]>
Reviewed-by: Tim Chen <[email protected]>
Tested-by: Chen Yu <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e7f82fa92c5b..7b668e16812e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6436,7 +6436,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
time = cpu_clock(this);
}

- if (sched_feat(SIS_UTIL)) {
+ if (sched_feat(SIS_UTIL) && !has_idle_core) {
sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
if (sd_share) {
/* because !--nr is the condition to stop scan */
--
2.37.3

2022-11-04 08:31:47

by Abel Wu

[permalink] [raw]

Subject: Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS

Ping :)

On 10/19/22 8:28 PM, Abel Wu wrote:
> This patchset tries to improve SIS scan efficiency by recording idle
> cpus in a cpumask for each LLC which will be used as a target cpuset
> in the domain scan. The cpus are recorded at CORE granule to avoid
> tasks being stack on same core.
>
> v5 -> v6:
> - Rename SIS_FILTER to SIS_CORE as it can only be activated when
> SMT is enabled and better describes the behavior of CORE granule
> update & load delivery.
> - Removed the part of limited scan for idle cores since it might be
> better to open another thread to discuss the strategies such as
> limited or scaled depth. But keep the part of full scan for idle
> cores when LLC is overloaded because SIS_CORE can greatly reduce
> the overhead of full scan in such case.
> - Removed the state of sd_is_busy which indicates an LLC is fully
> busy and we can safely skip the SIS domain scan. I would prefer
> leave this to SIS_UTIL.
> - The filter generation mechanism is replaced by in-place updates
> during domain scan to better deal with partial scan failures.
> - Collect Reviewed-bys from Tim Chen
>
> ...
>

2022-11-14 06:06:50

by K Prateek Nayak

[permalink] [raw]

Subject: Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS

Hello Abel,

Sorry for the delay. I've tested the patch on a dual socket Zen3 system
(2 x 64C/128T)

tl;dr

o I do not notice any regressions with the standard benchmarks.
o schbench sees a nice improvement to the tail latency when the number
of worker are equal to the number of cores in the system in NPS1 and
NPS2 mode. (Marked with "^")
o Few data points show improvements in tbench in NPS1 and NPS2 mode.
(Marked with "^")

I'm still in the process of running larger workloads. If there is any
specific workload you would like me to run on the test system, please
do let me know. Below is the detailed report:

Following are the results from running standard benchmarks on a
dual socket Zen3 (2 x 64C/128T) machine configured in different
NPS modes.

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
Total 2 NUMA nodes in the dual socket machine.

Node 0: 0-63, 128-191
Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
Total 4 NUMA nodes exist over 2 socket.

Node 0: 0-31, 128-159
Node 1: 32-63, 160-191
Node 2: 64-95, 192-223
Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
Total 8 NUMA nodes exist over 2 socket.

Node 0: 0-15, 128-143
Node 1: 16-31, 144-159
Node 2: 32-47, 160-175
Node 3: 48-63, 176-191
Node 4: 64-79, 192-207
Node 5: 80-95, 208-223
Node 6: 96-111, 223-231
Node 7: 112-127, 232-255

Benchmark Results:

Kernel versions:
- tip: 5.19.0 tip sched/core
- sis_core: 5.19.0 tip sched/core + this series

When we started testing, the tip was at:
commit fdf756f71271 ("sched: Fix more TASK_state comparisons")

~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~

o NPS1

Test: tip sis_core
1-groups: 4.06 (0.00 pct) 4.26 (-4.92 pct) *
1-groups: 4.14 (0.00 pct) 4.09 (1.20 pct) [Verification Run]
2-groups: 4.76 (0.00 pct) 4.71 (1.05 pct)
4-groups: 5.22 (0.00 pct) 5.11 (2.10 pct)
8-groups: 5.35 (0.00 pct) 5.31 (0.74 pct)
16-groups: 7.21 (0.00 pct) 6.80 (5.68 pct)

o NPS2

Test: tip sis_core
1-groups: 4.09 (0.00 pct) 4.08 (0.24 pct)
2-groups: 4.70 (0.00 pct) 4.69 (0.21 pct)
4-groups: 5.05 (0.00 pct) 4.92 (2.57 pct)
8-groups: 5.35 (0.00 pct) 5.26 (1.68 pct)
16-groups: 6.37 (0.00 pct) 6.34 (0.47 pct)

o NPS4

Test: tip sis_core
1-groups: 4.07 (0.00 pct) 3.99 (1.96 pct)
2-groups: 4.65 (0.00 pct) 4.59 (1.29 pct)
4-groups: 5.13 (0.00 pct) 5.00 (2.53 pct)
8-groups: 5.47 (0.00 pct) 5.43 (0.73 pct)
16-groups: 6.82 (0.00 pct) 6.56 (3.81 pct)

~~~~~~~~~~~~
~ schbench ~
~~~~~~~~~~~~

o NPS1

#workers: tip sis_core
1: 33.00 (0.00 pct) 33.00 (0.00 pct)
2: 35.00 (0.00 pct) 35.00 (0.00 pct)
4: 39.00 (0.00 pct) 38.00 (2.56 pct)
8: 49.00 (0.00 pct) 48.00 (2.04 pct)
16: 63.00 (0.00 pct) 66.00 (-4.76 pct)
32: 109.00 (0.00 pct) 107.00 (1.83 pct)
64: 208.00 (0.00 pct) 216.00 (-3.84 pct)
128: 559.00 (0.00 pct) 469.00 (16.10 pct) ^
256: 45888.00 (0.00 pct) 47552.00 (-3.62 pct)
512: 80000.00 (0.00 pct) 79744.00 (0.32 pct)

o NPS2

#workers: =tip sis_core
1: 30.00 (0.00 pct) 32.00 (-6.66 pct)
2: 37.00 (0.00 pct) 34.00 (8.10 pct)
4: 39.00 (0.00 pct) 36.00 (7.69 pct)
8: 51.00 (0.00 pct) 49.00 (3.92 pct)
16: 67.00 (0.00 pct) 66.00 (1.49 pct)
32: 117.00 (0.00 pct) 109.00 (6.83 pct)
64: 216.00 (0.00 pct) 213.00 (1.38 pct)
128: 529.00 (0.00 pct) 465.00 (12.09 pct) ^
256: 47040.00 (0.00 pct) 46528.00 (1.08 pct)
512: 84864.00 (0.00 pct) 83584.00 (1.50 pct)

o NPS4

#workers: tip sis_core
1: 23.00 (0.00 pct) 28.00 (-21.73 pct)
2: 28.00 (0.00 pct) 36.00 (-28.57 pct)
4: 41.00 (0.00 pct) 43.00 (-4.87 pct)
8: 60.00 (0.00 pct) 48.00 (20.00 pct)
16: 71.00 (0.00 pct) 69.00 (2.81 pct)
32: 117.00 (0.00 pct) 115.00 (1.70 pct)
64: 227.00 (0.00 pct) 228.00 (-0.44 pct)
128: 545.00 (0.00 pct) 545.00 (0.00 pct)
256: 45632.00 (0.00 pct) 47680.00 (-4.48 pct)
512: 81024.00 (0.00 pct) 76416.00 (5.68 pct)

Note: For lower worker count, schbench can show run to
run variation depending on external factors. Regression
for lower worker count can be ignored. The results are
included to spot any large blow up in the tail latency
for larger worker count.

~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

o NPS1

Clients: tip sis_core
1 578.37 (0.00 pct) 582.09 (0.64 pct)
2 1062.09 (0.00 pct) 1063.95 (0.17 pct)
4 1800.62 (0.00 pct) 1879.18 (4.36 pct)
8 3211.02 (0.00 pct) 3220.44 (0.29 pct)
16 4848.92 (0.00 pct) 4890.08 (0.84 pct)
32 9091.36 (0.00 pct) 9721.13 (6.92 pct) ^
64 15454.01 (0.00 pct) 15124.42 (-2.13 pct)
128 3511.33 (0.00 pct) 14314.79 (307.67 pct)
128 19910.99 (0.00pct) 19935.61 (0.12 pct) [Verification Run]
256 50019.32 (0.00 pct) 50708.24 (1.37 pct)
512 44317.68 (0.00 pct) 44787.48 (1.06 pct)
1024 41200.85 (0.00 pct) 42079.29 (2.13 pct)

o NPS2

Clients: tip sis_core
1 576.05 (0.00 pct) 579.18 (0.54 pct)
2 1037.68 (0.00 pct) 1070.49 (3.16 pct)
4 1818.13 (0.00 pct) 1860.22 (2.31 pct)
8 3004.16 (0.00 pct) 3087.09 (2.76 pct)
16 4520.11 (0.00 pct) 4789.53 (5.96 pct)
32 8624.23 (0.00 pct) 9439.50 (9.45 pct) ^
64 14886.75 (0.00 pct) 15004.96 (0.79 pct)
128 20602.00 (0.00 pct) 17730.31 (-13.93 pct) *
128 20602.00 (0.00 pct) 19585.20 (-4.93 pct) [Verification Run]
256 45566.83 (0.00 pct) 47922.70 (5.17 pct)
512 42717.49 (0.00 pct) 43809.68 (2.55 pct)
1024 40936.61 (0.00 pct) 40787.71 (-0.36 pct)

o NPS4

Clients: tip sis_core
1 576.36 (0.00 pct) 580.83 (0.77 pct)
2 1044.26 (0.00 pct) 1066.50 (2.12 pct)
4 1839.77 (0.00 pct) 1867.56 (1.51 pct)
8 3043.53 (0.00 pct) 3115.17 (2.35 pct)
16 5207.54 (0.00 pct) 4847.53 (-6.91 pct) *
16 4722.56 (0.00 pct) 4811.29 (1.87 pct) [Verification Run]
32 9263.86 (0.00 pct) 9478.68 (2.31 pct)
64 14959.66 (0.00 pct) 15267.39 (2.05 pct)
128 20698.65 (0.00 pct) 20432.19 (-1.28 pct)
256 46666.21 (0.00 pct) 46664.81 (0.00 pct)
512 41532.80 (0.00 pct) 44241.12 (6.52 pct)
1024 39459.49 (0.00 pct) 41043.22 (4.01 pct)

Note: On the tested kernel, with 128 clients, tbench can
run into a bottleneck during C2 exit. More details can be
found at:
https://lore.kernel.org/lkml/[email protected]/
This issue has been fixed in v6.0 but was not part of the
tip kernel when I started testing. This data point has
been rerun with C2 disabled to get representative results.

~~~~~~~~~~
~ Stream ~
~~~~~~~~~~

o NPS1

-> 10 Runs:

Test: tip sis_core
Copy: 328419.14 (0.00 pct) 337857.83 (2.87 pct)
Scale: 206071.21 (0.00 pct) 212133.82 (2.94 pct)
Add: 235271.48 (0.00 pct) 243811.97 (3.63 pct)
Triad: 253175.80 (0.00 pct) 252333.43 (-0.33 pct)

-> 100 Runs:

Test: tip sis_core
Copy: 328209.61 (0.00 pct) 339817.27 (3.53 pct)
Scale: 216310.13 (0.00 pct) 218635.16 (1.07 pct)
Add: 244417.83 (0.00 pct) 245641.47 (0.50 pct)
Triad: 237508.83 (0.00 pct) 255387.28 (7.52 pct)

o NPS2

-> 10 Runs:

Test: tip sis_core
Copy: 336503.88 (0.00 pct) 339684.21 (0.94 pct)
Scale: 218035.23 (0.00 pct) 217601.11 (-0.19 pct)
Add: 257677.42 (0.00 pct) 258608.34 (0.36 pct)
Triad: 268872.37 (0.00 pct) 272548.09 (1.36 pct)

-> 100 Runs:

Test: tip sis_core
Copy: 332304.34 (0.00 pct) 341565.75 (2.78 pct)
Scale: 223421.60 (0.00 pct) 224267.40 (0.37 pct)
Add: 252363.56 (0.00 pct) 254926.98 (1.01 pct)
Triad: 266687.56 (0.00 pct) 270782.81 (1.53 pct)

o NPS4

-> 10 Runs:

Test: tip sis_core
Copy: 353515.62 (0.00 pct) 342060.85 (-3.24 pct)
Scale: 228854.37 (0.00 pct) 218262.41 (-4.62 pct)
Add: 254942.12 (0.00 pct) 241975.90 (-5.08 pct)
Triad: 270521.87 (0.00 pct) 257686.71 (-4.74 pct)

-> 100 Runs:

Test: tip sis_core
Copy: 374520.81 (0.00 pct) 369353.13 (-1.37 pct)
Scale: 246280.23 (0.00 pct) 253881.69 (3.08 pct)
Add: 262772.72 (0.00 pct) 266484.58 (1.41 pct)
Triad: 283740.92 (0.00 pct) 279981.18 (-1.32 pct)

On 10/19/2022 5:58 PM, Abel Wu wrote:
> This patchset tries to improve SIS scan efficiency by recording idle
> cpus in a cpumask for each LLC which will be used as a target cpuset
> in the domain scan. The cpus are recorded at CORE granule to avoid
> tasks being stack on same core.
>
> v5 -> v6:
> - Rename SIS_FILTER to SIS_CORE as it can only be activated when
> SMT is enabled and better describes the behavior of CORE granule
> update & load delivery.
> - Removed the part of limited scan for idle cores since it might be
> better to open another thread to discuss the strategies such as
> limited or scaled depth. But keep the part of full scan for idle
> cores when LLC is overloaded because SIS_CORE can greatly reduce
> the overhead of full scan in such case.
> - Removed the state of sd_is_busy which indicates an LLC is fully
> busy and we can safely skip the SIS domain scan. I would prefer
> leave this to SIS_UTIL.
> - The filter generation mechanism is replaced by in-place updates
> during domain scan to better deal with partial scan failures.
> - Collect Reviewed-bys from Tim Chen
>
> v4 -> v5:
> - Add limited scan for idle cores when overloaded, suggested by Mel
> - Split out several patches since they are irrelevant to this scope
> - Add quick check on ttwu_pending before core update
> - Wrap the filter into SIS_FILTER feature, suggested by Chen Yu
> - Move the main filter logic to the idle path, because the newidle
> balance can bail out early if rq->avg_idle is small enough and
> lose chances to update the filter.
>
> v3 -> v4:
> - Update filter in load_balance rather than in the tick
> - Now the filter contains unoccupied cpus rather than overloaded ones
> - Added mechanisms to deal with the false positive cases
>
> v2 -> v3:
> - Removed sched-idle balance feature and focus on SIS
> - Take non-CFS tasks into consideration
> - Several fixes/improvement suggested by Josh Don
>
> v1 -> v2:
> - Several optimizations on sched-idle balancing
> - Ignore asym topos in can_migrate_task
> - Add more benchmarks including SIS efficiency
> - Re-organize patch as suggested by Mel Gorman
>
> Abel Wu (4):
> sched/fair: Skip core update if task pending
> sched/fair: Ignore SIS_UTIL when has_idle_core
> sched/fair: Introduce SIS_CORE
> sched/fair: Deal with SIS scan failures
>
> include/linux/sched/topology.h | 15 ++++
> kernel/sched/fair.c | 122 +++++++++++++++++++++++++++++----
> kernel/sched/features.h | 7 ++
> kernel/sched/sched.h | 3 +
> kernel/sched/topology.c | 8 ++-
> 5 files changed, 141 insertions(+), 14 deletions(-)
>

I ran pgbench from mmtest but realised there is too much run to run
variation on the system. Planning on running MongoDB benchmark which
is more stable on the system and couple more workloads but the
initial results look good. I'll get back with results later this week
or by early next week. Meanwhile, if you need data for any specific
workload on the test system, please do let me know.

--
Thanks and Regards,
Prateek

2022-11-15 08:39:35

by Abel Wu

[permalink] [raw]

Subject: Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS

Hi Prateek, thanks very much for your detailed testing!

On 11/14/22 1:45 PM, K Prateek Nayak wrote:
> Hello Abel,
>
> Sorry for the delay. I've tested the patch on a dual socket Zen3 system
> (2 x 64C/128T)
>
> tl;dr
>
> o I do not notice any regressions with the standard benchmarks.
> o schbench sees a nice improvement to the tail latency when the number
> of worker are equal to the number of cores in the system in NPS1 and
> NPS2 mode. (Marked with "^")
> o Few data points show improvements in tbench in NPS1 and NPS2 mode.
> (Marked with "^")
>
> I'm still in the process of running larger workloads. If there is any
> specific workload you would like me to run on the test system, please
> do let me know. Below is the detailed report:

Not particularly in my mind, and I think testing larger workloads is
great. Thanks!

>
> Following are the results from running standard benchmarks on a
> dual socket Zen3 (2 x 64C/128T) machine configured in different
> NPS modes.
>
> NPS Modes are used to logically divide single socket into
> multiple NUMA region.
> Following is the NUMA configuration for each NPS mode on the system:
>
> NPS1: Each socket is a NUMA node.
> Total 2 NUMA nodes in the dual socket machine.
>
> Node 0: 0-63, 128-191
> Node 1: 64-127, 192-255
>
> NPS2: Each socket is further logically divided into 2 NUMA regions.
> Total 4 NUMA nodes exist over 2 socket.
>
> Node 0: 0-31, 128-159
> Node 1: 32-63, 160-191
> Node 2: 64-95, 192-223
> Node 3: 96-127, 223-255
>
> NPS4: Each socket is logically divided into 4 NUMA regions.
> Total 8 NUMA nodes exist over 2 socket.
>
> Node 0: 0-15, 128-143
> Node 1: 16-31, 144-159
> Node 2: 32-47, 160-175
> Node 3: 48-63, 176-191
> Node 4: 64-79, 192-207
> Node 5: 80-95, 208-223
> Node 6: 96-111, 223-231
> Node 7: 112-127, 232-255
>
> Benchmark Results:
>
> Kernel versions:
> - tip: 5.19.0 tip sched/core
> - sis_core: 5.19.0 tip sched/core + this series
>
> When we started testing, the tip was at:
> commit fdf756f71271 ("sched: Fix more TASK_state comparisons")
>
> ~~~~~~~~~~~~~
> ~ hackbench ~
> ~~~~~~~~~~~~~
>
> o NPS1
>
> Test: tip sis_core
> 1-groups: 4.06 (0.00 pct) 4.26 (-4.92 pct) *
> 1-groups: 4.14 (0.00 pct) 4.09 (1.20 pct) [Verification Run]
> 2-groups: 4.76 (0.00 pct) 4.71 (1.05 pct)
> 4-groups: 5.22 (0.00 pct) 5.11 (2.10 pct)
> 8-groups: 5.35 (0.00 pct) 5.31 (0.74 pct)
> 16-groups: 7.21 (0.00 pct) 6.80 (5.68 pct)
>
> o NPS2
>
> Test: tip sis_core
> 1-groups: 4.09 (0.00 pct) 4.08 (0.24 pct)
> 2-groups: 4.70 (0.00 pct) 4.69 (0.21 pct)
> 4-groups: 5.05 (0.00 pct) 4.92 (2.57 pct)
> 8-groups: 5.35 (0.00 pct) 5.26 (1.68 pct)
> 16-groups: 6.37 (0.00 pct) 6.34 (0.47 pct)
>
> o NPS4
>
> Test: tip sis_core
> 1-groups: 4.07 (0.00 pct) 3.99 (1.96 pct)
> 2-groups: 4.65 (0.00 pct) 4.59 (1.29 pct)
> 4-groups: 5.13 (0.00 pct) 5.00 (2.53 pct)
> 8-groups: 5.47 (0.00 pct) 5.43 (0.73 pct)
> 16-groups: 6.82 (0.00 pct) 6.56 (3.81 pct)

Although each cpu will get 2.5 tasks when 16-groups, which can
be considered overloaded, I tested in AMD EPYC 7Y83 machine and
the total cpu usage was ~82% (with some older kernel version),
so there is still lots of idle time.

I guess cutting off at 16-groups is because it's enough loaded
compared to the real workloads, so testing more groups might just
be a waste of time?

Thanks & Best Regards,
Abel

>
> ~~~~~~~~~~~~
> ~ schbench ~
> ~~~~~~~~~~~~
>
> o NPS1
>
> #workers: tip sis_core
> 1: 33.00 (0.00 pct) 33.00 (0.00 pct)
> 2: 35.00 (0.00 pct) 35.00 (0.00 pct)
> 4: 39.00 (0.00 pct) 38.00 (2.56 pct)
> 8: 49.00 (0.00 pct) 48.00 (2.04 pct)
> 16: 63.00 (0.00 pct) 66.00 (-4.76 pct)
> 32: 109.00 (0.00 pct) 107.00 (1.83 pct)
> 64: 208.00 (0.00 pct) 216.00 (-3.84 pct)
> 128: 559.00 (0.00 pct) 469.00 (16.10 pct) ^
> 256: 45888.00 (0.00 pct) 47552.00 (-3.62 pct)
> 512: 80000.00 (0.00 pct) 79744.00 (0.32 pct)
>
> o NPS2
>
> #workers: =tip sis_core
> 1: 30.00 (0.00 pct) 32.00 (-6.66 pct)
> 2: 37.00 (0.00 pct) 34.00 (8.10 pct)
> 4: 39.00 (0.00 pct) 36.00 (7.69 pct)
> 8: 51.00 (0.00 pct) 49.00 (3.92 pct)
> 16: 67.00 (0.00 pct) 66.00 (1.49 pct)
> 32: 117.00 (0.00 pct) 109.00 (6.83 pct)
> 64: 216.00 (0.00 pct) 213.00 (1.38 pct)
> 128: 529.00 (0.00 pct) 465.00 (12.09 pct) ^
> 256: 47040.00 (0.00 pct) 46528.00 (1.08 pct)
> 512: 84864.00 (0.00 pct) 83584.00 (1.50 pct)
>
> o NPS4
>
> #workers: tip sis_core
> 1: 23.00 (0.00 pct) 28.00 (-21.73 pct)
> 2: 28.00 (0.00 pct) 36.00 (-28.57 pct)
> 4: 41.00 (0.00 pct) 43.00 (-4.87 pct)
> 8: 60.00 (0.00 pct) 48.00 (20.00 pct)
> 16: 71.00 (0.00 pct) 69.00 (2.81 pct)
> 32: 117.00 (0.00 pct) 115.00 (1.70 pct)
> 64: 227.00 (0.00 pct) 228.00 (-0.44 pct)
> 128: 545.00 (0.00 pct) 545.00 (0.00 pct)
> 256: 45632.00 (0.00 pct) 47680.00 (-4.48 pct)
> 512: 81024.00 (0.00 pct) 76416.00 (5.68 pct)
>
> Note: For lower worker count, schbench can show run to
> run variation depending on external factors. Regression
> for lower worker count can be ignored. The results are
> included to spot any large blow up in the tail latency
> for larger worker count.
>
> ~~~~~~~~~~
> ~ tbench ~
> ~~~~~~~~~~
>
> o NPS1
>
> Clients: tip sis_core
> 1 578.37 (0.00 pct) 582.09 (0.64 pct)
> 2 1062.09 (0.00 pct) 1063.95 (0.17 pct)
> 4 1800.62 (0.00 pct) 1879.18 (4.36 pct)
> 8 3211.02 (0.00 pct) 3220.44 (0.29 pct)
> 16 4848.92 (0.00 pct) 4890.08 (0.84 pct)
> 32 9091.36 (0.00 pct) 9721.13 (6.92 pct) ^
> 64 15454.01 (0.00 pct) 15124.42 (-2.13 pct)
> 128 3511.33 (0.00 pct) 14314.79 (307.67 pct)
> 128 19910.99 (0.00pct) 19935.61 (0.12 pct) [Verification Run]
> 256 50019.32 (0.00 pct) 50708.24 (1.37 pct)
> 512 44317.68 (0.00 pct) 44787.48 (1.06 pct)
> 1024 41200.85 (0.00 pct) 42079.29 (2.13 pct)
>
> o NPS2
>
> Clients: tip sis_core
> 1 576.05 (0.00 pct) 579.18 (0.54 pct)
> 2 1037.68 (0.00 pct) 1070.49 (3.16 pct)
> 4 1818.13 (0.00 pct) 1860.22 (2.31 pct)
> 8 3004.16 (0.00 pct) 3087.09 (2.76 pct)
> 16 4520.11 (0.00 pct) 4789.53 (5.96 pct)
> 32 8624.23 (0.00 pct) 9439.50 (9.45 pct) ^
> 64 14886.75 (0.00 pct) 15004.96 (0.79 pct)
> 128 20602.00 (0.00 pct) 17730.31 (-13.93 pct) *
> 128 20602.00 (0.00 pct) 19585.20 (-4.93 pct) [Verification Run]
> 256 45566.83 (0.00 pct) 47922.70 (5.17 pct)
> 512 42717.49 (0.00 pct) 43809.68 (2.55 pct)
> 1024 40936.61 (0.00 pct) 40787.71 (-0.36 pct)
>
> o NPS4
>
> Clients: tip sis_core
> 1 576.36 (0.00 pct) 580.83 (0.77 pct)
> 2 1044.26 (0.00 pct) 1066.50 (2.12 pct)
> 4 1839.77 (0.00 pct) 1867.56 (1.51 pct)
> 8 3043.53 (0.00 pct) 3115.17 (2.35 pct)
> 16 5207.54 (0.00 pct) 4847.53 (-6.91 pct) *
> 16 4722.56 (0.00 pct) 4811.29 (1.87 pct) [Verification Run]
> 32 9263.86 (0.00 pct) 9478.68 (2.31 pct)
> 64 14959.66 (0.00 pct) 15267.39 (2.05 pct)
> 128 20698.65 (0.00 pct) 20432.19 (-1.28 pct)
> 256 46666.21 (0.00 pct) 46664.81 (0.00 pct)
> 512 41532.80 (0.00 pct) 44241.12 (6.52 pct)
> 1024 39459.49 (0.00 pct) 41043.22 (4.01 pct)
>
> Note: On the tested kernel, with 128 clients, tbench can
> run into a bottleneck during C2 exit. More details can be
> found at:
> https://lore.kernel.org/lkml/[email protected]/
> This issue has been fixed in v6.0 but was not part of the
> tip kernel when I started testing. This data point has
> been rerun with C2 disabled to get representative results.
>
> ~~~~~~~~~~
> ~ Stream ~
> ~~~~~~~~~~
>
> o NPS1
>
> -> 10 Runs:
>
> Test: tip sis_core
> Copy: 328419.14 (0.00 pct) 337857.83 (2.87 pct)
> Scale: 206071.21 (0.00 pct) 212133.82 (2.94 pct)
> Add: 235271.48 (0.00 pct) 243811.97 (3.63 pct)
> Triad: 253175.80 (0.00 pct) 252333.43 (-0.33 pct)
>
> -> 100 Runs:
>
> Test: tip sis_core
> Copy: 328209.61 (0.00 pct) 339817.27 (3.53 pct)
> Scale: 216310.13 (0.00 pct) 218635.16 (1.07 pct)
> Add: 244417.83 (0.00 pct) 245641.47 (0.50 pct)
> Triad: 237508.83 (0.00 pct) 255387.28 (7.52 pct)
>
> o NPS2
>
> -> 10 Runs:
>
> Test: tip sis_core
> Copy: 336503.88 (0.00 pct) 339684.21 (0.94 pct)
> Scale: 218035.23 (0.00 pct) 217601.11 (-0.19 pct)
> Add: 257677.42 (0.00 pct) 258608.34 (0.36 pct)
> Triad: 268872.37 (0.00 pct) 272548.09 (1.36 pct)
>
> -> 100 Runs:
>
> Test: tip sis_core
> Copy: 332304.34 (0.00 pct) 341565.75 (2.78 pct)
> Scale: 223421.60 (0.00 pct) 224267.40 (0.37 pct)
> Add: 252363.56 (0.00 pct) 254926.98 (1.01 pct)
> Triad: 266687.56 (0.00 pct) 270782.81 (1.53 pct)
>
> o NPS4
>
> -> 10 Runs:
>
> Test: tip sis_core
> Copy: 353515.62 (0.00 pct) 342060.85 (-3.24 pct)
> Scale: 228854.37 (0.00 pct) 218262.41 (-4.62 pct)
> Add: 254942.12 (0.00 pct) 241975.90 (-5.08 pct)
> Triad: 270521.87 (0.00 pct) 257686.71 (-4.74 pct)
>
> -> 100 Runs:
>
> Test: tip sis_core
> Copy: 374520.81 (0.00 pct) 369353.13 (-1.37 pct)
> Scale: 246280.23 (0.00 pct) 253881.69 (3.08 pct)
> Add: 262772.72 (0.00 pct) 266484.58 (1.41 pct)
> Triad: 283740.92 (0.00 pct) 279981.18 (-1.32 pct)
>
> On 10/19/2022 5:58 PM, Abel Wu wrote:
>> This patchset tries to improve SIS scan efficiency by recording idle
>> cpus in a cpumask for each LLC which will be used as a target cpuset
>> in the domain scan. The cpus are recorded at CORE granule to avoid
>> tasks being stack on same core.
>>
>> v5 -> v6:
>> - Rename SIS_FILTER to SIS_CORE as it can only be activated when
>> SMT is enabled and better describes the behavior of CORE granule
>> update & load delivery.
>> - Removed the part of limited scan for idle cores since it might be
>> better to open another thread to discuss the strategies such as
>> limited or scaled depth. But keep the part of full scan for idle
>> cores when LLC is overloaded because SIS_CORE can greatly reduce
>> the overhead of full scan in such case.
>> - Removed the state of sd_is_busy which indicates an LLC is fully
>> busy and we can safely skip the SIS domain scan. I would prefer
>> leave this to SIS_UTIL.
>> - The filter generation mechanism is replaced by in-place updates
>> during domain scan to better deal with partial scan failures.
>> - Collect Reviewed-bys from Tim Chen
>>
>> v4 -> v5:
>> - Add limited scan for idle cores when overloaded, suggested by Mel
>> - Split out several patches since they are irrelevant to this scope
>> - Add quick check on ttwu_pending before core update
>> - Wrap the filter into SIS_FILTER feature, suggested by Chen Yu
>> - Move the main filter logic to the idle path, because the newidle
>> balance can bail out early if rq->avg_idle is small enough and
>> lose chances to update the filter.
>>
>> v3 -> v4:
>> - Update filter in load_balance rather than in the tick
>> - Now the filter contains unoccupied cpus rather than overloaded ones
>> - Added mechanisms to deal with the false positive cases
>>
>> v2 -> v3:
>> - Removed sched-idle balance feature and focus on SIS
>> - Take non-CFS tasks into consideration
>> - Several fixes/improvement suggested by Josh Don
>>
>> v1 -> v2:
>> - Several optimizations on sched-idle balancing
>> - Ignore asym topos in can_migrate_task
>> - Add more benchmarks including SIS efficiency
>> - Re-organize patch as suggested by Mel Gorman
>>
>> Abel Wu (4):
>> sched/fair: Skip core update if task pending
>> sched/fair: Ignore SIS_UTIL when has_idle_core
>> sched/fair: Introduce SIS_CORE
>> sched/fair: Deal with SIS scan failures
>>
>> include/linux/sched/topology.h | 15 ++++
>> kernel/sched/fair.c | 122 +++++++++++++++++++++++++++++----
>> kernel/sched/features.h | 7 ++
>> kernel/sched/sched.h | 3 +
>> kernel/sched/topology.c | 8 ++-
>> 5 files changed, 141 insertions(+), 14 deletions(-)
>>
>
> I ran pgbench from mmtest but realised there is too much run to run
> variation on the system. Planning on running MongoDB benchmark which
> is more stable on the system and couple more workloads but the
> initial results look good. I'll get back with results later this week
> or by early next week. Meanwhile, if you need data for any specific
> workload on the test system, please do let me know.
>
> --
> Thanks and Regards,
> Prateek

2022-11-15 11:36:51

by K Prateek Nayak

[permalink] [raw]

Subject: Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS

Hello Abel,

Thank you for taking a look at the report.

On 11/15/2022 2:01 PM, Abel Wu wrote:
> Hi Prateek, thanks very much for your detailed testing!
>
> On 11/14/22 1:45 PM, K Prateek Nayak wrote:
>> Hello Abel,
>>
>> Sorry for the delay. I've tested the patch on a dual socket Zen3 system
>> (2 x 64C/128T)
>>
>> tl;dr
>>
>> o I do not notice any regressions with the standard benchmarks.
>> o schbench sees a nice improvement to the tail latency when the number
>>    of worker are equal to the number of cores in the system in NPS1 and
>>    NPS2 mode. (Marked with "^")
>> o Few data points show improvements in tbench in NPS1 and NPS2 mode.
>>    (Marked with "^")
>>
>> I'm still in the process of running larger workloads. If there is any
>> specific workload you would like me to run on the test system, please
>> do let me know. Below is the detailed report:
>
> Not particularly in my mind, and I think testing larger workloads is
> great. Thanks!
>
>>
>> Following are the results from running standard benchmarks on a
>> dual socket Zen3 (2 x 64C/128T) machine configured in different
>> NPS modes.
>>
>> NPS Modes are used to logically divide single socket into
>> multiple NUMA region.
>> Following is the NUMA configuration for each NPS mode on the system:
>>
>> NPS1: Each socket is a NUMA node.
>>      Total 2 NUMA nodes in the dual socket machine.
>>
>>      Node 0: 0-63,   128-191
>>      Node 1: 64-127, 192-255
>>
>> NPS2: Each socket is further logically divided into 2 NUMA regions.
>>      Total 4 NUMA nodes exist over 2 socket.
>>          Node 0: 0-31,   128-159
>>      Node 1: 32-63, 160-191
>>      Node 2: 64-95, 192-223
>>      Node 3: 96-127, 223-255
>>
>> NPS4: Each socket is logically divided into 4 NUMA regions.
>>      Total 8 NUMA nodes exist over 2 socket.
>>          Node 0: 0-15,    128-143
>>      Node 1: 16-31,   144-159
>>      Node 2: 32-47,   160-175
>>      Node 3: 48-63,   176-191
>>      Node 4: 64-79,   192-207
>>      Node 5: 80-95,   208-223
>>      Node 6: 96-111, 223-231
>>      Node 7: 112-127, 232-255
>>
>> Benchmark Results:
>>
>> Kernel versions:
>> - tip:          5.19.0 tip sched/core
>> - sis_core:     5.19.0 tip sched/core + this series
>>
>> When we started testing, the tip was at:
>> commit fdf756f71271 ("sched: Fix more TASK_state comparisons")
>>
>> ~~~~~~~~~~~~~
>> ~ hackbench ~
>> ~~~~~~~~~~~~~
>>
>> o NPS1
>>
>> Test:            tip            sis_core
>> 1-groups:       4.06 (0.00 pct)       4.26 (-4.92 pct)    *
>> 1-groups:       4.14 (0.00 pct)       4.09 (1.20 pct)    [Verification Run]
>> 2-groups:       4.76 (0.00 pct)       4.71 (1.05 pct)
>> 4-groups:       5.22 (0.00 pct)       5.11 (2.10 pct)
>> 8-groups:       5.35 (0.00 pct)       5.31 (0.74 pct)
>> 16-groups:       7.21 (0.00 pct)       6.80 (5.68 pct)
>>
>> o NPS2
>>
>> Test:            tip            sis_core
>> 1-groups:       4.09 (0.00 pct)       4.08 (0.24 pct)
>> 2-groups:       4.70 (0.00 pct)       4.69 (0.21 pct)
>> 4-groups:       5.05 (0.00 pct)       4.92 (2.57 pct)
>> 8-groups:       5.35 (0.00 pct)       5.26 (1.68 pct)
>> 16-groups:       6.37 (0.00 pct)       6.34 (0.47 pct)
>>
>> o NPS4
>>
>> Test:            tip            sis_core
>> 1-groups:       4.07 (0.00 pct)       3.99 (1.96 pct)
>> 2-groups:       4.65 (0.00 pct)       4.59 (1.29 pct)
>> 4-groups:       5.13 (0.00 pct)       5.00 (2.53 pct)
>> 8-groups:       5.47 (0.00 pct)       5.43 (0.73 pct)
>> 16-groups:       6.82 (0.00 pct)       6.56 (3.81 pct)
>
> Although each cpu will get 2.5 tasks when 16-groups, which can
> be considered overloaded, I tested in AMD EPYC 7Y83 machine and
> the total cpu usage was ~82% (with some older kernel version),
> so there is still lots of idle time.
>
> I guess cutting off at 16-groups is because it's enough loaded
> compared to the real workloads, so testing more groups might just
> be a waste of time?

The machine has 16 LLCs so I capped the results at 16-groups.
Previously I had seen some run-to-run variance with larger group counts
so I limited the reports to 16-groups. I'll run hackbench with more
number of groups (32, 64, 128, 256) and get back to you with the
results along with results for a couple of long running workloads.

>
> Thanks & Best Regards,
>     Abel
>
> [..snip..]
>

--
Thanks and Regards,
Prateek

2022-11-22 13:01:42

by K Prateek Nayak

[permalink] [raw]

Subject: Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS

Hello Abel,

Following are the results for hackbench with larger number of
groups, ycsb-mongodb, Spec-JBB, and unixbench. Apart for
a regression in unixbench spawn in NPS2 and NPS4 mode and
unixbench syscall in NPs2 mode, everything looks good.

Detailed results are below:

~~~~~~~~~~~~~~~~
~ ycsb-mongodb ~
~~~~~~~~~~~~~~~~

o NPS1:

tip: 131696.33 (var: 2.03%)
sis_core: 129519.00 (var: 1.46%) (-1.65%)

o NPS2:

tip: 129895.33 (var: 2.34%)
sis_core: 130774.33 (var: 2.57%) (+0.67%)

o NPS4:

tip: 131165.00 (var: 1.06%)
sis_core: 133547.33 (var: 3.90%) (+1.81%)

~~~~~~~~~~~~~~~~~
~ Spec-JBB NPS1 ~
~~~~~~~~~~~~~~~~~

Max-jOPS and Critical-jOPS are same as the tip kernel.

~~~~~~~~~~~~~
~ unixbench ~
~~~~~~~~~~~~~

-> unixbench-dhry2reg

o NPS1

kernel: tip sis_core
Min unixbench-dhry2reg-1 48876615.50 ( 0.00%) 48891544.00 ( 0.03%)
Min unixbench-dhry2reg-512 6260344658.90 ( 0.00%) 6282967594.10 ( 0.36%)
Hmean unixbench-dhry2reg-1 49299721.81 ( 0.00%) 49233828.70 ( -0.13%)
Hmean unixbench-dhry2reg-512 6267459427.19 ( 0.00%) 6288772961.79 * 0.34%*
CoeffVar unixbench-dhry2reg-1 0.90 ( 0.00%) 0.68 ( 24.66%)
CoeffVar unixbench-dhry2reg-512 0.10 ( 0.00%) 0.10 ( 7.54%)

o NPS2

kernel: tip sis_core
Min unixbench-dhry2reg-1 48828251.70 ( 0.00%) 48856709.20 ( 0.06%)
Min unixbench-dhry2reg-512 6244987739.10 ( 0.00%) 6271229549.10 ( 0.42%)
Hmean unixbench-dhry2reg-1 48869882.65 ( 0.00%) 49302481.81 ( 0.89%)
Hmean unixbench-dhry2reg-512 6261073948.84 ( 0.00%) 6272564898.35 ( 0.18%)
CoeffVar unixbench-dhry2reg-1 0.08 ( 0.00%) 0.87 (-945.28%)
CoeffVar unixbench-dhry2reg-512 0.23 ( 0.00%) 0.03 ( 85.94%)

o NPS4

kernel: tip sis_core
Min unixbench-dhry2reg-1 48523981.30 ( 0.00%) 49083957.50 ( 1.15%)
Min unixbench-dhry2reg-512 6253738837.10 ( 0.00%) 6271747119.10 ( 0.29%)
Hmean unixbench-dhry2reg-1 48781044.09 ( 0.00%) 49232218.87 * 0.92%*
Hmean unixbench-dhry2reg-512 6264428474.90 ( 0.00%) 6280484789.64 ( 0.26%)
CoeffVar unixbench-dhry2reg-1 0.46 ( 0.00%) 0.26 ( 42.63%)
CoeffVar unixbench-dhry2reg-512 0.17 ( 0.00%) 0.21 ( -26.72%)

-> unixbench-syscall

o NPS1

kernel: tip sis_core
Min unixbench-syscall-1 2975654.80 ( 0.00%) 2978489.40 ( -0.10%)
Min unixbench-syscall-512 7840226.50 ( 0.00%) 7822133.40 ( 0.23%)
Amean unixbench-syscall-1 2976326.47 ( 0.00%) 2980985.27 * -0.16%*
Amean unixbench-syscall-512 7850493.90 ( 0.00%) 7844527.50 ( 0.08%)
CoeffVar unixbench-syscall-1 0.03 ( 0.00%) 0.07 (-154.43%)
CoeffVar unixbench-syscall-512 0.13 ( 0.00%) 0.34 (-158.96%)

o NPS2

kernel: tip sis_core
Min unixbench-syscall-1 2969863.60 ( 0.00%) 2977936.50 ( -0.27%)
Min unixbench-syscall-512 8053157.60 ( 0.00%) 8072239.00 ( -0.24%)
Amean unixbench-syscall-1 2970462.30 ( 0.00%) 2981732.50 * -0.38%*
Amean unixbench-syscall-512 8061454.50 ( 0.00%) 8079287.73 * -0.22%*
CoeffVar unixbench-syscall-1 0.02 ( 0.00%) 0.11 (-527.26%)
CoeffVar unixbench-syscall-512 0.12 ( 0.00%) 0.08 ( 37.30%)

o NPS4

kernel: tip sis_core
Min unixbench-syscall-1 2971799.80 ( 0.00%) 2979335.60 ( -0.25%)
Min unixbench-syscall-512 7824196.90 ( 0.00%) 8155610.20 ( -4.24%)
Amean unixbench-syscall-1 2973045.43 ( 0.00%) 2982036.13 * -0.30%*
Amean unixbench-syscall-512 7826302.17 ( 0.00%) 8173026.57 * -4.43%* <-- Regression in syscall for larger worker count
CoeffVar unixbench-syscall-1 0.04 ( 0.00%) 0.09 (-139.63%)
CoeffVar unixbench-syscall-512 0.03 ( 0.00%) 0.20 (-701.13%)

-> unixbench-pipe

o NPS1

kernel: tip sis_core
Min unixbench-pipe-1 2894765.30 ( 0.00%) 2891505.30 ( -0.11%)
Min unixbench-pipe-512 329818573.50 ( 0.00%) 325610257.80 ( -1.28%)
Hmean unixbench-pipe-1 2898803.38 ( 0.00%) 2896940.25 ( -0.06%)
Hmean unixbench-pipe-512 330226401.69 ( 0.00%) 326311984.29 * -1.19%*
CoeffVar unixbench-pipe-1 0.14 ( 0.00%) 0.17 ( -21.99%)
CoeffVar unixbench-pipe-512 0.11 ( 0.00%) 0.20 ( -88.38%)

o NPS2

kernel: tip sis_core
Min unixbench-pipe-1 2895327.90 ( 0.00%) 2894798.20 ( -0.02%)
Min unixbench-pipe-512 328350065.60 ( 0.00%) 325681163.10 ( -0.81%)
Hmean unixbench-pipe-1 2899129.86 ( 0.00%) 2897067.80 ( -0.07%)
Hmean unixbench-pipe-512 329436096.80 ( 0.00%) 326023030.94 * -1.04%*
CoeffVar unixbench-pipe-1 0.12 ( 0.00%) 0.09 ( 21.96%)
CoeffVar unixbench-pipe-512 0.30 ( 0.00%) 0.12 ( 60.80%)

o NPS4

kernel: tip sis_core
Min unixbench-pipe-1 2901525.60 ( 0.00%) 2885730.80 ( -0.54%)
Min unixbench-pipe-512 330265873.90 ( 0.00%) 326730770.60 ( -1.07%)
Hmean unixbench-pipe-1 2906184.70 ( 0.00%) 2891616.18 * -0.50%*
Hmean unixbench-pipe-512 330854683.27 ( 0.00%) 327113296.63 * -1.13%*
CoeffVar unixbench-pipe-1 0.14 ( 0.00%) 0.19 ( -33.74%)
CoeffVar unixbench-pipe-512 0.16 ( 0.00%) 0.11 ( 31.75%)

-> unixbench-spawn

o NPS1

kernel: tip sis_core
Min unixbench-spawn-1 6536.50 ( 0.00%) 6000.30 ( -8.20%)
Min unixbench-spawn-512 72571.40 ( 0.00%) 70829.60 ( -2.40%)
Hmean unixbench-spawn-1 6811.16 ( 0.00%) 7016.11 ( 3.01%)
Hmean unixbench-spawn-512 72801.77 ( 0.00%) 71012.03 * -2.46%*
CoeffVar unixbench-spawn-1 3.69 ( 0.00%) 13.52 (-266.69%)
CoeffVar unixbench-spawn-512 0.27 ( 0.00%) 0.22 ( 18.25%)

o NPS2

kernel: tip sis_core
Min unixbench-spawn-1 7042.20 ( 0.00%) 7078.70 ( 0.52%)
Min unixbench-spawn-512 85571.60 ( 0.00%) 77362.60 ( -9.59%)
Hmean unixbench-spawn-1 7199.01 ( 0.00%) 7276.55 ( 1.08%)
Hmean unixbench-spawn-512 85717.77 ( 0.00%) 77923.73 * -9.09%* <-- Regression in spawn test for larger worker count
CoeffVar unixbench-spawn-1 3.50 ( 0.00%) 3.30 ( 5.70%)
CoeffVar unixbench-spawn-512 0.20 ( 0.00%) 0.82 (-304.88%)

o NPS4

kernel: tip sis_core
Min unixbench-spawn-1 7521.90 ( 0.00%) 8102.80 ( 7.72%)
Min unixbench-spawn-512 84245.70 ( 0.00%) 73074.50 ( -13.26%)
Hmean unixbench-spawn-1 7659.12 ( 0.00%) 8645.19 * 12.87%*
Hmean unixbench-spawn-512 84908.77 ( 0.00%) 73409.49 * -13.54%* <-- Regression in spawn test for larger worker count
CoeffVar unixbench-spawn-1 1.92 ( 0.00%) 5.78 (-200.56%)
CoeffVar unixbench-spawn-512 0.76 ( 0.00%) 0.41 ( 46.58%)

-> unixbench-execl

o NPS1

kernel: tip sis_core
Min unixbench-execl-1 5421.50 ( 0.00%) 5471.50 ( 0.92%)
Min unixbench-execl-512 11213.50 ( 0.00%) 11677.20 ( 4.14%)
Hmean unixbench-execl-1 5443.75 ( 0.00%) 5475.36 * 0.58%*
Hmean unixbench-execl-512 11311.94 ( 0.00%) 11804.52 * 4.35%*
CoeffVar unixbench-execl-1 0.38 ( 0.00%) 0.12 ( 69.22%)
CoeffVar unixbench-execl-512 1.03 ( 0.00%) 1.73 ( -68.91%)

o NPS2

kernel: tip sis_core
Min unixbench-execl-1 5089.10 ( 0.00%) 5405.40 ( 6.22%)
Min unixbench-execl-512 11772.70 ( 0.00%) 11917.20 ( 1.23%)
Hmean unixbench-execl-1 5321.65 ( 0.00%) 5421.41 ( 1.87%)
Hmean unixbench-execl-512 12201.73 ( 0.00%) 12327.95 ( 1.03%)
CoeffVar unixbench-execl-1 3.87 ( 0.00%) 0.28 ( 92.88%)
CoeffVar unixbench-execl-512 6.23 ( 0.00%) 5.78 ( 7.21%)

o NPS4

kernel: tip sis_core
Min unixbench-execl-1 5099.40 ( 0.00%) 5479.60 ( 7.46%)
Min unixbench-execl-512 11692.80 ( 0.00%) 12205.50 ( 4.38%)
Hmean unixbench-execl-1 5136.86 ( 0.00%) 5487.93 * 6.83%*
Hmean unixbench-execl-512 12053.71 ( 0.00%) 12712.96 ( 5.47%)
CoeffVar unixbench-execl-1 1.05 ( 0.00%) 0.14 ( 86.57%)
CoeffVar unixbench-execl-512 3.85 ( 0.00%) 5.86 ( -52.14%)

For unixbench regressions, I do not see anything obvious jump up
in perf traces captureed with IBS. top shows over 99% utilization
which would ideally mean there are not many updates to the mask.
I'll take some more look at the spawn test case and get back to you.

On 11/15/2022 4:58 PM, K Prateek Nayak wrote:
> Hello Abel,
>
> Thank you for taking a look at the report.
>
> On 11/15/2022 2:01 PM, Abel Wu wrote:
>> Hi Prateek, thanks very much for your detailed testing!
>>
>> On 11/14/22 1:45 PM, K Prateek Nayak wrote:
>>> Hello Abel,
>>>
>>> Sorry for the delay. I've tested the patch on a dual socket Zen3 system
>>> (2 x 64C/128T)
>>>
>>> tl;dr
>>>
>>> o I do not notice any regressions with the standard benchmarks.
>>> o schbench sees a nice improvement to the tail latency when the number
>>>    of worker are equal to the number of cores in the system in NPS1 and
>>>    NPS2 mode. (Marked with "^")
>>> o Few data points show improvements in tbench in NPS1 and NPS2 mode.
>>>    (Marked with "^")
>>>
>>> I'm still in the process of running larger workloads. If there is any
>>> specific workload you would like me to run on the test system, please
>>> do let me know. Below is the detailed report:
>>
>> Not particularly in my mind, and I think testing larger workloads is
>> great. Thanks!
>>
>>>
>>> Following are the results from running standard benchmarks on a
>>> dual socket Zen3 (2 x 64C/128T) machine configured in different
>>> NPS modes.
>>>
>>> NPS Modes are used to logically divide single socket into
>>> multiple NUMA region.
>>> Following is the NUMA configuration for each NPS mode on the system:
>>>
>>> NPS1: Each socket is a NUMA node.
>>>      Total 2 NUMA nodes in the dual socket machine.
>>>
>>>      Node 0: 0-63,   128-191
>>>      Node 1: 64-127, 192-255
>>>
>>> NPS2: Each socket is further logically divided into 2 NUMA regions.
>>>      Total 4 NUMA nodes exist over 2 socket.
>>>          Node 0: 0-31,   128-159
>>>      Node 1: 32-63, 160-191
>>>      Node 2: 64-95, 192-223
>>>      Node 3: 96-127, 223-255
>>>
>>> NPS4: Each socket is logically divided into 4 NUMA regions.
>>>      Total 8 NUMA nodes exist over 2 socket.
>>>          Node 0: 0-15,    128-143
>>>      Node 1: 16-31,   144-159
>>>      Node 2: 32-47,   160-175
>>>      Node 3: 48-63,   176-191
>>>      Node 4: 64-79,   192-207
>>>      Node 5: 80-95,   208-223
>>>      Node 6: 96-111, 223-231
>>>      Node 7: 112-127, 232-255
>>>
>>> Benchmark Results:
>>>
>>> Kernel versions:
>>> - tip:          5.19.0 tip sched/core
>>> - sis_core:     5.19.0 tip sched/core + this series
>>>
>>> When we started testing, the tip was at:
>>> commit fdf756f71271 ("sched: Fix more TASK_state comparisons")
>>>
>>> ~~~~~~~~~~~~~
>>> ~ hackbench ~
>>> ~~~~~~~~~~~~~
>>>
>>> o NPS1
>>>
>>> Test:            tip            sis_core
>>> 1-groups:       4.06 (0.00 pct)       4.26 (-4.92 pct)    *
>>> 1-groups:       4.14 (0.00 pct)       4.09 (1.20 pct)    [Verification Run]
>>> 2-groups:       4.76 (0.00 pct)       4.71 (1.05 pct)
>>> 4-groups:       5.22 (0.00 pct)       5.11 (2.10 pct)
>>> 8-groups:       5.35 (0.00 pct)       5.31 (0.74 pct)
>>> 16-groups:       7.21 (0.00 pct)       6.80 (5.68 pct)
>>>
>>> o NPS2
>>>
>>> Test:            tip            sis_core
>>> 1-groups:       4.09 (0.00 pct)       4.08 (0.24 pct)
>>> 2-groups:       4.70 (0.00 pct)       4.69 (0.21 pct)
>>> 4-groups:       5.05 (0.00 pct)       4.92 (2.57 pct)
>>> 8-groups:       5.35 (0.00 pct)       5.26 (1.68 pct)
>>> 16-groups:       6.37 (0.00 pct)       6.34 (0.47 pct)
>>>
>>> o NPS4
>>>
>>> Test:            tip            sis_core
>>> 1-groups:       4.07 (0.00 pct)       3.99 (1.96 pct)
>>> 2-groups:       4.65 (0.00 pct)       4.59 (1.29 pct)
>>> 4-groups:       5.13 (0.00 pct)       5.00 (2.53 pct)
>>> 8-groups:       5.47 (0.00 pct)       5.43 (0.73 pct)
>>> 16-groups:       6.82 (0.00 pct)       6.56 (3.81 pct)
>>
>> Although each cpu will get 2.5 tasks when 16-groups, which can
>> be considered overloaded, I tested in AMD EPYC 7Y83 machine and
>> the total cpu usage was ~82% (with some older kernel version),
>> so there is still lots of idle time.
>>
>> I guess cutting off at 16-groups is because it's enough loaded
>> compared to the real workloads, so testing more groups might just
>> be a waste of time?
>
> The machine has 16 LLCs so I capped the results at 16-groups.
> Previously I had seen some run-to-run variance with larger group counts
> so I limited the reports to 16-groups. I'll run hackbench with more
> number of groups (32, 64, 128, 256) and get back to you with the
> results along with results for a couple of long running workloads.

~~~~~~~~~~~~~
~ Hackbench ~
~~~~~~~~~~~~~

$ perf bench sched messaging -p -l 50000 -g <groups>

o NPS1

kernel: tip sis_core
32-groups: 6.20 (0.00 pct) 5.86 (5.48 pct)
64-groups: 16.55 (0.00 pct) 15.21 (8.09 pct)
128-groups: 42.57 (0.00 pct) 34.63 (18.65 pct)
256-groups: 71.69 (0.00 pct) 67.11 (6.38 pct)
512-groups: 108.48 (0.00 pct) 110.23 (-1.61 pct)

o NPS2

kernel: tip sis_core
32-groups: 6.56 (0.00 pct) 5.60 (14.63 pct)
64-groups: 15.74 (0.00 pct) 14.45 (8.19 pct)
128-groups: 39.93 (0.00 pct) 35.33 (11.52 pct)
256-groups: 74.49 (0.00 pct) 69.65 (6.49 pct)
512-groups: 112.22 (0.00 pct) 113.75 (-1.36 pct)

o NPS4:

kernel: tip sis_core
32-groups: 9.48 (0.00 pct) 5.64 (40.50 pct)
64-groups: 15.38 (0.00 pct) 14.13 (8.12 pct)
128-groups: 39.93 (0.00 pct) 34.47 (13.67 pct)
256-groups: 75.31 (0.00 pct) 67.98 (9.73 pct)
512-groups: 115.37 (0.00 pct) 111.15 (3.65 pct)

Note: Hackbench with 32-groups show run to run variation
on tip but is more stable with sis_core. Hackbench for
64-groups and beyond is stable on both kernels.

>
>>
>> Thanks & Best Regards,
>>     Abel
>>
>> [..snip..]
>>
>
>
> --
> Thanks and Regards,
> Prateek

Apart from the couple of regressions in Unixbench, everything looks good.
If you would like me to get any more data for any workload on the test
system, please do let me know.
--
Thanks and Regards,
Prateek

2022-11-24 04:51:25

by Abel Wu

[permalink] [raw]

Subject: Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS

Hi Prateek, thanks again for your detailed test!

On 11/22/22 7:28 PM, K Prateek Nayak wrote:
> Hello Abel,
>
> Following are the results for hackbench with larger number of
> groups, ycsb-mongodb, Spec-JBB, and unixbench. Apart for
> a regression in unixbench spawn in NPS2 and NPS4 mode and
> unixbench syscall in NPs2 mode, everything looks good.
>
> ...
>
> -> unixbench-syscall
>
> o NPS4
>
> kernel: tip sis_core
> Min unixbench-syscall-1 2971799.80 ( 0.00%) 2979335.60 ( -0.25%)
> Min unixbench-syscall-512 7824196.90 ( 0.00%) 8155610.20 ( -4.24%)
> Amean unixbench-syscall-1 2973045.43 ( 0.00%) 2982036.13 * -0.30%*
> Amean unixbench-syscall-512 7826302.17 ( 0.00%) 8173026.57 * -4.43%* <-- Regression in syscall for larger worker count
> CoeffVar unixbench-syscall-1 0.04 ( 0.00%) 0.09 (-139.63%)
> CoeffVar unixbench-syscall-512 0.03 ( 0.00%) 0.20 (-701.13%)
>
>
> -> unixbench-spawn
>
> o NPS1
>
> kernel: tip sis_core
> Min unixbench-spawn-1 6536.50 ( 0.00%) 6000.30 ( -8.20%)
> Min unixbench-spawn-512 72571.40 ( 0.00%) 70829.60 ( -2.40%)
> Hmean unixbench-spawn-1 6811.16 ( 0.00%) 7016.11 ( 3.01%)
> Hmean unixbench-spawn-512 72801.77 ( 0.00%) 71012.03 * -2.46%*
> CoeffVar unixbench-spawn-1 3.69 ( 0.00%) 13.52 (-266.69%)
> CoeffVar unixbench-spawn-512 0.27 ( 0.00%) 0.22 ( 18.25%)
>
> o NPS2
>
> kernel: tip sis_core
> Min unixbench-spawn-1 7042.20 ( 0.00%) 7078.70 ( 0.52%)
> Min unixbench-spawn-512 85571.60 ( 0.00%) 77362.60 ( -9.59%)
> Hmean unixbench-spawn-1 7199.01 ( 0.00%) 7276.55 ( 1.08%)
> Hmean unixbench-spawn-512 85717.77 ( 0.00%) 77923.73 * -9.09%* <-- Regression in spawn test for larger worker count
> CoeffVar unixbench-spawn-1 3.50 ( 0.00%) 3.30 ( 5.70%)
> CoeffVar unixbench-spawn-512 0.20 ( 0.00%) 0.82 (-304.88%)
>
> o NPS4
>
> kernel: tip sis_core
> Min unixbench-spawn-1 7521.90 ( 0.00%) 8102.80 ( 7.72%)
> Min unixbench-spawn-512 84245.70 ( 0.00%) 73074.50 ( -13.26%)
> Hmean unixbench-spawn-1 7659.12 ( 0.00%) 8645.19 * 12.87%*
> Hmean unixbench-spawn-512 84908.77 ( 0.00%) 73409.49 * -13.54%* <-- Regression in spawn test for larger worker count
> CoeffVar unixbench-spawn-1 1.92 ( 0.00%) 5.78 (-200.56%)
> CoeffVar unixbench-spawn-512 0.76 ( 0.00%) 0.41 ( 46.58%)
>
> ...
>
> For unixbench regressions, I do not see anything obvious jump up
> in perf traces captureed with IBS. top shows over 99% utilization
> which would ideally mean there are not many updates to the mask.
> I'll take some more look at the spawn test case and get back to you.

These regressions seems to be common in full parallel tests. I
guess it might be due to over updating the idle cpumask when LLC
is overloaded which is not necessary if SIS_UTIL enabled, but I
need to dig it further. Maybe the rq avg_idle or nr_idle_scan need
to be taken into consideration as well. Thanks for providing these
important information.

>
> ~~~~~~~~~~~~~
> ~ Hackbench ~
> ~~~~~~~~~~~~~
>
> $ perf bench sched messaging -p -l 50000 -g <groups>
>
> o NPS1
>
> kernel: tip sis_core
> 32-groups: 6.20 (0.00 pct) 5.86 (5.48 pct)
> 64-groups: 16.55 (0.00 pct) 15.21 (8.09 pct)
> 128-groups: 42.57 (0.00 pct) 34.63 (18.65 pct)
> 256-groups: 71.69 (0.00 pct) 67.11 (6.38 pct)
> 512-groups: 108.48 (0.00 pct) 110.23 (-1.61 pct)
>
> o NPS2
>
> kernel: tip sis_core
> 32-groups: 6.56 (0.00 pct) 5.60 (14.63 pct)
> 64-groups: 15.74 (0.00 pct) 14.45 (8.19 pct)
> 128-groups: 39.93 (0.00 pct) 35.33 (11.52 pct)
> 256-groups: 74.49 (0.00 pct) 69.65 (6.49 pct)
> 512-groups: 112.22 (0.00 pct) 113.75 (-1.36 pct)
>
> o NPS4:
>
> kernel: tip sis_core
> 32-groups: 9.48 (0.00 pct) 5.64 (40.50 pct)
> 64-groups: 15.38 (0.00 pct) 14.13 (8.12 pct)
> 128-groups: 39.93 (0.00 pct) 34.47 (13.67 pct)
> 256-groups: 75.31 (0.00 pct) 67.98 (9.73 pct)
> 512-groups: 115.37 (0.00 pct) 111.15 (3.65 pct)
>
> Note: Hackbench with 32-groups show run to run variation
> on tip but is more stable with sis_core. Hackbench for
> 64-groups and beyond is stable on both kernels.
>
The result is consistent with mine except 512-groups which I
didn't test. The 512-groups test may have the same problem
aforementioned.

Thanks & Regards,
Abel

2023-02-07 03:43:00

by K Prateek Nayak

[permalink] [raw]

Subject: Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS

Hello Abel,

I've retested the patches with on the updated tip and the results
are still promising.

tl;dr

o Hackbench sees improvements when the machine is overloaded.
o tbench shows improvements when the machine is overloaded.
o The unixbench regression seen previously seems to be unrelated
to the patch as the spawn test scores are vastly different
after a reboot/kexec for the same kernel.
o Other benchmarks show slight improvements or are comparable to
the numbers on tip.

Following are the results from running standard benchmarks on a
dual socket Zen3 (2 x 64C/128T) machine configured in different
NPS modes.

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
Total 2 NUMA nodes in the dual socket machine.

Node 0: 0-63, 128-191
Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
Total 4 NUMA nodes exist over 2 socket.

Node 0: 0-31, 128-159
Node 1: 32-63, 160-191
Node 2: 64-95, 192-223
Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
Total 8 NUMA nodes exist over 2 socket.

Node 0: 0-15, 128-143
Node 1: 16-31, 144-159
Node 2: 32-47, 160-175
Node 3: 48-63, 176-191
Node 4: 64-79, 192-207
Node 5: 80-95, 208-223
Node 6: 96-111, 223-231
Node 7: 112-127, 232-255

Following are the Kernel versions:

tip: 6.2.0-rc2 tip:sched/core at
commit: bbd0b031509b "sched/rseq: Fix concurrency ID handling of usermodehelper kthreads"
sis_short: tip + series

The patch applied cleanly on the tip.

Benchmark Results:

~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~

NPS1

Test: tip sis_core
1-groups: 4.36 (0.00 pct) 4.17 (4.35 pct)
2-groups: 5.17 (0.00 pct) 5.03 (2.70 pct)
4-groups: 4.17 (0.00 pct) 4.14 (0.71 pct)
8-groups: 4.64 (0.00 pct) 4.63 (0.21 pct)
16-groups: 5.43 (0.00 pct) 5.32 (2.02 pct)

NPS2

Test: tip sis_core
1-groups: 4.43 (0.00 pct) 4.27 (3.61 pct)
2-groups: 4.61 (0.00 pct) 4.92 (-6.72 pct) *
2-groups: 4.52 (0.00 pct) 4.55 (-0.66 pct) [Verification Run]
4-groups: 4.25 (0.00 pct) 4.10 (3.52 pct)
8-groups: 4.91 (0.00 pct) 4.53 (7.73 pct)
16-groups: 5.84 (0.00 pct) 5.54 (5.13 pct)

NPS4

Test: tip sis_core
1-groups: 4.34 (0.00 pct) 4.23 (2.53 pct)
2-groups: 4.64 (0.00 pct) 4.84 (-4.31 pct)
4-groups: 4.20 (0.00 pct) 4.17 (0.71 pct)
8-groups: 5.21 (0.00 pct) 5.06 (2.87 pct)
16-groups: 6.24 (0.00 pct) 5.60 (10.25 pct)

~~~~~~~~~~~~
~ schbench ~
~~~~~~~~~~~~

NPS1

#workers: tip sis_core
1: 36.00 (0.00 pct) 23.00 (36.11 pct)
2: 37.00 (0.00 pct) 37.00 (0.00 pct)
4: 37.00 (0.00 pct) 38.00 (-2.70 pct)
8: 47.00 (0.00 pct) 52.00 (-10.63 pct)
16: 64.00 (0.00 pct) 65.00 (-1.56 pct)
32: 109.00 (0.00 pct) 111.00 (-1.83 pct)
64: 222.00 (0.00 pct) 215.00 (3.15 pct)
128: 515.00 (0.00 pct) 486.00 (5.63 pct)
256: 39744.00 (0.00 pct) 47808.00 (-20.28 pct) * (Machine Overloaded ~ 2 tasks per rq)
256: 43242.00 (0.00 pct) 42293.00 (2.19 pct) [Verification Run]
512: 81280.00 (0.00 pct) 76416.00 (5.98 pct)

NPS2

#workers: tip sis_core
1: 27.00 (0.00 pct) 27.00 (0.00 pct)
2: 31.00 (0.00 pct) 30.00 (3.22 pct)
4: 38.00 (0.00 pct) 37.00 (2.63 pct)
8: 50.00 (0.00 pct) 46.00 (8.00 pct)
16: 66.00 (0.00 pct) 68.00 (-3.03 pct)
32: 116.00 (0.00 pct) 113.00 (2.58 pct)
64: 210.00 (0.00 pct) 228.00 (-8.57 pct) *
64: 206.00 (0.00 pct) 219.00 (-6.31 pct) [Verification Run]
128: 523.00 (0.00 pct) 559.00 (-6.88 pct) *
128: 474.00 (0.00 pct) 497.00 (-4.85 pct) [Verification Run]
256: 44864.00 (0.00 pct) 47040.00 (-4.85 pct)
512: 78464.00 (0.00 pct) 81280.00 (-3.58 pct)

NPS4

#workers: tip sis_core
1: 32.00 (0.00 pct) 27.00 (15.62 pct)
2: 32.00 (0.00 pct) 35.00 (-9.37 pct)
4: 34.00 (0.00 pct) 41.00 (-20.58 pct)
8: 58.00 (0.00 pct) 58.00 (0.00 pct)
16: 67.00 (0.00 pct) 69.00 (-2.98 pct)
32: 118.00 (0.00 pct) 112.00 (5.08 pct)
64: 224.00 (0.00 pct) 209.00 (6.69 pct)
128: 533.00 (0.00 pct) 519.00 (2.62 pct)
256: 43456.00 (0.00 pct) 45248.00 (-4.12 pct)
512: 78976.00 (0.00 pct) 76160.00 (3.56 pct)

~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

NPS1

Clients: tip sis_core
1 539.96 (0.00 pct) 538.19 (-0.32 pct)
2 1068.21 (0.00 pct) 1063.04 (-0.48 pct)
4 1994.76 (0.00 pct) 1990.47 (-0.21 pct)
8 3602.30 (0.00 pct) 3496.07 (-2.94 pct)
16 6075.49 (0.00 pct) 6061.74 (-0.22 pct)
32 11641.07 (0.00 pct) 11904.58 (2.26 pct)
64 21529.16 (0.00 pct) 22124.81 (2.76 pct)
128 30852.92 (0.00 pct) 31258.56 (1.31 pct)
256 51901.20 (0.00 pct) 53249.69 (2.59 pct)
512 46797.40 (0.00 pct) 54477.79 (16.41 pct)
1024 46057.28 (0.00 pct) 53676.58 (16.54 pct)

NPS2

Clients: tip sis_core
1 536.11 (0.00 pct) 541.18 (0.94 pct)
2 1044.58 (0.00 pct) 1064.16 (1.87 pct)
4 2043.92 (0.00 pct) 2017.84 (-1.27 pct)
8 3572.50 (0.00 pct) 3494.83 (-2.17 pct)
16 6040.97 (0.00 pct) 5530.10 (-8.45 pct) *
16 5814.03 (0.00 pct) 6012.33 (-8.45 pct) [Verification Run]
32 10794.10 (0.00 pct) 10841.68 (0.44 pct)
64 20905.89 (0.00 pct) 21438.82 (2.54 pct)
128 30885.39 (0.00 pct) 30064.78 (-2.65 pct)
256 48901.25 (0.00 pct) 51395.08 (5.09 pct)
512 49673.91 (0.00 pct) 51725.89 (4.13 pct)
1024 47626.34 (0.00 pct) 52662.01 (10.57 pct)

NPS4

Clients: tip sis_core
1 544.91 (0.00 pct) 544.66 (-0.04 pct)
2 1046.49 (0.00 pct) 1072.42 (2.47 pct)
4 2007.11 (0.00 pct) 1970.05 (-1.84 pct)
8 3590.66 (0.00 pct) 3670.45 (2.22 pct)
16 5956.60 (0.00 pct) 6045.07 (1.48 pct)
32 10431.73 (0.00 pct) 10439.40 (0.07 pct)
64 21563.37 (0.00 pct) 19344.05 (-10.29 pct) *
64 19387.71 (0.00 pct) 19050.47 (-1.73 pct) [Verification Run]
128 30352.16 (0.00 pct) 26998.85 (-11.04 pct) *
128 29110.99 (0.00 pct) 29690.37 (1.99 pct) [Verification Run]
256 49504.51 (0.00 pct) 50921.66 (2.86 pct)
512 44916.61 (0.00 pct) 52176.11 (16.16 pct)
1024 49986.21 (0.00 pct) 51639.91 (3.30 pct)

~~~~~~~~~~
~ stream ~
~~~~~~~~~~

NPS1

10 Runs:

Test: tip sis_core
Copy: 339390.30 (0.00 pct) 324656.88 (-4.34 pct)
Scale: 212472.78 (0.00 pct) 210641.39 (-0.86 pct)
Add: 247598.48 (0.00 pct) 241669.10 (-2.39 pct)
Triad: 261852.07 (0.00 pct) 252088.55 (-3.72 pct)

100 Runs:

Test: tip sis_core
Copy: 335938.02 (0.00 pct) 331491.32 (-1.32 pct)
Scale: 212597.92 (0.00 pct) 218705.84 (2.87 pct)
Add: 248294.62 (0.00 pct) 243830.42 (-1.79 pct)
Triad: 258400.88 (0.00 pct) 248178.42 (-3.95 pct)

NPS2

10 Runs:

Test: tip sis_core
Copy: 334500.32 (0.00 pct) 335317.70 (0.24 pct)
Scale: 216804.76 (0.00 pct) 217862.71 (0.48 pct)
Add: 250787.33 (0.00 pct) 258839.00 (3.21 pct)
Triad: 259451.40 (0.00 pct) 264847.88 (2.07 pct)

100 Runs:

Test: tip sis_core
Copy: 326385.13 (0.00 pct) 338030.70 (3.56 pct)
Scale: 216440.37 (0.00 pct) 230053.24 (6.28 pct)
Add: 255062.22 (0.00 pct) 259197.23 (1.62 pct)
Triad: 265442.03 (0.00 pct) 271365.65 (2.23 pct)

NPS4

10 Runs:

Test: tip sis_core
Copy: 363927.86 (0.00 pct) 361014.15 (-0.80 pct)
Scale: 238190.49 (0.00 pct) 242176.02 (1.67 pct)
Add: 262806.49 (0.00 pct) 266348.50 (1.34 pct)
Triad: 276492.33 (0.00 pct) 276769.10 (0.10 pct)

100 Runs:

Test: tip sis_core
Copy: 365041.37 (0.00 pct) 349299.35 (-4.31 pct)
Scale: 239295.27 (0.00 pct) 229944.85 (-3.90 pct)
Add: 264085.21 (0.00 pct) 252651.56 (-4.32 pct)
Triad: 279664.56 (0.00 pct) 274254.22 (-1.93 pct)

~~~~~~~~~~~~~~~~
~ ycsb-mongodb ~
~~~~~~~~~~~~~~~~

o NPS1

tip: 131328.67 (var: 2.97%)
sis_core: 131702.33 (var: 3.61%) (0.28%)

o NPS2:

tip: 132482.33 (var: 2.06%)
sis_core: 132338.33 (var: 0.97%) (-0.11%)

o NPS4:

tip: 134130.00 (var: 4.12%)
sis_core: 133224.33 (var: 4.13%) (-0.67%)

~~~~~~~~~~~~~
~ unixbench ~
~~~~~~~~~~~~~

o NPS1

Test Metric Parallelism tip sis_core
unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48770555.20 ( 0.00%) 49025161.73 ( 0.52%)
unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6268185467.60 ( 0.00%) 6266351964.20 ( -0.03%)
unixbench-syscall Amean unixbench-syscall-1 2685321.17 ( 0.00%) 2694468.30 * -0.34%*
unixbench-syscall Amean unixbench-syscall-512 7291476.20 ( 0.00%) 7295087.67 ( -0.05%)
unixbench-pipe Hmean unixbench-pipe-1 2480858.53 ( 0.00%) 2536923.44 * 2.26%*
unixbench-pipe Hmean unixbench-pipe-512 300739256.62 ( 0.00%) 303470605.93 * 0.91%*
unixbench-spawn Hmean unixbench-spawn-1 4358.14 ( 0.00%) 4104.88 ( -5.81%) * (Known to be unstable)
unixbench-spawn Hmean unixbench-spawn-1 4711.00 ( 0.00%) 4006.20 ( -14.96%) [Verification Run]
unixbench-spawn Hmean unixbench-spawn-512 76497.32 ( 0.00%) 75555.94 * -1.23%*
unixbench-execl Hmean unixbench-execl-1 4147.12 ( 0.00%) 4157.33 ( 0.25%)
unixbench-execl Hmean unixbench-execl-512 12435.26 ( 0.00%) 11992.43 ( -3.56%)

o NPS2

Test Metric Parallelism tip sis_core
unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48872335.50 ( 0.00%) 48902553.70 ( 0.06%)
unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6264134378.20 ( 0.00%) 6260631689.40 ( -0.06%)
unixbench-syscall Amean unixbench-syscall-1 2683903.13 ( 0.00%) 2694829.17 * -0.41%*
unixbench-syscall Amean unixbench-syscall-512 7746773.60 ( 0.00%) 7493782.67 * 3.27%*
unixbench-pipe Hmean unixbench-pipe-1 2476724.23 ( 0.00%) 2537127.96 * 2.44%*
unixbench-pipe Hmean unixbench-pipe-512 300277350.41 ( 0.00%) 302979776.19 * 0.90%*
unixbench-spawn Hmean unixbench-spawn-1 5026.50 ( 0.00%) 4680.63 ( -6.88%) *
unixbench-spawn Hmean unixbench-spawn-1 5421.70 ( 0.00%) 5311.50 ( -2.03%) [Verification Run]
unixbench-spawn Hmean unixbench-spawn-512 80549.70 ( 0.00%) 78888.60 ( -2.06%)
unixbench-execl Hmean unixbench-execl-1 4151.70 ( 0.00%) 3913.76 * -5.73%* *
unixbench-execl Hmean unixbench-execl-1 4304.30 ( 0.00%) 4303.20 ( -0.02%) [Verification run]
unixbench-execl Hmean unixbench-execl-512 13605.15 ( 0.00%) 13129.23 ( -3.50%)

o NPS4

Test Metric Parallelism tip sis_core
unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48506771.20 ( 0.00%) 48894866.70 ( 0.80%)
unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6280954362.50 ( 0.00%) 6282759876.40 ( 0.03%)
unixbench-syscall Amean unixbench-syscall-1 2687259.30 ( 0.00%) 2695379.93 * -0.30%*
unixbench-syscall Amean unixbench-syscall-512 7350275.67 ( 0.00%) 7366923.73 ( -0.23%)
unixbench-pipe Hmean unixbench-pipe-1 2478893.01 ( 0.00%) 2540015.88 * 2.47%*
unixbench-pipe Hmean unixbench-pipe-512 301830155.61 ( 0.00%) 304305539.27 * 0.82%*
unixbench-spawn Hmean unixbench-spawn-1 5208.55 ( 0.00%) 5273.11 ( 1.24%)
unixbench-spawn Hmean unixbench-spawn-512 80745.79 ( 0.00%) 81940.71 * 1.48%*
unixbench-execl Hmean unixbench-execl-1 4072.72 ( 0.00%) 4126.13 * 1.31%*
unixbench-execl Hmean unixbench-execl-512 13746.56 ( 0.00%) 12848.77 ( -6.53%) *
unixbench-execl Hmean unixbench-execl-512 13898.30 ( 0.00%) 13959.70 ( 0.44%) [Verification Run]

On 10/19/2022 5:58 PM, Abel Wu wrote:
> This patchset tries to improve SIS scan efficiency by recording idle
> cpus in a cpumask for each LLC which will be used as a target cpuset
> in the domain scan. The cpus are recorded at CORE granule to avoid
> tasks being stack on same core.
>
> v5 -> v6:
> - Rename SIS_FILTER to SIS_CORE as it can only be activated when
> SMT is enabled and better describes the behavior of CORE granule
> update & load delivery.
> - Removed the part of limited scan for idle cores since it might be
> better to open another thread to discuss the strategies such as
> limited or scaled depth. But keep the part of full scan for idle
> cores when LLC is overloaded because SIS_CORE can greatly reduce
> the overhead of full scan in such case.
> - Removed the state of sd_is_busy which indicates an LLC is fully
> busy and we can safely skip the SIS domain scan. I would prefer
> leave this to SIS_UTIL.
> - The filter generation mechanism is replaced by in-place updates
> during domain scan to better deal with partial scan failures.
> - Collect Reviewed-bys from Tim Chen
>
> v4 -> v5:
> - Add limited scan for idle cores when overloaded, suggested by Mel
> - Split out several patches since they are irrelevant to this scope
> - Add quick check on ttwu_pending before core update
> - Wrap the filter into SIS_FILTER feature, suggested by Chen Yu
> - Move the main filter logic to the idle path, because the newidle
> balance can bail out early if rq->avg_idle is small enough and
> lose chances to update the filter.
>
> v3 -> v4:
> - Update filter in load_balance rather than in the tick
> - Now the filter contains unoccupied cpus rather than overloaded ones
> - Added mechanisms to deal with the false positive cases
>
> v2 -> v3:
> - Removed sched-idle balance feature and focus on SIS
> - Take non-CFS tasks into consideration
> - Several fixes/improvement suggested by Josh Don
>
> v1 -> v2:
> - Several optimizations on sched-idle balancing
> - Ignore asym topos in can_migrate_task
> - Add more benchmarks including SIS efficiency
> - Re-organize patch as suggested by Mel Gorman
>
> Abel Wu (4):
> sched/fair: Skip core update if task pending
> sched/fair: Ignore SIS_UTIL when has_idle_core
> sched/fair: Introduce SIS_CORE
> sched/fair: Deal with SIS scan failures
>
> include/linux/sched/topology.h | 15 ++++
> kernel/sched/fair.c | 122 +++++++++++++++++++++++++++++----
> kernel/sched/features.h | 7 ++
> kernel/sched/sched.h | 3 +
> kernel/sched/topology.c | 8 ++-
> 5 files changed, 141 insertions(+), 14 deletions(-)
>

Testing with couple of larger workloads like SpecJBB are still underway.
I'll update the thread with the results once they are done. The idea
is promising. I'll also try to run schbench / hackbench pinned in a
manner such that all wakeups happen on an external LLC to spot any
impact of rapid changes to the idle cpu mask of an external LLC.
Please let me know if you would like me to test or get data for any
particular benchmark from my test setup.

--
Thanks and Regards,
Prateek

2023-02-16 13:18:26

by Abel Wu

[permalink] [raw]

Subject: Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS

Hi Prateek, thanks very much for your solid testings!

On 2/7/23 11:42 AM, K Prateek Nayak wrote:
> Hello Abel,
>
> I've retested the patches with on the updated tip and the results
> are still promising.
>
> tl;dr
>
> o Hackbench sees improvements when the machine is overloaded.
> o tbench shows improvements when the machine is overloaded.
> o The unixbench regression seen previously seems to be unrelated
> to the patch as the spawn test scores are vastly different
> after a reboot/kexec for the same kernel.
> o Other benchmarks show slight improvements or are comparable to
> the numbers on tip.

Cheers! Yet I still see some minor regressions in the report
below. As we discussed last time, reducing unnecessary updates
on the idle cpumask when LLC is overloaded should help.

Thanks & Best regards,
Abel