LinuxLists.cc - [linus:master] [mm] f1a7941243: unixbench.score -19.2% regression

2023-01-30 02:34:06

Subject: [linus:master] [mm] f1a7941243: unixbench.score -19.2% regression

please be noted we reported
"[linux-next:master] [mm] f1a7941243: unixbench.score -5.1% regression"
on
https://lore.kernel.org/all/[email protected]/
when this commit was on linux-next

now after this commit was already merged to mainline, by other options in
unixbench testing, we noticed regression is still existing.

report again FYI. below is the full report.

Greeting,

FYI, we noticed a -19.2% regression of unixbench.score due to commit:

commit: f1a7941243c102a44e8847e3b94ff4ff3ec56f25 ("mm: convert mm's rss stats into percpu_counter")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

in testcase: unixbench
on test machine: 128 threads 4 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
with following parameters:

runtime: 300s
nr_task: 30%
test: spawn
cpufreq_governor: performance

test-description: UnixBench is the original BYTE UNIX benchmark suite aims to test performance of Unix-like system.
test-url: https://github.com/kdlucas/byte-unixbench

In addition to that, the commit also has significant impact on the following tests:

+------------------+--------------------------------------------------------------------------------------------+
| testcase: change | unixbench: unixbench.score -14.6% regression |
| test machine | 128 threads 4 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory |
| test parameters | cpufreq_governor=performance |
| | nr_task=1 |
| | runtime=300s |
| | test=spawn |
+------------------+--------------------------------------------------------------------------------------------+
| testcase: change | unixbench: unixbench.score -5.1% regression |
| test machine | 128 threads 4 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory |
| test parameters | cpufreq_governor=performance |
| | nr_task=1 |
| | runtime=300s |
| | test=shell8 |
+------------------+--------------------------------------------------------------------------------------------+

If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-lkp/[email protected]

Details are as below:
-------------------------------------------------------------------------------------------------->

To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
sudo bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
sudo bin/lkp run generated-yaml-file

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.

=========================================================================================
compiler/cpufreq_governor/kconfig/nr_task/rootfs/runtime/tbox_group/test/testcase:
gcc-11/performance/x86_64-rhel-8.3/30%/debian-11.1-x86_64-20220510.cgz/300s/lkp-icl-2sp2/spawn/unixbench

commit:
9cd6ffa602 ("selftests/damon: add tests for DAMON_LRU_SORT's enabled parameter")
f1a7941243 ("mm: convert mm's rss stats into percpu_counter")

9cd6ffa60256e931 f1a7941243c102a44e8847e3b94
---------------- ---------------------------
%stddev %change %stddev
\ | \
11110 -19.2% 8974 unixbench.score
1090843 -12.2% 957314 unixbench.time.involuntary_context_switches
4243909 ? 6% -32.4% 2867136 ? 5% unixbench.time.major_page_faults
10547 -12.6% 9216 unixbench.time.maximum_resident_set_size
9.913e+08 -19.6% 7.969e+08 unixbench.time.minor_page_faults
5638 +19.1% 6714 unixbench.time.system_time
5502 -20.7% 4363 unixbench.time.user_time
67991885 -16.9% 56507507 unixbench.time.voluntary_context_switches
46198768 -19.1% 37355723 unixbench.workload
1.365e+08 -12.5% 1.195e+08 ? 7% cpuidle..usage
1220612 ? 4% -38.0% 757009 ? 28% meminfo.Active
1220354 ? 4% -38.0% 756754 ? 28% meminfo.Active(anon)
0.50 ? 2% -0.1 0.45 ? 4% mpstat.cpu.all.soft%
1.73 -0.2 1.52 ? 2% mpstat.cpu.all.usr%
532266 -18.4% 434559 vmstat.system.cs
495826 -12.2% 435455 ? 8% vmstat.system.in
1.36e+08 -13.2% 1.18e+08 ? 9% turbostat.C1
68.80 +0.8 69.60 turbostat.C1%
1.663e+08 -12.1% 1.462e+08 ? 8% turbostat.IRQ
15.54 ? 20% -49.0% 7.93 ? 24% sched_debug.cfs_rq:/.runnable_avg.min
13.26 ? 19% -46.6% 7.08 ? 29% sched_debug.cfs_rq:/.util_avg.min
48.96 ? 8% +51.5% 74.20 ? 13% sched_debug.cfs_rq:/.util_est_enqueued.avg
138.00 ? 5% +28.9% 177.87 ? 7% sched_debug.cfs_rq:/.util_est_enqueued.stddev
228060 ? 3% +13.3% 258413 ? 4% sched_debug.cpu.avg_idle.stddev
432533 ? 5% -16.4% 361517 ? 4% sched_debug.cpu.nr_switches.min
2.665e+08 -18.9% 2.162e+08 numa-numastat.node0.local_node
2.666e+08 -18.9% 2.163e+08 numa-numastat.node0.numa_hit
2.746e+08 -20.9% 2.172e+08 numa-numastat.node1.local_node
2.747e+08 -20.9% 2.172e+08 numa-numastat.node1.numa_hit
2.602e+08 -17.4% 2.149e+08 numa-numastat.node2.local_node
2.603e+08 -17.4% 2.149e+08 numa-numastat.node2.numa_hit
2.423e+08 -15.0% 2.06e+08 numa-numastat.node3.local_node
2.424e+08 -15.0% 2.061e+08 numa-numastat.node3.numa_hit
2.666e+08 -18.9% 2.163e+08 numa-vmstat.node0.numa_hit
2.665e+08 -18.9% 2.162e+08 numa-vmstat.node0.numa_local
2.747e+08 -20.9% 2.172e+08 numa-vmstat.node1.numa_hit
2.746e+08 -20.9% 2.172e+08 numa-vmstat.node1.numa_local
2.603e+08 -17.4% 2.149e+08 numa-vmstat.node2.numa_hit
2.602e+08 -17.4% 2.149e+08 numa-vmstat.node2.numa_local
2.424e+08 -15.0% 2.061e+08 numa-vmstat.node3.numa_hit
2.423e+08 -15.0% 2.06e+08 numa-vmstat.node3.numa_local
304947 ? 4% -38.0% 189144 ? 28% proc-vmstat.nr_active_anon
36188 +2.1% 36963 proc-vmstat.nr_kernel_stack
251594 -6.7% 234634 proc-vmstat.nr_mapped
22030 +1.8% 22429 proc-vmstat.nr_page_table_pages
435748 ? 2% -8.1% 400525 ? 2% proc-vmstat.nr_shmem
304947 ? 4% -38.0% 189144 ? 28% proc-vmstat.nr_zone_active_anon
1.044e+09 -18.1% 8.546e+08 proc-vmstat.numa_hit
1.044e+09 -18.1% 8.543e+08 proc-vmstat.numa_local
1.131e+09 -18.4% 9.226e+08 proc-vmstat.pgalloc_normal
1.012e+09 -19.6% 8.142e+08 proc-vmstat.pgfault
1.125e+09 -18.3% 9.184e+08 proc-vmstat.pgfree
82521029 -18.3% 67380744 proc-vmstat.pgreuse
26.58 +7.4% 28.55 perf-stat.i.MPKI
1.101e+10 -6.4% 1.03e+10 perf-stat.i.branch-instructions
0.96 -0.1 0.83 perf-stat.i.branch-miss-rate%
1.056e+08 ? 2% -19.2% 85304885 perf-stat.i.branch-misses
23.15 +3.9 27.06 perf-stat.i.cache-miss-rate%
3.569e+08 +18.5% 4.231e+08 perf-stat.i.cache-misses
538874 -17.9% 442333 perf-stat.i.context-switches
1.084e+11 -2.9% 1.052e+11 perf-stat.i.cpu-cycles
61598 -38.0% 38211 ? 2% perf-stat.i.cpu-migrations
796.50 ? 3% -9.0% 724.46 ? 4% perf-stat.i.cycles-between-cache-misses
0.16 ? 3% +0.1 0.24 ? 6% perf-stat.i.dTLB-load-miss-rate%
24029890 ? 3% +40.8% 33845166 ? 7% perf-stat.i.dTLB-load-misses
1.55e+10 -8.4% 1.419e+10 perf-stat.i.dTLB-loads
0.20 +0.0 0.21 perf-stat.i.dTLB-store-miss-rate%
16564619 -11.1% 14720327 ? 2% perf-stat.i.dTLB-store-misses
7.675e+09 -12.8% 6.691e+09 perf-stat.i.dTLB-stores
5.438e+10 -6.1% 5.107e+10 perf-stat.i.instructions
0.50 -3.3% 0.48 perf-stat.i.ipc
13046 ? 6% -32.5% 8801 ? 5% perf-stat.i.major-faults
0.85 -3.5% 0.82 perf-stat.i.metric.GHz
1644 -32.6% 1108 perf-stat.i.metric.K/sec
278.67 -8.5% 255.01 perf-stat.i.metric.M/sec
3031980 -19.2% 2448488 perf-stat.i.minor-faults
71228649 +23.8% 88205091 ? 2% perf-stat.i.node-load-misses
75.10 +8.8 83.91 perf-stat.i.node-store-miss-rate%
84169391 +72.1% 1.449e+08 perf-stat.i.node-store-misses
27725536 -4.0% 26624099 ? 2% perf-stat.i.node-stores
3045027 -19.3% 2457289 perf-stat.i.page-faults
27.60 +7.5% 29.68 perf-stat.overall.MPKI
0.96 -0.1 0.83 perf-stat.overall.branch-miss-rate%
23.58 +4.1 27.70 perf-stat.overall.cache-miss-rate%
1.99 +3.3% 2.06 perf-stat.overall.cpi
306.03 -18.2% 250.28 perf-stat.overall.cycles-between-cache-misses
0.15 ? 4% +0.1 0.24 ? 6% perf-stat.overall.dTLB-load-miss-rate%
0.50 -3.2% 0.49 perf-stat.overall.ipc
75.22 +9.2 84.45 perf-stat.overall.node-store-miss-rate%
391938 +15.7% 453507 perf-stat.overall.path-length
1.108e+10 -7.0% 1.03e+10 perf-stat.ps.branch-instructions
1.062e+08 ? 2% -19.6% 85367186 perf-stat.ps.branch-misses
3.559e+08 +18.0% 4.198e+08 perf-stat.ps.cache-misses
536207 -18.3% 438218 perf-stat.ps.context-switches
1.089e+11 -3.5% 1.05e+11 perf-stat.ps.cpu-cycles
60827 -38.1% 37677 ? 2% perf-stat.ps.cpu-migrations
23919311 ? 3% +41.3% 33808464 ? 6% perf-stat.ps.dTLB-load-misses
1.559e+10 -9.0% 1.419e+10 perf-stat.ps.dTLB-loads
16510990 -11.3% 14644522 perf-stat.ps.dTLB-store-misses
7.709e+09 -13.2% 6.688e+09 perf-stat.ps.dTLB-stores
5.469e+10 -6.7% 5.105e+10 perf-stat.ps.instructions
12792 ? 6% -32.5% 8632 ? 5% perf-stat.ps.major-faults
3023478 -19.6% 2431916 perf-stat.ps.minor-faults
71296907 +23.1% 87799775 perf-stat.ps.node-load-misses
83918758 +71.1% 1.436e+08 perf-stat.ps.node-store-misses
27642149 -4.3% 26441426 perf-stat.ps.node-stores
3036271 -19.6% 2440548 perf-stat.ps.page-faults
1.811e+13 -6.4% 1.694e+13 perf-stat.total.instructions

***************************************************************************************************
lkp-icl-2sp2: 128 threads 4 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_task/rootfs/runtime/tbox_group/test/testcase:
gcc-11/performance/x86_64-rhel-8.3/1/debian-11.1-x86_64-20220510.cgz/300s/lkp-icl-2sp2/spawn/unixbench

commit:
9cd6ffa602 ("selftests/damon: add tests for DAMON_LRU_SORT's enabled parameter")
f1a7941243 ("mm: convert mm's rss stats into percpu_counter")

9cd6ffa60256e931 f1a7941243c102a44e8847e3b94
---------------- ---------------------------
%stddev %change %stddev
\ | \
794.90 -14.6% 678.57 unixbench.score
49613 ? 3% -4.7% 47286 ? 2% unixbench.time.involuntary_context_switches
10448 -11.8% 9216 unixbench.time.maximum_resident_set_size
68610231 -14.1% 58919669 unixbench.time.minor_page_faults
120.51 +11.4% 134.28 unixbench.time.system_time
181.44 -7.4% 167.96 unixbench.time.user_time
5947487 -14.6% 5081122 unixbench.time.voluntary_context_switches
3305203 -14.6% 2821489 unixbench.workload
0.08 ? 2% -0.0 0.07 ? 3% mpstat.cpu.all.usr%
12560 -11.0% 11184 turbostat.POLL
28.68 +1.2% 29.04 turbostat.RAMWatt
42049 -12.7% 36691 vmstat.system.cs
36312 -7.4% 33611 vmstat.system.in
4861 +13.9% 5537 ? 7% meminfo.Active
133.00 ? 5% +483.0% 775.33 ? 57% meminfo.Active(file)
1369756 -26.6% 1005212 ? 20% meminfo.DirectMap4k
16522554 ? 82% -98.7% 209105 ? 39% numa-numastat.node0.local_node
16522494 ? 82% -98.7% 209126 ? 39% numa-numastat.node0.numa_hit
171.33 ? 3% -35.6% 110.33 ? 35% numa-numastat.node1.other_node
37076 ? 4% -24.7% 27902 ? 21% sched_debug.cfs_rq:/.min_vruntime.avg
58332 ? 8% -15.6% 49210 ? 15% sched_debug.cfs_rq:/.min_vruntime.max
19442 ? 14% -25.4% 14501 ? 18% sched_debug.cfs_rq:/.min_vruntime.min
3117 ? 41% -210.0% -3430 sched_debug.cfs_rq:/.spread0.avg
1348100 -14.4% 1153515 sched_debug.cpu.curr->pid.max
145976 ? 5% -11.3% 129477 sched_debug.cpu.curr->pid.stddev
0.00 ? 9% -39.5% 0.00 ? 12% sched_debug.cpu.next_balance.stddev
51899 -11.6% 45858 sched_debug.cpu.nr_switches.avg
69329 ? 9% +17.8% 81638 ? 4% sched_debug.cpu.nr_switches.stddev
45546 ? 48% +2685.2% 1268560 ?130% numa-meminfo.node0.FilePages
496317 ? 17% +263.6% 1804856 ? 89% numa-meminfo.node0.MemUsed
42844 ? 53% +2851.6% 1264582 ?131% numa-meminfo.node0.Unevictable
8587 ?101% +276.8% 32356 ? 36% numa-meminfo.node1.AnonHugePages
59.67 ?141% +3069.3% 1891 ? 85% numa-meminfo.node1.Inactive(file)
1385 ? 31% -55.2% 621.00 ? 19% numa-meminfo.node1.Shmem
145445 ? 5% -23.0% 112048 ? 13% numa-meminfo.node1.Slab
384762 ? 11% -14.2% 330051 numa-meminfo.node2.MemUsed
55313 ? 5% -9.8% 49897 ? 6% numa-meminfo.node2.SUnreclaim
48756 ? 4% +14.5% 55819 ? 6% numa-meminfo.node3.SUnreclaim
57192 ? 6% +24.7% 71323 ? 12% numa-meminfo.node3.Slab
11385 ? 48% +2685.6% 317140 ?130% numa-vmstat.node0.nr_file_pages
10711 ? 53% +2851.6% 316145 ?131% numa-vmstat.node0.nr_unevictable
10711 ? 53% +2851.6% 316145 ?131% numa-vmstat.node0.nr_zone_unevictable
16522396 ? 82% -98.7% 209021 ? 39% numa-vmstat.node0.numa_hit
16522455 ? 82% -98.7% 209000 ? 39% numa-vmstat.node0.numa_local
14.67 ?141% +3118.2% 472.00 ? 86% numa-vmstat.node1.nr_inactive_file
346.00 ? 31% -55.2% 155.00 ? 19% numa-vmstat.node1.nr_shmem
14.67 ?141% +3118.2% 472.00 ? 86% numa-vmstat.node1.nr_zone_inactive_file
171.33 ? 3% -35.6% 110.33 ? 35% numa-vmstat.node1.numa_other
13837 ? 5% -9.7% 12490 ? 6% numa-vmstat.node2.nr_slab_unreclaimable
12188 ? 4% +14.5% 13961 ? 6% numa-vmstat.node3.nr_slab_unreclaimable
33.00 ? 6% +486.9% 193.67 ? 57% proc-vmstat.nr_active_file
63431 -1.6% 62397 proc-vmstat.nr_slab_unreclaimable
33.00 ? 6% +486.9% 193.67 ? 57% proc-vmstat.nr_zone_active_file
81877455 -14.3% 70147714 proc-vmstat.numa_hit
81885712 -14.3% 70150362 proc-vmstat.numa_local
13131 ? 26% -35.4% 8476 ? 18% proc-vmstat.numa_pages_migrated
92199304 -14.6% 78747348 proc-vmstat.pgalloc_normal
69555738 -13.9% 59868616 proc-vmstat.pgfault
92230026 -14.6% 78778237 proc-vmstat.pgfree
13131 ? 26% -35.4% 8476 ? 18% proc-vmstat.pgmigrate_success
6014572 -10.3% 5395629 ? 3% proc-vmstat.pgreuse

***************************************************************************************************
lkp-icl-2sp2: 128 threads 4 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_task/rootfs/runtime/tbox_group/test/testcase:
gcc-11/performance/x86_64-rhel-8.3/1/debian-11.1-x86_64-20220510.cgz/300s/lkp-icl-2sp2/shell8/unixbench

commit:
9cd6ffa602 ("selftests/damon: add tests for DAMON_LRU_SORT's enabled parameter")
f1a7941243 ("mm: convert mm's rss stats into percpu_counter")

9cd6ffa60256e931 f1a7941243c102a44e8847e3b94
---------------- ---------------------------
%stddev %change %stddev
\ | \
7917 -5.1% 7509 unixbench.score
10485 -12.1% 9216 unixbench.time.maximum_resident_set_size
37236706 -5.1% 35324104 unixbench.time.minor_page_faults
567.67 -1.3% 560.50 unixbench.time.percent_of_cpu_this_job_got
215.51 +3.7% 223.41 unixbench.time.system_time
143.81 -8.0% 132.23 unixbench.time.user_time
1075861 -4.0% 1033025 unixbench.time.voluntary_context_switches
299268 -4.9% 284627 unixbench.workload
13368320 ? 7% +18.1% 15788032 ? 6% meminfo.DirectMap2M
44104 -3.2% 42701 vmstat.system.cs
0.98 ? 20% +0.7 1.64 ? 38% perf-profile.calltrace.cycles-pp.link_path_walk.path_openat.do_filp_open.do_sys_openat2.__x64_sys_openat
2.12 ? 19% +0.8 2.96 ? 13% perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
2.35 ? 13% +0.9 3.28 ? 13% perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
0.14 ? 74% +0.4 0.55 ? 32% perf-profile.children.cycles-pp.do_task_dead
0.04 ?223% +0.4 0.47 ? 49% perf-profile.children.cycles-pp.__mmdrop
25909747 -5.1% 24594282 proc-vmstat.numa_hit
25735224 -5.1% 24419554 proc-vmstat.numa_local
46218815 -5.1% 43858849 proc-vmstat.pgalloc_normal
37577543 -5.1% 35663315 proc-vmstat.pgfault
46035593 -5.1% 43669821 proc-vmstat.pgfree
1592335 -3.9% 1529709 proc-vmstat.pgreuse
38041 -5.1% 36089 proc-vmstat.thp_fault_alloc
532220 -5.1% 504893 proc-vmstat.unevictable_pgs_culled
11.43 +7.9% 12.34 perf-stat.i.MPKI
4.415e+09 -3.2% 4.272e+09 perf-stat.i.branch-instructions
1.71 -0.0 1.67 perf-stat.i.branch-miss-rate%
74672466 -5.3% 70687323 perf-stat.i.branch-misses
14.90 +1.3 16.18 perf-stat.i.cache-miss-rate%
38201955 +14.5% 43725695 perf-stat.i.cache-misses
2.517e+08 +5.0% 2.644e+08 perf-stat.i.cache-references
46317 -3.8% 44553 perf-stat.i.context-switches
0.94 +1.7% 0.96 perf-stat.i.cpi
2.032e+10 -1.5% 2.002e+10 perf-stat.i.cpu-cycles
2760 -6.3% 2585 perf-stat.i.cpu-migrations
0.05 +0.0 0.07 ? 34% perf-stat.i.dTLB-load-miss-rate%
2687620 +37.3% 3690868 ? 35% perf-stat.i.dTLB-load-misses
5.592e+09 -3.0% 5.422e+09 perf-stat.i.dTLB-loads
0.09 +0.0 0.09 ? 8% perf-stat.i.dTLB-store-miss-rate%
3.295e+09 -3.6% 3.175e+09 perf-stat.i.dTLB-stores
2.162e+10 -2.9% 2.1e+10 perf-stat.i.instructions
1.06 -1.5% 1.05 perf-stat.i.ipc
0.16 -1.5% 0.16 perf-stat.i.metric.GHz
188.29 +21.3% 228.40 perf-stat.i.metric.K/sec
105.85 -3.1% 102.57 perf-stat.i.metric.M/sec
580192 -5.4% 548596 perf-stat.i.minor-faults
7332887 +14.4% 8388008 perf-stat.i.node-load-misses
197858 +2.4% 202553 perf-stat.i.node-loads
77.77 +4.8 82.56 perf-stat.i.node-store-miss-rate%
8653164 +45.4% 12581116 perf-stat.i.node-store-misses
2454964 +5.3% 2585143 perf-stat.i.node-stores
580247 -5.4% 548650 perf-stat.i.page-faults
11.64 +8.1% 12.59 perf-stat.overall.MPKI
1.69 -0.0 1.65 perf-stat.overall.branch-miss-rate%
15.18 +1.4 16.54 perf-stat.overall.cache-miss-rate%
0.94 +1.4% 0.95 perf-stat.overall.cpi
531.91 -13.9% 457.79 perf-stat.overall.cycles-between-cache-misses
0.05 +0.0 0.07 ? 34% perf-stat.overall.dTLB-load-miss-rate%
0.09 +0.0 0.10 ? 8% perf-stat.overall.dTLB-store-miss-rate%
1.06 -1.4% 1.05 perf-stat.overall.ipc
77.90 +5.1 82.95 perf-stat.overall.node-store-miss-rate%
4574083 +2.5% 4688385 perf-stat.overall.path-length
4.344e+09 -3.2% 4.203e+09 perf-stat.ps.branch-instructions
73479648 -5.3% 69558968 perf-stat.ps.branch-misses
37586020 +14.5% 43024951 perf-stat.ps.cache-misses
2.477e+08 +5.0% 2.601e+08 perf-stat.ps.cache-references
45573 -3.8% 43840 perf-stat.ps.context-switches
1.999e+10 -1.5% 1.97e+10 perf-stat.ps.cpu-cycles
2715 -6.3% 2543 perf-stat.ps.cpu-migrations
2644325 +37.3% 3631630 ? 35% perf-stat.ps.dTLB-load-misses
5.502e+09 -3.0% 5.336e+09 perf-stat.ps.dTLB-loads
3.242e+09 -3.6% 3.124e+09 perf-stat.ps.dTLB-stores
2.127e+10 -2.9% 2.066e+10 perf-stat.ps.instructions
570836 -5.4% 539805 perf-stat.ps.minor-faults
7214650 +14.4% 8253560 perf-stat.ps.node-load-misses
194668 +2.4% 199305 perf-stat.ps.node-loads
8513681 +45.4% 12379540 perf-stat.ps.node-store-misses
2415376 +5.3% 2543709 perf-stat.ps.node-stores
570889 -5.4% 539858 perf-stat.ps.page-faults
1.369e+12 -2.5% 1.334e+12 perf-stat.total.instructions

Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

Attachments:

(No filename) (24.18 kB)
config-6.1.0-rc4-00152-gf1a7941243c1 (162.07 kB)
job-script (7.77 kB)
job.yaml (5.35 kB)
reproduce (277.00 B)
Download all attachments

2023-01-30 04:15:43

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [linus:master] [mm] f1a7941243: unixbench.score -19.2% regression

On Mon, Jan 30, 2023 at 10:32:56AM +0800, kernel test robot wrote:
> FYI, we noticed a -19.2% regression of unixbench.score due to commit:
>
> commit: f1a7941243c102a44e8847e3b94ff4ff3ec56f25 ("mm: convert mm's rss stats into percpu_counter")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> in testcase: unixbench
> on test machine: 128 threads 4 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
> with following parameters:
>
> runtime: 300s
> nr_task: 30%
> test: spawn
> cpufreq_governor: performance

...

> 9cd6ffa60256e931 f1a7941243c102a44e8847e3b94
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
> 11110 -19.2% 8974 unixbench.score
> 1090843 -12.2% 957314 unixbench.time.involuntary_context_switches
> 4243909 ? 6% -32.4% 2867136 ? 5% unixbench.time.major_page_faults
> 10547 -12.6% 9216 unixbench.time.maximum_resident_set_size
> 9.913e+08 -19.6% 7.969e+08 unixbench.time.minor_page_faults
> 5638 +19.1% 6714 unixbench.time.system_time
> 5502 -20.7% 4363 unixbench.time.user_time

So we're spending a lot more time in the kernel and correspondingly less
time in userspace.

> 67991885 -16.9% 56507507 unixbench.time.voluntary_context_switches
> 46198768 -19.1% 37355723 unixbench.workload
> 1.365e+08 -12.5% 1.195e+08 ? 7% cpuidle..usage
> 1220612 ? 4% -38.0% 757009 ? 28% meminfo.Active
> 1220354 ? 4% -38.0% 756754 ? 28% meminfo.Active(anon)
> 0.50 ? 2% -0.1 0.45 ? 4% mpstat.cpu.all.soft%
> 1.73 -0.2 1.52 ? 2% mpstat.cpu.all.usr%
> 532266 -18.4% 434559 vmstat.system.cs
> 495826 -12.2% 435455 ? 8% vmstat.system.in
> 1.36e+08 -13.2% 1.18e+08 ? 9% turbostat.C1
> 68.80 +0.8 69.60 turbostat.C1%
> 1.663e+08 -12.1% 1.462e+08 ? 8% turbostat.IRQ
> 15.54 ? 20% -49.0% 7.93 ? 24% sched_debug.cfs_rq:/.runnable_avg.min
> 13.26 ? 19% -46.6% 7.08 ? 29% sched_debug.cfs_rq:/.util_avg.min
> 48.96 ? 8% +51.5% 74.20 ? 13% sched_debug.cfs_rq:/.util_est_enqueued.avg
> 138.00 ? 5% +28.9% 177.87 ? 7% sched_debug.cfs_rq:/.util_est_enqueued.stddev
> 228060 ? 3% +13.3% 258413 ? 4% sched_debug.cpu.avg_idle.stddev
> 432533 ? 5% -16.4% 361517 ? 4% sched_debug.cpu.nr_switches.min
> 2.665e+08 -18.9% 2.162e+08 numa-numastat.node0.local_node
> 2.666e+08 -18.9% 2.163e+08 numa-numastat.node0.numa_hit
> 2.746e+08 -20.9% 2.172e+08 numa-numastat.node1.local_node
> 2.747e+08 -20.9% 2.172e+08 numa-numastat.node1.numa_hit
> 2.602e+08 -17.4% 2.149e+08 numa-numastat.node2.local_node
> 2.603e+08 -17.4% 2.149e+08 numa-numastat.node2.numa_hit
> 2.423e+08 -15.0% 2.06e+08 numa-numastat.node3.local_node
> 2.424e+08 -15.0% 2.061e+08 numa-numastat.node3.numa_hit

So we're going off-node a lot more for ... something.

> 2.666e+08 -18.9% 2.163e+08 numa-vmstat.node0.numa_hit
> 2.665e+08 -18.9% 2.162e+08 numa-vmstat.node0.numa_local
> 2.747e+08 -20.9% 2.172e+08 numa-vmstat.node1.numa_hit
> 2.746e+08 -20.9% 2.172e+08 numa-vmstat.node1.numa_local
> 2.603e+08 -17.4% 2.149e+08 numa-vmstat.node2.numa_hit
> 2.602e+08 -17.4% 2.149e+08 numa-vmstat.node2.numa_local
> 2.424e+08 -15.0% 2.061e+08 numa-vmstat.node3.numa_hit
> 2.423e+08 -15.0% 2.06e+08 numa-vmstat.node3.numa_local
> 304947 ? 4% -38.0% 189144 ? 28% proc-vmstat.nr_active_anon

Umm. Are we running vmstat a lot during this test? The commit says:

At the
moment the readers are either procfs interface, oom_killer and memory
reclaim which I think are not performance critical and should be ok with
slow read. However I think we can make that change in a separate patch.

This would explain the increased cross-NUMA references (we're going to
the other nodes to collect the stats), and the general slowdown. But I
don't think it reflects a real workload; it's reflecting that the
monitoring of this workload that we're doing is now more accurate and
more expensive.

2023-01-31 05:24:02

by Shakeel Butt

[permalink] [raw]

Subject: Re: [linus:master] [mm] f1a7941243: unixbench.score -19.2% regression

On Mon, Jan 30, 2023 at 04:15:09AM +0000, Matthew Wilcox wrote:
> On Mon, Jan 30, 2023 at 10:32:56AM +0800, kernel test robot wrote:
> > FYI, we noticed a -19.2% regression of unixbench.score due to commit:
> >
> > commit: f1a7941243c102a44e8847e3b94ff4ff3ec56f25 ("mm: convert mm's rss stats into percpu_counter")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
> > in testcase: unixbench
> > on test machine: 128 threads 4 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
> > with following parameters:
> >
> > runtime: 300s
> > nr_task: 30%
> > test: spawn
> > cpufreq_governor: performance
>
> ...
>
> > 9cd6ffa60256e931 f1a7941243c102a44e8847e3b94
> > ---------------- ---------------------------
> > %stddev %change %stddev
> > \ | \
> > 11110 -19.2% 8974 unixbench.score
> > 1090843 -12.2% 957314 unixbench.time.involuntary_context_switches
> > 4243909 ± 6% -32.4% 2867136 ± 5% unixbench.time.major_page_faults
> > 10547 -12.6% 9216 unixbench.time.maximum_resident_set_size
> > 9.913e+08 -19.6% 7.969e+08 unixbench.time.minor_page_faults
> > 5638 +19.1% 6714 unixbench.time.system_time
> > 5502 -20.7% 4363 unixbench.time.user_time
>
> So we're spending a lot more time in the kernel and correspondingly less
> time in userspace.
>
> > 67991885 -16.9% 56507507 unixbench.time.voluntary_context_switches
> > 46198768 -19.1% 37355723 unixbench.workload
> > 1.365e+08 -12.5% 1.195e+08 ± 7% cpuidle..usage
> > 1220612 ± 4% -38.0% 757009 ± 28% meminfo.Active
> > 1220354 ± 4% -38.0% 756754 ± 28% meminfo.Active(anon)
> > 0.50 ± 2% -0.1 0.45 ± 4% mpstat.cpu.all.soft%
> > 1.73 -0.2 1.52 ± 2% mpstat.cpu.all.usr%
> > 532266 -18.4% 434559 vmstat.system.cs
> > 495826 -12.2% 435455 ± 8% vmstat.system.in
> > 1.36e+08 -13.2% 1.18e+08 ± 9% turbostat.C1
> > 68.80 +0.8 69.60 turbostat.C1%
> > 1.663e+08 -12.1% 1.462e+08 ± 8% turbostat.IRQ
> > 15.54 ± 20% -49.0% 7.93 ± 24% sched_debug.cfs_rq:/.runnable_avg.min
> > 13.26 ± 19% -46.6% 7.08 ± 29% sched_debug.cfs_rq:/.util_avg.min
> > 48.96 ± 8% +51.5% 74.20 ± 13% sched_debug.cfs_rq:/.util_est_enqueued.avg
> > 138.00 ± 5% +28.9% 177.87 ± 7% sched_debug.cfs_rq:/.util_est_enqueued.stddev
> > 228060 ± 3% +13.3% 258413 ± 4% sched_debug.cpu.avg_idle.stddev
> > 432533 ± 5% -16.4% 361517 ± 4% sched_debug.cpu.nr_switches.min
> > 2.665e+08 -18.9% 2.162e+08 numa-numastat.node0.local_node
> > 2.666e+08 -18.9% 2.163e+08 numa-numastat.node0.numa_hit
> > 2.746e+08 -20.9% 2.172e+08 numa-numastat.node1.local_node
> > 2.747e+08 -20.9% 2.172e+08 numa-numastat.node1.numa_hit
> > 2.602e+08 -17.4% 2.149e+08 numa-numastat.node2.local_node
> > 2.603e+08 -17.4% 2.149e+08 numa-numastat.node2.numa_hit
> > 2.423e+08 -15.0% 2.06e+08 numa-numastat.node3.local_node
> > 2.424e+08 -15.0% 2.061e+08 numa-numastat.node3.numa_hit
>
> So we're going off-node a lot more for ... something.
>
> > 2.666e+08 -18.9% 2.163e+08 numa-vmstat.node0.numa_hit
> > 2.665e+08 -18.9% 2.162e+08 numa-vmstat.node0.numa_local
> > 2.747e+08 -20.9% 2.172e+08 numa-vmstat.node1.numa_hit
> > 2.746e+08 -20.9% 2.172e+08 numa-vmstat.node1.numa_local
> > 2.603e+08 -17.4% 2.149e+08 numa-vmstat.node2.numa_hit
> > 2.602e+08 -17.4% 2.149e+08 numa-vmstat.node2.numa_local
> > 2.424e+08 -15.0% 2.061e+08 numa-vmstat.node3.numa_hit
> > 2.423e+08 -15.0% 2.06e+08 numa-vmstat.node3.numa_local
> > 304947 ± 4% -38.0% 189144 ± 28% proc-vmstat.nr_active_anon
>
> Umm. Are we running vmstat a lot during this test? The commit says:
>
> At the
> moment the readers are either procfs interface, oom_killer and memory
> reclaim which I think are not performance critical and should be ok with
> slow read. However I think we can make that change in a separate patch.
>
> This would explain the increased cross-NUMA references (we're going to
> the other nodes to collect the stats), and the general slowdown. But I
> don't think it reflects a real workload; it's reflecting that the
> monitoring of this workload that we're doing is now more accurate and
> more expensive.
>

Thanks Willy for taking a stab at this issue. The numa_hit stat is
updated on allocations, so I don't think stat collection would increase
these stats. I looked at workload "spawn" in UnixBench and it is a
simple fork ping pong i.e. process does fork and then waits for the
child while the child simply exits.

I ran perf and it seems like percpu counter allocation is the additional
cost with this patch. See the report below. However I made spawn a bit
more sophisticated by adding a mmap() of a GiB then the page table
copy became the significant cost and no difference without or with the
given patch.

I am now wondering if this fork ping pong really an important workload
that we should revert the patch or ignore for now but work on improving
the performance of __alloc_percpu_gfp code.

- 90.97% 0.06% spawn [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
- 90.91% entry_SYSCALL_64_after_hwframe
- 90.86% do_syscall_64
- 80.03% __x64_sys_clone
- 79.98% kernel_clone
- 75.97% copy_process
+ 46.04% perf_event_init_task
- 21.50% copy_mm
- 10.05% mm_init
----------------------> - 8.92% __percpu_counter_init
- 8.67% __alloc_percpu_gfp
- 5.70% pcpu_alloc
1.29% _find_next_bit
2.57% memset_erms
+ 0.96% pgd_alloc
+ 6.16% copy_page_range
+ 1.72% anon_vma_fork
+ 0.87% mas_store
0.72% kmem_cache_alloc
+ 2.71% dup_task_struct
+ 1.37% perf_event_fork
0.63% alloc_pid
0.51% copy_files
+ 3.71% wake_up_new_task
+ 7.40% __x64_sys_exit_group
+ 2.32% __x64_sys_wait4
+ 1.03% syscall_exit_to_user_mode

2023-01-31 05:45:42

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [linus:master] [mm] f1a7941243: unixbench.score -19.2% regression

On Tue, Jan 31, 2023 at 05:23:52AM +0000, Shakeel Butt wrote:
> On Mon, Jan 30, 2023 at 04:15:09AM +0000, Matthew Wilcox wrote:
> > On Mon, Jan 30, 2023 at 10:32:56AM +0800, kernel test robot wrote:
> > > FYI, we noticed a -19.2% regression of unixbench.score due to commit:
> > >
> > > commit: f1a7941243c102a44e8847e3b94ff4ff3ec56f25 ("mm: convert mm's rss stats into percpu_counter")
> > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > >
> > > in testcase: unixbench
> > > on test machine: 128 threads 4 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
> > > with following parameters:
> > >
> > > runtime: 300s
> > > nr_task: 30%
> > > test: spawn
> > > cpufreq_governor: performance
> >
> > ...
> >
> > > 9cd6ffa60256e931 f1a7941243c102a44e8847e3b94
> > > ---------------- ---------------------------
> > > %stddev %change %stddev
> > > \ | \
> > > 11110 -19.2% 8974 unixbench.score
> > > 1090843 -12.2% 957314 unixbench.time.involuntary_context_switches
> > > 4243909 ? 6% -32.4% 2867136 ? 5% unixbench.time.major_page_faults
> > > 10547 -12.6% 9216 unixbench.time.maximum_resident_set_size
> > > 9.913e+08 -19.6% 7.969e+08 unixbench.time.minor_page_faults
> > > 5638 +19.1% 6714 unixbench.time.system_time
> > > 5502 -20.7% 4363 unixbench.time.user_time
> >
> > So we're spending a lot more time in the kernel and correspondingly less
> > time in userspace.
> >
> > > 67991885 -16.9% 56507507 unixbench.time.voluntary_context_switches
> > > 46198768 -19.1% 37355723 unixbench.workload
> > > 1.365e+08 -12.5% 1.195e+08 ? 7% cpuidle..usage
> > > 1220612 ? 4% -38.0% 757009 ? 28% meminfo.Active
> > > 1220354 ? 4% -38.0% 756754 ? 28% meminfo.Active(anon)
> > > 0.50 ? 2% -0.1 0.45 ? 4% mpstat.cpu.all.soft%
> > > 1.73 -0.2 1.52 ? 2% mpstat.cpu.all.usr%
> > > 532266 -18.4% 434559 vmstat.system.cs
> > > 495826 -12.2% 435455 ? 8% vmstat.system.in
> > > 1.36e+08 -13.2% 1.18e+08 ? 9% turbostat.C1
> > > 68.80 +0.8 69.60 turbostat.C1%
> > > 1.663e+08 -12.1% 1.462e+08 ? 8% turbostat.IRQ
> > > 15.54 ? 20% -49.0% 7.93 ? 24% sched_debug.cfs_rq:/.runnable_avg.min
> > > 13.26 ? 19% -46.6% 7.08 ? 29% sched_debug.cfs_rq:/.util_avg.min
> > > 48.96 ? 8% +51.5% 74.20 ? 13% sched_debug.cfs_rq:/.util_est_enqueued.avg
> > > 138.00 ? 5% +28.9% 177.87 ? 7% sched_debug.cfs_rq:/.util_est_enqueued.stddev
> > > 228060 ? 3% +13.3% 258413 ? 4% sched_debug.cpu.avg_idle.stddev
> > > 432533 ? 5% -16.4% 361517 ? 4% sched_debug.cpu.nr_switches.min
> > > 2.665e+08 -18.9% 2.162e+08 numa-numastat.node0.local_node
> > > 2.666e+08 -18.9% 2.163e+08 numa-numastat.node0.numa_hit
> > > 2.746e+08 -20.9% 2.172e+08 numa-numastat.node1.local_node
> > > 2.747e+08 -20.9% 2.172e+08 numa-numastat.node1.numa_hit
> > > 2.602e+08 -17.4% 2.149e+08 numa-numastat.node2.local_node
> > > 2.603e+08 -17.4% 2.149e+08 numa-numastat.node2.numa_hit
> > > 2.423e+08 -15.0% 2.06e+08 numa-numastat.node3.local_node
> > > 2.424e+08 -15.0% 2.061e+08 numa-numastat.node3.numa_hit
> >
> > So we're going off-node a lot more for ... something.
> >
> > > 2.666e+08 -18.9% 2.163e+08 numa-vmstat.node0.numa_hit
> > > 2.665e+08 -18.9% 2.162e+08 numa-vmstat.node0.numa_local
> > > 2.747e+08 -20.9% 2.172e+08 numa-vmstat.node1.numa_hit
> > > 2.746e+08 -20.9% 2.172e+08 numa-vmstat.node1.numa_local
> > > 2.603e+08 -17.4% 2.149e+08 numa-vmstat.node2.numa_hit
> > > 2.602e+08 -17.4% 2.149e+08 numa-vmstat.node2.numa_local
> > > 2.424e+08 -15.0% 2.061e+08 numa-vmstat.node3.numa_hit
> > > 2.423e+08 -15.0% 2.06e+08 numa-vmstat.node3.numa_local
> > > 304947 ? 4% -38.0% 189144 ? 28% proc-vmstat.nr_active_anon
> >
> > Umm. Are we running vmstat a lot during this test? The commit says:
> >
> > At the
> > moment the readers are either procfs interface, oom_killer and memory
> > reclaim which I think are not performance critical and should be ok with
> > slow read. However I think we can make that change in a separate patch.
> >
> > This would explain the increased cross-NUMA references (we're going to
> > the other nodes to collect the stats), and the general slowdown. But I
> > don't think it reflects a real workload; it's reflecting that the
> > monitoring of this workload that we're doing is now more accurate and
> > more expensive.
> >
>
> Thanks Willy for taking a stab at this issue. The numa_hit stat is
> updated on allocations, so I don't think stat collection would increase
> these stats. I looked at workload "spawn" in UnixBench and it is a
> simple fork ping pong i.e. process does fork and then waits for the
> child while the child simply exits.

Perhaps I was unclear. What I meant was the monitoring done by the 01.org
team might be the thing which is causing the slowdown they're seeing,
since your commit says that reading the stats becomes more expensive,
and the stats they report are consistent with that.

> I ran perf and it seems like percpu counter allocation is the additional
> cost with this patch. See the report below. However I made spawn a bit
> more sophisticated by adding a mmap() of a GiB then the page table
> copy became the significant cost and no difference without or with the
> given patch.
>
> I am now wondering if this fork ping pong really an important workload
> that we should revert the patch or ignore for now but work on improving
> the performance of __alloc_percpu_gfp code.
>
>
> - 90.97% 0.06% spawn [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
> - 90.91% entry_SYSCALL_64_after_hwframe
> - 90.86% do_syscall_64
> - 80.03% __x64_sys_clone
> - 79.98% kernel_clone
> - 75.97% copy_process
> + 46.04% perf_event_init_task
> - 21.50% copy_mm
> - 10.05% mm_init
> ----------------------> - 8.92% __percpu_counter_init
> - 8.67% __alloc_percpu_gfp
> - 5.70% pcpu_alloc

5.7% of our time spent in pcpu_alloc seems excessive. Are we contending
on pcpu_alloc_mutex perhaps? Also, are you doing this on a 4-socket
machine like the kernel test robot ran on?

We could cut down the number of calls to pcpu_alloc() by a factor of 4
by having a pcpu_alloc_bulk() that would allocate all four RSS counters
at once.

Just throwing out ideas ...

2023-01-31 05:57:50

by Shakeel Butt

[permalink] [raw]

Subject: Re: [linus:master] [mm] f1a7941243: unixbench.score -19.2% regression

On Tue, Jan 31, 2023 at 05:45:21AM +0000, Matthew Wilcox wrote:
[...]
> > I ran perf and it seems like percpu counter allocation is the additional
> > cost with this patch. See the report below. However I made spawn a bit
> > more sophisticated by adding a mmap() of a GiB then the page table
> > copy became the significant cost and no difference without or with the
> > given patch.
> >
> > I am now wondering if this fork ping pong really an important workload
> > that we should revert the patch or ignore for now but work on improving
> > the performance of __alloc_percpu_gfp code.
> >
> >
> > - 90.97% 0.06% spawn [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
> > - 90.91% entry_SYSCALL_64_after_hwframe
> > - 90.86% do_syscall_64
> > - 80.03% __x64_sys_clone
> > - 79.98% kernel_clone
> > - 75.97% copy_process
> > + 46.04% perf_event_init_task
> > - 21.50% copy_mm
> > - 10.05% mm_init
> > ----------------------> - 8.92% __percpu_counter_init
> > - 8.67% __alloc_percpu_gfp
> > - 5.70% pcpu_alloc
>
> 5.7% of our time spent in pcpu_alloc seems excessive. Are we contending
> on pcpu_alloc_mutex perhaps? Also, are you doing this on a 4-socket
> machine like the kernel test robot ran on?

I ran on 2-socket machine and I am not sure about pcpu_alloc_mutex but I
doubt that because I ran a single instance of the spawn test i.e. a
single fork ping pong.

>
> We could cut down the number of calls to pcpu_alloc() by a factor of 4
> by having a pcpu_alloc_bulk() that would allocate all four RSS counters
> at once.
>
> Just throwing out ideas ...

Thanks, I will take a stab at pcpu_alloc_bulk() and will share the
result tomorrow.

thanks,
Shakeel

2023-01-31 06:14:42

by Feng Tang

[permalink] [raw]

Subject: Re: [linus:master] [mm] f1a7941243: unixbench.score -19.2% regression

On Mon, Jan 30, 2023 at 12:15:09PM +0800, Matthew Wilcox wrote:
> On Mon, Jan 30, 2023 at 10:32:56AM +0800, kernel test robot wrote:
> > FYI, we noticed a -19.2% regression of unixbench.score due to commit:
> >
> > commit: f1a7941243c102a44e8847e3b94ff4ff3ec56f25 ("mm: convert mm's rss stats into percpu_counter")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
> > in testcase: unixbench
> > on test machine: 128 threads 4 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
> > with following parameters:
> >
> > runtime: 300s
> > nr_task: 30%
> > test: spawn
> > cpufreq_governor: performance
>
> ...
>
> > 9cd6ffa60256e931 f1a7941243c102a44e8847e3b94
> > ---------------- ---------------------------
> > %stddev %change %stddev
> > \ | \
> > 11110 -19.2% 8974 unixbench.score
> > 1090843 -12.2% 957314 unixbench.time.involuntary_context_switches
> > 4243909 ± 6% -32.4% 2867136 ± 5% unixbench.time.major_page_faults
> > 10547 -12.6% 9216 unixbench.time.maximum_resident_set_size
> > 9.913e+08 -19.6% 7.969e+08 unixbench.time.minor_page_faults
> > 5638 +19.1% 6714 unixbench.time.system_time
> > 5502 -20.7% 4363 unixbench.time.user_time
>
> So we're spending a lot more time in the kernel and correspondingly less
> time in userspace.
>
> > 67991885 -16.9% 56507507 unixbench.time.voluntary_context_switches
> > 46198768 -19.1% 37355723 unixbench.workload
> > 1.365e+08 -12.5% 1.195e+08 ± 7% cpuidle..usage
> > 1220612 ± 4% -38.0% 757009 ± 28% meminfo.Active
> > 1220354 ± 4% -38.0% 756754 ± 28% meminfo.Active(anon)
> > 0.50 ± 2% -0.1 0.45 ± 4% mpstat.cpu.all.soft%
> > 1.73 -0.2 1.52 ± 2% mpstat.cpu.all.usr%
> > 532266 -18.4% 434559 vmstat.system.cs
> > 495826 -12.2% 435455 ± 8% vmstat.system.in
> > 1.36e+08 -13.2% 1.18e+08 ± 9% turbostat.C1
> > 68.80 +0.8 69.60 turbostat.C1%
> > 1.663e+08 -12.1% 1.462e+08 ± 8% turbostat.IRQ
> > 15.54 ± 20% -49.0% 7.93 ± 24% sched_debug.cfs_rq:/.runnable_avg.min
> > 13.26 ± 19% -46.6% 7.08 ± 29% sched_debug.cfs_rq:/.util_avg.min
> > 48.96 ± 8% +51.5% 74.20 ± 13% sched_debug.cfs_rq:/.util_est_enqueued.avg
> > 138.00 ± 5% +28.9% 177.87 ± 7% sched_debug.cfs_rq:/.util_est_enqueued.stddev
> > 228060 ± 3% +13.3% 258413 ± 4% sched_debug.cpu.avg_idle.stddev
> > 432533 ± 5% -16.4% 361517 ± 4% sched_debug.cpu.nr_switches.min
> > 2.665e+08 -18.9% 2.162e+08 numa-numastat.node0.local_node
> > 2.666e+08 -18.9% 2.163e+08 numa-numastat.node0.numa_hit
> > 2.746e+08 -20.9% 2.172e+08 numa-numastat.node1.local_node
> > 2.747e+08 -20.9% 2.172e+08 numa-numastat.node1.numa_hit
> > 2.602e+08 -17.4% 2.149e+08 numa-numastat.node2.local_node
> > 2.603e+08 -17.4% 2.149e+08 numa-numastat.node2.numa_hit
> > 2.423e+08 -15.0% 2.06e+08 numa-numastat.node3.local_node
> > 2.424e+08 -15.0% 2.061e+08 numa-numastat.node3.numa_hit
>
> So we're going off-node a lot more for ... something.
>
> > 2.666e+08 -18.9% 2.163e+08 numa-vmstat.node0.numa_hit
> > 2.665e+08 -18.9% 2.162e+08 numa-vmstat.node0.numa_local
> > 2.747e+08 -20.9% 2.172e+08 numa-vmstat.node1.numa_hit
> > 2.746e+08 -20.9% 2.172e+08 numa-vmstat.node1.numa_local
> > 2.603e+08 -17.4% 2.149e+08 numa-vmstat.node2.numa_hit
> > 2.602e+08 -17.4% 2.149e+08 numa-vmstat.node2.numa_local
> > 2.424e+08 -15.0% 2.061e+08 numa-vmstat.node3.numa_hit
> > 2.423e+08 -15.0% 2.06e+08 numa-vmstat.node3.numa_local
> > 304947 ± 4% -38.0% 189144 ± 28% proc-vmstat.nr_active_anon
>
> Umm. Are we running vmstat a lot during this test? The commit says:

Thanks for the hint.

Right. IIUC, 0Day normally will run 'vmstat 1' and collet info, and
also read /proc/vmstat and /sys/devices/system/node/nodexx/numstat
per second.

> At the
> moment the readers are either procfs interface, oom_killer and memory
> reclaim which I think are not performance critical and should be ok with
> slow read. However I think we can make that change in a separate patch.
>
> This would explain the increased cross-NUMA references (we're going to
> the other nodes to collect the stats), and the general slowdown. But I
> don't think it reflects a real workload; it's reflecting that the
> monitoring of this workload that we're doing is now more accurate and
> more expensive.

We tried to disable the 3 types of 'vmstat/numastat' monitors
mentioned above. And the regression is still there with
similar rate around 19%.

Thanks,
Feng

2023-01-31 18:27:08

by Shakeel Butt

[permalink] [raw]

Subject: Re: [linus:master] [mm] f1a7941243: unixbench.score -19.2% regression

+per-cpu memory maintainers for FYI.

Thread started at
https://lore.kernel.org/linux-mm/[email protected]/

On Mon, Jan 30, 2023 at 9:57 PM Shakeel Butt <[email protected]> wrote:
>
[...]
> >
> > We could cut down the number of calls to pcpu_alloc() by a factor of 4
> > by having a pcpu_alloc_bulk() that would allocate all four RSS counters
> > at once.
> >
> > Just throwing out ideas ...
>
> Thanks, I will take a stab at pcpu_alloc_bulk() and will share the
> result tomorrow.
>

OK, not a one day effort.

Andrew, which option do you prefer?

1. Keep the patches as the test workload (fork ping pong) is not a
representative of real world workload and work on improving
pcpu_alloc() for 6.4+.

2. Revert the patches for now, improve pcpu_alloc() and re-introduce
the patches once we confirm that fork-ping-pong is not regressed
anymore.

thanks,
Shakeel

2023-02-27 06:35:46

by Yin, Fengwei

[permalink] [raw]

Subject: Re: [linus:master] [mm] f1a7941243: unixbench.score -19.2% regression

Hi Andrew, Shakeel,

On Tue, 2023-01-31 at 10:26 -0800, Shakeel Butt wrote:
> +per-cpu memory maintainers for FYI.
>
> Thread started at
> https://lore.kernel.org/linux-mm/[email protected]/
>
> On Mon, Jan 30, 2023 at 9:57 PM Shakeel Butt <[email protected]>
> wrote:
> >
> [...]
> > >
> > > We could cut down the number of calls to pcpu_alloc() by a factor
> > > of 4
> > > by having a pcpu_alloc_bulk() that would allocate all four RSS
> > > counters
> > > at once.
> > >
> > > Just throwing out ideas ...
> >
> > Thanks, I will take a stab at pcpu_alloc_bulk() and will share the
> > result tomorrow.
> >
>
> OK, not a one day effort.
>
> Andrew, which option do you prefer?
>
> 1. Keep the patches as the test workload (fork ping pong) is not a
> representative of real world workload and work on improving
> pcpu_alloc() for 6.4+.
>
> 2. Revert the patches for now, improve pcpu_alloc() and re-introduce
> the patches once we confirm that fork-ping-pong is not regressed
> anymore.
This performance regression still can be reproduced on latest master
branch. So we took option1 here. Right? Thanks.

Regards
Yin, Fengwei

>
> thanks,
> Shakeel

2023-02-27 16:50:24

by Shakeel Butt

[permalink] [raw]

Subject: Re: [linus:master] [mm] f1a7941243: unixbench.score -19.2% regression

On Sun, Feb 26, 2023 at 10:35 PM Yin, Fengwei <[email protected]> wrote:
>
> Hi Andrew, Shakeel,
>
> On Tue, 2023-01-31 at 10:26 -0800, Shakeel Butt wrote:
> > +per-cpu memory maintainers for FYI.
> >
> > Thread started at
> > https://lore.kernel.org/linux-mm/[email protected]/
> >
> > On Mon, Jan 30, 2023 at 9:57 PM Shakeel Butt <[email protected]>
> > wrote:
> > >
> > [...]
> > > >
> > > > We could cut down the number of calls to pcpu_alloc() by a factor
> > > > of 4
> > > > by having a pcpu_alloc_bulk() that would allocate all four RSS
> > > > counters
> > > > at once.
> > > >
> > > > Just throwing out ideas ...
> > >
> > > Thanks, I will take a stab at pcpu_alloc_bulk() and will share the
> > > result tomorrow.
> > >
> >
> > OK, not a one day effort.
> >
> > Andrew, which option do you prefer?
> >
> > 1. Keep the patches as the test workload (fork ping pong) is not a
> > representative of real world workload and work on improving
> > pcpu_alloc() for 6.4+.
> >
> > 2. Revert the patches for now, improve pcpu_alloc() and re-introduce
> > the patches once we confirm that fork-ping-pong is not regressed
> > anymore.
> This performance regression still can be reproduced on latest master
> branch. So we took option1 here. Right? Thanks.
>

Yes unless some real workload reports regression.

Shakeel

2023-02-28 00:32:56

by Yin, Fengwei

[permalink] [raw]

Subject: Re: [linus:master] [mm] f1a7941243: unixbench.score -19.2% regression

Hi Shakeel,

On 2/28/2023 12:50 AM, Shakeel Butt wrote:
> On Sun, Feb 26, 2023 at 10:35 PM Yin, Fengwei <[email protected]> wrote:
>>
>> Hi Andrew, Shakeel,
>>
>> On Tue, 2023-01-31 at 10:26 -0800, Shakeel Butt wrote:
>>> +per-cpu memory maintainers for FYI.
>>>
>>> Thread started at
>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>
>>> On Mon, Jan 30, 2023 at 9:57 PM Shakeel Butt <[email protected]>
>>> wrote:
>>>>
>>> [...]
>>>>>
>>>>> We could cut down the number of calls to pcpu_alloc() by a factor
>>>>> of 4
>>>>> by having a pcpu_alloc_bulk() that would allocate all four RSS
>>>>> counters
>>>>> at once.
>>>>>
>>>>> Just throwing out ideas ...
>>>>
>>>> Thanks, I will take a stab at pcpu_alloc_bulk() and will share the
>>>> result tomorrow.
>>>>
>>>
>>> OK, not a one day effort.
>>>
>>> Andrew, which option do you prefer?
>>>
>>> 1. Keep the patches as the test workload (fork ping pong) is not a
>>> representative of real world workload and work on improving
>>> pcpu_alloc() for 6.4+.
>>>
>>> 2. Revert the patches for now, improve pcpu_alloc() and re-introduce
>>> the patches once we confirm that fork-ping-pong is not regressed
>>> anymore.
>> This performance regression still can be reproduced on latest master
>> branch. So we took option1 here. Right? Thanks.
>>
>
> Yes unless some real workload reports regression.
Thanks for the confirmation.

Regards
Yin, Fengwei

>
> Shakeel