LinuxLists.cc - [linus:master] will-it-scale.per_thread_ops -40.2% regression in mmap1 benchmark

2022-12-19 10:35:51

Subject: [linus:master] will-it-scale.per_thread_ops -40.2% regression in mmap1 benchmark

Greetings,

FYI, we noticed a between commit 524e00b36e8c
524e00b36e8c5 mm: remove rb tree.
0c563f1480435 proc: d0cf3dd47f0d5 damon: c9dbe82cb99db kernel/fork: 3499a13168da6 mm/mmap: 7fdbd37da5c6f mm/mmap: be8432e7166ef mm/mmap: 2e3af1db17442 mmap: f39af05949a42 mm: d4af56c5c7c67 mm: e15e06a839232 lib/test_maple_tree:
in testcase: will-it-scale
on test machine: with following parameters:

nr_task: 50%
mode: thread
test: mmap1
cpufreq_governor: performance

test-description: test-url:
Disclaimer:
Results have been for informational design or configuration
--
0-DAY CI Kernel Test Service
https://github.com/antonblanchard/will-it-scale
out the commit that introduced this regression because
failed to boot during bisection, but looks it is
tree code. Please check following details:
===============================================================
kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
4-rhel-8.3/thread/50%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/mmap1/will-it-scale
/test_maple_tree: add testing for maple tree")
remove rb tree.")
524e00b36e8c547f5582eef3fb6
---------------------------
%stddev
\
-40.2% 142816 will-it-scale.52.threads
2746 will-it-scale.per_thread_ops
-40.2% 142816 will-it-scale.workload
0.20 ? 3% mpstat.cpu.all.usr%
7636 proc-vmstat.nr_mapped
0.05 ? 10% time.system_time
18.71 ? 92% turbostat.C1E%
29.62 ? 52% turbostat.C6%
+20.3% 561351 ? 3% turbostat.POLL
+3.9% 44.00 turbostat.PkgTmp
-50.7% 412.94 ? 19% sched_debug.cfs_rq:/.load_avg.max
-53.4% 217040 ? 82% sched_debug.cfs_rq:/.min_vruntime.min
+146.5% -828023 sched_debug.cfs_rq:/.spread0.min
+6.5% 641749 ? 4% sched_debug.cpu.avg_idle.avg
30723 ? 6% sched_debug.cpu.nr_switches.max
+42.4% 327946 ? 3% numa-numastat.node0.local_node
+29.2% 332446 numa-numastat.node0.numa_hit
4532 ?138% numa-numastat.node0.other_node
-26.8% 251981 ? 2% numa-numastat.node1.local_node
-19.9% 281185 numa-numastat.node1.numa_hit
29204 ? 21% numa-numastat.node1.other_node
+43.9% 160892 ? 17% numa-meminfo.node0.AnonHugePages
+34.9% 221083 ? 21% numa-meminfo.node0.AnonPages
+39.7% 254705 ? 15% numa-meminfo.node0.AnonPages.max
+33.0% 223029 ? 20% numa-meminfo.node0.Inactive
+34.9% 223029 ? 20% numa-meminfo.node0.Inactive(anon)
0.00 numa-meminfo.node0.Inactive(file)
2548 ? 9% numa-meminfo.node0.PageTables
-27.5% 123611 ? 23% numa-meminfo.node1.AnonHugePages
-23.9% 181170 ? 25% numa-meminfo.node1.AnonPages
-26.8% 203778 ? 22% numa-meminfo.node1.AnonPages.max
-24.0% 185599 ? 25% numa-meminfo.node1.Inactive
-24.1% 185419 ? 25% numa-meminfo.node1.Inactive(anon)
10717 ?124% numa-meminfo.node1.Mapped
55213 ? 21% numa-vmstat.node0.nr_anon_pages
55700 ? 20% numa-vmstat.node0.nr_inactive_anon
55700 ? 20% numa-vmstat.node0.nr_zone_inactive_anon
+29.2% 332536 numa-vmstat.node0.numa_hit
+42.4% 328036 ? 3% numa-vmstat.node0.numa_local
4532 ?138% numa-vmstat.node0.numa_other
45237 ? 25% numa-vmstat.node1.nr_anon_pages
46287 ? 25% numa-vmstat.node1.nr_inactive_anon
2666 ?126% numa-vmstat.node1.nr_mapped
46287 ? 25% numa-vmstat.node1.nr_zone_inactive_anon
-20.0% 281191 numa-vmstat.node1.numa_hit
-26.8% 251987 ? 2% numa-vmstat.node1.numa_local
29204 ? 21% numa-vmstat.node1.numa_other
2.32 ? 2% perf-stat.i.MPKI
+4.4% 3.247e+09 perf-stat.i.branch-instructions
0.39 perf-stat.i.branch-miss-rate%
-5.5% 12837395 perf-stat.i.branch-misses
43.44 ? 3% perf-stat.i.cache-miss-rate%
-21.9% 37423833 ? 2% perf-stat.i.cache-references
8.94 perf-stat.i.cpi
0.01 perf-stat.i.dTLB-load-miss-rate%
-28.8% 449814 perf-stat.i.dTLB-load-misses
+3.8% 4.282e+09 perf-stat.i.dTLB-loads
0.00 ? 11% perf-stat.i.dTLB-store-miss-rate%
+26.5% 3.962e+08 perf-stat.i.dTLB-stores
-20.0% 479492 ? 6% perf-stat.i.iTLB-load-misses
-12.7% 2028806 ? 7% perf-stat.i.iTLB-loads
+5.4% 1.622e+10 perf-stat.i.instructions
34030 ? 6% perf-stat.i.instructions-per-iTLB-miss
0.11 perf-stat.i.ipc
-22.1% 444.53 ? 2% perf-stat.i.metric.K/sec
+5.0% 76.20 perf-stat.i.metric.M/sec
91.82 perf-stat.i.node-load-miss-rate%
-27.2% 5431142 ? 3% perf-stat.i.node-load-misses
-39.1% 484036 perf-stat.i.node-loads
2.31 ? 2% perf-stat.overall.MPKI
0.40 perf-stat.overall.branch-miss-rate%
43.24 ? 3% perf-stat.overall.cache-miss-rate%
8.93 perf-stat.overall.cpi
0.01 perf-stat.overall.dTLB-load-miss-rate%
0.00 ? 11% perf-stat.overall.dTLB-store-miss-rate%
33976 ? 6% perf-stat.overall.instructions-per-iTLB-miss
0.11 perf-stat.overall.ipc
91.81 perf-stat.overall.node-load-miss-rate%
+76.1% 34307144 perf-stat.overall.path-length
+4.4% 3.236e+09 perf-stat.ps.branch-instructions
-5.5% 12794692 perf-stat.ps.branch-misses
-21.9% 37302259 ? 2% perf-stat.ps.cache-references
-28.8% 448327 perf-stat.ps.dTLB-load-misses
+3.8% 4.268e+09 perf-stat.ps.dTLB-loads
+26.5% 3.949e+08 perf-stat.ps.dTLB-stores
-20.0% 477834 ? 6% perf-stat.ps.iTLB-load-misses
-12.7% 2021878 ? 7% perf-stat.ps.iTLB-loads
+5.4% 1.617e+10 perf-stat.ps.instructions
-27.2% 5412315 ? 3% perf-stat.ps.node-load-misses
-39.1% 482405 perf-stat.ps.node-loads
+5.4% 4.9e+12 perf-stat.total.instructions
7.77 ?122% perf-profile.calltrace.cycles-pp.mwait_idle_with_hints.intel_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
7.77 ?122% perf-profile.calltrace.cycles-pp.intel_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
24.32 ? 7% perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry
24.61 ? 6% perf-profile.calltrace.cycles-pp.secondary_startup_64_no_verify
24.13 ? 6% perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
24.13 ? 6% perf-profile.calltrace.cycles-pp.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
24.14 ? 6% perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
24.14 ? 6% perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
24.14 ? 6% perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64_no_verify
0.61 ? 2% perf-profile.calltrace.cycles-pp.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.56 ? 2% perf-profile.calltrace.cycles-pp.rwsem_spin_on_owner.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff
0.57 ? 2% perf-profile.calltrace.cycles-pp.rwsem_spin_on_owner.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.__vm_munmap
0.60 ? 3% perf-profile.calltrace.cycles-pp.do_mmap.vm_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
36.24 ? 2% perf-profile.calltrace.cycles-pp.osq_lock.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.__vm_munmap
36.05 perf-profile.calltrace.cycles-pp.osq_lock.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff
37.77 ? 2% perf-profile.calltrace.cycles-pp.__munmap
36.97 ? 2% perf-profile.calltrace.cycles-pp.down_write_killable.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
36.94 ? 2% perf-profile.calltrace.cycles-pp.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap.do_syscall_64
37.69 ? 2% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__munmap
36.76 perf-profile.calltrace.cycles-pp.down_write_killable.vm_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
37.66 ? 2% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
36.92 ? 2% perf-profile.calltrace.cycles-pp.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap
36.73 perf-profile.calltrace.cycles-pp.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
37.61 ? 2% perf-profile.calltrace.cycles-pp.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
36.70 perf-profile.calltrace.cycles-pp.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.do_syscall_64
37.61 ? 2% perf-profile.calltrace.cycles-pp.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
37.56 perf-profile.calltrace.cycles-pp.__mmap
37.48 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__mmap
37.46 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
37.39 perf-profile.calltrace.cycles-pp.vm_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
7.88 ?122% perf-profile.children.cycles-pp.intel_idle
24.61 ? 6% perf-profile.children.cycles-pp.secondary_startup_64_no_verify
24.61 ? 6% perf-profile.children.cycles-pp.cpu_startup_entry
24.61 ? 6% perf-profile.children.cycles-pp.do_idle
24.57 ? 6% perf-profile.children.cycles-pp.mwait_idle_with_hints
24.60 ? 6% perf-profile.children.cycles-pp.cpuidle_enter
24.60 ? 6% perf-profile.children.cycles-pp.cpuidle_enter_state
24.61 ? 6% perf-profile.children.cycles-pp.cpuidle_idle_call
24.14 ? 6% perf-profile.children.cycles-pp.start_secondary
0.48 ? 16% perf-profile.children.cycles-pp.start_kernel
0.48 ? 16% perf-profile.children.cycles-pp.arch_call_rest_init
0.48 ? 16% perf-profile.children.cycles-pp.rest_init
0.08 perf-profile.children.cycles-pp.unmap_region
0.10 ? 8% perf-profile.children.cycles-pp.syscall_exit_to_user_mode
0.10 perf-profile.children.cycles-pp.syscall_return_via_sysret
0.06 ? 13% perf-profile.children.cycles-pp.mas_wr_node_store
0.06 ? 7% perf-profile.children.cycles-pp.memset_erms
0.06 ? 7% perf-profile.children.cycles-pp.mas_wr_modify
0.07 ? 6% perf-profile.children.cycles-pp.kmem_cache_free_bulk
0.61 ? 2% perf-profile.children.cycles-pp.__do_munmap
0.08 ? 5% perf-profile.children.cycles-pp.mas_destroy
0.09 ? 5% perf-profile.children.cycles-pp.mt_find
0.10 perf-profile.children.cycles-pp.mas_spanning_rebalance
0.10 ? 4% perf-profile.children.cycles-pp.mas_wr_spanning_store
0.12 ? 4% perf-profile.children.cycles-pp.mas_rev_awalk
0.13 perf-profile.children.cycles-pp.mas_empty_area_rev
0.14 ? 5% perf-profile.children.cycles-pp.kmem_cache_alloc_bulk
0.16 ? 5% perf-profile.children.cycles-pp.mas_alloc_nodes
0.17 ? 4% perf-profile.children.cycles-pp.mas_preallocate
0.60 ? 3% perf-profile.children.cycles-pp.do_mmap
0.27 perf-profile.children.cycles-pp.vma_link
0.41 ? 4% perf-profile.children.cycles-pp.mmap_region
0.35 ? 4% perf-profile.children.cycles-pp.mas_store_prealloc
1.13 ? 2% perf-profile.children.cycles-pp.rwsem_spin_on_owner
37.77 ? 2% perf-profile.children.cycles-pp.__munmap
37.61 ? 2% perf-profile.children.cycles-pp.__x64_sys_munmap
37.61 ? 2% perf-profile.children.cycles-pp.__vm_munmap
37.56 perf-profile.children.cycles-pp.__mmap
37.40 perf-profile.children.cycles-pp.vm_mmap_pgoff
72.32 ? 2% perf-profile.children.cycles-pp.osq_lock
73.72 ? 2% perf-profile.children.cycles-pp.down_write_killable
73.66 ? 2% perf-profile.children.cycles-pp.rwsem_down_write_slowpath
73.62 ? 2% perf-profile.children.cycles-pp.rwsem_optimistic_spin
75.21 ? 2% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
75.15 ? 2% perf-profile.children.cycles-pp.do_syscall_64
24.57 ? 6% perf-profile.self.cycles-pp.mwait_idle_with_hints
0.14 ? 3% perf-profile.self.cycles-pp.rwsem_optimistic_spin
0.09 ? 9% perf-profile.self.cycles-pp.syscall_exit_to_user_mode
0.05 ? 8% perf-profile.self.cycles-pp.down_write_killable
0.10 perf-profile.self.cycles-pp.syscall_return_via_sysret
0.06 perf-profile.self.cycles-pp.memset_erms
0.06 ? 13% perf-profile.self.cycles-pp.kmem_cache_free_bulk
0.06 ? 7% perf-profile.self.cycles-pp.kmem_cache_alloc_bulk
0.08 perf-profile.self.cycles-pp.mt_find
0.11 ? 4% perf-profile.self.cycles-pp.mas_rev_awalk
1.12 ? 2% perf-profile.self.cycles-pp.rwsem_spin_on_owner
71.91 ? 2% perf-profile.self.cycles-pp.osq_lock
kindly add following tag
test robot <[email protected]>
lore.kernel.org/oe-lkp/202212191714.524e00b3-yujie.liu@intel.com">https://lore.kernel.org/oe-lkp/[email protected]
estimated based on internal Intel analysis and are provided
purposes only. Any difference in system hardware or software
may affect actual performance.
kp">https://01.org/lkp

Attachments:

(No filename) (19.02 kB)
config-6.1.0-rc7-00211-g0ba09b173387 (168.61 kB)
job-script (7.67 kB)
job.yaml (5.22 kB)
reproduce (354.00 B)
Download all attachments

2022-12-19 18:23:58

by Liam R. Howlett

[permalink] [raw]

Subject: Re: [linus:master] will-it-scale.per_thread_ops -40.2% regression in mmap1 benchmark

* kernel test robot <[email protected]> [221219 05:01]:
> Greetings,
>
> FYI, we noticed a -40.2% regression of will-it-scale.per_thread_ops
> between commit 524e00b36e8c and e15e06a83923 of mainline

Thank you for running this test.

We are aware of this regression. The regression was taken as an
acceptable trade off for the gain on the read speed. Applications
perform more reads than writes to the VMA tree. The overfall
performance on real applications is either even or faster with the maple
tree. This can be seen in the kernel build times where forked processes
are short lived and would be close to the worst case scenario.

This isn't to say we can't do better, and we are constantly working
towards faster performance. Please continue to report on the
performance.

Looking specifically at mmap1, it is mapping then unmapping in a tight
loop. The regression would be expected, considering the internals of
what is going on, but I don't believe this would ever happen in an
application that is doing what it is supposed to be doing.

If you find a real application that shows a performance regression,
please let us know.

>
> 524e00b36e8c5 mm: remove rb tree.
> 0c563f1480435 proc: remove VMA rbtree use from nommu
> d0cf3dd47f0d5 damon: convert __damon_va_three_regions to use the VMA iterator
> c9dbe82cb99db kernel/fork: use maple tree for dup_mmap() during forking
> 3499a13168da6 mm/mmap: use maple tree for unmapped_area{_topdown}
> 7fdbd37da5c6f mm/mmap: use the maple tree for find_vma_prev() instead of the rbtree
> be8432e7166ef mm/mmap: use the maple tree in find_vma() instead of the rbtree.
> 2e3af1db17442 mmap: use the VMA iterator in count_vma_pages_range()
> f39af05949a42 mm: add VMA iterator
> d4af56c5c7c67 mm: start tracking VMAs with maple tree
> e15e06a839232 lib/test_maple_tree: add testing for maple tree
>
> in testcase: will-it-scale
> on test machine: 104 threads 2 sockets (Skylake) with 192G memory
> with following parameters:
>
> nr_task: 50%
> mode: thread
> test: mmap1
> cpufreq_governor: performance
>
> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> test-url: https://github.com/antonblanchard/will-it-scale
>
>
> We couldn't find out the commit that introduced this regression because
> some of above commits failed to boot during bisection, but looks it is
> related with maple tree code. Please check following details:

It is interesting that these issues were not detected by myself or other
build bots. Perhaps there is a configuration option that wasn't tested.
In any rate, all of the listed commits were in preparation for the last
commit to remove the rb tree. Regardless of which commit introduced the
regression, it is the fact that that the maple tree is slower on writes
that is being detected.

Thanks,
Liam