LinuxLists.cc - [mm/gup] 57efa1fe59: will-it-scale.per_thread

2021-05-25 03:27:26

Subject: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

Greeting,

FYI, we noticed a

commit: 57efa1fe5957694fa54 In addition to that,
+------------------+------- | testcase: change | test machine | test parameters | | mode=thread | | nr_task=50% | | test=mmap1 | | ucode=0x5003006 +------------------+-------

If you fix the issue, Reported-by: kernel

Details are as below:
---------------------------

To reproduce:

git clone bin/lkp install bin/lkp split-job bin/lkp run
=========================== compiler/cpufreq_governor/k gcc-9/performance/x86_64-
commit:
c28b1fc703 ("mm/gup: 57efa1fe59 ("mm/gup:
c28b1fc70390df32 57efa1fe59 ---------------- ---------- %stddev %change \ | 342141 7127 -9.2% 342141 2555927 ? 3% 12108 ? 13% -36.7% 1142492 ? 30% -47.3% 282373 ? 13% -45.6% 48437 ? 3% -5.9% 54617 ? 3% -5.5% 48437 ? 3% -5.9% 70511 ? 3% -5.1% 278653 ? 8% +23.4% 22572 ? 16% -36.3% 66177 ? 16% -36.8% 11613 ? 15% -41.4% 22.96 ? 15% +55.6% 69713 ? 19% -38.0% 22.95 ? 15% +55.6% 29397 ? 23% -35.3% 31964 ? 20% -50.8% 4.59 ? 6% +12.2% 3.105e+09 12033 ? 12% -36.8% 10.06 +1.9% 4.067e+09 4.521e+08 1.522e+10 0.10 -1.9% 0.19 ? 8% -22.8% 80.30 -1.7% 167270 ? 6% -14.9% 49.76 -1.6 3945152 4.59 ? 6% +12.1% 10.04 +1.8% 0.10 -1.8% 49.76 -1.6 13400506 3.094e+09 12087 ? 13% -36.9% 4.054e+09 4.508e+08 1.516e+10 3932404 4.584e+12 364038 ? 6% -40.3% 5382 ? 33% -63.4% 6325 ? 19% -58.1% 11699 ? 19% -60.6% 94.20 ? 22% -45.8% 9223 ? 24% -52.5% 9507 ? 24% -57.5% 4530 ? 18% -33.9% 82.00 ? 21% -41.9% 4167 ? 16% -45.4% 135.20 ? 31% -58.4% 4155 ? 17% -42.5% 95.00 ? 48% -53.8% 4122 ? 20% -39.4% 3954 ? 14% -41.4% 3802 ? 17% -41.9% 3787 ? 11% -48.2% 3580 ? 14% -45.1% 3711 ? 20% -51.3% 3494 ? 21% -40.6% 3416 ? 21% -45.2% 3047 ? 24% -38.0% 3102 ? 28% -41.8% 2811 ? 23% -36.5% 2617 ? 17% -30.7% 3322 ? 25% -38.1% 2941 ? 12% -39.2% 72.56 -19.7 72.52 -19.7 72.48 -19.7 72.49 -19.7 72.47 -19.7 71.74 -19.7 71.63 -19.7 71.52 -19.6 70.12 -19.2 0.91 ? 2% -0.2 0.87 ? 2% +0.1 0.00 +0.6 24.24 ? 3% +19.4 24.72 ? 3% +19.8 24.87 ? 3% +19.8 24.78 ? 3% +19.8 25.94 ? 3% +19.8 25.97 ? 3% +19.8 25.90 ? 3% +19.8 25.88 ? 3% +19.8 25.87 ? 3% +19.8 72.57 -19.7 72.48 -19.7 72.48 -19.7 0.22 ? 5% -0.1 0.08 ? 23% -0.0 0.06 ? 7% -0.0 0.12 ? 4% -0.0 0.09 ? 7% -0.0 0.10 ? 4% -0.0 0.18 ? 2% +0.0 0.18 ? 5% +0.0 0.12 ? 4% +0.0 0.19 ? 6% +0.0 0.16 ? 6% +0.0 0.17 ? 6% +0.0 0.07 ? 10% +0.1 0.27 ? 4% +0.1 0.35 ? 2% +0.1 0.87 ? 2% +0.1 1.23 +0.1 25.98 ? 3% +19.8 25.87 ? 3% +19.8 25.88 ? 3% +19.8 0.53 ? 2% -0.2 0.08 ? 5% -0.1 0.11 ? 6% -0.0 0.19 ? 4% -0.0 0.05 ? 8% +0.0 0.11 ? 4% +0.0 0.16 ? 6% +0.0 0.05 ? 9% +0.0 1.21 +0.1

7400 +------------------- | 7200 |.++. | ++.+.++ 7000 |-+ + + : | + + .+ : 6800 |-+ + + | 6600 |-+ O O | OO O O O OO O 6400 |-+ O O | 6200 |-+ | O 6000 +------------------- [*] bisect-good sample
[O] bisect-bad sample

*************************** lkp-csl-2sp9: 88 threads =========================== compiler/cpufreq_governor/k gcc-9/performance/x86_64-
commit:
c28b1fc703 ("mm/gup: 57efa1fe59 ("mm/gup:
c28b1fc70390df32 57efa1fe59 ---------------- ---------- %stddev %change \ | 247840 5632 +3.7% 247840 0.10 ? 5% +0.0 14925 ? 19% -48.2% 9950 ? 3% -36.1% 3312916 ? 4% 1675504 ? 5% 987055 ? 5% -41.8% 165545 ? 3% -12.2% 165235 ? 3% -12.1% 180757 ? 3% -11.7% 2877001 ? 11% 5545708 ? 11% 2773178 ? 11% 733740 ? 3% -12.0% 17167 ? 10% -28.2% 49180 ? 14% -33.5% 9311 ? 18% -36.2% 41257 ? 3% -12.1% 339681 10395 -3.5% 45130 ? 3% -11.7% 41257 ? 3% -12.1% 841530 21515 ? 11% -68.9% 60224 ? 3% -11.1% 981265 895893 21515 ? 11% -68.9% 0.07 ?135% -74.1% 21.44 ? 5% +80.9% 67273 ? 6% -44.9% 21.44 ? 5% +80.9% 0.08 ? 14% +60.1% 0.09 ? 12% +58.0% 255.38 ? 14% +22.1% 31877 ? 10% -54.2% 27110 ? 7% -47.3% 138.60 ? 13% -21.4% 1.00 ?199% -99.9% 0.08 ? 14% +60.9% 0.09 ? 12% +58.2% 255.38 ? 14% +22.1% 1.00 ?199% -99.9% 4.99 -36.1% 9869 ? 3% -36.2% 0.00 ? 7% +0.0 76953 ? 7% +327.4% 4152320 1665297 8627 +3.5% 0.33 ? 3% -11.0% 87.42 +1.7 7507752 1078418 ? 2% -22.9% 3091445 0.00 ? 7% +0.0 8599 +3.6% 87.44 +1.7 43415811 9895 ? 3% -36.4% 76756 ? 7% +327.0% 4138410 1659653 7483002 1074856 ? 2% -22.9% 3081222 335021 ? 2% -27.9% 3662 ? 31% -61.3% 4671 ? 32% -65.6% 4999 ? 34% -68.1% 129.00 ? 30% -46.8% 4531 ? 49% -58.5% 4639 ? 28% -37.6% 4639 ? 28% -37.6% 6310 ? 49% -68.5% 149.40 ? 49% -49.3% 3592 ? 38% -63.0% 5350 ? 21% -30.5% 5350 ? 21% -30.5% 139.00 ? 27% -33.4% 3858 ? 42% -53.7% 5964 ? 28% -42.4% 5964 ? 28% -42.4% 3429 ? 37% -57.1% 3008 ? 35% -37.6% 4684 ? 73% -60.0% 4300 ? 46% -54.7% 10255 ? 26% -50.0% 5800 ? 20% -28.3% 4802 ? 19% -31.7% 4042 ? 32% -65.6% 128.60 ? 31% -52.9% 4065 ? 20% -37.8% 4340 ? 24% -36.2% 3983 ? 11% -27.1% 3392 ? 25% -55.2% 171.80 ? 67% -62.5% 2942 ? 33% -50.5% 7818 -27.3% 7818 -27.3% 320.80 ? 54% -44.6% 3212 ? 31% -64.8%

Disclaimer:
Results have been for informational design or configuration

---
0DAY/LKP+ Test Infrastructure https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
threads 2 sockets Ice Lake with 256G memory
Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
/github.com/antonblanchard/will-it-scale">https://github.com/antonblanchard/will-it-scale
the commit also has significant impact on the following tests:
--------------------------------------------------------------------------+
| will-it-scale: will-it-scale.per_thread_ops 3.7% improvement |
| 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz with 128G memory |
| cpufreq_governor=performance |
|
|
|
|
--------------------------------------------------------------------------+
kindly add following tag
test robot <[email protected]>
----------------------------------------------------------------------->
href="https://github.com/intel/lkp-tests.git">https://github.com/intel/lkp-tests.git
job.yaml # job file is attached in this email
--compatible job.yaml # generate the yaml file for lkp run
generated-yaml-file
==============================================================
config/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
rhel-8.3/thread/50%/debian-10.4-x86_64-20200603.cgz/lkp-icl-2sp1/mmap1/will-it-scale/0xb000280
reorganize internal_get_user_pages_fast()")
prevent gup_fast from racing with COW during fork")
57694fa541c9062de
-----------------
%stddev
\
-9.2% 310805 ? 2% will-it-scale.48.threads
6474 ? 2% will-it-scale.per_thread_ops
-9.2% 310805 ? 2% will-it-scale.workload
+45.8% 3727702 meminfo.Committed_AS
7665 ? 7% vmstat.system.cs
602364 ? 11% cpuidle.C1.usage
153684 ? 7% cpuidle.POLL.usage
45563 proc-vmstat.nr_active_anon
51602 proc-vmstat.nr_shmem
45563 proc-vmstat.nr_zone_active_anon
66942 ? 2% proc-vmstat.pgactivate
343904 ? 4% sched_debug.cpu.avg_idle.stddev
14378 ? 4% sched_debug.cpu.nr_switches.avg
41800 ? 21% sched_debug.cpu.nr_switches.max
6810 ? 23% sched_debug.cpu.nr_switches.stddev
35.73 ? 12% perf-sched.total_wait_and_delay.average.ms
43235 ? 12% perf-sched.total_wait_and_delay.count.ms
35.72 ? 12% perf-sched.total_wait_time.average.ms
19030 ? 17% perf-sched.wait_and_delay.count.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap
15738 ? 14% perf-sched.wait_and_delay.count.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff
5.15 ? 4% perf-stat.i.MPKI
-2.1% 3.04e+09 perf-stat.i.branch-instructions
7600 ? 7% perf-stat.i.context-switches
10.25 perf-stat.i.cpi
-1.3% 4.016e+09 perf-stat.i.dTLB-loads
-5.1% 4.291e+08 ? 2% perf-stat.i.dTLB-stores
-1.6% 1.497e+10 perf-stat.i.instructions
0.10 perf-stat.i.ipc
0.15 ? 5% perf-stat.i.metric.K/sec
78.93 perf-stat.i.metric.M/sec
142312 ? 11% perf-stat.i.node-loads
48.11 perf-stat.i.node-store-miss-rate%
+6.2% 4189006 perf-stat.i.node-stores
5.15 ? 4% perf-stat.overall.MPKI
10.23 perf-stat.overall.cpi
0.10 perf-stat.overall.ipc
48.12 perf-stat.overall.node-store-miss-rate%
+8.2% 14504566 perf-stat.overall.path-length
-2.1% 3.03e+09 perf-stat.ps.branch-instructions
7622 ? 7% perf-stat.ps.context-switches
-1.3% 4.002e+09 perf-stat.ps.dTLB-loads
-5.1% 4.278e+08 ? 2% perf-stat.ps.dTLB-stores
-1.6% 1.492e+10 perf-stat.ps.instructions
+6.2% 4175831 perf-stat.ps.node-stores
-1.7% 4.506e+12 perf-stat.total.instructions
217265 ? 9% interrupts.CAL:Function_call_interrupts
1970 ? 35% interrupts.CPU44.CAL:Function_call_interrupts
2650 ? 37% interrupts.CPU47.CAL:Function_call_interrupts
4610 ? 23% interrupts.CPU48.CAL:Function_call_interrupts
51.09 ? 46% interrupts.CPU48.TLB:TLB_shootdowns
4383 ? 28% interrupts.CPU49.CAL:Function_call_interrupts
4040 ? 27% interrupts.CPU50.CAL:Function_call_interrupts
2993 ? 28% interrupts.CPU62.CAL:Function_call_interrupts
47.64 ? 38% interrupts.CPU63.TLB:TLB_shootdowns
2276 ? 22% interrupts.CPU64.CAL:Function_call_interrupts
56.27 ? 51% interrupts.CPU64.TLB:TLB_shootdowns
2387 ? 27% interrupts.CPU65.CAL:Function_call_interrupts
43.91 ? 42% interrupts.CPU65.TLB:TLB_shootdowns
2497 ? 29% interrupts.CPU66.CAL:Function_call_interrupts
2318 ? 28% interrupts.CPU67.CAL:Function_call_interrupts
2209 ? 17% interrupts.CPU70.CAL:Function_call_interrupts
1961 ? 29% interrupts.CPU71.CAL:Function_call_interrupts
1964 ? 19% interrupts.CPU72.CAL:Function_call_interrupts
1806 ? 25% interrupts.CPU73.CAL:Function_call_interrupts
2076 ? 21% interrupts.CPU76.CAL:Function_call_interrupts
1873 ? 26% interrupts.CPU77.CAL:Function_call_interrupts
1890 ? 18% interrupts.CPU78.CAL:Function_call_interrupts
1805 ? 16% interrupts.CPU80.CAL:Function_call_interrupts
1785 ? 22% interrupts.CPU83.CAL:Function_call_interrupts
1814 ? 30% interrupts.CPU84.CAL:Function_call_interrupts
2055 ? 29% interrupts.CPU87.CAL:Function_call_interrupts
1787 ? 27% interrupts.CPU93.CAL:Function_call_interrupts
52.82 perf-profile.calltrace.cycles-pp.__mmap
52.78 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__mmap
52.74 perf-profile.calltrace.cycles-pp.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
52.76 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
52.74 perf-profile.calltrace.cycles-pp.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
52.04 perf-profile.calltrace.cycles-pp.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
51.95 perf-profile.calltrace.cycles-pp.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64
51.88 perf-profile.calltrace.cycles-pp.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff
50.92 perf-profile.calltrace.cycles-pp.osq_lock.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff
0.70 perf-profile.calltrace.cycles-pp.rwsem_spin_on_owner.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff
0.95 ? 2% perf-profile.calltrace.cycles-pp.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.63 ? 2% perf-profile.calltrace.cycles-pp.rwsem_spin_on_owner.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.__vm_munmap
43.62 perf-profile.calltrace.cycles-pp.osq_lock.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.__vm_munmap
44.47 perf-profile.calltrace.cycles-pp.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap
44.62 perf-profile.calltrace.cycles-pp.down_write_killable.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
44.54 perf-profile.calltrace.cycles-pp.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap.do_syscall_64
45.73 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__munmap
45.77 perf-profile.calltrace.cycles-pp.__munmap
45.70 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
45.68 perf-profile.calltrace.cycles-pp.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
45.67 perf-profile.calltrace.cycles-pp.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
52.83 perf-profile.children.cycles-pp.__mmap
52.74 perf-profile.children.cycles-pp.ksys_mmap_pgoff
52.74 perf-profile.children.cycles-pp.vm_mmap_pgoff
0.14 ? 6% perf-profile.children.cycles-pp.unmap_region
0.04 ? 61% perf-profile.children.cycles-pp.__schedule
0.03 ? 75% perf-profile.children.cycles-pp.perf_event_mmap
0.09 ? 5% perf-profile.children.cycles-pp.up_write
0.06 ? 16% perf-profile.children.cycles-pp.unmap_vmas
0.08 ? 3% perf-profile.children.cycles-pp.up_read
0.20 ? 3% perf-profile.children.cycles-pp.vm_area_dup
0.21 ? 2% perf-profile.children.cycles-pp.vma_merge
0.14 ? 4% perf-profile.children.cycles-pp.kmem_cache_free
0.23 ? 2% perf-profile.children.cycles-pp.get_unmapped_area
0.20 ? 2% perf-profile.children.cycles-pp.vm_unmapped_area
0.21 ? 2% perf-profile.children.cycles-pp.arch_get_unmapped_area_topdown
0.14 ? 16% perf-profile.children.cycles-pp.find_vma
0.35 ? 2% perf-profile.children.cycles-pp.__vma_adjust
0.43 ? 3% perf-profile.children.cycles-pp.__split_vma
0.95 ? 2% perf-profile.children.cycles-pp.__do_munmap
1.33 perf-profile.children.cycles-pp.rwsem_spin_on_owner
45.78 perf-profile.children.cycles-pp.__munmap
45.68 perf-profile.children.cycles-pp.__vm_munmap
45.68 perf-profile.children.cycles-pp.__x64_sys_munmap
0.35 ? 3% perf-profile.self.cycles-pp.rwsem_optimistic_spin
0.03 ? 75% perf-profile.self.cycles-pp.do_mmap
0.09 ? 5% perf-profile.self.cycles-pp.up_write
0.16 ? 5% perf-profile.self.cycles-pp.down_write_killable
0.07 ? 8% perf-profile.self.cycles-pp.downgrade_write
0.14 ? 4% perf-profile.self.cycles-pp.__vma_adjust
0.20 ? 3% perf-profile.self.cycles-pp.vm_unmapped_area
0.10 ? 13% perf-profile.self.cycles-pp.find_vma
1.31 perf-profile.self.cycles-pp.rwsem_spin_on_owner

will-it-scale.per_thread_ops

-------------------------------------------------+
+ |
.+. +.+ .++. : :.+ .+ +. |
++ + +.+.++.+ ++.+.++.++. : + + : : + |
+ :: |
+ |
|
O |
O O O O OO OO O |
O O O O O O O O |
O O O O OO O |
O O O |
O |
|
-------------------------------------------------+

************************************************************************
2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz with 128G memory
==============================================================
config/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
rhel-8.3/thread/50%/debian-10.4-x86_64-20200603.cgz/lkp-csl-2sp9/mmap1/will-it-scale/0x5003006
reorganize internal_get_user_pages_fast()")
prevent gup_fast from racing with COW during fork")
57694fa541c9062de
-----------------
%stddev
\
+3.7% 257132 ? 2% will-it-scale.44.threads
5843 ? 2% will-it-scale.per_thread_ops
+3.7% 257132 ? 2% will-it-scale.workload
0.13 ? 8% perf-profile.children.cycles-pp.find_vma
7724 ? 8% softirqs.CPU87.SCHED
6355 ? 2% vmstat.system.cs
+13.9% 3774536 ? 9% cpuidle.C1.time
-36.6% 1061625 cpuidle.POLL.time
574757 ? 2% cpuidle.POLL.usage
145358 ? 4% meminfo.Active
145188 ? 4% meminfo.Active(anon)
159538 ? 3% meminfo.Shmem
+16.2% 3342948 ? 10% sched_debug.cfs_rq:/.min_vruntime.avg
+9.8% 6086941 ? 8% sched_debug.cfs_rq:/.min_vruntime.max
+15.4% 3199941 ? 9% sched_debug.cfs_rq:/.spread0.avg
646033 ? 5% sched_debug.cpu.avg_idle.avg
12332 ? 7% sched_debug.cpu.nr_switches.avg
32687 ? 22% sched_debug.cpu.nr_switches.max
5943 ? 22% sched_debug.cpu.nr_switches.stddev
36252 ? 4% proc-vmstat.nr_active_anon
-1.6% 334294 proc-vmstat.nr_file_pages
10036 proc-vmstat.nr_mapped
39848 ? 3% proc-vmstat.nr_shmem
36252 ? 4% proc-vmstat.nr_zone_active_anon
-1.7% 826917 proc-vmstat.numa_local
6684 ? 70% proc-vmstat.numa_pages_migrated
53513 ? 3% proc-vmstat.pgactivate
-2.5% 956415 proc-vmstat.pgalloc_normal
-1.9% 878978 proc-vmstat.pgfree
6684 ? 70% proc-vmstat.pgmigrate_success
0.02 ? 5% perf-sched.sch_delay.max.ms.preempt_schedule_common._cond_resched.stop_one_cpu.__set_cpus_allowed_ptr.sched_setaffinity
38.79 ? 3% perf-sched.total_wait_and_delay.average.ms
37095 ? 5% perf-sched.total_wait_and_delay.count.ms
38.79 ? 3% perf-sched.total_wait_time.average.ms
0.13 ? 9% perf-sched.wait_and_delay.avg.ms.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap
0.15 ? 15% perf-sched.wait_and_delay.avg.ms.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff
311.71 ? 17% perf-sched.wait_and_delay.avg.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll
14606 ? 13% perf-sched.wait_and_delay.count.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap
14280 ? 4% perf-sched.wait_and_delay.count.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff
109.00 ? 15% perf-sched.wait_and_delay.count.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll
0.00 ?200% perf-sched.wait_time.avg.ms.preempt_schedule_common._cond_resched.remove_vma.__do_munmap.__vm_munmap
0.13 ? 9% perf-sched.wait_time.avg.ms.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap
0.15 ? 15% perf-sched.wait_time.avg.ms.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff
311.71 ? 17% perf-sched.wait_time.avg.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll
0.00 ?200% perf-sched.wait_time.max.ms.preempt_schedule_common._cond_resched.remove_vma.__do_munmap.__vm_munmap
3.19 ? 36% perf-sched.wait_time.max.ms.rcu_gp_kthread.kthread.ret_from_fork
6295 ? 2% perf-stat.i.context-switches
0.00 ? 29% perf-stat.i.dTLB-load-miss-rate%
328871 ? 29% perf-stat.i.dTLB-load-misses
-3.0% 4026365 perf-stat.i.iTLB-load-misses
-2.2% 1628746 perf-stat.i.iTLB-loads
8933 perf-stat.i.instructions-per-iTLB-miss
0.29 ? 6% perf-stat.i.metric.K/sec
89.11 perf-stat.i.node-load-miss-rate%
-9.2% 6814138 perf-stat.i.node-load-misses
831563 ? 3% perf-stat.i.node-loads
-8.2% 2838247 perf-stat.i.node-store-misses
0.00 ? 29% perf-stat.overall.dTLB-load-miss-rate%
8907 perf-stat.overall.instructions-per-iTLB-miss
89.13 perf-stat.overall.node-load-miss-rate%
-3.3% 41994695 ? 2% perf-stat.overall.path-length
6291 ? 2% perf-stat.ps.context-switches
327716 ? 29% perf-stat.ps.dTLB-load-misses
-3.0% 4012712 perf-stat.ps.iTLB-load-misses
-2.2% 1623167 perf-stat.ps.iTLB-loads
-9.2% 6791226 perf-stat.ps.node-load-misses
828780 ? 3% perf-stat.ps.node-loads
-8.2% 2828732 perf-stat.ps.node-store-misses
241715 ? 12% interrupts.CAL:Function_call_interrupts
1417 ? 16% interrupts.CPU10.CAL:Function_call_interrupts
1607 ? 30% interrupts.CPU12.CAL:Function_call_interrupts
1592 ? 43% interrupts.CPU14.CAL:Function_call_interrupts
68.60 ? 34% interrupts.CPU14.RES:Rescheduling_interrupts
1881 ? 39% interrupts.CPU15.CAL:Function_call_interrupts
2893 ? 2% interrupts.CPU18.NMI:Non-maskable_interrupts
2893 ? 2% interrupts.CPU18.PMI:Performance_monitoring_interrupts
1988 ? 57% interrupts.CPU21.CAL:Function_call_interrupts
75.80 ? 42% interrupts.CPU21.RES:Rescheduling_interrupts
1330 ? 14% interrupts.CPU24.CAL:Function_call_interrupts
3720 ? 44% interrupts.CPU24.NMI:Non-maskable_interrupts
3720 ? 44% interrupts.CPU24.PMI:Performance_monitoring_interrupts
92.60 ? 26% interrupts.CPU24.RES:Rescheduling_interrupts
1785 ? 38% interrupts.CPU26.CAL:Function_call_interrupts
3432 ? 55% interrupts.CPU27.NMI:Non-maskable_interrupts
3432 ? 55% interrupts.CPU27.PMI:Performance_monitoring_interrupts
1470 ? 44% interrupts.CPU28.CAL:Function_call_interrupts
1877 ? 38% interrupts.CPU29.CAL:Function_call_interrupts
1872 ? 34% interrupts.CPU30.CAL:Function_call_interrupts
1949 ? 13% interrupts.CPU43.CAL:Function_call_interrupts
5127 ? 29% interrupts.CPU44.CAL:Function_call_interrupts
4158 ? 27% interrupts.CPU52.CAL:Function_call_interrupts
3279 ? 18% interrupts.CPU58.CAL:Function_call_interrupts
1391 ? 41% interrupts.CPU6.CAL:Function_call_interrupts
60.60 ? 38% interrupts.CPU6.RES:Rescheduling_interrupts
2530 ? 6% interrupts.CPU63.CAL:Function_call_interrupts
2771 ? 11% interrupts.CPU64.CAL:Function_call_interrupts
2904 ? 19% interrupts.CPU65.CAL:Function_call_interrupts
1518 ? 53% interrupts.CPU7.CAL:Function_call_interrupts
64.40 ? 32% interrupts.CPU7.RES:Rescheduling_interrupts
1455 ? 25% interrupts.CPU8.CAL:Function_call_interrupts
5681 ? 31% interrupts.CPU85.NMI:Non-maskable_interrupts
5681 ? 31% interrupts.CPU85.PMI:Performance_monitoring_interrupts
177.80 ? 58% interrupts.CPU87.TLB:TLB_shootdowns
1130 ? 36% interrupts.CPU9.CAL:Function_call_interrupts
estimated based on internal Intel analysis and are provided
purposes only. Any difference in system hardware or software
may affect actual performance.
Open Source Technology Center
org/hyperkitty/list/lkp@lists.01.org">https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation

Attachments:

(No filename) (26.93 kB)
job-script (7.92 kB)
job.yaml (5.14 kB)
reproduce (346.00 B)
Download all attachments

2021-05-25 03:55:18

by Linus Torvalds

[permalink] [raw]

Subject: Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

On Mon, May 24, 2021 at 5:00 PM kernel test robot <[email protected]> wrote:
>
> FYI, we noticed a -9.2% regression of will-it-scale.per_thread_ops due to commit:
> commit: 57efa1fe5957694fa541c9062de0a127f0b9acb0 ("mm/gup: prevent gup_fast from racing with COW during fork")

Hmm. This looks like one of those "random fluctuations" things.

It would be good to hear if other test-cases also bisect to the same
thing, but this report already says:

> In addition to that, the commit also has significant impact on the following tests:
>
> +------------------+---------------------------------------------------------------------------------+
> | testcase: change | will-it-scale: will-it-scale.per_thread_ops 3.7% improvement |

which does kind of reinforce that "this benchmark gives unstable numbers".

The perf data doesn't even mention any of the GUP paths, and on the
pure fork path the biggest impact would be:

(a) maybe "struct mm_struct" changed in size or had a different cache layout

(b) two added (nonatomic) increment operations in the fork path due
to the seqcount

and I'm not seeing what would cause that 9% change. Obviously cache
placement has done it before.

If somebody else sees something that I'm missing, please holler. But
I'll ignore this as "noise" otherwise.

Linus

2021-06-04 07:05:34

by Feng Tang

[permalink] [raw]

Subject: Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

Hi Linus,

Sorry for the late response.

On Mon, May 24, 2021 at 05:11:37PM -1000, Linus Torvalds wrote:
> On Mon, May 24, 2021 at 5:00 PM kernel test robot <[email protected]> wrote:
> >
> > FYI, we noticed a -9.2% regression of will-it-scale.per_thread_ops due to commit:
> > commit: 57efa1fe5957694fa541c9062de0a127f0b9acb0 ("mm/gup: prevent gup_fast from racing with COW during fork")
>
> Hmm. This looks like one of those "random fluctuations" things.
>
> It would be good to hear if other test-cases also bisect to the same
> thing, but this report already says:
>
> > In addition to that, the commit also has significant impact on the following tests:
> >
> > +------------------+---------------------------------------------------------------------------------+
> > | testcase: change | will-it-scale: will-it-scale.per_thread_ops 3.7% improvement |
>
> which does kind of reinforce that "this benchmark gives unstable numbers".
>
> The perf data doesn't even mention any of the GUP paths, and on the
> pure fork path the biggest impact would be:
>
> (a) maybe "struct mm_struct" changed in size or had a different cache layout

Yes, this seems to be the cause of the regression.

The test case is many thread are doing map/unmap at the same time,
so the process's rw_semaphore 'mmap_lock' is highly contended.

Before the patch (with 0day's kconfig), the mmap_lock is separated
into 2 cachelines, the 'count' is in one line, and the other members
sit in the next line, so it luckily avoid some cache bouncing. After
the patch, the 'mmap_lock' is pushed into one cacheline, which may
cause the regression.

Below is the pahole info:

- before the patch

spinlock_t page_table_lock; /* 116 4 */
struct rw_semaphore mmap_lock; /* 120 40 */
/* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
struct list_head mmlist; /* 160 16 */
long unsigned int hiwater_rss; /* 176 8 */

- after the patch

spinlock_t page_table_lock; /* 124 4 */
/* --- cacheline 2 boundary (128 bytes) --- */
struct rw_semaphore mmap_lock; /* 128 40 */
struct list_head mmlist; /* 168 16 */
long unsigned int hiwater_rss; /* 184 8 */

perf c2c log can also confirm this.

Thanks,
Feng

> (b) two added (nonatomic) increment operations in the fork path due
> to the seqcount
>
> and I'm not seeing what would cause that 9% change. Obviously cache
> placement has done it before.
>
> If somebody else sees something that I'm missing, please holler. But
> I'll ignore this as "noise" otherwise.
>
> Linus

2021-06-04 07:54:59

by Feng Tang

[permalink] [raw]

Subject: Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

On Fri, Jun 04, 2021 at 03:04:11PM +0800, Feng Tang wrote:
> Hi Linus,
>
> Sorry for the late response.
>
> On Mon, May 24, 2021 at 05:11:37PM -1000, Linus Torvalds wrote:
> > On Mon, May 24, 2021 at 5:00 PM kernel test robot <[email protected]> wrote:
> > >
> > > FYI, we noticed a -9.2% regression of will-it-scale.per_thread_ops due to commit:
> > > commit: 57efa1fe5957694fa541c9062de0a127f0b9acb0 ("mm/gup: prevent gup_fast from racing with COW during fork")
> >
> > Hmm. This looks like one of those "random fluctuations" things.
> >
> > It would be good to hear if other test-cases also bisect to the same
> > thing, but this report already says:
> >
> > > In addition to that, the commit also has significant impact on the following tests:
> > >
> > > +------------------+---------------------------------------------------------------------------------+
> > > | testcase: change | will-it-scale: will-it-scale.per_thread_ops 3.7% improvement |
> >
> > which does kind of reinforce that "this benchmark gives unstable numbers".
> >
> > The perf data doesn't even mention any of the GUP paths, and on the
> > pure fork path the biggest impact would be:
> >
> > (a) maybe "struct mm_struct" changed in size or had a different cache layout
>
> Yes, this seems to be the cause of the regression.
>
> The test case is many thread are doing map/unmap at the same time,
> so the process's rw_semaphore 'mmap_lock' is highly contended.
>
> Before the patch (with 0day's kconfig), the mmap_lock is separated
> into 2 cachelines, the 'count' is in one line, and the other members
> sit in the next line, so it luckily avoid some cache bouncing. After
> the patch, the 'mmap_lock' is pushed into one cacheline, which may
> cause the regression.
>
> Below is the pahole info:
>
> - before the patch
>
> spinlock_t page_table_lock; /* 116 4 */
> struct rw_semaphore mmap_lock; /* 120 40 */
> /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
> struct list_head mmlist; /* 160 16 */
> long unsigned int hiwater_rss; /* 176 8 */
>
> - after the patch
>
> spinlock_t page_table_lock; /* 124 4 */
> /* --- cacheline 2 boundary (128 bytes) --- */
> struct rw_semaphore mmap_lock; /* 128 40 */
> struct list_head mmlist; /* 168 16 */
> long unsigned int hiwater_rss; /* 184 8 */
>
> perf c2c log can also confirm this.

We've tried some patch, which can restore the regerssion. As the
newly added member 'write_protect_seq' is 4 bytes long, and putting
it into an existing 4 bytes long hole can restore the regeression,
while not affecting most of other member's alignment. Please review
the following patch, thanks!

- Feng

From 85ddc2c3d0f2bdcbad4edc5c392c7bc90bb1667e Mon Sep 17 00:00:00 2001
From: Feng Tang <[email protected]>
Date: Fri, 4 Jun 2021 15:20:57 +0800
Subject: [PATCH RFC] mm: relocate 'write_protect_seq' in struct mm_struct

Before commit 57efa1fe5957 ("mm/gup: prevent gup_fast from
racing with COW during fork), on 64bits system, the hot member
rw_semaphore 'mmap_lock' of 'mm_struct' could be separated into
2 cachelines, that its member 'count' sits in one cacheline while
all other members in next cacheline, this naturally reduces some
cache bouncing, and with the commit, the 'mmap_lock' is pushed
into one cacheline, as shown in the pahole info:

- before the commit

spinlock_t page_table_lock; /* 116 4 */
struct rw_semaphore mmap_lock; /* 120 40 */
/* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
struct list_head mmlist; /* 160 16 */
long unsigned int hiwater_rss; /* 176 8 */

- after the commit

spinlock_t page_table_lock; /* 124 4 */
/* --- cacheline 2 boundary (128 bytes) --- */
struct rw_semaphore mmap_lock; /* 128 40 */
struct list_head mmlist; /* 168 16 */
long unsigned int hiwater_rss; /* 184 8 */

and it causes one 9.2% regression for 'mmap1' case of will-it-scale
benchmark[1], as in the case 'mmap_lock' is highly contented (occupies
90%+ cpu cycles).

Though relayouting a structure could be a double-edged sword, as it
helps some case, but may hurt other cases. So one solution is the
newly added 'seqcount_t' is 4 bytes long (when CONFIG_DEBUG_LOCK_ALLOC=n),
placing it into an existing 4 bytes hole in 'mm_struct' will not
affect most of other members's alignment, while restoring the
regression.

[1]. https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
---
include/linux/mm_types.h | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5aacc1c..5b55f88 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -445,13 +445,6 @@ struct mm_struct {
*/
atomic_t has_pinned;

- /**
- * @write_protect_seq: Locked when any thread is write
- * protecting pages mapped by this mm to enforce a later COW,
- * for instance during page table copying for fork().
- */
- seqcount_t write_protect_seq;
-
#ifdef CONFIG_MMU
atomic_long_t pgtables_bytes; /* PTE page table pages */
#endif
@@ -480,7 +473,15 @@ struct mm_struct {
unsigned long stack_vm; /* VM_STACK */
unsigned long def_flags;

+ /**
+ * @write_protect_seq: Locked when any thread is write
+ * protecting pages mapped by this mm to enforce a later COW,
+ * for instance during page table copying for fork().
+ */
+ seqcount_t write_protect_seq;
+
spinlock_t arg_lock; /* protect the below fields */
+
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
--
2.7.4

2021-06-04 08:41:42

by Xing Zhengjun

[permalink] [raw]

Subject: Re: [LKP] Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

Hi Linus,

On 5/25/2021 11:11 AM, Linus Torvalds wrote:
> On Mon, May 24, 2021 at 5:00 PM kernel test robot <[email protected]> wrote:
>> FYI, we noticed a -9.2% regression of will-it-scale.per_thread_ops due to commit:
>> commit: 57efa1fe5957694fa541c9062de0a127f0b9acb0 ("mm/gup: prevent gup_fast from racing with COW during fork")
> Hmm. This looks like one of those "random fluctuations" things.
>
> It would be good to hear if other test-cases also bisect to the same
> thing, but this report already says:
>
>> In addition to that, the commit also has significant impact on the following tests:
>>
>> +------------------+---------------------------------------------------------------------------------+
>> | testcase: change | will-it-scale: will-it-scale.per_thread_ops 3.7% improvement |
> which does kind of reinforce that "this benchmark gives unstable numbers".
>
> The perf data doesn't even mention any of the GUP paths, and on the
> pure fork path the biggest impact would be:
>
> (a) maybe "struct mm_struct" changed in size or had a different cache layout
I move "write_protect_seq" to the tail of the "struct mm_struct", the
regression reduced to -3.6%. The regression should relate to the cache
layout.
=========================================================================================
tbox_group/testcase/rootfs/kconfig/compiler/nr_task/mode/test/cpufreq_governor/ucode:
lkp-icl-2sp1/will-it-scale/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/50%/thread/mmap1/performance/0xb000280

commit:
c28b1fc70390df32e29991eedd52bd86e7aba080
57efa1fe5957694fa541c9062de0a127f0b9acb0
f6a9c27882d51ff551e15522992d3725c342372d (the test patch)

c28b1fc70390df32 57efa1fe5957694fa541c9062de f6a9c27882d51ff551e15522992
---------------- --------------------------- ---------------------------
         %stddev     %change         %stddev     %change %stddev
             \          |                \          | \
    341938            -9.0%     311218 ± 2%      -3.6% 329513
will-it-scale.48.threads
      7123            -9.0%       6483 ± 2%      -3.6% 6864
will-it-scale.per_thread_ops
    341938            -9.0%     311218 ± 2%      -3.6% 329513
will-it-scale.workload

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 915f4f100383..34bb2a01806c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -447,13 +447,6 @@ struct mm_struct {
                 */
                atomic_t has_pinned;

-               /**
-                * @write_protect_seq: Locked when any thread is write
-                * protecting pages mapped by this mm to enforce a later
COW,
-                * for instance during page table copying for fork().
-                */
-               seqcount_t write_protect_seq;
-
#ifdef CONFIG_MMU
                atomic_long_t pgtables_bytes;   /* PTE page table pages */
#endif
@@ -564,6 +557,12 @@ struct mm_struct {
#ifdef CONFIG_IOMMU_SUPPORT
                u32 pasid;
#endif
+                /**
+                 * @write_protect_seq: Locked when any thread is write
+                 * protecting pages mapped by this mm to enforce a
later COW,
+                 * for instance during page table copying for fork().
+                 */
+                seqcount_t write_protect_seq;
        } __randomize_layout;

        /*

>
> (b) two added (nonatomic) increment operations in the fork path due
> to the seqcount
>
> and I'm not seeing what would cause that 9% change. Obviously cache
> placement has done it before.
>
> If somebody else sees something that I'm missing, please holler. But
> I'll ignore this as "noise" otherwise.
>
> Linus
> _______________________________________________
> LKP mailing list -- [email protected]
> To unsubscribe send an email to [email protected]

--
Zhengjun Xing

2021-06-04 17:59:54

by Linus Torvalds

[permalink] [raw]

Subject: Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

On Fri, Jun 4, 2021 at 12:52 AM Feng Tang <[email protected]> wrote:
>
> On Fri, Jun 04, 2021 at 03:04:11PM +0800, Feng Tang wrote:
> > >
> > > The perf data doesn't even mention any of the GUP paths, and on the
> > > pure fork path the biggest impact would be:
> > >
> > > (a) maybe "struct mm_struct" changed in size or had a different cache layout
> >
> > Yes, this seems to be the cause of the regression.
> >
> > The test case is many thread are doing map/unmap at the same time,
> > so the process's rw_semaphore 'mmap_lock' is highly contended.
> >
> > Before the patch (with 0day's kconfig), the mmap_lock is separated
> > into 2 cachelines, the 'count' is in one line, and the other members
> > sit in the next line, so it luckily avoid some cache bouncing. After
> > the patch, the 'mmap_lock' is pushed into one cacheline, which may
> > cause the regression.

Ok, thanks for following up on this.

> We've tried some patch, which can restore the regerssion. As the
> newly added member 'write_protect_seq' is 4 bytes long, and putting
> it into an existing 4 bytes long hole can restore the regeression,
> while not affecting most of other member's alignment. Please review
> the following patch, thanks!

The patch looks fine to me.

At the same time, I do wonder if maybe it would be worth exploring if
it's a good idea to perhaps move the 'mmap_sem' thing instead.

Or at least add a big comment. It's not clear to me exactly _which_
other fields are the ones that are so hot that the contention on
mmap_sem then causes even more cacheline bouncing.

For example, is it either

(a) we *want* the mmap_sem to be in the first 128-byte region,
because then when we get the mmap_sem, the other fields in that same
cacheline are hot

OR

(b) we do *not* want mmap_sem to be in the *second* 128-byte region,
because there is something *else* in that region that is touched
independently of mmap_sem that is very very hot and now you get even
more bouncing?

but I can't tell which one it is.

It would be great to have a comment in the code - and in the commit
message - about exactly which fields are the criticial ones. Because I
doubt it is 'write_protect_seq' itself that matters at all.

If it's "mmap_sem should be close to other commonly used fields",
maybe we should just move mmap_sem upwards in the structure?

Linus

2021-06-04 18:00:21

by John Hubbard

[permalink] [raw]

Subject: Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

On 6/4/21 12:52 AM, Feng Tang wrote:
...
>>> The perf data doesn't even mention any of the GUP paths, and on the
>>> pure fork path the biggest impact would be:
>>>
>>> (a) maybe "struct mm_struct" changed in size or had a different cache layout
>>
>> Yes, this seems to be the cause of the regression.
>>
>> The test case is many thread are doing map/unmap at the same time,
>> so the process's rw_semaphore 'mmap_lock' is highly contended.
>>
>> Before the patch (with 0day's kconfig), the mmap_lock is separated
>> into 2 cachelines, the 'count' is in one line, and the other members
>> sit in the next line, so it luckily avoid some cache bouncing. After

Wow! That's quite a fortunate layout to land on by accident. Almost
makes me wonder if mmap_lock should be designed to do that, but it's
probably even better to just keep working on having a less contended
mmap_lock.

I *suppose* it's worth trying to keep this fragile layout in place,
but it is a landmine for anyone who touches mm_struct. And the struct
is so large already that I'm not sure a comment warning would even
be noticed. Anyway...

>> the patch, the 'mmap_lock' is pushed into one cacheline, which may
>> cause the regression.
>>
>> Below is the pahole info:
>>
>> - before the patch
>>
>> spinlock_t page_table_lock; /* 116 4 */
>> struct rw_semaphore mmap_lock; /* 120 40 */
>> /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
>> struct list_head mmlist; /* 160 16 */
>> long unsigned int hiwater_rss; /* 176 8 */
>>
>> - after the patch
>>
>> spinlock_t page_table_lock; /* 124 4 */
>> /* --- cacheline 2 boundary (128 bytes) --- */
>> struct rw_semaphore mmap_lock; /* 128 40 */
>> struct list_head mmlist; /* 168 16 */
>> long unsigned int hiwater_rss; /* 184 8 */
>>
>> perf c2c log can also confirm this.
>
> We've tried some patch, which can restore the regerssion. As the
> newly added member 'write_protect_seq' is 4 bytes long, and putting
> it into an existing 4 bytes long hole can restore the regeression,
> while not affecting most of other member's alignment. Please review
> the following patch, thanks!
>

So, this is a neat little solution, if we agree that it's worth "fixing".

I'm definitely on the fence, but leaning toward, "go for it", because
I like the "no cache effect" result of using up the hole.

Reviewed-by: John Hubbard <[email protected]>

thanks,
--
John Hubbard
NVIDIA

> - Feng
>
> From 85ddc2c3d0f2bdcbad4edc5c392c7bc90bb1667e Mon Sep 17 00:00:00 2001
> From: Feng Tang <[email protected]>
> Date: Fri, 4 Jun 2021 15:20:57 +0800
> Subject: [PATCH RFC] mm: relocate 'write_protect_seq' in struct mm_struct
>
> Before commit 57efa1fe5957 ("mm/gup: prevent gup_fast from
> racing with COW during fork), on 64bits system, the hot member
> rw_semaphore 'mmap_lock' of 'mm_struct' could be separated into
> 2 cachelines, that its member 'count' sits in one cacheline while
> all other members in next cacheline, this naturally reduces some
> cache bouncing, and with the commit, the 'mmap_lock' is pushed
> into one cacheline, as shown in the pahole info:
>
> - before the commit
>
> spinlock_t page_table_lock; /* 116 4 */
> struct rw_semaphore mmap_lock; /* 120 40 */
> /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
> struct list_head mmlist; /* 160 16 */
> long unsigned int hiwater_rss; /* 176 8 */
>
> - after the commit
>
> spinlock_t page_table_lock; /* 124 4 */
> /* --- cacheline 2 boundary (128 bytes) --- */
> struct rw_semaphore mmap_lock; /* 128 40 */
> struct list_head mmlist; /* 168 16 */
> long unsigned int hiwater_rss; /* 184 8 */
>
> and it causes one 9.2% regression for 'mmap1' case of will-it-scale
> benchmark[1], as in the case 'mmap_lock' is highly contented (occupies
> 90%+ cpu cycles).
>
> Though relayouting a structure could be a double-edged sword, as it
> helps some case, but may hurt other cases. So one solution is the
> newly added 'seqcount_t' is 4 bytes long (when CONFIG_DEBUG_LOCK_ALLOC=n),
> placing it into an existing 4 bytes hole in 'mm_struct' will not
> affect most of other members's alignment, while restoring the
> regression.
>
> [1]. https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/
> Reported-by: kernel test robot <[email protected]>
> Signed-off-by: Feng Tang <[email protected]>
> ---
> include/linux/mm_types.h | 15 ++++++++-------
> 1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5aacc1c..5b55f88 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -445,13 +445,6 @@ struct mm_struct {
> */
> atomic_t has_pinned;
>
> - /**
> - * @write_protect_seq: Locked when any thread is write
> - * protecting pages mapped by this mm to enforce a later COW,
> - * for instance during page table copying for fork().
> - */
> - seqcount_t write_protect_seq;
> -
> #ifdef CONFIG_MMU
> atomic_long_t pgtables_bytes; /* PTE page table pages */
> #endif
> @@ -480,7 +473,15 @@ struct mm_struct {
> unsigned long stack_vm; /* VM_STACK */
> unsigned long def_flags;
>
> + /**
> + * @write_protect_seq: Locked when any thread is write
> + * protecting pages mapped by this mm to enforce a later COW,
> + * for instance during page table copying for fork().
> + */
> + seqcount_t write_protect_seq;
> +
> spinlock_t arg_lock; /* protect the below fields */
> +
> unsigned long start_code, end_code, start_data, end_data;
> unsigned long start_brk, brk, start_stack;
> unsigned long arg_start, arg_end, env_start, env_end;
>

2021-06-06 04:50:44

by Feng Tang

[permalink] [raw]

Subject: Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

On Fri, Jun 04, 2021 at 10:58:14AM -0700, John Hubbard wrote:
> On 6/4/21 12:52 AM, Feng Tang wrote:
> ...
> >>>The perf data doesn't even mention any of the GUP paths, and on the
> >>>pure fork path the biggest impact would be:
> >>>
> >>> (a) maybe "struct mm_struct" changed in size or had a different cache layout
> >>
> >>Yes, this seems to be the cause of the regression.
> >>
> >>The test case is many thread are doing map/unmap at the same time,
> >>so the process's rw_semaphore 'mmap_lock' is highly contended.
> >>
> >>Before the patch (with 0day's kconfig), the mmap_lock is separated
> >>into 2 cachelines, the 'count' is in one line, and the other members
> >>sit in the next line, so it luckily avoid some cache bouncing. After
>
> Wow! That's quite a fortunate layout to land on by accident. Almost
> makes me wonder if mmap_lock should be designed to do that, but it's
> probably even better to just keep working on having a less contended
> mmap_lock.

Yes, manipulating cache alignment is always tricky and fragile, as
data structure keeps being changed and it is affected by differrent
kernel config options, also different workloads will see different
hot fields of it.

Optimizing 'mmap_lock' is the better and ultimate solution.

> I *suppose* it's worth trying to keep this fragile layout in place,
> but it is a landmine for anyone who touches mm_struct. And the struct
> is so large already that I'm not sure a comment warning would even
> be noticed. Anyway...

Linus also mentioned clear comment is needed. Will collect more info.

> >>the patch, the 'mmap_lock' is pushed into one cacheline, which may
> >>cause the regression.
> >>
> >>Below is the pahole info:
> >>
> >>- before the patch
> >>
> >> spinlock_t page_table_lock; /* 116 4 */
> >> struct rw_semaphore mmap_lock; /* 120 40 */
> >> /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
> >> struct list_head mmlist; /* 160 16 */
> >> long unsigned int hiwater_rss; /* 176 8 */
> >>
> >>- after the patch
> >>
> >> spinlock_t page_table_lock; /* 124 4 */
> >> /* --- cacheline 2 boundary (128 bytes) --- */
> >> struct rw_semaphore mmap_lock; /* 128 40 */
> >> struct list_head mmlist; /* 168 16 */
> >> long unsigned int hiwater_rss; /* 184 8 */
> >>
> >>perf c2c log can also confirm this.
> >We've tried some patch, which can restore the regerssion. As the
> >newly added member 'write_protect_seq' is 4 bytes long, and putting
> >it into an existing 4 bytes long hole can restore the regeression,
> >while not affecting most of other member's alignment. Please review
> >the following patch, thanks!
> >
>
> So, this is a neat little solution, if we agree that it's worth "fixing".
>
> I'm definitely on the fence, but leaning toward, "go for it", because
> I like the "no cache effect" result of using up the hole.
>
> Reviewed-by: John Hubbard <[email protected]>

Thanks for the reviewing!

- Feng

> thanks,
> --
> John Hubbard
> NVIDIA
>
> >- Feng
> >
> > From 85ddc2c3d0f2bdcbad4edc5c392c7bc90bb1667e Mon Sep 17 00:00:00 2001
> >From: Feng Tang <[email protected]>
> >Date: Fri, 4 Jun 2021 15:20:57 +0800
> >Subject: [PATCH RFC] mm: relocate 'write_protect_seq' in struct mm_struct
> >
> >Before commit 57efa1fe5957 ("mm/gup: prevent gup_fast from
> >racing with COW during fork), on 64bits system, the hot member
> >rw_semaphore 'mmap_lock' of 'mm_struct' could be separated into
> >2 cachelines, that its member 'count' sits in one cacheline while
> >all other members in next cacheline, this naturally reduces some
> >cache bouncing, and with the commit, the 'mmap_lock' is pushed
> >into one cacheline, as shown in the pahole info:
> >
> > - before the commit
> >
> > spinlock_t page_table_lock; /* 116 4 */
> > struct rw_semaphore mmap_lock; /* 120 40 */
> > /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
> > struct list_head mmlist; /* 160 16 */
> > long unsigned int hiwater_rss; /* 176 8 */
> >
> > - after the commit
> >
> > spinlock_t page_table_lock; /* 124 4 */
> > /* --- cacheline 2 boundary (128 bytes) --- */
> > struct rw_semaphore mmap_lock; /* 128 40 */
> > struct list_head mmlist; /* 168 16 */
> > long unsigned int hiwater_rss; /* 184 8 */
> >
> >and it causes one 9.2% regression for 'mmap1' case of will-it-scale
> >benchmark[1], as in the case 'mmap_lock' is highly contented (occupies
> >90%+ cpu cycles).
> >
> >Though relayouting a structure could be a double-edged sword, as it
> >helps some case, but may hurt other cases. So one solution is the
> >newly added 'seqcount_t' is 4 bytes long (when CONFIG_DEBUG_LOCK_ALLOC=n),
> >placing it into an existing 4 bytes hole in 'mm_struct' will not
> >affect most of other members's alignment, while restoring the
> >regression.
> >
> >[1]. https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/
> >Reported-by: kernel test robot <[email protected]>
> >Signed-off-by: Feng Tang <[email protected]>
> >---
> > include/linux/mm_types.h | 15 ++++++++-------
> > 1 file changed, 8 insertions(+), 7 deletions(-)
> >
> >diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >index 5aacc1c..5b55f88 100644
> >--- a/include/linux/mm_types.h
> >+++ b/include/linux/mm_types.h
> >@@ -445,13 +445,6 @@ struct mm_struct {
> > */
> > atomic_t has_pinned;
> >- /**
> >- * @write_protect_seq: Locked when any thread is write
> >- * protecting pages mapped by this mm to enforce a later COW,
> >- * for instance during page table copying for fork().
> >- */
> >- seqcount_t write_protect_seq;
> >-
> > #ifdef CONFIG_MMU
> > atomic_long_t pgtables_bytes; /* PTE page table pages */
> > #endif
> >@@ -480,7 +473,15 @@ struct mm_struct {
> > unsigned long stack_vm; /* VM_STACK */
> > unsigned long def_flags;
> >+ /**
> >+ * @write_protect_seq: Locked when any thread is write
> >+ * protecting pages mapped by this mm to enforce a later COW,
> >+ * for instance during page table copying for fork().
> >+ */
> >+ seqcount_t write_protect_seq;
> >+
> > spinlock_t arg_lock; /* protect the below fields */
> >+
> > unsigned long start_code, end_code, start_data, end_data;
> > unsigned long start_brk, brk, start_stack;
> > unsigned long arg_start, arg_end, env_start, env_end;
> >

2021-06-06 10:48:05

by Feng Tang

[permalink] [raw]

Subject: Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

On Fri, Jun 04, 2021 at 10:57:44AM -0700, Linus Torvalds wrote:
> On Fri, Jun 4, 2021 at 12:52 AM Feng Tang <[email protected]> wrote:
> >
> > On Fri, Jun 04, 2021 at 03:04:11PM +0800, Feng Tang wrote:
> > > >
> > > > The perf data doesn't even mention any of the GUP paths, and on the
> > > > pure fork path the biggest impact would be:
> > > >
> > > > (a) maybe "struct mm_struct" changed in size or had a different cache layout
> > >
> > > Yes, this seems to be the cause of the regression.
> > >
> > > The test case is many thread are doing map/unmap at the same time,
> > > so the process's rw_semaphore 'mmap_lock' is highly contended.
> > >
> > > Before the patch (with 0day's kconfig), the mmap_lock is separated
> > > into 2 cachelines, the 'count' is in one line, and the other members
> > > sit in the next line, so it luckily avoid some cache bouncing. After
> > > the patch, the 'mmap_lock' is pushed into one cacheline, which may
> > > cause the regression.
>
> Ok, thanks for following up on this.
>
> > We've tried some patch, which can restore the regerssion. As the
> > newly added member 'write_protect_seq' is 4 bytes long, and putting
> > it into an existing 4 bytes long hole can restore the regeression,
> > while not affecting most of other member's alignment. Please review
> > the following patch, thanks!
>
> The patch looks fine to me.
>
> At the same time, I do wonder if maybe it would be worth exploring if
> it's a good idea to perhaps move the 'mmap_sem' thing instead.
>
> Or at least add a big comment. It's not clear to me exactly _which_
> other fields are the ones that are so hot that the contention on
> mmap_sem then causes even more cacheline bouncing.
>
> For example, is it either
>
> (a) we *want* the mmap_sem to be in the first 128-byte region,
> because then when we get the mmap_sem, the other fields in that same
> cacheline are hot
>
> OR
>
> (b) we do *not* want mmap_sem to be in the *second* 128-byte region,
> because there is something *else* in that region that is touched
> independently of mmap_sem that is very very hot and now you get even
> more bouncing?
>
> but I can't tell which one it is.

Yes, it's better to get more details of which fields are hottest,
and following are some perf data details. Let me know if more info
is needed.

* perf-stat: we see more cache-misses

32158577 ± 7% +9.0% 35060321 ± 6% perf-stat.ps.cache-misses
69612918 ± 6% +11.2% 77382336 ± 5% perf-stat.ps.cache-references

* perf profile: the 'mmap_lock' are the hottest, though the ratio from
map/unmap has some difference from 72:24 to 52:45, and this is the part
that I don't understand

- old kernel (without commit 57efa1fe59)

96.60% 0.19% [kernel.kallsyms] [k] down_write_killable - -
72.46% down_write_killable;vm_mmap_pgoff;ksys_mmap_pgoff;do_syscall_64;entry_SYSCALL_64_after_hwframe;__mmap
24.14% down_write_killable;__vm_munmap;__x64_sys_munmap;do_syscall_64;entry_SYSCALL_64_after_hwframe;__munmap

- new kernel

96.60% 0.16% [kernel.kallsyms] [k] down_write_killable - -
51.85% down_write_killable;vm_mmap_pgoff;ksys_mmap_pgoff;do_syscall_64;entry_SYSCALL_64_after_hwframe;__mmap
44.74% down_write_killable;__vm_munmap;__x64_sys_munmap;do_syscall_64;entry_SYSCALL_64_after_hwframe;__munmap

* perf-c2c: The hotspots(HITM) for 2 kernels are different due to the
data structure change

- old kernel

- first cacheline
mmap_lock->count (75%)
mm->mapcount (14%)

- second cacheline
mmap_lock->owner (97%)

- new kernel

mainly in the cacheline of 'mmap_lock'

mmap_lock->count (~2%)
mmap_lock->owner (95%)

I also attached the reduced pah and perf-c2c log for further
check. (The absolute HITM events number can be ignored, as the
recording time for new/old kernel may be different)

> It would be great to have a comment in the code - and in the commit
> message - about exactly which fields are the criticial ones. Because I
> doubt it is 'write_protect_seq' itself that matters at all.
>
> If it's "mmap_sem should be close to other commonly used fields",
> maybe we should just move mmap_sem upwards in the structure?

Ok, will add more comments if the patch is still fine with
the above updated info.

Thanks,
Feng

> Linus

Attachments:

(No filename) (4.41 kB)
pah_new.log (5.60 kB)
pah_old.log (5.49 kB)
c2c_new.log (36.65 kB)
c2c_old.log (38.80 kB)
Download all attachments

2021-06-06 19:24:40

by Linus Torvalds

[permalink] [raw]

Subject: Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

[ Adding Waiman Long to the participants, because this seems to be a
very specific cacheline alignment behavior of rwsems, maybe Waiman has
some comments ]

On Sun, Jun 6, 2021 at 3:16 AM Feng Tang <[email protected]> wrote:
>
> * perf-c2c: The hotspots(HITM) for 2 kernels are different due to the
> data structure change
>
> - old kernel
>
> - first cacheline
> mmap_lock->count (75%)
> mm->mapcount (14%)
>
> - second cacheline
> mmap_lock->owner (97%)
>
> - new kernel
>
> mainly in the cacheline of 'mmap_lock'
>
> mmap_lock->count (~2%)
> mmap_lock->owner (95%)

Oooh.

It looks like pretty much all the contention is on mmap_lock, and the
difference is that the old kernel just _happened_ to split the
mmap_lock rwsem at *exactly* the right place.

The rw_semaphore structure looks like this:

struct rw_semaphore {
atomic_long_t count;
atomic_long_t owner;
struct optimistic_spin_queue osq; /* spinner MCS lock */
...

and before the addition of the 'write_protect_seq' field, the mmap_sem
was at offset 120 in 'struct mm_struct'.

Which meant that count and owner were in two different cachelines, and
then when you have contention and spend time in
rwsem_down_write_slowpath(), this is probably *exactly* the kind of
layout you want.

Because first the rwsem_write_trylock() will do a cmpxchg on the first
cacheline (for the optimistic fast-path), and then in the case of
contention, rwsem_down_write_slowpath() will just access the second
cacheline.

Which is probably just optimal for a load that spends a lot of time
contended - new waiters touch that first cacheline, and then they
queue themselves up on the second cacheline. Waiman, does that sound
believable?

Anyway, I'm certainly ok with the patch that just moves
'write_protect_seq' down, it might be worth commenting about how this
is about some very special cache layout of the mmap_sem part of the
structure.

That said, this means that it all is very subtle dependent on a lot of
kernel config options, and I'm not sure how relevant the exact kernel
options are that the test robot has been using. But even if this is
just a "kernel test robot reports", I think it's an interesting case
and worth a comment for when this happens next time...

Linus

2021-06-06 22:20:11

by Waiman Long

[permalink] [raw]

Subject: Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

On 6/6/21 3:20 PM, Linus Torvalds wrote:
> [ Adding Waiman Long to the participants, because this seems to be a
> very specific cacheline alignment behavior of rwsems, maybe Waiman has
> some comments ]
>
> On Sun, Jun 6, 2021 at 3:16 AM Feng Tang <[email protected]> wrote:
>> * perf-c2c: The hotspots(HITM) for 2 kernels are different due to the
>> data structure change
>>
>> - old kernel
>>
>> - first cacheline
>> mmap_lock->count (75%)
>> mm->mapcount (14%)
>>
>> - second cacheline
>> mmap_lock->owner (97%)
>>
>> - new kernel
>>
>> mainly in the cacheline of 'mmap_lock'
>>
>> mmap_lock->count (~2%)
>> mmap_lock->owner (95%)
> Oooh.
>
> It looks like pretty much all the contention is on mmap_lock, and the
> difference is that the old kernel just _happened_ to split the
> mmap_lock rwsem at *exactly* the right place.
>
> The rw_semaphore structure looks like this:
>
> struct rw_semaphore {
> atomic_long_t count;
> atomic_long_t owner;
> struct optimistic_spin_queue osq; /* spinner MCS lock */
> ...
>
> and before the addition of the 'write_protect_seq' field, the mmap_sem
> was at offset 120 in 'struct mm_struct'.
>
> Which meant that count and owner were in two different cachelines, and
> then when you have contention and spend time in
> rwsem_down_write_slowpath(), this is probably *exactly* the kind of
> layout you want.
>
> Because first the rwsem_write_trylock() will do a cmpxchg on the first
> cacheline (for the optimistic fast-path), and then in the case of
> contention, rwsem_down_write_slowpath() will just access the second
> cacheline.
>
> Which is probably just optimal for a load that spends a lot of time
> contended - new waiters touch that first cacheline, and then they
> queue themselves up on the second cacheline. Waiman, does that sound
> believable?

Yes, I think so.

The count field is accessed when a task tries to acquire the rwsem or
when a owner releases the lock. If the trylock fails, the writer will go
into the slowpath doing optimistic spinning on the owner field. As a
result, a lot of reads to owner are issued relative to the read/write of
count. Normally, there should only be one spinner that has the OSQ lock
spinning on owner and the 9% performance degradation seems a bit high to
me. In the rare case that the head waiter in the wait queue sets the
handoff flag, the waiter may also spin on owner causing a bit more
contention on the owner cacheline. I will do further investigation on
this possibility when I have time.

Cheers,
Longman

2021-06-07 06:07:13

by Feng Tang

[permalink] [raw]

Subject: Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

On Sun, Jun 06, 2021 at 12:20:46PM -0700, Linus Torvalds wrote:
> [ Adding Waiman Long to the participants, because this seems to be a
> very specific cacheline alignment behavior of rwsems, maybe Waiman has
> some comments ]
>
> On Sun, Jun 6, 2021 at 3:16 AM Feng Tang <[email protected]> wrote:
> >
> > * perf-c2c: The hotspots(HITM) for 2 kernels are different due to the
> > data structure change
> >
> > - old kernel
> >
> > - first cacheline
> > mmap_lock->count (75%)
> > mm->mapcount (14%)
> >
> > - second cacheline
> > mmap_lock->owner (97%)
> >
> > - new kernel
> >
> > mainly in the cacheline of 'mmap_lock'
> >
> > mmap_lock->count (~2%)
> > mmap_lock->owner (95%)
>
> Oooh.
>
> It looks like pretty much all the contention is on mmap_lock, and the
> difference is that the old kernel just _happened_ to split the
> mmap_lock rwsem at *exactly* the right place.
>
> The rw_semaphore structure looks like this:
>
> struct rw_semaphore {
> atomic_long_t count;
> atomic_long_t owner;
> struct optimistic_spin_queue osq; /* spinner MCS lock */
> ...
>
> and before the addition of the 'write_protect_seq' field, the mmap_sem
> was at offset 120 in 'struct mm_struct'.
>
> Which meant that count and owner were in two different cachelines, and
> then when you have contention and spend time in
> rwsem_down_write_slowpath(), this is probably *exactly* the kind of
> layout you want.
>
> Because first the rwsem_write_trylock() will do a cmpxchg on the first
> cacheline (for the optimistic fast-path), and then in the case of
> contention, rwsem_down_write_slowpath() will just access the second
> cacheline.
>
> Which is probably just optimal for a load that spends a lot of time
> contended - new waiters touch that first cacheline, and then they
> queue themselves up on the second cacheline. Waiman, does that sound
> believable?
>
> Anyway, I'm certainly ok with the patch that just moves
> 'write_protect_seq' down, it might be worth commenting about how this
> is about some very special cache layout of the mmap_sem part of the
> structure.
>
> That said, this means that it all is very subtle dependent on a lot of
> kernel config options, and I'm not sure how relevant the exact kernel
> options are that the test robot has been using. But even if this is
> just a "kernel test robot reports", I think it's an interesting case
> and worth a comment for when this happens next time...

There are 3 kernel config options before 'mmap_lock' (inside 'mm_struct'):

CONFIG_MMU
CONFIG_MEMBARRIER
CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES

0day's default kernel config is similar to RHEL-8.3, which has all
these three enabled. IIUC, the first 2 options are 'y' for many common
configs, while 'CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES' is only available
on x86 system.

Please review the updated patch, thanks

- Feng

From cbdbe70fb9e5bab2988d645c6f0f614d51b2e386 Mon Sep 17 00:00:00 2001
From: Feng Tang <[email protected]>
Date: Fri, 4 Jun 2021 15:20:57 +0800
Subject: [PATCH] mm: relocate 'write_protect_seq' in struct mm_struct

0day robot reports a 9.2% regression for will-it-scale mmap1 test
case[1], caused by commit 57efa1fe5957 ("mm/gup: prevent gup_fast
from racing with COW during fork").

Further debug shows the regression is due to that commit changes
the offset of hot fields 'mmap_lock' inside structure 'mm_struct',
thus some cache alignmeent changes.

From the perf data, the contention for 'mmap_lock' is very severe
and takes around 95% cpu cycles, and it is a rw_semaphore

struct rw_semaphore {
atomic_long_t count; /* 8 bytes */
atomic_long_t owner; /* 8 bytes */
struct optimistic_spin_queue osq; /* spinner MCS lock */
...

Before commit 57efa1fe5957 adds the 'write_protect_seq', it
happens to have a very optimal cache alignment layout, as
Linus explained:

"and before the addition of the 'write_protect_seq' field, the
mmap_sem was at offset 120 in 'struct mm_struct'.

Which meant that count and owner were in two different cachelines,
and then when you have contention and spend time in
rwsem_down_write_slowpath(), this is probably *exactly* the kind
of layout you want.

Because first the rwsem_write_trylock() will do a cmpxchg on the
first cacheline (for the optimistic fast-path), and then in the
case of contention, rwsem_down_write_slowpath() will just access
the second cacheline.

Which is probably just optimal for a load that spends a lot of
time contended - new waiters touch that first cacheline, and then
they queue themselves up on the second cacheline."

After the commit, the rw_semaphore is at offset 128, which means
the 'count' and 'owner' fields are now in the same cacheline,
and causes more cache bouncing.

Currently there are "#ifdef CONFIG_XXX" before 'mmap_lock' which
will affect its offset:

CONFIG_MMU
CONFIG_MEMBARRIER
CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES

The layout above is on 64 bits system with 0day's default kernel
config (similar to RHEL-8.3's config), which all these 3 options
are 'y'. And the layout can ary with different kernel configs.

Relayouting a structure is usually a double-edged sword, as sometimes
it can helps one case, but hurt other cases. For this case, one
solution is, as the newly added 'write_protect_seq' is a 4 bytes long
seqcount_t (when CONFIG_DEBUG_LOCK_ALLOC=n), placing it into an
existing 4 bytes hole in 'mm_struct' will not change other fields'
alignment, while restoring the regression.

[1]. https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
---
include/linux/mm_types.h | 27 ++++++++++++++++++++-------
1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5aacc1c..cba6022 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -445,13 +445,6 @@ struct mm_struct {
*/
atomic_t has_pinned;

- /**
- * @write_protect_seq: Locked when any thread is write
- * protecting pages mapped by this mm to enforce a later COW,
- * for instance during page table copying for fork().
- */
- seqcount_t write_protect_seq;
-
#ifdef CONFIG_MMU
atomic_long_t pgtables_bytes; /* PTE page table pages */
#endif
@@ -460,6 +453,18 @@ struct mm_struct {
spinlock_t page_table_lock; /* Protects page tables and some
* counters
*/
+ /*
+ * With some kernel config, the current mmap_lock's offset
+ * inside 'mm_struct' is at 0x120, which is very optimal, as
+ * its two hot fields 'count' and 'owner' sit in 2 different
+ * cachelines, and when mmap_lock is highly contended, both
+ * of the 2 fields will be accessed frequently, current layout
+ * will help to reduce cache bouncing.
+ *
+ * So please be careful with adding new fields before
+ * mmap_lock, which can easily push the 2 fields into one
+ * cacheline.
+ */
struct rw_semaphore mmap_lock;

struct list_head mmlist; /* List of maybe swapped mm's. These
@@ -480,7 +485,15 @@ struct mm_struct {
unsigned long stack_vm; /* VM_STACK */
unsigned long def_flags;

+ /**
+ * @write_protect_seq: Locked when any thread is write
+ * protecting pages mapped by this mm to enforce a later COW,
+ * for instance during page table copying for fork().
+ */
+ seqcount_t write_protect_seq;
+
spinlock_t arg_lock; /* protect the below fields */
+
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
--
2.7.4

2021-06-08 00:08:45

by Linus Torvalds

[permalink] [raw]

Subject: Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

On Sun, Jun 6, 2021 at 11:06 PM Feng Tang <[email protected]> wrote:
>
> Please review the updated patch, thanks

Looks good to me. Thanks,

Linus