2021-05-25 03:27:26

by kernel test robot

[permalink] [raw]
Subject: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression



Greeting,

FYI, we noticed a -9.2% regression of will-it-scale.per_thread_ops due to commit:


commit: 57efa1fe5957694fa541c9062de0a127f0b9acb0 ("mm/gup: prevent gup_fast from racing with COW during fork")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master


in testcase: will-it-scale
on test machine: 96 threads 2 sockets Ice Lake with 256G memory
with following parameters:

nr_task: 50%
mode: thread
test: mmap1
cpufreq_governor: performance
ucode: 0xb000280

test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
test-url: https://github.com/antonblanchard/will-it-scale

In addition to that, the commit also has significant impact on the following tests:

+------------------+---------------------------------------------------------------------------------+
| testcase: change | will-it-scale: will-it-scale.per_thread_ops 3.7% improvement |
| test machine | 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz with 128G memory |
| test parameters | cpufreq_governor=performance |
| | mode=thread |
| | nr_task=50% |
| | test=mmap1 |
| | ucode=0x5003006 |
+------------------+---------------------------------------------------------------------------------+


If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
bin/lkp run generated-yaml-file

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
gcc-9/performance/x86_64-rhel-8.3/thread/50%/debian-10.4-x86_64-20200603.cgz/lkp-icl-2sp1/mmap1/will-it-scale/0xb000280

commit:
c28b1fc703 ("mm/gup: reorganize internal_get_user_pages_fast()")
57efa1fe59 ("mm/gup: prevent gup_fast from racing with COW during fork")

c28b1fc70390df32 57efa1fe5957694fa541c9062de
---------------- ---------------------------
%stddev %change %stddev
\ | \
342141 -9.2% 310805 ? 2% will-it-scale.48.threads
7127 -9.2% 6474 ? 2% will-it-scale.per_thread_ops
342141 -9.2% 310805 ? 2% will-it-scale.workload
2555927 ? 3% +45.8% 3727702 meminfo.Committed_AS
12108 ? 13% -36.7% 7665 ? 7% vmstat.system.cs
1142492 ? 30% -47.3% 602364 ? 11% cpuidle.C1.usage
282373 ? 13% -45.6% 153684 ? 7% cpuidle.POLL.usage
48437 ? 3% -5.9% 45563 proc-vmstat.nr_active_anon
54617 ? 3% -5.5% 51602 proc-vmstat.nr_shmem
48437 ? 3% -5.9% 45563 proc-vmstat.nr_zone_active_anon
70511 ? 3% -5.1% 66942 ? 2% proc-vmstat.pgactivate
278653 ? 8% +23.4% 343904 ? 4% sched_debug.cpu.avg_idle.stddev
22572 ? 16% -36.3% 14378 ? 4% sched_debug.cpu.nr_switches.avg
66177 ? 16% -36.8% 41800 ? 21% sched_debug.cpu.nr_switches.max
11613 ? 15% -41.4% 6810 ? 23% sched_debug.cpu.nr_switches.stddev
22.96 ? 15% +55.6% 35.73 ? 12% perf-sched.total_wait_and_delay.average.ms
69713 ? 19% -38.0% 43235 ? 12% perf-sched.total_wait_and_delay.count.ms
22.95 ? 15% +55.6% 35.72 ? 12% perf-sched.total_wait_time.average.ms
29397 ? 23% -35.3% 19030 ? 17% perf-sched.wait_and_delay.count.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap
31964 ? 20% -50.8% 15738 ? 14% perf-sched.wait_and_delay.count.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff
4.59 ? 6% +12.2% 5.15 ? 4% perf-stat.i.MPKI
3.105e+09 -2.1% 3.04e+09 perf-stat.i.branch-instructions
12033 ? 12% -36.8% 7600 ? 7% perf-stat.i.context-switches
10.06 +1.9% 10.25 perf-stat.i.cpi
4.067e+09 -1.3% 4.016e+09 perf-stat.i.dTLB-loads
4.521e+08 -5.1% 4.291e+08 ? 2% perf-stat.i.dTLB-stores
1.522e+10 -1.6% 1.497e+10 perf-stat.i.instructions
0.10 -1.9% 0.10 perf-stat.i.ipc
0.19 ? 8% -22.8% 0.15 ? 5% perf-stat.i.metric.K/sec
80.30 -1.7% 78.93 perf-stat.i.metric.M/sec
167270 ? 6% -14.9% 142312 ? 11% perf-stat.i.node-loads
49.76 -1.6 48.11 perf-stat.i.node-store-miss-rate%
3945152 +6.2% 4189006 perf-stat.i.node-stores
4.59 ? 6% +12.1% 5.15 ? 4% perf-stat.overall.MPKI
10.04 +1.8% 10.23 perf-stat.overall.cpi
0.10 -1.8% 0.10 perf-stat.overall.ipc
49.76 -1.6 48.12 perf-stat.overall.node-store-miss-rate%
13400506 +8.2% 14504566 perf-stat.overall.path-length
3.094e+09 -2.1% 3.03e+09 perf-stat.ps.branch-instructions
12087 ? 13% -36.9% 7622 ? 7% perf-stat.ps.context-switches
4.054e+09 -1.3% 4.002e+09 perf-stat.ps.dTLB-loads
4.508e+08 -5.1% 4.278e+08 ? 2% perf-stat.ps.dTLB-stores
1.516e+10 -1.6% 1.492e+10 perf-stat.ps.instructions
3932404 +6.2% 4175831 perf-stat.ps.node-stores
4.584e+12 -1.7% 4.506e+12 perf-stat.total.instructions
364038 ? 6% -40.3% 217265 ? 9% interrupts.CAL:Function_call_interrupts
5382 ? 33% -63.4% 1970 ? 35% interrupts.CPU44.CAL:Function_call_interrupts
6325 ? 19% -58.1% 2650 ? 37% interrupts.CPU47.CAL:Function_call_interrupts
11699 ? 19% -60.6% 4610 ? 23% interrupts.CPU48.CAL:Function_call_interrupts
94.20 ? 22% -45.8% 51.09 ? 46% interrupts.CPU48.TLB:TLB_shootdowns
9223 ? 24% -52.5% 4383 ? 28% interrupts.CPU49.CAL:Function_call_interrupts
9507 ? 24% -57.5% 4040 ? 27% interrupts.CPU50.CAL:Function_call_interrupts
4530 ? 18% -33.9% 2993 ? 28% interrupts.CPU62.CAL:Function_call_interrupts
82.00 ? 21% -41.9% 47.64 ? 38% interrupts.CPU63.TLB:TLB_shootdowns
4167 ? 16% -45.4% 2276 ? 22% interrupts.CPU64.CAL:Function_call_interrupts
135.20 ? 31% -58.4% 56.27 ? 51% interrupts.CPU64.TLB:TLB_shootdowns
4155 ? 17% -42.5% 2387 ? 27% interrupts.CPU65.CAL:Function_call_interrupts
95.00 ? 48% -53.8% 43.91 ? 42% interrupts.CPU65.TLB:TLB_shootdowns
4122 ? 20% -39.4% 2497 ? 29% interrupts.CPU66.CAL:Function_call_interrupts
3954 ? 14% -41.4% 2318 ? 28% interrupts.CPU67.CAL:Function_call_interrupts
3802 ? 17% -41.9% 2209 ? 17% interrupts.CPU70.CAL:Function_call_interrupts
3787 ? 11% -48.2% 1961 ? 29% interrupts.CPU71.CAL:Function_call_interrupts
3580 ? 14% -45.1% 1964 ? 19% interrupts.CPU72.CAL:Function_call_interrupts
3711 ? 20% -51.3% 1806 ? 25% interrupts.CPU73.CAL:Function_call_interrupts
3494 ? 21% -40.6% 2076 ? 21% interrupts.CPU76.CAL:Function_call_interrupts
3416 ? 21% -45.2% 1873 ? 26% interrupts.CPU77.CAL:Function_call_interrupts
3047 ? 24% -38.0% 1890 ? 18% interrupts.CPU78.CAL:Function_call_interrupts
3102 ? 28% -41.8% 1805 ? 16% interrupts.CPU80.CAL:Function_call_interrupts
2811 ? 23% -36.5% 1785 ? 22% interrupts.CPU83.CAL:Function_call_interrupts
2617 ? 17% -30.7% 1814 ? 30% interrupts.CPU84.CAL:Function_call_interrupts
3322 ? 25% -38.1% 2055 ? 29% interrupts.CPU87.CAL:Function_call_interrupts
2941 ? 12% -39.2% 1787 ? 27% interrupts.CPU93.CAL:Function_call_interrupts
72.56 -19.7 52.82 perf-profile.calltrace.cycles-pp.__mmap
72.52 -19.7 52.78 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__mmap
72.48 -19.7 52.74 perf-profile.calltrace.cycles-pp.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
72.49 -19.7 52.76 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
72.47 -19.7 52.74 perf-profile.calltrace.cycles-pp.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
71.74 -19.7 52.04 perf-profile.calltrace.cycles-pp.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
71.63 -19.7 51.95 perf-profile.calltrace.cycles-pp.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64
71.52 -19.6 51.88 perf-profile.calltrace.cycles-pp.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff
70.12 -19.2 50.92 perf-profile.calltrace.cycles-pp.osq_lock.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff
0.91 ? 2% -0.2 0.70 perf-profile.calltrace.cycles-pp.rwsem_spin_on_owner.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff
0.87 ? 2% +0.1 0.95 ? 2% perf-profile.calltrace.cycles-pp.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.00 +0.6 0.63 ? 2% perf-profile.calltrace.cycles-pp.rwsem_spin_on_owner.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.__vm_munmap
24.24 ? 3% +19.4 43.62 perf-profile.calltrace.cycles-pp.osq_lock.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.__vm_munmap
24.72 ? 3% +19.8 44.47 perf-profile.calltrace.cycles-pp.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap
24.87 ? 3% +19.8 44.62 perf-profile.calltrace.cycles-pp.down_write_killable.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
24.78 ? 3% +19.8 44.54 perf-profile.calltrace.cycles-pp.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap.do_syscall_64
25.94 ? 3% +19.8 45.73 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__munmap
25.97 ? 3% +19.8 45.77 perf-profile.calltrace.cycles-pp.__munmap
25.90 ? 3% +19.8 45.70 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
25.88 ? 3% +19.8 45.68 perf-profile.calltrace.cycles-pp.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
25.87 ? 3% +19.8 45.67 perf-profile.calltrace.cycles-pp.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
72.57 -19.7 52.83 perf-profile.children.cycles-pp.__mmap
72.48 -19.7 52.74 perf-profile.children.cycles-pp.ksys_mmap_pgoff
72.48 -19.7 52.74 perf-profile.children.cycles-pp.vm_mmap_pgoff
0.22 ? 5% -0.1 0.14 ? 6% perf-profile.children.cycles-pp.unmap_region
0.08 ? 23% -0.0 0.04 ? 61% perf-profile.children.cycles-pp.__schedule
0.06 ? 7% -0.0 0.03 ? 75% perf-profile.children.cycles-pp.perf_event_mmap
0.12 ? 4% -0.0 0.09 ? 5% perf-profile.children.cycles-pp.up_write
0.09 ? 7% -0.0 0.06 ? 16% perf-profile.children.cycles-pp.unmap_vmas
0.10 ? 4% -0.0 0.08 ? 3% perf-profile.children.cycles-pp.up_read
0.18 ? 2% +0.0 0.20 ? 3% perf-profile.children.cycles-pp.vm_area_dup
0.18 ? 5% +0.0 0.21 ? 2% perf-profile.children.cycles-pp.vma_merge
0.12 ? 4% +0.0 0.14 ? 4% perf-profile.children.cycles-pp.kmem_cache_free
0.19 ? 6% +0.0 0.23 ? 2% perf-profile.children.cycles-pp.get_unmapped_area
0.16 ? 6% +0.0 0.20 ? 2% perf-profile.children.cycles-pp.vm_unmapped_area
0.17 ? 6% +0.0 0.21 ? 2% perf-profile.children.cycles-pp.arch_get_unmapped_area_topdown
0.07 ? 10% +0.1 0.14 ? 16% perf-profile.children.cycles-pp.find_vma
0.27 ? 4% +0.1 0.35 ? 2% perf-profile.children.cycles-pp.__vma_adjust
0.35 ? 2% +0.1 0.43 ? 3% perf-profile.children.cycles-pp.__split_vma
0.87 ? 2% +0.1 0.95 ? 2% perf-profile.children.cycles-pp.__do_munmap
1.23 +0.1 1.33 perf-profile.children.cycles-pp.rwsem_spin_on_owner
25.98 ? 3% +19.8 45.78 perf-profile.children.cycles-pp.__munmap
25.87 ? 3% +19.8 45.68 perf-profile.children.cycles-pp.__vm_munmap
25.88 ? 3% +19.8 45.68 perf-profile.children.cycles-pp.__x64_sys_munmap
0.53 ? 2% -0.2 0.35 ? 3% perf-profile.self.cycles-pp.rwsem_optimistic_spin
0.08 ? 5% -0.1 0.03 ? 75% perf-profile.self.cycles-pp.do_mmap
0.11 ? 6% -0.0 0.09 ? 5% perf-profile.self.cycles-pp.up_write
0.19 ? 4% -0.0 0.16 ? 5% perf-profile.self.cycles-pp.down_write_killable
0.05 ? 8% +0.0 0.07 ? 8% perf-profile.self.cycles-pp.downgrade_write
0.11 ? 4% +0.0 0.14 ? 4% perf-profile.self.cycles-pp.__vma_adjust
0.16 ? 6% +0.0 0.20 ? 3% perf-profile.self.cycles-pp.vm_unmapped_area
0.05 ? 9% +0.0 0.10 ? 13% perf-profile.self.cycles-pp.find_vma
1.21 +0.1 1.31 perf-profile.self.cycles-pp.rwsem_spin_on_owner



will-it-scale.per_thread_ops

7400 +--------------------------------------------------------------------+
| + |
7200 |.++. .+. +.+ .++. : :.+ .+ +. |
| ++.+.++ ++ + +.+.++.+ ++.+.++.++. : + + : : + |
7000 |-+ + + : + :: |
| + + .+ : + |
6800 |-+ + + |
| O |
6600 |-+ O O O O O O OO OO O |
| OO O O O OO O O O O O O O O O |
6400 |-+ O O O O O O OO O |
| O O O |
6200 |-+ O |
| O |
6000 +--------------------------------------------------------------------+


[*] bisect-good sample
[O] bisect-bad sample

***************************************************************************************************
lkp-csl-2sp9: 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz with 128G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
gcc-9/performance/x86_64-rhel-8.3/thread/50%/debian-10.4-x86_64-20200603.cgz/lkp-csl-2sp9/mmap1/will-it-scale/0x5003006

commit:
c28b1fc703 ("mm/gup: reorganize internal_get_user_pages_fast()")
57efa1fe59 ("mm/gup: prevent gup_fast from racing with COW during fork")

c28b1fc70390df32 57efa1fe5957694fa541c9062de
---------------- ---------------------------
%stddev %change %stddev
\ | \
247840 +3.7% 257132 ? 2% will-it-scale.44.threads
5632 +3.7% 5843 ? 2% will-it-scale.per_thread_ops
247840 +3.7% 257132 ? 2% will-it-scale.workload
0.10 ? 5% +0.0 0.13 ? 8% perf-profile.children.cycles-pp.find_vma
14925 ? 19% -48.2% 7724 ? 8% softirqs.CPU87.SCHED
9950 ? 3% -36.1% 6355 ? 2% vmstat.system.cs
3312916 ? 4% +13.9% 3774536 ? 9% cpuidle.C1.time
1675504 ? 5% -36.6% 1061625 cpuidle.POLL.time
987055 ? 5% -41.8% 574757 ? 2% cpuidle.POLL.usage
165545 ? 3% -12.2% 145358 ? 4% meminfo.Active
165235 ? 3% -12.1% 145188 ? 4% meminfo.Active(anon)
180757 ? 3% -11.7% 159538 ? 3% meminfo.Shmem
2877001 ? 11% +16.2% 3342948 ? 10% sched_debug.cfs_rq:/.min_vruntime.avg
5545708 ? 11% +9.8% 6086941 ? 8% sched_debug.cfs_rq:/.min_vruntime.max
2773178 ? 11% +15.4% 3199941 ? 9% sched_debug.cfs_rq:/.spread0.avg
733740 ? 3% -12.0% 646033 ? 5% sched_debug.cpu.avg_idle.avg
17167 ? 10% -28.2% 12332 ? 7% sched_debug.cpu.nr_switches.avg
49180 ? 14% -33.5% 32687 ? 22% sched_debug.cpu.nr_switches.max
9311 ? 18% -36.2% 5943 ? 22% sched_debug.cpu.nr_switches.stddev
41257 ? 3% -12.1% 36252 ? 4% proc-vmstat.nr_active_anon
339681 -1.6% 334294 proc-vmstat.nr_file_pages
10395 -3.5% 10036 proc-vmstat.nr_mapped
45130 ? 3% -11.7% 39848 ? 3% proc-vmstat.nr_shmem
41257 ? 3% -12.1% 36252 ? 4% proc-vmstat.nr_zone_active_anon
841530 -1.7% 826917 proc-vmstat.numa_local
21515 ? 11% -68.9% 6684 ? 70% proc-vmstat.numa_pages_migrated
60224 ? 3% -11.1% 53513 ? 3% proc-vmstat.pgactivate
981265 -2.5% 956415 proc-vmstat.pgalloc_normal
895893 -1.9% 878978 proc-vmstat.pgfree
21515 ? 11% -68.9% 6684 ? 70% proc-vmstat.pgmigrate_success
0.07 ?135% -74.1% 0.02 ? 5% perf-sched.sch_delay.max.ms.preempt_schedule_common._cond_resched.stop_one_cpu.__set_cpus_allowed_ptr.sched_setaffinity
21.44 ? 5% +80.9% 38.79 ? 3% perf-sched.total_wait_and_delay.average.ms
67273 ? 6% -44.9% 37095 ? 5% perf-sched.total_wait_and_delay.count.ms
21.44 ? 5% +80.9% 38.79 ? 3% perf-sched.total_wait_time.average.ms
0.08 ? 14% +60.1% 0.13 ? 9% perf-sched.wait_and_delay.avg.ms.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap
0.09 ? 12% +58.0% 0.15 ? 15% perf-sched.wait_and_delay.avg.ms.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff
255.38 ? 14% +22.1% 311.71 ? 17% perf-sched.wait_and_delay.avg.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll
31877 ? 10% -54.2% 14606 ? 13% perf-sched.wait_and_delay.count.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap
27110 ? 7% -47.3% 14280 ? 4% perf-sched.wait_and_delay.count.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff
138.60 ? 13% -21.4% 109.00 ? 15% perf-sched.wait_and_delay.count.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll
1.00 ?199% -99.9% 0.00 ?200% perf-sched.wait_time.avg.ms.preempt_schedule_common._cond_resched.remove_vma.__do_munmap.__vm_munmap
0.08 ? 14% +60.9% 0.13 ? 9% perf-sched.wait_time.avg.ms.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap
0.09 ? 12% +58.2% 0.15 ? 15% perf-sched.wait_time.avg.ms.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff
255.38 ? 14% +22.1% 311.71 ? 17% perf-sched.wait_time.avg.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll
1.00 ?199% -99.9% 0.00 ?200% perf-sched.wait_time.max.ms.preempt_schedule_common._cond_resched.remove_vma.__do_munmap.__vm_munmap
4.99 -36.1% 3.19 ? 36% perf-sched.wait_time.max.ms.rcu_gp_kthread.kthread.ret_from_fork
9869 ? 3% -36.2% 6295 ? 2% perf-stat.i.context-switches
0.00 ? 7% +0.0 0.00 ? 29% perf-stat.i.dTLB-load-miss-rate%
76953 ? 7% +327.4% 328871 ? 29% perf-stat.i.dTLB-load-misses
4152320 -3.0% 4026365 perf-stat.i.iTLB-load-misses
1665297 -2.2% 1628746 perf-stat.i.iTLB-loads
8627 +3.5% 8933 perf-stat.i.instructions-per-iTLB-miss
0.33 ? 3% -11.0% 0.29 ? 6% perf-stat.i.metric.K/sec
87.42 +1.7 89.11 perf-stat.i.node-load-miss-rate%
7507752 -9.2% 6814138 perf-stat.i.node-load-misses
1078418 ? 2% -22.9% 831563 ? 3% perf-stat.i.node-loads
3091445 -8.2% 2838247 perf-stat.i.node-store-misses
0.00 ? 7% +0.0 0.00 ? 29% perf-stat.overall.dTLB-load-miss-rate%
8599 +3.6% 8907 perf-stat.overall.instructions-per-iTLB-miss
87.44 +1.7 89.13 perf-stat.overall.node-load-miss-rate%
43415811 -3.3% 41994695 ? 2% perf-stat.overall.path-length
9895 ? 3% -36.4% 6291 ? 2% perf-stat.ps.context-switches
76756 ? 7% +327.0% 327716 ? 29% perf-stat.ps.dTLB-load-misses
4138410 -3.0% 4012712 perf-stat.ps.iTLB-load-misses
1659653 -2.2% 1623167 perf-stat.ps.iTLB-loads
7483002 -9.2% 6791226 perf-stat.ps.node-load-misses
1074856 ? 2% -22.9% 828780 ? 3% perf-stat.ps.node-loads
3081222 -8.2% 2828732 perf-stat.ps.node-store-misses
335021 ? 2% -27.9% 241715 ? 12% interrupts.CAL:Function_call_interrupts
3662 ? 31% -61.3% 1417 ? 16% interrupts.CPU10.CAL:Function_call_interrupts
4671 ? 32% -65.6% 1607 ? 30% interrupts.CPU12.CAL:Function_call_interrupts
4999 ? 34% -68.1% 1592 ? 43% interrupts.CPU14.CAL:Function_call_interrupts
129.00 ? 30% -46.8% 68.60 ? 34% interrupts.CPU14.RES:Rescheduling_interrupts
4531 ? 49% -58.5% 1881 ? 39% interrupts.CPU15.CAL:Function_call_interrupts
4639 ? 28% -37.6% 2893 ? 2% interrupts.CPU18.NMI:Non-maskable_interrupts
4639 ? 28% -37.6% 2893 ? 2% interrupts.CPU18.PMI:Performance_monitoring_interrupts
6310 ? 49% -68.5% 1988 ? 57% interrupts.CPU21.CAL:Function_call_interrupts
149.40 ? 49% -49.3% 75.80 ? 42% interrupts.CPU21.RES:Rescheduling_interrupts
3592 ? 38% -63.0% 1330 ? 14% interrupts.CPU24.CAL:Function_call_interrupts
5350 ? 21% -30.5% 3720 ? 44% interrupts.CPU24.NMI:Non-maskable_interrupts
5350 ? 21% -30.5% 3720 ? 44% interrupts.CPU24.PMI:Performance_monitoring_interrupts
139.00 ? 27% -33.4% 92.60 ? 26% interrupts.CPU24.RES:Rescheduling_interrupts
3858 ? 42% -53.7% 1785 ? 38% interrupts.CPU26.CAL:Function_call_interrupts
5964 ? 28% -42.4% 3432 ? 55% interrupts.CPU27.NMI:Non-maskable_interrupts
5964 ? 28% -42.4% 3432 ? 55% interrupts.CPU27.PMI:Performance_monitoring_interrupts
3429 ? 37% -57.1% 1470 ? 44% interrupts.CPU28.CAL:Function_call_interrupts
3008 ? 35% -37.6% 1877 ? 38% interrupts.CPU29.CAL:Function_call_interrupts
4684 ? 73% -60.0% 1872 ? 34% interrupts.CPU30.CAL:Function_call_interrupts
4300 ? 46% -54.7% 1949 ? 13% interrupts.CPU43.CAL:Function_call_interrupts
10255 ? 26% -50.0% 5127 ? 29% interrupts.CPU44.CAL:Function_call_interrupts
5800 ? 20% -28.3% 4158 ? 27% interrupts.CPU52.CAL:Function_call_interrupts
4802 ? 19% -31.7% 3279 ? 18% interrupts.CPU58.CAL:Function_call_interrupts
4042 ? 32% -65.6% 1391 ? 41% interrupts.CPU6.CAL:Function_call_interrupts
128.60 ? 31% -52.9% 60.60 ? 38% interrupts.CPU6.RES:Rescheduling_interrupts
4065 ? 20% -37.8% 2530 ? 6% interrupts.CPU63.CAL:Function_call_interrupts
4340 ? 24% -36.2% 2771 ? 11% interrupts.CPU64.CAL:Function_call_interrupts
3983 ? 11% -27.1% 2904 ? 19% interrupts.CPU65.CAL:Function_call_interrupts
3392 ? 25% -55.2% 1518 ? 53% interrupts.CPU7.CAL:Function_call_interrupts
171.80 ? 67% -62.5% 64.40 ? 32% interrupts.CPU7.RES:Rescheduling_interrupts
2942 ? 33% -50.5% 1455 ? 25% interrupts.CPU8.CAL:Function_call_interrupts
7818 -27.3% 5681 ? 31% interrupts.CPU85.NMI:Non-maskable_interrupts
7818 -27.3% 5681 ? 31% interrupts.CPU85.PMI:Performance_monitoring_interrupts
320.80 ? 54% -44.6% 177.80 ? 58% interrupts.CPU87.TLB:TLB_shootdowns
3212 ? 31% -64.8% 1130 ? 36% interrupts.CPU9.CAL:Function_call_interrupts





Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation

Thanks,
Oliver Sang


Attachments:
(No filename) (26.93 kB)
job-script (7.92 kB)
job.yaml (5.14 kB)
reproduce (346.00 B)
Download all attachments

2021-05-25 03:55:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

On Mon, May 24, 2021 at 5:00 PM kernel test robot <[email protected]> wrote:
>
> FYI, we noticed a -9.2% regression of will-it-scale.per_thread_ops due to commit:
> commit: 57efa1fe5957694fa541c9062de0a127f0b9acb0 ("mm/gup: prevent gup_fast from racing with COW during fork")

Hmm. This looks like one of those "random fluctuations" things.

It would be good to hear if other test-cases also bisect to the same
thing, but this report already says:

> In addition to that, the commit also has significant impact on the following tests:
>
> +------------------+---------------------------------------------------------------------------------+
> | testcase: change | will-it-scale: will-it-scale.per_thread_ops 3.7% improvement |

which does kind of reinforce that "this benchmark gives unstable numbers".

The perf data doesn't even mention any of the GUP paths, and on the
pure fork path the biggest impact would be:

(a) maybe "struct mm_struct" changed in size or had a different cache layout

(b) two added (nonatomic) increment operations in the fork path due
to the seqcount

and I'm not seeing what would cause that 9% change. Obviously cache
placement has done it before.

If somebody else sees something that I'm missing, please holler. But
I'll ignore this as "noise" otherwise.

Linus

2021-06-04 07:05:34

by Feng Tang

[permalink] [raw]
Subject: Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

Hi Linus,

Sorry for the late response.

On Mon, May 24, 2021 at 05:11:37PM -1000, Linus Torvalds wrote:
> On Mon, May 24, 2021 at 5:00 PM kernel test robot <[email protected]> wrote:
> >
> > FYI, we noticed a -9.2% regression of will-it-scale.per_thread_ops due to commit:
> > commit: 57efa1fe5957694fa541c9062de0a127f0b9acb0 ("mm/gup: prevent gup_fast from racing with COW during fork")
>
> Hmm. This looks like one of those "random fluctuations" things.
>
> It would be good to hear if other test-cases also bisect to the same
> thing, but this report already says:
>
> > In addition to that, the commit also has significant impact on the following tests:
> >
> > +------------------+---------------------------------------------------------------------------------+
> > | testcase: change | will-it-scale: will-it-scale.per_thread_ops 3.7% improvement |
>
> which does kind of reinforce that "this benchmark gives unstable numbers".
>
> The perf data doesn't even mention any of the GUP paths, and on the
> pure fork path the biggest impact would be:
>
> (a) maybe "struct mm_struct" changed in size or had a different cache layout

Yes, this seems to be the cause of the regression.

The test case is many thread are doing map/unmap at the same time,
so the process's rw_semaphore 'mmap_lock' is highly contended.

Before the patch (with 0day's kconfig), the mmap_lock is separated
into 2 cachelines, the 'count' is in one line, and the other members
sit in the next line, so it luckily avoid some cache bouncing. After
the patch, the 'mmap_lock' is pushed into one cacheline, which may
cause the regression.

Below is the pahole info:

- before the patch

spinlock_t page_table_lock; /* 116 4 */
struct rw_semaphore mmap_lock; /* 120 40 */
/* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
struct list_head mmlist; /* 160 16 */
long unsigned int hiwater_rss; /* 176 8 */

- after the patch

spinlock_t page_table_lock; /* 124 4 */
/* --- cacheline 2 boundary (128 bytes) --- */
struct rw_semaphore mmap_lock; /* 128 40 */
struct list_head mmlist; /* 168 16 */
long unsigned int hiwater_rss; /* 184 8 */

perf c2c log can also confirm this.

Thanks,
Feng

> (b) two added (nonatomic) increment operations in the fork path due
> to the seqcount
>
> and I'm not seeing what would cause that 9% change. Obviously cache
> placement has done it before.
>
> If somebody else sees something that I'm missing, please holler. But
> I'll ignore this as "noise" otherwise.
>
> Linus

2021-06-04 07:54:59

by Feng Tang

[permalink] [raw]
Subject: Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

On Fri, Jun 04, 2021 at 03:04:11PM +0800, Feng Tang wrote:
> Hi Linus,
>
> Sorry for the late response.
>
> On Mon, May 24, 2021 at 05:11:37PM -1000, Linus Torvalds wrote:
> > On Mon, May 24, 2021 at 5:00 PM kernel test robot <[email protected]> wrote:
> > >
> > > FYI, we noticed a -9.2% regression of will-it-scale.per_thread_ops due to commit:
> > > commit: 57efa1fe5957694fa541c9062de0a127f0b9acb0 ("mm/gup: prevent gup_fast from racing with COW during fork")
> >
> > Hmm. This looks like one of those "random fluctuations" things.
> >
> > It would be good to hear if other test-cases also bisect to the same
> > thing, but this report already says:
> >
> > > In addition to that, the commit also has significant impact on the following tests:
> > >
> > > +------------------+---------------------------------------------------------------------------------+
> > > | testcase: change | will-it-scale: will-it-scale.per_thread_ops 3.7% improvement |
> >
> > which does kind of reinforce that "this benchmark gives unstable numbers".
> >
> > The perf data doesn't even mention any of the GUP paths, and on the
> > pure fork path the biggest impact would be:
> >
> > (a) maybe "struct mm_struct" changed in size or had a different cache layout
>
> Yes, this seems to be the cause of the regression.
>
> The test case is many thread are doing map/unmap at the same time,
> so the process's rw_semaphore 'mmap_lock' is highly contended.
>
> Before the patch (with 0day's kconfig), the mmap_lock is separated
> into 2 cachelines, the 'count' is in one line, and the other members
> sit in the next line, so it luckily avoid some cache bouncing. After
> the patch, the 'mmap_lock' is pushed into one cacheline, which may
> cause the regression.
>
> Below is the pahole info:
>
> - before the patch
>
> spinlock_t page_table_lock; /* 116 4 */
> struct rw_semaphore mmap_lock; /* 120 40 */
> /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
> struct list_head mmlist; /* 160 16 */
> long unsigned int hiwater_rss; /* 176 8 */
>
> - after the patch
>
> spinlock_t page_table_lock; /* 124 4 */
> /* --- cacheline 2 boundary (128 bytes) --- */
> struct rw_semaphore mmap_lock; /* 128 40 */
> struct list_head mmlist; /* 168 16 */
> long unsigned int hiwater_rss; /* 184 8 */
>
> perf c2c log can also confirm this.

We've tried some patch, which can restore the regerssion. As the
newly added member 'write_protect_seq' is 4 bytes long, and putting
it into an existing 4 bytes long hole can restore the regeression,
while not affecting most of other member's alignment. Please review
the following patch, thanks!

- Feng

From 85ddc2c3d0f2bdcbad4edc5c392c7bc90bb1667e Mon Sep 17 00:00:00 2001
From: Feng Tang <[email protected]>
Date: Fri, 4 Jun 2021 15:20:57 +0800
Subject: [PATCH RFC] mm: relocate 'write_protect_seq' in struct mm_struct

Before commit 57efa1fe5957 ("mm/gup: prevent gup_fast from
racing with COW during fork), on 64bits system, the hot member
rw_semaphore 'mmap_lock' of 'mm_struct' could be separated into
2 cachelines, that its member 'count' sits in one cacheline while
all other members in next cacheline, this naturally reduces some
cache bouncing, and with the commit, the 'mmap_lock' is pushed
into one cacheline, as shown in the pahole info:

- before the commit

spinlock_t page_table_lock; /* 116 4 */
struct rw_semaphore mmap_lock; /* 120 40 */
/* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
struct list_head mmlist; /* 160 16 */
long unsigned int hiwater_rss; /* 176 8 */

- after the commit

spinlock_t page_table_lock; /* 124 4 */
/* --- cacheline 2 boundary (128 bytes) --- */
struct rw_semaphore mmap_lock; /* 128 40 */
struct list_head mmlist; /* 168 16 */
long unsigned int hiwater_rss; /* 184 8 */

and it causes one 9.2% regression for 'mmap1' case of will-it-scale
benchmark[1], as in the case 'mmap_lock' is highly contented (occupies
90%+ cpu cycles).

Though relayouting a structure could be a double-edged sword, as it
helps some case, but may hurt other cases. So one solution is the
newly added 'seqcount_t' is 4 bytes long (when CONFIG_DEBUG_LOCK_ALLOC=n),
placing it into an existing 4 bytes hole in 'mm_struct' will not
affect most of other members's alignment, while restoring the
regression.

[1]. https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
---
include/linux/mm_types.h | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5aacc1c..5b55f88 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -445,13 +445,6 @@ struct mm_struct {
*/
atomic_t has_pinned;

- /**
- * @write_protect_seq: Locked when any thread is write
- * protecting pages mapped by this mm to enforce a later COW,
- * for instance during page table copying for fork().
- */
- seqcount_t write_protect_seq;
-
#ifdef CONFIG_MMU
atomic_long_t pgtables_bytes; /* PTE page table pages */
#endif
@@ -480,7 +473,15 @@ struct mm_struct {
unsigned long stack_vm; /* VM_STACK */
unsigned long def_flags;

+ /**
+ * @write_protect_seq: Locked when any thread is write
+ * protecting pages mapped by this mm to enforce a later COW,
+ * for instance during page table copying for fork().
+ */
+ seqcount_t write_protect_seq;
+
spinlock_t arg_lock; /* protect the below fields */
+
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
--
2.7.4



2021-06-04 08:41:42

by Xing Zhengjun

[permalink] [raw]
Subject: Re: [LKP] Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

Hi Linus,

On 5/25/2021 11:11 AM, Linus Torvalds wrote:
> On Mon, May 24, 2021 at 5:00 PM kernel test robot <[email protected]> wrote:
>> FYI, we noticed a -9.2% regression of will-it-scale.per_thread_ops due to commit:
>> commit: 57efa1fe5957694fa541c9062de0a127f0b9acb0 ("mm/gup: prevent gup_fast from racing with COW during fork")
> Hmm. This looks like one of those "random fluctuations" things.
>
> It would be good to hear if other test-cases also bisect to the same
> thing, but this report already says:
>
>> In addition to that, the commit also has significant impact on the following tests:
>>
>> +------------------+---------------------------------------------------------------------------------+
>> | testcase: change | will-it-scale: will-it-scale.per_thread_ops 3.7% improvement |
> which does kind of reinforce that "this benchmark gives unstable numbers".
>
> The perf data doesn't even mention any of the GUP paths, and on the
> pure fork path the biggest impact would be:
>
> (a) maybe "struct mm_struct" changed in size or had a different cache layout
I move "write_protect_seq" to the tail of the "struct mm_struct", the
regression reduced to -3.6%. The regression should relate to the cache
layout.
=========================================================================================
tbox_group/testcase/rootfs/kconfig/compiler/nr_task/mode/test/cpufreq_governor/ucode:
lkp-icl-2sp1/will-it-scale/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/50%/thread/mmap1/performance/0xb000280

commit:
  c28b1fc70390df32e29991eedd52bd86e7aba080
  57efa1fe5957694fa541c9062de0a127f0b9acb0
  f6a9c27882d51ff551e15522992d3725c342372d  (the test patch)

c28b1fc70390df32 57efa1fe5957694fa541c9062de f6a9c27882d51ff551e15522992
---------------- --------------------------- ---------------------------
         %stddev     %change         %stddev     %change %stddev
             \          |                \          | \
    341938            -9.0%     311218 ±  2%      -3.6% 329513       
will-it-scale.48.threads
      7123            -9.0%       6483 ±  2%      -3.6% 6864       
will-it-scale.per_thread_ops
    341938            -9.0%     311218 ±  2%      -3.6% 329513       
will-it-scale.workload

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 915f4f100383..34bb2a01806c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -447,13 +447,6 @@ struct mm_struct {
                 */
                atomic_t has_pinned;

-               /**
-                * @write_protect_seq: Locked when any thread is write
-                * protecting pages mapped by this mm to enforce a later
COW,
-                * for instance during page table copying for fork().
-                */
-               seqcount_t write_protect_seq;
-
 #ifdef CONFIG_MMU
                atomic_long_t pgtables_bytes;   /* PTE page table pages */
 #endif
@@ -564,6 +557,12 @@ struct mm_struct {
 #ifdef CONFIG_IOMMU_SUPPORT
                u32 pasid;
 #endif
+                /**
+                 * @write_protect_seq: Locked when any thread is write
+                 * protecting pages mapped by this mm to enforce a
later COW,
+                 * for instance during page table copying for fork().
+                 */
+                seqcount_t write_protect_seq;
        } __randomize_layout;

        /*

>
> (b) two added (nonatomic) increment operations in the fork path due
> to the seqcount
>
> and I'm not seeing what would cause that 9% change. Obviously cache
> placement has done it before.
>
> If somebody else sees something that I'm missing, please holler. But
> I'll ignore this as "noise" otherwise.
>
> Linus
> _______________________________________________
> LKP mailing list -- [email protected]
> To unsubscribe send an email to [email protected]

--
Zhengjun Xing

2021-06-04 17:59:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

On Fri, Jun 4, 2021 at 12:52 AM Feng Tang <[email protected]> wrote:
>
> On Fri, Jun 04, 2021 at 03:04:11PM +0800, Feng Tang wrote:
> > >
> > > The perf data doesn't even mention any of the GUP paths, and on the
> > > pure fork path the biggest impact would be:
> > >
> > > (a) maybe "struct mm_struct" changed in size or had a different cache layout
> >
> > Yes, this seems to be the cause of the regression.
> >
> > The test case is many thread are doing map/unmap at the same time,
> > so the process's rw_semaphore 'mmap_lock' is highly contended.
> >
> > Before the patch (with 0day's kconfig), the mmap_lock is separated
> > into 2 cachelines, the 'count' is in one line, and the other members
> > sit in the next line, so it luckily avoid some cache bouncing. After
> > the patch, the 'mmap_lock' is pushed into one cacheline, which may
> > cause the regression.

Ok, thanks for following up on this.

> We've tried some patch, which can restore the regerssion. As the
> newly added member 'write_protect_seq' is 4 bytes long, and putting
> it into an existing 4 bytes long hole can restore the regeression,
> while not affecting most of other member's alignment. Please review
> the following patch, thanks!

The patch looks fine to me.

At the same time, I do wonder if maybe it would be worth exploring if
it's a good idea to perhaps move the 'mmap_sem' thing instead.

Or at least add a big comment. It's not clear to me exactly _which_
other fields are the ones that are so hot that the contention on
mmap_sem then causes even more cacheline bouncing.

For example, is it either

(a) we *want* the mmap_sem to be in the first 128-byte region,
because then when we get the mmap_sem, the other fields in that same
cacheline are hot

OR

(b) we do *not* want mmap_sem to be in the *second* 128-byte region,
because there is something *else* in that region that is touched
independently of mmap_sem that is very very hot and now you get even
more bouncing?

but I can't tell which one it is.

It would be great to have a comment in the code - and in the commit
message - about exactly which fields are the criticial ones. Because I
doubt it is 'write_protect_seq' itself that matters at all.

If it's "mmap_sem should be close to other commonly used fields",
maybe we should just move mmap_sem upwards in the structure?

Linus

2021-06-04 18:00:21

by John Hubbard

[permalink] [raw]
Subject: Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

On 6/4/21 12:52 AM, Feng Tang wrote:
...
>>> The perf data doesn't even mention any of the GUP paths, and on the
>>> pure fork path the biggest impact would be:
>>>
>>> (a) maybe "struct mm_struct" changed in size or had a different cache layout
>>
>> Yes, this seems to be the cause of the regression.
>>
>> The test case is many thread are doing map/unmap at the same time,
>> so the process's rw_semaphore 'mmap_lock' is highly contended.
>>
>> Before the patch (with 0day's kconfig), the mmap_lock is separated
>> into 2 cachelines, the 'count' is in one line, and the other members
>> sit in the next line, so it luckily avoid some cache bouncing. After

Wow! That's quite a fortunate layout to land on by accident. Almost
makes me wonder if mmap_lock should be designed to do that, but it's
probably even better to just keep working on having a less contended
mmap_lock.

I *suppose* it's worth trying to keep this fragile layout in place,
but it is a landmine for anyone who touches mm_struct. And the struct
is so large already that I'm not sure a comment warning would even
be noticed. Anyway...

>> the patch, the 'mmap_lock' is pushed into one cacheline, which may
>> cause the regression.
>>
>> Below is the pahole info:
>>
>> - before the patch
>>
>> spinlock_t page_table_lock; /* 116 4 */
>> struct rw_semaphore mmap_lock; /* 120 40 */
>> /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
>> struct list_head mmlist; /* 160 16 */
>> long unsigned int hiwater_rss; /* 176 8 */
>>
>> - after the patch
>>
>> spinlock_t page_table_lock; /* 124 4 */
>> /* --- cacheline 2 boundary (128 bytes) --- */
>> struct rw_semaphore mmap_lock; /* 128 40 */
>> struct list_head mmlist; /* 168 16 */
>> long unsigned int hiwater_rss; /* 184 8 */
>>
>> perf c2c log can also confirm this.
>
> We've tried some patch, which can restore the regerssion. As the
> newly added member 'write_protect_seq' is 4 bytes long, and putting
> it into an existing 4 bytes long hole can restore the regeression,
> while not affecting most of other member's alignment. Please review
> the following patch, thanks!
>

So, this is a neat little solution, if we agree that it's worth "fixing".

I'm definitely on the fence, but leaning toward, "go for it", because
I like the "no cache effect" result of using up the hole.

Reviewed-by: John Hubbard <[email protected]>

thanks,
--
John Hubbard
NVIDIA

> - Feng
>
> From 85ddc2c3d0f2bdcbad4edc5c392c7bc90bb1667e Mon Sep 17 00:00:00 2001
> From: Feng Tang <[email protected]>
> Date: Fri, 4 Jun 2021 15:20:57 +0800
> Subject: [PATCH RFC] mm: relocate 'write_protect_seq' in struct mm_struct
>
> Before commit 57efa1fe5957 ("mm/gup: prevent gup_fast from
> racing with COW during fork), on 64bits system, the hot member
> rw_semaphore 'mmap_lock' of 'mm_struct' could be separated into
> 2 cachelines, that its member 'count' sits in one cacheline while
> all other members in next cacheline, this naturally reduces some
> cache bouncing, and with the commit, the 'mmap_lock' is pushed
> into one cacheline, as shown in the pahole info:
>
> - before the commit
>
> spinlock_t page_table_lock; /* 116 4 */
> struct rw_semaphore mmap_lock; /* 120 40 */
> /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
> struct list_head mmlist; /* 160 16 */
> long unsigned int hiwater_rss; /* 176 8 */
>
> - after the commit
>
> spinlock_t page_table_lock; /* 124 4 */
> /* --- cacheline 2 boundary (128 bytes) --- */
> struct rw_semaphore mmap_lock; /* 128 40 */
> struct list_head mmlist; /* 168 16 */
> long unsigned int hiwater_rss; /* 184 8 */
>
> and it causes one 9.2% regression for 'mmap1' case of will-it-scale
> benchmark[1], as in the case 'mmap_lock' is highly contented (occupies
> 90%+ cpu cycles).
>
> Though relayouting a structure could be a double-edged sword, as it
> helps some case, but may hurt other cases. So one solution is the
> newly added 'seqcount_t' is 4 bytes long (when CONFIG_DEBUG_LOCK_ALLOC=n),
> placing it into an existing 4 bytes hole in 'mm_struct' will not
> affect most of other members's alignment, while restoring the
> regression.
>
> [1]. https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/
> Reported-by: kernel test robot <[email protected]>
> Signed-off-by: Feng Tang <[email protected]>
> ---
> include/linux/mm_types.h | 15 ++++++++-------
> 1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5aacc1c..5b55f88 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -445,13 +445,6 @@ struct mm_struct {
> */
> atomic_t has_pinned;
>
> - /**
> - * @write_protect_seq: Locked when any thread is write
> - * protecting pages mapped by this mm to enforce a later COW,
> - * for instance during page table copying for fork().
> - */
> - seqcount_t write_protect_seq;
> -
> #ifdef CONFIG_MMU
> atomic_long_t pgtables_bytes; /* PTE page table pages */
> #endif
> @@ -480,7 +473,15 @@ struct mm_struct {
> unsigned long stack_vm; /* VM_STACK */
> unsigned long def_flags;
>
> + /**
> + * @write_protect_seq: Locked when any thread is write
> + * protecting pages mapped by this mm to enforce a later COW,
> + * for instance during page table copying for fork().
> + */
> + seqcount_t write_protect_seq;
> +
> spinlock_t arg_lock; /* protect the below fields */
> +
> unsigned long start_code, end_code, start_data, end_data;
> unsigned long start_brk, brk, start_stack;
> unsigned long arg_start, arg_end, env_start, env_end;
>

2021-06-06 04:50:44

by Feng Tang

[permalink] [raw]
Subject: Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

On Fri, Jun 04, 2021 at 10:58:14AM -0700, John Hubbard wrote:
> On 6/4/21 12:52 AM, Feng Tang wrote:
> ...
> >>>The perf data doesn't even mention any of the GUP paths, and on the
> >>>pure fork path the biggest impact would be:
> >>>
> >>> (a) maybe "struct mm_struct" changed in size or had a different cache layout
> >>
> >>Yes, this seems to be the cause of the regression.
> >>
> >>The test case is many thread are doing map/unmap at the same time,
> >>so the process's rw_semaphore 'mmap_lock' is highly contended.
> >>
> >>Before the patch (with 0day's kconfig), the mmap_lock is separated
> >>into 2 cachelines, the 'count' is in one line, and the other members
> >>sit in the next line, so it luckily avoid some cache bouncing. After
>
> Wow! That's quite a fortunate layout to land on by accident. Almost
> makes me wonder if mmap_lock should be designed to do that, but it's
> probably even better to just keep working on having a less contended
> mmap_lock.

Yes, manipulating cache alignment is always tricky and fragile, as
data structure keeps being changed and it is affected by differrent
kernel config options, also different workloads will see different
hot fields of it.

Optimizing 'mmap_lock' is the better and ultimate solution.

> I *suppose* it's worth trying to keep this fragile layout in place,
> but it is a landmine for anyone who touches mm_struct. And the struct
> is so large already that I'm not sure a comment warning would even
> be noticed. Anyway...

Linus also mentioned clear comment is needed. Will collect more info.


> >>the patch, the 'mmap_lock' is pushed into one cacheline, which may
> >>cause the regression.
> >>
> >>Below is the pahole info:
> >>
> >>- before the patch
> >>
> >> spinlock_t page_table_lock; /* 116 4 */
> >> struct rw_semaphore mmap_lock; /* 120 40 */
> >> /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
> >> struct list_head mmlist; /* 160 16 */
> >> long unsigned int hiwater_rss; /* 176 8 */
> >>
> >>- after the patch
> >>
> >> spinlock_t page_table_lock; /* 124 4 */
> >> /* --- cacheline 2 boundary (128 bytes) --- */
> >> struct rw_semaphore mmap_lock; /* 128 40 */
> >> struct list_head mmlist; /* 168 16 */
> >> long unsigned int hiwater_rss; /* 184 8 */
> >>
> >>perf c2c log can also confirm this.
> >We've tried some patch, which can restore the regerssion. As the
> >newly added member 'write_protect_seq' is 4 bytes long, and putting
> >it into an existing 4 bytes long hole can restore the regeression,
> >while not affecting most of other member's alignment. Please review
> >the following patch, thanks!
> >
>
> So, this is a neat little solution, if we agree that it's worth "fixing".
>
> I'm definitely on the fence, but leaning toward, "go for it", because
> I like the "no cache effect" result of using up the hole.
>
> Reviewed-by: John Hubbard <[email protected]>

Thanks for the reviewing!

- Feng


> thanks,
> --
> John Hubbard
> NVIDIA
>
> >- Feng
> >
> > From 85ddc2c3d0f2bdcbad4edc5c392c7bc90bb1667e Mon Sep 17 00:00:00 2001
> >From: Feng Tang <[email protected]>
> >Date: Fri, 4 Jun 2021 15:20:57 +0800
> >Subject: [PATCH RFC] mm: relocate 'write_protect_seq' in struct mm_struct
> >
> >Before commit 57efa1fe5957 ("mm/gup: prevent gup_fast from
> >racing with COW during fork), on 64bits system, the hot member
> >rw_semaphore 'mmap_lock' of 'mm_struct' could be separated into
> >2 cachelines, that its member 'count' sits in one cacheline while
> >all other members in next cacheline, this naturally reduces some
> >cache bouncing, and with the commit, the 'mmap_lock' is pushed
> >into one cacheline, as shown in the pahole info:
> >
> > - before the commit
> >
> > spinlock_t page_table_lock; /* 116 4 */
> > struct rw_semaphore mmap_lock; /* 120 40 */
> > /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
> > struct list_head mmlist; /* 160 16 */
> > long unsigned int hiwater_rss; /* 176 8 */
> >
> > - after the commit
> >
> > spinlock_t page_table_lock; /* 124 4 */
> > /* --- cacheline 2 boundary (128 bytes) --- */
> > struct rw_semaphore mmap_lock; /* 128 40 */
> > struct list_head mmlist; /* 168 16 */
> > long unsigned int hiwater_rss; /* 184 8 */
> >
> >and it causes one 9.2% regression for 'mmap1' case of will-it-scale
> >benchmark[1], as in the case 'mmap_lock' is highly contented (occupies
> >90%+ cpu cycles).
> >
> >Though relayouting a structure could be a double-edged sword, as it
> >helps some case, but may hurt other cases. So one solution is the
> >newly added 'seqcount_t' is 4 bytes long (when CONFIG_DEBUG_LOCK_ALLOC=n),
> >placing it into an existing 4 bytes hole in 'mm_struct' will not
> >affect most of other members's alignment, while restoring the
> >regression.
> >
> >[1]. https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/
> >Reported-by: kernel test robot <[email protected]>
> >Signed-off-by: Feng Tang <[email protected]>
> >---
> > include/linux/mm_types.h | 15 ++++++++-------
> > 1 file changed, 8 insertions(+), 7 deletions(-)
> >
> >diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >index 5aacc1c..5b55f88 100644
> >--- a/include/linux/mm_types.h
> >+++ b/include/linux/mm_types.h
> >@@ -445,13 +445,6 @@ struct mm_struct {
> > */
> > atomic_t has_pinned;
> >- /**
> >- * @write_protect_seq: Locked when any thread is write
> >- * protecting pages mapped by this mm to enforce a later COW,
> >- * for instance during page table copying for fork().
> >- */
> >- seqcount_t write_protect_seq;
> >-
> > #ifdef CONFIG_MMU
> > atomic_long_t pgtables_bytes; /* PTE page table pages */
> > #endif
> >@@ -480,7 +473,15 @@ struct mm_struct {
> > unsigned long stack_vm; /* VM_STACK */
> > unsigned long def_flags;
> >+ /**
> >+ * @write_protect_seq: Locked when any thread is write
> >+ * protecting pages mapped by this mm to enforce a later COW,
> >+ * for instance during page table copying for fork().
> >+ */
> >+ seqcount_t write_protect_seq;
> >+
> > spinlock_t arg_lock; /* protect the below fields */
> >+
> > unsigned long start_code, end_code, start_data, end_data;
> > unsigned long start_brk, brk, start_stack;
> > unsigned long arg_start, arg_end, env_start, env_end;
> >

2021-06-06 10:48:05

by Feng Tang

[permalink] [raw]
Subject: Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

On Fri, Jun 04, 2021 at 10:57:44AM -0700, Linus Torvalds wrote:
> On Fri, Jun 4, 2021 at 12:52 AM Feng Tang <[email protected]> wrote:
> >
> > On Fri, Jun 04, 2021 at 03:04:11PM +0800, Feng Tang wrote:
> > > >
> > > > The perf data doesn't even mention any of the GUP paths, and on the
> > > > pure fork path the biggest impact would be:
> > > >
> > > > (a) maybe "struct mm_struct" changed in size or had a different cache layout
> > >
> > > Yes, this seems to be the cause of the regression.
> > >
> > > The test case is many thread are doing map/unmap at the same time,
> > > so the process's rw_semaphore 'mmap_lock' is highly contended.
> > >
> > > Before the patch (with 0day's kconfig), the mmap_lock is separated
> > > into 2 cachelines, the 'count' is in one line, and the other members
> > > sit in the next line, so it luckily avoid some cache bouncing. After
> > > the patch, the 'mmap_lock' is pushed into one cacheline, which may
> > > cause the regression.
>
> Ok, thanks for following up on this.
>
> > We've tried some patch, which can restore the regerssion. As the
> > newly added member 'write_protect_seq' is 4 bytes long, and putting
> > it into an existing 4 bytes long hole can restore the regeression,
> > while not affecting most of other member's alignment. Please review
> > the following patch, thanks!
>
> The patch looks fine to me.
>
> At the same time, I do wonder if maybe it would be worth exploring if
> it's a good idea to perhaps move the 'mmap_sem' thing instead.
>
> Or at least add a big comment. It's not clear to me exactly _which_
> other fields are the ones that are so hot that the contention on
> mmap_sem then causes even more cacheline bouncing.
>
> For example, is it either
>
> (a) we *want* the mmap_sem to be in the first 128-byte region,
> because then when we get the mmap_sem, the other fields in that same
> cacheline are hot
>
> OR
>
> (b) we do *not* want mmap_sem to be in the *second* 128-byte region,
> because there is something *else* in that region that is touched
> independently of mmap_sem that is very very hot and now you get even
> more bouncing?
>
> but I can't tell which one it is.

Yes, it's better to get more details of which fields are hottest,
and following are some perf data details. Let me know if more info
is needed.

* perf-stat: we see more cache-misses

32158577 ± 7% +9.0% 35060321 ± 6% perf-stat.ps.cache-misses
69612918 ± 6% +11.2% 77382336 ± 5% perf-stat.ps.cache-references


* perf profile: the 'mmap_lock' are the hottest, though the ratio from
map/unmap has some difference from 72:24 to 52:45, and this is the part
that I don't understand

- old kernel (without commit 57efa1fe59)

96.60% 0.19% [kernel.kallsyms] [k] down_write_killable - -
72.46% down_write_killable;vm_mmap_pgoff;ksys_mmap_pgoff;do_syscall_64;entry_SYSCALL_64_after_hwframe;__mmap
24.14% down_write_killable;__vm_munmap;__x64_sys_munmap;do_syscall_64;entry_SYSCALL_64_after_hwframe;__munmap

- new kernel

96.60% 0.16% [kernel.kallsyms] [k] down_write_killable - -
51.85% down_write_killable;vm_mmap_pgoff;ksys_mmap_pgoff;do_syscall_64;entry_SYSCALL_64_after_hwframe;__mmap
44.74% down_write_killable;__vm_munmap;__x64_sys_munmap;do_syscall_64;entry_SYSCALL_64_after_hwframe;__munmap


* perf-c2c: The hotspots(HITM) for 2 kernels are different due to the
data structure change

- old kernel

- first cacheline
mmap_lock->count (75%)
mm->mapcount (14%)

- second cacheline
mmap_lock->owner (97%)

- new kernel

mainly in the cacheline of 'mmap_lock'

mmap_lock->count (~2%)
mmap_lock->owner (95%)

I also attached the reduced pah and perf-c2c log for further
check. (The absolute HITM events number can be ignored, as the
recording time for new/old kernel may be different)


> It would be great to have a comment in the code - and in the commit
> message - about exactly which fields are the criticial ones. Because I
> doubt it is 'write_protect_seq' itself that matters at all.
>
> If it's "mmap_sem should be close to other commonly used fields",
> maybe we should just move mmap_sem upwards in the structure?

Ok, will add more comments if the patch is still fine with
the above updated info.

Thanks,
Feng

> Linus


Attachments:
(No filename) (4.41 kB)
pah_new.log (5.60 kB)
pah_old.log (5.49 kB)
c2c_new.log (36.65 kB)
c2c_old.log (38.80 kB)
Download all attachments

2021-06-06 19:24:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

[ Adding Waiman Long to the participants, because this seems to be a
very specific cacheline alignment behavior of rwsems, maybe Waiman has
some comments ]

On Sun, Jun 6, 2021 at 3:16 AM Feng Tang <[email protected]> wrote:
>
> * perf-c2c: The hotspots(HITM) for 2 kernels are different due to the
> data structure change
>
> - old kernel
>
> - first cacheline
> mmap_lock->count (75%)
> mm->mapcount (14%)
>
> - second cacheline
> mmap_lock->owner (97%)
>
> - new kernel
>
> mainly in the cacheline of 'mmap_lock'
>
> mmap_lock->count (~2%)
> mmap_lock->owner (95%)

Oooh.

It looks like pretty much all the contention is on mmap_lock, and the
difference is that the old kernel just _happened_ to split the
mmap_lock rwsem at *exactly* the right place.

The rw_semaphore structure looks like this:

struct rw_semaphore {
atomic_long_t count;
atomic_long_t owner;
struct optimistic_spin_queue osq; /* spinner MCS lock */
...

and before the addition of the 'write_protect_seq' field, the mmap_sem
was at offset 120 in 'struct mm_struct'.

Which meant that count and owner were in two different cachelines, and
then when you have contention and spend time in
rwsem_down_write_slowpath(), this is probably *exactly* the kind of
layout you want.

Because first the rwsem_write_trylock() will do a cmpxchg on the first
cacheline (for the optimistic fast-path), and then in the case of
contention, rwsem_down_write_slowpath() will just access the second
cacheline.

Which is probably just optimal for a load that spends a lot of time
contended - new waiters touch that first cacheline, and then they
queue themselves up on the second cacheline. Waiman, does that sound
believable?

Anyway, I'm certainly ok with the patch that just moves
'write_protect_seq' down, it might be worth commenting about how this
is about some very special cache layout of the mmap_sem part of the
structure.

That said, this means that it all is very subtle dependent on a lot of
kernel config options, and I'm not sure how relevant the exact kernel
options are that the test robot has been using. But even if this is
just a "kernel test robot reports", I think it's an interesting case
and worth a comment for when this happens next time...

Linus

2021-06-06 22:20:11

by Waiman Long

[permalink] [raw]
Subject: Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

On 6/6/21 3:20 PM, Linus Torvalds wrote:
> [ Adding Waiman Long to the participants, because this seems to be a
> very specific cacheline alignment behavior of rwsems, maybe Waiman has
> some comments ]
>
> On Sun, Jun 6, 2021 at 3:16 AM Feng Tang <[email protected]> wrote:
>> * perf-c2c: The hotspots(HITM) for 2 kernels are different due to the
>> data structure change
>>
>> - old kernel
>>
>> - first cacheline
>> mmap_lock->count (75%)
>> mm->mapcount (14%)
>>
>> - second cacheline
>> mmap_lock->owner (97%)
>>
>> - new kernel
>>
>> mainly in the cacheline of 'mmap_lock'
>>
>> mmap_lock->count (~2%)
>> mmap_lock->owner (95%)
> Oooh.
>
> It looks like pretty much all the contention is on mmap_lock, and the
> difference is that the old kernel just _happened_ to split the
> mmap_lock rwsem at *exactly* the right place.
>
> The rw_semaphore structure looks like this:
>
> struct rw_semaphore {
> atomic_long_t count;
> atomic_long_t owner;
> struct optimistic_spin_queue osq; /* spinner MCS lock */
> ...
>
> and before the addition of the 'write_protect_seq' field, the mmap_sem
> was at offset 120 in 'struct mm_struct'.
>
> Which meant that count and owner were in two different cachelines, and
> then when you have contention and spend time in
> rwsem_down_write_slowpath(), this is probably *exactly* the kind of
> layout you want.
>
> Because first the rwsem_write_trylock() will do a cmpxchg on the first
> cacheline (for the optimistic fast-path), and then in the case of
> contention, rwsem_down_write_slowpath() will just access the second
> cacheline.
>
> Which is probably just optimal for a load that spends a lot of time
> contended - new waiters touch that first cacheline, and then they
> queue themselves up on the second cacheline. Waiman, does that sound
> believable?

Yes, I think so.

The count field is accessed when a task tries to acquire the rwsem or
when a owner releases the lock. If the trylock fails, the writer will go
into the slowpath doing optimistic spinning on the owner field. As a
result, a lot of reads to owner are issued relative to the read/write of
count. Normally, there should only be one spinner that has the OSQ lock
spinning on owner and the 9% performance degradation seems a bit high to
me. In the rare case that the head waiter in the wait queue sets the
handoff flag, the waiter may also spin on owner causing a bit more
contention on the owner cacheline. I will do further investigation on
this possibility when I have time.

Cheers,
Longman

2021-06-07 06:07:13

by Feng Tang

[permalink] [raw]
Subject: Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

On Sun, Jun 06, 2021 at 12:20:46PM -0700, Linus Torvalds wrote:
> [ Adding Waiman Long to the participants, because this seems to be a
> very specific cacheline alignment behavior of rwsems, maybe Waiman has
> some comments ]
>
> On Sun, Jun 6, 2021 at 3:16 AM Feng Tang <[email protected]> wrote:
> >
> > * perf-c2c: The hotspots(HITM) for 2 kernels are different due to the
> > data structure change
> >
> > - old kernel
> >
> > - first cacheline
> > mmap_lock->count (75%)
> > mm->mapcount (14%)
> >
> > - second cacheline
> > mmap_lock->owner (97%)
> >
> > - new kernel
> >
> > mainly in the cacheline of 'mmap_lock'
> >
> > mmap_lock->count (~2%)
> > mmap_lock->owner (95%)
>
> Oooh.
>
> It looks like pretty much all the contention is on mmap_lock, and the
> difference is that the old kernel just _happened_ to split the
> mmap_lock rwsem at *exactly* the right place.
>
> The rw_semaphore structure looks like this:
>
> struct rw_semaphore {
> atomic_long_t count;
> atomic_long_t owner;
> struct optimistic_spin_queue osq; /* spinner MCS lock */
> ...
>
> and before the addition of the 'write_protect_seq' field, the mmap_sem
> was at offset 120 in 'struct mm_struct'.
>
> Which meant that count and owner were in two different cachelines, and
> then when you have contention and spend time in
> rwsem_down_write_slowpath(), this is probably *exactly* the kind of
> layout you want.
>
> Because first the rwsem_write_trylock() will do a cmpxchg on the first
> cacheline (for the optimistic fast-path), and then in the case of
> contention, rwsem_down_write_slowpath() will just access the second
> cacheline.
>
> Which is probably just optimal for a load that spends a lot of time
> contended - new waiters touch that first cacheline, and then they
> queue themselves up on the second cacheline. Waiman, does that sound
> believable?
>
> Anyway, I'm certainly ok with the patch that just moves
> 'write_protect_seq' down, it might be worth commenting about how this
> is about some very special cache layout of the mmap_sem part of the
> structure.
>
> That said, this means that it all is very subtle dependent on a lot of
> kernel config options, and I'm not sure how relevant the exact kernel
> options are that the test robot has been using. But even if this is
> just a "kernel test robot reports", I think it's an interesting case
> and worth a comment for when this happens next time...

There are 3 kernel config options before 'mmap_lock' (inside 'mm_struct'):

CONFIG_MMU
CONFIG_MEMBARRIER
CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES

0day's default kernel config is similar to RHEL-8.3, which has all
these three enabled. IIUC, the first 2 options are 'y' for many common
configs, while 'CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES' is only available
on x86 system.

Please review the updated patch, thanks

- Feng


From cbdbe70fb9e5bab2988d645c6f0f614d51b2e386 Mon Sep 17 00:00:00 2001
From: Feng Tang <[email protected]>
Date: Fri, 4 Jun 2021 15:20:57 +0800
Subject: [PATCH] mm: relocate 'write_protect_seq' in struct mm_struct

0day robot reports a 9.2% regression for will-it-scale mmap1 test
case[1], caused by commit 57efa1fe5957 ("mm/gup: prevent gup_fast
from racing with COW during fork").

Further debug shows the regression is due to that commit changes
the offset of hot fields 'mmap_lock' inside structure 'mm_struct',
thus some cache alignmeent changes.

From the perf data, the contention for 'mmap_lock' is very severe
and takes around 95% cpu cycles, and it is a rw_semaphore

struct rw_semaphore {
atomic_long_t count; /* 8 bytes */
atomic_long_t owner; /* 8 bytes */
struct optimistic_spin_queue osq; /* spinner MCS lock */
...

Before commit 57efa1fe5957 adds the 'write_protect_seq', it
happens to have a very optimal cache alignment layout, as
Linus explained:

"and before the addition of the 'write_protect_seq' field, the
mmap_sem was at offset 120 in 'struct mm_struct'.

Which meant that count and owner were in two different cachelines,
and then when you have contention and spend time in
rwsem_down_write_slowpath(), this is probably *exactly* the kind
of layout you want.

Because first the rwsem_write_trylock() will do a cmpxchg on the
first cacheline (for the optimistic fast-path), and then in the
case of contention, rwsem_down_write_slowpath() will just access
the second cacheline.

Which is probably just optimal for a load that spends a lot of
time contended - new waiters touch that first cacheline, and then
they queue themselves up on the second cacheline."

After the commit, the rw_semaphore is at offset 128, which means
the 'count' and 'owner' fields are now in the same cacheline,
and causes more cache bouncing.

Currently there are "#ifdef CONFIG_XXX" before 'mmap_lock' which
will affect its offset:

CONFIG_MMU
CONFIG_MEMBARRIER
CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES

The layout above is on 64 bits system with 0day's default kernel
config (similar to RHEL-8.3's config), which all these 3 options
are 'y'. And the layout can ary with different kernel configs.

Relayouting a structure is usually a double-edged sword, as sometimes
it can helps one case, but hurt other cases. For this case, one
solution is, as the newly added 'write_protect_seq' is a 4 bytes long
seqcount_t (when CONFIG_DEBUG_LOCK_ALLOC=n), placing it into an
existing 4 bytes hole in 'mm_struct' will not change other fields'
alignment, while restoring the regression.

[1]. https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
---
include/linux/mm_types.h | 27 ++++++++++++++++++++-------
1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5aacc1c..cba6022 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -445,13 +445,6 @@ struct mm_struct {
*/
atomic_t has_pinned;

- /**
- * @write_protect_seq: Locked when any thread is write
- * protecting pages mapped by this mm to enforce a later COW,
- * for instance during page table copying for fork().
- */
- seqcount_t write_protect_seq;
-
#ifdef CONFIG_MMU
atomic_long_t pgtables_bytes; /* PTE page table pages */
#endif
@@ -460,6 +453,18 @@ struct mm_struct {
spinlock_t page_table_lock; /* Protects page tables and some
* counters
*/
+ /*
+ * With some kernel config, the current mmap_lock's offset
+ * inside 'mm_struct' is at 0x120, which is very optimal, as
+ * its two hot fields 'count' and 'owner' sit in 2 different
+ * cachelines, and when mmap_lock is highly contended, both
+ * of the 2 fields will be accessed frequently, current layout
+ * will help to reduce cache bouncing.
+ *
+ * So please be careful with adding new fields before
+ * mmap_lock, which can easily push the 2 fields into one
+ * cacheline.
+ */
struct rw_semaphore mmap_lock;

struct list_head mmlist; /* List of maybe swapped mm's. These
@@ -480,7 +485,15 @@ struct mm_struct {
unsigned long stack_vm; /* VM_STACK */
unsigned long def_flags;

+ /**
+ * @write_protect_seq: Locked when any thread is write
+ * protecting pages mapped by this mm to enforce a later COW,
+ * for instance during page table copying for fork().
+ */
+ seqcount_t write_protect_seq;
+
spinlock_t arg_lock; /* protect the below fields */
+
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
--
2.7.4

2021-06-08 00:08:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

On Sun, Jun 6, 2021 at 11:06 PM Feng Tang <[email protected]> wrote:
>
> Please review the updated patch, thanks

Looks good to me. Thanks,

Linus