LinuxLists.cc - [LKP] [fs/locks] 83b381078b: will-it-scale.per_thread

2018-11-27 06:03:42

Subject: [LKP] [fs/locks] 83b381078b: will-it-scale.per_thread_ops -62.5% regression

Greeting,

FYI, we

commit: https://git.kernel.org/cgit/linux/kernel/git/jlayton/linux.git locks-next
machine: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 64G memory
Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
href="https://github.com/antonblanchard/will-it-scale">https://github.com/antonblanchard/will-it-scale
to that, the commit also has significant impact on the following tests:
--+-----------------------------------------------------------------------+
change | will-it-scale: will-it-scale.per_thread_ops -65.5% regression |
| 72 threads Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz with 128G memory |
| cpufreq_governor=performance |
| mode=thread |
| nr_task=100% |
| test=lock1 |
| ucode=0x3d |
--+-----------------------------------------------------------------------+
--------------------------------------------------------------------------------->
href="https://github.com/intel/lkp-tests.git">https://github.com/intel/lkp-tests.git
bin/lkp install job.yaml # job file is attached in this email
========================================================================
governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
ce/x86_64-rhel-7.2/thread/16/debian-x86_64-2018-04-03.cgz/lkp-bdw-ep3d/lock1/will-it-scale
("fs/locks: allow a lock request to block other requests.")
("fs/locks: always delete_block after waiting.")
83b381078b5ecab098ebf6bc95
--------------------------
%change %stddev
| \
-62.5% 140024 will-it-scale.per_thread_ops
+2.9% 4736 will-it-scale.time.system_time
-62.9% 79.47 will-it-scale.time.user_time
-62.5% 2240400 will-it-scale.workload
+1.1% 688.54 boot-time.idle
-1.8 1.20 ± 2% mpstat.cpu.usr%
-26.6% 169168 ± 26% softirqs.RCU
-2.5% 126.96 turbostat.PkgWatt
-3.0% 8.51 turbostat.RAMWatt
+24.3% 978.96 ± 4% sched_debug.cfs_rq:/.util_est_enqueued.max
-45.2% 8953 ± 17% sched_debug.cpu.ttwu_local.max
-28.9% 2527 ± 8% sched_debug.cpu.ttwu_local.stddev
-22.5% 1.052e+12 perf-stat.branch-instructions
-0.2 0.44 ± 3% perf-stat.branch-miss-rate%
-42.7% 4.642e+09 ± 3% perf-stat.branch-misses
-1.3 43.45 perf-stat.cache-miss-rate%
-20.3% 8.018e+09 ± 2% perf-stat.cache-misses
-18.1% 1.845e+10 ± 2% perf-stat.cache-references
+36.2% 3.31 perf-stat.cpi
+0.0 0.00 ± 8% perf-stat.dTLB-load-miss-rate%
-30.3% 1.185e+12 perf-stat.dTLB-loads
-59.3% 3.337e+11 perf-stat.dTLB-stores
-65.3% 1.403e+09 perf-stat.iTLB-load-misses
-62.2% 2.223e+08 ± 17% perf-stat.iTLB-loads
-26.9% 4.489e+12 perf-stat.instructions
+110.3% 3201 perf-stat.instructions-per-iTLB-miss
-26.6% 0.30 perf-stat.ipc
-2.5 79.75 perf-stat.node-store-miss-rate%
-16.0% 1.894e+09 perf-stat.node-store-misses
+94.2% 2003878 perf-stat.path-length
-39.2 0.96 ± 8% perf-profile.calltrace.cycles-pp._raw_spin_lock.fcntl_setlk.do_fcntl.__x64_sys_fcntl.do_syscall_64
-38.3 0.00 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.fcntl_setlk.do_fcntl.__x64_sys_fcntl
-1.5 0.94 ± 11% perf-profile.calltrace.cycles-pp.locks_alloc_lock.posix_lock_inode.do_lock_file_wait.fcntl_setlk.do_fcntl
-1.4 0.83 ± 8% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64
-1.3 0.82 ± 12% perf-profile.calltrace.cycles-pp.kmem_cache_alloc.locks_alloc_lock.posix_lock_inode.do_lock_file_wait.fcntl_setlk
-1.3 0.54 ± 3% perf-profile.calltrace.cycles-pp.fput.__x64_sys_fcntl.do_syscall_64.entry_SYSCALL_64_after_hwframe
-1.3 0.77 ± 4% perf-profile.calltrace.cycles-pp.syscall_return_via_sysret
-0.9 0.27 ±100% perf-profile.calltrace.cycles-pp.locks_alloc_lock.fcntl_setlk.do_fcntl.__x64_sys_fcntl.do_syscall_64
+8.7 76.49 ± 8% perf-profile.calltrace.cycles-pp.do_fcntl.__x64_sys_fcntl.do_syscall_64.entry_SYSCALL_64_after_hwframe
+9.3 76.09 ± 8% perf-profile.calltrace.cycles-pp.fcntl_setlk.do_fcntl.__x64_sys_fcntl.do_syscall_64.entry_SYSCALL_64_after_hwframe
+46.1 46.09 ± 8% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.locks_delete_block.do_lock_file_wait.fcntl_setlk
+47.0 47.01 ± 8% perf-profile.calltrace.cycles-pp._raw_spin_lock.locks_delete_block.do_lock_file_wait.fcntl_setlk.do_fcntl
+47.1 47.10 ± 8% perf-profile.calltrace.cycles-pp.locks_delete_block.do_lock_file_wait.fcntl_setlk.do_fcntl.__x64_sys_fcntl
+49.8 73.98 ± 8% perf-profile.calltrace.cycles-pp.do_lock_file_wait.fcntl_setlk.do_fcntl.__x64_sys_fcntl.do_syscall_64
-2.3 1.44 ± 11% perf-profile.children.cycles-pp.locks_alloc_lock
-2.0 1.27 ± 12% perf-profile.children.cycles-pp.kmem_cache_alloc
-1.5 0.90 ± 4% perf-profile.children.cycles-pp.syscall_return_via_sysret
-1.4 0.83 ± 8% perf-profile.children.cycles-pp.entry_SYSCALL_64
-1.3 0.55 ± 3% perf-profile.children.cycles-pp.fput
-1.1 0.29 ± 9% perf-profile.children.cycles-pp.__fget_light
-1.0 0.60 ± 6% perf-profile.children.cycles-pp.file_has_perm
-0.9 0.22 ± 11% perf-profile.children.cycles-pp.__fget
-0.7 0.45 ± 14% perf-profile.children.cycles-pp.memset_erms
-0.7 0.42 ± 11% perf-profile.children.cycles-pp.security_file_lock
-0.6 0.40 ± 7% perf-profile.children.cycles-pp.security_file_fcntl
-0.5 0.34 ± 7% perf-profile.children.cycles-pp._copy_from_user
-0.5 0.30 ± 7% perf-profile.children.cycles-pp.avc_has_perm
-0.4 0.54 ± 15% perf-profile.children.cycles-pp.kmem_cache_free
-0.4 0.26 ± 13% perf-profile.children.cycles-pp.___might_sleep
-0.2 0.15 ± 3% perf-profile.children.cycles-pp.copy_user_generic_unrolled
-0.2 0.17 ± 18% perf-profile.children.cycles-pp.__might_sleep
-0.2 0.14 ± 11% perf-profile.children.cycles-pp.locks_dispose_list
-0.2 0.12 ± 7% perf-profile.children.cycles-pp.locks_delete_lock_ctx
-0.2 0.10 ± 12% perf-profile.children.cycles-pp._cond_resched
-0.2 0.12 ± 10% perf-profile.children.cycles-pp.selinux_file_lock
-0.1 0.11 ± 13% perf-profile.children.cycles-pp.__might_fault
-0.1 0.08 ± 5% perf-profile.children.cycles-pp.inode_has_perm
-0.1 0.08 ± 13% perf-profile.children.cycles-pp.locks_unlink_lock_ctx
-0.1 0.03 ±100% perf-profile.children.cycles-pp.rcu_all_qs
-0.1 0.06 ± 11% perf-profile.children.cycles-pp.copy_user_enhanced_fast_string
-0.1 0.03 ±100% perf-profile.children.cycles-pp.should_failslab
-0.1 0.05 ± 8% perf-profile.children.cycles-pp.__x86_indirect_thunk_rax
-0.1 0.03 ±100% perf-profile.children.cycles-pp.selinux_file_fcntl
-0.1 0.04 ± 57% perf-profile.children.cycles-pp.flock64_to_posix_lock
+8.7 76.51 ± 8% perf-profile.children.cycles-pp.do_fcntl
+9.3 76.11 ± 8% perf-profile.children.cycles-pp.fcntl_setlk
+14.2 72.52 ± 8% perf-profile.children.cycles-pp._raw_spin_lock
+15.5 69.06 ± 8% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
+47.1 47.10 ± 8% perf-profile.children.cycles-pp.locks_delete_block
+49.8 73.99 ± 8% perf-profile.children.cycles-pp.do_lock_file_wait
-1.5 0.90 ± 4% perf-profile.self.cycles-pp.syscall_return_via_sysret
-1.4 0.83 ± 8% perf-profile.self.cycles-pp.entry_SYSCALL_64
-1.3 0.55 ± 3% perf-profile.self.cycles-pp.fput
-1.2 3.45 ± 7% perf-profile.self.cycles-pp._raw_spin_lock
-0.8 0.21 ± 14% perf-profile.self.cycles-pp.__fget
-0.8 0.46 ± 11% perf-profile.self.cycles-pp.kmem_cache_alloc
-0.7 0.43 ± 14% perf-profile.self.cycles-pp.memset_erms
-0.5 0.30 ± 7% perf-profile.self.cycles-pp.avc_has_perm
-0.4 0.46 ± 13% perf-profile.self.cycles-pp.kmem_cache_free
-0.4 0.21 ± 5% perf-profile.self.cycles-pp.fcntl_setlk
-0.4 0.26 ± 12% perf-profile.self.cycles-pp.___might_sleep
-0.4 0.24 ± 14% perf-profile.self.cycles-pp.posix_lock_inode
-0.3 0.20 ± 11% perf-profile.self.cycles-pp.file_has_perm
-0.3 0.14 ± 8% perf-profile.self.cycles-pp.locks_alloc_lock
-0.3 0.15 ± 13% perf-profile.self.cycles-pp.__x64_sys_fcntl
-0.2 0.14 ± 6% perf-profile.self.cycles-pp.copy_user_generic_unrolled
-0.2 0.08 ± 10% perf-profile.self.cycles-pp.__fget_light
-0.2 0.15 ± 18% perf-profile.self.cycles-pp.__might_sleep
-0.2 0.11 ± 4% perf-profile.self.cycles-pp.do_syscall_64
-0.1 0.11 ± 11% perf-profile.self.cycles-pp.selinux_file_lock
-0.1 0.07 ± 17% perf-profile.self.cycles-pp.do_lock_file_wait
-0.1 0.08 ± 10% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
-0.1 0.08 ± 5% perf-profile.self.cycles-pp.inode_has_perm
-0.1 0.07 ± 5% perf-profile.self.cycles-pp.do_fcntl
-0.1 0.05 ± 8% perf-profile.self.cycles-pp.copy_user_enhanced_fast_string
-0.1 0.03 ±100% perf-profile.self.cycles-pp._copy_from_user
-0.1 0.04 ± 58% perf-profile.self.cycles-pp._cond_resched
-0.1 0.15 ± 16% perf-profile.self.cycles-pp.locks_free_lock
+15.4 68.84 ± 8% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath

will-it-scale.per_thread_ops

+-+----------------------------------------------------------------+
|
+..+.. .+..+.. .+..+..+...+..+..+.. +.. .+.. ..|
+. .. +. +..+ |
+ + : |
: : |
: : |
: : |
: : |
: : |
O O O O O O O O O O O O O O O O O :O: O O O O O
: : |
: : |
: |
+-+----------------------------------------------------------------+

will-it-scale.workload

+-+-----------------------------------------------------------------+
+...+.. .+..+..+ + +.. |
.+..+..+. + + + .. ..|
+. + + + + +..+ |
+ + : |
: : |
: : |
: : |
: : |
O O : : O O |
O O O O O O O O O : O: O O O
: : |
: : |
: |
+-+-----------------------------------------------------------------+

will-it-scale.time.user_time

-----------------------------------------------------------+
.+.. .+.. +.. |
.+...+..+..+. +.. +.. .. . ..|
+. +. . .. + +..+ |
+ + : |
: : |
: : |
: : |
: : |
: : |
O O O O O O O O O O O O O :O: O O O O O
: : |
: : |
: |
-----------------------------------------------------------+

will-it-scale.time.system_time

-----------------------------------------------------------+
...O..O..O..O..O..O..O...O..O..O..O..O..O O O...O..O..O..O
: : |
: : |
: : |
: : |
: : |
: : |
: : |
: : |
: : |
: : |
: |
: |
-----------------------------------------------------------+

**********************************************************************************
72 threads Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz with 128G memory
========================================================================
governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
ce/x86_64-rhel-7.2/thread/100%/debian-x86_64-2018-04-03.cgz/lkp-hsw-ep4/lock1/will-it-scale/0x3d
("fs/locks: allow a lock request to block other requests.")
("fs/locks: always delete_block after waiting.")
83b381078b5ecab098ebf6bc95
--------------------------
%change %stddev
| \
-65.5% 27921 will-it-scale.per_thread_ops
-2.6% 134943 will-it-scale.time.involuntary_context_switches
+1.6% 7964 will-it-scale.time.maximum_resident_set_size
-64.8% 94.41 will-it-scale.time.user_time
-65.5% 2010357 will-it-scale.workload
-0.8 0.45 ± 4% mpstat.cpu.usr%
-15.7% 82.29 ± 5% sched_debug.cpu.ttwu_count.min
-2.3% 250.49 ± 2% turbostat.PkgWatt
-5.2% 1062 vmstat.system.cs
-12.2% 27.67 ± 10% boot-time.boot
-12.5% 22.59 ± 12% boot-time.dhcp
-15.4% 1517 ± 9% boot-time.idle
+34.7% 150535 ± 9% numa-meminfo.node0.Active
+34.7% 150533 ± 9% numa-meminfo.node0.Active(anon)
+44.8% 132283 ± 15% numa-meminfo.node0.AnonPages
+34.7% 37633 ± 9% numa-vmstat.node0.nr_active_anon
+44.8% 33067 ± 15% numa-vmstat.node0.nr_anon_pages
+34.7% 37633 ± 9% numa-vmstat.node0.nr_zone_active_anon
+11.1% 1647 ± 5% slabinfo.UNIX.active_objs
+11.1% 1647 ± 5% slabinfo.UNIX.num_objs
-28.9% 283.50 ± 6% slabinfo.kmem_cache.active_objs
-28.9% 283.50 ± 6% slabinfo.kmem_cache.num_objs
-25.7% 510.00 ± 5% slabinfo.kmem_cache_node.active_objs
-23.9% 560.00 ± 4% slabinfo.kmem_cache_node.num_objs
-14.5% 556.50 ± 8% slabinfo.mnt_cache.active_objs
-14.5% 556.50 ± 8% slabinfo.mnt_cache.num_objs
+19.6% 1311 ± 5% slabinfo.task_group.active_objs
+19.6% 1311 ± 5% slabinfo.task_group.num_objs
-8.2% 3.457e+12 perf-stat.branch-instructions
-0.1 0.16 ± 4% perf-stat.branch-miss-rate%
-43.1% 5.628e+09 ± 4% perf-stat.branch-misses
-13.0% 4.208e+09 perf-stat.cache-misses
-11.8% 1.019e+10 ± 4% perf-stat.cache-references
+12.4% 4.24 perf-stat.cpi
-12.7% 3.582e+12 perf-stat.dTLB-loads
+0.0 0.01 ± 11% perf-stat.dTLB-store-miss-rate%
-61.3% 3.158e+11 perf-stat.dTLB-stores
-63.0% 1.148e+09 ± 5% perf-stat.iTLB-load-misses
-60.7% 1.836e+08 ± 16% perf-stat.iTLB-loads
-10.6% 1.41e+13 perf-stat.instructions
+142.2% 12320 ± 5% perf-stat.instructions-per-iTLB-miss
-11.0% 0.24 perf-stat.ipc
+1.2 99.89 perf-stat.node-load-miss-rate%
-92.2% 2213871 ± 20% perf-stat.node-loads
-6.5 71.39 perf-stat.node-store-miss-rate%
-27.5% 1.503e+09 perf-stat.node-store-misses
+159.3% 7014271 perf-stat.path-length
-62.3 0.00 perf-profile.calltrace.cycles-pp._raw_spin_lock.fcntl_setlk.do_fcntl.__x64_sys_fcntl.do_syscall_64
-61.9 0.00 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.fcntl_setlk.do_fcntl.__x64_sys_fcntl
+0.9 99.42 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
+0.9 99.39 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
+1.0 99.34 perf-profile.calltrace.cycles-pp.__x64_sys_fcntl.do_syscall_64.entry_SYSCALL_64_after_hwframe
+1.8 99.00 perf-profile.calltrace.cycles-pp.do_fcntl.__x64_sys_fcntl.do_syscall_64.entry_SYSCALL_64_after_hwframe
+2.0 98.91 perf-profile.calltrace.cycles-pp.fcntl_setlk.do_fcntl.__x64_sys_fcntl.do_syscall_64.entry_SYSCALL_64_after_hwframe
+64.4 98.20 perf-profile.calltrace.cycles-pp.do_lock_file_wait.fcntl_setlk.do_fcntl.__x64_sys_fcntl.do_syscall_64
+64.5 64.55 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.locks_delete_block.do_lock_file_wait.fcntl_setlk
+64.9 64.89 perf-profile.calltrace.cycles-pp._raw_spin_lock.locks_delete_block.do_lock_file_wait.fcntl_setlk.do_fcntl
+64.9 64.92 perf-profile.calltrace.cycles-pp.locks_delete_block.do_lock_file_wait.fcntl_setlk.do_fcntl.__x64_sys_fcntl
-0.8 0.53 perf-profile.children.cycles-pp.locks_alloc_lock
-0.7 0.46 perf-profile.children.cycles-pp.kmem_cache_alloc
-0.5 0.28 ± 3% perf-profile.children.cycles-pp.syscall_return_via_sysret
-0.4 0.24 ± 4% perf-profile.children.cycles-pp.entry_SYSCALL_64
-0.4 0.07 perf-profile.children.cycles-pp.fput
-0.3 0.18 ± 2% perf-profile.children.cycles-pp.file_has_perm
-0.2 0.15 perf-profile.children.cycles-pp.memset_erms
-0.2 0.12 ± 4% perf-profile.children.cycles-pp.security_file_lock
-0.2 0.12 ± 3% perf-profile.children.cycles-pp.security_file_fcntl
-0.2 0.09 ± 4% perf-profile.children.cycles-pp.__fget_light
-0.2 0.09 perf-profile.children.cycles-pp.avc_has_perm
-0.2 0.10 ± 5% perf-profile.children.cycles-pp.___might_sleep
-0.1 0.07 ± 5% perf-profile.children.cycles-pp.__fget
-0.1 0.07 ± 6% perf-profile.children.cycles-pp._copy_from_user
-0.1 0.06 perf-profile.children.cycles-pp.__might_sleep
-0.1 0.04 ± 57% perf-profile.children.cycles-pp.locks_dispose_list
-0.1 0.24 ± 5% perf-profile.children.cycles-pp.kmem_cache_free
-0.0 0.10 ± 8% perf-profile.children.cycles-pp.locks_free_lock
+0.0 0.39 perf-profile.children.cycles-pp.apic_timer_interrupt
+0.9 99.44 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
+0.9 99.42 perf-profile.children.cycles-pp.do_syscall_64
+1.0 99.34 perf-profile.children.cycles-pp.__x64_sys_fcntl
+1.8 99.00 perf-profile.children.cycles-pp.do_fcntl
+2.0 98.92 perf-profile.children.cycles-pp.fcntl_setlk
+3.7 97.63 perf-profile.children.cycles-pp._raw_spin_lock
+4.0 96.79 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
+64.4 98.20 perf-profile.children.cycles-pp.do_lock_file_wait
+64.9 64.93 perf-profile.children.cycles-pp.locks_delete_block
-0.5 0.28 ± 3% perf-profile.self.cycles-pp.syscall_return_via_sysret
-0.4 0.24 ± 4% perf-profile.self.cycles-pp.entry_SYSCALL_64
-0.4 0.07 ± 6% perf-profile.self.cycles-pp.fput
-0.3 0.84 ± 2% perf-profile.self.cycles-pp._raw_spin_lock
-0.3 0.17 ± 4% perf-profile.self.cycles-pp.kmem_cache_alloc
-0.2 0.15 ± 3% perf-profile.self.cycles-pp.memset_erms
-0.2 0.08 ± 5% perf-profile.self.cycles-pp.posix_lock_inode
-0.2 0.09 ± 4% perf-profile.self.cycles-pp.avc_has_perm
-0.2 0.09 perf-profile.self.cycles-pp.___might_sleep
-0.1 0.07 perf-profile.self.cycles-pp.__fget
-0.1 0.06 ± 6% perf-profile.self.cycles-pp.locks_alloc_lock
-0.1 0.06 ± 7% perf-profile.self.cycles-pp.file_has_perm
-0.1 0.07 ± 10% perf-profile.self.cycles-pp.fcntl_setlk
-0.1 0.03 ±100% perf-profile.self.cycles-pp.do_syscall_64
-0.1 0.05 ± 8% perf-profile.self.cycles-pp.__might_sleep
-0.1 0.21 ± 5% perf-profile.self.cycles-pp.kmem_cache_free
-0.1 0.05 ± 8% perf-profile.self.cycles-pp.__x64_sys_fcntl
-0.0 0.06 ± 7% perf-profile.self.cycles-pp.locks_free_lock
+3.9 96.37 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
have been estimated based on internal Intel analysis and are provided
purposes only. Any difference in system hardware or software
configuration may affect actual performance.

Attachments:

(No filename) (29.82 kB)
config-4.20.0-rc2-00008-g83b3810 (171.30 kB)
job-script (6.92 kB)
job.yaml (4.60 kB)
reproduce (318.00 B)
Download all attachments

2018-11-27 17:48:22

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [LKP] [fs/locks] 83b381078b: will-it-scale.per_thread_ops -62.5% regression

Thanks for the report!

On Tue, Nov 27, 2018 at 02:01:02PM +0800, kernel test robot wrote:
> FYI, we noticed a -62.5% regression of will-it-scale.per_thread_ops due to commit:
>
>
> commit: 83b381078b5ecab098ebf6bc9548bb32af1dbf31 ("fs/locks: always delete_block after waiting.")
> https://git.kernel.org/cgit/linux/kernel/git/jlayton/linux.git locks-next
>
> in testcase: will-it-scale
> on test machine: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 64G memory
> with following parameters:
>
> nr_task: 16
> mode: thread
> test: lock1

So I guess it's doing this, uncontended file lock/unlock?:

https://github.com/antonblanchard/will-it-scale/blob/master/tests/lock1.c

Each thread is repeatedly locking and unlocking a file that is only used
by that thread.

By the way, what's the X-axis on these graphs? (Or the y-axis, for that
matter?)

--b.

> will-it-scale.per_thread_ops
>
> 450000 +-+----------------------------------------------------------------+
> | |
> 400000 +-+ +..+.. .+..+.. .+..+..+...+..+..+.. +.. .+.. ..|
> 350000 +-+ .. +. +. .. +. +..+ |
> | + + + : |
> 300000 +-+ : : |
> 250000 +-+ : : |
> | : : |
> 200000 +-+ : : |
> 150000 +-+ : : |
> O O O O O O O O O O O O O O O O O :O: O O O O O
> 100000 +-+ : : |
> 50000 +-+ : : |
> | : |
> 0 +-+----------------------------------------------------------------+
>
>
> will-it-scale.workload
>
> 7e+06 +-+-----------------------------------------------------------------+
> | +...+.. .+..+..+ + +.. |
> 6e+06 +-+ +..+.. .. .+..+..+. + + + .. ..|
> | .. + +. + + + + +..+ |
> 5e+06 +-++ + + : |
> | : : |
> 4e+06 +-+ : : |
> | : : |
> 3e+06 +-+ : : |
> | O O : : O O |
> 2e+06 O-+O O O O O O O O O O O O O O : O: O O O
> | : : |
> 1e+06 +-+ : : |
> | : |
> 0 +-+-----------------------------------------------------------------+
>
>
> will-it-scale.time.user_time
>
> 250 +-+-------------------------------------------------------------------+
> | .+.. .+.. +.. |
> |.. +...+.. .+. .+...+..+..+. +.. +.. .. . ..|
> 200 +-+ .. +. +. . .. + +..+ |
> | + + + : |
> | : : |
> 150 +-+ : : |
> | : : |
> 100 +-+ : : |
> | O O : : |
> O O O O O O O O O O O O O O O :O: O O O O O
> 50 +-+ : : |
> | : : |
> | : |
> 0 +-+-------------------------------------------------------------------+
>
>
> will-it-scale.time.system_time
>
> 5000 +-+------------------------------------------------------------------+
> 4500 O-+O..O..O...O..O..O..O..O..O..O...O..O..O..O..O..O O O...O..O..O..O
> | : : |
> 4000 +-+ : : |
> 3500 +-+ : : |
> | : : |
> 3000 +-+ : : |
> 2500 +-+ : : |
> 2000 +-+ : : |
> | : : |
> 1500 +-+ : : |
> 1000 +-+ : : |
> | : |
> 500 +-+ : |
> 0 +-+------------------------------------------------------------------+
>
>
> [*] bisect-good sample
> [O] bisect-bad sample

2018-11-27 23:22:14

by NeilBrown

[permalink] [raw]

Subject: Re: [LKP] [fs/locks] 83b381078b: will-it-scale.per_thread_ops -62.5% regression

On Tue, Nov 27 2018, J. Bruce Fields wrote:

> Thanks for the report!

Yes, thanks. I thought I had replied to the previous report of a similar
problem, but I didn't actually send that email - oops.
Though the test is the same and the regression similar, this is a
different patch. The previous report identified
fs/locks: allow a lock request to block other requests
this one identifies
fs/locks: always delete_block after waiting.

Both cause blocked_lock_lock to be taken more often.

In one case is it due to locks_move_blocks(). That can probably be
optimised to skip the lock if list_empty(&fl->fl_blocked_requests).
I'd need to double-check, but I think that is safe to check without
locking.

This one causes locks_delete_blocks() to be called more often. We now
call it even if no waiting happened at all. I suspect we can test for
that and avoid it. I'll have a look.

>
> On Tue, Nov 27, 2018 at 02:01:02PM +0800, kernel test robot wrote:
>> FYI, we noticed a -62.5% regression of will-it-scale.per_thread_ops due to commit:
>>
>>
>> commit: 83b381078b5ecab098ebf6bc9548bb32af1dbf31 ("fs/locks: always delete_block after waiting.")
>> https://git.kernel.org/cgit/linux/kernel/git/jlayton/linux.git locks-next
>>
>> in testcase: will-it-scale
>> on test machine: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 64G memory
>> with following parameters:
>>
>> nr_task: 16
>> mode: thread
>> test: lock1
>
> So I guess it's doing this, uncontended file lock/unlock?:
>
> https://github.com/antonblanchard/will-it-scale/blob/master/tests/lock1.c
>
> Each thread is repeatedly locking and unlocking a file that is only used
> by that thread.

Thanks for identifying that Bruce.
This would certainly be a case where locks_delete_block() is now being
called when it wasn't before.

>
> By the way, what's the X-axis on these graphs? (Or the y-axis, for that
> matter?)

A key would help. I think the X-axis is number-of-threads. y-axis
might be ops-per-second ??.

Thanks,
NeilBrown

>
> --b.
>
>> will-it-scale.per_thread_ops
>>
>> 450000 +-+----------------------------------------------------------------+
>> | |
>> 400000 +-+ +..+.. .+..+.. .+..+..+...+..+..+.. +.. .+.. ..|
>> 350000 +-+ .. +. +. .. +. +..+ |
>> | + + + : |
>> 300000 +-+ : : |
>> 250000 +-+ : : |
>> | : : |
>> 200000 +-+ : : |
>> 150000 +-+ : : |
>> O O O O O O O O O O O O O O O O O :O: O O O O O
>> 100000 +-+ : : |
>> 50000 +-+ : : |
>> | : |
>> 0 +-+----------------------------------------------------------------+
>>
>>
>> will-it-scale.workload
>>
>> 7e+06 +-+-----------------------------------------------------------------+
>> | +...+.. .+..+..+ + +.. |
>> 6e+06 +-+ +..+.. .. .+..+..+. + + + .. ..|
>> | .. + +. + + + + +..+ |
>> 5e+06 +-++ + + : |
>> | : : |
>> 4e+06 +-+ : : |
>> | : : |
>> 3e+06 +-+ : : |
>> | O O : : O O |
>> 2e+06 O-+O O O O O O O O O O O O O O : O: O O O
>> | : : |
>> 1e+06 +-+ : : |
>> | : |
>> 0 +-+-----------------------------------------------------------------+
>>
>>
>> will-it-scale.time.user_time
>>
>> 250 +-+-------------------------------------------------------------------+
>> | .+.. .+.. +.. |
>> |.. +...+.. .+. .+...+..+..+. +.. +.. .. . ..|
>> 200 +-+ .. +. +. . .. + +..+ |
>> | + + + : |
>> | : : |
>> 150 +-+ : : |
>> | : : |
>> 100 +-+ : : |
>> | O O : : |
>> O O O O O O O O O O O O O O O :O: O O O O O
>> 50 +-+ : : |
>> | : : |
>> | : |
>> 0 +-+-------------------------------------------------------------------+
>>
>>
>> will-it-scale.time.system_time
>>
>> 5000 +-+------------------------------------------------------------------+
>> 4500 O-+O..O..O...O..O..O..O..O..O..O...O..O..O..O..O..O O O...O..O..O..O
>> | : : |
>> 4000 +-+ : : |
>> 3500 +-+ : : |
>> | : : |
>> 3000 +-+ : : |
>> 2500 +-+ : : |
>> 2000 +-+ : : |
>> | : : |
>> 1500 +-+ : : |
>> 1000 +-+ : : |
>> | : |
>> 500 +-+ : |
>> 0 +-+------------------------------------------------------------------+
>>
>>
>> [*] bisect-good sample
>> [O] bisect-bad sample

Attachments:

signature.asc (847.00 B)

2018-11-28 00:56:05

by NeilBrown

[permalink] [raw]

Subject: [PATCH] locks: fix performance regressions.

The kernel test robot reported two performance regressions
caused by recent patches.
Both appear to related to the global spinlock blocked_lock_lock
being taken more often.

This patch avoids taking that lock in the cases tested.

Reported-by: kernel test robot <[email protected]>
Signed-off-by: NeilBrown <[email protected]>
---

Hi Jeff,
you might like to merge these back into the patches that introduced
the problem.
Or you might like me to re-send the series with these merged in,
in which case, please ask.

And a BIG thank-you to the kernel-test-robot team!!

Thanks,
NeilBrown

fs/locks.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)

diff --git a/fs/locks.c b/fs/locks.c
index f456cd3d9d50..67519a43e27a 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -444,6 +444,13 @@ static void locks_move_blocks(struct file_lock *new, struct file_lock *fl)
{
struct file_lock *f;

+ /*
+ * As ctx->flc_lock is held, new requests cannot be added to
+ * ->fl_blocked_requests, so we don't need a lock to check if it
+ * is empty.
+ */
+ if (list_empty(&fl->fl_blocked_requests))
+ return;
spin_lock(&blocked_lock_lock);
list_splice_init(&fl->fl_blocked_requests, &new->fl_blocked_requests);
list_for_each_entry(f, &fl->fl_blocked_requests, fl_blocked_member)
@@ -749,6 +756,20 @@ int locks_delete_block(struct file_lock *waiter)
{
int status = -ENOENT;

+ /*
+ * If fl_blocker is NULL, it won't be set again as this thread
+ * "owns" the lock and is the only one that might try to claim
+ * the lock. So it is safe to test fl_blocker locklessly.
+ * Also if fl_blocker is NULL, this waiter is not listed on
+ * fl_blocked_requests for some lock, so no other request can
+ * be added to the list of fl_blocked_requests for this
+ * request. So if fl_blocker is NULL, it is safe to
+ * locklessly check if fl_blocked_requests is empty. If both
+ * of these checks succeed, there is no need to take the lock.
+ */
+ if (waiter->fl_blocker == NULL &&
+ list_empty(&waiter->fl_blocked_requests))
+ return status;
spin_lock(&blocked_lock_lock);
if (waiter->fl_blocker)
status = 0;
--
2.14.0.rc0.dirty

Attachments:

signature.asc (847.00 B)

2018-11-28 09:18:29

by Chen, Rong A

[permalink] [raw]

Subject: Re: [PATCH] locks: fix performance regressions.

Hi,

On Wed, Nov 28, 2018 at 11:53:48AM +1100, NeilBrown wrote:
>
> The kernel test robot reported two performance regressions
> caused by recent patches.
> Both appear to related to the global spinlock blocked_lock_lock
> being taken more often.
>
> This patch avoids taking that lock in the cases tested.
>
> Reported-by: kernel test robot <[email protected]>
> Signed-off-by: NeilBrown <[email protected]>
> ---
>
> Hi Jeff,
> you might like to merge these back into the patches that introduced
> the problem.
> Or you might like me to re-send the series with these merged in,
> in which case, please ask.
>
> And a BIG thank-you to the kernel-test-robot team!!
>
> Thanks,
> NeilBrown
>
> fs/locks.c | 21 +++++++++++++++++++++
> 1 file changed, 21 insertions(+)
>
> diff --git a/fs/locks.c b/fs/locks.c
> index f456cd3d9d50..67519a43e27a 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -444,6 +444,13 @@ static void locks_move_blocks(struct file_lock *new, struct file_lock *fl)
> {
> struct file_lock *f;
>
> + /*
> + * As ctx->flc_lock is held, new requests cannot be added to
> + * ->fl_blocked_requests, so we don't need a lock to check if it
> + * is empty.
> + */
> + if (list_empty(&fl->fl_blocked_requests))
> + return;
> spin_lock(&blocked_lock_lock);
> list_splice_init(&fl->fl_blocked_requests, &new->fl_blocked_requests);
> list_for_each_entry(f, &fl->fl_blocked_requests, fl_blocked_member)
> @@ -749,6 +756,20 @@ int locks_delete_block(struct file_lock *waiter)
> {
> int status = -ENOENT;
>
> + /*
> + * If fl_blocker is NULL, it won't be set again as this thread
> + * "owns" the lock and is the only one that might try to claim
> + * the lock. So it is safe to test fl_blocker locklessly.
> + * Also if fl_blocker is NULL, this waiter is not listed on
> + * fl_blocked_requests for some lock, so no other request can
> + * be added to the list of fl_blocked_requests for this
> + * request. So if fl_blocker is NULL, it is safe to
> + * locklessly check if fl_blocked_requests is empty. If both
> + * of these checks succeed, there is no need to take the lock.
> + */
> + if (waiter->fl_blocker == NULL &&
> + list_empty(&waiter->fl_blocked_requests))
> + return status;
> spin_lock(&blocked_lock_lock);
> if (waiter->fl_blocker)
> status = 0;
> --
> 2.14.0.rc0.dirty
>

FYI, the performance recovered back, we didn't find any regression between the two commits.

commit:
48a7a13ff3 ("locks: use properly initialized file_lock when unlocking.")
8f64e497be ("locks: fix performance regressions.")

48a7a13ff31f0728 8f64e497be9929a2d5904c39c4
---------------- --------------------------
%stddev change %stddev
\ | \
33.56 ± 3% 5% 35.30 boot-time.boot
10497 ± 3% 12% 11733 ± 4% proc-vmstat.nr_shmem
67392 68449 proc-vmstat.nr_zone_active_anon
67392 68449 proc-vmstat.nr_active_anon
16303 16206 proc-vmstat.nr_slab_reclaimable
30602 29921 proc-vmstat.nr_slab_unreclaimable
0 9e+03 9009 ± 80% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64
0 6e+03 5837 ±139% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
149 ± 17% 5e+03 5457 ±137% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat
175 ± 29% 4e+03 3807 ±136% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.__lookup_slow.lookup_slow.walk_component.path_lookupat.filename_lookup
52868 ±110% -4e+04 17482 ± 4% latency_stats.avg.max
45055 ±141% -5e+04 0 latency_stats.avg.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
227 ± 10% 1e+04 9907 ±136% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat
0 9e+03 9367 ± 78% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64
0 6e+03 5837 ±139% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
175 ± 29% 4e+03 3807 ±136% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.__lookup_slow.lookup_slow.walk_component.path_lookupat.filename_lookup
98043 ±124% -8e+04 20999 ± 27% latency_stats.max.max
90059 ±141% -9e+04 0 latency_stats.max.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
1251 ± 23% 5e+04 49005 ±137% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat
0 1e+04 12061 ± 70% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64
0 6e+03 5837 ±139% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
175 ± 29% 4e+03 3807 ±136% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.__lookup_slow.lookup_slow.walk_component.path_lookupat.filename_lookup
90111 ±141% -9e+04 0 latency_stats.sum.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe

Best Regards,
Rong Chen

2018-11-28 11:38:02

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH] locks: fix performance regressions.

On Wed, 2018-11-28 at 11:53 +1100, NeilBrown wrote:
> The kernel test robot reported two performance regressions
> caused by recent patches.
> Both appear to related to the global spinlock blocked_lock_lock
> being taken more often.
>
> This patch avoids taking that lock in the cases tested.
>
> Reported-by: kernel test robot <[email protected]>
> Signed-off-by: NeilBrown <[email protected]>
> ---
>
> Hi Jeff,
> you might like to merge these back into the patches that introduced
> the problem.
> Or you might like me to re-send the series with these merged in,
> in which case, please ask.
>

Thanks Neil,

This looks great. I'll go ahead and toss this patch on top of the pile
in linux-next for now.

Would you mind resending the series with this patch merged in? I took a
quick stab at squashing it into the earlier patch, but there is some
churn in this area.

Maybe you can also turn that Reported-by: into a Tested-by: in the
changelog afterward?

> And a BIG thank-you to the kernel-test-robot team!!
>

Absolutely! We love you guys!

> Thanks,
> NeilBrown
>
> fs/locks.c | 21 +++++++++++++++++++++
> 1 file changed, 21 insertions(+)
>
> diff --git a/fs/locks.c b/fs/locks.c
> index f456cd3d9d50..67519a43e27a 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -444,6 +444,13 @@ static void locks_move_blocks(struct file_lock *new, struct file_lock *fl)
> {
> struct file_lock *f;
>
> + /*
> + * As ctx->flc_lock is held, new requests cannot be added to
> + * ->fl_blocked_requests, so we don't need a lock to check if it
> + * is empty.
> + */
> + if (list_empty(&fl->fl_blocked_requests))
> + return;
> spin_lock(&blocked_lock_lock);
> list_splice_init(&fl->fl_blocked_requests, &new->fl_blocked_requests);
> list_for_each_entry(f, &fl->fl_blocked_requests, fl_blocked_member)
> @@ -749,6 +756,20 @@ int locks_delete_block(struct file_lock *waiter)
> {
> int status = -ENOENT;
>
> + /*
> + * If fl_blocker is NULL, it won't be set again as this thread
> + * "owns" the lock and is the only one that might try to claim
> + * the lock. So it is safe to test fl_blocker locklessly.
> + * Also if fl_blocker is NULL, this waiter is not listed on
> + * fl_blocked_requests for some lock, so no other request can
> + * be added to the list of fl_blocked_requests for this
> + * request. So if fl_blocker is NULL, it is safe to
> + * locklessly check if fl_blocked_requests is empty. If both
> + * of these checks succeed, there is no need to take the lock.
> + */
> + if (waiter->fl_blocker == NULL &&
> + list_empty(&waiter->fl_blocked_requests))
> + return status;
> spin_lock(&blocked_lock_lock);
> if (waiter->fl_blocker)
> status = 0;

--
Jeff Layton <[email protected]>