LinuxLists.cc - [x86/mm/tlb] 6035152d8e: will-it-scale.per_thread

2022-03-17 10:05:27

Subject: [x86/mm/tlb] 6035152d8e: will-it-scale.per_thread_ops -13.2% regression

Greeting,

FYI, we noticed a

commit: 6035152d8eebe16a5b

If you fix the issue, Reported-by: kernel

Details are as below:
--------------------------

To reproduce:

git clone sudo bin/lkp install job.yaml bin/lkp split-job sudo bin/lkp
# if come # please
========================== compiler/cpufreq_governor/ gcc-9/performance/x86_64
commit:
4c1ba3923e ("x86/mm/tlb: 6035152d8e ("x86/mm/tlb:
4c1ba3923e6c8aa7 ---------------- %stddev %change \ | 823626 1.53 -13.2% 6434 -13.2% 823626 6.342e+10 13455 -10.9% 8834119 31.26 -6.6 66.51 +6.7 0.30 -0.0 1.282e+08 1.282e+08 1.279e+08 ? 2% 1.279e+08 ? 2% 66398782 66355740 66322224 ? 2% 65908198 ? 2% 13146738 10371412 ? 3% 2505980 ? 12% 18298 15393 -10.8% 0.04 ? 71% -69.1% 194329 194329 2.562e+08 2.56e+08 2.563e+08 5.004e+08 2.561e+08 1.392e+10 1.246e+08 4.507e+08 1.848e+09 13512 -11.0% 6.74 +7.0% 182.97 916.35 1.646e+10 0.14 -0.0 9784986 7.083e+09 6.113e+10 0.15 -6.4% 308.99 1655373 27419309 1655375 0.90 -0.0 24.40 +0.7 6.75 +7.1% 915.93 0.14 -0.0 0.15 -6.6% 22432987 1.387e+10 1.242e+08 4.493e+08 1.842e+09 13466 -10.9% 182.44 1.641e+10 9753730 7.061e+09 6.094e+10 1649980 27334503 1649982 1.848e+13 41.88 -41.9 41.50 -41.5 16.03 ? 2% -16.0 6.79 ? 3% -6.8 6.52 ? 3% -6.5 6.44 ? 3% -6.4 5.21 ? 2% -5.2 4.97 ? 2% -5.0 14.50 ? 4% -4.2 12.85 ? 3% -4.1 6.79 ? 3% -2.0 6.52 ? 3% -2.0 51.35 -1.8 51.14 -1.8 51.13 -1.8 51.11 -1.8 51.11 -1.8 50.10 -1.6 48.04 -1.4 47.80 -1.3 46.37 -1.1 6.59 ? 2% -0.8 44.91 -0.8 7.49 ? 2% -0.5 1.77 -0.3 1.05 -0.2 1.07 -0.2 0.98 -0.1 0.66 -0.1 1.06 ? 2% -0.1 0.00 +0.6 0.00 +0.8 0.00 +0.8 0.00 +1.3 20.24 ? 2% +1.7 48.18 +1.8 47.77 +1.9 47.66 +1.9 47.63 +1.9 0.00 +2.0 45.24 +2.2 45.09 +2.3 6.44 ? 3% +2.3 43.71 +2.5 42.92 +2.7 42.89 +2.7 6.31 ? 2% +4.3 0.00 +4.3 0.00 +4.5 0.00 +5.6 0.00 +21.7 0.00 +42.5 41.89 -41.9 30.59 ? 2% -6.4 31.91 ? 2% -6.3 30.51 ? 2% -6.3 30.12 ? 2% -6.3 19.62 ? 3% -5.7 51.37 -1.8 51.33 -1.8 51.32 -1.8 51.11 -1.8 51.11 -1.8 50.11 -1.6 48.05 -1.4 47.82 -1.3 3.68 -0.8 9.19 ? 2% -0.8 2.65 ? 2% -0.6 2.15 -0.3 1.08 -0.2 1.42 ? 7% -0.2 0.64 ? 10% -0.2 1.00 -0.2 0.53 -0.1 0.51 -0.1 0.50 -0.1 0.47 -0.1 0.47 ? 3% -0.1 0.49 -0.1 0.42 ? 2% -0.1 0.17 ? 6% -0.0 0.12 ? 10% -0.0 0.27 ? 2% -0.0 0.24 ? 3% -0.0 0.14 ? 4% -0.0 0.13 -0.0 0.26 -0.0 0.12 ? 4% -0.0 0.15 ? 3% -0.0 0.10 ? 3% -0.0 0.10 ? 3% -0.0 0.10 -0.0 0.10 -0.0 0.09 ? 4% -0.0 0.09 -0.0 0.08 ? 5% -0.0 0.06 ? 5% -0.0 0.06 -0.0 0.08 -0.0 1.54 ? 2% +0.1 2.31 +0.1 89.30 +1.5 48.44 +1.8 48.01 +1.9 47.67 +1.9 47.72 +1.9 0.00 +2.0 45.26 +2.2 45.10 +2.3 43.72 +2.5 42.92 +2.6 36.85 ? 2% +7.5 16.95 ? 4% -5.1 9.19 ? 2% -0.8 2.61 ? 2% -0.6 2.14 -0.3 1.04 ? 5% -0.2 0.17 ? 37% -0.1 0.24 ? 2% -0.1 0.32 -0.0 0.31 -0.0 0.31 ? 2% -0.0 0.14 ? 4% -0.0 0.16 ? 3% -0.0 0.08 ? 10% -0.0 0.23 ? 2% +0.0 0.35 ? 2% +0.0 1.00 ? 2% +0.1 0.00 +1.0 24.77 ? 2% +8.1

Disclaimer:
Results have been for informational design or configuration

---
0-DAY CI Kernel Test Service
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory
Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
//github.com/antonblanchard/will-it-scale">https://github.com/antonblanchard/will-it-scale
kindly add following tag
test robot <[email protected]>
------------------------------------------------------------------------>
href="https://github.com/intel/lkp-tests.git">https://github.com/intel/lkp-tests.git
# job file is attached in this email
--compatible job.yaml # generate the yaml file for lkp run
run generated-yaml-file
across any failure that blocks the test,
remove ~/.lkp and /lkp dir to run from a clean state.
===============================================================
kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
-rhel-8.3/thread/100%/debian-10.4-x86_64-20200603.cgz/lkp-icl-2sp5/tlb_flush1/will-it-scale/0xd000331
Unify flush_tlb_func_local() and flush_tlb_func_remote()")
Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()")
6035152d8eebe16a5bb60398d3e
---------------------------
%stddev
\
-13.2% 715243 will-it-scale.128.threads
1.32 will-it-scale.128.threads_idle
5587 will-it-scale.per_thread_ops
-13.2% 715243 will-it-scale.workload
-12.9% 5.524e+10 turbostat.IRQ
11995 vmstat.system.cs
-7.9% 8132489 vmstat.system.in
24.66 ? 2% mpstat.cpu.all.irq%
73.23 mpstat.cpu.all.sys%
0.26 mpstat.cpu.all.usr%
-11.2% 1.139e+08 numa-numastat.node0.local_node
-11.1% 1.139e+08 numa-numastat.node0.numa_hit
-14.4% 1.095e+08 ? 2% numa-numastat.node1.local_node
-14.3% 1.096e+08 ? 2% numa-numastat.node1.numa_hit
-10.5% 59416495 numa-vmstat.node0.numa_hit
-10.6% 59333364 numa-vmstat.node0.numa_local
-13.8% 57147061 ? 2% numa-vmstat.node1.numa_hit
-13.9% 56773100 ? 2% numa-vmstat.node1.numa_local
+11.3% 14628057 sched_debug.cfs_rq:/.min_vruntime.avg
+12.7% 11683859 ? 3% sched_debug.cfs_rq:/.min_vruntime.min
-32.7% 1686413 ? 9% sched_debug.cfs_rq:/.spread0.max
-9.0% 16648 sched_debug.cpu.nr_switches.avg
13737 sched_debug.cpu.nr_switches.min
0.01 ? 30% sched_debug.cpu.nr_uninterruptible.avg
+5.8% 205510 proc-vmstat.nr_inactive_anon
+5.8% 205510 proc-vmstat.nr_zone_inactive_anon
-12.7% 2.235e+08 proc-vmstat.numa_hit
-12.7% 2.234e+08 proc-vmstat.numa_local
-12.7% 2.236e+08 proc-vmstat.pgalloc_normal
-13.0% 4.354e+08 proc-vmstat.pgfault
-12.7% 2.234e+08 proc-vmstat.pgfree
-6.5% 1.302e+10 perf-stat.i.branch-instructions
-9.7% 1.125e+08 perf-stat.i.branch-misses
-4.6% 4.3e+08 perf-stat.i.cache-misses
-7.3% 1.713e+09 perf-stat.i.cache-references
12032 perf-stat.i.context-switches
7.22 perf-stat.i.cpi
-2.4% 178.63 perf-stat.i.cpu-migrations
+5.4% 965.43 ? 2% perf-stat.i.cycles-between-cache-misses
-8.3% 1.51e+10 perf-stat.i.dTLB-loads
0.13 perf-stat.i.dTLB-store-miss-rate%
-13.2% 8495466 perf-stat.i.dTLB-store-misses
-10.0% 6.378e+09 perf-stat.i.dTLB-stores
-6.5% 5.714e+10 perf-stat.i.instructions
0.14 perf-stat.i.ipc
-7.9% 284.59 perf-stat.i.metric.M/sec
-13.1% 1438884 perf-stat.i.minor-faults
-5.5% 25923350 perf-stat.i.node-loads
-13.1% 1438886 perf-stat.i.page-faults
0.86 perf-stat.overall.branch-miss-rate%
25.10 perf-stat.overall.cache-miss-rate%
7.23 perf-stat.overall.cpi
+4.9% 961.03 perf-stat.overall.cycles-between-cache-misses
0.13 perf-stat.overall.dTLB-store-miss-rate%
0.14 perf-stat.overall.ipc
+7.7% 24171448 perf-stat.overall.path-length
-6.5% 1.297e+10 perf-stat.ps.branch-instructions
-9.7% 1.121e+08 perf-stat.ps.branch-misses
-4.6% 4.286e+08 perf-stat.ps.cache-misses
-7.3% 1.708e+09 perf-stat.ps.cache-references
11992 perf-stat.ps.context-switches
-2.4% 177.98 perf-stat.ps.cpu-migrations
-8.3% 1.505e+10 perf-stat.ps.dTLB-loads
-13.2% 8467120 perf-stat.ps.dTLB-store-misses
-10.0% 6.357e+09 perf-stat.ps.dTLB-stores
-6.5% 5.695e+10 perf-stat.ps.instructions
-13.1% 1433930 perf-stat.ps.minor-faults
-5.4% 25845323 perf-stat.ps.node-loads
-13.1% 1433932 perf-stat.ps.page-faults
-6.4% 1.729e+13 perf-stat.total.instructions
0.00 perf-profile.calltrace.cycles-pp.on_each_cpu_cond_mask.flush_tlb_mm_range.ptep_clear_flush.wp_page_copy.__handle_mm_fault
0.00 perf-profile.calltrace.cycles-pp.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.ptep_clear_flush.wp_page_copy
0.00 perf-profile.calltrace.cycles-pp.llist_add_batch.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.ptep_clear_flush
0.00 perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.ptep_clear_flush
0.00 perf-profile.calltrace.cycles-pp.sysvec_call_function.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range
0.00 perf-profile.calltrace.cycles-pp.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask
0.00 perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.llist_add_batch.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range
0.00 perf-profile.calltrace.cycles-pp.sysvec_call_function.asm_sysvec_call_function.llist_add_batch.smp_call_function_many_cond.on_each_cpu_cond_mask
10.28 ? 5% perf-profile.calltrace.cycles-pp.flush_tlb_func.flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function
8.74 ? 3% perf-profile.calltrace.cycles-pp.flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.smp_call_function_many_cond
4.79 ? 3% perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.smp_call_function_many_cond.flush_tlb_mm_range.tlb_flush_mmu.tlb_finish_mmu
4.57 ? 3% perf-profile.calltrace.cycles-pp.sysvec_call_function.asm_sysvec_call_function.smp_call_function_many_cond.flush_tlb_mm_range.tlb_flush_mmu
49.55 perf-profile.calltrace.cycles-pp.__madvise
49.38 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__madvise
49.37 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__madvise
49.35 perf-profile.calltrace.cycles-pp.__x64_sys_madvise.do_syscall_64.entry_SYSCALL_64_after_hwframe.__madvise
49.35 perf-profile.calltrace.cycles-pp.do_madvise.__x64_sys_madvise.do_syscall_64.entry_SYSCALL_64_after_hwframe.__madvise
48.45 perf-profile.calltrace.cycles-pp.zap_page_range.do_madvise.__x64_sys_madvise.do_syscall_64.entry_SYSCALL_64_after_hwframe
46.69 perf-profile.calltrace.cycles-pp.tlb_finish_mmu.zap_page_range.do_madvise.__x64_sys_madvise.do_syscall_64
46.49 perf-profile.calltrace.cycles-pp.tlb_flush_mmu.tlb_finish_mmu.zap_page_range.do_madvise.__x64_sys_madvise
45.24 perf-profile.calltrace.cycles-pp.flush_tlb_mm_range.tlb_flush_mmu.tlb_finish_mmu.zap_page_range.do_madvise
5.74 perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.llist_add_batch.smp_call_function_many_cond.flush_tlb_mm_range.tlb_flush_mmu
44.08 perf-profile.calltrace.cycles-pp.smp_call_function_many_cond.flush_tlb_mm_range.tlb_flush_mmu.tlb_finish_mmu.zap_page_range
7.01 ? 2% perf-profile.calltrace.cycles-pp.llist_reverse_order.flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function
1.49 ? 2% perf-profile.calltrace.cycles-pp.default_send_IPI_mask_sequence_phys.smp_call_function_many_cond.flush_tlb_mm_range.tlb_flush_mmu.tlb_finish_mmu
0.81 perf-profile.calltrace.cycles-pp.cpumask_next.smp_call_function_many_cond.flush_tlb_mm_range.tlb_flush_mmu.tlb_finish_mmu
0.90 perf-profile.calltrace.cycles-pp.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
0.83 perf-profile.calltrace.cycles-pp.filemap_map_pages.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
0.51 perf-profile.calltrace.cycles-pp._find_next_bit.cpumask_next.smp_call_function_many_cond.flush_tlb_mm_range.tlb_flush_mmu
0.98 ? 2% perf-profile.calltrace.cycles-pp.__default_send_IPI_dest_field.default_send_IPI_mask_sequence_phys.smp_call_function_many_cond.flush_tlb_mm_range.tlb_flush_mmu
0.56 ? 3% perf-profile.calltrace.cycles-pp.cpumask_next.native_flush_tlb_others.flush_tlb_mm_range.ptep_clear_flush.wp_page_copy
0.76 perf-profile.calltrace.cycles-pp.cpumask_next.smp_call_function_many_cond.flush_tlb_mm_range.ptep_clear_flush.wp_page_copy
0.83 ? 2% perf-profile.calltrace.cycles-pp.__default_send_IPI_dest_field.default_send_IPI_mask_sequence_phys.smp_call_function_many_cond.flush_tlb_mm_range.ptep_clear_flush
1.27 ? 2% perf-profile.calltrace.cycles-pp.default_send_IPI_mask_sequence_phys.smp_call_function_many_cond.flush_tlb_mm_range.ptep_clear_flush.wp_page_copy
21.94 ? 2% perf-profile.calltrace.cycles-pp.llist_add_batch.smp_call_function_many_cond.flush_tlb_mm_range.tlb_flush_mmu.tlb_finish_mmu
50.02 perf-profile.calltrace.cycles-pp.testcase
49.69 perf-profile.calltrace.cycles-pp.asm_exc_page_fault.testcase
49.59 perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.testcase
49.58 perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase
1.98 ? 3% perf-profile.calltrace.cycles-pp.native_flush_tlb_others.flush_tlb_mm_range.ptep_clear_flush.wp_page_copy.__handle_mm_fault
47.47 perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase
47.34 perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
8.77 ? 3% perf-profile.calltrace.cycles-pp.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.smp_call_function_many_cond.flush_tlb_mm_range
46.23 perf-profile.calltrace.cycles-pp.wp_page_copy.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
45.57 perf-profile.calltrace.cycles-pp.ptep_clear_flush.wp_page_copy.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
45.55 perf-profile.calltrace.cycles-pp.flush_tlb_mm_range.ptep_clear_flush.wp_page_copy.__handle_mm_fault.handle_mm_fault
10.62 ? 2% perf-profile.calltrace.cycles-pp.sysvec_call_function.asm_sysvec_call_function.llist_add_batch.smp_call_function_many_cond.flush_tlb_mm_range
4.32 ? 4% perf-profile.calltrace.cycles-pp.sysvec_call_function.asm_sysvec_call_function.smp_call_function_many_cond.flush_tlb_mm_range.ptep_clear_flush
4.53 ? 3% perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.smp_call_function_many_cond.flush_tlb_mm_range.ptep_clear_flush.wp_page_copy
5.56 ? 2% perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.llist_add_batch.smp_call_function_many_cond.flush_tlb_mm_range.ptep_clear_flush
21.66 ? 3% perf-profile.calltrace.cycles-pp.llist_add_batch.smp_call_function_many_cond.flush_tlb_mm_range.ptep_clear_flush.wp_page_copy
42.48 perf-profile.calltrace.cycles-pp.smp_call_function_many_cond.flush_tlb_mm_range.ptep_clear_flush.wp_page_copy.__handle_mm_fault
0.00 perf-profile.children.cycles-pp.on_each_cpu_cond_mask
24.19 ? 2% perf-profile.children.cycles-pp.flush_smp_call_function_queue
25.60 ? 2% perf-profile.children.cycles-pp.asm_sysvec_call_function
24.20 ? 2% perf-profile.children.cycles-pp.sysvec_call_function
23.84 ? 2% perf-profile.children.cycles-pp.__sysvec_call_function
13.88 ? 5% perf-profile.children.cycles-pp.flush_tlb_func
49.57 perf-profile.children.cycles-pp.__madvise
49.55 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
49.54 perf-profile.children.cycles-pp.do_syscall_64
49.35 perf-profile.children.cycles-pp.__x64_sys_madvise
49.35 perf-profile.children.cycles-pp.do_madvise
48.46 perf-profile.children.cycles-pp.zap_page_range
46.69 perf-profile.children.cycles-pp.tlb_finish_mmu
46.50 perf-profile.children.cycles-pp.tlb_flush_mmu
2.84 ? 2% perf-profile.children.cycles-pp.default_send_IPI_mask_sequence_phys
8.42 perf-profile.children.cycles-pp.llist_reverse_order
2.04 ? 2% perf-profile.children.cycles-pp.native_flush_tlb_local
1.82 ? 2% perf-profile.children.cycles-pp.__default_send_IPI_dest_field
0.91 perf-profile.children.cycles-pp.do_fault
1.25 ? 6% perf-profile.children.cycles-pp.release_pages
0.49 ? 11% perf-profile.children.cycles-pp._raw_spin_unlock_irqrestore
0.85 perf-profile.children.cycles-pp.filemap_map_pages
0.39 perf-profile.children.cycles-pp.asm_sysvec_call_function_single
0.37 perf-profile.children.cycles-pp.sysvec_call_function_single
0.36 perf-profile.children.cycles-pp.__sysvec_call_function_single
0.38 perf-profile.children.cycles-pp.copy_page
0.39 ? 2% perf-profile.children.cycles-pp.unmap_page_range
0.41 perf-profile.children.cycles-pp.next_uptodate_page
0.35 ? 3% perf-profile.children.cycles-pp.zap_pte_range
0.14 ? 4% perf-profile.children.cycles-pp.tlb_gather_mmu
0.10 ? 5% perf-profile.children.cycles-pp.sync_mm_rss
0.24 ? 2% perf-profile.children.cycles-pp.irq_exit_rcu
0.22 perf-profile.children.cycles-pp.error_entry
0.12 perf-profile.children.cycles-pp.sync_regs
0.11 perf-profile.children.cycles-pp.__x86_indirect_thunk_rax
0.24 perf-profile.children.cycles-pp.irqtime_account_irq
0.10 ? 7% perf-profile.children.cycles-pp.rwsem_down_read_slowpath
0.13 ? 2% perf-profile.children.cycles-pp.unlock_page
0.09 ? 8% perf-profile.children.cycles-pp.__x64_sys_munmap
0.09 ? 8% perf-profile.children.cycles-pp.__vm_munmap
0.09 ? 8% perf-profile.children.cycles-pp.__munmap
0.09 ? 8% perf-profile.children.cycles-pp.__do_munmap
0.08 ? 4% perf-profile.children.cycles-pp.unmap_region
0.08 ? 5% perf-profile.children.cycles-pp.unmap_vmas
0.07 perf-profile.children.cycles-pp.do_set_pte
0.05 perf-profile.children.cycles-pp.page_add_file_rmap
0.05 perf-profile.children.cycles-pp.___perf_sw_event
0.07 perf-profile.children.cycles-pp.__perf_sw_event
1.66 ? 2% perf-profile.children.cycles-pp._find_next_bit
2.44 perf-profile.children.cycles-pp.cpumask_next
90.82 perf-profile.children.cycles-pp.flush_tlb_mm_range
50.25 perf-profile.children.cycles-pp.testcase
49.92 perf-profile.children.cycles-pp.asm_exc_page_fault
49.61 perf-profile.children.cycles-pp.do_user_addr_fault
49.67 perf-profile.children.cycles-pp.exc_page_fault
2.02 ? 3% perf-profile.children.cycles-pp.native_flush_tlb_others
47.48 perf-profile.children.cycles-pp.handle_mm_fault
47.35 perf-profile.children.cycles-pp.__handle_mm_fault
46.24 perf-profile.children.cycles-pp.wp_page_copy
45.57 perf-profile.children.cycles-pp.ptep_clear_flush
44.33 ? 3% perf-profile.children.cycles-pp.llist_add_batch
11.81 ? 6% perf-profile.self.cycles-pp.flush_tlb_func
8.41 perf-profile.self.cycles-pp.llist_reverse_order
2.01 ? 2% perf-profile.self.cycles-pp.native_flush_tlb_local
1.81 ? 2% perf-profile.self.cycles-pp.__default_send_IPI_dest_field
0.82 ? 7% perf-profile.self.cycles-pp.flush_tlb_mm_range
0.12 ? 7% perf-profile.self.cycles-pp.__handle_mm_fault
0.19 ? 3% perf-profile.self.cycles-pp.default_send_IPI_mask_sequence_phys
0.27 perf-profile.self.cycles-pp.testcase
0.27 ? 2% perf-profile.self.cycles-pp.copy_page
0.29 perf-profile.self.cycles-pp.next_uptodate_page
0.12 perf-profile.self.cycles-pp.sync_regs
0.14 perf-profile.self.cycles-pp.filemap_map_pages
0.06 ? 7% perf-profile.self.cycles-pp.sync_mm_rss
0.25 perf-profile.self.cycles-pp.find_next_bit
0.38 perf-profile.self.cycles-pp.cpumask_next
1.15 perf-profile.self.cycles-pp._find_next_bit
1.05 ? 4% perf-profile.self.cycles-pp.native_flush_tlb_others
32.86 ? 3% perf-profile.self.cycles-pp.llist_add_batch
estimated based on internal Intel analysis and are provided
purposes only. Any difference in system hardware or software
may affect actual performance.
.org/hyperkitty/list/lkp@lists.01.org">https://lists.01.org/hyperkitty/list/[email protected]

Attachments:

(No filename) (23.61 kB)
config-5.12.0-rc2-00003-g6035152d8eeb (171.64 kB)
job-script (7.94 kB)
job.yaml (5.36 kB)
reproduce (356.00 B)
Download all attachments

2022-03-17 19:55:04

by Dave Hansen

[permalink] [raw]

Subject: Re: [x86/mm/tlb] 6035152d8e: will-it-scale.per_thread_ops -13.2% regression

On 3/17/22 02:04, kernel test robot wrote:
> FYI, we noticed a -13.2% regression of will-it-scale.per_thread_ops due to commit:
...
> commit: 6035152d8eebe16a5bb60398d3e05dc7799067b0 ("x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
...
> 24.77 ± 2% +8.1 32.86 ± 3% perf-profile.self.cycles-pp.llist_add_batch

tl;dr: This commit made the tlb_is_not_lazy() check happen earlier.
That earlier check can miss threads _going_ lazy because if mmap_lock
contention. Fewer lazy threads means more IPIs and lower performance.

===

There's a lot of noise in that profile, but I filtered most of it out.
The main thing is that, somehow the llist_add() in
smp_call_function_many_cond() got more expensive. Either we're doing
more of them or the cacheline is bouncing around more.

Turns out that we're sending *more* IPIs with this patch applied than
without. That shouldn't happen since the old code did the same exact
logical check:

if (cond_func && !cond_func(cpu, info))
continue;

and the new code does:

if (tlb_is_not_lazy(cpu))
...

where cond_func==tlb_is_not_lazy.

So, what's the difference? Timing. With the old scheme, if a CPU
enters lazy mode between native_flush_tlb_others() and
the loop in smp_call_function_many_cond(), it won't get an IPI and won't
need to do the llist_add().

I stuck some printk()s in there and can confirm that the
earlier-calculated mask always seems to have more bits set, at least
when running will-it-scale tests that induce TLB flush IPIs.

I was kinda surprised that there were so many threads going idle with a
cpu-eating micro like this. But, it makes sense since they're
contending on mmap_lock. Basically, since TLB-flushing operations like
mmap() hold mmap_lock for write they tend to *force* other threads into
idle. Idle threads are lazy and they tend to _become_ lazy around the
time that the flushing starts.

This new "early lazy check" behavior could theoretically work both ways.
If threads tended to be waking up from idle when TLB flushes were being
sent, this would tend to reduce the number of IPIs. But, since they
tend to be going to sleep it increases the number of IPIs.

Anybody have a better theory? I think we should probably revert the commit.

2022-03-17 20:18:59

by Nadav Amit

[permalink] [raw]

Subject: Re: [x86/mm/tlb] 6035152d8e: will-it-scale.per_thread_ops -13.2% regression

> On Mar 17, 2022, at 11:38 AM, Dave Hansen <[email protected]> wrote:
>
> On 3/17/22 02:04, kernel test robot wrote:
>> FYI, we noticed a -13.2% regression of will-it-scale.per_thread_ops due to commit:
> ...
>> commit: 6035152d8eebe16a5bb60398d3e05dc7799067b0 ("x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()")
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fcgit%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git&data=04%7C01%7Cnamit%40vmware.com%7Cc958c9b39db94b6b78bc08da084564df%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637831391403951050%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=NSCrUK12AX5BtehaNZmPXIpirVtDOrllgUct1mqglO8%3D&reserved=0 master
> ...
>> 24.77 ± 2% +8.1 32.86 ± 3% perf-profile.self.cycles-pp.llist_add_batch
>
>
> tl;dr: This commit made the tlb_is_not_lazy() check happen earlier.
> That earlier check can miss threads _going_ lazy because if mmap_lock
> contention. Fewer lazy threads means more IPIs and lower performance.
>
> ===
>
> There's a lot of noise in that profile, but I filtered most of it out.
> The main thing is that, somehow the llist_add() in
> smp_call_function_many_cond() got more expensive. Either we're doing
> more of them or the cacheline is bouncing around more.
>
> Turns out that we're sending *more* IPIs with this patch applied than
> without. That shouldn't happen since the old code did the same exact
> logical check:
>
> if (cond_func && !cond_func(cpu, info))
> continue;
>
> and the new code does:
>
> if (tlb_is_not_lazy(cpu))
> ...
>
> where cond_func==tlb_is_not_lazy.
>
> So, what's the difference? Timing. With the old scheme, if a CPU
> enters lazy mode between native_flush_tlb_others() and
> the loop in smp_call_function_many_cond(), it won't get an IPI and won't
> need to do the llist_add().
>
> I stuck some printk()s in there and can confirm that the
> earlier-calculated mask always seems to have more bits set, at least
> when running will-it-scale tests that induce TLB flush IPIs.
>
> I was kinda surprised that there were so many threads going idle with a
> cpu-eating micro like this. But, it makes sense since they're
> contending on mmap_lock. Basically, since TLB-flushing operations like
> mmap() hold mmap_lock for write they tend to *force* other threads into
> idle. Idle threads are lazy and they tend to _become_ lazy around the
> time that the flushing starts.
>
> This new "early lazy check" behavior could theoretically work both ways.
> If threads tended to be waking up from idle when TLB flushes were being
> sent, this would tend to reduce the number of IPIs. But, since they
> tend to be going to sleep it increases the number of IPIs.
>
> Anybody have a better theory? I think we should probably revert the commit.

Let’s get back to the motivation behind this patch.

Originally we had an indirect branch that on system which are
vulnerable to Spectre v2 translates into a retpoline.

So I would not paraphrase this patch purpose as “early lazy check”
but instead “more efficient lazy check”. There is very little code
that was executed between the call to on_each_cpu_cond_mask() and
the actual check of tlb_is_not_lazy(). So what it seems to happen
in this test-case - according to what you say - is that *slower*
checks of is-lazy allows to send fewer IPIs since some cores go
into idle-state.

Was this test run with retpolines? If there is a difference in
performance without retpoline - I am probably wrong.

Otherwise, I do not see why this patch should be removed. We can
just as well add a busy-wait loop to tlb_is_not_lazy() to get the
same effect…

2022-03-17 20:19:01

by Dave Hansen

[permalink] [raw]

Subject: Re: [x86/mm/tlb] 6035152d8e: will-it-scale.per_thread_ops -13.2% regression

On 3/17/22 12:02, Nadav Amit wrote:
>> This new "early lazy check" behavior could theoretically work both ways.
>> If threads tended to be waking up from idle when TLB flushes were being
>> sent, this would tend to reduce the number of IPIs. But, since they
>> tend to be going to sleep it increases the number of IPIs.
>>
>> Anybody have a better theory? I think we should probably revert the commit.
>
> Let’s get back to the motivation behind this patch.
>
> Originally we had an indirect branch that on system which are
> vulnerable to Spectre v2 translates into a retpoline.
>
> So I would not paraphrase this patch purpose as “early lazy check”
> but instead “more efficient lazy check”. There is very little code
> that was executed between the call to on_each_cpu_cond_mask() and
> the actual check of tlb_is_not_lazy(). So what it seems to happen
> in this test-case - according to what you say - is that *slower*
> checks of is-lazy allows to send fewer IPIs since some cores go
> into idle-state.
>
> Was this test run with retpolines? If there is a difference in
> performance without retpoline - I am probably wrong.

Nope, no retpolines:

> /sys/devices/system/cpu/vulnerabilities/spectre_v2:Mitigation: Enhanced IBRS, IBPB: conditional, RSB filling

which is the same situation as the "Xeon Platinum 8358" which found this
in 0day.

Maybe the increased IPIs with this approach end up being a wash with the
reduced retpoline overhead.

Did you have any specific performance numbers that show the benefit on
retpoline systems?

2022-03-17 21:04:23

by Nadav Amit

[permalink] [raw]

Subject: Re: [x86/mm/tlb] 6035152d8e: will-it-scale.per_thread_ops -13.2% regression

> On Mar 17, 2022, at 12:11 PM, Dave Hansen <[email protected]> wrote:
>
> On 3/17/22 12:02, Nadav Amit wrote:
>>> This new "early lazy check" behavior could theoretically work both ways.
>>> If threads tended to be waking up from idle when TLB flushes were being
>>> sent, this would tend to reduce the number of IPIs. But, since they
>>> tend to be going to sleep it increases the number of IPIs.
>>>
>>> Anybody have a better theory? I think we should probably revert the commit.
>>
>> Let’s get back to the motivation behind this patch.
>>
>> Originally we had an indirect branch that on system which are
>> vulnerable to Spectre v2 translates into a retpoline.
>>
>> So I would not paraphrase this patch purpose as “early lazy check”
>> but instead “more efficient lazy check”. There is very little code
>> that was executed between the call to on_each_cpu_cond_mask() and
>> the actual check of tlb_is_not_lazy(). So what it seems to happen
>> in this test-case - according to what you say - is that *slower*
>> checks of is-lazy allows to send fewer IPIs since some cores go
>> into idle-state.
>>
>> Was this test run with retpolines? If there is a difference in
>> performance without retpoline - I am probably wrong.
>
> Nope, no retpolines:

Err..

>
>> /sys/devices/system/cpu/vulnerabilities/spectre_v2:Mitigation: Enhanced IBRS, IBPB: conditional, RSB filling
>
> which is the same situation as the "Xeon Platinum 8358" which found this
> in 0day.
>
> Maybe the increased IPIs with this approach end up being a wash with the
> reduced retpoline overhead.
>
> Did you have any specific performance numbers that show the benefit on
> retpoline systems?

I had profiled this thing to death at the time. I don’t have the numbers
with me now though. I did not run will-it-scale but a similar benchmark
that I wrote.

Another possible reason is that perhaps with this patch alone, without
subsequent patches we get some negative impact. I do not have a good
explanation, but can we rule this one out?

Can you please clarify how the bot works - did it notice a performance
regression and then started bisecting, or did it just check one patch
at a time?

I ask because I got a different report from the report that a
subsequent patch ("x86/mm/tlb: Privatize cpu_tlbstate”) made a
23.3% improvement [1] for a very similar (yet different) test.

Without a good explanation, my knee-jerk reaction is that this seems
as a pathological case. I do not expect performance improvement without
retpolines, and perhaps the few cycles in which the test of is-lazy
is performed earlier matter.

I’m not married to this patch, but before a revert it would be good
to know why it even matters. I wonder whether you can confirm that
reverting the patch (without the rest of the series) even helps. If
it does, I’ll try to run some tests to understand what the heck is
going on.

[1] https://lists.ofono.org/hyperkitty/list/[email protected]/thread/UTC7DVZX4O5DKT2WUTWBTCVQ6W5QLGFA/

2022-03-17 21:17:56

by Dave Hansen

[permalink] [raw]

Subject: Re: [x86/mm/tlb] 6035152d8e: will-it-scale.per_thread_ops -13.2% regression

On 3/17/22 13:32, Nadav Amit wrote:
> Can you please clarify how the bot works - did it notice a performance
> regression and then started bisecting, or did it just check one patch
> at a time?

Oliver can tell us for sure, but it usually finds things by bisecting.
It will pick an upstream commit and compare it to the latest baseline.
If it sees a delta it starts bisecting for the triggering commit.

It isn't a literal 'git bisect', but it's logically similar.

I did ask the 0day folks privately if they had any more performance data
on that commit: good, bad or neutral.

That commit didn't actually look to me like it was fundamental to
anything built after it. It might not revert cleanly, but it doesn't
look like it would be hard to logically remove. What other side-effects
are you worried about?

BTW, there's also a dirt simple hack to do the on_each_cpu_cond_mask()
without a retpoline:

if ((cond_func == tlb_is_not_lazy) &&
!tlb_is_not_lazy(...))
continue;

You can't do that literally in arch-independent code, but you get the point.

I know folks have discussed ways of doing this kind of stuff for other
high-value indirect calls. I need to see if there's anything around
that we could use.

2022-03-18 01:15:43

by Dave Hansen

[permalink] [raw]

Subject: Re: [x86/mm/tlb] 6035152d8e: will-it-scale.per_thread_ops -13.2% regression

On 3/17/22 13:32, Nadav Amit wrote:
> I’m not married to this patch, but before a revert it would be good
> to know why it even matters. I wonder whether you can confirm that
> reverting the patch (without the rest of the series) even helps. If
> it does, I’ll try to run some tests to understand what the heck is
> going on.

I went back and tested on a "Intel(R) Core(TM) i7-8086K CPU @ 4.00GHz"
which is evidently a 6-core "Coffee Lake". It needs retpolines:

> /sys/devices/system/cpu/vulnerabilities/spectre_v2:Mitigation: Full
generic retpoline, IBPB: conditional, IBRS_FW, STIBP: conditional, RSB
filling

I ran the will-it-scale test:

./malloc1_threads -s 30 -t 12

and took the 30-second average "ops/sec" at the two commits:

4c1ba3923e:197876
6035152d8e:199367 +0.75%

Where bigger is better. So, a small win, but probably mostly in the
noise. The number of IPIs definitely went up, probably 3-4% to get that
win.

IPI costs go up the more threads you throw at it. The retpolines do
too, though because you do *more* of them. Systems with no retpolines
get hit harder by the IPI costs and have no upsides from removing the
retpoline.

So, we've got a small (<1%, possibly zero) win on the bulk of systems
(which have retpolines). Newer, retpoline-free systems see a
double-digit regression. The bigger the system, the bigger the
regression (probably).

I tend to think the bigger regression wins and we should probably revert
the patch, or at least back out its behavior.

Nadav, do you have some different data or a different take?

2022-03-18 03:39:52

by Nadav Amit

[permalink] [raw]

Subject: Re: [x86/mm/tlb] 6035152d8e: will-it-scale.per_thread_ops -13.2% regression

> On Mar 17, 2022, at 5:16 PM, Dave Hansen <[email protected]> wrote:
>
> On 3/17/22 13:32, Nadav Amit wrote:
>> I’m not married to this patch, but before a revert it would be good
>> to know why it even matters. I wonder whether you can confirm that
>> reverting the patch (without the rest of the series) even helps. If
>> it does, I’ll try to run some tests to understand what the heck is
>> going on.
>
> I went back and tested on a "Intel(R) Core(TM) i7-8086K CPU @ 4.00GHz"
> which is evidently a 6-core "Coffee Lake". It needs retpolines:
>
>> /sys/devices/system/cpu/vulnerabilities/spectre_v2:Mitigation: Full
> generic retpoline, IBPB: conditional, IBRS_FW, STIBP: conditional, RSB
> filling
>
> I ran the will-it-scale test:
>
> ./malloc1_threads -s 30 -t 12
>
> and took the 30-second average "ops/sec" at the two commits:
>
> 4c1ba3923e:197876
> 6035152d8e:199367 +0.75%
>
> Where bigger is better. So, a small win, but probably mostly in the
> noise. The number of IPIs definitely went up, probably 3-4% to get that
> win.
>
> IPI costs go up the more threads you throw at it. The retpolines do
> too, though because you do *more* of them. Systems with no retpolines
> get hit harder by the IPI costs and have no upsides from removing the
> retpoline.
>
> So, we've got a small (<1%, possibly zero) win on the bulk of systems
> (which have retpolines). Newer, retpoline-free systems see a
> double-digit regression. The bigger the system, the bigger the
> regression (probably).
>
> I tend to think the bigger regression wins and we should probably revert
> the patch, or at least back out its behavior.
>
> Nadav, do you have some different data or a different take?

Thanks for testing.

I don’t have other data right now. Let me run some measurements later
tonight. I understand your explanation, but I still do not see how
much “later” can the lazy check be that it really matters. Just
strange.

2022-03-18 09:37:48

by kernel test robot

[permalink] [raw]

Subject: Re: [x86/mm/tlb] 6035152d8e: will-it-scale.per_thread_ops -13.2% regression

Hi Dave, Hi Nadav,

On Thu, Mar 17, 2022 at 01:49:48PM -0700, Dave Hansen wrote:
> On 3/17/22 13:32, Nadav Amit wrote:
> > Can you please clarify how the bot works - did it notice a performance
> > regression and then started bisecting, or did it just check one patch
> > at a time?
>
> Oliver can tell us for sure, but it usually finds things by bisecting.
> It will pick an upstream commit and compare it to the latest baseline.
> If it sees a delta it starts bisecting for the triggering commit.
>
> It isn't a literal 'git bisect', but it's logically similar.

yes, this is exactly how 0-day bot works.

regarding below from Nadav,
> > I ask because I got a different report from the report that a
> > subsequent patch ("x86/mm/tlb: Privatize cpu_tlbstate”) made a
> > 23.3% improvement [1] for a very similar (yet different) test.

yes, we also noticed this:
* 2f4305b19fe6a x86/mm/tlb: Privatize cpu_tlbstate
* 4ce94eabac16b x86/mm/tlb: Flush remote and local TLBs concurrently
* 6035152d8eebe x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
* 4c1ba3923e6c8 x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote()
* a32a4d8a815c4 smp: Run functions concurrently in smp_call_function_many_cond()
* a38fd87484648 (tag: v5.12-rc2,

but we confirmed there is no obvious performance change on this test upon
2f4305b19fe6a ("x86/mm/tlb: Privatize cpu_tlbstate")

below are what we tested along mainline recently, from latest to old for this
test:

7e57714cd0ad2 Linux 5.17-rc6 5533 5551 5572 5536 5544 5523
9137eda53752e Merge tag 'configfs-5.17-2022-02-25' of git://git.infradead.org/users/hch/configfs 5591
53ab78cd6d5ab Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux 5571 5569 5525 5542
d8152cfe2f21d Merge tag 'pci-v5.17-fixes-5' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci 5569 5541
23d04328444a8 Merge tag 'for-5.17/parisc-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux 5535 5565 5526
5c1ee569660d4 Merge branch 'for-5.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 5480 5527 5486
038101e6b2cd5 Merge tag 'platform-drivers-x86-v5.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86 5508
cfb92440ee71a Linux 5.17-rc5 5506
754e0b0e35608 Linux 5.17-rc4 5498
e783362eb54cd Linux 5.17-rc1 5557
df0cc57e057f1 Linux 5.16 5571
2f4305b19fe6a x86/mm/tlb: Privatize cpu_tlbstate 5601 5642 5674 5634 5678 5702
6035152d8eebe x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy() 5598 5571 5571 5639 5579 5587 5571 5582
4c1ba3923e6c8 x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote() 6292 6508 6478 6505 6411 6475 6269 6494 6474

as above show, the performance drop caused by 6035152d8eebe seems not recover
on 2f4305b19fe6a and following.

as a contrast, for the report
"[x86/mm/tlb] 2f4305b19f: will-it-scale.per_thread_ops 23.3% improvement"
which is a different subtest under will-it-scale, also on another platform:
(we have much more history tests on it before we upgrade the ucode for
this platform, so I only show partial of them):

df0cc57e057f1 Linux 5.16 3247
fc74e0a40e4f9 Linux 5.16-rc7 3242
8bb7eca972ad5 Linux 5.15 2856 2879 2900 2871 2890
519d81956ee27 Linux 5.15-rc6 2822
64570fbc14f8d Linux 5.15-rc5 2820 2839 2852 2833
62fb9874f5da5 Linux 5.13 3311 3299 3288
13311e74253fe Linux 5.13-rc7 3302 3316 3303
9f4ad9e425a1d Linux 5.12 2765 2774 2779 2784 2768
1608e4cf31b88 x86/mm/tlb: Remove unnecessary uses of the inline keyword 3448 3447 3483 3506 3494
291c4011dd7ac cpumask: Mark functions as pure 3469 3520 3419 3437 3418 3499
09c5272e48614 x86/mm/tlb: Do not make is_lazy dirty for no reason 3421 3473 3392 3463 3474 3434
2f4305b19fe6a x86/mm/tlb: Privatize cpu_tlbstate 3509 3475 3368 3450 3445 3442
4ce94eabac16b x86/mm/tlb: Flush remote and local TLBs concurrently 2796 2792 2796 2812 2784 2796 2779
6035152d8eebe x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy() 2755 2797 2792
4c1ba3923e6c8 x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote() 2836 2827 2825
a38fd87484648 Linux 5.12-rc2 2997 2981 3003

as above, there is a performance improvement on 2f4305b19fe6a.
but the data from this subtest seems more fluctuated along mainline.

>
> I did ask the 0day folks privately if they had any more performance data
> on that commit: good, bad or neutral.

we don't have other performance data on this commit so far.
but this may mean there is no other bisection bisected to this commit.

>
> That commit didn't actually look to me like it was fundamental to
> anything built after it. It might not revert cleanly, but it doesn't
> look like it would be hard to logically remove. What other side-effects
> are you worried about?
>
> BTW, there's also a dirt simple hack to do the on_each_cpu_cond_mask()
> without a retpoline:
>
> if ((cond_func == tlb_is_not_lazy) &&
> !tlb_is_not_lazy(...))
> continue;
>
> You can't do that literally in arch-independent code, but you get the point.
>
> I know folks have discussed ways of doing this kind of stuff for other
> high-value indirect calls. I need to see if there's anything around
> that we could use.

2022-03-18 10:40:35

by Dave Hansen

[permalink] [raw]

Subject: Re: [x86/mm/tlb] 6035152d8e: will-it-scale.per_thread_ops -13.2% regression

On 3/17/22 17:20, Nadav Amit wrote:
> I don’t have other data right now. Let me run some measurements later
> tonight. I understand your explanation, but I still do not see how
> much “later” can the lazy check be that it really matters. Just
> strange.

These will-it-scale tests are really brutal. They're usually sitting in
really tight kernel entry/exit loops. Everything is pounding on kernel
locks and bouncing cachelines around like crazy. It might only be a few
thousand cycles between two successive kernel entries.

Things like the call_single_queue cacheline have to be dragged from
other CPUs *and* there are locks that you can spin on. While a thread
is doing all this spinning, it is forcing more and more threads into the
lazy TLB state. The longer you spin, the more threads have entered the
kernel, contended on the mmap_lock and gone idle.

Is it really surprising that a loop that can take hundreds of locks can
take a long time?

for_each_cpu(cpu, cfd->cpumask) {
csd_lock(csd);
...
}