2023-11-07 01:45:29

by kernel test robot

[permalink] [raw]
Subject: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression



Hello,

kernel test robot noticed a -16.9% regression of vm-scalability.throughput on:


commit: c9eec08bac96898573c236af9cb0ccee765684fc ("iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc()")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

testcase: vm-scalability
test machine: 224 threads 4 sockets Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz (Cooper Lake) with 192G memory
parameters:

runtime: 300s
size: 256G
test: msync
cpufreq_governor: performance

test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-lkp/[email protected]


Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231106/[email protected]

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
gcc-12/performance/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/300s/256G/lkp-cpl-4sp2/msync/vm-scalability

commit:
f1982740f5 ("iov_iter: Convert iterate*() to inline funcs")
c9eec08bac ("iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc()")

f1982740f5e77090 c9eec08bac96898573c236af9cb
---------------- ---------------------------
%stddev %change %stddev
\ | \
17367 -16.8% 14444 vm-scalability.median
6.13 ? 26% +4.3 10.39 ? 18% vm-scalability.stddev%
6319269 -16.9% 5252011 vm-scalability.throughput
309.64 +6.3% 329.22 vm-scalability.time.elapsed_time
309.64 +6.3% 329.22 vm-scalability.time.elapsed_time.max
1.77e+09 -11.1% 1.574e+09 vm-scalability.time.file_system_outputs
2.357e+08 -11.1% 2.095e+08 vm-scalability.time.minor_page_faults
595.33 -15.1% 505.50 vm-scalability.time.percent_of_cpu_this_job_got
1452 -9.9% 1308 vm-scalability.time.system_time
392.26 ? 2% -8.9% 357.20 ? 3% vm-scalability.time.user_time
1369046 -4.4% 1308968 vm-scalability.time.voluntary_context_switches
9.952e+08 -11.1% 8.846e+08 vm-scalability.workload
95.00 ? 5% +37.9% 131.00 ? 4% perf-c2c.DRAM.local
2.17 -0.3 1.90 mpstat.cpu.all.sys%
0.39 ? 2% -0.1 0.34 ? 2% mpstat.cpu.all.usr%
264231 ? 7% +20.1% 317338 ? 9% numa-meminfo.node1.Writeback
3161539 ? 6% +11.1% 3513478 ? 3% numa-meminfo.node2.Inactive(anon)
2798749 -16.1% 2347030 vmstat.io.bo
7.61 ? 4% -12.7% 6.64 ? 6% vmstat.procs.r
16971 -10.4% 15204 vmstat.system.cs
1378902 ? 38% +389.9% 6755495 ? 45% numa-numastat.node0.numa_foreign
4177825 ? 40% -64.1% 1500551 ?107% numa-numastat.node0.numa_miss
4264407 ? 39% -63.4% 1562785 ?101% numa-numastat.node0.other_node
5590015 ? 78% -62.7% 2085869 ?113% numa-numastat.node3.numa_foreign
169.33 ? 2% -10.4% 151.67 turbostat.Avg_MHz
4.47 ? 2% -0.5 4.00 turbostat.Busy%
435.99 -1.9% 427.64 turbostat.PkgWatt
21.64 -3.5% 20.88 turbostat.RAMWatt
124185 ? 13% -32.3% 84070 ? 10% sched_debug.cfs_rq:/.avg_vruntime.min
124185 ? 13% -32.3% 84070 ? 10% sched_debug.cfs_rq:/.min_vruntime.min
105.08 ? 37% -61.2% 40.75 ? 7% sched_debug.cfs_rq:/.runnable_avg.avg
164.08 ? 14% -32.5% 110.71 ? 4% sched_debug.cfs_rq:/.runnable_avg.stddev
104.27 ? 38% -61.2% 40.49 ? 7% sched_debug.cfs_rq:/.util_avg.avg
162.41 ? 15% -32.4% 109.74 ? 4% sched_debug.cfs_rq:/.util_avg.stddev
2781 ? 24% -44.3% 1549 ? 14% sched_debug.cpu.curr->pid.stddev
0.59 ? 7% +12.4% 0.66 ? 4% sched_debug.cpu.nr_uninterruptible.avg
55282809 ? 6% -13.7% 47726748 ? 6% numa-vmstat.node0.nr_dirtied
1145143 ? 9% -20.4% 912102 ? 17% numa-vmstat.node0.nr_free_pages
55282809 ? 6% -13.7% 47726748 ? 6% numa-vmstat.node0.nr_written
1378902 ? 38% +389.9% 6755495 ? 45% numa-vmstat.node0.numa_foreign
4177825 ? 40% -64.1% 1500551 ?107% numa-vmstat.node0.numa_miss
4264407 ? 39% -63.4% 1562785 ?101% numa-vmstat.node0.numa_other
65521 ? 8% +20.8% 79128 ? 10% numa-vmstat.node1.nr_writeback
56237202 ? 6% -11.8% 49599462 ? 4% numa-vmstat.node2.nr_dirtied
789922 ? 6% +11.2% 878674 ? 3% numa-vmstat.node2.nr_inactive_anon
7363130 ? 14% -25.5% 5486565 ? 13% numa-vmstat.node2.nr_vmscan_immediate_reclaim
56237202 ? 6% -11.8% 49599462 ? 4% numa-vmstat.node2.nr_written
789923 ? 6% +11.2% 878673 ? 3% numa-vmstat.node2.nr_zone_inactive_anon
56312677 ? 5% -13.8% 48559691 ? 7% numa-vmstat.node3.nr_dirtied
56312677 ? 5% -13.8% 48559691 ? 7% numa-vmstat.node3.nr_written
5590015 ? 78% -62.7% 2085869 ?113% numa-vmstat.node3.numa_foreign
10311 ? 35% -68.9% 3204 ? 72% proc-vmstat.compact_success
14371045 ? 2% +3.7% 14896466 proc-vmstat.nr_active_file
98005 -1.9% 96096 proc-vmstat.nr_anon_pages
2.213e+08 -11.1% 1.967e+08 proc-vmstat.nr_dirtied
3160899 +9.1% 3447334 ? 2% proc-vmstat.nr_inactive_anon
27214352 ? 2% -14.1% 23369413 ? 2% proc-vmstat.nr_vmscan_immediate_reclaim
2.213e+08 -11.1% 1.967e+08 proc-vmstat.nr_written
14371387 ? 2% +3.7% 14896745 proc-vmstat.nr_zone_active_file
3160900 +9.1% 3447332 ? 2% proc-vmstat.nr_zone_inactive_anon
19216 ? 11% -71.2% 5539 ? 4% proc-vmstat.numa_hint_faults
9725 ? 31% -70.0% 2917 ? 59% proc-vmstat.numa_hint_faults_local
1996 ? 71% -76.8% 462.83 ? 80% proc-vmstat.numa_pages_migrated
3.344e+08 -11.4% 2.964e+08 proc-vmstat.pgactivate
2.646e+08 -10.0% 2.382e+08 proc-vmstat.pgalloc_normal
1.158e+08 ? 4% -13.7% 99968494 ? 3% proc-vmstat.pgdeactivate
2.374e+08 -11.1% 2.111e+08 proc-vmstat.pgfault
2.673e+08 -10.1% 2.402e+08 proc-vmstat.pgfree
3693 ? 3% -12.6% 3227 ? 3% proc-vmstat.pgmajfault
8.853e+08 -11.1% 7.869e+08 proc-vmstat.pgpgout
1.158e+08 ? 4% -13.7% 99968494 ? 3% proc-vmstat.pgrefill
127095 ? 2% -10.4% 113877 proc-vmstat.pgreuse
28337584 ? 2% -14.4% 24247338 proc-vmstat.pgrotated
61733485 ? 2% -11.9% 54360975 ? 4% proc-vmstat.pgscan_file
45323460 ? 5% -9.5% 40999529 ? 8% proc-vmstat.pgscan_kswapd
6262367 ? 4% -20.7% 4965325 ? 11% proc-vmstat.pgsteal_direct
31649958 ? 3% -11.4% 28049294 ? 6% proc-vmstat.pgsteal_file
11627904 -6.8% 10841344 proc-vmstat.unevictable_pgs_scanned
6985805 ? 2% -18.0% 5728058 ? 10% proc-vmstat.workingset_activate_file
7061865 ? 3% -12.0% 6214389 ? 5% proc-vmstat.workingset_refault_file
6985038 ? 2% -18.0% 5727368 ? 10% proc-vmstat.workingset_restore_file
9.17 -9.3% 8.31 perf-stat.i.MPKI
5.443e+09 -17.6% 4.486e+09 perf-stat.i.branch-instructions
13498005 ? 4% -13.9% 11626468 ? 2% perf-stat.i.branch-misses
80.16 -2.8 77.38 perf-stat.i.cache-miss-rate%
1.985e+08 -26.1% 1.467e+08 perf-stat.i.cache-misses
2.439e+08 -24.0% 1.854e+08 perf-stat.i.cache-references
16944 -10.7% 15132 perf-stat.i.context-switches
1.41 +12.8% 1.60 ? 5% perf-stat.i.cpi
3.482e+10 ? 2% -11.4% 3.086e+10 perf-stat.i.cpu-cycles
360.77 -6.1% 338.94 perf-stat.i.cpu-migrations
0.02 ? 5% +0.0 0.02 ? 3% perf-stat.i.dTLB-load-miss-rate%
826726 -9.6% 747639 ? 3% perf-stat.i.dTLB-load-misses
5.721e+09 ? 2% -21.9% 4.467e+09 perf-stat.i.dTLB-loads
0.13 +0.0 0.15 perf-stat.i.dTLB-store-miss-rate%
3828019 ? 2% -17.2% 3168599 perf-stat.i.dTLB-store-misses
2.814e+09 -26.4% 2.071e+09 perf-stat.i.dTLB-stores
55.13 -0.7 54.40 perf-stat.i.iTLB-load-miss-rate%
4113741 -10.8% 3670976 ? 2% perf-stat.i.iTLB-load-misses
3278251 -8.8% 2989968 perf-stat.i.iTLB-loads
2.171e+10 -20.0% 1.736e+10 perf-stat.i.instructions
5084 -10.0% 4578 perf-stat.i.instructions-per-iTLB-miss
0.85 -13.1% 0.74 ? 3% perf-stat.i.ipc
1448 ? 5% -14.8% 1233 ? 9% perf-stat.i.major-faults
0.15 ? 2% -11.4% 0.14 perf-stat.i.metric.GHz
925.09 -11.4% 819.35 perf-stat.i.metric.K/sec
62.72 -21.4% 49.31 perf-stat.i.metric.M/sec
703868 -16.4% 588183 perf-stat.i.minor-faults
80.84 -5.4 75.47 perf-stat.i.node-load-miss-rate%
10635123 +18.7% 12623931 perf-stat.i.node-load-misses
2610552 +59.3% 4158136 ? 2% perf-stat.i.node-loads
76.21 -2.7 73.54 perf-stat.i.node-store-miss-rate%
46019985 ? 2% -36.6% 29187579 perf-stat.i.node-store-misses
16991220 ? 3% -26.7% 12448792 perf-stat.i.node-stores
705316 -16.4% 589415 perf-stat.i.page-faults
9.21 -7.2% 8.54 perf-stat.overall.MPKI
81.37 -2.2 79.20 perf-stat.overall.cache-miss-rate%
1.65 +10.7% 1.82 perf-stat.overall.cpi
178.71 +19.4% 213.30 perf-stat.overall.cycles-between-cache-misses
0.01 ? 2% +0.0 0.02 ? 4% perf-stat.overall.dTLB-load-miss-rate%
0.14 +0.0 0.15 perf-stat.overall.dTLB-store-miss-rate%
5329 -10.1% 4789 perf-stat.overall.instructions-per-iTLB-miss
0.61 -9.7% 0.55 perf-stat.overall.ipc
80.15 -5.0 75.17 perf-stat.overall.node-load-miss-rate%
72.82 -2.9 69.89 perf-stat.overall.node-store-miss-rate%
6918 -4.2% 6625 perf-stat.overall.path-length
5.505e+09 -17.4% 4.549e+09 perf-stat.ps.branch-instructions
13649952 ? 4% -13.7% 11775243 ? 2% perf-stat.ps.branch-misses
2.023e+08 -25.6% 1.506e+08 perf-stat.ps.cache-misses
2.487e+08 -23.6% 1.901e+08 perf-stat.ps.cache-references
16976 -10.7% 15156 perf-stat.ps.context-switches
3.616e+10 ? 2% -11.2% 3.211e+10 perf-stat.ps.cpu-cycles
364.05 -6.1% 341.97 perf-stat.ps.cpu-migrations
836446 -9.5% 756895 ? 3% perf-stat.ps.dTLB-load-misses
5.773e+09 ? 2% -21.7% 4.521e+09 perf-stat.ps.dTLB-loads
3873256 ? 2% -16.9% 3216923 perf-stat.ps.dTLB-store-misses
2.862e+09 -26.0% 2.119e+09 perf-stat.ps.dTLB-stores
4124868 -10.7% 3682495 ? 2% perf-stat.ps.iTLB-load-misses
3267583 -8.7% 2983240 perf-stat.ps.iTLB-loads
2.198e+10 -19.8% 1.763e+10 perf-stat.ps.instructions
1468 ? 4% -15.2% 1245 ? 9% perf-stat.ps.major-faults
711215 -16.2% 596020 perf-stat.ps.minor-faults
10655899 +18.5% 12627905 perf-stat.ps.node-load-misses
2638928 +58.0% 4170441 ? 2% perf-stat.ps.node-loads
47247465 ? 2% -35.7% 30389920 perf-stat.ps.node-store-misses
17634674 ? 3% -25.7% 13094211 ? 2% perf-stat.ps.node-stores
712683 -16.2% 597265 perf-stat.ps.page-faults
6.885e+12 -14.9% 5.861e+12 perf-stat.total.instructions
0.00 ? 35% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.__get_user_pages.populate_vma_page_range.__mm_populate.vm_mmap_pgoff
0.02 ? 52% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.cgroup_rstat_flush_locked.cgroup_rstat_flush.do_flush_stats.mem_cgroup_wb_stats
0.00 ? 17% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.down_read.__mm_populate.vm_mmap_pgoff.ksys_mmap_pgoff
0.00 ? 62% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.down_read.page_cache_ra_order.filemap_fault.__do_fault
0.00 ? 28% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.down_read.xfs_ilock_for_iomap.xfs_read_iomap_begin.iomap_iter
0.00 ? 31% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.shrink_active_list.shrink_lruvec.shrink_node_memcgs.shrink_node
0.02 ? 43% -78.5% 0.00 ? 22% perf-sched.sch_delay.avg.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
0.02 ? 24% -91.6% 0.00 ?100% perf-sched.sch_delay.avg.ms.__cond_resched.stop_one_cpu.sched_exec.bprm_execve.part
3.27 ? 12% -51.4% 1.59 ? 73% perf-sched.sch_delay.avg.ms.__cond_resched.ww_mutex_lock.drm_gem_vunmap_unlocked.drm_gem_fb_vunmap.drm_atomic_helper_cleanup_planes
0.01 ? 21% -61.2% 0.01 ? 7% perf-sched.sch_delay.avg.ms.__x64_sys_pause.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
0.01 ? 12% -100.0% 0.00 perf-sched.sch_delay.avg.ms.d_alloc_parallel.__lookup_slow.walk_component.link_path_walk.part
0.01 ? 9% -41.5% 0.01 ? 7% perf-sched.sch_delay.avg.ms.devkmsg_read.vfs_read.ksys_read.do_syscall_64
0.01 ? 32% -54.5% 0.01 ? 11% perf-sched.sch_delay.avg.ms.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep
0.00 ? 10% -50.0% 0.00 ? 17% perf-sched.sch_delay.avg.ms.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64
0.02 ? 7% -73.9% 0.01 perf-sched.sch_delay.avg.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
0.01 ? 26% -67.9% 0.00 ?108% perf-sched.sch_delay.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_exc_page_fault
0.01 ? 58% -100.0% 0.00 perf-sched.sch_delay.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_call_function_single
0.00 ? 56% -100.0% 0.00 perf-sched.sch_delay.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_reschedule_ipi
0.01 ? 27% -82.2% 0.00 ?141% perf-sched.sch_delay.avg.ms.io_schedule.folio_wait_bit_common.filemap_fault.__do_fault
0.01 ? 11% -36.8% 0.00 perf-sched.sch_delay.avg.ms.io_schedule.rq_qos_wait.wbt_wait.__rq_qos_throttle
0.01 ? 13% -68.8% 0.00 ?141% perf-sched.sch_delay.avg.ms.kswapd_try_to_sleep.kswapd.kthread.ret_from_fork
0.01 ? 15% -69.3% 0.00 ? 9% perf-sched.sch_delay.avg.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
0.02 ? 9% -62.6% 0.01 ? 8% perf-sched.sch_delay.avg.ms.schedule_hrtimeout_range_clock.do_select.core_sys_select.kern_select
0.01 ? 8% -44.6% 0.01 ? 9% perf-sched.sch_delay.avg.ms.schedule_hrtimeout_range_clock.ep_poll.do_epoll_wait.__x64_sys_epoll_wait
0.01 ? 82% -77.3% 0.00 ? 50% perf-sched.sch_delay.avg.ms.schedule_preempt_disabled.rwsem_down_read_slowpath.down_read.xfs_map_blocks
0.00 ? 63% +3966.7% 0.06 ? 68% perf-sched.sch_delay.avg.ms.schedule_timeout.__wait_for_common.__flush_workqueue.xlog_cil_push_now.isra
0.01 ? 10% -31.7% 0.00 ? 10% perf-sched.sch_delay.avg.ms.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags
0.01 ? 10% -58.9% 0.00 ? 9% perf-sched.sch_delay.avg.ms.schedule_timeout.xfsaild.kthread.ret_from_fork
0.01 ? 9% -25.6% 0.01 ? 8% perf-sched.sch_delay.avg.ms.sigsuspend.__x64_sys_rt_sigsuspend.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.03 ? 7% -65.2% 0.01 ? 5% perf-sched.sch_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
0.01 ? 12% -45.2% 0.01 ? 7% perf-sched.sch_delay.avg.ms.syslog_print.do_syslog.kmsg_read.vfs_read
0.00 ? 11% -40.7% 0.00 ? 17% perf-sched.sch_delay.avg.ms.wait_for_partner.fifo_open.do_dentry_open.do_open
0.25 ? 11% -82.4% 0.04 ? 35% perf-sched.sch_delay.avg.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
0.01 ? 51% -100.0% 0.00 perf-sched.sch_delay.avg.ms.xlog_wait_on_iclog.xfs_extent_busy_flush.xfs_alloc_ag_vextent_near.xfs_alloc_vextent_near_bno
0.01 ? 23% -33.3% 0.00 ? 14% perf-sched.sch_delay.avg.ms.xlog_wait_on_iclog.xfs_log_force_seq.xfs_file_fsync.__do_sys_msync
0.01 ? 50% -49.2% 0.01 ? 7% perf-sched.sch_delay.avg.ms.xlog_wait_on_iclog.xlog_cil_push_work.process_one_work.worker_thread
0.01 ? 34% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.__alloc_pages_slowpath.constprop.0.__alloc_pages
0.01 ? 58% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.__get_user_pages.populate_vma_page_range.__mm_populate.vm_mmap_pgoff
0.03 ? 39% -79.6% 0.01 ? 17% perf-sched.sch_delay.max.ms.__cond_resched.__wait_for_common.affine_move_task.__set_cpus_allowed_ptr.__sched_setaffinity
0.03 ? 60% -95.7% 0.00 ?141% perf-sched.sch_delay.max.ms.__cond_resched.__xfs_filemap_fault.do_page_mkwrite.do_fault.__handle_mm_fault
0.01 ? 68% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.balance_pgdat.kswapd.kthread.ret_from_fork
0.02 ? 45% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.cgroup_rstat_flush_locked.cgroup_rstat_flush.do_flush_stats.mem_cgroup_wb_stats
0.05 ? 49% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.down_read.__mm_populate.vm_mmap_pgoff.ksys_mmap_pgoff
0.00 ? 20% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.down_read.page_cache_ra_order.filemap_fault.__do_fault
0.01 ? 31% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.down_read.xfs_ilock_for_iomap.xfs_read_iomap_begin.iomap_iter
0.03 ? 38% -65.9% 0.01 ? 27% perf-sched.sch_delay.max.ms.__cond_resched.generic_perform_write.shmem_file_write_iter.do_iter_readv_writev.do_iter_write
0.01 ? 51% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.shrink_active_list.shrink_lruvec.shrink_node_memcgs.shrink_node
0.03 ? 31% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.shrink_node_memcgs.shrink_node.shrink_zones.constprop
0.79 ? 77% -96.4% 0.03 ?169% perf-sched.sch_delay.max.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
0.57 ? 29% -99.7% 0.00 ?100% perf-sched.sch_delay.max.ms.__cond_resched.stop_one_cpu.sched_exec.bprm_execve.part
6.22 ? 36% -70.0% 1.87 ? 70% perf-sched.sch_delay.max.ms.__cond_resched.ww_mutex_lock.drm_gem_vunmap_unlocked.drm_gem_fb_vunmap.drm_atomic_helper_cleanup_planes
0.04 ? 45% -82.6% 0.01 ? 38% perf-sched.sch_delay.max.ms.__x64_sys_pause.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
0.02 ? 18% -100.0% 0.00 perf-sched.sch_delay.max.ms.d_alloc_parallel.__lookup_slow.walk_component.link_path_walk.part
0.10 ? 58% -82.7% 0.02 ? 7% perf-sched.sch_delay.max.ms.devkmsg_read.vfs_read.ksys_read.do_syscall_64
0.03 ? 53% -78.9% 0.01 ? 25% perf-sched.sch_delay.max.ms.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep
0.51 ? 36% -94.9% 0.03 ?129% perf-sched.sch_delay.max.ms.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64
0.89 ? 90% -98.6% 0.01 ? 17% perf-sched.sch_delay.max.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
0.06 ? 41% -94.9% 0.00 ?108% perf-sched.sch_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_exc_page_fault
0.06 ? 48% -90.4% 0.01 ? 22% perf-sched.sch_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
0.04 ? 84% -100.0% 0.00 perf-sched.sch_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_call_function_single
0.02 ? 38% -100.0% 0.00 perf-sched.sch_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_reschedule_ipi
0.03 ? 57% -95.4% 0.00 ?141% perf-sched.sch_delay.max.ms.io_schedule.folio_wait_bit_common.filemap_fault.__do_fault
0.01 ? 43% -80.7% 0.00 ?142% perf-sched.sch_delay.max.ms.kswapd_try_to_sleep.kswapd.kthread.ret_from_fork
0.28 ? 94% -97.0% 0.01 ? 16% perf-sched.sch_delay.max.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
0.73 ?161% -98.3% 0.01 ? 32% perf-sched.sch_delay.max.ms.schedule_hrtimeout_range_clock.do_select.core_sys_select.kern_select
0.18 ?119% -91.8% 0.02 ? 13% perf-sched.sch_delay.max.ms.schedule_hrtimeout_range_clock.ep_poll.do_epoll_wait.__x64_sys_epoll_wait
0.06 ? 65% -95.5% 0.00 ? 46% perf-sched.sch_delay.max.ms.schedule_preempt_disabled.rwsem_down_read_slowpath.down_read.xfs_map_blocks
0.01 ? 48% +6086.4% 0.45 ? 32% perf-sched.sch_delay.max.ms.schedule_timeout.__wait_for_common.__flush_workqueue.xlog_cil_push_now.isra
0.04 ? 47% -81.9% 0.01 ? 20% perf-sched.sch_delay.max.ms.schedule_timeout.kcompactd.kthread.ret_from_fork
0.07 ? 35% -78.7% 0.01 ? 87% perf-sched.sch_delay.max.ms.schedule_timeout.xfsaild.kthread.ret_from_fork
13.32 ? 13% -71.3% 3.83 ? 4% perf-sched.sch_delay.max.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
0.10 ? 46% -82.0% 0.02 ? 23% perf-sched.sch_delay.max.ms.syslog_print.do_syslog.kmsg_read.vfs_read
0.02 ? 40% -71.1% 0.01 ? 45% perf-sched.sch_delay.max.ms.wait_for_partner.fifo_open.do_dentry_open.do_open
20.07 ? 28% -76.3% 4.76 ? 58% perf-sched.sch_delay.max.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
0.00 ?145% +320.0% 0.00 ? 27% perf-sched.sch_delay.max.ms.xlog_force_lsn.xfs_log_force_seq.xfs_file_fsync.__do_sys_msync
0.01 ? 51% -100.0% 0.00 perf-sched.sch_delay.max.ms.xlog_wait_on_iclog.xfs_extent_busy_flush.xfs_alloc_ag_vextent_near.xfs_alloc_vextent_near_bno
0.03 ? 62% -74.2% 0.01 ? 22% perf-sched.sch_delay.max.ms.xlog_wait_on_iclog.xlog_cil_push_work.process_one_work.worker_thread
0.03 ? 3% -75.6% 0.01 ? 14% perf-sched.total_sch_delay.average.ms
52.36 ? 2% +16.5% 61.01 ? 2% perf-sched.total_wait_and_delay.average.ms
56658 ? 4% -21.4% 44529 ? 5% perf-sched.total_wait_and_delay.count.ms
52.33 ? 2% +16.6% 61.01 ? 2% perf-sched.total_wait_time.average.ms
0.15 ? 15% -100.0% 0.00 perf-sched.wait_and_delay.avg.ms.__cond_resched.down_read.__mm_populate.vm_mmap_pgoff.ksys_mmap_pgoff
0.50 ? 30% +92.1% 0.95 ? 20% perf-sched.wait_and_delay.avg.ms.__cond_resched.loop_process_work.process_one_work.worker_thread.kthread
0.95 ? 50% -100.0% 0.00 perf-sched.wait_and_delay.avg.ms.__cond_resched.shrink_node_memcgs.shrink_node.shrink_zones.constprop
26.03 ? 10% +166.5% 69.36 ? 14% perf-sched.wait_and_delay.avg.ms.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64
1.09 ? 31% -100.0% 0.00 perf-sched.wait_and_delay.avg.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
0.06 ? 31% +675.6% 0.44 ? 59% perf-sched.wait_and_delay.avg.ms.io_schedule.folio_wait_bit_common.folio_wait_writeback.__filemap_fdatawait_range
0.72 ? 65% -100.0% 0.00 perf-sched.wait_and_delay.avg.ms.io_schedule.folio_wait_bit_common.write_cache_pages.iomap_writepages
67.31 ? 6% +771.1% 586.35 ? 10% perf-sched.wait_and_delay.avg.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
8.78 ? 21% +1848.5% 171.11 ? 33% perf-sched.wait_and_delay.avg.ms.schedule_hrtimeout_range_clock.do_select.core_sys_select.kern_select
3.98 ? 2% -100.0% 0.00 perf-sched.wait_and_delay.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
43.48 ? 3% +12.9% 49.09 ? 2% perf-sched.wait_and_delay.avg.ms.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags
111.19 ? 16% -89.7% 11.50 ?223% perf-sched.wait_and_delay.avg.ms.schedule_timeout.kswapd_try_to_sleep.kswapd.kthread
4.66 ? 5% +171.8% 12.67 ? 16% perf-sched.wait_and_delay.avg.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
194.78 ? 8% +71.6% 334.26 ? 6% perf-sched.wait_and_delay.avg.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
369.50 ? 7% -100.0% 0.00 perf-sched.wait_and_delay.count.__cond_resched.down_read.__mm_populate.vm_mmap_pgoff.ksys_mmap_pgoff
1208 ? 22% -100.0% 0.00 perf-sched.wait_and_delay.count.__cond_resched.shrink_node_memcgs.shrink_node.shrink_zones.constprop
9.00 ? 40% -83.3% 1.50 ?120% perf-sched.wait_and_delay.count.__cond_resched.ww_mutex_lock.drm_gem_vunmap_unlocked.drm_gem_fb_vunmap.drm_atomic_helper_cleanup_planes
421.00 ? 7% -89.9% 42.67 ? 49% perf-sched.wait_and_delay.count.devkmsg_read.vfs_read.ksys_read.do_syscall_64
848.50 -73.3% 226.17 ? 20% perf-sched.wait_and_delay.count.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64
1048 -100.0% 0.00 perf-sched.wait_and_delay.count.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
1875 ? 36% -100.0% 0.00 perf-sched.wait_and_delay.count.io_schedule.folio_wait_bit_common.write_cache_pages.iomap_writepages
411.67 ? 13% +379.1% 1972 ? 91% perf-sched.wait_and_delay.count.io_schedule.rq_qos_wait.wbt_wait.__rq_qos_throttle
9.67 ? 22% -91.4% 0.83 ?175% perf-sched.wait_and_delay.count.kswapd_try_to_sleep.kswapd.kthread.ret_from_fork
188.33 ? 2% -87.9% 22.83 ? 6% perf-sched.wait_and_delay.count.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
1268 -92.1% 99.83 ? 56% perf-sched.wait_and_delay.count.schedule_hrtimeout_range_clock.do_select.core_sys_select.kern_select
433.67 ? 6% -87.1% 56.00 ? 46% perf-sched.wait_and_delay.count.schedule_hrtimeout_range_clock.ep_poll.do_epoll_wait.__x64_sys_epoll_wait
341.50 -100.0% 0.00 perf-sched.wait_and_delay.count.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
39.33 ? 17% -94.9% 2.00 ?223% perf-sched.wait_and_delay.count.schedule_timeout.kswapd_try_to_sleep.kswapd.kthread
1003 ? 7% -61.6% 385.67 ? 21% perf-sched.wait_and_delay.count.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
422.17 ? 6% -90.0% 42.33 ? 49% perf-sched.wait_and_delay.count.syslog_print.do_syslog.kmsg_read.vfs_read
4337 ? 7% -47.9% 2261 ? 6% perf-sched.wait_and_delay.count.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
2.86 ? 48% -100.0% 0.00 perf-sched.wait_and_delay.max.ms.__cond_resched.down_read.__mm_populate.vm_mmap_pgoff.ksys_mmap_pgoff
206.38 ?172% -100.0% 0.00 perf-sched.wait_and_delay.max.ms.__cond_resched.shrink_node_memcgs.shrink_node.shrink_zones.constprop
799.81 ? 24% -62.8% 297.34 ?108% perf-sched.wait_and_delay.max.ms.__cond_resched.ww_mutex_lock.drm_gem_vunmap_unlocked.drm_gem_fb_vunmap.drm_atomic_helper_cleanup_planes
2688 ? 42% -62.8% 1000 perf-sched.wait_and_delay.max.ms.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64
198.42 ?181% -100.0% 0.00 perf-sched.wait_and_delay.max.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
87.70 ? 27% -100.0% 0.00 perf-sched.wait_and_delay.max.ms.io_schedule.folio_wait_bit_common.write_cache_pages.iomap_writepages
108.24 -67.7% 34.93 ?141% perf-sched.wait_and_delay.max.ms.kswapd_try_to_sleep.kswapd.kthread.ret_from_fork
1386 ? 21% -26.7% 1015 perf-sched.wait_and_delay.max.ms.pipe_read.vfs_read.ksys_read.do_syscall_64
20.54 ? 18% -100.0% 0.00 perf-sched.wait_and_delay.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
59.91 ? 3% -7.5% 55.43 perf-sched.wait_and_delay.max.ms.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags
484.94 ? 15% -96.4% 17.50 ?223% perf-sched.wait_and_delay.max.ms.schedule_timeout.kswapd_try_to_sleep.kswapd.kthread
20.82 ? 4% +1402.3% 312.82 ? 16% perf-sched.wait_and_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
2686 ? 28% +45.1% 3897 ? 11% perf-sched.wait_and_delay.max.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
9.41 ?165% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.__alloc_pages.folio_alloc.page_cache_ra_order.filemap_fault
6.01 ?143% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.__alloc_pages.folio_alloc.page_cache_ra_unbounded.filemap_fault
0.48 ? 49% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.__alloc_pages_slowpath.constprop.0.__alloc_pages
0.14 ? 12% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.__get_user_pages.populate_vma_page_range.__mm_populate.vm_mmap_pgoff
11.89 ?142% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.__kmem_cache_alloc_node.__kmalloc.ifs_alloc.isra
28.12 ? 52% -100.0% 0.00 ?223% perf-sched.wait_time.avg.ms.__cond_resched.balance_pgdat.kswapd.kthread.ret_from_fork
1.65 ? 83% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.cgroup_rstat_flush.do_flush_stats.prepare_scan_count.shrink_node
31.34 ? 44% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.cgroup_rstat_flush_locked.cgroup_rstat_flush.do_flush_stats.mem_cgroup_wb_stats
0.15 ? 15% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.down_read.__mm_populate.vm_mmap_pgoff.ksys_mmap_pgoff
0.14 ? 41% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.down_read.page_cache_ra_order.filemap_fault.__do_fault
0.45 ? 98% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.down_read.xfs_ilock_for_iomap.xfs_read_iomap_begin.iomap_iter
0.49 ? 30% +92.1% 0.95 ? 20% perf-sched.wait_time.avg.ms.__cond_resched.loop_process_work.process_one_work.worker_thread.kthread
16.59 ? 84% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.shrink_active_list.shrink_lruvec.shrink_node_memcgs.shrink_node
3.71 ? 98% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.shrink_folio_list.shrink_inactive_list.shrink_lruvec.shrink_node_memcgs
22.01 ?144% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.shrink_lruvec.shrink_node_memcgs.shrink_node.balance_pgdat
2.88 ? 70% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.shrink_lruvec.shrink_node_memcgs.shrink_node.shrink_zones
0.95 ? 50% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.shrink_node_memcgs.shrink_node.shrink_zones.constprop
1.75 ?217% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.shrink_slab.shrink_node_memcgs.shrink_node.shrink_zones
0.10 ? 25% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.stop_one_cpu.sched_exec.bprm_execve.part
0.02 ? 63% -100.0% 0.00 perf-sched.wait_time.avg.ms.d_alloc_parallel.__lookup_slow.walk_component.link_path_walk.part
11.83 ? 6% +614.1% 84.45 ? 33% perf-sched.wait_time.avg.ms.devkmsg_read.vfs_read.ksys_read.do_syscall_64
26.02 ? 10% +166.5% 69.36 ? 14% perf-sched.wait_time.avg.ms.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64
1.07 ? 32% -63.2% 0.40 ? 3% perf-sched.wait_time.avg.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
5.52 ? 36% +867.7% 53.45 ? 96% perf-sched.wait_time.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
6.45 ?157% -100.0% 0.00 perf-sched.wait_time.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_call_function_single
6.30 ? 91% -100.0% 0.00 perf-sched.wait_time.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_reschedule_ipi
0.66 ? 9% +26159.1% 172.08 ? 32% perf-sched.wait_time.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
0.06 ? 33% +704.8% 0.44 ? 59% perf-sched.wait_time.avg.ms.io_schedule.folio_wait_bit_common.folio_wait_writeback.__filemap_fdatawait_range
3.29 ? 6% +24.0% 4.08 ? 8% perf-sched.wait_time.avg.ms.rcu_gp_kthread.kthread.ret_from_fork.ret_from_fork_asm
67.30 ? 6% +771.3% 586.35 ? 10% perf-sched.wait_time.avg.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
8.77 ? 21% +1851.8% 171.10 ? 33% perf-sched.wait_time.avg.ms.schedule_hrtimeout_range_clock.do_select.core_sys_select.kern_select
22.02 ? 5% +469.1% 125.31 ? 28% perf-sched.wait_time.avg.ms.schedule_hrtimeout_range_clock.ep_poll.do_epoll_wait.__x64_sys_epoll_wait
3.90 ?143% -100.0% 0.00 perf-sched.wait_time.avg.ms.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.xfs_ilock
0.23 ?144% +1155.4% 2.93 ? 70% perf-sched.wait_time.avg.ms.schedule_timeout.__wait_for_common.__flush_workqueue.xlog_cil_push_now.isra
3.97 ? 2% -87.0% 0.52 ? 2% perf-sched.wait_time.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
43.47 ? 3% +12.9% 49.09 ? 2% perf-sched.wait_time.avg.ms.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags
111.18 ? 16% -89.7% 11.50 ?223% perf-sched.wait_time.avg.ms.schedule_timeout.kswapd_try_to_sleep.kswapd.kthread
4.65 ? 5% +172.1% 12.66 ? 16% perf-sched.wait_time.avg.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
11.82 ? 6% +611.7% 84.14 ? 33% perf-sched.wait_time.avg.ms.syslog_print.do_syslog.kmsg_read.vfs_read
194.53 ? 8% +71.8% 334.21 ? 6% perf-sched.wait_time.avg.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
1.73 ? 35% -100.0% 0.00 perf-sched.wait_time.avg.ms.xlog_wait_on_iclog.xfs_extent_busy_flush.xfs_alloc_ag_vextent_near.xfs_alloc_vextent_near_bno
2.84 ?141% +460.6% 15.94 ? 71% perf-sched.wait_time.avg.ms.xlog_wait_on_iclog.xfs_file_fsync.__do_sys_msync.do_syscall_64
14.04 ?140% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.__alloc_pages.folio_alloc.page_cache_ra_order.filemap_fault
14.40 ?140% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.__alloc_pages.folio_alloc.page_cache_ra_unbounded.filemap_fault
43.33 ? 5% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.__alloc_pages_slowpath.constprop.0.__alloc_pages
0.89 ? 26% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.__get_user_pages.populate_vma_page_range.__mm_populate.vm_mmap_pgoff
14.41 ?139% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.__kmem_cache_alloc_node.__kmalloc.ifs_alloc.isra
335.74 ? 32% -100.0% 0.00 ?223% perf-sched.wait_time.max.ms.__cond_resched.balance_pgdat.kswapd.kthread.ret_from_fork
28.27 ? 70% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.cgroup_rstat_flush.do_flush_stats.prepare_scan_count.shrink_node
44.68 ? 7% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.cgroup_rstat_flush_locked.cgroup_rstat_flush.do_flush_stats.mem_cgroup_wb_stats
2.83 ? 48% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.down_read.__mm_populate.vm_mmap_pgoff.ksys_mmap_pgoff
0.23 ? 54% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.down_read.page_cache_ra_order.filemap_fault.__do_fault
15.76 ?131% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.down_read.xfs_ilock_for_iomap.xfs_read_iomap_begin.iomap_iter
261.27 ? 84% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.shrink_active_list.shrink_lruvec.shrink_node_memcgs.shrink_node
20.86 ? 98% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.shrink_folio_list.shrink_inactive_list.shrink_lruvec.shrink_node_memcgs
114.79 ?142% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.shrink_lruvec.shrink_node_memcgs.shrink_node.balance_pgdat
35.78 ? 44% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.shrink_lruvec.shrink_node_memcgs.shrink_node.shrink_zones
206.37 ?172% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.shrink_node_memcgs.shrink_node.shrink_zones.constprop
7.04 ?218% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.shrink_slab.shrink_node_memcgs.shrink_node.shrink_zones
3.32 ? 38% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.stop_one_cpu.sched_exec.bprm_execve.part
794.55 ? 24% -62.7% 296.03 ?108% perf-sched.wait_time.max.ms.__cond_resched.ww_mutex_lock.drm_gem_vunmap_unlocked.drm_gem_fb_vunmap.drm_atomic_helper_cleanup_planes
0.09 ? 49% -100.0% 0.00 perf-sched.wait_time.max.ms.d_alloc_parallel.__lookup_slow.walk_component.link_path_walk.part
2688 ? 42% -62.8% 1000 perf-sched.wait_time.max.ms.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64
198.19 ?181% -87.0% 25.83 ? 2% perf-sched.wait_time.max.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
47.05 ? 5% +48.0% 69.64 ? 61% perf-sched.wait_time.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
22.53 ? 99% -100.0% 0.00 perf-sched.wait_time.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_call_function_single
38.60 ? 34% -100.0% 0.00 perf-sched.wait_time.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_reschedule_ipi
16.45 ? 39% +4001.0% 674.63 ? 19% perf-sched.wait_time.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
108.23 -67.7% 34.92 ?141% perf-sched.wait_time.max.ms.kswapd_try_to_sleep.kswapd.kthread.ret_from_fork
1386 ? 21% -26.7% 1015 perf-sched.wait_time.max.ms.pipe_read.vfs_read.ksys_read.do_syscall_64
121.93 ?134% -82.0% 21.99 ? 84% perf-sched.wait_time.max.ms.schedule_preempt_disabled.rwsem_down_read_slowpath.down_read.xfs_map_blocks
8.42 ?177% -100.0% 0.00 perf-sched.wait_time.max.ms.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.xfs_ilock
2.00 ?173% +1142.8% 24.86 ? 44% perf-sched.wait_time.max.ms.schedule_timeout.__wait_for_common.__flush_workqueue.xlog_cil_push_now.isra
20.49 ? 18% -94.7% 1.09 ? 5% perf-sched.wait_time.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
59.90 ? 3% -7.5% 55.43 perf-sched.wait_time.max.ms.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags
484.93 ? 15% -96.4% 17.50 ?223% perf-sched.wait_time.max.ms.schedule_timeout.kswapd_try_to_sleep.kswapd.kthread
20.82 ? 4% +1402.6% 312.82 ? 16% perf-sched.wait_time.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
2686 ? 28% +45.1% 3897 ? 11% perf-sched.wait_time.max.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
1.73 ? 35% -100.0% 0.00 perf-sched.wait_time.max.ms.xlog_wait_on_iclog.xfs_extent_busy_flush.xfs_alloc_ag_vextent_near.xfs_alloc_vextent_near_bno
3.34 ?142% +804.0% 30.21 ? 72% perf-sched.wait_time.max.ms.xlog_wait_on_iclog.xfs_file_fsync.__do_sys_msync.do_syscall_64
11.14 ? 5% -11.1 0.00 perf-profile.calltrace.cycles-pp.memcpy_from_iter_mc.copy_page_from_iter_atomic.generic_perform_write.shmem_file_write_iter.do_iter_readv_writev
11.12 ? 5% -11.1 0.00 perf-profile.calltrace.cycles-pp.memcpy_orig.memcpy_from_iter_mc.copy_page_from_iter_atomic.generic_perform_write.shmem_file_write_iter
27.39 ? 2% -5.6 21.82 ? 12% perf-profile.calltrace.cycles-pp.do_access
17.96 ? 2% -3.7 14.22 ? 12% perf-profile.calltrace.cycles-pp.asm_exc_page_fault.do_access
13.46 ? 3% -3.1 10.38 ? 10% perf-profile.calltrace.cycles-pp.do_rw_once
12.32 ? 2% -2.4 9.88 ? 11% perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.do_access
12.22 ? 2% -2.4 9.78 ? 11% perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
10.75 ? 2% -2.2 8.60 ? 12% perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
9.73 ? 2% -2.0 7.78 ? 12% perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
8.89 ? 2% -1.8 7.14 ? 12% perf-profile.calltrace.cycles-pp.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
3.51 ? 2% -0.7 2.81 ? 12% perf-profile.calltrace.cycles-pp.do_page_mkwrite.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
3.45 ? 2% -0.7 2.78 ? 12% perf-profile.calltrace.cycles-pp.__xfs_filemap_fault.do_page_mkwrite.do_fault.__handle_mm_fault.handle_mm_fault
1.10 ? 25% -0.6 0.51 ? 72% perf-profile.calltrace.cycles-pp.do_writepages.__writeback_single_inode.writeback_sb_inodes.__writeback_inodes_wb.wb_writeback
1.10 ? 25% -0.6 0.51 ? 72% perf-profile.calltrace.cycles-pp.xfs_vm_writepages.do_writepages.__writeback_single_inode.writeback_sb_inodes.__writeback_inodes_wb
1.10 ? 25% -0.6 0.51 ? 72% perf-profile.calltrace.cycles-pp.wb_workfn.process_one_work.worker_thread.kthread.ret_from_fork
1.10 ? 25% -0.6 0.51 ? 72% perf-profile.calltrace.cycles-pp.wb_do_writeback.wb_workfn.process_one_work.worker_thread.kthread
1.10 ? 25% -0.6 0.51 ? 72% perf-profile.calltrace.cycles-pp.wb_writeback.wb_do_writeback.wb_workfn.process_one_work.worker_thread
1.10 ? 25% -0.6 0.51 ? 72% perf-profile.calltrace.cycles-pp.__writeback_inodes_wb.wb_writeback.wb_do_writeback.wb_workfn.process_one_work
1.10 ? 25% -0.6 0.51 ? 72% perf-profile.calltrace.cycles-pp.writeback_sb_inodes.__writeback_inodes_wb.wb_writeback.wb_do_writeback.wb_workfn
1.10 ? 25% -0.6 0.51 ? 72% perf-profile.calltrace.cycles-pp.__writeback_single_inode.writeback_sb_inodes.__writeback_inodes_wb.wb_writeback.wb_do_writeback
1.10 ? 25% -0.6 0.51 ? 72% perf-profile.calltrace.cycles-pp.iomap_writepages.xfs_vm_writepages.do_writepages.__writeback_single_inode.writeback_sb_inodes
1.10 ? 25% -0.6 0.51 ? 72% perf-profile.calltrace.cycles-pp.write_cache_pages.iomap_writepages.xfs_vm_writepages.do_writepages.__writeback_single_inode
0.86 ? 12% -0.5 0.32 ?100% perf-profile.calltrace.cycles-pp.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags.fault_dirty_shared_page.do_fault
0.86 ? 12% -0.5 0.32 ?100% perf-profile.calltrace.cycles-pp.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags.fault_dirty_shared_page
2.56 ? 2% -0.5 2.02 ? 11% perf-profile.calltrace.cycles-pp.iomap_page_mkwrite.__xfs_filemap_fault.do_page_mkwrite.do_fault.__handle_mm_fault
0.83 ? 13% -0.5 0.31 ?100% perf-profile.calltrace.cycles-pp.schedule.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags
0.83 ? 13% -0.5 0.31 ?100% perf-profile.calltrace.cycles-pp.__schedule.schedule.schedule_timeout.io_schedule_timeout.balance_dirty_pages
1.55 ? 7% -0.4 1.18 ? 15% perf-profile.calltrace.cycles-pp.balance_dirty_pages.balance_dirty_pages_ratelimited_flags.fault_dirty_shared_page.do_fault.__handle_mm_fault
0.62 ? 5% -0.3 0.28 ?100% perf-profile.calltrace.cycles-pp.filemap_get_entry.__filemap_get_folio.filemap_fault.__do_fault.do_fault
0.82 ? 5% -0.3 0.56 ? 46% perf-profile.calltrace.cycles-pp.sync_regs.asm_exc_page_fault.do_access
1.26 -0.3 1.01 ? 11% perf-profile.calltrace.cycles-pp.iomap_iter.iomap_page_mkwrite.__xfs_filemap_fault.do_page_mkwrite.do_fault
0.98 -0.2 0.79 ? 13% perf-profile.calltrace.cycles-pp.xfs_buffered_write_iomap_begin.iomap_iter.iomap_page_mkwrite.__xfs_filemap_fault.do_page_mkwrite
0.66 ? 5% -0.2 0.46 ? 45% perf-profile.calltrace.cycles-pp.__filemap_get_folio.filemap_fault.__do_fault.do_fault.__handle_mm_fault
0.66 ? 5% -0.2 0.47 ? 45% perf-profile.calltrace.cycles-pp.finish_fault.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
1.30 ? 9% +0.2 1.54 ? 10% perf-profile.calltrace.cycles-pp.rebalance_domains.__do_softirq.__irq_exit_rcu.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
0.37 ? 70% +0.3 0.65 ? 4% perf-profile.calltrace.cycles-pp.__intel_pmu_enable_all.perf_adjust_freq_unthr_context.perf_event_task_tick.scheduler_tick.update_process_times
1.60 ? 4% +0.3 1.94 ? 4% perf-profile.calltrace.cycles-pp.perf_adjust_freq_unthr_context.perf_event_task_tick.scheduler_tick.update_process_times.tick_sched_handle
1.64 ? 4% +0.3 1.98 ? 4% perf-profile.calltrace.cycles-pp.perf_event_task_tick.scheduler_tick.update_process_times.tick_sched_handle.tick_sched_timer
3.00 ? 6% +0.5 3.48 ? 7% perf-profile.calltrace.cycles-pp.scheduler_tick.update_process_times.tick_sched_handle.tick_sched_timer.__hrtimer_run_queues
0.09 ?223% +0.5 0.58 ? 10% perf-profile.calltrace.cycles-pp.load_balance.rebalance_domains.__do_softirq.__irq_exit_rcu.sysvec_apic_timer_interrupt
0.46 ? 74% +0.5 1.00 ? 25% perf-profile.calltrace.cycles-pp.folio_mkclean.folio_clear_dirty_for_io.write_cache_pages.iomap_writepages.xfs_vm_writepages
0.44 ? 74% +0.5 0.97 ? 25% perf-profile.calltrace.cycles-pp.page_mkclean_one.rmap_walk_file.folio_mkclean.folio_clear_dirty_for_io.write_cache_pages
0.42 ? 74% +0.5 0.96 ? 25% perf-profile.calltrace.cycles-pp.page_vma_mkclean_one.page_mkclean_one.rmap_walk_file.folio_mkclean.folio_clear_dirty_for_io
0.44 ? 74% +0.5 0.98 ? 25% perf-profile.calltrace.cycles-pp.rmap_walk_file.folio_mkclean.folio_clear_dirty_for_io.write_cache_pages.iomap_writepages
0.59 ? 50% +0.6 1.17 ? 21% perf-profile.calltrace.cycles-pp.folio_clear_dirty_for_io.write_cache_pages.iomap_writepages.xfs_vm_writepages.do_writepages
4.49 ? 5% +0.7 5.15 ? 8% perf-profile.calltrace.cycles-pp.tick_sched_timer.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt
0.36 ? 70% +0.7 1.02 ? 25% perf-profile.calltrace.cycles-pp.drm_fb_helper_damage_work.process_one_work.worker_thread.kthread.ret_from_fork
0.36 ? 70% +0.7 1.02 ? 25% perf-profile.calltrace.cycles-pp.drm_fbdev_generic_helper_fb_dirty.drm_fb_helper_damage_work.process_one_work.worker_thread.kthread
0.09 ?223% +0.7 0.80 ? 27% perf-profile.calltrace.cycles-pp.ptep_clear_flush.page_vma_mkclean_one.page_mkclean_one.rmap_walk_file.folio_mkclean
0.17 ?141% +0.7 0.90 ? 40% perf-profile.calltrace.cycles-pp.serial8250_console_write.console_flush_all.console_unlock.vprintk_emit.devkmsg_emit
0.18 ?141% +0.8 0.95 ? 39% perf-profile.calltrace.cycles-pp.console_flush_all.console_unlock.vprintk_emit.devkmsg_emit.devkmsg_write
0.18 ?141% +0.8 0.95 ? 39% perf-profile.calltrace.cycles-pp.console_unlock.vprintk_emit.devkmsg_emit.devkmsg_write.vfs_write
0.18 ?141% +0.8 0.96 ? 38% perf-profile.calltrace.cycles-pp.vprintk_emit.devkmsg_emit.devkmsg_write.vfs_write.ksys_write
0.18 ?141% +0.8 0.96 ? 38% perf-profile.calltrace.cycles-pp.devkmsg_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.18 ?141% +0.8 0.96 ? 38% perf-profile.calltrace.cycles-pp.devkmsg_emit.devkmsg_write.vfs_write.ksys_write.do_syscall_64
0.19 ?141% +0.8 0.97 ? 37% perf-profile.calltrace.cycles-pp.write
0.18 ?141% +0.8 0.97 ? 37% perf-profile.calltrace.cycles-pp.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
0.18 ?141% +0.8 0.97 ? 37% perf-profile.calltrace.cycles-pp.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
0.18 ?141% +0.8 0.97 ? 37% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.write
0.18 ?141% +0.8 0.97 ? 37% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
0.09 ?223% +0.8 0.93 ? 25% perf-profile.calltrace.cycles-pp.memcpy_toio.drm_fb_memcpy.ast_primary_plane_helper_atomic_update.drm_atomic_helper_commit_planes.drm_atomic_helper_commit_tail_rpm
0.09 ?223% +0.9 0.94 ? 25% perf-profile.calltrace.cycles-pp.drm_atomic_helper_commit_planes.drm_atomic_helper_commit_tail_rpm.ast_mode_config_helper_atomic_commit_tail.commit_tail.drm_atomic_helper_commit
0.09 ?223% +0.9 0.94 ? 25% perf-profile.calltrace.cycles-pp.ast_primary_plane_helper_atomic_update.drm_atomic_helper_commit_planes.drm_atomic_helper_commit_tail_rpm.ast_mode_config_helper_atomic_commit_tail.commit_tail
0.09 ?223% +0.9 0.94 ? 25% perf-profile.calltrace.cycles-pp.drm_fb_memcpy.ast_primary_plane_helper_atomic_update.drm_atomic_helper_commit_planes.drm_atomic_helper_commit_tail_rpm.ast_mode_config_helper_atomic_commit_tail
0.09 ?223% +0.9 0.94 ? 25% perf-profile.calltrace.cycles-pp.drm_atomic_helper_commit_tail_rpm.ast_mode_config_helper_atomic_commit_tail.commit_tail.drm_atomic_helper_commit.drm_atomic_commit
0.09 ?223% +0.9 0.95 ? 25% perf-profile.calltrace.cycles-pp.commit_tail.drm_atomic_helper_commit.drm_atomic_commit.drm_atomic_helper_dirtyfb.drm_fbdev_generic_helper_fb_dirty
0.09 ?223% +0.9 0.95 ? 25% perf-profile.calltrace.cycles-pp.ast_mode_config_helper_atomic_commit_tail.commit_tail.drm_atomic_helper_commit.drm_atomic_commit.drm_atomic_helper_dirtyfb
0.09 ?223% +0.9 0.96 ? 24% perf-profile.calltrace.cycles-pp.drm_atomic_helper_commit.drm_atomic_commit.drm_atomic_helper_dirtyfb.drm_fbdev_generic_helper_fb_dirty.drm_fb_helper_damage_work
0.09 ?223% +0.9 0.96 ? 24% perf-profile.calltrace.cycles-pp.drm_atomic_helper_dirtyfb.drm_fbdev_generic_helper_fb_dirty.drm_fb_helper_damage_work.process_one_work.worker_thread
0.09 ?223% +0.9 0.96 ? 24% perf-profile.calltrace.cycles-pp.drm_atomic_commit.drm_atomic_helper_dirtyfb.drm_fbdev_generic_helper_fb_dirty.drm_fb_helper_damage_work.process_one_work
0.18 ?141% +1.1 1.26 ? 36% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
0.18 ?141% +1.1 1.26 ? 36% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.00 +1.1 1.12 ? 38% perf-profile.calltrace.cycles-pp.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas.exit_mmap
0.00 +1.1 1.15 ? 38% perf-profile.calltrace.cycles-pp.unmap_vmas.exit_mmap.__mmput.exit_mm.do_exit
0.00 +1.1 1.15 ? 38% perf-profile.calltrace.cycles-pp.unmap_page_range.unmap_vmas.exit_mmap.__mmput.exit_mm
0.00 +1.1 1.15 ? 38% perf-profile.calltrace.cycles-pp.zap_pmd_range.unmap_page_range.unmap_vmas.exit_mmap.__mmput
0.00 +1.2 1.18 ? 37% perf-profile.calltrace.cycles-pp.__mmput.exit_mm.do_exit.do_group_exit.__x64_sys_exit_group
0.00 +1.2 1.18 ? 37% perf-profile.calltrace.cycles-pp.exit_mmap.__mmput.exit_mm.do_exit.do_group_exit
0.00 +1.2 1.18 ? 37% perf-profile.calltrace.cycles-pp.exit_mm.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64
0.00 +1.2 1.21 ? 37% perf-profile.calltrace.cycles-pp.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.00 +1.2 1.21 ? 37% perf-profile.calltrace.cycles-pp.do_group_exit.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.00 +1.2 1.21 ? 37% perf-profile.calltrace.cycles-pp.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe
6.62 ? 7% +1.2 7.85 ? 8% perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
9.43 ? 6% +1.7 11.11 ? 6% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.acpi_safe_halt
9.45 ? 6% +1.7 11.13 ? 6% perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.acpi_safe_halt.acpi_idle_enter
0.00 +1.9 1.95 ? 28% perf-profile.calltrace.cycles-pp.iomap_writepages.xfs_vm_writepages.do_writepages.filemap_fdatawrite_wbc.__filemap_fdatawrite_range
0.00 +1.9 1.95 ? 28% perf-profile.calltrace.cycles-pp.write_cache_pages.iomap_writepages.xfs_vm_writepages.do_writepages.filemap_fdatawrite_wbc
0.00 +2.0 1.98 ? 30% perf-profile.calltrace.cycles-pp.do_writepages.filemap_fdatawrite_wbc.__filemap_fdatawrite_range.file_write_and_wait_range.xfs_file_fsync
0.00 +2.0 1.98 ? 30% perf-profile.calltrace.cycles-pp.xfs_vm_writepages.do_writepages.filemap_fdatawrite_wbc.__filemap_fdatawrite_range.file_write_and_wait_range
0.00 +2.0 1.98 ? 30% perf-profile.calltrace.cycles-pp.__filemap_fdatawrite_range.file_write_and_wait_range.xfs_file_fsync.__do_sys_msync.do_syscall_64
0.00 +2.0 1.98 ? 30% perf-profile.calltrace.cycles-pp.filemap_fdatawrite_wbc.__filemap_fdatawrite_range.file_write_and_wait_range.xfs_file_fsync.__do_sys_msync
0.00 +2.0 2.01 ? 30% perf-profile.calltrace.cycles-pp.file_write_and_wait_range.xfs_file_fsync.__do_sys_msync.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.00 +2.0 2.02 ? 30% perf-profile.calltrace.cycles-pp.__do_sys_msync.do_syscall_64.entry_SYSCALL_64_after_hwframe.msync
0.00 +2.0 2.02 ? 30% perf-profile.calltrace.cycles-pp.xfs_file_fsync.__do_sys_msync.do_syscall_64.entry_SYSCALL_64_after_hwframe.msync
0.00 +2.0 2.02 ? 30% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.msync
0.00 +2.0 2.02 ? 30% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.msync
0.00 +2.0 2.02 ? 30% perf-profile.calltrace.cycles-pp.msync
13.56 ? 4% +2.1 15.67 ? 6% perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.acpi_safe_halt.acpi_idle_enter.cpuidle_enter_state
15.08 ? 4% +2.5 17.59 ? 7% perf-profile.calltrace.cycles-pp.acpi_safe_halt.acpi_idle_enter.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
16.61 ? 3% +2.7 19.34 ? 5% perf-profile.calltrace.cycles-pp.ret_from_fork_asm
16.61 ? 3% +2.7 19.34 ? 5% perf-profile.calltrace.cycles-pp.ret_from_fork.ret_from_fork_asm
16.61 ? 3% +2.7 19.34 ? 5% perf-profile.calltrace.cycles-pp.kthread.ret_from_fork.ret_from_fork_asm
14.27 ? 5% +2.8 17.05 ? 5% perf-profile.calltrace.cycles-pp.loop_process_work.process_one_work.worker_thread.kthread.ret_from_fork
14.06 ? 5% +2.8 16.85 ? 5% perf-profile.calltrace.cycles-pp.do_iter_write.lo_write_simple.loop_process_work.process_one_work.worker_thread
14.20 ? 5% +2.8 16.99 ? 5% perf-profile.calltrace.cycles-pp.lo_write_simple.loop_process_work.process_one_work.worker_thread.kthread
16.33 ? 3% +2.8 19.12 ? 5% perf-profile.calltrace.cycles-pp.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
13.91 ? 5% +2.8 16.71 ? 5% perf-profile.calltrace.cycles-pp.do_iter_readv_writev.do_iter_write.lo_write_simple.loop_process_work.process_one_work
16.19 ? 3% +2.8 19.00 ? 4% perf-profile.calltrace.cycles-pp.process_one_work.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
13.30 ? 5% +2.8 16.12 ? 5% perf-profile.calltrace.cycles-pp.generic_perform_write.shmem_file_write_iter.do_iter_readv_writev.do_iter_write.lo_write_simple
13.74 ? 4% +2.8 16.56 ? 5% perf-profile.calltrace.cycles-pp.shmem_file_write_iter.do_iter_readv_writev.do_iter_write.lo_write_simple.loop_process_work
11.27 ? 5% +3.7 14.93 ? 5% perf-profile.calltrace.cycles-pp.copy_page_from_iter_atomic.generic_perform_write.shmem_file_write_iter.do_iter_readv_writev.do_iter_write
11.14 ? 5% -11.1 0.00 perf-profile.children.cycles-pp.memcpy_from_iter_mc
11.20 ? 5% -11.1 0.10 ? 16% perf-profile.children.cycles-pp.memcpy_orig
26.79 ? 2% -5.6 21.19 ? 12% perf-profile.children.cycles-pp.do_access
14.94 -3.4 11.59 ? 11% perf-profile.children.cycles-pp.do_rw_once
15.77 ? 2% -3.2 12.54 ? 11% perf-profile.children.cycles-pp.asm_exc_page_fault
12.39 ? 2% -2.4 9.94 ? 11% perf-profile.children.cycles-pp.exc_page_fault
12.30 ? 2% -2.4 9.87 ? 11% perf-profile.children.cycles-pp.do_user_addr_fault
10.82 ? 2% -2.1 8.67 ? 12% perf-profile.children.cycles-pp.handle_mm_fault
9.79 ? 2% -1.9 7.84 ? 12% perf-profile.children.cycles-pp.__handle_mm_fault
8.93 ? 2% -1.8 7.17 ? 12% perf-profile.children.cycles-pp.do_fault
1.06 ? 10% -0.9 0.20 ? 12% perf-profile.children.cycles-pp.shmem_write_end
1.06 ? 10% -0.8 0.23 ? 17% perf-profile.children.cycles-pp.folio_unlock
3.52 ? 2% -0.7 2.82 ? 12% perf-profile.children.cycles-pp.do_page_mkwrite
3.51 -0.7 2.82 ? 12% perf-profile.children.cycles-pp.__xfs_filemap_fault
2.58 ? 2% -0.5 2.03 ? 11% perf-profile.children.cycles-pp.iomap_page_mkwrite
1.10 ? 25% -0.4 0.66 ? 26% perf-profile.children.cycles-pp.wb_workfn
1.10 ? 25% -0.4 0.66 ? 26% perf-profile.children.cycles-pp.wb_do_writeback
1.10 ? 25% -0.4 0.66 ? 26% perf-profile.children.cycles-pp.wb_writeback
1.10 ? 25% -0.4 0.66 ? 26% perf-profile.children.cycles-pp.__writeback_inodes_wb
1.10 ? 25% -0.4 0.66 ? 26% perf-profile.children.cycles-pp.writeback_sb_inodes
1.10 ? 25% -0.4 0.66 ? 26% perf-profile.children.cycles-pp.__writeback_single_inode
1.55 ? 7% -0.4 1.19 ? 15% perf-profile.children.cycles-pp.balance_dirty_pages
0.89 ? 13% -0.3 0.55 ? 24% perf-profile.children.cycles-pp.schedule_timeout
0.86 ? 12% -0.3 0.52 ? 24% perf-profile.children.cycles-pp.io_schedule_timeout
1.27 -0.3 1.02 ? 11% perf-profile.children.cycles-pp.iomap_iter
0.86 ? 5% -0.2 0.65 ? 16% perf-profile.children.cycles-pp.sync_regs
0.99 -0.2 0.80 ? 13% perf-profile.children.cycles-pp.xfs_buffered_write_iomap_begin
0.69 ? 8% -0.1 0.55 ? 7% perf-profile.children.cycles-pp.iomap_dirty_folio
0.64 ? 6% -0.1 0.51 ? 8% perf-profile.children.cycles-pp.__perf_sw_event
0.56 ? 4% -0.1 0.43 ? 12% perf-profile.children.cycles-pp.filemap_dirty_folio
0.66 ? 5% -0.1 0.54 ? 13% perf-profile.children.cycles-pp.__filemap_get_folio
0.49 ? 8% -0.1 0.37 ? 6% perf-profile.children.cycles-pp.ifs_set_range_dirty
0.68 ? 4% -0.1 0.56 ? 9% perf-profile.children.cycles-pp.finish_fault
0.42 ? 6% -0.1 0.33 ? 13% perf-profile.children.cycles-pp.lock_mm_and_find_vma
0.52 ? 7% -0.1 0.43 ? 6% perf-profile.children.cycles-pp.___perf_sw_event
0.48 ? 6% -0.1 0.40 ? 9% perf-profile.children.cycles-pp.lock_vma_under_rcu
0.34 ? 7% -0.1 0.26 ? 15% perf-profile.children.cycles-pp.xfs_ilock
0.27 ? 6% -0.1 0.21 ? 15% perf-profile.children.cycles-pp.handle_pte_fault
0.19 ? 4% -0.0 0.14 ? 17% perf-profile.children.cycles-pp.down_read_trylock
0.22 ? 4% -0.0 0.17 ? 12% perf-profile.children.cycles-pp.iomap_iter_advance
0.17 ? 7% -0.0 0.13 ? 17% perf-profile.children.cycles-pp.pte_offset_map_nolock
0.17 ? 13% -0.0 0.13 ? 16% perf-profile.children.cycles-pp.__folio_end_writeback
0.15 ? 10% -0.0 0.11 ? 13% perf-profile.children.cycles-pp.run_timer_softirq
0.19 ? 7% -0.0 0.16 ? 11% perf-profile.children.cycles-pp.__pte_offset_map_lock
0.14 ? 11% -0.0 0.10 ? 13% perf-profile.children.cycles-pp.call_timer_fn
0.14 ? 12% -0.0 0.10 ? 10% perf-profile.children.cycles-pp.down_read
0.09 ? 10% -0.0 0.07 ? 16% perf-profile.children.cycles-pp.dequeue_task_fair
0.06 ? 11% +0.0 0.08 ? 6% perf-profile.children.cycles-pp.update_rq_clock
0.06 ? 17% +0.0 0.09 ? 10% perf-profile.children.cycles-pp.x86_pmu_disable
0.05 ? 46% +0.0 0.08 ? 7% perf-profile.children.cycles-pp.timerqueue_add
0.07 ? 16% +0.0 0.10 ? 9% perf-profile.children.cycles-pp.enqueue_hrtimer
0.18 ? 6% +0.0 0.22 ? 7% perf-profile.children.cycles-pp.read_tsc
0.29 ? 7% +0.0 0.33 ? 8% perf-profile.children.cycles-pp.lapic_next_deadline
0.04 ? 75% +0.1 0.10 ? 18% perf-profile.children.cycles-pp.__folio_start_writeback
0.18 ? 10% +0.1 0.23 ? 20% perf-profile.children.cycles-pp.__mod_lruvec_page_state
0.10 ? 14% +0.1 0.16 ? 28% perf-profile.children.cycles-pp._find_next_bit
0.10 ? 16% +0.1 0.19 ? 14% perf-profile.children.cycles-pp.folio_mark_accessed
0.00 +0.1 0.09 ? 41% perf-profile.children.cycles-pp.free_swap_cache
0.07 ? 12% +0.1 0.17 ? 37% perf-profile.children.cycles-pp._compound_head
0.00 +0.1 0.10 ? 46% perf-profile.children.cycles-pp.free_pages_and_swap_cache
0.03 ? 70% +0.1 0.14 ? 31% perf-profile.children.cycles-pp.release_pages
0.08 ? 16% +0.1 0.19 ? 33% perf-profile.children.cycles-pp.wait_for_xmitr
0.00 +0.1 0.13 ? 43% perf-profile.children.cycles-pp.io_schedule
0.06 ? 50% +0.2 0.22 ? 39% perf-profile.children.cycles-pp.tlb_batch_pages_flush
0.10 ? 9% +0.2 0.26 ? 29% perf-profile.children.cycles-pp.asm_sysvec_call_function_single
1.31 ? 9% +0.2 1.55 ? 10% perf-profile.children.cycles-pp.rebalance_domains
0.15 ? 22% +0.3 0.48 ? 39% perf-profile.children.cycles-pp.page_remove_rmap
1.65 ? 4% +0.3 2.00 ? 3% perf-profile.children.cycles-pp.perf_adjust_freq_unthr_context
1.67 ? 3% +0.3 2.02 ? 4% perf-profile.children.cycles-pp.perf_event_task_tick
0.16 ? 33% +0.4 0.52 ? 35% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
0.08 ? 19% +0.4 0.46 ? 26% perf-profile.children.cycles-pp.native_flush_tlb_one_user
0.00 +0.4 0.39 ? 49% perf-profile.children.cycles-pp.prepare_to_wait_exclusive
0.10 ? 16% +0.4 0.52 ? 27% perf-profile.children.cycles-pp.flush_tlb_func
0.54 ? 13% +0.4 0.98 ? 36% perf-profile.children.cycles-pp.wait_for_lsr
0.43 ? 22% +0.5 0.88 ? 19% perf-profile.children.cycles-pp.flush_tlb_mm_range
0.56 ? 14% +0.5 1.03 ? 37% perf-profile.children.cycles-pp.serial8250_console_write
0.47 ? 9% +0.5 0.94 ? 25% perf-profile.children.cycles-pp.drm_atomic_helper_commit_planes
0.47 ? 9% +0.5 0.94 ? 25% perf-profile.children.cycles-pp.ast_primary_plane_helper_atomic_update
0.47 ? 9% +0.5 0.94 ? 25% perf-profile.children.cycles-pp.drm_fb_memcpy
0.47 ? 9% +0.5 0.94 ? 25% perf-profile.children.cycles-pp.memcpy_toio
0.47 ? 9% +0.5 0.94 ? 25% perf-profile.children.cycles-pp.drm_atomic_helper_commit_tail_rpm
0.48 ? 8% +0.5 0.95 ? 25% perf-profile.children.cycles-pp.commit_tail
0.48 ? 8% +0.5 0.95 ? 25% perf-profile.children.cycles-pp.ast_mode_config_helper_atomic_commit_tail
0.48 ? 8% +0.5 0.96 ? 24% perf-profile.children.cycles-pp.drm_atomic_helper_commit
0.48 ? 8% +0.5 0.96 ? 24% perf-profile.children.cycles-pp.drm_atomic_helper_dirtyfb
0.48 ? 8% +0.5 0.96 ? 24% perf-profile.children.cycles-pp.drm_atomic_commit
0.49 ? 12% +0.5 0.97 ? 37% perf-profile.children.cycles-pp.write
0.48 ? 11% +0.5 0.97 ? 37% perf-profile.children.cycles-pp.vfs_write
0.48 ? 11% +0.5 0.98 ? 37% perf-profile.children.cycles-pp.ksys_write
0.47 ? 11% +0.5 0.96 ? 38% perf-profile.children.cycles-pp.devkmsg_write
0.47 ? 11% +0.5 0.96 ? 38% perf-profile.children.cycles-pp.devkmsg_emit
3.06 ? 6% +0.5 3.56 ? 7% perf-profile.children.cycles-pp.scheduler_tick
0.59 ? 14% +0.5 1.09 ? 36% perf-profile.children.cycles-pp.console_unlock
0.59 ? 14% +0.5 1.09 ? 36% perf-profile.children.cycles-pp.console_flush_all
0.52 ? 22% +0.5 1.02 ? 19% perf-profile.children.cycles-pp.ptep_clear_flush
0.59 ? 14% +0.5 1.10 ? 35% perf-profile.children.cycles-pp.vprintk_emit
0.00 +0.5 0.51 ? 47% perf-profile.children.cycles-pp.__rq_qos_throttle
0.00 +0.5 0.51 ? 47% perf-profile.children.cycles-pp.wbt_wait
0.00 +0.5 0.51 ? 47% perf-profile.children.cycles-pp.rq_qos_wait
0.51 ? 8% +0.5 1.02 ? 25% perf-profile.children.cycles-pp.drm_fb_helper_damage_work
0.51 ? 8% +0.5 1.02 ? 25% perf-profile.children.cycles-pp.drm_fbdev_generic_helper_fb_dirty
0.21 ? 24% +0.5 0.73 ? 31% perf-profile.children.cycles-pp.iomap_add_to_ioend
0.00 +0.5 0.52 ? 47% perf-profile.children.cycles-pp.blk_mq_get_new_requests
0.73 ? 23% +0.6 1.28 ? 16% perf-profile.children.cycles-pp.page_vma_mkclean_one
0.74 ? 23% +0.6 1.29 ? 17% perf-profile.children.cycles-pp.page_mkclean_one
0.00 +0.6 0.56 ? 46% perf-profile.children.cycles-pp.submit_bio_noacct_nocheck
0.00 +0.6 0.56 ? 46% perf-profile.children.cycles-pp.blk_mq_submit_bio
0.78 ? 24% +0.6 1.34 ? 17% perf-profile.children.cycles-pp.folio_mkclean
0.76 ? 23% +0.6 1.33 ? 17% perf-profile.children.cycles-pp.rmap_walk_file
0.41 ? 23% +0.6 1.02 ? 24% perf-profile.children.cycles-pp.iomap_writepage_map
0.85 ? 23% +0.6 1.46 ? 17% perf-profile.children.cycles-pp.folio_clear_dirty_for_io
4.57 ? 5% +0.7 5.25 ? 8% perf-profile.children.cycles-pp.tick_sched_timer
0.35 ? 20% +0.8 1.15 ? 38% perf-profile.children.cycles-pp.unmap_vmas
0.35 ? 20% +0.8 1.15 ? 38% perf-profile.children.cycles-pp.unmap_page_range
0.34 ? 20% +0.8 1.15 ? 38% perf-profile.children.cycles-pp.zap_pmd_range
0.34 ? 20% +0.8 1.15 ? 38% perf-profile.children.cycles-pp.zap_pte_range
0.38 ? 19% +0.8 1.19 ? 37% perf-profile.children.cycles-pp.__mmput
0.38 ? 19% +0.8 1.19 ? 37% perf-profile.children.cycles-pp.exit_mmap
0.37 ? 20% +0.8 1.18 ? 37% perf-profile.children.cycles-pp.exit_mm
0.39 ? 19% +0.8 1.21 ? 37% perf-profile.children.cycles-pp.__x64_sys_exit_group
0.39 ? 19% +0.8 1.21 ? 37% perf-profile.children.cycles-pp.do_group_exit
0.39 ? 19% +0.8 1.21 ? 37% perf-profile.children.cycles-pp.do_exit
6.74 ? 6% +1.3 8.00 ? 8% perf-profile.children.cycles-pp.__hrtimer_run_queues
1.35 ? 23% +1.3 2.61 ? 19% perf-profile.children.cycles-pp.iomap_writepages
1.35 ? 23% +1.3 2.61 ? 19% perf-profile.children.cycles-pp.write_cache_pages
1.35 ? 23% +1.3 2.64 ? 20% perf-profile.children.cycles-pp.do_writepages
1.35 ? 23% +1.3 2.64 ? 20% perf-profile.children.cycles-pp.xfs_vm_writepages
9.59 ? 6% +1.7 11.30 ? 6% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
9.57 ? 6% +1.7 11.28 ? 6% perf-profile.children.cycles-pp.hrtimer_interrupt
0.25 ? 18% +1.7 1.98 ? 30% perf-profile.children.cycles-pp.__filemap_fdatawrite_range
0.25 ? 18% +1.7 1.98 ? 30% perf-profile.children.cycles-pp.filemap_fdatawrite_wbc
0.27 ? 17% +1.7 2.02 ? 30% perf-profile.children.cycles-pp.__do_sys_msync
0.27 ? 17% +1.7 2.02 ? 30% perf-profile.children.cycles-pp.xfs_file_fsync
0.26 ? 16% +1.8 2.01 ? 30% perf-profile.children.cycles-pp.file_write_and_wait_range
0.27 ? 17% +1.8 2.02 ? 30% perf-profile.children.cycles-pp.msync
13.72 ? 4% +2.1 15.87 ? 6% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
16.61 ? 3% +2.7 19.34 ? 5% perf-profile.children.cycles-pp.kthread
16.61 ? 3% +2.7 19.34 ? 5% perf-profile.children.cycles-pp.ret_from_fork_asm
16.61 ? 3% +2.7 19.34 ? 5% perf-profile.children.cycles-pp.ret_from_fork
14.27 ? 5% +2.8 17.05 ? 5% perf-profile.children.cycles-pp.loop_process_work
14.06 ? 5% +2.8 16.85 ? 5% perf-profile.children.cycles-pp.do_iter_write
14.20 ? 5% +2.8 16.99 ? 5% perf-profile.children.cycles-pp.lo_write_simple
16.33 ? 3% +2.8 19.12 ? 5% perf-profile.children.cycles-pp.worker_thread
13.92 ? 5% +2.8 16.72 ? 5% perf-profile.children.cycles-pp.do_iter_readv_writev
16.19 ? 3% +2.8 19.00 ? 4% perf-profile.children.cycles-pp.process_one_work
13.32 ? 5% +2.8 16.13 ? 5% perf-profile.children.cycles-pp.generic_perform_write
13.76 ? 4% +2.8 16.57 ? 5% perf-profile.children.cycles-pp.shmem_file_write_iter
1.50 ? 9% +3.1 4.60 ? 29% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
1.50 ? 9% +3.1 4.60 ? 29% perf-profile.children.cycles-pp.do_syscall_64
11.28 ? 5% +3.7 14.94 ? 5% perf-profile.children.cycles-pp.copy_page_from_iter_atomic
11.10 ? 5% -11.0 0.10 ? 16% perf-profile.self.cycles-pp.memcpy_orig
12.38 -2.8 9.61 ? 11% perf-profile.self.cycles-pp.do_rw_once
7.73 ? 4% -1.8 5.95 ? 11% perf-profile.self.cycles-pp.do_access
1.05 ? 10% -0.8 0.21 ? 11% perf-profile.self.cycles-pp.folio_unlock
0.85 ? 5% -0.2 0.65 ? 15% perf-profile.self.cycles-pp.sync_regs
0.56 ? 5% -0.1 0.42 ? 10% perf-profile.self.cycles-pp.__handle_mm_fault
0.38 ? 10% -0.1 0.30 ? 11% perf-profile.self.cycles-pp.handle_mm_fault
0.36 ? 3% -0.1 0.28 ? 17% perf-profile.self.cycles-pp.iomap_page_mkwrite
0.46 ? 7% -0.1 0.38 ? 6% perf-profile.self.cycles-pp.___perf_sw_event
0.26 ? 3% -0.1 0.20 ? 15% perf-profile.self.cycles-pp.filemap_fault
0.14 ? 9% -0.1 0.09 ? 21% perf-profile.self.cycles-pp.ifs_set_range_dirty
0.18 ? 8% -0.0 0.14 ? 12% perf-profile.self.cycles-pp.asm_exc_page_fault
0.11 ? 4% -0.0 0.07 ? 20% perf-profile.self.cycles-pp.do_fault
0.23 ? 7% -0.0 0.18 ? 11% perf-profile.self.cycles-pp.xfs_buffered_write_iomap_begin
0.21 ? 4% -0.0 0.17 ? 14% perf-profile.self.cycles-pp.iomap_iter_advance
0.19 ? 5% -0.0 0.14 ? 17% perf-profile.self.cycles-pp.down_read_trylock
0.17 ? 8% -0.0 0.13 ? 7% perf-profile.self.cycles-pp.filemap_dirty_folio
0.16 ? 12% -0.0 0.12 ? 11% perf-profile.self.cycles-pp.error_entry
0.10 ? 15% -0.0 0.07 ? 11% perf-profile.self.cycles-pp.down_read
0.11 ? 10% -0.0 0.09 ? 15% perf-profile.self.cycles-pp.generic_perform_write
0.06 ? 47% +0.0 0.08 ? 8% perf-profile.self.cycles-pp.x86_pmu_disable
0.08 ? 19% +0.0 0.11 ? 11% perf-profile.self.cycles-pp.folio_mark_accessed
0.18 ? 7% +0.0 0.21 ? 8% perf-profile.self.cycles-pp.read_tsc
0.08 ? 47% +0.1 0.14 ? 18% perf-profile.self.cycles-pp.ptep_clear_flush
0.08 ? 12% +0.1 0.14 ? 25% perf-profile.self.cycles-pp._find_next_bit
0.18 ? 7% +0.1 0.26 ? 17% perf-profile.self.cycles-pp.asm_sysvec_apic_timer_interrupt
0.00 +0.1 0.09 ? 40% perf-profile.self.cycles-pp.free_swap_cache
0.06 ? 11% +0.1 0.16 ? 38% perf-profile.self.cycles-pp._compound_head
0.02 ? 99% +0.1 0.13 ? 32% perf-profile.self.cycles-pp.release_pages
0.06 ? 46% +0.1 0.17 ? 46% perf-profile.self.cycles-pp.zap_pte_range
0.91 ? 5% +0.2 1.09 ? 4% perf-profile.self.cycles-pp.perf_adjust_freq_unthr_context
0.12 ? 21% +0.3 0.39 ? 42% perf-profile.self.cycles-pp.page_remove_rmap
0.16 ? 33% +0.4 0.52 ? 35% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
0.08 ? 19% +0.4 0.46 ? 26% perf-profile.self.cycles-pp.native_flush_tlb_one_user
0.47 ? 8% +0.4 0.90 ? 28% perf-profile.self.cycles-pp.memcpy_toio
0.14 ? 10% +14.7 14.83 ? 5% perf-profile.self.cycles-pp.copy_page_from_iter_atomic




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


2023-11-15 12:48:56

by David Howells

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

I can't run the test program:

andromeda1# bin/lkp run ./job-300s-256G-msync.yaml
grep: /root/lkp-tests/hosts/andromeda.procyon.org.uk: No such file or directory
/root/lkp-tests/bin/run-local:121:in `<main>': undefined method `chomp' for nil:NilClass (NoMethodError)

job['memory'] ||= `grep -w '^memory:' #{LKP_SRC}/hosts/#{HOSTNAME}`.split(' ')[1].chomp
^^^^^^

David

2023-11-15 13:19:19

by David Howells

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

Is there some way to run the testcase without it downloading and building its
own kernel, but rather just use the one running on the box?

David

2023-11-15 15:20:55

by David Howells

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

Okay, I've got this to work - kind of. Your test box has a lot more RAM than
mine (192G according to the email), so I had to reduce the sizes and make it
delete the files between tests. I ended up using the attached script to run
things. I don't see the statistical analysis stuff.

Anyway, with upstream Linus, I see something like:

Count: 27
Total: 10649173
Range: 391374...398472
Mean : 394413
Stdev: 10218

With that patch reverted, I see something like:

Count: 27
Total: 10665161
Range: 391427...399601
Mean : 395005
Stdev: 13720

But the outcome is a bit variable and the result spaces overlap considerably.
I certainly don't see a 17% performance reduction. Now, this may be due to
hardware differences. The CPU I'm using is an Intel i3-4170 - which is a few
years old at this point.

David
---
for cpu_dir in /sys/devices/system/cpu/cpu[0-9]*
do
online_file="$cpu_dir"/online
[ -f "$online_file" ] && [ "$(cat "$online_file")" -eq 0 ] && continue

file="$cpu_dir"/cpufreq/scaling_governor
[ -f "$file" ] && echo "performance" > "$file"
done

#DATADIR=/mnt2/vm-scalability-tmp
#WORKDIR=$DATADIR/vm-scalability
#WORKDIR=/mnt2/vm-scalability
WORKDIR=/tmp/vm-scalability

cd /root/lkp-tests/pkg/vm-scalability/vm-scalability-lkp/lkp/benchmarks/vm-scalability
#mount -t tmpfs -o size=100% vm-scalability-tmp $DATADIR
#mkdir -p $DATADIR || exit $?
#truncate -s 10G $WORKDIR.img || exit $?
#mkfs.xfs -f -q $WORKDIR.img || exit $?
mkdir -p $WORKDIR || exit $?
#mount -o loop $WORKDIR.img $WORKDIR || exit $?
#./case-msync

truncate $WORKDIR/sparse-msync-1 -s 10G
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-1 -F --prealloc --open-rw 449340754
rm $WORKDIR/sparse-msync-1

truncate $WORKDIR/sparse-msync-2 -s 10G
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-2 -F --prealloc --open-rw 449340754
rm $WORKDIR/sparse-msync-2

truncate $WORKDIR/sparse-msync-3 -s 10G
truncate $WORKDIR/sparse-msync-4 -s 10G
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-3 -F --prealloc --open-rw 449340754
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-4 -F --prealloc --open-rw 449340754
rm $WORKDIR/sparse-msync-[34]

truncate $WORKDIR/sparse-msync-5 -s 10G
truncate $WORKDIR/sparse-msync-6 -s 10G
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-5 -F --prealloc --open-rw 449340754
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-6 -F --prealloc --open-rw 449340754
rm $WORKDIR/sparse-msync-[56]

truncate $WORKDIR/sparse-msync-7 -s 10G
truncate $WORKDIR/sparse-msync-8 -s 10G
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-7 -F --prealloc --open-rw 449340754
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-8 -F --prealloc --open-rw 449340754
rm $WORKDIR/sparse-msync-[78]

truncate $WORKDIR/sparse-msync-9 -s 10G
truncate $WORKDIR/sparse-msync-10 -s 10G
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-9 -F --prealloc --open-rw 449340754
truncate $WORKDIR/sparse-msync-11 -s 10G
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-10 -F --prealloc --open-rw 449340754
rm $WORKDIR/sparse-msync-{9,10}

truncate $WORKDIR/sparse-msync-12 -s 10G
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-11 -F --prealloc --open-rw 449340754
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-12 -F --prealloc --open-rw 449340754
rm $WORKDIR/sparse-msync-{11,12}

truncate $WORKDIR/sparse-msync-13 -s 10G
truncate $WORKDIR/sparse-msync-14 -s 10G
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-13 -F --prealloc --open-rw 449340754
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-14 -F --prealloc --open-rw 449340754
rm $WORKDIR/sparse-msync-{13,14}

truncate $WORKDIR/sparse-msync-15 -s 10G
truncate $WORKDIR/sparse-msync-16 -s 10G
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-15 -F --prealloc --open-rw 449340754
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-16 -F --prealloc --open-rw 449340754
rm $WORKDIR/sparse-msync-{15,16}

truncate $WORKDIR/sparse-msync-17 -s 10G
truncate $WORKDIR/sparse-msync-18 -s 10G
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-17 -F --prealloc --open-rw 449340754
truncate $WORKDIR/sparse-msync-19 -s 10G
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-18 -F --prealloc --open-rw 449340754
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-19 -F --prealloc --open-rw 449340754
rm $WORKDIR/sparse-msync-{17,18,19}

truncate $WORKDIR/sparse-msync-20 -s 10G
truncate $WORKDIR/sparse-msync-21 -s 10G
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-20 -F --prealloc --open-rw 449340754
truncate $WORKDIR/sparse-msync-22 -s 10G
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-21 -F --prealloc --open-rw 449340754
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-22 -F --prealloc --open-rw 449340754
rm $WORKDIR/sparse-msync-{20,21,22}

truncate $WORKDIR/sparse-msync-23 -s 10G
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-23 -F --prealloc --open-rw 449340754
rm $WORKDIR/sparse-msync-23
truncate $WORKDIR/sparse-msync-24 -s 10G
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-24 -F --prealloc --open-rw 449340754
rm $WORKDIR/sparse-msync-24
truncate $WORKDIR/sparse-msync-25 -s 10G
truncate $WORKDIR/sparse-msync-26 -s 10G
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-25 -F --prealloc --open-rw 449340754
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-26 -F --prealloc --open-rw 449340754
rm $WORKDIR/sparse-msync-{25,26}

truncate $WORKDIR/sparse-msync-27 -s 10G
truncate $WORKDIR/sparse-msync-28 -s 10G
./usemem --runtime 300 -S -f $WORKDIR/sparse-msync-27 -F --prealloc --open-rw 449340754
rm $WORKDIR/sparse-msync-{27,28}

2023-11-15 16:53:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Wed, 15 Nov 2023 at 10:28, David Howells <[email protected]> wrote:
>
> But the outcome is a bit variable and the result spaces overlap considerably.
> I certainly don't see a 17% performance reduction. Now, this may be due to
> hardware differences. The CPU I'm using is an Intel i3-4170 - which is a few
> years old at this point.

I tried to look at the perf profile changes in the original report,
and very little of it makes sense to me.

Having looked at quite a lot of those in the past (although certainly
less than Oliver) hat's *usually* a result of a test that is unstable.

In this case, though, I think the big difference is

-11.0 perf-profile.self.cycles-pp.memcpy_orig
+14.7 perf-profile.self.cycles-pp.copy_page_from_iter_atomic

which is a bit odd. It looks like the old code used to use a regular
out-of-line memcpy (and that machine doesn't have FSRM), and the new
code for some reason does it inline.

I wonder if gcc somehow decided to inline "memcpy()" in
memcpy_from_iter() as a "rep movsb" because of other inlining changes?

[ Goes out to look ]

Yup, I think that's exactly what happened. Gcc seems to decide that it
might be a small memcpy(), and seems to do at least part of it
directly.

So I *think* this all is mainly an artifact of gcc having changed code
generation due to the code re-organization.

Linus

2023-11-15 17:39:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Wed, 15 Nov 2023 at 11:53, Linus Torvalds
<[email protected]> wrote:
>
> I wonder if gcc somehow decided to inline "memcpy()" in
> memcpy_from_iter() as a "rep movsb" because of other inlining changes?
>
> [ Goes out to look ]
>
> Yup, I think that's exactly what happened. Gcc seems to decide that it
> might be a small memcpy(), and seems to do at least part of it
> directly.
>
> So I *think* this all is mainly an artifact of gcc having changed code
> generation due to the code re-organization.

The gcc code generation here is *really* odd. I've never seen this
before, so it may be new to newer versions of gcc. I see code like
this:

# lib/iov_iter.c:73: memcpy(to + progress, iter_from, len);
cmpl $8, %edx #, _88
jb .L400 #,
movq (%rsi), %rax #, tmp288
movq %rax, (%rcx) # tmp288,
movl %edx, %eax # _88, _88
movq -8(%rsi,%rax), %rdi #, tmp295
movq %rdi, -8(%rcx,%rax) # tmp295,
leaq 8(%rcx), %rdi #, tmp296
andq $-8, %rdi #, tmp296
subq %rdi, %rcx # tmp296, tmp268
subq %rcx, %rsi # tmp268, tmp269
addl %edx, %ecx # _88, _88
shrl $3, %ecx #,
rep movsq
jmp .L392 #

.L398:
# lib/iov_iter.c:73: memcpy(to + progress, iter_from, len);
movl (%rsi), %eax #, tmp271
movl %eax, (%rcx) # tmp271,
movl %edx, %eax # _88, _88
movl -4(%rsi,%rax), %esi #, tmp278
movl %esi, -4(%rcx,%rax) # tmp278,
movl 8(%r9), %edi # p_72->bv_len, p_72->bv_len
jmp .L330 #
...

.L400:
# lib/iov_iter.c:73: memcpy(to + progress, iter_from, len);
testb $4, %dl #, _88
jne .L398 #,
testl %edx, %edx # _88
je .L330 #,
movzbl (%rsi), %eax #, tmp279
movb %al, (%rcx) # tmp279,
testb $2, %dl #, _88
jne .L390 #,
...

which makes *zero* sense. It first checks that the the length is at
least 8 bytes, then it moves *one* word by hand, then it aligns the
code to 8 bytes remaining, and does the remaining (possibly
overlapping at the beginning) words as one "rep movsq",

And L398 is the "I have 4..7 bytes to copy" target.

And L400 seems to be "I have 0..7 bytes to copy".

This is literally insane. And it seems to be all just gcc having for
some reason decided to do this instead of "rep movsb" or calling an
out-of-line function.

I get the feeling that this is related to how your patches made that
function be an inline function that is inlined through a function
pointer. I suspect that what happens is that gcc expands the memcpy()
first into that inlined function (without caller context), and then
inserts the crazily expanded inline later into the context of that
function pointer.

I dunno. I really only say that because I haven't seen gcc make this
kind of mess before, and that "inlined through a function pointer" is
the main unusual thing here.

How very annoying.

Linus

2023-11-15 18:37:22

by David Howells

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

Linus Torvalds <[email protected]> wrote:

> which makes *zero* sense. It first checks that the the length is at
> least 8 bytes, then it moves *one* word by hand, then it aligns the
> code to 8 bytes remaining, and does the remaining (possibly
> overlapping at the beginning) words as one "rep movsq",

That's not what I see. See attached for a dump of _copy_from_iter from my
kernel. It's just using REP MOVSB.

For reference, the compiler is gcc-13.2.1-3.fc39.x86_64 with
binutils-2.40-13.fc39.x86_64.

David
---

(gdb) disas _copy_from_iter
Dump of assembler code for function _copy_from_iter:
<+0>: push %r15
<+2>: push %r14
<+4>: push %r13
<+6>: push %r12
<+8>: push %rbp
<+9>: push %rbx
<+10>: sub $0x40,%rsp
<+14>: mov %gs:0x28,%rax
<+23>: mov %rax,0x38(%rsp)
<+28>: xor %eax,%eax
<+30>: cmpb $0x0,0x3(%rdx)
<+34>: je 0xffffffff81770334 <_copy_from_iter+50>
<+36>: cmpb $0x0,0x1(%rdx)
<+40>: mov %rdi,%r12
<+43>: mov %rdx,%rbx
<+46>: je 0xffffffff81770364 <_copy_from_iter+98>
<+48>: jmp 0xffffffff8177033d <_copy_from_iter+59>
<+50>: ud2
<+52>: xor %ebp,%ebp
<+54>: jmp 0xffffffff8177067e <_copy_from_iter+892>

<+59>: mov 0x38(%rsp),%rax
<+64>: sub %gs:0x28,%rax
<+73>: jne 0xffffffff8177068e <_copy_from_iter+908>
<+79>: add $0x40,%rsp
<+83>: pop %rbx
<+84>: pop %rbp
<+85>: pop %r12
<+87>: pop %r13
<+89>: pop %r14
<+91>: pop %r15
<+93>: jmp 0xffffffff8176ee3f <__copy_from_iter_mc>

<+98>: mov 0x18(%rdx),%rax
<+102>: cmp %rax,%rsi
<+105>: cmova %rax,%rsi
<+109>: test %rsi,%rsi
<+112>: mov %rsi,%rbp
<+115>: je 0xffffffff8177067e <_copy_from_iter+892>

<+121>: mov (%rdx),%dl
<+123>: test %dl,%dl # ITER_UBUF
<+125>: jne 0xffffffff817703cc <_copy_from_iter+202>
<+127>: mov 0x8(%rbx),%rsi
<+131>: mov %rbp,%rcx
<+134>: add 0x10(%rbx),%rsi
<+138>: mov %rsi,%rdx
<+141>: mov %rbp,%rsi
<+144>: mov %rdx,%rdi
<+147>: call 0xffffffff8176ec9e <__access_ok>
<+152>: test %al,%al
<+154>: je 0xffffffff817703af <_copy_from_iter+173>
<+156>: nop
<+157>: nop
<+158>: nop
<+159>: mov %r12,%rdi
<+162>: mov %rdx,%rsi
<+165>: rep movsb %ds:(%rsi),%es:(%rdi)
<+167>: nop
<+168>: nop
<+169>: nop
<+170>: nop
<+171>: nop
<+172>: nop
<+173>: mov %rbp,%rax
<+176>: sub %rcx,%rax
<+179>: add 0x18(%rbx),%rcx
<+183>: add %rax,0x8(%rbx)
<+187>: sub %rbp,%rcx
<+190>: mov %rax,%rbp
<+193>: mov %rcx,0x18(%rbx)
<+197>: jmp 0xffffffff8177067e <_copy_from_iter+892>

<+202>: cmp $0x1,%dl # ITER_IOVEC
<+205>: jne 0xffffffff8177044c <_copy_from_iter+330>
<+207>: mov 0x10(%rbx),%r9
<+211>: mov %rsi,%r8
<+214>: xor %ebp,%ebp
<+216>: mov 0x8(%rbx),%r10
<+220>: mov 0x8(%r9),%rdx
<+224>: sub %r10,%rdx
<+227>: cmp %r8,%rdx
<+230>: cmova %r8,%rdx
<+234>: test %rdx,%rdx
<+237>: je 0xffffffff81770433 <_copy_from_iter+305>
<+239>: mov (%r9),%r11
<+242>: mov %rdx,%rsi
<+245>: mov %rdx,%rcx
<+248>: add %r10,%r11
<+251>: mov %r11,%rdi
<+254>: call 0xffffffff8176ec9e <__access_ok>
<+259>: test %al,%al
<+261>: je 0xffffffff8177041b <_copy_from_iter+281>
<+263>: nop
<+264>: nop
<+265>: nop
<+266>: lea (%r12,%rbp,1),%rdi
<+270>: mov %r11,%rsi
<+273>: rep movsb %ds:(%rsi),%es:(%rdi)
<+275>: nop
<+276>: nop
<+277>: nop
<+278>: nop
<+279>: nop
<+280>: nop
<+281>: mov %rdx,%rax
<+284>: sub %rdx,%r8
<+287>: sub %rcx,%rax
<+290>: add %rcx,%r8
<+293>: add %rax,%rbp
<+296>: add %rax,%r10
<+299>: cmp 0x8(%r9),%r10
<+303>: jb 0xffffffff81770444 <_copy_from_iter+322>
<+305>: add $0x10,%r9
<+309>: xor %r10d,%r10d
<+312>: test %r8,%r8
<+315>: jne 0xffffffff817703de <_copy_from_iter+220>
<+317>: jmp 0xffffffff81770544 <_copy_from_iter+578>
<+322>: mov %r10,%r8
<+325>: jmp 0xffffffff81770544 <_copy_from_iter+578>

<+330>: cmp $0x2,%dl # ITER_BVEC
<+333>: jne 0xffffffff817704ee <_copy_from_iter+492>
<+339>: mov 0x10(%rbx),%r8
<+343>: mov %rsi,%r11
<+346>: xor %ebp,%ebp
<+348>: mov $0x1000,%r10d
<+354>: mov 0x8(%rbx),%r9
<+358>: mov 0xc(%r8),%ecx
<+362>: add %r9,%rcx
<+365>: mov %rcx,%rdi
<+368>: and $0xfff,%ecx
<+374>: shr $0xc,%rdi
<+378>: shl $0x6,%rdi
<+382>: add (%r8),%rdi
<+385>: call 0xffffffff8176e72e <kmap_local_page>
<+390>: mov %r10,%rdx
<+393>: mov %rax,%rsi
<+396>: mov 0x8(%r8),%eax
<+400>: sub %r9,%rax
<+403>: cmp %r11,%rax
<+406>: cmova %r11,%rax
<+410>: sub %rcx,%rdx
<+413>: cmp %rdx,%rax
<+416>: cmova %rdx,%rax
<+420>: add %rcx,%rsi
<+423>: lea (%r12,%rbp,1),%rdx
<+427>: mov %eax,%ecx
<+429>: sub %rax,%r11
<+432>: mov %rdx,%rdi
<+435>: add %rax,%rbp
<+438>: add %rax,%r9
<+441>: rep movsb %ds:(%rsi),%es:(%rdi)
<+443>: mov 0x8(%r8),%eax
<+447>: cmp %rax,%r9
<+450>: jb 0xffffffff817704cd <_copy_from_iter+459>
<+452>: add $0x10,%r8
<+456>: xor %r9d,%r9d
<+459>: test %r11,%r11
<+462>: jne 0xffffffff81770468 <_copy_from_iter+358>
<+464>: mov %r8,%rax
<+467>: sub 0x10(%rbx),%rax
<+471>: mov %r9,0x8(%rbx)
<+475>: mov %r8,0x10(%rbx)
<+479>: sar $0x4,%rax
<+483>: sub %rax,0x20(%rbx)
<+487>: jmp 0xffffffff81770671 <_copy_from_iter+879>

<+492>: cmp $0x3,%dl # ITER_KVEC
<+495>: jne 0xffffffff81770560 <_copy_from_iter+606>
<+497>: mov 0x10(%rbx),%r9
<+501>: mov %rsi,%r8
<+504>: xor %ebp,%ebp
<+506>: mov 0x8(%rbx),%rdx
<+510>: mov 0x8(%r9),%rax
<+514>: sub %rdx,%rax
<+517>: cmp %r8,%rax
<+520>: cmova %r8,%rax
<+524>: test %rax,%rax
<+527>: je 0xffffffff81770534 <_copy_from_iter+562>
<+529>: mov (%r9),%rsi
<+532>: lea (%r12,%rbp,1),%r10
<+536>: mov %rax,%rcx
<+539>: add %rax,%rbp
<+542>: mov %r10,%rdi
<+545>: sub %rax,%r8
<+548>: add %rdx,%rsi
<+551>: add %rax,%rdx
<+554>: rep movsb %ds:(%rsi),%es:(%rdi)
<+556>: cmp 0x8(%r9),%rdx
<+560>: jb 0xffffffff81770541 <_copy_from_iter+575>
<+562>: add $0x10,%r9
<+566>: xor %edx,%edx
<+568>: test %r8,%r8
<+571>: jne 0xffffffff81770500 <_copy_from_iter+510>
<+573>: jmp 0xffffffff81770544 <_copy_from_iter+578>
<+575>: mov %rdx,%r8
<+578>: mov %r9,%rax
<+581>: sub 0x10(%rbx),%rax
<+585>: mov %r8,0x8(%rbx)
<+589>: mov %r9,0x10(%rbx)
<+593>: sar $0x4,%rax
<+597>: sub %rax,0x20(%rbx)
<+601>: jmp 0xffffffff81770671 <_copy_from_iter+879>

<+606>: cmp $0x4,%dl # ITER_XARRAY
<+609>: jne 0xffffffff81770677 <_copy_from_iter+885>
<+615>: movq $0x3,0x18(%rsp)
<+624>: mov 0x10(%rbx),%rax
<+628>: xor %edx,%edx
<+630>: mov 0x8(%rbx),%r14
<+634>: mov %rdx,0x20(%rsp)
<+639>: add 0x20(%rbx),%r14
<+643>: mov %rdx,0x28(%rsp)
<+648>: mov %rdx,0x30(%rsp)
<+653>: mov %rax,(%rsp)
<+657>: mov %r14,%rax
<+660>: shr $0xc,%rax
<+664>: mov %rax,0x8(%rsp)
<+669>: xor %eax,%eax
<+671>: mov %eax,0x10(%rsp)
<+675>: or $0xffffffffffffffff,%rsi
<+679>: mov %rsp,%rdi
<+682>: mov %rbp,%r13
<+685>: call 0xffffffff81d617d1 <xas_find>
<+690>: xor %ebp,%ebp
<+692>: mov $0x1000,%r15d
<+698>: mov %rax,%r8
<+701>: test %r8,%r8
<+704>: je 0xffffffff8177066d <_copy_from_iter+875>
<+710>: mov %r8,%rsi
<+713>: mov %rsp,%rdi
<+716>: call 0xffffffff8176e7a8 <xas_retry>
<+721>: test %al,%al
<+723>: jne 0xffffffff8177065d <_copy_from_iter+859>
<+729>: test $0x1,%r8d
<+736>: jne 0xffffffff81770614 <_copy_from_iter+786>
<+738>: mov %r8,%rdi
<+741>: call 0xffffffff8176e6a4 <folio_test_hugetlb>
<+746>: test %al,%al
<+748>: jne 0xffffffff81770618 <_copy_from_iter+790>
<+750>: call 0xffffffff8176e714 <folio_size>
<+755>: lea (%r14,%rbp,1),%rdx
<+759>: lea -0x1(%rax),%r10
<+763>: call 0xffffffff8176e714 <folio_size>
<+768>: and %rdx,%r10
<+771>: sub %r10,%rax
<+774>: cmp %r13,%rax
<+777>: cmova %r13,%rax
<+781>: mov %rax,%r9
<+784>: jmp 0xffffffff81770658 <_copy_from_iter+854>
<+786>: ud2
<+788>: jmp 0xffffffff8177066d <_copy_from_iter+875>
<+790>: ud2
<+792>: jmp 0xffffffff8177066d <_copy_from_iter+875>
<+794>: lea (%r12,%rbp,1),%r11
<+798>: mov %r10,%rsi
<+801>: mov %r8,%rdi
<+804>: call 0xffffffff8176e753 <kmap_local_folio>
<+809>: mov %r15,%rdx
<+812>: mov %r11,%rdi
<+815>: mov %rax,%rsi
<+818>: mov %r10,%rax
<+821>: and $0xfff,%eax
<+826>: sub %rax,%rdx
<+829>: cmp %r9,%rdx
<+832>: cmova %r9,%rdx
<+836>: add %rdx,%rbp
<+839>: sub %rdx,%r13
<+842>: mov %edx,%ecx
<+844>: rep movsb %ds:(%rsi),%es:(%rdi)
<+846>: je 0xffffffff8177066d <_copy_from_iter+875>
<+848>: sub %rdx,%r9
<+851>: add %rdx,%r10
<+854>: test %r9,%r9
<+857>: jne 0xffffffff8177061c <_copy_from_iter+794>
<+859>: mov %rsp,%rdi
<+862>: call 0xffffffff8176f307 <xas_next_entry>
<+867>: mov %rax,%r8
<+870>: jmp 0xffffffff817705bf <_copy_from_iter+701>
<+875>: add %rbp,0x8(%rbx)
<+879>: sub %rbp,0x18(%rbx)
<+883>: jmp 0xffffffff8177067e <_copy_from_iter+892>

# ITER_DISCARD / default
<+885>: sub %rsi,%rax
<+888>: mov %rax,0x18(%rbx)
<+892>: mov 0x38(%rsp),%rax
<+897>: sub %gs:0x28,%rax
<+906>: je 0xffffffff81770693 <_copy_from_iter+913>
<+908>: call 0xffffffff81d67d8d <__stack_chk_fail>
<+913>: add $0x40,%rsp
<+917>: mov %rbp,%rax
<+920>: pop %rbx
<+921>: pop %rbp
<+922>: pop %r12
<+924>: pop %r13
<+926>: pop %r14
<+928>: pop %r15
<+930>: jmp 0xffffffff81d745a0 <__x86_return_thunk>

2023-11-15 18:39:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Wed, 15 Nov 2023 at 12:38, Linus Torvalds
<[email protected]> wrote:
>
> The gcc code generation here is *really* odd. I've never seen this
> before, so it may be new to newer versions of gcc. [...]

Ok, so I've been annoyed by gcc code generation for 'memcpy()' before,
mainly because it can't do the "let's inline it as an ALTERNATIVE() of
'rep movs' or an out-of-line call" thing that we do for user copies.

So here's a complete hack that says "we override the
__builtin_memcpy() implementation for non-constant lengths". We
stillwant gcc to deal with the constant size case, since often that
can be turned into just plain moves.

And this is *ENTIRELY* untested, except for checking that it makes the
code generation in lib/iov_iter.c look better.

It's probably completely broken. I particularly worry about "memcpy()"
being used *during* the instruction rewriting, so the whole approach
with ALTERNATIVE() is probably entirely broken.

But I'm cc'ing Borislav, because we've talked about the whole "inline
memcpy() with alternatives" before, so maybe this gives Borislav some
ideas for how to do it right.

Borislav, see

https://lore.kernel.org/all/CAHk-=wjCUckvZUQf7gqp2ziJUWxVpikM_6srFdbcNdBJTxExRg@mail.gmail.com/

for some truly crazy code generation by gcc.

Linus


Attachments:
patch.diff (1.85 kB)

2023-11-15 18:46:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Wed, 15 Nov 2023 at 13:35, David Howells <[email protected]> wrote:
>
> That's not what I see. See attached for a dump of _copy_from_iter from my
> kernel. It's just using REP MOVSB.

Yeah, an unconditional REP MOVSB is not right either. That just means
that it performs truly horrendously badly on some machines.

Do you perhaps have CONFIG_CC_OPTIMIZE_FOR_SIZE set? That makes gcc
use "rep movsb" - even for small copies that most definitely should
*not* use "rep movsb".

Anyway, you should never use CC_OPTIMIZE_FOR_SIZE as any kind of
baseline. I'd actually love to use it in general, but it really makes
gcc do silly things when it goes for size optimizations that make no
sense at all (because it will go for size over anything else).

It turns out that on FSRM machines (ie anything really new), it's ok,
because even small constant-sized copies do work ok with "rep movsb",
but there are cases where it's absolutely horrendously bad.

Linus

2023-11-15 19:10:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Wed, 15 Nov 2023 at 13:45, Linus Torvalds
<[email protected]> wrote:
>
> Do you perhaps have CONFIG_CC_OPTIMIZE_FOR_SIZE set? That makes gcc
> use "rep movsb" - even for small copies that most definitely should
> *not* use "rep movsb".

Just to give some background an an example:

__builtin_memcpy(dst, src, 24);

with -O2 is done as three 64-bit move instructions (well, three in
both direction, so six instructions total), and with -Os you get

movl $6, %ecx
rep movsl

instead. And no, this isn't all that uncommon, because things like
the above is what happens when you copy a small structure around.

And that "rep movsl" is indeed nice and small, but it's truly
horrendously bad from a performance angle on most cores, compared to
the six instructions that can schedule nicely and take a cycle or two.

There are some other cases of similar "-Os generates unacceptable
code". For example, dividing by a constant - when you use -Os, gcc
thinks that it's perfectly fine to actually generate a divide
instruction, because it is indeed small.

But in most cases you really *really* want to use a "multiply by
reciprocal" even though it generates bigger code. Again, it ends up
depending on microarchitecture, and modern cores tend to do better on
divides, but it's another of those things where saving a copuple of
bytes of code space is not the right choice if it means that you use a
slow divider.

And again, those "divide by constant" often happen in implicit
contexts (ie the constant may be the size of a structure, and the
divide is due to taking a pointer difference). Let's say you have a
structure that isn't a power of two, but is (to pick a random but not
unlikely value) is 56 bytes in size.

The code generation for -O2 is (value in %rdi)

movabsq $2635249153387078803, %rax
shrq $3, %rdi
mulq %rdi

and for -Os you get (value in %rax):

movl $56, %ecx
xorl %edx, %edx
divq %rcx

and that 'divq' is certainly again smaller and more obvious, but again
we're talking "single cycles" vs "potentially 50+ cycles" depending on
uarch.

Linus

2023-11-15 19:10:53

by Borislav Petkov

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Wed, Nov 15, 2023 at 01:38:18PM -0500, Linus Torvalds wrote:
> It's probably completely broken. I particularly worry about "memcpy()"
> being used *during* the instruction rewriting, so the whole approach
> with ALTERNATIVE() is probably entirely broken.

Should we define an alternative_memcpy() which is used *only* during
rewriting so that this becomes a non-issue?

> But I'm cc'ing Borislav, because we've talked about the whole "inline
> memcpy() with alternatives" before, so maybe this gives Borislav some
> ideas for how to do it right.

Yours looks simple enough and makes sense. Lemme poke at it a bit in the
coming days and see what happens.

> Borislav, see
>
> https://lore.kernel.org/all/CAHk-=wjCUckvZUQf7gqp2ziJUWxVpikM_6srFdbcNdBJTxExRg@mail.gmail.com/
>
> for some truly crazy code generation by gcc.

Yeah, lemme show that to gcc folks. That asm is with your compiler,
right? Version?

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-11-15 19:16:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Wed, 15 Nov 2023 at 14:10, Borislav Petkov <[email protected]> wrote:
>
> Should we define an alternative_memcpy() which is used *only* during
> rewriting so that this becomes a non-issue?

Yeah, I think the instruction rewriting should use something that
explicitly cannot possibility itself need rewriting, and a plain
'memcpy()' is obviously that.

The good news is that at least things like structure copies would
*not* trigger that alternative, so it's only explicit memcpy() calls
that my patch changes. But I would not be surprised if instruction
rewriting does that. I didn't actually check.

> Yours looks simple enough and makes sense. Lemme poke at it a bit in the
> coming days and see what happens.

Note that it has a nasty interaction with fortify-source, which is why
it has that hacky "#undef memcpy" in that unrelated header.

Also note that I was being very very lazy in how I re-used the
"rep_movs_alternative" function that we already have. And it's
actually a bad laziness, because our existing rep_movs_alternative
does the exception handling for user mode faults.

We don't actually want exception handling for 'memcpy()', because it
could hide bugs. If a memcpy() gets a bad pointer, we want the oops,
not a partial copy.

So my patch really is broken. It might happen to work when everything
else goes right, and it's small, but it is very much a "proof of
concept" rather than something that is actually acceptable.

Linus

2023-11-15 19:27:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Wed, 15 Nov 2023 at 14:10, Borislav Petkov <[email protected]> wrote:
>
> > Borislav, see
> >
> > https://lore.kernel.org/all/CAHk-=wjCUckvZUQf7gqp2ziJUWxVpikM_6srFdbcNdBJTxExRg@mail.gmail.com/
> >
> > for some truly crazy code generation by gcc.
>
> Yeah, lemme show that to gcc folks. That asm is with your compiler,
> right? Version?

That was with gcc version 13.2.1.

Note that I only see that crazy thing in lib/iov_iter.s, so I really
do think it has something to do with inlining __builtin_memcpy()
behind a conditional function pointer.

In normal cases, gcc seems to just do the obvious thing (ie expand a
small constant-sized memcpy inline, or just call the external 'memcpy'
function.

So it's some odd pattern that triggers that "expand non-constant
memcpy inline". And once that happens, the odd code generation is
still a bit odd but is at least explicable.

That "do first word by hand, then do aligned 'rep movsq' on top of it"
pattern is weird, but we've seen some similar strange patterns in
hand-written memcpy (eg "use two overlapping 8-byte writes to handle
the 8-15 byte case").

So the real issue is that we don't want an inlined memcpy at all,
unless it's the simple constant-sized case that has been turned into
individual moves with no loop.

Or it's a "rep movsb" with FSRM as a CPUID-based alternative, of course.

Linus

2023-11-15 20:07:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Wed, 15 Nov 2023 at 14:15, Linus Torvalds
<[email protected]> wrote:
>
> So my patch really is broken. It might happen to work when everything
> else goes right, and it's small, but it is very much a "proof of
> concept" rather than something that is actually acceptable.

More breakage details

(a) need to not do this with the boot code and EFI stub code that
doesn't handle alternatives.

It's not a huge deal in that obviously both alternatives work,
but it causes build issues:

ld: error: unplaced orphan section `.altinstructions' from
`arch/x86/boot/compressed/misc.o'
ld: error: unplaced orphan section `.altinstr_replacement' from
`arch/x86/boot/compressed/misc.o'
...

etc

- our current "memcpy_orig" fallback does unrolled copy loops, and
the rep_movs_alternative fallback obviously doesn't.

It's not clear that the unrolled copy loops matter for the in-kernel
kinds of copies, but who knows. The memcpy_orig code is definitely
trying to be smarter in some other ways too. So the fallback should
try a *bit* harder than I did, and not just with the whole "don't try
to handle exceptions" issue I mentioned.

Linus

2023-11-15 20:56:14

by David Howells

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

Linus Torvalds <[email protected]> wrote:

> Do you perhaps have CONFIG_CC_OPTIMIZE_FOR_SIZE set? That makes gcc
> use "rep movsb" - even for small copies that most definitely should
> *not* use "rep movsb".

I do, yes. I've attached the config for reference.

David
---
CONFIG_CC_VERSION_TEXT="gcc (GCC) 13.2.1 20230918 (Red Hat 13.2.1-3)"
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=130201
CONFIG_CLANG_VERSION=0
CONFIG_AS_IS_GNU=y
CONFIG_AS_VERSION=24000
CONFIG_LD_IS_BFD=y
CONFIG_LD_VERSION=24000
CONFIG_LLD_VERSION=0
CONFIG_CC_CAN_LINK=y
CONFIG_CC_CAN_LINK_STATIC=y
CONFIG_CC_HAS_ASM_GOTO_OUTPUT=y
CONFIG_CC_HAS_ASM_GOTO_TIED_OUTPUT=y
CONFIG_TOOLS_SUPPORT_RELR=y
CONFIG_CC_HAS_ASM_INLINE=y
CONFIG_CC_HAS_NO_PROFILE_FN_ATTR=y
CONFIG_PAHOLE_VERSION=0
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_TABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_UAPI_HEADER_TEST=y
CONFIG_LOCALVERSION="-build3"
CONFIG_BUILD_SALT=""
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_HAVE_KERNEL_ZSTD=y
CONFIG_KERNEL_XZ=y
CONFIG_DEFAULT_INIT=""
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_WATCH_QUEUE=y
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y

CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_HARDIRQS_SW_RESEND=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_IRQ_MSI_IOMMU=y
CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y
CONFIG_GENERIC_IRQ_RESERVATION_MODE=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y

CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_INIT=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_HAVE_POSIX_CPU_TIMERS_TASK_WORK=y
CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y
CONFIG_CONTEXT_TRACKING=y
CONFIG_CONTEXT_TRACKING_IDLE=y

CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
CONFIG_NO_HZ_FULL=y
CONFIG_CONTEXT_TRACKING_USER=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US=100

CONFIG_BPF=y
CONFIG_HAVE_EBPF_JIT=y
CONFIG_ARCH_WANT_DEFAULT_BPF_JIT=y

CONFIG_BPF_SYSCALL=y

CONFIG_PREEMPT_VOLUNTARY_BUILD=y
CONFIG_PREEMPT_VOLUNTARY=y

CONFIG_VIRT_CPU_ACCOUNTING=y
CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y

CONFIG_CPU_ISOLATION=y

CONFIG_TREE_RCU=y
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU_GENERIC=y
CONFIG_TASKS_RUDE_RCU=y
CONFIG_TASKS_TRACE_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
CONFIG_RCU_NOCB_CPU=y

CONFIG_IKCONFIG=m
CONFIG_IKCONFIG_PROC=y
CONFIG_IKHEADERS=m
CONFIG_LOG_BUF_SHIFT=16
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y


CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y
CONFIG_CC_HAS_INT128=y
CONFIG_CC_IMPLICIT_FALLTHROUGH="-Wimplicit-fallthrough=5"
CONFIG_GCC11_NO_ARRAY_BOUNDS=y
CONFIG_CC_NO_ARRAY_BOUNDS=y
CONFIG_ARCH_SUPPORTS_INT128=y
CONFIG_CGROUPS=y
CONFIG_PAGE_COUNTER=y
CONFIG_MEMCG=y
CONFIG_MEMCG_KMEM=y
CONFIG_BLK_CGROUP=y
CONFIG_CGROUP_WRITEBACK=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_SCHED_MM_CID=y
CONFIG_CGROUP_RDMA=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_HUGETLB=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_PERF=y
CONFIG_SOCK_CGROUP_DATA=y
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_TIME_NS=y
CONFIG_IPC_NS=y
CONFIG_USER_NS=y
CONFIG_PID_NS=y
CONFIG_NET_NS=y
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_INITRAMFS_PRESERVE_MTIME=y
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_LD_ORPHAN_WARN=y
CONFIG_LD_ORPHAN_WARN_LEVEL="warn"
CONFIG_SYSCTL=y
CONFIG_HAVE_UID16=y
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_HAVE_PCSPKR_PLATFORM=y
CONFIG_EXPERT=y
CONFIG_UID16=y
CONFIG_MULTIUSER=y
CONFIG_FHANDLE=y
CONFIG_POSIX_TIMERS=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_FUTEX_PI=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_IO_URING=y
CONFIG_ADVISE_SYSCALLS=y
CONFIG_MEMBARRIER=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_ABSOLUTE_PERCPU=y
CONFIG_KALLSYMS_BASE_RELATIVE=y
CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE=y
CONFIG_KCMP=y
CONFIG_RSEQ=y
CONFIG_CACHESTAT_SYSCALL=y
CONFIG_HAVE_PERF_EVENTS=y

CONFIG_PERF_EVENTS=y

CONFIG_SYSTEM_DATA_VERIFICATION=y
CONFIG_TRACEPOINTS=y

CONFIG_CRASH_CORE=y
CONFIG_KEXEC_CORE=y
CONFIG_KEXEC=y
CONFIG_KEXEC_FILE=y

CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=28
CONFIG_ARCH_MMAP_RND_BITS_MAX=32
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_AUDIT_ARCH=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_CC_HAS_SANE_STACKPROTECTOR=y

CONFIG_SMP=y
CONFIG_X86_MPPARSE=y
CONFIG_X86_CPU_RESCTRL=y
CONFIG_IOSF_MBI=y
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
CONFIG_MCORE2=y
CONFIG_X86_INTERNODE_CACHE_SHIFT=6
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_P6_NOP=y
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_IA32_FEAT_CTL=y
CONFIG_X86_VMX_FEATURE_NAMES=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_HYGON=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_CPU_SUP_ZHAOXIN=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
CONFIG_GART_IOMMU=y
CONFIG_NR_CPUS_RANGE_BEGIN=2
CONFIG_NR_CPUS_RANGE_END=512
CONFIG_NR_CPUS_DEFAULT=64
CONFIG_NR_CPUS=4
CONFIG_SCHED_CLUSTER=y
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_SCHED_MC_PRIO=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_THRESHOLD=y

CONFIG_PERF_EVENTS_INTEL_UNCORE=y
CONFIG_PERF_EVENTS_INTEL_RAPL=y
CONFIG_PERF_EVENTS_INTEL_CSTATE=y

CONFIG_X86_16BIT=y
CONFIG_X86_ESPFIX64=y
CONFIG_X86_VSYSCALL_EMULATION=y
CONFIG_MICROCODE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_X86_DIRECT_GBPAGES=y
CONFIG_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_NODES_SHIFT=6
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_MEMORY_PROBE=y
CONFIG_ARCH_PROC_KCORE_TEXT=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
CONFIG_MTRR=y
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
CONFIG_X86_UMIP=y
CONFIG_CC_HAS_IBT=y
CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS=y
CONFIG_X86_INTEL_TSX_MODE_OFF=y
CONFIG_EFI=y
CONFIG_EFI_STUB=y
CONFIG_EFI_HANDOVER_PROTOCOL=y
CONFIG_EFI_RUNTIME_MAP=y
CONFIG_HZ_250=y
CONFIG_HZ=250
CONFIG_SCHED_HRTICK=y
CONFIG_ARCH_SUPPORTS_KEXEC=y
CONFIG_ARCH_SUPPORTS_KEXEC_FILE=y
CONFIG_ARCH_SELECTS_KEXEC_FILE=y
CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY=y
CONFIG_ARCH_SUPPORTS_KEXEC_SIG=y
CONFIG_ARCH_SUPPORTS_KEXEC_SIG_FORCE=y
CONFIG_ARCH_SUPPORTS_KEXEC_BZIMAGE_VERIFY_SIG=y
CONFIG_ARCH_SUPPORTS_KEXEC_JUMP=y
CONFIG_ARCH_SUPPORTS_CRASH_DUMP=y
CONFIG_ARCH_SUPPORTS_CRASH_HOTPLUG=y
CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION=y
CONFIG_PHYSICAL_START=0x1000000
CONFIG_RELOCATABLE=y
CONFIG_PHYSICAL_ALIGN=0x1000000
CONFIG_HOTPLUG_CPU=y
CONFIG_COMPAT_VDSO=y
CONFIG_LEGACY_VSYSCALL_XONLY=y
CONFIG_MODIFY_LDT_SYSCALL=y
CONFIG_HAVE_LIVEPATCH=y
CONFIG_LIVEPATCH=y

CONFIG_CC_HAS_SLS=y
CONFIG_CC_HAS_RETURN_THUNK=y
CONFIG_CC_HAS_ENTRY_PADDING=y
CONFIG_FUNCTION_PADDING_CFI=11
CONFIG_FUNCTION_PADDING_BYTES=16
CONFIG_HAVE_CALL_THUNKS=y
CONFIG_SPECULATION_MITIGATIONS=y
CONFIG_PAGE_TABLE_ISOLATION=y
CONFIG_RETPOLINE=y
CONFIG_RETHUNK=y
CONFIG_CPU_UNRET_ENTRY=y
CONFIG_CPU_IBPB_ENTRY=y
CONFIG_CPU_IBRS_ENTRY=y
CONFIG_ARCH_HAS_ADD_PAGES=y

CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
CONFIG_PM_SLEEP=y
CONFIG_PM_SLEEP_SMP=y
CONFIG_PM=y
CONFIG_PM_CLK=y
CONFIG_WQ_POWER_EFFICIENT_DEFAULT=y
CONFIG_ENERGY_MODEL=y
CONFIG_ARCH_SUPPORTS_ACPI=y
CONFIG_ACPI=y
CONFIG_ACPI_LEGACY_TABLES_LOOKUP=y
CONFIG_ARCH_MIGHT_HAVE_ACPI_PDC=y
CONFIG_ACPI_SYSTEM_POWER_STATES_SUPPORT=y
CONFIG_ACPI_SPCR_TABLE=y
CONFIG_ACPI_FPDT=y
CONFIG_ACPI_LPIT=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_REV_OVERRIDE_POSSIBLE=y
CONFIG_ACPI_AC=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_VIDEO=y
CONFIG_ACPI_FAN=y
CONFIG_ACPI_TAD=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_CPU_FREQ_PSS=y
CONFIG_ACPI_PROCESSOR_CSTATE=y
CONFIG_ACPI_PROCESSOR_IDLE=y
CONFIG_ACPI_CPPC_LIB=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_IPMI=y
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_ACPI_PROCESSOR_AGGREGATOR=y
CONFIG_ACPI_THERMAL=y
CONFIG_ARCH_HAS_ACPI_TABLE_UPGRADE=y
CONFIG_ACPI_TABLE_UPGRADE=y
CONFIG_ACPI_DEBUG=y
CONFIG_ACPI_CONTAINER=y
CONFIG_ACPI_HOTPLUG_MEMORY=y
CONFIG_ACPI_HOTPLUG_IOAPIC=y
CONFIG_ACPI_HED=y
CONFIG_ACPI_NUMA=y
CONFIG_HAVE_ACPI_APEI=y
CONFIG_HAVE_ACPI_APEI_NMI=y
CONFIG_ACPI_APEI=y
CONFIG_ACPI_APEI_GHES=y
CONFIG_ACPI_APEI_EINJ=y
CONFIG_ACPI_DPTF=y
CONFIG_DPTF_PCH_FIVR=y
CONFIG_ACPI_PCC=y
CONFIG_ACPI_VIOT=y
CONFIG_ACPI_PRMT=y
CONFIG_X86_PM_TIMER=y

CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_GOV_ATTR_SET=y
CONFIG_CPU_FREQ_GOV_COMMON=y
CONFIG_CPU_FREQ_STAT=y
CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y
CONFIG_CPU_FREQ_GOV_SCHEDUTIL=y

CONFIG_X86_INTEL_PSTATE=y
CONFIG_X86_PCC_CPUFREQ=y
CONFIG_X86_ACPI_CPUFREQ=y


CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y

CONFIG_INTEL_IDLE=y

CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_MMCONF_FAM10H=y
CONFIG_ISA_DMA_API=y
CONFIG_AMD_NB=y

CONFIG_IA32_EMULATION=y
CONFIG_IA32_EMULATION_DEFAULT_DISABLED=y
CONFIG_COMPAT_32=y
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y

CONFIG_HAVE_KVM=y
CONFIG_AS_AVX512=y
CONFIG_AS_SHA1_NI=y
CONFIG_AS_SHA256_NI=y
CONFIG_AS_TPAUSE=y
CONFIG_AS_GFNI=y
CONFIG_AS_WRUSS=y

CONFIG_HOTPLUG_SMT=y
CONFIG_HOTPLUG_CORE_SYNC=y
CONFIG_HOTPLUG_CORE_SYNC_DEAD=y
CONFIG_HOTPLUG_CORE_SYNC_FULL=y
CONFIG_HOTPLUG_SPLIT_STARTUP=y
CONFIG_HOTPLUG_PARALLEL=y
CONFIG_GENERIC_ENTRY=y
CONFIG_KPROBES=y
CONFIG_JUMP_LABEL=y
CONFIG_OPTPROBES=y
CONFIG_KPROBES_ON_FTRACE=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_ARCH_USE_BUILTIN_BSWAP=y
CONFIG_KRETPROBES=y
CONFIG_KRETPROBE_ON_RETHOOK=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
CONFIG_ARCH_CORRECT_STACKTRACE_ON_KRETPROBE=y
CONFIG_HAVE_FUNCTION_ERROR_INJECTION=y
CONFIG_HAVE_NMI=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_TRACE_IRQFLAGS_NMI_SUPPORT=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_CONTIGUOUS=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_ARCH_HAS_FORTIFY_SOURCE=y
CONFIG_ARCH_HAS_SET_MEMORY=y
CONFIG_ARCH_HAS_SET_DIRECT_MAP=y
CONFIG_ARCH_HAS_CPU_FINALIZE_INIT=y
CONFIG_HAVE_ARCH_THREAD_STRUCT_WHITELIST=y
CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT=y
CONFIG_ARCH_WANTS_NO_INSTR=y
CONFIG_HAVE_ASM_MODVERSIONS=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_RSEQ=y
CONFIG_HAVE_RUST=y
CONFIG_HAVE_FUNCTION_ARG_ACCESS_API=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_HARDLOCKUP_DETECTOR_PERF=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y
CONFIG_HAVE_ARCH_JUMP_LABEL_RELATIVE=y
CONFIG_MMU_GATHER_MERGE_VMAS=y
CONFIG_MMU_LAZY_TLB_REFCOUNT=y
CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
CONFIG_ARCH_HAS_NMI_SAFE_THIS_CPU_OPS=y
CONFIG_HAVE_ALIGNED_STRUCT_PAGE=y
CONFIG_HAVE_CMPXCHG_LOCAL=y
CONFIG_HAVE_CMPXCHG_DOUBLE=y
CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSION=y
CONFIG_ARCH_WANT_OLD_COMPAT_IPC=y
CONFIG_HAVE_ARCH_SECCOMP=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP=y
CONFIG_SECCOMP_FILTER=y
CONFIG_HAVE_ARCH_STACKLEAK=y
CONFIG_HAVE_STACKPROTECTOR=y
CONFIG_STACKPROTECTOR=y
CONFIG_STACKPROTECTOR_STRONG=y
CONFIG_ARCH_SUPPORTS_LTO_CLANG=y
CONFIG_ARCH_SUPPORTS_LTO_CLANG_THIN=y
CONFIG_LTO_NONE=y
CONFIG_ARCH_SUPPORTS_CFI_CLANG=y
CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES=y
CONFIG_HAVE_CONTEXT_TRACKING_USER=y
CONFIG_HAVE_CONTEXT_TRACKING_USER_OFFSTACK=y
CONFIG_HAVE_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_MOVE_PUD=y
CONFIG_HAVE_MOVE_PMD=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD=y
CONFIG_HAVE_ARCH_HUGE_VMAP=y
CONFIG_HAVE_ARCH_HUGE_VMALLOC=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_PMD_MKWRITE=y
CONFIG_HAVE_ARCH_SOFT_DIRTY=y
CONFIG_HAVE_MOD_ARCH_SPECIFIC=y
CONFIG_MODULES_USE_ELF_RELA=y
CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK=y
CONFIG_HAVE_SOFTIRQ_ON_OWN_STACK=y
CONFIG_SOFTIRQ_ON_OWN_STACK=y
CONFIG_ARCH_HAS_ELF_RANDOMIZE=y
CONFIG_HAVE_ARCH_MMAP_RND_BITS=y
CONFIG_HAVE_EXIT_THREAD=y
CONFIG_ARCH_MMAP_RND_BITS=28
CONFIG_HAVE_ARCH_MMAP_RND_COMPAT_BITS=y
CONFIG_ARCH_MMAP_RND_COMPAT_BITS=8
CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES=y
CONFIG_PAGE_SIZE_LESS_THAN_64KB=y
CONFIG_PAGE_SIZE_LESS_THAN_256KB=y
CONFIG_HAVE_OBJTOOL=y
CONFIG_HAVE_JUMP_LABEL_HACK=y
CONFIG_HAVE_NOINSTR_HACK=y
CONFIG_HAVE_NOINSTR_VALIDATION=y
CONFIG_HAVE_UACCESS_VALIDATION=y
CONFIG_HAVE_STACK_VALIDATION=y
CONFIG_HAVE_RELIABLE_STACKTRACE=y
CONFIG_OLD_SIGSUSPEND3=y
CONFIG_COMPAT_OLD_SIGACTION=y
CONFIG_COMPAT_32BIT_TIME=y
CONFIG_HAVE_ARCH_VMAP_STACK=y
CONFIG_HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET=y
CONFIG_ARCH_HAS_STRICT_KERNEL_RWX=y
CONFIG_STRICT_KERNEL_RWX=y
CONFIG_ARCH_HAS_STRICT_MODULE_RWX=y
CONFIG_STRICT_MODULE_RWX=y
CONFIG_HAVE_ARCH_PREL32_RELOCATIONS=y
CONFIG_ARCH_USE_MEMREMAP_PROT=y
CONFIG_ARCH_HAS_MEM_ENCRYPT=y
CONFIG_HAVE_STATIC_CALL=y
CONFIG_HAVE_STATIC_CALL_INLINE=y
CONFIG_HAVE_PREEMPT_DYNAMIC=y
CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
CONFIG_ARCH_WANT_LD_ORPHAN_WARN=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_ARCH_SUPPORTS_PAGE_TABLE_CHECK=y
CONFIG_ARCH_HAS_ELFCORE_COMPAT=y
CONFIG_ARCH_HAS_PARANOID_L1D_FLUSH=y
CONFIG_DYNAMIC_SIGFRAME=y
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y

CONFIG_ARCH_HAS_GCOV_PROFILE_ALL=y

CONFIG_HAVE_GCC_PLUGINS=y
CONFIG_FUNCTION_ALIGNMENT_4B=y
CONFIG_FUNCTION_ALIGNMENT_16B=y
CONFIG_FUNCTION_ALIGNMENT=16

CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULE_SIG_FORMAT=y
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_SIG=y
CONFIG_MODULE_SIG_ALL=y
CONFIG_MODULE_SIG_SHA256=y
CONFIG_MODULE_SIG_HASH="sha256"
CONFIG_MODULE_COMPRESS_NONE=y
CONFIG_MODPROBE_PATH="/sbin/modprobe"
CONFIG_MODULES_TREE_LOOKUP=y
CONFIG_BLOCK=y
CONFIG_BLOCK_LEGACY_AUTOLOAD=y
CONFIG_BLK_CGROUP_PUNT_BIO=y
CONFIG_BLK_DEV_BSG_COMMON=y
CONFIG_BLK_DEV_BSGLIB=y
CONFIG_BLK_DEV_ZONED=y
CONFIG_BLK_DEBUG_FS=y
CONFIG_BLK_DEBUG_FS_ZONED=y

CONFIG_MSDOS_PARTITION=y
CONFIG_EFI_PARTITION=y

CONFIG_BLK_MQ_PCI=y
CONFIG_BLK_MQ_VIRTIO=y
CONFIG_BLK_PM=y
CONFIG_BLOCK_HOLDER_DEPRECATED=y
CONFIG_BLK_MQ_STACKING=y

CONFIG_MQ_IOSCHED_DEADLINE=y
CONFIG_MQ_IOSCHED_KYBER=y

CONFIG_ASN1=y
CONFIG_UNINLINE_SPIN_UNLOCK=y
CONFIG_ARCH_SUPPORTS_ATOMIC_RMW=y
CONFIG_MUTEX_SPIN_ON_OWNER=y
CONFIG_RWSEM_SPIN_ON_OWNER=y
CONFIG_LOCK_SPIN_ON_OWNER=y
CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
CONFIG_QUEUED_SPINLOCKS=y
CONFIG_ARCH_USE_QUEUED_RWLOCKS=y
CONFIG_QUEUED_RWLOCKS=y
CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE=y
CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE=y
CONFIG_ARCH_HAS_SYSCALL_WRAPPER=y
CONFIG_FREEZER=y

CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
CONFIG_ELFCORE=y
CONFIG_BINFMT_SCRIPT=y
CONFIG_BINFMT_MISC=y
CONFIG_COREDUMP=y

CONFIG_SWAP=y

CONFIG_SLUB=y
CONFIG_SLAB_MERGE_DEFAULT=y
CONFIG_SLUB_CPU_PARTIAL=y

CONFIG_COMPAT_BRK=y
CONFIG_SPARSEMEM=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP=y
CONFIG_ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP=y
CONFIG_HAVE_FAST_GUP=y
CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_ISOLATION=y
CONFIG_EXCLUSIVE_SYSTEM_RAM=y
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y
CONFIG_MEMORY_HOTPLUG=y
CONFIG_MHP_MEMMAP_ON_MEMORY=y
CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK=y
CONFIG_COMPACTION=y
CONFIG_COMPACT_UNEVICTABLE_DEFAULT=1
CONFIG_MIGRATION=y
CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION=y
CONFIG_ARCH_ENABLE_THP_MIGRATION=y
CONFIG_CONTIG_ALLOC=y
CONFIG_PCP_BATCH_SCALE_MAX=5
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_MMU_NOTIFIER=y
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ARCH_WANTS_THP_SWAP=y
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
CONFIG_THP_SWAP=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_USE_PERCPU_NUMA_NODE_ID=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_GENERIC_EARLY_IOREMAP=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_ARCH_HAS_CURRENT_STACK_POINTER=y
CONFIG_ARCH_HAS_PTE_DEVMAP=y
CONFIG_ARCH_HAS_ZONE_DMA_SET=y
CONFIG_ZONE_DMA=y
CONFIG_ZONE_DMA32=y
CONFIG_HMM_MIRROR=y
CONFIG_ARCH_USES_HIGH_VMA_FLAGS=y
CONFIG_ARCH_HAS_PKEYS=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_ARCH_HAS_PTE_SPECIAL=y
CONFIG_MEMFD_CREATE=y
CONFIG_SECRETMEM=y
CONFIG_ARCH_SUPPORTS_PER_VMA_LOCK=y
CONFIG_PER_VMA_LOCK=y
CONFIG_LOCK_MM_AND_FIND_VMA=y


CONFIG_NET=y
CONFIG_NET_INGRESS=y
CONFIG_NET_EGRESS=y
CONFIG_NET_XGRESS=y
CONFIG_SKB_EXTENSIONS=y

CONFIG_PACKET=y
CONFIG_UNIX=y
CONFIG_UNIX_SCM=y
CONFIG_AF_UNIX_OOB=y
CONFIG_TLS=y
CONFIG_TLS_DEVICE=y
CONFIG_TLS_TOE=y
CONFIG_XFRM=y
CONFIG_XFRM_ALGO=y
CONFIG_XFRM_AH=y
CONFIG_XFRM_ESP=y
CONFIG_XFRM_IPCOMP=y
CONFIG_XFRM_ESPINTCP=y
CONFIG_NET_HANDSHAKE=y
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_IP_FIB_TRIE_STATS=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
CONFIG_NET_IP_TUNNEL=y
CONFIG_NET_UDP_TUNNEL=y
CONFIG_INET_ESP=y
CONFIG_INET_ESPINTCP=y
CONFIG_INET_TABLE_PERTURB_ORDER=16
CONFIG_INET_DIAG=y
CONFIG_INET_TCP_DIAG=y
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
CONFIG_IPV6=y
CONFIG_IPV6_ROUTER_PREF=y
CONFIG_IPV6_ROUTE_INFO=y
CONFIG_IPV6_OPTIMISTIC_DAD=y
CONFIG_INET6_AH=y
CONFIG_INET6_ESP=y
CONFIG_INET6_IPCOMP=y
CONFIG_IPV6_MIP6=y
CONFIG_INET6_XFRM_TUNNEL=y
CONFIG_INET6_TUNNEL=y
CONFIG_IPV6_MULTIPLE_TABLES=y
CONFIG_IPV6_SUBTREES=y
CONFIG_NETLABEL=y
CONFIG_NETWORK_SECMARK=y
CONFIG_NET_PTP_CLASSIFY=y
CONFIG_IP_SCTP=y
CONFIG_SCTP_DEFAULT_COOKIE_HMAC_MD5=y
CONFIG_SCTP_COOKIE_HMAC_MD5=y
CONFIG_INET_SCTP_DIAG=y
CONFIG_RDS=y
CONFIG_RDS_TCP=m
CONFIG_L2TP=y
CONFIG_L2TP_DEBUGFS=y
CONFIG_L2TP_V3=y
CONFIG_L2TP_IP=y
CONFIG_L2TP_ETH=y
CONFIG_STP=y
CONFIG_BRIDGE=y
CONFIG_BRIDGE_IGMP_SNOOPING=y
CONFIG_LLC=y
CONFIG_DNS_RESOLVER=y
CONFIG_NETLINK_DIAG=y
CONFIG_PCPU_DEV_REFCNT=y
CONFIG_MAX_SKB_FRAGS=17
CONFIG_RPS=y
CONFIG_RFS_ACCEL=y
CONFIG_SOCK_RX_QUEUE_MAPPING=y
CONFIG_XPS=y
CONFIG_CGROUP_NET_PRIO=y
CONFIG_CGROUP_NET_CLASSID=y
CONFIG_NET_RX_BUSY_POLL=y
CONFIG_BQL=y
CONFIG_NET_FLOW_LIMIT=y

CONFIG_NET_DROP_MONITOR=y

CONFIG_BT=m
CONFIG_BT_BREDR=y
CONFIG_BT_RFCOMM=m
CONFIG_BT_RFCOMM_TTY=y
CONFIG_BT_BNEP=m
CONFIG_BT_BNEP_MC_FILTER=y
CONFIG_BT_BNEP_PROTO_FILTER=y
CONFIG_BT_HIDP=m
CONFIG_BT_HS=y
CONFIG_BT_LE=y
CONFIG_BT_LE_L2CAP_ECRED=y
CONFIG_BT_MSFTEXT=y
CONFIG_BT_AOSPEXT=y
CONFIG_BT_DEBUGFS=y

CONFIG_BT_INTEL=m
CONFIG_BT_BCM=m
CONFIG_BT_RTL=m
CONFIG_BT_HCIBTUSB=m
CONFIG_BT_HCIBTUSB_POLL_SYNC=y
CONFIG_BT_HCIBTUSB_BCM=y
CONFIG_BT_HCIBTUSB_RTL=y

CONFIG_AF_RXRPC=y
CONFIG_AF_RXRPC_IPV6=y
CONFIG_AF_RXRPC_DEBUG=y
CONFIG_RXKAD=y
CONFIG_RXPERF=y
CONFIG_AF_KCM=y
CONFIG_STREAM_PARSER=y
CONFIG_FIB_RULES=y
CONFIG_NET_9P=y
CONFIG_NET_9P_FD=y
CONFIG_NET_9P_VIRTIO=y
CONFIG_CEPH_LIB=y
CONFIG_CEPH_LIB_USE_DNS_RESOLVER=y
CONFIG_DST_CACHE=y
CONFIG_GRO_CELLS=y
CONFIG_SOCK_VALIDATE_XMIT=y
CONFIG_NET_SELFTESTS=y
CONFIG_NET_SOCK_MSG=y
CONFIG_NET_DEVLINK=y
CONFIG_PAGE_POOL=y
CONFIG_PAGE_POOL_STATS=y
CONFIG_FAILOVER=m
CONFIG_ETHTOOL_NETLINK=y

CONFIG_HAVE_EISA=y
CONFIG_HAVE_PCI=y
CONFIG_PCI=y
CONFIG_PCI_DOMAINS=y
CONFIG_PCIEPORTBUS=y
CONFIG_PCIEAER=y
CONFIG_PCIEASPM=y
CONFIG_PCIEASPM_DEFAULT=y
CONFIG_PCIE_PME=y
CONFIG_PCI_MSI=y
CONFIG_PCI_QUIRKS=y
CONFIG_PCI_ATS=y
CONFIG_PCI_LOCKLESS_CONFIG=y
CONFIG_PCI_PRI=y
CONFIG_PCI_PASID=y
CONFIG_PCI_LABEL=y
CONFIG_PCIE_BUS_DEFAULT=y
CONFIG_VGA_ARB=y
CONFIG_VGA_ARB_MAX_GPUS=16








CONFIG_UEVENT_HELPER=y
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_DEVTMPFS=y
CONFIG_DEVTMPFS_MOUNT=y
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y

CONFIG_FW_LOADER=y
CONFIG_FW_LOADER_PAGED_BUF=y
CONFIG_EXTRA_FIRMWARE=""
CONFIG_FW_LOADER_COMPRESS=y
CONFIG_FW_LOADER_COMPRESS_XZ=y
CONFIG_FW_CACHE=y

CONFIG_ALLOW_DEV_COREDUMP=y
CONFIG_GENERIC_CPU_AUTOPROBE=y
CONFIG_GENERIC_CPU_VULNERABILITIES=y
CONFIG_REGMAP=y
CONFIG_DMA_SHARED_BUFFER=y






CONFIG_FIRMWARE_MEMMAP=y
CONFIG_DMIID=y
CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK=y

CONFIG_EFI_ESRT=y
CONFIG_EFI_VARS_PSTORE=y
CONFIG_EFI_DXE_MEM_ATTRIBUTES=y
CONFIG_EFI_RUNTIME_WRAPPERS=y
CONFIG_EFI_EARLYCON=y
CONFIG_EFI_CUSTOM_SSDT_OVERLAYS=y

CONFIG_UEFI_CPER=y
CONFIG_UEFI_CPER_X86=y



CONFIG_ARCH_MIGHT_HAVE_PC_PARPORT=y
CONFIG_PNP=y

CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
CONFIG_BLK_DEV_NULL_BLK=y
CONFIG_BLK_DEV_LOOP=y
CONFIG_BLK_DEV_LOOP_MIN_COUNT=8
CONFIG_BLK_DEV_DRBD=m
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_COUNT=1
CONFIG_BLK_DEV_RAM_SIZE=65536
CONFIG_VIRTIO_BLK=m
CONFIG_BLK_DEV_RBD=y

CONFIG_NVME_COMMON=y
CONFIG_NVME_AUTH=y
CONFIG_NVME_CORE=y
CONFIG_BLK_DEV_NVME=y
CONFIG_NVME_MULTIPATH=y
CONFIG_NVME_FABRICS=y
CONFIG_NVME_TCP=y
CONFIG_NVME_TARGET=y
CONFIG_NVME_TARGET_PASSTHRU=y
CONFIG_NVME_TARGET_LOOP=y
CONFIG_NVME_TARGET_TCP=y
CONFIG_NVME_TARGET_AUTH=y

CONFIG_ENCLOSURE_SERVICES=y




CONFIG_INTEL_MEI=y
CONFIG_INTEL_MEI_ME=y

CONFIG_SCSI_MOD=y
CONFIG_SCSI_COMMON=y
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
CONFIG_SCSI_PROC_FS=y

CONFIG_BLK_DEV_SD=y
CONFIG_CHR_DEV_SG=y
CONFIG_BLK_DEV_BSG=y
CONFIG_SCSI_CONSTANTS=y

CONFIG_SCSI_SPI_ATTRS=y
CONFIG_SCSI_ISCSI_ATTRS=m

CONFIG_SCSI_LOWLEVEL=y
CONFIG_ISCSI_TCP=m

CONFIG_ATA=y
CONFIG_SATA_HOST=y
CONFIG_PATA_TIMINGS=y
CONFIG_ATA_VERBOSE_ERROR=y
CONFIG_ATA_FORCE=y
CONFIG_ATA_ACPI=y

CONFIG_SATA_AHCI=y
CONFIG_SATA_MOBILE_LPM_POLICY=0
CONFIG_SATA_AHCI_PLATFORM=y
CONFIG_MD=y
CONFIG_BLK_DEV_DM_BUILTIN=y
CONFIG_BLK_DEV_DM=y
CONFIG_DM_UEVENT=y
CONFIG_DM_ZONED=m
CONFIG_DM_AUDIT=y


CONFIG_NETDEVICES=y
CONFIG_NET_CORE=y
CONFIG_TUN=y
CONFIG_VETH=y
CONFIG_VIRTIO_NET=m
CONFIG_ETHERNET=y
CONFIG_MDIO=y
CONFIG_NET_VENDOR_CHELSIO=y
CONFIG_CHELSIO_T1=m
CONFIG_CHELSIO_T4=m
CONFIG_CHELSIO_T4VF=m
CONFIG_CHELSIO_INLINE_CRYPTO=y
CONFIG_CRYPTO_DEV_CHELSIO_TLS=m
CONFIG_CHELSIO_TLS_DEVICE=m
CONFIG_NET_VENDOR_GOOGLE=y
CONFIG_GVE=m
CONFIG_NET_VENDOR_I825XX=y
CONFIG_NET_VENDOR_INTEL=y
CONFIG_IXGBE=y
CONFIG_IXGBE_HWMON=y
CONFIG_NET_VENDOR_LITEX=y
CONFIG_NET_VENDOR_REALTEK=y
CONFIG_R8169=y
CONFIG_PHYLIB=y
CONFIG_SWPHY=y
CONFIG_FIXED_PHY=y

CONFIG_REALTEK_PHY=y
CONFIG_MDIO_DEVICE=y
CONFIG_MDIO_BUS=y
CONFIG_FWNODE_MDIO=y
CONFIG_ACPI_MDIO=y
CONFIG_MDIO_DEVRES=y





CONFIG_NETDEVSIM=y
CONFIG_NET_FAILOVER=m

CONFIG_INPUT=y
CONFIG_INPUT_VIVALDIFMAP=y

CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
CONFIG_INPUT_EVDEV=y

CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_SYNAPTICS_SMBUS=y
CONFIG_MOUSE_PS2_SMBUS=y

CONFIG_SERIO=y
CONFIG_ARCH_MIGHT_HAVE_PC_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_LIBPS2=y

CONFIG_TTY=y
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_VT_CONSOLE_SLEEP=y
CONFIG_HW_CONSOLE=y
CONFIG_UNIX98_PTYS=y
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=256
CONFIG_LEGACY_TIOCSTI=y
CONFIG_LDISC_AUTOLOAD=y

CONFIG_SERIAL_EARLYCON=y
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_DEPRECATED_OPTIONS=y
CONFIG_SERIAL_8250_PNP=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_SERIAL_8250_PCILIB=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_NR_UARTS=4
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
CONFIG_SERIAL_8250_DWLIB=y
CONFIG_SERIAL_8250_LPSS=y
CONFIG_SERIAL_8250_MID=y

CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y

CONFIG_HVC_DRIVER=y
CONFIG_VIRTIO_CONSOLE=y
CONFIG_IPMI_HANDLER=y
CONFIG_IPMI_DMI_DECODE=y
CONFIG_IPMI_PLAT_DATA=y
CONFIG_IPMI_DEVICE_INTERFACE=y
CONFIG_IPMI_SI=y
CONFIG_IPMI_SSIF=y
CONFIG_DEVPORT=y
CONFIG_HPET=y
CONFIG_HPET_MMAP=y
CONFIG_HPET_MMAP_DEFAULT=y
CONFIG_TCG_TPM=y
CONFIG_TCG_VTPM_PROXY=y

CONFIG_I2C=y
CONFIG_ACPI_I2C_OPREGION=y
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_COMPAT=y
CONFIG_I2C_CHARDEV=y
CONFIG_I2C_MUX=y


CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_SMBUS=y


CONFIG_I2C_I801=y

CONFIG_I2C_SCMI=y





CONFIG_PPS=y



CONFIG_PTP_1588_CLOCK=y
CONFIG_PTP_1588_CLOCK_OPTIONAL=y


CONFIG_POWER_SUPPLY=y
CONFIG_POWER_SUPPLY_HWMON=y
CONFIG_HWMON=y

CONFIG_SENSORS_CORETEMP=y
CONFIG_PMBUS=y
CONFIG_SENSORS_PMBUS=y

CONFIG_SENSORS_ACPI_POWER=y
CONFIG_SENSORS_ATK0110=y
CONFIG_THERMAL=y
CONFIG_THERMAL_EMERGENCY_POWEROFF_DELAY_MS=0
CONFIG_THERMAL_HWMON=y
CONFIG_THERMAL_ACPI=y
CONFIG_THERMAL_WRITABLE_TRIPS=y
CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE=y
CONFIG_THERMAL_GOV_STEP_WISE=y
CONFIG_THERMAL_GOV_USER_SPACE=y

CONFIG_X86_THERMAL_VECTOR=y
CONFIG_INTEL_TCC=y
CONFIG_X86_PKG_TEMP_THERMAL=y


CONFIG_INTEL_PCH_THERMAL=y
CONFIG_INTEL_TCC_COOLING=y

CONFIG_WATCHDOG=y
CONFIG_WATCHDOG_CORE=y
CONFIG_WATCHDOG_HANDLE_BOOT_ENABLED=y
CONFIG_WATCHDOG_OPEN_TIMEOUT=0


CONFIG_ITCO_WDT=y
CONFIG_ITCO_VENDOR_SUPPORT=y
CONFIG_INTEL_MEI_WDT=y


CONFIG_SSB_POSSIBLE=y
CONFIG_BCMA_POSSIBLE=y

CONFIG_MFD_CORE=y
CONFIG_LPC_ICH=y
CONFIG_MFD_INTEL_LPSS=y
CONFIG_MFD_INTEL_LPSS_ACPI=y
CONFIG_MFD_INTEL_LPSS_PCI=y






CONFIG_BACKLIGHT_CLASS_DEVICE=y

CONFIG_VGA_CONSOLE=y
CONFIG_DUMMY_CONSOLE=y
CONFIG_DUMMY_CONSOLE_COLUMNS=80
CONFIG_DUMMY_CONSOLE_ROWS=25

CONFIG_HID_SUPPORT=y
CONFIG_HID=y
CONFIG_HID_GENERIC=y



CONFIG_USB_HID=y




CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_SUPPORT=y
CONFIG_USB_COMMON=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB=y
CONFIG_USB_PCI=y

CONFIG_USB_DEFAULT_PERSIST=y
CONFIG_USB_DYNAMIC_MINORS=y
CONFIG_USB_AUTOSUSPEND_DELAY=2

CONFIG_USB_XHCI_HCD=y
CONFIG_USB_XHCI_PCI=y
CONFIG_USB_EHCI_HCD=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
CONFIG_USB_EHCI_PCI=y
CONFIG_USB_OHCI_HCD=y
CONFIG_USB_OHCI_HCD_PCI=y
CONFIG_USB_UHCI_HCD=y



CONFIG_USB_STORAGE=y






CONFIG_USB_GADGET=m
CONFIG_USB_GADGET_VBUS_DRAW=2
CONFIG_USB_GADGET_STORAGE_NUM_BUFFERS=2


CONFIG_USB_LIBCOMPOSITE=m
CONFIG_USB_F_FS=m
CONFIG_USB_CONFIGFS=m
CONFIG_USB_CONFIGFS_F_FS=y


CONFIG_INFINIBAND=y
CONFIG_INFINIBAND_USER_ACCESS=y
CONFIG_INFINIBAND_USER_MEM=y
CONFIG_INFINIBAND_ON_DEMAND_PAGING=y
CONFIG_INFINIBAND_ADDR_TRANS=y
CONFIG_INFINIBAND_ADDR_TRANS_CONFIGFS=y
CONFIG_INFINIBAND_VIRT_DMA=y
CONFIG_RDMA_RXE=y
CONFIG_RDMA_SIW=y
CONFIG_EDAC_ATOMIC_SCRUB=y
CONFIG_EDAC_SUPPORT=y
CONFIG_RTC_LIB=y
CONFIG_RTC_MC146818_LIB=y
CONFIG_RTC_CLASS=y
CONFIG_RTC_HCTOSYS=y
CONFIG_RTC_HCTOSYS_DEVICE="rtc0"
CONFIG_RTC_SYSTOHC=y
CONFIG_RTC_SYSTOHC_DEVICE="rtc0"
CONFIG_RTC_NVMEM=y

CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y


CONFIG_RTC_I2C_AND_SPI=y


CONFIG_RTC_DRV_CMOS=y



CONFIG_SYNC_FILE=y

CONFIG_VIRTIO_ANCHOR=y
CONFIG_VIRTIO=y
CONFIG_VHOST_MENU=y


CONFIG_X86_PLATFORM_DEVICES=y
CONFIG_ACPI_WMI=y
CONFIG_WMI_BMOF=y
CONFIG_MXM_WMI=y
CONFIG_INTEL_PMC_CORE=y

CONFIG_INTEL_SPEED_SELECT_INTERFACE=y



CONFIG_P2SB=y
CONFIG_HAVE_CLK=y
CONFIG_HAVE_CLK_PREPARE=y
CONFIG_COMMON_CLK=y

CONFIG_CLKEVT_I8253=y
CONFIG_I8253_LOCK=y
CONFIG_CLKBLD_I8253=y

CONFIG_MAILBOX=y
CONFIG_PCC=y
CONFIG_IOMMU_IOVA=y
CONFIG_IOMMU_API=y
CONFIG_IOMMU_SUPPORT=y


CONFIG_IOMMU_DEFAULT_DMA_LAZY=y
CONFIG_IOMMU_DMA=y
CONFIG_IOMMU_SVA=y
CONFIG_DMAR_TABLE=y
CONFIG_INTEL_IOMMU=y
CONFIG_INTEL_IOMMU_SVM=y
CONFIG_INTEL_IOMMU_DEFAULT_ON=y
CONFIG_INTEL_IOMMU_FLOPPY_WA=y
CONFIG_INTEL_IOMMU_PERF_EVENTS=y
CONFIG_VIRTIO_IOMMU=m




























CONFIG_RAS=y

CONFIG_ANDROID_BINDER_IPC=y
CONFIG_ANDROID_BINDERFS=y
CONFIG_ANDROID_BINDER_DEVICES="y"

CONFIG_NVMEM=y





CONFIG_DCACHE_WORD_ACCESS=y
CONFIG_VALIDATE_FS_PARSER=y
CONFIG_FS_IOMAP=y
CONFIG_BUFFER_HEAD=y
CONFIG_LEGACY_DIRECT_IO=y
CONFIG_EXT2_FS=m
CONFIG_EXT4_FS=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y
CONFIG_JBD2=y
CONFIG_FS_MBCACHE=y
CONFIG_XFS_FS=y
CONFIG_XFS_SUPPORT_V4=y
CONFIG_XFS_QUOTA=y
CONFIG_XFS_POSIX_ACL=y
CONFIG_GFS2_FS=y
CONFIG_BTRFS_FS=y
CONFIG_BTRFS_FS_POSIX_ACL=y
CONFIG_BTRFS_DEBUG=y
CONFIG_BTRFS_FS_REF_VERIFY=y
CONFIG_F2FS_FS=m
CONFIG_F2FS_STAT_FS=y
CONFIG_F2FS_IOSTAT=y
CONFIG_ZONEFS_FS=m
CONFIG_FS_POSIX_ACL=y
CONFIG_EXPORTFS=y
CONFIG_EXPORTFS_BLOCK_OPS=y
CONFIG_FILE_LOCKING=y
CONFIG_FSNOTIFY=y
CONFIG_DNOTIFY=y
CONFIG_INOTIFY_USER=y
CONFIG_FANOTIFY=y
CONFIG_FANOTIFY_ACCESS_PERMISSIONS=y
CONFIG_QUOTA=y
CONFIG_QUOTACTL=y
CONFIG_AUTOFS_FS=y
CONFIG_FUSE_FS=y
CONFIG_CUSE=y
CONFIG_VIRTIO_FS=m
CONFIG_OVERLAY_FS=y
CONFIG_OVERLAY_FS_REDIRECT_ALWAYS_FOLLOW=y

CONFIG_NETFS_SUPPORT=y
CONFIG_NETFS_STATS=y
CONFIG_FSCACHE=y
CONFIG_FSCACHE_STATS=y
CONFIG_FSCACHE_DEBUG=y
CONFIG_CACHEFILES=y

CONFIG_UDF_FS=y

CONFIG_NTFS_FS=m
CONFIG_NTFS3_FS=m

CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_PROC_PID_ARCH_STATUS=y
CONFIG_PROC_CPU_RESCTRL=y
CONFIG_KERNFS=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y
CONFIG_ARCH_HAS_GIGANTIC_PAGE=y
CONFIG_CONFIGFS_FS=y
CONFIG_EFIVAR_FS=y

CONFIG_MISC_FILESYSTEMS=y
CONFIG_ORANGEFS_FS=m
CONFIG_ECRYPT_FS=m
CONFIG_PSTORE=y
CONFIG_PSTORE_DEFAULT_KMSG_BYTES=10240
CONFIG_PSTORE_COMPRESS=y
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=y
CONFIG_NFS_V2=m
CONFIG_NFS_V3=y
CONFIG_NFS_V4=y
CONFIG_NFS_SWAP=y
CONFIG_NFS_V4_1=y
CONFIG_NFS_V4_2=y
CONFIG_PNFS_FILE_LAYOUT=y
CONFIG_PNFS_BLOCK=y
CONFIG_PNFS_FLEXFILE_LAYOUT=y
CONFIG_NFS_V4_1_IMPLEMENTATION_ID_DOMAIN="kernel.org"
CONFIG_NFS_V4_SECURITY_LABEL=y
CONFIG_NFS_FSCACHE=y
CONFIG_NFS_USE_KERNEL_DNS=y
CONFIG_NFS_DEBUG=y
CONFIG_NFS_DISABLE_UDP_SUPPORT=y
CONFIG_NFSD=y
CONFIG_NFSD_V4=y
CONFIG_GRACE_PERIOD=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_NFS_COMMON=y
CONFIG_NFS_V4_2_SSC_HELPER=y
CONFIG_SUNRPC=y
CONFIG_SUNRPC_GSS=y
CONFIG_SUNRPC_BACKCHANNEL=y
CONFIG_SUNRPC_SWAP=y
CONFIG_RPCSEC_GSS_KRB5=y
CONFIG_RPCSEC_GSS_KRB5_ENCTYPES_AES_SHA1=y
CONFIG_SUNRPC_DEBUG=y
CONFIG_SUNRPC_XPRT_RDMA=y
CONFIG_CEPH_FS=m
CONFIG_CEPH_FSCACHE=y
CONFIG_CEPH_FS_POSIX_ACL=y
CONFIG_CIFS=y
CONFIG_CIFS_STATS2=y
CONFIG_CIFS_ALLOW_INSECURE_LEGACY=y
CONFIG_CIFS_UPCALL=y
CONFIG_CIFS_XATTR=y
CONFIG_CIFS_POSIX=y
CONFIG_CIFS_DEBUG=y
CONFIG_CIFS_DFS_UPCALL=y
CONFIG_CIFS_SWN_UPCALL=y
CONFIG_CIFS_SMB_DIRECT=y
CONFIG_SMB_SERVER=y
CONFIG_SMB_SERVER_SMBDIRECT=y
CONFIG_SMB_SERVER_CHECK_CAP_NET_ADMIN=y
CONFIG_SMB_SERVER_KERBEROS5=y
CONFIG_SMBFS=y
CONFIG_CODA_FS=y
CONFIG_AFS_FS=y
CONFIG_AFS_FSCACHE=y
CONFIG_AFS_DEBUG_CURSOR=y
CONFIG_9P_FS=y
CONFIG_9P_FSCACHE=y
CONFIG_9P_FS_POSIX_ACL=y
CONFIG_9P_FS_SECURITY=y
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="iso8859-1"
CONFIG_NLS_UTF8=y
CONFIG_NLS_UCS2_UTILS=y
CONFIG_DLM=y
CONFIG_IO_WQ=y

CONFIG_KEYS=y
CONFIG_KEYS_REQUEST_CACHE=y
CONFIG_PERSISTENT_KEYRINGS=y
CONFIG_BIG_KEYS=y
CONFIG_TRUSTED_KEYS=y
CONFIG_TRUSTED_KEYS_TPM=y
CONFIG_ENCRYPTED_KEYS=y
CONFIG_KEY_DH_OPERATIONS=y
CONFIG_KEY_NOTIFICATIONS=y
CONFIG_SECURITY=y
CONFIG_SECURITYFS=y
CONFIG_SECURITY_NETWORK=y
CONFIG_SECURITY_NETWORK_XFRM=y
CONFIG_SECURITY_PATH=y
CONFIG_LSM_MMAP_MIN_ADDR=65536
CONFIG_SECURITY_SELINUX=y
CONFIG_SECURITY_SELINUX_BOOTPARAM=y
CONFIG_SECURITY_SELINUX_DEVELOP=y
CONFIG_SECURITY_SELINUX_AVC_STATS=y
CONFIG_SECURITY_SELINUX_SIDTAB_HASH_BITS=9
CONFIG_SECURITY_SELINUX_SID2STR_CACHE_SIZE=256
CONFIG_SECURITY_YAMA=y
CONFIG_DEFAULT_SECURITY_SELINUX=y
CONFIG_LSM="yama,loadpin,safesetid,integrity,selinux,smack,tomoyo,apparmor"


CONFIG_CC_HAS_AUTO_VAR_INIT_PATTERN=y
CONFIG_CC_HAS_AUTO_VAR_INIT_ZERO_BARE=y
CONFIG_CC_HAS_AUTO_VAR_INIT_ZERO=y
CONFIG_INIT_STACK_NONE=y
CONFIG_CC_HAS_ZERO_CALL_USED_REGS=y


CONFIG_RANDSTRUCT_NONE=y

CONFIG_XOR_BLOCKS=y
CONFIG_CRYPTO=y

CONFIG_CRYPTO_FIPS=y
CONFIG_CRYPTO_FIPS_NAME="Linux Kernel Cryptographic API"
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_SIG2=y
CONFIG_CRYPTO_SKCIPHER=y
CONFIG_CRYPTO_SKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_RNG_DEFAULT=y
CONFIG_CRYPTO_AKCIPHER2=y
CONFIG_CRYPTO_AKCIPHER=y
CONFIG_CRYPTO_KPP2=y
CONFIG_CRYPTO_KPP=y
CONFIG_CRYPTO_ACOMP2=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
CONFIG_CRYPTO_USER=y
CONFIG_CRYPTO_NULL=y
CONFIG_CRYPTO_NULL2=y
CONFIG_CRYPTO_CRYPTD=y
CONFIG_CRYPTO_AUTHENC=y
CONFIG_CRYPTO_SIMD=y

CONFIG_CRYPTO_RSA=y
CONFIG_CRYPTO_DH=y
CONFIG_CRYPTO_DH_RFC7919_GROUPS=y
CONFIG_CRYPTO_ECC=y
CONFIG_CRYPTO_ECDH=y
CONFIG_CRYPTO_ECDSA=y
CONFIG_CRYPTO_ECRDSA=y
CONFIG_CRYPTO_SM2=y
CONFIG_CRYPTO_CURVE25519=y

CONFIG_CRYPTO_AES=y
CONFIG_CRYPTO_CAMELLIA=y
CONFIG_CRYPTO_DES=y
CONFIG_CRYPTO_FCRYPT=y
CONFIG_CRYPTO_SERPENT=y
CONFIG_CRYPTO_SM4=y
CONFIG_CRYPTO_SM4_GENERIC=y

CONFIG_CRYPTO_ARC4=m
CONFIG_CRYPTO_CHACHA20=y
CONFIG_CRYPTO_CBC=y
CONFIG_CRYPTO_CTR=y
CONFIG_CRYPTO_CTS=y
CONFIG_CRYPTO_ECB=y
CONFIG_CRYPTO_PCBC=y

CONFIG_CRYPTO_CHACHA20POLY1305=y
CONFIG_CRYPTO_CCM=y
CONFIG_CRYPTO_GCM=y
CONFIG_CRYPTO_GENIV=y
CONFIG_CRYPTO_SEQIV=y
CONFIG_CRYPTO_ECHAINIV=y

CONFIG_CRYPTO_BLAKE2B=y
CONFIG_CRYPTO_CMAC=y
CONFIG_CRYPTO_GHASH=y
CONFIG_CRYPTO_HMAC=y
CONFIG_CRYPTO_MD4=m
CONFIG_CRYPTO_MD5=y
CONFIG_CRYPTO_POLY1305=y
CONFIG_CRYPTO_SHA1=y
CONFIG_CRYPTO_SHA256=y
CONFIG_CRYPTO_SHA512=y
CONFIG_CRYPTO_SHA3=y
CONFIG_CRYPTO_SM3=y
CONFIG_CRYPTO_STREEBOG=y
CONFIG_CRYPTO_XXHASH=y

CONFIG_CRYPTO_CRC32C=y
CONFIG_CRYPTO_CRC32=y
CONFIG_CRYPTO_CRCT10DIF=y

CONFIG_CRYPTO_DEFLATE=y

CONFIG_CRYPTO_DRBG_MENU=y
CONFIG_CRYPTO_DRBG_HMAC=y
CONFIG_CRYPTO_DRBG=y
CONFIG_CRYPTO_JITTERENTROPY=y
CONFIG_CRYPTO_JITTERENTROPY_MEMSIZE_2=y
CONFIG_CRYPTO_JITTERENTROPY_MEMORY_BLOCKS=64
CONFIG_CRYPTO_JITTERENTROPY_MEMORY_BLOCKSIZE=32
CONFIG_CRYPTO_JITTERENTROPY_OSR=1
CONFIG_CRYPTO_KDF800108_CTR=y

CONFIG_CRYPTO_USER_API=y
CONFIG_CRYPTO_USER_API_HASH=y
CONFIG_CRYPTO_USER_API_SKCIPHER=y
CONFIG_CRYPTO_USER_API_RNG=y
CONFIG_CRYPTO_USER_API_AEAD=y
CONFIG_CRYPTO_USER_API_ENABLE_OBSOLETE=y
CONFIG_CRYPTO_STATS=y

CONFIG_CRYPTO_HASH_INFO=y

CONFIG_CRYPTO_CURVE25519_X86=y
CONFIG_CRYPTO_AES_NI_INTEL=y
CONFIG_CRYPTO_CAMELLIA_X86_64=y
CONFIG_CRYPTO_CAMELLIA_AESNI_AVX_X86_64=y
CONFIG_CRYPTO_CAMELLIA_AESNI_AVX2_X86_64=y
CONFIG_CRYPTO_SM4_AESNI_AVX_X86_64=y
CONFIG_CRYPTO_SM4_AESNI_AVX2_X86_64=y
CONFIG_CRYPTO_CHACHA20_X86_64=y
CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL=y
CONFIG_CRYPTO_CRC32C_INTEL=m

CONFIG_CRYPTO_HW=y
CONFIG_CRYPTO_DEV_CHELSIO=m
CONFIG_ASYMMETRIC_KEY_TYPE=y
CONFIG_ASYMMETRIC_PUBLIC_KEY_SUBTYPE=y
CONFIG_X509_CERTIFICATE_PARSER=y
CONFIG_PKCS7_MESSAGE_PARSER=y
CONFIG_PKCS7_TEST_KEY=y
CONFIG_SIGNED_PE_FILE_VERIFICATION=y

CONFIG_MODULE_SIG_KEY="certs/signing_key.pem"
CONFIG_MODULE_SIG_KEY_TYPE_RSA=y
CONFIG_SYSTEM_TRUSTED_KEYRING=y
CONFIG_SYSTEM_TRUSTED_KEYS=""
CONFIG_SYSTEM_EXTRA_CERTIFICATE=y
CONFIG_SYSTEM_EXTRA_CERTIFICATE_SIZE=4096
CONFIG_SECONDARY_TRUSTED_KEYRING=y
CONFIG_SYSTEM_BLACKLIST_KEYRING=y
CONFIG_SYSTEM_BLACKLIST_HASH_LIST="/data/modsign/blacklist"
CONFIG_SYSTEM_REVOCATION_LIST=y
CONFIG_SYSTEM_REVOCATION_KEYS=""

CONFIG_BINARY_PRINTF=y

CONFIG_RAID6_PQ=y
CONFIG_RAID6_PQ_BENCHMARK=y
CONFIG_BITREVERSE=y
CONFIG_GENERIC_STRNCPY_FROM_USER=y
CONFIG_GENERIC_STRNLEN_USER=y
CONFIG_GENERIC_NET_UTILS=y
CONFIG_RATIONAL=y
CONFIG_GENERIC_PCI_IOMAP=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_USE_CMPXCHG_LOCKREF=y
CONFIG_ARCH_HAS_FAST_MULTIPLIER=y
CONFIG_ARCH_USE_SYM_ANNOTATIONS=y

CONFIG_CRYPTO_LIB_UTILS=y
CONFIG_CRYPTO_LIB_AES=y
CONFIG_CRYPTO_LIB_ARC4=m
CONFIG_CRYPTO_LIB_GF128MUL=y
CONFIG_CRYPTO_LIB_BLAKE2S_GENERIC=y
CONFIG_CRYPTO_ARCH_HAVE_LIB_CHACHA=y
CONFIG_CRYPTO_LIB_CHACHA_GENERIC=y
CONFIG_CRYPTO_LIB_CHACHA=y
CONFIG_CRYPTO_ARCH_HAVE_LIB_CURVE25519=y
CONFIG_CRYPTO_LIB_CURVE25519_GENERIC=y
CONFIG_CRYPTO_LIB_DES=y
CONFIG_CRYPTO_LIB_POLY1305_RSIZE=11
CONFIG_CRYPTO_LIB_POLY1305_GENERIC=y
CONFIG_CRYPTO_LIB_POLY1305=y
CONFIG_CRYPTO_LIB_CHACHA20POLY1305=y
CONFIG_CRYPTO_LIB_SHA1=y
CONFIG_CRYPTO_LIB_SHA256=y

CONFIG_CRC_CCITT=y
CONFIG_CRC16=y
CONFIG_CRC_T10DIF=y
CONFIG_CRC_ITU_T=y
CONFIG_CRC32=y
CONFIG_CRC32_SLICEBY8=y
CONFIG_LIBCRC32C=y
CONFIG_XXHASH=y
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_LZO_COMPRESS=y
CONFIG_LZO_DECOMPRESS=y
CONFIG_ZSTD_COMMON=y
CONFIG_ZSTD_COMPRESS=y
CONFIG_ZSTD_DECOMPRESS=y
CONFIG_XZ_DEC=y
CONFIG_XZ_DEC_X86=y
CONFIG_XZ_DEC_BCJ=y
CONFIG_DECOMPRESS_GZIP=y
CONFIG_GENERIC_ALLOCATOR=y
CONFIG_INTERVAL_TREE=y
CONFIG_XARRAY_MULTI=y
CONFIG_ASSOCIATIVE_ARRAY=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_IOPORT_MAP=y
CONFIG_HAS_DMA=y
CONFIG_DMA_OPS=y
CONFIG_NEED_SG_DMA_FLAGS=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_SWIOTLB=y
CONFIG_SGL_ALLOC=y
CONFIG_IOMMU_HELPER=y
CONFIG_CHECK_SIGNATURE=y
CONFIG_CPU_RMAP=y
CONFIG_DQL=y
CONFIG_GLOB=y
CONFIG_NLATTR=y
CONFIG_LRU_CACHE=m
CONFIG_CLZ_TAB=y
CONFIG_IRQ_POLL=y
CONFIG_MPILIB=y
CONFIG_DIMLIB=y
CONFIG_OID_REGISTRY=y
CONFIG_UCS2_STRING=y
CONFIG_HAVE_GENERIC_VDSO=y
CONFIG_GENERIC_GETTIMEOFDAY=y
CONFIG_GENERIC_VDSO_TIME_NS=y
CONFIG_FONT_SUPPORT=y
CONFIG_FONT_8x16=y
CONFIG_FONT_AUTOSELECT=y
CONFIG_SG_POOL=y
CONFIG_ARCH_HAS_PMEM_API=y
CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION=y
CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE=y
CONFIG_ARCH_HAS_COPY_MC=y
CONFIG_ARCH_STACKWALK=y
CONFIG_STACKDEPOT=y
CONFIG_SBITMAP=y

CONFIG_ASN1_ENCODER=y
CONFIG_FIRMWARE_TABLE=y


CONFIG_CONSOLE_LOGLEVEL_DEFAULT=7
CONFIG_CONSOLE_LOGLEVEL_QUIET=4
CONFIG_MESSAGE_LOGLEVEL_DEFAULT=4
CONFIG_SYMBOLIC_ERRNAME=y
CONFIG_DEBUG_BUGVERBOSE=y

CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_MISC=y

CONFIG_DEBUG_INFO=y
CONFIG_AS_HAS_NON_CONST_LEB128=y
CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y
CONFIG_DEBUG_INFO_COMPRESSED_NONE=y
CONFIG_FRAME_WARN=2048
CONFIG_HEADERS_INSTALL=y
CONFIG_DEBUG_SECTION_MISMATCH=y
CONFIG_SECTION_MISMATCH_WARN_ONLY=y
CONFIG_OBJTOOL=y

CONFIG_MAGIC_SYSRQ=y
CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE=0x1
CONFIG_MAGIC_SYSRQ_SERIAL=y
CONFIG_MAGIC_SYSRQ_SERIAL_SEQUENCE=""
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_FS_ALLOW_ALL=y
CONFIG_HAVE_ARCH_KGDB=y
CONFIG_ARCH_HAS_UBSAN_SANITIZE_ALL=y
CONFIG_HAVE_ARCH_KCSAN=y
CONFIG_HAVE_KCSAN_COMPILER=y


CONFIG_PAGE_EXTENSION=y
CONFIG_SLUB_DEBUG=y
CONFIG_PAGE_TABLE_CHECK=y
CONFIG_PAGE_TABLE_CHECK_ENFORCED=y
CONFIG_DEBUG_PAGE_REF=y
CONFIG_ARCH_HAS_DEBUG_WX=y
CONFIG_GENERIC_PTDUMP=y
CONFIG_HAVE_DEBUG_KMEMLEAK=y
CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE=y
CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
CONFIG_ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP=y
CONFIG_HAVE_ARCH_KASAN=y
CONFIG_HAVE_ARCH_KASAN_VMALLOC=y
CONFIG_CC_HAS_KASAN_GENERIC=y
CONFIG_CC_HAS_WORKING_NOSANITIZE_ADDRESS=y
CONFIG_HAVE_ARCH_KFENCE=y
CONFIG_HAVE_ARCH_KMSAN=y


CONFIG_PANIC_ON_OOPS_VALUE=0
CONFIG_PANIC_TIMEOUT=0
CONFIG_LOCKUP_DETECTOR=y
CONFIG_SOFTLOCKUP_DETECTOR=y
CONFIG_HAVE_HARDLOCKUP_DETECTOR_BUDDY=y
CONFIG_HARDLOCKUP_DETECTOR=y
CONFIG_HARDLOCKUP_DETECTOR_PERF=y
CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER=y
CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y
CONFIG_DETECT_HUNG_TASK=y
CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=120
CONFIG_WQ_WATCHDOG=y

CONFIG_SCHED_INFO=y


CONFIG_LOCK_DEBUGGING_SUPPORT=y
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y

CONFIG_STACKTRACE=y



CONFIG_RCU_CPU_STALL_TIMEOUT=60
CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0

CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_RETHOOK=y
CONFIG_RETHOOK=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_RETVAL=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS=y
CONFIG_HAVE_DYNAMIC_FTRACE_NO_PATCHABLE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_FENTRY=y
CONFIG_HAVE_OBJTOOL_MCOUNT=y
CONFIG_HAVE_OBJTOOL_NOP_MCOUNT=y
CONFIG_HAVE_C_RECORDMCOUNT=y
CONFIG_HAVE_BUILDTIME_MCOUNT_SORT=y
CONFIG_BUILDTIME_MCOUNT_SORT=y
CONFIG_TRACE_CLOCK=y
CONFIG_RING_BUFFER=y
CONFIG_EVENT_TRACING=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_TRACING=y
CONFIG_GENERIC_TRACER=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FTRACE=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=y
CONFIG_DYNAMIC_FTRACE_WITH_ARGS=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_BRANCH_PROFILE_NONE=y
CONFIG_KPROBE_EVENTS=y
CONFIG_BPF_EVENTS=y
CONFIG_DYNAMIC_EVENTS=y
CONFIG_PROBE_EVENTS=y
CONFIG_FTRACE_MCOUNT_RECORD=y
CONFIG_FTRACE_MCOUNT_USE_CC=y
CONFIG_SAMPLES=y
CONFIG_SAMPLE_VFS=y
CONFIG_HAVE_SAMPLE_FTRACE_DIRECT=y
CONFIG_HAVE_SAMPLE_FTRACE_DIRECT_MULTI=y
CONFIG_ARCH_HAS_DEVMEM_IS_ALLOWED=y

CONFIG_X86_VERBOSE_BOOTUP=y
CONFIG_EARLY_PRINTK=y
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
CONFIG_IO_DELAY_0X80=y
CONFIG_UNWINDER_ORC=y

CONFIG_KUNIT=y
CONFIG_KUNIT_DEFAULT_ENABLED=y
CONFIG_FUNCTION_ERROR_INJECTION=y
CONFIG_ARCH_HAS_KCOV=y
CONFIG_CC_HAS_SANCOV_TRACE_PC=y
CONFIG_RUNTIME_TESTING_MENU=y
CONFIG_TEST_IOV_ITER=m
CONFIG_ARCH_USE_MEMTEST=y

2023-11-15 21:44:01

by David Howells

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

I get:

LD arch/x86/purgatory/purgatory.chk
ld: arch/x86/purgatory/purgatory.ro:(.altinstr_replacement+0x1): undefined reference to `rep_movs_alternative'
ld: arch/x86/purgatory/purgatory.ro:(.altinstr_replacement+0x6): undefined reference to `rep_movs_alternative'

The symbol is available in the arch lib directory:

warthog>nm build3/arch/x86/lib/copy_user_64.o
0000000000000000 r __export_symbol_rep_movs_alternative
0000000000000000 T rep_movs_alternative
U __x86_return_thunk

so I'm not sure what's going on there.

David

2023-11-15 21:50:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Wed, 15 Nov 2023 at 16:43, David Howells <[email protected]> wrote:
>
> LD arch/x86/purgatory/purgatory.chk
> ld: arch/x86/purgatory/purgatory.ro:(.altinstr_replacement+0x1): undefined reference to `rep_movs_alternative'
> ld: arch/x86/purgatory/purgatory.ro:(.altinstr_replacement+0x6): undefined reference to `rep_movs_alternative'
>
> The symbol is available in the arch lib directory:

That patch of mine ends up exposing the fact that we have a lot of
special boot-time code and similar that isn't real kernel code, but is
built with kernel headers.

Normally not that noticeable, but when it modified something as core
as memcpy(), it shows up in a big way.

Sadly, we don't seem to have any obvious #define for "this is not real
kernel code". We just have a lot of ad-hoc tricks, like removing
compiler flags and disabling things like KASAN etc on a file-by-file
(or directory) basis.

The purgatory code isn't exactly boot-time code, but it's very similar
- it's kind of a limited environment that runs at crash time to load a
new kernel.

Linus

2023-11-15 21:59:57

by Borislav Petkov

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Wed, Nov 15, 2023 at 04:50:06PM -0500, Linus Torvalds wrote:
> Sadly, we don't seem to have any obvious #define for "this is not real
> kernel code". We just have a lot of ad-hoc tricks, like removing
> compiler flags and disabling things like KASAN etc on a file-by-file
> (or directory) basis.

Yeah, "untangling" memcpy() has always been a PITA, every time I tried
it. So I'd need to come up with a somewhat sensible scheme of "use this
special memcpy() only in kernel code proper and fallback to the library
version or gcc builtin otherwise"...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-11-15 23:00:31

by David Howells

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

Linus Torvalds <[email protected]> wrote:

>
> The purgatory code isn't exactly boot-time code, but it's very similar
> - it's kind of a limited environment that runs at crash time to load a
> new kernel.

Yeah - I can get rid of that by disabling KEXEC/CRASH handling, but then I see
errors in the compressed boot wrapper.

Instead, I tried replacing the call to memcpy() in memcpy_to_iter() and
memcpy_from_iter() with calls to __memcpy() to force the out-of-line memcpy.
Using my benchmarking patch, with what's upstream upstream I see:

iov_kunit_benchmark_bvec: avg 3185 uS, stddev 19 uS
iov_kunit_benchmark_bvec: avg 3186 uS, stddev 9 uS
iov_kunit_benchmark_bvec: avg 3244 uS, stddev 153 uS
iov_kunit_benchmark_bvec_split: avg 3397 uS, stddev 16 uS
iov_kunit_benchmark_bvec_split: avg 3400 uS, stddev 16 uS
iov_kunit_benchmark_bvec_split: avg 3402 uS, stddev 34 uS
iov_kunit_benchmark_kvec: avg 2818 uS, stddev 550 uS
iov_kunit_benchmark_kvec: avg 2906 uS, stddev 21 uS
iov_kunit_benchmark_kvec: avg 2923 uS, stddev 1496 uS
iov_kunit_benchmark_xarray: avg 3564 uS, stddev 6 uS
iov_kunit_benchmark_xarray: avg 3573 uS, stddev 17 uS
iov_kunit_benchmark_xarray: avg 3575 uS, stddev 58 uS
iov_kunit_benchmark_xarray_to_bvec: avg 3929 uS, stddev 9 uS
iov_kunit_benchmark_xarray_to_bvec: avg 3930 uS, stddev 6 uS
iov_kunit_benchmark_xarray_to_bvec: avg 3930 uS, stddev 7 uS

And using __memcpy() rather than memcpy():

iov_kunit_benchmark_bvec: avg 9977 uS, stddev 26 uS
iov_kunit_benchmark_bvec: avg 9979 uS, stddev 12 uS
iov_kunit_benchmark_bvec: avg 9980 uS, stddev 10 uS
iov_kunit_benchmark_bvec_split: avg 9834 uS, stddev 31 uS
iov_kunit_benchmark_bvec_split: avg 9840 uS, stddev 22 uS
iov_kunit_benchmark_bvec_split: avg 9848 uS, stddev 55 uS
iov_kunit_benchmark_kvec: avg 10010 uS, stddev 3253 uS
iov_kunit_benchmark_kvec: avg 10017 uS, stddev 4400 uS
iov_kunit_benchmark_kvec: avg 10095 uS, stddev 4282 uS
iov_kunit_benchmark_xarray: avg 10611 uS, stddev 7 uS
iov_kunit_benchmark_xarray: avg 10611 uS, stddev 9 uS
iov_kunit_benchmark_xarray: avg 10616 uS, stddev 13 uS
iov_kunit_benchmark_xarray_to_bvec: avg 10523 uS, stddev 143 uS
iov_kunit_benchmark_xarray_to_bvec: avg 10524 uS, stddev 197 uS
iov_kunit_benchmark_xarray_to_bvec: avg 10527 uS, stddev 186 uS

A disassembly of _copy_from_iter() for the latter is attached. Note that the
UBUF/IOVEC still uses "rep movsb" and that you can see calls to memcpy() as
__memcpy() has the same address.

Switching to CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE instead of FOR_SIZE
out-of-lines calls to memcpy giving numbers like the second set, with or
without the patch.

I wonder how much of the performance reduction there is due to the mitigations
for spectre or suchlike.

David
---
<+0>: push %r15
<+2>: push %r14
<+4>: push %r13
<+6>: push %r12
<+8>: push %rbp
<+9>: push %rbx
<+10>: sub $0x58,%rsp
<+14>: mov %rdi,0x8(%rsp)
<+19>: mov %gs:0x28,%rax
<+28>: mov %rax,0x50(%rsp)
<+33>: xor %eax,%eax
<+35>: cmpb $0x0,0x3(%rdx)
<+39>: je 0xffffffff8176b766 <_copy_from_iter+52>
<+41>: cmpb $0x0,0x1(%rdx)
<+45>: mov %rdx,%r15
<+48>: je 0xffffffff8176b79b <_copy_from_iter+105>
<+50>: jmp 0xffffffff8176b76f <_copy_from_iter+61>
<+52>: ud2
<+54>: xor %ebx,%ebx
<+56>: jmp 0xffffffff8176baf9 <_copy_from_iter+967>
<+61>: mov 0x50(%rsp),%rax
<+66>: sub %gs:0x28,%rax
<+75>: jne 0xffffffff8176bb09 <_copy_from_iter+983>
<+81>: mov 0x8(%rsp),%rdi
<+86>: add $0x58,%rsp
<+90>: pop %rbx
<+91>: pop %rbp
<+92>: pop %r12
<+94>: pop %r13
<+96>: pop %r14
<+98>: pop %r15
<+100>: jmp 0xffffffff8176a1f3 <__copy_from_iter_mc>
<+105>: mov 0x18(%rdx),%rax
<+109>: cmp %rax,%rsi
<+112>: cmova %rax,%rsi
<+116>: test %rsi,%rsi
<+119>: mov %rsi,%rbx
<+122>: je 0xffffffff8176baf9 <_copy_from_iter+967>

<+128>: mov (%rdx),%dl
<+130>: test %dl,%dl # ITER_UBUF
<+132>: jne 0xffffffff8176b805 <_copy_from_iter+211>
<+134>: mov 0x8(%r15),%rsi
<+138>: mov %rbx,%rcx
<+141>: add 0x10(%r15),%rsi
<+145>: mov %rsi,%rdx
<+148>: mov %rbx,%rsi
<+151>: mov %rdx,%rdi
<+154>: call 0xffffffff8176a052 <__access_ok>
<+159>: test %al,%al
<+161>: je 0xffffffff8176b7e8 <_copy_from_iter+182>
<+163>: nop
<+164>: nop
<+165>: nop
<+166>: mov 0x8(%rsp),%rdi
<+171>: mov %rdx,%rsi
<+174>: rep movsb %ds:(%rsi),%es:(%rdi)
<+176>: nop
<+177>: nop
<+178>: nop
<+179>: nop
<+180>: nop
<+181>: nop
<+182>: mov %rbx,%rax
<+185>: sub %rcx,%rax
<+188>: add 0x18(%r15),%rcx
<+192>: add %rax,0x8(%r15)
<+196>: sub %rbx,%rcx
<+199>: mov %rax,%rbx
<+202>: mov %rcx,0x18(%r15)
<+206>: jmp 0xffffffff8176baf9 <_copy_from_iter+967>

<+211>: cmp $0x1,%dl # ITER_IOVEC
<+214>: jne 0xffffffff8176b8a3 <_copy_from_iter+369>
<+220>: mov 0x10(%r15),%rbp
<+224>: mov %rsi,%r10
<+227>: xor %ebx,%ebx
<+229>: mov 0x8(%r15),%r12
<+233>: mov 0x8(%rbp),%rdx
<+237>: sub %r12,%rdx
<+240>: cmp %r10,%rdx
<+243>: cmova %r10,%rdx
<+247>: test %rdx,%rdx
<+250>: je 0xffffffff8176b876 <_copy_from_iter+324>
<+252>: mov 0x0(%rbp),%r11
<+256>: mov %rdx,%rsi
<+259>: mov %rdx,%rcx
<+262>: add %r12,%r11
<+265>: mov %r11,%rdi
<+268>: call 0xffffffff8176a052 <__access_ok>
<+273>: test %al,%al
<+275>: je 0xffffffff8176b85e <_copy_from_iter+300>
<+277>: nop
<+278>: nop
<+279>: nop
<+280>: mov 0x8(%rsp),%rax
<+285>: mov %r11,%rsi
<+288>: lea (%rax,%rbx,1),%rdi
<+292>: rep movsb %ds:(%rsi),%es:(%rdi)
<+294>: nop
<+295>: nop
<+296>: nop
<+297>: nop
<+298>: nop
<+299>: nop
<+300>: mov %rdx,%rax
<+303>: sub %rdx,%r10
<+306>: sub %rcx,%rax
<+309>: add %rcx,%r10
<+312>: add %rax,%rbx
<+315>: add %rax,%r12
<+318>: cmp 0x8(%rbp),%r12
<+322>: jb 0xffffffff8176b884 <_copy_from_iter+338>
<+324>: add $0x10,%rbp
<+328>: xor %r12d,%r12d
<+331>: test %r10,%r10
<+334>: jne 0xffffffff8176b81b <_copy_from_iter+233>
<+336>: jmp 0xffffffff8176b887 <_copy_from_iter+341>
<+338>: mov %r12,%r10
<+341>: mov %rbp,%rax
<+344>: sub 0x10(%r15),%rax
<+348>: mov %r10,0x8(%r15)
<+352>: mov %rbp,0x10(%r15)
<+356>: sar $0x4,%rax
<+360>: sub %rax,0x20(%r15)
<+364>: jmp 0xffffffff8176baec <_copy_from_iter+954>

<+369>: cmp $0x2,%dl # ITER_BVEC
<+372>: jne 0xffffffff8176b946 <_copy_from_iter+532>
<+378>: mov 0x10(%r15),%r13
<+382>: mov %rsi,%r12
<+385>: xor %ebx,%ebx
<+387>: mov 0x8(%r15),%r14
<+391>: mov 0xc(%r13),%ecx
<+395>: add %r14,%rcx
<+398>: mov %rcx,%rdi
<+401>: and $0xfff,%ecx
<+407>: shr $0xc,%rdi
<+411>: shl $0x6,%rdi
<+415>: add 0x0(%r13),%rdi
<+419>: call 0xffffffff81769ae2 <kmap_local_page>
<+424>: mov 0x8(%r13),%ebp
<+428>: mov $0x1000,%edx
<+433>: lea (%rax,%rcx,1),%rsi
<+437>: mov 0x8(%rsp),%rax
<+442>: sub %r14,%rbp
<+445>: lea (%rax,%rbx,1),%rdi
<+449>: cmp %r12,%rbp
<+452>: cmova %r12,%rbp
<+456>: sub %rcx,%rdx
<+459>: cmp %rdx,%rbp
<+462>: cmova %rdx,%rbp
<+466>: mov %rbp,%rdx
<+469>: add %rbp,%r14
<+472>: sub %rbp,%r12
<+475>: call 0xffffffff81d63980 <memcpy>
<+480>: mov 0x8(%r13),%eax
<+484>: add %rbp,%rbx
<+487>: cmp %rax,%r14
<+490>: jb 0xffffffff8176b925 <_copy_from_iter+499>
<+492>: add $0x10,%r13
<+496>: xor %r14d,%r14d
<+499>: test %r12,%r12
<+502>: jne 0xffffffff8176b8b9 <_copy_from_iter+391>
<+504>: mov %r13,%rax
<+507>: sub 0x10(%r15),%rax
<+511>: mov %r14,0x8(%r15)
<+515>: mov %r13,0x10(%r15)
<+519>: sar $0x4,%rax
<+523>: sub %rax,0x20(%r15)
<+527>: jmp 0xffffffff8176baec <_copy_from_iter+954>

<+532>: cmp $0x3,%dl # ITER_KVEC
<+535>: jne 0xffffffff8176b9bf <_copy_from_iter+653>
<+537>: mov 0x10(%r15),%r13
<+541>: mov %rsi,%r12
<+544>: xor %ebx,%ebx
<+546>: mov 0x8(%r15),%r14
<+550>: mov 0x8(%r13),%rbp
<+554>: sub %r14,%rbp
<+557>: cmp %r12,%rbp
<+560>: cmova %r12,%rbp
<+564>: test %rbp,%rbp
<+567>: je 0xffffffff8176b992 <_copy_from_iter+608>
<+569>: mov 0x0(%r13),%rsi
<+573>: mov %rbp,%rdx
<+576>: sub %rbp,%r12
<+579>: mov 0x8(%rsp),%rax
<+584>: add %r14,%rsi
<+587>: add %rbp,%r14
<+590>: lea (%rax,%rbx,1),%rdi
<+594>: add %rbp,%rbx
<+597>: call 0xffffffff81d63980 <memcpy>
<+602>: cmp 0x8(%r13),%r14
<+606>: jb 0xffffffff8176b9a0 <_copy_from_iter+622>
<+608>: add $0x10,%r13
<+612>: xor %r14d,%r14d
<+615>: test %r12,%r12
<+618>: jne 0xffffffff8176b958 <_copy_from_iter+550>
<+620>: jmp 0xffffffff8176b9a3 <_copy_from_iter+625>
<+622>: mov %r14,%r12
<+625>: mov %r13,%rax
<+628>: sub 0x10(%r15),%rax
<+632>: mov %r12,0x8(%r15)
<+636>: mov %r13,0x10(%r15)
<+640>: sar $0x4,%rax
<+644>: sub %rax,0x20(%r15)
<+648>: jmp 0xffffffff8176baec <_copy_from_iter+954>

<+653>: cmp $0x4,%dl # ITER_XARRAY
<+656>: jne 0xffffffff8176baf2 <_copy_from_iter+960>
<+662>: movq $0x3,0x30(%rsp)
<+671>: mov 0x10(%r15),%rax
<+675>: xor %edx,%edx
<+677>: mov 0x8(%r15),%r14
<+681>: mov %rdx,0x38(%rsp)
<+686>: add 0x20(%r15),%r14
<+690>: mov %rdx,0x40(%rsp)
<+695>: mov %rdx,0x48(%rsp)
<+700>: mov %rax,0x18(%rsp)
<+705>: mov %r14,%rax
<+708>: shr $0xc,%rax
<+712>: mov %rax,0x20(%rsp)
<+717>: xor %eax,%eax
<+719>: mov %eax,0x28(%rsp)
<+723>: lea 0x18(%rsp),%rdi
<+728>: or $0xffffffffffffffff,%rsi
<+732>: mov %rbx,%r13
<+735>: call 0xffffffff81d5ca1d <xas_find>
<+740>: xor %ebx,%ebx
<+742>: mov %rax,(%rsp)
<+746>: cmpq $0x0,(%rsp)
<+751>: je 0xffffffff8176bae8 <_copy_from_iter+950>
<+757>: mov (%rsp),%rsi
<+761>: lea 0x18(%rsp),%rdi
<+766>: call 0xffffffff81769b5c <xas_retry>
<+771>: test %al,%al
<+773>: jne 0xffffffff8176bad5 <_copy_from_iter+931>
<+779>: testb $0x1,(%rsp)
<+783>: jne 0xffffffff8176ba7c <_copy_from_iter+842>
<+785>: mov (%rsp),%rdi
<+789>: call 0xffffffff81769a58 <folio_test_hugetlb>
<+794>: test %al,%al
<+796>: jne 0xffffffff8176ba80 <_copy_from_iter+846>
<+798>: mov (%rsp),%rdi
<+802>: lea (%r14,%rbx,1),%rdx
<+806>: call 0xffffffff81769ac8 <folio_size>
<+811>: mov (%rsp),%rdi
<+815>: lea -0x1(%rax),%r12
<+819>: and %rdx,%r12
<+822>: call 0xffffffff81769ac8 <folio_size>
<+827>: sub %r12,%rax
<+830>: cmp %r13,%rax
<+833>: cmova %r13,%rax
<+837>: mov %rax,%rbp
<+840>: jmp 0xffffffff8176bad0 <_copy_from_iter+926>
<+842>: ud2
<+844>: jmp 0xffffffff8176bae8 <_copy_from_iter+950>
<+846>: ud2
<+848>: jmp 0xffffffff8176bae8 <_copy_from_iter+950>
<+850>: mov (%rsp),%rdi
<+854>: mov %r12,%rsi
<+857>: call 0xffffffff81769b07 <kmap_local_folio>
<+862>: mov $0x1000,%edx
<+867>: mov %rax,%rsi
<+870>: mov %r12,%rax
<+873>: and $0xfff,%eax
<+878>: sub %rax,%rdx
<+881>: mov 0x8(%rsp),%rax
<+886>: cmp %rbp,%rdx
<+889>: cmova %rbp,%rdx
<+893>: lea (%rax,%rbx,1),%rdi
<+897>: mov %rdx,0x10(%rsp)
<+902>: call 0xffffffff81d63980 <memcpy>
<+907>: mov 0x10(%rsp),%rdx
<+912>: add %rdx,%rbx
<+915>: sub %rdx,%r13
<+918>: je 0xffffffff8176bae8 <_copy_from_iter+950>
<+920>: sub %rdx,%rbp
<+923>: add %rdx,%r12
<+926>: test %rbp,%rbp
<+929>: jne 0xffffffff8176ba84 <_copy_from_iter+850>
<+931>: lea 0x18(%rsp),%rdi
<+936>: call 0xffffffff8176a6bb <xas_next_entry>
<+941>: mov %rax,(%rsp)
<+945>: jmp 0xffffffff8176ba1c <_copy_from_iter+746>
<+950>: add %rbx,0x8(%r15)
<+954>: sub %rbx,0x18(%r15)
<+958>: jmp 0xffffffff8176baf9 <_copy_from_iter+967>

# ITER_DISCARD
<+960>: sub %rsi,%rax
<+963>: mov %rax,0x18(%r15)
<+967>: mov 0x50(%rsp),%rax
<+972>: sub %gs:0x28,%rax
<+981>: je 0xffffffff8176bb0e <_copy_from_iter+988>
<+983>: call 0xffffffff81d62fcd <__stack_chk_fail>
<+988>: add $0x58,%rsp
<+992>: mov %rbx,%rax
<+995>: pop %rbx
<+996>: pop %rbp
<+997>: pop %r12
<+999>: pop %r13
<+1001>: pop %r14
<+1003>: pop %r15
<+1005>: jmp 0xffffffff81d6f7e0 <__x86_return_thunk>

2023-11-16 03:28:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Wed, 15 Nov 2023 at 18:00, David Howells <[email protected]> wrote:
>
> And using __memcpy() rather than memcpy():

Yeah, that's just sad. It might indeed be that you're running on a
Haswell core, and the retpoline overhead just kills that entirely. You
could try building the kernel without mitigations (or booting with
them off, which isn't quite as good) to verify.

> A disassembly of _copy_from_iter() for the latter is attached. Note that the
> UBUF/IOVEC still uses "rep movsb"

Well, yes and no.

User copies do that X86_FEATURE_FSRM alternatives dance, so the code
gets generated with "rep movs", but you'll note that there are several
'nops' after it.

Some of the nops are because we'll be inserting STAC/CLAC (three bytes
each, I think) instructions around user accesses for SMAP-capable
CPU's.

But some of the nops are because we'll be rewriting that "rep stosb"
(two bytes, iirc) as "call rep_stos_alternative" (5 bytes) on CPU's
that don't do FSRM like yours. So your CPU won't actually be executing
that 'rep stosb' sequence.

And yes, the '__x86_return_thunk' overhead can be pretty horrific. It
will get rewritten to the appropriate thing by "apply_returns". But
like the "rep movs" and the missing STAC/CLAC, you won't see that in
the objdump, you only see it in the final binary.

Linus

2023-11-16 10:08:40

by David Laight

[permalink] [raw]
Subject: RE: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

From: Linus Torvalds
> Sent: 15 November 2023 20:07
...
> - our current "memcpy_orig" fallback does unrolled copy loops, and
> the rep_movs_alternative fallback obviously doesn't.
>
> It's not clear that the unrolled copy loops matter for the in-kernel
> kinds of copies, but who knows. The memcpy_orig code is definitely
> trying to be smarter in some other ways too. So the fallback should
> try a *bit* harder than I did, and not just with the whole "don't try
> to handle exceptions" issue I mentioned.

I'm pretty sure the unrolled copy (and other unrolled loops)
just wastes I-cache and slows things down cold-cache.

With out of order execute on most x86 cpu (except atoms) you
don't really have to worry about the memory latency.
So get the loop control instructions to run in parallel with
the memory access ones and you can copy one word every clock.
I never managed a single clock loop, but you can get a two
clock loop (with 2 reads and 2 writes in it).

So unrolling once is typically enough.

You can also ignore alignment, the extra cost is minimal (on
Intel cpu at least). I think it requires an extra u-op when
the copy crosses a cache line boundadry.

On haswell (which is now quite old) both 'rep movsb' and
'rep movsq' copy 16 bytes/clock unless the destination
is 32 byte aligned when they copy 32 bytes/clock.
Source alignment make no different, neither does byte
alignment.

Another -Os stupidity is 'push $x; pop %reg' to load
a signed byte constant.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-11-16 10:14:49

by David Howells

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

David Laight <[email protected]> wrote:

> On haswell (which is now quite old) both 'rep movsb' and
> 'rep movsq' copy 16 bytes/clock unless the destination
> is 32 byte aligned when they copy 32 bytes/clock.
> Source alignment make no different, neither does byte
> alignment.

I think the i3-4170 cpu I'm using is Haswell. Does that mean for my
particular cpu, just using inline "rep movsb" is the best choice?

David

2023-11-16 11:38:25

by David Laight

[permalink] [raw]
Subject: RE: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

From: David Howells
> Sent: 16 November 2023 10:14
>
> David Laight <[email protected]> wrote:
>
> > On haswell (which is now quite old) both 'rep movsb' and
> > 'rep movsq' copy 16 bytes/clock unless the destination
> > is 32 byte aligned when they copy 32 bytes/clock.
> > Source alignment make no different, neither does byte
> > alignment.
>
> I think the i3-4170 cpu I'm using is Haswell. Does that mean for my
> particular cpu, just using inline "rep movsb" is the best choice?

I've just looked at a slight old copy of the instruction timing
doc from https://www.agner.org/optimize

Apart from P4 (130 clock setup!) the setup cost for 'rep movs'
is relatively small.
I think everything since sandy bridge and bulldozer (except atoms,
but including silvermont) do fast copies for 'rep movsb'.
(But the C2758 atom we use claims erms.)

I'd bet that the overhead for using 'rep movsb' for a short copy
is less than that of the mispredicted branch (or two) to select
the required code.

That rather implies always using 'rep movsb' is best unless
someone is compiling explicitly for an old cpu.
And apart from P4 an explicit 'rep movsl' will be fastest then
because the setup cost is minimal/zero.

The cutoff for using 'rep movsb' for constant sized copies
is probably also a lot less than you might expect.
Especially assuming cold cache.

This all makes that POS that gcc is inlining even more stupid.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-11-16 15:44:48

by Borislav Petkov

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Wed, Nov 15, 2023 at 02:26:02PM -0500, Linus Torvalds wrote:
> So the real issue is that we don't want an inlined memcpy at all,
> unless it's the simple constant-sized case that has been turned into
> individual moves with no loop.
>
> Or it's a "rep movsb" with FSRM as a CPUID-based alternative, of course.

Reportedly and apparently, this pretty much addresses the issue at hand.
However, I'd still like for the compiler to handle the small length
cases by issuing plain MOVs instead of blindly doing "call memcpy".

Lemme see how it would work with your patch...


diff --git a/Makefile b/Makefile
index ede0bd241056..94d93070d54a 100644
--- a/Makefile
+++ b/Makefile
@@ -996,6 +996,8 @@ endif
# change __FILE__ to the relative path from the srctree
KBUILD_CPPFLAGS += $(call cc-option,-fmacro-prefix-map=$(srctree)/=)

+KBUILD_CFLAGS += $(call cc-option,-mstringop-strategy=libcall)
+
# include additional Makefiles when needed
include-y := scripts/Makefile.extrawarn
include-$(CONFIG_DEBUG_INFO) += scripts/Makefile.debug

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-11-16 16:44:28

by David Howells

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

Borislav Petkov <[email protected]> wrote:

> diff --git a/Makefile b/Makefile
> index ede0bd241056..94d93070d54a 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -996,6 +996,8 @@ endif
> # change __FILE__ to the relative path from the srctree
> KBUILD_CPPFLAGS += $(call cc-option,-fmacro-prefix-map=$(srctree)/=)
>
> +KBUILD_CFLAGS += $(call cc-option,-mstringop-strategy=libcall)
> +
> # include additional Makefiles when needed
> include-y := scripts/Makefile.extrawarn
> include-$(CONFIG_DEBUG_INFO) += scripts/Makefile.debug

If you wanted this patch trying, I get the following numbers:

iov_kunit_benchmark_bvec: avg 9950 uS, stddev 40 uS
iov_kunit_benchmark_bvec: avg 9950 uS, stddev 53 uS
iov_kunit_benchmark_bvec: avg 9973 uS, stddev 69 uS
iov_kunit_benchmark_bvec_split: avg 9793 uS, stddev 28 uS
iov_kunit_benchmark_bvec_split: avg 9800 uS, stddev 41 uS
iov_kunit_benchmark_bvec_split: avg 9804 uS, stddev 16 uS
iov_kunit_benchmark_kvec: avg 10122 uS, stddev 4403 uS
iov_kunit_benchmark_kvec: avg 9757 uS, stddev 1516 uS
iov_kunit_benchmark_kvec: avg 9909 uS, stddev 2694 uS
iov_kunit_benchmark_xarray: avg 10526 uS, stddev 52 uS
iov_kunit_benchmark_xarray: avg 10529 uS, stddev 48 uS
iov_kunit_benchmark_xarray: avg 10532 uS, stddev 63 uS
iov_kunit_benchmark_xarray_to_bvec: avg 10468 uS, stddev 23 uS
iov_kunit_benchmark_xarray_to_bvec: avg 10469 uS, stddev 157 uS
iov_kunit_benchmark_xarray_to_bvec: avg 10471 uS, stddev 163 uS

I'm using the 6 patches here:

https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-kunit

Set CONFIG_TEST_IOV_ITER=m and then load the kunit_iov_iter.ko module. It'll
dump its benchmarks to dmesg.

David

2023-11-16 16:49:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Thu, 16 Nov 2023 at 10:44, Borislav Petkov <[email protected]> wrote:
>
> Reportedly and apparently, this pretty much addresses the issue at hand.
> However, I'd still like for the compiler to handle the small length
> cases by issuing plain MOVs instead of blindly doing "call memcpy".
>
> Lemme see how it would work with your patch...

Hmm. I know about the '-mstringop-strategy' flag because of the fairly
recently discussed bug where gcc would create a byte-by-byte copy in
some crazy circumstances with the address space attributes:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111657

But I incorrectly thought that "-mstringop-strategy=libcall" would
then *always* do library calls.

So I decided to test, and that shows that gcc still ends up doing the
"expand small constant size copies inline" even with that option, and
doesn't force library calls for those cases.

IOW, my assumption was just broken, and using
"-mstringop-strategy=libcall" may well be the right thing to do.

Of course, it's also possible that with all the function call overhead
introduced by the CPU mitigations on older CPU's, we should just say
"rep movsb" is always correct - if you have a new CPU with FSRM it's
good, and if you have an old CPU it's no worse than the horrendous CPU
mitigation overhead for function call/returns.

I really hate the mitigations. Oh well.

Ayway, maybe your patch is the RightThing(tm). Or maybe we should use
'rep_byte' instead of 'libcall'. Who knows..

Linus

2023-11-16 16:56:25

by David Laight

[permalink] [raw]
Subject: RE: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

From: Linus Torvalds
> Sent: 16 November 2023 03:27
>
> On Wed, 15 Nov 2023 at 18:00, David Howells <[email protected]> wrote:
...
> > A disassembly of _copy_from_iter() for the latter is attached. Note that the
> > UBUF/IOVEC still uses "rep movsb"
>
> Well, yes and no.
>
> User copies do that X86_FEATURE_FSRM alternatives dance, so the code
> gets generated with "rep movs", but you'll note that there are several
> 'nops' after it.
>
> Some of the nops are because we'll be inserting STAC/CLAC (three bytes
> each, I think) instructions around user accesses for SMAP-capable
> CPU's.
>
> But some of the nops are because we'll be rewriting that "rep stosb"
> (two bytes, iirc) as "call rep_stos_alternative" (5 bytes) on CPU's
> that don't do FSRM like yours. So your CPU won't actually be executing
> that 'rep stosb' sequence.

I presume lack of coffee is responsible for the s/movs/stos/ :-)

How much difference does FSRM actually make?
Especially when compared to the cost of a function call (even
without the horrid return thunk).

For small %cx I think non-FSRM modern cpu are ~2 clocks/byte
(no fixed overhead).
Which means 'rep movsb' wins for both short and long copies.
I wonder what sizes the function call (with all its size
based compares at the top) is actually a win.

There has to be some mileage in getting the complier to generate
'call memcpy' (for non-constant sizes) and then run-time patching
the 5 byte 'call offset' into 'mov %edx,%ecx; rep movsb'.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-11-16 16:58:58

by David Laight

[permalink] [raw]
Subject: RE: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

...
> Of course, it's also possible that with all the function call overhead
> introduced by the CPU mitigations on older CPU's, we should just say
> "rep movsb" is always correct - if you have a new CPU with FSRM it's
> good, and if you have an old CPU it's no worse than the horrendous CPU
> mitigation overhead for function call/returns.

Unless you are stupid enough to use a P4 :-)

I actually doubt anyone cares (esp. for 64bit) about any
cpu that don't optimise long 'rep movsb' (pre sandy bridge).

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-11-16 17:25:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Thu, 16 Nov 2023 at 11:55, David Laight <[email protected]> wrote:
>
> I presume lack of coffee is responsible for the s/movs/stos/ :-)

Yes.

> How much difference does FSRM actually make?
> Especially when compared to the cost of a function call (even
> without the horrid return thunk).

It can be a big deal. The subject line here is an example. On that
machine, using the call to 'memcpy_orig' clearly performs *noticeably*
better. So that 16% regression was"fast apparently at least partly
because of

-11.0 perf-profile.self.cycles-pp.memcpy_orig
+14.7 perf-profile.self.cycles-pp.copy_page_from_iter_atomic

where that inlined copy (that used 'rep movsq' and other things around
it) was noticeably worse than just calling memcpy_orig that does a
basic unrolled loop.

Now, *why* it matters a lot is unclear. Some machines literally have
the "fast rep string" code disabled, and then "rep movsb" is just
horrendous. That's arguably a machine setup issue, but people have
been known to do those things because of problems (most recently
"reptar").

And in most older microarchitectures it's not just the cycles in the
repat thing, it is also a pipeline stall and I think it's also a
(partial? full?) barrier for OoO execution. That pipeline stall was
most noticeable on P4, but it's most definitely there on other cores
too.

And the OoO execution batter can mean that it *benchmarks* fairly well
when you just do "rep movs" in a loop to test, but then if you have
code *around* it, it causes problems for the instructions around it.

I have this memory from my "push for -Os" (which is from over a decade
ago, to take my memory with a pinch of salt) of seeing "rep movsb"
followed by a load of the result causing a horrid stall on the load.

A regular load-store loop will have the store data forwarded to any
subsequent load, but "rep movs" might not do that and if it works on a
cacheline level you might lose out on those kinds of things.

Don't get me wrong - I really like the rep string instructions, and
while they have issues I'd *love* for CPU's to basically do "memcpy"
and "memset" without any library call overhead. The security
mitigations have made indirect calls much worse, but they have made
regular function call overhead worse too (and there's the I$ footprint
thing etc etc).

So I like "rep movs" a lot when it works well, but it most definitely
does not work well everywhere.

Of course, while the kernel test robot doesn't seem to like the
inlined "rep movsq", clearly the machine David is on absolutely
*hates* the call to memcpy_orig. Possibly due to mitigation overhead.

The problem with code generation at this level is that you win some,
you lose some. You can seldom make everybody happy.

Linus

2023-11-16 21:14:04

by David Howells

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

Linus Torvalds <[email protected]> wrote:

> You could try building the kernel without mitigations (or booting with them
> off, which isn't quite as good) to verify.

Okay, I disabled RETPOLINE, which seems like it should be the important one.
With inlined memcpy:

iov_kunit_benchmark_bvec: avg 3160 uS, stddev 17 uS
iov_kunit_benchmark_bvec_split: avg 3380 uS, stddev 29 uS
iov_kunit_benchmark_kvec: avg 2940 uS, stddev 978 uS
iov_kunit_benchmark_xarray: avg 3599 uS, stddev 8 uS
iov_kunit_benchmark_xarray_to_bvec: avg 3964 uS, stddev 16 uS

Directly calling __memcpy():

iov_kunit_benchmark_bvec: avg 9947 uS, stddev 61 uS
iov_kunit_benchmark_bvec_split: avg 9790 uS, stddev 13 uS
iov_kunit_benchmark_kvec: avg 9565 uS, stddev 758 uS
iov_kunit_benchmark_xarray: avg 10498 uS, stddev 24 uS
iov_kunit_benchmark_xarray_to_bvec: avg 10459 uS, stddev 188 uS

I created a duplicate of __memcpy() (called __movsb_memcpy) without the
"alternative" statement and made it call that:

iov_kunit_benchmark_bvec: avg 3177 uS, stddev 7 uS
iov_kunit_benchmark_bvec_split: avg 3393 uS, stddev 10 uS
iov_kunit_benchmark_kvec: avg 2813 uS, stddev 385 uS
iov_kunit_benchmark_xarray: avg 3651 uS, stddev 7 uS
iov_kunit_benchmark_xarray_to_bvec: avg 3946 uS, stddev 8 uS

And then I made it call memcpy_orig() directly:

iov_kunit_benchmark_bvec: avg 9942 uS, stddev 17 uS
iov_kunit_benchmark_bvec_split: avg 9802 uS, stddev 29 uS
iov_kunit_benchmark_kvec: avg 9547 uS, stddev 598 uS
iov_kunit_benchmark_xarray: avg 10486 uS, stddev 13 uS
iov_kunit_benchmark_xarray_to_bvec: avg 10438 uS, stddev 12 uS

(See attached patch)

David
---
diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index 0ae2e1712e2e..df1ebbe345e2 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -43,7 +43,7 @@ EXPORT_SYMBOL(__memcpy)
SYM_FUNC_ALIAS_MEMFUNC(memcpy, __memcpy)
EXPORT_SYMBOL(memcpy)

-SYM_FUNC_START_LOCAL(memcpy_orig)
+SYM_TYPED_FUNC_START(memcpy_orig)
movq %rdi, %rax

cmpq $0x20, %rdx
@@ -169,4 +169,12 @@ SYM_FUNC_START_LOCAL(memcpy_orig)
.Lend:
RET
SYM_FUNC_END(memcpy_orig)
+EXPORT_SYMBOL(memcpy_orig)

+SYM_TYPED_FUNC_START(__movsb_memcpy)
+ movq %rdi, %rax
+ movq %rdx, %rcx
+ rep movsb
+ RET
+SYM_FUNC_END(__movsb_memcpy)
+EXPORT_SYMBOL(__movsb_memcpy)
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index de7d11cf4c63..620cd6356a5b 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -58,11 +58,18 @@ size_t copy_from_user_iter(void __user *iter_from, size_t progress,
return res;
}

+extern void *__movsb_memcpy(void *, const void *, size_t);
+extern void *memcpy_orig(void *, const void *, size_t);
+
static __always_inline
size_t memcpy_to_iter(void *iter_to, size_t progress,
size_t len, void *from, void *priv2)
{
- memcpy(iter_to, from + progress, len);
+#if 0
+ __movsb_memcpy(iter_to, from + progress, len);
+#else
+ memcpy_orig(iter_to, from + progress, len);
+#endif
return 0;
}

@@ -70,7 +77,11 @@ static __always_inline
size_t memcpy_from_iter(void *iter_from, size_t progress,
size_t len, void *to, void *priv2)
{
- memcpy(to + progress, iter_from, len);
+#if 0
+ __movsb_memcpy(to + progress, iter_from, len);
+#else
+ memcpy_orig(to + progress, iter_from, len);
+#endif
return 0;
}

2023-11-16 22:37:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Thu, 16 Nov 2023 at 16:13, David Howells <[email protected]> wrote:
>
>
> Okay, I disabled RETPOLINE, which seems like it should be the important one.
> With inlined memcpy:

Yeah, your machine really seems to hate the out-of-line call version.

It is also not unlikely that the benchmark is the perfect example of
that kind of "bad memory copy benchmark" where the actual results of
the copy are never used or touched. It's one case that sometimes makes
"rep movs" look (somewhat artificially) good, just because the
optimized rep string will do cacheline copies in L2. So if you never
touch the source or the destination of the copy, it never even gets
brought into the L1.

Linus

2023-11-16 22:54:04

by David Laight

[permalink] [raw]
Subject: RE: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

From: Linus Torvalds
> Sent: 16 November 2023 17:25
...
> > How much difference does FSRM actually make?
> > Especially when compared to the cost of a function call (even
> > without the horrid return thunk).
>
> It can be a big deal. The subject line here is an example. On that
> machine, using the call to 'memcpy_orig' clearly performs *noticeably*
> better. So that 16% regression was"fast apparently at least partly
> because of
>
> -11.0 perf-profile.self.cycles-pp.memcpy_orig
> +14.7 perf-profile.self.cycles-pp.copy_page_from_iter_atomic
>
> where that inlined copy (that used 'rep movsq' and other things around
> it) was noticeably worse than just calling memcpy_orig that does a
> basic unrolled loop.

Wasn't that the stupid PoS inlined memcpy that was absolutely
horrendous?
I've also not seen any obvious statement about the lengths of the
copies.

> Now, *why* it matters a lot is unclear. Some machines literally have
> the "fast rep string" code disabled, and then "rep movsb" is just
> horrendous. That's arguably a machine setup issue, but people have
> been known to do those things because of problems (most recently
> "reptar").

They get what they deserve :-)

I've just done some measurements on an i7-7700.
cpuinfo:flags has erms but not frms (as I'd expect).

The test code path is:
rdpmc
lfence
then 10 copies of:
mov %r13,%rdi
mov %r14,%rsi
mov %r15,%rcx
rep movsb
followed by:
lfence
rdpmc

which I run through 5 times.
The first pass is cold-cache and expected to be slow.
The other 4 pretty much take the same number of clocks.
(Which is what I've seen before using the same basic program
to time the ip-checksum code.)

At first sight it appears that each 'rep movsb' takes about
32 clocks for short copies and only starts increasing above
(about) 32 bytes - and then increases very slowly.

But something very odd is going on.
For length 1 (to ~32) the first pass is ~4500 clocks and the others ~320.
For longer length the clock count increases slowly.
But length 0 reports ~600 for all 5 passes.

The cache should be the same in both cases.
So the effect must be an artifact of the instruction decoder.
The loop isn't long enough to not fit in the cpu loop buffer
(of decoded u-ops) - could try padding it with lots of nops.

This rather implies that the decode of 'rep movs' is taking
something horrid like 450 clocks, but it gets saved somewhere.
OTOH if the count is zero the decode+execute is only ~60 clocks
but it isn't saved.

If that is true (and I doubt Intel would admit it) you pretty
much never want to use 'rep movs' in any form unless you are
going to execute the instruction multiple times or the
length is significant.

This wasn't the conclusion I expected to come to...

It also means that while 'rep movs' will copy at 16 bytes/clock
(or 32 if the destination is aligned) it is possible that it
will always be slower than a register copy loop (8 bytes/clock)
unless the copy is significantly longer than most of the kernel
memcpy() can ever be.

...
> I have this memory from my "push for -Os" (which is from over a decade
> ago, to take my memory with a pinch of salt) of seeing "rep movsb"
> followed by a load of the result causing a horrid stall on the load.

I added some (unrelated) memory accesses between the 'rep movsb'.
Didn't see any significant delays.

The systems you were using a decade ago were likely very different
to the current ones - especially if they were Intel and pre-dated
sandy bridge.

> A regular load-store loop will have the store data forwarded to any
> subsequent load, but "rep movs" might not do that and if it works on a
> cacheline level you might lose out on those kinds of things.

That probably doesn't matter for data buffer copies.
You are unlikely to access them again that quickly.

> Don't get me wrong - I really like the rep string instructions, and
> while they have issues I'd *love* for CPU's to basically do "memcpy"
> and "memset" without any library call overhead. The security
> mitigations have made indirect calls much worse, but they have made
> regular function call overhead worse too (and there's the I$ footprint
> thing etc etc).
>
> So I like "rep movs" a lot when it works well, but it most definitely
> does not work well everywhere.

Yes, it is a real shame that everything since (probably) the 486
has execute 'rep anything' rather slower than you might expect.

Intel also f*cked up the 'loop' (dec %cx, jnz) instruction.
Even on cpu with adcx and adox you can't use 'loop'.

...
> The problem with code generation at this level is that you win some,
> you lose some. You can seldom make everybody happy.

Trying to second guess a workable model for the x86 cpu is hard.
For arithmetic instructions the register dependency chains seem
to give a reasonable model.
If the code flow doesn't depend on the data then the 'out of order'
execute will process data (from cache) when the relevant memory
instructions finally complete.
So I actually got pretty much the expected timings for my ip-csum
code loops (somewhat better than the current version).

But give me a nice simple cpu like the NiosII soft cpu.
The instruction and local memory timings are absolutely
well defined - and you can look at the fpga internals and
check!

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-11-17 11:37:08

by Borislav Petkov

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Thu, Nov 16, 2023 at 04:44:06PM +0000, David Howells wrote:
> If you wanted this patch trying, I get the following numbers:
>
> iov_kunit_benchmark_bvec: avg 9950 uS, stddev 40 uS
> iov_kunit_benchmark_bvec: avg 9950 uS, stddev 53 uS
> iov_kunit_benchmark_bvec: avg 9973 uS, stddev 69 uS
> iov_kunit_benchmark_bvec_split: avg 9793 uS, stddev 28 uS
> iov_kunit_benchmark_bvec_split: avg 9800 uS, stddev 41 uS
> iov_kunit_benchmark_bvec_split: avg 9804 uS, stddev 16 uS
> iov_kunit_benchmark_kvec: avg 10122 uS, stddev 4403 uS
> iov_kunit_benchmark_kvec: avg 9757 uS, stddev 1516 uS
> iov_kunit_benchmark_kvec: avg 9909 uS, stddev 2694 uS
> iov_kunit_benchmark_xarray: avg 10526 uS, stddev 52 uS
> iov_kunit_benchmark_xarray: avg 10529 uS, stddev 48 uS
> iov_kunit_benchmark_xarray: avg 10532 uS, stddev 63 uS
> iov_kunit_benchmark_xarray_to_bvec: avg 10468 uS, stddev 23 uS
> iov_kunit_benchmark_xarray_to_bvec: avg 10469 uS, stddev 157 uS
> iov_kunit_benchmark_xarray_to_bvec: avg 10471 uS, stddev 163 uS

Hmm, stupid question: are those results better than without that
oneliner?

I don't see those same numbers done without this patch - maybe I can't
find them in the thread or so...

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-11-17 11:45:06

by Borislav Petkov

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

Might as well Cc toolchains...

On Thu, Nov 16, 2023 at 11:48:18AM -0500, Linus Torvalds wrote:
> Hmm. I know about the '-mstringop-strategy' flag because of the fairly
> recently discussed bug where gcc would create a byte-by-byte copy in
> some crazy circumstances with the address space attributes:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111657

I hear those stringop strategy heuristics are interesting. :)

> But I incorrectly thought that "-mstringop-strategy=libcall" would
> then *always* do library calls.

That's how I understood it too. BUT, reportedly, small and known sizes
are still optimized, which is exactly what we want.

> So I decided to test, and that shows that gcc still ends up doing the
> "expand small constant size copies inline" even with that option, and
> doesn't force library calls for those cases.

And you've confirmed it.

> IOW, my assumption was just broken, and using
> "-mstringop-strategy=libcall" may well be the right thing to do.

And here's where I'm wondering whether we should enable it for x86 only
or globally. I think globally because those stringop heuristics happen,
AFAIU, in the general optimization stage and thus target agnostic.

> Of course, it's also possible that with all the function call overhead
> introduced by the CPU mitigations on older CPU's, we should just say
> "rep movsb" is always correct - if you have a new CPU with FSRM it's
> good, and if you have an old CPU it's no worse than the horrendous CPU
> mitigation overhead for function call/returns.

Yeah, I think we should measure the libcall thing and then try to get
the inlined "rep movsb" working and see which one is better. You do have
a point about that RET overhead after each CALL.

> I really hate the mitigations. Oh well.

Tell me about it.

> Ayway, maybe your patch is the RightThing(tm). Or maybe we should use
> 'rep_byte' instead of 'libcall'. Who knows..

Yeah, lemme keep playing with this.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-11-17 12:10:32

by Jakub Jelinek

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Fri, Nov 17, 2023 at 12:44:21PM +0100, Borislav Petkov wrote:
> Might as well Cc toolchains...
>
> On Thu, Nov 16, 2023 at 11:48:18AM -0500, Linus Torvalds wrote:
> > Hmm. I know about the '-mstringop-strategy' flag because of the fairly
> > recently discussed bug where gcc would create a byte-by-byte copy in
> > some crazy circumstances with the address space attributes:
> >
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111657
>
> I hear those stringop strategy heuristics are interesting. :)
>
> > But I incorrectly thought that "-mstringop-strategy=libcall" would
> > then *always* do library calls.
>
> That's how I understood it too. BUT, reportedly, small and known sizes
> are still optimized, which is exactly what we want.

Sure. -mstringop-strategy affects only x86 expansion of the stringops
from GIMPLE to RTL, while for small constant sizes some folding can happen
far earlier in generic code. Similarly, the copy/store by pieces generic
handling (straight-line code expansion of the builtins) is done in some
cases without invoking the backend optabs which is the only expansion
affected by the strategy.
Note, the default strategy depends on the sizes, -mtune= in effect,
whether it is -Os or -O2 etc. And the argument for -mmemcpy-strategy=
or -mmemset-strategy= can include details on what sizes should be handled
by which algorithm, not everything needs to be done the same.

> > IOW, my assumption was just broken, and using
> > "-mstringop-strategy=libcall" may well be the right thing to do.
>
> And here's where I'm wondering whether we should enable it for x86 only
> or globally. I think globally because those stringop heuristics happen,
> AFAIU, in the general optimization stage and thus target agnostic.

-mstringop-strategy= option is x86 specific, so I don't see how you could
enable it on other architectures.

Anyway, if you are just trying to work-around bugs in specific compilers,
please limit it to the affected compilers, overriding kernel makefiles
forever with the workaround would mean you force perhaps suboptimal
expansion in various cases.

Jakub

2023-11-17 12:19:14

by Borislav Petkov

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

+ SUSE gcc folks.

On Fri, Nov 17, 2023 at 01:09:55PM +0100, Jakub Jelinek wrote:
> Sure. -mstringop-strategy affects only x86 expansion of the stringops
> from GIMPLE to RTL, while for small constant sizes some folding can happen
> far earlier in generic code. Similarly, the copy/store by pieces generic
> handling (straight-line code expansion of the builtins) is done in some
> cases without invoking the backend optabs which is the only expansion
> affected by the strategy.
> Note, the default strategy depends on the sizes, -mtune= in effect,
> whether it is -Os or -O2 etc. And the argument for -mmemcpy-strategy=
> or -mmemset-strategy= can include details on what sizes should be handled
> by which algorithm, not everything needs to be done the same.

Good to know, I might experiment with those. Thx.

> > > IOW, my assumption was just broken, and using
> > > "-mstringop-strategy=libcall" may well be the right thing to do.
> >
> > And here's where I'm wondering whether we should enable it for x86 only
> > or globally. I think globally because those stringop heuristics happen,
> > AFAIU, in the general optimization stage and thus target agnostic.
>
> -mstringop-strategy= option is x86 specific, so I don't see how you could
> enable it on other architectures.

Yeah, Richi just explained to me the same on another thread. To which
I had the question:

"Ah, it even says so in the manpage:

x86 Options ... -mstringop-strategy=alg

Ok, so how would the same option be implemented for ARM or some other
backend?

Also -mstringop-strategy=alg but it would have effect when generating
ARM code, right?

Which means, if I have it in the generic Makefile, it'll get
automatically used on ARM too when gcc implements it.

Which then begs the question whether we want that or we should let ARM
folks decide when that time comes."

I.e., what happens if we have this option in the generic Makefile and
-mstringop-strategy starts affecting ARM expansion of the stringops from
GIMPLE to RTL? Does that even make sense?

> Anyway, if you are just trying to work-around bugs in specific compilers,
> please limit it to the affected compilers, overriding kernel makefiles
> forever with the workaround would mean you force perhaps suboptimal
> expansion in various cases.

Yeah, perhaps a good idea. gcc 13 for now, I guess...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-11-17 13:10:07

by David Laight

[permalink] [raw]
Subject: RE: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

From: Borislav Petkov
> Sent: 17 November 2023 11:44
...
> Yeah, I think we should measure the libcall thing and then try to get
> the inlined "rep movsb" working and see which one is better. You do have
> a point about that RET overhead after each CALL.

You might be able to use the relocation list for memcpy()
to change the 5 byte call instruction into the inline
'mov %rdx,%rcx; rep movsb' sequence.

I've spent all morning (on holiday) trying to understand the strange
timings I'm seeing for 'rep mosvb' on in i7-7700.

The fixed overhead is very strange.

The first 'rep movsb' I do in a process takes an extra 5000 clocks or so.
But it doesn't seem to matter when I do it!
I can do it on entry to main() with several system calls before
the timing loop.

After that the fixed overhead for the 'rep movsb' is fairly small.
I've a few extra register moves between the 'rep movsb' but
I'd guess at about 30 clocks.
All sizes up to (at least) 32 bytes execute in the same time.
After that it increases at much the rate you'd expect.

Zero length copies are different, they always take ~60 clocks.

My current guess for the 5000 clocks is that the logic to
decode 'rep movsb' is loaded into a buffer that is also used
to decode some other instructions.
So if still contains the 'rep movsb' decoder it is fast, otherwise
it is slow.

No idea what other instructions might be using the same logic
(microcode?) buffer.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-11-17 13:36:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Fri, 17 Nov 2023 at 08:09, David Laight <[email protected]> wrote:
>
> Zero length copies are different, they always take ~60 clocks.

That zero-length thing is some odd microcode implementation issue, and
I think intel actually made a FZRM cpuid bit available for it ("Fast
Zero-size Rep Movs").

I don't think we care in the kernel, but somebody else did (or maybe
Intel added a flag for "we fixed it" just because they noticed)

I at some point did some profiling, and we do have zero-length memcpy
cases occasionally (at least for user copies, which was what I was
looking at), but they aren't common enough to worry about some small
extra strange overhead.

(In case you care, it was for things like an ioctl doing "copy the
base part of the ioctl data, then copy the rest separately". Where
"the rest" was then often nothing at all).

> My current guess for the 5000 clocks is that the logic to
> decode 'rep movsb' is loaded into a buffer that is also used
> to decode some other instructions.

Unlikely.

I would guess it's the "power up the AVX2 side". The memory copy uses
those same resources internally.

You could try to see if "first AVX memory access" (or similar) has the
same extra initial cpu cycle issue.

Anyway, the CPU you are testing is new enough to have ERMS - that's
the "we do pretty well on string instructions" flag. It does indeed do
pretty well on string instructions, but has a few oddities in addition
to the zero-sized thing.

The other bad cases tend to be along the line of "it falls flat on its
face when the source and destination address are not mutually aligned,
but they are the same virtual address modulo 4096".

Or something like that. I forget the exact details. The details do
exist, but I forget where (I suspect either Agner Fog or some footnote
in some Intel architecture manual).

So it's very much not as simple as "fixed initial cost and then a
fairly fixed cost per 32B", even if that is *one* pattern.

Linus

2023-11-17 14:12:34

by David Howells

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

Borislav Petkov <[email protected]> wrote:

> Hmm, stupid question: are those results better than without that
> oneliner?

Without any changes, I see something like:

iov_kunit_benchmark_bvec: avg 3185 uS, stddev 19 uS
iov_kunit_benchmark_bvec: avg 3186 uS, stddev 9 uS
iov_kunit_benchmark_bvec: avg 3244 uS, stddev 153 uS
iov_kunit_benchmark_bvec_split: avg 3397 uS, stddev 16 uS
iov_kunit_benchmark_bvec_split: avg 3400 uS, stddev 16 uS
iov_kunit_benchmark_bvec_split: avg 3402 uS, stddev 34 uS
iov_kunit_benchmark_kvec: avg 2818 uS, stddev 550 uS
iov_kunit_benchmark_kvec: avg 2906 uS, stddev 21 uS
iov_kunit_benchmark_kvec: avg 2923 uS, stddev 1496 uS
iov_kunit_benchmark_xarray: avg 3564 uS, stddev 6 uS
iov_kunit_benchmark_xarray: avg 3573 uS, stddev 17 uS
iov_kunit_benchmark_xarray: avg 3575 uS, stddev 58 uS
iov_kunit_benchmark_xarray_to_bvec: avg 3929 uS, stddev 9 uS
iov_kunit_benchmark_xarray_to_bvec: avg 3930 uS, stddev 6 uS
iov_kunit_benchmark_xarray_to_bvec: avg 3930 uS, stddev 7 uS

David

2023-11-17 15:20:58

by David Laight

[permalink] [raw]
Subject: RE: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

From: Linus Torvalds
> Sent: 17 November 2023 13:36
>
> On Fri, 17 Nov 2023 at 08:09, David Laight <[email protected]> wrote:
> >
> > Zero length copies are different, they always take ~60 clocks.
>
> That zero-length thing is some odd microcode implementation issue, and
> I think intel actually made a FZRM cpuid bit available for it ("Fast
> Zero-size Rep Movs").
>
> I don't think we care in the kernel, but somebody else did (or maybe
> Intel added a flag for "we fixed it" just because they noticed)

I wasn't really worried about it - but it was an oddidy.

> I at some point did some profiling, and we do have zero-length memcpy
> cases occasionally (at least for user copies, which was what I was
> looking at), but they aren't common enough to worry about some small
> extra strange overhead.

For user copies avoiding the slac/stac might make it worthwhile.
But I doubt you'd want to add the 'jcxz .+n' in the copy code
itself because the mispredicted branch might make a bigger
difference.

I have tested writev() with lots of zero length fragments.
But that isn't a normal case.

> (In case you care, it was for things like an ioctl doing "copy the
> base part of the ioctl data, then copy the rest separately". Where
> "the rest" was then often nothing at all).

That specific code where a zero length copy is quite likely
would probably benefit from a test in the source.

> > My current guess for the 5000 clocks is that the logic to
> > decode 'rep movsb' is loaded into a buffer that is also used
> > to decode some other instructions.
>
> Unlikely.
>
> I would guess it's the "power up the AVX2 side". The memory copy uses
> those same resources internally.

That would make more sense - and have much the same effect.
If the kernel used 'rep movsb' internally and for user copies
it pretty much wouldn't ever get powered down.

> You could try to see if "first AVX memory access" (or similar) has the
> same extra initial cpu cycle issue.

Spot on.
vpbroadcast %xmm1,%xmm2
does the trick as well.

> Anyway, the CPU you are testing is new enough to have ERMS - that's
> the "we do pretty well on string instructions" flag. It does indeed do
> pretty well on string instructions, but has a few oddities in addition
> to the zero-sized thing.

From what I looked at pretty much everything anyone cares about
probably has ERMS.
You need to be running on something older than sandy bridge.
So basically 'core 2' or 'core 2 duo' (or P4 netburst).
The amd cpus are similarly old.

> The other bad cases tend to be along the line of "it falls flat on its
> face when the source and destination address are not mutually aligned,
> but they are the same virtual address modulo 4096".

There is a similar condition that very often stop the cpu ever
actually doing two memory reads in one clock.
Could easily be related.

> Or something like that. I forget the exact details. The details do
> exist, but I forget where (I suspect either Agner Fog or some footnote
> in some Intel architecture manual).

If Intel have published it, it will be in an unlit basement
behind a locked door and a broken staircase!

Unless 'page copy' hits it I wonder if it really matters
for a normal workload.
Yes, you can conspire to hit it, but mostly you wont.

Wasn't it one of the atoms where the data cache prefetch
managed to completely destroy forwards data copy.
To the point where is was worth taking the hit of a
backwards copy?

> So it's very much not as simple as "fixed initial cost and then a
> fairly fixed cost per 32B", even if that is *one* pattern.

True, but it is the most common one.
And if it is bad the whole thing isn't worth using at all.

I'll try my test on a ivy bridge later.
(I don't have anything older that actually boots.)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-11-17 16:10:23

by Borislav Petkov

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Fri, Nov 17, 2023 at 02:12:17PM +0000, David Howells wrote:
> Borislav Petkov <[email protected]> wrote:
>
> > Hmm, stupid question: are those results better than without that
> > oneliner?
>
> Without any changes, I see something like:
>
> iov_kunit_benchmark_bvec: avg 3185 uS, stddev 19 uS
> iov_kunit_benchmark_bvec: avg 3186 uS, stddev 9 uS
> iov_kunit_benchmark_bvec: avg 3244 uS, stddev 153 uS
> iov_kunit_benchmark_bvec_split: avg 3397 uS, stddev 16 uS
> iov_kunit_benchmark_bvec_split: avg 3400 uS, stddev 16 uS
> iov_kunit_benchmark_bvec_split: avg 3402 uS, stddev 34 uS
> iov_kunit_benchmark_kvec: avg 2818 uS, stddev 550 uS
> iov_kunit_benchmark_kvec: avg 2906 uS, stddev 21 uS
> iov_kunit_benchmark_kvec: avg 2923 uS, stddev 1496 uS
> iov_kunit_benchmark_xarray: avg 3564 uS, stddev 6 uS
> iov_kunit_benchmark_xarray: avg 3573 uS, stddev 17 uS
> iov_kunit_benchmark_xarray: avg 3575 uS, stddev 58 uS
> iov_kunit_benchmark_xarray_to_bvec: avg 3929 uS, stddev 9 uS
> iov_kunit_benchmark_xarray_to_bvec: avg 3930 uS, stddev 6 uS
> iov_kunit_benchmark_xarray_to_bvec: avg 3930 uS, stddev 7 uS

Which looks like those added memcpy calls add a lot of overhead due to
the mitigations crap.

You could verify that if you boot a kernel with the oneliner but add
"mitigations=off" on the cmdline.

Or profile the workload and check whether the return thunks appear
higher in the profile... I'd say.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-11-17 16:33:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Fri, 17 Nov 2023 at 11:10, Borislav Petkov <[email protected]> wrote:
>
> Which looks like those added memcpy calls add a lot of overhead due to
> the mitigations crap.

No, you missed where I thought that too and asked David to test
without mitigations.

That load really loves "rep movsb" on his machine (and that includes
the gcc-generated odd inlined "one word by hand, and then 'rep movsq'
for the rest").

It's probably because it's a benchmark that doesn't actually touch the
data, and does page-sized copies. It's pretty much the optimal case
for ERMS.

The "do one word by hand, the rest with 'rep movsq'" model that gcc
uses (but only in this particular code generation case) probably ends
up being quite reasonable in general - the one word by hand allows for
unaligned counts, but it also brings in the beginning of the copy into
the cache (which is *often* the part used later - not in this
benchmark, but in general), and then the rest ends up being done as L2
cacheline copies at least when we have those nice page-aligned
patterns.

Of course, this whole thread started because the kernel test robot
then has exactly the opposite reaction - it seems to really *hate*
that inlined code generation by gcc. Whether that is because it's a
very different microarchitecture, or it's because it's just a very
different access pattern than the one that David's random KUnit test
pattern is, I don't know.

That kernel test robot case is on a Cooper Lake Xeon, which is (I
think) is just Skylake server. Random Intel codenames...

So that test robot has ERMS too, but no FSRM, so we do the old
"memcpy_orig()" with the regular memcpy loop.

And on that Xeon, it really does seem to be the right thing to do.

But the profile is so noisy with other changes that it's not like I
can guarantee that that is the main issue here. The reason I zeroed in
on the memcpy thing was really just that (a) it does show up in the
profiles and (b) the commit that introduced that 16% regression
doesn't really seem to do anything else than reorganize things just
enough that gcc seems to do that alternate memcpy implementation.

The test case seems to be (from the profile) just a simple

do_iter_readv_writev ->
shmem_file_write_iter ->
generic_perform_write ->
copy_page_from_iter_atomic ->
memcpy_from_iter_mc

and that's then where the new code generation matters (ie does it do
that "inline with rep movsq" or "call memcpy_orig").

For David, the rep movsq is great. For the kernel test robot, it's bad.

Linus

2023-11-17 16:45:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Fri, 17 Nov 2023 at 11:32, Linus Torvalds
<[email protected]> wrote:
>
> The test case seems to be (from the profile) just a simple

Just to clarify: that's the test robot test case. David's test-case is
just the simple KUnit test.

> do_iter_readv_writev ->
> shmem_file_write_iter ->
> generic_perform_write ->
> copy_page_from_iter_atomic ->
> memcpy_from_iter_mc

.. and more details: this is *not* the actual normal "write()" path,
which would copy memory from user space, and that uses our
"copy_user()" function etc.

No, the path leading up to this seems to be

worker_thread ->
process_one_work ->
loop_process_work ->
lo_write_simple ->
do_iter_write

which is why it uses a regular memcpy (ie it's a kernel buffer due to
loop block device).

So the test robot load is kind of odd.

Not that I think that David's KUnit test is necessarily much of a real
load either. so...

Linus

2023-11-17 19:13:50

by Borislav Petkov

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Fri, Nov 17, 2023 at 11:44:10AM -0500, Linus Torvalds wrote:
> So the test robot load is kind of odd.

So looking at that. IINM, its documentation says:

https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/Documentation

case-msync:
Create N sparse files, each with a size of $MemTotal. For each sparse file,
start a process to write 1/2N of the sparse file's size. After the write,
do a msync to make sure the change in memory has reached the file.

Is that something userspace usually does?

Some distributed, shared thing logging to the same file?

I obviously have no effing clue what userspace does...

> Not that I think that David's KUnit test is necessarily much of a real
> load either. so...

Which begs the question: what are our priorities here?

I wouldn't want to optimize some weird loads. Especially if you have
weird loads which perform differently depending on what uarch
"optimizations" they sport.

I guess optimizing for the majority of machines - modern FSRM ones which
can do "rep; movsb" just fine - is one way to put it. And the rest is
best effort.

Hmmm.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-11-17 21:58:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Fri, 17 Nov 2023 at 11:13, Borislav Petkov <[email protected]> wrote:
>
> I wouldn't want to optimize some weird loads. Especially if you have
> weird loads which perform differently depending on what uarch
> "optimizations" they sport.
>
> I guess optimizing for the majority of machines - modern FSRM ones which
> can do "rep; movsb" just fine - is one way to put it. And the rest is
> best effort.

Yeah, we shouldn't optimize for microbenchmarks in particular.

The kernel robot performance reports have been interesting, because
they do end up often pointing to real issues. But we've had these
kinds of things too, where the benchmark is just odd and clearly
happens to trigger something that is just very machine-specific.

So I don't think we should use either of these benchmarks as a "we
need to optimize for *this*", but it is another example of how much
memcpy() does matter. Even if the end result is then "but different
microarchitectrues react so differently that we can't please
everybody".

Linus

2023-11-20 11:53:44

by Borislav Petkov

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Wed, Nov 15, 2023 at 04:50:06PM -0500, Linus Torvalds wrote:
> Sadly, we don't seem to have any obvious #define for "this is not real
> kernel code". We just have a lot of ad-hoc tricks, like removing
> compiler flags and disabling things like KASAN etc on a file-by-file
> (or directory) basis.
>
> The purgatory code isn't exactly boot-time code, but it's very similar
> - it's kind of a limited environment that runs at crash time to load a
> new kernel.

So I've been trying to do a proper and clean kernel proper from "other
objects" split by not allowing kernel proper crap from getting linked
into those other objects.

Because the same thing happens here: the sha256.o object gets included
from kernel proper:

$(obj)/sha256.o: $(srctree)/lib/crypto/sha256.c FORCE
$(call if_changed_rule,cc_o_c)

and that object has the alternatives sections:

[ 7] .altinstructions PROGBITS 0000000000000000 00000be8
000000000000002a 0000000000000000 A 0 0 1
[ 8] .rela.altins[...] RELA 0000000000000000 00001190
0000000000000090 0000000000000018 I 16 7 8
[ 9] .altinstr_re[...] PROGBITS 0000000000000000 00000c12
000000000000000f 0000000000000000 AX 0 0 1

and in it there are calls to rep_movs_alternative which the linker tries
to resolve, leading to that failure.

And we don't need those damn sections and we can simply do objcopy
--remove-section but then more sh*t happens, see end of mail.

So I'd really like for us to have a rule (or a couple of rules) which
clarify how other objects should include kernel proper stuff. Because it
is a mess currently. Pretty much every time we try to include some stuff
from kernel proper, those other builds fail.

I've been trying to disentangle the compressed kernel on x86 too, this
way, and we have had some success in defining separate facilities which
can be shared by others:

arch/x86/include/asm/shared/

but it needs a lot of work and policing... :-\

So I'm open to ideas...

Thx.

arch/x86/purgatory/purgatory.ro: in function `sha256_transform_blocks':
/home/boris/kernel/5th/linux/lib/crypto/sha256.c:123:(.text+0x801): undefined reference to `__fentry__'
ld: /home/boris/kernel/5th/linux/lib/crypto/sha256.c:132:(.text+0xfde): undefined reference to `__x86_return_thunk'
ld: arch/x86/purgatory/purgatory.ro: in function `sha256_update':
/home/boris/kernel/5th/linux/lib/crypto/sha256.c:135:(.text+0x1005): undefined reference to `__fentry__'
ld: /home/boris/kernel/5th/linux/lib/crypto/sha256.c:137:(.text+0x1071): undefined reference to `__x86_return_thunk'
ld: arch/x86/purgatory/purgatory.ro: in function `__sha256_final':
/home/boris/kernel/5th/linux/lib/crypto/sha256.c:141:(.text+0x10f1): undefined reference to `__fentry__'
ld: /home/boris/kernel/5th/linux/lib/crypto/sha256.c:144:(.text+0x1204): undefined reference to `__x86_return_thunk'
ld: arch/x86/purgatory/purgatory.ro: in function `sha256_final':
/home/boris/kernel/5th/linux/lib/crypto/sha256.c:147:(.text+0x1295): undefined reference to `__fentry__'
ld: arch/x86/purgatory/purgatory.ro: in function `sha224_final':
/home/boris/kernel/5th/linux/lib/crypto/sha256.c:153:(.text+0x12c5): undefined reference to `__fentry__'
ld: arch/x86/purgatory/purgatory.ro: in function `sha256':
/home/boris/kernel/5th/linux/lib/crypto/sha256.c:159:(.text+0x12f5): undefined reference to `__fentry__'
ld: /home/boris/kernel/5th/linux/lib/crypto/sha256.c:165:(.text+0x13bd): undefined reference to `__x86_return_thunk'
ld: /home/boris/kernel/5th/linux/lib/crypto/sha256.c:165:(.text+0x13f7): undefined reference to `__stack_chk_fail'
make[4]: *** [arch/x86/purgatory/Makefile:92: arch/x86/purgatory/purgatory.chk] Error 1
make[3]: *** [scripts/Makefile.build:480: arch/x86/purgatory] Error 2
make[2]: *** [scripts/Makefile.build:480: arch/x86] Error 2

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-11-20 13:33:24

by David Howells

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

Linus Torvalds <[email protected]> wrote:

> So I don't think we should use either of these benchmarks as a "we
> need to optimize for *this*", but it is another example of how much
> memcpy() does matter. Even if the end result is then "but different
> microarchitectrues react so differently that we can't please
> everybody".

So what, if anything, should I change? Should I make it directly call
__memcpy? Or should we just leave it to the compiler? I would prefer to
leave memcpy_from_iter() and memcpy_to_iter() as __always_inline to eliminate
the function pointer call we otherwise end up with and to eliminate the return
value (which is always 0 in this case).

How about something along the attached lines? (With the inline asm
appropriate pushed out to an arch header file).

On the other hand, it might be better to have memcpy_to/from_iter() just call
__memcpy() as it doesn't seem to make much difference to the time taken and
the inline func can still return a constant 0 return value that can be
optimised away.

David
---

diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index 0ae2e1712e2e..7354982dc433 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -43,7 +43,7 @@ EXPORT_SYMBOL(__memcpy)
SYM_FUNC_ALIAS_MEMFUNC(memcpy, __memcpy)
EXPORT_SYMBOL(memcpy)

-SYM_FUNC_START_LOCAL(memcpy_orig)
+SYM_TYPED_FUNC_START(memcpy_orig)
movq %rdi, %rax

cmpq $0x20, %rdx
@@ -169,4 +169,4 @@ SYM_FUNC_START_LOCAL(memcpy_orig)
.Lend:
RET
SYM_FUNC_END(memcpy_orig)
-
+EXPORT_SYMBOL(memcpy_orig)
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index de7d11cf4c63..de73edb9ffcc 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -62,7 +62,17 @@ static __always_inline
size_t memcpy_to_iter(void *iter_to, size_t progress,
size_t len, void *from, void *priv2)
{
- memcpy(iter_to, from + progress, len);
+ size_t len2 = len;
+ from += progress;
+ /*
+ * If CPU has FSRM feature, use 'rep movs'.
+ * Otherwise, use rep_movs_alternative.
+ */
+ asm volatile(
+ ALTERNATIVE("rep movsb",
+ "call memcpy_orig", ALT_NOT(X86_FEATURE_FSRM))
+ :"+D" (iter_to), "+S" (from), "+d" (len), "+c"(len2), ASM_CALL_CONSTRAINT
+ :: "memory", "rax", "r8", "r9", "r10", "r11");
return 0;
}

@@ -70,7 +80,18 @@ static __always_inline
size_t memcpy_from_iter(void *iter_from, size_t progress,
size_t len, void *to, void *priv2)
{
- memcpy(to + progress, iter_from, len);
+ size_t len2 = len;
+ to += progress;
+ /*
+ * If CPU has FSRM feature, use 'rep movs'.
+ * Otherwise, use rep_movs_alternative.
+ */
+ asm volatile(
+ ALTERNATIVE("rep movsb",
+ "call memcpy_orig",
+ ALT_NOT(X86_FEATURE_FSRM))
+ :"+D" (to), "+S" (iter_from), "+d" (len), "+c" (len2), ASM_CALL_CONSTRAINT
+ :: "memory", "rax", "r8", "r9", "r10", "r11");
return 0;
}


2023-11-20 16:07:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

On Mon, 20 Nov 2023 at 05:33, David Howells <[email protected]> wrote:
>
> So what, if anything, should I change?

I don't think you need to worry about this.

Not that I like ignoring kernel robot report performance regressions,
because they've often been useful to find unexpected issues. But this
one seems to clearly be just a random code choice issue by the
compiler, and be very CPU-specific anyway.

We'll figure out some good way to make memcpy() a bit more reliable
wrt the code it generates, but it's not the iov_iter code that should
worry about it.

Linus

2023-11-20 16:08:55

by David Laight

[permalink] [raw]
Subject: RE: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

From: David Howells
> Sent: 20 November 2023 13:33
>
> Linus Torvalds <[email protected]> wrote:
>
> > So I don't think we should use either of these benchmarks as a "we
> > need to optimize for *this*", but it is another example of how much
> > memcpy() does matter. Even if the end result is then "but different
> > microarchitectrues react so differently that we can't please
> > everybody".
>
> So what, if anything, should I change? Should I make it directly call
> __memcpy? Or should we just leave it to the compiler? I would prefer to
> leave memcpy_from_iter() and memcpy_to_iter() as __always_inline to eliminate
> the function pointer call we otherwise end up with and to eliminate the return
> value (which is always 0 in this case).

I'd have thought you'd just want to call memcpy() (or xxxx_memcpy())
Anything that matters here is likely to make more difference elsewhere.

I wonder if the kernel ever uses the return value from memcpy().
I suspect it only exists for very historic reasons.

The wrapper:
#define memcpy(d, s, l) {( \
const void *dd = d; \
memcpy_void(dd, s, l); \
dd; \
)}
would save all the asm implementations from saving the result.

I did some more measurements over the weekend.
A quick summary - I've not quite finished (and need to find some
more test systems - newer and amd).
I'm now thinking that the 5k clocks is a TLB miss.
In any case it is a feature of my test not the instruction.
I'm also subtracting off a baseline that has 'nop; nop' not 'rep movsb'.

I'm not entirely certain about the fractional clocks!
I counting 10 operations and getting pretty consistent counts.
I suspect they are end effects.

These measurements are also for 4k aligned src and dest.

An ivy bridge i7-3xxx seems to do:
0 41.4 clocks
1-64 31.5 clocks
65-128 44.3
129-191 55.1
192 47.4
193-255 58.8
then an extra 3 clocks every 64 bytes.

Whereas kaby lake i7-7xxx does:
0 51.5 clocks
1-64 22.9
65-95 25.3
96 30.5
97-127 34.1
128 31.5
then an extra clock every 32 bytes (if dest aligned).

Note that this system is somewhat slower if the destination
is less than (iirc) 48 bytes before the source (mod 4k).
There are several different slow speeds worst is about half
the speed.

I might be able to find a newer system with fsrm.

I was going to measure orig_memcpy() and also see what I can write.
Both those cpu can do a read and write every clock.
So a 64bit copy loop can execute at 8 bytes/clock.
It should be possible to get a 2 clock loop copying 16 bytes.
But that will need a few instructions to setup.
You need to use negative offsets from the end so that only
one register is changed and the 'add' sets Z for the jump.
It can be written in C - but gcc will pessimise it for you.

You also need a conditional branch for short copies (< 16 bytes)
that could easily be mispredicted pretty much 50% of the time.
(IIRC no static prediction on recent x86 cpu.)
And probably a separate test for 0.
It is hard genning a sane clock count for short copies because
the mispredicted branches kill you.
Trouble is any benchmark measurement is likely to train the
branch predictor.
It might actually be hard to reliably beat the ~20 clocks
for 'rep movsb' on kaby lake.

This graph is from the fsrm patch:

Time (cycles) for memmove() sizes 1..31 with neither source nor
destination in cache.

1800 +-+-------+--------+---------+---------+---------+--------+-------+-+
+ + + + + + + +
1600 +-+ 'memmove-fsrm' *******-+
| ###### 'memmove-orig' ####### |
1400 +-+ # ##################### +-+
| # ############ |
1200 +-+# ################## +-+
| # |
1000 +-+# +-+
| # |
| # |
800 +-# +-+
| # |
600 +-*********************** +-+
| ***************************** |
400 +-+ ******* +-+
| |
200 +-+ +-+
+ + + + + + + +
0 +-+-------+--------+---------+---------+---------+--------+-------+-+
0 5 10 15 20 25 30 35

I don't know what that was measured on.
600 clocks seems a lot - could be dominated by loading the cache.
I'd have thought short buffers are actually likely to be in the cache
and/or wanted in it.

There is also the lack of 'rep movsb' on erms (on various cpu).

David



-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)