2023-05-04 08:27:24

by Yujie Liu

[permalink] [raw]
Subject: [linus:master] [x86] adfcf4231b: blogbench.read_score -10.9% regression

Hello,

kernel test robot noticed a -10.9% regression of blogbench.read_score on:

commit: adfcf4231b8cbc2d9c1e7bfaa965b907e60639eb ("x86: don't use REP_GOOD or ERMS for user memory copies")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

testcase: blogbench
test machine: 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz (Cascade Lake) with 128G memory
parameters:

disk: 1SSD
fs: xfs
cpufreq_governor: performance

test-description: Blogbench is a portable filesystem benchmark that tries to reproduce the load of a real-world busy file server.
test-url: https://www.pureftpd.org/project/blogbench

In addition to that, the commit also has impact on the following tests:

+------------------+------------------------------------------------------------------------------------------------------------+
| testcase: change | phoronix-test-suite: phoronix-test-suite.aio-stress.RandomWrite.mb_s 21.4% improvement |
| test machine | 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory |
| test parameters | cpufreq_governor=performance |
| | test=aio-stress-1.1.1 |
+------------------+------------------------------------------------------------------------------------------------------------+
| testcase: change | phoronix-test-suite: phoronix-test-suite.incompact3d.input.i3d129CellsPerDirection.seconds 3.6% regression |
| test machine | 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory |
| test parameters | cpufreq_governor=performance |
| | option_a=input.i3d 129 Cells Per Direction [Intended For Small Systems] |
| | test=incompact3d-2.0.2 |
+------------------+------------------------------------------------------------------------------------------------------------+
| testcase: change | phoronix-test-suite: phoronix-test-suite.ior.2MB.DefaultTestDirectory.mb_s 25.7% improvement |
| test machine | 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory |
| test parameters | cpufreq_governor=performance |
| | test=ior-1.1.1 |
+------------------+------------------------------------------------------------------------------------------------------------+
| testcase: change | phoronix-test-suite: phoronix-test-suite.iozone.1MB.512MB.WritePerformance.mb_s 20.4% improvement |
| test machine | 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory |
| test parameters | cpufreq_governor=performance |
| | option_a=1MB |
| | option_b=512MB |
| | option_c=Write Performance |
| | test=iozone-1.9.5 |
+------------------+------------------------------------------------------------------------------------------------------------+


Since there are both improvements and regressions in various micro
benchmarks, we are sending this report to provide some data and hope
this can be helpful for further investigation. Thanks.


If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-lkp/[email protected]


Details are as below:

=========================================================================================
compiler/cpufreq_governor/disk/fs/kconfig/rootfs/tbox_group/testcase:
gcc-11/performance/1SSD/xfs/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-csl-2sp9/blogbench

commit:
20f3337d35 ("x86: don't use REP_GOOD or ERMS for small memory clearing")
adfcf4231b ("x86: don't use REP_GOOD or ERMS for user memory copies")

20f3337d350c4e1b adfcf4231b8cbc2d9c1e7bfaa96
---------------- ---------------------------
%stddev %change %stddev
\ | \
2071761 ? 4% -10.9% 1846004 blogbench.read_score
8916 ? 6% -11.0% 7932 ? 8% blogbench.time.minor_page_faults
215.23 ? 4% +12.6% 242.40 ? 2% blogbench.time.user_time
0.87 ? 4% +11.9% 0.98 ? 2% iostat.cpu.user
2.25 ? 10% -0.4 1.85 ? 6% mpstat.cpu.all.soft%
16.03 ? 8% -33.3% 10.69 ? 15% sched_debug.cfs_rq:/.load_avg.min
18.59 +8.4% 20.16 turbostat.RAMWatt
180697 -3.0% 175305 vmstat.io.bo
15025 ? 4% +6.8% 16048 ? 3% proc-vmstat.nr_active_anon
20720 ? 3% +6.0% 21962 ? 2% proc-vmstat.nr_shmem
306941 -2.6% 298909 proc-vmstat.nr_slab_unreclaimable
10016536 -3.3% 9689882 proc-vmstat.nr_written
15025 ? 4% +6.8% 16048 ? 3% proc-vmstat.nr_zone_active_anon
22736515 ? 2% -5.3% 21542328 proc-vmstat.numa_hit
22656328 ? 2% -5.3% 21460298 proc-vmstat.numa_local
31365890 ? 2% -7.0% 29177083 proc-vmstat.pgalloc_normal
18590396 ? 4% -10.5% 16641000 proc-vmstat.pgfree
54574610 -2.9% 52984224 proc-vmstat.pgpgout
15.65 ? 2% +66.7% 26.09 perf-stat.i.MPKI
6.909e+09 -8.9% 6.296e+09 perf-stat.i.branch-instructions
0.41 ? 2% +0.0 0.46 perf-stat.i.branch-miss-rate%
74.77 -14.7 60.06 perf-stat.i.cache-miss-rate%
3.627e+08 ? 3% +37.6% 4.992e+08 ? 2% perf-stat.i.cache-misses
4.842e+08 ? 3% +71.3% 8.297e+08 ? 2% perf-stat.i.cache-references
7.87 -2.9% 7.64 perf-stat.i.cpi
916.29 ? 2% -4.4% 875.61 perf-stat.i.cpu-migrations
673.07 ? 4% -27.6% 487.55 perf-stat.i.cycles-between-cache-misses
8.042e+09 +10.5% 8.884e+09 perf-stat.i.dTLB-loads
0.02 ? 25% -0.0 0.01 ? 11% perf-stat.i.dTLB-store-miss-rate%
2.369e+09 ? 3% +67.1% 3.958e+09 perf-stat.i.dTLB-stores
3.092e+10 +2.8% 3.179e+10 perf-stat.i.instructions
0.13 +3.1% 0.14 perf-stat.i.ipc
0.11 ? 16% -31.5% 0.08 ? 19% perf-stat.i.major-faults
202.45 +12.1% 226.90 perf-stat.i.metric.M/sec
73.58 +1.4 74.93 perf-stat.i.node-load-miss-rate%
83206825 ? 3% -56.1% 36533214 perf-stat.i.node-load-misses
29812010 ? 4% -59.1% 12206523 ? 2% perf-stat.i.node-loads
72.86 +20.2 93.08 perf-stat.i.node-store-miss-rate%
13767301 ? 3% +415.9% 71022712 ? 2% perf-stat.i.node-store-misses
15.66 ? 2% +66.7% 26.10 perf-stat.overall.MPKI
0.41 ? 2% +0.0 0.46 perf-stat.overall.branch-miss-rate%
74.90 -14.7 60.16 perf-stat.overall.cache-miss-rate%
7.87 -2.8% 7.65 perf-stat.overall.cpi
672.29 ? 4% -27.5% 487.55 perf-stat.overall.cycles-between-cache-misses
0.02 ? 25% -0.0 0.01 ? 11% perf-stat.overall.dTLB-store-miss-rate%
0.13 +2.9% 0.13 perf-stat.overall.ipc
73.62 +1.3 74.95 perf-stat.overall.node-load-miss-rate%
72.98 +20.3 93.32 perf-stat.overall.node-store-miss-rate%
6.887e+09 -8.9% 6.277e+09 perf-stat.ps.branch-instructions
3.615e+08 ? 3% +37.6% 4.975e+08 ? 2% perf-stat.ps.cache-misses
4.827e+08 ? 3% +71.3% 8.27e+08 ? 2% perf-stat.ps.cache-references
913.07 ? 2% -4.4% 872.48 perf-stat.ps.cpu-migrations
8.017e+09 +10.5% 8.857e+09 perf-stat.ps.dTLB-loads
2.362e+09 ? 3% +67.1% 3.946e+09 perf-stat.ps.dTLB-stores
3.082e+10 +2.8% 3.169e+10 perf-stat.ps.instructions
0.11 ? 16% -31.7% 0.08 ? 19% perf-stat.ps.major-faults
82933317 ? 3% -56.1% 36413295 perf-stat.ps.node-load-misses
29716685 ? 4% -59.0% 12171311 ? 2% perf-stat.ps.node-loads
13721351 ? 3% +415.9% 70784762 ? 2% perf-stat.ps.node-store-misses
9.266e+12 +3.0% 9.547e+12 perf-stat.total.instructions
50.60 -6.5 44.11 perf-profile.calltrace.cycles-pp.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe.__open64
50.58 -6.5 44.10 perf-profile.calltrace.cycles-pp.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe.__open64
50.69 -6.5 44.22 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__open64
50.68 -6.5 44.20 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__open64
50.78 -6.4 44.33 perf-profile.calltrace.cycles-pp.__open64
41.42 -5.8 35.60 perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_fd.do_sys_openat2.__x64_sys_openat.do_syscall_64
41.75 -5.8 35.93 perf-profile.calltrace.cycles-pp.alloc_fd.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
41.20 -5.8 35.40 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_fd.do_sys_openat2.__x64_sys_openat
36.32 -5.5 30.85 perf-profile.calltrace.cycles-pp.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
35.94 -5.5 30.47 perf-profile.calltrace.cycles-pp._raw_spin_lock.close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe
36.11 -5.5 30.64 perf-profile.calltrace.cycles-pp.close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
35.74 -5.4 30.30 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.close_fd.__x64_sys_close.do_syscall_64
37.31 -5.4 31.87 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
37.32 -5.4 31.89 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__close
37.39 -5.4 31.96 perf-profile.calltrace.cycles-pp.__close
5.56 ? 2% -0.7 4.84 perf-profile.calltrace.cycles-pp._raw_spin_lock.put_unused_fd.do_sys_openat2.__x64_sys_openat.do_syscall_64
5.58 ? 2% -0.7 4.86 perf-profile.calltrace.cycles-pp.put_unused_fd.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
5.52 ? 2% -0.7 4.81 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.put_unused_fd.do_sys_openat2.__x64_sys_openat
2.47 ? 6% -0.4 2.07 ? 6% perf-profile.calltrace.cycles-pp.smpboot_thread_fn.kthread.ret_from_fork
2.46 ? 6% -0.4 2.06 ? 6% perf-profile.calltrace.cycles-pp.run_ksoftirqd.smpboot_thread_fn.kthread.ret_from_fork
2.45 ? 6% -0.4 2.06 ? 6% perf-profile.calltrace.cycles-pp.__do_softirq.run_ksoftirqd.smpboot_thread_fn.kthread.ret_from_fork
2.45 ? 6% -0.4 2.06 ? 6% perf-profile.calltrace.cycles-pp.rcu_core.__do_softirq.run_ksoftirqd.smpboot_thread_fn.kthread
2.44 ? 6% -0.4 2.05 ? 6% perf-profile.calltrace.cycles-pp.rcu_do_batch.rcu_core.__do_softirq.run_ksoftirqd.smpboot_thread_fn
1.84 ? 5% -0.3 1.54 ? 5% perf-profile.calltrace.cycles-pp.file_free_rcu.rcu_do_batch.rcu_core.__do_softirq.run_ksoftirqd
0.00 +2.9 2.87 ? 5% perf-profile.calltrace.cycles-pp.__might_resched.__might_fault._copy_to_iter.copy_page_to_iter.filemap_read
0.00 +2.9 2.94 ? 5% perf-profile.calltrace.cycles-pp.__might_fault._copy_to_iter.copy_page_to_iter.filemap_read.xfs_file_buffered_read
4.73 ? 3% +9.0 13.76 ? 4% perf-profile.calltrace.cycles-pp.copyout._copy_to_iter.copy_page_to_iter.filemap_read.xfs_file_buffered_read
5.22 ? 4% +11.7 16.96 ? 4% perf-profile.calltrace.cycles-pp._copy_to_iter.copy_page_to_iter.filemap_read.xfs_file_buffered_read.xfs_file_read_iter
5.27 ? 4% +11.8 17.06 ? 4% perf-profile.calltrace.cycles-pp.copy_page_to_iter.filemap_read.xfs_file_buffered_read.xfs_file_read_iter.vfs_read
6.44 ? 4% +12.2 18.63 ? 4% perf-profile.calltrace.cycles-pp.xfs_file_buffered_read.xfs_file_read_iter.vfs_read.ksys_read.do_syscall_64
6.28 ? 4% +12.2 18.48 ? 4% perf-profile.calltrace.cycles-pp.filemap_read.xfs_file_buffered_read.xfs_file_read_iter.vfs_read.ksys_read
6.60 ? 4% +12.2 18.81 ? 4% perf-profile.calltrace.cycles-pp.xfs_file_read_iter.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe
7.16 ? 3% +12.2 19.38 ? 4% perf-profile.calltrace.cycles-pp.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_read
7.20 ? 3% +12.2 19.42 ? 4% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_read
7.22 ? 3% +12.2 19.44 ? 4% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__libc_read
6.87 ? 3% +12.2 19.12 ? 4% perf-profile.calltrace.cycles-pp.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_read
7.32 ? 3% +12.3 19.59 ? 4% perf-profile.calltrace.cycles-pp.__libc_read
0.00 +13.6 13.63 ? 4% perf-profile.calltrace.cycles-pp.copy_user_generic_unrolled.copyout._copy_to_iter.copy_page_to_iter.filemap_read
83.12 -12.0 71.10 perf-profile.children.cycles-pp._raw_spin_lock
82.49 -12.0 70.54 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
50.60 -6.5 44.11 perf-profile.children.cycles-pp.__x64_sys_openat
50.58 -6.5 44.10 perf-profile.children.cycles-pp.do_sys_openat2
50.80 -6.5 44.34 perf-profile.children.cycles-pp.__open64
41.75 -5.8 35.93 perf-profile.children.cycles-pp.alloc_fd
36.32 -5.5 30.85 perf-profile.children.cycles-pp.__x64_sys_close
36.11 -5.5 30.64 perf-profile.children.cycles-pp.close_fd
37.40 -5.4 31.97 perf-profile.children.cycles-pp.__close
5.58 ? 2% -0.7 4.86 perf-profile.children.cycles-pp.put_unused_fd
2.80 ? 5% -0.4 2.40 ? 5% perf-profile.children.cycles-pp.__do_softirq
2.74 ? 6% -0.4 2.35 ? 5% perf-profile.children.cycles-pp.rcu_core
2.47 ? 6% -0.4 2.07 ? 6% perf-profile.children.cycles-pp.smpboot_thread_fn
2.46 ? 6% -0.4 2.06 ? 6% perf-profile.children.cycles-pp.run_ksoftirqd
2.73 ? 6% -0.4 2.34 ? 6% perf-profile.children.cycles-pp.rcu_do_batch
2.06 ? 5% -0.3 1.74 ? 5% perf-profile.children.cycles-pp.file_free_rcu
0.11 ? 28% -0.1 0.04 ? 75% perf-profile.children.cycles-pp.record__finish_output
0.65 -0.0 0.62 perf-profile.children.cycles-pp.hrtimer_interrupt
0.65 -0.0 0.63 perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
0.20 ? 3% -0.0 0.18 ? 4% perf-profile.children.cycles-pp.down_read
0.12 ? 4% -0.0 0.11 ? 5% perf-profile.children.cycles-pp.xfs_ilock
0.58 -0.0 0.57 perf-profile.children.cycles-pp.__hrtimer_run_queues
0.07 ? 6% +0.0 0.09 ? 4% perf-profile.children.cycles-pp.touch_atime
0.06 ? 9% +0.0 0.08 ? 8% perf-profile.children.cycles-pp.__call_rcu_common
0.06 ? 14% +0.0 0.08 ? 5% perf-profile.children.cycles-pp.get_obj_cgroup_from_current
0.00 +0.1 0.05 perf-profile.children.cycles-pp.walk_component
0.00 +0.1 0.05 perf-profile.children.cycles-pp.atime_needs_update
0.00 +0.1 0.05 perf-profile.children.cycles-pp.syscall_return_via_sysret
0.00 +0.1 0.05 ? 7% perf-profile.children.cycles-pp.__snprintf_chk
0.47 ? 4% +0.1 0.54 ? 4% perf-profile.children.cycles-pp.kmem_cache_alloc
0.00 +0.1 0.10 ? 9% perf-profile.children.cycles-pp.folio_mark_accessed
0.00 +0.1 0.10 ? 4% perf-profile.children.cycles-pp.copy_user_short_string
0.06 ? 9% +0.2 0.23 ? 3% perf-profile.children.cycles-pp.__might_sleep
95.96 +0.3 96.30 perf-profile.children.cycles-pp.do_syscall_64
96.00 +0.3 96.34 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
0.12 ? 4% +2.8 2.96 ? 5% perf-profile.children.cycles-pp.__might_resched
0.08 ? 8% +2.9 2.96 ? 5% perf-profile.children.cycles-pp.__might_fault
4.73 ? 4% +9.0 13.76 ? 4% perf-profile.children.cycles-pp.copyout
5.23 ? 4% +11.7 16.96 ? 4% perf-profile.children.cycles-pp._copy_to_iter
5.27 ? 4% +11.8 17.07 ? 4% perf-profile.children.cycles-pp.copy_page_to_iter
6.44 ? 4% +12.2 18.63 ? 4% perf-profile.children.cycles-pp.xfs_file_buffered_read
6.28 ? 4% +12.2 18.49 ? 4% perf-profile.children.cycles-pp.filemap_read
6.60 ? 4% +12.2 18.81 ? 4% perf-profile.children.cycles-pp.xfs_file_read_iter
7.18 ? 3% +12.2 19.40 ? 4% perf-profile.children.cycles-pp.ksys_read
6.89 ? 3% +12.2 19.14 ? 4% perf-profile.children.cycles-pp.vfs_read
7.36 ? 3% +12.3 19.63 ? 4% perf-profile.children.cycles-pp.__libc_read
0.00 +13.7 13.70 ? 4% perf-profile.children.cycles-pp.copy_user_generic_unrolled
81.72 -11.8 69.91 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
2.05 ? 5% -0.3 1.74 ? 5% perf-profile.self.cycles-pp.file_free_rcu
0.43 ? 4% -0.2 0.26 ? 2% perf-profile.self.cycles-pp._copy_to_iter
0.62 ? 2% -0.1 0.56 ? 4% perf-profile.self.cycles-pp._raw_spin_lock
0.20 ? 6% -0.0 0.17 ? 2% perf-profile.self.cycles-pp.fput
0.10 ? 7% -0.0 0.08 ? 5% perf-profile.self.cycles-pp.alloc_empty_file
0.05 +0.0 0.07 ? 5% perf-profile.self.cycles-pp.kmem_cache_free
0.16 ? 3% +0.0 0.18 ? 4% perf-profile.self.cycles-pp.xfs_file_read_iter
0.09 ? 7% +0.0 0.12 ? 4% perf-profile.self.cycles-pp.kmem_cache_alloc
0.00 +0.1 0.05 perf-profile.self.cycles-pp.syscall_return_via_sysret
0.00 +0.1 0.06 ? 7% perf-profile.self.cycles-pp.folio_mark_accessed
0.02 ? 99% +0.1 0.10 ? 4% perf-profile.self.cycles-pp.copy_page_to_iter
0.00 +0.1 0.10 ? 4% perf-profile.self.cycles-pp.copy_user_short_string
0.05 ? 7% +0.2 0.22 ? 4% perf-profile.self.cycles-pp.__might_sleep
0.34 ? 5% +0.3 0.64 ? 3% perf-profile.self.cycles-pp.filemap_read
0.12 ? 4% +2.6 2.76 ? 5% perf-profile.self.cycles-pp.__might_resched
0.00 +13.5 13.52 ? 4% perf-profile.self.cycles-pp.copy_user_generic_unrolled


***************************************************************************************************
lkp-csl-2sp7: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase:
gcc-11/performance/x86_64-rhel-8.3/debian-x86_64-phoronix/lkp-csl-2sp7/aio-stress-1.1.1/phoronix-test-suite

commit:
20f3337d35 ("x86: don't use REP_GOOD or ERMS for small memory clearing")
adfcf4231b ("x86: don't use REP_GOOD or ERMS for user memory copies")

20f3337d350c4e1b adfcf4231b8cbc2d9c1e7bfaa96
---------------- ---------------------------
%stddev %change %stddev
\ | \
3087 +21.4% 3747 phoronix-test-suite.aio-stress.RandomWrite.mb_s
1415581 ? 3% -3.9% 1360489 proc-vmstat.numa_local
18259 ? 34% -86.3% 2501 ?135% proc-vmstat.numa_pte_updates
434049 -2.1% 425149 proc-vmstat.pgfault
5.43 +2.0 7.44 ? 4% perf-stat.i.cache-miss-rate%
3626266 +66.6% 6042976 ? 6% perf-stat.i.cache-misses
69186100 +3.7% 71773613 perf-stat.i.cache-references
1917 +3.1% 1977 perf-stat.i.context-switches
2.524e+08 +12.4% 2.837e+08 ? 2% perf-stat.i.dTLB-stores
69.45 -1.9 67.54 perf-stat.i.iTLB-load-miss-rate%
906278 -2.9% 879960 perf-stat.i.iTLB-load-misses
387815 +5.4% 408574 perf-stat.i.iTLB-loads
2.204e+09 +3.6% 2.283e+09 ? 2% perf-stat.i.instructions
2397 +5.2% 2522 ? 3% perf-stat.i.instructions-per-iTLB-miss
132455 ? 3% -9.0% 120495 ? 9% perf-stat.i.node-loads
245539 ? 4% +623.2% 1775744 ? 13% perf-stat.i.node-stores
3.81 ?109% -3.8 0.00 perf-profile.calltrace.cycles-pp.vsnprintf.seq_printf.show_interrupts.seq_read_iter.proc_reg_read_iter
4.98 ?126% -3.1 1.88 ?142% perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
4.98 ? 81% -3.1 1.88 ?142% perf-profile.calltrace.cycles-pp.asm_exc_page_fault
4.00 ?110% -2.1 1.88 ?142% perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault
4.00 ?110% -2.1 1.88 ?142% perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
4.00 ?110% -2.1 1.88 ?142% perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
3.81 ?109% -2.1 1.76 ?141% perf-profile.calltrace.cycles-pp.seq_printf.show_interrupts.seq_read_iter.proc_reg_read_iter.vfs_read
6.94 ? 90% -5.1 1.88 ?142% perf-profile.children.cycles-pp.asm_exc_page_fault
5.96 ? 98% -4.1 1.88 ?142% perf-profile.children.cycles-pp.exc_page_fault
3.81 ?109% -3.8 0.00 perf-profile.children.cycles-pp.vsnprintf
4.98 ?126% -3.1 1.88 ?142% perf-profile.children.cycles-pp.do_user_addr_fault
4.98 ?126% -3.1 1.88 ?142% perf-profile.children.cycles-pp.__handle_mm_fault
4.98 ?126% -3.1 1.88 ?142% perf-profile.children.cycles-pp.handle_mm_fault
3.32 ?101% -2.5 0.83 ?223% perf-profile.children.cycles-pp.dput
3.81 ?109% -2.1 1.76 ?141% perf-profile.children.cycles-pp.seq_printf
3.24 ?101% -3.2 0.00 perf-profile.self.cycles-pp.release_pages



***************************************************************************************************
lkp-csl-2sp7: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/option_a/rootfs/tbox_group/test/testcase:
gcc-11/performance/x86_64-rhel-8.3/input.i3d 129 Cells Per Direction [Intended For Small Systems]/debian-x86_64-phoronix/lkp-csl-2sp7/incompact3d-2.0.2/phoronix-test-suite

commit:
20f3337d35 ("x86: don't use REP_GOOD or ERMS for small memory clearing")
adfcf4231b ("x86: don't use REP_GOOD or ERMS for user memory copies")

20f3337d350c4e1b adfcf4231b8cbc2d9c1e7bfaa96
---------------- ---------------------------
%stddev %change %stddev
\ | \
10.21 +3.6% 10.58 phoronix-test-suite.incompact3d.input.i3d129CellsPerDirection.seconds
2634 +1.5% 2675 phoronix-test-suite.time.percent_of_cpu_this_job_got
125.60 +18.0% 148.24 phoronix-test-suite.time.system_time
1365 +2.2% 1395 phoronix-test-suite.time.user_time
2.45 +0.4 2.80 mpstat.cpu.all.sys%
4716584 +1.3% 4778690 proc-vmstat.numa_hint_faults
591104 +1.3% 598784 proc-vmstat.unevictable_pgs_scanned
1484 ? 2% -3.1% 1438 vmstat.io.bi
6485 -2.7% 6310 vmstat.system.cs
59.03 ? 13% -26.8% 43.20 ? 15% sched_debug.cfs_rq:/.load_avg.avg
203.26 ? 9% -17.4% 167.94 ? 11% sched_debug.cfs_rq:/.load_avg.stddev
41.33 ? 17% -33.5% 27.48 ? 29% sched_debug.cfs_rq:/.removed.load_avg.avg
195.00 ? 9% -17.4% 161.00 ? 14% sched_debug.cfs_rq:/.removed.load_avg.stddev
1256 ? 8% +13.9% 1430 ? 4% sched_debug.cpu.nr_switches.stddev
11.80 ? 76% -8.1 3.68 ?157% perf-profile.calltrace.cycles-pp.bprm_execve.do_execveat_common.__x64_sys_execve.do_syscall_64.entry_SYSCALL_64_after_hwframe
8.70 ? 86% -5.0 3.68 ?157% perf-profile.calltrace.cycles-pp.exec_binprm.bprm_execve.do_execveat_common.__x64_sys_execve.do_syscall_64
8.70 ? 86% -5.0 3.68 ?157% perf-profile.calltrace.cycles-pp.search_binary_handler.exec_binprm.bprm_execve.do_execveat_common.__x64_sys_execve
8.70 ? 86% -5.0 3.68 ?157% perf-profile.calltrace.cycles-pp.load_elf_binary.search_binary_handler.exec_binprm.bprm_execve.do_execveat_common
6.32 ?101% -3.8 2.56 ?223% perf-profile.calltrace.cycles-pp.begin_new_exec.load_elf_binary.search_binary_handler.exec_binprm.bprm_execve
4.08 ?106% -2.8 1.28 ?223% perf-profile.calltrace.cycles-pp.__mmput.exec_mmap.begin_new_exec.load_elf_binary.search_binary_handler
4.08 ?106% -2.8 1.28 ?223% perf-profile.calltrace.cycles-pp.exit_mmap.__mmput.exec_mmap.begin_new_exec.load_elf_binary
3.82 ?109% -2.5 1.28 ?223% perf-profile.calltrace.cycles-pp.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas.exit_mmap
3.82 ?109% -2.5 1.28 ?223% perf-profile.calltrace.cycles-pp.zap_pmd_range.unmap_page_range.unmap_vmas.exit_mmap.__mmput
3.88 ? 73% -2.0 1.91 ?141% perf-profile.calltrace.cycles-pp.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
3.88 ? 73% -2.0 1.91 ?141% perf-profile.calltrace.cycles-pp.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
4.08 ?106% -1.5 2.56 ?223% perf-profile.calltrace.cycles-pp.exec_mmap.begin_new_exec.load_elf_binary.search_binary_handler.exec_binprm
11.80 ? 76% -8.1 3.68 ?157% perf-profile.children.cycles-pp.bprm_execve
8.70 ? 86% -5.0 3.68 ?157% perf-profile.children.cycles-pp.exec_binprm
8.70 ? 86% -5.0 3.68 ?157% perf-profile.children.cycles-pp.search_binary_handler
8.70 ? 86% -5.0 3.68 ?157% perf-profile.children.cycles-pp.load_elf_binary
6.32 ?101% -3.8 2.56 ?223% perf-profile.children.cycles-pp.begin_new_exec
3.88 ? 73% -3.0 0.93 ?223% perf-profile.children.cycles-pp.seq_read_iter
3.82 ?109% -2.5 1.28 ?223% perf-profile.children.cycles-pp.zap_pte_range
3.82 ?109% -2.5 1.28 ?223% perf-profile.children.cycles-pp.unmap_vmas
3.82 ?109% -2.5 1.28 ?223% perf-profile.children.cycles-pp.unmap_page_range
3.82 ?109% -2.5 1.28 ?223% perf-profile.children.cycles-pp.zap_pmd_range
3.88 ? 73% -2.0 1.91 ?141% perf-profile.children.cycles-pp.ksys_read
3.88 ? 73% -2.0 1.91 ?141% perf-profile.children.cycles-pp.vfs_read
4.08 ?106% -1.5 2.56 ?223% perf-profile.children.cycles-pp.exec_mmap
4.167e+09 +5.0% 4.374e+09 ? 3% perf-stat.i.branch-instructions
36.75 +1.1 37.81 perf-stat.i.cache-miss-rate%
3.508e+08 +3.8% 3.643e+08 perf-stat.i.cache-misses
6338 -1.7% 6231 perf-stat.i.context-switches
7.944e+10 +1.4% 8.056e+10 perf-stat.i.cpu-cycles
1.236e+10 +2.9% 1.272e+10 perf-stat.i.dTLB-loads
0.09 -0.0 0.09 perf-stat.i.dTLB-store-miss-rate%
4.764e+09 +7.2% 5.109e+09 perf-stat.i.dTLB-stores
57.38 -1.1 56.31 perf-stat.i.iTLB-load-miss-rate%
1477373 +3.2% 1525290 perf-stat.i.iTLB-loads
4.633e+10 +2.8% 4.761e+10 perf-stat.i.instructions
831488 +1.3% 842562 perf-stat.i.metric.GHz
2.307e+08 +4.1% 2.401e+08 perf-stat.i.metric.M/sec
42.72 ? 2% -5.6 37.15 ? 2% perf-stat.i.node-load-miss-rate%
6830797 -47.3% 3600340 perf-stat.i.node-load-misses
20259060 -11.1% 18007910 perf-stat.i.node-loads
5175098 +4.9% 5428994 perf-stat.i.node-store-misses
64485672 +13.0% 72841822 perf-stat.i.node-stores
13.91 -2.4% 13.58 perf-stat.overall.MPKI
0.79 -0.0 0.75 ? 3% perf-stat.overall.branch-miss-rate%
54.45 +1.9 56.35 perf-stat.overall.cache-miss-rate%
226.37 -2.3% 221.12 perf-stat.overall.cycles-between-cache-misses
0.02 ? 2% -0.0 0.02 perf-stat.overall.dTLB-load-miss-rate%
0.07 -0.0 0.07 perf-stat.overall.dTLB-store-miss-rate%
53.87 -1.0 52.88 perf-stat.overall.iTLB-load-miss-rate%
26883 +3.5% 27836 perf-stat.overall.instructions-per-iTLB-miss
25.21 -8.6 16.66 perf-stat.overall.node-load-miss-rate%
7.43 -0.5 6.94 perf-stat.overall.node-store-miss-rate%
4.104e+09 +5.0% 4.309e+09 ? 3% perf-stat.ps.branch-instructions
3.456e+08 +3.8% 3.589e+08 perf-stat.ps.cache-misses
6219 -1.6% 6121 perf-stat.ps.context-switches
7.825e+10 +1.4% 7.936e+10 perf-stat.ps.cpu-cycles
1.218e+10 +2.9% 1.253e+10 perf-stat.ps.dTLB-loads
4.692e+09 +7.3% 5.033e+09 perf-stat.ps.dTLB-stores
1453670 +3.3% 1501333 perf-stat.ps.iTLB-loads
4.564e+10 +2.8% 4.691e+10 perf-stat.ps.instructions
6728891 -47.3% 3546712 perf-stat.ps.node-load-misses
19959395 -11.1% 17743546 perf-stat.ps.node-loads
5098250 +4.9% 5348852 perf-stat.ps.node-store-misses
63535396 +13.0% 71772875 perf-stat.ps.node-stores
2.592e+12 +4.8% 2.717e+12 perf-stat.total.instructions



***************************************************************************************************
lkp-icl-2sp5: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory
=========================================================================================
class/compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
cpu/gcc-11/performance/x86_64-rhel-8.3/100%/debian-11.1-x86_64-20220510.cgz/lkp-icl-2sp5/getrandom/stress-ng/60s

commit:
20f3337d35 ("x86: don't use REP_GOOD or ERMS for small memory clearing")
adfcf4231b ("x86: don't use REP_GOOD or ERMS for user memory copies")

20f3337d350c4e1b adfcf4231b8cbc2d9c1e7bfaa96
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
2:3 -67% :3 dmesg.RIP:_copy_to_iter
1:3 -33% :3 dmesg.RIP:chacha_block_generic
:3 33% 1:3 dmesg.RIP:chacha_permute
:3 67% 2:3 dmesg.RIP:copy_user_fast_string
1:3 -33% :3 dmesg.WARNING:CPU:#PID:#at_arch/x86/evperf:interrupt_took_too_long(#>#),lowering_kernel.perf_event_max_sample_rate_to
%stddev %change %stddev
\ | \
7725056 ? 12% -14.5% 6605482 meminfo.DirectMap2M
3025 ? 3% +8.0% 3268 proc-vmstat.nr_page_table_pages
48.73 ? 3% -3.6% 46.99 ? 2% turbostat.RAMWatt
0.40 -4.2% 0.39 perf-stat.i.MPKI
173.78 +1.4% 176.18 perf-stat.i.cpu-migrations
0.00 -0.0 0.00 perf-stat.i.dTLB-store-miss-rate%
169.53 +1.0% 171.29 perf-stat.ps.cpu-migrations
7.99 -8.0 0.00 perf-profile.calltrace.cycles-pp.copy_user_enhanced_fast_string.copyout._copy_to_iter.get_random_bytes_user.__x64_sys_getrandom
19.23 -0.1 19.09 perf-profile.calltrace.cycles-pp._copy_to_iter.get_random_bytes_user.__x64_sys_getrandom.do_syscall_64.entry_SYSCALL_64_after_hwframe
10.66 -0.1 10.54 perf-profile.calltrace.cycles-pp.copyout._copy_to_iter.get_random_bytes_user.__x64_sys_getrandom.do_syscall_64
75.51 +0.1 75.64 perf-profile.calltrace.cycles-pp.chacha_block_generic.get_random_bytes_user.__x64_sys_getrandom.do_syscall_64.entry_SYSCALL_64_after_hwframe
66.37 +0.2 66.52 perf-profile.calltrace.cycles-pp.chacha_permute.chacha_block_generic.get_random_bytes_user.__x64_sys_getrandom.do_syscall_64
0.00 +7.9 7.87 perf-profile.calltrace.cycles-pp.copy_user_fast_string.copyout._copy_to_iter.get_random_bytes_user.__x64_sys_getrandom
9.17 -9.2 0.00 perf-profile.children.cycles-pp.copy_user_enhanced_fast_string
20.05 -0.1 19.91 perf-profile.children.cycles-pp._copy_to_iter
10.11 -0.1 9.99 perf-profile.children.cycles-pp.copyout
0.09 ? 5% +0.0 0.10 perf-profile.children.cycles-pp.ktime_get_update_offsets_now
0.37 ? 3% +0.0 0.40 perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
77.15 +0.1 77.26 perf-profile.children.cycles-pp.chacha_block_generic
67.64 +0.2 67.79 perf-profile.children.cycles-pp.chacha_permute
0.00 +9.0 9.05 perf-profile.children.cycles-pp.copy_user_fast_string
8.92 -8.9 0.00 perf-profile.self.cycles-pp.copy_user_enhanced_fast_string
6.23 -0.0 6.20 perf-profile.self.cycles-pp._copy_to_iter
0.08 ? 5% +0.0 0.10 ? 4% perf-profile.self.cycles-pp.ktime_get_update_offsets_now
0.31 +0.0 0.33 perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
67.15 +0.1 67.29 perf-profile.self.cycles-pp.chacha_permute
0.00 +8.8 8.79 perf-profile.self.cycles-pp.copy_user_fast_string
0.25 ? 78% +117.4% 0.55 ? 2% sched_debug.cfs_rq:/.h_nr_running.avg
3187 ? 35% +183.7% 9043 ? 38% sched_debug.cfs_rq:/.load.avg
1296 ? 29% -40.1% 775.67 ? 22% sched_debug.cfs_rq:/.load_avg.max
205.63 ? 22% -33.8% 136.07 ? 4% sched_debug.cfs_rq:/.load_avg.stddev
1405113 ?137% +195.0% 4144508 sched_debug.cfs_rq:/.min_vruntime.avg
1433314 ?135% +192.5% 4191972 sched_debug.cfs_rq:/.min_vruntime.max
977377 ?137% +194.8% 2880861 sched_debug.cfs_rq:/.min_vruntime.min
0.25 ? 79% +117.7% 0.54 sched_debug.cfs_rq:/.nr_running.avg
853.67 ? 28% -40.0% 512.00 sched_debug.cfs_rq:/.removed.load_avg.max
163.07 ? 30% -36.6% 103.40 ? 23% sched_debug.cfs_rq:/.removed.load_avg.stddev
372.81 ? 48% +71.4% 638.84 sched_debug.cfs_rq:/.runnable_avg.avg
1161 ? 6% +36.9% 1589 ? 9% sched_debug.cfs_rq:/.runnable_avg.max
132.33 ?141% +244.1% 455.33 ? 11% sched_debug.cfs_rq:/.runnable_avg.min
4861 ?272% +1135.9% 60082 ? 20% sched_debug.cfs_rq:/.spread0.avg
32740 ? 92% +230.0% 108041 ? 11% sched_debug.cfs_rq:/.spread0.max
371.90 ? 48% +70.2% 632.85 sched_debug.cfs_rq:/.util_avg.avg
99.71 ?110% +155.3% 254.61 sched_debug.cfs_rq:/.util_est_enqueued.avg
88286 ?137% +215.2% 278294 ? 4% sched_debug.cpu.avg_idle.min
6.10 ? 37% +58.2% 9.65 ? 17% sched_debug.cpu.clock.stddev
1268 ? 99% +147.4% 3137 sched_debug.cpu.curr->pid.avg
4681 ? 15% +22.9% 5754 sched_debug.cpu.curr->pid.max
972.83 ?141% +200.4% 2922 sched_debug.cpu.curr->pid.min
644790 ? 5% -10.0% 580100 sched_debug.cpu.max_idle_balance_cost.max
21765 ? 26% -40.9% 12856 ? 11% sched_debug.cpu.max_idle_balance_cost.stddev
0.00 ? 26% +51.9% 0.00 ? 13% sched_debug.cpu.next_balance.stddev
0.25 ? 80% +123.8% 0.55 ? 2% sched_debug.cpu.nr_running.avg
1.00 +50.0% 1.50 sched_debug.cpu.nr_running.max
232.33 ? 21% +56.1% 362.67 sched_debug.cpu.nr_switches.min



***************************************************************************************************
lkp-csl-2sp7: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase:
gcc-11/performance/x86_64-rhel-8.3/debian-x86_64-phoronix/lkp-csl-2sp7/ior-1.1.1/phoronix-test-suite

commit:
20f3337d35 ("x86: don't use REP_GOOD or ERMS for small memory clearing")
adfcf4231b ("x86: don't use REP_GOOD or ERMS for user memory copies")

20f3337d350c4e1b adfcf4231b8cbc2d9c1e7bfaa96
---------------- ---------------------------
%stddev %change %stddev
\ | \
2762 +24.7% 3444 phoronix-test-suite.ior.2MB./opt/rootfs.mb_s
2737 +25.7% 3441 phoronix-test-suite.ior.2MB.DefaultTestDirectory.mb_s
40826 ? 13% -21.1% 32218 ? 11% sched_debug.cfs_rq:/.min_vruntime.avg
51961 ? 10% -18.8% 42191 ? 7% sched_debug.cfs_rq:/.min_vruntime.max
4451 ? 11% -32.1% 3022 ? 18% sched_debug.cfs_rq:/.min_vruntime.stddev
4452 ? 11% -32.0% 3026 ? 18% sched_debug.cfs_rq:/.spread0.stddev
811283 ? 26% -22.1% 631725 ? 6% sched_debug.cpu.max_idle_balance_cost.max
5.24 ?124% -3.6 1.67 ?223% perf-profile.calltrace.cycles-pp.asm_exc_page_fault
5.24 ?124% -3.6 1.67 ?223% perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault
5.24 ?124% -3.6 1.67 ?223% perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
5.24 ?124% -3.6 1.67 ?223% perf-profile.children.cycles-pp.asm_exc_page_fault
5.24 ?124% -3.6 1.67 ?223% perf-profile.children.cycles-pp.exc_page_fault
5.24 ?124% -3.6 1.67 ?223% perf-profile.children.cycles-pp.do_user_addr_fault
3939447 ? 2% +5.8% 4169072 ? 3% perf-stat.i.cache-misses
1787 -3.2% 1730 perf-stat.i.cycles-between-cache-misses
67.42 -1.9 65.54 perf-stat.i.iTLB-load-miss-rate%
1009539 -3.7% 972226 perf-stat.i.iTLB-load-misses
644541 ? 2% +9.1% 703307 ? 4% perf-stat.i.iTLB-loads
142633 ? 3% +13.0% 161225 ? 8% perf-stat.i.node-store-misses
205884 ? 4% +17.7% 242357 ? 10% perf-stat.i.node-stores


Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests


Attachments:
(No filename) (41.78 kB)
config-6.3.0-rc7-00003-gadfcf4231b8c (159.65 kB)
job-script (7.98 kB)
job.yaml (5.68 kB)
reproduce (459.00 B)
Download all attachments

2023-05-04 18:04:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [x86] adfcf4231b: blogbench.read_score -10.9% regression

On Thu, May 4, 2023 at 1:12 AM kernel test robot <[email protected]> wrote:
>
> Since there are both improvements and regressions in various micro
> benchmarks, we are sending this report to provide some data and hope
> this can be helpful for further investigation. Thanks.

Heh, yes. As that makes clear, the 10% regression on one load was a
25% improvement on another.

That said, while I was expecting these kinds of results, the absolute
percentages are much higher than I expected. I would have expected a
lot of "couple of percent either way".

Even with memory copies being as important as they are, moving the
needle by -10% to +25% is a much bigger swing than expected.

> Details are as below:

One of the issues with the details is that these benchmarks are
running on different machines, so there's a mix of

Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake)
Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake)

in those details, and it might matter.

That said, the big picture is probably independent - both have ERMS
with 'rep movs' doing some cacheline optimizations, and the big
changes in the perf data is in things like 'perf-stat.ps.dTLB-stores'
and 'perf-stat.ps.node-store-misses' which is probably exactly about
that.

And I suspect that the 'blogbench.read_score' thing probably does
mostly big reads, and really really liked the cacheline optimizations
of 'rep movs' even when it was slow in other things.

Let me think about how ugly it would be to re-introduce that case just
for large copies.

Linus

2023-05-04 19:03:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [x86] adfcf4231b: blogbench.read_score -10.9% regression

On Thu, May 4, 2023 at 10:39 AM Linus Torvalds
<[email protected]> wrote:
>
> And I suspect that the 'blogbench.read_score' thing probably does
> mostly big reads, and really really liked the cacheline optimizations
> of 'rep movs' even when it was slow in other things.

Oh, and I went to look at the blogbench code, and it's a
*horrifically* bad benchmark for this case.

It may actually be a good benchmark for other things - the "look up
random filenames" etc might be somewhat indicative of real performance
on some loads.

But the read() system call part is very much *not* good.

It looks like the reason blogbench likes 'rep movs' is the traditional
reason why memory copy benchmarks tend to like uncached copies: it
looks like the benchmark doesn't actually even look at the data it
reads.

It's a common pattern, but it's a bad pattern. It's not very different
from doing CPU benchmarks using an empty loop:

for (int i = 0; i < 1000000; i++) /*nothing*/;

and then being surprised when the compiler optimized the loop away
because it doesn't actually *do* anything.

Of course, unlike the empty loop, the kernel will still do the reads,
but the kernel will do the reads with the assumption that the result
matters.

And for blogbench, that is *doubly* not the case.

Because it looks like blogbench does a loop over

read_dummy_file(file_name);

which will open that file, and then loop over

#define READ_CHUNK_SIZE 65536
...
static char garbage[READ_CHUNK_SIZE];
..
loop:
if ((readen = read(fd, garbage, sizeof garbage))

and that variable is very appropriately named indeed. It is very much garbage.

So it really looks like what is going on is that you have

(a) a threaded application (looks like N=100 for reads by default)

(b) doing 'read()' calls into a *shared* buffer in parallel

(c) and never using the buffer for anything else

which means that a memory copy that does non-temporal writes is going
to look artificially really good, because nobody wants to ever see the
end result.

Now, the "nobody wants to ever see the end result", might actually be
at least _partially_ valid in some historical setting, if you were to
use DMA to then push things out to a network. Back in the olden days,
that was actually a valid reason to use non-temporal writes, because
the DMA would have to invalidate the caches anyway.

So that part is wrong, but isn't necessarily *horrifically* wrong.
It's just bad.

But using one shared destination buffer for N readers is *completely*
bogus. Any cache-lines would bounce around like crazy for the
(pointless) sharing. I'd like to call it "false sharing", but it's not
really false: it's intentionally just using one big shared buffer as a
sink for all these read() calls.

End result: I think the benchmark scores for reading are entirely
pointless random numbers, and trying to optimize the kernel for this
benchmark is exactly the wrong thing to do.

The benchmark doesn't actually do anything half-way valid.

At a *minimum* that benchmark should have different threads using
different read() buffers.

I suspect it might be a good idea to also touch some of the data,
because that would be the normal situation (even with old-style
zero-copy DMA you'd probably not do raw file data and push it out to
the network unmodified).

Because otherwise you will always find that objectively bad memory
copies will do better than a good memory copy that makes sense.

Anyway, that was a very long-winded way of saying that I will not use
that benchmark as a reason to touch the kernel "copy_to_user()"
implementation.

Of course, that's not to say that this might not be a real regression
on real loads, but I'd want to see those other numbers.

It looks like this is an old benchmark that hasn't been touched in
years (just going by the github activity), but I'm cc'ing the author
anyway, and will add a pointer to this email as a github issue. Maybe
the developer cares, maybe he doesn't, but no reason not to at least
mention this and see if maybe the benchmark can be improved to at
least use a thread buffers for the reads.

(The same issues are true for the writing side too, of course, but the
writing side has less parallelism and a shared *source* is not the
same kind of problematic from a cache pattern standpoint, so probably
isn't going to have anything like the same effect from any memcpy
implementation).

Linus

2023-05-09 12:35:59

by Feng Tang

[permalink] [raw]
Subject: Re: [linus:master] [x86] adfcf4231b: blogbench.read_score -10.9% regression

Hi Linus,

On Thu, May 04, 2023 at 11:56:12AM -0700, Linus Torvalds wrote:
> On Thu, May 4, 2023 at 10:39 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > And I suspect that the 'blogbench.read_score' thing probably does
> > mostly big reads, and really really liked the cacheline optimizations
> > of 'rep movs' even when it was slow in other things.
>
> Oh, and I went to look at the blogbench code, and it's a
> *horrifically* bad benchmark for this case.
>
> It may actually be a good benchmark for other things - the "look up
> random filenames" etc might be somewhat indicative of real performance
> on some loads.
>
> But the read() system call part is very much *not* good.
>
> It looks like the reason blogbench likes 'rep movs' is the traditional
> reason why memory copy benchmarks tend to like uncached copies: it
> looks like the benchmark doesn't actually even look at the data it
> reads.
>
> It's a common pattern, but it's a bad pattern. It's not very different
> from doing CPU benchmarks using an empty loop:
>
> for (int i = 0; i < 1000000; i++) /*nothing*/;
>
> and then being surprised when the compiler optimized the loop away
> because it doesn't actually *do* anything.
>
> Of course, unlike the empty loop, the kernel will still do the reads,
> but the kernel will do the reads with the assumption that the result
> matters.
>
> And for blogbench, that is *doubly* not the case.
>
> Because it looks like blogbench does a loop over
>
> read_dummy_file(file_name);
>
> which will open that file, and then loop over
>
> #define READ_CHUNK_SIZE 65536
> ...
> static char garbage[READ_CHUNK_SIZE];
> ..
> loop:
> if ((readen = read(fd, garbage, sizeof garbage))
>
> and that variable is very appropriately named indeed. It is very much garbage.
>
> So it really looks like what is going on is that you have
>
> (a) a threaded application (looks like N=100 for reads by default)
>
> (b) doing 'read()' calls into a *shared* buffer in parallel
>
> (c) and never using the buffer for anything else
>
> which means that a memory copy that does non-temporal writes is going
> to look artificially really good, because nobody wants to ever see the
> end result.
>
> Now, the "nobody wants to ever see the end result", might actually be
> at least _partially_ valid in some historical setting, if you were to
> use DMA to then push things out to a network. Back in the olden days,
> that was actually a valid reason to use non-temporal writes, because
> the DMA would have to invalidate the caches anyway.
>
> So that part is wrong, but isn't necessarily *horrifically* wrong.
> It's just bad.
>
> But using one shared destination buffer for N readers is *completely*
> bogus. Any cache-lines would bounce around like crazy for the
> (pointless) sharing. I'd like to call it "false sharing", but it's not
> really false: it's intentionally just using one big shared buffer as a
> sink for all these read() calls.
>
> End result: I think the benchmark scores for reading are entirely
> pointless random numbers, and trying to optimize the kernel for this
> benchmark is exactly the wrong thing to do.
>
> The benchmark doesn't actually do anything half-way valid.
>
> At a *minimum* that benchmark should have different threads using
> different read() buffers.

We tried a debug patch which allocates a dedicated buffer for
each reader thread, run it on the same Cacade Lake platform, and
the regression is _gone_, after the noise of cache false sharing is
reduced.

20f3337d350c4e1b adfcf4231b8cbc2d9c1e7bfaa96
---------------- ---------------------------
2119572 ± 2% +1.7% 2155704 ± 2% blogbench.read_score

3700996 ± 9% -7.4% 3427562 ± 16% perf-stat.ps.dTLB-load-misses
8.171e+09 +25.3% 1.024e+10 perf-stat.ps.dTLB-loads
431224 ± 9% -13.1% 374903 ± 12% perf-stat.ps.dTLB-store-misses
2.399e+09 +87.5% 4.497e+09 perf-stat.ps.dTLB-stores
8495507 ± 7% -8.2% 7794738 perf-stat.ps.iTLB-load-misses

3.07 -3.1 0.00 perf-profile.self.cycles-pp.copy_user_enhanced_fast_string
0.00 +3.4 3.40 perf-profile.self.cycles-pp.copy_user_generic_unrolled

We also run the original blogbench binary on a IceLake platform
(which has both ERMS and FSRM), and there is almost no peformance
change for the 2 commits.

Please let us know if you need more profiling info or want to run
more tests.

Thanks,
Feng

---
The debug patch for blobbench is:

diff --git a/src/blogbench.h b/src/blogbench.h
index bf538a8..d99d5bb 100644
--- a/src/blogbench.h
+++ b/src/blogbench.h
@@ -178,6 +178,8 @@ void *rewriter(void * const fodder);
void *reader(void * const fodder);
void *commenter(void * const fodder);

+extern pthread_key_t read_buf_key;
+
#include "globals.h"

#endif
diff --git a/src/helpers.c b/src/helpers.c
index 4c17e73..85631ec 100644
--- a/src/helpers.c
+++ b/src/helpers.c
@@ -107,15 +107,18 @@ int create_atomic_file(const char * const file_name,

int read_dummy_file(const char * const file_name)
{
- static char garbage[READ_CHUNK_SIZE];
ssize_t readen;
+ void * read_buf;
int fd;

if ((fd = open(file_name, O_RDONLY)) == -1) {
return -1;
}
+
+ read_buf = pthread_getspecific(read_buf_key);
+
do {
- if ((readen = read(fd, garbage, sizeof garbage))
+ if ((readen = read(fd, read_buf, READ_CHUNK_SIZE))
< (ssize_t) 0) {
if (errno == EINTR) {
continue;
diff --git a/src/process.c b/src/process.c
index a83a980..04f1411 100644
--- a/src/process.c
+++ b/src/process.c
@@ -58,9 +58,13 @@ static int wait_rewriters(void)
return 0;
}

+pthread_key_t read_buf_key = 0;
+
static int spawn_readers(void)
{
unsigned int t = nb_readers;
+
+ pthread_key_create(&read_buf_key, NULL);

do {
t--;
diff --git a/src/reader.c b/src/reader.c
index 91bab7a..2c0efdb 100644
--- a/src/reader.c
+++ b/src/reader.c
@@ -66,6 +66,11 @@ static int read_random_blog(void)
void *reader(void * const fodder)
{
(void) fodder;
+
+ void *read_buf;
+
+ read_buf = malloc(READ_CHUNK_SIZE);
+ pthread_setspecific(read_buf_key, read_buf);

do {
if (read_random_blog() != 0) {
@@ -75,6 +80,8 @@ void *reader(void * const fodder)
usleep(USLEEP_READER);
#endif
} while (stop == (sig_atomic_t) 0);
+
+ free(read_buf);

return NULL;
}

> I suspect it might be a good idea to also touch some of the data,
> because that would be the normal situation (even with old-style
> zero-copy DMA you'd probably not do raw file data and push it out to
> the network unmodified).
>
> Because otherwise you will always find that objectively bad memory
> copies will do better than a good memory copy that makes sense.
>
> Anyway, that was a very long-winded way of saying that I will not use
> that benchmark as a reason to touch the kernel "copy_to_user()"
> implementation.
>
> Of course, that's not to say that this might not be a real regression
> on real loads, but I'd want to see those other numbers.
>
> It looks like this is an old benchmark that hasn't been touched in
> years (just going by the github activity), but I'm cc'ing the author
> anyway, and will add a pointer to this email as a github issue. Maybe
> the developer cares, maybe he doesn't, but no reason not to at least
> mention this and see if maybe the benchmark can be improved to at
> least use a thread buffers for the reads.
>
> (The same issues are true for the writing side too, of course, but the
> writing side has less parallelism and a shared *source* is not the
> same kind of problematic from a cache pattern standpoint, so probably
> isn't going to have anything like the same effect from any memcpy
> implementation).
>
> Linus

2023-05-09 17:33:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [x86] adfcf4231b: blogbench.read_score -10.9% regression

On Tue, May 9, 2023 at 5:30 AM Feng Tang <[email protected]> wrote:
>
> > At a *minimum* that benchmark should have different threads using
> > different read() buffers.
>
> We tried a debug patch which allocates a dedicated buffer for
> each reader thread, run it on the same Cacade Lake platform, and
> the regression is _gone_, after the noise of cache false sharing is
> reduced.

Thanks for verifying.

I was pretty sure that was the issue, but it's always good to
double-check that there wasn't something else going on,

Linus