LinuxLists.cc - [linus:master] [x86] adfcf4231b: blogbench.read

2023-05-04 08:27:24

Subject: [linus:master] [x86] adfcf4231b: blogbench.read_score -10.9% regression

Hello,

kernel test robot
commit: adfcf4231b8cbc2d9c In addition to that,
+------------------+------ | testcase: change | test machine | test parameters | +------------------+------ | testcase: change | test machine | test parameters | | +------------------+------ | testcase: change | test machine | test parameters | | test=ior-1.1.1 +------------------+------ | testcase: change | test machine | test parameters | | option_a=1MB | | option_b=512MB | | +------------------+------

Since there are both benchmarks, we are this can be helpful

If you fix the issue, | Reported-by: kernel | Link: 1e7bfaa965b907e60639eb ("x86: don't use REP_GOOD or ERMS for user memory copies")
el.org/cgit/linux/kernel/git/torvalds/linux.git">https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz (Cascade Lake) with 128G memory
Blogbench is a portable filesystem benchmark that tries to reproduce the load of a real-world busy file server.
//www.pureftpd.org/project/blogbench">https://www.pureftpd.org/project/blogbench
the commit also has impact on the following tests:
------------------------------------------------------------------------------------------------------+
| phoronix-test-suite: phoronix-test-suite.aio-stress.RandomWrite.mb_s 21.4% improvement |
| 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory |
| cpufreq_governor=performance |
| test=aio-stress-1.1.1 |
------------------------------------------------------------------------------------------------------+
| phoronix-test-suite: phoronix-test-suite.incompact3d.input.i3d129CellsPerDirection.seconds 3.6% regression |
| 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory |
| cpufreq_governor=performance |
| option_a=input.i3d 129 Cells Per Direction [Intended For Small Systems] |
| test=incompact3d-2.0.2 |
------------------------------------------------------------------------------------------------------+
| phoronix-test-suite: phoronix-test-suite.ior.2MB.DefaultTestDirectory.mb_s 25.7% improvement |
| 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory |
| cpufreq_governor=performance |
|
------------------------------------------------------------------------------------------------------+
| phoronix-test-suite: phoronix-test-suite.iozone.1MB.512MB.WritePerformance.mb_s 20.4% improvement |
| 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory |
| cpufreq_governor=performance |
|
|
| option_c=Write Performance |
| test=iozone-1.9.5 |
------------------------------------------------------------------------------------------------------+
improvements and regressions in various micro
sending this report to provide some data and hope
for further investigation. Thanks.
kindly add following tag
test robot <[email protected]>
lore.kernel.org/oe-lkp/202305041446.71d46724-yujie.liu@intel.com">https://lore.kernel.org/oe-lkp/[email protected]
===============================================================
disk/fs/kconfig/rootfs/tbox_group/testcase:
xfs/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-csl-2sp9/blogbench
don't use REP_GOOD or ERMS for small memory clearing")
don't use REP_GOOD or ERMS for user memory copies")
adfcf4231b8cbc2d9c1e7bfaa96
---------------------------
%stddev
\
-10.9% 1846004 blogbench.read_score
7932 ? 8% blogbench.time.minor_page_faults
+12.6% 242.40 ? 2% blogbench.time.user_time
0.98 ? 2% iostat.cpu.user
1.85 ? 6% mpstat.cpu.all.soft%
10.69 ? 15% sched_debug.cfs_rq:/.load_avg.min
+8.4% 20.16 turbostat.RAMWatt
-3.0% 175305 vmstat.io.bo
+6.8% 16048 ? 3% proc-vmstat.nr_active_anon
+6.0% 21962 ? 2% proc-vmstat.nr_shmem
-2.6% 298909 proc-vmstat.nr_slab_unreclaimable
-3.3% 9689882 proc-vmstat.nr_written
+6.8% 16048 ? 3% proc-vmstat.nr_zone_active_anon
-5.3% 21542328 proc-vmstat.numa_hit
-5.3% 21460298 proc-vmstat.numa_local
-7.0% 29177083 proc-vmstat.pgalloc_normal
-10.5% 16641000 proc-vmstat.pgfree
-2.9% 52984224 proc-vmstat.pgpgout
26.09 perf-stat.i.MPKI
-8.9% 6.296e+09 perf-stat.i.branch-instructions
0.46 perf-stat.i.branch-miss-rate%
60.06 perf-stat.i.cache-miss-rate%
+37.6% 4.992e+08 ? 2% perf-stat.i.cache-misses
+71.3% 8.297e+08 ? 2% perf-stat.i.cache-references
7.64 perf-stat.i.cpi
-4.4% 875.61 perf-stat.i.cpu-migrations
-27.6% 487.55 perf-stat.i.cycles-between-cache-misses
+10.5% 8.884e+09 perf-stat.i.dTLB-loads
0.01 ? 11% perf-stat.i.dTLB-store-miss-rate%
+67.1% 3.958e+09 perf-stat.i.dTLB-stores
+2.8% 3.179e+10 perf-stat.i.instructions
0.14 perf-stat.i.ipc
0.08 ? 19% perf-stat.i.major-faults
+12.1% 226.90 perf-stat.i.metric.M/sec
74.93 perf-stat.i.node-load-miss-rate%
-56.1% 36533214 perf-stat.i.node-load-misses
-59.1% 12206523 ? 2% perf-stat.i.node-loads
93.08 perf-stat.i.node-store-miss-rate%
+415.9% 71022712 ? 2% perf-stat.i.node-store-misses
26.10 perf-stat.overall.MPKI
0.46 perf-stat.overall.branch-miss-rate%
60.16 perf-stat.overall.cache-miss-rate%
7.65 perf-stat.overall.cpi
-27.5% 487.55 perf-stat.overall.cycles-between-cache-misses
0.01 ? 11% perf-stat.overall.dTLB-store-miss-rate%
0.13 perf-stat.overall.ipc
74.95 perf-stat.overall.node-load-miss-rate%
93.32 perf-stat.overall.node-store-miss-rate%
-8.9% 6.277e+09 perf-stat.ps.branch-instructions
+37.6% 4.975e+08 ? 2% perf-stat.ps.cache-misses
+71.3% 8.27e+08 ? 2% perf-stat.ps.cache-references
-4.4% 872.48 perf-stat.ps.cpu-migrations
+10.5% 8.857e+09 perf-stat.ps.dTLB-loads
+67.1% 3.946e+09 perf-stat.ps.dTLB-stores
+2.8% 3.169e+10 perf-stat.ps.instructions
0.08 ? 19% perf-stat.ps.major-faults
-56.1% 36413295 perf-stat.ps.node-load-misses
-59.0% 12171311 ? 2% perf-stat.ps.node-loads
+415.9% 70784762 ? 2% perf-stat.ps.node-store-misses
+3.0% 9.547e+12 perf-stat.total.instructions
44.11 perf-profile.calltrace.cycles-pp.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe.__open64
44.10 perf-profile.calltrace.cycles-pp.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe.__open64
44.22 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__open64
44.20 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__open64
44.33 perf-profile.calltrace.cycles-pp.__open64
35.60 perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_fd.do_sys_openat2.__x64_sys_openat.do_syscall_64
35.93 perf-profile.calltrace.cycles-pp.alloc_fd.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
35.40 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_fd.do_sys_openat2.__x64_sys_openat
30.85 perf-profile.calltrace.cycles-pp.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
30.47 perf-profile.calltrace.cycles-pp._raw_spin_lock.close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe
30.64 perf-profile.calltrace.cycles-pp.close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
30.30 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.close_fd.__x64_sys_close.do_syscall_64
31.87 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
31.89 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__close
31.96 perf-profile.calltrace.cycles-pp.__close
4.84 perf-profile.calltrace.cycles-pp._raw_spin_lock.put_unused_fd.do_sys_openat2.__x64_sys_openat.do_syscall_64
4.86 perf-profile.calltrace.cycles-pp.put_unused_fd.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
4.81 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.put_unused_fd.do_sys_openat2.__x64_sys_openat
2.07 ? 6% perf-profile.calltrace.cycles-pp.smpboot_thread_fn.kthread.ret_from_fork
2.06 ? 6% perf-profile.calltrace.cycles-pp.run_ksoftirqd.smpboot_thread_fn.kthread.ret_from_fork
2.06 ? 6% perf-profile.calltrace.cycles-pp.__do_softirq.run_ksoftirqd.smpboot_thread_fn.kthread.ret_from_fork
2.06 ? 6% perf-profile.calltrace.cycles-pp.rcu_core.__do_softirq.run_ksoftirqd.smpboot_thread_fn.kthread
2.05 ? 6% perf-profile.calltrace.cycles-pp.rcu_do_batch.rcu_core.__do_softirq.run_ksoftirqd.smpboot_thread_fn
1.54 ? 5% perf-profile.calltrace.cycles-pp.file_free_rcu.rcu_do_batch.rcu_core.__do_softirq.run_ksoftirqd
2.87 ? 5% perf-profile.calltrace.cycles-pp.__might_resched.__might_fault._copy_to_iter.copy_page_to_iter.filemap_read
2.94 ? 5% perf-profile.calltrace.cycles-pp.__might_fault._copy_to_iter.copy_page_to_iter.filemap_read.xfs_file_buffered_read
13.76 ? 4% perf-profile.calltrace.cycles-pp.copyout._copy_to_iter.copy_page_to_iter.filemap_read.xfs_file_buffered_read
16.96 ? 4% perf-profile.calltrace.cycles-pp._copy_to_iter.copy_page_to_iter.filemap_read.xfs_file_buffered_read.xfs_file_read_iter
17.06 ? 4% perf-profile.calltrace.cycles-pp.copy_page_to_iter.filemap_read.xfs_file_buffered_read.xfs_file_read_iter.vfs_read
18.63 ? 4% perf-profile.calltrace.cycles-pp.xfs_file_buffered_read.xfs_file_read_iter.vfs_read.ksys_read.do_syscall_64
18.48 ? 4% perf-profile.calltrace.cycles-pp.filemap_read.xfs_file_buffered_read.xfs_file_read_iter.vfs_read.ksys_read
18.81 ? 4% perf-profile.calltrace.cycles-pp.xfs_file_read_iter.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe
19.38 ? 4% perf-profile.calltrace.cycles-pp.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_read
19.42 ? 4% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_read
19.44 ? 4% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__libc_read
19.12 ? 4% perf-profile.calltrace.cycles-pp.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_read
19.59 ? 4% perf-profile.calltrace.cycles-pp.__libc_read
13.63 ? 4% perf-profile.calltrace.cycles-pp.copy_user_generic_unrolled.copyout._copy_to_iter.copy_page_to_iter.filemap_read
71.10 perf-profile.children.cycles-pp._raw_spin_lock
70.54 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
44.11 perf-profile.children.cycles-pp.__x64_sys_openat
44.10 perf-profile.children.cycles-pp.do_sys_openat2
44.34 perf-profile.children.cycles-pp.__open64
35.93 perf-profile.children.cycles-pp.alloc_fd
30.85 perf-profile.children.cycles-pp.__x64_sys_close
30.64 perf-profile.children.cycles-pp.close_fd
31.97 perf-profile.children.cycles-pp.__close
4.86 perf-profile.children.cycles-pp.put_unused_fd
2.40 ? 5% perf-profile.children.cycles-pp.__do_softirq
2.35 ? 5% perf-profile.children.cycles-pp.rcu_core
2.07 ? 6% perf-profile.children.cycles-pp.smpboot_thread_fn
2.06 ? 6% perf-profile.children.cycles-pp.run_ksoftirqd
2.34 ? 6% perf-profile.children.cycles-pp.rcu_do_batch
1.74 ? 5% perf-profile.children.cycles-pp.file_free_rcu
0.04 ? 75% perf-profile.children.cycles-pp.record__finish_output
0.62 perf-profile.children.cycles-pp.hrtimer_interrupt
0.63 perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
0.18 ? 4% perf-profile.children.cycles-pp.down_read
0.11 ? 5% perf-profile.children.cycles-pp.xfs_ilock
0.57 perf-profile.children.cycles-pp.__hrtimer_run_queues
0.09 ? 4% perf-profile.children.cycles-pp.touch_atime
0.08 ? 8% perf-profile.children.cycles-pp.__call_rcu_common
0.08 ? 5% perf-profile.children.cycles-pp.get_obj_cgroup_from_current
0.05 perf-profile.children.cycles-pp.walk_component
0.05 perf-profile.children.cycles-pp.atime_needs_update
0.05 perf-profile.children.cycles-pp.syscall_return_via_sysret
0.05 ? 7% perf-profile.children.cycles-pp.__snprintf_chk
0.54 ? 4% perf-profile.children.cycles-pp.kmem_cache_alloc
0.10 ? 9% perf-profile.children.cycles-pp.folio_mark_accessed
0.10 ? 4% perf-profile.children.cycles-pp.copy_user_short_string
0.23 ? 3% perf-profile.children.cycles-pp.__might_sleep
96.30 perf-profile.children.cycles-pp.do_syscall_64
96.34 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
2.96 ? 5% perf-profile.children.cycles-pp.__might_resched
2.96 ? 5% perf-profile.children.cycles-pp.__might_fault
13.76 ? 4% perf-profile.children.cycles-pp.copyout
16.96 ? 4% perf-profile.children.cycles-pp._copy_to_iter
17.07 ? 4% perf-profile.children.cycles-pp.copy_page_to_iter
18.63 ? 4% perf-profile.children.cycles-pp.xfs_file_buffered_read
18.49 ? 4% perf-profile.children.cycles-pp.filemap_read
18.81 ? 4% perf-profile.children.cycles-pp.xfs_file_read_iter
19.40 ? 4% perf-profile.children.cycles-pp.ksys_read
19.14 ? 4% perf-profile.children.cycles-pp.vfs_read
19.63 ? 4% perf-profile.children.cycles-pp.__libc_read
13.70 ? 4% perf-profile.children.cycles-pp.copy_user_generic_unrolled
69.91 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
1.74 ? 5% perf-profile.self.cycles-pp.file_free_rcu
0.26 ? 2% perf-profile.self.cycles-pp._copy_to_iter
0.56 ? 4% perf-profile.self.cycles-pp._raw_spin_lock
0.17 ? 2% perf-profile.self.cycles-pp.fput
0.08 ? 5% perf-profile.self.cycles-pp.alloc_empty_file
0.07 ? 5% perf-profile.self.cycles-pp.kmem_cache_free
0.18 ? 4% perf-profile.self.cycles-pp.xfs_file_read_iter
0.12 ? 4% perf-profile.self.cycles-pp.kmem_cache_alloc
0.05 perf-profile.self.cycles-pp.syscall_return_via_sysret
0.06 ? 7% perf-profile.self.cycles-pp.folio_mark_accessed
0.10 ? 4% perf-profile.self.cycles-pp.copy_page_to_iter
0.10 ? 4% perf-profile.self.cycles-pp.copy_user_short_string
0.22 ? 4% perf-profile.self.cycles-pp.__might_sleep
0.64 ? 3% perf-profile.self.cycles-pp.filemap_read
2.76 ? 5% perf-profile.self.cycles-pp.__might_resched
13.52 ? 4% perf-profile.self.cycles-pp.copy_user_generic_unrolled
*************************************************************************
threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory
===============================================================
kconfig/rootfs/tbox_group/test/testcase:
4-rhel-8.3/debian-x86_64-phoronix/lkp-csl-2sp7/aio-stress-1.1.1/phoronix-test-suite
don't use REP_GOOD or ERMS for small memory clearing")
don't use REP_GOOD or ERMS for user memory copies")
adfcf4231b8cbc2d9c1e7bfaa96
---------------------------
%stddev
\
3747 phoronix-test-suite.aio-stress.RandomWrite.mb_s
-3.9% 1360489 proc-vmstat.numa_local
2501 ?135% proc-vmstat.numa_pte_updates
-2.1% 425149 proc-vmstat.pgfault
7.44 ? 4% perf-stat.i.cache-miss-rate%
+66.6% 6042976 ? 6% perf-stat.i.cache-misses
+3.7% 71773613 perf-stat.i.cache-references
1977 perf-stat.i.context-switches
+12.4% 2.837e+08 ? 2% perf-stat.i.dTLB-stores
67.54 perf-stat.i.iTLB-load-miss-rate%
-2.9% 879960 perf-stat.i.iTLB-load-misses
+5.4% 408574 perf-stat.i.iTLB-loads
+3.6% 2.283e+09 ? 2% perf-stat.i.instructions
2522 ? 3% perf-stat.i.instructions-per-iTLB-miss
-9.0% 120495 ? 9% perf-stat.i.node-loads
+623.2% 1775744 ? 13% perf-stat.i.node-stores
0.00 perf-profile.calltrace.cycles-pp.vsnprintf.seq_printf.show_interrupts.seq_read_iter.proc_reg_read_iter
1.88 ?142% perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
1.88 ?142% perf-profile.calltrace.cycles-pp.asm_exc_page_fault
1.88 ?142% perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault
1.88 ?142% perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
1.88 ?142% perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
1.76 ?141% perf-profile.calltrace.cycles-pp.seq_printf.show_interrupts.seq_read_iter.proc_reg_read_iter.vfs_read
1.88 ?142% perf-profile.children.cycles-pp.asm_exc_page_fault
1.88 ?142% perf-profile.children.cycles-pp.exc_page_fault
0.00 perf-profile.children.cycles-pp.vsnprintf
1.88 ?142% perf-profile.children.cycles-pp.do_user_addr_fault
1.88 ?142% perf-profile.children.cycles-pp.__handle_mm_fault
1.88 ?142% perf-profile.children.cycles-pp.handle_mm_fault
0.83 ?223% perf-profile.children.cycles-pp.dput
1.76 ?141% perf-profile.children.cycles-pp.seq_printf
0.00 perf-profile.self.cycles-pp.release_pages
*************************************************************************
threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory
===============================================================
kconfig/option_a/rootfs/tbox_group/test/testcase:
4-rhel-8.3/input.i3d 129 Cells Per Direction [Intended For Small Systems]/debian-x86_64-phoronix/lkp-csl-2sp7/incompact3d-2.0.2/phoronix-test-suite
don't use REP_GOOD or ERMS for small memory clearing")
don't use REP_GOOD or ERMS for user memory copies")
adfcf4231b8cbc2d9c1e7bfaa96
---------------------------
%stddev
\
+3.6% 10.58 phoronix-test-suite.incompact3d.input.i3d129CellsPerDirection.seconds
2675 phoronix-test-suite.time.percent_of_cpu_this_job_got
+18.0% 148.24 phoronix-test-suite.time.system_time
1395 phoronix-test-suite.time.user_time
2.80 mpstat.cpu.all.sys%
+1.3% 4778690 proc-vmstat.numa_hint_faults
+1.3% 598784 proc-vmstat.unevictable_pgs_scanned
1438 vmstat.io.bi
6310 vmstat.system.cs
43.20 ? 15% sched_debug.cfs_rq:/.load_avg.avg
-17.4% 167.94 ? 11% sched_debug.cfs_rq:/.load_avg.stddev
27.48 ? 29% sched_debug.cfs_rq:/.removed.load_avg.avg
-17.4% 161.00 ? 14% sched_debug.cfs_rq:/.removed.load_avg.stddev
1430 ? 4% sched_debug.cpu.nr_switches.stddev
3.68 ?157% perf-profile.calltrace.cycles-pp.bprm_execve.do_execveat_common.__x64_sys_execve.do_syscall_64.entry_SYSCALL_64_after_hwframe
3.68 ?157% perf-profile.calltrace.cycles-pp.exec_binprm.bprm_execve.do_execveat_common.__x64_sys_execve.do_syscall_64
3.68 ?157% perf-profile.calltrace.cycles-pp.search_binary_handler.exec_binprm.bprm_execve.do_execveat_common.__x64_sys_execve
3.68 ?157% perf-profile.calltrace.cycles-pp.load_elf_binary.search_binary_handler.exec_binprm.bprm_execve.do_execveat_common
2.56 ?223% perf-profile.calltrace.cycles-pp.begin_new_exec.load_elf_binary.search_binary_handler.exec_binprm.bprm_execve
1.28 ?223% perf-profile.calltrace.cycles-pp.__mmput.exec_mmap.begin_new_exec.load_elf_binary.search_binary_handler
1.28 ?223% perf-profile.calltrace.cycles-pp.exit_mmap.__mmput.exec_mmap.begin_new_exec.load_elf_binary
1.28 ?223% perf-profile.calltrace.cycles-pp.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas.exit_mmap
1.28 ?223% perf-profile.calltrace.cycles-pp.zap_pmd_range.unmap_page_range.unmap_vmas.exit_mmap.__mmput
1.91 ?141% perf-profile.calltrace.cycles-pp.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
1.91 ?141% perf-profile.calltrace.cycles-pp.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
2.56 ?223% perf-profile.calltrace.cycles-pp.exec_mmap.begin_new_exec.load_elf_binary.search_binary_handler.exec_binprm
3.68 ?157% perf-profile.children.cycles-pp.bprm_execve
3.68 ?157% perf-profile.children.cycles-pp.exec_binprm
3.68 ?157% perf-profile.children.cycles-pp.search_binary_handler
3.68 ?157% perf-profile.children.cycles-pp.load_elf_binary
2.56 ?223% perf-profile.children.cycles-pp.begin_new_exec
0.93 ?223% perf-profile.children.cycles-pp.seq_read_iter
1.28 ?223% perf-profile.children.cycles-pp.zap_pte_range
1.28 ?223% perf-profile.children.cycles-pp.unmap_vmas
1.28 ?223% perf-profile.children.cycles-pp.unmap_page_range
1.28 ?223% perf-profile.children.cycles-pp.zap_pmd_range
1.91 ?141% perf-profile.children.cycles-pp.ksys_read
1.91 ?141% perf-profile.children.cycles-pp.vfs_read
2.56 ?223% perf-profile.children.cycles-pp.exec_mmap
+5.0% 4.374e+09 ? 3% perf-stat.i.branch-instructions
37.81 perf-stat.i.cache-miss-rate%
+3.8% 3.643e+08 perf-stat.i.cache-misses
6231 perf-stat.i.context-switches
+1.4% 8.056e+10 perf-stat.i.cpu-cycles
+2.9% 1.272e+10 perf-stat.i.dTLB-loads
0.09 perf-stat.i.dTLB-store-miss-rate%
+7.2% 5.109e+09 perf-stat.i.dTLB-stores
56.31 perf-stat.i.iTLB-load-miss-rate%
+3.2% 1525290 perf-stat.i.iTLB-loads
+2.8% 4.761e+10 perf-stat.i.instructions
+1.3% 842562 perf-stat.i.metric.GHz
+4.1% 2.401e+08 perf-stat.i.metric.M/sec
37.15 ? 2% perf-stat.i.node-load-miss-rate%
-47.3% 3600340 perf-stat.i.node-load-misses
-11.1% 18007910 perf-stat.i.node-loads
+4.9% 5428994 perf-stat.i.node-store-misses
+13.0% 72841822 perf-stat.i.node-stores
-2.4% 13.58 perf-stat.overall.MPKI
0.75 ? 3% perf-stat.overall.branch-miss-rate%
56.35 perf-stat.overall.cache-miss-rate%
-2.3% 221.12 perf-stat.overall.cycles-between-cache-misses
0.02 perf-stat.overall.dTLB-load-miss-rate%
0.07 perf-stat.overall.dTLB-store-miss-rate%
52.88 perf-stat.overall.iTLB-load-miss-rate%
+3.5% 27836 perf-stat.overall.instructions-per-iTLB-miss
16.66 perf-stat.overall.node-load-miss-rate%
6.94 perf-stat.overall.node-store-miss-rate%
+5.0% 4.309e+09 ? 3% perf-stat.ps.branch-instructions
+3.8% 3.589e+08 perf-stat.ps.cache-misses
6121 perf-stat.ps.context-switches
+1.4% 7.936e+10 perf-stat.ps.cpu-cycles
+2.9% 1.253e+10 perf-stat.ps.dTLB-loads
+7.3% 5.033e+09 perf-stat.ps.dTLB-stores
+3.3% 1501333 perf-stat.ps.iTLB-loads
+2.8% 4.691e+10 perf-stat.ps.instructions
-47.3% 3546712 perf-stat.ps.node-load-misses
-11.1% 17743546 perf-stat.ps.node-loads
+4.9% 5348852 perf-stat.ps.node-store-misses
+13.0% 71772875 perf-stat.ps.node-stores
+4.8% 2.717e+12 perf-stat.total.instructions
*************************************************************************
threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory
===============================================================
ernor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
86_64-rhel-8.3/100%/debian-11.1-x86_64-20220510.cgz/lkp-icl-2sp5/getrandom/stress-ng/60s
don't use REP_GOOD or ERMS for small memory clearing")
don't use REP_GOOD or ERMS for user memory copies")
adfcf4231b8cbc2d9c1e7bfaa96
---------------------------
fail:runs
|
:3 dmesg.RIP:_copy_to_iter
:3 dmesg.RIP:chacha_block_generic
1:3 dmesg.RIP:chacha_permute
2:3 dmesg.RIP:copy_user_fast_string
:3 dmesg.WARNING:CPU:#PID:#at_arch/x86/evperf:interrupt_took_too_long(#>#),lowering_kernel.perf_event_max_sample_rate_to
%stddev
\
-14.5% 6605482 meminfo.DirectMap2M
3268 proc-vmstat.nr_page_table_pages
-3.6% 46.99 ? 2% turbostat.RAMWatt
0.39 perf-stat.i.MPKI
+1.4% 176.18 perf-stat.i.cpu-migrations
0.00 perf-stat.i.dTLB-store-miss-rate%
+1.0% 171.29 perf-stat.ps.cpu-migrations
0.00 perf-profile.calltrace.cycles-pp.copy_user_enhanced_fast_string.copyout._copy_to_iter.get_random_bytes_user.__x64_sys_getrandom
19.09 perf-profile.calltrace.cycles-pp._copy_to_iter.get_random_bytes_user.__x64_sys_getrandom.do_syscall_64.entry_SYSCALL_64_after_hwframe
10.54 perf-profile.calltrace.cycles-pp.copyout._copy_to_iter.get_random_bytes_user.__x64_sys_getrandom.do_syscall_64
75.64 perf-profile.calltrace.cycles-pp.chacha_block_generic.get_random_bytes_user.__x64_sys_getrandom.do_syscall_64.entry_SYSCALL_64_after_hwframe
66.52 perf-profile.calltrace.cycles-pp.chacha_permute.chacha_block_generic.get_random_bytes_user.__x64_sys_getrandom.do_syscall_64
7.87 perf-profile.calltrace.cycles-pp.copy_user_fast_string.copyout._copy_to_iter.get_random_bytes_user.__x64_sys_getrandom
0.00 perf-profile.children.cycles-pp.copy_user_enhanced_fast_string
19.91 perf-profile.children.cycles-pp._copy_to_iter
9.99 perf-profile.children.cycles-pp.copyout
0.10 perf-profile.children.cycles-pp.ktime_get_update_offsets_now
0.40 perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
77.26 perf-profile.children.cycles-pp.chacha_block_generic
67.79 perf-profile.children.cycles-pp.chacha_permute
9.05 perf-profile.children.cycles-pp.copy_user_fast_string
0.00 perf-profile.self.cycles-pp.copy_user_enhanced_fast_string
6.20 perf-profile.self.cycles-pp._copy_to_iter
0.10 ? 4% perf-profile.self.cycles-pp.ktime_get_update_offsets_now
0.33 perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
67.29 perf-profile.self.cycles-pp.chacha_permute
8.79 perf-profile.self.cycles-pp.copy_user_fast_string
0.55 ? 2% sched_debug.cfs_rq:/.h_nr_running.avg
9043 ? 38% sched_debug.cfs_rq:/.load.avg
-40.1% 775.67 ? 22% sched_debug.cfs_rq:/.load_avg.max
-33.8% 136.07 ? 4% sched_debug.cfs_rq:/.load_avg.stddev
+195.0% 4144508 sched_debug.cfs_rq:/.min_vruntime.avg
+192.5% 4191972 sched_debug.cfs_rq:/.min_vruntime.max
+194.8% 2880861 sched_debug.cfs_rq:/.min_vruntime.min
0.54 sched_debug.cfs_rq:/.nr_running.avg
-40.0% 512.00 sched_debug.cfs_rq:/.removed.load_avg.max
-36.6% 103.40 ? 23% sched_debug.cfs_rq:/.removed.load_avg.stddev
+71.4% 638.84 sched_debug.cfs_rq:/.runnable_avg.avg
1589 ? 9% sched_debug.cfs_rq:/.runnable_avg.max
455.33 ? 11% sched_debug.cfs_rq:/.runnable_avg.min
60082 ? 20% sched_debug.cfs_rq:/.spread0.avg
108041 ? 11% sched_debug.cfs_rq:/.spread0.max
+70.2% 632.85 sched_debug.cfs_rq:/.util_avg.avg
254.61 sched_debug.cfs_rq:/.util_est_enqueued.avg
278294 ? 4% sched_debug.cpu.avg_idle.min
9.65 ? 17% sched_debug.cpu.clock.stddev
3137 sched_debug.cpu.curr->pid.avg
5754 sched_debug.cpu.curr->pid.max
2922 sched_debug.cpu.curr->pid.min
-10.0% 580100 sched_debug.cpu.max_idle_balance_cost.max
12856 ? 11% sched_debug.cpu.max_idle_balance_cost.stddev
0.00 ? 13% sched_debug.cpu.next_balance.stddev
0.55 ? 2% sched_debug.cpu.nr_running.avg
1.50 sched_debug.cpu.nr_running.max
+56.1% 362.67 sched_debug.cpu.nr_switches.min
*************************************************************************
threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory
===============================================================
kconfig/rootfs/tbox_group/test/testcase:
4-rhel-8.3/debian-x86_64-phoronix/lkp-csl-2sp7/ior-1.1.1/phoronix-test-suite
don't use REP_GOOD or ERMS for small memory clearing")
don't use REP_GOOD or ERMS for user memory copies")
adfcf4231b8cbc2d9c1e7bfaa96
---------------------------
%stddev
\
3444 phoronix-test-suite.ior.2MB./opt/rootfs.mb_s
3441 phoronix-test-suite.ior.2MB.DefaultTestDirectory.mb_s
32218 ? 11% sched_debug.cfs_rq:/.min_vruntime.avg
42191 ? 7% sched_debug.cfs_rq:/.min_vruntime.max
3022 ? 18% sched_debug.cfs_rq:/.min_vruntime.stddev
3026 ? 18% sched_debug.cfs_rq:/.spread0.stddev
-22.1% 631725 ? 6% sched_debug.cpu.max_idle_balance_cost.max
1.67 ?223% perf-profile.calltrace.cycles-pp.asm_exc_page_fault
1.67 ?223% perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault
1.67 ?223% perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
1.67 ?223% perf-profile.children.cycles-pp.asm_exc_page_fault
1.67 ?223% perf-profile.children.cycles-pp.exc_page_fault
1.67 ?223% perf-profile.children.cycles-pp.do_user_addr_fault
+5.8% 4169072 ? 3% perf-stat.i.cache-misses
1730 perf-stat.i.cycles-between-cache-misses
65.54 perf-stat.i.iTLB-load-miss-rate%
-3.7% 972226 perf-stat.i.iTLB-load-misses
+9.1% 703307 ? 4% perf-stat.i.iTLB-loads
+13.0% 161225 ? 8% perf-stat.i.node-store-misses
+17.7% 242357 ? 10% perf-stat.i.node-stores
estimated based on internal Intel analysis and are provided
purposes only. Any difference in system hardware or software
may affect actual performance.
om/intel/lkp-tests">https://github.com/intel/lkp-tests

Attachments:

(No filename) (41.78 kB)
config-6.3.0-rc7-00003-gadfcf4231b8c (159.65 kB)
job-script (7.98 kB)
job.yaml (5.68 kB)
reproduce (459.00 B)
Download all attachments

2023-05-04 18:04:29

by Linus Torvalds

[permalink] [raw]

Subject: Re: [linus:master] [x86] adfcf4231b: blogbench.read_score -10.9% regression

On Thu, May 4, 2023 at 1:12 AM kernel test robot <[email protected]> wrote:
>
> Since there are both improvements and regressions in various micro
> benchmarks, we are sending this report to provide some data and hope
> this can be helpful for further investigation. Thanks.

Heh, yes. As that makes clear, the 10% regression on one load was a
25% improvement on another.

That said, while I was expecting these kinds of results, the absolute
percentages are much higher than I expected. I would have expected a
lot of "couple of percent either way".

Even with memory copies being as important as they are, moving the
needle by -10% to +25% is a much bigger swing than expected.

> Details are as below:

One of the issues with the details is that these benchmarks are
running on different machines, so there's a mix of

Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake)
Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake)

in those details, and it might matter.

That said, the big picture is probably independent - both have ERMS
with 'rep movs' doing some cacheline optimizations, and the big
changes in the perf data is in things like 'perf-stat.ps.dTLB-stores'
and 'perf-stat.ps.node-store-misses' which is probably exactly about
that.

And I suspect that the 'blogbench.read_score' thing probably does
mostly big reads, and really really liked the cacheline optimizations
of 'rep movs' even when it was slow in other things.

Let me think about how ugly it would be to re-introduce that case just
for large copies.

Linus

2023-05-04 19:03:13

by Linus Torvalds

[permalink] [raw]

Subject: Re: [linus:master] [x86] adfcf4231b: blogbench.read_score -10.9% regression

On Thu, May 4, 2023 at 10:39 AM Linus Torvalds
<[email protected]> wrote:
>
> And I suspect that the 'blogbench.read_score' thing probably does
> mostly big reads, and really really liked the cacheline optimizations
> of 'rep movs' even when it was slow in other things.

Oh, and I went to look at the blogbench code, and it's a
*horrifically* bad benchmark for this case.

It may actually be a good benchmark for other things - the "look up
random filenames" etc might be somewhat indicative of real performance
on some loads.

But the read() system call part is very much *not* good.

It looks like the reason blogbench likes 'rep movs' is the traditional
reason why memory copy benchmarks tend to like uncached copies: it
looks like the benchmark doesn't actually even look at the data it
reads.

It's a common pattern, but it's a bad pattern. It's not very different
from doing CPU benchmarks using an empty loop:

for (int i = 0; i < 1000000; i++) /*nothing*/;

and then being surprised when the compiler optimized the loop away
because it doesn't actually *do* anything.

Of course, unlike the empty loop, the kernel will still do the reads,
but the kernel will do the reads with the assumption that the result
matters.

And for blogbench, that is *doubly* not the case.

Because it looks like blogbench does a loop over

read_dummy_file(file_name);

which will open that file, and then loop over

#define READ_CHUNK_SIZE 65536
...
static char garbage[READ_CHUNK_SIZE];
..
loop:
if ((readen = read(fd, garbage, sizeof garbage))

and that variable is very appropriately named indeed. It is very much garbage.

So it really looks like what is going on is that you have

(a) a threaded application (looks like N=100 for reads by default)

(b) doing 'read()' calls into a *shared* buffer in parallel

(c) and never using the buffer for anything else

which means that a memory copy that does non-temporal writes is going
to look artificially really good, because nobody wants to ever see the
end result.

Now, the "nobody wants to ever see the end result", might actually be
at least _partially_ valid in some historical setting, if you were to
use DMA to then push things out to a network. Back in the olden days,
that was actually a valid reason to use non-temporal writes, because
the DMA would have to invalidate the caches anyway.

So that part is wrong, but isn't necessarily *horrifically* wrong.
It's just bad.

But using one shared destination buffer for N readers is *completely*
bogus. Any cache-lines would bounce around like crazy for the
(pointless) sharing. I'd like to call it "false sharing", but it's not
really false: it's intentionally just using one big shared buffer as a
sink for all these read() calls.

End result: I think the benchmark scores for reading are entirely
pointless random numbers, and trying to optimize the kernel for this
benchmark is exactly the wrong thing to do.

The benchmark doesn't actually do anything half-way valid.

At a *minimum* that benchmark should have different threads using
different read() buffers.

I suspect it might be a good idea to also touch some of the data,
because that would be the normal situation (even with old-style
zero-copy DMA you'd probably not do raw file data and push it out to
the network unmodified).

Because otherwise you will always find that objectively bad memory
copies will do better than a good memory copy that makes sense.

Anyway, that was a very long-winded way of saying that I will not use
that benchmark as a reason to touch the kernel "copy_to_user()"
implementation.

Of course, that's not to say that this might not be a real regression
on real loads, but I'd want to see those other numbers.

It looks like this is an old benchmark that hasn't been touched in
years (just going by the github activity), but I'm cc'ing the author
anyway, and will add a pointer to this email as a github issue. Maybe
the developer cares, maybe he doesn't, but no reason not to at least
mention this and see if maybe the benchmark can be improved to at
least use a thread buffers for the reads.

(The same issues are true for the writing side too, of course, but the
writing side has less parallelism and a shared *source* is not the
same kind of problematic from a cache pattern standpoint, so probably
isn't going to have anything like the same effect from any memcpy
implementation).

Linus

2023-05-09 12:35:59

by Feng Tang

[permalink] [raw]

Subject: Re: [linus:master] [x86] adfcf4231b: blogbench.read_score -10.9% regression

Hi Linus,

On Thu, May 04, 2023 at 11:56:12AM -0700, Linus Torvalds wrote:
> On Thu, May 4, 2023 at 10:39 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > And I suspect that the 'blogbench.read_score' thing probably does
> > mostly big reads, and really really liked the cacheline optimizations
> > of 'rep movs' even when it was slow in other things.
>
> Oh, and I went to look at the blogbench code, and it's a
> *horrifically* bad benchmark for this case.
>
> It may actually be a good benchmark for other things - the "look up
> random filenames" etc might be somewhat indicative of real performance
> on some loads.
>
> But the read() system call part is very much *not* good.
>
> It looks like the reason blogbench likes 'rep movs' is the traditional
> reason why memory copy benchmarks tend to like uncached copies: it
> looks like the benchmark doesn't actually even look at the data it
> reads.
>
> It's a common pattern, but it's a bad pattern. It's not very different
> from doing CPU benchmarks using an empty loop:
>
> for (int i = 0; i < 1000000; i++) /*nothing*/;
>
> and then being surprised when the compiler optimized the loop away
> because it doesn't actually *do* anything.
>
> Of course, unlike the empty loop, the kernel will still do the reads,
> but the kernel will do the reads with the assumption that the result
> matters.
>
> And for blogbench, that is *doubly* not the case.
>
> Because it looks like blogbench does a loop over
>
> read_dummy_file(file_name);
>
> which will open that file, and then loop over
>
> #define READ_CHUNK_SIZE 65536
> ...
> static char garbage[READ_CHUNK_SIZE];
> ..
> loop:
> if ((readen = read(fd, garbage, sizeof garbage))
>
> and that variable is very appropriately named indeed. It is very much garbage.
>
> So it really looks like what is going on is that you have
>
> (a) a threaded application (looks like N=100 for reads by default)
>
> (b) doing 'read()' calls into a *shared* buffer in parallel
>
> (c) and never using the buffer for anything else
>
> which means that a memory copy that does non-temporal writes is going
> to look artificially really good, because nobody wants to ever see the
> end result.
>
> Now, the "nobody wants to ever see the end result", might actually be
> at least _partially_ valid in some historical setting, if you were to
> use DMA to then push things out to a network. Back in the olden days,
> that was actually a valid reason to use non-temporal writes, because
> the DMA would have to invalidate the caches anyway.
>
> So that part is wrong, but isn't necessarily *horrifically* wrong.
> It's just bad.
>
> But using one shared destination buffer for N readers is *completely*
> bogus. Any cache-lines would bounce around like crazy for the
> (pointless) sharing. I'd like to call it "false sharing", but it's not
> really false: it's intentionally just using one big shared buffer as a
> sink for all these read() calls.
>
> End result: I think the benchmark scores for reading are entirely
> pointless random numbers, and trying to optimize the kernel for this
> benchmark is exactly the wrong thing to do.
>
> The benchmark doesn't actually do anything half-way valid.
>
> At a *minimum* that benchmark should have different threads using
> different read() buffers.

We tried a debug patch which allocates a dedicated buffer for
each reader thread, run it on the same Cacade Lake platform, and
the regression is _gone_, after the noise of cache false sharing is
reduced.

20f3337d350c4e1b adfcf4231b8cbc2d9c1e7bfaa96
---------------- ---------------------------
2119572 ± 2% +1.7% 2155704 ± 2% blogbench.read_score

3700996 ± 9% -7.4% 3427562 ± 16% perf-stat.ps.dTLB-load-misses
8.171e+09 +25.3% 1.024e+10 perf-stat.ps.dTLB-loads
431224 ± 9% -13.1% 374903 ± 12% perf-stat.ps.dTLB-store-misses
2.399e+09 +87.5% 4.497e+09 perf-stat.ps.dTLB-stores
8495507 ± 7% -8.2% 7794738 perf-stat.ps.iTLB-load-misses

3.07 -3.1 0.00 perf-profile.self.cycles-pp.copy_user_enhanced_fast_string
0.00 +3.4 3.40 perf-profile.self.cycles-pp.copy_user_generic_unrolled

We also run the original blogbench binary on a IceLake platform
(which has both ERMS and FSRM), and there is almost no peformance
change for the 2 commits.

Please let us know if you need more profiling info or want to run
more tests.

Thanks,
Feng

---
The debug patch for blobbench is:

diff --git a/src/blogbench.h b/src/blogbench.h
index bf538a8..d99d5bb 100644
--- a/src/blogbench.h
+++ b/src/blogbench.h
@@ -178,6 +178,8 @@ void *rewriter(void * const fodder);
void *reader(void * const fodder);
void *commenter(void * const fodder);

+extern pthread_key_t read_buf_key;
+
#include "globals.h"

#endif
diff --git a/src/helpers.c b/src/helpers.c
index 4c17e73..85631ec 100644
--- a/src/helpers.c
+++ b/src/helpers.c
@@ -107,15 +107,18 @@ int create_atomic_file(const char * const file_name,

int read_dummy_file(const char * const file_name)
{
- static char garbage[READ_CHUNK_SIZE];
ssize_t readen;
+ void * read_buf;
int fd;

if ((fd = open(file_name, O_RDONLY)) == -1) {
return -1;
}
+
+ read_buf = pthread_getspecific(read_buf_key);
+
do {
- if ((readen = read(fd, garbage, sizeof garbage))
+ if ((readen = read(fd, read_buf, READ_CHUNK_SIZE))
< (ssize_t) 0) {
if (errno == EINTR) {
continue;
diff --git a/src/process.c b/src/process.c
index a83a980..04f1411 100644
--- a/src/process.c
+++ b/src/process.c
@@ -58,9 +58,13 @@ static int wait_rewriters(void)
return 0;
}

+pthread_key_t read_buf_key = 0;
+
static int spawn_readers(void)
{
unsigned int t = nb_readers;
+
+ pthread_key_create(&read_buf_key, NULL);

do {
t--;
diff --git a/src/reader.c b/src/reader.c
index 91bab7a..2c0efdb 100644
--- a/src/reader.c
+++ b/src/reader.c
@@ -66,6 +66,11 @@ static int read_random_blog(void)
void *reader(void * const fodder)
{
(void) fodder;
+
+ void *read_buf;
+
+ read_buf = malloc(READ_CHUNK_SIZE);
+ pthread_setspecific(read_buf_key, read_buf);

do {
if (read_random_blog() != 0) {
@@ -75,6 +80,8 @@ void *reader(void * const fodder)
usleep(USLEEP_READER);
#endif
} while (stop == (sig_atomic_t) 0);
+
+ free(read_buf);

return NULL;
}

> I suspect it might be a good idea to also touch some of the data,
> because that would be the normal situation (even with old-style
> zero-copy DMA you'd probably not do raw file data and push it out to
> the network unmodified).
>
> Because otherwise you will always find that objectively bad memory
> copies will do better than a good memory copy that makes sense.
>
> Anyway, that was a very long-winded way of saying that I will not use
> that benchmark as a reason to touch the kernel "copy_to_user()"
> implementation.
>
> Of course, that's not to say that this might not be a real regression
> on real loads, but I'd want to see those other numbers.
>
> It looks like this is an old benchmark that hasn't been touched in
> years (just going by the github activity), but I'm cc'ing the author
> anyway, and will add a pointer to this email as a github issue. Maybe
> the developer cares, maybe he doesn't, but no reason not to at least
> mention this and see if maybe the benchmark can be improved to at
> least use a thread buffers for the reads.
>
> (The same issues are true for the writing side too, of course, but the
> writing side has less parallelism and a shared *source* is not the
> same kind of problematic from a cache pattern standpoint, so probably
> isn't going to have anything like the same effect from any memcpy
> implementation).
>
> Linus

2023-05-09 17:33:37

by Linus Torvalds

[permalink] [raw]

Subject: Re: [linus:master] [x86] adfcf4231b: blogbench.read_score -10.9% regression

On Tue, May 9, 2023 at 5:30 AM Feng Tang <[email protected]> wrote:
>
> > At a *minimum* that benchmark should have different threads using
> > different read() buffers.
>
> We tried a debug patch which allocates a dedicated buffer for
> each reader thread, run it on the same Cacade Lake platform, and
> the regression is _gone_, after the noise of cache false sharing is
> reduced.

Thanks for verifying.

I was pretty sure that was the issue, but it's always good to
double-check that there wasn't something else going on,

Linus