2021-05-21 20:09:08

by kernel test robot

[permalink] [raw]
Subject: [clocksource] 8901ecc231: stress-ng.lockbus.ops_per_sec -9.5% regression



Greeting,

FYI, we noticed a -9.5% regression of stress-ng.lockbus.ops_per_sec due to commit:


commit: 8901ecc2315b850f35a7b8c1b73b12388b72aa78 ("clocksource: Retry clock read if long delays detected")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master


in testcase: stress-ng
on test machine: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 192G memory
with following parameters:

nr_threads: 100%
disk: 1HDD
testtime: 60s
class: memory
test: lockbus
cpufreq_governor: performance
ucode: 0x5003006


please be noted below in dmesg.xz (attached)
[ 28.110351]
[ 28.302357] hrtimer: interrupt took 1878423 ns
[ 29.690760] clocksource: timekeeping watchdog on CPU53: hpet read-back delay of 169583ns, attempt 4, marking unstable
[ 29.860306] tsc: Marking TSC unstable due to clocksource watchdog
[ 30.559390] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[ 30.726282] sched_clock: Marking unstable (30052964508, 499342225)<-(30915547410, -363240730)
[ 31.620401] clocksource: Switched to clocksource hpet



If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
bin/lkp run generated-yaml-file

=========================================================================================
class/compiler/cpufreq_governor/disk/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime/ucode:
memory/gcc-9/performance/1HDD/x86_64-rhel-8.3/100%/debian-10.4-x86_64-20200603.cgz/lkp-csl-2sp5/lockbus/stress-ng/60s/0x5003006

commit:
v5.13-rc1
8901ecc231 ("clocksource: Retry clock read if long delays detected")

v5.13-rc1 8901ecc2315b850f35a7b8c1b73
---------------- ---------------------------
%stddev %change %stddev
\ | \
248755 -9.5% 225157 stress-ng.lockbus.ops
4136 -9.5% 3742 stress-ng.lockbus.ops_per_sec
6580 ? 5% +28.7% 8470 stress-ng.time.percent_of_cpu_this_job_got
200.39 ? 7% +61.5% 323.71 ? 7% stress-ng.time.system_time
3892 ? 5% +27.1% 4946 stress-ng.time.user_time
13.38 -6.7% 12.48 ? 5% boot-time.dhcp
208168 ? 14% +676.6% 1616642 ? 24% cpuidle.POLL.time
674.50 ? 12% -17.4% 557.33 ? 10% interrupts.CPU95.CAL:Function_call_interrupts
170270 ? 7% +22.1% 207860 ? 6% softirqs.RCU
225.06 +1.3% 227.89 turbostat.PkgWatt
61.83 ? 6% +28.0% 79.17 vmstat.cpu.us
28.52 ? 13% -61.1% 11.10 ? 6% iostat.cpu.system
63.82 ? 5% +27.6% 81.41 iostat.cpu.user
16.30 ? 21% -15.5 0.83 ? 42% mpstat.cpu.all.irq%
57.65 ? 7% +17.1 74.73 mpstat.cpu.all.usr%
0.00 ? 75% +286.7% 0.01 ? 35% perf-sched.sch_delay.avg.ms.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.01 ? 14% +140.0% 0.02 ? 13% perf-sched.sch_delay.avg.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_select
0.00 ? 72% +255.0% 0.01 ? 5% perf-sched.sch_delay.max.ms.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.01 ? 6% +180.0% 0.02 ? 60% perf-sched.sch_delay.max.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_select
9.00 ? 30% -48.1% 4.67 ? 45% perf-sched.wait_and_delay.count.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64
550561 +5.5% 580618 proc-vmstat.nr_file_pages
307417 +10.0% 338160 ? 2% proc-vmstat.nr_mapped
2993 +8.4% 3246 proc-vmstat.nr_page_table_pages
303607 +9.9% 333614 ? 3% proc-vmstat.nr_shmem
694035 +3.0% 714653 proc-vmstat.numa_hit
607455 +3.4% 628114 proc-vmstat.numa_local
513499 +5.4% 541036 proc-vmstat.pgfault
4823 ? 4% +47.7% 7124 ? 4% proc-vmstat.pgreuse
839921 ? 8% +13.1% 950002 ? 4% numa-meminfo.node0.Inactive
839800 ? 8% +13.1% 949852 ? 4% numa-meminfo.node0.Inactive(anon)
53189 ? 8% +9.2% 58066 ? 6% numa-meminfo.node0.KReclaimable
629217 ? 3% +12.2% 705752 ? 2% numa-meminfo.node0.Mapped
7941 ? 24% +21.9% 9680 ? 3% numa-meminfo.node0.PageTables
53189 ? 8% +9.2% 58066 ? 6% numa-meminfo.node0.SReclaimable
599179 ? 4% +12.7% 675027 ? 3% numa-meminfo.node0.Shmem
605295 ? 3% +12.5% 680806 ? 2% numa-meminfo.node1.Mapped
622631 ? 3% +11.1% 691846 ? 2% numa-meminfo.node1.Shmem
193696 ? 8% +19.0% 230483 ? 3% numa-vmstat.node0.nr_inactive_anon
142337 ? 2% +19.5% 170125 ? 2% numa-vmstat.node0.nr_mapped
1868 ? 25% +25.7% 2348 ? 3% numa-vmstat.node0.nr_page_table_pages
136818 ? 2% +18.9% 162611 ? 2% numa-vmstat.node0.nr_shmem
13219 ? 8% +9.5% 14473 ? 6% numa-vmstat.node0.nr_slab_reclaimable
193666 ? 8% +19.0% 230467 ? 3% numa-vmstat.node0.nr_zone_inactive_anon
1132156 ? 12% +11.2% 1258864 ? 5% numa-vmstat.node0.numa_hit
143857 +15.9% 166725 numa-vmstat.node1.nr_file_pages
135723 ? 3% +20.1% 163039 numa-vmstat.node1.nr_mapped
142136 +17.3% 166704 numa-vmstat.node1.nr_shmem
5.12 ? 20% -68.1% 1.63 ? 24% perf-stat.i.MPKI
7.441e+08 ? 11% +61.5% 1.202e+09 ? 8% perf-stat.i.branch-instructions
2.27 ? 7% -1.3 0.94 ? 8% perf-stat.i.branch-miss-rate%
26460369 ? 13% -46.8% 14082434 ? 6% perf-stat.i.branch-misses
2567959 ? 10% -27.7% 1857035 ? 5% perf-stat.i.cache-misses
2820 ? 7% -14.3% 2416 ? 2% perf-stat.i.context-switches
303.17 ? 9% -83.0% 51.67 ? 11% perf-stat.i.cpi
129563 ? 5% -16.9% 107705 perf-stat.i.cpu-clock
190.21 ? 4% -18.4% 155.23 ? 2% perf-stat.i.cpu-migrations
298667 ? 8% -29.0% 212034 ? 4% perf-stat.i.cycles-between-cache-misses
0.20 ? 9% +0.1 0.26 ? 11% perf-stat.i.dTLB-store-miss-rate%
3.276e+08 ? 11% -41.6% 1.913e+08 ? 3% perf-stat.i.dTLB-stores
77.27 ? 3% +7.3 84.53 perf-stat.i.iTLB-load-miss-rate%
560579 ? 19% -46.9% 297416 ? 10% perf-stat.i.iTLB-loads
3.622e+09 ? 11% +52.4% 5.52e+09 ? 8% perf-stat.i.instructions
3174 ? 8% +134.9% 7456 ? 4% perf-stat.i.instructions-per-iTLB-miss
0.26 ? 11% -49.4% 0.13 ? 7% perf-stat.i.ipc
89.57 ? 12% -53.7% 41.43 ? 7% perf-stat.i.major-faults
1.77 ? 6% +32.7% 2.35 perf-stat.i.metric.GHz
14.33 ? 11% +45.7% 20.88 ? 8% perf-stat.i.metric.M/sec
5011 ? 6% -34.8% 3265 ? 4% perf-stat.i.minor-faults
73.21 ? 3% +9.3 82.54 perf-stat.i.node-load-miss-rate%
121230 ? 8% -34.7% 79105 ? 3% perf-stat.i.node-loads
50.05 ? 11% +25.9 75.94 ? 3% perf-stat.i.node-store-miss-rate%
365780 ? 12% -50.4% 181434 ? 5% perf-stat.i.node-stores
5100 ? 6% -35.2% 3306 ? 4% perf-stat.i.page-faults
129586 ? 5% -16.9% 107712 perf-stat.i.task-clock
4.80 ? 24% -72.4% 1.33 ? 20% perf-stat.overall.MPKI
2.74 ? 8% -2.2 0.57 ? 8% perf-stat.overall.branch-miss-rate%
213.44 ? 11% -75.3% 52.81 ? 12% perf-stat.overall.cpi
259389 ? 11% -21.7% 203189 ? 3% perf-stat.overall.cycles-between-cache-misses
80.09 ? 2% +5.0 85.10 perf-stat.overall.iTLB-load-miss-rate%
2124 ? 7% +238.2% 7184 ? 4% perf-stat.overall.instructions-per-iTLB-miss
0.00 ? 11% +304.5% 0.02 ? 11% perf-stat.overall.ipc
72.33 ? 3% +10.3 82.64 perf-stat.overall.node-load-miss-rate%
35.64 ? 16% +29.0 64.68 ? 2% perf-stat.overall.node-store-miss-rate%
1.982e+08 ? 11% +393.9% 9.789e+08 ? 10% perf-stat.ps.branch-instructions
783739 ? 6% +45.5% 1140674 ? 6% perf-stat.ps.cache-misses
2.27e+08 ? 10% +170.4% 6.138e+08 ? 9% perf-stat.ps.dTLB-loads
451286 ? 10% +36.6% 616231 ? 8% perf-stat.ps.iTLB-load-misses
9.571e+08 ? 11% +363.0% 4.432e+09 ? 10% perf-stat.ps.instructions
17.47 ? 14% -29.4% 12.33 perf-stat.ps.major-faults
1452 ? 5% +24.8% 1812 ? 5% perf-stat.ps.minor-faults
94781 ? 7% +130.4% 218412 ? 7% perf-stat.ps.node-load-misses
36077 ? 5% +26.9% 45791 ? 5% perf-stat.ps.node-loads
45982 ? 21% +163.9% 121369 ? 7% perf-stat.ps.node-store-misses
82536 ? 12% -20.0% 66048 perf-stat.ps.node-stores
1469 ? 5% +24.2% 1824 ? 5% perf-stat.ps.page-faults
4.04e+10 ? 8% +567.9% 2.698e+11 ? 9% perf-stat.total.instructions
39.23 ? 6% -23.6 15.65 ? 27% perf-profile.calltrace.cycles-pp.tick_sched_handle.tick_sched_timer.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt
38.89 ? 6% -23.3 15.56 ? 27% perf-profile.calltrace.cycles-pp.update_process_times.tick_sched_handle.tick_sched_timer.__hrtimer_run_queues.hrtimer_interrupt
40.52 ? 6% -20.9 19.65 ? 26% perf-profile.calltrace.cycles-pp.tick_sched_timer.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt
47.38 ? 4% -17.4 29.94 ? 9% perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
30.70 ? 7% -16.7 14.02 ? 27% perf-profile.calltrace.cycles-pp.scheduler_tick.update_process_times.tick_sched_handle.tick_sched_timer.__hrtimer_run_queues
21.38 ? 7% -14.2 7.16 ? 30% perf-profile.calltrace.cycles-pp.task_tick_fair.scheduler_tick.update_process_times.tick_sched_handle.tick_sched_timer
5.98 ? 28% -5.3 0.63 ? 12% perf-profile.calltrace.cycles-pp.perf_event_task_tick.scheduler_tick.update_process_times.tick_sched_handle.tick_sched_timer
6.22 ? 12% -3.9 2.27 ? 31% perf-profile.calltrace.cycles-pp.update_load_avg.task_tick_fair.scheduler_tick.update_process_times.tick_sched_handle
5.58 ? 14% -3.9 1.71 ? 28% perf-profile.calltrace.cycles-pp.update_curr.task_tick_fair.scheduler_tick.update_process_times.tick_sched_handle
4.50 ? 10% -3.0 1.51 ? 31% perf-profile.calltrace.cycles-pp.hrtimer_active.task_tick_fair.scheduler_tick.update_process_times.tick_sched_handle
2.50 ? 14% -2.2 0.31 ?100% perf-profile.calltrace.cycles-pp.trigger_load_balance.update_process_times.tick_sched_handle.tick_sched_timer.__hrtimer_run_queues
2.43 ? 8% -2.1 0.35 ?107% perf-profile.calltrace.cycles-pp.rcu_sched_clock_irq.update_process_times.tick_sched_handle.tick_sched_timer.__hrtimer_run_queues
1.97 ? 8% -1.4 0.59 ? 54% perf-profile.calltrace.cycles-pp.update_cfs_group.task_tick_fair.scheduler_tick.update_process_times.tick_sched_handle
1.78 ? 13% -1.3 0.47 ? 80% perf-profile.calltrace.cycles-pp.__update_load_avg_cfs_rq.update_load_avg.task_tick_fair.scheduler_tick.update_process_times
1.77 ? 15% -1.1 0.69 ? 55% perf-profile.calltrace.cycles-pp.__update_load_avg_se.update_load_avg.task_tick_fair.scheduler_tick.update_process_times
2.21 ? 12% -0.9 1.33 ? 14% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
2.21 ? 12% -0.9 1.33 ? 14% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
2.85 ? 7% -0.8 2.00 ? 10% perf-profile.calltrace.cycles-pp.load_balance.rebalance_domains.__softirqentry_text_start.irq_exit_rcu.sysvec_apic_timer_interrupt
3.54 ? 6% -0.8 2.70 ? 9% perf-profile.calltrace.cycles-pp.rebalance_domains.__softirqentry_text_start.irq_exit_rcu.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
1.60 ? 10% -0.6 0.96 ? 12% perf-profile.calltrace.cycles-pp.execve
1.59 ? 10% -0.6 0.96 ? 12% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.execve
1.59 ? 10% -0.6 0.96 ? 12% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.execve
1.95 ? 8% -0.6 1.31 ? 9% perf-profile.calltrace.cycles-pp.find_busiest_group.load_balance.rebalance_domains.__softirqentry_text_start.irq_exit_rcu
1.59 ? 11% -0.6 0.96 ? 12% perf-profile.calltrace.cycles-pp.do_execveat_common.__x64_sys_execve.do_syscall_64.entry_SYSCALL_64_after_hwframe.execve
1.59 ? 10% -0.6 0.96 ? 12% perf-profile.calltrace.cycles-pp.__x64_sys_execve.do_syscall_64.entry_SYSCALL_64_after_hwframe.execve
1.63 ? 13% -0.6 1.02 ? 16% perf-profile.calltrace.cycles-pp.update_sd_lb_stats.find_busiest_group.load_balance.rebalance_domains.__softirqentry_text_start
1.31 ? 10% -0.5 0.79 ? 13% perf-profile.calltrace.cycles-pp.bprm_execve.do_execveat_common.__x64_sys_execve.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.81 ? 24% -0.5 0.29 ?100% perf-profile.calltrace.cycles-pp.worker_thread.kthread.ret_from_fork
1.19 ? 10% -0.5 0.70 ? 13% perf-profile.calltrace.cycles-pp.exec_binprm.bprm_execve.do_execveat_common.__x64_sys_execve.do_syscall_64
1.15 ? 10% -0.5 0.67 ? 13% perf-profile.calltrace.cycles-pp.load_elf_binary.exec_binprm.bprm_execve.do_execveat_common.__x64_sys_execve
0.86 ? 9% -0.4 0.41 ? 71% perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
1.13 ? 8% -0.4 0.77 ? 12% perf-profile.calltrace.cycles-pp.asm_exc_page_fault
1.20 ? 5% -0.4 0.85 ? 16% perf-profile.calltrace.cycles-pp.smpboot_thread_fn.kthread.ret_from_fork
0.74 ? 8% -0.3 0.39 ? 72% perf-profile.calltrace.cycles-pp.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.74 ? 8% -0.3 0.39 ? 72% perf-profile.calltrace.cycles-pp.do_group_exit.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.74 ? 8% -0.3 0.39 ? 72% perf-profile.calltrace.cycles-pp.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.08 ? 8% -0.3 0.74 ? 13% perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault
1.07 ? 9% -0.3 0.73 ? 13% perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
0.91 ? 9% -0.3 0.61 ? 14% perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
0.92 ? 11% -0.3 0.62 ? 13% perf-profile.calltrace.cycles-pp.__libc_fork
0.78 ? 16% -0.3 0.51 ? 46% perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.update_sd_lb_stats.find_busiest_group.load_balance.rebalance_domains
1.11 ? 15% +0.5 1.64 ? 11% perf-profile.calltrace.cycles-pp.read
1.09 ? 15% +0.5 1.62 ? 11% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.read
1.09 ? 15% +0.5 1.62 ? 11% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
1.07 ? 15% +0.5 1.60 ? 11% perf-profile.calltrace.cycles-pp.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
1.08 ? 15% +0.5 1.61 ? 11% perf-profile.calltrace.cycles-pp.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
0.97 ? 18% +0.6 1.54 ? 12% perf-profile.calltrace.cycles-pp.new_sync_read.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.17 ?141% +0.6 0.77 ? 21% perf-profile.calltrace.cycles-pp.__irqentry_text_end
0.59 ? 45% +0.7 1.33 ? 11% perf-profile.calltrace.cycles-pp.seq_read_iter.new_sync_read.vfs_read.ksys_read.do_syscall_64
0.50 ? 45% +0.8 1.25 ? 12% perf-profile.calltrace.cycles-pp.show_stat.seq_read_iter.new_sync_read.vfs_read.ksys_read
0.77 ? 19% +0.8 1.56 ? 31% perf-profile.calltrace.cycles-pp.rcu_core.__softirqentry_text_start.irq_exit_rcu.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
0.00 +3.4 3.37 ? 20% perf-profile.calltrace.cycles-pp.read_hpet.ktime_get.tick_sched_timer.__hrtimer_run_queues.hrtimer_interrupt
0.00 +3.7 3.67 ? 21% perf-profile.calltrace.cycles-pp.ktime_get.tick_sched_timer.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt
76.25 ? 2% +4.6 80.88 ? 2% perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt
0.00 +4.7 4.65 ? 23% perf-profile.calltrace.cycles-pp.read_hpet.ktime_get.sched_clock_tick.scheduler_tick.update_process_times
0.00 +5.2 5.18 ? 25% perf-profile.calltrace.cycles-pp.ktime_get.sched_clock_tick.scheduler_tick.update_process_times.tick_sched_handle
0.00 +5.3 5.29 ? 25% perf-profile.calltrace.cycles-pp.sched_clock_tick.scheduler_tick.update_process_times.tick_sched_handle.tick_sched_timer
3.85 ? 19% +5.4 9.24 ? 26% perf-profile.calltrace.cycles-pp.perf_mux_hrtimer_handler.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt
1.11 ? 11% +7.7 8.77 ? 25% perf-profile.calltrace.cycles-pp.ktime_get.perf_mux_hrtimer_handler.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt
0.00 +8.2 8.16 ? 25% perf-profile.calltrace.cycles-pp.read_hpet.ktime_get.perf_mux_hrtimer_handler.__hrtimer_run_queues.hrtimer_interrupt
66.60 ? 3% +11.5 78.13 ? 3% perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
3.32 ? 8% +12.4 15.75 ? 20% perf-profile.calltrace.cycles-pp.ktime_get_update_offsets_now.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
59.52 ? 3% +12.8 72.32 ? 5% perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
59.20 ? 3% +13.0 72.23 ? 5% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
0.00 +14.6 14.64 ? 21% perf-profile.calltrace.cycles-pp.read_hpet.ktime_get_update_offsets_now.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt
6.36 ? 12% +19.2 25.58 ? 6% perf-profile.calltrace.cycles-pp.clockevents_program_event.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
0.00 +20.6 20.55 ? 6% perf-profile.calltrace.cycles-pp.read_hpet.ktime_get.clockevents_program_event.hrtimer_interrupt.__sysvec_apic_timer_interrupt
1.27 ? 25% +21.2 22.46 ? 5% perf-profile.calltrace.cycles-pp.ktime_get.clockevents_program_event.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt
44.12 ? 5% -26.6 17.57 ? 25% perf-profile.children.cycles-pp.tick_sched_handle
43.72 ? 5% -26.3 17.47 ? 25% perf-profile.children.cycles-pp.update_process_times
45.65 ? 6% -23.5 22.13 ? 23% perf-profile.children.cycles-pp.tick_sched_timer
53.67 ? 3% -20.2 33.48 ? 8% perf-profile.children.cycles-pp.__hrtimer_run_queues
34.46 ? 7% -18.8 15.70 ? 25% perf-profile.children.cycles-pp.scheduler_tick
23.76 ? 6% -15.8 7.95 ? 28% perf-profile.children.cycles-pp.task_tick_fair
11.74 ? 42% -7.5 4.26 ? 77% perf-profile.children.cycles-pp.asm_exc_nmi
6.25 ? 21% -5.5 0.72 ? 11% perf-profile.children.cycles-pp.perf_event_task_tick
7.19 ? 11% -4.4 2.77 ? 25% perf-profile.children.cycles-pp.update_load_avg
6.06 ? 9% -4.0 2.11 ? 23% perf-profile.children.cycles-pp.update_curr
4.70 ? 16% -3.8 0.86 ? 44% perf-profile.children.cycles-pp.ghes_notify_nmi
4.97 ? 9% -3.3 1.66 ? 29% perf-profile.children.cycles-pp.hrtimer_active
3.57 ? 21% -3.3 0.28 ? 32% perf-profile.children.cycles-pp.__intel_pmu_enable_all
2.84 ? 14% -2.3 0.58 ? 20% perf-profile.children.cycles-pp.trigger_load_balance
2.79 ? 8% -2.1 0.65 ? 34% perf-profile.children.cycles-pp.rcu_sched_clock_irq
3.22 ? 7% -2.1 1.12 ? 50% perf-profile.children.cycles-pp.native_sched_clock
8.93 ? 7% -1.9 7.03 ? 11% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
8.92 ? 7% -1.9 7.02 ? 11% perf-profile.children.cycles-pp.do_syscall_64
2.35 ? 8% -1.7 0.69 ? 21% perf-profile.children.cycles-pp.irqtime_account_irq
2.19 ? 8% -1.5 0.73 ? 31% perf-profile.children.cycles-pp.update_cfs_group
1.78 ? 6% -1.3 0.44 ? 9% perf-profile.children.cycles-pp._raw_spin_lock
1.59 ? 15% -1.3 0.28 ? 17% perf-profile.children.cycles-pp.update_rq_clock
2.09 ? 12% -1.3 0.82 ? 25% perf-profile.children.cycles-pp.__update_load_avg_cfs_rq
1.35 ? 25% -1.3 0.09 ? 73% perf-profile.children.cycles-pp.nmi_handle
4.19 ? 5% -1.2 3.02 ? 9% perf-profile.children.cycles-pp.rebalance_domains
2.08 ? 14% -1.1 0.95 ? 26% perf-profile.children.cycles-pp.__update_load_avg_se
3.36 ? 7% -1.1 2.27 ? 9% perf-profile.children.cycles-pp.load_balance
1.61 ? 8% -1.0 0.56 ? 50% perf-profile.children.cycles-pp.native_flush_tlb_one_user
1.26 ? 23% -1.0 0.24 ? 30% perf-profile.children.cycles-pp.arch_scale_freq_tick
1.36 ? 41% -1.0 0.40 ? 67% perf-profile.children.cycles-pp.paranoid_entry
1.18 ? 10% -0.9 0.25 ? 17% perf-profile.children.cycles-pp.sched_clock_cpu
1.07 ? 10% -0.9 0.19 ? 60% perf-profile.children.cycles-pp.exc_nmi
1.13 ? 34% -0.8 0.29 ? 70% perf-profile.children.cycles-pp.repeat_nmi
2.29 ? 8% -0.8 1.47 ? 10% perf-profile.children.cycles-pp.find_busiest_group
2.16 ? 9% -0.8 1.37 ? 9% perf-profile.children.cycles-pp.update_sd_lb_stats
0.92 ? 24% -0.7 0.18 ? 30% perf-profile.children.cycles-pp.sync_regs
1.09 ? 10% -0.7 0.35 ? 7% perf-profile.children.cycles-pp.hrtimer_update_next_event
1.70 ? 11% -0.7 0.97 ? 12% perf-profile.children.cycles-pp.__x64_sys_execve
1.70 ? 11% -0.7 0.97 ? 12% perf-profile.children.cycles-pp.do_execveat_common
0.93 ? 7% -0.7 0.26 ? 8% perf-profile.children.cycles-pp.__remove_hrtimer
1.60 ? 10% -0.6 0.96 ? 12% perf-profile.children.cycles-pp.execve
0.77 ? 15% -0.6 0.15 ? 30% perf-profile.children.cycles-pp.update_irq_load_avg
0.93 ? 11% -0.6 0.32 ? 31% perf-profile.children.cycles-pp.acpi_os_read_memory
1.40 ? 10% -0.6 0.80 ? 13% perf-profile.children.cycles-pp.bprm_execve
2.13 ? 4% -0.6 1.57 ? 10% perf-profile.children.cycles-pp.asm_exc_page_fault
1.27 ? 10% -0.6 0.71 ? 13% perf-profile.children.cycles-pp.exec_binprm
2.00 ? 4% -0.5 1.46 ? 10% perf-profile.children.cycles-pp.exc_page_fault
1.22 ? 10% -0.5 0.68 ? 13% perf-profile.children.cycles-pp.load_elf_binary
2.00 ? 4% -0.5 1.45 ? 9% perf-profile.children.cycles-pp.do_user_addr_fault
1.72 ? 4% -0.5 1.20 ? 10% perf-profile.children.cycles-pp.handle_mm_fault
0.87 ? 12% -0.5 0.36 ? 31% perf-profile.children.cycles-pp.reweight_entity
1.62 ? 4% -0.5 1.12 ? 10% perf-profile.children.cycles-pp.__handle_mm_fault
0.69 ? 29% -0.5 0.19 ? 10% perf-profile.children.cycles-pp.native_irq_return_iret
0.79 ? 8% -0.5 0.29 ? 10% perf-profile.children.cycles-pp.enqueue_hrtimer
0.74 ? 13% -0.5 0.25 ? 31% perf-profile.children.cycles-pp.update_min_vruntime
0.68 ? 8% -0.5 0.19 ? 6% perf-profile.children.cycles-pp.timerqueue_del
0.83 ? 7% -0.5 0.38 ? 16% perf-profile.children.cycles-pp._raw_spin_lock_irq
0.71 ? 43% -0.4 0.27 ? 76% perf-profile.children.cycles-pp.first_nmi
1.17 ? 9% -0.4 0.73 ? 14% perf-profile.children.cycles-pp.exit_mmap
1.17 ? 9% -0.4 0.74 ? 15% perf-profile.children.cycles-pp.mmput
0.68 ? 9% -0.4 0.24 ? 9% perf-profile.children.cycles-pp.timerqueue_add
0.55 ? 17% -0.4 0.14 ? 28% perf-profile.children.cycles-pp.account_user_time
1.05 ? 4% -0.4 0.67 ? 19% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
0.83 ? 6% -0.4 0.48 ? 11% perf-profile.children.cycles-pp.do_filp_open
0.56 ? 11% -0.4 0.20 ? 8% perf-profile.children.cycles-pp.__hrtimer_next_event_base
0.82 ? 6% -0.3 0.47 ? 10% perf-profile.children.cycles-pp.path_openat
1.20 ? 4% -0.3 0.85 ? 16% perf-profile.children.cycles-pp.smpboot_thread_fn
0.84 ? 5% -0.3 0.50 ? 11% perf-profile.children.cycles-pp.do_sys_open
0.83 ? 5% -0.3 0.49 ? 11% perf-profile.children.cycles-pp.do_sys_openat2
0.49 ? 12% -0.3 0.18 ? 8% perf-profile.children.cycles-pp.__hrtimer_get_next_event
0.92 ? 10% -0.3 0.63 ? 14% perf-profile.children.cycles-pp.__libc_fork
1.08 ? 7% -0.3 0.78 ? 12% perf-profile.children.cycles-pp.finish_task_switch
0.65 ? 12% -0.3 0.35 ? 12% perf-profile.children.cycles-pp.begin_new_exec
0.81 ? 23% -0.3 0.52 ? 12% perf-profile.children.cycles-pp.worker_thread
0.37 ? 9% -0.3 0.09 ? 49% perf-profile.children.cycles-pp.default_do_nmi
0.59 ? 14% -0.3 0.31 ? 12% perf-profile.children.cycles-pp.vm_mmap_pgoff
0.56 ? 15% -0.3 0.29 ? 12% perf-profile.children.cycles-pp.do_mmap
0.68 ? 11% -0.3 0.43 ? 16% perf-profile.children.cycles-pp.copy_process
0.33 ? 15% -0.3 0.08 ? 38% perf-profile.children.cycles-pp.ghes_copy_tofrom_phys
0.71 ? 26% -0.3 0.46 ? 13% perf-profile.children.cycles-pp.process_one_work
0.71 ? 11% -0.2 0.46 ? 15% perf-profile.children.cycles-pp.__do_sys_clone
0.71 ? 11% -0.2 0.46 ? 15% perf-profile.children.cycles-pp.kernel_clone
0.83 ? 5% -0.2 0.59 ? 10% perf-profile.children.cycles-pp.do_fault
0.64 ? 10% -0.2 0.40 ? 20% perf-profile.children.cycles-pp.try_to_wake_up
0.28 ? 8% -0.2 0.03 ?100% perf-profile.children.cycles-pp.rb_erase
0.50 ? 15% -0.2 0.27 ? 13% perf-profile.children.cycles-pp.mmap_region
0.30 ? 10% -0.2 0.07 ? 23% perf-profile.children.cycles-pp.run_posix_cpu_timers
0.58 ? 9% -0.2 0.36 ? 15% perf-profile.children.cycles-pp.unmap_vmas
0.30 ? 8% -0.2 0.08 ? 16% perf-profile.children.cycles-pp.tick_program_event
0.33 ? 6% -0.2 0.12 ? 16% perf-profile.children.cycles-pp.hrtimer_forward
0.25 ? 20% -0.2 0.03 ?100% perf-profile.children.cycles-pp.account_process_tick
0.46 ? 15% -0.2 0.24 ? 7% perf-profile.children.cycles-pp.ksys_mmap_pgoff
0.46 ? 8% -0.2 0.25 ? 12% perf-profile.children.cycles-pp.kmem_cache_alloc
0.75 ? 8% -0.2 0.55 ? 16% perf-profile.children.cycles-pp.__x64_sys_exit_group
0.75 ? 8% -0.2 0.55 ? 16% perf-profile.children.cycles-pp.do_group_exit
0.75 ? 8% -0.2 0.55 ? 16% perf-profile.children.cycles-pp.do_exit
0.33 ? 16% -0.2 0.14 ? 66% perf-profile.children.cycles-pp.intel_pmu_handle_irq
0.28 ? 7% -0.2 0.09 ? 51% perf-profile.children.cycles-pp.__native_set_fixmap
0.49 ? 9% -0.2 0.29 ? 8% perf-profile.children.cycles-pp.__alloc_pages
0.47 ? 10% -0.2 0.28 ? 20% perf-profile.children.cycles-pp.dup_mm
0.42 ? 9% -0.2 0.23 ? 20% perf-profile.children.cycles-pp.walk_component
0.51 ? 10% -0.2 0.33 ? 15% perf-profile.children.cycles-pp.unmap_page_range
0.27 ? 12% -0.2 0.09 ? 54% perf-profile.children.cycles-pp.calc_global_load_tick
0.44 ? 11% -0.2 0.26 ? 20% perf-profile.children.cycles-pp.dup_mmap
0.41 ? 9% -0.2 0.24 ? 20% perf-profile.children.cycles-pp.link_path_walk
0.23 ? 38% -0.2 0.06 ?118% perf-profile.children.cycles-pp.intel_bts_disable_local
0.61 ? 6% -0.2 0.45 ? 16% perf-profile.children.cycles-pp.run_ksoftirqd
0.47 ? 9% -0.2 0.30 ? 15% perf-profile.children.cycles-pp.zap_pte_range
0.33 ? 14% -0.2 0.16 ? 22% perf-profile.children.cycles-pp.__calc_delta
0.22 ? 36% -0.2 0.06 ?119% perf-profile.children.cycles-pp.__intel_pmu_disable_all
0.55 ? 6% -0.2 0.39 ? 13% perf-profile.children.cycles-pp.filemap_map_pages
0.31 ? 12% -0.2 0.15 ? 28% perf-profile.children.cycles-pp.__accumulate_pelt_segments
0.40 ? 9% -0.1 0.26 ? 12% perf-profile.children.cycles-pp.setlocale
0.20 ? 16% -0.1 0.06 ? 46% perf-profile.children.cycles-pp.sched_slice
0.17 ? 7% -0.1 0.04 ? 75% perf-profile.children.cycles-pp.perf_trace_sched_stat_runtime
0.33 ? 7% -0.1 0.20 ? 11% perf-profile.children.cycles-pp.get_page_from_freelist
0.16 ? 16% -0.1 0.04 ?104% perf-profile.children.cycles-pp.native_set_fixmap
0.33 ? 14% -0.1 0.20 ? 15% perf-profile.children.cycles-pp.free_pgtables
0.16 ? 21% -0.1 0.04 ? 72% perf-profile.children.cycles-pp.schedule_idle
0.19 ? 38% -0.1 0.07 ? 18% perf-profile.children.cycles-pp.__poll
0.32 ? 16% -0.1 0.20 ? 15% perf-profile.children.cycles-pp.syscall_exit_to_user_mode
0.30 ? 15% -0.1 0.19 ? 19% perf-profile.children.cycles-pp.call_timer_fn
0.19 ? 9% -0.1 0.08 ? 6% perf-profile.children.cycles-pp.rb_insert_color
0.29 ? 9% -0.1 0.18 ? 21% perf-profile.children.cycles-pp.alloc_pages_vma
0.22 ? 6% -0.1 0.11 ? 27% perf-profile.children.cycles-pp.filename_lookup
0.24 ? 19% -0.1 0.13 ? 10% perf-profile.children.cycles-pp.__do_munmap
0.27 ? 10% -0.1 0.17 ? 11% perf-profile.children.cycles-pp.__open64_nocancel
0.14 ? 34% -0.1 0.03 ?108% perf-profile.children.cycles-pp.__evlist__enable
0.21 ? 7% -0.1 0.11 ? 24% perf-profile.children.cycles-pp.path_lookupat
0.30 ? 4% -0.1 0.21 ? 12% perf-profile.children.cycles-pp.next_uptodate_page
0.18 ? 7% -0.1 0.08 ? 24% perf-profile.children.cycles-pp.__do_sys_newstat
0.19 ? 4% -0.1 0.10 ? 16% perf-profile.children.cycles-pp.step_into
0.31 ? 4% -0.1 0.21 ? 19% perf-profile.children.cycles-pp.wp_page_copy
0.17 ? 6% -0.1 0.08 ? 24% perf-profile.children.cycles-pp.vfs_statx
0.19 ? 12% -0.1 0.10 ? 18% perf-profile.children.cycles-pp.elf_map
0.14 ? 34% -0.1 0.06 ? 46% perf-profile.children.cycles-pp.do_sys_poll
0.17 ? 8% -0.1 0.09 ? 12% perf-profile.children.cycles-pp.dput
0.20 ? 15% -0.1 0.12 ? 11% perf-profile.children.cycles-pp.__split_vma
0.14 ? 34% -0.1 0.06 ? 47% perf-profile.children.cycles-pp.__x64_sys_poll
0.15 ? 13% -0.1 0.07 ? 19% perf-profile.children.cycles-pp.vma_interval_tree_insert
0.17 ? 20% -0.1 0.09 ? 31% perf-profile.children.cycles-pp.enqueue_task_fair
0.17 ? 23% -0.1 0.09 ? 31% perf-profile.children.cycles-pp.ttwu_do_activate
0.17 ? 30% -0.1 0.09 ? 15% perf-profile.children.cycles-pp.__pagevec_lru_add
0.15 ? 20% -0.1 0.07 ? 21% perf-profile.children.cycles-pp.vma_link
0.21 ? 15% -0.1 0.13 ? 14% perf-profile.children.cycles-pp.do_anonymous_page
0.18 ? 23% -0.1 0.10 ? 25% perf-profile.children.cycles-pp.proc_reg_read_iter
0.17 ? 13% -0.1 0.09 ? 15% perf-profile.children.cycles-pp.__alloc_file
0.17 ? 11% -0.1 0.10 ? 15% perf-profile.children.cycles-pp.alloc_empty_file
0.20 ? 12% -0.1 0.13 ? 16% perf-profile.children.cycles-pp.tlb_finish_mmu
0.19 ? 12% -0.1 0.12 ? 17% perf-profile.children.cycles-pp.tlb_flush_mmu
0.17 ? 14% -0.1 0.09 ? 15% perf-profile.children.cycles-pp.__mmap
0.26 ? 8% -0.1 0.19 ? 18% perf-profile.children.cycles-pp.write
0.32 ? 8% -0.1 0.24 ? 14% perf-profile.children.cycles-pp.kmem_cache_free
0.20 ? 22% -0.1 0.13 ? 12% perf-profile.children.cycles-pp.release_pages
0.16 ? 17% -0.1 0.09 ? 16% perf-profile.children.cycles-pp.copy_strings
0.17 ? 14% -0.1 0.09 ? 14% perf-profile.children.cycles-pp.__vma_adjust
0.31 ? 10% -0.1 0.24 ? 16% perf-profile.children.cycles-pp.cpumask_next_and
0.16 ? 15% -0.1 0.09 ? 22% perf-profile.children.cycles-pp.do_mprotect_pkey
0.10 ? 9% -0.1 0.03 ?101% perf-profile.children.cycles-pp.get_obj_cgroup_from_current
0.09 ? 14% -0.1 0.03 ?100% perf-profile.children.cycles-pp.__xstat64
0.17 ? 13% -0.1 0.10 ? 21% perf-profile.children.cycles-pp.__x64_sys_mprotect
0.13 ? 10% -0.1 0.07 ? 18% perf-profile.children.cycles-pp.rmqueue
0.14 ? 11% -0.1 0.08 ? 16% perf-profile.children.cycles-pp.do_open
0.16 ? 6% -0.1 0.10 ? 23% perf-profile.children.cycles-pp.__lookup_slow
0.16 ? 7% -0.1 0.10 ? 11% perf-profile.children.cycles-pp.prep_new_page
0.15 ? 13% -0.1 0.09 ? 15% perf-profile.children.cycles-pp.unlink_anon_vmas
0.14 ? 12% -0.1 0.09 ? 14% perf-profile.children.cycles-pp.lookup_fast
0.15 ? 15% -0.1 0.09 ? 21% perf-profile.children.cycles-pp.mprotect_fixup
0.11 ? 19% -0.1 0.06 ? 50% perf-profile.children.cycles-pp.__get_user_pages_remote
0.10 ? 11% -0.1 0.04 ? 75% perf-profile.children.cycles-pp.anon_vma_fork
0.09 ? 9% -0.1 0.03 ? 70% perf-profile.children.cycles-pp.__get_free_pages
0.08 ? 31% -0.1 0.03 ? 99% perf-profile.children.cycles-pp.strnlen_user
0.08 ? 12% -0.1 0.03 ? 99% perf-profile.children.cycles-pp.page_counter_try_charge
0.14 ? 21% -0.1 0.08 ? 21% perf-profile.children.cycles-pp.unlink_file_vma
0.11 ? 17% -0.1 0.06 ? 50% perf-profile.children.cycles-pp.__get_user_pages
0.17 ? 7% -0.1 0.11 ? 12% perf-profile.children.cycles-pp.___might_sleep
0.10 ? 15% -0.1 0.05 ? 46% perf-profile.children.cycles-pp.remove_vma
0.09 ? 18% -0.1 0.04 ? 71% perf-profile.children.cycles-pp.mutex_lock
0.18 ? 11% -0.1 0.12 ? 15% perf-profile.children.cycles-pp.page_remove_rmap
0.17 ? 10% -0.1 0.12 ? 13% perf-profile.children.cycles-pp.select_task_rq_fair
0.08 ? 14% -0.1 0.03 ?100% perf-profile.children.cycles-pp.__check_object_size
0.10 ? 22% -0.0 0.05 ? 46% perf-profile.children.cycles-pp.unmap_region
0.09 ? 5% -0.0 0.04 ? 71% perf-profile.children.cycles-pp.getname_flags
0.13 ? 5% -0.0 0.08 ? 20% perf-profile.children.cycles-pp.d_alloc_parallel
0.10 ? 14% -0.0 0.06 ? 19% perf-profile.children.cycles-pp.do_open_execat
0.09 ? 8% -0.0 0.04 ? 73% perf-profile.children.cycles-pp.__pte_alloc
0.15 ? 12% -0.0 0.10 ? 19% perf-profile.children.cycles-pp.vmstat_update
0.10 ? 10% -0.0 0.05 ? 46% perf-profile.children.cycles-pp.do_dentry_open
0.13 ? 14% -0.0 0.08 ? 17% perf-profile.children.cycles-pp.select_idle_sibling
0.12 ? 22% -0.0 0.08 ? 21% perf-profile.children.cycles-pp.__cgroup_account_cputime_field
0.08 ? 8% -0.0 0.04 ? 71% perf-profile.children.cycles-pp.__dentry_kill
0.07 ? 12% -0.0 0.03 ?100% perf-profile.children.cycles-pp.__memcg_kmem_charge_page
0.10 ? 17% -0.0 0.05 ? 48% perf-profile.children.cycles-pp.lru_add_drain
0.11 ? 14% -0.0 0.07 ? 14% perf-profile.children.cycles-pp.open64
0.08 ? 14% -0.0 0.04 ? 71% perf-profile.children.cycles-pp.__mod_memcg_lruvec_state
0.13 ? 8% -0.0 0.09 ? 10% perf-profile.children.cycles-pp.clear_page_erms
0.11 ? 13% -0.0 0.07 ? 21% perf-profile.children.cycles-pp.get_arg_page
0.13 ? 8% -0.0 0.08 ? 24% perf-profile.children.cycles-pp.__wake_up_common_lock
0.08 ? 13% -0.0 0.04 ? 72% perf-profile.children.cycles-pp.lru_add_drain_cpu
0.12 ? 12% -0.0 0.08 ? 14% perf-profile.children.cycles-pp.__fput
0.11 ? 15% -0.0 0.07 ? 16% perf-profile.children.cycles-pp.__clear_user
0.07 ? 10% -0.0 0.03 ?100% perf-profile.children.cycles-pp.__d_alloc
0.09 ? 12% -0.0 0.05 ? 47% perf-profile.children.cycles-pp.obj_cgroup_charge_pages
0.13 ? 11% -0.0 0.09 ? 20% perf-profile.children.cycles-pp._dl_addr
0.10 ? 7% -0.0 0.06 ? 17% perf-profile.children.cycles-pp.__mod_lruvec_page_state
0.13 ? 16% -0.0 0.09 ? 22% perf-profile.children.cycles-pp.perf_event_mmap
0.09 ? 6% -0.0 0.05 ? 50% perf-profile.children.cycles-pp.___perf_sw_event
0.10 ? 9% -0.0 0.07 ? 14% perf-profile.children.cycles-pp.pte_alloc_one
0.09 ? 27% -0.0 0.05 ? 50% perf-profile.children.cycles-pp.vm_area_dup
0.09 ? 20% -0.0 0.06 ? 13% perf-profile.children.cycles-pp.down_write
0.09 ? 14% -0.0 0.06 ? 18% perf-profile.children.cycles-pp.malloc
0.09 ? 10% -0.0 0.06 ? 9% perf-profile.children.cycles-pp.sum_zone_numa_state
0.10 ? 13% -0.0 0.07 ? 11% perf-profile.children.cycles-pp.sysfs_kf_seq_show
0.10 ? 15% -0.0 0.07 ? 11% perf-profile.children.cycles-pp.dev_attr_show
0.08 ? 10% -0.0 0.06 ? 8% perf-profile.children.cycles-pp.__might_sleep
0.09 ? 11% -0.0 0.06 ? 17% perf-profile.children.cycles-pp.__perf_sw_event
0.07 ? 45% +0.0 0.11 ? 13% perf-profile.children.cycles-pp.seq_put_decimal_ull_width
0.04 ? 71% +0.1 0.09 ? 25% perf-profile.children.cycles-pp.ksoftirqd_running
0.00 +0.1 0.06 ? 13% perf-profile.children.cycles-pp.num_to_str
0.48 ? 8% +0.3 0.77 ? 21% perf-profile.children.cycles-pp.__irqentry_text_end
0.07 ? 14% +0.3 0.39 ? 10% perf-profile.children.cycles-pp.get_cpu_idle_time_us
0.07 ? 14% +0.3 0.39 ? 11% perf-profile.children.cycles-pp.get_idle_time
0.00 +0.4 0.42 ? 11% perf-profile.children.cycles-pp.get_cpu_iowait_time_us
0.00 +0.4 0.42 ? 11% perf-profile.children.cycles-pp.get_iowait_time
0.00 +0.5 0.47 ? 21% perf-profile.children.cycles-pp.sched_clock_local
1.39 ? 17% +0.5 1.87 ? 11% perf-profile.children.cycles-pp.ksys_read
1.33 ? 16% +0.5 1.82 ? 11% perf-profile.children.cycles-pp.vfs_read
1.11 ? 15% +0.5 1.64 ? 11% perf-profile.children.cycles-pp.read
1.03 ? 18% +0.5 1.57 ? 12% perf-profile.children.cycles-pp.new_sync_read
0.93 ? 17% +0.6 1.49 ? 11% perf-profile.children.cycles-pp.seq_read_iter
0.55 ? 23% +0.7 1.26 ? 11% perf-profile.children.cycles-pp.show_stat
1.00 ? 16% +0.9 1.86 ? 29% perf-profile.children.cycles-pp.rcu_core
4.49 ? 21% +5.7 10.15 ? 26% perf-profile.children.cycles-pp.perf_mux_hrtimer_handler
0.00 +6.0 5.98 ? 22% perf-profile.children.cycles-pp.sched_clock_tick
85.42 ? 2% +6.4 91.85 ? 3% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
75.45 ? 2% +11.3 86.74 ? 2% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
67.36 ? 2% +13.1 80.42 ? 4% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
67.01 ? 2% +13.3 80.31 ? 4% perf-profile.children.cycles-pp.hrtimer_interrupt
3.66 ? 8% +13.8 17.51 ? 19% perf-profile.children.cycles-pp.ktime_get_update_offsets_now
6.67 ? 10% +21.5 28.15 ? 6% perf-profile.children.cycles-pp.clockevents_program_event
3.16 ? 11% +42.0 45.11 ? 3% perf-profile.children.cycles-pp.ktime_get
0.00 +57.3 57.25 ? 5% perf-profile.children.cycles-pp.read_hpet
11.64 ? 42% -7.4 4.24 ? 77% perf-profile.self.cycles-pp.asm_exc_nmi
4.70 ? 16% -3.8 0.86 ? 44% perf-profile.self.cycles-pp.ghes_notify_nmi
3.68 ? 14% -3.1 0.57 ? 8% perf-profile.self.cycles-pp.perf_event_task_tick
2.82 ? 22% -2.6 0.25 ? 27% perf-profile.self.cycles-pp.__intel_pmu_enable_all
4.08 ? 16% -2.5 1.55 ? 30% perf-profile.self.cycles-pp.hrtimer_active
3.57 ? 9% -2.2 1.41 ? 24% perf-profile.self.cycles-pp.update_curr
3.10 ? 8% -1.9 1.16 ? 19% perf-profile.self.cycles-pp.ktime_get_update_offsets_now
2.93 ? 5% -1.8 1.09 ? 49% perf-profile.self.cycles-pp.native_sched_clock
2.35 ? 13% -1.6 0.71 ? 30% perf-profile.self.cycles-pp.task_tick_fair
2.12 ? 18% -1.6 0.52 ? 21% perf-profile.self.cycles-pp.trigger_load_balance
2.16 ? 11% -1.6 0.59 ? 33% perf-profile.self.cycles-pp.rcu_sched_clock_irq
1.71 ? 21% -1.5 0.21 ? 22% perf-profile.self.cycles-pp.update_process_times
2.31 ? 10% -1.3 0.98 ? 23% perf-profile.self.cycles-pp.update_load_avg
1.35 ? 25% -1.3 0.09 ? 73% perf-profile.self.cycles-pp.nmi_handle
1.75 ? 15% -1.1 0.68 ? 33% perf-profile.self.cycles-pp.update_cfs_group
1.61 ? 8% -1.0 0.56 ? 50% perf-profile.self.cycles-pp.native_flush_tlb_one_user
1.36 ? 41% -1.0 0.40 ? 67% perf-profile.self.cycles-pp.paranoid_entry
1.05 ? 10% -0.9 0.19 ? 60% perf-profile.self.cycles-pp.exc_nmi
1.18 ? 8% -0.8 0.34 ? 11% perf-profile.self.cycles-pp._raw_spin_lock
1.13 ? 34% -0.8 0.29 ? 70% perf-profile.self.cycles-pp.repeat_nmi
1.25 ? 14% -0.8 0.49 ? 16% perf-profile.self.cycles-pp.irqtime_account_irq
0.89 ? 25% -0.7 0.18 ? 29% perf-profile.self.cycles-pp.sync_regs
1.38 ? 14% -0.7 0.68 ? 29% perf-profile.self.cycles-pp.__update_load_avg_cfs_rq
0.83 ? 20% -0.6 0.21 ? 31% perf-profile.self.cycles-pp.arch_scale_freq_tick
0.93 ? 11% -0.6 0.32 ? 31% perf-profile.self.cycles-pp.acpi_os_read_memory
1.34 ? 15% -0.6 0.77 ? 29% perf-profile.self.cycles-pp.__update_load_avg_se
0.68 ? 29% -0.5 0.15 ? 10% perf-profile.self.cycles-pp.native_irq_return_iret
0.71 ? 43% -0.4 0.27 ? 76% perf-profile.self.cycles-pp.first_nmi
0.58 ? 8% -0.4 0.15 ? 21% perf-profile.self.cycles-pp.scheduler_tick
0.62 ? 15% -0.4 0.23 ? 10% perf-profile.self.cycles-pp.__hrtimer_run_queues
0.51 ? 16% -0.4 0.12 ? 12% perf-profile.self.cycles-pp.update_irq_load_avg
0.56 ? 13% -0.3 0.22 ? 32% perf-profile.self.cycles-pp.update_min_vruntime
0.68 ? 21% -0.3 0.37 ? 8% perf-profile.self.cycles-pp._raw_spin_lock_irqsave
0.37 ? 9% -0.3 0.09 ? 49% perf-profile.self.cycles-pp.default_do_nmi
0.89 ? 8% -0.3 0.61 ? 8% perf-profile.self.cycles-pp.update_sd_lb_stats
0.36 ? 15% -0.3 0.10 ? 21% perf-profile.self.cycles-pp.tick_sched_timer
0.33 ? 15% -0.3 0.08 ? 38% perf-profile.self.cycles-pp.ghes_copy_tofrom_phys
0.53 ? 11% -0.2 0.28 ? 34% perf-profile.self.cycles-pp.reweight_entity
0.54 ? 9% -0.2 0.32 ? 6% perf-profile.self.cycles-pp._raw_spin_lock_irq
0.26 ? 18% -0.2 0.05 ? 76% perf-profile.self.cycles-pp.account_user_time
0.33 ? 15% -0.2 0.12 ? 21% perf-profile.self.cycles-pp.perf_mux_hrtimer_handler
0.33 ? 16% -0.2 0.14 ? 66% perf-profile.self.cycles-pp.intel_pmu_handle_irq
0.28 ? 7% -0.2 0.09 ? 51% perf-profile.self.cycles-pp.__native_set_fixmap
0.36 ? 13% -0.2 0.17 ? 7% perf-profile.self.cycles-pp.__hrtimer_next_event_base
0.33 ? 9% -0.2 0.15 ? 4% perf-profile.self.cycles-pp.timerqueue_add
0.23 ? 38% -0.2 0.06 ?118% perf-profile.self.cycles-pp.intel_bts_disable_local
0.26 ? 23% -0.2 0.09 ? 11% perf-profile.self.cycles-pp.__sysvec_apic_timer_interrupt
0.19 ? 21% -0.2 0.03 ?101% perf-profile.self.cycles-pp.account_process_tick
0.21 ? 35% -0.2 0.06 ?119% perf-profile.self.cycles-pp.__intel_pmu_disable_all
0.22 ? 12% -0.1 0.09 ? 55% perf-profile.self.cycles-pp.calc_global_load_tick
0.23 ? 10% -0.1 0.10 ? 7% perf-profile.self.cycles-pp.hrtimer_forward
0.16 ? 16% -0.1 0.04 ?104% perf-profile.self.cycles-pp.native_set_fixmap
0.19 ? 14% -0.1 0.07 ? 21% perf-profile.self.cycles-pp.tick_program_event
0.18 ? 8% -0.1 0.07 ? 25% perf-profile.self.cycles-pp.run_posix_cpu_timers
0.18 ? 13% -0.1 0.07 ? 15% perf-profile.self.cycles-pp.__remove_hrtimer
0.20 ? 15% -0.1 0.09 ? 13% perf-profile.self.cycles-pp.__hrtimer_get_next_event
0.26 ? 13% -0.1 0.16 ? 22% perf-profile.self.cycles-pp.__calc_delta
0.13 ? 15% -0.1 0.03 ?100% perf-profile.self.cycles-pp.sched_slice
0.21 ? 13% -0.1 0.12 ? 5% perf-profile.self.cycles-pp._raw_spin_unlock_irqrestore
0.12 ? 16% -0.1 0.04 ? 75% perf-profile.self.cycles-pp.perf_trace_sched_stat_runtime
0.17 ? 9% -0.1 0.10 ? 11% perf-profile.self.cycles-pp.timerqueue_del
0.19 ? 14% -0.1 0.13 ? 32% perf-profile.self.cycles-pp.__accumulate_pelt_segments
0.12 ? 15% -0.1 0.07 ? 11% perf-profile.self.cycles-pp.rb_insert_color
0.09 ? 11% -0.0 0.05 ? 47% perf-profile.self.cycles-pp.kmem_cache_alloc
0.04 ? 71% +0.0 0.07 ? 10% perf-profile.self.cycles-pp.rcu_nocb_flush_deferred_wakeup
0.03 ? 70% +0.0 0.07 ? 26% perf-profile.self.cycles-pp.update_group_capacity
0.00 +0.1 0.07 ? 18% perf-profile.self.cycles-pp.ksoftirqd_running
0.14 ? 13% +0.1 0.22 ? 10% perf-profile.self.cycles-pp.exit_to_user_mode_prepare
0.05 ? 45% +0.1 0.14 ? 9% perf-profile.self.cycles-pp.run_rebalance_domains
0.17 ? 19% +0.1 0.29 ? 10% perf-profile.self.cycles-pp.irq_exit_rcu
0.02 ?141% +0.2 0.17 ? 32% perf-profile.self.cycles-pp.rcu_core
0.24 ? 20% +0.2 0.45 ? 14% perf-profile.self.cycles-pp.idle_cpu
0.00 +0.2 0.25 ? 21% perf-profile.self.cycles-pp.sched_clock_local
0.32 ? 7% +0.3 0.63 ? 10% perf-profile.self.cycles-pp.__irqentry_text_end
0.14 ? 16% +0.3 0.49 ? 14% perf-profile.self.cycles-pp.__softirqentry_text_start
2.15 ? 12% +1.5 3.64 ? 5% perf-profile.self.cycles-pp.ktime_get
0.00 +54.2 54.23 ? 8% perf-profile.self.cycles-pp.read_hpet



stress-ng.time.user_time

5200 +--------------------------------------------------------------------+
| OO O O O O O O O |
5000 |-+OO OO OO O O O O O O OO O OO O O O O OO O |
4800 |O+ O O O O O O |
| |
4600 |-+ |
4400 |-+ |
| |
4200 |-+ + + + + + + +. + |
4000 |-+ +. + ++ :: :+ :: : + + :.+ : : + : |
| + +. : + : : : :: : : :: : :+ ++ + : : : : : : : |
3800 |:.+ + :.+ + .+ : : :: : + :: : + :: + :.+ :: :.+ ::|
3600 |++ + + + +.+ + + : + : + : + :|
| + + + +|
3400 +--------------------------------------------------------------------+


stress-ng.time.percent_of_cpu_this_job_got

9000 +--------------------------------------------------------------------+
| |
8500 |-+ OO O O OOO OO O O O OO O O OO O |
| OO OO O O OO O O OO O O O OO O |
|O O O |
8000 |-+ |
| |
7500 |-+ |
| |
7000 |-+ + + + + + + + |
| +. + :: :: :: : .+ : :+ : |
| + +. : + : ++ :: : + :: : :.+++ +.+ : : : + : : |
6500 |:.+ + :.+ + + : ::+ + + :: + + :: + :.+ :: :.+ ::|
|+ + + :+ +.: + + : + : + : + :|
6000 +--------------------------------------------------------------------+


stress-ng.lockbus.ops

250000 +------------------------------------------------------------------+
| ++ +.++.+++.++.+ +++ + ++.+ ++ ++ + +.++ + ++.+|
245000 |-+ |
| |
| |
240000 |-+ |
| |
235000 |-+ |
| O |
230000 |-+ O O O O O |
|O O O O O O O O O |
| O O O O O |
225000 |-+O OO O O O OO O O O O OO |
| O O O O O O |
220000 +------------------------------------------------------------------+


stress-ng.lockbus.ops_per_sec

4150 +--------------------------------------------------------------------+
4100 |-+ +.+ ++ ++ + +.+ +.+ +.+ |
| |
4050 |-+ |
4000 |-+ |
| |
3950 |-+ |
3900 |-+ |
3850 |-+ O O |
|O O O O O O |
3800 |-+ O O O O O O O |
3750 |-+ O OO O O O |
| O OO O O O O O OO O O |
3700 |-+ OO O O O O O |
3650 +--------------------------------------------------------------------+


[*] bisect-good sample
[O] bisect-bad sample



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation

Thanks,
Oliver Sang


Attachments:
(No filename) (56.49 kB)
config-5.13.0-rc1-00001-g8901ecc2315b (176.78 kB)
job-script (8.28 kB)
job.yaml (5.74 kB)
reproduce (350.00 B)
dmesg.xz (27.83 kB)
Download all attachments

2021-05-26 11:25:38

by Feng Tang

[permalink] [raw]
Subject: Re: [clocksource] 8901ecc231: stress-ng.lockbus.ops_per_sec -9.5% regression

On Sat, May 22, 2021 at 09:08:27AM -0700, Paul E. McKenney wrote:
> On Fri, May 21, 2021 at 06:56:17AM -0700, Paul E. McKenney wrote:
> > On Fri, May 21, 2021 at 04:33:22PM +0800, kernel test robot wrote:
> > >
> > >
> > > Greeting,
> > >
> > > FYI, we noticed a -9.5% regression of stress-ng.lockbus.ops_per_sec due to commit:
> > >
> > >
> > > commit: 8901ecc2315b850f35a7b8c1b73b12388b72aa78 ("clocksource: Retry clock read if long delays detected")
> > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > >
> > >
> > > in testcase: stress-ng
> > > on test machine: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 192G memory
> > > with following parameters:
> > >
> > > nr_threads: 100%
> > > disk: 1HDD
> > > testtime: 60s
> > > class: memory
> > > test: lockbus
> > > cpufreq_governor: performance
> > > ucode: 0x5003006
> > >
> > >
> > > please be noted below in dmesg.xz (attached)
> > > [ 28.110351]
> > > [ 28.302357] hrtimer: interrupt took 1878423 ns
> > > [ 29.690760] clocksource: timekeeping watchdog on CPU53: hpet read-back delay of 169583ns, attempt 4, marking unstable
> > > [ 29.860306] tsc: Marking TSC unstable due to clocksource watchdog
> > > [ 30.559390] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> > > [ 30.726282] sched_clock: Marking unstable (30052964508, 499342225)<-(30915547410, -363240730)
> > > [ 31.620401] clocksource: Switched to clocksource hpet
> >
> > If I am reading the dmesg correctly, there were many interrupts that
> > prevented a good clock read. This sound to me like a bug that the
> > clocksource watchdog located, but please let me know if this is not
> > the case.
> >
> > There are also the later "perf: interrupt took too long" messages.
>
> And of course, increasing the clocksource.max_cswd_read_retries module
> boot parameter (or clocksource.max_read_retries in the earlier commits,
> which I will fix) can work around short bursts of NMIs. Or long bursts
> of NMIs, if you set this kernel boot parameter large enough.

I reproduced it on a borrowed baremetal 4 nodes, 96C/192T Xeon, with
latest stress-ng code https://github.com/ColinIanKing/stress-ng.git.
(2 sockets box should also be able to reproduce it)

Seems this sub testcase 'lockbus' is a extreme stress case, by loop
doing "lock" operation:

c8: f0 83 00 01 lock addl $0x1,(%rax)
cc: f0 83 40 04 01 lock addl $0x1,0x4(%rax)
d1: f0 83 40 08 01 lock addl $0x1,0x8(%rax)
d6: f0 83 40 0c 01 lock addl $0x1,0xc(%rax)
db: f0 83 40 10 01 lock addl $0x1,0x10(%rax)
e0: f0 83 40 14 01 lock addl $0x1,0x14(%rax)
e5: f0 83 40 18 01 lock addl $0x1,0x18(%rax)
ea: f0 83 40 1c 01 lock addl $0x1,0x1c(%rax)
ef: f0 83 01 00 lock addl $0x0,(%rcx)
f3: f0 83 01 00 lock addl $0x0,(%rcx)
f7: f0 83 01 00 lock addl $0x0,(%rcx)
fb: f0 83 01 00 lock addl $0x0,(%rcx)
ff: f0 83 01 00 lock addl $0x0,(%rcx)
103: f0 83 01 00 lock addl $0x0,(%rcx)
107: f0 83 01 00 lock addl $0x0,(%rcx)
...

(The source c file and objdump are attached fyi)

So the watchdog read (read_hpet() here) sometimes does take very
long time (hundreds of microseconds) which breaks this sanity
read check, and cause 'unstable' tsc.

As from the man page of stress-ng:
"Use stress-ng with caution as some of the tests can make a system
run hot on poorly designed hardware and also can cause excessive
system thrashing which may be difficult to stop"

I don't think this 'lockbus' is close to any real-world usage.

Thanks,
Feng


Attachments:
(No filename) (3.67 kB)
stress-lockbus.c (5.05 kB)
objdump_lockbus.log (11.87 kB)
Download all attachments

2021-05-26 13:57:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [clocksource] 8901ecc231: stress-ng.lockbus.ops_per_sec -9.5% regression

On Wed, May 26, 2021 at 02:49:22PM +0800, Feng Tang wrote:
> On Sat, May 22, 2021 at 09:08:27AM -0700, Paul E. McKenney wrote:
> > On Fri, May 21, 2021 at 06:56:17AM -0700, Paul E. McKenney wrote:
> > > On Fri, May 21, 2021 at 04:33:22PM +0800, kernel test robot wrote:
> > > >
> > > >
> > > > Greeting,
> > > >
> > > > FYI, we noticed a -9.5% regression of stress-ng.lockbus.ops_per_sec due to commit:
> > > >
> > > >
> > > > commit: 8901ecc2315b850f35a7b8c1b73b12388b72aa78 ("clocksource: Retry clock read if long delays detected")
> > > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > > >
> > > >
> > > > in testcase: stress-ng
> > > > on test machine: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 192G memory
> > > > with following parameters:
> > > >
> > > > nr_threads: 100%
> > > > disk: 1HDD
> > > > testtime: 60s
> > > > class: memory
> > > > test: lockbus
> > > > cpufreq_governor: performance
> > > > ucode: 0x5003006
> > > >
> > > >
> > > > please be noted below in dmesg.xz (attached)
> > > > [ 28.110351]
> > > > [ 28.302357] hrtimer: interrupt took 1878423 ns
> > > > [ 29.690760] clocksource: timekeeping watchdog on CPU53: hpet read-back delay of 169583ns, attempt 4, marking unstable
> > > > [ 29.860306] tsc: Marking TSC unstable due to clocksource watchdog
> > > > [ 30.559390] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> > > > [ 30.726282] sched_clock: Marking unstable (30052964508, 499342225)<-(30915547410, -363240730)
> > > > [ 31.620401] clocksource: Switched to clocksource hpet
> > >
> > > If I am reading the dmesg correctly, there were many interrupts that
> > > prevented a good clock read. This sound to me like a bug that the
> > > clocksource watchdog located, but please let me know if this is not
> > > the case.
> > >
> > > There are also the later "perf: interrupt took too long" messages.
> >
> > And of course, increasing the clocksource.max_cswd_read_retries module
> > boot parameter (or clocksource.max_read_retries in the earlier commits,
> > which I will fix) can work around short bursts of NMIs. Or long bursts
> > of NMIs, if you set this kernel boot parameter large enough.
>
> I reproduced it on a borrowed baremetal 4 nodes, 96C/192T Xeon, with
> latest stress-ng code https://github.com/ColinIanKing/stress-ng.git.
> (2 sockets box should also be able to reproduce it)
>
> Seems this sub testcase 'lockbus' is a extreme stress case, by loop
> doing "lock" operation:
>
> c8: f0 83 00 01 lock addl $0x1,(%rax)
> cc: f0 83 40 04 01 lock addl $0x1,0x4(%rax)
> d1: f0 83 40 08 01 lock addl $0x1,0x8(%rax)
> d6: f0 83 40 0c 01 lock addl $0x1,0xc(%rax)
> db: f0 83 40 10 01 lock addl $0x1,0x10(%rax)
> e0: f0 83 40 14 01 lock addl $0x1,0x14(%rax)
> e5: f0 83 40 18 01 lock addl $0x1,0x18(%rax)
> ea: f0 83 40 1c 01 lock addl $0x1,0x1c(%rax)
> ef: f0 83 01 00 lock addl $0x0,(%rcx)
> f3: f0 83 01 00 lock addl $0x0,(%rcx)
> f7: f0 83 01 00 lock addl $0x0,(%rcx)
> fb: f0 83 01 00 lock addl $0x0,(%rcx)
> ff: f0 83 01 00 lock addl $0x0,(%rcx)
> 103: f0 83 01 00 lock addl $0x0,(%rcx)
> 107: f0 83 01 00 lock addl $0x0,(%rcx)
> ...
>
> (The source c file and objdump are attached fyi)
>
> So the watchdog read (read_hpet() here) sometimes does take very
> long time (hundreds of microseconds) which breaks this sanity
> read check, and cause 'unstable' tsc.
>
> As from the man page of stress-ng:
> "Use stress-ng with caution as some of the tests can make a system
> run hot on poorly designed hardware and also can cause excessive
> system thrashing which may be difficult to stop"
>
> I don't think this 'lockbus' is close to any real-world usage.

Heh! In the past, I have had to adjust Linux-kernel RCU in order to
avoid having too many locked operations. So, yes, I agree that this
test result should not require a change to the clocksource watchdog.

I just rebased to eliminate the pointless name change in the middle
of the series from max_read_retries to max_cswd_read_retries, and will
repost later today or tomorrow.

Thanx, Paul

> Thanks,
> Feng

> /*
> * Copyright (C) 2013-2021 Canonical, Ltd.
> *
> * This program is free software; you can redistribute it and/or
> * modify it under the terms of the GNU General Public License
> * as published by the Free Software Foundation; either version 2
> * of the License, or (at your option) any later version.
> *
> * This program is distributed in the hope that it will be useful,
> * but WITHOUT ANY WARRANTY; without even the implied warranty of
> * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> * GNU General Public License for more details.
> *
> * You should have received a copy of the GNU General Public License
> * along with this program; if not, write to the Free Software
> * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
> *
> * This code is a complete clean re-write of the stress tool by
> * Colin Ian King <[email protected]> and attempts to be
> * backwardly compatible with the stress tool by Amos Waterland
> * <[email protected]> but has more stress tests and more
> * functionality.
> *
> */
> #include "stress-ng.h"
>
> static const stress_help_t help[] = {
> { NULL, "lockbus N", "start N workers locking a memory increment" },
> { NULL, "lockbus-ops N", "stop after N lockbus bogo operations" },
> { NULL, NULL, NULL }
> };
>
> #if (((defined(__GNUC__) || defined(__clang__)) && \
> defined(STRESS_ARCH_X86)) || \
> (defined(__GNUC__) && \
> defined(HAVE_ATOMIC_ADD_FETCH) && \
> defined(__ATOMIC_SEQ_CST) && \
> NEED_GNUC(4,7,0) && \
> defined(STRESS_ARCH_ARM)))
>
> #if defined(HAVE_ATOMIC_ADD_FETCH)
> #define MEM_LOCK(ptr, inc) \
> do { \
> __atomic_add_fetch(ptr, inc, __ATOMIC_SEQ_CST);\
> } while (0)
> #else
> #define MEM_LOCK(ptr, inc) \
> do { \
> asm volatile("lock addl %1,%0" : \
> "+m" (*ptr) : \
> "ir" (inc)); \
> } while (0)
> #endif
>
> #define BUFFER_SIZE (1024 * 1024 * 16)
> #define CHUNK_SIZE (64 * 4)
>
> #define MEM_LOCK_AND_INC(ptr, inc) \
> do { \
> MEM_LOCK(ptr, inc); \
> ptr++; \
> } while (0)
>
> #define MEM_LOCK_AND_INCx8(ptr, inc) \
> do { \
> MEM_LOCK_AND_INC(ptr, inc); \
> MEM_LOCK_AND_INC(ptr, inc); \
> MEM_LOCK_AND_INC(ptr, inc); \
> MEM_LOCK_AND_INC(ptr, inc); \
> MEM_LOCK_AND_INC(ptr, inc); \
> MEM_LOCK_AND_INC(ptr, inc); \
> MEM_LOCK_AND_INC(ptr, inc); \
> MEM_LOCK_AND_INC(ptr, inc); \
> } while (0)
>
> #define MEM_LOCKx8(ptr) \
> do { \
> MEM_LOCK(ptr, 0); \
> MEM_LOCK(ptr, 0); \
> MEM_LOCK(ptr, 0); \
> MEM_LOCK(ptr, 0); \
> MEM_LOCK(ptr, 0); \
> MEM_LOCK(ptr, 0); \
> MEM_LOCK(ptr, 0); \
> MEM_LOCK(ptr, 0); \
> } while (0)
>
> #if defined(STRESS_ARCH_X86)
> static sigjmp_buf jmp_env;
> static bool do_splitlock;
>
> static void NORETURN MLOCKED_TEXT stress_sigbus_handler(int signum)
> {
> (void)signum;
>
> do_splitlock = false;
>
> siglongjmp(jmp_env, 1);
> }
> #endif
>
> /*
> * stress_lockbus()
> * stress memory with lock and increment
> */
> static int stress_lockbus(const stress_args_t *args)
> {
> uint32_t *buffer;
> int flags = MAP_ANONYMOUS | MAP_SHARED;
> #if defined(STRESS_ARCH_X86)
> uint32_t *splitlock_ptr1, *splitlock_ptr2;
>
> if (stress_sighandler(args->name, SIGBUS, stress_sigbus_handler, NULL) < 0)
> return EXIT_FAILURE;
> #endif
>
> #if defined(MAP_POPULATE)
> flags |= MAP_POPULATE;
> #endif
> buffer = (uint32_t*)mmap(NULL, BUFFER_SIZE, PROT_READ | PROT_WRITE, flags, -1, 0);
> if (buffer == MAP_FAILED) {
> int rc = exit_status(errno);
>
> pr_err("%s: mmap failed\n", args->name);
> return rc;
> }
>
> #if defined(STRESS_ARCH_X86)
> /* Split lock on a page boundary */
> splitlock_ptr1 = (uint32_t *)(((uint8_t *)buffer) + args->page_size - (sizeof(uint32_t) >> 1));
> /* Split lock on a cache boundary */
> splitlock_ptr2 = (uint32_t *)(((uint8_t *)buffer) + 64 - (sizeof(uint32_t) >> 1));
> do_splitlock = true;
> if (sigsetjmp(jmp_env, 1) && !keep_stressing(args))
> goto done;
> #endif
> stress_set_proc_state(args->name, STRESS_STATE_RUN);
>
> do {
> uint32_t *ptr0 = buffer + ((stress_mwc32() % (BUFFER_SIZE - CHUNK_SIZE)) >> 2);
> #if defined(STRESS_ARCH_X86)
> uint32_t *ptr1 = do_splitlock ? splitlock_ptr1 : ptr0;
> uint32_t *ptr2 = do_splitlock ? splitlock_ptr2 : ptr0;
> #else
> uint32_t *ptr1 = ptr0;
> uint32_t *ptr2 = ptr0;
> #endif
> const uint32_t inc = 1;
>
> MEM_LOCK_AND_INCx8(ptr0, inc);
> MEM_LOCKx8(ptr1);
> MEM_LOCKx8(ptr2);
> MEM_LOCK_AND_INCx8(ptr0, inc);
> MEM_LOCKx8(ptr1);
> MEM_LOCKx8(ptr2);
> MEM_LOCK_AND_INCx8(ptr0, inc);
> MEM_LOCKx8(ptr1);
> MEM_LOCKx8(ptr2);
> MEM_LOCK_AND_INCx8(ptr0, inc);
> MEM_LOCKx8(ptr1);
> MEM_LOCKx8(ptr2);
>
> inc_counter(args);
> } while (keep_stressing(args));
>
> #if defined(STRESS_ARCH_X86)
> done:
> #endif
> stress_set_proc_state(args->name, STRESS_STATE_DEINIT);
>
> (void)munmap((void *)buffer, BUFFER_SIZE);
>
> return EXIT_SUCCESS;
> }
>
> stressor_info_t stress_lockbus_info = {
> .stressor = stress_lockbus,
> .class = CLASS_CPU_CACHE | CLASS_MEMORY,
> .help = help
> };
> #else
> stressor_info_t stress_lockbus_info = {
> .stressor = stress_not_implemented,
> .class = CLASS_CPU_CACHE | CLASS_MEMORY,
> .help = help
> };
> #endif

>
> stress-lockbus.o: file format elf64-x86-64
>
>
> Disassembly of section .text:
>
> 0000000000000000 <stress_lockbus>:
> 0: 53 push %rbx
> 1: 48 8d 15 00 00 00 00 lea 0x0(%rip),%rdx # 8 <stress_lockbus+0x8>
> 8: 31 c9 xor %ecx,%ecx
> a: be 07 00 00 00 mov $0x7,%esi
> f: bb 01 00 00 00 mov $0x1,%ebx
> 14: 48 83 ec 20 sub $0x20,%rsp
> 18: 48 89 3c 24 mov %rdi,(%rsp)
> 1c: 48 8b 7f 10 mov 0x10(%rdi),%rdi
> 20: e8 00 00 00 00 callq 25 <stress_lockbus+0x25>
> 25: 85 c0 test %eax,%eax
> 27: 0f 88 e9 02 00 00 js 316 <stress_lockbus+0x316>
> 2d: 45 31 c9 xor %r9d,%r9d
> 30: 31 ff xor %edi,%edi
> 32: 41 b8 ff ff ff ff mov $0xffffffff,%r8d
> 38: b9 21 80 00 00 mov $0x8021,%ecx
> 3d: ba 03 00 00 00 mov $0x3,%edx
> 42: be 00 00 00 01 mov $0x1000000,%esi
> 47: e8 00 00 00 00 callq 4c <stress_lockbus+0x4c>
> 4c: 48 83 f8 ff cmp $0xffffffffffffffff,%rax
> 50: 48 89 44 24 08 mov %rax,0x8(%rsp)
> 55: 0f 84 85 02 00 00 je 2e0 <stress_lockbus+0x2e0>
> 5b: 48 8b 04 24 mov (%rsp),%rax
> 5f: 48 8b 5c 24 08 mov 0x8(%rsp),%rbx
> 64: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 6b <stress_lockbus+0x6b>
> 6b: be 01 00 00 00 mov $0x1,%esi
> 70: c6 05 00 00 00 00 01 movb $0x1,0x0(%rip) # 77 <stress_lockbus+0x77>
> 77: 48 8b 40 30 mov 0x30(%rax),%rax
> 7b: 48 8d 44 03 fe lea -0x2(%rbx,%rax,1),%rax
> 80: 48 89 44 24 18 mov %rax,0x18(%rsp)
> 85: 48 8d 43 3e lea 0x3e(%rbx),%rax
> 89: 48 89 44 24 10 mov %rax,0x10(%rsp)
> 8e: e8 00 00 00 00 callq 93 <stress_lockbus+0x93>
> 93: 85 c0 test %eax,%eax
> 95: 0f 85 85 02 00 00 jne 320 <stress_lockbus+0x320>
> 9b: 48 8b 04 24 mov (%rsp),%rax
> 9f: be 02 00 00 00 mov $0x2,%esi
> a4: 48 8b 78 10 mov 0x10(%rax),%rdi
> a8: e8 00 00 00 00 callq ad <stress_lockbus+0xad>
> ad: 48 8b 5c 24 18 mov 0x18(%rsp),%rbx
> b2: e9 da 01 00 00 jmpq 291 <stress_lockbus+0x291>
> b7: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
> be: 00 00
> c0: 48 8b 54 24 10 mov 0x10(%rsp),%rdx
> c5: 48 89 d9 mov %rbx,%rcx
> c8: f0 83 00 01 lock addl $0x1,(%rax)
> cc: f0 83 40 04 01 lock addl $0x1,0x4(%rax)
> d1: f0 83 40 08 01 lock addl $0x1,0x8(%rax)
> d6: f0 83 40 0c 01 lock addl $0x1,0xc(%rax)
> db: f0 83 40 10 01 lock addl $0x1,0x10(%rax)
> e0: f0 83 40 14 01 lock addl $0x1,0x14(%rax)
> e5: f0 83 40 18 01 lock addl $0x1,0x18(%rax)
> ea: f0 83 40 1c 01 lock addl $0x1,0x1c(%rax)
> ef: f0 83 01 00 lock addl $0x0,(%rcx)
> f3: f0 83 01 00 lock addl $0x0,(%rcx)
> f7: f0 83 01 00 lock addl $0x0,(%rcx)
> fb: f0 83 01 00 lock addl $0x0,(%rcx)
> ff: f0 83 01 00 lock addl $0x0,(%rcx)
> 103: f0 83 01 00 lock addl $0x0,(%rcx)
> 107: f0 83 01 00 lock addl $0x0,(%rcx)
> 10b: f0 83 01 00 lock addl $0x0,(%rcx)
> 10f: f0 83 02 00 lock addl $0x0,(%rdx)
> 113: f0 83 02 00 lock addl $0x0,(%rdx)
> 117: f0 83 02 00 lock addl $0x0,(%rdx)
> 11b: f0 83 02 00 lock addl $0x0,(%rdx)
> 11f: f0 83 02 00 lock addl $0x0,(%rdx)
> 123: f0 83 02 00 lock addl $0x0,(%rdx)
> 127: f0 83 02 00 lock addl $0x0,(%rdx)
> 12b: f0 83 02 00 lock addl $0x0,(%rdx)
> 12f: f0 83 40 20 01 lock addl $0x1,0x20(%rax)
> 134: f0 83 40 24 01 lock addl $0x1,0x24(%rax)
> 139: f0 83 40 28 01 lock addl $0x1,0x28(%rax)
> 13e: f0 83 40 2c 01 lock addl $0x1,0x2c(%rax)
> 143: f0 83 40 30 01 lock addl $0x1,0x30(%rax)
> 148: f0 83 40 34 01 lock addl $0x1,0x34(%rax)
> 14d: f0 83 40 38 01 lock addl $0x1,0x38(%rax)
> 152: f0 83 40 3c 01 lock addl $0x1,0x3c(%rax)
> 157: f0 83 01 00 lock addl $0x0,(%rcx)
> 15b: f0 83 01 00 lock addl $0x0,(%rcx)
> 15f: f0 83 01 00 lock addl $0x0,(%rcx)
> 163: f0 83 01 00 lock addl $0x0,(%rcx)
> 167: f0 83 01 00 lock addl $0x0,(%rcx)
> 16b: f0 83 01 00 lock addl $0x0,(%rcx)
> 16f: f0 83 01 00 lock addl $0x0,(%rcx)
> 173: f0 83 01 00 lock addl $0x0,(%rcx)
> 177: f0 83 02 00 lock addl $0x0,(%rdx)
> 17b: f0 83 02 00 lock addl $0x0,(%rdx)
> 17f: f0 83 02 00 lock addl $0x0,(%rdx)
> 183: f0 83 02 00 lock addl $0x0,(%rdx)
> 187: f0 83 02 00 lock addl $0x0,(%rdx)
> 18b: f0 83 02 00 lock addl $0x0,(%rdx)
> 18f: f0 83 02 00 lock addl $0x0,(%rdx)
> 193: f0 83 02 00 lock addl $0x0,(%rdx)
> 197: f0 83 40 40 01 lock addl $0x1,0x40(%rax)
> 19c: f0 83 40 44 01 lock addl $0x1,0x44(%rax)
> 1a1: f0 83 40 48 01 lock addl $0x1,0x48(%rax)
> 1a6: f0 83 40 4c 01 lock addl $0x1,0x4c(%rax)
> 1ab: f0 83 40 50 01 lock addl $0x1,0x50(%rax)
> 1b0: f0 83 40 54 01 lock addl $0x1,0x54(%rax)
> 1b5: f0 83 40 58 01 lock addl $0x1,0x58(%rax)
> 1ba: f0 83 40 5c 01 lock addl $0x1,0x5c(%rax)
> 1bf: f0 83 01 00 lock addl $0x0,(%rcx)
> 1c3: f0 83 01 00 lock addl $0x0,(%rcx)
> 1c7: f0 83 01 00 lock addl $0x0,(%rcx)
> 1cb: f0 83 01 00 lock addl $0x0,(%rcx)
> 1cf: f0 83 01 00 lock addl $0x0,(%rcx)
> 1d3: f0 83 01 00 lock addl $0x0,(%rcx)
> 1d7: f0 83 01 00 lock addl $0x0,(%rcx)
> 1db: f0 83 01 00 lock addl $0x0,(%rcx)
> 1df: f0 83 02 00 lock addl $0x0,(%rdx)
> 1e3: f0 83 02 00 lock addl $0x0,(%rdx)
> 1e7: f0 83 02 00 lock addl $0x0,(%rdx)
> 1eb: f0 83 02 00 lock addl $0x0,(%rdx)
> 1ef: f0 83 02 00 lock addl $0x0,(%rdx)
> 1f3: f0 83 02 00 lock addl $0x0,(%rdx)
> 1f7: f0 83 02 00 lock addl $0x0,(%rdx)
> 1fb: f0 83 02 00 lock addl $0x0,(%rdx)
> 1ff: f0 83 40 60 01 lock addl $0x1,0x60(%rax)
> 204: f0 83 40 64 01 lock addl $0x1,0x64(%rax)
> 209: f0 83 40 68 01 lock addl $0x1,0x68(%rax)
> 20e: f0 83 40 6c 01 lock addl $0x1,0x6c(%rax)
> 213: f0 83 40 70 01 lock addl $0x1,0x70(%rax)
> 218: f0 83 40 74 01 lock addl $0x1,0x74(%rax)
> 21d: f0 83 40 78 01 lock addl $0x1,0x78(%rax)
> 222: f0 83 40 7c 01 lock addl $0x1,0x7c(%rax)
> 227: f0 83 01 00 lock addl $0x0,(%rcx)
> 22b: f0 83 01 00 lock addl $0x0,(%rcx)
> 22f: f0 83 01 00 lock addl $0x0,(%rcx)
> 233: f0 83 01 00 lock addl $0x0,(%rcx)
> 237: f0 83 01 00 lock addl $0x0,(%rcx)
> 23b: f0 83 01 00 lock addl $0x0,(%rcx)
> 23f: f0 83 01 00 lock addl $0x0,(%rcx)
> 243: f0 83 01 00 lock addl $0x0,(%rcx)
> 247: f0 83 02 00 lock addl $0x0,(%rdx)
> 24b: f0 83 02 00 lock addl $0x0,(%rdx)
> 24f: f0 83 02 00 lock addl $0x0,(%rdx)
> 253: f0 83 02 00 lock addl $0x0,(%rdx)
> 257: f0 83 02 00 lock addl $0x0,(%rdx)
> 25b: f0 83 02 00 lock addl $0x0,(%rdx)
> 25f: f0 83 02 00 lock addl $0x0,(%rdx)
> 263: f0 83 02 00 lock addl $0x0,(%rdx)
> 267: 48 8b 3c 24 mov (%rsp),%rdi
> 26b: 48 8b 47 08 mov 0x8(%rdi),%rax
> 26f: c6 00 00 movb $0x0,(%rax)
> 272: 48 8b 07 mov (%rdi),%rax
> 275: 48 83 00 01 addq $0x1,(%rax)
> 279: 48 8b 47 08 mov 0x8(%rdi),%rax
> 27d: c6 00 01 movb $0x1,(%rax)
> 280: 48 8d 77 18 lea 0x18(%rdi),%rsi
> 284: e8 00 00 00 00 callq 289 <stress_lockbus+0x289>
> 289: 84 c0 test %al,%al
> 28b: 0f 84 a4 00 00 00 je 335 <stress_lockbus+0x335>
> 291: e8 00 00 00 00 callq 296 <stress_lockbus+0x296>
> 296: 89 c1 mov %eax,%ecx
> 298: 48 89 ca mov %rcx,%rdx
> 29b: 48 c1 e2 10 shl $0x10,%rdx
> 29f: 48 01 ca add %rcx,%rdx
> 2a2: 48 c1 e2 0f shl $0xf,%rdx
> 2a6: 48 01 ca add %rcx,%rdx
> 2a9: 48 c1 ea 37 shr $0x37,%rdx
> 2ad: 69 d2 00 ff ff 00 imul $0xffff00,%edx,%edx
> 2b3: 48 29 d0 sub %rdx,%rax
> 2b6: 83 e0 fc and $0xfffffffc,%eax
> 2b9: 48 03 44 24 08 add 0x8(%rsp),%rax
> 2be: 80 3d 00 00 00 00 00 cmpb $0x0,0x0(%rip) # 2c5 <stress_lockbus+0x2c5>
> 2c5: 0f 85 f5 fd ff ff jne c0 <stress_lockbus+0xc0>
> 2cb: 48 89 c1 mov %rax,%rcx
> 2ce: 48 89 c2 mov %rax,%rdx
> 2d1: e9 f2 fd ff ff jmpq c8 <stress_lockbus+0xc8>
> 2d6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
> 2dd: 00 00 00
> 2e0: e8 00 00 00 00 callq 2e5 <stress_lockbus+0x2e5>
> 2e5: 8b 00 mov (%rax),%eax
> 2e7: 83 f8 1c cmp $0x1c,%eax
> 2ea: 74 74 je 360 <stress_lockbus+0x360>
> 2ec: 83 f8 26 cmp $0x26,%eax
> 2ef: bb 04 00 00 00 mov $0x4,%ebx
> 2f4: 74 0a je 300 <stress_lockbus+0x300>
> 2f6: 83 f8 0c cmp $0xc,%eax
> 2f9: 74 65 je 360 <stress_lockbus+0x360>
> 2fb: bb 01 00 00 00 mov $0x1,%ebx
> 300: 48 8b 04 24 mov (%rsp),%rax
> 304: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 30b <stress_lockbus+0x30b>
> 30b: 48 8b 70 10 mov 0x10(%rax),%rsi
> 30f: 31 c0 xor %eax,%eax
> 311: e8 00 00 00 00 callq 316 <stress_lockbus+0x316>
> 316: 48 83 c4 20 add $0x20,%rsp
> 31a: 89 d8 mov %ebx,%eax
> 31c: 5b pop %rbx
> 31d: c3 retq
> 31e: 66 90 xchg %ax,%ax
> 320: 48 8b 3c 24 mov (%rsp),%rdi
> 324: 48 8d 77 18 lea 0x18(%rdi),%rsi
> 328: e8 00 00 00 00 callq 32d <stress_lockbus+0x32d>
> 32d: 84 c0 test %al,%al
> 32f: 0f 85 66 fd ff ff jne 9b <stress_lockbus+0x9b>
> 335: 48 8b 04 24 mov (%rsp),%rax
> 339: be 03 00 00 00 mov $0x3,%esi
> 33e: 31 db xor %ebx,%ebx
> 340: 48 8b 78 10 mov 0x10(%rax),%rdi
> 344: e8 00 00 00 00 callq 349 <stress_lockbus+0x349>
> 349: 48 8b 7c 24 08 mov 0x8(%rsp),%rdi
> 34e: be 00 00 00 01 mov $0x1000000,%esi
> 353: e8 00 00 00 00 callq 358 <stress_lockbus+0x358>
> 358: 48 83 c4 20 add $0x20,%rsp
> 35c: 89 d8 mov %ebx,%eax
> 35e: 5b pop %rbx
> 35f: c3 retq
> 360: bb 03 00 00 00 mov $0x3,%ebx
> 365: eb 99 jmp 300 <stress_lockbus+0x300>
>
> Disassembly of section mlocked_text:
>
> 0000000000000000 <stress_sigbus_handler>:
> 0: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 7 <stress_sigbus_handler+0x7>
> 7: 48 83 ec 08 sub $0x8,%rsp
> b: be 01 00 00 00 mov $0x1,%esi
> 10: c6 05 00 00 00 00 00 movb $0x0,0x0(%rip) # 17 <stress_sigbus_handler+0x17>
> 17: e8 00 00 00 00 callq 1c <stress_sigbus_handler+0x1c>
>
> Disassembly of section .text.hot:
>
> 0000000000000000 <keep_stressing.isra.0>:
> 0: 0f b6 05 00 00 00 00 movzbl 0x0(%rip),%eax # 7 <keep_stressing.isra.0+0x7>
> 7: 84 c0 test %al,%al
> 9: 74 08 je 13 <keep_stressing.isra.0+0x13>
> b: 48 8b 16 mov (%rsi),%rdx
> e: 48 85 d2 test %rdx,%rdx
> 11: 75 05 jne 18 <keep_stressing.isra.0+0x18>
> 13: f3 c3 repz retq
> 15: 0f 1f 00 nopl (%rax)
> 18: 48 8b 07 mov (%rdi),%rax
> 1b: 48 3b 10 cmp (%rax),%rdx
> 1e: 0f 97 c0 seta %al
> 21: c3 retq

2021-05-27 23:51:19

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [clocksource] 8901ecc231: stress-ng.lockbus.ops_per_sec -9.5% regression

On Wed, May 26, 2021 at 06:49:11AM -0700, Paul E. McKenney wrote:
> On Wed, May 26, 2021 at 02:49:22PM +0800, Feng Tang wrote:
> > On Sat, May 22, 2021 at 09:08:27AM -0700, Paul E. McKenney wrote:
> > > On Fri, May 21, 2021 at 06:56:17AM -0700, Paul E. McKenney wrote:
> > > > On Fri, May 21, 2021 at 04:33:22PM +0800, kernel test robot wrote:
> > > > >
> > > > >
> > > > > Greeting,
> > > > >
> > > > > FYI, we noticed a -9.5% regression of stress-ng.lockbus.ops_per_sec due to commit:
> > > > >
> > > > >
> > > > > commit: 8901ecc2315b850f35a7b8c1b73b12388b72aa78 ("clocksource: Retry clock read if long delays detected")
> > > > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > > > >
> > > > >
> > > > > in testcase: stress-ng
> > > > > on test machine: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 192G memory
> > > > > with following parameters:
> > > > >
> > > > > nr_threads: 100%
> > > > > disk: 1HDD
> > > > > testtime: 60s
> > > > > class: memory
> > > > > test: lockbus
> > > > > cpufreq_governor: performance
> > > > > ucode: 0x5003006
> > > > >
> > > > >
> > > > > please be noted below in dmesg.xz (attached)
> > > > > [ 28.110351]
> > > > > [ 28.302357] hrtimer: interrupt took 1878423 ns
> > > > > [ 29.690760] clocksource: timekeeping watchdog on CPU53: hpet read-back delay of 169583ns, attempt 4, marking unstable
> > > > > [ 29.860306] tsc: Marking TSC unstable due to clocksource watchdog
> > > > > [ 30.559390] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> > > > > [ 30.726282] sched_clock: Marking unstable (30052964508, 499342225)<-(30915547410, -363240730)
> > > > > [ 31.620401] clocksource: Switched to clocksource hpet
> > > >
> > > > If I am reading the dmesg correctly, there were many interrupts that
> > > > prevented a good clock read. This sound to me like a bug that the
> > > > clocksource watchdog located, but please let me know if this is not
> > > > the case.
> > > >
> > > > There are also the later "perf: interrupt took too long" messages.
> > >
> > > And of course, increasing the clocksource.max_cswd_read_retries module
> > > boot parameter (or clocksource.max_read_retries in the earlier commits,
> > > which I will fix) can work around short bursts of NMIs. Or long bursts
> > > of NMIs, if you set this kernel boot parameter large enough.
> >
> > I reproduced it on a borrowed baremetal 4 nodes, 96C/192T Xeon, with
> > latest stress-ng code https://github.com/ColinIanKing/stress-ng.git.
> > (2 sockets box should also be able to reproduce it)
> >
> > Seems this sub testcase 'lockbus' is a extreme stress case, by loop
> > doing "lock" operation:
> >
> > c8: f0 83 00 01 lock addl $0x1,(%rax)
> > cc: f0 83 40 04 01 lock addl $0x1,0x4(%rax)
> > d1: f0 83 40 08 01 lock addl $0x1,0x8(%rax)
> > d6: f0 83 40 0c 01 lock addl $0x1,0xc(%rax)
> > db: f0 83 40 10 01 lock addl $0x1,0x10(%rax)
> > e0: f0 83 40 14 01 lock addl $0x1,0x14(%rax)
> > e5: f0 83 40 18 01 lock addl $0x1,0x18(%rax)
> > ea: f0 83 40 1c 01 lock addl $0x1,0x1c(%rax)
> > ef: f0 83 01 00 lock addl $0x0,(%rcx)
> > f3: f0 83 01 00 lock addl $0x0,(%rcx)
> > f7: f0 83 01 00 lock addl $0x0,(%rcx)
> > fb: f0 83 01 00 lock addl $0x0,(%rcx)
> > ff: f0 83 01 00 lock addl $0x0,(%rcx)
> > 103: f0 83 01 00 lock addl $0x0,(%rcx)
> > 107: f0 83 01 00 lock addl $0x0,(%rcx)
> > ...
> >
> > (The source c file and objdump are attached fyi)
> >
> > So the watchdog read (read_hpet() here) sometimes does take very
> > long time (hundreds of microseconds) which breaks this sanity
> > read check, and cause 'unstable' tsc.
> >
> > As from the man page of stress-ng:
> > "Use stress-ng with caution as some of the tests can make a system
> > run hot on poorly designed hardware and also can cause excessive
> > system thrashing which may be difficult to stop"
> >
> > I don't think this 'lockbus' is close to any real-world usage.
>
> Heh! In the past, I have had to adjust Linux-kernel RCU in order to
> avoid having too many locked operations. So, yes, I agree that this
> test result should not require a change to the clocksource watchdog.
>
> I just rebased to eliminate the pointless name change in the middle
> of the series from max_read_retries to max_cswd_read_retries, and will
> repost later today or tomorrow.

In addition, please see below for a just-in-case out-of-tree patch that
takes a different approach in response to persistent long-latency reads.
It falls back to using the old 62.5-millisecond skew threshold and also
marks the offending clocksource for reinitialization.

Again, I believe that the current less-subtle approach will serve us well,
especially during hardware bringup, but just in case...

Thanx, Paul

------------------------------------------------------------------------

commit 48ebcfbfd877f5d9cddcc03c91352a8ca7b190af
Author: Paul E. McKenney <[email protected]>
Date: Thu May 27 11:03:28 2021 -0700

clocksource: Forgive repeated long-latency watchdog clocksource reads

Currently, the clocksource watchdog reacts to repeated long-latency
clocksource reads by marking that clocksource unstable on the theory that
these long-latency reads are a sign of a serious problem. And this theory
does in fact have real-world support in the form of firmware issues [1].

However, it is also possible to trigger this using stress-ng on what
the stress-ng man page terms "poorly designed hardware" [2]. And it
is not necessarily a bad thing for the kernel to diagnose cases where
high-stress workloads are being run on hardware that is not designed
for this sort of use.

Nevertheless, it is quite possible that real-world use will result in
some situation requiring that high-stress workloads run on hardware
not designed to accommodate them, and also requiring that the kernel
refrain from marking clocksources unstable.

Therefore, provide an out-of-tree patch that reacts to this situation
by leaving the clocksource alone, but using the old 62.5-millisecond
skew-detection threshold in response persistent long-latency reads.
In addition, the offending clocksource is marked for re-initialization
in this case, which both restarts that clocksource with a clean bill of
health and avoids false-positive skew reports on later watchdog checks.

Link: https://lore.kernel.org/lkml/20210513155515.GB23902@xsang-OptiPlex-9020/ # [1]
Link: https://lore.kernel.org/lkml/20210521083322.GG25531@xsang-OptiPlex-9020/ # [2]
Link: https://lore.kernel.org/lkml/20210521084405.GH25531@xsang-OptiPlex-9020/
Link: https://lore.kernel.org/lkml/20210511233403.GA2896757@paulmck-ThinkPad-P17-Gen-1/
Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/kernel/time/clocksource-wdtest.c b/kernel/time/clocksource-wdtest.c
index 01df12395c0e..b72a969f7b93 100644
--- a/kernel/time/clocksource-wdtest.c
+++ b/kernel/time/clocksource-wdtest.c
@@ -146,13 +146,12 @@ static int wdtest_func(void *arg)
else if (i <= max_cswd_read_retries)
s = ", expect message";
else
- s = ", expect clock skew";
+ s = ", expect coarse-grained clock skew check and re-initialization";
pr_info("--- Watchdog with %dx error injection, %lu retries%s.\n", i, max_cswd_read_retries, s);
WRITE_ONCE(wdtest_ktime_read_ndelays, i);
schedule_timeout_uninterruptible(2 * HZ);
WARN_ON_ONCE(READ_ONCE(wdtest_ktime_read_ndelays));
- WARN_ON_ONCE((i <= max_cswd_read_retries) !=
- !(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE));
+ WARN_ON_ONCE(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE);
wdtest_ktime_clocksource_reset();
}

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 4485635b69f5..6c0820779bd3 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -225,13 +225,13 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
smp_processor_id(), watchdog->name, nretries);
}
- return true;
+ return false;
}
}

- pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
+ pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, coarse-grained skew check followed by re-initialization\n",
smp_processor_id(), watchdog->name, wd_delay, nretries);
- return false;
+ return true;
}

static u64 csnow_mid;
@@ -355,6 +355,7 @@ static void clocksource_watchdog(struct timer_list *unused)
int next_cpu, reset_pending;
int64_t wd_nsec, cs_nsec;
struct clocksource *cs;
+ bool coarse;
u32 md;

spin_lock(&watchdog_lock);
@@ -372,11 +373,7 @@ static void clocksource_watchdog(struct timer_list *unused)
continue;
}

- if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
- /* Clock readout unreliable, so give it up. */
- __clocksource_unstable(cs);
- continue;
- }
+ coarse = cs_watchdog_read(cs, &csnow, &wdnow);

/* Clocksource initialized ? */
if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
@@ -402,7 +399,13 @@ static void clocksource_watchdog(struct timer_list *unused)
continue;

/* Check the deviation from the watchdog clocksource. */
- md = cs->uncertainty_margin + watchdog->uncertainty_margin;
+ if (coarse) {
+ md = 62500 * NSEC_PER_USEC;
+ cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
+ pr_warn("timekeeping watchdog on CPU%d: %s coarse-grained %lu.%03lu ms clock-skew check followed by re-initialization\n", smp_processor_id(), watchdog->name, md / NSEC_PER_MSEC, md % NSEC_PER_MSEC / NSEC_PER_USEC);
+ } else {
+ md = cs->uncertainty_margin + watchdog->uncertainty_margin;
+ }
if (abs(cs_nsec - wd_nsec) > md) {
pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
smp_processor_id(), cs->name);

2021-05-27 23:55:28

by Andi Kleen

[permalink] [raw]
Subject: Re: [clocksource] 8901ecc231: stress-ng.lockbus.ops_per_sec -9.5% regression


>
> Nevertheless, it is quite possible that real-world use will result in
> some situation requiring that high-stress workloads run on hardware
> not designed to accommodate them, and also requiring that the kernel
> refrain from marking clocksources unstable.
>
> Therefore, provide an out-of-tree patch that reacts to this situation


out-of-tree means it will not be submitted?


I think it would make sense upstream, but perhaps guarded with some option.

2021-08-02 06:13:46

by Chao Gao

[permalink] [raw]
Subject: Re: [clocksource] 8901ecc231: stress-ng.lockbus.ops_per_sec -9.5% regression

[snip]
>commit 48ebcfbfd877f5d9cddcc03c91352a8ca7b190af
>Author: Paul E. McKenney <[email protected]>
>Date: Thu May 27 11:03:28 2021 -0700
>
> clocksource: Forgive repeated long-latency watchdog clocksource reads
>
> Currently, the clocksource watchdog reacts to repeated long-latency
> clocksource reads by marking that clocksource unstable on the theory that
> these long-latency reads are a sign of a serious problem. And this theory
> does in fact have real-world support in the form of firmware issues [1].
>
> However, it is also possible to trigger this using stress-ng on what
> the stress-ng man page terms "poorly designed hardware" [2]. And it
> is not necessarily a bad thing for the kernel to diagnose cases where
> high-stress workloads are being run on hardware that is not designed
> for this sort of use.
>
> Nevertheless, it is quite possible that real-world use will result in
> some situation requiring that high-stress workloads run on hardware
> not designed to accommodate them, and also requiring that the kernel
> refrain from marking clocksources unstable.
>
> Therefore, provide an out-of-tree patch that reacts to this situation
> by leaving the clocksource alone, but using the old 62.5-millisecond
> skew-detection threshold in response persistent long-latency reads.
> In addition, the offending clocksource is marked for re-initialization
> in this case, which both restarts that clocksource with a clean bill of
> health and avoids false-positive skew reports on later watchdog checks.

Hi Paul,

Sorry to dig out this old thread.

I am testing with this patch in a VM, but I find sometimes re-initialization
after coarse-grained skew check may not happen as expected because ...

>
> Link: https://lore.kernel.org/lkml/20210513155515.GB23902@xsang-OptiPlex-9020/ # [1]
> Link: https://lore.kernel.org/lkml/20210521083322.GG25531@xsang-OptiPlex-9020/ # [2]
> Link: https://lore.kernel.org/lkml/20210521084405.GH25531@xsang-OptiPlex-9020/
> Link: https://lore.kernel.org/lkml/20210511233403.GA2896757@paulmck-ThinkPad-P17-Gen-1/
> Signed-off-by: Paul E. McKenney <[email protected]>
>
>diff --git a/kernel/time/clocksource-wdtest.c b/kernel/time/clocksource-wdtest.c
>index 01df12395c0e..b72a969f7b93 100644
>--- a/kernel/time/clocksource-wdtest.c
>+++ b/kernel/time/clocksource-wdtest.c
>@@ -146,13 +146,12 @@ static int wdtest_func(void *arg)
> else if (i <= max_cswd_read_retries)
> s = ", expect message";
> else
>- s = ", expect clock skew";
>+ s = ", expect coarse-grained clock skew check and re-initialization";
> pr_info("--- Watchdog with %dx error injection, %lu retries%s.\n", i, max_cswd_read_retries, s);
> WRITE_ONCE(wdtest_ktime_read_ndelays, i);
> schedule_timeout_uninterruptible(2 * HZ);
> WARN_ON_ONCE(READ_ONCE(wdtest_ktime_read_ndelays));
>- WARN_ON_ONCE((i <= max_cswd_read_retries) !=
>- !(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE));
>+ WARN_ON_ONCE(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE);
> wdtest_ktime_clocksource_reset();
> }
>
>diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
>index 4485635b69f5..6c0820779bd3 100644
>--- a/kernel/time/clocksource.c
>+++ b/kernel/time/clocksource.c
>@@ -225,13 +225,13 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
> smp_processor_id(), watchdog->name, nretries);
> }
>- return true;
>+ return false;
> }
> }
>
>- pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
>+ pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, coarse-grained skew check followed by re-initialization\n",
> smp_processor_id(), watchdog->name, wd_delay, nretries);
>- return false;
>+ return true;
> }
>
> static u64 csnow_mid;
>@@ -355,6 +355,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> int next_cpu, reset_pending;
> int64_t wd_nsec, cs_nsec;
> struct clocksource *cs;
>+ bool coarse;
> u32 md;
>
> spin_lock(&watchdog_lock);
>@@ -372,11 +373,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> continue;
> }
>
>- if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
>- /* Clock readout unreliable, so give it up. */
>- __clocksource_unstable(cs);
>- continue;
>- }
>+ coarse = cs_watchdog_read(cs, &csnow, &wdnow);
>
> /* Clocksource initialized ? */
> if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
>@@ -402,7 +399,13 @@ static void clocksource_watchdog(struct timer_list *unused)
> continue;
>
> /* Check the deviation from the watchdog clocksource. */
>- md = cs->uncertainty_margin + watchdog->uncertainty_margin;
>+ if (coarse) {
>+ md = 62500 * NSEC_PER_USEC;
>+ cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
>+ pr_warn("timekeeping watchdog on CPU%d: %s coarse-grained %lu.%03lu ms clock-skew check followed by re-initialization\n", smp_processor_id(), watchdog->name, md / NSEC_PER_MSEC, md % NSEC_PER_MSEC / NSEC_PER_USEC);

... this message on CPU5 doesn't show up in below kernel logs.
Do you think it is a bug? if yes, any idea to resolve it?

[ 498.571086] clocksource: timekeeping watchdog on CPU1: hpet read-back delay of 432490ns, attempt 4, coarse-grained skew check followed by re-initialization
[ 498.572867] clocksource: timekeeping watchdog on CPU1: hpet coarse-grained 62.500 ms clock-skew check followed by re-initialization
[ 504.071959] clocksource: timekeeping watchdog on CPU4: hpet read-back delay of 1679880ns, attempt 4, coarse-grained skew check followed by re-initialization
[ 504.073817] clocksource: timekeeping watchdog on CPU4: hpet coarse-grained 62.500 ms clock-skew check followed by re-initialization
[ 504.568821] clocksource: timekeeping watchdog on CPU5: hpet read-back delay of 554880ns, attempt 4, coarse-grained skew check followed by re-initialization
[ 505.067666] clocksource: timekeeping watchdog on CPU6: hpet retried 3 times before success
[ 505.068593] clocksource: timekeeping watchdog on CPU6: Marking clocksource 'tsc' as unstable because the skew is too large:
[ 505.069596] clocksource: 'hpet' wd_nsec: 499376790 wd_now: be2f200d wd_last: bb3522fe mask: ffffffff
[ 505.071131] clocksource: 'tsc' cs_nsec: 498867307 cs_now: 103895c060a cs_last: 1034aea96ea mask: ffffffffffffffff
[ 505.072994] clocksource: 'tsc' is current clocksource.
[ 505.074748] tsc: Marking TSC unstable due to clocksource watchdog

Thanks
-Chao

>+ } else {
>+ md = cs->uncertainty_margin + watchdog->uncertainty_margin;
>+ }
> if (abs(cs_nsec - wd_nsec) > md) {
> pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
> smp_processor_id(), cs->name);

2021-08-02 17:04:38

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [clocksource] 8901ecc231: stress-ng.lockbus.ops_per_sec -9.5% regression

On Mon, Aug 02, 2021 at 02:20:09PM +0800, Chao Gao wrote:
> [snip]
> >commit 48ebcfbfd877f5d9cddcc03c91352a8ca7b190af
> >Author: Paul E. McKenney <[email protected]>
> >Date: Thu May 27 11:03:28 2021 -0700
> >
> > clocksource: Forgive repeated long-latency watchdog clocksource reads
> >
> > Currently, the clocksource watchdog reacts to repeated long-latency
> > clocksource reads by marking that clocksource unstable on the theory that
> > these long-latency reads are a sign of a serious problem. And this theory
> > does in fact have real-world support in the form of firmware issues [1].
> >
> > However, it is also possible to trigger this using stress-ng on what
> > the stress-ng man page terms "poorly designed hardware" [2]. And it
> > is not necessarily a bad thing for the kernel to diagnose cases where
> > high-stress workloads are being run on hardware that is not designed
> > for this sort of use.
> >
> > Nevertheless, it is quite possible that real-world use will result in
> > some situation requiring that high-stress workloads run on hardware
> > not designed to accommodate them, and also requiring that the kernel
> > refrain from marking clocksources unstable.
> >
> > Therefore, provide an out-of-tree patch that reacts to this situation
> > by leaving the clocksource alone, but using the old 62.5-millisecond
> > skew-detection threshold in response persistent long-latency reads.
> > In addition, the offending clocksource is marked for re-initialization
> > in this case, which both restarts that clocksource with a clean bill of
> > health and avoids false-positive skew reports on later watchdog checks.
>
> Hi Paul,
>
> Sorry to dig out this old thread.

Not a problem, especially given that this is still an experimental patch
(marked with "EXP" in -rcu). So one remaining question is "what is this
patch really supposed to do, if anything?"

> I am testing with this patch in a VM, but I find sometimes re-initialization
> after coarse-grained skew check may not happen as expected because ...
>
> >
> > Link: https://lore.kernel.org/lkml/20210513155515.GB23902@xsang-OptiPlex-9020/ # [1]
> > Link: https://lore.kernel.org/lkml/20210521083322.GG25531@xsang-OptiPlex-9020/ # [2]
> > Link: https://lore.kernel.org/lkml/20210521084405.GH25531@xsang-OptiPlex-9020/
> > Link: https://lore.kernel.org/lkml/20210511233403.GA2896757@paulmck-ThinkPad-P17-Gen-1/
> > Signed-off-by: Paul E. McKenney <[email protected]>
> >
> >diff --git a/kernel/time/clocksource-wdtest.c b/kernel/time/clocksource-wdtest.c
> >index 01df12395c0e..b72a969f7b93 100644
> >--- a/kernel/time/clocksource-wdtest.c
> >+++ b/kernel/time/clocksource-wdtest.c
> >@@ -146,13 +146,12 @@ static int wdtest_func(void *arg)
> > else if (i <= max_cswd_read_retries)
> > s = ", expect message";
> > else
> >- s = ", expect clock skew";
> >+ s = ", expect coarse-grained clock skew check and re-initialization";
> > pr_info("--- Watchdog with %dx error injection, %lu retries%s.\n", i, max_cswd_read_retries, s);
> > WRITE_ONCE(wdtest_ktime_read_ndelays, i);
> > schedule_timeout_uninterruptible(2 * HZ);
> > WARN_ON_ONCE(READ_ONCE(wdtest_ktime_read_ndelays));
> >- WARN_ON_ONCE((i <= max_cswd_read_retries) !=
> >- !(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE));
> >+ WARN_ON_ONCE(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE);
> > wdtest_ktime_clocksource_reset();
> > }
> >
> >diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> >index 4485635b69f5..6c0820779bd3 100644
> >--- a/kernel/time/clocksource.c
> >+++ b/kernel/time/clocksource.c
> >@@ -225,13 +225,13 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> > pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
> > smp_processor_id(), watchdog->name, nretries);
> > }
> >- return true;
> >+ return false;
> > }
> > }
> >
> >- pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
> >+ pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, coarse-grained skew check followed by re-initialization\n",
> > smp_processor_id(), watchdog->name, wd_delay, nretries);
> >- return false;
> >+ return true;
> > }
> >
> > static u64 csnow_mid;
> >@@ -355,6 +355,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > int next_cpu, reset_pending;
> > int64_t wd_nsec, cs_nsec;
> > struct clocksource *cs;
> >+ bool coarse;
> > u32 md;
> >
> > spin_lock(&watchdog_lock);
> >@@ -372,11 +373,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > continue;
> > }
> >
> >- if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
> >- /* Clock readout unreliable, so give it up. */
> >- __clocksource_unstable(cs);
> >- continue;
> >- }
> >+ coarse = cs_watchdog_read(cs, &csnow, &wdnow);
> >
> > /* Clocksource initialized ? */
> > if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
> >@@ -402,7 +399,13 @@ static void clocksource_watchdog(struct timer_list *unused)
> > continue;
> >
> > /* Check the deviation from the watchdog clocksource. */
> >- md = cs->uncertainty_margin + watchdog->uncertainty_margin;
> >+ if (coarse) {
> >+ md = 62500 * NSEC_PER_USEC;
> >+ cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
> >+ pr_warn("timekeeping watchdog on CPU%d: %s coarse-grained %lu.%03lu ms clock-skew check followed by re-initialization\n", smp_processor_id(), watchdog->name, md / NSEC_PER_MSEC, md % NSEC_PER_MSEC / NSEC_PER_USEC);
>
> ... this message on CPU5 doesn't show up in below kernel logs.
> Do you think it is a bug? if yes, any idea to resolve it?
>
> [ 498.571086] clocksource: timekeeping watchdog on CPU1: hpet read-back delay of 432490ns, attempt 4, coarse-grained skew check followed by re-initialization
> [ 498.572867] clocksource: timekeeping watchdog on CPU1: hpet coarse-grained 62.500 ms clock-skew check followed by re-initialization
> [ 504.071959] clocksource: timekeeping watchdog on CPU4: hpet read-back delay of 1679880ns, attempt 4, coarse-grained skew check followed by re-initialization
> [ 504.073817] clocksource: timekeeping watchdog on CPU4: hpet coarse-grained 62.500 ms clock-skew check followed by re-initialization
> [ 504.568821] clocksource: timekeeping watchdog on CPU5: hpet read-back delay of 554880ns, attempt 4, coarse-grained skew check followed by re-initialization

Up to this point, the clocksource passed the coarse-grained checks.
So at the very least, the "followed by re-initialization" is misleading.
I will change this message.

And yes, I would have expected the additional "62.500 ms clock-skew check"
message from CPU5, like we see from CPU1 and CPU4 above. However, this
message will be omitted if there is a watchdog reset pending or if the
clocksource has not yet been initialized. Which could well have happened
in this case.

> [ 505.067666] clocksource: timekeeping watchdog on CPU6: hpet retried 3 times before success
> [ 505.068593] clocksource: timekeeping watchdog on CPU6: Marking clocksource 'tsc' as unstable because the skew is too large:
> [ 505.069596] clocksource: 'hpet' wd_nsec: 499376790 wd_now: be2f200d wd_last: bb3522fe mask: ffffffff
> [ 505.071131] clocksource: 'tsc' cs_nsec: 498867307 cs_now: 103895c060a cs_last: 1034aea96ea mask: ffffffffffffffff
> [ 505.072994] clocksource: 'tsc' is current clocksource.
> [ 505.074748] tsc: Marking TSC unstable due to clocksource watchdog

And here the clocksource failed the coarse-grained check and marked
the clocksource as unstable. Perhaps because the previous read
forced a coarse-grained check. Except that this should have forced
a reinitialization. Ah, it looks like I need to suppress setting
CLOCK_SOURCE_WATCHDOG if coarse-grained checks have been enabled.
That could cause false-positive failure for the next check, after all.

And perhaps make cs_watchdog_read() modify its print if there is
a watchdog reset pending or if the current clocksource has the
CLOCK_SOURCE_WATCHDOG flag cleared.

Perhaps as shown in the additional patch below, to be folded into the
original?

Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index cfa992250c388..62da2485fd574 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -230,8 +230,13 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
}
}

- pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, coarse-grained skew check followed by re-initialization\n",
- smp_processor_id(), watchdog->name, wd_delay, nretries);
+ if ((cs->flags & CLOCK_SOURCE_WATCHDOG) && !atomic_read(&watchdog_reset_pending)) {
+ pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, coarse-grained skew check followed by re-initialization\n",
+ smp_processor_id(), watchdog->name, wd_delay, nretries);
+ } else {
+ pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, awaiting re-initialization\n",
+ smp_processor_id(), watchdog->name, wd_delay, nretries);
+ }
return true;
}

@@ -379,7 +384,8 @@ static void clocksource_watchdog(struct timer_list *unused)
/* Clocksource initialized ? */
if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
atomic_read(&watchdog_reset_pending)) {
- cs->flags |= CLOCK_SOURCE_WATCHDOG;
+ if (!coarse)
+ cs->flags |= CLOCK_SOURCE_WATCHDOG;
cs->wd_last = wdnow;
cs->cs_last = csnow;
continue;

2021-08-03 08:51:30

by Chao Gao

[permalink] [raw]
Subject: Re: [clocksource] 8901ecc231: stress-ng.lockbus.ops_per_sec -9.5% regression

On Mon, Aug 02, 2021 at 10:02:57AM -0700, Paul E. McKenney wrote:
>On Mon, Aug 02, 2021 at 02:20:09PM +0800, Chao Gao wrote:
>> [snip]
>> >commit 48ebcfbfd877f5d9cddcc03c91352a8ca7b190af
>> >Author: Paul E. McKenney <[email protected]>
>> >Date: Thu May 27 11:03:28 2021 -0700
>> >
>> > clocksource: Forgive repeated long-latency watchdog clocksource reads
>> >
>> > Currently, the clocksource watchdog reacts to repeated long-latency
>> > clocksource reads by marking that clocksource unstable on the theory that
>> > these long-latency reads are a sign of a serious problem. And this theory
>> > does in fact have real-world support in the form of firmware issues [1].
>> >
>> > However, it is also possible to trigger this using stress-ng on what
>> > the stress-ng man page terms "poorly designed hardware" [2]. And it
>> > is not necessarily a bad thing for the kernel to diagnose cases where
>> > high-stress workloads are being run on hardware that is not designed
>> > for this sort of use.
>> >
>> > Nevertheless, it is quite possible that real-world use will result in
>> > some situation requiring that high-stress workloads run on hardware
>> > not designed to accommodate them, and also requiring that the kernel
>> > refrain from marking clocksources unstable.
>> >
>> > Therefore, provide an out-of-tree patch that reacts to this situation
>> > by leaving the clocksource alone, but using the old 62.5-millisecond
>> > skew-detection threshold in response persistent long-latency reads.
>> > In addition, the offending clocksource is marked for re-initialization
>> > in this case, which both restarts that clocksource with a clean bill of
>> > health and avoids false-positive skew reports on later watchdog checks.
>>
>> Hi Paul,
>>
>> Sorry to dig out this old thread.
>
>Not a problem, especially given that this is still an experimental patch
>(marked with "EXP" in -rcu). So one remaining question is "what is this
>patch really supposed to do, if anything?".

We are testing with TDX [1] and analyzing why kernel in a TD, or Trust Domain,
sometimes spots a large TSC skew. We have inspected tsc hardware/ucode/tdx
module to ensure no hardware issue, and also ported tsc_sync.c to a userspace
tool such that this tool can help to constantly check if tsc is synchronized
when some workload is running. Finally, we believe that the large TSC skew
spotted by TD kernel is a false positive.

Your patches (those are merged) have improved clocksource watchdog a lot to
reduce false-positives. But due to the nature of TDX, switching between TD
and host takes more time. Then, the time window between two reads from
watchdog clocksource in cs_watchdog_read() increases, so does the
probability of the two reads being interrupted by whatever on host. Then,
sometimes, especially when there are heavy workloads in both host and TD,
the maximum number of retries in cs_watchdog_read() is exceeded and tsc is
marked unstable.

Then we apply this out-of-tree patch, it helps to further reduce
false-positives. But TD kernel still observes TSC skew in some cases. After
a close look into kernel logs, we find patterns in those cases: an expected
re-initialization somehow doesn't happen. That's why we raise this issue
and ask for your advice.

[1]: https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html

>And here the clocksource failed the coarse-grained check and marked
>the clocksource as unstable. Perhaps because the previous read
>forced a coarse-grained check. Except that this should have forced
>a reinitialization. Ah, it looks like I need to suppress setting
>CLOCK_SOURCE_WATCHDOG if coarse-grained checks have been enabled.
>That could cause false-positive failure for the next check, after all.
>
>And perhaps make cs_watchdog_read() modify its print if there is
>a watchdog reset pending or if the current clocksource has the
>CLOCK_SOURCE_WATCHDOG flag cleared.
>
>Perhaps as shown in the additional patch below, to be folded into the
>original?

Thanks. Will test with below patch applied.

Thanks
Chao
>
> Thanx, Paul
>
>------------------------------------------------------------------------
>
>diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
>index cfa992250c388..62da2485fd574 100644
>--- a/kernel/time/clocksource.c
>+++ b/kernel/time/clocksource.c
>@@ -230,8 +230,13 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> }
> }
>
>- pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, coarse-grained skew check followed by re-initialization\n",
>- smp_processor_id(), watchdog->name, wd_delay, nretries);
>+ if ((cs->flags & CLOCK_SOURCE_WATCHDOG) && !atomic_read(&watchdog_reset_pending)) {
>+ pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, coarse-grained skew check followed by re-initialization\n",
>+ smp_processor_id(), watchdog->name, wd_delay, nretries);
>+ } else {
>+ pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, awaiting re-initialization\n",
>+ smp_processor_id(), watchdog->name, wd_delay, nretries);
>+ }
> return true;
> }
>
>@@ -379,7 +384,8 @@ static void clocksource_watchdog(struct timer_list *unused)
> /* Clocksource initialized ? */
> if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
> atomic_read(&watchdog_reset_pending)) {
>- cs->flags |= CLOCK_SOURCE_WATCHDOG;
>+ if (!coarse)
>+ cs->flags |= CLOCK_SOURCE_WATCHDOG;
> cs->wd_last = wdnow;
> cs->cs_last = csnow;
> continue;

2021-08-03 13:52:46

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [clocksource] 8901ecc231: stress-ng.lockbus.ops_per_sec -9.5% regression

On Tue, Aug 03, 2021 at 04:58:00PM +0800, Chao Gao wrote:
> On Mon, Aug 02, 2021 at 10:02:57AM -0700, Paul E. McKenney wrote:
> >On Mon, Aug 02, 2021 at 02:20:09PM +0800, Chao Gao wrote:
> >> [snip]
> >> >commit 48ebcfbfd877f5d9cddcc03c91352a8ca7b190af
> >> >Author: Paul E. McKenney <[email protected]>
> >> >Date: Thu May 27 11:03:28 2021 -0700
> >> >
> >> > clocksource: Forgive repeated long-latency watchdog clocksource reads
> >> >
> >> > Currently, the clocksource watchdog reacts to repeated long-latency
> >> > clocksource reads by marking that clocksource unstable on the theory that
> >> > these long-latency reads are a sign of a serious problem. And this theory
> >> > does in fact have real-world support in the form of firmware issues [1].
> >> >
> >> > However, it is also possible to trigger this using stress-ng on what
> >> > the stress-ng man page terms "poorly designed hardware" [2]. And it
> >> > is not necessarily a bad thing for the kernel to diagnose cases where
> >> > high-stress workloads are being run on hardware that is not designed
> >> > for this sort of use.
> >> >
> >> > Nevertheless, it is quite possible that real-world use will result in
> >> > some situation requiring that high-stress workloads run on hardware
> >> > not designed to accommodate them, and also requiring that the kernel
> >> > refrain from marking clocksources unstable.
> >> >
> >> > Therefore, provide an out-of-tree patch that reacts to this situation
> >> > by leaving the clocksource alone, but using the old 62.5-millisecond
> >> > skew-detection threshold in response persistent long-latency reads.
> >> > In addition, the offending clocksource is marked for re-initialization
> >> > in this case, which both restarts that clocksource with a clean bill of
> >> > health and avoids false-positive skew reports on later watchdog checks.
> >>
> >> Hi Paul,
> >>
> >> Sorry to dig out this old thread.
> >
> >Not a problem, especially given that this is still an experimental patch
> >(marked with "EXP" in -rcu). So one remaining question is "what is this
> >patch really supposed to do, if anything?".
>
> We are testing with TDX [1] and analyzing why kernel in a TD, or Trust Domain,
> sometimes spots a large TSC skew. We have inspected tsc hardware/ucode/tdx
> module to ensure no hardware issue, and also ported tsc_sync.c to a userspace
> tool such that this tool can help to constantly check if tsc is synchronized
> when some workload is running. Finally, we believe that the large TSC skew
> spotted by TD kernel is a false positive.
>
> Your patches (those are merged) have improved clocksource watchdog a lot to
> reduce false-positives. But due to the nature of TDX, switching between TD
> and host takes more time. Then, the time window between two reads from
> watchdog clocksource in cs_watchdog_read() increases, so does the
> probability of the two reads being interrupted by whatever on host. Then,
> sometimes, especially when there are heavy workloads in both host and TD,
> the maximum number of retries in cs_watchdog_read() is exceeded and tsc is
> marked unstable.
>
> Then we apply this out-of-tree patch, it helps to further reduce
> false-positives. But TD kernel still observes TSC skew in some cases. After
> a close look into kernel logs, we find patterns in those cases: an expected
> re-initialization somehow doesn't happen. That's why we raise this issue
> and ask for your advice.

I am glad that the patch at least helps. ;-)

> [1]: https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
>
> >And here the clocksource failed the coarse-grained check and marked
> >the clocksource as unstable. Perhaps because the previous read
> >forced a coarse-grained check. Except that this should have forced
> >a reinitialization. Ah, it looks like I need to suppress setting
> >CLOCK_SOURCE_WATCHDOG if coarse-grained checks have been enabled.
> >That could cause false-positive failure for the next check, after all.
> >
> >And perhaps make cs_watchdog_read() modify its print if there is
> >a watchdog reset pending or if the current clocksource has the
> >CLOCK_SOURCE_WATCHDOG flag cleared.
> >
> >Perhaps as shown in the additional patch below, to be folded into the
> >original?
>
> Thanks. Will test with below patch applied.

If this patch helps, but problems remain, another thing to try is to
increase the clocksource.max_cswd_read_retries kernel boot parameter
above its default value of 3. Maybe to 5 or 10?

If this patch does not help, please let me know. In that case, there
are probably more fixes required.

Thanx, Paul

> Thanks
> Chao
> >
> > Thanx, Paul
> >
> >------------------------------------------------------------------------
> >
> >diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> >index cfa992250c388..62da2485fd574 100644
> >--- a/kernel/time/clocksource.c
> >+++ b/kernel/time/clocksource.c
> >@@ -230,8 +230,13 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> > }
> > }
> >
> >- pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, coarse-grained skew check followed by re-initialization\n",
> >- smp_processor_id(), watchdog->name, wd_delay, nretries);
> >+ if ((cs->flags & CLOCK_SOURCE_WATCHDOG) && !atomic_read(&watchdog_reset_pending)) {
> >+ pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, coarse-grained skew check followed by re-initialization\n",
> >+ smp_processor_id(), watchdog->name, wd_delay, nretries);
> >+ } else {
> >+ pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, awaiting re-initialization\n",
> >+ smp_processor_id(), watchdog->name, wd_delay, nretries);
> >+ }
> > return true;
> > }
> >
> >@@ -379,7 +384,8 @@ static void clocksource_watchdog(struct timer_list *unused)
> > /* Clocksource initialized ? */
> > if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
> > atomic_read(&watchdog_reset_pending)) {
> >- cs->flags |= CLOCK_SOURCE_WATCHDOG;
> >+ if (!coarse)
> >+ cs->flags |= CLOCK_SOURCE_WATCHDOG;
> > cs->wd_last = wdnow;
> > cs->cs_last = csnow;
> > continue;

2021-08-05 04:53:45

by Chao Gao

[permalink] [raw]
Subject: Re: [clocksource] 8901ecc231: stress-ng.lockbus.ops_per_sec -9.5% regression

On Tue, Aug 03, 2021 at 06:48:16AM -0700, Paul E. McKenney wrote:
>On Tue, Aug 03, 2021 at 04:58:00PM +0800, Chao Gao wrote:
>> On Mon, Aug 02, 2021 at 10:02:57AM -0700, Paul E. McKenney wrote:
>> >On Mon, Aug 02, 2021 at 02:20:09PM +0800, Chao Gao wrote:
>> >> [snip]
>> >> >commit 48ebcfbfd877f5d9cddcc03c91352a8ca7b190af
>> >> >Author: Paul E. McKenney <[email protected]>
>> >> >Date: Thu May 27 11:03:28 2021 -0700
>> >> >
>> >> > clocksource: Forgive repeated long-latency watchdog clocksource reads
>> >> >
>> >> > Currently, the clocksource watchdog reacts to repeated long-latency
>> >> > clocksource reads by marking that clocksource unstable on the theory that
>> >> > these long-latency reads are a sign of a serious problem. And this theory
>> >> > does in fact have real-world support in the form of firmware issues [1].
>> >> >
>> >> > However, it is also possible to trigger this using stress-ng on what
>> >> > the stress-ng man page terms "poorly designed hardware" [2]. And it
>> >> > is not necessarily a bad thing for the kernel to diagnose cases where
>> >> > high-stress workloads are being run on hardware that is not designed
>> >> > for this sort of use.
>> >> >
>> >> > Nevertheless, it is quite possible that real-world use will result in
>> >> > some situation requiring that high-stress workloads run on hardware
>> >> > not designed to accommodate them, and also requiring that the kernel
>> >> > refrain from marking clocksources unstable.
>> >> >
>> >> > Therefore, provide an out-of-tree patch that reacts to this situation
>> >> > by leaving the clocksource alone, but using the old 62.5-millisecond
>> >> > skew-detection threshold in response persistent long-latency reads.
>> >> > In addition, the offending clocksource is marked for re-initialization
>> >> > in this case, which both restarts that clocksource with a clean bill of
>> >> > health and avoids false-positive skew reports on later watchdog checks.
>> >>
>> >> Hi Paul,
>> >>
>> >> Sorry to dig out this old thread.
>> >
>> >Not a problem, especially given that this is still an experimental patch
>> >(marked with "EXP" in -rcu). So one remaining question is "what is this
>> >patch really supposed to do, if anything?".
>>
>> We are testing with TDX [1] and analyzing why kernel in a TD, or Trust Domain,
>> sometimes spots a large TSC skew. We have inspected tsc hardware/ucode/tdx
>> module to ensure no hardware issue, and also ported tsc_sync.c to a userspace
>> tool such that this tool can help to constantly check if tsc is synchronized
>> when some workload is running. Finally, we believe that the large TSC skew
>> spotted by TD kernel is a false positive.
>>
>> Your patches (those are merged) have improved clocksource watchdog a lot to
>> reduce false-positives. But due to the nature of TDX, switching between TD
>> and host takes more time. Then, the time window between two reads from
>> watchdog clocksource in cs_watchdog_read() increases, so does the
>> probability of the two reads being interrupted by whatever on host. Then,
>> sometimes, especially when there are heavy workloads in both host and TD,
>> the maximum number of retries in cs_watchdog_read() is exceeded and tsc is
>> marked unstable.
>>
>> Then we apply this out-of-tree patch, it helps to further reduce
>> false-positives. But TD kernel still observes TSC skew in some cases. After
>> a close look into kernel logs, we find patterns in those cases: an expected
>> re-initialization somehow doesn't happen. That's why we raise this issue
>> and ask for your advice.
>
>I am glad that the patch at least helps. ;-)
>
>> [1]: https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
>>
>> >And here the clocksource failed the coarse-grained check and marked
>> >the clocksource as unstable. Perhaps because the previous read
>> >forced a coarse-grained check. Except that this should have forced
>> >a reinitialization. Ah, it looks like I need to suppress setting
>> >CLOCK_SOURCE_WATCHDOG if coarse-grained checks have been enabled.
>> >That could cause false-positive failure for the next check, after all.
>> >
>> >And perhaps make cs_watchdog_read() modify its print if there is
>> >a watchdog reset pending or if the current clocksource has the
>> >CLOCK_SOURCE_WATCHDOG flag cleared.
>> >
>> >Perhaps as shown in the additional patch below, to be folded into the
>> >original?
>>
>> Thanks. Will test with below patch applied.
>
>If this patch helps, but problems remain, another thing to try is to
>increase the clocksource.max_cswd_read_retries kernel boot parameter
>above its default value of 3. Maybe to 5 or 10?
>
>If this patch does not help, please let me know. In that case, there
>are probably more fixes required.

This patch works well; no false-positive (marking TSC unstable) in a
10hr stress test.

Thanks
Chao

2021-08-05 05:15:14

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [clocksource] 8901ecc231: stress-ng.lockbus.ops_per_sec -9.5% regression

On Thu, Aug 05, 2021 at 10:16:48AM +0800, Chao Gao wrote:
> On Tue, Aug 03, 2021 at 06:48:16AM -0700, Paul E. McKenney wrote:
> >On Tue, Aug 03, 2021 at 04:58:00PM +0800, Chao Gao wrote:
> >> On Mon, Aug 02, 2021 at 10:02:57AM -0700, Paul E. McKenney wrote:
> >> >On Mon, Aug 02, 2021 at 02:20:09PM +0800, Chao Gao wrote:
> >> >> [snip]
> >> >> >commit 48ebcfbfd877f5d9cddcc03c91352a8ca7b190af
> >> >> >Author: Paul E. McKenney <[email protected]>
> >> >> >Date: Thu May 27 11:03:28 2021 -0700
> >> >> >
> >> >> > clocksource: Forgive repeated long-latency watchdog clocksource reads
> >> >> >
> >> >> > Currently, the clocksource watchdog reacts to repeated long-latency
> >> >> > clocksource reads by marking that clocksource unstable on the theory that
> >> >> > these long-latency reads are a sign of a serious problem. And this theory
> >> >> > does in fact have real-world support in the form of firmware issues [1].
> >> >> >
> >> >> > However, it is also possible to trigger this using stress-ng on what
> >> >> > the stress-ng man page terms "poorly designed hardware" [2]. And it
> >> >> > is not necessarily a bad thing for the kernel to diagnose cases where
> >> >> > high-stress workloads are being run on hardware that is not designed
> >> >> > for this sort of use.
> >> >> >
> >> >> > Nevertheless, it is quite possible that real-world use will result in
> >> >> > some situation requiring that high-stress workloads run on hardware
> >> >> > not designed to accommodate them, and also requiring that the kernel
> >> >> > refrain from marking clocksources unstable.
> >> >> >
> >> >> > Therefore, provide an out-of-tree patch that reacts to this situation
> >> >> > by leaving the clocksource alone, but using the old 62.5-millisecond
> >> >> > skew-detection threshold in response persistent long-latency reads.
> >> >> > In addition, the offending clocksource is marked for re-initialization
> >> >> > in this case, which both restarts that clocksource with a clean bill of
> >> >> > health and avoids false-positive skew reports on later watchdog checks.
> >> >>
> >> >> Hi Paul,
> >> >>
> >> >> Sorry to dig out this old thread.
> >> >
> >> >Not a problem, especially given that this is still an experimental patch
> >> >(marked with "EXP" in -rcu). So one remaining question is "what is this
> >> >patch really supposed to do, if anything?".
> >>
> >> We are testing with TDX [1] and analyzing why kernel in a TD, or Trust Domain,
> >> sometimes spots a large TSC skew. We have inspected tsc hardware/ucode/tdx
> >> module to ensure no hardware issue, and also ported tsc_sync.c to a userspace
> >> tool such that this tool can help to constantly check if tsc is synchronized
> >> when some workload is running. Finally, we believe that the large TSC skew
> >> spotted by TD kernel is a false positive.
> >>
> >> Your patches (those are merged) have improved clocksource watchdog a lot to
> >> reduce false-positives. But due to the nature of TDX, switching between TD
> >> and host takes more time. Then, the time window between two reads from
> >> watchdog clocksource in cs_watchdog_read() increases, so does the
> >> probability of the two reads being interrupted by whatever on host. Then,
> >> sometimes, especially when there are heavy workloads in both host and TD,
> >> the maximum number of retries in cs_watchdog_read() is exceeded and tsc is
> >> marked unstable.
> >>
> >> Then we apply this out-of-tree patch, it helps to further reduce
> >> false-positives. But TD kernel still observes TSC skew in some cases. After
> >> a close look into kernel logs, we find patterns in those cases: an expected
> >> re-initialization somehow doesn't happen. That's why we raise this issue
> >> and ask for your advice.
> >
> >I am glad that the patch at least helps. ;-)
> >
> >> [1]: https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
> >>
> >> >And here the clocksource failed the coarse-grained check and marked
> >> >the clocksource as unstable. Perhaps because the previous read
> >> >forced a coarse-grained check. Except that this should have forced
> >> >a reinitialization. Ah, it looks like I need to suppress setting
> >> >CLOCK_SOURCE_WATCHDOG if coarse-grained checks have been enabled.
> >> >That could cause false-positive failure for the next check, after all.
> >> >
> >> >And perhaps make cs_watchdog_read() modify its print if there is
> >> >a watchdog reset pending or if the current clocksource has the
> >> >CLOCK_SOURCE_WATCHDOG flag cleared.
> >> >
> >> >Perhaps as shown in the additional patch below, to be folded into the
> >> >original?
> >>
> >> Thanks. Will test with below patch applied.
> >
> >If this patch helps, but problems remain, another thing to try is to
> >increase the clocksource.max_cswd_read_retries kernel boot parameter
> >above its default value of 3. Maybe to 5 or 10?
> >
> >If this patch does not help, please let me know. In that case, there
> >are probably more fixes required.
>
> This patch works well; no false-positive (marking TSC unstable) in a
> 10hr stress test.

Very good, thank you! May I add your Tested-by?

I expect that I will need to modify the patch a bit more to check for
a system where it is -never- able to get a good fine-grained read from
the clock. And it might be that your test run ended up in that state.

My current thought is that if more than (say) 100 consecutive attempts
to read the clocksource get hit with excessive delays, it is time to at
least do a WARN_ON(), and maybe also time to disable the clocksource
due to skew. The reason is that if reading the clocksource -always-
sees excessive delays, perhaps the clock driver or hardware is to blame.

Thoughts?

Thanx, Paul

2021-08-05 05:17:17

by Andi Kleen

[permalink] [raw]
Subject: Re: [clocksource] 8901ecc231: stress-ng.lockbus.ops_per_sec -9.5% regression


> My current thought is that if more than (say) 100 consecutive attempts
> to read the clocksource get hit with excessive delays, it is time to at
> least do a WARN_ON(), and maybe also time to disable the clocksource
> due to skew. The reason is that if reading the clocksource -always-
> sees excessive delays, perhaps the clock driver or hardware is to blame.
>
> Thoughts?

On TDX this would be fatal because we don't have a usable fallback source

(just jiffies). Better try as hard as possible.


-Andi

>
>

2021-08-05 07:47:44

by Chao Gao

[permalink] [raw]
Subject: Re: [clocksource] 8901ecc231: stress-ng.lockbus.ops_per_sec -9.5% regression

[snip]
>> This patch works well; no false-positive (marking TSC unstable) in a
>> 10hr stress test.
>
>Very good, thank you! May I add your Tested-by?

sure.
Tested-by: Chao Gao <[email protected]>

>
>I expect that I will need to modify the patch a bit more to check for
>a system where it is -never- able to get a good fine-grained read from
>the clock.

Agreed.

>And it might be that your test run ended up in that state.

Not that case judging from kernel logs. Coarse-grained check happened 6475
times in 43k seconds (by grep "coarse-grained skew check" in kernel logs).
So, still many checks were fine-grained.

>
>My current thought is that if more than (say) 100 consecutive attempts
>to read the clocksource get hit with excessive delays, it is time to at
>least do a WARN_ON(), and maybe also time to disable the clocksource
>due to skew. The reason is that if reading the clocksource -always-
>sees excessive delays, perhaps the clock driver or hardware is to blame.
>
>Thoughts?

It makes sense to me.

Thanks
Chao

2021-08-05 15:35:55

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [clocksource] 8901ecc231: stress-ng.lockbus.ops_per_sec -9.5% regression

On Wed, Aug 04, 2021 at 09:34:13PM -0700, Andi Kleen wrote:
>
> > My current thought is that if more than (say) 100 consecutive attempts
> > to read the clocksource get hit with excessive delays, it is time to at
> > least do a WARN_ON(), and maybe also time to disable the clocksource
> > due to skew. The reason is that if reading the clocksource -always-
> > sees excessive delays, perhaps the clock driver or hardware is to blame.
> >
> > Thoughts?
>
> On TDX this would be fatal because we don't have a usable fallback source
>
> (just jiffies). Better try as hard as possible.

At some point, won't the system's suffering in silence become quite the
disservice to its users?

One alternative would be to give a warning splat, but avoid reporting
skew. Unless there is the traditional 62.5ms of skew, of course.

Thanx, Paul

2021-08-05 22:12:30

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [clocksource] 8901ecc231: stress-ng.lockbus.ops_per_sec -9.5% regression

On Thu, Aug 05, 2021 at 01:39:40PM +0800, Chao Gao wrote:
> [snip]
> >> This patch works well; no false-positive (marking TSC unstable) in a
> >> 10hr stress test.
> >
> >Very good, thank you! May I add your Tested-by?
>
> sure.
> Tested-by: Chao Gao <[email protected]>

Very good, thank you! I will apply this on the next rebase.

> >I expect that I will need to modify the patch a bit more to check for
> >a system where it is -never- able to get a good fine-grained read from
> >the clock.
>
> Agreed.
>
> >And it might be that your test run ended up in that state.
>
> Not that case judging from kernel logs. Coarse-grained check happened 6475
> times in 43k seconds (by grep "coarse-grained skew check" in kernel logs).
> So, still many checks were fine-grained.

Whew! ;-)

So about once per 13 clocksource watchdog checks.

To Andi's point, do you have enough information in your console log to
work out the longest run of course-grained clocksource checks?

> >My current thought is that if more than (say) 100 consecutive attempts
> >to read the clocksource get hit with excessive delays, it is time to at
> >least do a WARN_ON(), and maybe also time to disable the clocksource
> >due to skew. The reason is that if reading the clocksource -always-
> >sees excessive delays, perhaps the clock driver or hardware is to blame.
> >
> >Thoughts?
>
> It makes sense to me.

Sounds good!

Thanx, Paul

2021-08-06 07:42:32

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [clocksource] 8901ecc231: stress-ng.lockbus.ops_per_sec -9.5% regression

On Fri, Aug 06, 2021 at 10:10:00AM +0800, Chao Gao wrote:
> On Thu, Aug 05, 2021 at 08:37:27AM -0700, Paul E. McKenney wrote:
> >On Thu, Aug 05, 2021 at 01:39:40PM +0800, Chao Gao wrote:
> >> [snip]
> >> >> This patch works well; no false-positive (marking TSC unstable) in a
> >> >> 10hr stress test.
> >> >
> >> >Very good, thank you! May I add your Tested-by?
> >>
> >> sure.
> >> Tested-by: Chao Gao <[email protected]>
> >
> >Very good, thank you! I will apply this on the next rebase.
> >
> >> >I expect that I will need to modify the patch a bit more to check for
> >> >a system where it is -never- able to get a good fine-grained read from
> >> >the clock.
> >>
> >> Agreed.
> >>
> >> >And it might be that your test run ended up in that state.
> >>
> >> Not that case judging from kernel logs. Coarse-grained check happened 6475
> >> times in 43k seconds (by grep "coarse-grained skew check" in kernel logs).
> >> So, still many checks were fine-grained.
> >
> >Whew! ;-)
> >
> >So about once per 13 clocksource watchdog checks.
> >
> >To Andi's point, do you have enough information in your console log to
> >work out the longest run of course-grained clocksource checks?
>
> Yes. 5 consecutive course-grained clocksource checks. Note that
> considering the reinitialization after course-grained check, in my
> calculation, two course-grained checks are considered consecutive if
> they happens in 1s(+/- 0.3s).

Very good, thank you!

So it seems eminently reasonable to have the clocksource watchdog complain
bitterly for more than (say) 100 consecutive course-grained checks.

I am thinking in terms of a separate patch for this purpose.

Thoughts?

Thanx, Paul

2021-08-06 11:02:46

by Chao Gao

[permalink] [raw]
Subject: Re: [clocksource] 8901ecc231: stress-ng.lockbus.ops_per_sec -9.5% regression

On Thu, Aug 05, 2021 at 08:37:27AM -0700, Paul E. McKenney wrote:
>On Thu, Aug 05, 2021 at 01:39:40PM +0800, Chao Gao wrote:
>> [snip]
>> >> This patch works well; no false-positive (marking TSC unstable) in a
>> >> 10hr stress test.
>> >
>> >Very good, thank you! May I add your Tested-by?
>>
>> sure.
>> Tested-by: Chao Gao <[email protected]>
>
>Very good, thank you! I will apply this on the next rebase.
>
>> >I expect that I will need to modify the patch a bit more to check for
>> >a system where it is -never- able to get a good fine-grained read from
>> >the clock.
>>
>> Agreed.
>>
>> >And it might be that your test run ended up in that state.
>>
>> Not that case judging from kernel logs. Coarse-grained check happened 6475
>> times in 43k seconds (by grep "coarse-grained skew check" in kernel logs).
>> So, still many checks were fine-grained.
>
>Whew! ;-)
>
>So about once per 13 clocksource watchdog checks.
>
>To Andi's point, do you have enough information in your console log to
>work out the longest run of course-grained clocksource checks?

Yes. 5 consecutive course-grained clocksource checks. Note that
considering the reinitialization after course-grained check, in my
calculation, two course-grained checks are considered consecutive if
they happens in 1s(+/- 0.3s).

Thanks
Chao