2022-02-06 13:07:49

by Oliver Sang

[permalink] [raw]
Subject: [sched/pelt] 2d02fa8cc2: stress-ng.pipeherd.ops_per_sec -9.7% regression



Greeting,

FYI, we noticed a -9.7% regression of stress-ng.pipeherd.ops_per_sec due to commit:


commit: 2d02fa8cc21a93da35cfba462bf8ab87bf2db651 ("sched/pelt: Relax the sync of load_sum with load_avg")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

in testcase: stress-ng
on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory
with following parameters:

nr_threads: 100%
testtime: 60s
class: memory
test: pipeherd
cpufreq_governor: performance
ucode: 0xd000280




If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
sudo bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
sudo bin/lkp run generated-yaml-file

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.

=========================================================================================
class/compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime/ucode:
memory/gcc-9/performance/x86_64-rhel-8.3/100%/debian-10.4-x86_64-20200603.cgz/lkp-icl-2sp6/pipeherd/stress-ng/60s/0xd000280

commit:
95246d1ec8 ("sched/pelt: Relax the sync of runnable_sum with runnable_avg")
2d02fa8cc2 ("sched/pelt: Relax the sync of load_sum with load_avg")

95246d1ec80b8d19 2d02fa8cc21a93da35cfba462bf
---------------- ---------------------------
%stddev %change %stddev
\ | \
0.21 +11.0% 0.24 ? 2% stress-ng.pipeherd.context_switches_per_bogo_op
3.869e+09 -9.7% 3.494e+09 stress-ng.pipeherd.ops
64412021 -9.7% 58171101 stress-ng.pipeherd.ops_per_sec
442.37 -7.2% 410.54 stress-ng.time.user_time
5.49 ? 2% -0.5 4.94 ? 4% mpstat.cpu.all.usr%
80705 ? 7% +26.7% 102266 ? 17% numa-meminfo.node1.Active
80705 ? 7% +26.7% 102266 ? 17% numa-meminfo.node1.Active(anon)
12324 ? 3% -22.1% 9603 ? 25% softirqs.CPU106.RCU
12703 ? 4% -23.1% 9769 ? 24% softirqs.CPU27.RCU
4.814e+10 -7.8% 4.438e+10 ? 4% perf-stat.i.branch-instructions
1.58 +10.6% 1.75 ? 6% perf-stat.i.cpi
7.485e+10 -8.2% 6.869e+10 ? 4% perf-stat.i.dTLB-loads
4.451e+10 -8.7% 4.062e+10 ? 4% perf-stat.i.dTLB-stores
2.475e+11 -7.9% 2.28e+11 ? 4% perf-stat.i.instructions
0.64 -7.8% 0.59 ? 2% perf-stat.i.ipc
1305 -9.3% 1184 ? 5% perf-stat.i.metric.M/sec
0.57 +0.0 0.61 perf-stat.overall.branch-miss-rate%
1.58 +6.0% 1.67 perf-stat.overall.cpi
0.05 ? 2% +0.0 0.05 ? 8% perf-stat.overall.dTLB-load-miss-rate%
0.00 ? 18% +0.0 0.00 ? 8% perf-stat.overall.dTLB-store-miss-rate%
0.63 -5.7% 0.60 perf-stat.overall.ipc
4.719e+10 -7.3% 4.373e+10 ? 3% perf-stat.ps.branch-instructions
7.341e+10 -7.8% 6.77e+10 ? 3% perf-stat.ps.dTLB-loads
4.366e+10 -8.3% 4.004e+10 ? 3% perf-stat.ps.dTLB-stores
2.426e+11 -7.4% 2.247e+11 ? 3% perf-stat.ps.instructions
1.557e+13 -5.0% 1.479e+13 perf-stat.total.instructions
2.96 ? 2% -0.3 2.68 perf-profile.calltrace.cycles-pp.copy_page_to_iter.pipe_read.new_sync_read.vfs_read.ksys_read
2.80 -0.2 2.56 perf-profile.calltrace.cycles-pp.copy_page_from_iter.pipe_write.new_sync_write.vfs_write.ksys_write
1.81 -0.2 1.63 perf-profile.calltrace.cycles-pp.__entry_text_start.read
1.80 -0.2 1.65 perf-profile.calltrace.cycles-pp.__entry_text_start.write
1.93 -0.1 1.80 ? 2% perf-profile.calltrace.cycles-pp.mutex_lock.pipe_read.new_sync_read.vfs_read.ksys_read
0.94 ? 2% -0.1 0.82 ? 2% perf-profile.calltrace.cycles-pp.security_file_permission.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.37 -0.1 1.25 ? 2% perf-profile.calltrace.cycles-pp.mutex_lock.pipe_write.new_sync_write.vfs_write.ksys_write
1.23 ? 2% -0.1 1.12 ? 2% perf-profile.calltrace.cycles-pp.copyin.copy_page_from_iter.pipe_write.new_sync_write.vfs_write
1.46 ? 3% -0.1 1.36 ? 3% perf-profile.calltrace.cycles-pp.copyout.copy_page_to_iter.pipe_read.new_sync_read.vfs_read
1.40 -0.1 1.30 ? 2% perf-profile.calltrace.cycles-pp.__mutex_unlock_slowpath.pipe_read.new_sync_read.vfs_read.ksys_read
1.04 ? 2% -0.1 0.95 ? 3% perf-profile.calltrace.cycles-pp.copy_user_enhanced_fast_string.copyin.copy_page_from_iter.pipe_write.new_sync_write
1.03 ? 2% -0.1 0.94 ? 2% perf-profile.calltrace.cycles-pp.__mutex_unlock_slowpath.pipe_write.new_sync_write.vfs_write.ksys_write
1.27 ? 3% -0.1 1.18 ? 3% perf-profile.calltrace.cycles-pp.copy_user_enhanced_fast_string.copyout.copy_page_to_iter.pipe_read.new_sync_read
0.92 -0.1 0.83 perf-profile.calltrace.cycles-pp.touch_atime.pipe_read.new_sync_read.vfs_read.ksys_read
1.07 ? 2% -0.1 0.98 perf-profile.calltrace.cycles-pp.security_file_permission.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.78 -0.1 0.70 perf-profile.calltrace.cycles-pp.atime_needs_update.touch_atime.pipe_read.new_sync_read.vfs_read
0.72 ? 2% -0.1 0.64 perf-profile.calltrace.cycles-pp.file_update_time.pipe_write.new_sync_write.vfs_write.ksys_write
0.60 ? 3% -0.1 0.52 ? 2% perf-profile.calltrace.cycles-pp.common_file_perm.security_file_permission.vfs_read.ksys_read.do_syscall_64
2.60 -0.1 2.54 perf-profile.calltrace.cycles-pp.select_idle_sibling.select_task_rq_fair.try_to_wake_up.autoremove_wake_function.__wake_up_common
0.68 -0.1 0.62 perf-profile.calltrace.cycles-pp.ttwu_do_wakeup.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock
0.64 -0.1 0.58 ? 2% perf-profile.calltrace.cycles-pp.check_preempt_curr.ttwu_do_wakeup.try_to_wake_up.autoremove_wake_function.__wake_up_common
0.73 ? 2% -0.1 0.68 perf-profile.calltrace.cycles-pp.common_file_perm.security_file_permission.vfs_write.ksys_write.do_syscall_64
0.64 ? 3% -0.0 0.59 perf-profile.calltrace.cycles-pp.__fdget_pos.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
0.59 ? 2% -0.0 0.54 ? 2% perf-profile.calltrace.cycles-pp.wake_up_q.__mutex_unlock_slowpath.pipe_read.new_sync_read.vfs_read
0.60 -0.0 0.56 ? 2% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__wake_up_common_lock.pipe_write.new_sync_write.vfs_write
0.73 ? 3% -0.0 0.69 ? 3% perf-profile.calltrace.cycles-pp.update_rq_clock.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock
0.57 ? 3% -0.0 0.53 perf-profile.calltrace.cycles-pp.__fget_light.__fdget_pos.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.82 ? 3% +0.1 0.89 perf-profile.calltrace.cycles-pp.update_load_avg.dequeue_entity.dequeue_task_fair.__schedule.schedule
0.58 ? 2% +0.1 0.64 perf-profile.calltrace.cycles-pp.update_load_avg.enqueue_task_fair.ttwu_do_activate.try_to_wake_up.autoremove_wake_function
0.67 ? 2% +0.1 0.78 ? 3% perf-profile.calltrace.cycles-pp.update_load_avg.enqueue_entity.enqueue_task_fair.ttwu_do_activate.try_to_wake_up
40.69 +0.2 40.87 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.write
40.28 +0.2 40.52 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
50.44 +0.3 50.71 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.read
49.94 +0.3 50.23 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
39.38 +0.3 39.69 perf-profile.calltrace.cycles-pp.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
1.02 ? 15% +0.3 1.33 ? 14% perf-profile.calltrace.cycles-pp.set_task_cpu.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock
48.24 +0.4 48.60 perf-profile.calltrace.cycles-pp.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
38.39 +0.4 38.78 perf-profile.calltrace.cycles-pp.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
47.40 +0.4 47.83 perf-profile.calltrace.cycles-pp.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
13.88 +0.5 14.42 perf-profile.calltrace.cycles-pp.schedule.pipe_read.new_sync_read.vfs_read.ksys_read
36.68 +0.5 37.22 perf-profile.calltrace.cycles-pp.new_sync_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
13.70 +0.5 14.25 perf-profile.calltrace.cycles-pp.__schedule.schedule.pipe_read.new_sync_read.vfs_read
45.72 +0.6 46.31 perf-profile.calltrace.cycles-pp.new_sync_read.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe
35.80 +0.6 36.42 perf-profile.calltrace.cycles-pp.pipe_write.new_sync_write.vfs_write.ksys_write.do_syscall_64
44.85 +0.7 45.51 perf-profile.calltrace.cycles-pp.pipe_read.new_sync_read.vfs_read.ksys_read.do_syscall_64
6.15 +0.7 6.84 ? 2% perf-profile.calltrace.cycles-pp.dequeue_task_fair.__schedule.schedule.pipe_read.new_sync_read
1.98 ? 9% +0.7 2.70 ? 9% perf-profile.calltrace.cycles-pp.update_cfs_group.dequeue_task_fair.__schedule.schedule.pipe_read
2.03 ? 8% +0.7 2.78 ? 9% perf-profile.calltrace.cycles-pp.update_cfs_group.enqueue_task_fair.ttwu_do_activate.try_to_wake_up.autoremove_wake_function
15.96 +1.0 16.95 perf-profile.calltrace.cycles-pp.prepare_to_wait_event.pipe_read.new_sync_read.vfs_read.ksys_read
6.67 +1.0 7.68 ? 2% perf-profile.calltrace.cycles-pp.enqueue_task_fair.ttwu_do_activate.try_to_wake_up.autoremove_wake_function.__wake_up_common
6.77 +1.0 7.79 ? 2% perf-profile.calltrace.cycles-pp.ttwu_do_activate.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock
14.46 +1.0 15.48 ? 2% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.prepare_to_wait_event.pipe_read.new_sync_read.vfs_read
13.73 +1.1 14.79 ? 2% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.prepare_to_wait_event.pipe_read.new_sync_read
26.95 +1.4 28.34 perf-profile.calltrace.cycles-pp.__wake_up_common_lock.pipe_write.new_sync_write.vfs_write.ksys_write
25.85 +1.5 27.32 perf-profile.calltrace.cycles-pp.__wake_up_common.__wake_up_common_lock.pipe_write.new_sync_write.vfs_write
25.18 +1.5 26.69 perf-profile.calltrace.cycles-pp.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.pipe_write.new_sync_write
24.61 +1.5 26.14 perf-profile.calltrace.cycles-pp.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.pipe_write
3.12 ? 2% -0.3 2.82 perf-profile.children.cycles-pp.copy_page_to_iter
2.98 -0.3 2.71 perf-profile.children.cycles-pp.copy_page_from_iter
3.47 -0.3 3.21 ? 2% perf-profile.children.cycles-pp.mutex_lock
2.25 -0.2 2.04 perf-profile.children.cycles-pp.syscall_return_via_sysret
2.10 ? 2% -0.2 1.89 perf-profile.children.cycles-pp.security_file_permission
2.60 ? 2% -0.2 2.40 ? 2% perf-profile.children.cycles-pp.copy_user_enhanced_fast_string
2.48 -0.2 2.29 ? 2% perf-profile.children.cycles-pp.__mutex_unlock_slowpath
2.08 -0.2 1.90 perf-profile.children.cycles-pp.__entry_text_start
1.86 -0.2 1.71 perf-profile.children.cycles-pp.__might_resched
1.45 ? 2% -0.1 1.31 ? 2% perf-profile.children.cycles-pp.common_file_perm
1.28 ? 2% -0.1 1.17 ? 2% perf-profile.children.cycles-pp.copyin
1.51 ? 3% -0.1 1.40 ? 3% perf-profile.children.cycles-pp.copyout
1.21 ? 2% -0.1 1.11 perf-profile.children.cycles-pp.__fdget_pos
1.07 ? 2% -0.1 0.96 ? 2% perf-profile.children.cycles-pp.__might_sleep
0.99 -0.1 0.89 perf-profile.children.cycles-pp.touch_atime
0.95 ? 2% -0.1 0.86 ? 2% perf-profile.children.cycles-pp.current_time
1.04 ? 2% -0.1 0.96 perf-profile.children.cycles-pp.__fget_light
0.83 -0.1 0.74 perf-profile.children.cycles-pp.atime_needs_update
0.90 -0.1 0.81 ? 2% perf-profile.children.cycles-pp.__cond_resched
1.00 ? 2% -0.1 0.91 ? 2% perf-profile.children.cycles-pp.wake_up_q
0.81 -0.1 0.73 ? 2% perf-profile.children.cycles-pp.__might_fault
0.78 ? 2% -0.1 0.71 ? 2% perf-profile.children.cycles-pp.file_update_time
1.24 -0.1 1.17 ? 2% perf-profile.children.cycles-pp.mutex_unlock
2.68 -0.1 2.61 perf-profile.children.cycles-pp.select_idle_sibling
1.81 ? 2% -0.1 1.74 perf-profile.children.cycles-pp.syscall_exit_to_user_mode
0.66 -0.1 0.60 perf-profile.children.cycles-pp.check_preempt_curr
1.15 -0.1 1.09 perf-profile.children.cycles-pp.available_idle_cpu
0.69 -0.1 0.63 perf-profile.children.cycles-pp.ttwu_do_wakeup
0.50 -0.1 0.44 ? 2% perf-profile.children.cycles-pp.check_preempt_wakeup
0.50 ? 3% -0.0 0.45 ? 11% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
0.58 ? 2% -0.0 0.53 ? 2% perf-profile.children.cycles-pp.wake_q_add
0.45 ? 2% -0.0 0.40 ? 2% perf-profile.children.cycles-pp.syscall_enter_from_user_mode
0.47 ? 3% -0.0 0.42 ? 2% perf-profile.children.cycles-pp.aa_file_perm
0.46 ? 3% -0.0 0.42 ? 3% perf-profile.children.cycles-pp.kill_fasync
1.82 -0.0 1.78 perf-profile.children.cycles-pp.select_idle_cpu
0.38 ? 2% -0.0 0.34 ? 2% perf-profile.children.cycles-pp.rcu_all_qs
0.35 ? 2% -0.0 0.32 ? 2% perf-profile.children.cycles-pp.timestamp_truncate
0.24 ? 2% -0.0 0.22 ? 2% perf-profile.children.cycles-pp.rw_verify_area
0.21 -0.0 0.19 perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
0.19 ? 3% -0.0 0.16 ? 3% perf-profile.children.cycles-pp.iov_iter_init
0.23 -0.0 0.21 ? 3% perf-profile.children.cycles-pp.anon_pipe_buf_release
0.13 -0.0 0.12 ? 4% perf-profile.children.cycles-pp.apparmor_file_permission
0.15 -0.0 0.14 ? 2% perf-profile.children.cycles-pp.pick_next_entity
3.25 +0.2 3.49 perf-profile.children.cycles-pp.update_load_avg
39.46 +0.3 39.77 perf-profile.children.cycles-pp.ksys_write
1.03 ? 15% +0.3 1.35 ? 14% perf-profile.children.cycles-pp.set_task_cpu
48.31 +0.4 48.66 perf-profile.children.cycles-pp.ksys_read
38.52 +0.4 38.90 perf-profile.children.cycles-pp.vfs_write
47.48 +0.4 47.90 perf-profile.children.cycles-pp.vfs_read
91.24 +0.5 91.69 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
90.53 +0.5 91.04 perf-profile.children.cycles-pp.do_syscall_64
13.98 +0.5 14.51 perf-profile.children.cycles-pp.schedule
36.77 +0.5 37.30 perf-profile.children.cycles-pp.new_sync_write
13.88 +0.5 14.42 perf-profile.children.cycles-pp.__schedule
45.80 +0.6 46.39 perf-profile.children.cycles-pp.new_sync_read
36.10 +0.6 36.70 perf-profile.children.cycles-pp.pipe_write
45.16 +0.6 45.81 perf-profile.children.cycles-pp.pipe_read
6.21 +0.7 6.90 ? 2% perf-profile.children.cycles-pp.dequeue_task_fair
16.24 +1.0 17.22 perf-profile.children.cycles-pp._raw_spin_lock_irqsave
16.02 +1.0 17.01 perf-profile.children.cycles-pp.prepare_to_wait_event
6.74 +1.0 7.76 ? 2% perf-profile.children.cycles-pp.enqueue_task_fair
6.83 +1.0 7.86 ? 2% perf-profile.children.cycles-pp.ttwu_do_activate
15.12 +1.1 16.22 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
27.07 +1.4 28.44 perf-profile.children.cycles-pp.__wake_up_common_lock
25.88 +1.5 27.35 perf-profile.children.cycles-pp.__wake_up_common
25.48 +1.5 26.96 perf-profile.children.cycles-pp.try_to_wake_up
4.24 ? 8% +1.5 5.73 ? 9% perf-profile.children.cycles-pp.update_cfs_group
25.19 +1.5 26.71 perf-profile.children.cycles-pp.autoremove_wake_function
2.24 -0.2 2.04 perf-profile.self.cycles-pp.syscall_return_via_sysret
2.54 ? 2% -0.2 2.34 ? 3% perf-profile.self.cycles-pp.copy_user_enhanced_fast_string
2.15 ? 2% -0.1 2.01 ? 3% perf-profile.self.cycles-pp.mutex_lock
1.70 -0.1 1.57 perf-profile.self.cycles-pp.__might_resched
1.44 -0.1 1.30 perf-profile.self.cycles-pp.pipe_write
2.08 -0.1 1.95 ? 2% perf-profile.self.cycles-pp.pipe_read
1.54 -0.1 1.44 ? 3% perf-profile.self.cycles-pp._raw_spin_lock_irq
1.02 ? 2% -0.1 0.91 ? 2% perf-profile.self.cycles-pp.common_file_perm
0.79 ? 2% -0.1 0.69 perf-profile.self.cycles-pp.copy_page_to_iter
2.24 -0.1 2.15 ? 3% perf-profile.self.cycles-pp._raw_spin_lock_irqsave
0.99 ? 2% -0.1 0.91 perf-profile.self.cycles-pp.__fget_light
0.87 ? 2% -0.1 0.79 ? 2% perf-profile.self.cycles-pp.__might_sleep
0.78 -0.1 0.70 ? 3% perf-profile.self.cycles-pp.copy_page_from_iter
0.92 ? 2% -0.1 0.84 ? 3% perf-profile.self.cycles-pp.write
0.76 ? 2% -0.1 0.70 perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
1.17 -0.1 1.10 ? 2% perf-profile.self.cycles-pp.mutex_unlock
0.59 ? 2% -0.1 0.53 perf-profile.self.cycles-pp.vfs_write
1.02 -0.1 0.96 ? 2% perf-profile.self.cycles-pp._raw_spin_lock
1.10 -0.1 1.05 perf-profile.self.cycles-pp.available_idle_cpu
0.57 -0.1 0.52 perf-profile.self.cycles-pp.new_sync_write
0.89 ? 2% -0.1 0.83 ? 3% perf-profile.self.cycles-pp.__update_load_avg_cfs_rq
0.38 ? 2% -0.1 0.33 ? 2% perf-profile.self.cycles-pp.check_preempt_wakeup
0.59 ? 3% -0.1 0.53 perf-profile.self.cycles-pp.new_sync_read
0.57 ? 2% -0.1 0.51 perf-profile.self.cycles-pp.security_file_permission
0.66 -0.1 0.61 ? 2% perf-profile.self.cycles-pp.__mutex_unlock_slowpath
0.55 ? 2% -0.0 0.50 perf-profile.self.cycles-pp.__entry_text_start
0.53 ? 2% -0.0 0.49 ? 2% perf-profile.self.cycles-pp.do_syscall_64
0.47 -0.0 0.43 perf-profile.self.cycles-pp.__cond_resched
0.63 ? 2% -0.0 0.60 perf-profile.self.cycles-pp.vfs_read
0.55 ? 2% -0.0 0.51 ? 3% perf-profile.self.cycles-pp.wake_q_add
0.37 ? 4% -0.0 0.33 ? 5% perf-profile.self.cycles-pp.kill_fasync
0.39 ? 2% -0.0 0.36 ? 2% perf-profile.self.cycles-pp.aa_file_perm
0.38 ? 2% -0.0 0.35 ? 2% perf-profile.self.cycles-pp.syscall_enter_from_user_mode
0.34 ? 2% -0.0 0.30 ? 2% perf-profile.self.cycles-pp.atime_needs_update
0.36 ? 3% -0.0 0.32 ? 2% perf-profile.self.cycles-pp.current_time
0.31 ? 2% -0.0 0.28 ? 2% perf-profile.self.cycles-pp.timestamp_truncate
0.36 ? 3% -0.0 0.33 ? 2% perf-profile.self.cycles-pp.exit_to_user_mode_prepare
0.34 ? 2% -0.0 0.32 ? 2% perf-profile.self.cycles-pp.__wake_up_common_lock
0.30 -0.0 0.28 perf-profile.self.cycles-pp.file_update_time
0.37 ? 3% -0.0 0.34 ? 3% perf-profile.self.cycles-pp.enqueue_entity
0.33 ? 2% -0.0 0.31 ? 2% perf-profile.self.cycles-pp.wake_up_q
0.26 ? 2% -0.0 0.24 perf-profile.self.cycles-pp.rcu_all_qs
0.28 -0.0 0.26 perf-profile.self.cycles-pp.syscall_exit_to_user_mode
0.30 ? 2% -0.0 0.27 ? 2% perf-profile.self.cycles-pp.ksys_write
0.18 ? 2% -0.0 0.16 ? 2% perf-profile.self.cycles-pp.__fdget_pos
0.53 -0.0 0.51 perf-profile.self.cycles-pp.select_idle_sibling
0.18 ? 2% -0.0 0.16 ? 3% perf-profile.self.cycles-pp.rw_verify_area
0.18 ? 2% -0.0 0.16 ? 2% perf-profile.self.cycles-pp.touch_atime
0.20 ? 2% -0.0 0.18 ? 2% perf-profile.self.cycles-pp.entry_SYSCALL_64_safe_stack
0.14 ? 4% -0.0 0.12 ? 3% perf-profile.self.cycles-pp.iov_iter_init
0.15 ? 4% -0.0 0.13 ? 4% perf-profile.self.cycles-pp.__might_fault
0.10 ? 5% -0.0 0.08 ? 4% perf-profile.self.cycles-pp.copyout
0.10 ? 5% -0.0 0.08 ? 4% perf-profile.self.cycles-pp.copyin
0.36 ? 2% +0.2 0.53 ? 2% perf-profile.self.cycles-pp.enqueue_task_fair
1.43 ? 2% +0.3 1.69 ? 2% perf-profile.self.cycles-pp.update_load_avg
0.85 ? 18% +0.3 1.17 ? 16% perf-profile.self.cycles-pp.set_task_cpu
9.31 ? 2% +0.5 9.81 perf-profile.self.cycles-pp.try_to_wake_up
15.10 +1.1 16.21 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
4.21 ? 8% +1.5 5.71 ? 9% perf-profile.self.cycles-pp.update_cfs_group




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation

Thanks,
Oliver Sang


Attachments:
(No filename) (24.25 kB)
config-5.16.0-08306-g2d02fa8cc21a (176.76 kB)
job-script (8.08 kB)
job.yaml (5.55 kB)
reproduce (352.00 B)
Download all attachments

2022-04-01 09:30:09

by Yu Chen

[permalink] [raw]
Subject: Re: [sched/pelt] 2d02fa8cc2: stress-ng.pipeherd.ops_per_sec -9.7% regression

Hi Vincent,

On Wed, Feb 9, 2022 at 1:17 PM kernel test robot <[email protected]> wrote:
>
>
>
> Greeting,
>
> FYI, we noticed a -9.7% regression of stress-ng.pipeherd.ops_per_sec due to commit:
>
>
> commit: 2d02fa8cc21a93da35cfba462bf8ab87bf2db651 ("sched/pelt: Relax the sync of load_sum with load_avg")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> in testcase: stress-ng
> on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory
> with following parameters:
>
> nr_threads: 100%
> testtime: 60s
> class: memory
> test: pipeherd
> cpufreq_governor: performance
> ucode: 0xd000280
>
This week we have re-run the test result and it seems that this
regression is still there.
As we are evaluating whether this report is valid or if the
downgrading is expected, appreciated
if you could give suggestion on further steps:

1. If I understand correctly,
2d02fa8cc21a93da35cfba462bf8ab87bf2db651 ("sched/pelt: Relax the sync
of load_sum with load_avg")
fixed the calculating of load_sum. Before this patch the
contribution part would be 'skipped' and caused the load_sum
to be lower than expected.
2. If above is true, after this patch, the load_sum becomes higher. Is
there a scenario that higher load_sum added to 1 cfs_rq brings
more imbalance between this group and other sched_group, thus
brings more task migration/wake up? (because in below perf result,
it seems that, with this patch applied, there are slightly more
take wake up)
3. Consider the 9.7% downgrading is not that high, do you think if
lkp team should continue track this issue or just close it
as documented?

Best,
Yu
>
> commit:
> 95246d1ec8 ("sched/pelt: Relax the sync of runnable_sum with runnable_avg")
> 2d02fa8cc2 ("sched/pelt: Relax the sync of load_sum with load_avg")
>
> 95246d1ec80b8d19 2d02fa8cc21a93da35cfba462bf
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
> 0.21 +11.0% 0.24 ą 2% stress-ng.pipeherd.context_switches_per_bogo_op
> 3.869e+09 -9.7% 3.494e+09 stress-ng.pipeherd.ops
> 64412021 -9.7% 58171101 stress-ng.pipeherd.ops_per_sec
> 442.37 -7.2% 410.54 stress-ng.time.user_time
> 5.49 ą 2% -0.5 4.94 ą 4% mpstat.cpu.all.usr%
> 80705 ą 7% +26.7% 102266 ą 17% numa-meminfo.node1.Active
> 80705 ą 7% +26.7% 102266 ą 17% numa-meminfo.node1.Active(anon)
> 12324 ą 3% -22.1% 9603 ą 25% softirqs.CPU106.RCU
> 12703 ą 4% -23.1% 9769 ą 24% softirqs.CPU27.RCU
> 15.96 +1.0 16.95 perf-profile.calltrace.cycles-pp.prepare_to_wait_event.pipe_read.new_sync_read.vfs_read.ksys_read
> 6.67 +1.0 7.68 ą 2% perf-profile.calltrace.cycles-pp.enqueue_task_fair.ttwu_do_activate.try_to_wake_up.autoremove_wake_function.__wake_up_common
> 6.77 +1.0 7.79 ą 2% perf-profile.calltrace.cycles-pp.ttwu_do_activate.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock
> 14.46 +1.0 15.48 ą 2% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.prepare_to_wait_event.pipe_read.new_sync_read.vfs_read
> 13.73 +1.1 14.79 ą 2% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.prepare_to_wait_event.pipe_read.new_sync_read
> 26.95 +1.4 28.34 perf-profile.calltrace.cycles-pp.__wake_up_common_lock.pipe_write.new_sync_write.vfs_write.ksys_write
> 25.85 +1.5 27.32 perf-profile.calltrace.cycles-pp.__wake_up_common.__wake_up_common_lock.pipe_write.new_sync_write.vfs_write
> 25.18 +1.5 26.69 perf-profile.calltrace.cycles-pp.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.pipe_write.new_sync_write
> 24.61 +1.5 26.14 perf-profile.calltrace.cycles-pp.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.pipe_write

2022-04-01 10:37:30

by Vincent Guittot

[permalink] [raw]
Subject: Re: [sched/pelt] 2d02fa8cc2: stress-ng.pipeherd.ops_per_sec -9.7% regression

On Thu, 31 Mar 2022 at 16:19, Chen Yu <[email protected]> wrote:
>
> Hi Vincent,
>
> On Wed, Feb 9, 2022 at 1:17 PM kernel test robot <[email protected]> wrote:
> >
> >
> >
> > Greeting,
> >
> > FYI, we noticed a -9.7% regression of stress-ng.pipeherd.ops_per_sec due to commit:
> >
> >
> > commit: 2d02fa8cc21a93da35cfba462bf8ab87bf2db651 ("sched/pelt: Relax the sync of load_sum with load_avg")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
> > in testcase: stress-ng
> > on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory
> > with following parameters:
> >
> > nr_threads: 100%
> > testtime: 60s
> > class: memory
> > test: pipeherd
> > cpufreq_governor: performance
> > ucode: 0xd000280
> >
> This week we have re-run the test result and it seems that this
> regression is still there.
> As we are evaluating whether this report is valid or if the
> downgrading is expected, appreciated
> if you could give suggestion on further steps:
>
> 1. If I understand correctly,
> 2d02fa8cc21a93da35cfba462bf8ab87bf2db651 ("sched/pelt: Relax the sync
> of load_sum with load_avg")
> fixed the calculating of load_sum. Before this patch the
> contribution part would be 'skipped' and caused the load_sum
> to be lower than expected.

Yes, you understand it correctly

> 2. If above is true, after this patch, the load_sum becomes higher. Is
> there a scenario that higher load_sum added to 1 cfs_rq brings
> more imbalance between this group and other sched_group, thus
> brings more task migration/wake up? (because in below perf result,
> it seems that, with this patch applied, there are slightly more
> take wake up)

This change should not impact load balance as it only does comparison
and I expect the load increase to happen on all cfs rq.
The only place that could be impacted, would be wake_affine_weight()
because it removes task load from previous cfs rq load before
comparing.
The task's load was not impacted by the underestimate which means that
the load of prev cfs might be seen lower than current cfs after
subtracting the task's load whereas both cfs rqs were similarly
underestimated.
Now the load of prev cfs rq is not underestimated and becomes
comparable or slightly higher than the current cfs and the task
migrate on current cfs instead of staying on prev one at wakeup

One possible test would be to run the test with WA_WEIGHT features
disable and check if there is still a difference

> 3. Consider the 9.7% downgrading is not that high, do you think if
> lkp team should continue track this issue or just close it
> as documented?
>
> Best,
> Yu
> >
> > commit:
> > 95246d1ec8 ("sched/pelt: Relax the sync of runnable_sum with runnable_avg")
> > 2d02fa8cc2 ("sched/pelt: Relax the sync of load_sum with load_avg")
> >
> > 95246d1ec80b8d19 2d02fa8cc21a93da35cfba462bf
> > ---------------- ---------------------------
> > %stddev %change %stddev
> > \ | \
> > 0.21 +11.0% 0.24 ą 2% stress-ng.pipeherd.context_switches_per_bogo_op
> > 3.869e+09 -9.7% 3.494e+09 stress-ng.pipeherd.ops
> > 64412021 -9.7% 58171101 stress-ng.pipeherd.ops_per_sec
> > 442.37 -7.2% 410.54 stress-ng.time.user_time
> > 5.49 ą 2% -0.5 4.94 ą 4% mpstat.cpu.all.usr%
> > 80705 ą 7% +26.7% 102266 ą 17% numa-meminfo.node1.Active
> > 80705 ą 7% +26.7% 102266 ą 17% numa-meminfo.node1.Active(anon)
> > 12324 ą 3% -22.1% 9603 ą 25% softirqs.CPU106.RCU
> > 12703 ą 4% -23.1% 9769 ą 24% softirqs.CPU27.RCU
> > 15.96 +1.0 16.95 perf-profile.calltrace.cycles-pp.prepare_to_wait_event.pipe_read.new_sync_read.vfs_read.ksys_read
> > 6.67 +1.0 7.68 ą 2% perf-profile.calltrace.cycles-pp.enqueue_task_fair.ttwu_do_activate.try_to_wake_up.autoremove_wake_function.__wake_up_common
> > 6.77 +1.0 7.79 ą 2% perf-profile.calltrace.cycles-pp.ttwu_do_activate.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock
> > 14.46 +1.0 15.48 ą 2% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.prepare_to_wait_event.pipe_read.new_sync_read.vfs_read
> > 13.73 +1.1 14.79 ą 2% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.prepare_to_wait_event.pipe_read.new_sync_read
> > 26.95 +1.4 28.34 perf-profile.calltrace.cycles-pp.__wake_up_common_lock.pipe_write.new_sync_write.vfs_write.ksys_write
> > 25.85 +1.5 27.32 perf-profile.calltrace.cycles-pp.__wake_up_common.__wake_up_common_lock.pipe_write.new_sync_write.vfs_write
> > 25.18 +1.5 26.69 perf-profile.calltrace.cycles-pp.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.pipe_write.new_sync_write
> > 24.61 +1.5 26.14 perf-profile.calltrace.cycles-pp.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.pipe_write

2022-04-03 06:46:54

by Yu Chen

[permalink] [raw]
Subject: Re: [sched/pelt] 2d02fa8cc2: stress-ng.pipeherd.ops_per_sec -9.7% regression

On Fri, Apr 1, 2022 at 12:17 AM Vincent Guittot
<[email protected]> wrote:
>
> On Thu, 31 Mar 2022 at 16:19, Chen Yu <[email protected]> wrote:
> >
> > Hi Vincent,
> >
> > On Wed, Feb 9, 2022 at 1:17 PM kernel test robot <[email protected]> wrote:
> > >
> > >
> > >
> > > Greeting,
> > >
> > > FYI, we noticed a -9.7% regression of stress-ng.pipeherd.ops_per_sec due to commit:
> > >
> > >
> > > commit: 2d02fa8cc21a93da35cfba462bf8ab87bf2db651 ("sched/pelt: Relax the sync of load_sum with load_avg")
> > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > >
> > > in testcase: stress-ng
> > > on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory
> > > with following parameters:
> > >
> > > nr_threads: 100%
> > > testtime: 60s
> > > class: memory
> > > test: pipeherd
> > > cpufreq_governor: performance
> > > ucode: 0xd000280
> > >
> > This week we have re-run the test result and it seems that this
> > regression is still there.
> > As we are evaluating whether this report is valid or if the
> > downgrading is expected, appreciated
> > if you could give suggestion on further steps:
> >
> > 1. If I understand correctly,
> > 2d02fa8cc21a93da35cfba462bf8ab87bf2db651 ("sched/pelt: Relax the sync
> > of load_sum with load_avg")
> > fixed the calculating of load_sum. Before this patch the
> > contribution part would be 'skipped' and caused the load_sum
> > to be lower than expected.
>
> Yes, you understand it correctly
>
> > 2. If above is true, after this patch, the load_sum becomes higher. Is
> > there a scenario that higher load_sum added to 1 cfs_rq brings
> > more imbalance between this group and other sched_group, thus
> > brings more task migration/wake up? (because in below perf result,
> > it seems that, with this patch applied, there are slightly more
> > take wake up)
>
> This change should not impact load balance as it only does comparison
> and I expect the load increase to happen on all cfs rq.
> The only place that could be impacted, would be wake_affine_weight()
> because it removes task load from previous cfs rq load before
> comparing.
> The task's load was not impacted by the underestimate which means that
> the load of prev cfs might be seen lower than current cfs after
> subtracting the task's load whereas both cfs rqs were similarly
> underestimated.
> Now the load of prev cfs rq is not underestimated and becomes
> comparable or slightly higher than the current cfs and the task
> migrate on current cfs instead of staying on prev one at wakeup
>
Could you please elaborate a little more on this scenario, since both current
and previous cfs rqs were underestimated, how could previous cfs rq has
lower load than the current one before applying this patch?

Say, suppose the previous cfs rq has a load of L1, and current cfs rq has
a load of L2, the waken task has a load of h, then wake_affine_weight()
compares L1 - h with L2 + h , when L1 < L2 + 2h, the task will remain on
previous CPU. Since L1 and L2 were underestimated in the same scale,
I'm not quite sure how this patch would affect the choice between
prev and current CPU.
> One possible test would be to run the test with WA_WEIGHT features
> disable and check if there is still a difference
>
Yes, after disabling WA_WEIGHT, the performance came back.
The following score is the output of stress-ng.pipeherd.ops_per_sec

WA_WEIGHT yes no
-------------------
patched
yes 58069733.01 69940547.7*
no 64591593.69 73503396.9

--
Thanks,
Chenyu

2022-04-05 01:36:07

by Vincent Guittot

[permalink] [raw]
Subject: Re: [sched/pelt] 2d02fa8cc2: stress-ng.pipeherd.ops_per_sec -9.7% regression

On Fri, 1 Apr 2022 at 20:32, Chen Yu <[email protected]> wrote:
>
> On Fri, Apr 1, 2022 at 12:17 AM Vincent Guittot
> <[email protected]> wrote:
> >
> > On Thu, 31 Mar 2022 at 16:19, Chen Yu <[email protected]> wrote:
> > >
> > > Hi Vincent,
> > >
> > > On Wed, Feb 9, 2022 at 1:17 PM kernel test robot <[email protected]> wrote:
> > > >
> > > >
> > > >
> > > > Greeting,
> > > >
> > > > FYI, we noticed a -9.7% regression of stress-ng.pipeherd.ops_per_sec due to commit:
> > > >
> > > >
> > > > commit: 2d02fa8cc21a93da35cfba462bf8ab87bf2db651 ("sched/pelt: Relax the sync of load_sum with load_avg")
> > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > >
> > > > in testcase: stress-ng
> > > > on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory
> > > > with following parameters:
> > > >
> > > > nr_threads: 100%
> > > > testtime: 60s
> > > > class: memory
> > > > test: pipeherd
> > > > cpufreq_governor: performance
> > > > ucode: 0xd000280
> > > >
> > > This week we have re-run the test result and it seems that this
> > > regression is still there.
> > > As we are evaluating whether this report is valid or if the
> > > downgrading is expected, appreciated
> > > if you could give suggestion on further steps:
> > >
> > > 1. If I understand correctly,
> > > 2d02fa8cc21a93da35cfba462bf8ab87bf2db651 ("sched/pelt: Relax the sync
> > > of load_sum with load_avg")
> > > fixed the calculating of load_sum. Before this patch the
> > > contribution part would be 'skipped' and caused the load_sum
> > > to be lower than expected.
> >
> > Yes, you understand it correctly
> >
> > > 2. If above is true, after this patch, the load_sum becomes higher. Is
> > > there a scenario that higher load_sum added to 1 cfs_rq brings
> > > more imbalance between this group and other sched_group, thus
> > > brings more task migration/wake up? (because in below perf result,
> > > it seems that, with this patch applied, there are slightly more
> > > take wake up)
> >
> > This change should not impact load balance as it only does comparison
> > and I expect the load increase to happen on all cfs rq.
> > The only place that could be impacted, would be wake_affine_weight()
> > because it removes task load from previous cfs rq load before
> > comparing.
> > The task's load was not impacted by the underestimate which means that
> > the load of prev cfs might be seen lower than current cfs after
> > subtracting the task's load whereas both cfs rqs were similarly
> > underestimated.
> > Now the load of prev cfs rq is not underestimated and becomes
> > comparable or slightly higher than the current cfs and the task
> > migrate on current cfs instead of staying on prev one at wakeup
> >
> Could you please elaborate a little more on this scenario, since both current
> and previous cfs rqs were underestimated, how could previous cfs rq has
> lower load than the current one before applying this patch?
>
> Say, suppose the previous cfs rq has a load of L1, and current cfs rq has
> a load of L2, the waken task has a load of h, then wake_affine_weight()
> compares L1 - h with L2 + h , when L1 < L2 + 2h, the task will remain on
> previous CPU. Since L1 and L2 were underestimated in the same scale,
> I'm not quite sure how this patch would affect the choice between
> prev and current CPU.

Let's take the example of this_cpu load L1 = 0 and prev_cpu load L2 =
2h'+d. h' reflects h in the cpu load and d is a small delta load. The
task will migrate if we have the condition below:

h < 2h'-h+d

With this patch, we assume that h' == h as we don't underestimate the
load of cfs rqs anymore. The condition for migrating the task is :
h < h+d
And the task will migrate on this cpu as soon as there is a small load
on prev_cpu in addition to the 2h.

Without the patch, the load of cfs_rqs are underestimated which means
that the task's load is underestimated in the cfs rq. This can be
described as h' == h-U. U being the underestimated part. In this case
the condition to migrate the task becomes:
h < h-2U+d
The task will migrate on this cpu is d is large enough to compensate
the underestimation so we will migrate less often

> > One possible test would be to run the test with WA_WEIGHT features
> > disable and check if there is still a difference
> >
> Yes, after disabling WA_WEIGHT, the performance came back.
> The following score is the output of stress-ng.pipeherd.ops_per_sec
>
> WA_WEIGHT yes no
> -------------------
> patched
> yes 58069733.01 69940547.7*
> no 64591593.69 73503396.9
>
> --
> Thanks,
> Chenyu

2022-04-06 13:14:58

by Yu Chen

[permalink] [raw]
Subject: Re: [sched/pelt] 2d02fa8cc2: stress-ng.pipeherd.ops_per_sec -9.7% regression

On Mon, Apr 4, 2022 at 5:53 PM Vincent Guittot
<[email protected]> wrote:
>
> On Fri, 1 Apr 2022 at 20:32, Chen Yu <[email protected]> wrote:
> >
> > On Fri, Apr 1, 2022 at 12:17 AM Vincent Guittot
> > <[email protected]> wrote:
> > >
> > > On Thu, 31 Mar 2022 at 16:19, Chen Yu <[email protected]> wrote:
> > > >
> > > > Hi Vincent,
> > > >
> > > > On Wed, Feb 9, 2022 at 1:17 PM kernel test robot <[email protected]> wrote:
> > > > >
> > > > >
> > > > >
> > > > > Greeting,
> > > > >
> > > > > FYI, we noticed a -9.7% regression of stress-ng.pipeherd.ops_per_sec due to commit:
> > > > >
> > > > >
> > > > > commit: 2d02fa8cc21a93da35cfba462bf8ab87bf2db651 ("sched/pelt: Relax the sync of load_sum with load_avg")
> > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > > >
> > > > > in testcase: stress-ng
> > > > > on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory
> > > > > with following parameters:
> > > > >
> > > > > nr_threads: 100%
> > > > > testtime: 60s
> > > > > class: memory
> > > > > test: pipeherd
> > > > > cpufreq_governor: performance
> > > > > ucode: 0xd000280
> > > > >
> > > > This week we have re-run the test result and it seems that this
> > > > regression is still there.
> > > > As we are evaluating whether this report is valid or if the
> > > > downgrading is expected, appreciated
> > > > if you could give suggestion on further steps:
> > > >
> > > > 1. If I understand correctly,
> > > > 2d02fa8cc21a93da35cfba462bf8ab87bf2db651 ("sched/pelt: Relax the sync
> > > > of load_sum with load_avg")
> > > > fixed the calculating of load_sum. Before this patch the
> > > > contribution part would be 'skipped' and caused the load_sum
> > > > to be lower than expected.
> > >
> > > Yes, you understand it correctly
> > >
> > > > 2. If above is true, after this patch, the load_sum becomes higher. Is
> > > > there a scenario that higher load_sum added to 1 cfs_rq brings
> > > > more imbalance between this group and other sched_group, thus
> > > > brings more task migration/wake up? (because in below perf result,
> > > > it seems that, with this patch applied, there are slightly more
> > > > take wake up)
> > >
> > > This change should not impact load balance as it only does comparison
> > > and I expect the load increase to happen on all cfs rq.
> > > The only place that could be impacted, would be wake_affine_weight()
> > > because it removes task load from previous cfs rq load before
> > > comparing.
> > > The task's load was not impacted by the underestimate which means that
> > > the load of prev cfs might be seen lower than current cfs after
> > > subtracting the task's load whereas both cfs rqs were similarly
> > > underestimated.
> > > Now the load of prev cfs rq is not underestimated and becomes
> > > comparable or slightly higher than the current cfs and the task
> > > migrate on current cfs instead of staying on prev one at wakeup
> > >
> > Could you please elaborate a little more on this scenario, since both current
> > and previous cfs rqs were underestimated, how could previous cfs rq has
> > lower load than the current one before applying this patch?
> >
> > Say, suppose the previous cfs rq has a load of L1, and current cfs rq has
> > a load of L2, the waken task has a load of h, then wake_affine_weight()
> > compares L1 - h with L2 + h , when L1 < L2 + 2h, the task will remain on
> > previous CPU. Since L1 and L2 were underestimated in the same scale,
> > I'm not quite sure how this patch would affect the choice between
> > prev and current CPU.
>
> Let's take the example of this_cpu load L1 = 0 and prev_cpu load L2 =
> 2h'+d. h' reflects h in the cpu load and d is a small delta load. The
> task will migrate if we have the condition below:
>
> h < 2h'-h+d
>
> With this patch, we assume that h' == h as we don't underestimate the
> load of cfs rqs anymore. The condition for migrating the task is :
> h < h+d
> And the task will migrate on this cpu as soon as there is a small load
> on prev_cpu in addition to the 2h.
>
> Without the patch, the load of cfs_rqs are underestimated which means
> that the task's load is underestimated in the cfs rq. This can be
> described as h' == h-U. U being the underestimated part. In this case
> the condition to migrate the task becomes:
> h < h-2U+d
> The task will migrate on this cpu is d is large enough to compensate
> the underestimation so we will migrate less often
>
I see. Thanks for this example! So in this scenario when previous CPU
has some higher load than the current CPU, without this patch applied,
the OS would have more chances to keep the task on the previous CPU,
thus less migration and less rq lock contention(according to the perf result).
I don't have a good idea in mind on how to deal with this case, except that
by disabling WA_WEIGHT or WA_BIAS(prefer 'this' CPU I suppose).

--
Thanks,
Chenyu

2022-04-08 10:34:54

by Vincent Guittot

[permalink] [raw]
Subject: Re: [sched/pelt] 2d02fa8cc2: stress-ng.pipeherd.ops_per_sec -9.7% regression

On Tue, 5 Apr 2022 at 16:23, Chen Yu <[email protected]> wrote:
>
> On Mon, Apr 4, 2022 at 5:53 PM Vincent Guittot
> <[email protected]> wrote:
> >
> > On Fri, 1 Apr 2022 at 20:32, Chen Yu <[email protected]> wrote:
> > >
> > > On Fri, Apr 1, 2022 at 12:17 AM Vincent Guittot
> > > <[email protected]> wrote:
> > > >
> > > > On Thu, 31 Mar 2022 at 16:19, Chen Yu <[email protected]> wrote:
> > > > >
> > > > > Hi Vincent,
> > > > >
> > > > > On Wed, Feb 9, 2022 at 1:17 PM kernel test robot <[email protected]> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > Greeting,
> > > > > >
> > > > > > FYI, we noticed a -9.7% regression of stress-ng.pipeherd.ops_per_sec due to commit:
> > > > > >
> > > > > >
> > > > > > commit: 2d02fa8cc21a93da35cfba462bf8ab87bf2db651 ("sched/pelt: Relax the sync of load_sum with load_avg")
> > > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > > > >
> > > > > > in testcase: stress-ng
> > > > > > on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory
> > > > > > with following parameters:
> > > > > >
> > > > > > nr_threads: 100%
> > > > > > testtime: 60s
> > > > > > class: memory
> > > > > > test: pipeherd
> > > > > > cpufreq_governor: performance
> > > > > > ucode: 0xd000280
> > > > > >
> > > > > This week we have re-run the test result and it seems that this
> > > > > regression is still there.
> > > > > As we are evaluating whether this report is valid or if the
> > > > > downgrading is expected, appreciated
> > > > > if you could give suggestion on further steps:
> > > > >
> > > > > 1. If I understand correctly,
> > > > > 2d02fa8cc21a93da35cfba462bf8ab87bf2db651 ("sched/pelt: Relax the sync
> > > > > of load_sum with load_avg")
> > > > > fixed the calculating of load_sum. Before this patch the
> > > > > contribution part would be 'skipped' and caused the load_sum
> > > > > to be lower than expected.
> > > >
> > > > Yes, you understand it correctly
> > > >
> > > > > 2. If above is true, after this patch, the load_sum becomes higher. Is
> > > > > there a scenario that higher load_sum added to 1 cfs_rq brings
> > > > > more imbalance between this group and other sched_group, thus
> > > > > brings more task migration/wake up? (because in below perf result,
> > > > > it seems that, with this patch applied, there are slightly more
> > > > > take wake up)
> > > >
> > > > This change should not impact load balance as it only does comparison
> > > > and I expect the load increase to happen on all cfs rq.
> > > > The only place that could be impacted, would be wake_affine_weight()
> > > > because it removes task load from previous cfs rq load before
> > > > comparing.
> > > > The task's load was not impacted by the underestimate which means that
> > > > the load of prev cfs might be seen lower than current cfs after
> > > > subtracting the task's load whereas both cfs rqs were similarly
> > > > underestimated.
> > > > Now the load of prev cfs rq is not underestimated and becomes
> > > > comparable or slightly higher than the current cfs and the task
> > > > migrate on current cfs instead of staying on prev one at wakeup
> > > >
> > > Could you please elaborate a little more on this scenario, since both current
> > > and previous cfs rqs were underestimated, how could previous cfs rq has
> > > lower load than the current one before applying this patch?
> > >
> > > Say, suppose the previous cfs rq has a load of L1, and current cfs rq has
> > > a load of L2, the waken task has a load of h, then wake_affine_weight()
> > > compares L1 - h with L2 + h , when L1 < L2 + 2h, the task will remain on
> > > previous CPU. Since L1 and L2 were underestimated in the same scale,
> > > I'm not quite sure how this patch would affect the choice between
> > > prev and current CPU.
> >
> > Let's take the example of this_cpu load L1 = 0 and prev_cpu load L2 =
> > 2h'+d. h' reflects h in the cpu load and d is a small delta load. The
> > task will migrate if we have the condition below:
> >
> > h < 2h'-h+d
> >
> > With this patch, we assume that h' == h as we don't underestimate the
> > load of cfs rqs anymore. The condition for migrating the task is :
> > h < h+d
> > And the task will migrate on this cpu as soon as there is a small load
> > on prev_cpu in addition to the 2h.
> >
> > Without the patch, the load of cfs_rqs are underestimated which means
> > that the task's load is underestimated in the cfs rq. This can be
> > described as h' == h-U. U being the underestimated part. In this case
> > the condition to migrate the task becomes:
> > h < h-2U+d
> > The task will migrate on this cpu is d is large enough to compensate
> > the underestimation so we will migrate less often
> >
> I see. Thanks for this example! So in this scenario when previous CPU
> has some higher load than the current CPU, without this patch applied,
> the OS would have more chances to keep the task on the previous CPU,
> thus less migration and less rq lock contention(according to the perf result).
> I don't have a good idea in mind on how to deal with this case, except that
> by disabling WA_WEIGHT or WA_BIAS(prefer 'this' CPU I suppose).

I don't think that there is any good solution for this bench. It was
taking advantage of the underestimated load_avg because it seems that
it doesn't like to migrate on local cpu but on the other side, some
benches will take advantage of this migration.

Thanks
Vincent
>
> --
> Thanks,
> Chenyu