2019-02-18 08:11:57

by Chen, Rong A

[permalink] [raw]
Subject: [LKP] [driver core] 570d020012: will-it-scale.per_thread_ops -12.2% regression

Greeting,

FYI, we noticed a -12.2% regression of will-it-scale.per_thread_ops due to commit:


commit: 570d0200123fb4f809aa2f6226e93a458d664d70 ("driver core: move device->knode_class to device_private")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master

in testcase: will-it-scale
on test machine: 288 threads Knights Mill with 80G memory
with following parameters:

nr_task: 100%
mode: thread
test: unlink2
cpufreq_governor: performance

test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
test-url: https://github.com/antonblanchard/will-it-scale

In addition to that, the commit also has significant impact on the following tests:

+------------------+---------------------------------------------------------------+
| testcase: change | will-it-scale: will-it-scale.per_thread_ops -29.9% regression |
| test machine | 288 threads Knights Mill with 80G memory |
| test parameters | cpufreq_governor=performance |
| | mode=thread |
| | nr_task=100% |
| | test=signal1 |
+------------------+---------------------------------------------------------------+
| testcase: change | will-it-scale: will-it-scale.per_thread_ops -16.5% regression |
| test machine | 288 threads Knights Mill with 80G memory |
| test parameters | cpufreq_governor=performance |
| | mode=thread |
| | nr_task=100% |
| | test=open1 |
+------------------+---------------------------------------------------------------+


Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-7/performance/x86_64-rhel-7.2/thread/100%/debian-x86_64-2018-04-03.cgz/lkp-knm01/unlink2/will-it-scale

commit:
4bd4e92cfe ("sysfs: fix blank line coding style warning")
570d020012 ("driver core: move device->knode_class to device_private")

4bd4e92cfe6d2af7 570d0200123fb4f809aa2f6226
---------------- --------------------------
%stddev %change %stddev
\ | \
937.00 -12.2% 823.00 will-it-scale.per_thread_ops
906625 -1.7% 891438 will-it-scale.time.involuntary_context_switches
269989 -12.2% 237096 will-it-scale.workload
17104244 -12.3% 15001702 softirqs.RCU
6886 -1.7% 6768 vmstat.system.cs
2.088e+08 ± 17% -56.9% 90059912 ± 57% cpuidle.C1.time
683051 ± 15% -42.5% 392628 ± 40% cpuidle.C1.usage
4.519e+08 ± 6% +27.8% 5.777e+08 ± 9% cpuidle.C6.time
648490 ± 14% -42.7% 371724 ± 43% turbostat.C1
0.23 ± 18% -0.1 0.10 ± 62% turbostat.C1%
0.42 ± 9% +0.1 0.55 ± 11% turbostat.C6%
0.41 ± 5% -18.1% 0.34 ± 16% turbostat.CPU%c1
0.30 ± 6% +27.9% 0.39 ± 13% turbostat.CPU%c6
5.22 ± 13% -1.9 3.35 ± 4% perf-profile.calltrace.cycles-pp.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
5.27 ± 13% -1.9 3.40 ± 3% perf-profile.calltrace.cycles-pp.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe.__GI___libc_open
4.71 ± 14% -1.8 2.92 ± 4% perf-profile.calltrace.cycles-pp.unlink
4.58 ± 15% -1.8 2.81 ± 4% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.unlink
4.56 ± 15% -1.8 2.80 ± 4% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.unlink
4.40 ± 15% -1.8 2.64 ± 5% perf-profile.calltrace.cycles-pp.do_unlinkat.do_syscall_64.entry_SYSCALL_64_after_hwframe.unlink
3.18 ± 23% -1.6 1.53 ± 9% perf-profile.calltrace.cycles-pp.shmem_mknod.path_openat.do_filp_open.do_sys_open.do_syscall_64
1.70 ± 19% -0.7 0.95 ± 14% perf-profile.calltrace.cycles-pp.shmem_get_inode.shmem_mknod.path_openat.do_filp_open.do_sys_open
1.47 ± 20% -0.7 0.76 ± 18% perf-profile.calltrace.cycles-pp.evict.do_unlinkat.do_syscall_64.entry_SYSCALL_64_after_hwframe.unlink
1.43 ± 23% -0.7 0.73 ± 18% perf-profile.calltrace.cycles-pp.new_inode.shmem_get_inode.shmem_mknod.path_openat.do_filp_open
1.57 ± 5% -0.5 1.04 ± 57% perf-profile.calltrace.cycles-pp.perf_mux_hrtimer_handler.__hrtimer_run_queues.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt
0.58 -0.1 0.52 ± 2% perf-profile.calltrace.cycles-pp.filename_parentat.do_unlinkat.do_syscall_64.entry_SYSCALL_64_after_hwframe.unlink
2.31 ± 16% +0.7 3.00 ± 9% perf-profile.calltrace.cycles-pp.tick_sched_timer.__hrtimer_run_queues.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt
2.04 ± 17% +0.7 2.76 ± 7% perf-profile.calltrace.cycles-pp.update_process_times.tick_sched_handle.tick_sched_timer.__hrtimer_run_queues.hrtimer_interrupt
2.09 ± 17% +0.7 2.82 ± 7% perf-profile.calltrace.cycles-pp.tick_sched_handle.tick_sched_timer.__hrtimer_run_queues.hrtimer_interrupt.smp_apic_timer_interrupt
44.55 +1.7 46.29 perf-profile.calltrace.cycles-pp.__GI___libc_close
44.26 +1.8 46.07 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__GI___libc_close
44.24 +1.8 46.05 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__GI___libc_close
43.67 +1.8 45.49 perf-profile.calltrace.cycles-pp.__close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__GI___libc_close
43.80 +1.9 45.67 perf-profile.calltrace.cycles-pp.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__GI___libc_close
43.67 +1.9 45.61 perf-profile.calltrace.cycles-pp.__alloc_fd.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe.__GI___libc_open
43.11 +2.0 45.13 perf-profile.calltrace.cycles-pp._raw_spin_lock.__close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe
43.17 +2.0 45.22 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.__close_fd.__x64_sys_close.do_syscall_64
43.08 +2.1 45.17 perf-profile.calltrace.cycles-pp._raw_spin_lock.__alloc_fd.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
43.13 +2.1 45.26 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.__alloc_fd.do_sys_open.do_syscall_64
12.29 -10.0% 11.06 ± 2% perf-stat.i.MPKI
8.564e+09 -1.7% 8.421e+09 perf-stat.i.branch-instructions
1.59 -0.1 1.46 ± 2% perf-stat.i.branch-miss-rate%
1.356e+08 -9.7% 1.225e+08 perf-stat.i.branch-misses
16.95 +1.1 18.02 perf-stat.i.cache-miss-rate%
74122780 -6.2% 69544118 perf-stat.i.cache-misses
4.377e+08 -11.8% 3.862e+08 perf-stat.i.cache-references
6523 -1.6% 6416 perf-stat.i.context-switches
12.02 +1.6% 12.21 perf-stat.i.cpi
5778 +6.2% 6134 perf-stat.i.cycles-between-cache-misses
0.33 -0.0 0.29 ± 2% perf-stat.i.iTLB-load-miss-rate%
1.185e+08 -15.6% 1e+08 ± 2% perf-stat.i.iTLB-load-misses
3.555e+10 -2.0% 3.484e+10 perf-stat.i.iTLB-loads
3.563e+10 -2.0% 3.492e+10 perf-stat.i.instructions
300.74 +16.2% 349.51 ± 2% perf-stat.i.instructions-per-iTLB-miss
0.08 -1.6% 0.08 perf-stat.i.ipc
12.28 -10.0% 11.06 ± 2% perf-stat.overall.MPKI
1.58 -0.1 1.45 ± 2% perf-stat.overall.branch-miss-rate%
16.94 +1.1 18.01 perf-stat.overall.cache-miss-rate%
12.02 +1.6% 12.21 perf-stat.overall.cpi
5776 +6.2% 6132 perf-stat.overall.cycles-between-cache-misses
0.33 -0.0 0.29 ± 2% perf-stat.overall.iTLB-load-miss-rate%
300.68 +16.2% 349.34 ± 2% perf-stat.overall.instructions-per-iTLB-miss
0.08 -1.6% 0.08 perf-stat.overall.ipc
39542686 +11.7% 44150058 perf-stat.overall.path-length
8.53e+09 -1.7% 8.388e+09 perf-stat.ps.branch-instructions
1.351e+08 -9.7% 1.22e+08 perf-stat.ps.branch-misses
73835513 -6.2% 69273046 perf-stat.ps.cache-misses
4.36e+08 -11.8% 3.847e+08 perf-stat.ps.cache-references
6498 -1.7% 6390 perf-stat.ps.context-switches
1.18e+08 -15.6% 99644821 ± 2% perf-stat.ps.iTLB-load-misses
3.541e+10 -2.0% 3.47e+10 perf-stat.ps.iTLB-loads
3.549e+10 -2.0% 3.479e+10 perf-stat.ps.instructions
1.067e+13 -1.9% 1.047e+13 perf-stat.total.instructions
595740 ± 18% -43.0% 339338 ± 28% sched_debug.cfs_rq:/.MIN_vruntime.avg
4833300 ± 8% -31.2% 3327465 ± 24% sched_debug.cfs_rq:/.MIN_vruntime.stddev
15804 ± 7% -28.3% 11329 ± 22% sched_debug.cfs_rq:/.load.avg
1096777 -24.1% 832275 ± 27% sched_debug.cfs_rq:/.load.max
108529 ± 4% -27.8% 78378 ± 26% sched_debug.cfs_rq:/.load.stddev
12.36 ± 3% -11.3% 10.96 sched_debug.cfs_rq:/.load_avg.avg
522.40 ± 8% -32.1% 354.80 ± 23% sched_debug.cfs_rq:/.load_avg.max
35.08 ± 7% -28.4% 25.13 ± 21% sched_debug.cfs_rq:/.load_avg.stddev
595740 ± 18% -43.0% 339338 ± 28% sched_debug.cfs_rq:/.max_vruntime.avg
4833300 ± 8% -31.2% 3327465 ± 24% sched_debug.cfs_rq:/.max_vruntime.stddev
0.12 ± 9% -21.9% 0.10 ± 18% sched_debug.cfs_rq:/.nr_running.stddev
77.95 ± 6% +37.6% 107.25 ± 19% sched_debug.cfs_rq:/.nr_spread_over.max
7.10 ± 2% +20.5% 8.55 ± 15% sched_debug.cfs_rq:/.nr_spread_over.stddev
4.39 -15.0% 3.73 ± 8% sched_debug.cfs_rq:/.runnable_load_avg.avg
468.00 -35.1% 303.60 ± 30% sched_debug.cfs_rq:/.runnable_load_avg.max
28.56 -35.1% 18.54 ± 29% sched_debug.cfs_rq:/.runnable_load_avg.stddev
15797 ± 7% -28.3% 11323 ± 22% sched_debug.cfs_rq:/.runnable_weight.avg
1096533 -24.1% 832050 ± 27% sched_debug.cfs_rq:/.runnable_weight.max
108522 ± 4% -27.8% 78365 ± 26% sched_debug.cfs_rq:/.runnable_weight.stddev
772.59 ± 6% -28.1% 555.30 ± 5% sched_debug.cpu.clock.stddev
772.58 ± 6% -28.1% 555.30 ± 5% sched_debug.cpu.clock_task.stddev
4.43 -15.1% 3.76 ± 8% sched_debug.cpu.cpu_load[0].avg
28.58 -33.9% 18.88 ± 27% sched_debug.cpu.cpu_load[0].stddev
4.51 ± 3% -16.6% 3.76 ± 8% sched_debug.cpu.cpu_load[1].avg
29.07 ± 6% -35.8% 18.66 ± 28% sched_debug.cpu.cpu_load[1].stddev
4.45 ± 2% -15.7% 3.75 ± 8% sched_debug.cpu.cpu_load[2].avg
28.36 ± 4% -34.7% 18.53 ± 28% sched_debug.cpu.cpu_load[2].stddev
4.40 -15.0% 3.73 ± 8% sched_debug.cpu.cpu_load[3].avg
27.95 ± 3% -34.1% 18.43 ± 28% sched_debug.cpu.cpu_load[3].stddev
4.34 ± 2% -14.3% 3.72 ± 8% sched_debug.cpu.cpu_load[4].avg
456.00 ± 3% -34.2% 300.10 ± 29% sched_debug.cpu.cpu_load[4].max
27.74 ± 2% -33.6% 18.41 ± 28% sched_debug.cpu.cpu_load[4].stddev
15602 ± 7% -20.4% 12420 ± 19% sched_debug.cpu.load.avg
106214 ± 3% -19.2% 85804 ± 18% sched_debug.cpu.load.stddev
0.00 ± 4% -24.5% 0.00 ± 3% sched_debug.cpu.next_balance.stddev
0.50 ± 26% -54.8% 0.23 ± 79% sched_debug.rt_rq:/.rt_time.avg
143.90 ± 26% -54.8% 65.01 ± 79% sched_debug.rt_rq:/.rt_time.max
8.46 ± 26% -54.8% 3.82 ± 79% sched_debug.rt_rq:/.rt_time.stddev



will-it-scale.per_thread_ops

1000 +-+------------------------------------------------------------------+
| |
|.. +.. .+ +.. .+.. .+ + |
950 +-+ + +. + .. .+..+.. .+. + + +.. .. |
| + + + + + + + |
| + |
900 +-+ |
| O O |
850 +-+ O O O O |
O O O O O O O O O |
| O O O O O O O
800 +-+O O O |
| O |
| O |
750 +-+------------------------------------------------------------------+


[*] bisect-good sample
[O] bisect-bad sample

***************************************************************************************************
lkp-knm01: 288 threads Knights Mill with 80G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-7/performance/x86_64-rhel-7.2/thread/100%/debian-x86_64-2018-04-03.cgz/lkp-knm01/signal1/will-it-scale

commit:
4bd4e92cfe ("sysfs: fix blank line coding style warning")
570d020012 ("driver core: move device->knode_class to device_private")

4bd4e92cfe6d2af7 570d0200123fb4f809aa2f6226
---------------- --------------------------
fail:runs %reproduction fail:runs
| | |
:4 25% 1:4 dmesg.WARNING:at#for_ip_interrupt_entry/0x
%stddev %change %stddev
\ | \
266.75 ± 2% -29.9% 187.00 will-it-scale.per_thread_ops
131.40 ± 6% -29.3% 92.91 ± 3% will-it-scale.time.user_time
76931 ± 2% -29.9% 53963 will-it-scale.workload
788152 ± 8% +17.8% 928405 ± 5% cpuidle.C6.usage
33255 ± 4% -8.2% 30514 ± 2% numa-meminfo.node1.Slab
34856762 +14.5% 39908744 softirqs.TIMER
0.00 ± 26% +0.0 0.01 ± 67% mpstat.cpu.soft%
0.28 ± 3% -0.0 0.23 ± 2% mpstat.cpu.usr%
981.00 ± 7% -12.8% 855.00 ± 2% slabinfo.skbuff_ext_cache.active_objs
981.00 ± 7% -12.8% 855.00 ± 2% slabinfo.skbuff_ext_cache.num_objs
714927 ± 11% +21.5% 868290 ± 5% turbostat.C6
1.71 ± 5% +39.9% 2.39 turbostat.RAMWatt
1121 ± 2% -23.8% 854.06 ± 4% sched_debug.cfs_rq:/.exec_clock.stddev
20.90 ± 11% +42.0% 29.68 ± 17% sched_debug.cfs_rq:/.nr_spread_over.avg
54.08 ± 4% +18.2% 63.93 ± 7% sched_debug.cfs_rq:/.util_est_enqueued.stddev
349403 ± 73% +94.9% 680874 ± 2% sched_debug.cpu.avg_idle.min
860.45 ± 4% -37.1% 540.88 ± 7% sched_debug.cpu.clock.stddev
860.45 ± 4% -37.1% 540.88 ± 7% sched_debug.cpu.clock_task.stddev
5355 ± 14% +21.7% 6517 sched_debug.cpu.curr->pid.max
0.00 ± 3% -36.6% 0.00 ± 6% sched_debug.cpu.next_balance.stddev
16.16 -0.2 16.00 perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__set_current_blocked.signal_setup_done.do_signal.exit_to_usermode_loop
16.14 -0.2 15.99 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.__set_current_blocked.signal_setup_done.do_signal
32.16 -0.1 32.04 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.__set_current_blocked.sigprocmask.__x64_sys_rt_sigprocmask
16.12 -0.1 16.01 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.get_signal.do_signal.exit_to_usermode_loop
16.13 -0.1 16.01 perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.get_signal.do_signal.exit_to_usermode_loop.do_syscall_64
32.17 -0.1 32.05 perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__set_current_blocked.sigprocmask.__x64_sys_rt_sigprocmask.do_syscall_64
16.11 -0.1 16.00 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.__set_current_blocked.__x64_sys_rt_sigreturn.do_syscall_64
16.12 -0.1 16.01 perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__set_current_blocked.__x64_sys_rt_sigreturn.do_syscall_64.entry_SYSCALL_64_after_hwframe
16.63 -0.1 16.54 perf-profile.calltrace.cycles-pp.handler
16.49 -0.1 16.40 perf-profile.calltrace.cycles-pp.__set_current_blocked.signal_setup_done.do_signal.exit_to_usermode_loop.do_syscall_64
16.57 -0.1 16.48 perf-profile.calltrace.cycles-pp.do_signal.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe.handler
16.58 -0.1 16.49 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.handler
16.58 -0.1 16.49 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.handler
16.57 -0.1 16.49 perf-profile.calltrace.cycles-pp.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe.handler
16.52 -0.1 16.44 perf-profile.calltrace.cycles-pp.signal_setup_done.do_signal.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe
16.63 -0.1 16.56 perf-profile.calltrace.cycles-pp.do_signal.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe.raise
16.65 -0.1 16.58 perf-profile.calltrace.cycles-pp.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe.raise
16.52 -0.1 16.46 perf-profile.calltrace.cycles-pp.get_signal.do_signal.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.59 ± 4% +0.1 0.71 ± 2% perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.__set_current_blocked.sigprocmask.__x64_sys_rt_sigprocmask
0.64 ± 5% +0.1 0.77 ± 2% perf-profile.calltrace.cycles-pp.apic_timer_interrupt.__set_current_blocked.sigprocmask.__x64_sys_rt_sigprocmask.do_syscall_64
0.27 ±100% +0.3 0.59 ± 2% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.__set_current_blocked.sigprocmask
4.55 ± 2% -8.7% 4.15 perf-stat.i.MPKI
0.52 -0.1 0.47 perf-stat.i.branch-miss-rate%
47148684 -8.9% 42963996 perf-stat.i.branch-misses
35569495 ± 2% -4.9% 33832492 perf-stat.i.cache-misses
1.66e+08 -7.7% 1.533e+08 perf-stat.i.cache-references
12.38 -1.2% 12.24 perf-stat.i.cpi
12726 ± 2% +5.1% 13374 perf-stat.i.cycles-between-cache-misses
0.09 ± 4% -0.0 0.07 perf-stat.i.iTLB-load-miss-rate%
32470934 ± 5% -14.8% 27651795 perf-stat.i.iTLB-load-misses
1128 ± 4% +18.6% 1339 perf-stat.i.instructions-per-iTLB-miss
4.54 ± 2% -8.7% 4.15 perf-stat.overall.MPKI
0.52 -0.1 0.47 ± 2% perf-stat.overall.branch-miss-rate%
12.38 -1.2% 12.24 perf-stat.overall.cpi
12722 ± 2% +5.1% 13367 perf-stat.overall.cycles-between-cache-misses
0.09 ± 4% -0.0 0.07 perf-stat.overall.iTLB-load-miss-rate%
1127 ± 4% +18.5% 1336 perf-stat.overall.instructions-per-iTLB-miss
1.414e+08 ± 2% +43.9% 2.035e+08 ± 2% perf-stat.overall.path-length
46584988 -9.3% 42262634 perf-stat.ps.branch-misses
35172756 ± 2% -5.0% 33400721 perf-stat.ps.cache-misses
1.641e+08 -7.8% 1.513e+08 perf-stat.ps.cache-references
32116706 ± 5% -15.0% 27304782 perf-stat.ps.iTLB-load-misses



***************************************************************************************************
lkp-knm01: 288 threads Knights Mill with 80G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-7/performance/x86_64-rhel-7.2/thread/100%/debian-x86_64-2018-04-03.cgz/lkp-knm01/open1/will-it-scale

commit:
4bd4e92cfe ("sysfs: fix blank line coding style warning")
570d020012 ("driver core: move device->knode_class to device_private")

4bd4e92cfe6d2af7 570d0200123fb4f809aa2f6226
---------------- --------------------------
%stddev %change %stddev
\ | \
945.00 -16.5% 788.75 will-it-scale.per_thread_ops
568697 ± 7% -21.3% 447754 will-it-scale.time.involuntary_context_switches
303.55 ± 2% -7.2% 281.60 ± 2% will-it-scale.time.user_time
272400 -16.6% 227268 will-it-scale.workload
0.08 ± 9% -0.0 0.05 ± 5% mpstat.cpu.soft%
192099 -11.9% 169161 ± 3% slabinfo.filp.active_objs
5043716 -12.1% 4431600 ± 2% softirqs.RCU
2.31 ± 2% +11.2% 2.58 turbostat.RAMWatt
4620 ± 6% -19.2% 3732 vmstat.system.cs
5458 +1.4% 5532 proc-vmstat.nr_inactive_anon
5458 +1.4% 5532 proc-vmstat.nr_zone_inactive_anon
7.27 ± 4% -0.8 6.50 ± 4% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.native_queued_spin_lock_slowpath._raw_spin_lock
6.02 ± 5% -0.6 5.40 ± 5% perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.native_queued_spin_lock_slowpath
1.22 ± 2% -0.2 1.07 ± 2% perf-profile.calltrace.cycles-pp.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe.__GI___libc_open
1.16 ± 2% -0.1 1.02 ± 2% perf-profile.calltrace.cycles-pp.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
50.11 -0.1 49.98 perf-profile.calltrace.cycles-pp.__GI___libc_open
47.70 +0.2 47.86 perf-profile.calltrace.cycles-pp._raw_spin_lock.__alloc_fd.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
47.89 +0.2 48.12 perf-profile.calltrace.cycles-pp._raw_spin_lock.__close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe
306780 ± 31% -46.2% 165114 ± 27% sched_debug.cfs_rq:/.MIN_vruntime.avg
37943697 ± 12% -23.6% 28986155 ± 20% sched_debug.cfs_rq:/.MIN_vruntime.max
3199728 ± 21% -33.9% 2114259 ± 22% sched_debug.cfs_rq:/.MIN_vruntime.stddev
1368 ± 7% -17.0% 1135 ± 9% sched_debug.cfs_rq:/.exec_clock.stddev
10913 ± 25% -29.9% 7650 ± 10% sched_debug.cfs_rq:/.load.avg
76190 ± 22% -21.5% 59825 ± 7% sched_debug.cfs_rq:/.load.stddev
306780 ± 31% -46.2% 165114 ± 27% sched_debug.cfs_rq:/.max_vruntime.avg
37943697 ± 12% -23.6% 28986155 ± 20% sched_debug.cfs_rq:/.max_vruntime.max
3199728 ± 21% -33.9% 2114259 ± 22% sched_debug.cfs_rq:/.max_vruntime.stddev
10906 ± 25% -29.9% 7642 ± 10% sched_debug.cfs_rq:/.runnable_weight.avg
76186 ± 22% -21.5% 59800 ± 7% sched_debug.cfs_rq:/.runnable_weight.stddev
234073 ± 39% +134.5% 548907 ± 71% sched_debug.cfs_rq:/.spread0.avg
1117 ± 11% -26.5% 821.87 ± 5% sched_debug.cpu.clock.stddev
1117 ± 11% -26.5% 821.87 ± 5% sched_debug.cpu.clock_task.stddev
0.00 ± 8% -25.6% 0.00 ± 6% sched_debug.cpu.next_balance.stddev
2333 ± 3% -14.3% 1998 sched_debug.cpu.nr_switches.avg
1323 ± 4% -20.1% 1057 sched_debug.cpu.nr_switches.min
2160 ± 4% -12.5% 1891 ± 2% sched_debug.cpu.sched_count.avg
1179 ± 5% -17.0% 978.15 sched_debug.cpu.sched_count.min
1.55 ± 66% -100.0% 0.00 sched_debug.cpu.sched_goidle.min
912.99 ± 4% -18.1% 747.65 sched_debug.cpu.ttwu_count.avg
590.45 ± 5% -21.5% 463.60 ± 2% sched_debug.cpu.ttwu_count.min
835.45 ± 5% -19.8% 670.09 sched_debug.cpu.ttwu_local.avg
561.95 ± 5% -22.5% 435.70 ± 3% sched_debug.cpu.ttwu_local.min
6.47 -8.2% 5.93 ± 3% perf-stat.i.MPKI
8.137e+09 +1.8% 8.282e+09 perf-stat.i.branch-instructions
0.85 -0.1 0.77 ± 2% perf-stat.i.branch-miss-rate%
69084658 -7.5% 63892585 ± 2% perf-stat.i.branch-misses
44409176 -6.8% 41391448 ± 3% perf-stat.i.cache-misses
2.151e+08 -6.9% 2.003e+08 ± 3% perf-stat.i.cache-references
4250 ± 7% -19.1% 3436 perf-stat.i.context-switches
13.05 -1.8% 12.82 perf-stat.i.cpi
9785 +7.2% 10490 ± 3% perf-stat.i.cycles-between-cache-misses
0.20 -0.0 0.18 ± 6% perf-stat.i.iTLB-load-miss-rate%
67639103 -11.5% 59865979 ± 6% perf-stat.i.iTLB-load-misses
3.324e+10 +1.4% 3.371e+10 perf-stat.i.iTLB-loads
3.328e+10 +1.5% 3.379e+10 perf-stat.i.instructions
492.13 +15.2% 566.96 ± 6% perf-stat.i.instructions-per-iTLB-miss
0.08 +1.8% 0.08 perf-stat.i.ipc
6.46 -8.3% 5.93 ± 3% perf-stat.overall.MPKI
0.85 -0.1 0.77 ± 2% perf-stat.overall.branch-miss-rate%
13.05 -1.8% 12.82 perf-stat.overall.cpi
9782 +7.1% 10480 ± 3% perf-stat.overall.cycles-between-cache-misses
0.20 -0.0 0.18 ± 6% perf-stat.overall.iTLB-load-miss-rate%
492.04 +15.2% 566.73 ± 6% perf-stat.overall.instructions-per-iTLB-miss
0.08 +1.8% 0.08 perf-stat.overall.ipc
36640385 +21.7% 44597360 perf-stat.overall.path-length
8.103e+09 +1.8% 8.249e+09 perf-stat.ps.branch-instructions
68811380 -7.5% 63646780 ± 2% perf-stat.ps.branch-misses
44229668 -6.8% 41229829 ± 3% perf-stat.ps.cache-misses
2.142e+08 -6.9% 1.995e+08 ± 3% perf-stat.ps.cache-references
4233 ± 7% -19.1% 3423 perf-stat.ps.context-switches
67364890 -11.5% 59634016 ± 6% perf-stat.ps.iTLB-load-misses
3.315e+10 +1.5% 3.365e+10 perf-stat.ps.instructions
9.98e+12 +1.5% 1.013e+13 perf-stat.total.instructions





Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


Thanks,
Rong Chen


Attachments:
(No filename) (29.57 kB)
config-5.0.0-rc2-00005-g570d020 (171.86 kB)
job.yaml (4.76 kB)
reproduce (321.00 B)
Download all attachments

2019-02-19 12:20:15

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [LKP] [driver core] 570d020012: will-it-scale.per_thread_ops -12.2% regression

On Tue, Feb 19, 2019 at 08:59:45AM +0800, Wei Yang wrote:
> On Mon, Feb 18, 2019 at 03:54:42PM +0800, kernel test robot wrote:
> >Greeting,
> >
> >FYI, we noticed a -12.2% regression of will-it-scale.per_thread_ops due to commit:
> >
> >
> >commit: 570d0200123fb4f809aa2f6226e93a458d664d70 ("driver core: move device->knode_class to device_private")
> >https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> >
>
> This is interesting.
>
> I didn't expect the move of this field will impact the performance.
>
> The reason is struct device is a hotter memory than device->device_private?
>
> >in testcase: will-it-scale
> >on test machine: 288 threads Knights Mill with 80G memory
> >with following parameters:
> >
> > nr_task: 100%
> > mode: thread
> > test: unlink2
> > cpufreq_governor: performance
> >
> >test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> >test-url: https://github.com/antonblanchard/will-it-scale
> >
> >In addition to that, the commit also has significant impact on the following tests:
> >
> >+------------------+---------------------------------------------------------------+
> >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -29.9% regression |
> >| test machine | 288 threads Knights Mill with 80G memory |
> >| test parameters | cpufreq_governor=performance |
> >| | mode=thread |
> >| | nr_task=100% |
> >| | test=signal1 |

Ok, I'm going to blame your testing system, or something here, and not
the above patch.

All this test does is call raise(3). That does not touch the driver
core at all.

> >+------------------+---------------------------------------------------------------+
> >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -16.5% regression |
> >| test machine | 288 threads Knights Mill with 80G memory |
> >| test parameters | cpufreq_governor=performance |
> >| | mode=thread |
> >| | nr_task=100% |
> >| | test=open1 |
> >+------------------+---------------------------------------------------------------+

Same here, open1 just calls open/close a lot. No driver core
interaction at all there either.

So are you _sure_ this is the offending patch?

thanks,

greg k-h

2019-02-21 03:11:39

by Chen, Rong A

[permalink] [raw]
Subject: Re: [LKP] [driver core] 570d020012: will-it-scale.per_thread_ops -12.2% regression

On Tue, Feb 19, 2019 at 01:19:04PM +0100, Greg Kroah-Hartman wrote:
> On Tue, Feb 19, 2019 at 08:59:45AM +0800, Wei Yang wrote:
> > On Mon, Feb 18, 2019 at 03:54:42PM +0800, kernel test robot wrote:
> > >Greeting,
> > >
> > >FYI, we noticed a -12.2% regression of will-it-scale.per_thread_ops due to commit:
> > >
> > >
> > >commit: 570d0200123fb4f809aa2f6226e93a458d664d70 ("driver core: move device->knode_class to device_private")
> > >https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > >
> >
> > This is interesting.
> >
> > I didn't expect the move of this field will impact the performance.
> >
> > The reason is struct device is a hotter memory than device->device_private?
> >
> > >in testcase: will-it-scale
> > >on test machine: 288 threads Knights Mill with 80G memory
> > >with following parameters:
> > >
> > > nr_task: 100%
> > > mode: thread
> > > test: unlink2
> > > cpufreq_governor: performance
> > >
> > >test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> > >test-url: https://github.com/antonblanchard/will-it-scale
> > >
> > >In addition to that, the commit also has significant impact on the following tests:
> > >
> > >+------------------+---------------------------------------------------------------+
> > >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -29.9% regression |
> > >| test machine | 288 threads Knights Mill with 80G memory |
> > >| test parameters | cpufreq_governor=performance |
> > >| | mode=thread |
> > >| | nr_task=100% |
> > >| | test=signal1 |
>
> Ok, I'm going to blame your testing system, or something here, and not
> the above patch.
>
> All this test does is call raise(3). That does not touch the driver
> core at all.
>
> > >+------------------+---------------------------------------------------------------+
> > >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -16.5% regression |
> > >| test machine | 288 threads Knights Mill with 80G memory |
> > >| test parameters | cpufreq_governor=performance |
> > >| | mode=thread |
> > >| | nr_task=100% |
> > >| | test=open1 |
> > >+------------------+---------------------------------------------------------------+
>
> Same here, open1 just calls open/close a lot. No driver core
> interaction at all there either.
>
> So are you _sure_ this is the offending patch?

Hi Greg,

We did an experiment, recovered the layout of struct device. and we
found the regression is gone. I guess the regession is not from the
patch but related to the struct layout.


tests: 1
testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-unlink2/lkp-knm01

570d0200123fb4f8 a36dc70b810afe9183de2ea18f
---------------- --------------------------
%stddev change %stddev
\ | \
237096 14% 270789 will-it-scale.workload
823 14% 939 will-it-scale.per_thread_ops


tests: 1
testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-signal1/lkp-knm01

570d0200123fb4f8 a36dc70b810afe9183de2ea18f
---------------- --------------------------
%stddev change %stddev
\ | \
93.51 ± 3% 48% 138.53 ± 3% will-it-scale.time.user_time
186 40% 261 will-it-scale.per_thread_ops
53909 40% 75507 will-it-scale.workload


tests: 1
testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-open1/lkp-knm01

570d0200123fb4f8 a36dc70b810afe9183de2ea18f
---------------- --------------------------
%stddev change %stddev
\ | \
447722 22% 546258 ± 10% will-it-scale.time.involuntary_context_switches
226995 19% 269751 will-it-scale.workload
787 19% 936 will-it-scale.per_thread_ops



commit a36dc70b810afe9183de2ea18faa4c0939c139ac
Author: 0day robot <[email protected]>
Date: Wed Feb 20 14:21:19 2019 +0800

backfile klist_node in struct device for debugging

Signed-off-by: 0day robot <[email protected]>

diff --git a/include/linux/device.h b/include/linux/device.h
index d0e452fd0bff2..31666cb72b3ba 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -1035,6 +1035,7 @@ struct device {
spinlock_t devres_lock;
struct list_head devres_head;

+ struct klist_node knode_class_test_by_rongc;
struct class *class;
const struct attribute_group **groups; /* optional groups */

Best Regards,
Rong Chen

2019-02-21 03:48:32

by Wei Yang

[permalink] [raw]
Subject: Re: [LKP] [driver core] 570d020012: will-it-scale.per_thread_ops -12.2% regression

On Thu, Feb 21, 2019 at 11:10:49AM +0800, kernel test robot wrote:
>On Tue, Feb 19, 2019 at 01:19:04PM +0100, Greg Kroah-Hartman wrote:
>> On Tue, Feb 19, 2019 at 08:59:45AM +0800, Wei Yang wrote:
>> > On Mon, Feb 18, 2019 at 03:54:42PM +0800, kernel test robot wrote:
>> > >Greeting,
>> > >
>> > >FYI, we noticed a -12.2% regression of will-it-scale.per_thread_ops due to commit:
>> > >
>> > >
>> > >commit: 570d0200123fb4f809aa2f6226e93a458d664d70 ("driver core: move device->knode_class to device_private")
>> > >https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>> > >
>> >
>> > This is interesting.
>> >
>> > I didn't expect the move of this field will impact the performance.
>> >
>> > The reason is struct device is a hotter memory than device->device_private?
>> >
>> > >in testcase: will-it-scale
>> > >on test machine: 288 threads Knights Mill with 80G memory
>> > >with following parameters:
>> > >
>> > > nr_task: 100%
>> > > mode: thread
>> > > test: unlink2
>> > > cpufreq_governor: performance
>> > >
>> > >test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
>> > >test-url: https://github.com/antonblanchard/will-it-scale
>> > >
>> > >In addition to that, the commit also has significant impact on the following tests:
>> > >
>> > >+------------------+---------------------------------------------------------------+
>> > >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -29.9% regression |
>> > >| test machine | 288 threads Knights Mill with 80G memory |
>> > >| test parameters | cpufreq_governor=performance |
>> > >| | mode=thread |
>> > >| | nr_task=100% |
>> > >| | test=signal1 |
>>
>> Ok, I'm going to blame your testing system, or something here, and not
>> the above patch.
>>
>> All this test does is call raise(3). That does not touch the driver
>> core at all.
>>
>> > >+------------------+---------------------------------------------------------------+
>> > >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -16.5% regression |
>> > >| test machine | 288 threads Knights Mill with 80G memory |
>> > >| test parameters | cpufreq_governor=performance |
>> > >| | mode=thread |
>> > >| | nr_task=100% |
>> > >| | test=open1 |
>> > >+------------------+---------------------------------------------------------------+
>>
>> Same here, open1 just calls open/close a lot. No driver core
>> interaction at all there either.
>>
>> So are you _sure_ this is the offending patch?
>
>Hi Greg,
>
>We did an experiment, recovered the layout of struct device. and we
>found the regression is gone. I guess the regession is not from the
>patch but related to the struct layout.
>
>
>tests: 1
>testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-unlink2/lkp-knm01
>
>570d0200123fb4f8 a36dc70b810afe9183de2ea18f
>---------------- --------------------------
> %stddev change %stddev
> \ | \
> 237096 14% 270789 will-it-scale.workload
> 823 14% 939 will-it-scale.per_thread_ops
>

Do you have the comparison between a36dc70b810afe9183de2ea18f and the one
before 570d020012?

>
>tests: 1
>testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-signal1/lkp-knm01
>
>570d0200123fb4f8 a36dc70b810afe9183de2ea18f
>---------------- --------------------------
> %stddev change %stddev
> \ | \
> 93.51 ? 3% 48% 138.53 ? 3% will-it-scale.time.user_time
> 186 40% 261 will-it-scale.per_thread_ops
> 53909 40% 75507 will-it-scale.workload
>
>
>tests: 1
>testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-open1/lkp-knm01
>
>570d0200123fb4f8 a36dc70b810afe9183de2ea18f
>---------------- --------------------------
> %stddev change %stddev
> \ | \
> 447722 22% 546258 ? 10% will-it-scale.time.involuntary_context_switches
> 226995 19% 269751 will-it-scale.workload
> 787 19% 936 will-it-scale.per_thread_ops
>
>
>
>commit a36dc70b810afe9183de2ea18faa4c0939c139ac
>Author: 0day robot <[email protected]>
>Date: Wed Feb 20 14:21:19 2019 +0800
>
> backfile klist_node in struct device for debugging
>
> Signed-off-by: 0day robot <[email protected]>
>
>diff --git a/include/linux/device.h b/include/linux/device.h
>index d0e452fd0bff2..31666cb72b3ba 100644
>--- a/include/linux/device.h
>+++ b/include/linux/device.h
>@@ -1035,6 +1035,7 @@ struct device {
> spinlock_t devres_lock;
> struct list_head devres_head;
>
>+ struct klist_node knode_class_test_by_rongc;
> struct class *class;
> const struct attribute_group **groups; /* optional groups */

Hmm... because this is not properly aligned?

struct klist_node {
void *n_klist; /* never access directly */
struct list_head n_node;
struct kref n_ref;
};

Except struct kref has one "int" type, others are pointers.

But... I am still confused.

>
>Best Regards,
>Rong Chen

--
Wei Yang
Help you, Help me

2019-02-21 04:47:11

by Huang, Ying

[permalink] [raw]
Subject: Re: [LKP] [driver core] 570d020012: will-it-scale.per_thread_ops -12.2% regression

Wei Yang <[email protected]> writes:

> On Thu, Feb 21, 2019 at 11:10:49AM +0800, kernel test robot wrote:
>>On Tue, Feb 19, 2019 at 01:19:04PM +0100, Greg Kroah-Hartman wrote:
>>> On Tue, Feb 19, 2019 at 08:59:45AM +0800, Wei Yang wrote:
>>> > On Mon, Feb 18, 2019 at 03:54:42PM +0800, kernel test robot wrote:
>>> > >Greeting,
>>> > >
>>> > >FYI, we noticed a -12.2% regression of will-it-scale.per_thread_ops due to commit:
>>> > >
>>> > >
>>> > >commit: 570d0200123fb4f809aa2f6226e93a458d664d70 ("driver core: move device->knode_class to device_private")
>>> > >https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>>> > >
>>> >
>>> > This is interesting.
>>> >
>>> > I didn't expect the move of this field will impact the performance.
>>> >
>>> > The reason is struct device is a hotter memory than device->device_private?
>>> >
>>> > >in testcase: will-it-scale
>>> > >on test machine: 288 threads Knights Mill with 80G memory
>>> > >with following parameters:
>>> > >
>>> > > nr_task: 100%
>>> > > mode: thread
>>> > > test: unlink2
>>> > > cpufreq_governor: performance
>>> > >
>>> > >test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
>>> > >test-url: https://github.com/antonblanchard/will-it-scale
>>> > >
>>> > >In addition to that, the commit also has significant impact on the following tests:
>>> > >
>>> > >+------------------+---------------------------------------------------------------+
>>> > >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -29.9% regression |
>>> > >| test machine | 288 threads Knights Mill with 80G memory |
>>> > >| test parameters | cpufreq_governor=performance |
>>> > >| | mode=thread |
>>> > >| | nr_task=100% |
>>> > >| | test=signal1 |
>>>
>>> Ok, I'm going to blame your testing system, or something here, and not
>>> the above patch.
>>>
>>> All this test does is call raise(3). That does not touch the driver
>>> core at all.
>>>
>>> > >+------------------+---------------------------------------------------------------+
>>> > >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -16.5% regression |
>>> > >| test machine | 288 threads Knights Mill with 80G memory |
>>> > >| test parameters | cpufreq_governor=performance |
>>> > >| | mode=thread |
>>> > >| | nr_task=100% |
>>> > >| | test=open1 |
>>> > >+------------------+---------------------------------------------------------------+
>>>
>>> Same here, open1 just calls open/close a lot. No driver core
>>> interaction at all there either.
>>>
>>> So are you _sure_ this is the offending patch?
>>
>>Hi Greg,
>>
>>We did an experiment, recovered the layout of struct device. and we
>>found the regression is gone. I guess the regession is not from the
>>patch but related to the struct layout.
>>
>>
>>tests: 1
>>testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-unlink2/lkp-knm01
>>
>>570d0200123fb4f8 a36dc70b810afe9183de2ea18f
>>---------------- --------------------------
>> %stddev change %stddev
>> \ | \
>> 237096 14% 270789 will-it-scale.workload
>> 823 14% 939 will-it-scale.per_thread_ops
>>
>
> Do you have the comparison between a36dc70b810afe9183de2ea18f and the one
> before 570d020012?
>
>>
>>tests: 1
>>testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-signal1/lkp-knm01
>>
>>570d0200123fb4f8 a36dc70b810afe9183de2ea18f
>>---------------- --------------------------
>> %stddev change %stddev
>> \ | \
>> 93.51 3% 48% 138.53 3% will-it-scale.time.user_time
>> 186 40% 261 will-it-scale.per_thread_ops
>> 53909 40% 75507 will-it-scale.workload
>>
>>
>>tests: 1
>>testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-open1/lkp-knm01
>>
>>570d0200123fb4f8 a36dc70b810afe9183de2ea18f
>>---------------- --------------------------
>> %stddev change %stddev
>> \ | \
>> 447722 22% 546258 10% will-it-scale.time.involuntary_context_switches
>> 226995 19% 269751 will-it-scale.workload
>> 787 19% 936 will-it-scale.per_thread_ops
>>
>>
>>
>>commit a36dc70b810afe9183de2ea18faa4c0939c139ac
>>Author: 0day robot <[email protected]>
>>Date: Wed Feb 20 14:21:19 2019 +0800
>>
>> backfile klist_node in struct device for debugging
>>
>> Signed-off-by: 0day robot <[email protected]>
>>
>>diff --git a/include/linux/device.h b/include/linux/device.h
>>index d0e452fd0bff2..31666cb72b3ba 100644
>>--- a/include/linux/device.h
>>+++ b/include/linux/device.h
>>@@ -1035,6 +1035,7 @@ struct device {
>> spinlock_t devres_lock;
>> struct list_head devres_head;
>>
>>+ struct klist_node knode_class_test_by_rongc;
>> struct class *class;
>> const struct attribute_group **groups; /* optional groups */
>
> Hmm... because this is not properly aligned?
>
> struct klist_node {
> void *n_klist; /* never access directly */
> struct list_head n_node;
> struct kref n_ref;
> };
>
> Except struct kref has one "int" type, others are pointers.
>
> But... I am still confused.

I guess because the size of struct device is changed, it influences some
alignment changes in the system. Thus influence the benchmark score.

Best Regards,
Huang, Ying

>>
>>Best Regards,
>>Rong Chen

2019-02-21 05:47:21

by Chen, Rong A

[permalink] [raw]
Subject: Re: [LKP] [driver core] 570d020012: will-it-scale.per_thread_ops -12.2% regression

On Thu, Feb 21, 2019 at 11:46:12AM +0800, Wei Yang wrote:
> On Thu, Feb 21, 2019 at 11:10:49AM +0800, kernel test robot wrote:
> >On Tue, Feb 19, 2019 at 01:19:04PM +0100, Greg Kroah-Hartman wrote:
> >> On Tue, Feb 19, 2019 at 08:59:45AM +0800, Wei Yang wrote:
> >> > On Mon, Feb 18, 2019 at 03:54:42PM +0800, kernel test robot wrote:
> >> > >Greeting,
> >> > >
> >> > >FYI, we noticed a -12.2% regression of will-it-scale.per_thread_ops due to commit:
> >> > >
> >> > >
> >> > >commit: 570d0200123fb4f809aa2f6226e93a458d664d70 ("driver core: move device->knode_class to device_private")
> >> > >https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> >> > >
> >> >
> >> > This is interesting.
> >> >
> >> > I didn't expect the move of this field will impact the performance.
> >> >
> >> > The reason is struct device is a hotter memory than device->device_private?
> >> >
> >> > >in testcase: will-it-scale
> >> > >on test machine: 288 threads Knights Mill with 80G memory
> >> > >with following parameters:
> >> > >
> >> > > nr_task: 100%
> >> > > mode: thread
> >> > > test: unlink2
> >> > > cpufreq_governor: performance
> >> > >
> >> > >test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> >> > >test-url: https://github.com/antonblanchard/will-it-scale
> >> > >
> >> > >In addition to that, the commit also has significant impact on the following tests:
> >> > >
> >> > >+------------------+---------------------------------------------------------------+
> >> > >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -29.9% regression |
> >> > >| test machine | 288 threads Knights Mill with 80G memory |
> >> > >| test parameters | cpufreq_governor=performance |
> >> > >| | mode=thread |
> >> > >| | nr_task=100% |
> >> > >| | test=signal1 |
> >>
> >> Ok, I'm going to blame your testing system, or something here, and not
> >> the above patch.
> >>
> >> All this test does is call raise(3). That does not touch the driver
> >> core at all.
> >>
> >> > >+------------------+---------------------------------------------------------------+
> >> > >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -16.5% regression |
> >> > >| test machine | 288 threads Knights Mill with 80G memory |
> >> > >| test parameters | cpufreq_governor=performance |
> >> > >| | mode=thread |
> >> > >| | nr_task=100% |
> >> > >| | test=open1 |
> >> > >+------------------+---------------------------------------------------------------+
> >>
> >> Same here, open1 just calls open/close a lot. No driver core
> >> interaction at all there either.
> >>
> >> So are you _sure_ this is the offending patch?
> >
> >Hi Greg,
> >
> >We did an experiment, recovered the layout of struct device. and we
> >found the regression is gone. I guess the regession is not from the
> >patch but related to the struct layout.
> >
> >
> >tests: 1
> >testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-unlink2/lkp-knm01
> >
> >570d0200123fb4f8 a36dc70b810afe9183de2ea18f
> >---------------- --------------------------
> > %stddev change %stddev
> > \ | \
> > 237096 14% 270789 will-it-scale.workload
> > 823 14% 939 will-it-scale.per_thread_ops
> >
>
> Do you have the comparison between a36dc70b810afe9183de2ea18f and the one
> before 570d020012?

testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-unlink2/lkp-knm01

4bd4e92cfe6d2af7 a36dc70b810afe9183de2ea18f
---------------- --------------------------
%stddev %change %stddev
\ | \
937.00 +0.2% 939.33 will-it-scale.per_thread_ops
269989 +0.3% 270789 will-it-scale.workload

> >
> >tests: 1
> >testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-signal1/lkp-knm01
> >
> >570d0200123fb4f8 a36dc70b810afe9183de2ea18f
> >---------------- --------------------------
> > %stddev change %stddev
> > \ | \
> > 93.51 ± 3% 48% 138.53 ± 3% will-it-scale.time.user_time
> > 186 40% 261 will-it-scale.per_thread_ops
> > 53909 40% 75507 will-it-scale.workload
> >

testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-signal1/lkp-knm01

4bd4e92cfe6d2af7 a36dc70b810afe9183de2ea18f
---------------- --------------------------
%stddev %change %stddev
\ | \
266.00 ± 2% -1.6% 261.67 will-it-scale.per_thread_ops
76699 ± 2% -1.6% 75507 will-it-scale.workload

> >
> >tests: 1
> >testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-open1/lkp-knm01
> >
> >570d0200123fb4f8 a36dc70b810afe9183de2ea18f
> >---------------- --------------------------
> > %stddev change %stddev
> > \ | \
> > 447722 22% 546258 ± 10% will-it-scale.time.involuntary_context_switches
> > 226995 19% 269751 will-it-scale.workload
> > 787 19% 936 will-it-scale.per_thread_ops
> >
> >

testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-open1/lkp-knm01

4bd4e92cfe6d2af7 a36dc70b810afe9183de2ea18f
---------------- --------------------------
%stddev %change %stddev
\ | \
944.60 -0.9% 936.00 will-it-scale.per_thread_ops
272252 -0.9% 269751 will-it-scale.workload

Best Regards,
Rong Chen

2019-02-21 06:03:25

by Wei Yang

[permalink] [raw]
Subject: Re: [LKP] [driver core] 570d020012: will-it-scale.per_thread_ops -12.2% regression

On Thu, Feb 21, 2019 at 12:46:18PM +0800, Huang, Ying wrote:
>Wei Yang <[email protected]> writes:
>
>> On Thu, Feb 21, 2019 at 11:10:49AM +0800, kernel test robot wrote:
>>>On Tue, Feb 19, 2019 at 01:19:04PM +0100, Greg Kroah-Hartman wrote:
>>>> On Tue, Feb 19, 2019 at 08:59:45AM +0800, Wei Yang wrote:
>>>> > On Mon, Feb 18, 2019 at 03:54:42PM +0800, kernel test robot wrote:
>>>> > >Greeting,
>>>> > >
>>>> > >FYI, we noticed a -12.2% regression of will-it-scale.per_thread_ops due to commit:
>>>> > >
>>>> > >
>>>> > >commit: 570d0200123fb4f809aa2f6226e93a458d664d70 ("driver core: move device->knode_class to device_private")
>>>> > >https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>>>> > >
>>>> >
>>>> > This is interesting.
>>>> >
>>>> > I didn't expect the move of this field will impact the performance.
>>>> >
>>>> > The reason is struct device is a hotter memory than device->device_private?
>>>> >
>>>> > >in testcase: will-it-scale
>>>> > >on test machine: 288 threads Knights Mill with 80G memory
>>>> > >with following parameters:
>>>> > >
>>>> > > nr_task: 100%
>>>> > > mode: thread
>>>> > > test: unlink2
>>>> > > cpufreq_governor: performance
>>>> > >
>>>> > >test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
>>>> > >test-url: https://github.com/antonblanchard/will-it-scale
>>>> > >
>>>> > >In addition to that, the commit also has significant impact on the following tests:
>>>> > >
>>>> > >+------------------+---------------------------------------------------------------+
>>>> > >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -29.9% regression |
>>>> > >| test machine | 288 threads Knights Mill with 80G memory |
>>>> > >| test parameters | cpufreq_governor=performance |
>>>> > >| | mode=thread |
>>>> > >| | nr_task=100% |
>>>> > >| | test=signal1 |
>>>>
>>>> Ok, I'm going to blame your testing system, or something here, and not
>>>> the above patch.
>>>>
>>>> All this test does is call raise(3). That does not touch the driver
>>>> core at all.
>>>>
>>>> > >+------------------+---------------------------------------------------------------+
>>>> > >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -16.5% regression |
>>>> > >| test machine | 288 threads Knights Mill with 80G memory |
>>>> > >| test parameters | cpufreq_governor=performance |
>>>> > >| | mode=thread |
>>>> > >| | nr_task=100% |
>>>> > >| | test=open1 |
>>>> > >+------------------+---------------------------------------------------------------+
>>>>
>>>> Same here, open1 just calls open/close a lot. No driver core
>>>> interaction at all there either.
>>>>
>>>> So are you _sure_ this is the offending patch?
>>>
>>>Hi Greg,
>>>
>>>We did an experiment, recovered the layout of struct device. and we
>>>found the regression is gone. I guess the regession is not from the
>>>patch but related to the struct layout.
>>>
>>>
>>>tests: 1
>>>testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-unlink2/lkp-knm01
>>>
>>>570d0200123fb4f8 a36dc70b810afe9183de2ea18f
>>>---------------- --------------------------
>>> %stddev change %stddev
>>> \ | \
>>> 237096 14% 270789 will-it-scale.workload
>>> 823 14% 939 will-it-scale.per_thread_ops
>>>
>>
>> Do you have the comparison between a36dc70b810afe9183de2ea18f and the one
>> before 570d020012?
>>
>>>
>>>tests: 1
>>>testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-signal1/lkp-knm01
>>>
>>>570d0200123fb4f8 a36dc70b810afe9183de2ea18f
>>>---------------- --------------------------
>>> %stddev change %stddev
>>> \ | \
>>> 93.51 3% 48% 138.53 3% will-it-scale.time.user_time
>>> 186 40% 261 will-it-scale.per_thread_ops
>>> 53909 40% 75507 will-it-scale.workload
>>>
>>>
>>>tests: 1
>>>testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-open1/lkp-knm01
>>>
>>>570d0200123fb4f8 a36dc70b810afe9183de2ea18f
>>>---------------- --------------------------
>>> %stddev change %stddev
>>> \ | \
>>> 447722 22% 546258 10% will-it-scale.time.involuntary_context_switches
>>> 226995 19% 269751 will-it-scale.workload
>>> 787 19% 936 will-it-scale.per_thread_ops
>>>
>>>
>>>
>>>commit a36dc70b810afe9183de2ea18faa4c0939c139ac
>>>Author: 0day robot <[email protected]>
>>>Date: Wed Feb 20 14:21:19 2019 +0800
>>>
>>> backfile klist_node in struct device for debugging
>>>
>>> Signed-off-by: 0day robot <[email protected]>
>>>
>>>diff --git a/include/linux/device.h b/include/linux/device.h
>>>index d0e452fd0bff2..31666cb72b3ba 100644
>>>--- a/include/linux/device.h
>>>+++ b/include/linux/device.h
>>>@@ -1035,6 +1035,7 @@ struct device {
>>> spinlock_t devres_lock;
>>> struct list_head devres_head;
>>>
>>>+ struct klist_node knode_class_test_by_rongc;
>>> struct class *class;
>>> const struct attribute_group **groups; /* optional groups */
>>
>> Hmm... because this is not properly aligned?
>>
>> struct klist_node {
>> void *n_klist; /* never access directly */
>> struct list_head n_node;
>> struct kref n_ref;
>> };
>>
>> Except struct kref has one "int" type, others are pointers.
>>
>> But... I am still confused.
>
>I guess because the size of struct device is changed, it influences some
>alignment changes in the system. Thus influence the benchmark score.
>

That's interesting.

I wrote a module to see the exact size of these two structure on my x86_64.

sizeof(struct device) = 736 = 8 * 92
sizeof(struct device_private) = 160 = 8 * 20
sizeof(struct klist_node) = 32 = 8 * 4

Even klist_node has one 4 byte field, c complier would pack the structure to
make it aligned. Which system alignment it would affect?

After the patch, size would change like this:

struct device 736 -> 704
struce device_private 160 -> 192

Would this size change affect system?

>Best Regards,
>Huang, Ying
>
>>>
>>>Best Regards,
>>>Rong Chen

--
Wei Yang
Help you, Help me

2019-02-21 06:30:22

by Huang, Ying

[permalink] [raw]
Subject: Re: [LKP] [driver core] 570d020012: will-it-scale.per_thread_ops -12.2% regression

Wei Yang <[email protected]> writes:

> On Thu, Feb 21, 2019 at 12:46:18PM +0800, Huang, Ying wrote:
>>Wei Yang <[email protected]> writes:
>>
>>> On Thu, Feb 21, 2019 at 11:10:49AM +0800, kernel test robot wrote:
>>>>On Tue, Feb 19, 2019 at 01:19:04PM +0100, Greg Kroah-Hartman wrote:
>>>>> On Tue, Feb 19, 2019 at 08:59:45AM +0800, Wei Yang wrote:
>>>>> > On Mon, Feb 18, 2019 at 03:54:42PM +0800, kernel test robot wrote:
>>>>> > >Greeting,
>>>>> > >
>>>>> > >FYI, we noticed a -12.2% regression of will-it-scale.per_thread_ops due to commit:
>>>>> > >
>>>>> > >
>>>>> > >commit: 570d0200123fb4f809aa2f6226e93a458d664d70 ("driver core: move device->knode_class to device_private")
>>>>> > >https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>>>>> > >
>>>>> >
>>>>> > This is interesting.
>>>>> >
>>>>> > I didn't expect the move of this field will impact the performance.
>>>>> >
>>>>> > The reason is struct device is a hotter memory than device->device_private?
>>>>> >
>>>>> > >in testcase: will-it-scale
>>>>> > >on test machine: 288 threads Knights Mill with 80G memory
>>>>> > >with following parameters:
>>>>> > >
>>>>> > > nr_task: 100%
>>>>> > > mode: thread
>>>>> > > test: unlink2
>>>>> > > cpufreq_governor: performance
>>>>> > >
>>>>> > >test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
>>>>> > >test-url: https://github.com/antonblanchard/will-it-scale
>>>>> > >
>>>>> > >In addition to that, the commit also has significant impact on the following tests:
>>>>> > >
>>>>> > >+------------------+---------------------------------------------------------------+
>>>>> > >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -29.9% regression |
>>>>> > >| test machine | 288 threads Knights Mill with 80G memory |
>>>>> > >| test parameters | cpufreq_governor=performance |
>>>>> > >| | mode=thread |
>>>>> > >| | nr_task=100% |
>>>>> > >| | test=signal1 |
>>>>>
>>>>> Ok, I'm going to blame your testing system, or something here, and not
>>>>> the above patch.
>>>>>
>>>>> All this test does is call raise(3). That does not touch the driver
>>>>> core at all.
>>>>>
>>>>> > >+------------------+---------------------------------------------------------------+
>>>>> > >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -16.5% regression |
>>>>> > >| test machine | 288 threads Knights Mill with 80G memory |
>>>>> > >| test parameters | cpufreq_governor=performance |
>>>>> > >| | mode=thread |
>>>>> > >| | nr_task=100% |
>>>>> > >| | test=open1 |
>>>>> > >+------------------+---------------------------------------------------------------+
>>>>>
>>>>> Same here, open1 just calls open/close a lot. No driver core
>>>>> interaction at all there either.
>>>>>
>>>>> So are you _sure_ this is the offending patch?
>>>>
>>>>Hi Greg,
>>>>
>>>>We did an experiment, recovered the layout of struct device. and we
>>>>found the regression is gone. I guess the regession is not from the
>>>>patch but related to the struct layout.
>>>>
>>>>
>>>>tests: 1
>>>>testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-unlink2/lkp-knm01
>>>>
>>>>570d0200123fb4f8 a36dc70b810afe9183de2ea18f
>>>>---------------- --------------------------
>>>> %stddev change %stddev
>>>> \ | \
>>>> 237096 14% 270789 will-it-scale.workload
>>>> 823 14% 939 will-it-scale.per_thread_ops
>>>>
>>>
>>> Do you have the comparison between a36dc70b810afe9183de2ea18f and the one
>>> before 570d020012?
>>>
>>>>
>>>>tests: 1
>>>>testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-signal1/lkp-knm01
>>>>
>>>>570d0200123fb4f8 a36dc70b810afe9183de2ea18f
>>>>---------------- --------------------------
>>>> %stddev change %stddev
>>>> \ | \
>>>> 93.51 3% 48% 138.53 3% will-it-scale.time.user_time
>>>> 186 40% 261 will-it-scale.per_thread_ops
>>>> 53909 40% 75507 will-it-scale.workload
>>>>
>>>>
>>>>tests: 1
>>>>testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-open1/lkp-knm01
>>>>
>>>>570d0200123fb4f8 a36dc70b810afe9183de2ea18f
>>>>---------------- --------------------------
>>>> %stddev change %stddev
>>>> \ | \
>>>> 447722 22% 546258 10% will-it-scale.time.involuntary_context_switches
>>>> 226995 19% 269751 will-it-scale.workload
>>>> 787 19% 936 will-it-scale.per_thread_ops
>>>>
>>>>
>>>>
>>>>commit a36dc70b810afe9183de2ea18faa4c0939c139ac
>>>>Author: 0day robot <[email protected]>
>>>>Date: Wed Feb 20 14:21:19 2019 +0800
>>>>
>>>> backfile klist_node in struct device for debugging
>>>>
>>>> Signed-off-by: 0day robot <[email protected]>
>>>>
>>>>diff --git a/include/linux/device.h b/include/linux/device.h
>>>>index d0e452fd0bff2..31666cb72b3ba 100644
>>>>--- a/include/linux/device.h
>>>>+++ b/include/linux/device.h
>>>>@@ -1035,6 +1035,7 @@ struct device {
>>>> spinlock_t devres_lock;
>>>> struct list_head devres_head;
>>>>
>>>>+ struct klist_node knode_class_test_by_rongc;
>>>> struct class *class;
>>>> const struct attribute_group **groups; /* optional groups */
>>>
>>> Hmm... because this is not properly aligned?
>>>
>>> struct klist_node {
>>> void *n_klist; /* never access directly */
>>> struct list_head n_node;
>>> struct kref n_ref;
>>> };
>>>
>>> Except struct kref has one "int" type, others are pointers.
>>>
>>> But... I am still confused.
>>
>>I guess because the size of struct device is changed, it influences some
>>alignment changes in the system. Thus influence the benchmark score.
>>
>
> That's interesting.
>
> I wrote a module to see the exact size of these two structure on my x86_64.
>
> sizeof(struct device) = 736 = 8 * 92
> sizeof(struct device_private) = 160 = 8 * 20
> sizeof(struct klist_node) = 32 = 8 * 4
>
> Even klist_node has one 4 byte field, c complier would pack the structure to
> make it aligned. Which system alignment it would affect?
>
> After the patch, size would change like this:
>
> struct device 736 -> 704
> struce device_private 160 -> 192
>
> Would this size change affect system?

Yes. I guess these size change may affect system performance. Some
other objects may share slab page with these objects.

Best Regards,
Huang, Ying

>>Best Regards,
>>Huang, Ying
>>
>>>>
>>>>Best Regards,
>>>>Rong Chen

2019-02-21 07:11:06

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [LKP] [driver core] 570d020012: will-it-scale.per_thread_ops -12.2% regression

On Thu, Feb 21, 2019 at 11:10:49AM +0800, kernel test robot wrote:
> On Tue, Feb 19, 2019 at 01:19:04PM +0100, Greg Kroah-Hartman wrote:
> > On Tue, Feb 19, 2019 at 08:59:45AM +0800, Wei Yang wrote:
> > > On Mon, Feb 18, 2019 at 03:54:42PM +0800, kernel test robot wrote:
> > > >Greeting,
> > > >
> > > >FYI, we noticed a -12.2% regression of will-it-scale.per_thread_ops due to commit:
> > > >
> > > >
> > > >commit: 570d0200123fb4f809aa2f6226e93a458d664d70 ("driver core: move device->knode_class to device_private")
> > > >https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > > >
> > >
> > > This is interesting.
> > >
> > > I didn't expect the move of this field will impact the performance.
> > >
> > > The reason is struct device is a hotter memory than device->device_private?
> > >
> > > >in testcase: will-it-scale
> > > >on test machine: 288 threads Knights Mill with 80G memory
> > > >with following parameters:
> > > >
> > > > nr_task: 100%
> > > > mode: thread
> > > > test: unlink2
> > > > cpufreq_governor: performance
> > > >
> > > >test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> > > >test-url: https://github.com/antonblanchard/will-it-scale
> > > >
> > > >In addition to that, the commit also has significant impact on the following tests:
> > > >
> > > >+------------------+---------------------------------------------------------------+
> > > >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -29.9% regression |
> > > >| test machine | 288 threads Knights Mill with 80G memory |
> > > >| test parameters | cpufreq_governor=performance |
> > > >| | mode=thread |
> > > >| | nr_task=100% |
> > > >| | test=signal1 |
> >
> > Ok, I'm going to blame your testing system, or something here, and not
> > the above patch.
> >
> > All this test does is call raise(3). That does not touch the driver
> > core at all.
> >
> > > >+------------------+---------------------------------------------------------------+
> > > >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -16.5% regression |
> > > >| test machine | 288 threads Knights Mill with 80G memory |
> > > >| test parameters | cpufreq_governor=performance |
> > > >| | mode=thread |
> > > >| | nr_task=100% |
> > > >| | test=open1 |
> > > >+------------------+---------------------------------------------------------------+
> >
> > Same here, open1 just calls open/close a lot. No driver core
> > interaction at all there either.
> >
> > So are you _sure_ this is the offending patch?
>
> Hi Greg,
>
> We did an experiment, recovered the layout of struct device. and we
> found the regression is gone. I guess the regession is not from the
> patch but related to the struct layout.
>
>
> tests: 1
> testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-unlink2/lkp-knm01
>
> 570d0200123fb4f8 a36dc70b810afe9183de2ea18f
> ---------------- --------------------------
> %stddev change %stddev
> \ | \
> 237096 14% 270789 will-it-scale.workload
> 823 14% 939 will-it-scale.per_thread_ops
>
>
> tests: 1
> testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-signal1/lkp-knm01
>
> 570d0200123fb4f8 a36dc70b810afe9183de2ea18f
> ---------------- --------------------------
> %stddev change %stddev
> \ | \
> 93.51 ? 3% 48% 138.53 ? 3% will-it-scale.time.user_time
> 186 40% 261 will-it-scale.per_thread_ops
> 53909 40% 75507 will-it-scale.workload
>
>
> tests: 1
> testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-open1/lkp-knm01
>
> 570d0200123fb4f8 a36dc70b810afe9183de2ea18f
> ---------------- --------------------------
> %stddev change %stddev
> \ | \
> 447722 22% 546258 ? 10% will-it-scale.time.involuntary_context_switches
> 226995 19% 269751 will-it-scale.workload
> 787 19% 936 will-it-scale.per_thread_ops
>
>
>
> commit a36dc70b810afe9183de2ea18faa4c0939c139ac
> Author: 0day robot <[email protected]>
> Date: Wed Feb 20 14:21:19 2019 +0800
>
> backfile klist_node in struct device for debugging
>
> Signed-off-by: 0day robot <[email protected]>
>
> diff --git a/include/linux/device.h b/include/linux/device.h
> index d0e452fd0bff2..31666cb72b3ba 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -1035,6 +1035,7 @@ struct device {
> spinlock_t devres_lock;
> struct list_head devres_head;
>
> + struct klist_node knode_class_test_by_rongc;
> struct class *class;
> const struct attribute_group **groups; /* optional groups */

While this is fun to worry about alignment and structure size of 'struct
device' I find it odd given that the syscalls and userspace load of
those test programs have nothing to do with 'struct device' at all.

So I can work on fixing up the alignment of struct device, as that's a
nice thing to do for systems with 30k of these in memory, but that
shouldn't affect a workload of a constant string of signal calls.

thanks,

greg k-h

2019-02-21 07:19:07

by Huang, Ying

[permalink] [raw]
Subject: Re: [LKP] [driver core] 570d020012: will-it-scale.per_thread_ops -12.2% regression

Greg Kroah-Hartman <[email protected]> writes:

> On Thu, Feb 21, 2019 at 11:10:49AM +0800, kernel test robot wrote:
>> On Tue, Feb 19, 2019 at 01:19:04PM +0100, Greg Kroah-Hartman wrote:
>> > On Tue, Feb 19, 2019 at 08:59:45AM +0800, Wei Yang wrote:
>> > > On Mon, Feb 18, 2019 at 03:54:42PM +0800, kernel test robot wrote:
>> > > >Greeting,
>> > > >
>> > > >FYI, we noticed a -12.2% regression of will-it-scale.per_thread_ops due to commit:
>> > > >
>> > > >
>> > > >commit: 570d0200123fb4f809aa2f6226e93a458d664d70 ("driver core: move device->knode_class to device_private")
>> > > >https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>> > > >
>> > >
>> > > This is interesting.
>> > >
>> > > I didn't expect the move of this field will impact the performance.
>> > >
>> > > The reason is struct device is a hotter memory than device->device_private?
>> > >
>> > > >in testcase: will-it-scale
>> > > >on test machine: 288 threads Knights Mill with 80G memory
>> > > >with following parameters:
>> > > >
>> > > > nr_task: 100%
>> > > > mode: thread
>> > > > test: unlink2
>> > > > cpufreq_governor: performance
>> > > >
>> > > >test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
>> > > >test-url: https://github.com/antonblanchard/will-it-scale
>> > > >
>> > > >In addition to that, the commit also has significant impact on the following tests:
>> > > >
>> > > >+------------------+---------------------------------------------------------------+
>> > > >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -29.9% regression |
>> > > >| test machine | 288 threads Knights Mill with 80G memory |
>> > > >| test parameters | cpufreq_governor=performance |
>> > > >| | mode=thread |
>> > > >| | nr_task=100% |
>> > > >| | test=signal1 |
>> >
>> > Ok, I'm going to blame your testing system, or something here, and not
>> > the above patch.
>> >
>> > All this test does is call raise(3). That does not touch the driver
>> > core at all.
>> >
>> > > >+------------------+---------------------------------------------------------------+
>> > > >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -16.5% regression |
>> > > >| test machine | 288 threads Knights Mill with 80G memory |
>> > > >| test parameters | cpufreq_governor=performance |
>> > > >| | mode=thread |
>> > > >| | nr_task=100% |
>> > > >| | test=open1 |
>> > > >+------------------+---------------------------------------------------------------+
>> >
>> > Same here, open1 just calls open/close a lot. No driver core
>> > interaction at all there either.
>> >
>> > So are you _sure_ this is the offending patch?
>>
>> Hi Greg,
>>
>> We did an experiment, recovered the layout of struct device. and we
>> found the regression is gone. I guess the regession is not from the
>> patch but related to the struct layout.
>>
>>
>> tests: 1
>> testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-unlink2/lkp-knm01
>>
>> 570d0200123fb4f8 a36dc70b810afe9183de2ea18f
>> ---------------- --------------------------
>> %stddev change %stddev
>> \ | \
>> 237096 14% 270789 will-it-scale.workload
>> 823 14% 939 will-it-scale.per_thread_ops
>>
>>
>> tests: 1
>> testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-signal1/lkp-knm01
>>
>> 570d0200123fb4f8 a36dc70b810afe9183de2ea18f
>> ---------------- --------------------------
>> %stddev change %stddev
>> \ | \
>> 93.51 3% 48% 138.53 3% will-it-scale.time.user_time
>> 186 40% 261 will-it-scale.per_thread_ops
>> 53909 40% 75507 will-it-scale.workload
>>
>>
>> tests: 1
>> testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-open1/lkp-knm01
>>
>> 570d0200123fb4f8 a36dc70b810afe9183de2ea18f
>> ---------------- --------------------------
>> %stddev change %stddev
>> \ | \
>> 447722 22% 546258 10% will-it-scale.time.involuntary_context_switches
>> 226995 19% 269751 will-it-scale.workload
>> 787 19% 936 will-it-scale.per_thread_ops
>>
>>
>>
>> commit a36dc70b810afe9183de2ea18faa4c0939c139ac
>> Author: 0day robot <[email protected]>
>> Date: Wed Feb 20 14:21:19 2019 +0800
>>
>> backfile klist_node in struct device for debugging
>>
>> Signed-off-by: 0day robot <[email protected]>
>>
>> diff --git a/include/linux/device.h b/include/linux/device.h
>> index d0e452fd0bff2..31666cb72b3ba 100644
>> --- a/include/linux/device.h
>> +++ b/include/linux/device.h
>> @@ -1035,6 +1035,7 @@ struct device {
>> spinlock_t devres_lock;
>> struct list_head devres_head;
>>
>> + struct klist_node knode_class_test_by_rongc;
>> struct class *class;
>> const struct attribute_group **groups; /* optional groups */
>
> While this is fun to worry about alignment and structure size of 'struct
> device' I find it odd given that the syscalls and userspace load of
> those test programs have nothing to do with 'struct device' at all.
>
> So I can work on fixing up the alignment of struct device, as that's a
> nice thing to do for systems with 30k of these in memory, but that
> shouldn't affect a workload of a constant string of signal calls.

Hi, Greg,

I don't think this is an issues of struct device. As you said, struct
device isn't access much during test. Struct device may share slab page
with some other data structures (signal related, or fd related (as in
some other test cases)), so that the alignment of these data structures
are affected, so caused the performance regression.

Best Regards,
Huang, Ying

> thanks,
>
> greg k-h

2019-02-21 07:36:14

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [LKP] [driver core] 570d020012: will-it-scale.per_thread_ops -12.2% regression

On Thu, Feb 21, 2019 at 03:18:22PM +0800, Huang, Ying wrote:
> Greg Kroah-Hartman <[email protected]> writes:
>
> > On Thu, Feb 21, 2019 at 11:10:49AM +0800, kernel test robot wrote:
> >> On Tue, Feb 19, 2019 at 01:19:04PM +0100, Greg Kroah-Hartman wrote:
> >> > On Tue, Feb 19, 2019 at 08:59:45AM +0800, Wei Yang wrote:
> >> > > On Mon, Feb 18, 2019 at 03:54:42PM +0800, kernel test robot wrote:
> >> > > >Greeting,
> >> > > >
> >> > > >FYI, we noticed a -12.2% regression of will-it-scale.per_thread_ops due to commit:
> >> > > >
> >> > > >
> >> > > >commit: 570d0200123fb4f809aa2f6226e93a458d664d70 ("driver core: move device->knode_class to device_private")
> >> > > >https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> >> > > >
> >> > >
> >> > > This is interesting.
> >> > >
> >> > > I didn't expect the move of this field will impact the performance.
> >> > >
> >> > > The reason is struct device is a hotter memory than device->device_private?
> >> > >
> >> > > >in testcase: will-it-scale
> >> > > >on test machine: 288 threads Knights Mill with 80G memory
> >> > > >with following parameters:
> >> > > >
> >> > > > nr_task: 100%
> >> > > > mode: thread
> >> > > > test: unlink2
> >> > > > cpufreq_governor: performance
> >> > > >
> >> > > >test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> >> > > >test-url: https://github.com/antonblanchard/will-it-scale
> >> > > >
> >> > > >In addition to that, the commit also has significant impact on the following tests:
> >> > > >
> >> > > >+------------------+---------------------------------------------------------------+
> >> > > >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -29.9% regression |
> >> > > >| test machine | 288 threads Knights Mill with 80G memory |
> >> > > >| test parameters | cpufreq_governor=performance |
> >> > > >| | mode=thread |
> >> > > >| | nr_task=100% |
> >> > > >| | test=signal1 |
> >> >
> >> > Ok, I'm going to blame your testing system, or something here, and not
> >> > the above patch.
> >> >
> >> > All this test does is call raise(3). That does not touch the driver
> >> > core at all.
> >> >
> >> > > >+------------------+---------------------------------------------------------------+
> >> > > >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -16.5% regression |
> >> > > >| test machine | 288 threads Knights Mill with 80G memory |
> >> > > >| test parameters | cpufreq_governor=performance |
> >> > > >| | mode=thread |
> >> > > >| | nr_task=100% |
> >> > > >| | test=open1 |
> >> > > >+------------------+---------------------------------------------------------------+
> >> >
> >> > Same here, open1 just calls open/close a lot. No driver core
> >> > interaction at all there either.
> >> >
> >> > So are you _sure_ this is the offending patch?
> >>
> >> Hi Greg,
> >>
> >> We did an experiment, recovered the layout of struct device. and we
> >> found the regression is gone. I guess the regession is not from the
> >> patch but related to the struct layout.
> >>
> >>
> >> tests: 1
> >> testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-unlink2/lkp-knm01
> >>
> >> 570d0200123fb4f8 a36dc70b810afe9183de2ea18f
> >> ---------------- --------------------------
> >> %stddev change %stddev
> >> \ | \
> >> 237096 14% 270789 will-it-scale.workload
> >> 823 14% 939 will-it-scale.per_thread_ops
> >>
> >>
> >> tests: 1
> >> testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-signal1/lkp-knm01
> >>
> >> 570d0200123fb4f8 a36dc70b810afe9183de2ea18f
> >> ---------------- --------------------------
> >> %stddev change %stddev
> >> \ | \
> >> 93.51 3% 48% 138.53 3% will-it-scale.time.user_time
> >> 186 40% 261 will-it-scale.per_thread_ops
> >> 53909 40% 75507 will-it-scale.workload
> >>
> >>
> >> tests: 1
> >> testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-open1/lkp-knm01
> >>
> >> 570d0200123fb4f8 a36dc70b810afe9183de2ea18f
> >> ---------------- --------------------------
> >> %stddev change %stddev
> >> \ | \
> >> 447722 22% 546258 10% will-it-scale.time.involuntary_context_switches
> >> 226995 19% 269751 will-it-scale.workload
> >> 787 19% 936 will-it-scale.per_thread_ops
> >>
> >>
> >>
> >> commit a36dc70b810afe9183de2ea18faa4c0939c139ac
> >> Author: 0day robot <[email protected]>
> >> Date: Wed Feb 20 14:21:19 2019 +0800
> >>
> >> backfile klist_node in struct device for debugging
> >>
> >> Signed-off-by: 0day robot <[email protected]>
> >>
> >> diff --git a/include/linux/device.h b/include/linux/device.h
> >> index d0e452fd0bff2..31666cb72b3ba 100644
> >> --- a/include/linux/device.h
> >> +++ b/include/linux/device.h
> >> @@ -1035,6 +1035,7 @@ struct device {
> >> spinlock_t devres_lock;
> >> struct list_head devres_head;
> >>
> >> + struct klist_node knode_class_test_by_rongc;
> >> struct class *class;
> >> const struct attribute_group **groups; /* optional groups */
> >
> > While this is fun to worry about alignment and structure size of 'struct
> > device' I find it odd given that the syscalls and userspace load of
> > those test programs have nothing to do with 'struct device' at all.
> >
> > So I can work on fixing up the alignment of struct device, as that's a
> > nice thing to do for systems with 30k of these in memory, but that
> > shouldn't affect a workload of a constant string of signal calls.
>
> Hi, Greg,
>
> I don't think this is an issues of struct device. As you said, struct
> device isn't access much during test. Struct device may share slab page
> with some other data structures (signal related, or fd related (as in
> some other test cases)), so that the alignment of these data structures
> are affected, so caused the performance regression.

But allocation of a structure should always be "properly" aligned, no
matter what something else did in the system as that is what kmalloc
ensures. If not, then we have problems in our memory allocator :)

So something is odd here, but I don't think that is it...

greg k-h

2019-02-21 07:54:23

by Wei Yang

[permalink] [raw]
Subject: Re: [LKP] [driver core] 570d020012: will-it-scale.per_thread_ops -12.2% regression

On Thu, Feb 21, 2019 at 03:18:22PM +0800, Huang, Ying wrote:
>Greg Kroah-Hartman <[email protected]> writes:
>
>> On Thu, Feb 21, 2019 at 11:10:49AM +0800, kernel test robot wrote:
>>> On Tue, Feb 19, 2019 at 01:19:04PM +0100, Greg Kroah-Hartman wrote:
>>> > On Tue, Feb 19, 2019 at 08:59:45AM +0800, Wei Yang wrote:
>>> > > On Mon, Feb 18, 2019 at 03:54:42PM +0800, kernel test robot wrote:
>>> > > >Greeting,
>>> > > >
>>> > > >FYI, we noticed a -12.2% regression of will-it-scale.per_thread_ops due to commit:
>>> > > >
>>> > > >
>>> > > >commit: 570d0200123fb4f809aa2f6226e93a458d664d70 ("driver core: move device->knode_class to device_private")
>>> > > >https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>>> > > >
>>> > >
>>> > > This is interesting.
>>> > >
>>> > > I didn't expect the move of this field will impact the performance.
>>> > >
>>> > > The reason is struct device is a hotter memory than device->device_private?
>>> > >
>>> > > >in testcase: will-it-scale
>>> > > >on test machine: 288 threads Knights Mill with 80G memory
>>> > > >with following parameters:
>>> > > >
>>> > > > nr_task: 100%
>>> > > > mode: thread
>>> > > > test: unlink2
>>> > > > cpufreq_governor: performance
>>> > > >
>>> > > >test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
>>> > > >test-url: https://github.com/antonblanchard/will-it-scale
>>> > > >
>>> > > >In addition to that, the commit also has significant impact on the following tests:
>>> > > >
>>> > > >+------------------+---------------------------------------------------------------+
>>> > > >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -29.9% regression |
>>> > > >| test machine | 288 threads Knights Mill with 80G memory |
>>> > > >| test parameters | cpufreq_governor=performance |
>>> > > >| | mode=thread |
>>> > > >| | nr_task=100% |
>>> > > >| | test=signal1 |
>>> >
>>> > Ok, I'm going to blame your testing system, or something here, and not
>>> > the above patch.
>>> >
>>> > All this test does is call raise(3). That does not touch the driver
>>> > core at all.
>>> >
>>> > > >+------------------+---------------------------------------------------------------+
>>> > > >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -16.5% regression |
>>> > > >| test machine | 288 threads Knights Mill with 80G memory |
>>> > > >| test parameters | cpufreq_governor=performance |
>>> > > >| | mode=thread |
>>> > > >| | nr_task=100% |
>>> > > >| | test=open1 |
>>> > > >+------------------+---------------------------------------------------------------+
>>> >
>>> > Same here, open1 just calls open/close a lot. No driver core
>>> > interaction at all there either.
>>> >
>>> > So are you _sure_ this is the offending patch?
>>>
>>> Hi Greg,
>>>
>>> We did an experiment, recovered the layout of struct device. and we
>>> found the regression is gone. I guess the regession is not from the
>>> patch but related to the struct layout.
>>>
>>>
>>> tests: 1
>>> testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-unlink2/lkp-knm01
>>>
>>> 570d0200123fb4f8 a36dc70b810afe9183de2ea18f
>>> ---------------- --------------------------
>>> %stddev change %stddev
>>> \ | \
>>> 237096 14% 270789 will-it-scale.workload
>>> 823 14% 939 will-it-scale.per_thread_ops
>>>
>>>
>>> tests: 1
>>> testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-signal1/lkp-knm01
>>>
>>> 570d0200123fb4f8 a36dc70b810afe9183de2ea18f
>>> ---------------- --------------------------
>>> %stddev change %stddev
>>> \ | \
>>> 93.51 3% 48% 138.53 3% will-it-scale.time.user_time
>>> 186 40% 261 will-it-scale.per_thread_ops
>>> 53909 40% 75507 will-it-scale.workload
>>>
>>>
>>> tests: 1
>>> testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-open1/lkp-knm01
>>>
>>> 570d0200123fb4f8 a36dc70b810afe9183de2ea18f
>>> ---------------- --------------------------
>>> %stddev change %stddev
>>> \ | \
>>> 447722 22% 546258 10% will-it-scale.time.involuntary_context_switches
>>> 226995 19% 269751 will-it-scale.workload
>>> 787 19% 936 will-it-scale.per_thread_ops
>>>
>>>
>>>
>>> commit a36dc70b810afe9183de2ea18faa4c0939c139ac
>>> Author: 0day robot <[email protected]>
>>> Date: Wed Feb 20 14:21:19 2019 +0800
>>>
>>> backfile klist_node in struct device for debugging
>>>
>>> Signed-off-by: 0day robot <[email protected]>
>>>
>>> diff --git a/include/linux/device.h b/include/linux/device.h
>>> index d0e452fd0bff2..31666cb72b3ba 100644
>>> --- a/include/linux/device.h
>>> +++ b/include/linux/device.h
>>> @@ -1035,6 +1035,7 @@ struct device {
>>> spinlock_t devres_lock;
>>> struct list_head devres_head;
>>>
>>> + struct klist_node knode_class_test_by_rongc;
>>> struct class *class;
>>> const struct attribute_group **groups; /* optional groups */
>>
>> While this is fun to worry about alignment and structure size of 'struct
>> device' I find it odd given that the syscalls and userspace load of
>> those test programs have nothing to do with 'struct device' at all.
>>
>> So I can work on fixing up the alignment of struct device, as that's a
>> nice thing to do for systems with 30k of these in memory, but that
>> shouldn't affect a workload of a constant string of signal calls.
>
>Hi, Greg,
>
>I don't think this is an issues of struct device. As you said, struct
>device isn't access much during test. Struct device may share slab page
>with some other data structures (signal related, or fd related (as in
>some other test cases)), so that the alignment of these data structures
>are affected, so caused the performance regression.
>

I didn't get the point here neither.

slab allocator ask memory from page allocator Page by Page and split the page
into pre-defined size. For example, 128B, 512B... Just as shown in
/proc/slabinfo.

Per my understanding, each struct device / device_private will sits in its own
aligned space. struct device would sits in 1K slab and struct device_private
would sits in 256B slab, both before and after this patch if I am correct.

Hmm... I am just curious about how this alignment is affected. Maybe I lost
some point?

>Best Regards,
>Huang, Ying
>
>> thanks,
>>
>> greg k-h

--
Wei Yang
Help you, Help me

2019-02-21 08:31:12

by Huang, Ying

[permalink] [raw]
Subject: Re: [LKP] [driver core] 570d020012: will-it-scale.per_thread_ops -12.2% regression

Greg Kroah-Hartman <[email protected]> writes:

> On Thu, Feb 21, 2019 at 03:18:22PM +0800, Huang, Ying wrote:
>> Greg Kroah-Hartman <[email protected]> writes:
>>
>> > On Thu, Feb 21, 2019 at 11:10:49AM +0800, kernel test robot wrote:
>> >> On Tue, Feb 19, 2019 at 01:19:04PM +0100, Greg Kroah-Hartman wrote:
>> >> > On Tue, Feb 19, 2019 at 08:59:45AM +0800, Wei Yang wrote:
>> >> > > On Mon, Feb 18, 2019 at 03:54:42PM +0800, kernel test robot wrote:
>> >> > > >Greeting,
>> >> > > >
>> >> > > >FYI, we noticed a -12.2% regression of will-it-scale.per_thread_ops due to commit:
>> >> > > >
>> >> > > >
>> >> > > >commit: 570d0200123fb4f809aa2f6226e93a458d664d70 ("driver core: move device->knode_class to device_private")
>> >> > > >https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>> >> > > >
>> >> > >
>> >> > > This is interesting.
>> >> > >
>> >> > > I didn't expect the move of this field will impact the performance.
>> >> > >
>> >> > > The reason is struct device is a hotter memory than device->device_private?
>> >> > >
>> >> > > >in testcase: will-it-scale
>> >> > > >on test machine: 288 threads Knights Mill with 80G memory
>> >> > > >with following parameters:
>> >> > > >
>> >> > > > nr_task: 100%
>> >> > > > mode: thread
>> >> > > > test: unlink2
>> >> > > > cpufreq_governor: performance
>> >> > > >
>> >> > > >test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
>> >> > > >test-url: https://github.com/antonblanchard/will-it-scale
>> >> > > >
>> >> > > >In addition to that, the commit also has significant impact on the following tests:
>> >> > > >
>> >> > > >+------------------+---------------------------------------------------------------+
>> >> > > >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -29.9% regression |
>> >> > > >| test machine | 288 threads Knights Mill with 80G memory |
>> >> > > >| test parameters | cpufreq_governor=performance |
>> >> > > >| | mode=thread |
>> >> > > >| | nr_task=100% |
>> >> > > >| | test=signal1 |
>> >> >
>> >> > Ok, I'm going to blame your testing system, or something here, and not
>> >> > the above patch.
>> >> >
>> >> > All this test does is call raise(3). That does not touch the driver
>> >> > core at all.
>> >> >
>> >> > > >+------------------+---------------------------------------------------------------+
>> >> > > >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -16.5% regression |
>> >> > > >| test machine | 288 threads Knights Mill with 80G memory |
>> >> > > >| test parameters | cpufreq_governor=performance |
>> >> > > >| | mode=thread |
>> >> > > >| | nr_task=100% |
>> >> > > >| | test=open1 |
>> >> > > >+------------------+---------------------------------------------------------------+
>> >> >
>> >> > Same here, open1 just calls open/close a lot. No driver core
>> >> > interaction at all there either.
>> >> >
>> >> > So are you _sure_ this is the offending patch?
>> >>
>> >> Hi Greg,
>> >>
>> >> We did an experiment, recovered the layout of struct device. and we
>> >> found the regression is gone. I guess the regession is not from the
>> >> patch but related to the struct layout.
>> >>
>> >>
>> >> tests: 1
>> >> testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-unlink2/lkp-knm01
>> >>
>> >> 570d0200123fb4f8 a36dc70b810afe9183de2ea18f
>> >> ---------------- --------------------------
>> >> %stddev change %stddev
>> >> \ | \
>> >> 237096 14% 270789 will-it-scale.workload
>> >> 823 14% 939 will-it-scale.per_thread_ops
>> >>
>> >>
>> >> tests: 1
>> >> testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-signal1/lkp-knm01
>> >>
>> >> 570d0200123fb4f8 a36dc70b810afe9183de2ea18f
>> >> ---------------- --------------------------
>> >> %stddev change %stddev
>> >> \ | \
>> >> 93.51 3% 48% 138.53 3% will-it-scale.time.user_time
>> >> 186 40% 261 will-it-scale.per_thread_ops
>> >> 53909 40% 75507 will-it-scale.workload
>> >>
>> >>
>> >> tests: 1
>> >> testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-open1/lkp-knm01
>> >>
>> >> 570d0200123fb4f8 a36dc70b810afe9183de2ea18f
>> >> ---------------- --------------------------
>> >> %stddev change %stddev
>> >> \ | \
>> >> 447722 22% 546258 10% will-it-scale.time.involuntary_context_switches
>> >> 226995 19% 269751 will-it-scale.workload
>> >> 787 19% 936 will-it-scale.per_thread_ops
>> >>
>> >>
>> >>
>> >> commit a36dc70b810afe9183de2ea18faa4c0939c139ac
>> >> Author: 0day robot <[email protected]>
>> >> Date: Wed Feb 20 14:21:19 2019 +0800
>> >>
>> >> backfile klist_node in struct device for debugging
>> >>
>> >> Signed-off-by: 0day robot <[email protected]>
>> >>
>> >> diff --git a/include/linux/device.h b/include/linux/device.h
>> >> index d0e452fd0bff2..31666cb72b3ba 100644
>> >> --- a/include/linux/device.h
>> >> +++ b/include/linux/device.h
>> >> @@ -1035,6 +1035,7 @@ struct device {
>> >> spinlock_t devres_lock;
>> >> struct list_head devres_head;
>> >>
>> >> + struct klist_node knode_class_test_by_rongc;
>> >> struct class *class;
>> >> const struct attribute_group **groups; /* optional groups */
>> >
>> > While this is fun to worry about alignment and structure size of 'struct
>> > device' I find it odd given that the syscalls and userspace load of
>> > those test programs have nothing to do with 'struct device' at all.
>> >
>> > So I can work on fixing up the alignment of struct device, as that's a
>> > nice thing to do for systems with 30k of these in memory, but that
>> > shouldn't affect a workload of a constant string of signal calls.
>>
>> Hi, Greg,
>>
>> I don't think this is an issues of struct device. As you said, struct
>> device isn't access much during test. Struct device may share slab page
>> with some other data structures (signal related, or fd related (as in
>> some other test cases)), so that the alignment of these data structures
>> are affected, so caused the performance regression.
>
> But allocation of a structure should always be "properly" aligned, no
> matter what something else did in the system as that is what kmalloc
> ensures. If not, then we have problems in our memory allocator :)
>
> So something is odd here, but I don't think that is it...

If all these data structure are allocated with kmalloc() instead of
kmem_cache_alloc(), then my guessing above seems incorrect ...

Best Regards,
Huang, Ying

2019-02-21 22:33:26

by Wei Yang

[permalink] [raw]
Subject: Re: [LKP] [driver core] 570d020012: will-it-scale.per_thread_ops -12.2% regression

On Thu, Feb 21, 2019 at 03:53:13PM +0800, Wei Yang wrote:
>On Thu, Feb 21, 2019 at 03:18:22PM +0800, Huang, Ying wrote:
>>Greg Kroah-Hartman <[email protected]> writes:
>>
>>> On Thu, Feb 21, 2019 at 11:10:49AM +0800, kernel test robot wrote:
>>>> On Tue, Feb 19, 2019 at 01:19:04PM +0100, Greg Kroah-Hartman wrote:
>>>> > On Tue, Feb 19, 2019 at 08:59:45AM +0800, Wei Yang wrote:
>>>> > > On Mon, Feb 18, 2019 at 03:54:42PM +0800, kernel test robot wrote:
>>>> > > >Greeting,
>>>> > > >
>>>> > > >FYI, we noticed a -12.2% regression of will-it-scale.per_thread_ops due to commit:
>>>> > > >
>>>> > > >
>>>> > > >commit: 570d0200123fb4f809aa2f6226e93a458d664d70 ("driver core: move device->knode_class to device_private")
>>>> > > >https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>>>> > > >
>>>> > >
>>>> > > This is interesting.
>>>> > >
>>>> > > I didn't expect the move of this field will impact the performance.
>>>> > >
>>>> > > The reason is struct device is a hotter memory than device->device_private?
>>>> > >
>>>> > > >in testcase: will-it-scale
>>>> > > >on test machine: 288 threads Knights Mill with 80G memory
>>>> > > >with following parameters:
>>>> > > >
>>>> > > > nr_task: 100%
>>>> > > > mode: thread
>>>> > > > test: unlink2
>>>> > > > cpufreq_governor: performance
>>>> > > >
>>>> > > >test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
>>>> > > >test-url: https://github.com/antonblanchard/will-it-scale
>>>> > > >
>>>> > > >In addition to that, the commit also has significant impact on the following tests:
>>>> > > >
>>>> > > >+------------------+---------------------------------------------------------------+
>>>> > > >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -29.9% regression |
>>>> > > >| test machine | 288 threads Knights Mill with 80G memory |
>>>> > > >| test parameters | cpufreq_governor=performance |
>>>> > > >| | mode=thread |
>>>> > > >| | nr_task=100% |
>>>> > > >| | test=signal1 |
>>>> >
>>>> > Ok, I'm going to blame your testing system, or something here, and not
>>>> > the above patch.
>>>> >
>>>> > All this test does is call raise(3). That does not touch the driver
>>>> > core at all.
>>>> >
>>>> > > >+------------------+---------------------------------------------------------------+
>>>> > > >| testcase: change | will-it-scale: will-it-scale.per_thread_ops -16.5% regression |
>>>> > > >| test machine | 288 threads Knights Mill with 80G memory |
>>>> > > >| test parameters | cpufreq_governor=performance |
>>>> > > >| | mode=thread |
>>>> > > >| | nr_task=100% |
>>>> > > >| | test=open1 |
>>>> > > >+------------------+---------------------------------------------------------------+
>>>> >
>>>> > Same here, open1 just calls open/close a lot. No driver core
>>>> > interaction at all there either.
>>>> >
>>>> > So are you _sure_ this is the offending patch?
>>>>
>>>> Hi Greg,
>>>>
>>>> We did an experiment, recovered the layout of struct device. and we
>>>> found the regression is gone. I guess the regession is not from the
>>>> patch but related to the struct layout.
>>>>
>>>>
>>>> tests: 1
>>>> testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-unlink2/lkp-knm01
>>>>
>>>> 570d0200123fb4f8 a36dc70b810afe9183de2ea18f
>>>> ---------------- --------------------------
>>>> %stddev change %stddev
>>>> \ | \
>>>> 237096 14% 270789 will-it-scale.workload
>>>> 823 14% 939 will-it-scale.per_thread_ops
>>>>
>>>>
>>>> tests: 1
>>>> testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-signal1/lkp-knm01
>>>>
>>>> 570d0200123fb4f8 a36dc70b810afe9183de2ea18f
>>>> ---------------- --------------------------
>>>> %stddev change %stddev
>>>> \ | \
>>>> 93.51 3% 48% 138.53 3% will-it-scale.time.user_time
>>>> 186 40% 261 will-it-scale.per_thread_ops
>>>> 53909 40% 75507 will-it-scale.workload
>>>>
>>>>
>>>> tests: 1
>>>> testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-open1/lkp-knm01
>>>>
>>>> 570d0200123fb4f8 a36dc70b810afe9183de2ea18f
>>>> ---------------- --------------------------
>>>> %stddev change %stddev
>>>> \ | \
>>>> 447722 22% 546258 10% will-it-scale.time.involuntary_context_switches
>>>> 226995 19% 269751 will-it-scale.workload
>>>> 787 19% 936 will-it-scale.per_thread_ops
>>>>
>>>>
>>>>
>>>> commit a36dc70b810afe9183de2ea18faa4c0939c139ac
>>>> Author: 0day robot <[email protected]>
>>>> Date: Wed Feb 20 14:21:19 2019 +0800
>>>>
>>>> backfile klist_node in struct device for debugging
>>>>
>>>> Signed-off-by: 0day robot <[email protected]>
>>>>
>>>> diff --git a/include/linux/device.h b/include/linux/device.h
>>>> index d0e452fd0bff2..31666cb72b3ba 100644
>>>> --- a/include/linux/device.h
>>>> +++ b/include/linux/device.h
>>>> @@ -1035,6 +1035,7 @@ struct device {
>>>> spinlock_t devres_lock;
>>>> struct list_head devres_head;
>>>>
>>>> + struct klist_node knode_class_test_by_rongc;
>>>> struct class *class;
>>>> const struct attribute_group **groups; /* optional groups */
>>>
>>> While this is fun to worry about alignment and structure size of 'struct
>>> device' I find it odd given that the syscalls and userspace load of
>>> those test programs have nothing to do with 'struct device' at all.
>>>
>>> So I can work on fixing up the alignment of struct device, as that's a
>>> nice thing to do for systems with 30k of these in memory, but that
>>> shouldn't affect a workload of a constant string of signal calls.
>>
>>Hi, Greg,
>>
>>I don't think this is an issues of struct device. As you said, struct
>>device isn't access much during test. Struct device may share slab page
>>with some other data structures (signal related, or fd related (as in
>>some other test cases)), so that the alignment of these data structures
>>are affected, so caused the performance regression.
>>
>
>I didn't get the point here neither.
>
>slab allocator ask memory from page allocator Page by Page and split the page
>into pre-defined size. For example, 128B, 512B... Just as shown in
>/proc/slabinfo.
>
>Per my understanding, each struct device / device_private will sits in its own
>aligned space. struct device would sits in 1K slab and struct device_private
>would sits in 256B slab, both before and after this patch if I am correct.
>

As Greg mentioned, device is embedded in other structure. My analysis
here is not correct. The change in size of device, may affect the size
of struct who wraps device.

In case this struct is allocated by kmem_cache, this may affect the
number of objects allocated each time, or even the number of pages
allocated each time.

This means with this patch may affect the system in the following two
aspects:

* the times for struct allocation
* the order of page it asks from page allocator

One place we may take a look is the /proc/slabinfo. To see whether the
change in size of device affect kmem_cach objects.

Hmm... while as Greg mentioned, those cases will not involve struct
device allocation. So the above aspects may not take effect?

>Hmm... I am just curious about how this alignment is affected. Maybe I lost
>some point?
>
>>Best Regards,
>>Huang, Ying
>>
>>> thanks,
>>>
>>> greg k-h
>
>--
>Wei Yang
>Help you, Help me

--
Wei Yang
Help you, Help me