2020-11-02 09:17:48

by Chen, Rong A

[permalink] [raw]
Subject: [mm/gup] a308c71bf1: stress-ng.vm-splice.ops_per_sec -95.6% regression

Greeting,

FYI, we noticed a -95.6% regression of stress-ng.vm-splice.ops_per_sec due to commit:


commit: a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a ("mm/gup: Remove enfornced COW mechanism")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master


in testcase: stress-ng
on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 192G memory
with following parameters:

nr_threads: 100%
disk: 1HDD
testtime: 30s
class: pipe
cpufreq_governor: performance
ucode: 0x5002f01




If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml

=========================================================================================
class/compiler/cpufreq_governor/disk/kconfig/nr_threads/rootfs/tbox_group/testcase/testtime/ucode:
pipe/gcc-9/performance/1HDD/x86_64-rhel-8.3/100%/debian-10.4-x86_64-20200603.cgz/lkp-csl-2sp5/stress-ng/30s/0x5002f01

commit:
1a0cf26323 ("mm/ksm: Remove reuse_ksm_page()")
a308c71bf1 ("mm/gup: Remove enfornced COW mechanism")

1a0cf26323c80e2f a308c71bf1e6e19cc2e4ced3185
---------------- ---------------------------
%stddev %change %stddev
\ | \
18804 +2.1% 19191 stress-ng.time.system_time
1059 -35.9% 678.99 ± 4% stress-ng.time.user_time
1.556e+08 ± 7% +10.3% 1.716e+08 ± 4% stress-ng.time.voluntary_context_switches
3.406e+09 -95.6% 1.49e+08 stress-ng.vm-splice.ops
1.135e+08 -95.6% 4965911 stress-ng.vm-splice.ops_per_sec
5.11 -1.8 3.26 ± 4% mpstat.cpu.all.usr%
91.93 +2.0% 93.77 iostat.cpu.system
5.07 -36.2% 3.23 ± 4% iostat.cpu.user
28348 +6.7% 30235 ± 2% proc-vmstat.nr_kernel_stack
12148 ± 5% +11.2% 13513 ± 5% proc-vmstat.nr_page_table_pages
2767 ± 4% +8.0% 2987 ± 5% slabinfo.PING.active_objs
2767 ± 4% +8.0% 2987 ± 5% slabinfo.PING.num_objs
91.25 +1.9% 93.00 vmstat.cpu.sy
994917 ± 9% +11.4% 1108222 ± 4% vmstat.system.cs
13547 ± 6% -14.9% 11524 ± 9% numa-vmstat.node0.nr_slab_reclaimable
40767 ±112% +128.4% 93118 ± 61% numa-vmstat.node0.numa_other
12466 ± 6% +15.9% 14448 ± 6% numa-vmstat.node1.nr_slab_reclaimable
54181 ± 6% -14.9% 46101 ± 9% numa-meminfo.node0.KReclaimable
54181 ± 6% -14.9% 46101 ± 9% numa-meminfo.node0.SReclaimable
6689 ± 52% +73.2% 11588 ± 34% numa-meminfo.node1.Inactive
6689 ± 52% +73.2% 11588 ± 34% numa-meminfo.node1.Inactive(anon)
49867 ± 6% +15.9% 57793 ± 6% numa-meminfo.node1.KReclaimable
49867 ± 6% +15.9% 57793 ± 6% numa-meminfo.node1.SReclaimable
79.25 ± 12% -35.0% 51.50 ± 8% sched_debug.cfs_rq:/.nr_spread_over.max
9.76 ± 10% -22.7% 7.55 ± 9% sched_debug.cfs_rq:/.nr_spread_over.stddev
276.75 ± 11% -19.1% 224.01 ± 11% sched_debug.cfs_rq:/.runnable_avg.stddev
414265 ± 27% -44.1% 231543 ± 45% sched_debug.cfs_rq:/.spread0.avg
1199728 ± 14% -28.1% 863115 ± 23% sched_debug.cfs_rq:/.spread0.max
20.22 ± 49% -48.0% 10.52 ± 18% sched_debug.cpu.clock.stddev
1410577 ± 10% +13.1% 1594913 ± 5% sched_debug.cpu.nr_switches.avg
1409041 ± 10% +13.1% 1593368 ± 5% sched_debug.cpu.sched_count.avg
980469 ± 9% +12.5% 1102701 ± 4% sched_debug.cpu.ttwu_count.avg
866170 ± 10% +13.9% 986450 ± 5% sched_debug.cpu.ttwu_local.avg
38.20 ± 7% -6.3 31.86 ± 16% perf-profile.calltrace.cycles-pp.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe
37.36 ± 7% -6.2 31.12 ± 16% perf-profile.calltrace.cycles-pp.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe
33.38 ± 7% -5.7 27.65 ± 15% perf-profile.calltrace.cycles-pp.new_sync_read.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe
32.56 ± 7% -5.6 26.93 ± 15% perf-profile.calltrace.cycles-pp.pipe_read.new_sync_read.vfs_read.ksys_read.do_syscall_64
26.32 ± 7% -5.1 21.23 ± 14% perf-profile.calltrace.cycles-pp.__wake_up_common_lock.pipe_read.new_sync_read.vfs_read.ksys_read
26.00 ± 7% -5.1 20.92 ± 14% perf-profile.calltrace.cycles-pp.__wake_up_common.__wake_up_common_lock.pipe_read.new_sync_read.vfs_read
25.98 ± 7% -5.1 20.89 ± 14% perf-profile.calltrace.cycles-pp.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.pipe_read.new_sync_read
25.96 ± 7% -5.1 20.88 ± 14% perf-profile.calltrace.cycles-pp.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.pipe_read
24.74 ± 8% -5.0 19.70 ± 15% perf-profile.calltrace.cycles-pp.ttwu_do_activate.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock
24.73 ± 8% -5.0 19.70 ± 15% perf-profile.calltrace.cycles-pp.enqueue_task_fair.ttwu_do_activate.try_to_wake_up.autoremove_wake_function.__wake_up_common
24.30 ± 8% -5.0 19.32 ± 15% perf-profile.calltrace.cycles-pp.enqueue_entity.enqueue_task_fair.ttwu_do_activate.try_to_wake_up.autoremove_wake_function
1.00 ± 23% +0.4 1.40 ± 16% perf-profile.calltrace.cycles-pp.arch_stack_walk.stack_trace_save_tsk.__account_scheduler_latency.update_stats_enqueue_sleeper.enqueue_entity
1.05 ± 23% +0.4 1.48 ± 16% perf-profile.calltrace.cycles-pp.stack_trace_save_tsk.__account_scheduler_latency.update_stats_enqueue_sleeper.enqueue_entity.enqueue_task_fair
11.09 +0.5 11.60 ± 3% perf-profile.calltrace.cycles-pp.do_select.core_sys_select.kern_select.__x64_sys_select.do_syscall_64
38.23 ± 7% -6.3 31.88 ± 16% perf-profile.children.cycles-pp.ksys_read
37.45 ± 7% -6.3 31.20 ± 16% perf-profile.children.cycles-pp.vfs_read
33.41 ± 7% -5.7 27.68 ± 15% perf-profile.children.cycles-pp.new_sync_read
32.63 ± 7% -5.6 26.99 ± 15% perf-profile.children.cycles-pp.pipe_read
25.98 ± 7% -5.1 20.90 ± 14% perf-profile.children.cycles-pp.autoremove_wake_function
0.09 ± 7% +0.0 0.11 ± 4% perf-profile.children.cycles-pp.pick_next_entity
0.27 ± 4% +0.0 0.31 ± 3% perf-profile.children.cycles-pp.load_new_mm_cr3
0.39 ± 4% +0.1 0.45 ± 3% perf-profile.children.cycles-pp.switch_mm_irqs_off
0.50 ± 4% +0.1 0.56 ± 2% perf-profile.children.cycles-pp.pick_next_task_fair
0.06 ± 60% +0.1 0.11 ± 21% perf-profile.children.cycles-pp.__mutex_unlock_slowpath
2.11 ± 4% +0.3 2.40 ± 4% perf-profile.children.cycles-pp.schedule
11.21 +0.5 11.71 ± 2% perf-profile.children.cycles-pp.do_select
0.05 +0.0 0.06 perf-profile.self.cycles-pp.finish_task_switch
0.05 ± 8% +0.0 0.07 ± 7% perf-profile.self.cycles-pp.dequeue_task_fair
0.23 ± 3% +0.0 0.26 ± 4% perf-profile.self.cycles-pp.__sched_text_start
0.27 ± 4% +0.0 0.31 ± 3% perf-profile.self.cycles-pp.load_new_mm_cr3
11.52 ± 3% +8.8% 12.53 perf-stat.i.MPKI
3.307e+10 -20.3% 2.636e+10 ± 2% perf-stat.i.branch-instructions
0.73 +0.0 0.75 perf-stat.i.branch-miss-rate%
1.957e+08 -27.3% 1.423e+08 ± 3% perf-stat.i.branch-misses
987322 ± 9% +13.0% 1115216 ± 4% perf-stat.i.context-switches
4.95 ± 3% +43.7% 7.11 perf-stat.i.cpi
4.722e+10 -20.9% 3.737e+10 ± 2% perf-stat.i.dTLB-loads
2.789e+10 -22.3% 2.168e+10 ± 3% perf-stat.i.dTLB-stores
1.659e+08 ± 3% -29.2% 1.175e+08 ± 5% perf-stat.i.iTLB-load-misses
1.654e+11 -21.1% 1.305e+11 ± 2% perf-stat.i.instructions
0.67 -20.7% 0.53 ± 2% perf-stat.i.ipc
1.19 ± 5% -23.8% 0.91 ± 5% perf-stat.i.metric.K/sec
1130 -21.0% 893.12 ± 3% perf-stat.i.metric.M/sec
91.95 +0.9 92.88 perf-stat.i.node-store-miss-rate%
2.03 ± 5% +31.1% 2.66 ± 5% perf-stat.overall.MPKI
0.59 -0.0 0.54 perf-stat.overall.branch-miss-rate%
1.54 +28.4% 1.98 ± 2% perf-stat.overall.cpi
0.00 ± 3% +0.0 0.00 ± 6% perf-stat.overall.dTLB-store-miss-rate%
97.08 -1.4 95.70 perf-stat.overall.iTLB-load-miss-rate%
991.26 ± 3% +10.9% 1099 ± 2% perf-stat.overall.instructions-per-iTLB-miss
0.65 -22.0% 0.51 ± 2% perf-stat.overall.ipc
87.39 +2.0 89.39 perf-stat.overall.node-store-miss-rate%
3.271e+10 -20.8% 2.592e+10 ± 3% perf-stat.ps.branch-instructions
1.945e+08 -27.4% 1.412e+08 ± 3% perf-stat.ps.branch-misses
1004771 ± 9% +11.5% 1119933 ± 4% perf-stat.ps.context-switches
4.669e+10 -21.3% 3.674e+10 ± 3% perf-stat.ps.dTLB-loads
2.756e+10 -22.7% 2.131e+10 ± 3% perf-stat.ps.dTLB-stores
1.652e+08 ± 3% -29.3% 1.168e+08 ± 5% perf-stat.ps.iTLB-load-misses
1.635e+11 -21.6% 1.283e+11 ± 3% perf-stat.ps.instructions
3.497e+13 -21.6% 2.741e+13 ± 3% perf-stat.total.instructions
6739 ± 14% +33.4% 8988 ± 12% softirqs.CPU14.SCHED
6775 ± 15% +36.0% 9217 ± 8% softirqs.CPU19.SCHED
7359 ± 17% +29.1% 9500 ± 14% softirqs.CPU2.SCHED
9708 ± 10% -19.5% 7811 ± 5% softirqs.CPU24.SCHED
9651 ± 8% -18.2% 7890 ± 3% softirqs.CPU25.SCHED
9444 ± 6% -18.1% 7737 ± 10% softirqs.CPU27.SCHED
9578 ± 10% -18.3% 7826 ± 5% softirqs.CPU30.SCHED
9469 ± 8% -16.8% 7882 ± 5% softirqs.CPU31.SCHED
9826 ± 8% -16.1% 8246 ± 7% softirqs.CPU32.SCHED
9319 ± 9% -15.0% 7922 ± 5% softirqs.CPU34.SCHED
10167 ± 10% -20.1% 8123 ± 3% softirqs.CPU35.SCHED
9208 ± 8% -15.6% 7772 ± 5% softirqs.CPU36.SCHED
9641 ± 6% -17.8% 7921 ± 7% softirqs.CPU37.SCHED
9399 ± 7% -16.0% 7896 ± 6% softirqs.CPU38.SCHED
9275 ± 6% -15.4% 7851 ± 6% softirqs.CPU39.SCHED
9937 ± 12% -21.6% 7786 ± 4% softirqs.CPU40.SCHED
9259 ± 6% -16.2% 7758 ± 6% softirqs.CPU41.SCHED
9490 ± 6% -16.7% 7901 ± 4% softirqs.CPU42.SCHED
9504 ± 8% -18.6% 7736 ± 5% softirqs.CPU43.SCHED
9602 ± 8% -17.4% 7926 ± 2% softirqs.CPU44.SCHED
9564 ± 7% -18.5% 7796 ± 5% softirqs.CPU46.SCHED
6581 ± 12% +26.9% 8353 ± 14% softirqs.CPU50.SCHED
6864 ± 17% +27.7% 8762 ± 16% softirqs.CPU55.SCHED
6495 ± 10% +31.4% 8537 ± 12% softirqs.CPU61.SCHED
6294 ± 2% +42.8% 8985 ± 20% softirqs.CPU69.SCHED
9312 ± 8% -20.1% 7439 ± 8% softirqs.CPU72.SCHED
9408 ± 6% -19.1% 7615 ± 5% softirqs.CPU74.SCHED
9348 ± 8% -15.9% 7863 ± 5% softirqs.CPU75.SCHED
9506 ± 7% -17.4% 7855 ± 5% softirqs.CPU76.SCHED
9433 ± 8% -16.8% 7851 ± 5% softirqs.CPU77.SCHED
9585 ± 10% -19.3% 7737 ± 5% softirqs.CPU78.SCHED
9381 ± 9% -17.2% 7765 ± 4% softirqs.CPU79.SCHED
9559 ± 7% -16.7% 7964 ± 5% softirqs.CPU80.SCHED
9357 ± 6% -15.3% 7927 ± 6% softirqs.CPU81.SCHED
9710 ± 9% -19.4% 7825 ± 6% softirqs.CPU82.SCHED
9761 ± 9% -19.8% 7827 ± 5% softirqs.CPU83.SCHED
9433 ± 6% -16.7% 7856 ± 5% softirqs.CPU84.SCHED
9325 ± 7% -15.7% 7863 ± 5% softirqs.CPU85.SCHED
9609 ± 8% -18.9% 7796 ± 4% softirqs.CPU88.SCHED
9290 ± 7% -17.6% 7654 ± 4% softirqs.CPU89.SCHED
9595 ± 9% -18.5% 7821 ± 5% softirqs.CPU90.SCHED
9528 ± 9% -18.6% 7752 ± 4% softirqs.CPU91.SCHED
9660 ± 8% -19.5% 7773 ± 5% softirqs.CPU92.SCHED
9641 ± 8% -17.5% 7949 ± 5% softirqs.CPU93.SCHED
9710 ± 4% -18.9% 7875 ± 5% softirqs.CPU94.SCHED
10107 ± 9% -22.6% 7821 ± 6% softirqs.CPU95.SCHED
1731 ±114% -88.0% 208.50 ±173% interrupts.43:PCI-MSI.31981576-edge.i40e-eth0-TxRx-7
5625 ± 41% +73.0% 9730 ± 28% interrupts.CPU11.CAL:Function_call_interrupts
3192 ± 51% +500.0% 19155 ± 99% interrupts.CPU12.CAL:Function_call_interrupts
5652 ± 59% +387.9% 27579 ± 96% interrupts.CPU13.CAL:Function_call_interrupts
3389 ± 92% +206.3% 10382 ± 17% interrupts.CPU16.CAL:Function_call_interrupts
3212 ± 64% +781.8% 28328 ± 55% interrupts.CPU18.CAL:Function_call_interrupts
5503 ± 51% +261.3% 19884 ± 65% interrupts.CPU19.CAL:Function_call_interrupts
4305 ± 6% +37.2% 5908 ± 24% interrupts.CPU20.RES:Rescheduling_interrupts
3653 ± 36% +248.2% 12720 ± 12% interrupts.CPU21.CAL:Function_call_interrupts
4937 ± 36% +91.9% 9477 ± 52% interrupts.CPU22.CAL:Function_call_interrupts
23314 ± 20% -75.5% 5720 ± 41% interrupts.CPU25.CAL:Function_call_interrupts
22533 ± 8% -71.8% 6360 ± 35% interrupts.CPU26.CAL:Function_call_interrupts
24106 ± 13% -62.7% 8987 ± 57% interrupts.CPU27.CAL:Function_call_interrupts
6111 ± 17% +367.9% 28593 ± 55% interrupts.CPU27.RES:Rescheduling_interrupts
28249 ± 32% -69.3% 8675 ± 50% interrupts.CPU28.CAL:Function_call_interrupts
5584 ± 26% +402.4% 28060 ± 26% interrupts.CPU28.RES:Rescheduling_interrupts
19155 ± 12% -52.9% 9015 ± 61% interrupts.CPU29.CAL:Function_call_interrupts
4946 ± 18% +368.8% 23186 ± 70% interrupts.CPU29.RES:Rescheduling_interrupts
25158 ± 27% -73.4% 6694 ± 69% interrupts.CPU30.CAL:Function_call_interrupts
20595 ± 28% -63.0% 7620 ± 63% interrupts.CPU31.CAL:Function_call_interrupts
6658 ± 23% -30.6% 4620 ± 33% interrupts.CPU32.NMI:Non-maskable_interrupts
6658 ± 23% -30.6% 4620 ± 33% interrupts.CPU32.PMI:Performance_monitoring_interrupts
19923 ± 27% -70.4% 5906 ± 43% interrupts.CPU33.CAL:Function_call_interrupts
21664 ± 35% -68.5% 6813 ± 44% interrupts.CPU35.CAL:Function_call_interrupts
18049 ± 22% -63.7% 6560 ± 47% interrupts.CPU36.CAL:Function_call_interrupts
22299 ± 15% -60.0% 8913 ± 44% interrupts.CPU37.CAL:Function_call_interrupts
16997 ± 16% -63.7% 6173 ± 56% interrupts.CPU38.CAL:Function_call_interrupts
22528 ± 43% -62.8% 8375 ± 41% interrupts.CPU39.CAL:Function_call_interrupts
5373 ± 39% +100.0% 10743 ± 19% interrupts.CPU4.CAL:Function_call_interrupts
20817 ± 16% -63.8% 7530 ± 62% interrupts.CPU40.CAL:Function_call_interrupts
18927 ± 33% -67.6% 6125 ± 35% interrupts.CPU41.CAL:Function_call_interrupts
28780 ± 37% -66.3% 9688 ± 58% interrupts.CPU42.CAL:Function_call_interrupts
23586 ± 20% -72.2% 6561 ± 37% interrupts.CPU43.CAL:Function_call_interrupts
21970 ± 14% -72.1% 6136 ± 53% interrupts.CPU44.CAL:Function_call_interrupts
21667 ± 21% -47.7% 11326 ± 36% interrupts.CPU47.CAL:Function_call_interrupts
4314 ± 42% +170.1% 11653 ± 44% interrupts.CPU5.CAL:Function_call_interrupts
3969 ± 69% +176.8% 10988 ± 33% interrupts.CPU53.CAL:Function_call_interrupts
8525 ± 37% -33.5% 5672 ± 10% interrupts.CPU58.RES:Rescheduling_interrupts
4521 ± 28% +255.7% 16080 ± 32% interrupts.CPU61.CAL:Function_call_interrupts
3924 ± 50% +242.6% 13445 ± 39% interrupts.CPU63.CAL:Function_call_interrupts
3298 ± 45% +255.7% 11732 ± 44% interrupts.CPU69.CAL:Function_call_interrupts
1730 ±114% -88.0% 208.25 ±173% interrupts.CPU7.43:PCI-MSI.31981576-edge.i40e-eth0-TxRx-7
21881 ± 31% -64.7% 7726 ± 47% interrupts.CPU73.CAL:Function_call_interrupts
23394 ± 27% -64.5% 8308 ± 66% interrupts.CPU74.CAL:Function_call_interrupts
6742 ± 17% +312.2% 27796 ± 38% interrupts.CPU75.RES:Rescheduling_interrupts
28453 ± 45% -72.8% 7728 ± 49% interrupts.CPU76.CAL:Function_call_interrupts
5517 ± 43% +354.3% 25062 ± 20% interrupts.CPU76.RES:Rescheduling_interrupts
4801 ± 16% +433.6% 25624 ± 55% interrupts.CPU77.RES:Rescheduling_interrupts
25911 ± 15% -67.8% 8348 ± 58% interrupts.CPU78.CAL:Function_call_interrupts
21248 ± 16% -67.7% 6863 ± 40% interrupts.CPU80.CAL:Function_call_interrupts
21098 ± 23% -66.8% 6996 ± 34% interrupts.CPU83.CAL:Function_call_interrupts
19567 ± 20% -52.4% 9306 ± 64% interrupts.CPU84.CAL:Function_call_interrupts
16578 ± 20% -58.4% 6902 ± 56% interrupts.CPU87.CAL:Function_call_interrupts
21823 ± 28% -75.4% 5363 ± 41% interrupts.CPU89.CAL:Function_call_interrupts
24409 ± 22% -52.0% 11718 ± 49% interrupts.CPU90.CAL:Function_call_interrupts
20699 ± 21% -74.0% 5389 ± 40% interrupts.CPU91.CAL:Function_call_interrupts
18159 ± 14% -63.7% 6587 ± 67% interrupts.CPU92.CAL:Function_call_interrupts
22992 ± 23% -56.8% 9925 ± 38% interrupts.CPU93.CAL:Function_call_interrupts
5656 ± 20% +368.6% 26505 ± 47% interrupts.CPU93.RES:Rescheduling_interrupts
21995 ± 26% -64.6% 7782 ± 68% interrupts.CPU94.CAL:Function_call_interrupts
868396 ± 3% +20.9% 1050234 ± 14% interrupts.RES:Rescheduling_interrupts



stress-ng.time.system_time

19300 +-------------------------------------------------------------------+
| |
19200 |-+ O O O O |
| O O O O O O |
| O O |
19100 |-+O O O O |
| O O |
19000 |-+ |
| |
18900 |-+ |
| |
| ..+.. .+...+..+... |
18800 |.. ..+..+...+.. +... .+. +. +..|
| +...+..+...+.. .+. .. +. |
18700 +-------------------------------------------------------------------+


stress-ng.vm-splice.ops

3.5e+09 +-----------------------------------------------------------------+
| |
3e+09 |-+ |
| |
2.5e+09 |-+ |
| |
2e+09 |-+ |
| |
1.5e+09 |-+ |
| |
1e+09 |-+ |
| |
5e+08 |-+ |
| O O O O O O O O O O O O O O O O O O |
0 +-----------------------------------------------------------------+


stress-ng.vm-splice.ops_per_sec

1.2e+08 +-----------------------------------------------------------------+
|..+...+..+..+...+..+..+..+...+..+..+...+..+..+...+..+..+..+...+..|
1e+08 |-+ |
| |
| |
8e+07 |-+ |
| |
6e+07 |-+ |
| |
4e+07 |-+ |
| |
| |
2e+07 |-+ |
| O O O O O O O O O O O O O O O O O O |
0 +-----------------------------------------------------------------+


[*] bisect-good sample
[O] bisect-bad sample



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


Thanks,
Rong Chen


Attachments:
(No filename) (23.96 kB)
config-5.8.0-00003-ga308c71bf1e6e (172.07 kB)
job-script (7.95 kB)
job.yaml (5.59 kB)
reproduce (400.00 B)
Download all attachments

2020-11-04 20:35:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [mm/gup] a308c71bf1: stress-ng.vm-splice.ops_per_sec -95.6% regression

On Mon, Nov 2, 2020 at 1:15 AM kernel test robot <[email protected]> wrote:
>
> Greeting,
>
> FYI, we noticed a -95.6% regression of stress-ng.vm-splice.ops_per_sec due to commit:
>
> commit: a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a ("mm/gup: Remove enfornced COW mechanism")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

Note that this is just the reverse of the previous 2000% improvement
reported by the test robot here:

https://lore.kernel.org/lkml/20200611040453.GK12456@shao2-debian/

and the explanation seems to remain the same:

https://lore.kernel.org/lkml/CAG48ez1v1b4X5LgFya6nvi33-TWwqna_dc5jGFVosqQhdn_Nkg@mail.gmail.com/

IOW, this is testing a special case (zero page lookup) that the "force
COW" patches happened to turn into a regular case (COW creating a
regular page from the zero page).

The question is whether we should care about the zero page for gup_fast lookup.

If we do care, then the proper fix is likely simply to allow the zero
page in fast-gup, the same way we already do in slow-gup.

ENTIRELY UNTESTED PATCH ATTACHED.

Rong - mind testing this? I don't think the zero-page _should_ be
something that real loads care about, but hey, maybe people do want to
do things like splice zeroes very efficiently..

And note the "untested" part of the patch. It _looks_ fairly obvious,
but maybe I'm missing something.

Linus


Attachments:
patch (632.00 B)

2020-11-05 08:31:20

by Xing Zhengjun

[permalink] [raw]
Subject: Re: [LKP] Re: [mm/gup] a308c71bf1: stress-ng.vm-splice.ops_per_sec -95.6% regression



On 11/5/2020 2:29 AM, Linus Torvalds wrote:
> On Mon, Nov 2, 2020 at 1:15 AM kernel test robot <[email protected]> wrote:
>>
>> Greeting,
>>
>> FYI, we noticed a -95.6% regression of stress-ng.vm-splice.ops_per_sec due to commit:
>>
>> commit: a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a ("mm/gup: Remove enfornced COW mechanism")
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> Note that this is just the reverse of the previous 2000% improvement
> reported by the test robot here:
>
> https://lore.kernel.org/lkml/20200611040453.GK12456@shao2-debian/
>
> and the explanation seems to remain the same:
>
> https://lore.kernel.org/lkml/CAG48ez1v1b4X5LgFya6nvi33-TWwqna_dc5jGFVosqQhdn_Nkg@mail.gmail.com/
>
> IOW, this is testing a special case (zero page lookup) that the "force
> COW" patches happened to turn into a regular case (COW creating a
> regular page from the zero page).
>
> The question is whether we should care about the zero page for gup_fast lookup.
>
> If we do care, then the proper fix is likely simply to allow the zero
> page in fast-gup, the same way we already do in slow-gup.
>
> ENTIRELY UNTESTED PATCH ATTACHED.
>
> Rong - mind testing this? I don't think the zero-page _should_ be
> something that real loads care about, but hey, maybe people do want to
> do things like splice zeroes very efficiently..

I test the patch, the regression still existed.

=========================================================================================
tbox_group/testcase/rootfs/kconfig/compiler/nr_threads/disk/testtime/class/cpufreq_governor/ucode:

lkp-csl-2sp5/stress-ng/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/100%/1HDD/30s/pipe/performance/0x5002f01

commit:
1a0cf26323c80e2f1c58fc04f15686de61bfab0c
a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a
da5ba9980aa2211c1e2a89fc814abab2fea6f69d (debug patch)

1a0cf26323c80e2f a308c71bf1e6e19cc2e4ced3185 da5ba9980aa2211c1e2a89fc814
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
3.406e+09 -95.6% 1.49e+08 -96.4% 1.213e+08
stress-ng.vm-splice.ops
1.135e+08 -95.6% 4965911 -96.4% 4041777
stress-ng.vm-splice.ops_per_sec

>
> And note the "untested" part of the patch. It _looks_ fairly obvious,
> but maybe I'm missing something.
>
> Linus
>
>
> _______________________________________________
> LKP mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>

--
Zhengjun Xing

2020-11-05 18:39:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [LKP] Re: [mm/gup] a308c71bf1: stress-ng.vm-splice.ops_per_sec -95.6% regression

On Thu, Nov 5, 2020 at 12:29 AM Xing Zhengjun
<[email protected]> wrote:
>
> > Rong - mind testing this? I don't think the zero-page _should_ be
> > something that real loads care about, but hey, maybe people do want to
> > do things like splice zeroes very efficiently..
>
> I test the patch, the regression still existed.

Thanks.

So Jann's suspicion seems interesting but apparently not the reason
for this particular case.

For being such a _huge_ difference (20x improvement followed by a 20x
regression), it's surprising how little the numbers give a clue. The
big changes are things like
"interrupts.CPU19.CAL:Function_call_interrupts", but while those
change by hundreds of percent, most of the changes seem to just be
about them moving to different CPU's. IOW, we have things like

5652 ± 59% +387.9% 27579 ± 96%
interrupts.CPU13.CAL:Function_call_interrupts
28249 ± 32% -69.3% 8675 ± 50%
interrupts.CPU28.CAL:Function_call_interrupts

which isn't really much of a change at all despite the changes looking
very big - it's just the stats jumping from one CPU to another.

Maybe there's some actual change in there, but it's very well hidden if so.

Yes, some of the numbers get worse:

868396 ± 3% +20.9% 1050234 ± 14%
interrupts.RES:Rescheduling_interrupts

so that's a 20% increase in rescheduling interrupts, But it's a 20%
increase, not a 500% one. So the fact that performance changes by 20x
is still very unclear to me.

We do have a lot of those numa-meminfo changes, but they could just
come from allocation patterns.

That said - another difference between the fast-cup code and the
regular gup code is that the fast-gup code does

if (pte_protnone(pte))
goto pte_unmap;

and the regular slow case does

if ((flags & FOLL_NUMA) && pte_protnone(pte))
goto no_page;

now, FOLL_NUMA is always set in the slow case if we don't have
FOLL_FORCE set, so this difference isn't "real", but it's one of those
cases where the zero-page might be marked for NUMA faulting, and doing
the forced COW might then cause it to be accessible.

Just out of curiosity, do the numbers change enormously if you just remove that

if (pte_protnone(pte))
goto pte_unmap;

test from the fast-cup case (top of the loop in gup_pte_range()) -
effectively making fast-gup basically act like FOLL_FORCE wrt numa
placement..

I'm not convinced that's a valid change in general, so this is just a
"to debug the odd performance numbers" issue.

Also out of curiosity: is the performance profile limited to just the
load, or is it a system profile (ie do you have "-a" on the perf
record line or not).

Linus

2020-11-06 05:18:01

by Xing Zhengjun

[permalink] [raw]
Subject: Re: [LKP] Re: [mm/gup] a308c71bf1: stress-ng.vm-splice.ops_per_sec -95.6% regression



On 11/6/2020 2:37 AM, Linus Torvalds wrote:
> On Thu, Nov 5, 2020 at 12:29 AM Xing Zhengjun
> <[email protected]> wrote:
>>
>>> Rong - mind testing this? I don't think the zero-page _should_ be
>>> something that real loads care about, but hey, maybe people do want to
>>> do things like splice zeroes very efficiently..
>>
>> I test the patch, the regression still existed.
>
> Thanks.
>
> So Jann's suspicion seems interesting but apparently not the reason
> for this particular case.
>
> For being such a _huge_ difference (20x improvement followed by a 20x
> regression), it's surprising how little the numbers give a clue. The
> big changes are things like
> "interrupts.CPU19.CAL:Function_call_interrupts", but while those
> change by hundreds of percent, most of the changes seem to just be
> about them moving to different CPU's. IOW, we have things like
>
> 5652 ± 59% +387.9% 27579 ± 96%
> interrupts.CPU13.CAL:Function_call_interrupts
> 28249 ± 32% -69.3% 8675 ± 50%
> interrupts.CPU28.CAL:Function_call_interrupts
>
> which isn't really much of a change at all despite the changes looking
> very big - it's just the stats jumping from one CPU to another.
>
> Maybe there's some actual change in there, but it's very well hidden if so.
>
> Yes, some of the numbers get worse:
>
> 868396 ± 3% +20.9% 1050234 ± 14%
> interrupts.RES:Rescheduling_interrupts
>
> so that's a 20% increase in rescheduling interrupts, But it's a 20%
> increase, not a 500% one. So the fact that performance changes by 20x
> is still very unclear to me.
>
> We do have a lot of those numa-meminfo changes, but they could just
> come from allocation patterns.
>
> That said - another difference between the fast-cup code and the
> regular gup code is that the fast-gup code does
>
> if (pte_protnone(pte))
> goto pte_unmap;
>
> and the regular slow case does
>
> if ((flags & FOLL_NUMA) && pte_protnone(pte))
> goto no_page;
>
> now, FOLL_NUMA is always set in the slow case if we don't have
> FOLL_FORCE set, so this difference isn't "real", but it's one of those
> cases where the zero-page might be marked for NUMA faulting, and doing
> the forced COW might then cause it to be accessible.
>
> Just out of curiosity, do the numbers change enormously if you just remove that
>
> if (pte_protnone(pte))
> goto pte_unmap;
>
> test from the fast-cup case (top of the loop in gup_pte_range()) -
> effectively making fast-gup basically act like FOLL_FORCE wrt numa
> placement..

Based on the last debug patch, I removed the two lines code at the top
of the loop in gup_pte_range() as you mentioned, the regression still
existed.

=========================================================================================
tbox_group/testcase/rootfs/kconfig/compiler/nr_threads/disk/testtime/class/cpufreq_governor/ucode:

lkp-csl-2sp5/stress-ng/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/100%/1HDD/30s/pipe/performance/0x5002f01

commit:
1a0cf26323c80e2f1c58fc04f15686de61bfab0c
a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a
da5ba9980aa2211c1e2a89fc814abab2fea6f69d (last debug patch)
8803d304738b52f66f6b683be38c4f8b9cf4bff5 (to debug the odd
performance numbers)

1a0cf26323c80e2f a308c71bf1e6e19cc2e4ced3185 da5ba9980aa2211c1e2a89fc814
8803d304738b52f66f6b683be38
---------------- --------------------------- ---------------------------
---------------------------
%stddev %change %stddev %change
%stddev %change %stddev
\ | \ | \
| \
3.406e+09 -95.6% 1.49e+08 -96.4% 1.213e+08
-96.5% 1.201e+08 stress-ng.vm-splice.ops
1.135e+08 -95.6% 4965911 -96.4% 4041777
-96.5% 4002572 stress-ng.vm-splice.ops_per_sec


>
> I'm not convinced that's a valid change in general, so this is just a
> "to debug the odd performance numbers" issue.
>
> Also out of curiosity: is the performance profile limited to just the
> load, or is it a system profile (ie do you have "-a" on the perf
> record line or not).
>

In our test , "-a" is enabled on the perf record line.

> Linus
>

--
Zhengjun Xing