LinuxLists.cc - [mm, thp] 85b9f46e8e: vm-scalability.throughput -8.7% regression

2020-10-04 13:34:41

Subject: [mm, thp] 85b9f46e8e: vm-scalability.throughput -8.7% regression

Greeting,

FYI, we

commit: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
machine: 104 threads Skylake with 192G memory
The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
href="https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/">https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
the issue, kindly add following tag
kernel test robot <[email protected]>
--------------------------------------------------------------------------------->
href="https://github.com/intel/lkp-tests.git">https://github.com/intel/lkp-tests.git
bin/lkp install job.yaml # job file is attached in this email
========================================================================
governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase/ucode:
ce/x86_64-rhel-8.3/debian-10.4-x86_64-20200603.cgz/300s/1T/lkp-skl-fpga01/lru-shm/vm-scalability/0x2006906
("mm, shmem: add vmstat for hugepage fallback")
("mm, thp: track fallbacks due to failed memcg charges separately")
85b9f46e8ea451633ccd60a7d8c
---------------------------
%reproduction fail:runs
| |
24% 2:4 perf-profile.calltrace.cycles-pp.sync_regs.error_entry.do_access
53% 5:4 perf-profile.calltrace.cycles-pp.error_entry.do_access
-27% 8:4 perf-profile.children.cycles-pp.error_entry
-10% 4:4 perf-profile.self.cycles-pp.error_entry
%change %stddev
| \
-9.1% 434041 vm-scalability.median
-8.7% 45476799 vm-scalability.throughput
+1.6% 227.36 vm-scalability.time.elapsed_time
+1.6% 227.36 vm-scalability.time.elapsed_time.max
+24.1% 62482 ± 10% vm-scalability.time.involuntary_context_switches
+7.8% 2412 vm-scalability.time.percent_of_cpu_this_job_got
+18.2% 3646 vm-scalability.time.system_time
-4.2% 1839 vm-scalability.time.user_time
+2.2 15.86 mpstat.cpu.all.sys%
-47.0% 15114 ± 79% numa-numastat.node0.other_node
-19.4% 115000 ± 17% numa-meminfo.node0.AnonPages
+8.8% 12150 ± 5% numa-meminfo.node1.PageTables
-1.6% 74.75 vmstat.cpu.id
-1.9% 3555 vmstat.system.cs
-96.6% 75321 ± 7% cpuidle.C1.usage
-18.0% 164861 ± 7% cpuidle.POLL.time
-16.7% 43866 ± 10% cpuidle.POLL.usage
-19.4% 28754 ± 17% numa-vmstat.node0.nr_anon_pages
+10.9% 1384236 numa-vmstat.node1.nr_mapped
+10.6% 3011 ± 5% numa-vmstat.node1.nr_page_table_pages
+7.0% 2741267 proc-vmstat.nr_mapped
+4.7% 5707 proc-vmstat.nr_page_table_pages
+38.0% 7872 ± 6% proc-vmstat.pgactivate
-9.5% 10984 ± 6% softirqs.CPU13.RCU
-8.3% 11167 ± 2% softirqs.CPU39.RCU
-11.6% 10377 ± 4% softirqs.CPU65.RCU
-12.7% 11453 ± 4% softirqs.CPU71.RCU
-7.9% 11079 ± 3% softirqs.CPU73.RCU
-16.3% 11090 ± 4% softirqs.CPU88.RCU
-12.0% 21602 ± 9% softirqs.CPU91.SCHED
-16.0% 641.83 ± 7% sched_debug.cfs_rq:/.load_avg.max
-53.7% 0.15 ± 66% sched_debug.cfs_rq:/.nr_running.avg
-51.1% 4.41 ± 85% sched_debug.cfs_rq:/.removed.load_avg.avg
-51.0% 4.41 ± 85% sched_debug.cfs_rq:/.removed.runnable_avg.avg
-42.1% 282.24 ± 33% sched_debug.cfs_rq:/.runnable_avg.avg
-53.7% 163.58 ± 60% sched_debug.cfs_rq:/.util_avg.avg
-30.4% 16.61 ± 14% sched_debug.cfs_rq:/.util_est_enqueued.avg
-37.3% 2.81 ± 33% sched_debug.cpu.clock.stddev
-67.3% 965.56 ±126% sched_debug.cpu.curr->pid.avg
-12.7% 520369 sched_debug.cpu.max_idle_balance_cost.max
-80.1% 2099 ± 44% sched_debug.cpu.max_idle_balance_cost.stddev
-25.2% 0.00 ± 19% sched_debug.cpu.next_balance.stddev
-57.8% 0.13 ± 88% sched_debug.cpu.nr_running.avg
-10.7% 0.24 ± 5% sched_debug.cpu.nr_running.stddev
-1.0% 1.471e+10 perf-stat.i.branch-instructions
-2.0% 3535 perf-stat.i.context-switches
+5.6% 1.19 perf-stat.i.cpi
+8.5% 7.123e+10 perf-stat.i.cpu-cycles
+2.2% 1244 perf-stat.i.cycles-between-cache-misses
-0.9% 1.485e+10 perf-stat.i.dTLB-loads
-1.3% 4.002e+09 perf-stat.i.dTLB-stores
-5.4% 0.86 ± 2% perf-stat.i.ipc
+8.4% 0.68 perf-stat.i.metric.GHz
-1.4% 2274501 perf-stat.i.minor-faults
+14.3% 2355494 ± 3% perf-stat.i.node-load-misses
+5.7% 1356763 perf-stat.i.node-loads
-2.5 37.25 ± 3% perf-stat.i.node-store-miss-rate%
-2.0% 8468484 perf-stat.i.node-stores
-1.4% 2274501 perf-stat.i.page-faults
+9.5% 1.36 perf-stat.overall.cpi
+8.0% 1282 perf-stat.overall.cycles-between-cache-misses
-8.7% 0.73 perf-stat.overall.ipc
-1.2% 1.468e+10 perf-stat.ps.branch-instructions
-2.1% 3523 perf-stat.ps.context-switches
+8.3% 7.111e+10 perf-stat.ps.cpu-cycles
-1.1% 1.482e+10 perf-stat.ps.dTLB-loads
-1.5% 3.993e+09 perf-stat.ps.dTLB-stores
-1.1% 5.223e+10 perf-stat.ps.instructions
-1.6% 2270962 perf-stat.ps.minor-faults
+14.1% 2349536 ± 3% perf-stat.ps.node-load-misses
+5.5% 1353397 perf-stat.ps.node-loads
-2.2% 8455419 perf-stat.ps.node-stores
-1.6% 2270962 perf-stat.ps.page-faults
+46.2% 2154 ± 5% interrupts.CPU0.NMI:Non-maskable_interrupts
+46.2% 2154 ± 5% interrupts.CPU0.PMI:Performance_monitoring_interrupts
+8.0% 565.75 ± 5% interrupts.CPU1.CAL:Function_call_interrupts
-85.7% 372.00 ± 97% interrupts.CPU1.RES:Rescheduling_interrupts
+92.5% 2467 ± 26% interrupts.CPU102.NMI:Non-maskable_interrupts
+92.5% 2467 ± 26% interrupts.CPU102.PMI:Performance_monitoring_interrupts
+41.9% 2093 interrupts.CPU103.NMI:Non-maskable_interrupts
+41.9% 2093 interrupts.CPU103.PMI:Performance_monitoring_interrupts
-79.8% 148.75 ± 17% interrupts.CPU13.RES:Rescheduling_interrupts
+122.5% 2292 ± 10% interrupts.CPU17.NMI:Non-maskable_interrupts
+122.5% 2292 ± 10% interrupts.CPU17.PMI:Performance_monitoring_interrupts
+73.2% 2059 ± 4% interrupts.CPU18.NMI:Non-maskable_interrupts
+73.2% 2059 ± 4% interrupts.CPU18.PMI:Performance_monitoring_interrupts
+90.0% 1946 ± 8% interrupts.CPU22.NMI:Non-maskable_interrupts
+90.0% 1946 ± 8% interrupts.CPU22.PMI:Performance_monitoring_interrupts
+67.5% 242.00 ± 37% interrupts.CPU22.RES:Rescheduling_interrupts
+80.3% 2244 ± 16% interrupts.CPU23.NMI:Non-maskable_interrupts
+80.3% 2244 ± 16% interrupts.CPU23.PMI:Performance_monitoring_interrupts
+62.5% 2332 ± 27% interrupts.CPU25.NMI:Non-maskable_interrupts
+62.5% 2332 ± 27% interrupts.CPU25.PMI:Performance_monitoring_interrupts
+56.4% 2135 ± 3% interrupts.CPU28.NMI:Non-maskable_interrupts
+56.4% 2135 ± 3% interrupts.CPU28.PMI:Performance_monitoring_interrupts
+84.9% 2115 ± 2% interrupts.CPU30.NMI:Non-maskable_interrupts
+84.9% 2115 ± 2% interrupts.CPU30.PMI:Performance_monitoring_interrupts
+139.8% 3088 ± 56% interrupts.CPU35.NMI:Non-maskable_interrupts
+139.8% 3088 ± 56% interrupts.CPU35.PMI:Performance_monitoring_interrupts
+1361.1% 4712 ± 89% interrupts.CPU36.RES:Rescheduling_interrupts
+87.1% 2059 ± 2% interrupts.CPU37.NMI:Non-maskable_interrupts
+87.1% 2059 ± 2% interrupts.CPU37.PMI:Performance_monitoring_interrupts
+113.9% 2072 interrupts.CPU38.NMI:Non-maskable_interrupts
+113.9% 2072 interrupts.CPU38.PMI:Performance_monitoring_interrupts
+440.5% 56.75 ± 62% interrupts.CPU4.TLB:TLB_shootdowns
+81.4% 2079 interrupts.CPU45.NMI:Non-maskable_interrupts
+81.4% 2079 interrupts.CPU45.PMI:Performance_monitoring_interrupts
+59.8% 2070 interrupts.CPU46.NMI:Non-maskable_interrupts
+59.8% 2070 interrupts.CPU46.PMI:Performance_monitoring_interrupts
+175.4% 2282 ± 18% interrupts.CPU47.NMI:Non-maskable_interrupts
+175.4% 2282 ± 18% interrupts.CPU47.PMI:Performance_monitoring_interrupts
+92.8% 2053 interrupts.CPU48.NMI:Non-maskable_interrupts
+92.8% 2053 interrupts.CPU48.PMI:Performance_monitoring_interrupts
+54.9% 2052 ± 2% interrupts.CPU51.NMI:Non-maskable_interrupts
+54.9% 2052 ± 2% interrupts.CPU51.PMI:Performance_monitoring_interrupts
+81.8% 2386 ± 18% interrupts.CPU52.NMI:Non-maskable_interrupts
+81.8% 2386 ± 18% interrupts.CPU52.PMI:Performance_monitoring_interrupts
+34.2% 1889 ± 9% interrupts.CPU54.NMI:Non-maskable_interrupts
+34.2% 1889 ± 9% interrupts.CPU54.PMI:Performance_monitoring_interrupts
+35.4% 1921 ± 9% interrupts.CPU58.NMI:Non-maskable_interrupts
+35.4% 1921 ± 9% interrupts.CPU58.PMI:Performance_monitoring_interrupts
+55.6% 2111 ± 11% interrupts.CPU59.NMI:Non-maskable_interrupts
+55.6% 2111 ± 11% interrupts.CPU59.PMI:Performance_monitoring_interrupts
+79.9% 2171 ± 17% interrupts.CPU60.NMI:Non-maskable_interrupts
+79.9% 2171 ± 17% interrupts.CPU60.PMI:Performance_monitoring_interrupts
+43.6% 1994 ± 11% interrupts.CPU61.NMI:Non-maskable_interrupts
+43.6% 1994 ± 11% interrupts.CPU61.PMI:Performance_monitoring_interrupts
+46.7% 2065 ± 7% interrupts.CPU64.NMI:Non-maskable_interrupts
+46.7% 2065 ± 7% interrupts.CPU64.PMI:Performance_monitoring_interrupts
+66.2% 2017 ± 14% interrupts.CPU65.NMI:Non-maskable_interrupts
+66.2% 2017 ± 14% interrupts.CPU65.PMI:Performance_monitoring_interrupts
+69.6% 2055 ± 14% interrupts.CPU67.NMI:Non-maskable_interrupts
+69.6% 2055 ± 14% interrupts.CPU67.PMI:Performance_monitoring_interrupts
+124.4% 2698 ± 23% interrupts.CPU69.NMI:Non-maskable_interrupts
+124.4% 2698 ± 23% interrupts.CPU69.PMI:Performance_monitoring_interrupts
+114.5% 2269 ± 24% interrupts.CPU7.NMI:Non-maskable_interrupts
+114.5% 2269 ± 24% interrupts.CPU7.PMI:Performance_monitoring_interrupts
+44.6% 1965 ± 9% interrupts.CPU70.NMI:Non-maskable_interrupts
+44.6% 1965 ± 9% interrupts.CPU70.PMI:Performance_monitoring_interrupts
+67.2% 2037 ± 10% interrupts.CPU72.NMI:Non-maskable_interrupts
+67.2% 2037 ± 10% interrupts.CPU72.PMI:Performance_monitoring_interrupts
-31.3% 96.00 ± 11% interrupts.CPU74.RES:Rescheduling_interrupts
+97.2% 2331 ± 25% interrupts.CPU8.NMI:Non-maskable_interrupts
+97.2% 2331 ± 25% interrupts.CPU8.PMI:Performance_monitoring_interrupts
-50.1% 109.75 ± 12% interrupts.CPU8.RES:Rescheduling_interrupts
+47.7% 2012 ± 6% interrupts.CPU9.NMI:Non-maskable_interrupts
+47.7% 2012 ± 6% interrupts.CPU9.PMI:Performance_monitoring_interrupts
+1584.2% 4379 ±158% interrupts.CPU90.RES:Rescheduling_interrupts
+133.7% 475.00 ± 65% interrupts.CPU94.RES:Rescheduling_interrupts
+80.0% 2047 ± 2% interrupts.CPU98.NMI:Non-maskable_interrupts
+80.0% 2047 ± 2% interrupts.CPU98.PMI:Performance_monitoring_interrupts
+43.0% 204005 ± 6% interrupts.NMI:Non-maskable_interrupts
+43.0% 204005 ± 6% interrupts.PMI:Performance_monitoring_interrupts
-16.4 13.46 ±173% perf-profile.calltrace.cycles-pp.page_fault
-16.2 13.31 ±173% perf-profile.calltrace.cycles-pp.do_user_addr_fault.page_fault
-15.7 13.04 ±173% perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.page_fault
-15.2 12.72 ±173% perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.page_fault
-7.0 7.39 ± 12% perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64
-7.0 7.39 ± 12% perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64
-7.0 7.39 ± 12% perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64
-7.0 7.57 ± 14% perf-profile.calltrace.cycles-pp.secondary_startup_64
-6.7 6.87 ± 11% perf-profile.calltrace.cycles-pp.cpuidle_enter.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64
-6.6 6.76 ± 11% perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.do_idle.cpu_startup_entry.start_secondary
-5.7 5.33 ± 9% perf-profile.calltrace.cycles-pp.intel_idle.cpuidle_enter_state.cpuidle_enter.do_idle.cpu_startup_entry
-2.8 2.43 ± 44% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.unlinkat
-2.8 2.43 ± 44% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.unlinkat
-2.8 2.43 ± 44% perf-profile.calltrace.cycles-pp.evict.do_unlinkat.do_syscall_64.entry_SYSCALL_64_after_hwframe.unlinkat
-2.8 2.43 ± 44% perf-profile.calltrace.cycles-pp.unlinkat
-2.8 2.43 ± 44% perf-profile.calltrace.cycles-pp.do_unlinkat.do_syscall_64.entry_SYSCALL_64_after_hwframe.unlinkat
-2.5 2.73 ± 32% perf-profile.calltrace.cycles-pp.shmem_evict_inode.evict.do_unlinkat.do_syscall_64.entry_SYSCALL_64_after_hwframe
-2.5 2.73 ± 32% perf-profile.calltrace.cycles-pp.shmem_truncate_range.shmem_evict_inode.evict.do_unlinkat.do_syscall_64
-2.5 2.72 ± 32% perf-profile.calltrace.cycles-pp.shmem_undo_range.shmem_truncate_range.shmem_evict_inode.evict.do_unlinkat
-2.4 2.54 ± 2% perf-profile.calltrace.cycles-pp.mem_cgroup_try_charge_delay.shmem_getpage_gfp.shmem_fault.__do_fault.do_fault
-2.4 2.42 ± 3% perf-profile.calltrace.cycles-pp.mem_cgroup_try_charge.mem_cgroup_try_charge_delay.shmem_getpage_gfp.shmem_fault.__do_fault
-1.3 1.16 ± 4% perf-profile.calltrace.cycles-pp.get_mem_cgroup_from_mm.mem_cgroup_try_charge.mem_cgroup_try_charge_delay.shmem_getpage_gfp.shmem_fault
-1.3 1.06 ± 45% perf-profile.calltrace.cycles-pp.__pagevec_release.shmem_undo_range.shmem_truncate_range.shmem_evict_inode.evict
-1.3 1.04 ± 45% perf-profile.calltrace.cycles-pp.release_pages.__pagevec_release.shmem_undo_range.shmem_truncate_range.shmem_evict_inode
-1.2 0.87 ± 45% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
-1.2 0.87 ± 45% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
-1.0 0.93 ± 42% perf-profile.calltrace.cycles-pp.truncate_inode_page.shmem_undo_range.shmem_truncate_range.shmem_evict_inode.evict
-1.0 0.68 ± 69% perf-profile.calltrace.cycles-pp.delete_from_page_cache.truncate_inode_page.shmem_undo_range.shmem_truncate_range.shmem_evict_inode
-1.0 1.38 ± 18% perf-profile.calltrace.cycles-pp.apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.do_idle.cpu_startup_entry
-0.9 1.25 ± 19% perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.do_idle
-0.8 0.37 ±108% perf-profile.calltrace.cycles-pp.__delete_from_page_cache.delete_from_page_cache.truncate_inode_page.shmem_undo_range.shmem_truncate_range
-0.8 1.85 ± 20% perf-profile.calltrace.cycles-pp.write
-0.8 1.85 ± 20% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.write
-0.8 1.85 ± 20% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
-0.8 1.85 ± 20% perf-profile.calltrace.cycles-pp.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
-0.8 1.85 ± 20% perf-profile.calltrace.cycles-pp.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
-0.8 1.83 ± 20% perf-profile.calltrace.cycles-pp.vprintk_emit.devkmsg_emit.devkmsg_write.cold.new_sync_write.vfs_write
-0.8 1.85 ± 20% perf-profile.calltrace.cycles-pp.new_sync_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
-0.8 1.82 ± 20% perf-profile.calltrace.cycles-pp.console_unlock.vprintk_emit.devkmsg_emit.devkmsg_write.cold.new_sync_write
-0.8 1.83 ± 20% perf-profile.calltrace.cycles-pp.devkmsg_write.cold.new_sync_write.vfs_write.ksys_write.do_syscall_64
-0.8 1.83 ± 20% perf-profile.calltrace.cycles-pp.devkmsg_emit.devkmsg_write.cold.new_sync_write.vfs_write.ksys_write
-0.7 1.39 ± 19% perf-profile.calltrace.cycles-pp.drm_fb_helper_dirty_work.process_one_work.worker_thread.kthread.ret_from_fork
-0.7 1.42 ± 19% perf-profile.calltrace.cycles-pp.ret_from_fork
-0.7 1.42 ± 19% perf-profile.calltrace.cycles-pp.kthread.ret_from_fork
-0.7 1.46 ± 20% perf-profile.calltrace.cycles-pp.serial8250_console_write.console_unlock.vprintk_emit.devkmsg_emit.devkmsg_write.cold
-0.7 1.40 ± 19% perf-profile.calltrace.cycles-pp.worker_thread.kthread.ret_from_fork
-0.7 1.40 ± 19% perf-profile.calltrace.cycles-pp.process_one_work.worker_thread.kthread.ret_from_fork
-0.7 4.97 ± 12% perf-profile.calltrace.cycles-pp.filemap_map_pages.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
-0.7 4.53 ± 12% perf-profile.calltrace.cycles-pp.clear_page_erms.shmem_getpage_gfp.shmem_fault.__do_fault.do_fault
-0.7 1.35 ± 19% perf-profile.calltrace.cycles-pp.memcpy_erms.drm_fb_helper_dirty_work.process_one_work.worker_thread.kthread
-0.6 1.38 ± 20% perf-profile.calltrace.cycles-pp.uart_console_write.serial8250_console_write.console_unlock.vprintk_emit.devkmsg_emit
-0.6 1.26 ± 20% perf-profile.calltrace.cycles-pp.wait_for_xmitr.serial8250_console_putchar.uart_console_write.serial8250_console_write.console_unlock
-0.6 1.26 ± 20% perf-profile.calltrace.cycles-pp.serial8250_console_putchar.uart_console_write.serial8250_console_write.console_unlock.vprintk_emit
-0.5 0.78 ± 19% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter
-0.5 0.98 ± 20% perf-profile.calltrace.cycles-pp.io_serial_in.wait_for_xmitr.serial8250_console_putchar.uart_console_write.serial8250_console_write
+10.7 23.87 ± 10% perf-profile.calltrace.cycles-pp.__lru_cache_add.shmem_getpage_gfp.shmem_fault.__do_fault.do_fault
+10.8 21.78 ± 10% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.pagevec_lru_move_fn.__lru_cache_add.shmem_getpage_gfp.shmem_fault
+10.9 21.73 ± 10% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.pagevec_lru_move_fn.__lru_cache_add.shmem_getpage_gfp
+10.9 23.73 ± 10% perf-profile.calltrace.cycles-pp.pagevec_lru_move_fn.__lru_cache_add.shmem_getpage_gfp.shmem_fault.__do_fault
-7.0 7.39 ± 12% perf-profile.children.cycles-pp.start_secondary
-7.0 7.57 ± 14% perf-profile.children.cycles-pp.secondary_startup_64
-7.0 7.57 ± 14% perf-profile.children.cycles-pp.cpu_startup_entry
-7.0 7.57 ± 14% perf-profile.children.cycles-pp.do_idle
-6.7 7.05 ± 13% perf-profile.children.cycles-pp.cpuidle_enter
-6.7 7.05 ± 13% perf-profile.children.cycles-pp.cpuidle_enter_state
-5.6 5.48 ± 12% perf-profile.children.cycles-pp.intel_idle
-3.7 6.52 ± 18% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
-3.7 6.52 ± 18% perf-profile.children.cycles-pp.do_syscall_64
-2.8 2.43 ± 44% perf-profile.children.cycles-pp.unlinkat
-2.4 2.83 ± 29% perf-profile.children.cycles-pp.evict
-2.4 2.83 ± 29% perf-profile.children.cycles-pp.do_unlinkat
-2.4 2.83 ± 29% perf-profile.children.cycles-pp.shmem_evict_inode
-2.4 2.83 ± 29% perf-profile.children.cycles-pp.shmem_truncate_range
-2.4 2.83 ± 29% perf-profile.children.cycles-pp.shmem_undo_range
-2.4 2.68 ± 6% perf-profile.children.cycles-pp.mem_cgroup_try_charge_delay
-2.4 2.53 ± 6% perf-profile.children.cycles-pp.mem_cgroup_try_charge
-1.4 1.22 ± 5% perf-profile.children.cycles-pp.get_mem_cgroup_from_mm
-1.2 2.02 ± 17% perf-profile.children.cycles-pp.apic_timer_interrupt
-1.1 1.49 ± 27% perf-profile.children.cycles-pp.release_pages
-1.1 1.79 ± 17% perf-profile.children.cycles-pp.smp_apic_timer_interrupt
-1.1 1.24 ± 30% perf-profile.children.cycles-pp.__pagevec_release
-0.9 1.09 ± 28% perf-profile.children.cycles-pp.truncate_inode_page
-0.9 4.61 ± 12% perf-profile.children.cycles-pp.clear_page_erms
-0.9 1.90 ± 20% perf-profile.children.cycles-pp.console_unlock
-0.8 1.90 ± 20% perf-profile.children.cycles-pp.vprintk_emit
-0.8 1.85 ± 20% perf-profile.children.cycles-pp.write
-0.8 1.86 ± 20% perf-profile.children.cycles-pp.ksys_write
-0.8 1.86 ± 20% perf-profile.children.cycles-pp.vfs_write
-0.8 5.10 ± 12% perf-profile.children.cycles-pp.filemap_map_pages
-0.8 1.85 ± 20% perf-profile.children.cycles-pp.new_sync_write
-0.8 1.83 ± 20% perf-profile.children.cycles-pp.devkmsg_write.cold
-0.8 1.83 ± 20% perf-profile.children.cycles-pp.devkmsg_emit
-0.8 0.94 ± 28% perf-profile.children.cycles-pp.delete_from_page_cache
-0.7 1.22 ± 17% perf-profile.children.cycles-pp.hrtimer_interrupt
-0.7 1.52 ± 20% perf-profile.children.cycles-pp.serial8250_console_write
-0.7 1.39 ± 19% perf-profile.children.cycles-pp.memcpy_erms
-0.7 1.39 ± 19% perf-profile.children.cycles-pp.drm_fb_helper_dirty_work
-0.7 1.42 ± 19% perf-profile.children.cycles-pp.ret_from_fork
-0.7 1.42 ± 19% perf-profile.children.cycles-pp.kthread
-0.7 1.40 ± 19% perf-profile.children.cycles-pp.worker_thread
-0.7 1.40 ± 19% perf-profile.children.cycles-pp.process_one_work
-0.7 1.44 ± 20% perf-profile.children.cycles-pp.uart_console_write
-0.7 1.40 ± 20% perf-profile.children.cycles-pp.wait_for_xmitr
-0.6 1.32 ± 20% perf-profile.children.cycles-pp.serial8250_console_putchar
-0.6 2.89 ± 10% perf-profile.children.cycles-pp.shmem_alloc_and_acct_page
-0.6 0.63 ± 31% perf-profile.children.cycles-pp.free_unref_page_list
-0.6 0.68 ± 29% perf-profile.children.cycles-pp.__delete_from_page_cache
-0.5 1.09 ± 20% perf-profile.children.cycles-pp.io_serial_in
-0.5 0.76 ± 19% perf-profile.children.cycles-pp.__hrtimer_run_queues
-0.4 3.67 ± 11% perf-profile.children.cycles-pp.swapgs_restore_regs_and_return_to_usermode
-0.4 2.11 ± 11% perf-profile.children.cycles-pp.shmem_alloc_page
-0.4 0.44 ± 32% perf-profile.children.cycles-pp.free_pcppages_bulk
-0.4 1.88 ± 11% perf-profile.children.cycles-pp.alloc_pages_vma
-0.3 0.55 ± 17% perf-profile.children.cycles-pp.tick_sched_timer
-0.3 1.65 ± 10% perf-profile.children.cycles-pp.__alloc_pages_nodemask
-0.3 0.42 ± 12% perf-profile.children.cycles-pp.irq_exit
-0.3 0.60 ± 20% perf-profile.children.cycles-pp.xas_store
-0.3 0.41 ± 11% perf-profile.children.cycles-pp.ktime_get
-0.3 0.33 ± 36% perf-profile.children.cycles-pp.__free_one_page
-0.3 0.47 ± 21% perf-profile.children.cycles-pp.tick_sched_handle
-0.3 0.45 ± 19% perf-profile.children.cycles-pp.update_process_times
-0.3 1.33 ± 13% perf-profile.children.cycles-pp.shmem_add_to_page_cache
-0.3 0.40 ± 21% perf-profile.children.cycles-pp.menu_select
-0.2 1.26 ± 9% perf-profile.children.cycles-pp.get_page_from_freelist
-0.2 0.25 ± 28% perf-profile.children.cycles-pp.find_get_entries
-0.2 0.83 ± 17% perf-profile.children.cycles-pp.xas_load
-0.2 0.33 ± 25% perf-profile.children.cycles-pp.__list_del_entry_valid
-0.2 0.32 ± 13% perf-profile.children.cycles-pp.clockevents_program_event
-0.2 1.03 ± 10% perf-profile.children.cycles-pp.rmqueue
-0.2 0.25 ± 17% perf-profile.children.cycles-pp.tick_nohz_get_sleep_length
-0.2 0.80 ± 17% perf-profile.children.cycles-pp.unlock_page
-0.2 0.21 ± 27% perf-profile.children.cycles-pp.xas_init_marks
-0.2 0.26 ± 19% perf-profile.children.cycles-pp.__softirqentry_text_start
-0.2 0.21 ± 15% perf-profile.children.cycles-pp.tick_nohz_next_event
-0.2 1.04 ± 12% perf-profile.children.cycles-pp.sync_regs
-0.2 0.47 ± 15% perf-profile.children.cycles-pp.__mod_node_page_state
-0.1 0.28 ± 16% perf-profile.children.cycles-pp.scheduler_tick
-0.1 0.15 ± 30% perf-profile.children.cycles-pp.xas_clear_mark
-0.1 0.37 ± 21% perf-profile.children.cycles-pp.vt_console_print
-0.1 0.36 ± 21% perf-profile.children.cycles-pp.fbcon_redraw
-0.1 0.60 ± 10% perf-profile.children.cycles-pp.rmqueue_bulk
-0.1 0.37 ± 21% perf-profile.children.cycles-pp.lf
-0.1 0.37 ± 21% perf-profile.children.cycles-pp.con_scroll
-0.1 0.37 ± 21% perf-profile.children.cycles-pp.fbcon_scroll
-0.1 1.08 ± 10% perf-profile.children.cycles-pp.__count_memcg_events
-0.1 0.15 ± 31% perf-profile.children.cycles-pp.mem_cgroup_uncharge_list
-0.1 0.36 ± 21% perf-profile.children.cycles-pp.fbcon_putcs
-0.1 0.36 ± 20% perf-profile.children.cycles-pp.bit_putcs
-0.1 0.33 ± 20% perf-profile.children.cycles-pp.drm_fb_helper_sys_imageblit
-0.1 0.33 ± 20% perf-profile.children.cycles-pp.sys_imageblit
-0.1 0.10 ± 25% perf-profile.children.cycles-pp.tick_nohz_irq_exit
-0.1 0.63 ± 14% perf-profile.children.cycles-pp.__perf_sw_event
-0.1 0.12 ± 25% perf-profile.children.cycles-pp.unaccount_page_cache_page
-0.1 0.68 ± 14% perf-profile.children.cycles-pp.xas_find
-0.1 0.48 ± 14% perf-profile.children.cycles-pp.___perf_sw_event
-0.1 0.39 ± 10% perf-profile.children.cycles-pp._raw_spin_lock
-0.1 0.04 ±106% perf-profile.children.cycles-pp.timekeeping_max_deferment
-0.1 0.27 ± 20% perf-profile.children.cycles-pp._raw_spin_unlock_irqrestore
-0.1 0.10 ± 19% perf-profile.children.cycles-pp.rebalance_domains
-0.1 0.12 ± 26% perf-profile.children.cycles-pp.get_next_timer_interrupt
-0.1 0.08 ± 23% perf-profile.children.cycles-pp.irq_work_run_list
-0.1 0.09 ± 27% perf-profile.children.cycles-pp.uncharge_batch
-0.1 0.12 ± 20% perf-profile.children.cycles-pp.io_serial_out
-0.1 0.11 ± 19% perf-profile.children.cycles-pp.perf_mux_hrtimer_handler
-0.1 0.15 ± 29% perf-profile.children.cycles-pp.page_mapping
-0.1 0.07 ± 26% perf-profile.children.cycles-pp.page_cache_free_page
-0.1 0.20 ± 14% perf-profile.children.cycles-pp.percpu_counter_add_batch
-0.1 0.03 ±100% perf-profile.children.cycles-pp.free_unref_page_prepare
-0.1 0.15 ± 15% perf-profile.children.cycles-pp.__list_add_valid
-0.1 0.11 ± 13% perf-profile.children.cycles-pp.task_tick_fair
-0.1 0.07 ± 23% perf-profile.children.cycles-pp.free_unref_page_commit
-0.1 0.08 ± 13% perf-profile.children.cycles-pp.irqtime_account_irq
-0.1 0.08 ± 15% perf-profile.children.cycles-pp.lapic_next_deadline
-0.1 0.04 ± 58% perf-profile.children.cycles-pp.read
-0.1 0.04 ± 63% perf-profile.children.cycles-pp.uncharge_page
-0.0 0.05 ± 64% perf-profile.children.cycles-pp.rcu_sched_clock_irq
-0.0 0.09 ± 24% perf-profile.children.cycles-pp.__next_timer_interrupt
-0.0 0.05 ± 62% perf-profile.children.cycles-pp._find_next_bit
-0.0 0.21 ± 18% perf-profile.children.cycles-pp.xas_start
-0.0 0.10 ± 25% perf-profile.children.cycles-pp.__mod_zone_page_state
-0.0 0.08 ± 26% perf-profile.children.cycles-pp.read_tsc
-0.0 0.10 ± 23% perf-profile.children.cycles-pp.__indirect_thunk_start
-0.0 0.06 ± 26% perf-profile.children.cycles-pp.load_balance
-0.0 0.07 ± 17% perf-profile.children.cycles-pp.irq_work_interrupt
-0.0 0.07 ± 17% perf-profile.children.cycles-pp.smp_irq_work_interrupt
-0.0 0.07 ± 17% perf-profile.children.cycles-pp.irq_work_run
-0.0 0.07 ± 17% perf-profile.children.cycles-pp.printk
-0.0 0.11 ± 7% perf-profile.children.cycles-pp.PageHuge
-0.0 0.06 ± 20% perf-profile.children.cycles-pp.rcu_core
-0.0 0.07 ± 24% perf-profile.children.cycles-pp.sched_clock
-0.0 0.14 ± 13% perf-profile.children.cycles-pp.___might_sleep
+0.2 0.27 ± 17% perf-profile.children.cycles-pp.mem_cgroup_page_lruvec
+1.2 1.21 ± 49% perf-profile.children.cycles-pp.__munmap
+10.6 21.96 ± 10% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
+10.7 23.88 ± 10% perf-profile.children.cycles-pp.__lru_cache_add
+10.7 21.87 ± 10% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
+10.7 23.76 ± 10% perf-profile.children.cycles-pp.pagevec_lru_move_fn
-5.6 5.48 ± 12% perf-profile.self.cycles-pp.intel_idle
-1.4 1.21 ± 6% perf-profile.self.cycles-pp.get_mem_cgroup_from_mm
-1.0 5.24 ± 12% perf-profile.self.cycles-pp.shmem_getpage_gfp
-0.9 4.58 ± 12% perf-profile.self.cycles-pp.clear_page_erms
-0.8 0.52 ± 6% perf-profile.self.cycles-pp.mem_cgroup_try_charge
-0.7 1.38 ± 19% perf-profile.self.cycles-pp.memcpy_erms
-0.5 3.40 ± 11% perf-profile.self.cycles-pp.filemap_map_pages
-0.5 1.09 ± 20% perf-profile.self.cycles-pp.io_serial_in
-0.3 0.34 ± 10% perf-profile.self.cycles-pp.ktime_get
-0.2 0.41 ± 20% perf-profile.self.cycles-pp.release_pages
-0.2 0.32 ± 26% perf-profile.self.cycles-pp.__list_del_entry_valid
-0.2 0.22 ± 27% perf-profile.self.cycles-pp.find_get_entries
-0.2 0.23 ± 35% perf-profile.self.cycles-pp.__free_one_page
-0.2 1.03 ± 13% perf-profile.self.cycles-pp.swapgs_restore_regs_and_return_to_usermode
-0.2 0.75 ± 17% perf-profile.self.cycles-pp.unlock_page
-0.2 1.03 ± 12% perf-profile.self.cycles-pp.sync_regs
-0.2 0.62 ± 17% perf-profile.self.cycles-pp.xas_load
-0.2 0.46 ± 15% perf-profile.self.cycles-pp.__mod_node_page_state
-0.1 0.15 ± 30% perf-profile.self.cycles-pp.shmem_undo_range
-0.1 1.07 ± 11% perf-profile.self.cycles-pp.__count_memcg_events
-0.1 0.13 ± 30% perf-profile.self.cycles-pp.xas_clear_mark
-0.1 0.19 ± 14% perf-profile.self.cycles-pp._raw_spin_lock_irqsave
-0.1 0.49 ± 17% perf-profile.self.cycles-pp.__pagevec_lru_add_fn
-0.1 0.33 ± 20% perf-profile.self.cycles-pp.sys_imageblit
-0.1 0.28 ± 15% perf-profile.self.cycles-pp.xas_store
-0.1 0.17 ± 18% perf-profile.self.cycles-pp.cpuidle_enter_state
-0.1 0.31 ± 10% perf-profile.self.cycles-pp.try_charge
-0.1 0.15 ± 24% perf-profile.self.cycles-pp._raw_spin_unlock_irqrestore
-0.1 0.04 ±106% perf-profile.self.cycles-pp.timekeeping_max_deferment
-0.1 0.40 ± 12% perf-profile.self.cycles-pp.___perf_sw_event
-0.1 0.08 ± 23% perf-profile.self.cycles-pp.free_pcppages_bulk
-0.1 0.39 ± 11% perf-profile.self.cycles-pp.rmqueue_bulk
-0.1 0.12 ± 20% perf-profile.self.cycles-pp.io_serial_out
-0.1 0.03 ±102% perf-profile.self.cycles-pp.free_unref_page_commit
-0.1 0.09 ± 37% perf-profile.self.cycles-pp.__delete_from_page_cache
-0.1 0.07 ± 23% perf-profile.self.cycles-pp.page_cache_free_page
-0.1 0.05 ± 61% perf-profile.self.cycles-pp.unaccount_page_cache_page
-0.1 0.04 ± 63% perf-profile.self.cycles-pp.free_unref_page_list
-0.1 0.13 ± 18% perf-profile.self.cycles-pp.__list_add_valid
-0.1 0.24 ± 9% perf-profile.self.cycles-pp.__alloc_pages_nodemask
-0.1 0.14 ± 28% perf-profile.self.cycles-pp.page_mapping
-0.1 0.03 ±100% perf-profile.self.cycles-pp.irqtime_account_irq
-0.1 0.18 ± 15% perf-profile.self.cycles-pp.percpu_counter_add_batch
-0.1 0.03 ±105% perf-profile.self.cycles-pp.rcu_sched_clock_irq
-0.1 0.29 ± 11% perf-profile.self.cycles-pp.handle_mm_fault
-0.1 0.08 ± 15% perf-profile.self.cycles-pp.lapic_next_deadline
-0.0 0.07 ± 17% perf-profile.self.cycles-pp.xas_init_marks
-0.0 0.10 ± 23% perf-profile.self.cycles-pp.__mod_zone_page_state
-0.0 0.05 ± 62% perf-profile.self.cycles-pp._find_next_bit
-0.0 0.20 ± 18% perf-profile.self.cycles-pp.xas_start
-0.0 0.08 ± 24% perf-profile.self.cycles-pp.read_tsc
-0.0 0.06 ± 26% perf-profile.self.cycles-pp.native_sched_clock
-0.0 0.14 ± 11% perf-profile.self.cycles-pp.___might_sleep
+0.2 0.26 ± 18% perf-profile.self.cycles-pp.mem_cgroup_page_lruvec
+0.4 0.93 ± 12% perf-profile.self.cycles-pp.page_add_file_rmap
+10.7 21.87 ± 10% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath

vm-scalability.time.system_time

-----------------------------------------------------------+
O O O O O O O O O O O O |
O O O O O O O O O O |
.+.+.+.+.+..+.+.+.+.+.+..+.+.+.+.+.+.+..+.+. +.+..+.+.+.|
+ : |
: : |
: : |
: : |
: : |
: : |
: : |
: : |
: |
: |
-----------------------------------------------------------+

vm-scalability.throughput

+-------------------------------------------------------------------+
|
.+.+.+.+.+.+.+.. .+. .+ +..+.+.+.+.|
|.+.+.+..+ O O O O O O O +.+.+.+.+.+.+.+..+.+.+ O + : O : O O |
: : O |
: : |
: : |
: : |
: : |
: : |
: : |
:: |
: |
: |
+-------------------------------------------------------------------+

have been estimated based on internal Intel analysis and are provided
purposes only. Any difference in system hardware or software
configuration may affect actual performance.

Attachments:

(No filename) (45.88 kB)
config-5.6.0-11466-g85b9f46e8ea45 (159.20 kB)
job-script (7.57 kB)
job.yaml (5.25 kB)
reproduce (342.14 kB)
Download all attachments

2020-10-04 19:08:51

by David Rientjes

[permalink] [raw]

Subject: Re: [mm, thp] 85b9f46e8e: vm-scalability.throughput -8.7% regression

On Sun, 4 Oct 2020, kernel test robot wrote:

> Greeting,
>
> FYI, we noticed a -8.7% regression of vm-scalability.throughput due to commit:
>
>
> commit: 85b9f46e8ea451633ccd60a7d8cacbfff9f34047 ("mm, thp: track fallbacks due to failed memcg charges separately")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
>
> in testcase: vm-scalability
> on test machine: 104 threads Skylake with 192G memory
> with following parameters:
>
> runtime: 300s
> size: 1T
> test: lru-shm
> cpufreq_governor: performance
> ucode: 0x2006906
>
> test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
> test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
>
>
>
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot <[email protected]>
>
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> To reproduce:
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp install job.yaml # job file is attached in this email
> bin/lkp run job.yaml
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase/ucode:
> gcc-9/performance/x86_64-rhel-8.3/debian-10.4-x86_64-20200603.cgz/300s/1T/lkp-skl-fpga01/lru-shm/vm-scalability/0x2006906
>
> commit:
> dcdf11ee14 ("mm, shmem: add vmstat for hugepage fallback")
> 85b9f46e8e ("mm, thp: track fallbacks due to failed memcg charges separately")
>
> dcdf11ee14413332 85b9f46e8ea451633ccd60a7d8c
> ---------------- ---------------------------
> fail:runs %reproduction fail:runs
> | | |
> 1:4 24% 2:4 perf-profile.calltrace.cycles-pp.sync_regs.error_entry.do_access
> 3:4 53% 5:4 perf-profile.calltrace.cycles-pp.error_entry.do_access
> 9:4 -27% 8:4 perf-profile.children.cycles-pp.error_entry
> 4:4 -10% 4:4 perf-profile.self.cycles-pp.error_entry
> %stddev %change %stddev
> \ | \
> 477291 -9.1% 434041 vm-scalability.median
> 49791027 -8.7% 45476799 vm-scalability.throughput
> 223.67 +1.6% 227.36 vm-scalability.time.elapsed_time
> 223.67 +1.6% 227.36 vm-scalability.time.elapsed_time.max
> 50364 ± 6% +24.1% 62482 ± 10% vm-scalability.time.involuntary_context_switches
> 2237 +7.8% 2412 vm-scalability.time.percent_of_cpu_this_job_got
> 3084 +18.2% 3646 vm-scalability.time.system_time
> 1921 -4.2% 1839 vm-scalability.time.user_time
> 13.68 +2.2 15.86 mpstat.cpu.all.sys%
> 28535 ± 30% -47.0% 15114 ± 79% numa-numastat.node0.other_node
> 142734 ± 11% -19.4% 115000 ± 17% numa-meminfo.node0.AnonPages
> 11168 ± 3% +8.8% 12150 ± 5% numa-meminfo.node1.PageTables
> 76.00 -1.6% 74.75 vmstat.cpu.id
> 3626 -1.9% 3555 vmstat.system.cs
> 2214928 ±166% -96.6% 75321 ± 7% cpuidle.C1.usage
> 200981 ± 7% -18.0% 164861 ± 7% cpuidle.POLL.time
> 52675 ± 3% -16.7% 43866 ± 10% cpuidle.POLL.usage
> 35659 ± 11% -19.4% 28754 ± 17% numa-vmstat.node0.nr_anon_pages
> 1248014 ± 3% +10.9% 1384236 numa-vmstat.node1.nr_mapped
> 2722 ± 4% +10.6% 3011 ± 5% numa-vmstat.node1.nr_page_table_pages

I'm not sure that I'm reading this correctly, but I suspect that this just
happens because of NUMA: memory affinity will obviously impact
vm-scalability.throughput quite substantially, but I don't think the
bisected commit can be to be blame. Commit 85b9f46e8ea4 ("mm, thp: track
fallbacks due to failed memcg charges separately") simply adds new
count_vm_event() calls in a couple areas to track thp fallback due to
memcg limits separate from fragmentation.

It's likely a question about the testing methodology in general: for
memory intensive benchmarks, I suggest it is configured in a manner that
we can expect consistent memory access latency at the hardware level when
running on a NUMA system.

2020-10-20 09:14:40

by Huang, Ying

[permalink] [raw]

Subject: Re: [mm, thp] 85b9f46e8e: vm-scalability.throughput -8.7% regression

David Rientjes <[email protected]> writes:

> On Sun, 4 Oct 2020, kernel test robot wrote:
>
>> Greeting,
>>
>> FYI, we noticed a -8.7% regression of vm-scalability.throughput due to commit:
>>
>>
>> commit: 85b9f46e8ea451633ccd60a7d8cacbfff9f34047 ("mm, thp: track fallbacks due to failed memcg charges separately")
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>
>>
>> in testcase: vm-scalability
>> on test machine: 104 threads Skylake with 192G memory
>> with following parameters:
>>
>> runtime: 300s
>> size: 1T
>> test: lru-shm
>> cpufreq_governor: performance
>> ucode: 0x2006906
>>
>> test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
>> test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
>>
>>
>>
>> If you fix the issue, kindly add following tag
>> Reported-by: kernel test robot <[email protected]>
>>
>>
>> Details are as below:
>> -------------------------------------------------------------------------------------------------->
>>
>>
>> To reproduce:
>>
>> git clone https://github.com/intel/lkp-tests.git
>> cd lkp-tests
>> bin/lkp install job.yaml # job file is attached in this email
>> bin/lkp run job.yaml
>>
>> =========================================================================================
>> compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase/ucode:
>> gcc-9/performance/x86_64-rhel-8.3/debian-10.4-x86_64-20200603.cgz/300s/1T/lkp-skl-fpga01/lru-shm/vm-scalability/0x2006906
>>
>> commit:
>> dcdf11ee14 ("mm, shmem: add vmstat for hugepage fallback")
>> 85b9f46e8e ("mm, thp: track fallbacks due to failed memcg charges separately")
>>
>> dcdf11ee14413332 85b9f46e8ea451633ccd60a7d8c
>> ---------------- ---------------------------
>> fail:runs %reproduction fail:runs
>> | | |
>> 1:4 24% 2:4 perf-profile.calltrace.cycles-pp.sync_regs.error_entry.do_access
>> 3:4 53% 5:4 perf-profile.calltrace.cycles-pp.error_entry.do_access
>> 9:4 -27% 8:4 perf-profile.children.cycles-pp.error_entry
>> 4:4 -10% 4:4 perf-profile.self.cycles-pp.error_entry
>> %stddev %change %stddev
>> \ | \
>> 477291 -9.1% 434041 vm-scalability.median
>> 49791027 -8.7% 45476799 vm-scalability.throughput
>> 223.67 +1.6% 227.36 vm-scalability.time.elapsed_time
>> 223.67 +1.6% 227.36 vm-scalability.time.elapsed_time.max
>> 50364 ± 6% +24.1% 62482 ± 10% vm-scalability.time.involuntary_context_switches
>> 2237 +7.8% 2412 vm-scalability.time.percent_of_cpu_this_job_got
>> 3084 +18.2% 3646 vm-scalability.time.system_time
>> 1921 -4.2% 1839 vm-scalability.time.user_time
>> 13.68 +2.2 15.86 mpstat.cpu.all.sys%
>> 28535 ± 30% -47.0% 15114 ± 79% numa-numastat.node0.other_node
>> 142734 ± 11% -19.4% 115000 ± 17% numa-meminfo.node0.AnonPages
>> 11168 ± 3% +8.8% 12150 ± 5% numa-meminfo.node1.PageTables
>> 76.00 -1.6% 74.75 vmstat.cpu.id
>> 3626 -1.9% 3555 vmstat.system.cs
>> 2214928 ±166% -96.6% 75321 ± 7% cpuidle.C1.usage
>> 200981 ± 7% -18.0% 164861 ± 7% cpuidle.POLL.time
>> 52675 ± 3% -16.7% 43866 ± 10% cpuidle.POLL.usage
>> 35659 ± 11% -19.4% 28754 ± 17% numa-vmstat.node0.nr_anon_pages
>> 1248014 ± 3% +10.9% 1384236 numa-vmstat.node1.nr_mapped
>> 2722 ± 4% +10.6% 3011 ± 5% numa-vmstat.node1.nr_page_table_pages
>
> I'm not sure that I'm reading this correctly, but I suspect that this just
> happens because of NUMA: memory affinity will obviously impact
> vm-scalability.throughput quite substantially, but I don't think the
> bisected commit can be to be blame. Commit 85b9f46e8ea4 ("mm, thp: track
> fallbacks due to failed memcg charges separately") simply adds new
> count_vm_event() calls in a couple areas to track thp fallback due to
> memcg limits separate from fragmentation.
>
> It's likely a question about the testing methodology in general: for
> memory intensive benchmarks, I suggest it is configured in a manner that
> we can expect consistent memory access latency at the hardware level when
> running on a NUMA system.

So you think it's better to bind processes to NUMA node or CPU? But we
want to use this test case to capture NUMA/CPU placement/balance issue
too.

0day solve the problem in another way. We run the test case
multiple-times and calculate the average and standard deviation, then
compare.

For this specific regression, I found something strange,

10.93 ± 15% +10.8 21.78 ± 10% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.pagevec_lru_move_fn.__lru_cache_add.shmem_getpage_gfp.shmem_fault

It appears the lock contention becomes heavier with the patch. But I
cannot understand why too.

Best Regards,
Huang, Ying

2020-10-21 09:08:46

by David Rientjes

[permalink] [raw]

Subject: Re: [mm, thp] 85b9f46e8e: vm-scalability.throughput -8.7% regression

On Tue, 20 Oct 2020, Huang, Ying wrote:

> >> =========================================================================================
> >> compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase/ucode:
> >> gcc-9/performance/x86_64-rhel-8.3/debian-10.4-x86_64-20200603.cgz/300s/1T/lkp-skl-fpga01/lru-shm/vm-scalability/0x2006906
> >>
> >> commit:
> >> dcdf11ee14 ("mm, shmem: add vmstat for hugepage fallback")
> >> 85b9f46e8e ("mm, thp: track fallbacks due to failed memcg charges separately")
> >>
> >> dcdf11ee14413332 85b9f46e8ea451633ccd60a7d8c
> >> ---------------- ---------------------------
> >> fail:runs %reproduction fail:runs
> >> | | |
> >> 1:4 24% 2:4 perf-profile.calltrace.cycles-pp.sync_regs.error_entry.do_access
> >> 3:4 53% 5:4 perf-profile.calltrace.cycles-pp.error_entry.do_access
> >> 9:4 -27% 8:4 perf-profile.children.cycles-pp.error_entry
> >> 4:4 -10% 4:4 perf-profile.self.cycles-pp.error_entry
> >> %stddev %change %stddev
> >> \ | \
> >> 477291 -9.1% 434041 vm-scalability.median
> >> 49791027 -8.7% 45476799 vm-scalability.throughput
> >> 223.67 +1.6% 227.36 vm-scalability.time.elapsed_time
> >> 223.67 +1.6% 227.36 vm-scalability.time.elapsed_time.max
> >> 50364 ± 6% +24.1% 62482 ± 10% vm-scalability.time.involuntary_context_switches
> >> 2237 +7.8% 2412 vm-scalability.time.percent_of_cpu_this_job_got
> >> 3084 +18.2% 3646 vm-scalability.time.system_time
> >> 1921 -4.2% 1839 vm-scalability.time.user_time
> >> 13.68 +2.2 15.86 mpstat.cpu.all.sys%
> >> 28535 ± 30% -47.0% 15114 ± 79% numa-numastat.node0.other_node
> >> 142734 ± 11% -19.4% 115000 ± 17% numa-meminfo.node0.AnonPages
> >> 11168 ± 3% +8.8% 12150 ± 5% numa-meminfo.node1.PageTables
> >> 76.00 -1.6% 74.75 vmstat.cpu.id
> >> 3626 -1.9% 3555 vmstat.system.cs
> >> 2214928 ±166% -96.6% 75321 ± 7% cpuidle.C1.usage
> >> 200981 ± 7% -18.0% 164861 ± 7% cpuidle.POLL.time
> >> 52675 ± 3% -16.7% 43866 ± 10% cpuidle.POLL.usage
> >> 35659 ± 11% -19.4% 28754 ± 17% numa-vmstat.node0.nr_anon_pages
> >> 1248014 ± 3% +10.9% 1384236 numa-vmstat.node1.nr_mapped
> >> 2722 ± 4% +10.6% 3011 ± 5% numa-vmstat.node1.nr_page_table_pages
> >
> > I'm not sure that I'm reading this correctly, but I suspect that this just
> > happens because of NUMA: memory affinity will obviously impact
> > vm-scalability.throughput quite substantially, but I don't think the
> > bisected commit can be to be blame. Commit 85b9f46e8ea4 ("mm, thp: track
> > fallbacks due to failed memcg charges separately") simply adds new
> > count_vm_event() calls in a couple areas to track thp fallback due to
> > memcg limits separate from fragmentation.
> >
> > It's likely a question about the testing methodology in general: for
> > memory intensive benchmarks, I suggest it is configured in a manner that
> > we can expect consistent memory access latency at the hardware level when
> > running on a NUMA system.
>
> So you think it's better to bind processes to NUMA node or CPU? But we
> want to use this test case to capture NUMA/CPU placement/balance issue
> too.
>

No, because binding to a specific socket may cause other performance
"improvements" or "degradations" depending on how fragmented local memory
is, or whether or not it's under memory pressure. Is the system rebooted
before testing so that we have a consistent state of memory availability
and fragmentation across sockets?

> 0day solve the problem in another way. We run the test case
> multiple-times and calculate the average and standard deviation, then
> compare.
>

Depending on fragmentation or memory availability, any benchmark that
assesses performance may be adversely affected if its results can be
impacted by hugepage backing.

2020-10-21 11:56:24

by Huang, Ying

[permalink] [raw]

Subject: Re: [mm, thp] 85b9f46e8e: vm-scalability.throughput -8.7% regression

David Rientjes <[email protected]> writes:

> On Tue, 20 Oct 2020, Huang, Ying wrote:
>
>> >> =========================================================================================
>> >> compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase/ucode:
>> >> gcc-9/performance/x86_64-rhel-8.3/debian-10.4-x86_64-20200603.cgz/300s/1T/lkp-skl-fpga01/lru-shm/vm-scalability/0x2006906
>> >>
>> >> commit:
>> >> dcdf11ee14 ("mm, shmem: add vmstat for hugepage fallback")
>> >> 85b9f46e8e ("mm, thp: track fallbacks due to failed memcg charges separately")
>> >>
>> >> dcdf11ee14413332 85b9f46e8ea451633ccd60a7d8c
>> >> ---------------- ---------------------------
>> >> fail:runs %reproduction fail:runs
>> >> | | |
>> >> 1:4 24% 2:4 perf-profile.calltrace.cycles-pp.sync_regs.error_entry.do_access
>> >> 3:4 53% 5:4 perf-profile.calltrace.cycles-pp.error_entry.do_access
>> >> 9:4 -27% 8:4 perf-profile.children.cycles-pp.error_entry
>> >> 4:4 -10% 4:4 perf-profile.self.cycles-pp.error_entry
>> >> %stddev %change %stddev
>> >> \ | \
>> >> 477291 -9.1% 434041 vm-scalability.median
>> >> 49791027 -8.7% 45476799 vm-scalability.throughput
>> >> 223.67 +1.6% 227.36 vm-scalability.time.elapsed_time
>> >> 223.67 +1.6% 227.36 vm-scalability.time.elapsed_time.max
>> >> 50364 ± 6% +24.1% 62482 ± 10% vm-scalability.time.involuntary_context_switches
>> >> 2237 +7.8% 2412 vm-scalability.time.percent_of_cpu_this_job_got
>> >> 3084 +18.2% 3646 vm-scalability.time.system_time
>> >> 1921 -4.2% 1839 vm-scalability.time.user_time
>> >> 13.68 +2.2 15.86 mpstat.cpu.all.sys%
>> >> 28535 ± 30% -47.0% 15114 ± 79% numa-numastat.node0.other_node
>> >> 142734 ± 11% -19.4% 115000 ± 17% numa-meminfo.node0.AnonPages
>> >> 11168 ± 3% +8.8% 12150 ± 5% numa-meminfo.node1.PageTables
>> >> 76.00 -1.6% 74.75 vmstat.cpu.id
>> >> 3626 -1.9% 3555 vmstat.system.cs
>> >> 2214928 ±166% -96.6% 75321 ± 7% cpuidle.C1.usage
>> >> 200981 ± 7% -18.0% 164861 ± 7% cpuidle.POLL.time
>> >> 52675 ± 3% -16.7% 43866 ± 10% cpuidle.POLL.usage
>> >> 35659 ± 11% -19.4% 28754 ± 17% numa-vmstat.node0.nr_anon_pages
>> >> 1248014 ± 3% +10.9% 1384236 numa-vmstat.node1.nr_mapped
>> >> 2722 ± 4% +10.6% 3011 ± 5% numa-vmstat.node1.nr_page_table_pages
>> >
>> > I'm not sure that I'm reading this correctly, but I suspect that this just
>> > happens because of NUMA: memory affinity will obviously impact
>> > vm-scalability.throughput quite substantially, but I don't think the
>> > bisected commit can be to be blame. Commit 85b9f46e8ea4 ("mm, thp: track
>> > fallbacks due to failed memcg charges separately") simply adds new
>> > count_vm_event() calls in a couple areas to track thp fallback due to
>> > memcg limits separate from fragmentation.
>> >
>> > It's likely a question about the testing methodology in general: for
>> > memory intensive benchmarks, I suggest it is configured in a manner that
>> > we can expect consistent memory access latency at the hardware level when
>> > running on a NUMA system.
>>
>> So you think it's better to bind processes to NUMA node or CPU? But we
>> want to use this test case to capture NUMA/CPU placement/balance issue
>> too.
>>
>
> No, because binding to a specific socket may cause other performance
> "improvements" or "degradations" depending on how fragmented local memory
> is, or whether or not it's under memory pressure. Is the system rebooted
> before testing so that we have a consistent state of memory availability
> and fragmentation across sockets?

Yes. System is rebooted before testing (0day uses kexec to accelerate
rebooting).

>> 0day solve the problem in another way. We run the test case
>> multiple-times and calculate the average and standard deviation, then
>> compare.
>>
>
> Depending on fragmentation or memory availability, any benchmark that
> assesses performance may be adversely affected if its results can be
> impacted by hugepage backing.

Best Regards,
Huang, Ying