2019-05-20 08:44:09

by Chen, Rong A

[permalink] [raw]
Subject: [mm] 42a3003535: will-it-scale.per_process_ops -25.9% regression

Greeting,

FYI, we noticed a -25.9% regression of will-it-scale.per_process_ops due to commit:


commit: 42a300353577ccc17ecc627b8570a89fa1678bec ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

in testcase: will-it-scale
on test machine: 192 threads Skylake-SP with 256G memory
with following parameters:

nr_task: 100%
mode: process
test: page_fault3
cpufreq_governor: performance

test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
test-url: https://github.com/antonblanchard/will-it-scale



Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-7/performance/x86_64-rhel-7.6/process/100%/debian-x86_64-2018-04-03.cgz/lkp-skl-4sp1/page_fault3/will-it-scale

commit:
db9adbcbe7 ("mm: memcontrol: move stat/event counting functions out-of-line")
42a3003535 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")

db9adbcbe740e098 42a300353577ccc17ecc627b857
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
44:4 -283% 33:4 perf-profile.calltrace.cycles-pp.error_entry.testcase
23:4 -151% 17:4 perf-profile.calltrace.cycles-pp.sync_regs.error_entry.testcase
46:4 -294% 34:4 perf-profile.children.cycles-pp.error_entry
0:4 -2% 0:4 perf-profile.children.cycles-pp.error_exit
21:4 -138% 16:4 perf-profile.self.cycles-pp.error_entry
0:4 -2% 0:4 perf-profile.self.cycles-pp.error_exit
%stddev %change %stddev
\ | \
457395 -25.9% 339092 will-it-scale.per_process_ops
87819982 -25.9% 65105780 will-it-scale.workload
14796 ± 54% +95.8% 28964 ± 24% numa-meminfo.node1.Inactive(anon)
38.02 +6.1% 40.35 ± 5% boot-time.boot
7275 +4.7% 7618 ± 5% boot-time.idle
58843 ± 96% -93.8% 3646 ± 12% cpuidle.C1.usage
735515 ± 55% -77.2% 167576 ± 4% cpuidle.C6.usage
269480 ± 10% +15.7% 311822 ± 3% sched_debug.cfs_rq:/.min_vruntime.stddev
268385 ± 10% +15.7% 310417 ± 2% sched_debug.cfs_rq:/.spread0.stddev
79.00 +3.2% 81.50 vmstat.cpu.sy
17.75 ± 2% -8.5% 16.25 ± 2% vmstat.cpu.us
7938 -2.5% 7741 vmstat.system.cs
462347 -4.4% 441932 vmstat.system.in
60227822 -22.5% 46651173 proc-vmstat.numa_hit
60082905 -22.6% 46506548 proc-vmstat.numa_local
60343904 -22.5% 46759361 proc-vmstat.pgalloc_normal
2.646e+10 -25.9% 1.962e+10 proc-vmstat.pgfault
60301946 -22.5% 46714581 proc-vmstat.pgfree
57689 ± 99% -96.0% 2309 ± 15% turbostat.C1
692737 ± 61% -83.4% 114720 ± 10% turbostat.C6
0.30 ± 67% -68.1% 0.10 ± 22% turbostat.Pkg%pc2
524.21 -6.8% 488.32 turbostat.PkgWatt
122.63 -7.6% 113.33 turbostat.RAMWatt
15022686 -18.4% 12260958 ± 2% numa-numastat.node0.local_node
15051138 -18.4% 12280459 ± 2% numa-numastat.node0.numa_hit
14900790 -24.7% 11213127 numa-numastat.node1.local_node
14941317 -24.7% 11258221 numa-numastat.node1.numa_hit
14987489 ± 2% -24.5% 11320760 ± 2% numa-numastat.node2.local_node
15022530 ± 2% -24.3% 11368910 ± 2% numa-numastat.node2.numa_hit
15145257 -22.8% 11685459 numa-numastat.node3.local_node
15186130 -22.8% 11717320 numa-numastat.node3.numa_hit
1.04 +11.4% 1.16 irq_exception_noise.__do_page_fault.60th
1.07 +22.5% 1.32 irq_exception_noise.__do_page_fault.70th
1.11 +40.9% 1.57 irq_exception_noise.__do_page_fault.80th
1.19 +65.2% 1.96 irq_exception_noise.__do_page_fault.90th
1.29 +93.5% 2.50 ± 2% irq_exception_noise.__do_page_fault.95th
1.64 +202.5% 4.96 ± 4% irq_exception_noise.__do_page_fault.99th
1.07 +28.0% 1.37 irq_exception_noise.__do_page_fault.avg
214103 +28.0% 274033 irq_exception_noise.__do_page_fault.sum
372.75 ± 3% +17.8% 439.25 ± 2% irq_exception_noise.softirq_nr
188.50 ± 11% +26.7% 238.75 ± 10% irq_exception_noise.softirq_time
8826971 ± 3% -15.6% 7452852 ± 3% numa-vmstat.node0.numa_hit
8798559 ± 3% -15.5% 7433083 ± 4% numa-vmstat.node0.numa_local
3634 ± 55% +100.6% 7291 ± 24% numa-vmstat.node1.nr_inactive_anon
3638 ± 55% +100.4% 7291 ± 24% numa-vmstat.node1.nr_zone_inactive_anon
8654591 -21.8% 6764571 numa-vmstat.node1.numa_hit
8529741 -22.2% 6635474 numa-vmstat.node1.numa_local
8789111 ± 2% -22.2% 6838852 ± 3% numa-vmstat.node2.numa_hit
8669691 ± 3% -22.6% 6706417 ± 3% numa-vmstat.node2.numa_local
8870528 ± 2% -21.2% 6989990 numa-vmstat.node3.numa_hit
8745356 ± 2% -21.4% 6873917 numa-vmstat.node3.numa_local
5.1e+10 -25.3% 3.807e+10 perf-stat.i.branch-instructions
4.024e+08 -24.3% 3.046e+08 perf-stat.i.branch-misses
70.00 -8.4 61.59 perf-stat.i.cache-miss-rate%
2.932e+08 -6.7% 2.737e+08 perf-stat.i.cache-misses
4.174e+08 +6.3% 4.439e+08 perf-stat.i.cache-references
7791 -3.4% 7529 perf-stat.i.context-switches
1.69 ± 2% +32.2% 2.23 perf-stat.i.cpi
1435 +6.9% 1534 perf-stat.i.cycles-between-cache-misses
7.381e+10 -25.2% 5.518e+10 perf-stat.i.dTLB-loads
1.447e+09 -26.0% 1.072e+09 perf-stat.i.dTLB-store-misses
4.045e+10 -25.1% 3.031e+10 perf-stat.i.dTLB-stores
68.74 -27.5 41.27 perf-stat.i.iTLB-load-miss-rate%
1.882e+08 -24.9% 1.413e+08 perf-stat.i.iTLB-load-misses
94586813 ± 5% +147.3% 2.339e+08 perf-stat.i.iTLB-loads
2.512e+11 -25.2% 1.879e+11 perf-stat.i.instructions
0.60 -25.5% 0.45 perf-stat.i.ipc
87236715 -25.5% 64948201 perf-stat.i.minor-faults
1.27 ± 34% +3.5 4.75 ± 5% perf-stat.i.node-load-miss-rate%
371727 ± 8% +291.3% 1454574 ± 2% perf-stat.i.node-load-misses
40821938 -27.1% 29771040 perf-stat.i.node-loads
7.55 ± 5% +8.2 15.80 perf-stat.i.node-store-miss-rate%
6885297 +79.9% 12389289 perf-stat.i.node-store-misses
89036136 -25.5% 66289356 perf-stat.i.node-stores
87237611 -25.5% 64948420 perf-stat.i.page-faults
1.66 +42.2% 2.36 perf-stat.overall.MPKI
0.79 +0.0 0.80 perf-stat.overall.branch-miss-rate%
70.23 -8.6 61.65 perf-stat.overall.cache-miss-rate%
1.65 +34.8% 2.22 perf-stat.overall.cpi
1414 +8.0% 1527 perf-stat.overall.cycles-between-cache-misses
0.01 ± 13% +0.0 0.01 ± 16% perf-stat.overall.dTLB-load-miss-rate%
3.45 -0.0 3.41 perf-stat.overall.dTLB-store-miss-rate%
66.57 ± 2% -28.9 37.66 perf-stat.overall.iTLB-load-miss-rate%
0.61 -25.8% 0.45 perf-stat.overall.ipc
0.90 ± 9% +3.8 4.66 ± 2% perf-stat.overall.node-load-miss-rate%
7.18 +8.6 15.75 perf-stat.overall.node-store-miss-rate%
5.079e+10 -25.3% 3.793e+10 perf-stat.ps.branch-instructions
4.011e+08 -24.3% 3.037e+08 perf-stat.ps.branch-misses
2.92e+08 -6.6% 2.726e+08 perf-stat.ps.cache-misses
4.158e+08 +6.4% 4.423e+08 perf-stat.ps.cache-references
7755 -3.2% 7509 perf-stat.ps.context-switches
7.351e+10 -25.2% 5.498e+10 perf-stat.ps.dTLB-loads
1.441e+09 -25.9% 1.068e+09 perf-stat.ps.dTLB-store-misses
4.028e+10 -25.0% 3.02e+10 perf-stat.ps.dTLB-stores
1.874e+08 -24.9% 1.407e+08 perf-stat.ps.iTLB-load-misses
94208530 ± 5% +147.3% 2.33e+08 perf-stat.ps.iTLB-loads
2.502e+11 -25.2% 1.872e+11 perf-stat.ps.instructions
86898473 -25.5% 64708711 perf-stat.ps.minor-faults
371049 ± 8% +290.7% 1449665 ± 2% perf-stat.ps.node-load-misses
40664993 -27.1% 29662127 perf-stat.ps.node-loads
6859094 +80.0% 12344092 perf-stat.ps.node-store-misses
88694004 -25.5% 66047901 perf-stat.ps.node-stores
86899384 -25.5% 64708780 perf-stat.ps.page-faults
7.618e+13 -25.5% 5.673e+13 perf-stat.total.instructions
13965 ± 7% -12.3% 12250 ± 3% softirqs.CPU104.RCU
13818 ± 6% -11.9% 12167 ± 4% softirqs.CPU105.RCU
13944 ± 6% -12.3% 12231 ± 4% softirqs.CPU106.RCU
13932 ± 8% -12.2% 12237 ± 4% softirqs.CPU107.RCU
13838 ± 6% -10.9% 12330 ± 4% softirqs.CPU109.RCU
13524 ± 6% -10.0% 12174 ± 4% softirqs.CPU111.RCU
15248 ± 10% -14.3% 13068 ± 8% softirqs.CPU112.RCU
14832 ± 7% -13.7% 12799 ± 4% softirqs.CPU113.RCU
14741 ± 8% -11.0% 13117 ± 6% softirqs.CPU114.RCU
14729 ± 8% -14.3% 12616 ± 3% softirqs.CPU115.RCU
15234 ± 6% -16.8% 12676 ± 5% softirqs.CPU116.RCU
14968 ± 7% -15.4% 12656 ± 5% softirqs.CPU117.RCU
15220 ± 8% -17.8% 12515 ± 4% softirqs.CPU118.RCU
15680 ± 7% -13.5% 13559 ± 4% softirqs.CPU120.RCU
14621 ± 8% -12.3% 12821 ± 4% softirqs.CPU123.RCU
16075 ± 11% -19.2% 12983 ± 4% softirqs.CPU13.RCU
13799 ± 10% -14.3% 11823 ± 3% softirqs.CPU133.RCU
13752 ± 9% -15.2% 11659 ± 5% softirqs.CPU134.RCU
13855 ± 9% -14.0% 11915 ± 4% softirqs.CPU138.RCU
13601 ± 9% -15.5% 11499 ± 4% softirqs.CPU142.RCU
13840 ± 11% -19.4% 11154 ± 14% softirqs.CPU145.RCU
13154 ± 7% -12.5% 11505 ± 3% softirqs.CPU147.RCU
13396 ± 7% -13.2% 11622 ± 3% softirqs.CPU148.RCU
13392 ± 7% -13.4% 11595 ± 4% softirqs.CPU150.RCU
13335 ± 6% -13.0% 11607 ± 4% softirqs.CPU151.RCU
13446 ± 6% -18.1% 11013 ± 13% softirqs.CPU152.RCU
13447 ± 6% -16.9% 11179 ± 3% softirqs.CPU153.RCU
13557 ± 9% -14.3% 11617 ± 3% softirqs.CPU154.RCU
13282 ± 8% -13.1% 11547 ± 3% softirqs.CPU155.RCU
13098 ± 8% -12.2% 11502 ± 3% softirqs.CPU158.RCU
13096 ± 8% -12.0% 11528 ± 3% softirqs.CPU159.RCU
14572 ± 9% -13.0% 12683 ± 5% softirqs.CPU160.RCU
14327 ± 5% -12.7% 12510 ± 5% softirqs.CPU161.RCU
14348 ± 6% -11.1% 12751 ± 4% softirqs.CPU162.RCU
14213 ± 6% -10.7% 12692 ± 4% softirqs.CPU163.RCU
14249 ± 6% -12.1% 12524 ± 5% softirqs.CPU164.RCU
14234 ± 6% -12.0% 12523 ± 6% softirqs.CPU165.RCU
14315 ± 5% -12.2% 12561 ± 5% softirqs.CPU166.RCU
13999 ± 6% -10.2% 12574 ± 5% softirqs.CPU167.RCU
15604 ± 6% -12.7% 13627 ± 3% softirqs.CPU17.RCU
15087 ± 10% -15.8% 12706 ± 4% softirqs.CPU174.RCU
13327 ± 8% -11.6% 11776 ± 7% softirqs.CPU185.RCU
13851 ± 9% -19.4% 11162 ± 4% softirqs.CPU187.RCU
15629 ± 6% -12.4% 13685 ± 4% softirqs.CPU20.RCU
15723 ± 6% -11.9% 13857 ± 6% softirqs.CPU21.RCU
16114 ± 7% -17.6% 13275 ± 3% softirqs.CPU22.RCU
15767 ± 6% -17.7% 12972 ± 8% softirqs.CPU23.RCU
16746 ± 8% -13.9% 14417 ± 3% softirqs.CPU24.RCU
14317 ± 5% -11.5% 12667 ± 3% softirqs.CPU3.RCU
14121 ± 9% -12.9% 12299 ± 4% softirqs.CPU32.RCU
14184 ± 8% -13.7% 12242 ± 4% softirqs.CPU38.RCU
14239 ± 9% -13.0% 12394 ± 4% softirqs.CPU40.RCU
15062 ± 8% -17.0% 12497 ± 4% softirqs.CPU41.RCU
14218 ± 8% -13.5% 12299 ± 4% softirqs.CPU42.RCU
14305 ± 8% -11.8% 12623 softirqs.CPU45.RCU
14295 ± 8% -15.5% 12080 ± 8% softirqs.CPU46.RCU
14199 ± 9% -12.7% 12397 ± 4% softirqs.CPU47.RCU
16386 ± 20% -21.3% 12892 ± 2% softirqs.CPU5.RCU
14976 ± 5% -13.0% 13026 ± 5% softirqs.CPU65.RCU
14809 ± 5% -12.6% 12942 ± 4% softirqs.CPU66.RCU
15051 ± 5% -14.5% 12865 ± 4% softirqs.CPU68.RCU
15246 ± 4% -15.4% 12900 ± 3% softirqs.CPU69.RCU
14766 ± 5% -14.6% 12617 ± 7% softirqs.CPU70.RCU
14623 ± 6% -11.4% 12949 ± 5% softirqs.CPU71.RCU
15793 ± 11% -17.6% 13009 ± 3% softirqs.CPU8.RCU
13773 ± 7% -12.6% 12032 ± 4% softirqs.CPU82.RCU
14870 ± 6% -13.3% 12894 ± 3% softirqs.CPU9.RCU
14727 ± 10% -17.6% 12135 ± 4% softirqs.CPU91.RCU
14321 ± 5% -14.0% 12313 ± 5% softirqs.CPU92.RCU
81.90 -4.5 77.37 perf-profile.calltrace.cycles-pp.page_fault.testcase
9.87 -1.1 8.80 perf-profile.calltrace.cycles-pp.__do_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault
9.50 -1.0 8.52 perf-profile.calltrace.cycles-pp.shmem_fault.__do_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault
8.17 -0.7 7.49 perf-profile.calltrace.cycles-pp.shmem_getpage_gfp.shmem_fault.__do_fault.__handle_mm_fault.handle_mm_fault
5.31 -0.6 4.68 perf-profile.calltrace.cycles-pp.find_get_entry.find_lock_entry.shmem_getpage_gfp.shmem_fault.__do_fault
7.06 -0.6 6.45 perf-profile.calltrace.cycles-pp.swapgs_restore_regs_and_return_to_usermode.testcase
2.44 -0.6 1.83 perf-profile.calltrace.cycles-pp.xas_load.find_get_entry.find_lock_entry.shmem_getpage_gfp.shmem_fault
2.60 -0.5 2.10 perf-profile.calltrace.cycles-pp.fault_dirty_shared_page.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault
1.82 -0.5 1.34 perf-profile.calltrace.cycles-pp.__do_page_fault.return_to_handler.testcase
1.82 -0.5 1.34 perf-profile.calltrace.cycles-pp.ftrace_graph_caller.__do_page_fault.return_to_handler.testcase
1.82 -0.5 1.34 perf-profile.calltrace.cycles-pp.return_to_handler.testcase
1.81 -0.5 1.33 perf-profile.calltrace.cycles-pp.prepare_ftrace_return.ftrace_graph_caller.__do_page_fault.return_to_handler.testcase
1.75 -0.5 1.28 perf-profile.calltrace.cycles-pp.function_graph_enter.prepare_ftrace_return.ftrace_graph_caller.__do_page_fault.return_to_handler
1.76 -0.4 1.35 perf-profile.calltrace.cycles-pp.__perf_sw_event.__do_page_fault.do_page_fault.page_fault.testcase
7.17 -0.4 6.75 perf-profile.calltrace.cycles-pp.find_lock_entry.shmem_getpage_gfp.shmem_fault.__do_fault.__handle_mm_fault
5.85 -0.4 5.43 perf-profile.calltrace.cycles-pp.ftrace_graph_caller.__do_page_fault.do_page_fault.page_fault.testcase
1.36 -0.3 1.04 perf-profile.calltrace.cycles-pp.___perf_sw_event.__perf_sw_event.__do_page_fault.do_page_fault.page_fault
0.84 -0.3 0.53 perf-profile.calltrace.cycles-pp.set_page_dirty.fault_dirty_shared_page.__handle_mm_fault.handle_mm_fault.__do_page_fault
1.22 -0.3 0.92 perf-profile.calltrace.cycles-pp.prepare_exit_to_usermode.swapgs_restore_regs_and_return_to_usermode.testcase
1.11 -0.3 0.81 perf-profile.calltrace.cycles-pp.trace_clock_local.function_graph_enter.prepare_ftrace_return.ftrace_graph_caller.__do_page_fault
5.35 -0.3 5.07 perf-profile.calltrace.cycles-pp.prepare_ftrace_return.ftrace_graph_caller.__do_page_fault.do_page_fault.page_fault
0.98 -0.3 0.71 perf-profile.calltrace.cycles-pp.sched_clock.trace_clock_local.function_graph_enter.prepare_ftrace_return.ftrace_graph_caller
0.99 -0.3 0.73 perf-profile.calltrace.cycles-pp.native_sched_clock.sched_clock.trace_clock_local.function_graph_enter.prepare_ftrace_return
1.23 ± 2% -0.2 0.98 ± 2% perf-profile.calltrace.cycles-pp.___perf_sw_event.__perf_sw_event.do_page_fault.page_fault.testcase
1.02 -0.2 0.81 perf-profile.calltrace.cycles-pp.vmacache_find.find_vma.__do_page_fault.do_page_fault.page_fault
5.03 -0.2 4.83 perf-profile.calltrace.cycles-pp.function_graph_enter.prepare_ftrace_return.ftrace_graph_caller.__do_page_fault.do_page_fault
1.12 -0.2 0.94 perf-profile.calltrace.cycles-pp.find_vma.__do_page_fault.do_page_fault.page_fault.testcase
1.20 -0.2 1.01 perf-profile.calltrace.cycles-pp.file_update_time.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault
0.85 -0.2 0.67 perf-profile.calltrace.cycles-pp.__set_page_dirty_no_writeback.fault_dirty_shared_page.__handle_mm_fault.handle_mm_fault.__do_page_fault
0.71 -0.2 0.54 perf-profile.calltrace.cycles-pp.__set_page_dirty_no_writeback.unmap_page_range.unmap_vmas.unmap_region.__do_munmap
1.73 -0.1 1.58 perf-profile.calltrace.cycles-pp.__perf_sw_event.do_page_fault.page_fault.testcase
0.76 -0.1 0.64 ± 2% perf-profile.calltrace.cycles-pp.tlb_flush_mmu.unmap_page_range.unmap_vmas.unmap_region.__do_munmap
4.64 -0.1 4.55 perf-profile.calltrace.cycles-pp.trace_graph_entry.function_graph_enter.prepare_ftrace_return.ftrace_graph_caller.__do_page_fault
0.68 ± 3% -0.1 0.61 perf-profile.calltrace.cycles-pp.current_time.file_update_time.__handle_mm_fault.handle_mm_fault.__do_page_fault
0.55 +0.1 0.64 perf-profile.calltrace.cycles-pp.unlock_page.fault_dirty_shared_page.__handle_mm_fault.handle_mm_fault.__do_page_fault
0.70 +0.3 0.97 perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_set_pte.finish_fault.__handle_mm_fault.handle_mm_fault
0.61 +0.4 1.03 perf-profile.calltrace.cycles-pp.up_read.__do_page_fault.do_page_fault.page_fault.testcase
0.00 +0.8 0.82 perf-profile.calltrace.cycles-pp.lock_page_memcg.page_remove_rmap.unmap_page_range.unmap_vmas.unmap_region
0.59 +1.2 1.75 perf-profile.calltrace.cycles-pp.down_read_trylock.__do_page_fault.do_page_fault.page_fault.testcase
79.30 +1.2 80.50 perf-profile.calltrace.cycles-pp.testcase
0.00 +2.7 2.67 perf-profile.calltrace.cycles-pp.lock_page_memcg.page_add_file_rmap.alloc_set_pte.finish_fault.__handle_mm_fault
22.92 +3.4 26.37 perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault
0.54 +3.7 4.26 perf-profile.calltrace.cycles-pp.__count_memcg_events.handle_mm_fault.__do_page_fault.do_page_fault.page_fault
1.30 +3.9 5.17 perf-profile.calltrace.cycles-pp.__mod_lruvec_state.page_add_file_rmap.alloc_set_pte.finish_fault.__handle_mm_fault
7.93 +4.0 11.88 perf-profile.calltrace.cycles-pp.munmap
7.88 +4.0 11.84 perf-profile.calltrace.cycles-pp.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.munmap
7.88 +4.0 11.84 perf-profile.calltrace.cycles-pp.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.munmap
7.89 +4.0 11.85 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.munmap
7.89 +4.0 11.85 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.munmap
7.88 +4.0 11.84 perf-profile.calltrace.cycles-pp.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
7.88 +4.0 11.84 perf-profile.calltrace.cycles-pp.unmap_region.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64
7.86 +4.0 11.82 perf-profile.calltrace.cycles-pp.unmap_vmas.unmap_region.__do_munmap.__vm_munmap.__x64_sys_munmap
7.78 +4.0 11.77 perf-profile.calltrace.cycles-pp.unmap_page_range.unmap_vmas.unmap_region.__do_munmap.__vm_munmap
1.22 +4.0 5.26 perf-profile.calltrace.cycles-pp.__mod_lruvec_state.page_remove_rmap.unmap_page_range.unmap_vmas.unmap_region
0.53 +4.1 4.58 ± 2% perf-profile.calltrace.cycles-pp.__mod_memcg_state.__mod_lruvec_state.page_remove_rmap.unmap_page_range.unmap_vmas
0.00 +4.4 4.44 ± 2% perf-profile.calltrace.cycles-pp.__mod_memcg_state.__mod_lruvec_state.page_add_file_rmap.alloc_set_pte.finish_fault
2.47 +4.8 7.29 perf-profile.calltrace.cycles-pp.page_remove_rmap.unmap_page_range.unmap_vmas.unmap_region.__do_munmap
5.32 +6.1 11.46 perf-profile.calltrace.cycles-pp.finish_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault
5.01 +6.2 11.24 perf-profile.calltrace.cycles-pp.alloc_set_pte.finish_fault.__handle_mm_fault.handle_mm_fault.__do_page_fault
40.44 +6.3 46.71 perf-profile.calltrace.cycles-pp.do_page_fault.page_fault.testcase
2.52 +6.3 8.84 perf-profile.calltrace.cycles-pp.page_add_file_rmap.alloc_set_pte.finish_fault.__handle_mm_fault.handle_mm_fault
38.29 +6.5 44.81 perf-profile.calltrace.cycles-pp.__do_page_fault.do_page_fault.page_fault.testcase
25.85 +6.6 32.41 perf-profile.calltrace.cycles-pp.handle_mm_fault.__do_page_fault.do_page_fault.page_fault.testcase
92.00 -4.0 88.05 perf-profile.children.cycles-pp.testcase
6.03 -1.6 4.48 perf-profile.children.cycles-pp.sync_regs
9.89 -1.1 8.81 perf-profile.children.cycles-pp.__do_fault
9.53 -1.0 8.54 perf-profile.children.cycles-pp.shmem_fault
7.71 -0.9 6.80 perf-profile.children.cycles-pp.ftrace_graph_caller
7.21 -0.8 6.43 perf-profile.children.cycles-pp.prepare_ftrace_return
8.18 -0.7 7.50 perf-profile.children.cycles-pp.shmem_getpage_gfp
6.81 -0.7 6.13 perf-profile.children.cycles-pp.function_graph_enter
5.36 -0.6 4.72 perf-profile.children.cycles-pp.find_get_entry
2.53 -0.6 1.91 perf-profile.children.cycles-pp.xas_load
7.07 -0.6 6.45 perf-profile.children.cycles-pp.swapgs_restore_regs_and_return_to_usermode
2.65 -0.6 2.08 perf-profile.children.cycles-pp.___perf_sw_event
3.51 -0.6 2.94 perf-profile.children.cycles-pp.__perf_sw_event
2.69 -0.5 2.16 perf-profile.children.cycles-pp.fault_dirty_shared_page
1.48 -0.5 0.97 perf-profile.children.cycles-pp.set_page_dirty
1.82 -0.5 1.34 perf-profile.children.cycles-pp.return_to_handler
7.24 -0.4 6.82 perf-profile.children.cycles-pp.find_lock_entry
1.46 -0.4 1.08 perf-profile.children.cycles-pp.page_mapping
1.60 -0.4 1.24 perf-profile.children.cycles-pp.__set_page_dirty_no_writeback
1.25 -0.3 0.94 perf-profile.children.cycles-pp.prepare_exit_to_usermode
1.13 -0.3 0.83 perf-profile.children.cycles-pp.trace_clock_local
1.03 -0.3 0.75 perf-profile.children.cycles-pp.sched_clock
1.00 -0.3 0.73 perf-profile.children.cycles-pp.native_sched_clock
1.01 -0.2 0.76 ± 2% perf-profile.children.cycles-pp.__mod_node_page_state
1.05 -0.2 0.82 perf-profile.children.cycles-pp.vmacache_find
1.23 -0.2 1.04 perf-profile.children.cycles-pp.file_update_time
1.15 -0.2 0.97 perf-profile.children.cycles-pp.find_vma
7.79 -0.2 7.61 perf-profile.children.cycles-pp.native_irq_return_iret
0.65 -0.1 0.50 perf-profile.children.cycles-pp.pmd_devmap_trans_unstable
4.79 -0.1 4.66 perf-profile.children.cycles-pp.trace_graph_entry
0.54 ± 2% -0.1 0.41 ± 3% perf-profile.children.cycles-pp.xas_start
0.47 -0.1 0.35 perf-profile.children.cycles-pp.__trace_graph_entry
0.77 -0.1 0.65 ± 2% perf-profile.children.cycles-pp.tlb_flush_mmu
0.23 ± 9% -0.1 0.12 ± 5% perf-profile.children.cycles-pp.retint_user
0.45 -0.1 0.34 perf-profile.children.cycles-pp.__x86_indirect_thunk_rax
0.35 -0.1 0.25 perf-profile.children.cycles-pp.trace_buffer_lock_reserve
0.70 -0.1 0.60 perf-profile.children.cycles-pp.___might_sleep
0.37 -0.1 0.28 perf-profile.children.cycles-pp.mark_page_accessed
0.27 ± 18% -0.1 0.17 ± 2% perf-profile.children.cycles-pp.mem_cgroup_from_task
0.39 ± 2% -0.1 0.30 perf-profile.children.cycles-pp._cond_resched
0.35 ± 2% -0.1 0.27 perf-profile.children.cycles-pp.ftrace_lookup_ip
0.23 ± 3% -0.1 0.15 ± 4% perf-profile.children.cycles-pp.__tlb_remove_page_size
0.72 ± 3% -0.1 0.65 perf-profile.children.cycles-pp.current_time
0.28 -0.1 0.20 ± 2% perf-profile.children.cycles-pp.__might_sleep
0.20 -0.1 0.14 ± 3% perf-profile.children.cycles-pp._vm_normal_page
0.56 -0.1 0.50 ± 2% perf-profile.children.cycles-pp.release_pages
0.22 -0.1 0.16 ± 4% perf-profile.children.cycles-pp.ring_buffer_lock_reserve
0.23 -0.1 0.18 ± 2% perf-profile.children.cycles-pp.fpregs_assert_state_consistent
0.18 ± 2% -0.1 0.12 ± 4% perf-profile.children.cycles-pp.pmd_pfn
0.22 ± 3% -0.0 0.17 ± 4% perf-profile.children.cycles-pp.perf_exclude_event
0.21 ± 2% -0.0 0.17 ± 3% perf-profile.children.cycles-pp.pmd_page_vaddr
0.20 ± 2% -0.0 0.15 ± 3% perf-profile.children.cycles-pp.page_rmapping
0.20 -0.0 0.15 ± 2% perf-profile.children.cycles-pp.free_pages_and_swap_cache
0.15 ± 2% -0.0 0.11 ± 4% perf-profile.children.cycles-pp.PageHuge
0.13 ± 3% -0.0 0.09 ± 4% perf-profile.children.cycles-pp.native_iret
0.17 -0.0 0.14 perf-profile.children.cycles-pp.rcu_all_qs
0.10 ± 4% -0.0 0.08 ± 5% perf-profile.children.cycles-pp.perf_swevent_event
0.09 ± 5% -0.0 0.07 ± 7% perf-profile.children.cycles-pp.pte_alloc_one
0.07 ± 5% -0.0 0.05 ± 9% perf-profile.children.cycles-pp.get_page_from_freelist
0.08 ± 5% -0.0 0.06 perf-profile.children.cycles-pp.__alloc_pages_nodemask
0.56 +0.1 0.64 perf-profile.children.cycles-pp.unlock_page
0.12 ± 4% +0.1 0.27 perf-profile.children.cycles-pp.__unlock_page_memcg
0.74 +0.3 1.01 perf-profile.children.cycles-pp._raw_spin_lock
0.61 +0.4 1.03 perf-profile.children.cycles-pp.up_read
61.67 +0.7 62.41 perf-profile.children.cycles-pp.page_fault
0.61 +1.2 1.77 perf-profile.children.cycles-pp.down_read_trylock
0.35 +3.2 3.51 perf-profile.children.cycles-pp.lock_page_memcg
23.05 +3.4 26.47 perf-profile.children.cycles-pp.__handle_mm_fault
0.54 +3.7 4.26 perf-profile.children.cycles-pp.__count_memcg_events
7.93 +4.0 11.88 perf-profile.children.cycles-pp.munmap
7.93 +4.0 11.89 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
7.93 +4.0 11.89 perf-profile.children.cycles-pp.do_syscall_64
7.88 +4.0 11.84 perf-profile.children.cycles-pp.__x64_sys_munmap
7.88 +4.0 11.84 perf-profile.children.cycles-pp.__vm_munmap
7.88 +4.0 11.84 perf-profile.children.cycles-pp.__do_munmap
7.88 +4.0 11.84 perf-profile.children.cycles-pp.unmap_region
7.86 +4.0 11.82 perf-profile.children.cycles-pp.unmap_vmas
7.86 +4.0 11.82 perf-profile.children.cycles-pp.unmap_page_range
2.50 +4.8 7.33 perf-profile.children.cycles-pp.page_remove_rmap
40.28 +6.0 46.28 perf-profile.children.cycles-pp.__do_page_fault
5.36 +6.1 11.49 perf-profile.children.cycles-pp.finish_fault
5.13 +6.2 11.32 perf-profile.children.cycles-pp.alloc_set_pte
40.60 +6.2 46.82 perf-profile.children.cycles-pp.do_page_fault
2.55 +6.3 8.87 perf-profile.children.cycles-pp.page_add_file_rmap
26.02 +6.5 32.55 perf-profile.children.cycles-pp.handle_mm_fault
2.61 +7.8 10.45 perf-profile.children.cycles-pp.__mod_lruvec_state
1.01 +8.0 9.03 ± 2% perf-profile.children.cycles-pp.__mod_memcg_state
22.09 -5.7 16.41 perf-profile.self.cycles-pp.testcase
5.95 -1.5 4.42 perf-profile.self.cycles-pp.sync_regs
3.44 -0.8 2.62 perf-profile.self.cycles-pp.__handle_mm_fault
2.15 -0.6 1.59 perf-profile.self.cycles-pp.handle_mm_fault
2.35 -0.5 1.85 perf-profile.self.cycles-pp.___perf_sw_event
2.03 -0.5 1.53 perf-profile.self.cycles-pp.xas_load
1.58 -0.4 1.15 perf-profile.self.cycles-pp.__do_page_fault
1.40 -0.4 1.04 perf-profile.self.cycles-pp.page_mapping
1.45 -0.3 1.13 perf-profile.self.cycles-pp.__set_page_dirty_no_writeback
1.16 -0.3 0.84 perf-profile.self.cycles-pp.alloc_set_pte
1.30 -0.3 0.99 perf-profile.self.cycles-pp.shmem_fault
5.81 -0.3 5.51 perf-profile.self.cycles-pp.swapgs_restore_regs_and_return_to_usermode
0.99 -0.3 0.72 perf-profile.self.cycles-pp.function_graph_enter
0.97 -0.3 0.71 perf-profile.self.cycles-pp.native_sched_clock
0.90 -0.2 0.67 ± 6% perf-profile.self.cycles-pp.shmem_getpage_gfp
1.01 -0.2 0.79 perf-profile.self.cycles-pp.vmacache_find
2.39 -0.2 2.17 perf-profile.self.cycles-pp.unmap_page_range
0.92 -0.2 0.70 perf-profile.self.cycles-pp.prepare_exit_to_usermode
0.53 -0.2 0.31 perf-profile.self.cycles-pp.set_page_dirty
0.96 -0.2 0.75 ± 2% perf-profile.self.cycles-pp.__mod_node_page_state
7.76 -0.2 7.60 perf-profile.self.cycles-pp.native_irq_return_iret
0.59 -0.1 0.45 perf-profile.self.cycles-pp.page_fault
0.60 -0.1 0.47 perf-profile.self.cycles-pp.pmd_devmap_trans_unstable
0.49 -0.1 0.36 ± 3% perf-profile.self.cycles-pp.ftrace_graph_caller
0.52 ± 2% -0.1 0.40 perf-profile.self.cycles-pp.file_update_time
0.47 ± 2% -0.1 0.35 ± 4% perf-profile.self.cycles-pp.xas_start
0.23 ± 10% -0.1 0.12 ± 5% perf-profile.self.cycles-pp.retint_user
0.42 -0.1 0.32 perf-profile.self.cycles-pp.__x86_indirect_thunk_rax
0.69 -0.1 0.59 perf-profile.self.cycles-pp.___might_sleep
0.35 -0.1 0.26 perf-profile.self.cycles-pp.prepare_ftrace_return
0.32 -0.1 0.23 perf-profile.self.cycles-pp.do_page_fault
0.26 ± 19% -0.1 0.17 ± 2% perf-profile.self.cycles-pp.mem_cgroup_from_task
0.34 ± 2% -0.1 0.25 perf-profile.self.cycles-pp.ftrace_lookup_ip
0.34 -0.1 0.26 perf-profile.self.cycles-pp.mark_page_accessed
0.27 -0.1 0.20 ± 2% perf-profile.self.cycles-pp.fault_dirty_shared_page
0.21 ± 3% -0.1 0.14 ± 3% perf-profile.self.cycles-pp.__tlb_remove_page_size
0.55 -0.1 0.48 ± 2% perf-profile.self.cycles-pp.release_pages
0.25 ± 2% -0.1 0.18 ± 2% perf-profile.self.cycles-pp.__might_sleep
0.32 ± 2% -0.1 0.26 ± 4% perf-profile.self.cycles-pp.current_time
0.21 ± 2% -0.1 0.15 ± 2% perf-profile.self.cycles-pp._cond_resched
0.26 -0.1 0.20 ± 3% perf-profile.self.cycles-pp.__do_fault
0.24 ± 4% -0.1 0.18 ± 2% perf-profile.self.cycles-pp.finish_fault
0.22 -0.1 0.16 ± 4% perf-profile.self.cycles-pp.ring_buffer_lock_reserve
0.18 -0.1 0.12 ± 4% perf-profile.self.cycles-pp._vm_normal_page
0.22 -0.0 0.17 ± 2% perf-profile.self.cycles-pp.fpregs_assert_state_consistent
0.16 ± 2% -0.0 0.11 ± 4% perf-profile.self.cycles-pp.pmd_pfn
0.20 ± 2% -0.0 0.15 perf-profile.self.cycles-pp.free_pages_and_swap_cache
0.20 ± 2% -0.0 0.15 ± 4% perf-profile.self.cycles-pp.pmd_page_vaddr
0.18 ± 2% -0.0 0.14 ± 3% perf-profile.self.cycles-pp.perf_exclude_event
0.16 ± 2% -0.0 0.12 ± 5% perf-profile.self.cycles-pp.page_rmapping
0.13 ± 3% -0.0 0.09 perf-profile.self.cycles-pp.native_iret
0.15 ± 2% -0.0 0.11 perf-profile.self.cycles-pp.__trace_graph_entry
0.12 ± 3% -0.0 0.09 ± 4% perf-profile.self.cycles-pp.PageHuge
0.11 ± 4% -0.0 0.08 perf-profile.self.cycles-pp.trace_buffer_lock_reserve
0.10 -0.0 0.08 ± 5% perf-profile.self.cycles-pp.trace_clock_local
0.10 -0.0 0.08 ± 5% perf-profile.self.cycles-pp.perf_swevent_event
0.13 ± 3% -0.0 0.11 perf-profile.self.cycles-pp.rcu_all_qs
0.12 ± 3% +0.0 0.17 perf-profile.self.cycles-pp.find_vma
3.94 +0.1 4.03 perf-profile.self.cycles-pp.trace_graph_entry
0.53 +0.1 0.62 perf-profile.self.cycles-pp.unlock_page
0.11 ± 4% +0.1 0.25 ± 3% perf-profile.self.cycles-pp.__unlock_page_memcg
1.05 +0.1 1.19 ± 2% perf-profile.self.cycles-pp.page_remove_rmap
0.71 +0.3 0.99 perf-profile.self.cycles-pp._raw_spin_lock
0.88 ± 2% +0.4 1.28 perf-profile.self.cycles-pp.find_lock_entry
0.59 +0.4 1.01 ± 2% perf-profile.self.cycles-pp.up_read
0.59 +1.1 1.74 perf-profile.self.cycles-pp.down_read_trylock
0.31 +3.2 3.46 perf-profile.self.cycles-pp.lock_page_memcg
0.52 +3.7 4.25 perf-profile.self.cycles-pp.__count_memcg_events
0.97 +8.0 8.96 ± 2% perf-profile.self.cycles-pp.__mod_memcg_state
871.50 ± 42% -49.3% 441.75 ± 48% interrupts.32:PCI-MSI.26738689-edge.eth0-TxRx-0
1306534 ± 7% -11.5% 1156735 interrupts.CAL:Function_call_interrupts
21692 ± 5% -99.9% 24.75 ± 78% interrupts.CPU0.TRM:Thermal_event_interrupts
6825 ± 7% -12.1% 5997 interrupts.CPU1.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU1.TRM:Thermal_event_interrupts
6794 ± 7% -11.0% 6047 interrupts.CPU10.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU10.TRM:Thermal_event_interrupts
6847 ± 7% -12.1% 6017 interrupts.CPU100.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU100.TRM:Thermal_event_interrupts
6736 ± 8% -10.7% 6018 interrupts.CPU101.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU101.TRM:Thermal_event_interrupts
6871 ± 7% -11.8% 6059 interrupts.CPU102.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU102.TRM:Thermal_event_interrupts
6849 ± 7% -11.7% 6047 interrupts.CPU103.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU103.TRM:Thermal_event_interrupts
6847 ± 7% -11.5% 6060 interrupts.CPU104.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU104.TRM:Thermal_event_interrupts
6869 ± 7% -11.8% 6058 interrupts.CPU105.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU105.TRM:Thermal_event_interrupts
6820 ± 7% -11.0% 6071 interrupts.CPU106.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU106.TRM:Thermal_event_interrupts
6868 ± 7% -11.6% 6074 interrupts.CPU107.CAL:Function_call_interrupts
21693 ± 5% -99.9% 14.25 ±108% interrupts.CPU107.TRM:Thermal_event_interrupts
6787 ± 7% -10.6% 6070 interrupts.CPU108.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU108.TRM:Thermal_event_interrupts
6830 ± 7% -11.2% 6065 interrupts.CPU109.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU109.TRM:Thermal_event_interrupts
6836 ± 8% -11.4% 6055 interrupts.CPU11.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU11.TRM:Thermal_event_interrupts
6809 ± 7% -11.1% 6054 interrupts.CPU110.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU110.TRM:Thermal_event_interrupts
6808 ± 7% -11.1% 6054 interrupts.CPU111.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU111.TRM:Thermal_event_interrupts
6819 ± 7% -11.2% 6054 interrupts.CPU112.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU112.TRM:Thermal_event_interrupts
6806 ± 7% -11.0% 6055 interrupts.CPU113.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU113.TRM:Thermal_event_interrupts
6793 ± 7% -17.1% 5631 ± 13% interrupts.CPU114.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU114.TRM:Thermal_event_interrupts
6802 ± 7% -11.6% 6014 interrupts.CPU115.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU115.TRM:Thermal_event_interrupts
6791 ± 7% -10.8% 6060 interrupts.CPU116.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU116.TRM:Thermal_event_interrupts
6801 ± 7% -11.0% 6054 interrupts.CPU117.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU117.TRM:Thermal_event_interrupts
6842 ± 7% -11.5% 6057 interrupts.CPU118.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU118.TRM:Thermal_event_interrupts
6813 ± 7% -10.9% 6067 interrupts.CPU119.CAL:Function_call_interrupts
12.50 ± 66% +1090.0% 148.75 ± 74% interrupts.CPU119.RES:Rescheduling_interrupts
21692 ± 5% -99.9% 24.75 ± 78% interrupts.CPU119.TRM:Thermal_event_interrupts
6823 ± 8% -11.4% 6048 interrupts.CPU12.CAL:Function_call_interrupts
533.00 ± 77% +171.0% 1444 ± 36% interrupts.CPU12.RES:Rescheduling_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU12.TRM:Thermal_event_interrupts
6733 ± 7% -10.1% 6055 interrupts.CPU120.CAL:Function_call_interrupts
6815 ± 7% -11.1% 6061 interrupts.CPU121.CAL:Function_call_interrupts
6817 ± 7% -10.9% 6072 interrupts.CPU122.CAL:Function_call_interrupts
6806 ± 7% -10.8% 6070 interrupts.CPU123.CAL:Function_call_interrupts
6829 ± 7% -11.0% 6075 interrupts.CPU124.CAL:Function_call_interrupts
6842 ± 6% -11.3% 6070 interrupts.CPU125.CAL:Function_call_interrupts
6847 ± 7% -11.5% 6059 interrupts.CPU126.CAL:Function_call_interrupts
6844 ± 7% -11.4% 6065 interrupts.CPU127.CAL:Function_call_interrupts
6839 ± 7% -11.4% 6056 interrupts.CPU128.CAL:Function_call_interrupts
6820 ± 6% -11.2% 6054 interrupts.CPU129.CAL:Function_call_interrupts
3852 ± 34% +61.3% 6214 interrupts.CPU129.NMI:Non-maskable_interrupts
3852 ± 34% +61.3% 6214 interrupts.CPU129.PMI:Performance_monitoring_interrupts
122.75 ±121% -97.6% 3.00 ± 23% interrupts.CPU129.RES:Rescheduling_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU13.TRM:Thermal_event_interrupts
6845 ± 7% -11.5% 6058 interrupts.CPU130.CAL:Function_call_interrupts
6835 ± 7% -11.5% 6046 interrupts.CPU131.CAL:Function_call_interrupts
6826 ± 7% -11.4% 6047 interrupts.CPU132.CAL:Function_call_interrupts
6839 ± 7% -11.5% 6050 interrupts.CPU133.CAL:Function_call_interrupts
6832 ± 7% -11.3% 6057 interrupts.CPU134.CAL:Function_call_interrupts
6835 ± 7% -11.5% 6051 interrupts.CPU135.CAL:Function_call_interrupts
6813 ± 7% -11.4% 6037 interrupts.CPU137.CAL:Function_call_interrupts
6846 ± 7% -11.8% 6035 interrupts.CPU138.CAL:Function_call_interrupts
6823 ± 7% -11.4% 6048 interrupts.CPU139.CAL:Function_call_interrupts
6852 ± 8% -11.7% 6050 interrupts.CPU14.CAL:Function_call_interrupts
245.50 ± 74% +162.5% 644.50 ± 65% interrupts.CPU14.RES:Rescheduling_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU14.TRM:Thermal_event_interrupts
6787 ± 7% -10.9% 6044 interrupts.CPU140.CAL:Function_call_interrupts
6802 ± 7% -11.2% 6042 interrupts.CPU141.CAL:Function_call_interrupts
6839 ± 7% -11.6% 6046 interrupts.CPU142.CAL:Function_call_interrupts
6823 ± 7% -11.3% 6051 interrupts.CPU143.CAL:Function_call_interrupts
6693 ± 8% -11.1% 5954 ± 2% interrupts.CPU144.CAL:Function_call_interrupts
6807 ± 7% -11.5% 6025 interrupts.CPU145.CAL:Function_call_interrupts
6825 ± 7% -11.4% 6045 interrupts.CPU147.CAL:Function_call_interrupts
6861 ± 7% -11.9% 6042 interrupts.CPU148.CAL:Function_call_interrupts
6855 ± 7% -11.8% 6050 interrupts.CPU149.CAL:Function_call_interrupts
6846 ± 8% -11.8% 6037 interrupts.CPU15.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU15.TRM:Thermal_event_interrupts
6832 ± 7% -11.5% 6046 interrupts.CPU150.CAL:Function_call_interrupts
6651 ± 6% -9.3% 6034 interrupts.CPU151.CAL:Function_call_interrupts
6860 ± 7% -12.0% 6034 interrupts.CPU153.CAL:Function_call_interrupts
6843 ± 7% -11.8% 6034 interrupts.CPU154.CAL:Function_call_interrupts
6865 ± 7% -11.9% 6050 interrupts.CPU155.CAL:Function_call_interrupts
6862 ± 7% -12.2% 6026 interrupts.CPU156.CAL:Function_call_interrupts
6882 ± 8% -12.3% 6035 interrupts.CPU157.CAL:Function_call_interrupts
6854 ± 7% -12.1% 6022 interrupts.CPU158.CAL:Function_call_interrupts
6856 ± 7% -11.8% 6046 interrupts.CPU159.CAL:Function_call_interrupts
6838 ± 8% -11.8% 6033 interrupts.CPU16.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU16.TRM:Thermal_event_interrupts
6860 ± 7% -11.9% 6044 interrupts.CPU160.CAL:Function_call_interrupts
6849 ± 7% -11.8% 6042 interrupts.CPU161.CAL:Function_call_interrupts
6836 ± 7% -11.4% 6054 interrupts.CPU162.CAL:Function_call_interrupts
6850 ± 7% -13.6% 5917 ± 3% interrupts.CPU163.CAL:Function_call_interrupts
6840 ± 7% -11.5% 6056 interrupts.CPU164.CAL:Function_call_interrupts
6843 ± 7% -11.7% 6043 interrupts.CPU165.CAL:Function_call_interrupts
6858 ± 7% -11.7% 6053 interrupts.CPU166.CAL:Function_call_interrupts
6855 ± 7% -11.8% 6043 interrupts.CPU167.CAL:Function_call_interrupts
6855 ± 7% -11.1% 6095 interrupts.CPU169.CAL:Function_call_interrupts
6826 ± 8% -11.3% 6056 interrupts.CPU17.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU17.TRM:Thermal_event_interrupts
6866 ± 7% -15.8% 5782 ± 8% interrupts.CPU170.CAL:Function_call_interrupts
6867 ± 7% -11.9% 6049 interrupts.CPU171.CAL:Function_call_interrupts
6866 ± 7% -11.9% 6052 interrupts.CPU172.CAL:Function_call_interrupts
6869 ± 7% -11.9% 6053 interrupts.CPU173.CAL:Function_call_interrupts
6884 ± 7% -12.1% 6051 interrupts.CPU174.CAL:Function_call_interrupts
6858 ± 7% -13.0% 5963 ± 3% interrupts.CPU175.CAL:Function_call_interrupts
6818 ± 7% -11.2% 6056 interrupts.CPU176.CAL:Function_call_interrupts
6887 ± 7% -12.2% 6049 interrupts.CPU177.CAL:Function_call_interrupts
6849 ± 7% -11.7% 6047 interrupts.CPU179.CAL:Function_call_interrupts
6855 ± 7% -14.3% 5877 ± 4% interrupts.CPU18.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU18.TRM:Thermal_event_interrupts
6879 ± 8% -12.1% 6044 interrupts.CPU181.CAL:Function_call_interrupts
6879 ± 8% -12.0% 6054 interrupts.CPU182.CAL:Function_call_interrupts
6915 ± 7% -12.6% 6046 interrupts.CPU183.CAL:Function_call_interrupts
6892 ± 8% -12.4% 6039 interrupts.CPU184.CAL:Function_call_interrupts
6877 ± 7% -12.1% 6042 interrupts.CPU185.CAL:Function_call_interrupts
6887 ± 8% -12.5% 6024 interrupts.CPU186.CAL:Function_call_interrupts
6851 ± 8% -11.9% 6038 interrupts.CPU187.CAL:Function_call_interrupts
6855 ± 7% -11.7% 6053 interrupts.CPU188.CAL:Function_call_interrupts
6882 ± 7% -14.5% 5887 ± 5% interrupts.CPU189.CAL:Function_call_interrupts
6843 ± 8% -19.3% 5520 ± 16% interrupts.CPU19.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU19.TRM:Thermal_event_interrupts
6852 ± 7% -11.6% 6060 interrupts.CPU190.CAL:Function_call_interrupts
6876 ± 7% -11.9% 6056 interrupts.CPU191.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU2.TRM:Thermal_event_interrupts
6848 ± 8% -12.0% 6030 interrupts.CPU20.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU20.TRM:Thermal_event_interrupts
6820 ± 8% -11.3% 6046 interrupts.CPU21.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU21.TRM:Thermal_event_interrupts
6824 ± 8% -11.3% 6050 interrupts.CPU22.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU22.TRM:Thermal_event_interrupts
6828 ± 8% -11.4% 6047 interrupts.CPU23.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU23.TRM:Thermal_event_interrupts
6856 ± 7% -13.2% 5951 ± 2% interrupts.CPU24.CAL:Function_call_interrupts
6817 ± 7% -11.3% 6046 interrupts.CPU25.CAL:Function_call_interrupts
6840 ± 7% -11.7% 6039 interrupts.CPU26.CAL:Function_call_interrupts
6824 ± 7% -12.9% 5946 ± 2% interrupts.CPU27.CAL:Function_call_interrupts
137.00 ± 74% -71.7% 38.75 ±150% interrupts.CPU27.RES:Rescheduling_interrupts
6851 ± 7% -12.2% 6018 interrupts.CPU28.CAL:Function_call_interrupts
6833 ± 7% -11.3% 6062 interrupts.CPU29.CAL:Function_call_interrupts
97.50 ± 92% -92.1% 7.75 ± 69% interrupts.CPU29.RES:Rescheduling_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU3.TRM:Thermal_event_interrupts
6814 ± 8% -11.2% 6052 interrupts.CPU30.CAL:Function_call_interrupts
234.00 ±103% -82.2% 41.75 ±120% interrupts.CPU30.RES:Rescheduling_interrupts
6811 ± 8% -11.1% 6052 interrupts.CPU31.CAL:Function_call_interrupts
6813 ± 8% -11.1% 6059 interrupts.CPU32.CAL:Function_call_interrupts
6808 ± 8% -11.0% 6059 interrupts.CPU33.CAL:Function_call_interrupts
6842 ± 8% -11.4% 6062 interrupts.CPU34.CAL:Function_call_interrupts
6823 ± 8% -11.3% 6049 interrupts.CPU35.CAL:Function_call_interrupts
6838 ± 8% -10.9% 6092 interrupts.CPU36.CAL:Function_call_interrupts
194.50 ± 41% -80.3% 38.25 ± 93% interrupts.CPU36.RES:Rescheduling_interrupts
6835 ± 8% -11.4% 6057 interrupts.CPU37.CAL:Function_call_interrupts
6802 ± 7% -11.0% 6052 interrupts.CPU38.CAL:Function_call_interrupts
211.25 ± 74% -89.7% 21.75 ± 82% interrupts.CPU38.RES:Rescheduling_interrupts
6812 ± 7% -11.1% 6055 interrupts.CPU39.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU4.TRM:Thermal_event_interrupts
6791 ± 7% -10.8% 6057 interrupts.CPU40.CAL:Function_call_interrupts
112.25 ± 21% -75.1% 28.00 ± 82% interrupts.CPU40.RES:Rescheduling_interrupts
6798 ± 7% -11.0% 6048 interrupts.CPU42.CAL:Function_call_interrupts
6829 ± 7% -11.3% 6057 interrupts.CPU44.CAL:Function_call_interrupts
586.50 ±141% -93.6% 37.50 ± 71% interrupts.CPU45.RES:Rescheduling_interrupts
6816 ± 7% -11.2% 6051 interrupts.CPU46.CAL:Function_call_interrupts
6833 ± 7% -11.4% 6055 interrupts.CPU47.CAL:Function_call_interrupts
493.00 ± 87% -92.8% 35.50 ± 92% interrupts.CPU48.RES:Rescheduling_interrupts
6800 ± 7% -23.1% 5230 ± 23% interrupts.CPU49.CAL:Function_call_interrupts
6712 ± 8% -9.8% 6057 interrupts.CPU5.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU5.TRM:Thermal_event_interrupts
6858 ± 8% -11.6% 6066 interrupts.CPU53.CAL:Function_call_interrupts
6839 ± 8% -11.4% 6060 interrupts.CPU54.CAL:Function_call_interrupts
6838 ± 8% -11.3% 6063 interrupts.CPU55.CAL:Function_call_interrupts
6825 ± 7% -11.2% 6060 interrupts.CPU56.CAL:Function_call_interrupts
6866 ± 7% -11.7% 6061 interrupts.CPU58.CAL:Function_call_interrupts
6846 ± 8% -11.5% 6062 interrupts.CPU59.CAL:Function_call_interrupts
6852 ± 7% -12.3% 6007 interrupts.CPU6.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU6.TRM:Thermal_event_interrupts
6847 ± 8% -11.5% 6062 interrupts.CPU60.CAL:Function_call_interrupts
6842 ± 7% -11.4% 6061 interrupts.CPU63.CAL:Function_call_interrupts
6870 ± 7% -11.8% 6058 interrupts.CPU64.CAL:Function_call_interrupts
6819 ± 7% -18.7% 5542 ± 16% interrupts.CPU67.CAL:Function_call_interrupts
6819 ± 7% -11.2% 6054 interrupts.CPU68.CAL:Function_call_interrupts
6862 ± 6% -11.7% 6062 interrupts.CPU69.CAL:Function_call_interrupts
6812 ± 7% -19.8% 5466 ± 18% interrupts.CPU7.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU7.TRM:Thermal_event_interrupts
6836 ± 7% -11.3% 6064 interrupts.CPU70.CAL:Function_call_interrupts
6814 ± 7% -11.1% 6060 interrupts.CPU71.CAL:Function_call_interrupts
6816 ± 8% -11.1% 6061 interrupts.CPU72.CAL:Function_call_interrupts
6843 ± 8% -11.4% 6062 interrupts.CPU73.CAL:Function_call_interrupts
6826 ± 8% -14.7% 5820 ± 7% interrupts.CPU74.CAL:Function_call_interrupts
6820 ± 8% -11.2% 6054 interrupts.CPU75.CAL:Function_call_interrupts
164.50 ± 34% +261.2% 594.25 ± 35% interrupts.CPU75.RES:Rescheduling_interrupts
6812 ± 8% -11.1% 6053 interrupts.CPU76.CAL:Function_call_interrupts
6814 ± 8% -11.3% 6045 interrupts.CPU77.CAL:Function_call_interrupts
6818 ± 8% -11.2% 6058 interrupts.CPU78.CAL:Function_call_interrupts
6791 ± 8% -10.8% 6057 interrupts.CPU79.CAL:Function_call_interrupts
871.50 ± 42% -49.3% 441.75 ± 48% interrupts.CPU8.32:PCI-MSI.26738689-edge.eth0-TxRx-0
6729 ± 8% -10.1% 6050 interrupts.CPU8.CAL:Function_call_interrupts
316.50 ± 89% +444.2% 1722 ± 67% interrupts.CPU8.RES:Rescheduling_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU8.TRM:Thermal_event_interrupts
6814 ± 8% -11.1% 6058 interrupts.CPU80.CAL:Function_call_interrupts
6826 ± 7% -11.1% 6069 interrupts.CPU81.CAL:Function_call_interrupts
6832 ± 7% -11.3% 6062 interrupts.CPU88.CAL:Function_call_interrupts
6860 ± 7% -11.8% 6050 interrupts.CPU89.CAL:Function_call_interrupts
6820 ± 7% -10.6% 6097 interrupts.CPU9.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU9.TRM:Thermal_event_interrupts
6843 ± 7% -11.4% 6062 interrupts.CPU90.CAL:Function_call_interrupts
6803 ± 7% -10.8% 6066 interrupts.CPU91.CAL:Function_call_interrupts
6877 ± 8% -11.8% 6066 interrupts.CPU92.CAL:Function_call_interrupts
6880 ± 7% -17.8% 5658 ± 12% interrupts.CPU93.CAL:Function_call_interrupts
6858 ± 7% -11.7% 6058 interrupts.CPU94.CAL:Function_call_interrupts
367.50 ± 77% -58.2% 153.50 ±119% interrupts.CPU94.RES:Rescheduling_interrupts
6813 ± 7% -11.7% 6018 interrupts.CPU95.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU96.TRM:Thermal_event_interrupts
6855 ± 7% -12.2% 6017 interrupts.CPU97.CAL:Function_call_interrupts
113.25 ±105% -86.5% 15.25 ± 57% interrupts.CPU97.RES:Rescheduling_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU97.TRM:Thermal_event_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU98.TRM:Thermal_event_interrupts
6845 ± 7% -11.6% 6049 interrupts.CPU99.CAL:Function_call_interrupts
21693 ± 5% -99.9% 24.75 ± 78% interrupts.CPU99.TRM:Thermal_event_interrupts
52257 ± 4% +12.0% 58541 ± 6% interrupts.RES:Rescheduling_interrupts
1041285 ± 5% -99.9% 1177 ± 77% interrupts.TRM:Thermal_event_interrupts



irq_exception_noise.__do_page_fault.70th

1.5 +-+------------------------------------------------------------------+
| O O O |
1.45 +-+ O OO O O |
1.4 OO+O OO O O O O OO O O O |
| O O O O O O |
1.35 +-+ O OO OO |
1.3 +-+ OO OO |
| |
1.25 +-+ |
1.2 +-+ |
| |
1.15 +-+ .++ .+ |
1.1 +-+++++.++ .++++. +++ +.++++.+++.+ ++.++ +.+++.++++.+++.++++ + |
|+ + + + + ++. |
1.05 +-+------------------------------------------------------------------+


irq_exception_noise.__do_page_fault.80th

1.8 +-+-------------------------------------------------------------------+
| |
1.7 +-+ O O O |
OO O O O OO OOOO O O O O O |
1.6 +-+ O OO O OO OO O O OOOO O O |
| O O |
1.5 +-+ |
| |
1.4 +-+ |
| |
1.3 +-+ |
| |
1.2 +-+ ++. |
|+.++++.+++.+++.++++.+ +++.++++.+++.++++.+++.+++.++++.+++.+++.++ |
1.1 +-+-------------------------------------------------------------------+


irq_exception_noise.__do_page_fault.90th

2.2 +-+-O-----------------------------------------------------------------+
2.1 +-+ O O |
OO O O O OO OOOO O O O O O |
2 +-+ O OO O OO OO O O OOOO OOO O |
1.9 +-+ |
1.8 +-+ |
1.7 +-+ |
| |
1.6 +-+ |
1.5 +-+ |
1.4 +-+ |
1.3 +-+ |
|+.++++.+++.+++.++++.+++.+++.++++.+++.++++.+++.+++.++++.+++.+++.++ |
1.2 +-+ ++.+|
1.1 +-+-------------------------------------------------------------------+


irq_exception_noise.__do_page_fault.95th

3 +-+-------------------------------------------------------------------+
| O |
2.8 +-+ O O |
2.6 O-+O O O O O O O O |
|O OO O O OOOO O O O O O O O O OO |
2.4 +-+ O O OO O O |
2.2 +-+ |
| |
2 +-+ |
1.8 +-+ |
| |
1.6 +-+ |
1.4 +-+ + |
|+.++++.+++.+++.++++.+ +.+++.++++.+++.++++.+++.+++.++++.+++.+++.++++.+|
1.2 +-+-------------------------------------------------------------------+


irq_exception_noise.__do_page_fault.99th

6 +-+-------------------------------------------------------------------+
| O |
5.5 +-+ O |
5 +-+ O O O OO O |
O O OO O O OO O O O O O |
4.5 +-+ O O O O O O O O O |
4 +O+ O O O O |
| O O |
3.5 +-+ |
3 +-+ |
| |
2.5 +-+ |
2 +-+ |
| ++.+ +. ++ ++. +++.+++.+++.++ |
1.5 +-+-------------------------------------------------------------------+


irq_exception_noise.__do_page_fault.avg

1.5 +-+-O---------------------O-O----------------------------------------+
|O O OO O |
1.45 O-+ OO OO O OO O O |
1.4 +-+O OO O O O O O O O |
| O OO OOOO OO |
1.35 +-+ |
1.3 +-+ |
| |
1.25 +-+ |
1.2 +-+ |
| + |
1.15 +-+ + + + .+ +.++++.++ |
1.1 +-+ +++.++ +++. ++ +.++++.+++.+ +.++ .+++.++ + + : |
|+.+ +.+ ++ ++ ++ ++. |
1.05 +-+------------------------------------------------------------------+


irq_exception_noise.__do_page_fault.sum

300000 +-+----------------------O-O---------------------------------------+
|O O O O O |
290000 O-+ O O OOO O O O O O |
280000 +-O OO OO OO OO O |
| O OOO OOOO O |
270000 +-+ |
260000 +-+ |
| |
250000 +-+ |
240000 +-+ |
| + |
230000 +-+ :+. + .+ ++.+++ |
220000 +-+.++++.+ +.++ .++ ++++.++++.+ ++.+ +.++++.++ + + : |
|++ ++ ++ ++ ++ +.+ |
210000 +-+----------------------------------------------------------------+


will-it-scale.per_process_ops

460000 +-+----------------------------------------------------------------+
|++.++++.++++.++++.++++.++++.++++.+++++.++++.+ +.++++.++++.+++ |
440000 +-+ |
| |
420000 +-+ |
| |
400000 +-+ |
| |
380000 +-+ |
| |
360000 +-+ |
| OO O O OO |
340000 OOO OO O O OOOO OOOO OOOO OO O OOO OOOO O |
| O |
320000 +-+----------------------------------------------------------------+


will-it-scale.workload

9e+07 +-+---------------------------------------------------------------+
|++.++++.+++++.++++.+ ++.++++.+++++.++++.+++++.+ +.++|
8.5e+07 +-+ ++ +++.+++++.+++ |
| |
| |
8e+07 +-+ |
| |
7.5e+07 +-+ |
| |
7e+07 +-+ |
| |
|O OO OOOO OOOO OOO O OOO O O O |
6.5e+07 O-O OO O O O OOOOO O O OO |
| |
6e+07 +-+---------------------------------------------------------------+


[*] bisect-good sample
[O] bisect-bad sample



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


Thanks,
Rong Chen


Attachments:
(No filename) (74.56 kB)
config-5.1.0-10831-g42a3003 (193.26 kB)
job-script (7.65 kB)
job.yaml (5.38 kB)
reproduce (326.00 B)
Download all attachments

2019-05-20 21:55:44

by Johannes Weiner

[permalink] [raw]
Subject: Re: [mm] 42a3003535: will-it-scale.per_process_ops -25.9% regression

Hello,

On Mon, May 20, 2019 at 02:35:34PM +0800, kernel test robot wrote:
> Greeting,
>
> FYI, we noticed a -25.9% regression of will-it-scale.per_process_ops due to commit:
>
>
> commit: 42a300353577ccc17ecc627b8570a89fa1678bec ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> in testcase: will-it-scale
> on test machine: 192 threads Skylake-SP with 256G memory
> with following parameters:

Ouch. That has to be the additional cache footprint of the split
local/recursive stat counters, rather than the extra instructions.

Could you please try re-running the test on that host with the below
patch applied?

Also CCing Aaron Lu, who has previously investigated the cache layout
in the struct memcg stat counters.

> nr_task: 100%
> mode: process
> test: page_fault3
> cpufreq_governor: performance

From c9ed8f78dfa25c4d29adf5a09cf9adeeb43e8bdd Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Mon, 20 May 2019 14:18:26 -0400
Subject: [PATCH] mm: memcontrol: don't batch updates of local VM stats and
events

The kernel test robot noticed a 26% will-it-scale pagefault regression
from commit 42a300353577 ("mm: memcontrol: fix recursive statistics
correctness & scalabilty"). This appears to be caused by bouncing the
additional cachelines from the new hierarchical statistics counters.

We can fix this by getting rid of the batched local counters instead.

Originally, there were *only* group-local counters, and they were
fully maintained per cpu. A reader of a stats file high up in the
cgroup tree would have to walk the entire subtree and collect each
level's per-cpu counters to get the recursive view. This was
prohibitively expensive, and so we switched to per-cpu batched updates
of the local counters during a983b5ebee57 ("mm: memcontrol: fix
excessive complexity in memory.stat reporting"), reducing the
complexity from nr_subgroups * nr_cpus to nr_subgroups.

With growing machines and cgroup trees, the tree walk itself became
too expensive for monitoring top-level groups, and this is when the
culprit patch added hierarchy counters on each cgroup level. When the
per-cpu batch size would be reached, both the local and the hierarchy
counters would get batch-updated from the per-cpu delta simultaneously.

This makes local and hierarchical counter reads blazingly fast, but it
unfortunately makes the write-side too cache line intense.

Since local counter reads were never a problem - we only centralized
them to accelerate the hierarchy walk - and use of the local counters
are becoming rarer due to replacement with hierarchical views (ongoing
rework in the page reclaim and workingset code), we can make those
local counters unbatched per-cpu counters again.

The scheme will then be as such:

when a memcg statistic changes, the writer will:
- update the local counter (per-cpu)
- update the batch counter (per-cpu). If the batch is full:
- spill the batch into the group's atomic_t
- spill the batch into all ancestors' atomic_ts
- empty out the batch counter (per-cpu)

when a local memcg counter is read, the reader will:
- collect the local counter from all cpus

when a hiearchy memcg counter is read, the reader will:
- read the atomic_t

We might be able to simplify this further and make the recursive
counters unbatched per-cpu counters as well (batch upward propagation,
but leave per-cpu collection to the readers), but that will require a
more in-depth analysis and testing of all the callsites. Deal with the
immediate regression for now.

Fixes: 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 26 ++++++++++++++++--------
mm/memcontrol.c | 41 ++++++++++++++++++++++++++------------
2 files changed, 46 insertions(+), 21 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bc74d6a4407c..2d23ae7bd36d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -126,9 +126,12 @@ struct memcg_shrinker_map {
struct mem_cgroup_per_node {
struct lruvec lruvec;

+ /* Legacy local VM stats */
+ struct lruvec_stat __percpu *lruvec_stat_local;
+
+ /* Subtree VM stats (batched updates) */
struct lruvec_stat __percpu *lruvec_stat_cpu;
atomic_long_t lruvec_stat[NR_VM_NODE_STAT_ITEMS];
- atomic_long_t lruvec_stat_local[NR_VM_NODE_STAT_ITEMS];

unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];

@@ -274,17 +277,18 @@ struct mem_cgroup {
atomic_t moving_account;
struct task_struct *move_lock_task;

- /* memory.stat */
+ /* Legacy local VM stats and events */
+ struct memcg_vmstats_percpu __percpu *vmstats_local;
+
+ /* Subtree VM stats and events (batched updates) */
struct memcg_vmstats_percpu __percpu *vmstats_percpu;

MEMCG_PADDING(_pad2_);

atomic_long_t vmstats[MEMCG_NR_STAT];
- atomic_long_t vmstats_local[MEMCG_NR_STAT];
-
atomic_long_t vmevents[NR_VM_EVENT_ITEMS];
- atomic_long_t vmevents_local[NR_VM_EVENT_ITEMS];

+ /* memory.events */
atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];

unsigned long socket_pressure;
@@ -576,7 +580,11 @@ static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg,
int idx)
{
- long x = atomic_long_read(&memcg->vmstats_local[idx]);
+ long x = 0;
+ int cpu;
+
+ for_each_possible_cpu(cpu)
+ x += per_cpu(memcg->vmstats_local->stat[idx], cpu);
#ifdef CONFIG_SMP
if (x < 0)
x = 0;
@@ -650,13 +658,15 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
enum node_stat_item idx)
{
struct mem_cgroup_per_node *pn;
- long x;
+ long x = 0;
+ int cpu;

if (mem_cgroup_disabled())
return node_page_state(lruvec_pgdat(lruvec), idx);

pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- x = atomic_long_read(&pn->lruvec_stat_local[idx]);
+ for_each_possible_cpu(cpu)
+ x += per_cpu(pn->lruvec_stat_local->count[idx], cpu);
#ifdef CONFIG_SMP
if (x < 0)
x = 0;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e50a2db5b4ff..8d42e5c7bf37 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -700,11 +700,12 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
if (mem_cgroup_disabled())
return;

+ __this_cpu_add(memcg->vmstats_local->stat[idx], val);
+
x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
struct mem_cgroup *mi;

- atomic_long_add(x, &memcg->vmstats_local[idx]);
for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
atomic_long_add(x, &mi->vmstats[idx]);
x = 0;
@@ -754,11 +755,12 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
__mod_memcg_state(memcg, idx, val);

/* Update lruvec */
+ __this_cpu_add(pn->lruvec_stat_local->count[idx], val);
+
x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
struct mem_cgroup_per_node *pi;

- atomic_long_add(x, &pn->lruvec_stat_local[idx]);
for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id))
atomic_long_add(x, &pi->lruvec_stat[idx]);
x = 0;
@@ -780,11 +782,12 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
if (mem_cgroup_disabled())
return;

+ __this_cpu_add(memcg->vmstats_local->events[idx], count);
+
x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]);
if (unlikely(x > MEMCG_CHARGE_BATCH)) {
struct mem_cgroup *mi;

- atomic_long_add(x, &memcg->vmevents_local[idx]);
for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
atomic_long_add(x, &mi->vmevents[idx]);
x = 0;
@@ -799,7 +802,12 @@ static unsigned long memcg_events(struct mem_cgroup *memcg, int event)

static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
{
- return atomic_long_read(&memcg->vmevents_local[event]);
+ long x = 0;
+ int cpu;
+
+ for_each_possible_cpu(cpu)
+ x += per_cpu(memcg->vmstats_local->events[event], cpu);
+ return x;
}

static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
@@ -2200,11 +2208,9 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
long x;

x = this_cpu_xchg(memcg->vmstats_percpu->stat[i], 0);
- if (x) {
- atomic_long_add(x, &memcg->vmstats_local[i]);
+ if (x)
for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
atomic_long_add(x, &memcg->vmstats[i]);
- }

if (i >= NR_VM_NODE_STAT_ITEMS)
continue;
@@ -2214,12 +2220,10 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)

pn = mem_cgroup_nodeinfo(memcg, nid);
x = this_cpu_xchg(pn->lruvec_stat_cpu->count[i], 0);
- if (x) {
- atomic_long_add(x, &pn->lruvec_stat_local[i]);
+ if (x)
do {
atomic_long_add(x, &pn->lruvec_stat[i]);
} while ((pn = parent_nodeinfo(pn, nid)));
- }
}
}

@@ -2227,11 +2231,9 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
long x;

x = this_cpu_xchg(memcg->vmstats_percpu->events[i], 0);
- if (x) {
- atomic_long_add(x, &memcg->vmevents_local[i]);
+ if (x)
for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
atomic_long_add(x, &memcg->vmevents[i]);
- }
}
}

@@ -4492,8 +4494,15 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
if (!pn)
return 1;

+ pn->lruvec_stat_local = alloc_percpu(struct lruvec_stat);
+ if (!pn->lruvec_stat_local) {
+ kfree(pn);
+ return 1;
+ }
+
pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat);
if (!pn->lruvec_stat_cpu) {
+ free_percpu(pn->lruvec_stat_local);
kfree(pn);
return 1;
}
@@ -4515,6 +4524,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
return;

free_percpu(pn->lruvec_stat_cpu);
+ free_percpu(pn->lruvec_stat_local);
kfree(pn);
}

@@ -4525,6 +4535,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
for_each_node(node)
free_mem_cgroup_per_node_info(memcg, node);
free_percpu(memcg->vmstats_percpu);
+ free_percpu(memcg->vmstats_local);
kfree(memcg);
}

@@ -4553,6 +4564,10 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
if (memcg->id.id < 0)
goto fail;

+ memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu);
+ if (!memcg->vmstats_local)
+ goto fail;
+
memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu);
if (!memcg->vmstats_percpu)
goto fail;
--
2.21.0


2019-05-21 13:48:10

by Chen, Rong A

[permalink] [raw]
Subject: Re: [LKP] [mm] 42a3003535: will-it-scale.per_process_ops -25.9% regression

On Mon, May 20, 2019 at 05:53:28PM -0400, Johannes Weiner wrote:
> Hello,
>
> On Mon, May 20, 2019 at 02:35:34PM +0800, kernel test robot wrote:
> > Greeting,
> >
> > FYI, we noticed a -25.9% regression of will-it-scale.per_process_ops due to commit:
> >
> >
> > commit: 42a300353577ccc17ecc627b8570a89fa1678bec ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
> > in testcase: will-it-scale
> > on test machine: 192 threads Skylake-SP with 256G memory
> > with following parameters:
>
> Ouch. That has to be the additional cache footprint of the split
> local/recursive stat counters, rather than the extra instructions.
>
> Could you please try re-running the test on that host with the below
> patch applied?

Hi,

The patch can fix the regression.

tests: 1
testcase/path_params/tbox_group/run: will-it-scale/performance-process-100%-page_fault3/lkp-skl-4sp1

db9adbcbe7 ("mm: memcontrol: move stat/event counting functions out-of-line")
8d8245997d ("mm: memcontrol: don't batch updates of local VM stats and events")

db9adbcbe740e098 8d8245997dbd17c5056094f15c
---------------- --------------------------
%stddev change %stddev
\ | \
87819982 85307742 will-it-scale.workload
457395 444310 will-it-scale.per_process_ops
7275 5% 7636 ± 5% boot-time.idle
122 120 turbostat.RAMWatt
388 392 time.voluntary_context_switches
61093 5% 64277 proc-vmstat.nr_slab_unreclaimable
60343904 58838096 proc-vmstat.pgalloc_normal
60301946 58797053 proc-vmstat.pgfree
60227822 58720057 proc-vmstat.numa_hit
60082905 58575049 proc-vmstat.numa_local
2.646e+10 2.57e+10 proc-vmstat.pgfault
94586813 ± 5% 368% 4.423e+08 perf-stat.i.iTLB-loads
94208530 ± 5% 367% 4.403e+08 perf-stat.ps.iTLB-loads
40821938 86% 75753326 perf-stat.i.node-loads
40664993 85% 75428605 perf-stat.ps.node-loads
1334 4% 1387 perf-stat.overall.instructions-per-iTLB-miss
1341 4% 1394 perf-stat.i.instructions-per-iTLB-miss
1414 4% 1464 perf-stat.overall.cycles-between-cache-misses
1435 3% 1482 perf-stat.i.cycles-between-cache-misses
1.65 1.69 perf-stat.overall.cpi
70.00 70.98 perf-stat.i.cache-miss-rate%
70.23 71.14 perf-stat.overall.cache-miss-rate%
7755 7695 perf-stat.ps.context-switches
3.44 3.40 perf-stat.i.dTLB-store-miss-rate%
4.045e+10 3.998e+10 perf-stat.i.dTLB-stores
7.381e+10 7.292e+10 perf-stat.i.dTLB-loads
4.028e+10 3.978e+10 perf-stat.ps.dTLB-stores
3.45 3.41 perf-stat.overall.dTLB-store-miss-rate%
7.351e+10 7.257e+10 perf-stat.ps.dTLB-loads
2.512e+11 2.47e+11 perf-stat.i.instructions
2.502e+11 2.458e+11 perf-stat.ps.instructions
7.618e+13 7.472e+13 perf-stat.total.instructions
7.18 7.04 perf-stat.overall.node-store-miss-rate%
0.60 0.59 perf-stat.i.ipc
0.61 0.59 perf-stat.overall.ipc
1.447e+09 1.412e+09 perf-stat.i.dTLB-store-misses
5.1e+10 4.971e+10 perf-stat.i.branch-instructions
1.441e+09 1.405e+09 perf-stat.ps.dTLB-store-misses
5.079e+10 4.947e+10 perf-stat.ps.branch-instructions
6885297 6705138 perf-stat.i.node-store-misses
1.66 1.62 perf-stat.overall.MPKI
6859094 6676984 perf-stat.ps.node-store-misses
86898473 84521835 perf-stat.ps.minor-faults
86899384 84522278 perf-stat.ps.page-faults
87236715 84845389 perf-stat.i.minor-faults
87237611 84846088 perf-stat.i.page-faults
4.024e+08 3.905e+08 perf-stat.i.branch-misses
2.932e+08 -3% 2.843e+08 perf-stat.i.cache-misses
4.011e+08 -3% 3.888e+08 perf-stat.ps.branch-misses
2.92e+08 -3% 2.829e+08 perf-stat.ps.cache-misses
4.174e+08 -4% 3.996e+08 perf-stat.i.cache-references
4.158e+08 -4% 3.977e+08 perf-stat.ps.cache-references
1.882e+08 -5% 1.779e+08 perf-stat.i.iTLB-load-misses
1.874e+08 -6% 1.771e+08 perf-stat.ps.iTLB-load-misses
1.27 ± 34% -40% 0.76 ± 17% perf-stat.i.node-load-miss-rate%
0.90 ± 9% -41% 0.53 perf-stat.overall.node-load-miss-rate%
68.74 -53% 32.36 perf-stat.i.iTLB-load-miss-rate%
66.57 -57% 28.69 perf-stat.overall.iTLB-load-miss-rate%

Best Regards,
Rong Chen

>
> Also CCing Aaron Lu, who has previously investigated the cache layout
> in the struct memcg stat counters.
>
> > nr_task: 100%
> > mode: process
> > test: page_fault3
> > cpufreq_governor: performance
>
> From c9ed8f78dfa25c4d29adf5a09cf9adeeb43e8bdd Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <[email protected]>
> Date: Mon, 20 May 2019 14:18:26 -0400
> Subject: [PATCH] mm: memcontrol: don't batch updates of local VM stats and
> events
>
> The kernel test robot noticed a 26% will-it-scale pagefault regression
> from commit 42a300353577 ("mm: memcontrol: fix recursive statistics
> correctness & scalabilty"). This appears to be caused by bouncing the
> additional cachelines from the new hierarchical statistics counters.
>
> We can fix this by getting rid of the batched local counters instead.
>
> Originally, there were *only* group-local counters, and they were
> fully maintained per cpu. A reader of a stats file high up in the
> cgroup tree would have to walk the entire subtree and collect each
> level's per-cpu counters to get the recursive view. This was
> prohibitively expensive, and so we switched to per-cpu batched updates
> of the local counters during a983b5ebee57 ("mm: memcontrol: fix
> excessive complexity in memory.stat reporting"), reducing the
> complexity from nr_subgroups * nr_cpus to nr_subgroups.
>
> With growing machines and cgroup trees, the tree walk itself became
> too expensive for monitoring top-level groups, and this is when the
> culprit patch added hierarchy counters on each cgroup level. When the
> per-cpu batch size would be reached, both the local and the hierarchy
> counters would get batch-updated from the per-cpu delta simultaneously.
>
> This makes local and hierarchical counter reads blazingly fast, but it
> unfortunately makes the write-side too cache line intense.
>
> Since local counter reads were never a problem - we only centralized
> them to accelerate the hierarchy walk - and use of the local counters
> are becoming rarer due to replacement with hierarchical views (ongoing
> rework in the page reclaim and workingset code), we can make those
> local counters unbatched per-cpu counters again.
>
> The scheme will then be as such:
>
> when a memcg statistic changes, the writer will:
> - update the local counter (per-cpu)
> - update the batch counter (per-cpu). If the batch is full:
> - spill the batch into the group's atomic_t
> - spill the batch into all ancestors' atomic_ts
> - empty out the batch counter (per-cpu)
>
> when a local memcg counter is read, the reader will:
> - collect the local counter from all cpus
>
> when a hiearchy memcg counter is read, the reader will:
> - read the atomic_t
>
> We might be able to simplify this further and make the recursive
> counters unbatched per-cpu counters as well (batch upward propagation,
> but leave per-cpu collection to the readers), but that will require a
> more in-depth analysis and testing of all the callsites. Deal with the
> immediate regression for now.
>
> Fixes: 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
> Reported-by: kernel test robot <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> include/linux/memcontrol.h | 26 ++++++++++++++++--------
> mm/memcontrol.c | 41 ++++++++++++++++++++++++++------------
> 2 files changed, 46 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index bc74d6a4407c..2d23ae7bd36d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -126,9 +126,12 @@ struct memcg_shrinker_map {
> struct mem_cgroup_per_node {
> struct lruvec lruvec;
>
> + /* Legacy local VM stats */
> + struct lruvec_stat __percpu *lruvec_stat_local;
> +
> + /* Subtree VM stats (batched updates) */
> struct lruvec_stat __percpu *lruvec_stat_cpu;
> atomic_long_t lruvec_stat[NR_VM_NODE_STAT_ITEMS];
> - atomic_long_t lruvec_stat_local[NR_VM_NODE_STAT_ITEMS];
>
> unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
>
> @@ -274,17 +277,18 @@ struct mem_cgroup {
> atomic_t moving_account;
> struct task_struct *move_lock_task;
>
> - /* memory.stat */
> + /* Legacy local VM stats and events */
> + struct memcg_vmstats_percpu __percpu *vmstats_local;
> +
> + /* Subtree VM stats and events (batched updates) */
> struct memcg_vmstats_percpu __percpu *vmstats_percpu;
>
> MEMCG_PADDING(_pad2_);
>
> atomic_long_t vmstats[MEMCG_NR_STAT];
> - atomic_long_t vmstats_local[MEMCG_NR_STAT];
> -
> atomic_long_t vmevents[NR_VM_EVENT_ITEMS];
> - atomic_long_t vmevents_local[NR_VM_EVENT_ITEMS];
>
> + /* memory.events */
> atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
>
> unsigned long socket_pressure;
> @@ -576,7 +580,11 @@ static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
> static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg,
> int idx)
> {
> - long x = atomic_long_read(&memcg->vmstats_local[idx]);
> + long x = 0;
> + int cpu;
> +
> + for_each_possible_cpu(cpu)
> + x += per_cpu(memcg->vmstats_local->stat[idx], cpu);
> #ifdef CONFIG_SMP
> if (x < 0)
> x = 0;
> @@ -650,13 +658,15 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
> enum node_stat_item idx)
> {
> struct mem_cgroup_per_node *pn;
> - long x;
> + long x = 0;
> + int cpu;
>
> if (mem_cgroup_disabled())
> return node_page_state(lruvec_pgdat(lruvec), idx);
>
> pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> - x = atomic_long_read(&pn->lruvec_stat_local[idx]);
> + for_each_possible_cpu(cpu)
> + x += per_cpu(pn->lruvec_stat_local->count[idx], cpu);
> #ifdef CONFIG_SMP
> if (x < 0)
> x = 0;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e50a2db5b4ff..8d42e5c7bf37 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -700,11 +700,12 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
> if (mem_cgroup_disabled())
> return;
>
> + __this_cpu_add(memcg->vmstats_local->stat[idx], val);
> +
> x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
> if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
> struct mem_cgroup *mi;
>
> - atomic_long_add(x, &memcg->vmstats_local[idx]);
> for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
> atomic_long_add(x, &mi->vmstats[idx]);
> x = 0;
> @@ -754,11 +755,12 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> __mod_memcg_state(memcg, idx, val);
>
> /* Update lruvec */
> + __this_cpu_add(pn->lruvec_stat_local->count[idx], val);
> +
> x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
> if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
> struct mem_cgroup_per_node *pi;
>
> - atomic_long_add(x, &pn->lruvec_stat_local[idx]);
> for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id))
> atomic_long_add(x, &pi->lruvec_stat[idx]);
> x = 0;
> @@ -780,11 +782,12 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
> if (mem_cgroup_disabled())
> return;
>
> + __this_cpu_add(memcg->vmstats_local->events[idx], count);
> +
> x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]);
> if (unlikely(x > MEMCG_CHARGE_BATCH)) {
> struct mem_cgroup *mi;
>
> - atomic_long_add(x, &memcg->vmevents_local[idx]);
> for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
> atomic_long_add(x, &mi->vmevents[idx]);
> x = 0;
> @@ -799,7 +802,12 @@ static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
>
> static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
> {
> - return atomic_long_read(&memcg->vmevents_local[event]);
> + long x = 0;
> + int cpu;
> +
> + for_each_possible_cpu(cpu)
> + x += per_cpu(memcg->vmstats_local->events[event], cpu);
> + return x;
> }
>
> static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
> @@ -2200,11 +2208,9 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
> long x;
>
> x = this_cpu_xchg(memcg->vmstats_percpu->stat[i], 0);
> - if (x) {
> - atomic_long_add(x, &memcg->vmstats_local[i]);
> + if (x)
> for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
> atomic_long_add(x, &memcg->vmstats[i]);
> - }
>
> if (i >= NR_VM_NODE_STAT_ITEMS)
> continue;
> @@ -2214,12 +2220,10 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
>
> pn = mem_cgroup_nodeinfo(memcg, nid);
> x = this_cpu_xchg(pn->lruvec_stat_cpu->count[i], 0);
> - if (x) {
> - atomic_long_add(x, &pn->lruvec_stat_local[i]);
> + if (x)
> do {
> atomic_long_add(x, &pn->lruvec_stat[i]);
> } while ((pn = parent_nodeinfo(pn, nid)));
> - }
> }
> }
>
> @@ -2227,11 +2231,9 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
> long x;
>
> x = this_cpu_xchg(memcg->vmstats_percpu->events[i], 0);
> - if (x) {
> - atomic_long_add(x, &memcg->vmevents_local[i]);
> + if (x)
> for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
> atomic_long_add(x, &memcg->vmevents[i]);
> - }
> }
> }
>
> @@ -4492,8 +4494,15 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
> if (!pn)
> return 1;
>
> + pn->lruvec_stat_local = alloc_percpu(struct lruvec_stat);
> + if (!pn->lruvec_stat_local) {
> + kfree(pn);
> + return 1;
> + }
> +
> pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat);
> if (!pn->lruvec_stat_cpu) {
> + free_percpu(pn->lruvec_stat_local);
> kfree(pn);
> return 1;
> }
> @@ -4515,6 +4524,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
> return;
>
> free_percpu(pn->lruvec_stat_cpu);
> + free_percpu(pn->lruvec_stat_local);
> kfree(pn);
> }
>
> @@ -4525,6 +4535,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
> for_each_node(node)
> free_mem_cgroup_per_node_info(memcg, node);
> free_percpu(memcg->vmstats_percpu);
> + free_percpu(memcg->vmstats_local);
> kfree(memcg);
> }
>
> @@ -4553,6 +4564,10 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
> if (memcg->id.id < 0)
> goto fail;
>
> + memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu);
> + if (!memcg->vmstats_local)
> + goto fail;
> +
> memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu);
> if (!memcg->vmstats_percpu)
> goto fail;
> --
> 2.21.0
>
> _______________________________________________
> LKP mailing list
> [email protected]
> https://lists.01.org/mailman/listinfo/lkp

2019-05-21 15:20:05

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH] mm: memcontrol: don't batch updates of local VM stats and events

The kernel test robot noticed a 26% will-it-scale pagefault regression
from commit 42a300353577 ("mm: memcontrol: fix recursive statistics
correctness & scalabilty"). This appears to be caused by bouncing the
additional cachelines from the new hierarchical statistics counters.

We can fix this by getting rid of the batched local counters instead.

Originally, there were *only* group-local counters, and they were
fully maintained per cpu. A reader of a stats file high up in the
cgroup tree would have to walk the entire subtree and collect each
level's per-cpu counters to get the recursive view. This was
prohibitively expensive, and so we switched to per-cpu batched updates
of the local counters during a983b5ebee57 ("mm: memcontrol: fix
excessive complexity in memory.stat reporting"), reducing the
complexity from nr_subgroups * nr_cpus to nr_subgroups.

With growing machines and cgroup trees, the tree walk itself became
too expensive for monitoring top-level groups, and this is when the
culprit patch added hierarchy counters on each cgroup level. When the
per-cpu batch size would be reached, both the local and the hierarchy
counters would get batch-updated from the per-cpu delta simultaneously.

This makes local and hierarchical counter reads blazingly fast, but it
unfortunately makes the write-side too cache line intense.

Since local counter reads were never a problem - we only centralized
them to accelerate the hierarchy walk - and use of the local counters
are becoming rarer due to replacement with hierarchical views (ongoing
rework in the page reclaim and workingset code), we can make those
local counters unbatched per-cpu counters again.

The scheme will then be as such:

when a memcg statistic changes, the writer will:
- update the local counter (per-cpu)
- update the batch counter (per-cpu). If the batch is full:
- spill the batch into the group's atomic_t
- spill the batch into all ancestors' atomic_ts
- empty out the batch counter (per-cpu)

when a local memcg counter is read, the reader will:
- collect the local counter from all cpus

when a hiearchy memcg counter is read, the reader will:
- read the atomic_t

We might be able to simplify this further and make the recursive
counters unbatched per-cpu counters as well (batch upward propagation,
but leave per-cpu collection to the readers), but that will require a
more in-depth analysis and testing of all the callsites. Deal with the
immediate regression for now.

Fixes: 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
Reported-by: kernel test robot <[email protected]>
Tested-by: kernel test robot <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 26 ++++++++++++++++--------
mm/memcontrol.c | 41 ++++++++++++++++++++++++++------------
2 files changed, 46 insertions(+), 21 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bc74d6a4407c..2d23ae7bd36d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -126,9 +126,12 @@ struct memcg_shrinker_map {
struct mem_cgroup_per_node {
struct lruvec lruvec;

+ /* Legacy local VM stats */
+ struct lruvec_stat __percpu *lruvec_stat_local;
+
+ /* Subtree VM stats (batched updates) */
struct lruvec_stat __percpu *lruvec_stat_cpu;
atomic_long_t lruvec_stat[NR_VM_NODE_STAT_ITEMS];
- atomic_long_t lruvec_stat_local[NR_VM_NODE_STAT_ITEMS];

unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];

@@ -274,17 +277,18 @@ struct mem_cgroup {
atomic_t moving_account;
struct task_struct *move_lock_task;

- /* memory.stat */
+ /* Legacy local VM stats and events */
+ struct memcg_vmstats_percpu __percpu *vmstats_local;
+
+ /* Subtree VM stats and events (batched updates) */
struct memcg_vmstats_percpu __percpu *vmstats_percpu;

MEMCG_PADDING(_pad2_);

atomic_long_t vmstats[MEMCG_NR_STAT];
- atomic_long_t vmstats_local[MEMCG_NR_STAT];
-
atomic_long_t vmevents[NR_VM_EVENT_ITEMS];
- atomic_long_t vmevents_local[NR_VM_EVENT_ITEMS];

+ /* memory.events */
atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];

unsigned long socket_pressure;
@@ -576,7 +580,11 @@ static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg,
int idx)
{
- long x = atomic_long_read(&memcg->vmstats_local[idx]);
+ long x = 0;
+ int cpu;
+
+ for_each_possible_cpu(cpu)
+ x += per_cpu(memcg->vmstats_local->stat[idx], cpu);
#ifdef CONFIG_SMP
if (x < 0)
x = 0;
@@ -650,13 +658,15 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
enum node_stat_item idx)
{
struct mem_cgroup_per_node *pn;
- long x;
+ long x = 0;
+ int cpu;

if (mem_cgroup_disabled())
return node_page_state(lruvec_pgdat(lruvec), idx);

pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- x = atomic_long_read(&pn->lruvec_stat_local[idx]);
+ for_each_possible_cpu(cpu)
+ x += per_cpu(pn->lruvec_stat_local->count[idx], cpu);
#ifdef CONFIG_SMP
if (x < 0)
x = 0;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e50a2db5b4ff..8d42e5c7bf37 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -700,11 +700,12 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
if (mem_cgroup_disabled())
return;

+ __this_cpu_add(memcg->vmstats_local->stat[idx], val);
+
x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
struct mem_cgroup *mi;

- atomic_long_add(x, &memcg->vmstats_local[idx]);
for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
atomic_long_add(x, &mi->vmstats[idx]);
x = 0;
@@ -754,11 +755,12 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
__mod_memcg_state(memcg, idx, val);

/* Update lruvec */
+ __this_cpu_add(pn->lruvec_stat_local->count[idx], val);
+
x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
struct mem_cgroup_per_node *pi;

- atomic_long_add(x, &pn->lruvec_stat_local[idx]);
for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id))
atomic_long_add(x, &pi->lruvec_stat[idx]);
x = 0;
@@ -780,11 +782,12 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
if (mem_cgroup_disabled())
return;

+ __this_cpu_add(memcg->vmstats_local->events[idx], count);
+
x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]);
if (unlikely(x > MEMCG_CHARGE_BATCH)) {
struct mem_cgroup *mi;

- atomic_long_add(x, &memcg->vmevents_local[idx]);
for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
atomic_long_add(x, &mi->vmevents[idx]);
x = 0;
@@ -799,7 +802,12 @@ static unsigned long memcg_events(struct mem_cgroup *memcg, int event)

static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
{
- return atomic_long_read(&memcg->vmevents_local[event]);
+ long x = 0;
+ int cpu;
+
+ for_each_possible_cpu(cpu)
+ x += per_cpu(memcg->vmstats_local->events[event], cpu);
+ return x;
}

static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
@@ -2200,11 +2208,9 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
long x;

x = this_cpu_xchg(memcg->vmstats_percpu->stat[i], 0);
- if (x) {
- atomic_long_add(x, &memcg->vmstats_local[i]);
+ if (x)
for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
atomic_long_add(x, &memcg->vmstats[i]);
- }

if (i >= NR_VM_NODE_STAT_ITEMS)
continue;
@@ -2214,12 +2220,10 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)

pn = mem_cgroup_nodeinfo(memcg, nid);
x = this_cpu_xchg(pn->lruvec_stat_cpu->count[i], 0);
- if (x) {
- atomic_long_add(x, &pn->lruvec_stat_local[i]);
+ if (x)
do {
atomic_long_add(x, &pn->lruvec_stat[i]);
} while ((pn = parent_nodeinfo(pn, nid)));
- }
}
}

@@ -2227,11 +2231,9 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
long x;

x = this_cpu_xchg(memcg->vmstats_percpu->events[i], 0);
- if (x) {
- atomic_long_add(x, &memcg->vmevents_local[i]);
+ if (x)
for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
atomic_long_add(x, &memcg->vmevents[i]);
- }
}
}

@@ -4492,8 +4494,15 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
if (!pn)
return 1;

+ pn->lruvec_stat_local = alloc_percpu(struct lruvec_stat);
+ if (!pn->lruvec_stat_local) {
+ kfree(pn);
+ return 1;
+ }
+
pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat);
if (!pn->lruvec_stat_cpu) {
+ free_percpu(pn->lruvec_stat_local);
kfree(pn);
return 1;
}
@@ -4515,6 +4524,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
return;

free_percpu(pn->lruvec_stat_cpu);
+ free_percpu(pn->lruvec_stat_local);
kfree(pn);
}

@@ -4525,6 +4535,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
for_each_node(node)
free_mem_cgroup_per_node_info(memcg, node);
free_percpu(memcg->vmstats_percpu);
+ free_percpu(memcg->vmstats_local);
kfree(memcg);
}

@@ -4553,6 +4564,10 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
if (memcg->id.id < 0)
goto fail;

+ memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu);
+ if (!memcg->vmstats_local)
+ goto fail;
+
memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu);
if (!memcg->vmstats_percpu)
goto fail;
--
2.21.0


2019-05-21 16:39:54

by Johannes Weiner

[permalink] [raw]
Subject: Re: [LKP] [mm] 42a3003535: will-it-scale.per_process_ops -25.9% regression

On Tue, May 21, 2019 at 09:46:46PM +0800, kernel test robot wrote:
> On Mon, May 20, 2019 at 05:53:28PM -0400, Johannes Weiner wrote:
> > Hello,
> >
> > On Mon, May 20, 2019 at 02:35:34PM +0800, kernel test robot wrote:
> > > Greeting,
> > >
> > > FYI, we noticed a -25.9% regression of will-it-scale.per_process_ops due to commit:
> > >
> > >
> > > commit: 42a300353577ccc17ecc627b8570a89fa1678bec ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
> > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > >
> > > in testcase: will-it-scale
> > > on test machine: 192 threads Skylake-SP with 256G memory
> > > with following parameters:
> >
> > Ouch. That has to be the additional cache footprint of the split
> > local/recursive stat counters, rather than the extra instructions.
> >
> > Could you please try re-running the test on that host with the below
> > patch applied?
>
> Hi,
>
> The patch can fix the regression.
>
> tests: 1
> testcase/path_params/tbox_group/run: will-it-scale/performance-process-100%-page_fault3/lkp-skl-4sp1
>
> db9adbcbe7 ("mm: memcontrol: move stat/event counting functions out-of-line")
> 8d8245997d ("mm: memcontrol: don't batch updates of local VM stats and events")
>
> db9adbcbe740e098 8d8245997dbd17c5056094f15c
> ---------------- --------------------------
> %stddev change %stddev
> \ | \
> 87819982 85307742 will-it-scale.workload
> 457395 444310 will-it-scale.per_process_ops

Fantastic, thank you for verifying! I'm going to take that as a
Tested-by.

2019-05-28 16:03:31

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH] mm: memcontrol: don't batch updates of local VM stats and events

On Tue, May 21, 2019 at 8:16 AM Johannes Weiner <[email protected]> wrote:
>
> The kernel test robot noticed a 26% will-it-scale pagefault regression
> from commit 42a300353577 ("mm: memcontrol: fix recursive statistics
> correctness & scalabilty"). This appears to be caused by bouncing the
> additional cachelines from the new hierarchical statistics counters.
>
> We can fix this by getting rid of the batched local counters instead.
>
> Originally, there were *only* group-local counters, and they were
> fully maintained per cpu. A reader of a stats file high up in the
> cgroup tree would have to walk the entire subtree and collect each
> level's per-cpu counters to get the recursive view. This was
> prohibitively expensive, and so we switched to per-cpu batched updates
> of the local counters during a983b5ebee57 ("mm: memcontrol: fix
> excessive complexity in memory.stat reporting"), reducing the
> complexity from nr_subgroups * nr_cpus to nr_subgroups.
>
> With growing machines and cgroup trees, the tree walk itself became
> too expensive for monitoring top-level groups, and this is when the
> culprit patch added hierarchy counters on each cgroup level. When the
> per-cpu batch size would be reached, both the local and the hierarchy
> counters would get batch-updated from the per-cpu delta simultaneously.
>
> This makes local and hierarchical counter reads blazingly fast, but it
> unfortunately makes the write-side too cache line intense.
>
> Since local counter reads were never a problem - we only centralized
> them to accelerate the hierarchy walk - and use of the local counters
> are becoming rarer due to replacement with hierarchical views (ongoing
> rework in the page reclaim and workingset code), we can make those
> local counters unbatched per-cpu counters again.
>
> The scheme will then be as such:
>
> when a memcg statistic changes, the writer will:
> - update the local counter (per-cpu)
> - update the batch counter (per-cpu). If the batch is full:
> - spill the batch into the group's atomic_t
> - spill the batch into all ancestors' atomic_ts
> - empty out the batch counter (per-cpu)
>
> when a local memcg counter is read, the reader will:
> - collect the local counter from all cpus
>
> when a hiearchy memcg counter is read, the reader will:
> - read the atomic_t
>
> We might be able to simplify this further and make the recursive
> counters unbatched per-cpu counters as well (batch upward propagation,
> but leave per-cpu collection to the readers), but that will require a
> more in-depth analysis and testing of all the callsites. Deal with the
> immediate regression for now.
>
> Fixes: 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
> Reported-by: kernel test robot <[email protected]>
> Tested-by: kernel test robot <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> include/linux/memcontrol.h | 26 ++++++++++++++++--------
> mm/memcontrol.c | 41 ++++++++++++++++++++++++++------------
> 2 files changed, 46 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index bc74d6a4407c..2d23ae7bd36d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -126,9 +126,12 @@ struct memcg_shrinker_map {
> struct mem_cgroup_per_node {
> struct lruvec lruvec;
>
> + /* Legacy local VM stats */
> + struct lruvec_stat __percpu *lruvec_stat_local;
> +
> + /* Subtree VM stats (batched updates) */
> struct lruvec_stat __percpu *lruvec_stat_cpu;
> atomic_long_t lruvec_stat[NR_VM_NODE_STAT_ITEMS];
> - atomic_long_t lruvec_stat_local[NR_VM_NODE_STAT_ITEMS];
>
> unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
>
> @@ -274,17 +277,18 @@ struct mem_cgroup {
> atomic_t moving_account;
> struct task_struct *move_lock_task;
>
> - /* memory.stat */
> + /* Legacy local VM stats and events */
> + struct memcg_vmstats_percpu __percpu *vmstats_local;
> +
> + /* Subtree VM stats and events (batched updates) */
> struct memcg_vmstats_percpu __percpu *vmstats_percpu;
>
> MEMCG_PADDING(_pad2_);
>
> atomic_long_t vmstats[MEMCG_NR_STAT];
> - atomic_long_t vmstats_local[MEMCG_NR_STAT];
> -
> atomic_long_t vmevents[NR_VM_EVENT_ITEMS];
> - atomic_long_t vmevents_local[NR_VM_EVENT_ITEMS];
>
> + /* memory.events */
> atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
>
> unsigned long socket_pressure;
> @@ -576,7 +580,11 @@ static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
> static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg,
> int idx)
> {
> - long x = atomic_long_read(&memcg->vmstats_local[idx]);
> + long x = 0;
> + int cpu;
> +
> + for_each_possible_cpu(cpu)
> + x += per_cpu(memcg->vmstats_local->stat[idx], cpu);
> #ifdef CONFIG_SMP
> if (x < 0)
> x = 0;
> @@ -650,13 +658,15 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
> enum node_stat_item idx)
> {
> struct mem_cgroup_per_node *pn;
> - long x;
> + long x = 0;
> + int cpu;
>
> if (mem_cgroup_disabled())
> return node_page_state(lruvec_pgdat(lruvec), idx);
>
> pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> - x = atomic_long_read(&pn->lruvec_stat_local[idx]);
> + for_each_possible_cpu(cpu)
> + x += per_cpu(pn->lruvec_stat_local->count[idx], cpu);
> #ifdef CONFIG_SMP
> if (x < 0)
> x = 0;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e50a2db5b4ff..8d42e5c7bf37 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -700,11 +700,12 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
> if (mem_cgroup_disabled())
> return;
>
> + __this_cpu_add(memcg->vmstats_local->stat[idx], val);
> +
> x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
> if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
> struct mem_cgroup *mi;
>
> - atomic_long_add(x, &memcg->vmstats_local[idx]);

I was suspecting the following for-loop+atomic-add for the regression.
Why the above atomic-add is the culprit?

> for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
> atomic_long_add(x, &mi->vmstats[idx]);
> x = 0;
> @@ -754,11 +755,12 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> __mod_memcg_state(memcg, idx, val);
>
> /* Update lruvec */
> + __this_cpu_add(pn->lruvec_stat_local->count[idx], val);
> +
> x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
> if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
> struct mem_cgroup_per_node *pi;
>
> - atomic_long_add(x, &pn->lruvec_stat_local[idx]);
> for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id))
> atomic_long_add(x, &pi->lruvec_stat[idx]);
> x = 0;
> @@ -780,11 +782,12 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
> if (mem_cgroup_disabled())
> return;
>
> + __this_cpu_add(memcg->vmstats_local->events[idx], count);
> +
> x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]);
> if (unlikely(x > MEMCG_CHARGE_BATCH)) {
> struct mem_cgroup *mi;
>
> - atomic_long_add(x, &memcg->vmevents_local[idx]);
> for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
> atomic_long_add(x, &mi->vmevents[idx]);
> x = 0;
> @@ -799,7 +802,12 @@ static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
>
> static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
> {
> - return atomic_long_read(&memcg->vmevents_local[event]);
> + long x = 0;
> + int cpu;
> +
> + for_each_possible_cpu(cpu)
> + x += per_cpu(memcg->vmstats_local->events[event], cpu);
> + return x;
> }
>
> static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
> @@ -2200,11 +2208,9 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
> long x;
>
> x = this_cpu_xchg(memcg->vmstats_percpu->stat[i], 0);
> - if (x) {
> - atomic_long_add(x, &memcg->vmstats_local[i]);
> + if (x)
> for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
> atomic_long_add(x, &memcg->vmstats[i]);
> - }
>
> if (i >= NR_VM_NODE_STAT_ITEMS)
> continue;
> @@ -2214,12 +2220,10 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
>
> pn = mem_cgroup_nodeinfo(memcg, nid);
> x = this_cpu_xchg(pn->lruvec_stat_cpu->count[i], 0);
> - if (x) {
> - atomic_long_add(x, &pn->lruvec_stat_local[i]);
> + if (x)
> do {
> atomic_long_add(x, &pn->lruvec_stat[i]);
> } while ((pn = parent_nodeinfo(pn, nid)));
> - }
> }
> }
>
> @@ -2227,11 +2231,9 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
> long x;
>
> x = this_cpu_xchg(memcg->vmstats_percpu->events[i], 0);
> - if (x) {
> - atomic_long_add(x, &memcg->vmevents_local[i]);
> + if (x)
> for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
> atomic_long_add(x, &memcg->vmevents[i]);
> - }
> }
> }
>
> @@ -4492,8 +4494,15 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
> if (!pn)
> return 1;
>
> + pn->lruvec_stat_local = alloc_percpu(struct lruvec_stat);
> + if (!pn->lruvec_stat_local) {
> + kfree(pn);
> + return 1;
> + }
> +
> pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat);
> if (!pn->lruvec_stat_cpu) {
> + free_percpu(pn->lruvec_stat_local);
> kfree(pn);
> return 1;
> }
> @@ -4515,6 +4524,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
> return;
>
> free_percpu(pn->lruvec_stat_cpu);
> + free_percpu(pn->lruvec_stat_local);
> kfree(pn);
> }
>
> @@ -4525,6 +4535,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
> for_each_node(node)
> free_mem_cgroup_per_node_info(memcg, node);
> free_percpu(memcg->vmstats_percpu);
> + free_percpu(memcg->vmstats_local);
> kfree(memcg);
> }
>
> @@ -4553,6 +4564,10 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
> if (memcg->id.id < 0)
> goto fail;
>
> + memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu);
> + if (!memcg->vmstats_local)
> + goto fail;
> +
> memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu);
> if (!memcg->vmstats_percpu)
> goto fail;
> --
> 2.21.0
>

2019-05-28 19:17:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: memcontrol: don't batch updates of local VM stats and events

On Tue, May 28, 2019 at 9:00 AM Shakeel Butt <[email protected]> wrote:
>
> I was suspecting the following for-loop+atomic-add for the regression.

If I read the kernel test robot reports correctly, Johannes' fix patch
does fix the regression (well - mostly. The original reported
regression was 26%, and with Johannes' fix patch it was 3% - so still
a slight performance regression, but not nearly as bad).

> Why the above atomic-add is the culprit?

I think the problem with that one is that it's cross-cpu statistics,
so you end up with lots of cacheline bounces on the local counts when
you have lots of load.

But yes, the recursive updates still do show a small regression,
probably because there's still some overhead from the looping up in
the hierarchy. You still get *those* cacheline bounces, but now they
are limited to the upper hierarchies that only get updated at batch
time.

Johannes? Am I reading this right?

Linus

2019-05-28 20:38:54

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] mm: memcontrol: don't batch updates of local VM stats and events

On Tue, May 28, 2019 at 10:37:15AM -0700, Linus Torvalds wrote:
> On Tue, May 28, 2019 at 9:00 AM Shakeel Butt <[email protected]> wrote:
> >
> > I was suspecting the following for-loop+atomic-add for the regression.
>
> If I read the kernel test robot reports correctly, Johannes' fix patch
> does fix the regression (well - mostly. The original reported
> regression was 26%, and with Johannes' fix patch it was 3% - so still
> a slight performance regression, but not nearly as bad).
>
> > Why the above atomic-add is the culprit?
>
> I think the problem with that one is that it's cross-cpu statistics,
> so you end up with lots of cacheline bounces on the local counts when
> you have lots of load.

In this case, that's true for both of them. The workload runs at the
root cgroup level, so per definition the local and the recursive
counters at that level are identical and written to at the same
rate. Adding the new counter obviously caused the regression, but
they're contributing equally to the cost, and we could
remove/per-cpuify either of them for the fix.

So why did I unshare the old counter instead of the new one? Well, the
old counter *used* to be unshared for the longest time, and was only
made into a shared one to make recursive aggregation cheaper - before
there was a dedicated recursive counter. But now that we have that
recursive counter, there isn't much reason to keep the local counter
shared and bounce it around on updates.

Essentially, this fix-up is a revert of a983b5ebee57 ("mm: memcontrol:
fix excessive complexity in memory.stat reporting") since the problem
described in that patch is now solved from the other end.

> But yes, the recursive updates still do show a small regression,
> probably because there's still some overhead from the looping up in
> the hierarchy. You still get *those* cacheline bounces, but now they
> are limited to the upper hierarchies that only get updated at batch
> time.

Right, I reduce the *shared* data back to how it was before the patch,
but it still adds a second (per-cpu) counter that needs to get bumped,
and the loop adds a branch as well.

But while I would expect that to show up in a case like will-it-scale,
I'd be surprised if the remaining difference would be noticeable for
real workloads that actually work with the memory they allocate.