2020-11-02 09:21:15

by Chen, Rong A

[permalink] [raw]
Subject: [mm/memcg] bd0b230fe1: will-it-scale.per_process_ops -22.7% regression

Greeting,

FYI, we noticed a -22.7% regression of will-it-scale.per_process_ops due to commit:


commit: bd0b230fe14554bfffbae54e19038716f96f5a41 ("mm/memcg: unify swap and memsw page counters")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master


in testcase: will-it-scale
on test machine: 144 threads Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50GHz with 512G memory
with following parameters:

nr_task: 50%
mode: process
test: page_fault2
cpufreq_governor: performance
ucode: 0x16

test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
test-url: https://github.com/antonblanchard/will-it-scale



If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
gcc-9/performance/x86_64-rhel-8.3/process/50%/debian-10.4-x86_64-20200603.cgz/lkp-hsw-4ex1/page_fault2/will-it-scale/0x16

commit:
8d387a5f17 ("mm/memcg: simplify mem_cgroup_get_max()")
bd0b230fe1 ("mm/memcg: unify swap and memsw page counters")

8d387a5f172f26ff bd0b230fe14554bfffbae54e190
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
5:5 -23% 4:5 perf-profile.calltrace.cycles-pp.error_entry.testcase
5:5 -23% 4:5 perf-profile.children.cycles-pp.error_entry
5:5 -19% 4:5 perf-profile.self.cycles-pp.error_entry
%stddev %change %stddev
\ | \
187666 -22.7% 145157 will-it-scale.per_process_ops
13511995 -22.7% 10451324 will-it-scale.workload
1895023 ±196% -98.7% 24188 ± 4% cpuidle.POLL.time
280275 ±192% -97.3% 7540 ± 3% cpuidle.POLL.usage
34107 ± 26% -27.5% 24740 ± 10% numa-meminfo.node1.KReclaimable
4234 ± 11% -15.6% 3573 ± 2% numa-meminfo.node1.PageTables
34107 ± 26% -27.5% 24740 ± 10% numa-meminfo.node1.SReclaimable
1195173 +1.2% 1209077 proc-vmstat.nr_anon_pages
4325 -2.3% 4223 proc-vmstat.nr_page_table_pages
4.07e+09 -22.6% 3.149e+09 proc-vmstat.numa_hit
4.069e+09 -22.6% 3.149e+09 proc-vmstat.numa_local
4.072e+09 -22.6% 3.151e+09 proc-vmstat.pgalloc_normal
4.059e+09 -22.6% 3.141e+09 proc-vmstat.pgfault
4.069e+09 -22.6% 3.147e+09 proc-vmstat.pgfree
1.395e+09 -9.9e+07 1.296e+09 ± 5% syscalls.sys_mmap.noise.75%
2.512e+09 ± 17% +1.2e+09 3.667e+09 ± 13% syscalls.sys_write.noise.100%
2.526e+09 ± 17% +1.2e+09 3.684e+09 ± 13% syscalls.sys_write.noise.2%
2.523e+09 ± 17% +1.2e+09 3.681e+09 ± 13% syscalls.sys_write.noise.25%
2.525e+09 ± 17% +1.2e+09 3.684e+09 ± 13% syscalls.sys_write.noise.5%
2.52e+09 ± 17% +1.2e+09 3.677e+09 ± 13% syscalls.sys_write.noise.50%
2.516e+09 ± 17% +1.2e+09 3.672e+09 ± 13% syscalls.sys_write.noise.75%
1.029e+09 -21.7% 8.052e+08 ± 2% numa-numastat.node0.local_node
1.029e+09 -21.7% 8.053e+08 ± 2% numa-numastat.node0.numa_hit
1.02e+09 -23.0% 7.853e+08 numa-numastat.node1.local_node
1.02e+09 -23.0% 7.853e+08 numa-numastat.node1.numa_hit
1.013e+09 -22.8% 7.817e+08 numa-numastat.node2.local_node
1.013e+09 -22.8% 7.818e+08 numa-numastat.node2.numa_hit
1.011e+09 -23.1% 7.771e+08 ± 2% numa-numastat.node3.local_node
1.011e+09 -23.1% 7.772e+08 ± 2% numa-numastat.node3.numa_hit
9210 ± 8% +12.5% 10362 ± 8% softirqs.CPU13.RCU
20302 ± 8% +31.3% 26656 ± 10% softirqs.CPU142.SCHED
16688 ± 16% +52.8% 25498 ± 29% softirqs.CPU15.SCHED
17137 ± 23% +62.3% 27811 ± 12% softirqs.CPU20.SCHED
23421 ± 21% -40.4% 13969 ± 33% softirqs.CPU36.SCHED
23782 ± 7% -29.0% 16876 ± 17% softirqs.CPU70.SCHED
27401 ± 9% -34.4% 17978 ± 42% softirqs.CPU87.SCHED
25692 ± 13% -44.6% 14223 ± 20% softirqs.CPU92.SCHED
5.114e+08 -21.6% 4.012e+08 ± 2% numa-vmstat.node0.numa_hit
5.114e+08 -21.6% 4.012e+08 ± 2% numa-vmstat.node0.numa_local
1058 ± 11% -15.6% 893.00 ± 2% numa-vmstat.node1.nr_page_table_pages
8526 ± 26% -27.5% 6184 ± 10% numa-vmstat.node1.nr_slab_reclaimable
5.074e+08 -22.9% 3.91e+08 numa-vmstat.node1.numa_hit
5.073e+08 -22.9% 3.909e+08 numa-vmstat.node1.numa_local
5.04e+08 -22.7% 3.895e+08 numa-vmstat.node2.numa_hit
5.039e+08 -22.7% 3.894e+08 numa-vmstat.node2.numa_local
5.029e+08 -23.0% 3.87e+08 ± 2% numa-vmstat.node3.numa_hit
5.028e+08 -23.1% 3.869e+08 ± 2% numa-vmstat.node3.numa_local
6778 ± 54% -98.0% 135.96 ± 62% sched_debug.cfs_rq:/.exec_clock.min
29780 ± 6% +52.2% 45320 ± 6% sched_debug.cfs_rq:/.exec_clock.stddev
528699 ± 51% -94.3% 30214 ± 19% sched_debug.cfs_rq:/.min_vruntime.min
2205697 ± 6% +52.3% 3359526 ± 6% sched_debug.cfs_rq:/.min_vruntime.stddev
17.95 ± 5% -37.5% 11.22 ± 2% sched_debug.cfs_rq:/.nr_spread_over.avg
59.80 ± 28% -40.5% 35.60 ± 24% sched_debug.cfs_rq:/.nr_spread_over.max
1.80 ± 38% -100.0% 0.00 sched_debug.cfs_rq:/.nr_spread_over.min
8.78 ± 5% -16.2% 7.36 ± 7% sched_debug.cfs_rq:/.nr_spread_over.stddev
2205731 ± 6% +52.3% 3359553 ± 6% sched_debug.cfs_rq:/.spread0.stddev
23138 ± 2% +39.2% 32199 ± 18% sched_debug.cpu.nr_switches.max
3068 ± 3% +19.9% 3680 ± 10% sched_debug.cpu.nr_switches.stddev
19891 ± 4% +47.3% 29292 ± 20% sched_debug.cpu.sched_count.max
694.57 ± 4% -12.4% 608.67 ± 3% sched_debug.cpu.sched_count.min
2602 ± 4% +28.2% 3335 ± 10% sched_debug.cpu.sched_count.stddev
9769 ± 4% +48.7% 14531 ± 21% sched_debug.cpu.sched_goidle.max
25.50 ± 22% -51.2% 12.43 ± 10% sched_debug.cpu.sched_goidle.min
1315 ± 4% +29.2% 1699 ± 9% sched_debug.cpu.sched_goidle.stddev
259.50 ± 2% -16.7% 216.20 ± 5% sched_debug.cpu.ttwu_count.min
233.47 -12.9% 203.27 ± 3% sched_debug.cpu.ttwu_local.min
136.40 ± 22% -41.3% 80.00 ± 35% interrupts.CPU1.RES:Rescheduling_interrupts
445.80 ± 49% +54.3% 688.00 interrupts.CPU116.CAL:Function_call_interrupts
2384 ± 39% +77.2% 4225 ± 26% interrupts.CPU116.NMI:Non-maskable_interrupts
2384 ± 39% +77.2% 4225 ± 26% interrupts.CPU116.PMI:Performance_monitoring_interrupts
3140 ± 25% +82.5% 5732 ± 36% interrupts.CPU12.NMI:Non-maskable_interrupts
3140 ± 25% +82.5% 5732 ± 36% interrupts.CPU12.PMI:Performance_monitoring_interrupts
6641 ± 17% -48.0% 3452 ± 40% interrupts.CPU128.NMI:Non-maskable_interrupts
6641 ± 17% -48.0% 3452 ± 40% interrupts.CPU128.PMI:Performance_monitoring_interrupts
6211 ± 25% -41.3% 3643 ± 40% interrupts.CPU14.NMI:Non-maskable_interrupts
6211 ± 25% -41.3% 3643 ± 40% interrupts.CPU14.PMI:Performance_monitoring_interrupts
156.20 ± 19% -60.4% 61.80 ± 45% interrupts.CPU15.RES:Rescheduling_interrupts
450.60 ± 61% +314.9% 1869 ±110% interrupts.CPU17.CAL:Function_call_interrupts
401.20 ± 62% +105.1% 823.00 ± 28% interrupts.CPU2.CAL:Function_call_interrupts
3781 ± 29% +91.4% 7239 ± 22% interrupts.CPU23.NMI:Non-maskable_interrupts
3781 ± 29% +91.4% 7239 ± 22% interrupts.CPU23.PMI:Performance_monitoring_interrupts
131.20 ± 16% -38.3% 81.00 ± 37% interrupts.CPU23.RES:Rescheduling_interrupts
6565 ± 27% -47.8% 3430 ± 57% interrupts.CPU30.NMI:Non-maskable_interrupts
6565 ± 27% -47.8% 3430 ± 57% interrupts.CPU30.PMI:Performance_monitoring_interrupts
2524 ± 33% +53.1% 3866 ± 18% interrupts.CPU39.NMI:Non-maskable_interrupts
2524 ± 33% +53.1% 3866 ± 18% interrupts.CPU39.PMI:Performance_monitoring_interrupts
5453 ± 26% -34.5% 3569 ± 27% interrupts.CPU43.NMI:Non-maskable_interrupts
5453 ± 26% -34.5% 3569 ± 27% interrupts.CPU43.PMI:Performance_monitoring_interrupts
2524 ± 36% +65.3% 4172 ± 28% interrupts.CPU48.NMI:Non-maskable_interrupts
2524 ± 36% +65.3% 4172 ± 28% interrupts.CPU48.PMI:Performance_monitoring_interrupts
487.40 ± 58% +124.0% 1092 ± 55% interrupts.CPU79.CAL:Function_call_interrupts
585.20 ± 40% +61.9% 947.20 ± 23% interrupts.CPU80.CAL:Function_call_interrupts
487.60 ± 54% +67.9% 818.80 ± 16% interrupts.CPU81.CAL:Function_call_interrupts
15.97 +11.6% 17.83 perf-stat.i.MPKI
1.333e+10 -27.3% 9.69e+09 perf-stat.i.branch-instructions
0.48 +0.1 0.53 perf-stat.i.branch-miss-rate%
62963150 -20.1% 50310834 perf-stat.i.branch-misses
37.28 -1.9 35.40 perf-stat.i.cache-miss-rate%
4.08e+08 -21.8% 3.189e+08 perf-stat.i.cache-misses
1.092e+09 -17.7% 8.983e+08 perf-stat.i.cache-references
3.06 +35.8% 4.16 perf-stat.i.cpi
143.56 -1.4% 141.50 perf-stat.i.cpu-migrations
523.77 +28.4% 672.47 perf-stat.i.cycles-between-cache-misses
2.068e+10 -25.8% 1.534e+10 perf-stat.i.dTLB-loads
95643337 -22.7% 73905709 perf-stat.i.dTLB-store-misses
1.225e+10 -22.1% 9.544e+09 perf-stat.i.dTLB-stores
94.47 -2.1 92.36 perf-stat.i.iTLB-load-miss-rate%
40594103 -23.3% 31146350 perf-stat.i.iTLB-load-misses
2304247 ± 11% +9.1% 2513651 perf-stat.i.iTLB-loads
6.823e+10 -26.2% 5.033e+10 perf-stat.i.instructions
1692 -3.6% 1632 perf-stat.i.instructions-per-iTLB-miss
0.33 -26.3% 0.24 perf-stat.i.ipc
332.55 -25.1% 249.20 perf-stat.i.metric.M/sec
13449087 -22.7% 10398888 perf-stat.i.minor-faults
4.30 +1.7 6.01 perf-stat.i.node-load-miss-rate%
13864893 +11.3% 15438022 perf-stat.i.node-load-misses
3.294e+08 -22.9% 2.539e+08 perf-stat.i.node-loads
18.37 +0.4 18.79 perf-stat.i.node-store-miss-rate%
11546036 -22.7% 8923297 perf-stat.i.node-store-misses
51372725 -24.8% 38626751 perf-stat.i.node-stores
13449089 -22.7% 10398890 perf-stat.i.page-faults
16.00 +11.5% 17.85 perf-stat.overall.MPKI
0.47 +0.0 0.52 perf-stat.overall.branch-miss-rate%
37.36 -1.9 35.50 perf-stat.overall.cache-miss-rate%
3.06 +35.9% 4.16 perf-stat.overall.cpi
512.20 +28.3% 656.96 perf-stat.overall.cycles-between-cache-misses
94.64 -2.1 92.54 perf-stat.overall.iTLB-load-miss-rate%
1680 -3.9% 1616 perf-stat.overall.instructions-per-iTLB-miss
0.33 -26.4% 0.24 perf-stat.overall.ipc
4.04 +1.7 5.73 perf-stat.overall.node-load-miss-rate%
18.35 +0.4 18.77 perf-stat.overall.node-store-miss-rate%
1519587 -4.6% 1449452 perf-stat.overall.path-length
1.329e+10 -27.3% 9.658e+09 perf-stat.ps.branch-instructions
62761890 -20.1% 50132370 perf-stat.ps.branch-misses
4.065e+08 -21.8% 3.178e+08 perf-stat.ps.cache-misses
1.088e+09 -17.7% 8.953e+08 perf-stat.ps.cache-references
142.33 -1.6% 140.11 perf-stat.ps.cpu-migrations
2.061e+10 -25.8% 1.529e+10 perf-stat.ps.dTLB-loads
95301115 -22.7% 73646080 perf-stat.ps.dTLB-store-misses
1.221e+10 -22.1% 9.511e+09 perf-stat.ps.dTLB-stores
40451925 -23.3% 31039602 perf-stat.ps.iTLB-load-misses
2295342 ± 11% +9.1% 2503862 perf-stat.ps.iTLB-loads
6.799e+10 -26.2% 5.016e+10 perf-stat.ps.instructions
13401817 -22.7% 10363171 perf-stat.ps.minor-faults
13816941 +11.3% 15384985 perf-stat.ps.node-load-misses
3.282e+08 -22.9% 2.53e+08 perf-stat.ps.node-loads
11506624 -22.7% 8893312 perf-stat.ps.node-store-misses
51195452 -24.8% 38495647 perf-stat.ps.node-stores
13401819 -22.7% 10363173 perf-stat.ps.page-faults
2.053e+13 -26.2% 1.515e+13 perf-stat.total.instructions
11.81 ± 11% -3.5 8.26 ± 10% perf-profile.calltrace.cycles-pp.__munmap
11.81 ± 11% -3.5 8.26 ± 10% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__munmap
11.80 ± 11% -3.5 8.26 ± 10% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
11.80 ± 11% -3.5 8.26 ± 10% perf-profile.calltrace.cycles-pp.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
11.80 ± 11% -3.5 8.26 ± 10% perf-profile.calltrace.cycles-pp.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
11.80 ± 11% -3.5 8.26 ± 10% perf-profile.calltrace.cycles-pp.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
11.80 ± 11% -3.5 8.25 ± 10% perf-profile.calltrace.cycles-pp.unmap_region.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64
12.30 ± 10% -3.4 8.86 ± 10% perf-profile.calltrace.cycles-pp.finish_fault.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
12.25 ± 10% -3.4 8.83 ± 10% perf-profile.calltrace.cycles-pp.alloc_set_pte.finish_fault.do_fault.__handle_mm_fault.handle_mm_fault
8.17 ± 11% -3.3 4.90 ± 10% perf-profile.calltrace.cycles-pp.lru_cache_add.alloc_set_pte.finish_fault.do_fault.__handle_mm_fault
8.04 ± 11% -3.2 4.79 ± 10% perf-profile.calltrace.cycles-pp.pagevec_lru_move_fn.lru_cache_add.alloc_set_pte.finish_fault.do_fault
10.87 ± 11% -3.2 7.63 ± 10% perf-profile.calltrace.cycles-pp.unmap_vmas.unmap_region.__do_munmap.__vm_munmap.__x64_sys_munmap
10.87 ± 11% -3.2 7.63 ± 10% perf-profile.calltrace.cycles-pp.unmap_page_range.unmap_vmas.unmap_region.__do_munmap.__vm_munmap
10.86 ± 11% -3.2 7.62 ± 10% perf-profile.calltrace.cycles-pp.zap_pte_range.unmap_page_range.unmap_vmas.unmap_region.__do_munmap
6.14 ± 11% -3.0 3.18 ± 9% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.pagevec_lru_move_fn.lru_cache_add.alloc_set_pte.finish_fault
6.08 ± 11% -2.9 3.14 ± 9% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.pagevec_lru_move_fn.lru_cache_add.alloc_set_pte
7.57 ± 12% -2.7 4.88 ± 9% perf-profile.calltrace.cycles-pp.tlb_flush_mmu.zap_pte_range.unmap_page_range.unmap_vmas.unmap_region
7.14 ± 13% -2.6 4.56 ± 9% perf-profile.calltrace.cycles-pp.release_pages.tlb_flush_mmu.zap_pte_range.unmap_page_range.unmap_vmas
3.95 ± 16% -2.1 1.86 ± 7% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.release_pages.tlb_flush_mmu.zap_pte_range.unmap_page_range
3.93 ± 17% -2.1 1.85 ± 7% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.release_pages.tlb_flush_mmu.zap_pte_range
3.25 ± 10% -0.8 2.44 ± 10% perf-profile.calltrace.cycles-pp.alloc_pages_vma.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
2.86 ± 10% -0.7 2.13 ± 9% perf-profile.calltrace.cycles-pp.__alloc_pages_nodemask.alloc_pages_vma.do_fault.__handle_mm_fault.handle_mm_fault
2.62 ± 10% -0.7 1.93 ± 10% perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.do_fault.__handle_mm_fault
2.36 ± 11% -0.7 1.71 ± 9% perf-profile.calltrace.cycles-pp.rmqueue.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.do_fault
1.83 ± 11% -0.6 1.23 ± 11% perf-profile.calltrace.cycles-pp.try_charge.mem_cgroup_charge.do_fault.__handle_mm_fault.handle_mm_fault
1.89 ± 11% -0.6 1.32 ± 9% perf-profile.calltrace.cycles-pp.rmqueue_bulk.rmqueue.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma
1.76 ± 9% -0.5 1.30 ± 10% perf-profile.calltrace.cycles-pp.free_unref_page_list.release_pages.tlb_flush_mmu.zap_pte_range.unmap_page_range
1.30 ± 12% -0.4 0.86 ± 11% perf-profile.calltrace.cycles-pp.page_counter_try_charge.try_charge.mem_cgroup_charge.do_fault.__handle_mm_fault
1.53 ± 9% -0.4 1.11 ± 10% perf-profile.calltrace.cycles-pp.free_pcppages_bulk.free_unref_page_list.release_pages.tlb_flush_mmu.zap_pte_range
0.92 ± 13% -0.3 0.62 ± 10% perf-profile.calltrace.cycles-pp.tlb_finish_mmu.unmap_region.__do_munmap.__vm_munmap.__x64_sys_munmap
0.92 ± 13% -0.3 0.62 ± 10% perf-profile.calltrace.cycles-pp.tlb_flush_mmu.tlb_finish_mmu.unmap_region.__do_munmap.__vm_munmap
0.89 ± 13% -0.3 0.60 ± 10% perf-profile.calltrace.cycles-pp.release_pages.tlb_flush_mmu.tlb_finish_mmu.unmap_region.__do_munmap
1.02 ± 11% -0.2 0.78 ± 9% perf-profile.calltrace.cycles-pp.__list_del_entry_valid.rmqueue_bulk.rmqueue.get_page_from_freelist.__alloc_pages_nodemask
1.03 ± 9% -0.2 0.79 ± 10% perf-profile.calltrace.cycles-pp.__irqentry_text_end.testcase
0.98 ± 12% +0.3 1.27 ± 13% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.page_add_new_anon_rmap.alloc_set_pte.finish_fault.do_fault
1.42 ± 9% +0.3 1.75 ± 9% perf-profile.calltrace.cycles-pp.shmem_fault.__do_fault.do_fault.__handle_mm_fault.handle_mm_fault
0.70 ± 13% +0.4 1.10 ± 14% perf-profile.calltrace.cycles-pp.__mod_memcg_state.__mod_memcg_lruvec_state.page_add_new_anon_rmap.alloc_set_pte.finish_fault
1.23 ± 9% +0.4 1.64 ± 10% perf-profile.calltrace.cycles-pp.shmem_getpage_gfp.shmem_fault.__do_fault.do_fault.__handle_mm_fault
1.05 ± 9% +0.5 1.51 ± 9% perf-profile.calltrace.cycles-pp.find_lock_entry.shmem_getpage_gfp.shmem_fault.__do_fault.do_fault
0.71 ± 12% +0.5 1.18 ± 13% perf-profile.calltrace.cycles-pp.__count_memcg_events.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
0.83 ± 9% +0.5 1.34 ± 10% perf-profile.calltrace.cycles-pp.find_get_entry.find_lock_entry.shmem_getpage_gfp.shmem_fault.__do_fault
0.37 ± 81% +0.6 0.93 ± 15% perf-profile.calltrace.cycles-pp.mem_cgroup_charge_statistics.mem_cgroup_charge.do_fault.__handle_mm_fault.handle_mm_fault
0.36 ± 81% +0.6 0.91 ± 15% perf-profile.calltrace.cycles-pp.__count_memcg_events.mem_cgroup_charge_statistics.mem_cgroup_charge.do_fault.__handle_mm_fault
2.69 ± 9% +4.3 6.95 ± 14% perf-profile.calltrace.cycles-pp.get_mem_cgroup_from_mm.mem_cgroup_charge.do_fault.__handle_mm_fault.handle_mm_fault
7.04 ± 9% +10.2 17.26 ± 13% perf-profile.calltrace.cycles-pp.mem_cgroup_charge.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
11.69 ± 13% -5.8 5.91 ± 8% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
10.58 ± 13% -5.3 5.29 ± 8% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
11.81 ± 11% -3.5 8.26 ± 10% perf-profile.children.cycles-pp.__munmap
11.80 ± 11% -3.5 8.26 ± 10% perf-profile.children.cycles-pp.__do_munmap
11.80 ± 11% -3.5 8.26 ± 10% perf-profile.children.cycles-pp.__x64_sys_munmap
11.80 ± 11% -3.5 8.26 ± 10% perf-profile.children.cycles-pp.__vm_munmap
11.80 ± 11% -3.5 8.26 ± 10% perf-profile.children.cycles-pp.unmap_region
11.88 ± 11% -3.5 8.34 ± 10% perf-profile.children.cycles-pp.do_syscall_64
11.89 ± 11% -3.5 8.35 ± 10% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
12.30 ± 10% -3.4 8.86 ± 10% perf-profile.children.cycles-pp.finish_fault
12.26 ± 10% -3.4 8.83 ± 10% perf-profile.children.cycles-pp.alloc_set_pte
8.18 ± 11% -3.3 4.90 ± 10% perf-profile.children.cycles-pp.lru_cache_add
8.05 ± 11% -3.2 4.80 ± 9% perf-profile.children.cycles-pp.pagevec_lru_move_fn
10.87 ± 11% -3.2 7.63 ± 10% perf-profile.children.cycles-pp.unmap_vmas
10.87 ± 11% -3.2 7.63 ± 10% perf-profile.children.cycles-pp.unmap_page_range
10.87 ± 11% -3.2 7.63 ± 10% perf-profile.children.cycles-pp.zap_pte_range
8.49 ± 12% -3.0 5.51 ± 9% perf-profile.children.cycles-pp.tlb_flush_mmu
8.20 ± 13% -2.9 5.29 ± 9% perf-profile.children.cycles-pp.release_pages
3.91 ± 10% -0.9 3.06 ± 9% perf-profile.children.cycles-pp._raw_spin_lock
3.27 ± 10% -0.8 2.45 ± 10% perf-profile.children.cycles-pp.alloc_pages_vma
2.95 ± 10% -0.7 2.20 ± 10% perf-profile.children.cycles-pp.__alloc_pages_nodemask
2.67 ± 10% -0.7 1.98 ± 10% perf-profile.children.cycles-pp.get_page_from_freelist
2.39 ± 11% -0.7 1.73 ± 9% perf-profile.children.cycles-pp.rmqueue
1.84 ± 11% -0.6 1.23 ± 11% perf-profile.children.cycles-pp.try_charge
1.90 ± 11% -0.6 1.33 ± 9% perf-profile.children.cycles-pp.rmqueue_bulk
2.01 ± 9% -0.5 1.49 ± 10% perf-profile.children.cycles-pp.free_unref_page_list
1.74 ± 9% -0.5 1.27 ± 10% perf-profile.children.cycles-pp.free_pcppages_bulk
1.32 ± 12% -0.4 0.87 ± 12% perf-profile.children.cycles-pp.page_counter_try_charge
1.54 ± 10% -0.3 1.22 ± 9% perf-profile.children.cycles-pp.__list_del_entry_valid
0.93 ± 13% -0.3 0.62 ± 10% perf-profile.children.cycles-pp.tlb_finish_mmu
1.03 ± 9% -0.2 0.79 ± 10% perf-profile.children.cycles-pp.__irqentry_text_end
0.43 ± 7% -0.1 0.33 ± 10% perf-profile.children.cycles-pp.free_pages_and_swap_cache
0.41 ± 7% -0.1 0.33 ± 10% perf-profile.children.cycles-pp.__perf_sw_event
0.41 ± 9% -0.1 0.33 ± 8% perf-profile.children.cycles-pp.xas_load
0.31 ± 9% -0.1 0.23 ± 6% perf-profile.children.cycles-pp.__mod_lruvec_state
0.25 ± 8% -0.1 0.19 ± 10% perf-profile.children.cycles-pp.___perf_sw_event
0.22 ± 10% -0.1 0.16 ± 6% perf-profile.children.cycles-pp.__mod_node_page_state
0.09 ± 21% -0.0 0.05 ± 50% perf-profile.children.cycles-pp._raw_spin_unlock_irqrestore
0.22 ± 8% -0.0 0.17 ± 9% perf-profile.children.cycles-pp.sync_regs
0.15 ± 10% -0.0 0.11 ± 8% perf-profile.children.cycles-pp.__list_add_valid
0.13 ± 12% -0.0 0.10 ± 8% perf-profile.children.cycles-pp.memcg_check_events
0.09 ± 8% -0.0 0.06 ± 12% perf-profile.children.cycles-pp.mem_cgroup_page_lruvec
0.10 ± 9% -0.0 0.08 ± 14% perf-profile.children.cycles-pp.up_read
0.11 ± 10% -0.0 0.09 ± 12% perf-profile.children.cycles-pp.shmem_get_policy
0.10 ± 5% -0.0 0.08 ± 13% perf-profile.children.cycles-pp.unlock_page
0.09 ± 11% -0.0 0.07 ± 14% perf-profile.children.cycles-pp._cond_resched
0.09 ± 5% -0.0 0.07 ± 17% perf-profile.children.cycles-pp.find_vma
0.25 ± 8% +0.2 0.42 ± 14% perf-profile.children.cycles-pp.mem_cgroup_uncharge_list
0.11 ± 8% +0.2 0.31 ± 15% perf-profile.children.cycles-pp.uncharge_page
0.15 ± 10% +0.2 0.38 ± 12% perf-profile.children.cycles-pp.lock_page_memcg
1.42 ± 9% +0.3 1.75 ± 9% perf-profile.children.cycles-pp.shmem_fault
0.57 ± 11% +0.4 0.93 ± 15% perf-profile.children.cycles-pp.mem_cgroup_charge_statistics
1.24 ± 9% +0.4 1.65 ± 9% perf-profile.children.cycles-pp.shmem_getpage_gfp
1.07 ± 9% +0.5 1.52 ± 10% perf-profile.children.cycles-pp.find_lock_entry
0.84 ± 9% +0.5 1.35 ± 10% perf-profile.children.cycles-pp.find_get_entry
1.70 ± 11% +0.7 2.39 ± 9% perf-profile.children.cycles-pp.native_irq_return_iret
1.28 ± 11% +0.8 2.11 ± 14% perf-profile.children.cycles-pp.__count_memcg_events
2.69 ± 9% +4.3 6.98 ± 14% perf-profile.children.cycles-pp.get_mem_cgroup_from_mm
7.06 ± 9% +10.2 17.28 ± 13% perf-profile.children.cycles-pp.mem_cgroup_charge
11.69 ± 13% -5.8 5.91 ± 8% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
2.90 ± 9% -0.6 2.29 ± 10% perf-profile.self.cycles-pp.testcase
2.03 ± 8% -0.5 1.55 ± 11% perf-profile.self.cycles-pp.zap_pte_range
1.07 ± 11% -0.4 0.63 ± 11% perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
1.21 ± 12% -0.4 0.79 ± 11% perf-profile.self.cycles-pp.page_counter_try_charge
1.53 ± 10% -0.3 1.21 ± 9% perf-profile.self.cycles-pp.__list_del_entry_valid
1.03 ± 9% -0.2 0.79 ± 10% perf-profile.self.cycles-pp.__irqentry_text_end
1.05 ± 8% -0.2 0.82 ± 10% perf-profile.self.cycles-pp.free_pcppages_bulk
0.53 ± 10% -0.2 0.37 ± 12% perf-profile.self.cycles-pp.try_charge
0.43 ± 8% -0.1 0.33 ± 10% perf-profile.self.cycles-pp.free_pages_and_swap_cache
0.46 ± 10% -0.1 0.36 ± 10% perf-profile.self.cycles-pp.release_pages
0.46 ± 11% -0.1 0.36 ± 8% perf-profile.self.cycles-pp.__handle_mm_fault
0.36 ± 8% -0.1 0.29 ± 9% perf-profile.self.cycles-pp.xas_load
0.18 ± 10% -0.1 0.11 ± 11% perf-profile.self.cycles-pp.shmem_fault
0.27 ± 9% -0.1 0.20 ± 11% perf-profile.self.cycles-pp.page_remove_rmap
0.29 ± 9% -0.1 0.23 ± 9% perf-profile.self.cycles-pp.handle_mm_fault
0.14 ± 10% -0.1 0.09 ± 11% perf-profile.self.cycles-pp.page_add_new_anon_rmap
0.21 ± 8% -0.1 0.16 ± 4% perf-profile.self.cycles-pp.__mod_node_page_state
0.08 ± 17% -0.0 0.03 ± 82% perf-profile.self.cycles-pp.memcg_check_events
0.16 ± 9% -0.0 0.12 ± 9% perf-profile.self.cycles-pp.shmem_getpage_gfp
0.19 ± 8% -0.0 0.15 ± 9% perf-profile.self.cycles-pp.sync_regs
0.19 ± 8% -0.0 0.14 ± 11% perf-profile.self.cycles-pp.___perf_sw_event
0.17 ± 9% -0.0 0.13 ± 10% perf-profile.self.cycles-pp.do_user_addr_fault
0.12 ± 9% -0.0 0.09 ± 10% perf-profile.self.cycles-pp.find_lock_entry
0.08 ± 11% -0.0 0.05 ± 7% perf-profile.self.cycles-pp.mem_cgroup_page_lruvec
0.09 ± 25% -0.0 0.06 ± 12% perf-profile.self.cycles-pp._raw_spin_lock_irqsave
0.08 ± 6% -0.0 0.06 ± 12% perf-profile.self.cycles-pp.get_task_policy
0.11 ± 8% +0.2 0.31 ± 15% perf-profile.self.cycles-pp.uncharge_page
0.15 ± 10% +0.2 0.38 ± 13% perf-profile.self.cycles-pp.lock_page_memcg
0.43 ± 9% +0.6 1.01 ± 10% perf-profile.self.cycles-pp.find_get_entry
1.69 ± 11% +0.7 2.39 ± 9% perf-profile.self.cycles-pp.native_irq_return_iret
1.28 ± 11% +0.8 2.11 ± 14% perf-profile.self.cycles-pp.__count_memcg_events
2.68 ± 9% +4.3 6.93 ± 14% perf-profile.self.cycles-pp.get_mem_cgroup_from_mm
1.84 ± 10% +6.2 8.01 ± 13% perf-profile.self.cycles-pp.mem_cgroup_charge



will-it-scale.per_process_ops

195000 +------------------------------------------------------------------+
190000 |-+ .+.+. .+.+. +.+. .+. .|
| +. + +.+ : +. .+ +. .+ |
185000 |-+ : : : +. + |
180000 |-+ : : : |
175000 |-+ : : : |
170000 |-+ : : : |
|.+. .+.+ +.+. .+.+.+.+. .+.+ |
165000 |-+ + +. + |
160000 |-+ |
155000 |-+ |
150000 |-+ |
| O O O O O O O O O O O O |
145000 |-O O O O O O O O O O O O O O |
140000 +------------------------------------------------------------------+


[*] bisect-good sample
[O] bisect-bad sample



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


Thanks,
Rong Chen


Attachments:
(No filename) (31.60 kB)
config-5.9.0-02768-gbd0b230fe14554 (173.22 kB)
job-script (7.86 kB)
job.yaml (5.37 kB)
reproduce (353.00 B)
Download all attachments

2020-11-02 09:29:56

by Michal Hocko

[permalink] [raw]
Subject: Re: [mm/memcg] bd0b230fe1: will-it-scale.per_process_ops -22.7% regression

On Mon 02-11-20 17:15:43, kernel test robot wrote:
> Greeting,
>
> FYI, we noticed a -22.7% regression of will-it-scale.per_process_ops due to commit:
>
>
> commit: bd0b230fe14554bfffbae54e19038716f96f5a41 ("mm/memcg: unify swap and memsw page counters")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

I really fail to see how this can be anything else than a data structure
layout change. There is one counter less.

btw. are cgroups configured at all? What would be the configuration?
--
Michal Hocko
SUSE Labs

2020-11-02 09:56:00

by Chen, Rong A

[permalink] [raw]
Subject: Re: [LKP] Re: [mm/memcg] bd0b230fe1: will-it-scale.per_process_ops -22.7% regression



On 11/2/20 5:27 PM, Michal Hocko wrote:
> On Mon 02-11-20 17:15:43, kernel test robot wrote:
>> Greeting,
>>
>> FYI, we noticed a -22.7% regression of will-it-scale.per_process_ops due to commit:
>>
>>
>> commit: bd0b230fe14554bfffbae54e19038716f96f5a41 ("mm/memcg: unify swap and memsw page counters")
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> I really fail to see how this can be anything else than a data structure
> layout change. There is one counter less.
>
> btw. are cgroups configured at all? What would be the configuration?

Hi Michal,

We used the default configure of cgroups, not sure what configuration
you want,
could you give me more details? and here is the cgroup info of
will-it-scale process:

$ cat /proc/3042/cgroup
12:hugetlb:/
11:memory:/system.slice/lkp-bootstrap.service
10:devices:/system.slice/lkp-bootstrap.service
9:cpuset:/
8:perf_event:/
7:rdma:/
6:freezer:/
5:pids:/system.slice/lkp-bootstrap.service
4:net_cls,net_prio:/
3:blkio:/
2:cpu,cpuacct:/
1:name=systemd:/system.slice/lkp-bootstrap.service

Best Regards,
Rong Chen

2020-11-02 10:06:42

by Michal Hocko

[permalink] [raw]
Subject: Re: [LKP] Re: [mm/memcg] bd0b230fe1: will-it-scale.per_process_ops -22.7% regression

On Mon 02-11-20 17:53:14, Rong Chen wrote:
>
>
> On 11/2/20 5:27 PM, Michal Hocko wrote:
> > On Mon 02-11-20 17:15:43, kernel test robot wrote:
> > > Greeting,
> > >
> > > FYI, we noticed a -22.7% regression of will-it-scale.per_process_ops due to commit:
> > >
> > >
> > > commit: bd0b230fe14554bfffbae54e19038716f96f5a41 ("mm/memcg: unify swap and memsw page counters")
> > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > I really fail to see how this can be anything else than a data structure
> > layout change. There is one counter less.
> >
> > btw. are cgroups configured at all? What would be the configuration?
>
> Hi Michal,
>
> We used the default configure of cgroups, not sure what configuration you
> want,
> could you give me more details? and here is the cgroup info of will-it-scale
> process:
>
> $ cat /proc/3042/cgroup
> 12:hugetlb:/
> 11:memory:/system.slice/lkp-bootstrap.service

OK, this means that memory controler is enabled and in use. Btw. do you
get the original performance if you add one phony page_counter after the
union?
--
Michal Hocko
SUSE Labs

2020-11-04 01:22:27

by Xing Zhengjun

[permalink] [raw]
Subject: Re: [LKP] Re: [mm/memcg] bd0b230fe1: will-it-scale.per_process_ops -22.7% regression



On 11/2/2020 6:02 PM, Michal Hocko wrote:
> On Mon 02-11-20 17:53:14, Rong Chen wrote:
>>
>>
>> On 11/2/20 5:27 PM, Michal Hocko wrote:
>>> On Mon 02-11-20 17:15:43, kernel test robot wrote:
>>>> Greeting,
>>>>
>>>> FYI, we noticed a -22.7% regression of will-it-scale.per_process_ops due to commit:
>>>>
>>>>
>>>> commit: bd0b230fe14554bfffbae54e19038716f96f5a41 ("mm/memcg: unify swap and memsw page counters")
>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>> I really fail to see how this can be anything else than a data structure
>>> layout change. There is one counter less.
>>>
>>> btw. are cgroups configured at all? What would be the configuration?
>>
>> Hi Michal,
>>
>> We used the default configure of cgroups, not sure what configuration you
>> want,
>> could you give me more details? and here is the cgroup info of will-it-scale
>> process:
>>
>> $ cat /proc/3042/cgroup
>> 12:hugetlb:/
>> 11:memory:/system.slice/lkp-bootstrap.service
>
> OK, this means that memory controler is enabled and in use. Btw. do you
> get the original performance if you add one phony page_counter after the
> union?
>
I add one phony page_counter after the union and re-test, the regression
reduced to -1.2%. It looks like the regression caused by the data
structure layout change.

=========================================================================================
tbox_group/testcase/rootfs/kconfig/compiler/nr_task/mode/test/cpufreq_governor/ucode/debug-setup:

lkp-hsw-4ex1/will-it-scale/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/50%/process/page_fault2/performance/0x16/test1

commit:
8d387a5f172f26ff8c76096d5876b881dec6b7ce
bd0b230fe14554bfffbae54e19038716f96f5a41
b3233916ab0a883e1117397e28b723bd0e4ac1eb (debug patch add one phony
page_counter after the union)

8d387a5f172f26ff bd0b230fe14554bfffbae54e190 b3233916ab0a883e1117397e28b
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
187632 -22.8% 144931 -1.2% 185391
will-it-scale.per_process_ops
13509525 -22.8% 10435073 -1.2% 13348181
will-it-scale.workload



--
Zhengjun Xing

2020-11-04 02:49:13

by Waiman Long

[permalink] [raw]
Subject: Re: [LKP] Re: [mm/memcg] bd0b230fe1: will-it-scale.per_process_ops -22.7% regression

On 11/3/20 8:20 PM, Xing Zhengjun wrote:
>
>
> On 11/2/2020 6:02 PM, Michal Hocko wrote:
>> On Mon 02-11-20 17:53:14, Rong Chen wrote:
>>>
>>>
>>> On 11/2/20 5:27 PM, Michal Hocko wrote:
>>>> On Mon 02-11-20 17:15:43, kernel test robot wrote:
>>>>> Greeting,
>>>>>
>>>>> FYI, we noticed a -22.7% regression of
>>>>> will-it-scale.per_process_ops due to commit:
>>>>>
>>>>>
>>>>> commit: bd0b230fe14554bfffbae54e19038716f96f5a41 ("mm/memcg: unify
>>>>> swap and memsw page counters")
>>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git
>>>>> master
>>>> I really fail to see how this can be anything else than a data
>>>> structure
>>>> layout change. There is one counter less.
>>>>
>>>> btw. are cgroups configured at all? What would be the configuration?
>>>
>>> Hi Michal,
>>>
>>> We used the default configure of cgroups, not sure what
>>> configuration you
>>> want,
>>> could you give me more details? and here is the cgroup info of
>>> will-it-scale
>>> process:
>>>
>>> $ cat /proc/3042/cgroup
>>> 12:hugetlb:/
>>> 11:memory:/system.slice/lkp-bootstrap.service
>>
>> OK, this means that memory controler is enabled and in use. Btw. do you
>> get the original performance if you add one phony page_counter after the
>> union?
>>
> I add one phony page_counter after the union and re-test, the
> regression reduced to -1.2%. It looks like the regression caused by
> the data structure layout change.

So it looks like the regression is caused by false cacheline sharing of
two or more hot items in mem_cgroup. As the size of the page_counter is
112 bytes, eliminating one counter will shift down the cacheline
boundary by 16 bytes. We probably need to use perf to find out what
those hot items are for this particular benchmark.

Cheers,
Longman

2020-11-04 08:18:35

by Michal Hocko

[permalink] [raw]
Subject: Re: [LKP] Re: [mm/memcg] bd0b230fe1: will-it-scale.per_process_ops -22.7% regression

On Wed 04-11-20 09:20:04, Xing Zhengjun wrote:
>
>
> On 11/2/2020 6:02 PM, Michal Hocko wrote:
> > On Mon 02-11-20 17:53:14, Rong Chen wrote:
> > >
> > >
> > > On 11/2/20 5:27 PM, Michal Hocko wrote:
> > > > On Mon 02-11-20 17:15:43, kernel test robot wrote:
> > > > > Greeting,
> > > > >
> > > > > FYI, we noticed a -22.7% regression of will-it-scale.per_process_ops due to commit:
> > > > >
> > > > >
> > > > > commit: bd0b230fe14554bfffbae54e19038716f96f5a41 ("mm/memcg: unify swap and memsw page counters")
> > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > > I really fail to see how this can be anything else than a data structure
> > > > layout change. There is one counter less.
> > > >
> > > > btw. are cgroups configured at all? What would be the configuration?
> > >
> > > Hi Michal,
> > >
> > > We used the default configure of cgroups, not sure what configuration you
> > > want,
> > > could you give me more details? and here is the cgroup info of will-it-scale
> > > process:
> > >
> > > $ cat /proc/3042/cgroup
> > > 12:hugetlb:/
> > > 11:memory:/system.slice/lkp-bootstrap.service
> >
> > OK, this means that memory controler is enabled and in use. Btw. do you
> > get the original performance if you add one phony page_counter after the
> > union?
> >
> I add one phony page_counter after the union and re-test, the regression
> reduced to -1.2%. It looks like the regression caused by the data structure
> layout change.

Thanks for double checking. Could you try to cache align the
page_counter struct? If that helps then we should figure which counters
acks against each other by adding the alignement between the respective
counters.

> =========================================================================================
> tbox_group/testcase/rootfs/kconfig/compiler/nr_task/mode/test/cpufreq_governor/ucode/debug-setup:
>
> lkp-hsw-4ex1/will-it-scale/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/50%/process/page_fault2/performance/0x16/test1
>
> commit:
> 8d387a5f172f26ff8c76096d5876b881dec6b7ce
> bd0b230fe14554bfffbae54e19038716f96f5a41
> b3233916ab0a883e1117397e28b723bd0e4ac1eb (debug patch add one phony
> page_counter after the union)
>
> 8d387a5f172f26ff bd0b230fe14554bfffbae54e190 b3233916ab0a883e1117397e28b
> ---------------- --------------------------- ---------------------------
> %stddev %change %stddev %change %stddev
> \ | \ | \
> 187632 -22.8% 144931 -1.2% 185391
> will-it-scale.per_process_ops
> 13509525 -22.8% 10435073 -1.2% 13348181
> will-it-scale.workload
>
>
>
> --
> Zhengjun Xing

--
Michal Hocko
SUSE Labs

2020-11-12 12:31:26

by Feng Tang

[permalink] [raw]
Subject: Re: [LKP] Re: [mm/memcg] bd0b230fe1: will-it-scale.per_process_ops -22.7% regression

Hi Michal,

On Wed, Nov 04, 2020 at 09:15:46AM +0100, Michal Hocko wrote:
> > > > Hi Michal,
> > > >
> > > > We used the default configure of cgroups, not sure what configuration you
> > > > want,
> > > > could you give me more details? and here is the cgroup info of will-it-scale
> > > > process:
> > > >
> > > > $ cat /proc/3042/cgroup
> > > > 12:hugetlb:/
> > > > 11:memory:/system.slice/lkp-bootstrap.service
> > >
> > > OK, this means that memory controler is enabled and in use. Btw. do you
> > > get the original performance if you add one phony page_counter after the
> > > union?
> > >
> > I add one phony page_counter after the union and re-test, the regression
> > reduced to -1.2%. It looks like the regression caused by the data structure
> > layout change.
>
> Thanks for double checking. Could you try to cache align the
> page_counter struct? If that helps then we should figure which counters
> acks against each other by adding the alignement between the respective
> counters.

We tried below patch to make the 'page_counter' aligned.

diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index bab7e57..9efa6f7 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -26,7 +26,7 @@ struct page_counter {
/* legacy */
unsigned long watermark;
unsigned long failcnt;
-};
+} ____cacheline_internodealigned_in_smp;

and with it, the -22.7% peformance change turns to a small -1.7%, which
confirms the performance bump is caused by the change to data alignment.

After the patch, size of 'page_counter' increases from 104 bytes to 128
bytes, and the size of 'mem_cgroup' increases from 2880 bytes to 3008
bytes(with our kernel config). Another major data structure which
contains 'page_counter' is 'hugetlb_cgroup', whose size will change
from 912B to 1024B.

Should we make these page_counters aligned to reduce cacheline conflict?

Thanks,
Feng

2020-11-12 14:19:40

by Michal Hocko

[permalink] [raw]
Subject: Re: [LKP] Re: [mm/memcg] bd0b230fe1: will-it-scale.per_process_ops -22.7% regression

On Thu 12-11-20 20:28:44, Feng Tang wrote:
> Hi Michal,
>
> On Wed, Nov 04, 2020 at 09:15:46AM +0100, Michal Hocko wrote:
> > > > > Hi Michal,
> > > > >
> > > > > We used the default configure of cgroups, not sure what configuration you
> > > > > want,
> > > > > could you give me more details? and here is the cgroup info of will-it-scale
> > > > > process:
> > > > >
> > > > > $ cat /proc/3042/cgroup
> > > > > 12:hugetlb:/
> > > > > 11:memory:/system.slice/lkp-bootstrap.service
> > > >
> > > > OK, this means that memory controler is enabled and in use. Btw. do you
> > > > get the original performance if you add one phony page_counter after the
> > > > union?
> > > >
> > > I add one phony page_counter after the union and re-test, the regression
> > > reduced to -1.2%. It looks like the regression caused by the data structure
> > > layout change.
> >
> > Thanks for double checking. Could you try to cache align the
> > page_counter struct? If that helps then we should figure which counters
> > acks against each other by adding the alignement between the respective
> > counters.
>
> We tried below patch to make the 'page_counter' aligned.
>
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index bab7e57..9efa6f7 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -26,7 +26,7 @@ struct page_counter {
> /* legacy */
> unsigned long watermark;
> unsigned long failcnt;
> -};
> +} ____cacheline_internodealigned_in_smp;
>
> and with it, the -22.7% peformance change turns to a small -1.7%, which
> confirms the performance bump is caused by the change to data alignment.
>
> After the patch, size of 'page_counter' increases from 104 bytes to 128
> bytes, and the size of 'mem_cgroup' increases from 2880 bytes to 3008
> bytes(with our kernel config). Another major data structure which
> contains 'page_counter' is 'hugetlb_cgroup', whose size will change
> from 912B to 1024B.
>
> Should we make these page_counters aligned to reduce cacheline conflict?

I would rather focus on a more effective mem_cgroup layout. It is very
likely that we are just stumbling over two counters here.

Could you try to add cache alignment of counters after memory and see
which one makes the difference? I do not expect memsw to be the one
because that one is used together with the main counter. But who knows
maybe the way it crosses the cache line has the exact effect. Hard to
tell without other numbers.

Btw. it would be great to see what the effect is on cgroup v2 as well.

Thanks for pursuing this!
--
Michal Hocko
SUSE Labs

2020-11-12 16:47:54

by Waiman Long

[permalink] [raw]
Subject: Re: [LKP] Re: [mm/memcg] bd0b230fe1: will-it-scale.per_process_ops -22.7% regression

On 11/12/20 9:16 AM, Michal Hocko wrote:
> On Thu 12-11-20 20:28:44, Feng Tang wrote:
>> Hi Michal,
>>
>> On Wed, Nov 04, 2020 at 09:15:46AM +0100, Michal Hocko wrote:
>>>>>> Hi Michal,
>>>>>>
>>>>>> We used the default configure of cgroups, not sure what configuration you
>>>>>> want,
>>>>>> could you give me more details? and here is the cgroup info of will-it-scale
>>>>>> process:
>>>>>>
>>>>>> $ cat /proc/3042/cgroup
>>>>>> 12:hugetlb:/
>>>>>> 11:memory:/system.slice/lkp-bootstrap.service
>>>>> OK, this means that memory controler is enabled and in use. Btw. do you
>>>>> get the original performance if you add one phony page_counter after the
>>>>> union?
>>>>>
>>>> I add one phony page_counter after the union and re-test, the regression
>>>> reduced to -1.2%. It looks like the regression caused by the data structure
>>>> layout change.
>>> Thanks for double checking. Could you try to cache align the
>>> page_counter struct? If that helps then we should figure which counters
>>> acks against each other by adding the alignement between the respective
>>> counters.
>> We tried below patch to make the 'page_counter' aligned.
>>
>> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
>> index bab7e57..9efa6f7 100644
>> --- a/include/linux/page_counter.h
>> +++ b/include/linux/page_counter.h
>> @@ -26,7 +26,7 @@ struct page_counter {
>> /* legacy */
>> unsigned long watermark;
>> unsigned long failcnt;
>> -};
>> +} ____cacheline_internodealigned_in_smp;
>>
>> and with it, the -22.7% peformance change turns to a small -1.7%, which
>> confirms the performance bump is caused by the change to data alignment.
>>
>> After the patch, size of 'page_counter' increases from 104 bytes to 128
>> bytes, and the size of 'mem_cgroup' increases from 2880 bytes to 3008
>> bytes(with our kernel config). Another major data structure which
>> contains 'page_counter' is 'hugetlb_cgroup', whose size will change
>> from 912B to 1024B.
>>
>> Should we make these page_counters aligned to reduce cacheline conflict?
> I would rather focus on a more effective mem_cgroup layout. It is very
> likely that we are just stumbling over two counters here.
>
> Could you try to add cache alignment of counters after memory and see
> which one makes the difference? I do not expect memsw to be the one
> because that one is used together with the main counter. But who knows
> maybe the way it crosses the cache line has the exact effect. Hard to
> tell without other numbers.
>
> Btw. it would be great to see what the effect is on cgroup v2 as well.
>
> Thanks for pursuing this!

The contention may be in the page counters themselves or it can be in
other fields below the page counters. The cacheline alignment will cause
"high_work" just after the page counters to start at a cacheline
boundary. I will try removing the cacheline alignment in the page
counter and add it to high_work to see there is any change in
performance. If there is no change, the performance problem will not be
in the page counters.

Cheers,
Longman

2020-11-13 07:39:16

by Feng Tang

[permalink] [raw]
Subject: Re: [LKP] Re: [mm/memcg] bd0b230fe1: will-it-scale.per_process_ops -22.7% regression

On Thu, Nov 12, 2020 at 03:16:54PM +0100, Michal Hocko wrote:
> On Thu 12-11-20 20:28:44, Feng Tang wrote:
> > Hi Michal,
> >
> > On Wed, Nov 04, 2020 at 09:15:46AM +0100, Michal Hocko wrote:
> > > > > > Hi Michal,
> > > > > >
> > > > > > We used the default configure of cgroups, not sure what configuration you
> > > > > > want,
> > > > > > could you give me more details? and here is the cgroup info of will-it-scale
> > > > > > process:
> > > > > >
> > > > > > $ cat /proc/3042/cgroup
> > > > > > 12:hugetlb:/
> > > > > > 11:memory:/system.slice/lkp-bootstrap.service
> > > > >
> > > > > OK, this means that memory controler is enabled and in use. Btw. do you
> > > > > get the original performance if you add one phony page_counter after the
> > > > > union?
> > > > >
> > > > I add one phony page_counter after the union and re-test, the regression
> > > > reduced to -1.2%. It looks like the regression caused by the data structure
> > > > layout change.
> > >
> > > Thanks for double checking. Could you try to cache align the
> > > page_counter struct? If that helps then we should figure which counters
> > > acks against each other by adding the alignement between the respective
> > > counters.
> >
> > We tried below patch to make the 'page_counter' aligned.
> >
> > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> > index bab7e57..9efa6f7 100644
> > --- a/include/linux/page_counter.h
> > +++ b/include/linux/page_counter.h
> > @@ -26,7 +26,7 @@ struct page_counter {
> > /* legacy */
> > unsigned long watermark;
> > unsigned long failcnt;
> > -};
> > +} ____cacheline_internodealigned_in_smp;
> >
> > and with it, the -22.7% peformance change turns to a small -1.7%, which
> > confirms the performance bump is caused by the change to data alignment.
> >
> > After the patch, size of 'page_counter' increases from 104 bytes to 128
> > bytes, and the size of 'mem_cgroup' increases from 2880 bytes to 3008
> > bytes(with our kernel config). Another major data structure which
> > contains 'page_counter' is 'hugetlb_cgroup', whose size will change
> > from 912B to 1024B.
> >
> > Should we make these page_counters aligned to reduce cacheline conflict?
>
> I would rather focus on a more effective mem_cgroup layout. It is very
> likely that we are just stumbling over two counters here.
>
> Could you try to add cache alignment of counters after memory and see
> which one makes the difference? I do not expect memsw to be the one
> because that one is used together with the main counter. But who knows
> maybe the way it crosses the cache line has the exact effect. Hard to
> tell without other numbers.

I added some alignments change around the 'memsw', but neither of them can
restore the -22.7%. Following are some log showing how the alignments
are:

tl: memcg=0x7cd1000 memory=0x7cd10d0 memsw=0x7cd1140 kmem=0x7cd11b0 tcpmem=0x7cd1220
t2: memcg=0x7cd0000 memory=0x7cd00d0 memsw=0x7cd0140 kmem=0x7cd01c0 tcpmem=0x7cd0230

So both of the 'memsw' are aligned, but t2's 'kmem' is aligned while
t1's is not.

I will check more on the perf data about detailed hotspots.

Thanks,
Feng

> Btw. it would be great to see what the effect is on cgroup v2 as well.
>
> Thanks for pursuing this!
> --
> Michal Hocko
> SUSE Labs

2020-11-13 07:41:13

by Feng Tang

[permalink] [raw]
Subject: Re: [LKP] Re: [mm/memcg] bd0b230fe1: will-it-scale.per_process_ops -22.7% regression

On Thu, Nov 12, 2020 at 11:43:45AM -0500, Waiman Long wrote:
> >>We tried below patch to make the 'page_counter' aligned.
> >> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> >> index bab7e57..9efa6f7 100644
> >> --- a/include/linux/page_counter.h
> >> +++ b/include/linux/page_counter.h
> >> @@ -26,7 +26,7 @@ struct page_counter {
> >> /* legacy */
> >> unsigned long watermark;
> >> unsigned long failcnt;
> >> -};
> >> +} ____cacheline_internodealigned_in_smp;
> >>and with it, the -22.7% peformance change turns to a small -1.7%, which
> >>confirms the performance bump is caused by the change to data alignment.
> >>
> >>After the patch, size of 'page_counter' increases from 104 bytes to 128
> >>bytes, and the size of 'mem_cgroup' increases from 2880 bytes to 3008
> >>bytes(with our kernel config). Another major data structure which
> >>contains 'page_counter' is 'hugetlb_cgroup', whose size will change
> >>from 912B to 1024B.
> >>
> >>Should we make these page_counters aligned to reduce cacheline conflict?
> >I would rather focus on a more effective mem_cgroup layout. It is very
> >likely that we are just stumbling over two counters here.
> >
> >Could you try to add cache alignment of counters after memory and see
> >which one makes the difference? I do not expect memsw to be the one
> >because that one is used together with the main counter. But who knows
> >maybe the way it crosses the cache line has the exact effect. Hard to
> >tell without other numbers.
> >
> >Btw. it would be great to see what the effect is on cgroup v2 as well.
> >
> >Thanks for pursuing this!
>
> The contention may be in the page counters themselves or it can be in other
> fields below the page counters. The cacheline alignment will cause
> "high_work" just after the page counters to start at a cacheline boundary. I
> will try removing the cacheline alignment in the page counter and add it to
> high_work to see there is any change in performance. If there is no change,
> the performance problem will not be in the page counters.

Yes, that's a good spot to check. I even doubt it could be other members of
'struct mem_cgroup', which affects the benchmark, as we've seen some other
performance bump which is possibly related to it too.

Thanks,
Feng

2020-11-20 11:46:25

by Feng Tang

[permalink] [raw]
Subject: Re: [LKP] Re: [mm/memcg] bd0b230fe1: will-it-scale.per_process_ops -22.7% regression

On Fri, Nov 13, 2020 at 03:34:36PM +0800, Feng Tang wrote:
> On Thu, Nov 12, 2020 at 03:16:54PM +0100, Michal Hocko wrote:
> > > > > I add one phony page_counter after the union and re-test, the regression
> > > > > reduced to -1.2%. It looks like the regression caused by the data structure
> > > > > layout change.
> > > >
> > > > Thanks for double checking. Could you try to cache align the
> > > > page_counter struct? If that helps then we should figure which counters
> > > > acks against each other by adding the alignement between the respective
> > > > counters.
> > >
> > > We tried below patch to make the 'page_counter' aligned.
> > >
> > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> > > index bab7e57..9efa6f7 100644
> > > --- a/include/linux/page_counter.h
> > > +++ b/include/linux/page_counter.h
> > > @@ -26,7 +26,7 @@ struct page_counter {
> > > /* legacy */
> > > unsigned long watermark;
> > > unsigned long failcnt;
> > > -};
> > > +} ____cacheline_internodealigned_in_smp;
> > >
> > > and with it, the -22.7% peformance change turns to a small -1.7%, which
> > > confirms the performance bump is caused by the change to data alignment.
> > >
> > > After the patch, size of 'page_counter' increases from 104 bytes to 128
> > > bytes, and the size of 'mem_cgroup' increases from 2880 bytes to 3008
> > > bytes(with our kernel config). Another major data structure which
> > > contains 'page_counter' is 'hugetlb_cgroup', whose size will change
> > > from 912B to 1024B.
> > >
> > > Should we make these page_counters aligned to reduce cacheline conflict?
> >
> > I would rather focus on a more effective mem_cgroup layout. It is very
> > likely that we are just stumbling over two counters here.
> >
> > Could you try to add cache alignment of counters after memory and see
> > which one makes the difference? I do not expect memsw to be the one
> > because that one is used together with the main counter. But who knows
> > maybe the way it crosses the cache line has the exact effect. Hard to
> > tell without other numbers.
>
> I added some alignments change around the 'memsw', but neither of them can
> restore the -22.7%. Following are some log showing how the alignments
> are:
>
> tl: memcg=0x7cd1000 memory=0x7cd10d0 memsw=0x7cd1140 kmem=0x7cd11b0 tcpmem=0x7cd1220
> t2: memcg=0x7cd0000 memory=0x7cd00d0 memsw=0x7cd0140 kmem=0x7cd01c0 tcpmem=0x7cd0230
>
> So both of the 'memsw' are aligned, but t2's 'kmem' is aligned while
> t1's is not.
>
> I will check more on the perf data about detailed hotspots.

Some more check updates about it:

Waiman's patch is effectively removing one 'struct page_counter' between
'memory' and "memsw'. And the mem_cgroup is:

struct mem_cgroup {

...

struct page_counter memory; /* Both v1 & v2 */

union {
struct page_counter swap; /* v2 only */
struct page_counter memsw; /* v1 only */
};

/* Legacy consumer-oriented counters */
struct page_counter kmem; /* v1 only */
struct page_counter tcpmem; /* v1 only */

...
...

MEMCG_PADDING(_pad1_);

atomic_t moving_account;
struct task_struct *move_lock_task;

...
};


I do experiments by inserting a 'page_counter' between 'memory'
and the 'MEMCG_PADDING(_pad1_)', no matter where I put it, the
benchmark result can be recovered from 145K to 185K, which is
really confusing, as adding a 'page_counter' right before the
'_pad1_' doesn't change cache alignment of any members.

Thanks,
Feng



2020-11-20 13:21:49

by Michal Hocko

[permalink] [raw]
Subject: Re: [LKP] Re: [mm/memcg] bd0b230fe1: will-it-scale.per_process_ops -22.7% regression

On Fri 20-11-20 19:44:24, Feng Tang wrote:
> On Fri, Nov 13, 2020 at 03:34:36PM +0800, Feng Tang wrote:
> > On Thu, Nov 12, 2020 at 03:16:54PM +0100, Michal Hocko wrote:
> > > > > > I add one phony page_counter after the union and re-test, the regression
> > > > > > reduced to -1.2%. It looks like the regression caused by the data structure
> > > > > > layout change.
> > > > >
> > > > > Thanks for double checking. Could you try to cache align the
> > > > > page_counter struct? If that helps then we should figure which counters
> > > > > acks against each other by adding the alignement between the respective
> > > > > counters.
> > > >
> > > > We tried below patch to make the 'page_counter' aligned.
> > > >
> > > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> > > > index bab7e57..9efa6f7 100644
> > > > --- a/include/linux/page_counter.h
> > > > +++ b/include/linux/page_counter.h
> > > > @@ -26,7 +26,7 @@ struct page_counter {
> > > > /* legacy */
> > > > unsigned long watermark;
> > > > unsigned long failcnt;
> > > > -};
> > > > +} ____cacheline_internodealigned_in_smp;
> > > >
> > > > and with it, the -22.7% peformance change turns to a small -1.7%, which
> > > > confirms the performance bump is caused by the change to data alignment.
> > > >
> > > > After the patch, size of 'page_counter' increases from 104 bytes to 128
> > > > bytes, and the size of 'mem_cgroup' increases from 2880 bytes to 3008
> > > > bytes(with our kernel config). Another major data structure which
> > > > contains 'page_counter' is 'hugetlb_cgroup', whose size will change
> > > > from 912B to 1024B.
> > > >
> > > > Should we make these page_counters aligned to reduce cacheline conflict?
> > >
> > > I would rather focus on a more effective mem_cgroup layout. It is very
> > > likely that we are just stumbling over two counters here.
> > >
> > > Could you try to add cache alignment of counters after memory and see
> > > which one makes the difference? I do not expect memsw to be the one
> > > because that one is used together with the main counter. But who knows
> > > maybe the way it crosses the cache line has the exact effect. Hard to
> > > tell without other numbers.
> >
> > I added some alignments change around the 'memsw', but neither of them can
> > restore the -22.7%. Following are some log showing how the alignments
> > are:
> >
> > tl: memcg=0x7cd1000 memory=0x7cd10d0 memsw=0x7cd1140 kmem=0x7cd11b0 tcpmem=0x7cd1220
> > t2: memcg=0x7cd0000 memory=0x7cd00d0 memsw=0x7cd0140 kmem=0x7cd01c0 tcpmem=0x7cd0230
> >
> > So both of the 'memsw' are aligned, but t2's 'kmem' is aligned while
> > t1's is not.
> >
> > I will check more on the perf data about detailed hotspots.
>
> Some more check updates about it:
>
> Waiman's patch is effectively removing one 'struct page_counter' between
> 'memory' and "memsw'. And the mem_cgroup is:
>
> struct mem_cgroup {
>
> ...
>
> struct page_counter memory; /* Both v1 & v2 */
>
> union {
> struct page_counter swap; /* v2 only */
> struct page_counter memsw; /* v1 only */
> };
>
> /* Legacy consumer-oriented counters */
> struct page_counter kmem; /* v1 only */
> struct page_counter tcpmem; /* v1 only */
>
> ...
> ...
>
> MEMCG_PADDING(_pad1_);
>
> atomic_t moving_account;
> struct task_struct *move_lock_task;
>
> ...
> };
>
>
> I do experiments by inserting a 'page_counter' between 'memory'
> and the 'MEMCG_PADDING(_pad1_)', no matter where I put it, the
> benchmark result can be recovered from 145K to 185K, which is
> really confusing, as adding a 'page_counter' right before the
> '_pad1_' doesn't change cache alignment of any members.

Have you checked the result of pahole before and after your modification
whether something stands out?

Btw. is this reproducible an different CPU models?
--
Michal Hocko
SUSE Labs

2020-11-20 14:34:19

by Feng Tang

[permalink] [raw]
Subject: Re: [LKP] Re: [mm/memcg] bd0b230fe1: will-it-scale.per_process_ops -22.7% regression

On Fri, Nov 20, 2020 at 02:19:44PM +0100, Michal Hocko wrote:
> On Fri 20-11-20 19:44:24, Feng Tang wrote:
> > On Fri, Nov 13, 2020 at 03:34:36PM +0800, Feng Tang wrote:
> > > On Thu, Nov 12, 2020 at 03:16:54PM +0100, Michal Hocko wrote:
> > > > > > > I add one phony page_counter after the union and re-test, the regression
> > > > > > > reduced to -1.2%. It looks like the regression caused by the data structure
> > > > > > > layout change.
> > > > > >
> > > > > > Thanks for double checking. Could you try to cache align the
> > > > > > page_counter struct? If that helps then we should figure which counters
> > > > > > acks against each other by adding the alignement between the respective
> > > > > > counters.
> > > > >
> > > > > We tried below patch to make the 'page_counter' aligned.
> > > > >
> > > > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> > > > > index bab7e57..9efa6f7 100644
> > > > > --- a/include/linux/page_counter.h
> > > > > +++ b/include/linux/page_counter.h
> > > > > @@ -26,7 +26,7 @@ struct page_counter {
> > > > > /* legacy */
> > > > > unsigned long watermark;
> > > > > unsigned long failcnt;
> > > > > -};
> > > > > +} ____cacheline_internodealigned_in_smp;
> > > > >
> > > > > and with it, the -22.7% peformance change turns to a small -1.7%, which
> > > > > confirms the performance bump is caused by the change to data alignment.
> > > > >
> > > > > After the patch, size of 'page_counter' increases from 104 bytes to 128
> > > > > bytes, and the size of 'mem_cgroup' increases from 2880 bytes to 3008
> > > > > bytes(with our kernel config). Another major data structure which
> > > > > contains 'page_counter' is 'hugetlb_cgroup', whose size will change
> > > > > from 912B to 1024B.
> > > > >
> > > > > Should we make these page_counters aligned to reduce cacheline conflict?
> > > >
> > > > I would rather focus on a more effective mem_cgroup layout. It is very
> > > > likely that we are just stumbling over two counters here.
> > > >
> > > > Could you try to add cache alignment of counters after memory and see
> > > > which one makes the difference? I do not expect memsw to be the one
> > > > because that one is used together with the main counter. But who knows
> > > > maybe the way it crosses the cache line has the exact effect. Hard to
> > > > tell without other numbers.
> > >
> > > I added some alignments change around the 'memsw', but neither of them can
> > > restore the -22.7%. Following are some log showing how the alignments
> > > are:
> > >
> > > tl: memcg=0x7cd1000 memory=0x7cd10d0 memsw=0x7cd1140 kmem=0x7cd11b0 tcpmem=0x7cd1220
> > > t2: memcg=0x7cd0000 memory=0x7cd00d0 memsw=0x7cd0140 kmem=0x7cd01c0 tcpmem=0x7cd0230
> > >
> > > So both of the 'memsw' are aligned, but t2's 'kmem' is aligned while
> > > t1's is not.
> > >
> > > I will check more on the perf data about detailed hotspots.
> >
> > Some more check updates about it:
> >
> > Waiman's patch is effectively removing one 'struct page_counter' between
> > 'memory' and "memsw'. And the mem_cgroup is:
> >
> > struct mem_cgroup {
> >
> > ...
> >
> > struct page_counter memory; /* Both v1 & v2 */
> >
> > union {
> > struct page_counter swap; /* v2 only */
> > struct page_counter memsw; /* v1 only */
> > };
> >
> > /* Legacy consumer-oriented counters */
> > struct page_counter kmem; /* v1 only */
> > struct page_counter tcpmem; /* v1 only */
> >
> > ...
> > ...
> >
> > MEMCG_PADDING(_pad1_);
> >
> > atomic_t moving_account;
> > struct task_struct *move_lock_task;
> >
> > ...
> > };
> >
> >
> > I do experiments by inserting a 'page_counter' between 'memory'
> > and the 'MEMCG_PADDING(_pad1_)', no matter where I put it, the
> > benchmark result can be recovered from 145K to 185K, which is
> > really confusing, as adding a 'page_counter' right before the
> > '_pad1_' doesn't change cache alignment of any members.
>
> Have you checked the result of pahole before and after your modification
> whether something stands out?

I can not find any abnormal thing. (I attached pahole log for 2 kernels,
one's head commit is Waiman's patch, the other adds a page_counter before
the '_pad1_')

> Btw. is this reproducible an different CPU models?

This is a Haswell 4S platform. I've tried on Cascadelake 2S and 4S , which
has -7.7% and -4.2% regression, though the perf data shows the similar
changing trend.

Thanks,
Feng

>
> --
> Michal Hocko
> SUSE Labs

2020-11-25 06:29:24

by Feng Tang

[permalink] [raw]
Subject: Re: [LKP] Re: [mm/memcg] bd0b230fe1: will-it-scale.per_process_ops -22.7% regression

On Fri, Nov 20, 2020 at 07:44:24PM +0800, Feng Tang wrote:
> On Fri, Nov 13, 2020 at 03:34:36PM +0800, Feng Tang wrote:
> > > I would rather focus on a more effective mem_cgroup layout. It is very
> > > likely that we are just stumbling over two counters here.
> > >
> > > Could you try to add cache alignment of counters after memory and see
> > > which one makes the difference? I do not expect memsw to be the one
> > > because that one is used together with the main counter. But who knows
> > > maybe the way it crosses the cache line has the exact effect. Hard to
> > > tell without other numbers.
> >
> > I added some alignments change around the 'memsw', but neither of them can
> > restore the -22.7%. Following are some log showing how the alignments
> > are:
> >
> > tl: memcg=0x7cd1000 memory=0x7cd10d0 memsw=0x7cd1140 kmem=0x7cd11b0 tcpmem=0x7cd1220
> > t2: memcg=0x7cd0000 memory=0x7cd00d0 memsw=0x7cd0140 kmem=0x7cd01c0 tcpmem=0x7cd0230
> >
> > So both of the 'memsw' are aligned, but t2's 'kmem' is aligned while
> > t1's is not.
> >
> > I will check more on the perf data about detailed hotspots.
>
> Some more check updates about it:
>
> Waiman's patch is effectively removing one 'struct page_counter' between
> 'memory' and "memsw'. And the mem_cgroup is:
>
> struct mem_cgroup {
>
> ...
>
> struct page_counter memory; /* Both v1 & v2 */
>
> union {
> struct page_counter swap; /* v2 only */
> struct page_counter memsw; /* v1 only */
> };
>
> /* Legacy consumer-oriented counters */
> struct page_counter kmem; /* v1 only */
> struct page_counter tcpmem; /* v1 only */
>
> ...
> ...
>
> MEMCG_PADDING(_pad1_);
>
> atomic_t moving_account;
> struct task_struct *move_lock_task;
>
> ...
> };
>
>
> I do experiments by inserting a 'page_counter' between 'memory'
> and the 'MEMCG_PADDING(_pad1_)', no matter where I put it, the
> benchmark result can be recovered from 145K to 185K, which is
> really confusing, as adding a 'page_counter' right before the
> '_pad1_' doesn't change cache alignment of any members.

I think we finally found the trick :), further debugging shows it
is not related to the alignment inside one cacheline, but the
adjacency of 2 adjacent cacheliens (2N and 2N+1, one pair of 128 bytes).

For structure mem_cgroup, member 'vmstats_local', 'vmstats_percpu'
sit in one cacheline, while 'vmstats[]' sits in the next cacheline,
and when 'adjacent cacheline prefetch" is enabled, if these 2 lines
sit in one pair (128 btyes), say 2N and 2N+1, then there seems to
be some kind of false sharing, and if they sit in 2 pairs, say
2N-1 and 2N then it's fine.

And with the following patch to relayout these members, the regression
is restored and event better. while reducing 64 bytes of sizeof
'struct mem_cgroup'

parent_commit Waiman's_commit +relayout patch

result 187K 145K 200K

Also, if we disable the hw prefetch feature, the Waiman's commit
and its parent commit will have no performance difference.

Thanks,
Feng

From 2e63af34fa4853b2dd9669867c37a3cf07f7a505 Mon Sep 17 00:00:00 2001
From: Feng Tang <[email protected]>
Date: Wed, 25 Nov 2020 13:22:21 +0800
Subject: [PATCH] mm: memcg: relayout structure mem_cgroup to avoid cache
interfereing

0day reported one -22.7% regression for will-it-scale page_fault2
case [1] on a 4 sockets 144 CPU platform, and bisected to it to be
caused by Waiman's optimization (commit bd0b230fe1) of saving one
'struct page_counter' space for 'struct mem_cgroup'.

Initially we thought it was due to the cache alignment change introduced
by the patch, but further debug shows that it is due to some hot data
members ('vmstats_local', 'vmstats_percpu', 'vmstats') sit in 2 adjacent
cacheline (2N and 2N+1 cacheline), and when adjacent cache line prefetch
is enabled, it triggers an "extended level" of cache false sharing for
2 adjacent cache lines.

So exchange the 2 member blocks, while keeping mostly the original
cache alignment, which can restore and even enhance the performance,
and save 64 bytes of space for 'struct mem_cgroup' (from 2880 to 2816,
with 0day's default RHEL-8.3 kernel config)

[1]. https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/

Fixes: bd0b230fe145 ("mm/memcg: unify swap and memsw page counters")
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
---
include/linux/memcontrol.h | 28 ++++++++++++++--------------
1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e391e3c..a2d50b0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -282,20 +282,6 @@ struct mem_cgroup {

MEMCG_PADDING(_pad1_);

- /*
- * set > 0 if pages under this cgroup are moving to other cgroup.
- */
- atomic_t moving_account;
- struct task_struct *move_lock_task;
-
- /* Legacy local VM stats and events */
- struct memcg_vmstats_percpu __percpu *vmstats_local;
-
- /* Subtree VM stats and events (batched updates) */
- struct memcg_vmstats_percpu __percpu *vmstats_percpu;
-
- MEMCG_PADDING(_pad2_);
-
atomic_long_t vmstats[MEMCG_NR_STAT];
atomic_long_t vmevents[NR_VM_EVENT_ITEMS];

@@ -317,6 +303,20 @@ struct mem_cgroup {
struct list_head objcg_list; /* list of inherited objcgs */
#endif

+ MEMCG_PADDING(_pad2_);
+
+ /*
+ * set > 0 if pages under this cgroup are moving to other cgroup.
+ */
+ atomic_t moving_account;
+ struct task_struct *move_lock_task;
+
+ /* Legacy local VM stats and events */
+ struct memcg_vmstats_percpu __percpu *vmstats_local;
+
+ /* Subtree VM stats and events (batched updates) */
+ struct memcg_vmstats_percpu __percpu *vmstats_percpu;
+
#ifdef CONFIG_CGROUP_WRITEBACK
struct list_head cgwb_list;
struct wb_domain cgwb_domain;
--
2.7.4

2020-11-26 09:29:08

by Waiman Long

[permalink] [raw]
Subject: Re: [LKP] Re: [mm/memcg] bd0b230fe1: will-it-scale.per_process_ops -22.7% regression

On 11/25/20 1:24 AM, Feng Tang wrote:
> On Fri, Nov 20, 2020 at 07:44:24PM +0800, Feng Tang wrote:
>> On Fri, Nov 13, 2020 at 03:34:36PM +0800, Feng Tang wrote:
>>>> I would rather focus on a more effective mem_cgroup layout. It is very
>>>> likely that we are just stumbling over two counters here.
>>>>
>>>> Could you try to add cache alignment of counters after memory and see
>>>> which one makes the difference? I do not expect memsw to be the one
>>>> because that one is used together with the main counter. But who knows
>>>> maybe the way it crosses the cache line has the exact effect. Hard to
>>>> tell without other numbers.
>>> I added some alignments change around the 'memsw', but neither of them can
>>> restore the -22.7%. Following are some log showing how the alignments
>>> are:
>>>
>>> tl: memcg=0x7cd1000 memory=0x7cd10d0 memsw=0x7cd1140 kmem=0x7cd11b0 tcpmem=0x7cd1220
>>> t2: memcg=0x7cd0000 memory=0x7cd00d0 memsw=0x7cd0140 kmem=0x7cd01c0 tcpmem=0x7cd0230
>>>
>>> So both of the 'memsw' are aligned, but t2's 'kmem' is aligned while
>>> t1's is not.
>>>
>>> I will check more on the perf data about detailed hotspots.
>> Some more check updates about it:
>>
>> Waiman's patch is effectively removing one 'struct page_counter' between
>> 'memory' and "memsw'. And the mem_cgroup is:
>>
>> struct mem_cgroup {
>>
>> ...
>>
>> struct page_counter memory; /* Both v1 & v2 */
>>
>> union {
>> struct page_counter swap; /* v2 only */
>> struct page_counter memsw; /* v1 only */
>> };
>>
>> /* Legacy consumer-oriented counters */
>> struct page_counter kmem; /* v1 only */
>> struct page_counter tcpmem; /* v1 only */
>>
>> ...
>> ...
>>
>> MEMCG_PADDING(_pad1_);
>>
>> atomic_t moving_account;
>> struct task_struct *move_lock_task;
>>
>> ...
>> };
>>
>>
>> I do experiments by inserting a 'page_counter' between 'memory'
>> and the 'MEMCG_PADDING(_pad1_)', no matter where I put it, the
>> benchmark result can be recovered from 145K to 185K, which is
>> really confusing, as adding a 'page_counter' right before the
>> '_pad1_' doesn't change cache alignment of any members.
> I think we finally found the trick :), further debugging shows it
> is not related to the alignment inside one cacheline, but the
> adjacency of 2 adjacent cacheliens (2N and 2N+1, one pair of 128 bytes).
>
> For structure mem_cgroup, member 'vmstats_local', 'vmstats_percpu'
> sit in one cacheline, while 'vmstats[]' sits in the next cacheline,
> and when 'adjacent cacheline prefetch" is enabled, if these 2 lines
> sit in one pair (128 btyes), say 2N and 2N+1, then there seems to
> be some kind of false sharing, and if they sit in 2 pairs, say
> 2N-1 and 2N then it's fine.
>
> And with the following patch to relayout these members, the regression
> is restored and event better. while reducing 64 bytes of sizeof
> 'struct mem_cgroup'
>
> parent_commit Waiman's_commit +relayout patch
>
> result 187K 145K 200K
>
> Also, if we disable the hw prefetch feature, the Waiman's commit
> and its parent commit will have no performance difference.
>
> Thanks,
> Feng

Finally, misery solved? Well done.

Your patch looks good. It reduces structure size and improves performance.

Acked-by: Waiman Long <[email protected]>

> From 2e63af34fa4853b2dd9669867c37a3cf07f7a505 Mon Sep 17 00:00:00 2001
> From: Feng Tang <[email protected]>
> Date: Wed, 25 Nov 2020 13:22:21 +0800
> Subject: [PATCH] mm: memcg: relayout structure mem_cgroup to avoid cache
> interfereing
>
> 0day reported one -22.7% regression for will-it-scale page_fault2
> case [1] on a 4 sockets 144 CPU platform, and bisected to it to be
> caused by Waiman's optimization (commit bd0b230fe1) of saving one
> 'struct page_counter' space for 'struct mem_cgroup'.
>
> Initially we thought it was due to the cache alignment change introduced
> by the patch, but further debug shows that it is due to some hot data
> members ('vmstats_local', 'vmstats_percpu', 'vmstats') sit in 2 adjacent
> cacheline (2N and 2N+1 cacheline), and when adjacent cache line prefetch
> is enabled, it triggers an "extended level" of cache false sharing for
> 2 adjacent cache lines.
>
> So exchange the 2 member blocks, while keeping mostly the original
> cache alignment, which can restore and even enhance the performance,
> and save 64 bytes of space for 'struct mem_cgroup' (from 2880 to 2816,
> with 0day's default RHEL-8.3 kernel config)
>
> [1]. https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
>
> Fixes: bd0b230fe145 ("mm/memcg: unify swap and memsw page counters")
> Reported-by: kernel test robot <[email protected]>
> Signed-off-by: Feng Tang <[email protected]>
> ---
> include/linux/memcontrol.h | 28 ++++++++++++++--------------
> 1 file changed, 14 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e391e3c..a2d50b0 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -282,20 +282,6 @@ struct mem_cgroup {
>
> MEMCG_PADDING(_pad1_);
>
> - /*
> - * set > 0 if pages under this cgroup are moving to other cgroup.
> - */
> - atomic_t moving_account;
> - struct task_struct *move_lock_task;
> -
> - /* Legacy local VM stats and events */
> - struct memcg_vmstats_percpu __percpu *vmstats_local;
> -
> - /* Subtree VM stats and events (batched updates) */
> - struct memcg_vmstats_percpu __percpu *vmstats_percpu;
> -
> - MEMCG_PADDING(_pad2_);
> -
> atomic_long_t vmstats[MEMCG_NR_STAT];
> atomic_long_t vmevents[NR_VM_EVENT_ITEMS];
>
> @@ -317,6 +303,20 @@ struct mem_cgroup {
> struct list_head objcg_list; /* list of inherited objcgs */
> #endif
>
> + MEMCG_PADDING(_pad2_);
> +
> + /*
> + * set > 0 if pages under this cgroup are moving to other cgroup.
> + */
> + atomic_t moving_account;
> + struct task_struct *move_lock_task;
> +
> + /* Legacy local VM stats and events */
> + struct memcg_vmstats_percpu __percpu *vmstats_local;
> +
> + /* Subtree VM stats and events (batched updates) */
> + struct memcg_vmstats_percpu __percpu *vmstats_percpu;
> +
> #ifdef CONFIG_CGROUP_WRITEBACK
> struct list_head cgwb_list;
> struct wb_domain cgwb_domain;


2020-11-26 17:43:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: [LKP] Re: [mm/memcg] bd0b230fe1: will-it-scale.per_process_ops -22.7% regression

On Tue, Nov 24, 2020 at 10:24 PM Feng Tang <[email protected]> wrote:
>
> And with the following patch to relayout these members, the regression
> is restored and event better. while reducing 64 bytes of sizeof
> 'struct mem_cgroup'

Thanks, applied.

I do wonder if somebody on the hardware side inside Intel might not
perhaps want to look at that load pattern and at why performance
regressed so much. Maybe teach the prefetcher to avoid adjacent cache
lines that end up showing ping-pong behavior..

Linus

2020-11-30 08:51:19

by Michal Hocko

[permalink] [raw]
Subject: Re: [LKP] Re: [mm/memcg] bd0b230fe1: will-it-scale.per_process_ops -22.7% regression

On Wed 25-11-20 14:24:45, Feng Tang wrote:
[...]
> I think we finally found the trick :), further debugging shows it
> is not related to the alignment inside one cacheline, but the
> adjacency of 2 adjacent cacheliens (2N and 2N+1, one pair of 128 bytes).
>
> For structure mem_cgroup, member 'vmstats_local', 'vmstats_percpu'
> sit in one cacheline, while 'vmstats[]' sits in the next cacheline,
> and when 'adjacent cacheline prefetch" is enabled, if these 2 lines
> sit in one pair (128 btyes), say 2N and 2N+1, then there seems to
> be some kind of false sharing, and if they sit in 2 pairs, say
> 2N-1 and 2N then it's fine.
>
> And with the following patch to relayout these members, the regression
> is restored and event better. while reducing 64 bytes of sizeof
> 'struct mem_cgroup'
>
> parent_commit Waiman's_commit +relayout patch
>
> result 187K 145K 200K
>
> Also, if we disable the hw prefetch feature, the Waiman's commit
> and its parent commit will have no performance difference.
>
> Thanks,
> Feng
>
> >From 2e63af34fa4853b2dd9669867c37a3cf07f7a505 Mon Sep 17 00:00:00 2001
> From: Feng Tang <[email protected]>
> Date: Wed, 25 Nov 2020 13:22:21 +0800
> Subject: [PATCH] mm: memcg: relayout structure mem_cgroup to avoid cache
> interfereing
>
> 0day reported one -22.7% regression for will-it-scale page_fault2
> case [1] on a 4 sockets 144 CPU platform, and bisected to it to be
> caused by Waiman's optimization (commit bd0b230fe1) of saving one
> 'struct page_counter' space for 'struct mem_cgroup'.
>
> Initially we thought it was due to the cache alignment change introduced
> by the patch, but further debug shows that it is due to some hot data
> members ('vmstats_local', 'vmstats_percpu', 'vmstats') sit in 2 adjacent
> cacheline (2N and 2N+1 cacheline), and when adjacent cache line prefetch
> is enabled, it triggers an "extended level" of cache false sharing for
> 2 adjacent cache lines.
>
> So exchange the 2 member blocks, while keeping mostly the original
> cache alignment, which can restore and even enhance the performance,
> and save 64 bytes of space for 'struct mem_cgroup' (from 2880 to 2816,
> with 0day's default RHEL-8.3 kernel config)
>
> [1]. https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
>
> Fixes: bd0b230fe145 ("mm/memcg: unify swap and memsw page counters")
> Reported-by: kernel test robot <[email protected]>
> Signed-off-by: Feng Tang <[email protected]>

Sorry for a late reply. This is indeed surprising! I was really
expecting page counter to be the culprit. Anyway this rearrangement
looks ok as well. moving_account related stuff is still after padding
which is good because this rare operation shouldn't really interfere
with the rest of the structure. Btw. now you made me look into the
history and I have noticed e81bf9793b18 ("mem_cgroup: make sure
moving_account, move_lock_task and stat_cpu in the same cacheline") so
this is not the first time we are dealing with a regression here.

Linus has already merged the patch but for the record
Acked-by: Michal Hocko <[email protected]>

Thanks a lot for pursuing this!

--
Michal Hocko
SUSE Labs