2021-08-11 03:03:05

by kernel test robot

[permalink] [raw]
Subject: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression



Greeting,

FYI, we noticed a -36.4% regression of vm-scalability.throughput due to commit:


commit: 2d146aa3aa842d7f5065802556b4f9a2c6e8ef12 ("mm: memcontrol: switch to rstat")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master


in testcase: vm-scalability
on test machine: 192 threads 4 sockets Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory
with following parameters:

runtime: 300s
size: 1T
test: lru-shm
cpufreq_governor: performance
ucode: 0x5003006

test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/

In addition to that, the commit also has significant impact on the following tests:

+------------------+-------------------------------------------------------------------------------------+
| testcase: change | stress-ng: stress-ng.null.ops_per_sec -45.9% regression |
| test machine | 192 threads 4 sockets Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory |
| test parameters | class=device |
| | cpufreq_governor=performance |
| | disk=1HDD |
| | nr_threads=100% |
| | testtime=60s |
| | ucode=0x5003006 |
+------------------+-------------------------------------------------------------------------------------+


If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
bin/lkp run generated-yaml-file

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase/ucode:
gcc-9/performance/x86_64-rhel-8.3/debian-10.4-x86_64-20200603.cgz/300s/1T/lkp-csl-2ap4/lru-shm/vm-scalability/0x5003006

commit:
dc26532aed ("cgroup: rstat: punt root-level optimization to individual controllers")
2d146aa3aa ("mm: memcontrol: switch to rstat")

dc26532aed0ab25c 2d146aa3aa842d7f5065802556b
---------------- ---------------------------
%stddev %change %stddev
\ | \
0.02 ? 5% +90.7% 0.04 ? 9% vm-scalability.free_time
359287 ? 5% -36.5% 228315 ? 4% vm-scalability.median
69325283 ? 5% -36.4% 44114216 ? 5% vm-scalability.throughput
53491 ? 11% +33.6% 71455 ? 23% vm-scalability.time.involuntary_context_switches
7.408e+08 -8.9% 6.746e+08 vm-scalability.time.minor_page_faults
3121 ? 4% +42.5% 4448 ? 4% vm-scalability.time.percent_of_cpu_this_job_got
7257 ? 6% +65.2% 11989 ? 4% vm-scalability.time.system_time
2266 ? 9% -25.1% 1697 ? 7% vm-scalability.time.user_time
69029 -8.1% 63441 vm-scalability.time.voluntary_context_switches
3.318e+09 -8.9% 3.021e+09 vm-scalability.workload
7852412 ? 5% +40.6% 11040505 ? 5% meminfo.Mapped
18180 ? 3% +30.1% 23656 ? 4% meminfo.PageTables
12.52 ? 6% +7.8 20.28 ? 3% mpstat.cpu.all.sys%
3.90 ? 8% -1.0 2.94 ? 7% mpstat.cpu.all.usr%
81.67 -8.2% 75.00 vmstat.cpu.id
32.83 ? 4% +40.1% 46.00 ? 3% vmstat.procs.r
9136 ? 13% -29.8% 6413 ? 2% vmstat.system.cs
1.868e+08 ? 2% -9.3% 1.694e+08 numa-numastat.node1.local_node
1.869e+08 ? 2% -9.3% 1.694e+08 numa-numastat.node1.numa_hit
1.876e+08 -10.8% 1.674e+08 ? 2% numa-numastat.node3.local_node
1.876e+08 -10.8% 1.675e+08 ? 2% numa-numastat.node3.numa_hit
272.83 ? 5% +38.3% 377.33 ? 2% turbostat.Avg_MHz
18.62 ? 4% +6.7 25.30 turbostat.Busy%
38.78 ? 2% -10.1 28.67 ? 39% turbostat.C1E%
147.74 ? 4% +7.6% 159.04 ? 3% turbostat.PkgWatt
3233 ? 33% +166.7% 8622 ? 96% numa-meminfo.node0.Active(anon)
1897908 ? 19% +61.0% 3055051 ? 13% numa-meminfo.node0.Mapped
4875 ? 33% +72.7% 8418 ? 21% numa-meminfo.node0.PageTables
1949132 ? 7% +30.5% 2544570 ? 4% numa-meminfo.node1.Mapped
4033 ? 36% -40.5% 2401 ? 27% numa-meminfo.node2.Active(anon)
69457 ? 12% -19.4% 55973 ? 7% numa-meminfo.node2.KReclaimable
1951496 ? 13% +37.2% 2677364 ? 8% numa-meminfo.node2.Mapped
69457 ? 12% -19.4% 55973 ? 7% numa-meminfo.node2.SReclaimable
1914648 ? 7% +37.2% 2626319 ? 3% numa-meminfo.node3.Mapped
4068 ? 9% +34.8% 5483 ? 16% numa-meminfo.node3.PageTables
12999071 -1.1% 12854405 proc-vmstat.nr_file_pages
12456571 -1.2% 12307726 proc-vmstat.nr_inactive_anon
1961429 ? 5% +40.8% 2762082 ? 5% proc-vmstat.nr_mapped
4538 ? 3% +30.4% 5919 ? 4% proc-vmstat.nr_page_table_pages
12407630 -1.2% 12262964 proc-vmstat.nr_shmem
12456570 -1.2% 12307724 proc-vmstat.nr_zone_inactive_anon
7.443e+08 -8.9% 6.779e+08 proc-vmstat.numa_hit
7.441e+08 -8.9% 6.777e+08 proc-vmstat.numa_local
7.46e+08 -8.9% 6.794e+08 proc-vmstat.pgalloc_normal
7.42e+08 -8.9% 6.758e+08 proc-vmstat.pgfault
7.457e+08 -8.9% 6.793e+08 proc-vmstat.pgfree
283300 ? 2% -7.4% 262251 ? 2% proc-vmstat.pgreuse
808.00 ? 33% +166.8% 2155 ? 96% numa-vmstat.node0.nr_active_anon
473185 ? 18% +62.7% 769993 ? 12% numa-vmstat.node0.nr_mapped
1209 ? 32% +73.8% 2103 ? 21% numa-vmstat.node0.nr_page_table_pages
808.00 ? 33% +166.8% 2155 ? 96% numa-vmstat.node0.nr_zone_active_anon
490316 ? 7% +29.9% 637150 ? 5% numa-vmstat.node1.nr_mapped
96371466 ? 2% -9.6% 87157667 numa-vmstat.node1.numa_hit
96136178 ? 2% -9.6% 86939810 numa-vmstat.node1.numa_local
1008 ? 36% -40.5% 599.83 ? 27% numa-vmstat.node2.nr_active_anon
492855 ? 10% +34.7% 664105 ? 7% numa-vmstat.node2.nr_mapped
17340 ? 12% -19.3% 13999 ? 7% numa-vmstat.node2.nr_slab_reclaimable
1008 ? 36% -40.5% 599.83 ? 27% numa-vmstat.node2.nr_zone_active_anon
479874 ? 7% +36.2% 653534 ? 2% numa-vmstat.node3.nr_mapped
1013 ? 8% +34.1% 1359 ? 17% numa-vmstat.node3.nr_page_table_pages
96808651 -10.8% 86336260 ? 2% numa-vmstat.node3.numa_hit
96571930 -10.8% 86156545 ? 2% numa-vmstat.node3.numa_local
0.03 ? 37% +249.4% 0.10 ? 58% perf-sched.sch_delay.avg.ms.devkmsg_read.vfs_read.ksys_read.do_syscall_64
0.01 ? 12% -100.0% 0.00 perf-sched.sch_delay.avg.ms.pipe_write.new_sync_write.vfs_write.ksys_write
0.00 ?223% +16950.0% 0.17 ?202% perf-sched.sch_delay.avg.ms.preempt_schedule_common.__cond_resched.unmap_vmas.unmap_region.__do_munmap
0.00 ?142% +318.2% 0.01 ? 68% perf-sched.sch_delay.avg.ms.schedule_timeout.wait_for_completion.__flush_work.lru_add_drain_all
3.15 ?108% +181.3% 8.86 ? 47% perf-sched.sch_delay.max.ms.do_syslog.part.0.kmsg_read.vfs_read
6.17 ?136% -100.0% 0.00 perf-sched.sch_delay.max.ms.pipe_write.new_sync_write.vfs_write.ksys_write
0.00 ?223% +81550.0% 0.82 ?207% perf-sched.sch_delay.max.ms.preempt_schedule_common.__cond_resched.unmap_vmas.unmap_region.__do_munmap
0.00 ?142% +318.2% 0.01 ? 68% perf-sched.sch_delay.max.ms.schedule_timeout.wait_for_completion.__flush_work.lru_add_drain_all
0.07 ? 56% +158.2% 0.17 ? 39% perf-sched.sch_delay.max.ms.sigsuspend.__x64_sys_rt_sigsuspend.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.04 ? 12% +102.2% 0.08 ? 35% perf-sched.total_sch_delay.average.ms
65.06 ? 9% +82.1% 118.46 ? 8% perf-sched.total_wait_and_delay.average.ms
57034 ? 12% -42.6% 32746 ? 10% perf-sched.total_wait_and_delay.count.ms
65.02 ? 9% +82.1% 118.38 ? 8% perf-sched.total_wait_time.average.ms
0.16 ? 41% +876.7% 1.61 ? 58% perf-sched.wait_and_delay.avg.ms.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe
9.64 ? 4% +59.6% 15.39 ? 30% perf-sched.wait_and_delay.avg.ms.pipe_read.new_sync_read.vfs_read.ksys_read
0.69 ? 21% -100.0% 0.00 perf-sched.wait_and_delay.avg.ms.pipe_write.new_sync_write.vfs_write.ksys_write
4563 ? 61% -88.6% 518.67 ? 26% perf-sched.wait_and_delay.count.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe
16728 ? 4% -29.1% 11854 ? 29% perf-sched.wait_and_delay.count.pipe_read.new_sync_read.vfs_read.ksys_read
15718 ? 33% -100.0% 0.00 perf-sched.wait_and_delay.count.pipe_write.new_sync_write.vfs_write.ksys_write
1500 +48.6% 2229 perf-sched.wait_and_delay.count.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_select
51.67 ? 9% -23.9% 39.33 ? 15% perf-sched.wait_and_delay.count.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll
29.42 ? 9% -100.0% 0.00 perf-sched.wait_and_delay.max.ms.pipe_write.new_sync_write.vfs_write.ksys_write
0.16 ? 41% +879.5% 1.58 ? 58% perf-sched.wait_time.avg.ms.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.96 ?141% +141.8% 4.75 ? 54% perf-sched.wait_time.avg.ms.io_schedule.__lock_page.pagecache_get_page.shmem_getpage_gfp
9.64 ? 4% +59.3% 15.35 ? 30% perf-sched.wait_time.avg.ms.pipe_read.new_sync_read.vfs_read.ksys_read
0.68 ? 21% -100.0% 0.00 perf-sched.wait_time.avg.ms.pipe_write.new_sync_write.vfs_write.ksys_write
0.00 ?223% +24122.2% 0.36 ? 22% perf-sched.wait_time.avg.ms.preempt_schedule_common.__cond_resched.unmap_vmas.unmap_region.__do_munmap
0.23 ?166% -82.8% 0.04 ? 30% perf-sched.wait_time.avg.ms.preempt_schedule_common.__cond_resched.zap_pte_range.unmap_page_range.unmap_vmas
2.19 ?143% +286.1% 8.44 ? 50% perf-sched.wait_time.max.ms.io_schedule.__lock_page.pagecache_get_page.shmem_getpage_gfp
29.40 ? 9% -100.0% 0.00 perf-sched.wait_time.max.ms.pipe_write.new_sync_write.vfs_write.ksys_write
0.00 ?223% +60722.2% 0.91 ? 46% perf-sched.wait_time.max.ms.preempt_schedule_common.__cond_resched.unmap_vmas.unmap_region.__do_munmap
328790 ? 6% -32.6% 221487 ? 3% interrupts.CAL:Function_call_interrupts
1655 ? 30% +70.1% 2816 ? 34% interrupts.CPU0.NMI:Non-maskable_interrupts
1655 ? 30% +70.1% 2816 ? 34% interrupts.CPU0.PMI:Performance_monitoring_interrupts
1419 ? 33% +48.8% 2111 ? 16% interrupts.CPU11.NMI:Non-maskable_interrupts
1419 ? 33% +48.8% 2111 ? 16% interrupts.CPU11.PMI:Performance_monitoring_interrupts
1617 ? 54% -43.5% 913.83 ? 16% interrupts.CPU113.CAL:Function_call_interrupts
1198 ? 40% +80.7% 2166 ? 14% interrupts.CPU126.NMI:Non-maskable_interrupts
1198 ? 40% +80.7% 2166 ? 14% interrupts.CPU126.PMI:Performance_monitoring_interrupts
1062 ? 44% +84.9% 1964 ? 21% interrupts.CPU129.NMI:Non-maskable_interrupts
1062 ? 44% +84.9% 1964 ? 21% interrupts.CPU129.PMI:Performance_monitoring_interrupts
1342 ? 33% +47.3% 1977 ? 15% interrupts.CPU138.NMI:Non-maskable_interrupts
1342 ? 33% +47.3% 1977 ? 15% interrupts.CPU138.PMI:Performance_monitoring_interrupts
1712 ? 74% -38.9% 1046 ? 7% interrupts.CPU146.CAL:Function_call_interrupts
1398 ? 21% -25.2% 1045 ? 4% interrupts.CPU162.CAL:Function_call_interrupts
1905 ? 72% -45.6% 1036 ? 8% interrupts.CPU164.CAL:Function_call_interrupts
2796 ?101% -61.6% 1073 ? 9% interrupts.CPU166.CAL:Function_call_interrupts
1246 ? 16% -22.1% 971.67 ? 14% interrupts.CPU169.CAL:Function_call_interrupts
1578 ? 7% +62.4% 2563 ? 28% interrupts.CPU169.NMI:Non-maskable_interrupts
1578 ? 7% +62.4% 2563 ? 28% interrupts.CPU169.PMI:Performance_monitoring_interrupts
1661 ? 50% -43.8% 934.33 ? 15% interrupts.CPU17.CAL:Function_call_interrupts
3393 ? 84% -71.8% 958.17 ? 3% interrupts.CPU18.CAL:Function_call_interrupts
1376 ? 24% +50.1% 2066 ? 16% interrupts.CPU2.NMI:Non-maskable_interrupts
1376 ? 24% +50.1% 2066 ? 16% interrupts.CPU2.PMI:Performance_monitoring_interrupts
1480 ? 25% +93.1% 2858 ? 63% interrupts.CPU38.NMI:Non-maskable_interrupts
1480 ? 25% +93.1% 2858 ? 63% interrupts.CPU38.PMI:Performance_monitoring_interrupts
1427 ? 29% -35.0% 928.17 ? 16% interrupts.CPU4.CAL:Function_call_interrupts
1029 ? 32% +65.7% 1705 ? 26% interrupts.CPU41.NMI:Non-maskable_interrupts
1029 ? 32% +65.7% 1705 ? 26% interrupts.CPU41.PMI:Performance_monitoring_interrupts
1179 ? 38% +68.3% 1985 ? 16% interrupts.CPU47.NMI:Non-maskable_interrupts
1179 ? 38% +68.3% 1985 ? 16% interrupts.CPU47.PMI:Performance_monitoring_interrupts
3836 ? 96% -73.7% 1007 ? 9% interrupts.CPU5.CAL:Function_call_interrupts
1274 ? 33% +62.3% 2068 ? 19% interrupts.CPU5.NMI:Non-maskable_interrupts
1274 ? 33% +62.3% 2068 ? 19% interrupts.CPU5.PMI:Performance_monitoring_interrupts
1457 ? 26% +39.7% 2036 ? 13% interrupts.CPU60.NMI:Non-maskable_interrupts
1457 ? 26% +39.7% 2036 ? 13% interrupts.CPU60.PMI:Performance_monitoring_interrupts
1454 ? 40% -29.9% 1019 ? 5% interrupts.CPU61.CAL:Function_call_interrupts
1543 ? 57% -36.0% 988.67 ? 3% interrupts.CPU65.CAL:Function_call_interrupts
1789 ? 55% -43.2% 1016 ? 5% interrupts.CPU66.CAL:Function_call_interrupts
57.17 ? 8% +165.0% 151.50 ? 74% interrupts.CPU77.RES:Rescheduling_interrupts
63.83 ? 15% +186.4% 182.83 ? 86% interrupts.CPU78.RES:Rescheduling_interrupts
1209 ? 15% +30.0% 1571 ? 6% interrupts.CPU95.CAL:Function_call_interrupts
121.17 ? 53% +271.0% 449.50 ? 20% interrupts.CPU95.RES:Rescheduling_interrupts
1.551e+10 -4.0% 1.489e+10 perf-stat.i.branch-instructions
66444144 -23.1% 51101081 perf-stat.i.cache-misses
9079 ? 13% -29.9% 6364 ? 3% perf-stat.i.context-switches
1.46 ? 2% +22.4% 1.78 perf-stat.i.cpi
1.035e+11 ? 3% +36.5% 1.413e+11 ? 3% perf-stat.i.cpu-cycles
1704 +21.1% 2063 ? 4% perf-stat.i.cycles-between-cache-misses
0.02 ? 6% +0.0 0.04 ? 57% perf-stat.i.dTLB-load-miss-rate%
1.565e+10 -2.9% 1.521e+10 perf-stat.i.dTLB-loads
0.02 +0.0 0.03 ? 19% perf-stat.i.dTLB-store-miss-rate%
4.343e+09 -8.1% 3.993e+09 perf-stat.i.dTLB-stores
50.45 +4.6 55.09 ? 7% perf-stat.i.iTLB-load-miss-rate%
5111149 ? 4% -11.0% 4548466 ? 2% perf-stat.i.iTLB-load-misses
2595279 -14.0% 2231645 ? 16% perf-stat.i.iTLB-loads
5.647e+10 -3.2% 5.466e+10 perf-stat.i.instructions
7962 ? 2% +12.0% 8914 ? 3% perf-stat.i.instructions-per-iTLB-miss
0.71 -11.5% 0.63 ? 3% perf-stat.i.ipc
0.54 ? 3% +36.5% 0.73 ? 3% perf-stat.i.metric.GHz
185.49 -4.0% 178.09 perf-stat.i.metric.M/sec
2346771 -10.4% 2102847 perf-stat.i.minor-faults
5135545 -12.6% 4490509 perf-stat.i.node-load-misses
1167240 ? 4% -8.8% 1064214 ? 4% perf-stat.i.node-loads
65.40 -10.6 54.77 ? 6% perf-stat.i.node-store-miss-rate%
2725640 ? 4% -61.9% 1037367 perf-stat.i.node-store-misses
9038103 -17.4% 7468754 perf-stat.i.node-stores
2346773 -10.4% 2102849 perf-stat.i.page-faults
30.83 ? 2% -3.2 27.66 ? 7% perf-stat.overall.cache-miss-rate%
1.84 ? 5% +41.5% 2.61 ? 3% perf-stat.overall.cpi
1557 ? 4% +78.7% 2783 ? 4% perf-stat.overall.cycles-between-cache-misses
11129 ? 4% +9.1% 12140 perf-stat.overall.instructions-per-iTLB-miss
0.54 ? 5% -29.4% 0.38 ? 3% perf-stat.overall.ipc
23.14 ? 3% -11.1 12.08 perf-stat.overall.node-store-miss-rate%
5346 +8.0% 5773 perf-stat.overall.path-length
1.595e+10 -3.2% 1.543e+10 perf-stat.ps.branch-instructions
68523526 -22.8% 52929588 perf-stat.ps.cache-misses
9075 ? 13% -30.0% 6349 ? 2% perf-stat.ps.context-switches
1.067e+11 ? 4% +38.0% 1.472e+11 ? 3% perf-stat.ps.cpu-cycles
1.608e+10 -2.2% 1.573e+10 perf-stat.ps.dTLB-loads
4.442e+09 -7.7% 4.102e+09 perf-stat.ps.dTLB-stores
5217650 ? 4% -10.8% 4656233 perf-stat.ps.iTLB-load-misses
2569538 ? 2% -14.5% 2196111 ? 15% perf-stat.ps.iTLB-loads
5.796e+10 -2.5% 5.651e+10 perf-stat.ps.instructions
1.72 ? 4% +7.2% 1.85 ? 2% perf-stat.ps.major-faults
2423700 -9.7% 2188843 perf-stat.ps.minor-faults
5257866 ? 2% -12.2% 4615133 perf-stat.ps.node-load-misses
1197096 ? 4% -8.7% 1092716 ? 4% perf-stat.ps.node-loads
2811001 ? 4% -62.0% 1067656 perf-stat.ps.node-store-misses
9336409 -16.7% 7775080 perf-stat.ps.node-stores
2423702 -9.7% 2188844 perf-stat.ps.page-faults
33414 ? 4% -12.1% 29356 ? 18% softirqs.CPU101.SCHED
33523 ? 5% -13.8% 28899 ? 14% softirqs.CPU104.SCHED
33620 ? 4% -10.9% 29966 ? 6% softirqs.CPU105.SCHED
9177 ? 9% -16.2% 7691 ? 8% softirqs.CPU109.RCU
10431 ? 16% -23.8% 7943 ? 8% softirqs.CPU11.RCU
9060 ? 10% -15.9% 7619 ? 7% softirqs.CPU110.RCU
33743 ? 4% -6.2% 31639 ? 4% softirqs.CPU119.SCHED
8818 ? 8% -20.6% 7002 ? 9% softirqs.CPU120.RCU
8794 ? 10% -20.1% 7026 ? 11% softirqs.CPU121.RCU
8534 ? 9% -22.8% 6584 ? 8% softirqs.CPU124.RCU
8799 ? 12% -26.0% 6512 ? 10% softirqs.CPU125.RCU
10109 ? 33% -32.5% 6828 ? 11% softirqs.CPU128.RCU
9380 ? 6% -15.0% 7977 ? 7% softirqs.CPU13.RCU
10633 ? 45% -34.8% 6935 ? 9% softirqs.CPU132.RCU
9186 ? 17% -24.3% 6956 ? 10% softirqs.CPU133.RCU
9000 ? 11% -23.6% 6877 ? 10% softirqs.CPU136.RCU
9538 ? 7% -20.2% 7608 ? 12% softirqs.CPU144.RCU
9564 ? 15% -24.7% 7201 ? 14% softirqs.CPU148.RCU
10287 ? 25% -31.2% 7072 ? 12% softirqs.CPU149.RCU
10068 ? 23% -27.1% 7336 ? 13% softirqs.CPU151.RCU
8909 ? 9% -19.6% 7166 ? 11% softirqs.CPU152.RCU
9001 ? 13% -22.6% 6968 ? 12% softirqs.CPU153.RCU
9012 ? 8% -23.0% 6939 ? 12% softirqs.CPU154.RCU
8955 ? 8% -21.3% 7045 ? 10% softirqs.CPU156.RCU
11374 ? 32% -37.5% 7109 ? 13% softirqs.CPU157.RCU
9551 ? 15% -28.2% 6859 ? 13% softirqs.CPU166.RCU
8810 ? 10% -22.8% 6800 ? 12% softirqs.CPU167.RCU
8594 ? 14% -26.3% 6330 ? 10% softirqs.CPU175.RCU
35252 ? 3% -10.5% 31536 ? 5% softirqs.CPU182.SCHED
35076 ? 2% -11.5% 31042 ? 7% softirqs.CPU184.SCHED
9237 ? 12% -27.0% 6740 ? 10% softirqs.CPU185.RCU
35111 ? 2% -9.6% 31730 ? 5% softirqs.CPU185.SCHED
9350 ? 8% -23.7% 7134 ? 8% softirqs.CPU191.RCU
10047 ? 27% -25.3% 7504 ? 10% softirqs.CPU23.RCU
9074 ? 11% -23.7% 6926 ? 11% softirqs.CPU25.RCU
12146 ? 47% -43.1% 6910 ? 8% softirqs.CPU29.RCU
10375 ? 25% -30.4% 7216 ? 10% softirqs.CPU32.RCU
9530 ? 11% -23.9% 7252 ? 11% softirqs.CPU37.RCU
9372 ? 7% -23.5% 7171 ? 10% softirqs.CPU38.RCU
9663 ? 5% -19.6% 7772 ? 7% softirqs.CPU4.RCU
9336 ? 9% -20.9% 7381 ? 9% softirqs.CPU42.RCU
9243 ? 9% -21.5% 7255 ? 11% softirqs.CPU43.RCU
11490 ? 47% -35.8% 7371 ? 12% softirqs.CPU47.RCU
11603 ? 39% -33.4% 7730 ? 13% softirqs.CPU48.RCU
34132 ? 2% -7.5% 31571 ? 3% softirqs.CPU5.SCHED
11001 ? 26% -29.3% 7781 ? 15% softirqs.CPU50.RCU
9773 ? 11% -25.0% 7326 ? 11% softirqs.CPU51.RCU
9888 ? 7% -25.4% 7373 ? 13% softirqs.CPU52.RCU
9610 ? 9% -24.9% 7220 ? 10% softirqs.CPU55.RCU
10651 ? 29% -31.6% 7283 ? 12% softirqs.CPU60.RCU
9131 ? 8% -20.8% 7229 ? 11% softirqs.CPU63.RCU
9302 ? 5% -18.4% 7592 ? 7% softirqs.CPU7.RCU
33631 ? 5% -7.9% 30973 ? 3% softirqs.CPU7.SCHED
9188 ? 8% -22.1% 7154 ? 6% softirqs.CPU83.RCU
9122 ? 9% -24.5% 6883 ? 6% softirqs.CPU88.RCU
10551 ? 22% -22.3% 8200 ? 7% softirqs.CPU9.RCU
9070 ? 7% -21.5% 7119 ? 10% softirqs.CPU94.RCU
33361 ? 4% -10.0% 30037 ? 3% softirqs.CPU96.SCHED
33776 ? 4% -7.0% 31409 ? 4% softirqs.CPU99.SCHED
1744747 ? 4% -19.7% 1400430 ? 10% softirqs.RCU
33.04 ? 8% -18.9 14.09 ? 5% perf-profile.calltrace.cycles-pp.mem_cgroup_charge.shmem_add_to_page_cache.shmem_getpage_gfp.shmem_fault.__do_fault
35.94 ? 8% -16.0 19.98 ? 4% perf-profile.calltrace.cycles-pp.shmem_add_to_page_cache.shmem_getpage_gfp.shmem_fault.__do_fault.do_fault
12.88 ? 8% -7.1 5.78 ? 4% perf-profile.calltrace.cycles-pp.get_mem_cgroup_from_mm.mem_cgroup_charge.shmem_add_to_page_cache.shmem_getpage_gfp.shmem_fault
4.52 ? 7% -2.2 2.30 ? 7% perf-profile.calltrace.cycles-pp.clear_page_erms.shmem_getpage_gfp.shmem_fault.__do_fault.do_fault
2.50 ? 7% -2.0 0.49 ? 45% perf-profile.calltrace.cycles-pp.try_charge.mem_cgroup_charge.shmem_add_to_page_cache.shmem_getpage_gfp.shmem_fault
3.11 ? 10% -1.2 1.89 ? 6% perf-profile.calltrace.cycles-pp.filemap_map_pages.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
2.05 ? 8% -1.0 1.04 ? 14% perf-profile.calltrace.cycles-pp.shmem_alloc_and_acct_page.shmem_getpage_gfp.shmem_fault.__do_fault.do_fault
1.58 ? 8% -0.8 0.79 ? 15% perf-profile.calltrace.cycles-pp.shmem_alloc_page.shmem_alloc_and_acct_page.shmem_getpage_gfp.shmem_fault.__do_fault
1.36 ? 8% -0.7 0.62 ? 45% perf-profile.calltrace.cycles-pp.alloc_pages_vma.shmem_alloc_page.shmem_alloc_and_acct_page.shmem_getpage_gfp.shmem_fault
1.24 ? 8% -0.7 0.55 ? 45% perf-profile.calltrace.cycles-pp.__alloc_pages_nodemask.alloc_pages_vma.shmem_alloc_page.shmem_alloc_and_acct_page.shmem_getpage_gfp
1.40 ? 10% -0.6 0.80 ? 13% perf-profile.calltrace.cycles-pp.next_uptodate_page.filemap_map_pages.do_fault.__handle_mm_fault.handle_mm_fault
1.00 ? 16% -0.4 0.61 ? 48% perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter
0.99 ? 16% -0.4 0.60 ? 48% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state
0.08 ?223% +0.7 0.79 ? 20% perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_getpage_gfp.shmem_fault.__do_fault
0.08 ?223% +1.0 1.09 ? 48% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.page_remove_rmap.zap_pte_range.unmap_page_range.unmap_vmas
0.47 ? 45% +1.0 1.51 ? 5% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__pagevec_lru_add.lru_cache_add.shmem_getpage_gfp.shmem_fault
0.00 +1.4 1.41 ? 5% perf-profile.calltrace.cycles-pp.__mod_memcg_state.__mod_memcg_lruvec_state.__pagevec_lru_add.lru_cache_add.shmem_getpage_gfp
0.45 ?103% +1.5 1.91 ? 34% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.45 ?103% +1.5 1.91 ? 34% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
1.58 ? 4% +2.6 4.19 ? 5% perf-profile.calltrace.cycles-pp.finish_fault.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
1.30 ? 4% +2.7 4.05 ? 5% perf-profile.calltrace.cycles-pp.do_set_pte.finish_fault.do_fault.__handle_mm_fault.handle_mm_fault
1.22 ? 4% +2.8 4.00 ? 5% perf-profile.calltrace.cycles-pp.page_add_file_rmap.do_set_pte.finish_fault.do_fault.__handle_mm_fault
0.51 ? 45% +3.0 3.47 ? 6% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.page_add_file_rmap.do_set_pte.finish_fault.do_fault
1.27 ? 4% +3.0 4.31 ? 3% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.shmem_add_to_page_cache.shmem_getpage_gfp.shmem_fault.__do_fault
0.29 ?100% +3.2 3.48 ? 4% perf-profile.calltrace.cycles-pp.__count_memcg_events.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
0.00 +3.4 3.35 ? 5% perf-profile.calltrace.cycles-pp.__mod_memcg_state.__mod_memcg_lruvec_state.page_add_file_rmap.do_set_pte.finish_fault
0.67 ? 44% +3.4 4.09 ? 3% perf-profile.calltrace.cycles-pp.__mod_memcg_state.__mod_memcg_lruvec_state.shmem_add_to_page_cache.shmem_getpage_gfp.shmem_fault
52.25 ? 5% +6.7 58.93 ? 3% perf-profile.calltrace.cycles-pp.__do_fault.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
52.22 ? 5% +6.7 58.91 ? 3% perf-profile.calltrace.cycles-pp.shmem_fault.__do_fault.do_fault.__handle_mm_fault.handle_mm_fault
52.08 ? 5% +6.7 58.83 ? 3% perf-profile.calltrace.cycles-pp.shmem_getpage_gfp.shmem_fault.__do_fault.do_fault.__handle_mm_fault
57.54 ? 5% +7.9 65.44 ? 2% perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
57.23 ? 5% +8.0 65.24 ? 2% perf-profile.calltrace.cycles-pp.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
2.74 ? 12% +27.3 30.04 ? 8% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.lock_page_lruvec_irqsave.__pagevec_lru_add.lru_cache_add.shmem_getpage_gfp
2.75 ? 12% +27.3 30.05 ? 8% perf-profile.calltrace.cycles-pp.lock_page_lruvec_irqsave.__pagevec_lru_add.lru_cache_add.shmem_getpage_gfp.shmem_fault
2.69 ? 12% +27.3 30.01 ? 8% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.lock_page_lruvec_irqsave.__pagevec_lru_add.lru_cache_add
4.26 ? 10% +28.0 32.30 ? 7% perf-profile.calltrace.cycles-pp.lru_cache_add.shmem_getpage_gfp.shmem_fault.__do_fault.do_fault
4.13 ? 10% +28.1 32.24 ? 7% perf-profile.calltrace.cycles-pp.__pagevec_lru_add.lru_cache_add.shmem_getpage_gfp.shmem_fault.__do_fault
33.13 ? 8% -19.0 14.13 ? 5% perf-profile.children.cycles-pp.mem_cgroup_charge
35.97 ? 8% -16.0 20.00 ? 4% perf-profile.children.cycles-pp.shmem_add_to_page_cache
12.99 ? 8% -7.1 5.87 ? 4% perf-profile.children.cycles-pp.get_mem_cgroup_from_mm
4.62 ? 7% -2.2 2.40 ? 6% perf-profile.children.cycles-pp.clear_page_erms
2.53 ? 7% -1.9 0.62 ? 7% perf-profile.children.cycles-pp.try_charge
2.04 ? 7% -1.5 0.52 ? 7% perf-profile.children.cycles-pp.page_counter_try_charge
3.18 ? 10% -1.2 1.96 ? 5% perf-profile.children.cycles-pp.filemap_map_pages
2.05 ? 8% -0.9 1.13 ? 5% perf-profile.children.cycles-pp.shmem_alloc_and_acct_page
2.64 ? 13% -0.9 1.78 ? 20% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
2.28 ? 14% -0.7 1.53 ? 19% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
1.58 ? 8% -0.7 0.85 ? 5% perf-profile.children.cycles-pp.shmem_alloc_page
0.89 ? 5% -0.7 0.16 ? 9% perf-profile.children.cycles-pp.propagate_protected_usage
1.42 ? 7% -0.7 0.76 ? 4% perf-profile.children.cycles-pp.__alloc_pages_nodemask
1.38 ? 8% -0.6 0.76 ? 5% perf-profile.children.cycles-pp.alloc_pages_vma
1.63 ? 13% -0.6 1.04 ? 16% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
1.61 ? 13% -0.6 1.03 ? 16% perf-profile.children.cycles-pp.hrtimer_interrupt
1.42 ? 9% -0.5 0.87 ? 4% perf-profile.children.cycles-pp.next_uptodate_page
1.01 ? 8% -0.5 0.52 ? 6% perf-profile.children.cycles-pp.get_page_from_freelist
0.78 ? 9% -0.4 0.42 ? 7% perf-profile.children.cycles-pp.rmqueue
0.51 ? 3% -0.3 0.19 ? 5% perf-profile.children.cycles-pp.mem_cgroup_charge_statistics
0.93 ? 13% -0.3 0.62 ? 15% perf-profile.children.cycles-pp.__hrtimer_run_queues
0.60 ? 18% -0.3 0.30 ? 20% perf-profile.children.cycles-pp.ktime_get
0.79 ? 9% -0.3 0.50 ? 5% perf-profile.children.cycles-pp.error_entry
0.73 ? 14% -0.3 0.46 ? 13% perf-profile.children.cycles-pp.tick_sched_timer
0.68 ? 14% -0.2 0.43 ? 12% perf-profile.children.cycles-pp.tick_sched_handle
0.67 ? 14% -0.2 0.42 ? 12% perf-profile.children.cycles-pp.update_process_times
0.65 ? 10% -0.2 0.41 ? 4% perf-profile.children.cycles-pp.sync_regs
0.49 ? 14% -0.2 0.28 ? 18% perf-profile.children.cycles-pp.clockevents_program_event
0.25 ? 10% -0.2 0.05 ? 47% perf-profile.children.cycles-pp.lock_page_memcg
0.45 ? 9% -0.2 0.26 ? 7% perf-profile.children.cycles-pp.rmqueue_bulk
0.50 ? 8% -0.2 0.32 ? 7% perf-profile.children.cycles-pp.unlock_page
0.40 ? 12% -0.2 0.23 ? 4% perf-profile.children.cycles-pp.__perf_sw_event
0.42 ? 15% -0.2 0.26 ? 11% perf-profile.children.cycles-pp.scheduler_tick
0.32 ? 10% -0.1 0.18 ? 6% perf-profile.children.cycles-pp._raw_spin_lock
0.44 ? 9% -0.1 0.30 ? 12% perf-profile.children.cycles-pp.__mod_lruvec_state
0.25 ? 8% -0.1 0.12 ? 7% perf-profile.children.cycles-pp._raw_spin_lock_irq
0.36 ? 9% -0.1 0.24 ? 11% perf-profile.children.cycles-pp.__mod_node_page_state
0.26 ? 11% -0.1 0.14 ? 6% perf-profile.children.cycles-pp.___perf_sw_event
0.22 ? 19% -0.1 0.11 ? 10% perf-profile.children.cycles-pp.task_tick_fair
0.18 ? 10% -0.1 0.08 ? 4% perf-profile.children.cycles-pp.shmem_pseudo_vma_init
0.36 ? 9% -0.1 0.26 ? 10% perf-profile.children.cycles-pp.xas_load
0.29 ? 6% -0.1 0.19 ? 4% perf-profile.children.cycles-pp.xas_create_range
0.32 ? 7% -0.1 0.23 ? 8% perf-profile.children.cycles-pp.xas_find
0.27 ? 5% -0.1 0.18 ? 5% perf-profile.children.cycles-pp.xas_create
0.14 ? 7% -0.1 0.06 ? 11% perf-profile.children.cycles-pp.__memcg_kmem_charge_page
0.22 ? 11% -0.1 0.14 ? 5% perf-profile.children.cycles-pp.pagecache_get_page
0.13 ? 13% -0.1 0.07 ? 7% perf-profile.children.cycles-pp.security_vm_enough_memory_mm
0.11 ? 7% -0.1 0.04 ? 45% perf-profile.children.cycles-pp.__memcg_kmem_charge
0.15 ? 4% -0.1 0.09 ? 9% perf-profile.children.cycles-pp.pte_alloc_one
0.11 ? 8% -0.1 0.04 ? 45% perf-profile.children.cycles-pp.obj_cgroup_charge
0.11 ? 12% -0.1 0.04 ? 45% perf-profile.children.cycles-pp.cap_vm_enough_memory
0.10 ? 28% -0.1 0.03 ? 70% perf-profile.children.cycles-pp.up_read
0.13 ? 17% -0.1 0.07 ? 55% perf-profile.children.cycles-pp.tick_nohz_irq_exit
0.10 ? 23% -0.1 0.04 ? 76% perf-profile.children.cycles-pp.ktime_get_update_offsets_now
0.08 ? 5% -0.0 0.03 ? 70% perf-profile.children.cycles-pp.__pte_alloc
0.11 ? 14% -0.0 0.06 ? 7% perf-profile.children.cycles-pp.percpu_counter_add_batch
0.10 ? 11% -0.0 0.06 ? 6% perf-profile.children.cycles-pp.find_vma
0.10 ? 13% -0.0 0.06 ? 11% perf-profile.children.cycles-pp.xas_find_conflict
0.17 ? 6% -0.0 0.13 ? 6% perf-profile.children.cycles-pp.xas_alloc
0.09 ? 9% -0.0 0.05 ? 7% perf-profile.children.cycles-pp.vmacache_find
0.07 ? 10% -0.0 0.03 ? 70% perf-profile.children.cycles-pp.__irqentry_text_end
0.14 ? 10% -0.0 0.10 ? 10% perf-profile.children.cycles-pp.__list_add_valid
0.09 ? 11% -0.0 0.06 perf-profile.children.cycles-pp.___might_sleep
0.10 ? 13% -0.0 0.07 ? 12% perf-profile.children.cycles-pp.xas_start
0.08 ? 8% -0.0 0.06 ? 13% perf-profile.children.cycles-pp.__mod_zone_page_state
0.07 ? 11% -0.0 0.05 perf-profile.children.cycles-pp.perf_swevent_get_recursion_context
0.02 ? 99% +0.0 0.07 ? 20% perf-profile.children.cycles-pp.__x64_sys_exit_group
0.02 ? 99% +0.0 0.07 ? 20% perf-profile.children.cycles-pp.do_group_exit
0.02 ? 99% +0.0 0.07 ? 20% perf-profile.children.cycles-pp.do_exit
0.09 ? 19% +0.1 0.18 ? 9% perf-profile.children.cycles-pp._raw_spin_unlock_irqrestore
0.00 +0.1 0.14 ? 12% perf-profile.children.cycles-pp.cgroup_rstat_updated
0.79 ? 8% +0.6 1.39 ? 9% perf-profile.children.cycles-pp.__mod_lruvec_page_state
1.58 ? 4% +2.6 4.22 ? 4% perf-profile.children.cycles-pp.finish_fault
1.03 ? 4% +2.7 3.69 ? 4% perf-profile.children.cycles-pp.__count_memcg_events
1.32 ? 4% +2.8 4.09 ? 4% perf-profile.children.cycles-pp.do_set_pte
1.23 ? 4% +2.8 4.04 ? 4% perf-profile.children.cycles-pp.page_add_file_rmap
52.26 ? 5% +6.7 58.94 ? 3% perf-profile.children.cycles-pp.__do_fault
52.22 ? 5% +6.7 58.91 ? 3% perf-profile.children.cycles-pp.shmem_fault
52.10 ? 5% +6.8 58.86 ? 3% perf-profile.children.cycles-pp.shmem_getpage_gfp
3.10 ? 4% +7.6 10.71 ? 8% perf-profile.children.cycles-pp.__mod_memcg_lruvec_state
57.65 ? 5% +7.9 65.52 ? 2% perf-profile.children.cycles-pp.__handle_mm_fault
57.28 ? 5% +8.0 65.27 ? 2% perf-profile.children.cycles-pp.do_fault
1.90 ? 4% +8.2 10.07 ? 8% perf-profile.children.cycles-pp.__mod_memcg_state
62.00 ? 3% +8.7 70.65 ? 2% perf-profile.children.cycles-pp.asm_exc_page_fault
59.44 ? 5% +10.3 69.79 ? 2% perf-profile.children.cycles-pp.exc_page_fault
59.36 ? 5% +10.4 69.74 ? 2% perf-profile.children.cycles-pp.do_user_addr_fault
58.61 ? 5% +10.7 69.31 ? 2% perf-profile.children.cycles-pp.handle_mm_fault
2.77 ? 12% +27.4 30.19 ? 8% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
2.86 ? 12% +27.4 30.29 ? 8% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
2.77 ? 12% +27.4 30.21 ? 8% perf-profile.children.cycles-pp.lock_page_lruvec_irqsave
4.26 ? 10% +28.1 32.32 ? 7% perf-profile.children.cycles-pp.lru_cache_add
4.15 ? 10% +28.2 32.33 ? 7% perf-profile.children.cycles-pp.__pagevec_lru_add
16.97 ? 10% -9.6 7.42 ? 6% perf-profile.self.cycles-pp.mem_cgroup_charge
12.87 ? 8% -7.0 5.83 ? 4% perf-profile.self.cycles-pp.get_mem_cgroup_from_mm
4.57 ? 7% -2.2 2.38 ? 6% perf-profile.self.cycles-pp.clear_page_erms
4.82 ? 8% -2.0 2.82 ? 7% perf-profile.self.cycles-pp.shmem_getpage_gfp
1.16 ? 8% -0.8 0.36 ? 6% perf-profile.self.cycles-pp.page_counter_try_charge
0.88 ? 5% -0.7 0.16 ? 10% perf-profile.self.cycles-pp.propagate_protected_usage
1.21 ? 4% -0.7 0.55 ? 17% perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
1.40 ? 9% -0.5 0.86 ? 3% perf-profile.self.cycles-pp.next_uptodate_page
0.57 ? 9% -0.4 0.14 ? 5% perf-profile.self.cycles-pp.try_charge
0.98 ? 10% -0.4 0.59 ? 5% perf-profile.self.cycles-pp.filemap_map_pages
0.55 ? 20% -0.3 0.26 ? 19% perf-profile.self.cycles-pp.ktime_get
0.64 ? 9% -0.2 0.40 ? 5% perf-profile.self.cycles-pp.sync_regs
0.24 ? 10% -0.2 0.05 ? 47% perf-profile.self.cycles-pp.lock_page_memcg
0.48 ? 7% -0.2 0.30 ? 7% perf-profile.self.cycles-pp.unlock_page
0.26 ? 12% -0.2 0.10 ? 7% perf-profile.self.cycles-pp.rmqueue
0.25 ? 8% -0.1 0.12 ? 4% perf-profile.self.cycles-pp._raw_spin_lock_irq
0.31 ? 9% -0.1 0.18 ? 7% perf-profile.self.cycles-pp.rmqueue_bulk
0.35 ? 9% -0.1 0.23 ? 11% perf-profile.self.cycles-pp.__mod_node_page_state
0.43 ? 10% -0.1 0.31 ? 4% perf-profile.self.cycles-pp.__pagevec_lru_add
0.30 ? 8% -0.1 0.18 ? 6% perf-profile.self.cycles-pp.__handle_mm_fault
0.26 ? 10% -0.1 0.15 ? 7% perf-profile.self.cycles-pp._raw_spin_lock
0.17 ? 9% -0.1 0.08 ? 6% perf-profile.self.cycles-pp.shmem_pseudo_vma_init
0.20 ? 12% -0.1 0.11 ? 6% perf-profile.self.cycles-pp.___perf_sw_event
0.23 ? 10% -0.1 0.14 ? 3% perf-profile.self.cycles-pp.shmem_add_to_page_cache
0.16 ? 8% -0.1 0.08 ? 12% perf-profile.self.cycles-pp.get_page_from_freelist
0.20 ? 10% -0.1 0.12 ? 9% perf-profile.self.cycles-pp.shmem_alloc_and_acct_page
0.19 ? 11% -0.1 0.11 ? 3% perf-profile.self.cycles-pp.do_user_addr_fault
0.27 ? 11% -0.1 0.20 ? 10% perf-profile.self.cycles-pp.xas_load
0.17 ? 10% -0.1 0.10 ? 7% perf-profile.self.cycles-pp.handle_mm_fault
0.17 ? 11% -0.1 0.11 ? 4% perf-profile.self.cycles-pp.__alloc_pages_nodemask
0.12 ? 13% -0.1 0.06 ? 8% perf-profile.self.cycles-pp.lru_cache_add
0.09 ? 8% -0.1 0.02 ? 99% perf-profile.self.cycles-pp.page_add_file_rmap
0.09 ? 23% -0.1 0.03 ?102% perf-profile.self.cycles-pp.ktime_get_update_offsets_now
0.12 ? 15% -0.1 0.06 ? 11% perf-profile.self.cycles-pp.asm_exc_page_fault
0.09 ? 9% -0.1 0.04 ? 71% perf-profile.self.cycles-pp.finish_fault
0.10 ? 10% -0.1 0.05 ? 48% perf-profile.self.cycles-pp.update_process_times
0.14 ? 9% -0.1 0.09 ? 12% perf-profile.self.cycles-pp.error_entry
0.09 ? 11% -0.0 0.04 ? 45% perf-profile.self.cycles-pp.xas_create
0.10 ? 16% -0.0 0.06 ? 9% perf-profile.self.cycles-pp.percpu_counter_add_batch
0.12 ? 17% -0.0 0.08 ? 7% perf-profile.self.cycles-pp.shmem_fault
0.07 ? 9% -0.0 0.03 ? 70% perf-profile.self.cycles-pp.__irqentry_text_end
0.09 ? 12% -0.0 0.05 ? 7% perf-profile.self.cycles-pp.vmacache_find
0.08 ? 13% -0.0 0.04 ? 44% perf-profile.self.cycles-pp.do_fault
0.09 ? 11% -0.0 0.06 ? 8% perf-profile.self.cycles-pp.xas_find_conflict
0.09 ? 9% -0.0 0.06 perf-profile.self.cycles-pp.___might_sleep
0.08 ? 11% -0.0 0.06 ? 9% perf-profile.self.cycles-pp.pagecache_get_page
0.13 ? 8% -0.0 0.10 ? 13% perf-profile.self.cycles-pp.xas_find
0.12 ? 9% -0.0 0.09 ? 10% perf-profile.self.cycles-pp.__list_add_valid
0.08 ? 10% -0.0 0.06 ? 13% perf-profile.self.cycles-pp.__mod_zone_page_state
0.00 +0.1 0.14 ? 13% perf-profile.self.cycles-pp.cgroup_rstat_updated
0.78 ? 8% +0.6 1.38 ? 9% perf-profile.self.cycles-pp.__mod_lruvec_page_state
1.03 ? 5% +2.7 3.68 ? 4% perf-profile.self.cycles-pp.__count_memcg_events
1.88 ? 4% +8.1 10.03 ? 8% perf-profile.self.cycles-pp.__mod_memcg_state
2.77 ? 12% +27.4 30.19 ? 8% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath



vm-scalability.time.system_time

13000 +-------------------------------------------------------------------+
| O |
12000 |-+ O O |
| O |
11000 |-+ |
| |
10000 |-+ |
| |
9000 |-+ |
| |
8000 |-+ ...+......... |
| ....... ... ......+.............|
7000 |... +.............+...... |
| |
6000 +-------------------------------------------------------------------+


vm-scalability.time.percent_of_cpu_this_job_got

4800 +--------------------------------------------------------------------+
4600 |-+ O |
| O |
4400 |-+ O O |
4200 |-+ |
| |
4000 |-+ |
3800 |-+ |
3600 |-+ |
| |
3400 |-+ ...+.......... |
3200 |-+ ....... ... ...+.............|
|... +............ ....... |
3000 |-+ +... |
2800 +--------------------------------------------------------------------+


vm-scalability.time.voluntary_context_switches

71000 +-------------------------------------------------------------------+
| ...+.......... |
70000 |-+ ...... ...|
69000 |.......... .......+... |
| ... ......+...... |
68000 |-+ +...... |
67000 |-+ |
| |
66000 |-+ |
65000 |-+ |
| |
64000 |-+ O |
63000 |-+ O |
| O O |
62000 +-------------------------------------------------------------------+


perf-sched.total_wait_time.average.ms

140 +---------------------------------------------------------------------+
| O |
130 |-+ O |
120 |-+ |
| O |
110 |-+ |
100 |-+ O |
| |
90 |-+ |
80 |-+ |
| .......+.......... |
70 |-+ .......+.............+...... ...|
60 |-+ .......+...... |
|...... |
50 +---------------------------------------------------------------------+


perf-sched.total_wait_and_delay.count.ms

70000 +-------------------------------------------------------------------+
|.............+... |
65000 |-+ ... |
60000 |-+ ... |
| .. |
55000 |-+ . ......+.............|
50000 |-+ +.............+...... |
| |
45000 |-+ |
40000 |-+ |
| O |
35000 |-+ O |
30000 |-+ |
| O O |
25000 +-------------------------------------------------------------------+


perf-sched.total_wait_and_delay.average.ms

140 +---------------------------------------------------------------------+
| O |
130 |-+ O |
120 |-+ |
| O |
110 |-+ |
100 |-+ O |
| |
90 |-+ |
80 |-+ |
| .......+.......... |
70 |-+ .......+.............+...... ...|
60 |-+ .......+...... |
|...... |
50 +---------------------------------------------------------------------+




2300 +--------------------------------------------------------------------+
| O O O O |
2200 |-+ |
2100 |-+ |
| |
2000 |-+ |
1900 |-+ |
| |
1800 |-+ |
1700 |-+ |
| |
1600 |-+ |
1500 |.............+.............+............ .......|
| +.............+...... |
1400 +--------------------------------------------------------------------+




12000 +-------------------------------------------------------------------+
| +.. |
10000 |-+ .. . |
| .. .. |
| .. . |
8000 |-+ . . |
| .. .. |
6000 |-+.. . |
|.. .. |
4000 |-+ |
| +............. ......+.............|
| +...... |
2000 |-+ |
| O O O O |
0 +-------------------------------------------------------------------+




0.6 +---------------------------------------------------------------------+
| |
0.5 |-+ O |
| |
| |
0.4 |-+ |
| O |
0.3 |-+ O |
| O |
0.2 |-+ |
| |
| |
0.1 |-+ |
| |
0 +---------------------------------------------------------------------+


vm-scalability.throughput

7.5e+07 +-----------------------------------------------------------------+
| .+...... .... ...|
7e+07 |...... ... .. ...... |
| .... .... +... |
6.5e+07 |-+ .. ... |
| +. |
6e+07 |-+ |
| |
5.5e+07 |-+ |
| |
5e+07 |-+ |
| |
4.5e+07 |-+ O O O |
| O |
4e+07 +-----------------------------------------------------------------+


vm-scalability.free_time

0.05 +-------------------------------------------------------------------+
| |
0.045 |-+ O |
| |
| O |
0.04 |-+ O |
| O |
0.035 |-+ |
| |
0.03 |-+ |
| |
| |
0.025 |-+ ...+......... |
| ....... ... ......+.............|
0.02 +-------------------------------------------------------------------+


vm-scalability.median

400000 +------------------------------------------------------------------+
380000 |-+ ......+.......... |
| ..+...... ... ......|
360000 |......... ..... +...... |
340000 |-+ ... .... |
| +.. |
320000 |-+ |
300000 |-+ |
280000 |-+ |
| |
260000 |-+ |
240000 |-+ |
| O O O |
220000 |-+ O |
200000 +------------------------------------------------------------------+


perf-sched.wait_time.max.ms.pipe_write.new_sync_write.vfs_write.ksys_write

32 +----------------------------------------------------------------------+
| ...+........... |
31 |............. ....... ... |
30 |-+ +... +.. .|
| . .. |
29 |-+ . . |
28 |-+ .. . |
| . .. |
27 |-+ .. . |
26 |-+ . . |
| . . |
25 |-+ .. .. |
24 |-+ . |
| + |
23 +----------------------------------------------------------------------+


perf-sched.sch_delay.avg.ms.pipe_write.new_sync_write.vfs_write.ksys_write

0.009 +------------------------------------------------------------------+
0.0088 |:+ + : |
| : + : |
0.0086 |-+: + : |
0.0084 |-+ : + : |
| : + : |
0.0082 |-+ : + : |
0.008 |-+ : + : |
0.0078 |-+ : + : |
| : + : |
0.0076 |-+ : + : |
0.0074 |-+ : + : |
| : + : |
0.0072 |-+ : + :|
0.007 +------------------------------------------------------------------+


perf-sched.wait_and_delay.count.pipe_write.new_sync_write.vfs_write.ksys_write

28000 +-------------------------------------------------------------------+
|. |
26000 |-. |
24000 |-+.. |
| . |
22000 |-+ . |
| . |
20000 |-+ . |
| . |
18000 |-+ .. |
16000 |-+ . |
| . |
14000 |-+ .......|
| +............+.............+............+...... |
12000 +-------------------------------------------------------------------+


perf-sched.wait_and_delay.max.ms.pipe_write.new_sync_write.vfs_write.ksys_write

32 +----------------------------------------------------------------------+
| ...+........... |
31 |............. ....... ... |
30 |-+ +... +.. .|
| . .. |
29 |-+ . . |
28 |-+ .. . |
| . .. |
27 |-+ .. . |
26 |-+ . . |
| . . |
25 |-+ .. .. |
24 |-+ . |
| + |
23 +----------------------------------------------------------------------+


perf-sched.wait_and_delay.avg.ms.pipe_write.new_sync_write.vfs_write.ksys_write

0.8 +--------------------------------------------------------------------+
| .......+...... ... |
0.75 |-+ +...... +.............|
0.7 |-+ . |
| .. |
0.65 |-+ . |
0.6 |-+ . |
| .. |
0.55 |-+ . |
0.5 |-+ . |
| . |
0.45 |-.. |
0.4 |.+ |
| |
0.35 +--------------------------------------------------------------------+


[*] bisect-good sample
[O] bisect-bad sample

***************************************************************************************************
lkp-csl-2ap3: 192 threads 4 sockets Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation

Thanks,
Oliver Sang


Attachments:
(No filename) (70.32 kB)
config-5.12.0-11208-g2d146aa3aa84 (176.50 kB)
job-script (7.78 kB)
job.yaml (5.08 kB)
reproduce (815.59 kB)
Download all attachments

2021-08-11 06:03:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression

On Tue, Aug 10, 2021 at 4:59 PM kernel test robot <[email protected]> wrote:
>
> FYI, we noticed a -36.4% regression of vm-scalability.throughput due to commit:
> 2d146aa3aa84 ("mm: memcontrol: switch to rstat")

Hmm. I guess some cost is to be expected, but that's a big regression.

I'm not sure what the code ends up doing, and how relevant this test
is, but Johannes - could you please take a look?

I can't make heads nor tails of the profile. The profile kind of points at this:

> 2.77 ą 12% +27.4 30.19 ą 8% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
> 2.86 ą 12% +27.4 30.29 ą 8% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
> 2.77 ą 12% +27.4 30.21 ą 8% perf-profile.children.cycles-pp.lock_page_lruvec_irqsave
> 4.26 ą 10% +28.1 32.32 ą 7% perf-profile.children.cycles-pp.lru_cache_add
> 4.15 ą 10% +28.2 32.33 ą 7% perf-profile.children.cycles-pp.__pagevec_lru_add

and that seems to be from the chain __do_fault -> shmem_fault ->
shmem_getpage_gfp -> lru_cache_add -> __pagevec_lru_add ->
lock_page_lruvec_irqsave -> _raw_spin_lock_irqsave ->
native_queued_spin_lock_slowpath.

That shmem_fault codepath being hot may make sense for sokme VM
scalability test. But it seems to make little sense when I look at the
commit that it bisected to.

We had another report of this commit causing a much more reasonable
small slowdown (-2.8%) back in May.

I'm not sure what's up with this new report. Johannes, does this make
sense to you?

Is it perhaps another "unlucky cache line placement" thing? Or has the
statistics changes just changed the behavior of the test?

Anybody?

Linus

2021-08-11 20:13:53

by Johannes Weiner

[permalink] [raw]
Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression

On Tue, Aug 10, 2021 at 07:59:53PM -1000, Linus Torvalds wrote:
> On Tue, Aug 10, 2021 at 4:59 PM kernel test robot <[email protected]> wrote:
> >
> > FYI, we noticed a -36.4% regression of vm-scalability.throughput due to commit:
> > 2d146aa3aa84 ("mm: memcontrol: switch to rstat")
>
> Hmm. I guess some cost is to be expected, but that's a big regression.
>
> I'm not sure what the code ends up doing, and how relevant this test
> is, but Johannes - could you please take a look?
>
> I can't make heads nor tails of the profile. The profile kind of points at this:
>
> > 2.77 ą 12% +27.4 30.19 ą 8% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
> > 2.86 ą 12% +27.4 30.29 ą 8% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
> > 2.77 ą 12% +27.4 30.21 ą 8% perf-profile.children.cycles-pp.lock_page_lruvec_irqsave
> > 4.26 ą 10% +28.1 32.32 ą 7% perf-profile.children.cycles-pp.lru_cache_add
> > 4.15 ą 10% +28.2 32.33 ą 7% perf-profile.children.cycles-pp.__pagevec_lru_add
>
> and that seems to be from the chain __do_fault -> shmem_fault ->
> shmem_getpage_gfp -> lru_cache_add -> __pagevec_lru_add ->
> lock_page_lruvec_irqsave -> _raw_spin_lock_irqsave ->
> native_queued_spin_lock_slowpath.
>
> That shmem_fault codepath being hot may make sense for sokme VM
> scalability test. But it seems to make little sense when I look at the
> commit that it bisected to.
>
> We had another report of this commit causing a much more reasonable
> small slowdown (-2.8%) back in May.
>
> I'm not sure what's up with this new report. Johannes, does this make
> sense to you?
>
> Is it perhaps another "unlucky cache line placement" thing? Or has the
> statistics changes just changed the behavior of the test?

I'm at a loss as well.

The patch only changes how we aggregate the cgroup's memory.stat file,
it doesn't influence reclaim/LRU operations.

The test itself isn't interacting with memory.stat either - IIRC it
doesn't even run inside a dedicated cgroup in this test
environment. The patch should actually reduce accounting overhead here
because we switched from batched percpu flushing during updates to
only flushing when the stats are *read* - which doesn't happen here.

That would leave cachelines. But the cachelines the patch touched are
in struct mem_cgroup, whereas the lock this profile points out is in a
separately allocated per-node structure. The cache footprint on the
percpu data this test is hammering also didn't increase; it actually
decreased a bit, but not sure where this could cause conflicts.

I'll try to reproduce this on a smaller setup. But I have to say, I've
seen a few of these bisection reports now that didn't seem to make any
sense, which is why I've started to take these with a grain of salt.

2021-08-12 03:32:42

by Feng Tang

[permalink] [raw]
Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression

On Tue, Aug 10, 2021 at 07:59:53PM -1000, Linus Torvalds wrote:
> On Tue, Aug 10, 2021 at 4:59 PM kernel test robot <[email protected]> wrote:
> >
> > FYI, we noticed a -36.4% regression of vm-scalability.throughput due to commit:
> > 2d146aa3aa84 ("mm: memcontrol: switch to rstat")
>
> Hmm. I guess some cost is to be expected, but that's a big regression.
>
> I'm not sure what the code ends up doing, and how relevant this test
> is, but Johannes - could you please take a look?
>
> I can't make heads nor tails of the profile. The profile kind of points at this:
>
> > 2.77 ą 12% +27.4 30.19 ą 8% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
> > 2.86 ą 12% +27.4 30.29 ą 8% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
> > 2.77 ą 12% +27.4 30.21 ą 8% perf-profile.children.cycles-pp.lock_page_lruvec_irqsave
> > 4.26 ą 10% +28.1 32.32 ą 7% perf-profile.children.cycles-pp.lru_cache_add
> > 4.15 ą 10% +28.2 32.33 ą 7% perf-profile.children.cycles-pp.__pagevec_lru_add
>
> and that seems to be from the chain __do_fault -> shmem_fault ->
> shmem_getpage_gfp -> lru_cache_add -> __pagevec_lru_add ->
> lock_page_lruvec_irqsave -> _raw_spin_lock_irqsave ->
> native_queued_spin_lock_slowpath.
>
> That shmem_fault codepath being hot may make sense for sokme VM
> scalability test. But it seems to make little sense when I look at the
> commit that it bisected to.
>
> We had another report of this commit causing a much more reasonable
> small slowdown (-2.8%) back in May.
>
> I'm not sure what's up with this new report. Johannes, does this make
> sense to you?
>
> Is it perhaps another "unlucky cache line placement" thing? Or has the
> statistics changes just changed the behavior of the test?

Yes, this is probably related with cache line.

We just used perf-c2c tool profile the data for 2d146aa3aa and its
parent commit. There is very few false sharing for the parent commit,
while there is some for 2d146aa3aa, one hottest spot is:

#
# ----- HITM ----- -- Store Refs -- --------- Data address --------- ---------- cycles ---------- Total cpu Shared
# Num RmtHitm LclHitm L1 Hit L1 Miss Offset Node PA cnt Code address rmt hitm lcl hitm load records cnt Symbol Object Source:Line Node
# ..... ....... ....... ....... ....... .................. .... ...... .................. ........ ........ ........ ....... ........ .............................. ................. ..................... ....
#
-------------------------------------------------------------
0 0 2036 0 0 0xffff8881c0642000
-------------------------------------------------------------
0.00% 45.58% 0.00% 0.00% 0x0 0 1 0xffffffff8137071c 0 2877 3221 8969 191 [k] __mod_memcg_state [kernel.kallsyms] memcontrol.c:772 0 1 2 3
0.00% 20.92% 0.00% 0.00% 0x0 0 1 0xffffffff8137091c 0 3027 2841 6626 188 [k] __count_memcg_events [kernel.kallsyms] memcontrol.c:920 0 1 2 3
0.00% 17.88% 0.00% 0.00% 0x10 0 1 0xffffffff8136d7ad 0 3047 3326 3820 187 [k] get_mem_cgroup_from_mm [kernel.kallsyms] percpu-refcount.h:174 0 1 2 3
0.00% 8.94% 0.00% 0.00% 0x10 0 1 0xffffffff81375374 0 3192 3041 2067 187 [k] mem_cgroup_charge [kernel.kallsyms] percpu-refcount.h:174 0 1 2 3

And seems there is some cache false sharing when accessing mem_cgroup
member: 'struct cgroup_subsys_state', from the offset (0x0 and 0x10 here)
and the calling sites, the cache false sharing could happen between:

cgroup_rstat_updated (read memcg->css.cgroup, offset 0x0)
and
get_mem_cgroup_from_mm
css_tryget(&memcg->css) (read/write memcg->css.refcnt, offset 0x10)

(This could be wrong as many of the functions are inlined, and the
exact calling site isn't shown)

And to verify this, we did a test by adding padding between
memcg->css.cgroup and memcg->css.refcnt to push them into 2
different cache lines, and the performance are partly restored:

dc26532aed0ab25c 2d146aa3aa842d7f5065802556b 73371bf27a8a8ea68df2fbf456b
---------------- --------------------------- ---------------------------
65523232 ± 4% -40.8% 38817332 ± 5% -19.6% 52701654 ± 3% vm-scalability.throughput

0.58 ± 2% +3.1 3.63 +2.3 2.86 ± 4% pp.bt.__count_memcg_events.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
1.15 +3.1 4.20 +2.5 3.68 ± 4% pp.bt.page_add_file_rmap.do_set_pte.finish_fault.do_fault.__handle_mm_fault
0.53 ± 2% +3.1 3.62 +2.5 2.99 ± 2% pp.bt.__mod_memcg_lruvec_state.page_add_file_rmap.do_set_pte.finish_fault.do_fault
1.16 +3.3 4.50 +3.2 4.38 ± 3% pp.bt.__mod_memcg_lruvec_state.shmem_add_to_page_cache.shmem_getpage_gfp.shmem_fault.__do_fault
0.80 +3.5 4.29 +3.2 3.99 ± 3% pp.bt.__mod_memcg_state.__mod_memcg_lruvec_state.shmem_add_to_page_cache.shmem_getpage_gfp.shmem_fault
0.00 +3.5 3.50 +2.8 2.78 ± 2% pp.bt.__mod_memcg_state.__mod_memcg_lruvec_state.page_add_file_rmap.do_set_pte.finish_fault
52.02 ± 3% +13.3 65.29 ± 2% +4.3 56.34 ± 6% pp.bt.__do_fault.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
51.98 ± 3% +13.3 65.27 ± 2% +4.3 56.31 ± 6% pp.bt.shmem_fault.__do_fault.do_fault.__handle_mm_fault.handle_mm_fault
51.87 ± 3% +13.3 65.21 ± 2% +4.3 56.22 ± 6% pp.bt.shmem_getpage_gfp.shmem_fault.__do_fault.do_fault.__handle_mm_fault
56.75 ± 3% +15.0 71.78 ± 2% +6.3 63.09 ± 5% pp.bt.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
1.97 ± 3% +33.9 35.87 ± 5% +19.8 21.79 ± 23% pp.bt._raw_spin_lock_irqsave.lock_page_lruvec_irqsave.__pagevec_lru_add.lru_cache_add.shmem_getpage_gfp
1.98 ± 3% +33.9 35.89 ± 5% +19.8 21.80 ± 23% pp.bt.lock_page_lruvec_irqsave.__pagevec_lru_add.lru_cache_add.shmem_getpage_gfp.shmem_fault

We are still checking more, and will update if there is new data.

Btw, the test platform is a 2 sockets, 4 nodes, 96C/192T Cascadelake AP,
and if run the same case on 2S/2Nodes/48C/96T Cascade Lake SP box, the
regression is about -22.3%.

Thanks,
Feng

> Anybody?
>
> Linus

2021-08-16 03:29:50

by Feng Tang

[permalink] [raw]
Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression

On Thu, Aug 12, 2021 at 11:19:10AM +0800, Feng Tang wrote:
> On Tue, Aug 10, 2021 at 07:59:53PM -1000, Linus Torvalds wrote:
[SNIP]

> And seems there is some cache false sharing when accessing mem_cgroup
> member: 'struct cgroup_subsys_state', from the offset (0x0 and 0x10 here)
> and the calling sites, the cache false sharing could happen between:
>
> cgroup_rstat_updated (read memcg->css.cgroup, offset 0x0)
> and
> get_mem_cgroup_from_mm
> css_tryget(&memcg->css) (read/write memcg->css.refcnt, offset 0x10)
>
> (This could be wrong as many of the functions are inlined, and the
> exact calling site isn't shown)
>
> And to verify this, we did a test by adding padding between
> memcg->css.cgroup and memcg->css.refcnt to push them into 2
> different cache lines, and the performance are partly restored:
>
> dc26532aed0ab25c 2d146aa3aa842d7f5065802556b 73371bf27a8a8ea68df2fbf456b
> ---------------- --------------------------- ---------------------------
> 65523232 ± 4% -40.8% 38817332 ± 5% -19.6% 52701654 ± 3% vm-scalability.throughput
>
> We are still checking more, and will update if there is new data.

Seems this is the second case to hit 'adjacent cacheline prefetch",
the first time we saw it is also related with mem_cgroup
https://lore.kernel.org/lkml/[email protected]/

In previous debug patch, the 'css.cgroup' and 'css.refcnt' is
separated to 2 cache lines, which are still adjacent (2N and 2N+1)
cachelines. And with more padding (add 128 bytes padding in between),
the performance is restored, and even better (test run 3 times):

dc26532aed0ab25c 2d146aa3aa842d7f5065802556b 2e34d6daf5fbab0fb286dcdb3bc
---------------- --------------------------- ---------------------------
65523232 ± 4% -40.8% 38817332 ± 5% +23.4% 80862243 ± 3% vm-scalability.throughput

The debug patch is:
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -142,6 +142,8 @@ struct cgroup_subsys_state {
/* PI: the cgroup subsystem that this css is attached to */
struct cgroup_subsys *ss;

+ unsigned long pad[16];
+
/* reference count - access via css_[try]get() and css_put() */
struct percpu_ref refcnt;

Thanks,
Feng

> Btw, the test platform is a 2 sockets, 4 nodes, 96C/192T Cascadelake AP,
> and if run the same case on 2S/2Nodes/48C/96T Cascade Lake SP box, the
> regression is about -22.3%.
>
> Thanks,
> Feng
>
> > Anybody?
> >
> > Linus

2021-08-16 21:44:44

by Johannes Weiner

[permalink] [raw]
Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression

On Mon, Aug 16, 2021 at 11:28:55AM +0800, Feng Tang wrote:
> On Thu, Aug 12, 2021 at 11:19:10AM +0800, Feng Tang wrote:
> > On Tue, Aug 10, 2021 at 07:59:53PM -1000, Linus Torvalds wrote:
> [SNIP]
>
> > And seems there is some cache false sharing when accessing mem_cgroup
> > member: 'struct cgroup_subsys_state', from the offset (0x0 and 0x10 here)
> > and the calling sites, the cache false sharing could happen between:
> >
> > cgroup_rstat_updated (read memcg->css.cgroup, offset 0x0)
> > and
> > get_mem_cgroup_from_mm
> > css_tryget(&memcg->css) (read/write memcg->css.refcnt, offset 0x10)
> >
> > (This could be wrong as many of the functions are inlined, and the
> > exact calling site isn't shown)

Thanks for digging more into this.

The offset 0x0 access is new in the page instantiation path with the
bisected patch, so that part makes sense. The new sequence is this:

shmem_add_to_page_cache()
mem_cgroup_charge()
get_mem_cgroup_from_mm()
css_tryget() # touches memcg->css.refcnt
xas_store()
__mod_lruvec_page_state()
__mod_lruvec_state()
__mod_memcg_lruvec_state()
__mod_memcg_state()
__this_cpu_add()
cgroup_rstat_updated() # touches memcg->css.cgroup

whereas before, __mod_memcg_state() would just do stuff inside memcg.

However, css.refcnt is a percpu-refcount. We should see a read-only
lookup of the base pointer inside this cacheline, with the write
occuring in percpu memory elsewhere. Even if it were in atomic/shared
mode, which it shouldn't be for the root cgroup, the shared atomic_t
is also located in an auxiliary allocation and shouldn't overlap with
the cgroup pointer in any way.

The css itself is embedded inside struct mem_cgroup, which does see
modifications. But the closest of those is 3 cachelines down (struct
page_counter memory), so that doesn't make sense, either.

Does this theory require writes? Because I don't actually see any (hot
ones, anyway) inside struct cgroup_subsys_state for this workload.

> > And to verify this, we did a test by adding padding between
> > memcg->css.cgroup and memcg->css.refcnt to push them into 2
> > different cache lines, and the performance are partly restored:
> >
> > dc26532aed0ab25c 2d146aa3aa842d7f5065802556b 73371bf27a8a8ea68df2fbf456b
> > ---------------- --------------------------- ---------------------------
> > 65523232 ? 4% -40.8% 38817332 ? 5% -19.6% 52701654 ? 3% vm-scalability.throughput
> >
> > We are still checking more, and will update if there is new data.
>
> Seems this is the second case to hit 'adjacent cacheline prefetch",
> the first time we saw it is also related with mem_cgroup
> https://lore.kernel.org/lkml/[email protected]/
>
> In previous debug patch, the 'css.cgroup' and 'css.refcnt' is
> separated to 2 cache lines, which are still adjacent (2N and 2N+1)
> cachelines. And with more padding (add 128 bytes padding in between),
> the performance is restored, and even better (test run 3 times):
>
> dc26532aed0ab25c 2d146aa3aa842d7f5065802556b 2e34d6daf5fbab0fb286dcdb3bc
> ---------------- --------------------------- ---------------------------
> 65523232 ? 4% -40.8% 38817332 ? 5% +23.4% 80862243 ? 3% vm-scalability.throughput
>
> The debug patch is:
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -142,6 +142,8 @@ struct cgroup_subsys_state {
> /* PI: the cgroup subsystem that this css is attached to */
> struct cgroup_subsys *ss;
>
> + unsigned long pad[16];
> +
> /* reference count - access via css_[try]get() and css_put() */
> struct percpu_ref refcnt;

We aren't particularly space-constrained in this structure, so padding
should generally be acceptable here.

2021-08-17 02:46:48

by Feng Tang

[permalink] [raw]
Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression

On Mon, Aug 16, 2021 at 05:41:57PM -0400, Johannes Weiner wrote:
> On Mon, Aug 16, 2021 at 11:28:55AM +0800, Feng Tang wrote:
> > On Thu, Aug 12, 2021 at 11:19:10AM +0800, Feng Tang wrote:
> > > On Tue, Aug 10, 2021 at 07:59:53PM -1000, Linus Torvalds wrote:
> > [SNIP]
> >
> > > And seems there is some cache false sharing when accessing mem_cgroup
> > > member: 'struct cgroup_subsys_state', from the offset (0x0 and 0x10 here)
> > > and the calling sites, the cache false sharing could happen between:
> > >
> > > cgroup_rstat_updated (read memcg->css.cgroup, offset 0x0)
> > > and
> > > get_mem_cgroup_from_mm
> > > css_tryget(&memcg->css) (read/write memcg->css.refcnt, offset 0x10)
> > >
> > > (This could be wrong as many of the functions are inlined, and the
> > > exact calling site isn't shown)
>
> Thanks for digging more into this.
>
> The offset 0x0 access is new in the page instantiation path with the
> bisected patch, so that part makes sense. The new sequence is this:
>
> shmem_add_to_page_cache()
> mem_cgroup_charge()
> get_mem_cgroup_from_mm()
> css_tryget() # touches memcg->css.refcnt
> xas_store()
> __mod_lruvec_page_state()
> __mod_lruvec_state()
> __mod_memcg_lruvec_state()
> __mod_memcg_state()
> __this_cpu_add()
> cgroup_rstat_updated() # touches memcg->css.cgroup
>
> whereas before, __mod_memcg_state() would just do stuff inside memcg.

Yes, the perf record/report also show these two are hotspots, one takes
about 6% cpu cycles, the other takes 10%.

> However, css.refcnt is a percpu-refcount. We should see a read-only
> lookup of the base pointer inside this cacheline, with the write
> occuring in percpu memory elsewhere. Even if it were in atomic/shared
> mode, which it shouldn't be for the root cgroup, the shared atomic_t
> is also located in an auxiliary allocation and shouldn't overlap with
> the cgroup pointer in any way.
>
> The css itself is embedded inside struct mem_cgroup, which does see
> modifications. But the closest of those is 3 cachelines down (struct
> page_counter memory), so that doesn't make sense, either.
>
> Does this theory require writes? Because I don't actually see any (hot
> ones, anyway) inside struct cgroup_subsys_state for this workload.

You are right. the access to 'css.refcnt' is a read, and false sharing
is kind of interference between read and write. I presumed it's a global
reference count, and the try_get is a write operation.

Initially from the perf-c2c data, the in-cacheline hotspots are only
0x0, and 0x10, and if we extends to 2 cachelines, there is one more
offset 0x54 (css.flags), but still I can't figure out which member
inside the 128 bytes range is written frequenty.

/* pah info for cgroup_subsys_state */
struct cgroup_subsys_state {
struct cgroup * cgroup; /* 0 8 */
struct cgroup_subsys * ss; /* 8 8 */
struct percpu_ref refcnt; /* 16 16 */
struct list_head sibling; /* 32 16 */
struct list_head children; /* 48 16 */
/* --- cacheline 1 boundary (64 bytes) --- */
struct list_head rstat_css_node; /* 64 16 */
int id; /* 80 4 */
unsigned int flags; /* 84 4 */
u64 serial_nr; /* 88 8 */
atomic_t online_cnt; /* 96 4 */

/* XXX 4 bytes hole, try to pack */

struct work_struct destroy_work; /* 104 32 */
/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */

Since the test run implies this is cacheline related, and I'm not very
familiar with the mem_cgroup code, the original perf-c2c log is attached
which may give more hints.

Thanks,
Feng

> > > And to verify this, we did a test by adding padding between
> > > memcg->css.cgroup and memcg->css.refcnt to push them into 2
> > > different cache lines, and the performance are partly restored:
> > >
> > > dc26532aed0ab25c 2d146aa3aa842d7f5065802556b 73371bf27a8a8ea68df2fbf456b
> > > ---------------- --------------------------- ---------------------------
> > > 65523232 ± 4% -40.8% 38817332 ± 5% -19.6% 52701654 ± 3% vm-scalability.throughput
> > >
> > > We are still checking more, and will update if there is new data.
> >
> > Seems this is the second case to hit 'adjacent cacheline prefetch",
> > the first time we saw it is also related with mem_cgroup
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > In previous debug patch, the 'css.cgroup' and 'css.refcnt' is
> > separated to 2 cache lines, which are still adjacent (2N and 2N+1)
> > cachelines. And with more padding (add 128 bytes padding in between),
> > the performance is restored, and even better (test run 3 times):
> >
> > dc26532aed0ab25c 2d146aa3aa842d7f5065802556b 2e34d6daf5fbab0fb286dcdb3bc
> > ---------------- --------------------------- ---------------------------
> > 65523232 ± 4% -40.8% 38817332 ± 5% +23.4% 80862243 ± 3% vm-scalability.throughput
> >
> > The debug patch is:
> > --- a/include/linux/cgroup-defs.h
> > +++ b/include/linux/cgroup-defs.h
> > @@ -142,6 +142,8 @@ struct cgroup_subsys_state {
> > /* PI: the cgroup subsystem that this css is attached to */
> > struct cgroup_subsys *ss;
> >
> > + unsigned long pad[16];
> > +
> > /* reference count - access via css_[try]get() and css_put() */
> > struct percpu_ref refcnt;
>
> We aren't particularly space-constrained in this structure, so padding
> should generally be acceptable here.


Attachments:
(No filename) (5.75 kB)
perf-c2c-2d146aa3.log (89.71 kB)
Download all attachments

2021-08-17 16:49:31

by Michal Koutný

[permalink] [raw]
Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression

On Tue, Aug 17, 2021 at 10:45:00AM +0800, Feng Tang <[email protected]> wrote:
> Initially from the perf-c2c data, the in-cacheline hotspots are only
> 0x0, and 0x10, and if we extends to 2 cachelines, there is one more
> offset 0x54 (css.flags), but still I can't figure out which member
> inside the 128 bytes range is written frequenty.

Is it certain that perf-c2c reported offsets are the cacheline of the
first bytes of struct cgroup_subsys_state? (Yeah, it looks to me so,
given what code accesses those and your padding fixing it. I'm just
raising it in case there was anything non-obvious.)

>
> /* pah info for cgroup_subsys_state */
> struct cgroup_subsys_state {
> struct cgroup * cgroup; /* 0 8 */
> struct cgroup_subsys * ss; /* 8 8 */
> struct percpu_ref refcnt; /* 16 16 */
> struct list_head sibling; /* 32 16 */
> struct list_head children; /* 48 16 */
> /* --- cacheline 1 boundary (64 bytes) --- */
> struct list_head rstat_css_node; /* 64 16 */
> int id; /* 80 4 */
> unsigned int flags; /* 84 4 */
> u64 serial_nr; /* 88 8 */
> atomic_t online_cnt; /* 96 4 */
>
> /* XXX 4 bytes hole, try to pack */
>
> struct work_struct destroy_work; /* 104 32 */
> /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
>
> Since the test run implies this is cacheline related, and I'm not very
> familiar with the mem_cgroup code, the original perf-c2c log is attached
> which may give more hints.

As noted by Johannes, even in atomic mode, the refcnt would have the
atomic part elsewhere. The other members shouldn't be written frequently
unless there are some intense modifications of the cgroup tree in
parallel.
Does the benchmark create lots of memory cgroups in such a fashion?

Regards,
Michal

2021-08-17 17:13:55

by Shakeel Butt

[permalink] [raw]
Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression

On Tue, Aug 17, 2021 at 9:47 AM Michal Koutný <[email protected]> wrote:
>
> On Tue, Aug 17, 2021 at 10:45:00AM +0800, Feng Tang <[email protected]> wrote:
> > Initially from the perf-c2c data, the in-cacheline hotspots are only
> > 0x0, and 0x10, and if we extends to 2 cachelines, there is one more
> > offset 0x54 (css.flags), but still I can't figure out which member
> > inside the 128 bytes range is written frequenty.
>
> Is it certain that perf-c2c reported offsets are the cacheline of the
> first bytes of struct cgroup_subsys_state? (Yeah, it looks to me so,
> given what code accesses those and your padding fixing it. I'm just
> raising it in case there was anything non-obvious.)
>
> >
> > /* pah info for cgroup_subsys_state */
> > struct cgroup_subsys_state {
> > struct cgroup * cgroup; /* 0 8 */
> > struct cgroup_subsys * ss; /* 8 8 */
> > struct percpu_ref refcnt; /* 16 16 */
> > struct list_head sibling; /* 32 16 */
> > struct list_head children; /* 48 16 */
> > /* --- cacheline 1 boundary (64 bytes) --- */
> > struct list_head rstat_css_node; /* 64 16 */
> > int id; /* 80 4 */
> > unsigned int flags; /* 84 4 */
> > u64 serial_nr; /* 88 8 */
> > atomic_t online_cnt; /* 96 4 */
> >
> > /* XXX 4 bytes hole, try to pack */
> >
> > struct work_struct destroy_work; /* 104 32 */
> > /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
> >
> > Since the test run implies this is cacheline related, and I'm not very
> > familiar with the mem_cgroup code, the original perf-c2c log is attached
> > which may give more hints.
>
> As noted by Johannes, even in atomic mode, the refcnt would have the
> atomic part elsewhere. The other members shouldn't be written frequently
> unless there are some intense modifications of the cgroup tree in
> parallel.
> Does the benchmark create lots of memory cgroups in such a fashion?

From what I know the benchmark is running in the root cgroup and there
is no cgroup manipulation.

2021-08-18 02:32:18

by Feng Tang

[permalink] [raw]
Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression

Hi Michal,

On Tue, Aug 17, 2021 at 06:47:37PM +0200, Michal Koutn?? wrote:
> On Tue, Aug 17, 2021 at 10:45:00AM +0800, Feng Tang <[email protected]> wrote:
> > Initially from the perf-c2c data, the in-cacheline hotspots are only
> > 0x0, and 0x10, and if we extends to 2 cachelines, there is one more
> > offset 0x54 (css.flags), but still I can't figure out which member
> > inside the 128 bytes range is written frequenty.
>
> Is it certain that perf-c2c reported offsets are the cacheline of the
> first bytes of struct cgroup_subsys_state? (Yeah, it looks to me so,
> given what code accesses those and your padding fixing it. I'm just
> raising it in case there was anything non-obvious.)

Thanks for checking.

Yes, they are. 'struct cgroup_subsys_state' is the first member of
'mem_cgoup' whose address are alwasy cacheline aligned (debug info
shows it's even 2KB or 4KB aligned)

> >
> > /* pah info for cgroup_subsys_state */
> > struct cgroup_subsys_state {
> > struct cgroup * cgroup; /* 0 8 */
> > struct cgroup_subsys * ss; /* 8 8 */
> > struct percpu_ref refcnt; /* 16 16 */
> > struct list_head sibling; /* 32 16 */
> > struct list_head children; /* 48 16 */
> > /* --- cacheline 1 boundary (64 bytes) --- */
> > struct list_head rstat_css_node; /* 64 16 */
> > int id; /* 80 4 */
> > unsigned int flags; /* 84 4 */
> > u64 serial_nr; /* 88 8 */
> > atomic_t online_cnt; /* 96 4 */
> >
> > /* XXX 4 bytes hole, try to pack */
> >
> > struct work_struct destroy_work; /* 104 32 */
> > /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
> >
> > Since the test run implies this is cacheline related, and I'm not very
> > familiar with the mem_cgroup code, the original perf-c2c log is attached
> > which may give more hints.
>
> As noted by Johannes, even in atomic mode, the refcnt would have the
> atomic part elsewhere. The other members shouldn't be written frequently
> unless there are some intense modifications of the cgroup tree in
> parallel.
> Does the benchmark create lots of memory cgroups in such a fashion?

As Shakeel also mentioned, this 0day's vm-scalability doesn't involve
any explicit mem_cgroup configurations. And it's running on a simplified
debian 10 rootfs which has some systemd boottime cgroup setup.

Thanks,
Feng

> Regards,
> Michal

2021-08-30 14:55:24

by Michal Koutný

[permalink] [raw]
Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression

Hello Feng.

On Wed, Aug 18, 2021 at 10:30:04AM +0800, Feng Tang <[email protected]> wrote:
> As Shakeel also mentioned, this 0day's vm-scalability doesn't involve
> any explicit mem_cgroup configurations.

If it all happens inside root memcg, there should be no accesses to the
0x10 offset since the root memcg is excluded from refcounting. (Unless
the modified cacheline is a μarch artifact. Actually, for the lack of
other ideas, I was thinking about similar cause even for non-root memcgs
since the percpu refcounting is implemented via a segment register.)

Is this still relevant? (You refer to it as 0day's vm-scalability
issue.)

By some rough estimates there could be ~10 cgroup_subsys_sets per 10 MiB
of workload, so the 128B padding gives 1e-4 relative overhead (but
presumably less in most cases). I also think it acceptable (size-wise).

Out of curiosity, have you measured impact of reshuffling the refcnt
member into the middle of the cgroup_subsys_state (keeping it distant
both from .cgroup and .parent)?

Thanks,
Michal

2021-08-31 06:33:14

by Feng Tang

[permalink] [raw]
Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression

Hi Michal,

On Mon, Aug 30, 2021 at 04:51:04PM +0200, Michal Koutn?? wrote:
> Hello Feng.
>
> On Wed, Aug 18, 2021 at 10:30:04AM +0800, Feng Tang <[email protected]> wrote:
> > As Shakeel also mentioned, this 0day's vm-scalability doesn't involve
> > any explicit mem_cgroup configurations.
>
> If it all happens inside root memcg, there should be no accesses to the
> 0x10 offset since the root memcg is excluded from refcounting. (Unless
> the modified cacheline is a μarch artifact. Actually, for the lack of
> other ideas, I was thinking about similar cause even for non-root memcgs
> since the percpu refcounting is implemented via a segment register.)

Thought I haven't checked the exact memcg that the perf-c2c hot spots
pointed to, I don't think it's the root memcg. From debug, in the test
run, the OS has created about 50 memcgs before vm-scalability test run,
mostly by systemd-servces, and during the test there is no more new
memcg created.

> Is this still relevant? (You refer to it as 0day's vm-scalability
> issue.)
>
> By some rough estimates there could be ~10 cgroup_subsys_sets per 10 MiB
> of workload, so the 128B padding gives 1e-4 relative overhead (but
> presumably less in most cases). I also think it acceptable (size-wise).
>
> Out of curiosity, have you measured impact of reshuffling the refcnt
> member into the middle of the cgroup_subsys_state (keeping it distant
> both from .cgroup and .parent)?

Yes, I tried many re-arrangement of the members of cgroup_subsys_state,
and even close members of memcg, but there were no obvious changes.
What can recover the regresion is adding 128 bytes padding in the css,
no matter at the start, end or in the middle.


Some finding is, this could be related with HW cache prefetcher.

From this article
https://software.intel.com/content/www/us/en/develop/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors.html

There are four bits controlling different types of prefetcher, on the
testbox (CascadeLake AP platform), they are all enabled by default.
When we disable the "L2 hardware prefetcher" (bit 0), the permance
for commit 2d146aa3aa8 is almost the same as its parent commit.

So it seems to be affected about HW cache prefechter's policy, the
test's access pattern changes the HW prefetcher policy, which in
turn affect the performance.

Also the test shows the regression is platform dependent, that regression
could be seen on Cascade Lake AP (36%) and SP (20%), but not on a
Icelake SP 2S platform.

Thanks,
Feng

2021-08-31 09:25:37

by Michal Koutný

[permalink] [raw]
Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression

On Tue, Aug 31, 2021 at 02:30:36PM +0800, Feng Tang <[email protected]> wrote:
> Yes, I tried many re-arrangement of the members of cgroup_subsys_state,
> and even close members of memcg, but there were no obvious changes.
> What can recover the regresion is adding 128 bytes padding in the css,
> no matter at the start, end or in the middle.

Do you mean the padding added outside the .cgroup--.refcnt members area
also restores the benchmark results? (Or you refer to paddings that move
.cgroup and .refcnt across a cacheline border ?) I'm asking to be sure
we have correct understanding of what members are contended (what's the
frequent writer).

Thanks,
Michal

2021-09-01 04:54:53

by Feng Tang

[permalink] [raw]
Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression

On Tue, Aug 31, 2021 at 11:23:04AM +0200, Michal Koutn?? wrote:
> On Tue, Aug 31, 2021 at 02:30:36PM +0800, Feng Tang <[email protected]> wrote:
> > Yes, I tried many re-arrangement of the members of cgroup_subsys_state,
> > and even close members of memcg, but there were no obvious changes.
> > What can recover the regresion is adding 128 bytes padding in the css,
> > no matter at the start, end or in the middle.
>
> Do you mean the padding added outside the .cgroup--.refcnt members area
> also restores the benchmark results? (Or you refer to paddings that move
> .cgroup and .refcnt across a cacheline border ?) I'm asking to be sure
> we have correct understanding of what members are contended (what's the
> frequent writer).

Yes, the tests I did is no matter where the 128B padding is added, the
performance can be restored and even improved.

struct cgroup_subsys_state {
<----------------- padding
struct cgroup *cgroup;
struct cgroup_subsys *ss;
<----------------- padding
struct percpu_ref refcnt;
struct list_head sibling;
struct list_head children;
struct list_head rstat_css_node;
int id;
unsigned int flags;
u64 serial_nr;
atomic_t online_cnt;
struct work_struct destroy_work;
struct rcu_work destroy_rwork;
struct cgroup_subsys_state *parent;
<----------------- padding
};

Other tries I did are moving the untouched members around,
to separate the serveral hottest members, but no much effect.

From the data from perf-tool, 3 members are frequently accessed
(read actually): 'cgroup', 'refcnt', 'flags'

I also used 'perf mem' command tryint to catch read/write to
the css, and haven't found any _write_ operation, nor can the
code reading.

That led me to go check the "HW cache prefetcher", as in my
laste email. And all these test results make me think it's
data access pattern caused HW prefetcher related performance
change.

Thanks,
Feng


> Thanks,
> Michal

2021-09-01 22:19:50

by Andi Kleen

[permalink] [raw]
Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression

Feng Tang <[email protected]> writes:
>
> Yes, the tests I did is no matter where the 128B padding is added, the
> performance can be restored and even improved.

I wonder if we can find some cold, rarely accessed, data to put into the
padding to not waste it. Perhaps some name strings? Or the destroy
support, which doesn't sound like its commonly used.

-Andi

2021-09-02 01:55:03

by Feng Tang

[permalink] [raw]
Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression

On Wed, Sep 01, 2021 at 08:12:24AM -0700, Andi Kleen wrote:
> Feng Tang <[email protected]> writes:
> >
> > Yes, the tests I did is no matter where the 128B padding is added, the
> > performance can be restored and even improved.
>
> I wonder if we can find some cold, rarely accessed, data to put into the
> padding to not waste it. Perhaps some name strings? Or the destroy
> support, which doesn't sound like its commonly used.

Yes, I tried to move 'destroy_work', 'destroy_rwork' and 'parent' over
before the 'refcnt' together with some padding, it restored the performance
to about 10~15% regression. (debug patch pasted below)

But I'm not sure if we should use it, before we can fully explain the
regression.

Thanks,
Feng

commit a308d90b0d1973eb75551540a7aa849cabc8b8af
Author: Feng Tang <[email protected]>
Date: Sat Aug 14 16:18:43 2021 +0800

move the member around

Signed-off-by: Feng Tang <[email protected]>

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index f9fb7f0..255f668 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -139,10 +139,21 @@ struct cgroup_subsys_state {
/* PI: the cgroup that this css is attached to */
struct cgroup *cgroup;

+ struct cgroup_subsys_state *parent;
+
/* PI: the cgroup subsystem that this css is attached to */
struct cgroup_subsys *ss;

- unsigned long pad[16];
+ /* percpu_ref killing and RCU release */
+ struct work_struct destroy_work;
+ struct rcu_work destroy_rwork;
+
+ unsigned long pad[2]; /* 128 bytes */

/* reference count - access via css_[try]get() and css_put() */
struct percpu_ref refcnt;
@@ -176,6 +187,7 @@ struct cgroup_subsys_state {
*/
atomic_t online_cnt;

+ #if 0
/* percpu_ref killing and RCU release */
struct work_struct destroy_work;
struct rcu_work destroy_rwork;
@@ -185,6 +197,7 @@ struct cgroup_subsys_state {
* fields of the containing structure.
*/
struct cgroup_subsys_state *parent;
+ #endif
};

/*


> -Andi

2021-09-02 02:26:50

by Andi Kleen

[permalink] [raw]
Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression


On 9/1/2021 6:35 PM, Feng Tang wrote:
> On Wed, Sep 01, 2021 at 08:12:24AM -0700, Andi Kleen wrote:
>> Feng Tang <[email protected]> writes:
>>> Yes, the tests I did is no matter where the 128B padding is added, the
>>> performance can be restored and even improved.
>> I wonder if we can find some cold, rarely accessed, data to put into the
>> padding to not waste it. Perhaps some name strings? Or the destroy
>> support, which doesn't sound like its commonly used.
> Yes, I tried to move 'destroy_work', 'destroy_rwork' and 'parent' over
> before the 'refcnt' together with some padding, it restored the performance
> to about 10~15% regression. (debug patch pasted below)
>
> But I'm not sure if we should use it, before we can fully explain the
> regression.

Narrowing it down to a single prefetcher seems good enough to me. The
behavior of the prefetchers is fairly complicated and hard to predict,
so I doubt you'll ever get a 100% step by step explanation.


-Andi


2021-09-02 03:49:28

by Feng Tang

[permalink] [raw]
Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression

On Wed, Sep 01, 2021 at 07:23:24PM -0700, Andi Kleen wrote:
>
> On 9/1/2021 6:35 PM, Feng Tang wrote:
> >On Wed, Sep 01, 2021 at 08:12:24AM -0700, Andi Kleen wrote:
> >>Feng Tang <[email protected]> writes:
> >>>Yes, the tests I did is no matter where the 128B padding is added, the
> >>>performance can be restored and even improved.
> >>I wonder if we can find some cold, rarely accessed, data to put into the
> >>padding to not waste it. Perhaps some name strings? Or the destroy
> >>support, which doesn't sound like its commonly used.
> >Yes, I tried to move 'destroy_work', 'destroy_rwork' and 'parent' over
> >before the 'refcnt' together with some padding, it restored the performance
> >to about 10~15% regression. (debug patch pasted below)
> >
> >But I'm not sure if we should use it, before we can fully explain the
> >regression.
>
> Narrowing it down to a single prefetcher seems good enough to me. The
> behavior of the prefetchers is fairly complicated and hard to predict, so I
> doubt you'll ever get a 100% step by step explanation.

Yes, I'm afriad so, given that the policy/algorithm used by perfetcher
keeps changing from generation to generation.

I will test the patch more with other benchmarks.

Thanks,
Feng

>
> -Andi
>

2021-09-02 23:23:31

by Michal Koutný

[permalink] [raw]
Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression

Hi.

On Thu, Sep 02, 2021 at 11:46:28AM +0800, Feng Tang <[email protected]> wrote:
> > Narrowing it down to a single prefetcher seems good enough to me. The
> > behavior of the prefetchers is fairly complicated and hard to predict, so I
> > doubt you'll ever get a 100% step by step explanation.

My layman explanation with the available information is that the
prefetcher somehow behaves as if it marked the offending cacheline as
modified (even though reading only) therefore slowing down the remote reader.


On Thu, Sep 02, 2021 at 09:35:58AM +0800, Feng Tang <[email protected]> wrote:
> @@ -139,10 +139,21 @@ struct cgroup_subsys_state {
> /* PI: the cgroup that this css is attached to */
> struct cgroup *cgroup;
>
> + struct cgroup_subsys_state *parent;
> +
> /* PI: the cgroup subsystem that this css is attached to */
> struct cgroup_subsys *ss;

Hm, an interesting move; be mindful of commit b8b1a2e5eca6 ("cgroup:
move cgroup_subsys_state parent field for cache locality"). It might be
a regression for systems with cpuacct root css present. (That is likely
a big amount nowadays, that may be the reason why you don't see full
recovery? For future, we may at least guard cpuacct_charge() with
cgroup_subsys_enabled() static branch.)

> [snip]
> Yes, I'm afriad so, given that the policy/algorithm used by perfetcher
> keeps changing from generation to generation.

Exactly. I'm afraid of relayouting the structure with each new
generation. A robust solution is putting all frequently accessed members
into individual cache-lines + separating them with one more cache line? :-/


Michal