LinuxLists.cc - [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression

2021-08-11 03:03:05

Subject: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression

Greeting,

FYI, we noticed a

commit: 2d146aa3aa842d7f506 In addition to that,
+------------------+------- | testcase: change | test machine | test parameters | class=device | | | disk=1HDD | | nr_threads=100% | | testtime=60s | | ucode=0x5003006 +------------------+-------

If you fix the issue, Reported-by: kernel

Details are as below:
---------------------------

To reproduce:

git clone bin/lkp install bin/lkp split-job bin/lkp run
=========================== compiler/cpufreq_governor/k gcc-9/performance/x86_64-
commit:
dc26532aed ("cgroup: 2d146aa3aa ("mm:
dc26532aed0ab25c 2d146aa3aa ---------------- ---------- %stddev %change \ | 0.02 ? 5% +90.7% 359287 ? 5% -36.5% 69325283 ? 5% 53491 ? 11% +33.6% 7.408e+08 3121 ? 4% +42.5% 7257 ? 6% +65.2% 2266 ? 9% -25.1% 69029 -8.1% 3.318e+09 7852412 ? 5% 18180 ? 3% +30.1% 12.52 ? 6% +7.8 3.90 ? 8% -1.0 81.67 -8.2% 32.83 ? 4% +40.1% 9136 ? 13% -29.8% 1.868e+08 ? 2% 1.869e+08 ? 2% 1.876e+08 1.876e+08 272.83 ? 5% +38.3% 18.62 ? 4% +6.7 38.78 ? 2% -10.1 147.74 ? 4% 3233 ? 33% +166.7% 1897908 ? 19% 4875 ? 33% +72.7% 1949132 ? 7% 4033 ? 36% -40.5% 69457 ? 12% -19.4% 1951496 ? 13% 69457 ? 12% -19.4% 1914648 ? 7% 4068 ? 9% +34.8% 12999071 12456571 1961429 ? 5% 4538 ? 3% +30.4% 12407630 12456570 7.443e+08 7.441e+08 7.46e+08 7.42e+08 7.457e+08 283300 ? 2% 808.00 ? 33% +166.8% 473185 ? 18% +62.7% 1209 ? 32% +73.8% 808.00 ? 33% +166.8% 490316 ? 7% +29.9% 96371466 ? 2% 96136178 ? 2% 1008 ? 36% -40.5% 492855 ? 10% +34.7% 17340 ? 12% -19.3% 1008 ? 36% -40.5% 479874 ? 7% +36.2% 1013 ? 8% +34.1% 96808651 96571930 0.03 ? 37% +249.4% 0.01 ? 12% -100.0% 0.00 ?223% +16950.0% 0.00 ?142% +318.2% 3.15 ?108% +181.3% 6.17 ?136% -100.0% 0.00 ?223% +81550.0% 0.00 ?142% +318.2% 0.07 ? 56% +158.2% 0.04 ? 12% +102.2% 65.06 ? 9% +82.1% 57034 ? 12% -42.6% 65.02 ? 9% +82.1% 0.16 ? 41% +876.7% 9.64 ? 4% +59.6% 0.69 ? 21% -100.0% 4563 ? 61% -88.6% 16728 ? 4% -29.1% 15718 ? 33% -100.0% 1500 +48.6% 51.67 ? 9% -23.9% 29.42 ? 9% -100.0% 0.16 ? 41% +879.5% 1.96 ?141% +141.8% 9.64 ? 4% +59.3% 0.68 ? 21% -100.0% 0.00 ?223% +24122.2% 0.23 ?166% -82.8% 2.19 ?143% +286.1% 29.40 ? 9% -100.0% 0.00 ?223% +60722.2% 328790 ? 6% -32.6% 1655 ? 30% +70.1% 1655 ? 30% +70.1% 1419 ? 33% +48.8% 1419 ? 33% +48.8% 1617 ? 54% -43.5% 1198 ? 40% +80.7% 1198 ? 40% +80.7% 1062 ? 44% +84.9% 1062 ? 44% +84.9% 1342 ? 33% +47.3% 1342 ? 33% +47.3% 1712 ? 74% -38.9% 1398 ? 21% -25.2% 1905 ? 72% -45.6% 2796 ?101% -61.6% 1246 ? 16% -22.1% 1578 ? 7% +62.4% 1578 ? 7% +62.4% 1661 ? 50% -43.8% 3393 ? 84% -71.8% 1376 ? 24% +50.1% 1376 ? 24% +50.1% 1480 ? 25% +93.1% 1480 ? 25% +93.1% 1427 ? 29% -35.0% 1029 ? 32% +65.7% 1029 ? 32% +65.7% 1179 ? 38% +68.3% 1179 ? 38% +68.3% 3836 ? 96% -73.7% 1274 ? 33% +62.3% 1274 ? 33% +62.3% 1457 ? 26% +39.7% 1457 ? 26% +39.7% 1454 ? 40% -29.9% 1543 ? 57% -36.0% 1789 ? 55% -43.2% 57.17 ? 8% +165.0% 63.83 ? 15% +186.4% 1209 ? 15% +30.0% 121.17 ? 53% +271.0% 1.551e+10 66444144 9079 ? 13% -29.9% 1.46 ? 2% +22.4% 1.035e+11 ? 3% 1704 +21.1% 0.02 ? 6% +0.0 1.565e+10 0.02 +0.0 4.343e+09 50.45 +4.6 5111149 ? 4% 2595279 5.647e+10 7962 ? 2% +12.0% 0.71 -11.5% 0.54 ? 3% +36.5% 185.49 2346771 5135545 1167240 ? 4% 65.40 -10.6 2725640 ? 4% 9038103 2346773 30.83 ? 2% -3.2 1.84 ? 5% +41.5% 1557 ? 4% +78.7% 11129 ? 4% +9.1% 0.54 ? 5% -29.4% 23.14 ? 3% -11.1 5346 +8.0% 1.595e+10 68523526 9075 ? 13% -30.0% 1.067e+11 ? 4% 1.608e+10 4.442e+09 5217650 ? 4% 2569538 ? 2% 5.796e+10 1.72 ? 4% +7.2% 2423700 5257866 ? 2% 1197096 ? 4% 2811001 ? 4% 9336409 2423702 33414 ? 4% -12.1% 33523 ? 5% -13.8% 33620 ? 4% -10.9% 9177 ? 9% -16.2% 10431 ? 16% -23.8% 9060 ? 10% -15.9% 33743 ? 4% -6.2% 8818 ? 8% -20.6% 8794 ? 10% -20.1% 8534 ? 9% -22.8% 8799 ? 12% -26.0% 10109 ? 33% -32.5% 9380 ? 6% -15.0% 10633 ? 45% -34.8% 9186 ? 17% -24.3% 9000 ? 11% -23.6% 9538 ? 7% -20.2% 9564 ? 15% -24.7% 10287 ? 25% -31.2% 10068 ? 23% -27.1% 8909 ? 9% -19.6% 9001 ? 13% -22.6% 9012 ? 8% -23.0% 8955 ? 8% -21.3% 11374 ? 32% -37.5% 9551 ? 15% -28.2% 8810 ? 10% -22.8% 8594 ? 14% -26.3% 35252 ? 3% -10.5% 35076 ? 2% -11.5% 9237 ? 12% -27.0% 35111 ? 2% -9.6% 9350 ? 8% -23.7% 10047 ? 27% -25.3% 9074 ? 11% -23.7% 12146 ? 47% -43.1% 10375 ? 25% -30.4% 9530 ? 11% -23.9% 9372 ? 7% -23.5% 9663 ? 5% -19.6% 9336 ? 9% -20.9% 9243 ? 9% -21.5% 11490 ? 47% -35.8% 11603 ? 39% -33.4% 34132 ? 2% -7.5% 11001 ? 26% -29.3% 9773 ? 11% -25.0% 9888 ? 7% -25.4% 9610 ? 9% -24.9% 10651 ? 29% -31.6% 9131 ? 8% -20.8% 9302 ? 5% -18.4% 33631 ? 5% -7.9% 9188 ? 8% -22.1% 9122 ? 9% -24.5% 10551 ? 22% -22.3% 9070 ? 7% -21.5% 33361 ? 4% -10.0% 33776 ? 4% -7.0% 1744747 ? 4% 33.04 ? 8% -18.9 35.94 ? 8% -16.0 12.88 ? 8% -7.1 4.52 ? 7% -2.2 2.50 ? 7% -2.0 3.11 ? 10% -1.2 2.05 ? 8% -1.0 1.58 ? 8% -0.8 1.36 ? 8% -0.7 1.24 ? 8% -0.7 1.40 ? 10% -0.6 1.00 ? 16% -0.4 0.99 ? 16% -0.4 0.08 ?223% +0.7 0.08 ?223% +1.0 0.47 ? 45% +1.0 0.00 +1.4 0.45 ?103% +1.5 0.45 ?103% +1.5 1.58 ? 4% +2.6 1.30 ? 4% +2.7 1.22 ? 4% +2.8 0.51 ? 45% +3.0 1.27 ? 4% +3.0 0.29 ?100% +3.2 0.00 +3.4 0.67 ? 44% +3.4 52.25 ? 5% +6.7 52.22 ? 5% +6.7 52.08 ? 5% +6.7 57.54 ? 5% +7.9 57.23 ? 5% +8.0 2.74 ? 12% +27.3 2.75 ? 12% +27.3 2.69 ? 12% +27.3 4.26 ? 10% +28.0 4.13 ? 10% +28.1 33.13 ? 8% -19.0 35.97 ? 8% -16.0 12.99 ? 8% -7.1 4.62 ? 7% -2.2 2.53 ? 7% -1.9 2.04 ? 7% -1.5 3.18 ? 10% -1.2 2.05 ? 8% -0.9 2.64 ? 13% -0.9 2.28 ? 14% -0.7 1.58 ? 8% -0.7 0.89 ? 5% -0.7 1.42 ? 7% -0.7 1.38 ? 8% -0.6 1.63 ? 13% -0.6 1.61 ? 13% -0.6 1.42 ? 9% -0.5 1.01 ? 8% -0.5 0.78 ? 9% -0.4 0.51 ? 3% -0.3 0.93 ? 13% -0.3 0.60 ? 18% -0.3 0.79 ? 9% -0.3 0.73 ? 14% -0.3 0.68 ? 14% -0.2 0.67 ? 14% -0.2 0.65 ? 10% -0.2 0.49 ? 14% -0.2 0.25 ? 10% -0.2 0.45 ? 9% -0.2 0.50 ? 8% -0.2 0.40 ? 12% -0.2 0.42 ? 15% -0.2 0.32 ? 10% -0.1 0.44 ? 9% -0.1 0.25 ? 8% -0.1 0.36 ? 9% -0.1 0.26 ? 11% -0.1 0.22 ? 19% -0.1 0.18 ? 10% -0.1 0.36 ? 9% -0.1 0.29 ? 6% -0.1 0.32 ? 7% -0.1 0.27 ? 5% -0.1 0.14 ? 7% -0.1 0.22 ? 11% -0.1 0.13 ? 13% -0.1 0.11 ? 7% -0.1 0.15 ? 4% -0.1 0.11 ? 8% -0.1 0.11 ? 12% -0.1 0.10 ? 28% -0.1 0.13 ? 17% -0.1 0.10 ? 23% -0.1 0.08 ? 5% -0.0 0.11 ? 14% -0.0 0.10 ? 11% -0.0 0.10 ? 13% -0.0 0.17 ? 6% -0.0 0.09 ? 9% -0.0 0.07 ? 10% -0.0 0.14 ? 10% -0.0 0.09 ? 11% -0.0 0.10 ? 13% -0.0 0.08 ? 8% -0.0 0.07 ? 11% -0.0 0.02 ? 99% +0.0 0.02 ? 99% +0.0 0.02 ? 99% +0.0 0.09 ? 19% +0.1 0.00 +0.1 0.79 ? 8% +0.6 1.58 ? 4% +2.6 1.03 ? 4% +2.7 1.32 ? 4% +2.8 1.23 ? 4% +2.8 52.26 ? 5% +6.7 52.22 ? 5% +6.7 52.10 ? 5% +6.8 3.10 ? 4% +7.6 57.65 ? 5% +7.9 57.28 ? 5% +8.0 1.90 ? 4% +8.2 62.00 ? 3% +8.7 59.44 ? 5% +10.3 59.36 ? 5% +10.4 58.61 ? 5% +10.7 2.77 ? 12% +27.4 2.86 ? 12% +27.4 2.77 ? 12% +27.4 4.26 ? 10% +28.1 4.15 ? 10% +28.2 16.97 ? 10% -9.6 12.87 ? 8% -7.0 4.57 ? 7% -2.2 4.82 ? 8% -2.0 1.16 ? 8% -0.8 0.88 ? 5% -0.7 1.21 ? 4% -0.7 1.40 ? 9% -0.5 0.57 ? 9% -0.4 0.98 ? 10% -0.4 0.55 ? 20% -0.3 0.64 ? 9% -0.2 0.24 ? 10% -0.2 0.48 ? 7% -0.2 0.26 ? 12% -0.2 0.25 ? 8% -0.1 0.31 ? 9% -0.1 0.35 ? 9% -0.1 0.43 ? 10% -0.1 0.30 ? 8% -0.1 0.26 ? 10% -0.1 0.17 ? 9% -0.1 0.20 ? 12% -0.1 0.23 ? 10% -0.1 0.16 ? 8% -0.1 0.20 ? 10% -0.1 0.19 ? 11% -0.1 0.27 ? 11% -0.1 0.17 ? 10% -0.1 0.17 ? 11% -0.1 0.12 ? 13% -0.1 0.09 ? 8% -0.1 0.09 ? 23% -0.1 0.12 ? 15% -0.1 0.09 ? 9% -0.1 0.10 ? 10% -0.1 0.14 ? 9% -0.1 0.09 ? 11% -0.0 0.10 ? 16% -0.0 0.12 ? 17% -0.0 0.07 ? 9% -0.0 0.09 ? 12% -0.0 0.08 ? 13% -0.0 0.09 ? 11% -0.0 0.09 ? 9% -0.0 0.08 ? 11% -0.0 0.13 ? 8% -0.0 0.12 ? 9% -0.0 0.08 ? 10% -0.0 0.00 +0.1 0.78 ? 8% +0.6 1.03 ? 5% +2.7 1.88 ? 4% +8.1 2.77 ? 12% +27.4

13000 +------------------ | O 12000 |-+ | 11000 |-+ | 10000 |-+ | 9000 |-+ | 8000 |-+ ...+......... | ....... 7000 |... | 6000 +------------------ 4800 +------------------- 4600 |-+ O | 4400 |-+ 4200 |-+ | 4000 |-+ 3800 |-+ 3600 |-+ | 3400 |-+ ...+.......... 3200 |-+ ....... |... 3000 |-+ 2800 +------------------- 71000 +------------------ | 70000 |-+ 69000 |.......... | ... 68000 |-+ +...... 67000 |-+ | 66000 |-+ 65000 |-+ | 64000 |-+ 63000 |-+ | O 62000 +------------------ 140 +-------------------- | 130 |-+ 120 |-+ | 110 |-+ 100 |-+ O | 90 |-+ 80 |-+ | 70 |-+ 60 |-+ .......+...... |...... 50 +-------------------- 70000 +------------------ |.............+... 65000 |-+ ... 60000 |-+ | 55000 |-+ 50000 |-+ | 45000 |-+ 40000 |-+ | O 35000 |-+ 30000 |-+ | 25000 +------------------ 140 +-------------------- | 130 |-+ 120 |-+ | 110 |-+ 100 |-+ O | 90 |-+ 80 |-+ | 70 |-+ 60 |-+ .......+...... |...... 50 +-------------------- 2300 +------------------- | O 2200 |-+ 2100 |-+ | 2000 |-+ 1900 |-+ | 1800 |-+ 1700 |-+ | 1600 |-+ 1500 |.............+..... | 1400 +------------------- 12000 +------------------ | +.. 10000 |-+ .. . | .. .. | .. . 8000 |-+ . | .. 6000 |-+.. |.. 4000 |-+ | | 2000 |-+ | O 0 +------------------ 0.6 +-------------------- | 0.5 |-+ | | 0.4 |-+ | O 0.3 |-+ | 0.2 |-+ | | 0.1 |-+ | 0 +-------------------- 7.5e+07 +---------------- | 7e+07 |...... | .... .... 6.5e+07 |-+ .. ... | +. 6e+07 |-+ | 5.5e+07 |-+ | 5e+07 |-+ | 4.5e+07 |-+ | O 4e+07 +---------------- 0.05 +------------------ | 0.045 |-+ O | | 0.04 |-+ | 0.035 |-+ | 0.03 |-+ | | 0.025 |-+ ...+......... | ....... 0.02 +------------------ 400000 +----------------- 380000 |-+ | 360000 |......... 340000 |-+ ... .... | +.. 320000 |-+ 300000 |-+ 280000 |-+ | 260000 |-+ 240000 |-+ | 220000 |-+ O 200000 +----------------- perf-sched.wait_time.max 32 +--------------------- | 31 |............. ....... 30 |-+ +... | 29 |-+ 28 |-+ | 27 |-+ 26 |-+ | 25 |-+ 24 |-+ | 23 +--------------------- perf-sched.sch_delay.a 0.009 +----------------- 0.0088 |:+ | : 0.0086 |-+: 0.0084 |-+ : | : 0.0082 |-+ : 0.008 |-+ : 0.0078 |-+ : | : 0.0076 |-+ : 0.0074 |-+ : | : 0.0072 |-+ : 0.007 +----------------- perf-sched.wait_and_delay 28000 +------------------ |. 26000 |-. 24000 |-+.. | . 22000 |-+ . | . 20000 |-+ . | . 18000 |-+ .. 16000 |-+ . | . 14000 |-+ | 12000 +------------------ perf-sched.wait_and_delay. 32 +--------------------- | 31 |............. ....... 30 |-+ +... | 29 |-+ 28 |-+ | 27 |-+ 26 |-+ | 25 |-+ 24 |-+ | 23 +--------------------- perf-sched.wait_and_delay. 0.8 +------------------- | 0.75 |-+ +...... 0.7 |-+ . | .. 0.65 |-+ . 0.6 |-+ . | .. 0.55 |-+ . 0.5 |-+ . | . 0.45 |-.. 0.4 |.+ | 0.35 +------------------- [*] bisect-good sample
[O] bisect-bad sample

*************************** lkp-csl-2ap3: 192

Disclaimer:
Results have been for informational design or configuration

---
0DAY/LKP+ Test Infrastructure https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
threads 4 sockets Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory
The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
/git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/">https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
the commit also has significant impact on the following tests:
------------------------------------------------------------------------------+
| stress-ng: stress-ng.null.ops_per_sec -45.9% regression |
| 192 threads 4 sockets Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory |
|
| cpufreq_governor=performance |
|
|
|
|
------------------------------------------------------------------------------+
kindly add following tag
test robot <[email protected]>
----------------------------------------------------------------------->
href="https://github.com/intel/lkp-tests.git">https://github.com/intel/lkp-tests.git
job.yaml # job file is attached in this email
--compatible job.yaml # generate the yaml file for lkp run
generated-yaml-file
==============================================================
config/rootfs/runtime/size/tbox_group/test/testcase/ucode:
rhel-8.3/debian-10.4-x86_64-20200603.cgz/300s/1T/lkp-csl-2ap4/lru-shm/vm-scalability/0x5003006
rstat: punt root-level optimization to individual controllers")
memcontrol: switch to rstat")
842d7f5065802556b
-----------------
%stddev
\
0.04 ? 9% vm-scalability.free_time
228315 ? 4% vm-scalability.median
-36.4% 44114216 ? 5% vm-scalability.throughput
71455 ? 23% vm-scalability.time.involuntary_context_switches
-8.9% 6.746e+08 vm-scalability.time.minor_page_faults
4448 ? 4% vm-scalability.time.percent_of_cpu_this_job_got
11989 ? 4% vm-scalability.time.system_time
1697 ? 7% vm-scalability.time.user_time
63441 vm-scalability.time.voluntary_context_switches
-8.9% 3.021e+09 vm-scalability.workload
+40.6% 11040505 ? 5% meminfo.Mapped
23656 ? 4% meminfo.PageTables
20.28 ? 3% mpstat.cpu.all.sys%
2.94 ? 7% mpstat.cpu.all.usr%
75.00 vmstat.cpu.id
46.00 ? 3% vmstat.procs.r
6413 ? 2% vmstat.system.cs
-9.3% 1.694e+08 numa-numastat.node1.local_node
-9.3% 1.694e+08 numa-numastat.node1.numa_hit
-10.8% 1.674e+08 ? 2% numa-numastat.node3.local_node
-10.8% 1.675e+08 ? 2% numa-numastat.node3.numa_hit
377.33 ? 2% turbostat.Avg_MHz
25.30 turbostat.Busy%
28.67 ? 39% turbostat.C1E%
+7.6% 159.04 ? 3% turbostat.PkgWatt
8622 ? 96% numa-meminfo.node0.Active(anon)
+61.0% 3055051 ? 13% numa-meminfo.node0.Mapped
8418 ? 21% numa-meminfo.node0.PageTables
+30.5% 2544570 ? 4% numa-meminfo.node1.Mapped
2401 ? 27% numa-meminfo.node2.Active(anon)
55973 ? 7% numa-meminfo.node2.KReclaimable
+37.2% 2677364 ? 8% numa-meminfo.node2.Mapped
55973 ? 7% numa-meminfo.node2.SReclaimable
+37.2% 2626319 ? 3% numa-meminfo.node3.Mapped
5483 ? 16% numa-meminfo.node3.PageTables
-1.1% 12854405 proc-vmstat.nr_file_pages
-1.2% 12307726 proc-vmstat.nr_inactive_anon
+40.8% 2762082 ? 5% proc-vmstat.nr_mapped
5919 ? 4% proc-vmstat.nr_page_table_pages
-1.2% 12262964 proc-vmstat.nr_shmem
-1.2% 12307724 proc-vmstat.nr_zone_inactive_anon
-8.9% 6.779e+08 proc-vmstat.numa_hit
-8.9% 6.777e+08 proc-vmstat.numa_local
-8.9% 6.794e+08 proc-vmstat.pgalloc_normal
-8.9% 6.758e+08 proc-vmstat.pgfault
-8.9% 6.793e+08 proc-vmstat.pgfree
-7.4% 262251 ? 2% proc-vmstat.pgreuse
2155 ? 96% numa-vmstat.node0.nr_active_anon
769993 ? 12% numa-vmstat.node0.nr_mapped
2103 ? 21% numa-vmstat.node0.nr_page_table_pages
2155 ? 96% numa-vmstat.node0.nr_zone_active_anon
637150 ? 5% numa-vmstat.node1.nr_mapped
-9.6% 87157667 numa-vmstat.node1.numa_hit
-9.6% 86939810 numa-vmstat.node1.numa_local
599.83 ? 27% numa-vmstat.node2.nr_active_anon
664105 ? 7% numa-vmstat.node2.nr_mapped
13999 ? 7% numa-vmstat.node2.nr_slab_reclaimable
599.83 ? 27% numa-vmstat.node2.nr_zone_active_anon
653534 ? 2% numa-vmstat.node3.nr_mapped
1359 ? 17% numa-vmstat.node3.nr_page_table_pages
-10.8% 86336260 ? 2% numa-vmstat.node3.numa_hit
-10.8% 86156545 ? 2% numa-vmstat.node3.numa_local
0.10 ? 58% perf-sched.sch_delay.avg.ms.devkmsg_read.vfs_read.ksys_read.do_syscall_64
0.00 perf-sched.sch_delay.avg.ms.pipe_write.new_sync_write.vfs_write.ksys_write
0.17 ?202% perf-sched.sch_delay.avg.ms.preempt_schedule_common.__cond_resched.unmap_vmas.unmap_region.__do_munmap
0.01 ? 68% perf-sched.sch_delay.avg.ms.schedule_timeout.wait_for_completion.__flush_work.lru_add_drain_all
8.86 ? 47% perf-sched.sch_delay.max.ms.do_syslog.part.0.kmsg_read.vfs_read
0.00 perf-sched.sch_delay.max.ms.pipe_write.new_sync_write.vfs_write.ksys_write
0.82 ?207% perf-sched.sch_delay.max.ms.preempt_schedule_common.__cond_resched.unmap_vmas.unmap_region.__do_munmap
0.01 ? 68% perf-sched.sch_delay.max.ms.schedule_timeout.wait_for_completion.__flush_work.lru_add_drain_all
0.17 ? 39% perf-sched.sch_delay.max.ms.sigsuspend.__x64_sys_rt_sigsuspend.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.08 ? 35% perf-sched.total_sch_delay.average.ms
118.46 ? 8% perf-sched.total_wait_and_delay.average.ms
32746 ? 10% perf-sched.total_wait_and_delay.count.ms
118.38 ? 8% perf-sched.total_wait_time.average.ms
1.61 ? 58% perf-sched.wait_and_delay.avg.ms.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe
15.39 ? 30% perf-sched.wait_and_delay.avg.ms.pipe_read.new_sync_read.vfs_read.ksys_read
0.00 perf-sched.wait_and_delay.avg.ms.pipe_write.new_sync_write.vfs_write.ksys_write
518.67 ? 26% perf-sched.wait_and_delay.count.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe
11854 ? 29% perf-sched.wait_and_delay.count.pipe_read.new_sync_read.vfs_read.ksys_read
0.00 perf-sched.wait_and_delay.count.pipe_write.new_sync_write.vfs_write.ksys_write
2229 perf-sched.wait_and_delay.count.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_select
39.33 ? 15% perf-sched.wait_and_delay.count.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll
0.00 perf-sched.wait_and_delay.max.ms.pipe_write.new_sync_write.vfs_write.ksys_write
1.58 ? 58% perf-sched.wait_time.avg.ms.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe
4.75 ? 54% perf-sched.wait_time.avg.ms.io_schedule.__lock_page.pagecache_get_page.shmem_getpage_gfp
15.35 ? 30% perf-sched.wait_time.avg.ms.pipe_read.new_sync_read.vfs_read.ksys_read
0.00 perf-sched.wait_time.avg.ms.pipe_write.new_sync_write.vfs_write.ksys_write
0.36 ? 22% perf-sched.wait_time.avg.ms.preempt_schedule_common.__cond_resched.unmap_vmas.unmap_region.__do_munmap
0.04 ? 30% perf-sched.wait_time.avg.ms.preempt_schedule_common.__cond_resched.zap_pte_range.unmap_page_range.unmap_vmas
8.44 ? 50% perf-sched.wait_time.max.ms.io_schedule.__lock_page.pagecache_get_page.shmem_getpage_gfp
0.00 perf-sched.wait_time.max.ms.pipe_write.new_sync_write.vfs_write.ksys_write
0.91 ? 46% perf-sched.wait_time.max.ms.preempt_schedule_common.__cond_resched.unmap_vmas.unmap_region.__do_munmap
221487 ? 3% interrupts.CAL:Function_call_interrupts
2816 ? 34% interrupts.CPU0.NMI:Non-maskable_interrupts
2816 ? 34% interrupts.CPU0.PMI:Performance_monitoring_interrupts
2111 ? 16% interrupts.CPU11.NMI:Non-maskable_interrupts
2111 ? 16% interrupts.CPU11.PMI:Performance_monitoring_interrupts
913.83 ? 16% interrupts.CPU113.CAL:Function_call_interrupts
2166 ? 14% interrupts.CPU126.NMI:Non-maskable_interrupts
2166 ? 14% interrupts.CPU126.PMI:Performance_monitoring_interrupts
1964 ? 21% interrupts.CPU129.NMI:Non-maskable_interrupts
1964 ? 21% interrupts.CPU129.PMI:Performance_monitoring_interrupts
1977 ? 15% interrupts.CPU138.NMI:Non-maskable_interrupts
1977 ? 15% interrupts.CPU138.PMI:Performance_monitoring_interrupts
1046 ? 7% interrupts.CPU146.CAL:Function_call_interrupts
1045 ? 4% interrupts.CPU162.CAL:Function_call_interrupts
1036 ? 8% interrupts.CPU164.CAL:Function_call_interrupts
1073 ? 9% interrupts.CPU166.CAL:Function_call_interrupts
971.67 ? 14% interrupts.CPU169.CAL:Function_call_interrupts
2563 ? 28% interrupts.CPU169.NMI:Non-maskable_interrupts
2563 ? 28% interrupts.CPU169.PMI:Performance_monitoring_interrupts
934.33 ? 15% interrupts.CPU17.CAL:Function_call_interrupts
958.17 ? 3% interrupts.CPU18.CAL:Function_call_interrupts
2066 ? 16% interrupts.CPU2.NMI:Non-maskable_interrupts
2066 ? 16% interrupts.CPU2.PMI:Performance_monitoring_interrupts
2858 ? 63% interrupts.CPU38.NMI:Non-maskable_interrupts
2858 ? 63% interrupts.CPU38.PMI:Performance_monitoring_interrupts
928.17 ? 16% interrupts.CPU4.CAL:Function_call_interrupts
1705 ? 26% interrupts.CPU41.NMI:Non-maskable_interrupts
1705 ? 26% interrupts.CPU41.PMI:Performance_monitoring_interrupts
1985 ? 16% interrupts.CPU47.NMI:Non-maskable_interrupts
1985 ? 16% interrupts.CPU47.PMI:Performance_monitoring_interrupts
1007 ? 9% interrupts.CPU5.CAL:Function_call_interrupts
2068 ? 19% interrupts.CPU5.NMI:Non-maskable_interrupts
2068 ? 19% interrupts.CPU5.PMI:Performance_monitoring_interrupts
2036 ? 13% interrupts.CPU60.NMI:Non-maskable_interrupts
2036 ? 13% interrupts.CPU60.PMI:Performance_monitoring_interrupts
1019 ? 5% interrupts.CPU61.CAL:Function_call_interrupts
988.67 ? 3% interrupts.CPU65.CAL:Function_call_interrupts
1016 ? 5% interrupts.CPU66.CAL:Function_call_interrupts
151.50 ? 74% interrupts.CPU77.RES:Rescheduling_interrupts
182.83 ? 86% interrupts.CPU78.RES:Rescheduling_interrupts
1571 ? 6% interrupts.CPU95.CAL:Function_call_interrupts
449.50 ? 20% interrupts.CPU95.RES:Rescheduling_interrupts
-4.0% 1.489e+10 perf-stat.i.branch-instructions
-23.1% 51101081 perf-stat.i.cache-misses
6364 ? 3% perf-stat.i.context-switches
1.78 perf-stat.i.cpi
+36.5% 1.413e+11 ? 3% perf-stat.i.cpu-cycles
2063 ? 4% perf-stat.i.cycles-between-cache-misses
0.04 ? 57% perf-stat.i.dTLB-load-miss-rate%
-2.9% 1.521e+10 perf-stat.i.dTLB-loads
0.03 ? 19% perf-stat.i.dTLB-store-miss-rate%
-8.1% 3.993e+09 perf-stat.i.dTLB-stores
55.09 ? 7% perf-stat.i.iTLB-load-miss-rate%
-11.0% 4548466 ? 2% perf-stat.i.iTLB-load-misses
-14.0% 2231645 ? 16% perf-stat.i.iTLB-loads
-3.2% 5.466e+10 perf-stat.i.instructions
8914 ? 3% perf-stat.i.instructions-per-iTLB-miss
0.63 ? 3% perf-stat.i.ipc
0.73 ? 3% perf-stat.i.metric.GHz
-4.0% 178.09 perf-stat.i.metric.M/sec
-10.4% 2102847 perf-stat.i.minor-faults
-12.6% 4490509 perf-stat.i.node-load-misses
-8.8% 1064214 ? 4% perf-stat.i.node-loads
54.77 ? 6% perf-stat.i.node-store-miss-rate%
-61.9% 1037367 perf-stat.i.node-store-misses
-17.4% 7468754 perf-stat.i.node-stores
-10.4% 2102849 perf-stat.i.page-faults
27.66 ? 7% perf-stat.overall.cache-miss-rate%
2.61 ? 3% perf-stat.overall.cpi
2783 ? 4% perf-stat.overall.cycles-between-cache-misses
12140 perf-stat.overall.instructions-per-iTLB-miss
0.38 ? 3% perf-stat.overall.ipc
12.08 perf-stat.overall.node-store-miss-rate%
5773 perf-stat.overall.path-length
-3.2% 1.543e+10 perf-stat.ps.branch-instructions
-22.8% 52929588 perf-stat.ps.cache-misses
6349 ? 2% perf-stat.ps.context-switches
+38.0% 1.472e+11 ? 3% perf-stat.ps.cpu-cycles
-2.2% 1.573e+10 perf-stat.ps.dTLB-loads
-7.7% 4.102e+09 perf-stat.ps.dTLB-stores
-10.8% 4656233 perf-stat.ps.iTLB-load-misses
-14.5% 2196111 ? 15% perf-stat.ps.iTLB-loads
-2.5% 5.651e+10 perf-stat.ps.instructions
1.85 ? 2% perf-stat.ps.major-faults
-9.7% 2188843 perf-stat.ps.minor-faults
-12.2% 4615133 perf-stat.ps.node-load-misses
-8.7% 1092716 ? 4% perf-stat.ps.node-loads
-62.0% 1067656 perf-stat.ps.node-store-misses
-16.7% 7775080 perf-stat.ps.node-stores
-9.7% 2188844 perf-stat.ps.page-faults
29356 ? 18% softirqs.CPU101.SCHED
28899 ? 14% softirqs.CPU104.SCHED
29966 ? 6% softirqs.CPU105.SCHED
7691 ? 8% softirqs.CPU109.RCU
7943 ? 8% softirqs.CPU11.RCU
7619 ? 7% softirqs.CPU110.RCU
31639 ? 4% softirqs.CPU119.SCHED
7002 ? 9% softirqs.CPU120.RCU
7026 ? 11% softirqs.CPU121.RCU
6584 ? 8% softirqs.CPU124.RCU
6512 ? 10% softirqs.CPU125.RCU
6828 ? 11% softirqs.CPU128.RCU
7977 ? 7% softirqs.CPU13.RCU
6935 ? 9% softirqs.CPU132.RCU
6956 ? 10% softirqs.CPU133.RCU
6877 ? 10% softirqs.CPU136.RCU
7608 ? 12% softirqs.CPU144.RCU
7201 ? 14% softirqs.CPU148.RCU
7072 ? 12% softirqs.CPU149.RCU
7336 ? 13% softirqs.CPU151.RCU
7166 ? 11% softirqs.CPU152.RCU
6968 ? 12% softirqs.CPU153.RCU
6939 ? 12% softirqs.CPU154.RCU
7045 ? 10% softirqs.CPU156.RCU
7109 ? 13% softirqs.CPU157.RCU
6859 ? 13% softirqs.CPU166.RCU
6800 ? 12% softirqs.CPU167.RCU
6330 ? 10% softirqs.CPU175.RCU
31536 ? 5% softirqs.CPU182.SCHED
31042 ? 7% softirqs.CPU184.SCHED
6740 ? 10% softirqs.CPU185.RCU
31730 ? 5% softirqs.CPU185.SCHED
7134 ? 8% softirqs.CPU191.RCU
7504 ? 10% softirqs.CPU23.RCU
6926 ? 11% softirqs.CPU25.RCU
6910 ? 8% softirqs.CPU29.RCU
7216 ? 10% softirqs.CPU32.RCU
7252 ? 11% softirqs.CPU37.RCU
7171 ? 10% softirqs.CPU38.RCU
7772 ? 7% softirqs.CPU4.RCU
7381 ? 9% softirqs.CPU42.RCU
7255 ? 11% softirqs.CPU43.RCU
7371 ? 12% softirqs.CPU47.RCU
7730 ? 13% softirqs.CPU48.RCU
31571 ? 3% softirqs.CPU5.SCHED
7781 ? 15% softirqs.CPU50.RCU
7326 ? 11% softirqs.CPU51.RCU
7373 ? 13% softirqs.CPU52.RCU
7220 ? 10% softirqs.CPU55.RCU
7283 ? 12% softirqs.CPU60.RCU
7229 ? 11% softirqs.CPU63.RCU
7592 ? 7% softirqs.CPU7.RCU
30973 ? 3% softirqs.CPU7.SCHED
7154 ? 6% softirqs.CPU83.RCU
6883 ? 6% softirqs.CPU88.RCU
8200 ? 7% softirqs.CPU9.RCU
7119 ? 10% softirqs.CPU94.RCU
30037 ? 3% softirqs.CPU96.SCHED
31409 ? 4% softirqs.CPU99.SCHED
-19.7% 1400430 ? 10% softirqs.RCU
14.09 ? 5% perf-profile.calltrace.cycles-pp.mem_cgroup_charge.shmem_add_to_page_cache.shmem_getpage_gfp.shmem_fault.__do_fault
19.98 ? 4% perf-profile.calltrace.cycles-pp.shmem_add_to_page_cache.shmem_getpage_gfp.shmem_fault.__do_fault.do_fault
5.78 ? 4% perf-profile.calltrace.cycles-pp.get_mem_cgroup_from_mm.mem_cgroup_charge.shmem_add_to_page_cache.shmem_getpage_gfp.shmem_fault
2.30 ? 7% perf-profile.calltrace.cycles-pp.clear_page_erms.shmem_getpage_gfp.shmem_fault.__do_fault.do_fault
0.49 ? 45% perf-profile.calltrace.cycles-pp.try_charge.mem_cgroup_charge.shmem_add_to_page_cache.shmem_getpage_gfp.shmem_fault
1.89 ? 6% perf-profile.calltrace.cycles-pp.filemap_map_pages.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
1.04 ? 14% perf-profile.calltrace.cycles-pp.shmem_alloc_and_acct_page.shmem_getpage_gfp.shmem_fault.__do_fault.do_fault
0.79 ? 15% perf-profile.calltrace.cycles-pp.shmem_alloc_page.shmem_alloc_and_acct_page.shmem_getpage_gfp.shmem_fault.__do_fault
0.62 ? 45% perf-profile.calltrace.cycles-pp.alloc_pages_vma.shmem_alloc_page.shmem_alloc_and_acct_page.shmem_getpage_gfp.shmem_fault
0.55 ? 45% perf-profile.calltrace.cycles-pp.__alloc_pages_nodemask.alloc_pages_vma.shmem_alloc_page.shmem_alloc_and_acct_page.shmem_getpage_gfp
0.80 ? 13% perf-profile.calltrace.cycles-pp.next_uptodate_page.filemap_map_pages.do_fault.__handle_mm_fault.handle_mm_fault
0.61 ? 48% perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter
0.60 ? 48% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state
0.79 ? 20% perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_getpage_gfp.shmem_fault.__do_fault
1.09 ? 48% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.page_remove_rmap.zap_pte_range.unmap_page_range.unmap_vmas
1.51 ? 5% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__pagevec_lru_add.lru_cache_add.shmem_getpage_gfp.shmem_fault
1.41 ? 5% perf-profile.calltrace.cycles-pp.__mod_memcg_state.__mod_memcg_lruvec_state.__pagevec_lru_add.lru_cache_add.shmem_getpage_gfp
1.91 ? 34% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.91 ? 34% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
4.19 ? 5% perf-profile.calltrace.cycles-pp.finish_fault.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
4.05 ? 5% perf-profile.calltrace.cycles-pp.do_set_pte.finish_fault.do_fault.__handle_mm_fault.handle_mm_fault
4.00 ? 5% perf-profile.calltrace.cycles-pp.page_add_file_rmap.do_set_pte.finish_fault.do_fault.__handle_mm_fault
3.47 ? 6% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.page_add_file_rmap.do_set_pte.finish_fault.do_fault
4.31 ? 3% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.shmem_add_to_page_cache.shmem_getpage_gfp.shmem_fault.__do_fault
3.48 ? 4% perf-profile.calltrace.cycles-pp.__count_memcg_events.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
3.35 ? 5% perf-profile.calltrace.cycles-pp.__mod_memcg_state.__mod_memcg_lruvec_state.page_add_file_rmap.do_set_pte.finish_fault
4.09 ? 3% perf-profile.calltrace.cycles-pp.__mod_memcg_state.__mod_memcg_lruvec_state.shmem_add_to_page_cache.shmem_getpage_gfp.shmem_fault
58.93 ? 3% perf-profile.calltrace.cycles-pp.__do_fault.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
58.91 ? 3% perf-profile.calltrace.cycles-pp.shmem_fault.__do_fault.do_fault.__handle_mm_fault.handle_mm_fault
58.83 ? 3% perf-profile.calltrace.cycles-pp.shmem_getpage_gfp.shmem_fault.__do_fault.do_fault.__handle_mm_fault
65.44 ? 2% perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
65.24 ? 2% perf-profile.calltrace.cycles-pp.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
30.04 ? 8% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.lock_page_lruvec_irqsave.__pagevec_lru_add.lru_cache_add.shmem_getpage_gfp
30.05 ? 8% perf-profile.calltrace.cycles-pp.lock_page_lruvec_irqsave.__pagevec_lru_add.lru_cache_add.shmem_getpage_gfp.shmem_fault
30.01 ? 8% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.lock_page_lruvec_irqsave.__pagevec_lru_add.lru_cache_add
32.30 ? 7% perf-profile.calltrace.cycles-pp.lru_cache_add.shmem_getpage_gfp.shmem_fault.__do_fault.do_fault
32.24 ? 7% perf-profile.calltrace.cycles-pp.__pagevec_lru_add.lru_cache_add.shmem_getpage_gfp.shmem_fault.__do_fault
14.13 ? 5% perf-profile.children.cycles-pp.mem_cgroup_charge
20.00 ? 4% perf-profile.children.cycles-pp.shmem_add_to_page_cache
5.87 ? 4% perf-profile.children.cycles-pp.get_mem_cgroup_from_mm
2.40 ? 6% perf-profile.children.cycles-pp.clear_page_erms
0.62 ? 7% perf-profile.children.cycles-pp.try_charge
0.52 ? 7% perf-profile.children.cycles-pp.page_counter_try_charge
1.96 ? 5% perf-profile.children.cycles-pp.filemap_map_pages
1.13 ? 5% perf-profile.children.cycles-pp.shmem_alloc_and_acct_page
1.78 ? 20% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
1.53 ? 19% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
0.85 ? 5% perf-profile.children.cycles-pp.shmem_alloc_page
0.16 ? 9% perf-profile.children.cycles-pp.propagate_protected_usage
0.76 ? 4% perf-profile.children.cycles-pp.__alloc_pages_nodemask
0.76 ? 5% perf-profile.children.cycles-pp.alloc_pages_vma
1.04 ? 16% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
1.03 ? 16% perf-profile.children.cycles-pp.hrtimer_interrupt
0.87 ? 4% perf-profile.children.cycles-pp.next_uptodate_page
0.52 ? 6% perf-profile.children.cycles-pp.get_page_from_freelist
0.42 ? 7% perf-profile.children.cycles-pp.rmqueue
0.19 ? 5% perf-profile.children.cycles-pp.mem_cgroup_charge_statistics
0.62 ? 15% perf-profile.children.cycles-pp.__hrtimer_run_queues
0.30 ? 20% perf-profile.children.cycles-pp.ktime_get
0.50 ? 5% perf-profile.children.cycles-pp.error_entry
0.46 ? 13% perf-profile.children.cycles-pp.tick_sched_timer
0.43 ? 12% perf-profile.children.cycles-pp.tick_sched_handle
0.42 ? 12% perf-profile.children.cycles-pp.update_process_times
0.41 ? 4% perf-profile.children.cycles-pp.sync_regs
0.28 ? 18% perf-profile.children.cycles-pp.clockevents_program_event
0.05 ? 47% perf-profile.children.cycles-pp.lock_page_memcg
0.26 ? 7% perf-profile.children.cycles-pp.rmqueue_bulk
0.32 ? 7% perf-profile.children.cycles-pp.unlock_page
0.23 ? 4% perf-profile.children.cycles-pp.__perf_sw_event
0.26 ? 11% perf-profile.children.cycles-pp.scheduler_tick
0.18 ? 6% perf-profile.children.cycles-pp._raw_spin_lock
0.30 ? 12% perf-profile.children.cycles-pp.__mod_lruvec_state
0.12 ? 7% perf-profile.children.cycles-pp._raw_spin_lock_irq
0.24 ? 11% perf-profile.children.cycles-pp.__mod_node_page_state
0.14 ? 6% perf-profile.children.cycles-pp.___perf_sw_event
0.11 ? 10% perf-profile.children.cycles-pp.task_tick_fair
0.08 ? 4% perf-profile.children.cycles-pp.shmem_pseudo_vma_init
0.26 ? 10% perf-profile.children.cycles-pp.xas_load
0.19 ? 4% perf-profile.children.cycles-pp.xas_create_range
0.23 ? 8% perf-profile.children.cycles-pp.xas_find
0.18 ? 5% perf-profile.children.cycles-pp.xas_create
0.06 ? 11% perf-profile.children.cycles-pp.__memcg_kmem_charge_page
0.14 ? 5% perf-profile.children.cycles-pp.pagecache_get_page
0.07 ? 7% perf-profile.children.cycles-pp.security_vm_enough_memory_mm
0.04 ? 45% perf-profile.children.cycles-pp.__memcg_kmem_charge
0.09 ? 9% perf-profile.children.cycles-pp.pte_alloc_one
0.04 ? 45% perf-profile.children.cycles-pp.obj_cgroup_charge
0.04 ? 45% perf-profile.children.cycles-pp.cap_vm_enough_memory
0.03 ? 70% perf-profile.children.cycles-pp.up_read
0.07 ? 55% perf-profile.children.cycles-pp.tick_nohz_irq_exit
0.04 ? 76% perf-profile.children.cycles-pp.ktime_get_update_offsets_now
0.03 ? 70% perf-profile.children.cycles-pp.__pte_alloc
0.06 ? 7% perf-profile.children.cycles-pp.percpu_counter_add_batch
0.06 ? 6% perf-profile.children.cycles-pp.find_vma
0.06 ? 11% perf-profile.children.cycles-pp.xas_find_conflict
0.13 ? 6% perf-profile.children.cycles-pp.xas_alloc
0.05 ? 7% perf-profile.children.cycles-pp.vmacache_find
0.03 ? 70% perf-profile.children.cycles-pp.__irqentry_text_end
0.10 ? 10% perf-profile.children.cycles-pp.__list_add_valid
0.06 perf-profile.children.cycles-pp.___might_sleep
0.07 ? 12% perf-profile.children.cycles-pp.xas_start
0.06 ? 13% perf-profile.children.cycles-pp.__mod_zone_page_state
0.05 perf-profile.children.cycles-pp.perf_swevent_get_recursion_context
0.07 ? 20% perf-profile.children.cycles-pp.__x64_sys_exit_group
0.07 ? 20% perf-profile.children.cycles-pp.do_group_exit
0.07 ? 20% perf-profile.children.cycles-pp.do_exit
0.18 ? 9% perf-profile.children.cycles-pp._raw_spin_unlock_irqrestore
0.14 ? 12% perf-profile.children.cycles-pp.cgroup_rstat_updated
1.39 ? 9% perf-profile.children.cycles-pp.__mod_lruvec_page_state
4.22 ? 4% perf-profile.children.cycles-pp.finish_fault
3.69 ? 4% perf-profile.children.cycles-pp.__count_memcg_events
4.09 ? 4% perf-profile.children.cycles-pp.do_set_pte
4.04 ? 4% perf-profile.children.cycles-pp.page_add_file_rmap
58.94 ? 3% perf-profile.children.cycles-pp.__do_fault
58.91 ? 3% perf-profile.children.cycles-pp.shmem_fault
58.86 ? 3% perf-profile.children.cycles-pp.shmem_getpage_gfp
10.71 ? 8% perf-profile.children.cycles-pp.__mod_memcg_lruvec_state
65.52 ? 2% perf-profile.children.cycles-pp.__handle_mm_fault
65.27 ? 2% perf-profile.children.cycles-pp.do_fault
10.07 ? 8% perf-profile.children.cycles-pp.__mod_memcg_state
70.65 ? 2% perf-profile.children.cycles-pp.asm_exc_page_fault
69.79 ? 2% perf-profile.children.cycles-pp.exc_page_fault
69.74 ? 2% perf-profile.children.cycles-pp.do_user_addr_fault
69.31 ? 2% perf-profile.children.cycles-pp.handle_mm_fault
30.19 ? 8% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
30.29 ? 8% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
30.21 ? 8% perf-profile.children.cycles-pp.lock_page_lruvec_irqsave
32.32 ? 7% perf-profile.children.cycles-pp.lru_cache_add
32.33 ? 7% perf-profile.children.cycles-pp.__pagevec_lru_add
7.42 ? 6% perf-profile.self.cycles-pp.mem_cgroup_charge
5.83 ? 4% perf-profile.self.cycles-pp.get_mem_cgroup_from_mm
2.38 ? 6% perf-profile.self.cycles-pp.clear_page_erms
2.82 ? 7% perf-profile.self.cycles-pp.shmem_getpage_gfp
0.36 ? 6% perf-profile.self.cycles-pp.page_counter_try_charge
0.16 ? 10% perf-profile.self.cycles-pp.propagate_protected_usage
0.55 ? 17% perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
0.86 ? 3% perf-profile.self.cycles-pp.next_uptodate_page
0.14 ? 5% perf-profile.self.cycles-pp.try_charge
0.59 ? 5% perf-profile.self.cycles-pp.filemap_map_pages
0.26 ? 19% perf-profile.self.cycles-pp.ktime_get
0.40 ? 5% perf-profile.self.cycles-pp.sync_regs
0.05 ? 47% perf-profile.self.cycles-pp.lock_page_memcg
0.30 ? 7% perf-profile.self.cycles-pp.unlock_page
0.10 ? 7% perf-profile.self.cycles-pp.rmqueue
0.12 ? 4% perf-profile.self.cycles-pp._raw_spin_lock_irq
0.18 ? 7% perf-profile.self.cycles-pp.rmqueue_bulk
0.23 ? 11% perf-profile.self.cycles-pp.__mod_node_page_state
0.31 ? 4% perf-profile.self.cycles-pp.__pagevec_lru_add
0.18 ? 6% perf-profile.self.cycles-pp.__handle_mm_fault
0.15 ? 7% perf-profile.self.cycles-pp._raw_spin_lock
0.08 ? 6% perf-profile.self.cycles-pp.shmem_pseudo_vma_init
0.11 ? 6% perf-profile.self.cycles-pp.___perf_sw_event
0.14 ? 3% perf-profile.self.cycles-pp.shmem_add_to_page_cache
0.08 ? 12% perf-profile.self.cycles-pp.get_page_from_freelist
0.12 ? 9% perf-profile.self.cycles-pp.shmem_alloc_and_acct_page
0.11 ? 3% perf-profile.self.cycles-pp.do_user_addr_fault
0.20 ? 10% perf-profile.self.cycles-pp.xas_load
0.10 ? 7% perf-profile.self.cycles-pp.handle_mm_fault
0.11 ? 4% perf-profile.self.cycles-pp.__alloc_pages_nodemask
0.06 ? 8% perf-profile.self.cycles-pp.lru_cache_add
0.02 ? 99% perf-profile.self.cycles-pp.page_add_file_rmap
0.03 ?102% perf-profile.self.cycles-pp.ktime_get_update_offsets_now
0.06 ? 11% perf-profile.self.cycles-pp.asm_exc_page_fault
0.04 ? 71% perf-profile.self.cycles-pp.finish_fault
0.05 ? 48% perf-profile.self.cycles-pp.update_process_times
0.09 ? 12% perf-profile.self.cycles-pp.error_entry
0.04 ? 45% perf-profile.self.cycles-pp.xas_create
0.06 ? 9% perf-profile.self.cycles-pp.percpu_counter_add_batch
0.08 ? 7% perf-profile.self.cycles-pp.shmem_fault
0.03 ? 70% perf-profile.self.cycles-pp.__irqentry_text_end
0.05 ? 7% perf-profile.self.cycles-pp.vmacache_find
0.04 ? 44% perf-profile.self.cycles-pp.do_fault
0.06 ? 8% perf-profile.self.cycles-pp.xas_find_conflict
0.06 perf-profile.self.cycles-pp.___might_sleep
0.06 ? 9% perf-profile.self.cycles-pp.pagecache_get_page
0.10 ? 13% perf-profile.self.cycles-pp.xas_find
0.09 ? 10% perf-profile.self.cycles-pp.__list_add_valid
0.06 ? 13% perf-profile.self.cycles-pp.__mod_zone_page_state
0.14 ? 13% perf-profile.self.cycles-pp.cgroup_rstat_updated
1.38 ? 9% perf-profile.self.cycles-pp.__mod_lruvec_page_state
3.68 ? 4% perf-profile.self.cycles-pp.__count_memcg_events
10.03 ? 8% perf-profile.self.cycles-pp.__mod_memcg_state
30.19 ? 8% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath

vm-scalability.time.system_time

-------------------------------------------------+
|
O O |
O |
|
|
|
|
|
|
|
... ......+.............|
+.............+...... |
|
-------------------------------------------------+

vm-scalability.time.percent_of_cpu_this_job_got

-------------------------------------------------+
|
O |
O O |
|
|
|
|
|
|
|
... ...+.............|
+............ ....... |
+... |
-------------------------------------------------+

vm-scalability.time.voluntary_context_switches

-------------------------------------------------+
...+.......... |
...... ...|
.......+... |
......+...... |
|
|
|
|
|
|
O |
O |
O |
-------------------------------------------------+

perf-sched.total_wait_time.average.ms

-------------------------------------------------+
O |
O |
|
O |
|
|
|
|
|
.......+.......... |
.......+.............+...... ...|
|
|
-------------------------------------------------+

perf-sched.total_wait_and_delay.count.ms

-------------------------------------------------+
|
|
... |
.. |
. ......+.............|
+.............+...... |
|
|
|
|
O |
|
O O |
-------------------------------------------------+

perf-sched.total_wait_and_delay.average.ms

-------------------------------------------------+
O |
O |
|
O |
|
|
|
|
|
.......+.......... |
.......+.............+...... ...|
|
|
-------------------------------------------------+

-------------------------------------------------+
O O O |
|
|
|
|
|
|
|
|
|
|
........+............ .......|
+.............+...... |
-------------------------------------------------+

-------------------------------------------------+
|
|
|
|
. |
.. |
. |
.. |
|
+............. ......+.............|
+...... |
|
O O O |
-------------------------------------------------+

-------------------------------------------------+
|
O |
|
|
|
|
O |
O |
|
|
|
|
|
-------------------------------------------------+

vm-scalability.throughput

-------------------------------------------------+
.+...... .... ...|
... .. ...... |
+... |
|
|
|
|
|
|
|
|
O O O |
|
-------------------------------------------------+

vm-scalability.free_time

-------------------------------------------------+
|
|
|
O |
O |
O |
|
|
|
|
|
|
... ......+.............|
-------------------------------------------------+

vm-scalability.median

-------------------------------------------------+
......+.......... |
..+...... ... ......|
..... +...... |
|
|
|
|
|
|
|
|
O O O |
|
-------------------------------------------------+

.ms.pipe_write.new_sync_write.vfs_write.ksys_write

-------------------------------------------------+
...+........... |
... |
+.. .|
. .. |
. . |
.. . |
. .. |
.. . |
. . |
. . |
.. .. |
. |
+ |
-------------------------------------------------+

vg.ms.pipe_write.new_sync_write.vfs_write.ksys_write

-------------------------------------------------+
+ : |
+ : |
+ : |
+ : |
+ : |
+ : |
+ : |
+ : |
+ : |
+ : |
+ : |
+ : |
+ :|
-------------------------------------------------+

.count.pipe_write.new_sync_write.vfs_write.ksys_write

-------------------------------------------------+
|
|
|
|
|
|
|
|
|
|
|
.......|
+............+.............+............+...... |
-------------------------------------------------+

max.ms.pipe_write.new_sync_write.vfs_write.ksys_write

-------------------------------------------------+
...+........... |
... |
+.. .|
. .. |
. . |
.. . |
. .. |
.. . |
. . |
. . |
.. .. |
. |
+ |
-------------------------------------------------+

avg.ms.pipe_write.new_sync_write.vfs_write.ksys_write

-------------------------------------------------+
.......+...... ... |
+.............|
|
|
|
|
|
|
|
|
|
|
|
-------------------------------------------------+

************************************************************************
threads 4 sockets Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory
estimated based on internal Intel analysis and are provided
purposes only. Any difference in system hardware or software
may affect actual performance.
Open Source Technology Center
org/hyperkitty/list/lkp@lists.01.org">https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation

Attachments:

(No filename) (70.32 kB)
config-5.12.0-11208-g2d146aa3aa84 (176.50 kB)
job-script (7.78 kB)
job.yaml (5.08 kB)
reproduce (815.59 kB)
Download all attachments

2021-08-11 06:03:05

by Linus Torvalds

[permalink] [raw]

Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression

On Tue, Aug 10, 2021 at 4:59 PM kernel test robot <[email protected]> wrote:
>
> FYI, we noticed a -36.4% regression of vm-scalability.throughput due to commit:
> 2d146aa3aa84 ("mm: memcontrol: switch to rstat")

Hmm. I guess some cost is to be expected, but that's a big regression.

I'm not sure what the code ends up doing, and how relevant this test
is, but Johannes - could you please take a look?

I can't make heads nor tails of the profile. The profile kind of points at this:

> 2.77 ą 12% +27.4 30.19 ą 8% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
> 2.86 ą 12% +27.4 30.29 ą 8% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
> 2.77 ą 12% +27.4 30.21 ą 8% perf-profile.children.cycles-pp.lock_page_lruvec_irqsave
> 4.26 ą 10% +28.1 32.32 ą 7% perf-profile.children.cycles-pp.lru_cache_add
> 4.15 ą 10% +28.2 32.33 ą 7% perf-profile.children.cycles-pp.__pagevec_lru_add

and that seems to be from the chain __do_fault -> shmem_fault ->
shmem_getpage_gfp -> lru_cache_add -> __pagevec_lru_add ->
lock_page_lruvec_irqsave -> _raw_spin_lock_irqsave ->
native_queued_spin_lock_slowpath.

That shmem_fault codepath being hot may make sense for sokme VM
scalability test. But it seems to make little sense when I look at the
commit that it bisected to.

We had another report of this commit causing a much more reasonable
small slowdown (-2.8%) back in May.

I'm not sure what's up with this new report. Johannes, does this make
sense to you?

Is it perhaps another "unlucky cache line placement" thing? Or has the
statistics changes just changed the behavior of the test?

Anybody?

Linus

2021-08-11 20:13:53

by Johannes Weiner

[permalink] [raw]

Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression

2021-08-12 03:32:42

by Feng Tang

[permalink] [raw]

Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression

On Tue, Aug 10, 2021 at 07:59:53PM -1000, Linus Torvalds wrote:
> On Tue, Aug 10, 2021 at 4:59 PM kernel test robot <[email protected]> wrote:
> >
> > FYI, we noticed a -36.4% regression of vm-scalability.throughput due to commit:
> > 2d146aa3aa84 ("mm: memcontrol: switch to rstat")
>
> Hmm. I guess some cost is to be expected, but that's a big regression.
>
> I'm not sure what the code ends up doing, and how relevant this test
> is, but Johannes - could you please take a look?
>
> I can't make heads nor tails of the profile. The profile kind of points at this:
>
> > 2.77 ą 12% +27.4 30.19 ą 8% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
> > 2.86 ą 12% +27.4 30.29 ą 8% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
> > 2.77 ą 12% +27.4 30.21 ą 8% perf-profile.children.cycles-pp.lock_page_lruvec_irqsave
> > 4.26 ą 10% +28.1 32.32 ą 7% perf-profile.children.cycles-pp.lru_cache_add
> > 4.15 ą 10% +28.2 32.33 ą 7% perf-profile.children.cycles-pp.__pagevec_lru_add
>
> and that seems to be from the chain __do_fault -> shmem_fault ->
> shmem_getpage_gfp -> lru_cache_add -> __pagevec_lru_add ->
> lock_page_lruvec_irqsave -> _raw_spin_lock_irqsave ->
> native_queued_spin_lock_slowpath.
>
> That shmem_fault codepath being hot may make sense for sokme VM
> scalability test. But it seems to make little sense when I look at the
> commit that it bisected to.
>
> We had another report of this commit causing a much more reasonable
> small slowdown (-2.8%) back in May.
>
> I'm not sure what's up with this new report. Johannes, does this make
> sense to you?
>
> Is it perhaps another "unlucky cache line placement" thing? Or has the
> statistics changes just changed the behavior of the test?

Yes, this is probably related with cache line.

We just used perf-c2c tool profile the data for 2d146aa3aa and its
parent commit. There is very few false sharing for the parent commit,
while there is some for 2d146aa3aa, one hottest spot is:

#
# ----- HITM ----- -- Store Refs -- --------- Data address --------- ---------- cycles ---------- Total cpu Shared
# Num RmtHitm LclHitm L1 Hit L1 Miss Offset Node PA cnt Code address rmt hitm lcl hitm load records cnt Symbol Object Source:Line Node
# ..... ....... ....... ....... ....... .................. .... ...... .................. ........ ........ ........ ....... ........ .............................. ................. ..................... ....
#
-------------------------------------------------------------
0 0 2036 0 0 0xffff8881c0642000
-------------------------------------------------------------
0.00% 45.58% 0.00% 0.00% 0x0 0 1 0xffffffff8137071c 0 2877 3221 8969 191 [k] __mod_memcg_state [kernel.kallsyms] memcontrol.c:772 0 1 2 3
0.00% 20.92% 0.00% 0.00% 0x0 0 1 0xffffffff8137091c 0 3027 2841 6626 188 [k] __count_memcg_events [kernel.kallsyms] memcontrol.c:920 0 1 2 3
0.00% 17.88% 0.00% 0.00% 0x10 0 1 0xffffffff8136d7ad 0 3047 3326 3820 187 [k] get_mem_cgroup_from_mm [kernel.kallsyms] percpu-refcount.h:174 0 1 2 3
0.00% 8.94% 0.00% 0.00% 0x10 0 1 0xffffffff81375374 0 3192 3041 2067 187 [k] mem_cgroup_charge [kernel.kallsyms] percpu-refcount.h:174 0 1 2 3

And seems there is some cache false sharing when accessing mem_cgroup
member: 'struct cgroup_subsys_state', from the offset (0x0 and 0x10 here)
and the calling sites, the cache false sharing could happen between:

cgroup_rstat_updated (read memcg->css.cgroup, offset 0x0)
and
get_mem_cgroup_from_mm
css_tryget(&memcg->css) (read/write memcg->css.refcnt, offset 0x10)

(This could be wrong as many of the functions are inlined, and the
exact calling site isn't shown)

And to verify this, we did a test by adding padding between
memcg->css.cgroup and memcg->css.refcnt to push them into 2
different cache lines, and the performance are partly restored:

dc26532aed0ab25c 2d146aa3aa842d7f5065802556b 73371bf27a8a8ea68df2fbf456b
---------------- --------------------------- ---------------------------
65523232 ± 4% -40.8% 38817332 ± 5% -19.6% 52701654 ± 3% vm-scalability.throughput

0.58 ± 2% +3.1 3.63 +2.3 2.86 ± 4% pp.bt.__count_memcg_events.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
1.15 +3.1 4.20 +2.5 3.68 ± 4% pp.bt.page_add_file_rmap.do_set_pte.finish_fault.do_fault.__handle_mm_fault
0.53 ± 2% +3.1 3.62 +2.5 2.99 ± 2% pp.bt.__mod_memcg_lruvec_state.page_add_file_rmap.do_set_pte.finish_fault.do_fault
1.16 +3.3 4.50 +3.2 4.38 ± 3% pp.bt.__mod_memcg_lruvec_state.shmem_add_to_page_cache.shmem_getpage_gfp.shmem_fault.__do_fault
0.80 +3.5 4.29 +3.2 3.99 ± 3% pp.bt.__mod_memcg_state.__mod_memcg_lruvec_state.shmem_add_to_page_cache.shmem_getpage_gfp.shmem_fault
0.00 +3.5 3.50 +2.8 2.78 ± 2% pp.bt.__mod_memcg_state.__mod_memcg_lruvec_state.page_add_file_rmap.do_set_pte.finish_fault
52.02 ± 3% +13.3 65.29 ± 2% +4.3 56.34 ± 6% pp.bt.__do_fault.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
51.98 ± 3% +13.3 65.27 ± 2% +4.3 56.31 ± 6% pp.bt.shmem_fault.__do_fault.do_fault.__handle_mm_fault.handle_mm_fault
51.87 ± 3% +13.3 65.21 ± 2% +4.3 56.22 ± 6% pp.bt.shmem_getpage_gfp.shmem_fault.__do_fault.do_fault.__handle_mm_fault
56.75 ± 3% +15.0 71.78 ± 2% +6.3 63.09 ± 5% pp.bt.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
1.97 ± 3% +33.9 35.87 ± 5% +19.8 21.79 ± 23% pp.bt._raw_spin_lock_irqsave.lock_page_lruvec_irqsave.__pagevec_lru_add.lru_cache_add.shmem_getpage_gfp
1.98 ± 3% +33.9 35.89 ± 5% +19.8 21.80 ± 23% pp.bt.lock_page_lruvec_irqsave.__pagevec_lru_add.lru_cache_add.shmem_getpage_gfp.shmem_fault

We are still checking more, and will update if there is new data.

Btw, the test platform is a 2 sockets, 4 nodes, 96C/192T Cascadelake AP,
and if run the same case on 2S/2Nodes/48C/96T Cascade Lake SP box, the
regression is about -22.3%.

Thanks,
Feng

> Anybody?
>
> Linus

2021-08-16 03:29:50

by Feng Tang

[permalink] [raw]

Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression

On Thu, Aug 12, 2021 at 11:19:10AM +0800, Feng Tang wrote:
> On Tue, Aug 10, 2021 at 07:59:53PM -1000, Linus Torvalds wrote:
[SNIP]

> And seems there is some cache false sharing when accessing mem_cgroup
> member: 'struct cgroup_subsys_state', from the offset (0x0 and 0x10 here)
> and the calling sites, the cache false sharing could happen between:
>
> cgroup_rstat_updated (read memcg->css.cgroup, offset 0x0)
> and
> get_mem_cgroup_from_mm
> css_tryget(&memcg->css) (read/write memcg->css.refcnt, offset 0x10)
>
> (This could be wrong as many of the functions are inlined, and the
> exact calling site isn't shown)
>
> And to verify this, we did a test by adding padding between
> memcg->css.cgroup and memcg->css.refcnt to push them into 2
> different cache lines, and the performance are partly restored:
>
> dc26532aed0ab25c 2d146aa3aa842d7f5065802556b 73371bf27a8a8ea68df2fbf456b
> ---------------- --------------------------- ---------------------------
> 65523232 ± 4% -40.8% 38817332 ± 5% -19.6% 52701654 ± 3% vm-scalability.throughput
>
> We are still checking more, and will update if there is new data.

Seems this is the second case to hit 'adjacent cacheline prefetch",
the first time we saw it is also related with mem_cgroup
https://lore.kernel.org/lkml/[email protected]/

In previous debug patch, the 'css.cgroup' and 'css.refcnt' is
separated to 2 cache lines, which are still adjacent (2N and 2N+1)
cachelines. And with more padding (add 128 bytes padding in between),
the performance is restored, and even better (test run 3 times):

dc26532aed0ab25c 2d146aa3aa842d7f5065802556b 2e34d6daf5fbab0fb286dcdb3bc
---------------- --------------------------- ---------------------------
65523232 ± 4% -40.8% 38817332 ± 5% +23.4% 80862243 ± 3% vm-scalability.throughput

The debug patch is:
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -142,6 +142,8 @@ struct cgroup_subsys_state {
/* PI: the cgroup subsystem that this css is attached to */
struct cgroup_subsys *ss;

+ unsigned long pad[16];
+
/* reference count - access via css_[try]get() and css_put() */
struct percpu_ref refcnt;

Thanks,
Feng

> Btw, the test platform is a 2 sockets, 4 nodes, 96C/192T Cascadelake AP,
> and if run the same case on 2S/2Nodes/48C/96T Cascade Lake SP box, the
> regression is about -22.3%.
>
> Thanks,
> Feng
>
> > Anybody?
> >
> > Linus

2021-08-16 21:44:44

by Johannes Weiner

[permalink] [raw]

Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression

On Mon, Aug 16, 2021 at 11:28:55AM +0800, Feng Tang wrote:
> On Thu, Aug 12, 2021 at 11:19:10AM +0800, Feng Tang wrote:
> > On Tue, Aug 10, 2021 at 07:59:53PM -1000, Linus Torvalds wrote:
> [SNIP]
>
> > And seems there is some cache false sharing when accessing mem_cgroup
> > member: 'struct cgroup_subsys_state', from the offset (0x0 and 0x10 here)
> > and the calling sites, the cache false sharing could happen between:
> >
> > cgroup_rstat_updated (read memcg->css.cgroup, offset 0x0)
> > and
> > get_mem_cgroup_from_mm
> > css_tryget(&memcg->css) (read/write memcg->css.refcnt, offset 0x10)
> >
> > (This could be wrong as many of the functions are inlined, and the
> > exact calling site isn't shown)

Thanks for digging more into this.

The offset 0x0 access is new in the page instantiation path with the
bisected patch, so that part makes sense. The new sequence is this:

shmem_add_to_page_cache()
mem_cgroup_charge()
get_mem_cgroup_from_mm()
css_tryget() # touches memcg->css.refcnt
xas_store()
__mod_lruvec_page_state()
__mod_lruvec_state()
__mod_memcg_lruvec_state()
__mod_memcg_state()
__this_cpu_add()
cgroup_rstat_updated() # touches memcg->css.cgroup

whereas before, __mod_memcg_state() would just do stuff inside memcg.

However, css.refcnt is a percpu-refcount. We should see a read-only
lookup of the base pointer inside this cacheline, with the write
occuring in percpu memory elsewhere. Even if it were in atomic/shared
mode, which it shouldn't be for the root cgroup, the shared atomic_t
is also located in an auxiliary allocation and shouldn't overlap with
the cgroup pointer in any way.

The css itself is embedded inside struct mem_cgroup, which does see
modifications. But the closest of those is 3 cachelines down (struct
page_counter memory), so that doesn't make sense, either.

Does this theory require writes? Because I don't actually see any (hot
ones, anyway) inside struct cgroup_subsys_state for this workload.

> > And to verify this, we did a test by adding padding between
> > memcg->css.cgroup and memcg->css.refcnt to push them into 2
> > different cache lines, and the performance are partly restored:
> >
> > dc26532aed0ab25c 2d146aa3aa842d7f5065802556b 73371bf27a8a8ea68df2fbf456b
> > ---------------- --------------------------- ---------------------------
> > 65523232 ? 4% -40.8% 38817332 ? 5% -19.6% 52701654 ? 3% vm-scalability.throughput
> >
> > We are still checking more, and will update if there is new data.
>
> Seems this is the second case to hit 'adjacent cacheline prefetch",
> the first time we saw it is also related with mem_cgroup
> https://lore.kernel.org/lkml/[email protected]/
>
> In previous debug patch, the 'css.cgroup' and 'css.refcnt' is
> separated to 2 cache lines, which are still adjacent (2N and 2N+1)
> cachelines. And with more padding (add 128 bytes padding in between),
> the performance is restored, and even better (test run 3 times):
>
> dc26532aed0ab25c 2d146aa3aa842d7f5065802556b 2e34d6daf5fbab0fb286dcdb3bc
> ---------------- --------------------------- ---------------------------
> 65523232 ? 4% -40.8% 38817332 ? 5% +23.4% 80862243 ? 3% vm-scalability.throughput
>
> The debug patch is:
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -142,6 +142,8 @@ struct cgroup_subsys_state {
> /* PI: the cgroup subsystem that this css is attached to */
> struct cgroup_subsys *ss;
>
> + unsigned long pad[16];
> +
> /* reference count - access via css_[try]get() and css_put() */
> struct percpu_ref refcnt;

We aren't particularly space-constrained in this structure, so padding
should generally be acceptable here.

2021-08-17 02:46:48

On Wed, Sep 01, 2021 at 07:23:24PM -0700, Andi Kleen wrote:
>
> On 9/1/2021 6:35 PM, Feng Tang wrote:
> >On Wed, Sep 01, 2021 at 08:12:24AM -0700, Andi Kleen wrote:
> >>Feng Tang <[email protected]> writes:
> >>>Yes, the tests I did is no matter where the 128B padding is added, the
> >>>performance can be restored and even improved.
> >>I wonder if we can find some cold, rarely accessed, data to put into the
> >>padding to not waste it. Perhaps some name strings? Or the destroy
> >>support, which doesn't sound like its commonly used.
> >Yes, I tried to move 'destroy_work', 'destroy_rwork' and 'parent' over
> >before the 'refcnt' together with some padding, it restored the performance
> >to about 10~15% regression. (debug patch pasted below)
> >
> >But I'm not sure if we should use it, before we can fully explain the
> >regression.
>
> Narrowing it down to a single prefetcher seems good enough to me. The
> behavior of the prefetchers is fairly complicated and hard to predict, so I
> doubt you'll ever get a 100% step by step explanation.

Yes, I'm afriad so, given that the policy/algorithm used by perfetcher
keeps changing from generation to generation.

I will test the patch more with other benchmarks.

Thanks,
Feng

>
> -Andi
>

2021-09-02 23:23:31

by Michal Koutný

[permalink] [raw]

Subject: Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression

Hi.

On Thu, Sep 02, 2021 at 11:46:28AM +0800, Feng Tang <[email protected]> wrote:
> > Narrowing it down to a single prefetcher seems good enough to me. The
> > behavior of the prefetchers is fairly complicated and hard to predict, so I
> > doubt you'll ever get a 100% step by step explanation.

My layman explanation with the available information is that the
prefetcher somehow behaves as if it marked the offending cacheline as
modified (even though reading only) therefore slowing down the remote reader.

On Thu, Sep 02, 2021 at 09:35:58AM +0800, Feng Tang <[email protected]> wrote:
> @@ -139,10 +139,21 @@ struct cgroup_subsys_state {
> /* PI: the cgroup that this css is attached to */
> struct cgroup *cgroup;
>
> + struct cgroup_subsys_state *parent;
> +
> /* PI: the cgroup subsystem that this css is attached to */
> struct cgroup_subsys *ss;

Hm, an interesting move; be mindful of commit b8b1a2e5eca6 ("cgroup:
move cgroup_subsys_state parent field for cache locality"). It might be
a regression for systems with cpuacct root css present. (That is likely
a big amount nowadays, that may be the reason why you don't see full
recovery? For future, we may at least guard cpuacct_charge() with
cgroup_subsys_enabled() static branch.)

> [snip]
> Yes, I'm afriad so, given that the policy/algorithm used by perfetcher
> keeps changing from generation to generation.

Exactly. I'm afraid of relayouting the structure with each new
generation. A robust solution is putting all frequently accessed members
into individual cache-lines + separating them with one more cache line? :-/

Michal