2022-04-21 08:51:09

by kernel test robot

[permalink] [raw]
Subject: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression


(please be noted we reported
"[mm/page_alloc] 39907a939a: netperf.Throughput_Mbps -18.1% regression"
on
https://lore.kernel.org/all/[email protected]/
while the commit is on branch.
now we still observe similar regression when it's on mainline, and we also
observe a 13.2% improvement on another netperf subtest.
so report again for information)

Greeting,

FYI, we noticed a -18.0% regression of netperf.Throughput_Mbps due to commit:


commit: f26b3fa046116a7dedcaafe30083402113941451 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

in testcase: netperf
on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory
with following parameters:

ip: ipv4
runtime: 300s
nr_threads: 1
cluster: cs-localhost
test: UDP_STREAM
cpufreq_governor: performance
ucode: 0xd000331

test-description: Netperf is a benchmark that can be use to measure various aspect of networking performance.
test-url: http://www.netperf.org/netperf/

In addition to that, the commit also has significant impact on the following tests:

+------------------+-------------------------------------------------------------------------------------+
| testcase: change | netperf: netperf.Throughput_Mbps 13.2% improvement |
| test machine | 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory |
| test parameters | cluster=cs-localhost |
| | cpufreq_governor=performance |
| | ip=ipv4 |
| | nr_threads=25% |
| | runtime=300s |
| | send_size=10K |
| | test=SCTP_STREAM_MANY |
| | ucode=0xd000331 |
+------------------+-------------------------------------------------------------------------------------+


If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
sudo bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
sudo bin/lkp run generated-yaml-file

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.

=========================================================================================
cluster/compiler/cpufreq_governor/ip/kconfig/nr_threads/rootfs/runtime/tbox_group/test/testcase/ucode:
cs-localhost/gcc-11/performance/ipv4/x86_64-rhel-8.3/1/debian-10.4-x86_64-20200603.cgz/300s/lkp-icl-2sp4/UDP_STREAM/netperf/0xd000331

commit:
8b10b465d0 ("mm/page_alloc: free pages in a single pass during bulk free")
f26b3fa046 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free")

8b10b465d0e18b00 f26b3fa046116a7dedcaafe3008
---------------- ---------------------------
%stddev %change %stddev
\ | \
120956 ? 2% -18.0% 99177 netperf.Throughput_Mbps
120956 ? 2% -18.0% 99177 netperf.Throughput_total_Mbps
90.83 -2.0% 89.00 netperf.time.percent_of_cpu_this_job_got
69242552 ? 2% -18.0% 56775058 netperf.workload
29460 ? 2% +25.7% 37044 meminfo.Shmem
96933 ?198% +9094.3% 8912386 ? 7% turbostat.POLL
1746 ? 2% +6694.6% 118678 ? 3% vmstat.system.cs
293357 ? 7% -21.2% 231238 ? 17% sched_debug.cfs_rq:/.min_vruntime.max
269394 ? 8% -23.6% 205870 ? 17% sched_debug.cfs_rq:/.spread0.max
239945 ? 64% -99.5% 1108 ? 2% sched_debug.cpu.avg_idle.min
122694 ? 18% +26.2% 154895 ? 6% sched_debug.cpu.avg_idle.stddev
4705 ? 2% +2916.4% 141948 ? 3% sched_debug.cpu.nr_switches.avg
65447 ? 3% +9997.6% 6608655 ? 13% sched_debug.cpu.nr_switches.max
8178 ? 3% +9737.7% 804544 ? 5% sched_debug.cpu.nr_switches.stddev
250093 ? 8% +15.0% 287675 ? 7% perf-stat.i.cache-misses
1674 ? 2% +7043.4% 119598 ? 3% perf-stat.i.context-switches
3127 +1.8% 3183 perf-stat.i.minor-faults
7495 ? 24% +76.9% 13260 ? 6% perf-stat.i.node-loads
3128 +1.8% 3184 perf-stat.i.page-faults
0.05 ? 7% +0.0 0.06 ? 11% perf-stat.overall.cache-miss-rate%
45827 ? 6% -13.7% 39529 ? 10% perf-stat.overall.cycles-between-cache-misses
87.75 ? 3% -7.5 80.29 ? 2% perf-stat.overall.node-load-miss-rate%
18242 ? 5% +22.8% 22395 ? 2% perf-stat.overall.path-length
249180 ? 8% +15.0% 286678 ? 7% perf-stat.ps.cache-misses
1668 ? 2% +7044.3% 119200 ? 3% perf-stat.ps.context-switches
3114 +1.8% 3170 perf-stat.ps.minor-faults
7465 ? 24% +77.0% 13213 ? 6% perf-stat.ps.node-loads
3115 +1.8% 3171 perf-stat.ps.page-faults
2640 ? 3% -3.7% 2541 proc-vmstat.nr_active_anon
71813 +2.8% 73854 proc-vmstat.nr_inactive_anon
9669 +2.7% 9930 proc-vmstat.nr_mapped
7368 ? 2% +25.7% 9262 proc-vmstat.nr_shmem
2640 ? 3% -3.7% 2541 proc-vmstat.nr_zone_active_anon
71813 +2.8% 73854 proc-vmstat.nr_zone_inactive_anon
419.83 ?190% +1461.8% 6556 ? 15% proc-vmstat.numa_hint_faults
380.83 ?212% +1374.0% 5613 ? 3% proc-vmstat.numa_hint_faults_local
1.336e+08 -13.8% 1.152e+08 ? 2% proc-vmstat.numa_hit
1.337e+08 -13.6% 1.156e+08 ? 2% proc-vmstat.numa_local
8502 ? 97% +311.4% 34976 ? 9% proc-vmstat.numa_pte_updates
7931 +1121.7% 96900 ? 4% proc-vmstat.pgactivate
1.33e+08 -14.0% 1.144e+08 proc-vmstat.pgalloc_normal
1060109 +1.2% 1073035 proc-vmstat.pgfault
1.33e+08 -14.0% 1.144e+08 proc-vmstat.pgfree
1.26 ? 19% +0.6 1.81 ? 17% perf-profile.calltrace.cycles-pp.udp_unicast_rcv_skb.__udp4_lib_rcv.ip_protocol_deliver_rcu.ip_local_deliver_finish.__netif_receive_skb_one_core
1.18 ? 19% +0.6 1.79 ? 17% perf-profile.calltrace.cycles-pp.udp_queue_rcv_one_skb.udp_unicast_rcv_skb.__udp4_lib_rcv.ip_protocol_deliver_rcu.ip_local_deliver_finish
0.00 +0.7 0.69 ? 21% perf-profile.calltrace.cycles-pp.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.sock_def_readable
0.00 +0.7 0.70 ? 21% perf-profile.calltrace.cycles-pp.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.sock_def_readable.__udp_enqueue_schedule_skb
0.00 +0.7 0.71 ? 16% perf-profile.calltrace.cycles-pp.free_pcppages_bulk.free_unref_page.skb_release_data.__consume_stateless_skb.udp_recvmsg
0.00 +0.8 0.82 ? 22% perf-profile.calltrace.cycles-pp.__wake_up_common.__wake_up_common_lock.sock_def_readable.__udp_enqueue_schedule_skb.udp_queue_rcv_one_skb
0.00 +0.8 0.83 ? 23% perf-profile.calltrace.cycles-pp.__schedule.schedule.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp
0.00 +0.8 0.84 ? 22% perf-profile.calltrace.cycles-pp.__schedule.schedule_idle.do_idle.cpu_startup_entry.secondary_startup_64_no_verify
0.00 +0.9 0.86 ? 23% perf-profile.calltrace.cycles-pp.schedule.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg
0.00 +0.9 0.87 ? 22% perf-profile.calltrace.cycles-pp.__wake_up_common_lock.sock_def_readable.__udp_enqueue_schedule_skb.udp_queue_rcv_one_skb.udp_unicast_rcv_skb
0.00 +0.9 0.87 ? 22% perf-profile.calltrace.cycles-pp.schedule_idle.do_idle.cpu_startup_entry.secondary_startup_64_no_verify
0.00 +0.9 0.88 ? 23% perf-profile.calltrace.cycles-pp.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg.inet_recvmsg
0.00 +1.0 0.97 ? 24% perf-profile.calltrace.cycles-pp.sock_def_readable.__udp_enqueue_schedule_skb.udp_queue_rcv_one_skb.udp_unicast_rcv_skb.__udp4_lib_rcv
0.20 ?142% +1.1 1.33 ? 19% perf-profile.calltrace.cycles-pp.__udp_enqueue_schedule_skb.udp_queue_rcv_one_skb.udp_unicast_rcv_skb.__udp4_lib_rcv.ip_protocol_deliver_rcu
0.00 +1.2 1.19 ? 21% perf-profile.calltrace.cycles-pp.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg.inet_recvmsg.__sys_recvfrom
0.42 ? 71% +1.6 2.06 ? 20% perf-profile.calltrace.cycles-pp.__skb_recv_udp.udp_recvmsg.inet_recvmsg.__sys_recvfrom.__x64_sys_recvfrom
0.41 ? 19% -0.3 0.15 ? 21% perf-profile.children.cycles-pp.udp_rmem_release
0.65 ? 16% -0.2 0.45 ? 14% perf-profile.children.cycles-pp.kfree
0.44 ? 14% -0.2 0.26 ? 15% perf-profile.children.cycles-pp.__slab_free
0.58 ? 8% -0.2 0.42 ? 18% perf-profile.children.cycles-pp.free_pcp_prepare
0.17 ? 13% -0.1 0.07 ? 15% perf-profile.children.cycles-pp.free_unref_page_commit
0.24 ? 5% -0.1 0.18 ? 19% perf-profile.children.cycles-pp.kmem_cache_free
0.21 ? 14% -0.1 0.16 ? 17% perf-profile.children.cycles-pp.send_data
0.10 ? 15% +0.0 0.15 ? 8% perf-profile.children.cycles-pp.syscall_exit_to_user_mode
0.00 +0.1 0.06 ? 13% perf-profile.children.cycles-pp.finish_wait
0.00 +0.1 0.06 ? 13% perf-profile.children.cycles-pp.__update_load_avg_se
0.00 +0.1 0.06 ? 23% perf-profile.children.cycles-pp.ttwu_do_wakeup
0.00 +0.1 0.06 ? 11% perf-profile.children.cycles-pp.switch_mm_irqs_off
0.02 ?141% +0.1 0.08 ? 10% perf-profile.children.cycles-pp.exit_to_user_mode_prepare
0.10 ? 4% +0.1 0.16 ? 19% perf-profile.children.cycles-pp.__list_add_valid
0.00 +0.1 0.08 ? 14% perf-profile.children.cycles-pp.flush_smp_call_function_queue
0.00 +0.1 0.08 ? 37% perf-profile.children.cycles-pp.nohz_run_idle_balance
0.01 ?223% +0.1 0.09 ? 21% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
0.21 ? 11% +0.1 0.29 ? 16% perf-profile.children.cycles-pp.skb_set_owner_w
0.08 ? 11% +0.1 0.17 ? 23% perf-profile.children.cycles-pp.memcg_slab_free_hook
0.00 +0.1 0.08 ? 36% perf-profile.children.cycles-pp.tick_nohz_idle_enter
0.00 +0.1 0.08 ? 19% perf-profile.children.cycles-pp.prepare_task_switch
0.00 +0.1 0.10 ? 34% perf-profile.children.cycles-pp.prepare_to_wait_exclusive
0.07 ? 48% +0.1 0.20 ? 20% perf-profile.children.cycles-pp._raw_spin_lock_bh
0.73 ? 6% +0.1 0.86 ? 9% perf-profile.children.cycles-pp._raw_spin_lock
0.00 +0.1 0.14 ? 34% perf-profile.children.cycles-pp.set_next_entity
0.07 ? 18% +0.2 0.22 ? 26% perf-profile.children.cycles-pp.__zone_watermark_ok
0.00 +0.2 0.17 ? 11% perf-profile.children.cycles-pp.enqueue_entity
0.00 +0.2 0.17 ? 29% perf-profile.children.cycles-pp.update_load_avg
0.00 +0.2 0.18 ? 21% perf-profile.children.cycles-pp.__switch_to
0.00 +0.2 0.19 ? 22% perf-profile.children.cycles-pp.sched_ttwu_pending
0.00 +0.2 0.19 ? 21% perf-profile.children.cycles-pp.ttwu_queue_wakelist
0.27 ? 34% +0.2 0.47 ? 15% perf-profile.children.cycles-pp.update_rq_clock
0.00 +0.2 0.19 ? 20% perf-profile.children.cycles-pp.update_curr
0.00 +0.2 0.21 ? 7% perf-profile.children.cycles-pp.enqueue_task_fair
0.00 +0.2 0.22 ? 6% perf-profile.children.cycles-pp.ttwu_do_activate
0.00 +0.2 0.24 ? 25% perf-profile.children.cycles-pp.pick_next_task_fair
0.28 ? 14% +0.2 0.52 ? 8% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
0.00 +0.3 0.26 ? 17% perf-profile.children.cycles-pp.__sysvec_call_function_single
0.00 +0.3 0.32 ? 17% perf-profile.children.cycles-pp.sysvec_call_function_single
0.39 ? 14% +0.3 0.72 ? 16% perf-profile.children.cycles-pp.free_pcppages_bulk
0.00 +0.3 0.33 ? 26% perf-profile.children.cycles-pp.dequeue_entity
0.00 +0.4 0.37 ? 24% perf-profile.children.cycles-pp.dequeue_task_fair
0.00 +0.4 0.41 ? 25% perf-profile.children.cycles-pp.finish_task_switch
0.00 +0.5 0.50 ? 19% perf-profile.children.cycles-pp.asm_sysvec_call_function_single
1.26 ? 19% +0.6 1.81 ? 17% perf-profile.children.cycles-pp.udp_unicast_rcv_skb
1.19 ? 19% +0.6 1.80 ? 17% perf-profile.children.cycles-pp.udp_queue_rcv_one_skb
0.00 +0.7 0.71 ? 21% perf-profile.children.cycles-pp.autoremove_wake_function
0.00 +0.7 0.71 ? 20% perf-profile.children.cycles-pp.try_to_wake_up
0.00 +0.8 0.83 ? 21% perf-profile.children.cycles-pp.__wake_up_common
0.49 ? 17% +0.8 1.34 ? 19% perf-profile.children.cycles-pp.__udp_enqueue_schedule_skb
0.00 +0.9 0.87 ? 22% perf-profile.children.cycles-pp.schedule_idle
0.00 +0.9 0.88 ? 22% perf-profile.children.cycles-pp.__wake_up_common_lock
0.00 +0.9 0.89 ? 23% perf-profile.children.cycles-pp.schedule_timeout
0.01 ?223% +0.9 0.90 ? 22% perf-profile.children.cycles-pp.schedule
0.03 ?100% +0.9 0.97 ? 24% perf-profile.children.cycles-pp.sock_def_readable
0.00 +1.2 1.19 ? 21% perf-profile.children.cycles-pp.__skb_wait_for_more_packets
0.57 ? 19% +1.5 2.08 ? 20% perf-profile.children.cycles-pp.__skb_recv_udp
0.05 ? 47% +1.7 1.73 ? 22% perf-profile.children.cycles-pp.__schedule
0.24 ? 21% -0.2 0.03 ?100% perf-profile.self.cycles-pp.udp_rmem_release
0.44 ? 14% -0.2 0.26 ? 16% perf-profile.self.cycles-pp.__slab_free
0.58 ? 8% -0.2 0.42 ? 18% perf-profile.self.cycles-pp.free_pcp_prepare
0.29 ? 19% -0.1 0.16 ? 10% perf-profile.self.cycles-pp.kfree
0.28 ? 17% -0.1 0.18 ? 18% perf-profile.self.cycles-pp.udp_recvmsg
0.13 ? 13% -0.1 0.05 ? 45% perf-profile.self.cycles-pp.free_unref_page_commit
0.24 ? 21% -0.1 0.17 ? 17% perf-profile.self.cycles-pp.send_omni_inner
0.08 ? 10% -0.0 0.02 ? 99% perf-profile.self.cycles-pp.kmem_cache_free
0.10 ? 19% -0.0 0.06 ? 52% perf-profile.self.cycles-pp.__dev_queue_xmit
0.12 ? 15% -0.0 0.09 ? 18% perf-profile.self.cycles-pp.__cgroup_bpf_run_filter_skb
0.06 ? 9% +0.1 0.12 ? 18% perf-profile.self.cycles-pp.free_pcppages_bulk
0.00 +0.1 0.06 ? 14% perf-profile.self.cycles-pp.flush_smp_call_function_queue
0.03 ?100% +0.1 0.10 ? 42% perf-profile.self.cycles-pp.sock_def_readable
0.08 ? 8% +0.1 0.15 ? 18% perf-profile.self.cycles-pp.__list_add_valid
0.00 +0.1 0.08 ? 22% perf-profile.self.cycles-pp.update_curr
0.00 +0.1 0.08 ? 20% perf-profile.self.cycles-pp.enqueue_entity
0.01 ?223% +0.1 0.09 ? 20% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
0.21 ? 11% +0.1 0.29 ? 16% perf-profile.self.cycles-pp.skb_set_owner_w
0.08 ? 12% +0.1 0.16 ? 22% perf-profile.self.cycles-pp.memcg_slab_free_hook
0.00 +0.1 0.09 ? 33% perf-profile.self.cycles-pp.set_next_entity
0.00 +0.1 0.12 ? 31% perf-profile.self.cycles-pp.__wake_up_common
0.07 ? 50% +0.1 0.19 ? 21% perf-profile.self.cycles-pp._raw_spin_lock_bh
0.00 +0.1 0.12 ? 16% perf-profile.self.cycles-pp.try_to_wake_up
0.04 ?101% +0.1 0.17 ? 11% perf-profile.self.cycles-pp.update_rq_clock
0.00 +0.2 0.15 ? 17% perf-profile.self.cycles-pp.__skb_wait_for_more_packets
0.00 +0.2 0.15 ? 26% perf-profile.self.cycles-pp.finish_task_switch
0.17 ? 14% +0.2 0.33 ? 24% perf-profile.self.cycles-pp.skb_release_data
0.04 ? 72% +0.2 0.22 ? 26% perf-profile.self.cycles-pp.__zone_watermark_ok
0.00 +0.2 0.18 ? 21% perf-profile.self.cycles-pp.__switch_to
0.26 ? 16% +0.2 0.48 ? 8% perf-profile.self.cycles-pp._raw_spin_lock_irqsave
0.00 +0.3 0.27 ? 23% perf-profile.self.cycles-pp.__schedule
0.04 ? 71% +0.4 0.45 ? 21% perf-profile.self.cycles-pp.__skb_recv_udp


***************************************************************************************************
lkp-icl-2sp4: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory
=========================================================================================
cluster/compiler/cpufreq_governor/ip/kconfig/nr_threads/rootfs/runtime/send_size/tbox_group/test/testcase/ucode:
cs-localhost/gcc-11/performance/ipv4/x86_64-rhel-8.3/25%/debian-10.4-x86_64-20200603.cgz/300s/10K/lkp-icl-2sp4/SCTP_STREAM_MANY/netperf/0xd000331

commit:
8b10b465d0 ("mm/page_alloc: free pages in a single pass during bulk free")
f26b3fa046 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free")

8b10b465d0e18b00 f26b3fa046116a7dedcaafe3008
---------------- ---------------------------
%stddev %change %stddev
\ | \
14785 ? 2% +13.2% 16740 netperf.Throughput_Mbps
473143 ? 2% +13.2% 535690 netperf.Throughput_total_Mbps
17542 ? 6% +24.5% 21835 ? 2% netperf.time.involuntary_context_switches
1342 ? 3% +19.2% 1600 netperf.time.percent_of_cpu_this_job_got
3935 ? 3% +19.4% 4698 netperf.time.system_time
110.45 ? 3% +15.2% 127.26 netperf.time.user_time
199875 ? 5% -36.6% 126767 ? 9% netperf.time.voluntary_context_switches
1.733e+09 ? 2% +13.2% 1.962e+09 netperf.workload
3.48 ? 3% +0.5 3.94 mpstat.cpu.all.soft%
16.79 ? 3% +3.4 20.23 mpstat.cpu.all.sys%
0.68 ? 3% +0.1 0.78 mpstat.cpu.all.usr%
27.83 ? 5% +22.8% 34.17 ? 3% vmstat.procs.r
3208349 ? 2% +12.1% 3596993 vmstat.system.cs
263494 +3.6% 273026 vmstat.system.in
1.101e+09 ? 6% +16.4% 1.282e+09 ? 3% numa-numastat.node0.local_node
1.1e+09 ? 6% +16.4% 1.28e+09 ? 3% numa-numastat.node0.numa_hit
1.151e+09 ? 4% +10.2% 1.269e+09 ? 2% numa-numastat.node1.local_node
1.149e+09 ? 4% +10.2% 1.265e+09 ? 2% numa-numastat.node1.numa_hit
1.1e+09 ? 6% +16.4% 1.28e+09 ? 3% numa-vmstat.node0.numa_hit
1.101e+09 ? 6% +16.4% 1.282e+09 ? 3% numa-vmstat.node0.numa_local
1.149e+09 ? 4% +10.2% 1.265e+09 ? 2% numa-vmstat.node1.numa_hit
1.151e+09 ? 4% +10.2% 1.269e+09 ? 2% numa-vmstat.node1.numa_local
953763 ? 18% +33.6% 1273973 ? 8% meminfo.Active
953603 ? 18% +33.6% 1273684 ? 8% meminfo.Active(anon)
1450710 ? 13% +23.9% 1797564 ? 6% meminfo.Committed_AS
484102 ? 18% +32.7% 642218 ? 9% meminfo.Mapped
983413 ? 18% +34.8% 1326115 ? 8% meminfo.Shmem
812.50 ? 2% +16.5% 946.17 turbostat.Avg_MHz
24.64 ? 2% +4.1 28.73 turbostat.Busy%
4.704e+08 ? 2% +11.0% 5.219e+08 turbostat.C1
5.57 ? 2% +0.7 6.26 turbostat.C1%
0.37 ? 10% -16.1% 0.31 turbostat.IPC
1004055 ? 2% +11.4% 1118247 turbostat.POLL
0.02 +0.0 0.03 turbostat.POLL%
416.33 ? 4% +7.5% 447.50 turbostat.PkgWatt
238335 ? 18% +33.4% 317828 ? 8% proc-vmstat.nr_active_anon
811128 ? 5% +10.5% 896520 ? 3% proc-vmstat.nr_file_pages
80808 ? 2% +7.4% 86814 ? 3% proc-vmstat.nr_inactive_anon
120937 ? 19% +34.1% 162233 ? 9% proc-vmstat.nr_mapped
1938 ? 2% +4.7% 2029 ? 2% proc-vmstat.nr_page_table_pages
245826 ? 18% +34.6% 330998 ? 8% proc-vmstat.nr_shmem
238335 ? 18% +33.4% 317828 ? 8% proc-vmstat.nr_zone_active_anon
80808 ? 2% +7.4% 86814 ? 3% proc-vmstat.nr_zone_inactive_anon
2.248e+09 ? 2% +13.2% 2.545e+09 proc-vmstat.numa_hit
2.253e+09 ? 2% +13.2% 2.551e+09 proc-vmstat.numa_local
260577 ? 15% +33.6% 348260 ? 11% proc-vmstat.pgactivate
5.944e+09 ? 2% +13.2% 6.73e+09 proc-vmstat.pgalloc_normal
1579108 ? 2% +3.8% 1638994 proc-vmstat.pgfault
5.944e+09 ? 2% +13.2% 6.73e+09 proc-vmstat.pgfree
850785 ? 19% +64.9% 1403095 ? 8% sched_debug.cfs_rq:/.MIN_vruntime.max
110144 ? 17% +78.2% 196314 ? 20% sched_debug.cfs_rq:/.MIN_vruntime.stddev
0.26 ? 15% +25.6% 0.33 ? 12% sched_debug.cfs_rq:/.h_nr_running.avg
36930 ? 6% -19.2% 29847 ? 4% sched_debug.cfs_rq:/.load.max
13805 ? 8% -14.6% 11792 ? 3% sched_debug.cfs_rq:/.load.stddev
850785 ? 19% +64.9% 1403095 ? 8% sched_debug.cfs_rq:/.max_vruntime.max
110144 ? 17% +78.2% 196314 ? 20% sched_debug.cfs_rq:/.max_vruntime.stddev
803157 ? 9% +44.1% 1157345 ? 10% sched_debug.cfs_rq:/.min_vruntime.avg
1328522 ? 10% +31.4% 1746141 ? 9% sched_debug.cfs_rq:/.min_vruntime.max
349319 ? 17% +85.6% 648499 ? 17% sched_debug.cfs_rq:/.min_vruntime.min
209093 ? 8% +17.5% 245777 ? 8% sched_debug.cfs_rq:/.min_vruntime.stddev
279.98 ? 11% +24.1% 347.54 ? 7% sched_debug.cfs_rq:/.runnable_avg.avg
209084 ? 8% +17.5% 245769 ? 8% sched_debug.cfs_rq:/.spread0.stddev
279.78 ? 11% +24.2% 347.36 ? 7% sched_debug.cfs_rq:/.util_avg.avg
183.66 ? 15% +29.2% 237.36 ? 10% sched_debug.cfs_rq:/.util_est_enqueued.avg
1276 ? 11% +19.6% 1526 ? 5% sched_debug.cpu.curr->pid.avg
0.21 ? 10% +18.4% 0.25 ? 4% sched_debug.cpu.nr_running.avg
26.69 -0.8% 26.49 perf-stat.i.MPKI
1.96e+10 ? 2% +13.0% 2.215e+10 perf-stat.i.branch-instructions
1.257e+08 ? 2% +13.2% 1.423e+08 ? 2% perf-stat.i.branch-misses
2.672e+09 ? 2% +12.2% 2.997e+09 perf-stat.i.cache-references
3236739 ? 2% +12.2% 3630735 perf-stat.i.context-switches
1.10 +2.9% 1.13 perf-stat.i.cpi
1.099e+11 ? 2% +16.3% 1.278e+11 perf-stat.i.cpu-cycles
216.09 ? 3% +9.1% 235.65 ? 2% perf-stat.i.cpu-migrations
2.893e+10 ? 2% +13.1% 3.271e+10 perf-stat.i.dTLB-loads
0.01 ? 7% -0.0 0.00 ? 38% perf-stat.i.dTLB-store-miss-rate%
1218982 ? 5% -66.4% 409240 ? 39% perf-stat.i.dTLB-store-misses
1.715e+10 ? 2% +13.1% 1.939e+10 perf-stat.i.dTLB-stores
1.002e+11 ? 2% +13.0% 1.132e+11 perf-stat.i.instructions
0.91 -2.7% 0.89 perf-stat.i.ipc
0.86 ? 2% +16.3% 1.00 perf-stat.i.metric.GHz
533.97 ? 2% +13.0% 603.48 perf-stat.i.metric.M/sec
4845 ? 2% +4.4% 5058 perf-stat.i.minor-faults
106011 ? 17% -43.2% 60257 ? 28% perf-stat.i.node-loads
56.37 ? 10% +12.1 68.44 ? 7% perf-stat.i.node-store-miss-rate%
1300772 ? 13% -31.9% 886088 ? 31% perf-stat.i.node-stores
4846 ? 2% +4.4% 5059 perf-stat.i.page-faults
26.68 -0.8% 26.47 perf-stat.overall.MPKI
1.10 +2.9% 1.13 perf-stat.overall.cpi
0.01 ? 6% -0.0 0.00 ? 39% perf-stat.overall.dTLB-store-miss-rate%
0.91 -2.8% 0.89 perf-stat.overall.ipc
1.953e+10 ? 2% +13.0% 2.207e+10 perf-stat.ps.branch-instructions
1.252e+08 ? 2% +13.2% 1.418e+08 ? 2% perf-stat.ps.branch-misses
2.662e+09 ? 2% +12.2% 2.986e+09 perf-stat.ps.cache-references
3224941 ? 2% +12.2% 3617930 perf-stat.ps.context-switches
1.095e+11 ? 2% +16.3% 1.273e+11 perf-stat.ps.cpu-cycles
215.47 ? 3% +9.1% 235.06 ? 2% perf-stat.ps.cpu-migrations
2.882e+10 ? 2% +13.1% 3.259e+10 perf-stat.ps.dTLB-loads
1214485 ? 5% -66.4% 407878 ? 39% perf-stat.ps.dTLB-store-misses
1.709e+10 ? 2% +13.1% 1.932e+10 perf-stat.ps.dTLB-stores
9.982e+10 ? 2% +13.0% 1.128e+11 perf-stat.ps.instructions
4823 ? 3% +4.4% 5034 perf-stat.ps.minor-faults
105655 ? 17% -43.2% 59979 ? 28% perf-stat.ps.node-loads
1296468 ? 13% -31.9% 882954 ? 31% perf-stat.ps.node-stores
4824 ? 3% +4.4% 5035 perf-stat.ps.page-faults
3.017e+13 ? 2% +13.1% 3.411e+13 perf-stat.total.instructions
22.12 ? 7% -3.1 19.06 ? 4% perf-profile.calltrace.cycles-pp.secondary_startup_64_no_verify
18.84 ? 9% -3.0 15.83 ? 5% perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.secondary_startup_64_no_verify
21.94 ? 7% -3.0 18.95 ? 4% perf-profile.calltrace.cycles-pp.cpu_startup_entry.secondary_startup_64_no_verify
21.86 ? 7% -3.0 18.88 ? 4% perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.secondary_startup_64_no_verify
17.27 ? 8% -2.8 14.47 ? 5% perf-profile.calltrace.cycles-pp.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry.secondary_startup_64_no_verify
17.04 ? 8% -2.7 14.32 ? 5% perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry
14.40 ? 5% -1.8 12.57 ? 3% perf-profile.calltrace.cycles-pp.intel_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
14.23 ? 5% -1.8 12.40 ? 3% perf-profile.calltrace.cycles-pp.mwait_idle_with_hints.intel_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
1.80 ? 37% -0.8 1.02 ? 31% perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
1.59 ? 37% -0.7 0.89 ? 32% perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
0.55 ? 3% -0.3 0.26 ?100% perf-profile.calltrace.cycles-pp.sctp_chunkify._sctp_make_chunk.sctp_make_datafrag_empty.sctp_datamsg_from_user.sctp_sendmsg_to_asoc
0.66 ? 12% +0.2 0.90 ? 12% perf-profile.calltrace.cycles-pp.__free_pages_ok.skb_release_data.kfree_skb_reason.sctp_recvmsg.inet_recvmsg
4.04 ? 3% +0.3 4.38 ? 2% perf-profile.calltrace.cycles-pp.sctp_outq_sack.sctp_cmd_interpreter.sctp_do_sm.sctp_assoc_bh_rcv.sctp_backlog_rcv
3.98 ? 3% +0.3 4.32 ? 3% perf-profile.calltrace.cycles-pp.sctp_make_datafrag_empty.sctp_datamsg_from_user.sctp_sendmsg_to_asoc.sctp_sendmsg.sock_sendmsg
0.88 ? 7% +0.4 1.24 ? 6% perf-profile.calltrace.cycles-pp.free_unref_page.skb_release_data.consume_skb.sctp_chunk_put.sctp_outq_sack
0.78 ? 6% +0.4 1.13 ? 4% perf-profile.calltrace.cycles-pp.kmem_cache_free.sctp_recvmsg.inet_recvmsg.____sys_recvmsg.___sys_recvmsg
3.34 ? 3% +0.4 3.70 ? 3% perf-profile.calltrace.cycles-pp._sctp_make_chunk.sctp_make_datafrag_empty.sctp_datamsg_from_user.sctp_sendmsg_to_asoc.sctp_sendmsg
2.11 ? 3% +0.4 2.49 ? 3% perf-profile.calltrace.cycles-pp.consume_skb.sctp_chunk_put.sctp_outq_sack.sctp_cmd_interpreter.sctp_do_sm
2.71 ? 3% +0.4 3.09 ? 3% perf-profile.calltrace.cycles-pp.sctp_chunk_put.sctp_outq_sack.sctp_cmd_interpreter.sctp_do_sm.sctp_assoc_bh_rcv
1.36 ? 5% +0.4 1.75 ? 4% perf-profile.calltrace.cycles-pp.skb_release_data.consume_skb.sctp_chunk_put.sctp_outq_sack.sctp_cmd_interpreter
0.47 ? 45% +0.4 0.87 ? 4% perf-profile.calltrace.cycles-pp.__slab_free.kmem_cache_free.sctp_recvmsg.inet_recvmsg.____sys_recvmsg
1.54 ? 6% +0.4 1.94 ? 6% perf-profile.calltrace.cycles-pp.kmalloc_reserve.__alloc_skb.sctp_packet_transmit.sctp_outq_flush.sctp_cmd_interpreter
1.52 ? 6% +0.4 1.92 ? 7% perf-profile.calltrace.cycles-pp.__kmalloc_node_track_caller.kmalloc_reserve.__alloc_skb.sctp_packet_transmit.sctp_outq_flush
1.88 ? 5% +0.4 2.28 ? 5% perf-profile.calltrace.cycles-pp.__kmalloc_node_track_caller.kmalloc_reserve.__alloc_skb._sctp_make_chunk.sctp_make_datafrag_empty
1.50 ? 6% +0.4 1.90 ? 7% perf-profile.calltrace.cycles-pp.kmalloc_large_node.__kmalloc_node_track_caller.kmalloc_reserve.__alloc_skb.sctp_packet_transmit
1.82 ? 5% +0.4 2.22 ? 5% perf-profile.calltrace.cycles-pp.kmalloc_large_node.__kmalloc_node_track_caller.kmalloc_reserve.__alloc_skb._sctp_make_chunk
1.92 ? 5% +0.4 2.33 ? 5% perf-profile.calltrace.cycles-pp.kmalloc_reserve.__alloc_skb._sctp_make_chunk.sctp_make_datafrag_empty.sctp_datamsg_from_user
2.66 ? 3% +0.4 3.06 ? 4% perf-profile.calltrace.cycles-pp.__alloc_skb._sctp_make_chunk.sctp_make_datafrag_empty.sctp_datamsg_from_user.sctp_sendmsg_to_asoc
1.78 ? 6% +0.4 2.20 ? 6% perf-profile.calltrace.cycles-pp.__alloc_skb.sctp_packet_transmit.sctp_outq_flush.sctp_cmd_interpreter.sctp_do_sm
1.36 ? 9% +0.4 1.79 ? 9% perf-profile.calltrace.cycles-pp.rmqueue.get_page_from_freelist.__alloc_pages.kmalloc_large_node.__kmalloc_node_track_caller
7.79 ? 2% +0.7 8.46 perf-profile.calltrace.cycles-pp.sctp_packet_pack.sctp_packet_transmit.sctp_outq_flush.sctp_cmd_interpreter.sctp_do_sm
7.41 ? 2% +0.7 8.10 perf-profile.calltrace.cycles-pp.memcpy_erms.sctp_packet_pack.sctp_packet_transmit.sctp_outq_flush.sctp_cmd_interpreter
0.00 +0.7 0.74 ? 16% perf-profile.calltrace.cycles-pp._raw_spin_lock.free_pcppages_bulk.free_unref_page.skb_release_data.kfree_skb_reason
0.00 +0.7 0.74 ? 10% perf-profile.calltrace.cycles-pp.free_unref_page_commit.free_unref_page.skb_release_data.consume_skb.sctp_chunk_put
9.46 +0.8 10.22 perf-profile.calltrace.cycles-pp.sctp_do_sm.sctp_primitive_SEND.sctp_sendmsg_to_asoc.sctp_sendmsg.sock_sendmsg
9.63 +0.8 10.40 perf-profile.calltrace.cycles-pp.sctp_cmd_interpreter.sctp_do_sm.sctp_primitive_SEND.sctp_sendmsg_to_asoc.sctp_sendmsg
13.53 ? 2% +0.8 14.30 ? 2% perf-profile.calltrace.cycles-pp.sctp_do_sm.sctp_assoc_bh_rcv.sctp_backlog_rcv.__release_sock.release_sock
13.82 ? 2% +0.8 14.60 ? 2% perf-profile.calltrace.cycles-pp.sctp_assoc_bh_rcv.sctp_backlog_rcv.__release_sock.release_sock.sctp_sendmsg
13.84 ? 2% +0.8 14.62 ? 2% perf-profile.calltrace.cycles-pp.sctp_cmd_interpreter.sctp_do_sm.sctp_assoc_bh_rcv.sctp_backlog_rcv.__release_sock
10.83 ? 2% +0.8 11.64 perf-profile.calltrace.cycles-pp.sctp_packet_transmit.sctp_outq_flush.sctp_cmd_interpreter.sctp_do_sm.sctp_primitive_SEND
12.82 ? 2% +0.9 13.68 ? 2% perf-profile.calltrace.cycles-pp.sctp_outq_flush.sctp_cmd_interpreter.sctp_do_sm.sctp_primitive_SEND.sctp_sendmsg_to_asoc
14.77 +1.0 15.74 perf-profile.calltrace.cycles-pp.sctp_primitive_SEND.sctp_sendmsg_to_asoc.sctp_sendmsg.sock_sendmsg.____sys_sendmsg
0.00 +1.0 0.96 ? 14% perf-profile.calltrace.cycles-pp.free_pcppages_bulk.free_unref_page.skb_release_data.kfree_skb_reason.sctp_recvmsg
2.05 ? 7% +1.0 3.02 ? 9% perf-profile.calltrace.cycles-pp.kfree_skb_reason.sctp_recvmsg.inet_recvmsg.____sys_recvmsg.___sys_recvmsg
1.40 ? 8% +1.0 2.44 ? 10% perf-profile.calltrace.cycles-pp.skb_release_data.kfree_skb_reason.sctp_recvmsg.inet_recvmsg.____sys_recvmsg
0.00 +1.1 1.10 ? 12% perf-profile.calltrace.cycles-pp.free_unref_page.skb_release_data.kfree_skb_reason.sctp_recvmsg.inet_recvmsg
25.15 +1.2 26.32 perf-profile.calltrace.cycles-pp.sctp_sendmsg_to_asoc.sctp_sendmsg.sock_sendmsg.____sys_sendmsg.___sys_sendmsg
2.08 ? 6% +1.2 3.30 ? 6% perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages.kmalloc_large_node.__kmalloc_node_track_caller.kmalloc_reserve
2.48 ? 6% +1.3 3.75 ? 6% perf-profile.calltrace.cycles-pp.__alloc_pages.kmalloc_large_node.__kmalloc_node_track_caller.kmalloc_reserve.__alloc_skb
49.05 ? 2% +1.7 50.72 perf-profile.calltrace.cycles-pp.__sys_sendmsg.do_syscall_64.entry_SYSCALL_64_after_hwframe.sendmsg.main
49.47 ? 2% +1.7 51.21 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.sendmsg.main.__libc_start_main
49.17 ? 2% +1.7 50.91 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.sendmsg.main.__libc_start_main
51.24 ? 2% +1.8 52.98 perf-profile.calltrace.cycles-pp.__libc_start_main
50.11 ? 2% +1.8 51.86 perf-profile.calltrace.cycles-pp.sendmsg.main.__libc_start_main
50.99 ? 2% +1.8 52.74 perf-profile.calltrace.cycles-pp.main.__libc_start_main
45.54 +2.0 47.52 perf-profile.calltrace.cycles-pp.sock_sendmsg.____sys_sendmsg.___sys_sendmsg.__sys_sendmsg.do_syscall_64
45.10 +2.0 47.08 perf-profile.calltrace.cycles-pp.sctp_sendmsg.sock_sendmsg.____sys_sendmsg.___sys_sendmsg.__sys_sendmsg
47.29 +2.0 49.27 perf-profile.calltrace.cycles-pp.____sys_sendmsg.___sys_sendmsg.__sys_sendmsg.do_syscall_64.entry_SYSCALL_64_after_hwframe
49.20 +2.0 51.18 perf-profile.calltrace.cycles-pp.___sys_sendmsg.__sys_sendmsg.do_syscall_64.entry_SYSCALL_64_after_hwframe.sendmsg
19.03 ? 9% -3.1 15.95 ? 5% perf-profile.children.cycles-pp.cpuidle_idle_call
22.08 ? 7% -3.1 19.03 ? 4% perf-profile.children.cycles-pp.do_idle
22.12 ? 7% -3.1 19.06 ? 4% perf-profile.children.cycles-pp.secondary_startup_64_no_verify
22.12 ? 7% -3.1 19.06 ? 4% perf-profile.children.cycles-pp.cpu_startup_entry
17.42 ? 8% -2.9 14.56 ? 5% perf-profile.children.cycles-pp.cpuidle_enter
17.37 ? 8% -2.9 14.52 ? 5% perf-profile.children.cycles-pp.cpuidle_enter_state
14.52 ? 5% -1.9 12.64 ? 3% perf-profile.children.cycles-pp.intel_idle
14.43 ? 5% -1.9 12.55 ? 3% perf-profile.children.cycles-pp.mwait_idle_with_hints
2.25 ? 27% -0.8 1.46 ? 21% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
1.88 ? 29% -0.7 1.18 ? 23% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
1.19 ? 33% -0.5 0.73 ? 25% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
1.16 ? 32% -0.4 0.72 ? 24% perf-profile.children.cycles-pp.hrtimer_interrupt
0.76 ? 45% -0.3 0.44 ? 37% perf-profile.children.cycles-pp.__hrtimer_run_queues
0.52 ? 44% -0.2 0.31 ? 34% perf-profile.children.cycles-pp.tick_sched_timer
0.48 ? 47% -0.2 0.28 ? 36% perf-profile.children.cycles-pp.tick_sched_handle
0.44 ? 44% -0.2 0.27 ? 32% perf-profile.children.cycles-pp.update_process_times
0.32 ? 35% -0.1 0.20 ? 28% perf-profile.children.cycles-pp.__irq_exit_rcu
0.24 ? 35% -0.1 0.16 ? 24% perf-profile.children.cycles-pp.scheduler_tick
0.24 ? 12% -0.1 0.18 ? 19% perf-profile.children.cycles-pp.clockevents_program_event
0.15 ? 23% -0.1 0.10 ? 18% perf-profile.children.cycles-pp.rebalance_domains
0.80 ? 2% -0.0 0.76 perf-profile.children.cycles-pp.sctp_chunkify
0.10 ? 33% -0.0 0.06 ? 21% perf-profile.children.cycles-pp.perf_mux_hrtimer_handler
0.40 ? 6% -0.0 0.36 ? 2% perf-profile.children.cycles-pp.native_sched_clock
0.47 ? 5% -0.0 0.43 ? 2% perf-profile.children.cycles-pp.sched_clock_cpu
0.12 ? 13% -0.0 0.08 ? 7% perf-profile.children.cycles-pp.native_irq_return_iret
0.55 -0.0 0.52 ? 2% perf-profile.children.cycles-pp.sctp_chunk_free
0.34 ? 6% -0.0 0.31 ? 2% perf-profile.children.cycles-pp.rcu_idle_exit
0.08 ? 11% -0.0 0.06 ? 11% perf-profile.children.cycles-pp.lapic_next_deadline
0.33 ? 2% +0.0 0.34 perf-profile.children.cycles-pp.loopback_xmit
0.36 +0.0 0.38 perf-profile.children.cycles-pp.xmit_one
0.12 ? 4% +0.0 0.14 ? 3% perf-profile.children.cycles-pp.__build_skb_around
0.89 ? 2% +0.0 0.93 perf-profile.children.cycles-pp.enqueue_task_fair
0.93 ? 2% +0.0 0.96 perf-profile.children.cycles-pp.ttwu_do_activate
0.44 ? 5% +0.0 0.48 ? 2% perf-profile.children.cycles-pp.__mod_node_page_state
0.22 ? 5% +0.1 0.28 ? 6% perf-profile.children.cycles-pp.rmqueue_bulk
0.44 ? 2% +0.1 0.53 perf-profile.children.cycles-pp.__list_add_valid
2.68 ? 2% +0.1 2.77 perf-profile.children.cycles-pp.try_to_wake_up
2.69 ? 2% +0.1 2.79 perf-profile.children.cycles-pp.autoremove_wake_function
0.29 ? 7% +0.1 0.40 ? 6% perf-profile.children.cycles-pp.__free_one_page
2.98 +0.1 3.09 perf-profile.children.cycles-pp.__wake_up_common
3.45 ? 2% +0.1 3.56 perf-profile.children.cycles-pp.sctp_data_ready
3.07 +0.1 3.18 perf-profile.children.cycles-pp.__wake_up_common_lock
3.64 ? 2% +0.1 3.76 perf-profile.children.cycles-pp.sctp_ulpq_tail_event
0.39 ? 3% +0.2 0.56 ? 3% perf-profile.children.cycles-pp.__zone_watermark_ok
0.67 ? 12% +0.2 0.91 ? 12% perf-profile.children.cycles-pp.__free_pages_ok
0.50 ? 11% +0.3 0.79 ? 8% perf-profile.children.cycles-pp.free_unref_page_commit
0.95 ? 5% +0.3 1.27 ? 4% perf-profile.children.cycles-pp.__slab_free
2.47 ? 3% +0.4 2.83 ? 2% perf-profile.children.cycles-pp.kmem_cache_free
4.13 ? 2% +0.4 4.50 ? 2% perf-profile.children.cycles-pp.sctp_outq_sack
4.06 ? 2% +0.4 4.43 ? 3% perf-profile.children.cycles-pp.sctp_make_datafrag_empty
3.67 ? 2% +0.4 4.06 ? 3% perf-profile.children.cycles-pp._sctp_make_chunk
4.37 ? 2% +0.4 4.76 ? 2% perf-profile.children.cycles-pp.sctp_chunk_put
2.80 ? 2% +0.4 3.20 ? 2% perf-profile.children.cycles-pp.consume_skb
1.22 ? 12% +0.5 1.71 ? 9% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
1.67 ? 8% +0.6 2.22 ? 8% perf-profile.children.cycles-pp.rmqueue
8.31 +0.7 9.02 perf-profile.children.cycles-pp.sctp_packet_pack
7.64 +0.7 8.38 perf-profile.children.cycles-pp.memcpy_erms
0.82 +0.8 1.60 ? 8% perf-profile.children.cycles-pp._raw_spin_lock
2.60 ? 5% +0.8 3.42 ? 6% perf-profile.children.cycles-pp.get_page_from_freelist
3.02 ? 5% +0.8 3.86 ? 6% perf-profile.children.cycles-pp.__alloc_pages
0.13 ? 15% +0.8 0.97 ? 14% perf-profile.children.cycles-pp.free_pcppages_bulk
3.56 ? 5% +0.8 4.40 ? 5% perf-profile.children.cycles-pp.__kmalloc_node_track_caller
3.39 ? 5% +0.8 4.24 ? 6% perf-profile.children.cycles-pp.kmalloc_large_node
3.62 ? 4% +0.9 4.48 ? 5% perf-profile.children.cycles-pp.kmalloc_reserve
15.00 +0.9 15.86 perf-profile.children.cycles-pp.sctp_primitive_SEND
4.83 ? 3% +0.9 5.70 ? 4% perf-profile.children.cycles-pp.__alloc_skb
2.08 ? 7% +1.0 3.04 ? 9% perf-profile.children.cycles-pp.kfree_skb_reason
0.56 ? 23% +1.0 1.56 ? 19% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
1.22 ? 6% +1.2 2.38 ? 8% perf-profile.children.cycles-pp.free_unref_page
27.46 +1.2 28.68 perf-profile.children.cycles-pp.sctp_outq_flush
25.20 +1.2 26.44 perf-profile.children.cycles-pp.sctp_sendmsg_to_asoc
24.22 +1.3 25.47 perf-profile.children.cycles-pp.sctp_packet_transmit
2.95 ? 6% +1.4 4.38 ? 7% perf-profile.children.cycles-pp.skb_release_data
32.36 +1.6 33.92 perf-profile.children.cycles-pp.sctp_do_sm
32.11 +1.6 33.68 perf-profile.children.cycles-pp.sctp_cmd_interpreter
51.46 ? 2% +1.7 53.15 perf-profile.children.cycles-pp.main
51.24 ? 2% +1.8 52.98 perf-profile.children.cycles-pp.__libc_start_main
45.47 +2.0 47.44 perf-profile.children.cycles-pp.sctp_sendmsg
45.56 +2.0 47.53 perf-profile.children.cycles-pp.sock_sendmsg
47.32 +2.0 49.30 perf-profile.children.cycles-pp.____sys_sendmsg
49.23 +2.0 51.22 perf-profile.children.cycles-pp.___sys_sendmsg
49.56 +2.0 51.57 perf-profile.children.cycles-pp.__sys_sendmsg
51.10 +2.0 53.14 perf-profile.children.cycles-pp.sendmsg
73.82 ? 2% +3.0 76.86 perf-profile.children.cycles-pp.do_syscall_64
74.28 ? 2% +3.0 77.32 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
14.36 ? 5% -1.9 12.48 ? 3% perf-profile.self.cycles-pp.mwait_idle_with_hints
0.48 ? 21% -0.1 0.34 ? 13% perf-profile.self.cycles-pp.cpuidle_enter_state
0.39 ? 5% -0.0 0.35 ? 2% perf-profile.self.cycles-pp.native_sched_clock
0.12 ? 13% -0.0 0.08 ? 7% perf-profile.self.cycles-pp.native_irq_return_iret
0.08 ? 11% -0.0 0.06 ? 9% perf-profile.self.cycles-pp.lapic_next_deadline
0.38 ? 3% -0.0 0.35 ? 2% perf-profile.self.cycles-pp.sctp_packet_pack
0.50 ? 2% -0.0 0.48 ? 2% perf-profile.self.cycles-pp.__might_sleep
0.23 -0.0 0.22 ? 2% perf-profile.self.cycles-pp.do_idle
0.07 +0.0 0.09 ? 12% perf-profile.self.cycles-pp.syscall_exit_to_user_mode
0.20 ? 3% +0.0 0.22 ? 4% perf-profile.self.cycles-pp.enqueue_task_fair
0.02 ?141% +0.0 0.06 ? 9% perf-profile.self.cycles-pp.poll_idle
1.13 +0.0 1.17 perf-profile.self.cycles-pp.kmem_cache_free
0.43 ? 6% +0.0 0.47 ? 2% perf-profile.self.cycles-pp.__mod_node_page_state
0.37 ? 2% +0.1 0.45 ? 2% perf-profile.self.cycles-pp.__list_add_valid
0.84 ? 6% +0.1 0.92 ? 2% perf-profile.self.cycles-pp._raw_spin_lock_irqsave
0.51 ? 2% +0.1 0.61 ? 4% perf-profile.self.cycles-pp.get_page_from_freelist
0.38 ? 2% +0.2 0.54 ? 4% perf-profile.self.cycles-pp.__zone_watermark_ok
0.76 +0.2 0.96 perf-profile.self.cycles-pp._raw_spin_lock
0.79 ? 8% +0.3 1.08 ? 7% perf-profile.self.cycles-pp.rmqueue
0.44 ? 12% +0.3 0.75 ? 9% perf-profile.self.cycles-pp.free_unref_page_commit
0.93 ? 5% +0.3 1.26 ? 4% perf-profile.self.cycles-pp.__slab_free
7.61 +0.7 8.35 perf-profile.self.cycles-pp.memcpy_erms
0.56 ? 23% +1.0 1.55 ? 19% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath





Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


--
0-DAY CI Kernel Test Service
https://01.org/lkp



Attachments:
(No filename) (46.21 kB)
config-5.17.0-00103-gf26b3fa04611 (163.87 kB)
job-script (8.28 kB)
job.yaml (5.70 kB)
reproduce (341.00 B)
Download all attachments

2022-05-01 19:45:15

by Aaron Lu

[permalink] [raw]
Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression

Hi Mel,

On Wed, Apr 20, 2022 at 09:35:26AM +0800, kernel test robot wrote:
>
> (please be noted we reported
> "[mm/page_alloc] 39907a939a: netperf.Throughput_Mbps -18.1% regression"
> on
> https://lore.kernel.org/all/[email protected]/
> while the commit is on branch.
> now we still observe similar regression when it's on mainline, and we also
> observe a 13.2% improvement on another netperf subtest.
> so report again for information)
>
> Greeting,
>
> FYI, we noticed a -18.0% regression of netperf.Throughput_Mbps due to commit:
>
>
> commit: f26b3fa046116a7dedcaafe30083402113941451 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>

So what this commit did is: if a CPU is always doing free(pcp->free_factor > 0)
and if the being freed high-order page's order is <= PAGE_ALLOC_COSTLY_ORDER,
then do not use PCP but directly free the page directly to buddy.

The rationale as explained in the commit's changelog is:
"
Netperf running on localhost exhibits this pattern and while it does not
matter for some machines, it does matter for others with smaller caches
where cache misses cause problems due to reduced page reuse. Pages
freed directly to the buddy list may be reused quickly while still cache
hot where as storing on the PCP lists may be cold by the time
free_pcppages_bulk() is called.
"

This regression occurred on a machine that has large caches so this
optimization brings no value to it but only overhead(skipped PCP), I
guess this is the reason why there is a regression.

I have also tested this case on a small machine: a skylake desktop and
this commit shows improvement:
8b10b465d0e1: "netperf.Throughput_Mbps": 72288.76,
f26b3fa04611: "netperf.Throughput_Mbps": 90784.4, +25.6%

So this means those directly freed pages get reused by allocator side
and that brings performance improvement for machines with smaller cache.

I wonder if we should still use PCP a little bit under the above said
condition, for the purpose of:
1 reduced overhead in the free path for machines with large cache;
2 still keeps the benefit of reused pages for machines with smaller cache.

For this reason, I tested increasing nr_pcp_high() from returning 0 to
either returning pcp->batch or (pcp->batch << 2):
machine\nr_pcp_high() ret: pcp->high 0 pcp->batch (pcp->batch << 2)
skylake desktop: 72288 90784 92219 91528
icelake 2sockets: 120956 99177 98251 116108

note nr_pcp_high() returns pcp->high is the behaviour of this commit's
parent, returns 0 is the behaviour of this commit.

The result shows, if we effectively use a PCP high as (pcp->batch << 2)
for the described condition, then this workload's performance on
small machine can remain while the regression on large machines can be
greately reduced(from -18% to -4%).

> in testcase: netperf
> on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory
> with following parameters:
>
> ip: ipv4
> runtime: 300s
> nr_threads: 1
> cluster: cs-localhost
> test: UDP_STREAM
> cpufreq_governor: performance
> ucode: 0xd000331
>
> test-description: Netperf is a benchmark that can be use to measure various aspect of networking performance.
> test-url: http://www.netperf.org/netperf/
>
> In addition to that, the commit also has significant impact on the following tests:
>

> +------------------+-------------------------------------------------------------------------------------+
> | testcase: change | netperf: netperf.Throughput_Mbps 13.2% improvement |
> | test machine | 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory |
> | test parameters | cluster=cs-localhost |
> | | cpufreq_governor=performance |
> | | ip=ipv4 |
> | | nr_threads=25% |
> | | runtime=300s |
> | | send_size=10K |
> | | test=SCTP_STREAM_MANY |
> | | ucode=0xd000331 |
> +------------------+-------------------------------------------------------------------------------------+
>

And when nr_pcp_high() returns (pcp->batch << 2), the improvement will
drop from 13.2% to 5.7%, not great but still an improvement...

The said change looks like this:
(relevant comment will have to be adjusted)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 505d59f7d4fa..130a02af8321 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3332,18 +3332,19 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
bool free_high)
{
int high = READ_ONCE(pcp->high);
+ int batch = READ_ONCE(pcp->batch);

- if (unlikely(!high || free_high))
+ if (unlikely(!high))
return 0;

- if (!test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
- return high;
-
/*
* If reclaim is active, limit the number of pages that can be
* stored on pcp lists
*/
- return min(READ_ONCE(pcp->batch) << 2, high);
+ if (test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags) || free_high)
+ return min(batch << 2, high);
+
+ return high;
}

static void free_unref_page_commit(struct page *page, int migratetype,

Does this look sane? If so, I can prepare a formal patch with proper
comment and changelog, thanks.

2022-05-02 23:19:30

by Mel Gorman

[permalink] [raw]
Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression

On Fri, Apr 29, 2022 at 07:29:19PM +0800, Aaron Lu wrote:
> Hi Mel,
>
> On Wed, Apr 20, 2022 at 09:35:26AM +0800, kernel test robot wrote:
> >
> > (please be noted we reported
> > "[mm/page_alloc] 39907a939a: netperf.Throughput_Mbps -18.1% regression"
> > on
> > https://lore.kernel.org/all/[email protected]/
> > while the commit is on branch.
> > now we still observe similar regression when it's on mainline, and we also
> > observe a 13.2% improvement on another netperf subtest.
> > so report again for information)
> >
> > Greeting,
> >
> > FYI, we noticed a -18.0% regression of netperf.Throughput_Mbps due to commit:
> >
> >
> > commit: f26b3fa046116a7dedcaafe30083402113941451 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
>
> So what this commit did is: if a CPU is always doing free(pcp->free_factor > 0)
> and if the being freed high-order page's order is <= PAGE_ALLOC_COSTLY_ORDER,
> then do not use PCP but directly free the page directly to buddy.
>

Yes.

> This regression occurred on a machine that has large caches so this
> optimization brings no value to it but only overhead(skipped PCP), I
> guess this is the reason why there is a regression.
>
> I have also tested this case on a small machine: a skylake desktop and
> this commit shows improvement:
> 8b10b465d0e1: "netperf.Throughput_Mbps": 72288.76,
> f26b3fa04611: "netperf.Throughput_Mbps": 90784.4, +25.6%
>
> So this means those directly freed pages get reused by allocator side
> and that brings performance improvement for machines with smaller cache.
>
> I wonder if we should still use PCP a little bit under the above said
> condition, for the purpose of:
> 1 reduced overhead in the free path for machines with large cache;
> 2 still keeps the benefit of reused pages for machines with smaller cache.
>

Ideally yes although the exact timing is going to depend on the cache
size so even if it's right for one machine, it's not necessarily right
for another.

Going through the buddy, pages get reused quickly and remains cache
hot. Going through PCP contends less on zone->lock but pages get reused
too late on microbenchmarks dealing with small amounts of data. As the
threshold couldn't be predicted, I went with "free to buddy" immediately.

> > in testcase: netperf
> > on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory
> > with following parameters:
> >
> > ip: ipv4
> > runtime: 300s
> > nr_threads: 1
> > cluster: cs-localhost
> > test: UDP_STREAM
> > cpufreq_governor: performance
> > ucode: 0xd000331
> >
> > test-description: Netperf is a benchmark that can be use to measure various aspect of networking performance.
> > test-url: http://www.netperf.org/netperf/
> >
> > In addition to that, the commit also has significant impact on the following tests:
> >
>
> > +------------------+-------------------------------------------------------------------------------------+
> > | testcase: change | netperf: netperf.Throughput_Mbps 13.2% improvement |
> > | test machine | 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory |
> > | test parameters | cluster=cs-localhost |
> > | | cpufreq_governor=performance |
> > | | ip=ipv4 |
> > | | nr_threads=25% |
> > | | runtime=300s |
> > | | send_size=10K |
> > | | test=SCTP_STREAM_MANY |
> > | | ucode=0xd000331 |
> > +------------------+-------------------------------------------------------------------------------------+
> >
>
> And when nr_pcp_high() returns (pcp->batch << 2), the improvement will
> drop from 13.2% to 5.7%, not great but still an improvement...
>
> The said change looks like this:
> (relevant comment will have to be adjusted)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 505d59f7d4fa..130a02af8321 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3332,18 +3332,19 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
> bool free_high)
> {
> int high = READ_ONCE(pcp->high);
> + int batch = READ_ONCE(pcp->batch);
>
> - if (unlikely(!high || free_high))
> + if (unlikely(!high))
> return 0;
>
> - if (!test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
> - return high;
> -
> /*
> * If reclaim is active, limit the number of pages that can be
> * stored on pcp lists
> */
> - return min(READ_ONCE(pcp->batch) << 2, high);
> + if (test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags) || free_high)
> + return min(batch << 2, high);
> +
> + return high;
> }
>
> static void free_unref_page_commit(struct page *page, int migratetype,
>
> Does this look sane? If so, I can prepare a formal patch with proper
> comment and changelog, thanks.

I think it looks reasonable sane. The corner case is that if
((high - (batch >> 2)) > cachesize) that the pages will not get recycled
quickly enough. On the plus side always freeing to buddy may contend on the
zone lock again and freeing in batches reduces that risk.

Given that zone lock contention is reduced regardless of cache size, it
seems like a reasonable tradeoff.

--
Mel Gorman
SUSE Labs