2022-04-21 08:51:09

by kernel test robot

[permalink] [raw]
Subject: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression


(please be noted we reported
"[mm/page_alloc] 39907a939a: netperf.Throughput_Mbps -18.1% regression"
on
https://lore.kernel.org/all/20220228155733.GF1643@xsang-OptiPlex-9020/
while the commit is on branch.
now we still observe similar regression when it's on mainline, and we also
observe a 13.2% improvement on another netperf subtest.
so report again for information)

Greeting,

FYI, we noticed a -18.0% regression of netperf.Throughput_Mbps due to commit:


commit: f26b3fa046116a7dedcaafe30083402113941451 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

in testcase: netperf
on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory
with following parameters:

ip: ipv4
runtime: 300s
nr_threads: 1
cluster: cs-localhost
test: UDP_STREAM
cpufreq_governor: performance
ucode: 0xd000331

test-description: Netperf is a benchmark that can be use to measure various aspect of networking performance.
test-url: http://www.netperf.org/netperf/

In addition to that, the commit also has significant impact on the following tests:

+------------------+-------------------------------------------------------------------------------------+
| testcase: change | netperf: netperf.Throughput_Mbps 13.2% improvement |
| test machine | 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory |
| test parameters | cluster=cs-localhost |
| | cpufreq_governor=performance |
| | ip=ipv4 |
| | nr_threads=25% |
| | runtime=300s |
| | send_size=10K |
| | test=SCTP_STREAM_MANY |
| | ucode=0xd000331 |
+------------------+-------------------------------------------------------------------------------------+


If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
sudo bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
sudo bin/lkp run generated-yaml-file

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.

=========================================================================================
cluster/compiler/cpufreq_governor/ip/kconfig/nr_threads/rootfs/runtime/tbox_group/test/testcase/ucode:
cs-localhost/gcc-11/performance/ipv4/x86_64-rhel-8.3/1/debian-10.4-x86_64-20200603.cgz/300s/lkp-icl-2sp4/UDP_STREAM/netperf/0xd000331

commit:
8b10b465d0 ("mm/page_alloc: free pages in a single pass during bulk free")
f26b3fa046 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free")

8b10b465d0e18b00 f26b3fa046116a7dedcaafe3008
---------------- ---------------------------
%stddev %change %stddev
\ | \
120956 ? 2% -18.0% 99177 netperf.Throughput_Mbps
120956 ? 2% -18.0% 99177 netperf.Throughput_total_Mbps
90.83 -2.0% 89.00 netperf.time.percent_of_cpu_this_job_got
69242552 ? 2% -18.0% 56775058 netperf.workload
29460 ? 2% +25.7% 37044 meminfo.Shmem
96933 ?198% +9094.3% 8912386 ? 7% turbostat.POLL
1746 ? 2% +6694.6% 118678 ? 3% vmstat.system.cs
293357 ? 7% -21.2% 231238 ? 17% sched_debug.cfs_rq:/.min_vruntime.max
269394 ? 8% -23.6% 205870 ? 17% sched_debug.cfs_rq:/.spread0.max
239945 ? 64% -99.5% 1108 ? 2% sched_debug.cpu.avg_idle.min
122694 ? 18% +26.2% 154895 ? 6% sched_debug.cpu.avg_idle.stddev
4705 ? 2% +2916.4% 141948 ? 3% sched_debug.cpu.nr_switches.avg
65447 ? 3% +9997.6% 6608655 ? 13% sched_debug.cpu.nr_switches.max
8178 ? 3% +9737.7% 804544 ? 5% sched_debug.cpu.nr_switches.stddev
250093 ? 8% +15.0% 287675 ? 7% perf-stat.i.cache-misses
1674 ? 2% +7043.4% 119598 ? 3% perf-stat.i.context-switches
3127 +1.8% 3183 perf-stat.i.minor-faults
7495 ? 24% +76.9% 13260 ? 6% perf-stat.i.node-loads
3128 +1.8% 3184 perf-stat.i.page-faults
0.05 ? 7% +0.0 0.06 ? 11% perf-stat.overall.cache-miss-rate%
45827 ? 6% -13.7% 39529 ? 10% perf-stat.overall.cycles-between-cache-misses
87.75 ? 3% -7.5 80.29 ? 2% perf-stat.overall.node-load-miss-rate%
18242 ? 5% +22.8% 22395 ? 2% perf-stat.overall.path-length
249180 ? 8% +15.0% 286678 ? 7% perf-stat.ps.cache-misses
1668 ? 2% +7044.3% 119200 ? 3% perf-stat.ps.context-switches
3114 +1.8% 3170 perf-stat.ps.minor-faults
7465 ? 24% +77.0% 13213 ? 6% perf-stat.ps.node-loads
3115 +1.8% 3171 perf-stat.ps.page-faults
2640 ? 3% -3.7% 2541 proc-vmstat.nr_active_anon
71813 +2.8% 73854 proc-vmstat.nr_inactive_anon
9669 +2.7% 9930 proc-vmstat.nr_mapped
7368 ? 2% +25.7% 9262 proc-vmstat.nr_shmem
2640 ? 3% -3.7% 2541 proc-vmstat.nr_zone_active_anon
71813 +2.8% 73854 proc-vmstat.nr_zone_inactive_anon
419.83 ?190% +1461.8% 6556 ? 15% proc-vmstat.numa_hint_faults
380.83 ?212% +1374.0% 5613 ? 3% proc-vmstat.numa_hint_faults_local
1.336e+08 -13.8% 1.152e+08 ? 2% proc-vmstat.numa_hit
1.337e+08 -13.6% 1.156e+08 ? 2% proc-vmstat.numa_local
8502 ? 97% +311.4% 34976 ? 9% proc-vmstat.numa_pte_updates
7931 +1121.7% 96900 ? 4% proc-vmstat.pgactivate
1.33e+08 -14.0% 1.144e+08 proc-vmstat.pgalloc_normal
1060109 +1.2% 1073035 proc-vmstat.pgfault
1.33e+08 -14.0% 1.144e+08 proc-vmstat.pgfree
1.26 ? 19% +0.6 1.81 ? 17% perf-profile.calltrace.cycles-pp.udp_unicast_rcv_skb.__udp4_lib_rcv.ip_protocol_deliver_rcu.ip_local_deliver_finish.__netif_receive_skb_one_core
1.18 ? 19% +0.6 1.79 ? 17% perf-profile.calltrace.cycles-pp.udp_queue_rcv_one_skb.udp_unicast_rcv_skb.__udp4_lib_rcv.ip_protocol_deliver_rcu.ip_local_deliver_finish
0.00 +0.7 0.69 ? 21% perf-profile.calltrace.cycles-pp.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.sock_def_readable
0.00 +0.7 0.70 ? 21% perf-profile.calltrace.cycles-pp.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.sock_def_readable.__udp_enqueue_schedule_skb
0.00 +0.7 0.71 ? 16% perf-profile.calltrace.cycles-pp.free_pcppages_bulk.free_unref_page.skb_release_data.__consume_stateless_skb.udp_recvmsg
0.00 +0.8 0.82 ? 22% perf-profile.calltrace.cycles-pp.__wake_up_common.__wake_up_common_lock.sock_def_readable.__udp_enqueue_schedule_skb.udp_queue_rcv_one_skb
0.00 +0.8 0.83 ? 23% perf-profile.calltrace.cycles-pp.__schedule.schedule.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp
0.00 +0.8 0.84 ? 22% perf-profile.calltrace.cycles-pp.__schedule.schedule_idle.do_idle.cpu_startup_entry.secondary_startup_64_no_verify
0.00 +0.9 0.86 ? 23% perf-profile.calltrace.cycles-pp.schedule.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg
0.00 +0.9 0.87 ? 22% perf-profile.calltrace.cycles-pp.__wake_up_common_lock.sock_def_readable.__udp_enqueue_schedule_skb.udp_queue_rcv_one_skb.udp_unicast_rcv_skb
0.00 +0.9 0.87 ? 22% perf-profile.calltrace.cycles-pp.schedule_idle.do_idle.cpu_startup_entry.secondary_startup_64_no_verify
0.00 +0.9 0.88 ? 23% perf-profile.calltrace.cycles-pp.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg.inet_recvmsg
0.00 +1.0 0.97 ? 24% perf-profile.calltrace.cycles-pp.sock_def_readable.__udp_enqueue_schedule_skb.udp_queue_rcv_one_skb.udp_unicast_rcv_skb.__udp4_lib_rcv
0.20 ?142% +1.1 1.33 ? 19% perf-profile.calltrace.cycles-pp.__udp_enqueue_schedule_skb.udp_queue_rcv_one_skb.udp_unicast_rcv_skb.__udp4_lib_rcv.ip_protocol_deliver_rcu
0.00 +1.2 1.19 ? 21% perf-profile.calltrace.cycles-pp.__skb_wait_for_more_packets.__skb_recv_udp.udp_recvmsg.inet_recvmsg.__sys_recvfrom
0.42 ? 71% +1.6 2.06 ? 20% perf-profile.calltrace.cycles-pp.__skb_recv_udp.udp_recvmsg.inet_recvmsg.__sys_recvfrom.__x64_sys_recvfrom
0.41 ? 19% -0.3 0.15 ? 21% perf-profile.children.cycles-pp.udp_rmem_release
0.65 ? 16% -0.2 0.45 ? 14% perf-profile.children.cycles-pp.kfree
0.44 ? 14% -0.2 0.26 ? 15% perf-profile.children.cycles-pp.__slab_free
0.58 ? 8% -0.2 0.42 ? 18% perf-profile.children.cycles-pp.free_pcp_prepare
0.17 ? 13% -0.1 0.07 ? 15% perf-profile.children.cycles-pp.free_unref_page_commit
0.24 ? 5% -0.1 0.18 ? 19% perf-profile.children.cycles-pp.kmem_cache_free
0.21 ? 14% -0.1 0.16 ? 17% perf-profile.children.cycles-pp.send_data
0.10 ? 15% +0.0 0.15 ? 8% perf-profile.children.cycles-pp.syscall_exit_to_user_mode
0.00 +0.1 0.06 ? 13% perf-profile.children.cycles-pp.finish_wait
0.00 +0.1 0.06 ? 13% perf-profile.children.cycles-pp.__update_load_avg_se
0.00 +0.1 0.06 ? 23% perf-profile.children.cycles-pp.ttwu_do_wakeup
0.00 +0.1 0.06 ? 11% perf-profile.children.cycles-pp.switch_mm_irqs_off
0.02 ?141% +0.1 0.08 ? 10% perf-profile.children.cycles-pp.exit_to_user_mode_prepare
0.10 ? 4% +0.1 0.16 ? 19% perf-profile.children.cycles-pp.__list_add_valid
0.00 +0.1 0.08 ? 14% perf-profile.children.cycles-pp.flush_smp_call_function_queue
0.00 +0.1 0.08 ? 37% perf-profile.children.cycles-pp.nohz_run_idle_balance
0.01 ?223% +0.1 0.09 ? 21% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
0.21 ? 11% +0.1 0.29 ? 16% perf-profile.children.cycles-pp.skb_set_owner_w
0.08 ? 11% +0.1 0.17 ? 23% perf-profile.children.cycles-pp.memcg_slab_free_hook
0.00 +0.1 0.08 ? 36% perf-profile.children.cycles-pp.tick_nohz_idle_enter
0.00 +0.1 0.08 ? 19% perf-profile.children.cycles-pp.prepare_task_switch
0.00 +0.1 0.10 ? 34% perf-profile.children.cycles-pp.prepare_to_wait_exclusive
0.07 ? 48% +0.1 0.20 ? 20% perf-profile.children.cycles-pp._raw_spin_lock_bh
0.73 ? 6% +0.1 0.86 ? 9% perf-profile.children.cycles-pp._raw_spin_lock
0.00 +0.1 0.14 ? 34% perf-profile.children.cycles-pp.set_next_entity
0.07 ? 18% +0.2 0.22 ? 26% perf-profile.children.cycles-pp.__zone_watermark_ok
0.00 +0.2 0.17 ? 11% perf-profile.children.cycles-pp.enqueue_entity
0.00 +0.2 0.17 ? 29% perf-profile.children.cycles-pp.update_load_avg
0.00 +0.2 0.18 ? 21% perf-profile.children.cycles-pp.__switch_to
0.00 +0.2 0.19 ? 22% perf-profile.children.cycles-pp.sched_ttwu_pending
0.00 +0.2 0.19 ? 21% perf-profile.children.cycles-pp.ttwu_queue_wakelist
0.27 ? 34% +0.2 0.47 ? 15% perf-profile.children.cycles-pp.update_rq_clock
0.00 +0.2 0.19 ? 20% perf-profile.children.cycles-pp.update_curr
0.00 +0.2 0.21 ? 7% perf-profile.children.cycles-pp.enqueue_task_fair
0.00 +0.2 0.22 ? 6% perf-profile.children.cycles-pp.ttwu_do_activate
0.00 +0.2 0.24 ? 25% perf-profile.children.cycles-pp.pick_next_task_fair
0.28 ? 14% +0.2 0.52 ? 8% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
0.00 +0.3 0.26 ? 17% perf-profile.children.cycles-pp.__sysvec_call_function_single
0.00 +0.3 0.32 ? 17% perf-profile.children.cycles-pp.sysvec_call_function_single
0.39 ? 14% +0.3 0.72 ? 16% perf-profile.children.cycles-pp.free_pcppages_bulk
0.00 +0.3 0.33 ? 26% perf-profile.children.cycles-pp.dequeue_entity
0.00 +0.4 0.37 ? 24% perf-profile.children.cycles-pp.dequeue_task_fair
0.00 +0.4 0.41 ? 25% perf-profile.children.cycles-pp.finish_task_switch
0.00 +0.5 0.50 ? 19% perf-profile.children.cycles-pp.asm_sysvec_call_function_single
1.26 ? 19% +0.6 1.81 ? 17% perf-profile.children.cycles-pp.udp_unicast_rcv_skb
1.19 ? 19% +0.6 1.80 ? 17% perf-profile.children.cycles-pp.udp_queue_rcv_one_skb
0.00 +0.7 0.71 ? 21% perf-profile.children.cycles-pp.autoremove_wake_function
0.00 +0.7 0.71 ? 20% perf-profile.children.cycles-pp.try_to_wake_up
0.00 +0.8 0.83 ? 21% perf-profile.children.cycles-pp.__wake_up_common
0.49 ? 17% +0.8 1.34 ? 19% perf-profile.children.cycles-pp.__udp_enqueue_schedule_skb
0.00 +0.9 0.87 ? 22% perf-profile.children.cycles-pp.schedule_idle
0.00 +0.9 0.88 ? 22% perf-profile.children.cycles-pp.__wake_up_common_lock
0.00 +0.9 0.89 ? 23% perf-profile.children.cycles-pp.schedule_timeout
0.01 ?223% +0.9 0.90 ? 22% perf-profile.children.cycles-pp.schedule
0.03 ?100% +0.9 0.97 ? 24% perf-profile.children.cycles-pp.sock_def_readable
0.00 +1.2 1.19 ? 21% perf-profile.children.cycles-pp.__skb_wait_for_more_packets
0.57 ? 19% +1.5 2.08 ? 20% perf-profile.children.cycles-pp.__skb_recv_udp
0.05 ? 47% +1.7 1.73 ? 22% perf-profile.children.cycles-pp.__schedule
0.24 ? 21% -0.2 0.03 ?100% perf-profile.self.cycles-pp.udp_rmem_release
0.44 ? 14% -0.2 0.26 ? 16% perf-profile.self.cycles-pp.__slab_free
0.58 ? 8% -0.2 0.42 ? 18% perf-profile.self.cycles-pp.free_pcp_prepare
0.29 ? 19% -0.1 0.16 ? 10% perf-profile.self.cycles-pp.kfree
0.28 ? 17% -0.1 0.18 ? 18% perf-profile.self.cycles-pp.udp_recvmsg
0.13 ? 13% -0.1 0.05 ? 45% perf-profile.self.cycles-pp.free_unref_page_commit
0.24 ? 21% -0.1 0.17 ? 17% perf-profile.self.cycles-pp.send_omni_inner
0.08 ? 10% -0.0 0.02 ? 99% perf-profile.self.cycles-pp.kmem_cache_free
0.10 ? 19% -0.0 0.06 ? 52% perf-profile.self.cycles-pp.__dev_queue_xmit
0.12 ? 15% -0.0 0.09 ? 18% perf-profile.self.cycles-pp.__cgroup_bpf_run_filter_skb
0.06 ? 9% +0.1 0.12 ? 18% perf-profile.self.cycles-pp.free_pcppages_bulk
0.00 +0.1 0.06 ? 14% perf-profile.self.cycles-pp.flush_smp_call_function_queue
0.03 ?100% +0.1 0.10 ? 42% perf-profile.self.cycles-pp.sock_def_readable
0.08 ? 8% +0.1 0.15 ? 18% perf-profile.self.cycles-pp.__list_add_valid
0.00 +0.1 0.08 ? 22% perf-profile.self.cycles-pp.update_curr
0.00 +0.1 0.08 ? 20% perf-profile.self.cycles-pp.enqueue_entity
0.01 ?223% +0.1 0.09 ? 20% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
0.21 ? 11% +0.1 0.29 ? 16% perf-profile.self.cycles-pp.skb_set_owner_w
0.08 ? 12% +0.1 0.16 ? 22% perf-profile.self.cycles-pp.memcg_slab_free_hook
0.00 +0.1 0.09 ? 33% perf-profile.self.cycles-pp.set_next_entity
0.00 +0.1 0.12 ? 31% perf-profile.self.cycles-pp.__wake_up_common
0.07 ? 50% +0.1 0.19 ? 21% perf-profile.self.cycles-pp._raw_spin_lock_bh
0.00 +0.1 0.12 ? 16% perf-profile.self.cycles-pp.try_to_wake_up
0.04 ?101% +0.1 0.17 ? 11% perf-profile.self.cycles-pp.update_rq_clock
0.00 +0.2 0.15 ? 17% perf-profile.self.cycles-pp.__skb_wait_for_more_packets
0.00 +0.2 0.15 ? 26% perf-profile.self.cycles-pp.finish_task_switch
0.17 ? 14% +0.2 0.33 ? 24% perf-profile.self.cycles-pp.skb_release_data
0.04 ? 72% +0.2 0.22 ? 26% perf-profile.self.cycles-pp.__zone_watermark_ok
0.00 +0.2 0.18 ? 21% perf-profile.self.cycles-pp.__switch_to
0.26 ? 16% +0.2 0.48 ? 8% perf-profile.self.cycles-pp._raw_spin_lock_irqsave
0.00 +0.3 0.27 ? 23% perf-profile.self.cycles-pp.__schedule
0.04 ? 71% +0.4 0.45 ? 21% perf-profile.self.cycles-pp.__skb_recv_udp


***************************************************************************************************
lkp-icl-2sp4: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory
=========================================================================================
cluster/compiler/cpufreq_governor/ip/kconfig/nr_threads/rootfs/runtime/send_size/tbox_group/test/testcase/ucode:
cs-localhost/gcc-11/performance/ipv4/x86_64-rhel-8.3/25%/debian-10.4-x86_64-20200603.cgz/300s/10K/lkp-icl-2sp4/SCTP_STREAM_MANY/netperf/0xd000331

commit:
8b10b465d0 ("mm/page_alloc: free pages in a single pass during bulk free")
f26b3fa046 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free")

8b10b465d0e18b00 f26b3fa046116a7dedcaafe3008
---------------- ---------------------------
%stddev %change %stddev
\ | \
14785 ? 2% +13.2% 16740 netperf.Throughput_Mbps
473143 ? 2% +13.2% 535690 netperf.Throughput_total_Mbps
17542 ? 6% +24.5% 21835 ? 2% netperf.time.involuntary_context_switches
1342 ? 3% +19.2% 1600 netperf.time.percent_of_cpu_this_job_got
3935 ? 3% +19.4% 4698 netperf.time.system_time
110.45 ? 3% +15.2% 127.26 netperf.time.user_time
199875 ? 5% -36.6% 126767 ? 9% netperf.time.voluntary_context_switches
1.733e+09 ? 2% +13.2% 1.962e+09 netperf.workload
3.48 ? 3% +0.5 3.94 mpstat.cpu.all.soft%
16.79 ? 3% +3.4 20.23 mpstat.cpu.all.sys%
0.68 ? 3% +0.1 0.78 mpstat.cpu.all.usr%
27.83 ? 5% +22.8% 34.17 ? 3% vmstat.procs.r
3208349 ? 2% +12.1% 3596993 vmstat.system.cs
263494 +3.6% 273026 vmstat.system.in
1.101e+09 ? 6% +16.4% 1.282e+09 ? 3% numa-numastat.node0.local_node
1.1e+09 ? 6% +16.4% 1.28e+09 ? 3% numa-numastat.node0.numa_hit
1.151e+09 ? 4% +10.2% 1.269e+09 ? 2% numa-numastat.node1.local_node
1.149e+09 ? 4% +10.2% 1.265e+09 ? 2% numa-numastat.node1.numa_hit
1.1e+09 ? 6% +16.4% 1.28e+09 ? 3% numa-vmstat.node0.numa_hit
1.101e+09 ? 6% +16.4% 1.282e+09 ? 3% numa-vmstat.node0.numa_local
1.149e+09 ? 4% +10.2% 1.265e+09 ? 2% numa-vmstat.node1.numa_hit
1.151e+09 ? 4% +10.2% 1.269e+09 ? 2% numa-vmstat.node1.numa_local
953763 ? 18% +33.6% 1273973 ? 8% meminfo.Active
953603 ? 18% +33.6% 1273684 ? 8% meminfo.Active(anon)
1450710 ? 13% +23.9% 1797564 ? 6% meminfo.Committed_AS
484102 ? 18% +32.7% 642218 ? 9% meminfo.Mapped
983413 ? 18% +34.8% 1326115 ? 8% meminfo.Shmem
812.50 ? 2% +16.5% 946.17 turbostat.Avg_MHz
24.64 ? 2% +4.1 28.73 turbostat.Busy%
4.704e+08 ? 2% +11.0% 5.219e+08 turbostat.C1
5.57 ? 2% +0.7 6.26 turbostat.C1%
0.37 ? 10% -16.1% 0.31 turbostat.IPC
1004055 ? 2% +11.4% 1118247 turbostat.POLL
0.02 +0.0 0.03 turbostat.POLL%
416.33 ? 4% +7.5% 447.50 turbostat.PkgWatt
238335 ? 18% +33.4% 317828 ? 8% proc-vmstat.nr_active_anon
811128 ? 5% +10.5% 896520 ? 3% proc-vmstat.nr_file_pages
80808 ? 2% +7.4% 86814 ? 3% proc-vmstat.nr_inactive_anon
120937 ? 19% +34.1% 162233 ? 9% proc-vmstat.nr_mapped
1938 ? 2% +4.7% 2029 ? 2% proc-vmstat.nr_page_table_pages
245826 ? 18% +34.6% 330998 ? 8% proc-vmstat.nr_shmem
238335 ? 18% +33.4% 317828 ? 8% proc-vmstat.nr_zone_active_anon
80808 ? 2% +7.4% 86814 ? 3% proc-vmstat.nr_zone_inactive_anon
2.248e+09 ? 2% +13.2% 2.545e+09 proc-vmstat.numa_hit
2.253e+09 ? 2% +13.2% 2.551e+09 proc-vmstat.numa_local
260577 ? 15% +33.6% 348260 ? 11% proc-vmstat.pgactivate
5.944e+09 ? 2% +13.2% 6.73e+09 proc-vmstat.pgalloc_normal
1579108 ? 2% +3.8% 1638994 proc-vmstat.pgfault
5.944e+09 ? 2% +13.2% 6.73e+09 proc-vmstat.pgfree
850785 ? 19% +64.9% 1403095 ? 8% sched_debug.cfs_rq:/.MIN_vruntime.max
110144 ? 17% +78.2% 196314 ? 20% sched_debug.cfs_rq:/.MIN_vruntime.stddev
0.26 ? 15% +25.6% 0.33 ? 12% sched_debug.cfs_rq:/.h_nr_running.avg
36930 ? 6% -19.2% 29847 ? 4% sched_debug.cfs_rq:/.load.max
13805 ? 8% -14.6% 11792 ? 3% sched_debug.cfs_rq:/.load.stddev
850785 ? 19% +64.9% 1403095 ? 8% sched_debug.cfs_rq:/.max_vruntime.max
110144 ? 17% +78.2% 196314 ? 20% sched_debug.cfs_rq:/.max_vruntime.stddev
803157 ? 9% +44.1% 1157345 ? 10% sched_debug.cfs_rq:/.min_vruntime.avg
1328522 ? 10% +31.4% 1746141 ? 9% sched_debug.cfs_rq:/.min_vruntime.max
349319 ? 17% +85.6% 648499 ? 17% sched_debug.cfs_rq:/.min_vruntime.min
209093 ? 8% +17.5% 245777 ? 8% sched_debug.cfs_rq:/.min_vruntime.stddev
279.98 ? 11% +24.1% 347.54 ? 7% sched_debug.cfs_rq:/.runnable_avg.avg
209084 ? 8% +17.5% 245769 ? 8% sched_debug.cfs_rq:/.spread0.stddev
279.78 ? 11% +24.2% 347.36 ? 7% sched_debug.cfs_rq:/.util_avg.avg
183.66 ? 15% +29.2% 237.36 ? 10% sched_debug.cfs_rq:/.util_est_enqueued.avg
1276 ? 11% +19.6% 1526 ? 5% sched_debug.cpu.curr->pid.avg
0.21 ? 10% +18.4% 0.25 ? 4% sched_debug.cpu.nr_running.avg
26.69 -0.8% 26.49 perf-stat.i.MPKI
1.96e+10 ? 2% +13.0% 2.215e+10 perf-stat.i.branch-instructions
1.257e+08 ? 2% +13.2% 1.423e+08 ? 2% perf-stat.i.branch-misses
2.672e+09 ? 2% +12.2% 2.997e+09 perf-stat.i.cache-references
3236739 ? 2% +12.2% 3630735 perf-stat.i.context-switches
1.10 +2.9% 1.13 perf-stat.i.cpi
1.099e+11 ? 2% +16.3% 1.278e+11 perf-stat.i.cpu-cycles
216.09 ? 3% +9.1% 235.65 ? 2% perf-stat.i.cpu-migrations
2.893e+10 ? 2% +13.1% 3.271e+10 perf-stat.i.dTLB-loads
0.01 ? 7% -0.0 0.00 ? 38% perf-stat.i.dTLB-store-miss-rate%
1218982 ? 5% -66.4% 409240 ? 39% perf-stat.i.dTLB-store-misses
1.715e+10 ? 2% +13.1% 1.939e+10 perf-stat.i.dTLB-stores
1.002e+11 ? 2% +13.0% 1.132e+11 perf-stat.i.instructions
0.91 -2.7% 0.89 perf-stat.i.ipc
0.86 ? 2% +16.3% 1.00 perf-stat.i.metric.GHz
533.97 ? 2% +13.0% 603.48 perf-stat.i.metric.M/sec
4845 ? 2% +4.4% 5058 perf-stat.i.minor-faults
106011 ? 17% -43.2% 60257 ? 28% perf-stat.i.node-loads
56.37 ? 10% +12.1 68.44 ? 7% perf-stat.i.node-store-miss-rate%
1300772 ? 13% -31.9% 886088 ? 31% perf-stat.i.node-stores
4846 ? 2% +4.4% 5059 perf-stat.i.page-faults
26.68 -0.8% 26.47 perf-stat.overall.MPKI
1.10 +2.9% 1.13 perf-stat.overall.cpi
0.01 ? 6% -0.0 0.00 ? 39% perf-stat.overall.dTLB-store-miss-rate%
0.91 -2.8% 0.89 perf-stat.overall.ipc
1.953e+10 ? 2% +13.0% 2.207e+10 perf-stat.ps.branch-instructions
1.252e+08 ? 2% +13.2% 1.418e+08 ? 2% perf-stat.ps.branch-misses
2.662e+09 ? 2% +12.2% 2.986e+09 perf-stat.ps.cache-references
3224941 ? 2% +12.2% 3617930 perf-stat.ps.context-switches
1.095e+11 ? 2% +16.3% 1.273e+11 perf-stat.ps.cpu-cycles
215.47 ? 3% +9.1% 235.06 ? 2% perf-stat.ps.cpu-migrations
2.882e+10 ? 2% +13.1% 3.259e+10 perf-stat.ps.dTLB-loads
1214485 ? 5% -66.4% 407878 ? 39% perf-stat.ps.dTLB-store-misses
1.709e+10 ? 2% +13.1% 1.932e+10 perf-stat.ps.dTLB-stores
9.982e+10 ? 2% +13.0% 1.128e+11 perf-stat.ps.instructions
4823 ? 3% +4.4% 5034 perf-stat.ps.minor-faults
105655 ? 17% -43.2% 59979 ? 28% perf-stat.ps.node-loads
1296468 ? 13% -31.9% 882954 ? 31% perf-stat.ps.node-stores
4824 ? 3% +4.4% 5035 perf-stat.ps.page-faults
3.017e+13 ? 2% +13.1% 3.411e+13 perf-stat.total.instructions
22.12 ? 7% -3.1 19.06 ? 4% perf-profile.calltrace.cycles-pp.secondary_startup_64_no_verify
18.84 ? 9% -3.0 15.83 ? 5% perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.secondary_startup_64_no_verify
21.94 ? 7% -3.0 18.95 ? 4% perf-profile.calltrace.cycles-pp.cpu_startup_entry.secondary_startup_64_no_verify
21.86 ? 7% -3.0 18.88 ? 4% perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.secondary_startup_64_no_verify
17.27 ? 8% -2.8 14.47 ? 5% perf-profile.calltrace.cycles-pp.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry.secondary_startup_64_no_verify
17.04 ? 8% -2.7 14.32 ? 5% perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry
14.40 ? 5% -1.8 12.57 ? 3% perf-profile.calltrace.cycles-pp.intel_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
14.23 ? 5% -1.8 12.40 ? 3% perf-profile.calltrace.cycles-pp.mwait_idle_with_hints.intel_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
1.80 ? 37% -0.8 1.02 ? 31% perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
1.59 ? 37% -0.7 0.89 ? 32% perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
0.55 ? 3% -0.3 0.26 ?100% perf-profile.calltrace.cycles-pp.sctp_chunkify._sctp_make_chunk.sctp_make_datafrag_empty.sctp_datamsg_from_user.sctp_sendmsg_to_asoc
0.66 ? 12% +0.2 0.90 ? 12% perf-profile.calltrace.cycles-pp.__free_pages_ok.skb_release_data.kfree_skb_reason.sctp_recvmsg.inet_recvmsg
4.04 ? 3% +0.3 4.38 ? 2% perf-profile.calltrace.cycles-pp.sctp_outq_sack.sctp_cmd_interpreter.sctp_do_sm.sctp_assoc_bh_rcv.sctp_backlog_rcv
3.98 ? 3% +0.3 4.32 ? 3% perf-profile.calltrace.cycles-pp.sctp_make_datafrag_empty.sctp_datamsg_from_user.sctp_sendmsg_to_asoc.sctp_sendmsg.sock_sendmsg
0.88 ? 7% +0.4 1.24 ? 6% perf-profile.calltrace.cycles-pp.free_unref_page.skb_release_data.consume_skb.sctp_chunk_put.sctp_outq_sack
0.78 ? 6% +0.4 1.13 ? 4% perf-profile.calltrace.cycles-pp.kmem_cache_free.sctp_recvmsg.inet_recvmsg.____sys_recvmsg.___sys_recvmsg
3.34 ? 3% +0.4 3.70 ? 3% perf-profile.calltrace.cycles-pp._sctp_make_chunk.sctp_make_datafrag_empty.sctp_datamsg_from_user.sctp_sendmsg_to_asoc.sctp_sendmsg
2.11 ? 3% +0.4 2.49 ? 3% perf-profile.calltrace.cycles-pp.consume_skb.sctp_chunk_put.sctp_outq_sack.sctp_cmd_interpreter.sctp_do_sm
2.71 ? 3% +0.4 3.09 ? 3% perf-profile.calltrace.cycles-pp.sctp_chunk_put.sctp_outq_sack.sctp_cmd_interpreter.sctp_do_sm.sctp_assoc_bh_rcv
1.36 ? 5% +0.4 1.75 ? 4% perf-profile.calltrace.cycles-pp.skb_release_data.consume_skb.sctp_chunk_put.sctp_outq_sack.sctp_cmd_interpreter
0.47 ? 45% +0.4 0.87 ? 4% perf-profile.calltrace.cycles-pp.__slab_free.kmem_cache_free.sctp_recvmsg.inet_recvmsg.____sys_recvmsg
1.54 ? 6% +0.4 1.94 ? 6% perf-profile.calltrace.cycles-pp.kmalloc_reserve.__alloc_skb.sctp_packet_transmit.sctp_outq_flush.sctp_cmd_interpreter
1.52 ? 6% +0.4 1.92 ? 7% perf-profile.calltrace.cycles-pp.__kmalloc_node_track_caller.kmalloc_reserve.__alloc_skb.sctp_packet_transmit.sctp_outq_flush
1.88 ? 5% +0.4 2.28 ? 5% perf-profile.calltrace.cycles-pp.__kmalloc_node_track_caller.kmalloc_reserve.__alloc_skb._sctp_make_chunk.sctp_make_datafrag_empty
1.50 ? 6% +0.4 1.90 ? 7% perf-profile.calltrace.cycles-pp.kmalloc_large_node.__kmalloc_node_track_caller.kmalloc_reserve.__alloc_skb.sctp_packet_transmit
1.82 ? 5% +0.4 2.22 ? 5% perf-profile.calltrace.cycles-pp.kmalloc_large_node.__kmalloc_node_track_caller.kmalloc_reserve.__alloc_skb._sctp_make_chunk
1.92 ? 5% +0.4 2.33 ? 5% perf-profile.calltrace.cycles-pp.kmalloc_reserve.__alloc_skb._sctp_make_chunk.sctp_make_datafrag_empty.sctp_datamsg_from_user
2.66 ? 3% +0.4 3.06 ? 4% perf-profile.calltrace.cycles-pp.__alloc_skb._sctp_make_chunk.sctp_make_datafrag_empty.sctp_datamsg_from_user.sctp_sendmsg_to_asoc
1.78 ? 6% +0.4 2.20 ? 6% perf-profile.calltrace.cycles-pp.__alloc_skb.sctp_packet_transmit.sctp_outq_flush.sctp_cmd_interpreter.sctp_do_sm
1.36 ? 9% +0.4 1.79 ? 9% perf-profile.calltrace.cycles-pp.rmqueue.get_page_from_freelist.__alloc_pages.kmalloc_large_node.__kmalloc_node_track_caller
7.79 ? 2% +0.7 8.46 perf-profile.calltrace.cycles-pp.sctp_packet_pack.sctp_packet_transmit.sctp_outq_flush.sctp_cmd_interpreter.sctp_do_sm
7.41 ? 2% +0.7 8.10 perf-profile.calltrace.cycles-pp.memcpy_erms.sctp_packet_pack.sctp_packet_transmit.sctp_outq_flush.sctp_cmd_interpreter
0.00 +0.7 0.74 ? 16% perf-profile.calltrace.cycles-pp._raw_spin_lock.free_pcppages_bulk.free_unref_page.skb_release_data.kfree_skb_reason
0.00 +0.7 0.74 ? 10% perf-profile.calltrace.cycles-pp.free_unref_page_commit.free_unref_page.skb_release_data.consume_skb.sctp_chunk_put
9.46 +0.8 10.22 perf-profile.calltrace.cycles-pp.sctp_do_sm.sctp_primitive_SEND.sctp_sendmsg_to_asoc.sctp_sendmsg.sock_sendmsg
9.63 +0.8 10.40 perf-profile.calltrace.cycles-pp.sctp_cmd_interpreter.sctp_do_sm.sctp_primitive_SEND.sctp_sendmsg_to_asoc.sctp_sendmsg
13.53 ? 2% +0.8 14.30 ? 2% perf-profile.calltrace.cycles-pp.sctp_do_sm.sctp_assoc_bh_rcv.sctp_backlog_rcv.__release_sock.release_sock
13.82 ? 2% +0.8 14.60 ? 2% perf-profile.calltrace.cycles-pp.sctp_assoc_bh_rcv.sctp_backlog_rcv.__release_sock.release_sock.sctp_sendmsg
13.84 ? 2% +0.8 14.62 ? 2% perf-profile.calltrace.cycles-pp.sctp_cmd_interpreter.sctp_do_sm.sctp_assoc_bh_rcv.sctp_backlog_rcv.__release_sock
10.83 ? 2% +0.8 11.64 perf-profile.calltrace.cycles-pp.sctp_packet_transmit.sctp_outq_flush.sctp_cmd_interpreter.sctp_do_sm.sctp_primitive_SEND
12.82 ? 2% +0.9 13.68 ? 2% perf-profile.calltrace.cycles-pp.sctp_outq_flush.sctp_cmd_interpreter.sctp_do_sm.sctp_primitive_SEND.sctp_sendmsg_to_asoc
14.77 +1.0 15.74 perf-profile.calltrace.cycles-pp.sctp_primitive_SEND.sctp_sendmsg_to_asoc.sctp_sendmsg.sock_sendmsg.____sys_sendmsg
0.00 +1.0 0.96 ? 14% perf-profile.calltrace.cycles-pp.free_pcppages_bulk.free_unref_page.skb_release_data.kfree_skb_reason.sctp_recvmsg
2.05 ? 7% +1.0 3.02 ? 9% perf-profile.calltrace.cycles-pp.kfree_skb_reason.sctp_recvmsg.inet_recvmsg.____sys_recvmsg.___sys_recvmsg
1.40 ? 8% +1.0 2.44 ? 10% perf-profile.calltrace.cycles-pp.skb_release_data.kfree_skb_reason.sctp_recvmsg.inet_recvmsg.____sys_recvmsg
0.00 +1.1 1.10 ? 12% perf-profile.calltrace.cycles-pp.free_unref_page.skb_release_data.kfree_skb_reason.sctp_recvmsg.inet_recvmsg
25.15 +1.2 26.32 perf-profile.calltrace.cycles-pp.sctp_sendmsg_to_asoc.sctp_sendmsg.sock_sendmsg.____sys_sendmsg.___sys_sendmsg
2.08 ? 6% +1.2 3.30 ? 6% perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages.kmalloc_large_node.__kmalloc_node_track_caller.kmalloc_reserve
2.48 ? 6% +1.3 3.75 ? 6% perf-profile.calltrace.cycles-pp.__alloc_pages.kmalloc_large_node.__kmalloc_node_track_caller.kmalloc_reserve.__alloc_skb
49.05 ? 2% +1.7 50.72 perf-profile.calltrace.cycles-pp.__sys_sendmsg.do_syscall_64.entry_SYSCALL_64_after_hwframe.sendmsg.main
49.47 ? 2% +1.7 51.21 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.sendmsg.main.__libc_start_main
49.17 ? 2% +1.7 50.91 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.sendmsg.main.__libc_start_main
51.24 ? 2% +1.8 52.98 perf-profile.calltrace.cycles-pp.__libc_start_main
50.11 ? 2% +1.8 51.86 perf-profile.calltrace.cycles-pp.sendmsg.main.__libc_start_main
50.99 ? 2% +1.8 52.74 perf-profile.calltrace.cycles-pp.main.__libc_start_main
45.54 +2.0 47.52 perf-profile.calltrace.cycles-pp.sock_sendmsg.____sys_sendmsg.___sys_sendmsg.__sys_sendmsg.do_syscall_64
45.10 +2.0 47.08 perf-profile.calltrace.cycles-pp.sctp_sendmsg.sock_sendmsg.____sys_sendmsg.___sys_sendmsg.__sys_sendmsg
47.29 +2.0 49.27 perf-profile.calltrace.cycles-pp.____sys_sendmsg.___sys_sendmsg.__sys_sendmsg.do_syscall_64.entry_SYSCALL_64_after_hwframe
49.20 +2.0 51.18 perf-profile.calltrace.cycles-pp.___sys_sendmsg.__sys_sendmsg.do_syscall_64.entry_SYSCALL_64_after_hwframe.sendmsg
19.03 ? 9% -3.1 15.95 ? 5% perf-profile.children.cycles-pp.cpuidle_idle_call
22.08 ? 7% -3.1 19.03 ? 4% perf-profile.children.cycles-pp.do_idle
22.12 ? 7% -3.1 19.06 ? 4% perf-profile.children.cycles-pp.secondary_startup_64_no_verify
22.12 ? 7% -3.1 19.06 ? 4% perf-profile.children.cycles-pp.cpu_startup_entry
17.42 ? 8% -2.9 14.56 ? 5% perf-profile.children.cycles-pp.cpuidle_enter
17.37 ? 8% -2.9 14.52 ? 5% perf-profile.children.cycles-pp.cpuidle_enter_state
14.52 ? 5% -1.9 12.64 ? 3% perf-profile.children.cycles-pp.intel_idle
14.43 ? 5% -1.9 12.55 ? 3% perf-profile.children.cycles-pp.mwait_idle_with_hints
2.25 ? 27% -0.8 1.46 ? 21% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
1.88 ? 29% -0.7 1.18 ? 23% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
1.19 ? 33% -0.5 0.73 ? 25% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
1.16 ? 32% -0.4 0.72 ? 24% perf-profile.children.cycles-pp.hrtimer_interrupt
0.76 ? 45% -0.3 0.44 ? 37% perf-profile.children.cycles-pp.__hrtimer_run_queues
0.52 ? 44% -0.2 0.31 ? 34% perf-profile.children.cycles-pp.tick_sched_timer
0.48 ? 47% -0.2 0.28 ? 36% perf-profile.children.cycles-pp.tick_sched_handle
0.44 ? 44% -0.2 0.27 ? 32% perf-profile.children.cycles-pp.update_process_times
0.32 ? 35% -0.1 0.20 ? 28% perf-profile.children.cycles-pp.__irq_exit_rcu
0.24 ? 35% -0.1 0.16 ? 24% perf-profile.children.cycles-pp.scheduler_tick
0.24 ? 12% -0.1 0.18 ? 19% perf-profile.children.cycles-pp.clockevents_program_event
0.15 ? 23% -0.1 0.10 ? 18% perf-profile.children.cycles-pp.rebalance_domains
0.80 ? 2% -0.0 0.76 perf-profile.children.cycles-pp.sctp_chunkify
0.10 ? 33% -0.0 0.06 ? 21% perf-profile.children.cycles-pp.perf_mux_hrtimer_handler
0.40 ? 6% -0.0 0.36 ? 2% perf-profile.children.cycles-pp.native_sched_clock
0.47 ? 5% -0.0 0.43 ? 2% perf-profile.children.cycles-pp.sched_clock_cpu
0.12 ? 13% -0.0 0.08 ? 7% perf-profile.children.cycles-pp.native_irq_return_iret
0.55 -0.0 0.52 ? 2% perf-profile.children.cycles-pp.sctp_chunk_free
0.34 ? 6% -0.0 0.31 ? 2% perf-profile.children.cycles-pp.rcu_idle_exit
0.08 ? 11% -0.0 0.06 ? 11% perf-profile.children.cycles-pp.lapic_next_deadline
0.33 ? 2% +0.0 0.34 perf-profile.children.cycles-pp.loopback_xmit
0.36 +0.0 0.38 perf-profile.children.cycles-pp.xmit_one
0.12 ? 4% +0.0 0.14 ? 3% perf-profile.children.cycles-pp.__build_skb_around
0.89 ? 2% +0.0 0.93 perf-profile.children.cycles-pp.enqueue_task_fair
0.93 ? 2% +0.0 0.96 perf-profile.children.cycles-pp.ttwu_do_activate
0.44 ? 5% +0.0 0.48 ? 2% perf-profile.children.cycles-pp.__mod_node_page_state
0.22 ? 5% +0.1 0.28 ? 6% perf-profile.children.cycles-pp.rmqueue_bulk
0.44 ? 2% +0.1 0.53 perf-profile.children.cycles-pp.__list_add_valid
2.68 ? 2% +0.1 2.77 perf-profile.children.cycles-pp.try_to_wake_up
2.69 ? 2% +0.1 2.79 perf-profile.children.cycles-pp.autoremove_wake_function
0.29 ? 7% +0.1 0.40 ? 6% perf-profile.children.cycles-pp.__free_one_page
2.98 +0.1 3.09 perf-profile.children.cycles-pp.__wake_up_common
3.45 ? 2% +0.1 3.56 perf-profile.children.cycles-pp.sctp_data_ready
3.07 +0.1 3.18 perf-profile.children.cycles-pp.__wake_up_common_lock
3.64 ? 2% +0.1 3.76 perf-profile.children.cycles-pp.sctp_ulpq_tail_event
0.39 ? 3% +0.2 0.56 ? 3% perf-profile.children.cycles-pp.__zone_watermark_ok
0.67 ? 12% +0.2 0.91 ? 12% perf-profile.children.cycles-pp.__free_pages_ok
0.50 ? 11% +0.3 0.79 ? 8% perf-profile.children.cycles-pp.free_unref_page_commit
0.95 ? 5% +0.3 1.27 ? 4% perf-profile.children.cycles-pp.__slab_free
2.47 ? 3% +0.4 2.83 ? 2% perf-profile.children.cycles-pp.kmem_cache_free
4.13 ? 2% +0.4 4.50 ? 2% perf-profile.children.cycles-pp.sctp_outq_sack
4.06 ? 2% +0.4 4.43 ? 3% perf-profile.children.cycles-pp.sctp_make_datafrag_empty
3.67 ? 2% +0.4 4.06 ? 3% perf-profile.children.cycles-pp._sctp_make_chunk
4.37 ? 2% +0.4 4.76 ? 2% perf-profile.children.cycles-pp.sctp_chunk_put
2.80 ? 2% +0.4 3.20 ? 2% perf-profile.children.cycles-pp.consume_skb
1.22 ? 12% +0.5 1.71 ? 9% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
1.67 ? 8% +0.6 2.22 ? 8% perf-profile.children.cycles-pp.rmqueue
8.31 +0.7 9.02 perf-profile.children.cycles-pp.sctp_packet_pack
7.64 +0.7 8.38 perf-profile.children.cycles-pp.memcpy_erms
0.82 +0.8 1.60 ? 8% perf-profile.children.cycles-pp._raw_spin_lock
2.60 ? 5% +0.8 3.42 ? 6% perf-profile.children.cycles-pp.get_page_from_freelist
3.02 ? 5% +0.8 3.86 ? 6% perf-profile.children.cycles-pp.__alloc_pages
0.13 ? 15% +0.8 0.97 ? 14% perf-profile.children.cycles-pp.free_pcppages_bulk
3.56 ? 5% +0.8 4.40 ? 5% perf-profile.children.cycles-pp.__kmalloc_node_track_caller
3.39 ? 5% +0.8 4.24 ? 6% perf-profile.children.cycles-pp.kmalloc_large_node
3.62 ? 4% +0.9 4.48 ? 5% perf-profile.children.cycles-pp.kmalloc_reserve
15.00 +0.9 15.86 perf-profile.children.cycles-pp.sctp_primitive_SEND
4.83 ? 3% +0.9 5.70 ? 4% perf-profile.children.cycles-pp.__alloc_skb
2.08 ? 7% +1.0 3.04 ? 9% perf-profile.children.cycles-pp.kfree_skb_reason
0.56 ? 23% +1.0 1.56 ? 19% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
1.22 ? 6% +1.2 2.38 ? 8% perf-profile.children.cycles-pp.free_unref_page
27.46 +1.2 28.68 perf-profile.children.cycles-pp.sctp_outq_flush
25.20 +1.2 26.44 perf-profile.children.cycles-pp.sctp_sendmsg_to_asoc
24.22 +1.3 25.47 perf-profile.children.cycles-pp.sctp_packet_transmit
2.95 ? 6% +1.4 4.38 ? 7% perf-profile.children.cycles-pp.skb_release_data
32.36 +1.6 33.92 perf-profile.children.cycles-pp.sctp_do_sm
32.11 +1.6 33.68 perf-profile.children.cycles-pp.sctp_cmd_interpreter
51.46 ? 2% +1.7 53.15 perf-profile.children.cycles-pp.main
51.24 ? 2% +1.8 52.98 perf-profile.children.cycles-pp.__libc_start_main
45.47 +2.0 47.44 perf-profile.children.cycles-pp.sctp_sendmsg
45.56 +2.0 47.53 perf-profile.children.cycles-pp.sock_sendmsg
47.32 +2.0 49.30 perf-profile.children.cycles-pp.____sys_sendmsg
49.23 +2.0 51.22 perf-profile.children.cycles-pp.___sys_sendmsg
49.56 +2.0 51.57 perf-profile.children.cycles-pp.__sys_sendmsg
51.10 +2.0 53.14 perf-profile.children.cycles-pp.sendmsg
73.82 ? 2% +3.0 76.86 perf-profile.children.cycles-pp.do_syscall_64
74.28 ? 2% +3.0 77.32 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
14.36 ? 5% -1.9 12.48 ? 3% perf-profile.self.cycles-pp.mwait_idle_with_hints
0.48 ? 21% -0.1 0.34 ? 13% perf-profile.self.cycles-pp.cpuidle_enter_state
0.39 ? 5% -0.0 0.35 ? 2% perf-profile.self.cycles-pp.native_sched_clock
0.12 ? 13% -0.0 0.08 ? 7% perf-profile.self.cycles-pp.native_irq_return_iret
0.08 ? 11% -0.0 0.06 ? 9% perf-profile.self.cycles-pp.lapic_next_deadline
0.38 ? 3% -0.0 0.35 ? 2% perf-profile.self.cycles-pp.sctp_packet_pack
0.50 ? 2% -0.0 0.48 ? 2% perf-profile.self.cycles-pp.__might_sleep
0.23 -0.0 0.22 ? 2% perf-profile.self.cycles-pp.do_idle
0.07 +0.0 0.09 ? 12% perf-profile.self.cycles-pp.syscall_exit_to_user_mode
0.20 ? 3% +0.0 0.22 ? 4% perf-profile.self.cycles-pp.enqueue_task_fair
0.02 ?141% +0.0 0.06 ? 9% perf-profile.self.cycles-pp.poll_idle
1.13 +0.0 1.17 perf-profile.self.cycles-pp.kmem_cache_free
0.43 ? 6% +0.0 0.47 ? 2% perf-profile.self.cycles-pp.__mod_node_page_state
0.37 ? 2% +0.1 0.45 ? 2% perf-profile.self.cycles-pp.__list_add_valid
0.84 ? 6% +0.1 0.92 ? 2% perf-profile.self.cycles-pp._raw_spin_lock_irqsave
0.51 ? 2% +0.1 0.61 ? 4% perf-profile.self.cycles-pp.get_page_from_freelist
0.38 ? 2% +0.2 0.54 ? 4% perf-profile.self.cycles-pp.__zone_watermark_ok
0.76 +0.2 0.96 perf-profile.self.cycles-pp._raw_spin_lock
0.79 ? 8% +0.3 1.08 ? 7% perf-profile.self.cycles-pp.rmqueue
0.44 ? 12% +0.3 0.75 ? 9% perf-profile.self.cycles-pp.free_unref_page_commit
0.93 ? 5% +0.3 1.26 ? 4% perf-profile.self.cycles-pp.__slab_free
7.61 +0.7 8.35 perf-profile.self.cycles-pp.memcpy_erms
0.56 ? 23% +1.0 1.55 ? 19% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath





Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


--
0-DAY CI Kernel Test Service
https://01.org/lkp



Attachments:
(No filename) (46.21 kB)
config-5.17.0-00103-gf26b3fa04611 (163.87 kB)
job-script (8.28 kB)
job.yaml (5.70 kB)
reproduce (341.00 B)
Download all attachments

2022-05-01 19:45:15

by Aaron Lu

[permalink] [raw]
Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression

Hi Mel,

On Wed, Apr 20, 2022 at 09:35:26AM +0800, kernel test robot wrote:
>
> (please be noted we reported
> "[mm/page_alloc] 39907a939a: netperf.Throughput_Mbps -18.1% regression"
> on
> https://lore.kernel.org/all/20220228155733.GF1643@xsang-OptiPlex-9020/
> while the commit is on branch.
> now we still observe similar regression when it's on mainline, and we also
> observe a 13.2% improvement on another netperf subtest.
> so report again for information)
>
> Greeting,
>
> FYI, we noticed a -18.0% regression of netperf.Throughput_Mbps due to commit:
>
>
> commit: f26b3fa046116a7dedcaafe30083402113941451 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>

So what this commit did is: if a CPU is always doing free(pcp->free_factor > 0)
and if the being freed high-order page's order is <= PAGE_ALLOC_COSTLY_ORDER,
then do not use PCP but directly free the page directly to buddy.

The rationale as explained in the commit's changelog is:
"
Netperf running on localhost exhibits this pattern and while it does not
matter for some machines, it does matter for others with smaller caches
where cache misses cause problems due to reduced page reuse. Pages
freed directly to the buddy list may be reused quickly while still cache
hot where as storing on the PCP lists may be cold by the time
free_pcppages_bulk() is called.
"

This regression occurred on a machine that has large caches so this
optimization brings no value to it but only overhead(skipped PCP), I
guess this is the reason why there is a regression.

I have also tested this case on a small machine: a skylake desktop and
this commit shows improvement:
8b10b465d0e1: "netperf.Throughput_Mbps": 72288.76,
f26b3fa04611: "netperf.Throughput_Mbps": 90784.4, +25.6%

So this means those directly freed pages get reused by allocator side
and that brings performance improvement for machines with smaller cache.

I wonder if we should still use PCP a little bit under the above said
condition, for the purpose of:
1 reduced overhead in the free path for machines with large cache;
2 still keeps the benefit of reused pages for machines with smaller cache.

For this reason, I tested increasing nr_pcp_high() from returning 0 to
either returning pcp->batch or (pcp->batch << 2):
machine\nr_pcp_high() ret: pcp->high 0 pcp->batch (pcp->batch << 2)
skylake desktop: 72288 90784 92219 91528
icelake 2sockets: 120956 99177 98251 116108

note nr_pcp_high() returns pcp->high is the behaviour of this commit's
parent, returns 0 is the behaviour of this commit.

The result shows, if we effectively use a PCP high as (pcp->batch << 2)
for the described condition, then this workload's performance on
small machine can remain while the regression on large machines can be
greately reduced(from -18% to -4%).

> in testcase: netperf
> on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory
> with following parameters:
>
> ip: ipv4
> runtime: 300s
> nr_threads: 1
> cluster: cs-localhost
> test: UDP_STREAM
> cpufreq_governor: performance
> ucode: 0xd000331
>
> test-description: Netperf is a benchmark that can be use to measure various aspect of networking performance.
> test-url: http://www.netperf.org/netperf/
>
> In addition to that, the commit also has significant impact on the following tests:
>

> +------------------+-------------------------------------------------------------------------------------+
> | testcase: change | netperf: netperf.Throughput_Mbps 13.2% improvement |
> | test machine | 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory |
> | test parameters | cluster=cs-localhost |
> | | cpufreq_governor=performance |
> | | ip=ipv4 |
> | | nr_threads=25% |
> | | runtime=300s |
> | | send_size=10K |
> | | test=SCTP_STREAM_MANY |
> | | ucode=0xd000331 |
> +------------------+-------------------------------------------------------------------------------------+
>

And when nr_pcp_high() returns (pcp->batch << 2), the improvement will
drop from 13.2% to 5.7%, not great but still an improvement...

The said change looks like this:
(relevant comment will have to be adjusted)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 505d59f7d4fa..130a02af8321 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3332,18 +3332,19 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
bool free_high)
{
int high = READ_ONCE(pcp->high);
+ int batch = READ_ONCE(pcp->batch);

- if (unlikely(!high || free_high))
+ if (unlikely(!high))
return 0;

- if (!test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
- return high;
-
/*
* If reclaim is active, limit the number of pages that can be
* stored on pcp lists
*/
- return min(READ_ONCE(pcp->batch) << 2, high);
+ if (test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags) || free_high)
+ return min(batch << 2, high);
+
+ return high;
}

static void free_unref_page_commit(struct page *page, int migratetype,

Does this look sane? If so, I can prepare a formal patch with proper
comment and changelog, thanks.

2022-05-02 23:19:30

by Mel Gorman

[permalink] [raw]
Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression

On Fri, Apr 29, 2022 at 07:29:19PM +0800, Aaron Lu wrote:
> Hi Mel,
>
> On Wed, Apr 20, 2022 at 09:35:26AM +0800, kernel test robot wrote:
> >
> > (please be noted we reported
> > "[mm/page_alloc] 39907a939a: netperf.Throughput_Mbps -18.1% regression"
> > on
> > https://lore.kernel.org/all/20220228155733.GF1643@xsang-OptiPlex-9020/
> > while the commit is on branch.
> > now we still observe similar regression when it's on mainline, and we also
> > observe a 13.2% improvement on another netperf subtest.
> > so report again for information)
> >
> > Greeting,
> >
> > FYI, we noticed a -18.0% regression of netperf.Throughput_Mbps due to commit:
> >
> >
> > commit: f26b3fa046116a7dedcaafe30083402113941451 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
>
> So what this commit did is: if a CPU is always doing free(pcp->free_factor > 0)
> and if the being freed high-order page's order is <= PAGE_ALLOC_COSTLY_ORDER,
> then do not use PCP but directly free the page directly to buddy.
>

Yes.

> This regression occurred on a machine that has large caches so this
> optimization brings no value to it but only overhead(skipped PCP), I
> guess this is the reason why there is a regression.
>
> I have also tested this case on a small machine: a skylake desktop and
> this commit shows improvement:
> 8b10b465d0e1: "netperf.Throughput_Mbps": 72288.76,
> f26b3fa04611: "netperf.Throughput_Mbps": 90784.4, +25.6%
>
> So this means those directly freed pages get reused by allocator side
> and that brings performance improvement for machines with smaller cache.
>
> I wonder if we should still use PCP a little bit under the above said
> condition, for the purpose of:
> 1 reduced overhead in the free path for machines with large cache;
> 2 still keeps the benefit of reused pages for machines with smaller cache.
>

Ideally yes although the exact timing is going to depend on the cache
size so even if it's right for one machine, it's not necessarily right
for another.

Going through the buddy, pages get reused quickly and remains cache
hot. Going through PCP contends less on zone->lock but pages get reused
too late on microbenchmarks dealing with small amounts of data. As the
threshold couldn't be predicted, I went with "free to buddy" immediately.

> > in testcase: netperf
> > on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory
> > with following parameters:
> >
> > ip: ipv4
> > runtime: 300s
> > nr_threads: 1
> > cluster: cs-localhost
> > test: UDP_STREAM
> > cpufreq_governor: performance
> > ucode: 0xd000331
> >
> > test-description: Netperf is a benchmark that can be use to measure various aspect of networking performance.
> > test-url: http://www.netperf.org/netperf/
> >
> > In addition to that, the commit also has significant impact on the following tests:
> >
>
> > +------------------+-------------------------------------------------------------------------------------+
> > | testcase: change | netperf: netperf.Throughput_Mbps 13.2% improvement |
> > | test machine | 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory |
> > | test parameters | cluster=cs-localhost |
> > | | cpufreq_governor=performance |
> > | | ip=ipv4 |
> > | | nr_threads=25% |
> > | | runtime=300s |
> > | | send_size=10K |
> > | | test=SCTP_STREAM_MANY |
> > | | ucode=0xd000331 |
> > +------------------+-------------------------------------------------------------------------------------+
> >
>
> And when nr_pcp_high() returns (pcp->batch << 2), the improvement will
> drop from 13.2% to 5.7%, not great but still an improvement...
>
> The said change looks like this:
> (relevant comment will have to be adjusted)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 505d59f7d4fa..130a02af8321 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3332,18 +3332,19 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
> bool free_high)
> {
> int high = READ_ONCE(pcp->high);
> + int batch = READ_ONCE(pcp->batch);
>
> - if (unlikely(!high || free_high))
> + if (unlikely(!high))
> return 0;
>
> - if (!test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
> - return high;
> -
> /*
> * If reclaim is active, limit the number of pages that can be
> * stored on pcp lists
> */
> - return min(READ_ONCE(pcp->batch) << 2, high);
> + if (test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags) || free_high)
> + return min(batch << 2, high);
> +
> + return high;
> }
>
> static void free_unref_page_commit(struct page *page, int migratetype,
>
> Does this look sane? If so, I can prepare a formal patch with proper
> comment and changelog, thanks.

I think it looks reasonable sane. The corner case is that if
((high - (batch >> 2)) > cachesize) that the pages will not get recycled
quickly enough. On the plus side always freeing to buddy may contend on the
zone lock again and freeing in batches reduces that risk.

Given that zone lock contention is reduced regardless of cache size, it
seems like a reasonable tradeoff.

--
Mel Gorman
SUSE Labs

2022-05-07 09:30:57

by Aaron Lu

[permalink] [raw]
Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression

On Fri, May 06, 2022 at 04:40:45PM +0800, [email protected] wrote:
> On Fri, 2022-04-29 at 19:29 +0800, Aaron Lu wrote:
> > Hi Mel,
> >
> > On Wed, Apr 20, 2022 at 09:35:26AM +0800, kernel test robot wrote:
> > >
> > > (please be noted we reported
> > > "[mm/page_alloc] 39907a939a: netperf.Throughput_Mbps -18.1% regression"
> > > on
> > > https://lore.kernel.org/all/20220228155733.GF1643@xsang-OptiPlex-9020/
> > > while the commit is on branch.
> > > now we still observe similar regression when it's on mainline, and we also
> > > observe a 13.2% improvement on another netperf subtest.
> > > so report again for information)
> > >
> > > Greeting,
> > >
> > > FYI, we noticed a -18.0% regression of netperf.Throughput_Mbps due to commit:
> > >
> > >
> > > commit: f26b3fa046116a7dedcaafe30083402113941451 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free")
> > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > >
> >
> > So what this commit did is: if a CPU is always doing free(pcp->free_factor > 0)
>
> IMHO, this means the consumer and producer are running on different
> CPUs.
>

Right.

> > and if the being freed high-order page's order is <= PAGE_ALLOC_COSTLY_ORDER,
> > then do not use PCP but directly free the page directly to buddy.
> >
> > The rationale as explained in the commit's changelog is:
> > "
> > Netperf running on localhost exhibits this pattern and while it does not
> > matter for some machines, it does matter for others with smaller caches
> > where cache misses cause problems due to reduced page reuse. Pages
> > freed directly to the buddy list may be reused quickly while still cache
> > hot where as storing on the PCP lists may be cold by the time
> > free_pcppages_bulk() is called.
> > "
> >
> > This regression occurred on a machine that has large caches so this
> > optimization brings no value to it but only overhead(skipped PCP), I
> > guess this is the reason why there is a regression.
>
> Per my understanding, not only the cache size is larger, but also the L2
> cache (1MB) is per-core on this machine. So if the consumer and
> producer are running on different cores, the cache-hot page may cause
> more core-to-core cache transfer. This may hurt performance too.
>

Client side allocates skb(page) and server side recvfrom() it.
recvfrom() copies the page data to server's own buffer and then releases
the page associated with the skb. Client does all the allocation and
server does all the free, page reuse happens at client side.
So I think core-2-core cache transfer due to page reuse can occur when
client task migrates.

I have modified the job to have the client and server bound to a
specific CPU of different cores on the same node, and testing it on the
same Icelake 2 sockets server, the result is

kernel throughput
8b10b465d0e1 125168
f26b3fa04611 102039 -18%

It's also a 18% drop. I think this means c2c is not a factor?

> > I have also tested this case on a small machine: a skylake desktop and
> > this commit shows improvement:
> > 8b10b465d0e1: "netperf.Throughput_Mbps": 72288.76,
> > f26b3fa04611: "netperf.Throughput_Mbps": 90784.4, +25.6%
> >
> > So this means those directly freed pages get reused by allocator side
> > and that brings performance improvement for machines with smaller cache.
>
> Per my understanding, the L2 cache on this desktop machine is shared
> among cores.
>

The said CPU is i7-6700 and according to this wikipedia page,
L2 is per core:
https://en.wikipedia.org/wiki/Skylake_(microarchitecture)#Mainstream_desktop_processors

> > I wonder if we should still use PCP a little bit under the above said
> > condition, for the purpose of:
> > 1 reduced overhead in the free path for machines with large cache;
> > 2 still keeps the benefit of reused pages for machines with smaller cache.
> >
> > For this reason, I tested increasing nr_pcp_high() from returning 0 to
> > either returning pcp->batch or (pcp->batch << 2):
> > machine\nr_pcp_high() ret: pcp->high 0 pcp->batch (pcp->batch << 2)
> > skylake desktop: 72288 90784 92219 91528
> > icelake 2sockets: 120956 99177 98251 116108
> >
> > note nr_pcp_high() returns pcp->high is the behaviour of this commit's
> > parent, returns 0 is the behaviour of this commit.
> >
> > The result shows, if we effectively use a PCP high as (pcp->batch << 2)
> > for the described condition, then this workload's performance on
> > small machine can remain while the regression on large machines can be
> > greately reduced(from -18% to -4%).
> >
>
> Can we use cache size and topology information directly?

It can be complicated by the fact that the system can have multiple
producers(cpus that are doing free) running at the same time and getting
the perfect number can be a difficult job.

2022-05-07 14:27:49

by Mel Gorman

[permalink] [raw]
Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression

On Thu, May 05, 2022 at 04:27:04PM +0800, Aaron Lu wrote:
> On Fri, Apr 29, 2022 at 02:39:18PM +0100, Mel Gorman wrote:
> > On Fri, Apr 29, 2022 at 07:29:19PM +0800, Aaron Lu wrote:
>
> ... ...
>
> > > The said change looks like this:
> > > (relevant comment will have to be adjusted)
> > >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 505d59f7d4fa..130a02af8321 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -3332,18 +3332,19 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
> > > bool free_high)
> > > {
> > > int high = READ_ONCE(pcp->high);
> > > + int batch = READ_ONCE(pcp->batch);
> > >
> > > - if (unlikely(!high || free_high))
> > > + if (unlikely(!high))
> > > return 0;
> > >
> > > - if (!test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
> > > - return high;
> > > -
> > > /*
> > > * If reclaim is active, limit the number of pages that can be
> > > * stored on pcp lists
> > > */
> > > - return min(READ_ONCE(pcp->batch) << 2, high);
> > > + if (test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags) || free_high)
> > > + return min(batch << 2, high);
> > > +
> > > + return high;
> > > }
> > >
> > > static void free_unref_page_commit(struct page *page, int migratetype,
> > >
> > > Does this look sane? If so, I can prepare a formal patch with proper
> > > comment and changelog, thanks.
> >
> > I think it looks reasonable sane. The corner case is that if
> > ((high - (batch >> 2)) > cachesize) that the pages will not get recycled
>
> When free_high is true, the above diff changed the return value of
> nr_pcp_high() from 0 to min(batch << 2, pcp->high) so the corner case is
> when (min(batch << 2, pcp->high) > cachesize)?
>

Yes. It's not perfect due to cache aliasing so the actual point where it
matters will be variable. Whatever the value is, there a value where the
corner case applies that pages do not get recycled quickly enough and
are no longer cache-hot.

--
Mel Gorman
SUSE Labs

2022-05-09 02:12:45

by Huang, Ying

[permalink] [raw]
Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression

On Fri, 2022-04-29 at 19:29 +0800, Aaron Lu wrote:
> Hi Mel,
>
> On Wed, Apr 20, 2022 at 09:35:26AM +0800, kernel test robot wrote:
> >
> > (please be noted we reported
> > "[mm/page_alloc] 39907a939a: netperf.Throughput_Mbps -18.1% regression"
> > on
> > https://lore.kernel.org/all/20220228155733.GF1643@xsang-OptiPlex-9020/
> > while the commit is on branch.
> > now we still observe similar regression when it's on mainline, and we also
> > observe a 13.2% improvement on another netperf subtest.
> > so report again for information)
> >
> > Greeting,
> >
> > FYI, we noticed a -18.0% regression of netperf.Throughput_Mbps due to commit:
> >
> >
> > commit: f26b3fa046116a7dedcaafe30083402113941451 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
>
> So what this commit did is: if a CPU is always doing free(pcp->free_factor > 0)

IMHO, this means the consumer and producer are running on different
CPUs.

> and if the being freed high-order page's order is <= PAGE_ALLOC_COSTLY_ORDER,
> then do not use PCP but directly free the page directly to buddy.
>
> The rationale as explained in the commit's changelog is:
> "
> Netperf running on localhost exhibits this pattern and while it does not
> matter for some machines, it does matter for others with smaller caches
> where cache misses cause problems due to reduced page reuse. Pages
> freed directly to the buddy list may be reused quickly while still cache
> hot where as storing on the PCP lists may be cold by the time
> free_pcppages_bulk() is called.
> "
>
> This regression occurred on a machine that has large caches so this
> optimization brings no value to it but only overhead(skipped PCP), I
> guess this is the reason why there is a regression.

Per my understanding, not only the cache size is larger, but also the L2
cache (1MB) is per-core on this machine. So if the consumer and
producer are running on different cores, the cache-hot page may cause
more core-to-core cache transfer. This may hurt performance too.

> I have also tested this case on a small machine: a skylake desktop and
> this commit shows improvement:
> 8b10b465d0e1: "netperf.Throughput_Mbps": 72288.76,
> f26b3fa04611: "netperf.Throughput_Mbps": 90784.4, +25.6%
>
> So this means those directly freed pages get reused by allocator side
> and that brings performance improvement for machines with smaller cache.

Per my understanding, the L2 cache on this desktop machine is shared
among cores.

> I wonder if we should still use PCP a little bit under the above said
> condition, for the purpose of:
> 1 reduced overhead in the free path for machines with large cache;
> 2 still keeps the benefit of reused pages for machines with smaller cache.
>
> For this reason, I tested increasing nr_pcp_high() from returning 0 to
> either returning pcp->batch or (pcp->batch << 2):
> machine\nr_pcp_high() ret: pcp->high 0 pcp->batch (pcp->batch << 2)
> skylake desktop: 72288 90784 92219 91528
> icelake 2sockets: 120956 99177 98251 116108
>
> note nr_pcp_high() returns pcp->high is the behaviour of this commit's
> parent, returns 0 is the behaviour of this commit.
>
> The result shows, if we effectively use a PCP high as (pcp->batch << 2)
> for the described condition, then this workload's performance on
> small machine can remain while the regression on large machines can be
> greately reduced(from -18% to -4%).
>

Can we use cache size and topology information directly?

>
Best Regards,
Huang, Ying

[snip]



2022-05-09 02:27:22

by Aaron Lu

[permalink] [raw]
Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression

On Fri, Apr 29, 2022 at 02:39:18PM +0100, Mel Gorman wrote:
> On Fri, Apr 29, 2022 at 07:29:19PM +0800, Aaron Lu wrote:

... ...

> > The said change looks like this:
> > (relevant comment will have to be adjusted)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 505d59f7d4fa..130a02af8321 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3332,18 +3332,19 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
> > bool free_high)
> > {
> > int high = READ_ONCE(pcp->high);
> > + int batch = READ_ONCE(pcp->batch);
> >
> > - if (unlikely(!high || free_high))
> > + if (unlikely(!high))
> > return 0;
> >
> > - if (!test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
> > - return high;
> > -
> > /*
> > * If reclaim is active, limit the number of pages that can be
> > * stored on pcp lists
> > */
> > - return min(READ_ONCE(pcp->batch) << 2, high);
> > + if (test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags) || free_high)
> > + return min(batch << 2, high);
> > +
> > + return high;
> > }
> >
> > static void free_unref_page_commit(struct page *page, int migratetype,
> >
> > Does this look sane? If so, I can prepare a formal patch with proper
> > comment and changelog, thanks.
>
> I think it looks reasonable sane. The corner case is that if
> ((high - (batch >> 2)) > cachesize) that the pages will not get recycled

When free_high is true, the above diff changed the return value of
nr_pcp_high() from 0 to min(batch << 2, pcp->high) so the corner case is
when (min(batch << 2, pcp->high) > cachesize)?

> quickly enough. On the plus side always freeing to buddy may contend on the
> zone lock again and freeing in batches reduces that risk.
>
> Given that zone lock contention is reduced regardless of cache size, it
> seems like a reasonable tradeoff.

Glad to hear this.

2022-05-09 02:58:14

by Aaron Lu

[permalink] [raw]
Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression

On Thu, May 05, 2022 at 12:09:47PM +0100, Mel Gorman wrote:
> On Thu, May 05, 2022 at 04:27:04PM +0800, Aaron Lu wrote:
> > On Fri, Apr 29, 2022 at 02:39:18PM +0100, Mel Gorman wrote:
> > > On Fri, Apr 29, 2022 at 07:29:19PM +0800, Aaron Lu wrote:
> >
> > ... ...
> >
> > > > The said change looks like this:
> > > > (relevant comment will have to be adjusted)
> > > >
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index 505d59f7d4fa..130a02af8321 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -3332,18 +3332,19 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
> > > > bool free_high)
> > > > {
> > > > int high = READ_ONCE(pcp->high);
> > > > + int batch = READ_ONCE(pcp->batch);
> > > >
> > > > - if (unlikely(!high || free_high))
> > > > + if (unlikely(!high))
> > > > return 0;
> > > >
> > > > - if (!test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
> > > > - return high;
> > > > -
> > > > /*
> > > > * If reclaim is active, limit the number of pages that can be
> > > > * stored on pcp lists
> > > > */
> > > > - return min(READ_ONCE(pcp->batch) << 2, high);
> > > > + if (test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags) || free_high)
> > > > + return min(batch << 2, high);
> > > > +
> > > > + return high;
> > > > }
> > > >
> > > > static void free_unref_page_commit(struct page *page, int migratetype,
> > > >
> > > > Does this look sane? If so, I can prepare a formal patch with proper
> > > > comment and changelog, thanks.
> > >
> > > I think it looks reasonable sane. The corner case is that if
> > > ((high - (batch >> 2)) > cachesize) that the pages will not get recycled
> >
> > When free_high is true, the above diff changed the return value of
> > nr_pcp_high() from 0 to min(batch << 2, pcp->high) so the corner case is
> > when (min(batch << 2, pcp->high) > cachesize)?
> >
>
> Yes. It's not perfect due to cache aliasing so the actual point where it
> matters will be variable. Whatever the value is, there a value where the
> corner case applies that pages do not get recycled quickly enough and
> are no longer cache-hot.

Understood. And as you said, it's a tradeoff.

2022-05-09 03:13:32

by Huang, Ying

[permalink] [raw]
Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression

On Sat, 2022-05-07 at 11:27 +0800, Aaron Lu wrote:
> On Sat, May 07, 2022 at 08:54:35AM +0800, [email protected] wrote:
> > On Fri, 2022-05-06 at 20:17 +0800, Aaron Lu wrote:
> > > On Fri, May 06, 2022 at 04:40:45PM +0800, [email protected] wrote:
> > > > On Fri, 2022-04-29 at 19:29 +0800, Aaron Lu wrote:
> > > > > Hi Mel,
> > > > >
> > > > > On Wed, Apr 20, 2022 at 09:35:26AM +0800, kernel test robot wrote:
> > > > > >
> > > > > > (please be noted we reported
> > > > > > "[mm/page_alloc] 39907a939a: netperf.Throughput_Mbps -18.1% regression"
> > > > > > on
> > > > > > https://lore.kernel.org/all/20220228155733.GF1643@xsang-OptiPlex-9020/
> > > > > > while the commit is on branch.
> > > > > > now we still observe similar regression when it's on mainline, and we also
> > > > > > observe a 13.2% improvement on another netperf subtest.
> > > > > > so report again for information)
> > > > > >
> > > > > > Greeting,
> > > > > >
> > > > > > FYI, we noticed a -18.0% regression of netperf.Throughput_Mbps due to commit:
> > > > > >
> > > > > >
> > > > > > commit: f26b3fa046116a7dedcaafe30083402113941451 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free")
> > > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > > > >
> > > > >
> > > > > So what this commit did is: if a CPU is always doing free(pcp->free_factor > 0)
> > > >
> > > > IMHO, this means the consumer and producer are running on different
> > > > CPUs.
> > > >
> > >
> > > Right.
> > >
> > > > > and if the being freed high-order page's order is <= PAGE_ALLOC_COSTLY_ORDER,
> > > > > then do not use PCP but directly free the page directly to buddy.
> > > > >
> > > > > The rationale as explained in the commit's changelog is:
> > > > > "
> > > > > Netperf running on localhost exhibits this pattern and while it does not
> > > > > matter for some machines, it does matter for others with smaller caches
> > > > > where cache misses cause problems due to reduced page reuse. Pages
> > > > > freed directly to the buddy list may be reused quickly while still cache
> > > > > hot where as storing on the PCP lists may be cold by the time
> > > > > free_pcppages_bulk() is called.
> > > > > "
> > > > >
> > > > > This regression occurred on a machine that has large caches so this
> > > > > optimization brings no value to it but only overhead(skipped PCP), I
> > > > > guess this is the reason why there is a regression.
> > > >
> > > > Per my understanding, not only the cache size is larger, but also the L2
> > > > cache (1MB) is per-core on this machine. So if the consumer and
> > > > producer are running on different cores, the cache-hot page may cause
> > > > more core-to-core cache transfer. This may hurt performance too.
> > > >
> > >
> > > Client side allocates skb(page) and server side recvfrom() it.
> > > recvfrom() copies the page data to server's own buffer and then releases
> > > the page associated with the skb. Client does all the allocation and
> > > server does all the free, page reuse happens at client side.
> > > So I think core-2-core cache transfer due to page reuse can occur when
> > > client task migrates.
> >
> > The core-to-core cache transfering can be cross-socket or cross-L2 in
> > one socket. I mean the later one.
> >
> > > I have modified the job to have the client and server bound to a
> > > specific CPU of different cores on the same node, and testing it on the
> > > same Icelake 2 sockets server, the result is
> > >
> > >   kernel throughput
> > > 8b10b465d0e1 125168
> > > f26b3fa04611 102039 -18%
> > >
> > > It's also a 18% drop. I think this means c2c is not a factor?
> >
> > Can you test with client and server bound to 2 hardware threads
> > (hyperthread) of one core? The two hardware threads of one core will
> > share the L2 cache.
> >
>
> 8b10b465d0e1: 89702
> f26b3fa04611: 95823 +6.8%
>
> When binding client and server on the 2 threads of the same core, the
> bisected commit is an improvement now on this 2 sockets Icelake server.

Good. I guess cache-hot works now.

> > > > > I have also tested this case on a small machine: a skylake desktop and
> > > > > this commit shows improvement:
> > > > > 8b10b465d0e1: "netperf.Throughput_Mbps": 72288.76,
> > > > > f26b3fa04611: "netperf.Throughput_Mbps": 90784.4, +25.6%
> > > > >
> > > > > So this means those directly freed pages get reused by allocator side
> > > > > and that brings performance improvement for machines with smaller cache.
> > > >
> > > > Per my understanding, the L2 cache on this desktop machine is shared
> > > > among cores.
> > > >
> > >
> > > The said CPU is i7-6700 and according to this wikipedia page,
> > > L2 is per core:
> > > https://en.wikipedia.org/wiki/Skylake_(microarchitecture)#Mainstream_desktop_processors
> >
> > Sorry, my memory was wrong. The skylake and later server has much
> > larger private L2 cache (1MB vs 256KB of client), this may increase the
> > possibility of core-2-core transfering.
> >
>
> I'm trying to understand where is the core-2-core cache transfer.
>
> When server needs to do the copy in recvfrom(), there is core-2-core
> cache transfer from client cpu to server cpu. But this is the same no
> matter page gets reused or not, i.e. the bisected commit and its parent
> doesn't have any difference in this step. 

Yes.

> Then when page gets reused in
> the client side, there is no core-2-core cache transfer as the server
> side didn't do write to the page's data.

The "reused" pages were read by the server side, so their cache lines
are in "shared" state, some inter-core traffic is needed to shoot down
these cache lines before the client side writes them. This will incur
some overhead.

Best Regards,
Huang, Ying

> So page reuse or not, it
> shouldn't cause any difference regarding core-2-core cache transfer.
> Is this correct?
>
> > > > > I wonder if we should still use PCP a little bit under the above said
> > > > > condition, for the purpose of:
> > > > > 1 reduced overhead in the free path for machines with large cache;
> > > > > 2 still keeps the benefit of reused pages for machines with smaller cache.
> > > > >
> > > > > For this reason, I tested increasing nr_pcp_high() from returning 0 to
> > > > > either returning pcp->batch or (pcp->batch << 2):
> > > > > machine\nr_pcp_high() ret: pcp->high 0 pcp->batch (pcp->batch << 2)
> > > > > skylake desktop: 72288 90784 92219 91528
> > > > > icelake 2sockets: 120956 99177 98251 116108
> > > > >
> > > > > note nr_pcp_high() returns pcp->high is the behaviour of this commit's
> > > > > parent, returns 0 is the behaviour of this commit.
> > > > >
> > > > > The result shows, if we effectively use a PCP high as (pcp->batch << 2)
> > > > > for the described condition, then this workload's performance on
> > > > > small machine can remain while the regression on large machines can be
> > > > > greately reduced(from -18% to -4%).
> > > > >
> > > >
> > > > Can we use cache size and topology information directly?
> > >
> > > It can be complicated by the fact that the system can have multiple
> > > producers(cpus that are doing free) running at the same time and getting
> > > the perfect number can be a difficult job.
> >
> > We can discuss this after verifying whether it's core-2-core transfering
> > related.
> >
> > Best Regards,
> > Huang, Ying
> >
> >



2022-05-09 06:04:56

by Huang, Ying

[permalink] [raw]
Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression

On Fri, 2022-05-06 at 20:17 +0800, Aaron Lu wrote:
> On Fri, May 06, 2022 at 04:40:45PM +0800, [email protected] wrote:
> > On Fri, 2022-04-29 at 19:29 +0800, Aaron Lu wrote:
> > > Hi Mel,
> > >
> > > On Wed, Apr 20, 2022 at 09:35:26AM +0800, kernel test robot wrote:
> > > >
> > > > (please be noted we reported
> > > > "[mm/page_alloc] 39907a939a: netperf.Throughput_Mbps -18.1% regression"
> > > > on
> > > > https://lore.kernel.org/all/20220228155733.GF1643@xsang-OptiPlex-9020/
> > > > while the commit is on branch.
> > > > now we still observe similar regression when it's on mainline, and we also
> > > > observe a 13.2% improvement on another netperf subtest.
> > > > so report again for information)
> > > >
> > > > Greeting,
> > > >
> > > > FYI, we noticed a -18.0% regression of netperf.Throughput_Mbps due to commit:
> > > >
> > > >
> > > > commit: f26b3fa046116a7dedcaafe30083402113941451 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free")
> > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > >
> > >
> > > So what this commit did is: if a CPU is always doing free(pcp->free_factor > 0)
> >
> > IMHO, this means the consumer and producer are running on different
> > CPUs.
> >
>
> Right.
>
> > > and if the being freed high-order page's order is <= PAGE_ALLOC_COSTLY_ORDER,
> > > then do not use PCP but directly free the page directly to buddy.
> > >
> > > The rationale as explained in the commit's changelog is:
> > > "
> > > Netperf running on localhost exhibits this pattern and while it does not
> > > matter for some machines, it does matter for others with smaller caches
> > > where cache misses cause problems due to reduced page reuse. Pages
> > > freed directly to the buddy list may be reused quickly while still cache
> > > hot where as storing on the PCP lists may be cold by the time
> > > free_pcppages_bulk() is called.
> > > "
> > >
> > > This regression occurred on a machine that has large caches so this
> > > optimization brings no value to it but only overhead(skipped PCP), I
> > > guess this is the reason why there is a regression.
> >
> > Per my understanding, not only the cache size is larger, but also the L2
> > cache (1MB) is per-core on this machine. So if the consumer and
> > producer are running on different cores, the cache-hot page may cause
> > more core-to-core cache transfer. This may hurt performance too.
> >
>
> Client side allocates skb(page) and server side recvfrom() it.
> recvfrom() copies the page data to server's own buffer and then releases
> the page associated with the skb. Client does all the allocation and
> server does all the free, page reuse happens at client side.
> So I think core-2-core cache transfer due to page reuse can occur when
> client task migrates.

The core-to-core cache transfering can be cross-socket or cross-L2 in
one socket. I mean the later one.

> I have modified the job to have the client and server bound to a
> specific CPU of different cores on the same node, and testing it on the
> same Icelake 2 sockets server, the result is
>
>   kernel throughput
> 8b10b465d0e1 125168
> f26b3fa04611 102039 -18%
>
> It's also a 18% drop. I think this means c2c is not a factor?

Can you test with client and server bound to 2 hardware threads
(hyperthread) of one core? The two hardware threads of one core will
share the L2 cache.

> > > I have also tested this case on a small machine: a skylake desktop and
> > > this commit shows improvement:
> > > 8b10b465d0e1: "netperf.Throughput_Mbps": 72288.76,
> > > f26b3fa04611: "netperf.Throughput_Mbps": 90784.4, +25.6%
> > >
> > > So this means those directly freed pages get reused by allocator side
> > > and that brings performance improvement for machines with smaller cache.
> >
> > Per my understanding, the L2 cache on this desktop machine is shared
> > among cores.
> >
>
> The said CPU is i7-6700 and according to this wikipedia page,
> L2 is per core:
> https://en.wikipedia.org/wiki/Skylake_(microarchitecture)#Mainstream_desktop_processors

Sorry, my memory was wrong. The skylake and later server has much
larger private L2 cache (1MB vs 256KB of client), this may increase the
possibility of core-2-core transfering.

> > > I wonder if we should still use PCP a little bit under the above said
> > > condition, for the purpose of:
> > > 1 reduced overhead in the free path for machines with large cache;
> > > 2 still keeps the benefit of reused pages for machines with smaller cache.
> > >
> > > For this reason, I tested increasing nr_pcp_high() from returning 0 to
> > > either returning pcp->batch or (pcp->batch << 2):
> > > machine\nr_pcp_high() ret: pcp->high 0 pcp->batch (pcp->batch << 2)
> > > skylake desktop: 72288 90784 92219 91528
> > > icelake 2sockets: 120956 99177 98251 116108
> > >
> > > note nr_pcp_high() returns pcp->high is the behaviour of this commit's
> > > parent, returns 0 is the behaviour of this commit.
> > >
> > > The result shows, if we effectively use a PCP high as (pcp->batch << 2)
> > > for the described condition, then this workload's performance on
> > > small machine can remain while the regression on large machines can be
> > > greately reduced(from -18% to -4%).
> > >
> >
> > Can we use cache size and topology information directly?
>
> It can be complicated by the fact that the system can have multiple
> producers(cpus that are doing free) running at the same time and getting
> the perfect number can be a difficult job.

We can discuss this after verifying whether it's core-2-core transfering
related.

Best Regards,
Huang, Ying



2022-05-09 07:49:51

by Aaron Lu

[permalink] [raw]
Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression

On Sat, May 07, 2022 at 08:54:35AM +0800, [email protected] wrote:
> On Fri, 2022-05-06 at 20:17 +0800, Aaron Lu wrote:
> > On Fri, May 06, 2022 at 04:40:45PM +0800, [email protected] wrote:
> > > On Fri, 2022-04-29 at 19:29 +0800, Aaron Lu wrote:
> > > > Hi Mel,
> > > >
> > > > On Wed, Apr 20, 2022 at 09:35:26AM +0800, kernel test robot wrote:
> > > > >
> > > > > (please be noted we reported
> > > > > "[mm/page_alloc] 39907a939a: netperf.Throughput_Mbps -18.1% regression"
> > > > > on
> > > > > https://lore.kernel.org/all/20220228155733.GF1643@xsang-OptiPlex-9020/
> > > > > while the commit is on branch.
> > > > > now we still observe similar regression when it's on mainline, and we also
> > > > > observe a 13.2% improvement on another netperf subtest.
> > > > > so report again for information)
> > > > >
> > > > > Greeting,
> > > > >
> > > > > FYI, we noticed a -18.0% regression of netperf.Throughput_Mbps due to commit:
> > > > >
> > > > >
> > > > > commit: f26b3fa046116a7dedcaafe30083402113941451 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free")
> > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > > >
> > > >
> > > > So what this commit did is: if a CPU is always doing free(pcp->free_factor > 0)
> > >
> > > IMHO, this means the consumer and producer are running on different
> > > CPUs.
> > >
> >
> > Right.
> >
> > > > and if the being freed high-order page's order is <= PAGE_ALLOC_COSTLY_ORDER,
> > > > then do not use PCP but directly free the page directly to buddy.
> > > >
> > > > The rationale as explained in the commit's changelog is:
> > > > "
> > > > Netperf running on localhost exhibits this pattern and while it does not
> > > > matter for some machines, it does matter for others with smaller caches
> > > > where cache misses cause problems due to reduced page reuse. Pages
> > > > freed directly to the buddy list may be reused quickly while still cache
> > > > hot where as storing on the PCP lists may be cold by the time
> > > > free_pcppages_bulk() is called.
> > > > "
> > > >
> > > > This regression occurred on a machine that has large caches so this
> > > > optimization brings no value to it but only overhead(skipped PCP), I
> > > > guess this is the reason why there is a regression.
> > >
> > > Per my understanding, not only the cache size is larger, but also the L2
> > > cache (1MB) is per-core on this machine. So if the consumer and
> > > producer are running on different cores, the cache-hot page may cause
> > > more core-to-core cache transfer. This may hurt performance too.
> > >
> >
> > Client side allocates skb(page) and server side recvfrom() it.
> > recvfrom() copies the page data to server's own buffer and then releases
> > the page associated with the skb. Client does all the allocation and
> > server does all the free, page reuse happens at client side.
> > So I think core-2-core cache transfer due to page reuse can occur when
> > client task migrates.
>
> The core-to-core cache transfering can be cross-socket or cross-L2 in
> one socket. I mean the later one.
>
> > I have modified the job to have the client and server bound to a
> > specific CPU of different cores on the same node, and testing it on the
> > same Icelake 2 sockets server, the result is
> >
> > ??kernel throughput
> > 8b10b465d0e1 125168
> > f26b3fa04611 102039 -18%
> >
> > It's also a 18% drop. I think this means c2c is not a factor?
>
> Can you test with client and server bound to 2 hardware threads
> (hyperthread) of one core? The two hardware threads of one core will
> share the L2 cache.
>

8b10b465d0e1: 89702
f26b3fa04611: 95823 +6.8%

When binding client and server on the 2 threads of the same core, the
bisected commit is an improvement now on this 2 sockets Icelake server.

> > > > I have also tested this case on a small machine: a skylake desktop and
> > > > this commit shows improvement:
> > > > 8b10b465d0e1: "netperf.Throughput_Mbps": 72288.76,
> > > > f26b3fa04611: "netperf.Throughput_Mbps": 90784.4, +25.6%
> > > >
> > > > So this means those directly freed pages get reused by allocator side
> > > > and that brings performance improvement for machines with smaller cache.
> > >
> > > Per my understanding, the L2 cache on this desktop machine is shared
> > > among cores.
> > >
> >
> > The said CPU is i7-6700 and according to this wikipedia page,
> > L2 is per core:
> > https://en.wikipedia.org/wiki/Skylake_(microarchitecture)#Mainstream_desktop_processors
>
> Sorry, my memory was wrong. The skylake and later server has much
> larger private L2 cache (1MB vs 256KB of client), this may increase the
> possibility of core-2-core transfering.
>

I'm trying to understand where is the core-2-core cache transfer.

When server needs to do the copy in recvfrom(), there is core-2-core
cache transfer from client cpu to server cpu. But this is the same no
matter page gets reused or not, i.e. the bisected commit and its parent
doesn't have any difference in this step. Then when page gets reused in
the client side, there is no core-2-core cache transfer as the server
side didn't do write to the page's data. So page reuse or not, it
shouldn't cause any difference regarding core-2-core cache transfer.
Is this correct?

> > > > I wonder if we should still use PCP a little bit under the above said
> > > > condition, for the purpose of:
> > > > 1 reduced overhead in the free path for machines with large cache;
> > > > 2 still keeps the benefit of reused pages for machines with smaller cache.
> > > >
> > > > For this reason, I tested increasing nr_pcp_high() from returning 0 to
> > > > either returning pcp->batch or (pcp->batch << 2):
> > > > machine\nr_pcp_high() ret: pcp->high 0 pcp->batch (pcp->batch << 2)
> > > > skylake desktop: 72288 90784 92219 91528
> > > > icelake 2sockets: 120956 99177 98251 116108
> > > >
> > > > note nr_pcp_high() returns pcp->high is the behaviour of this commit's
> > > > parent, returns 0 is the behaviour of this commit.
> > > >
> > > > The result shows, if we effectively use a PCP high as (pcp->batch << 2)
> > > > for the described condition, then this workload's performance on
> > > > small machine can remain while the regression on large machines can be
> > > > greately reduced(from -18% to -4%).
> > > >
> > >
> > > Can we use cache size and topology information directly?
> >
> > It can be complicated by the fact that the system can have multiple
> > producers(cpus that are doing free) running at the same time and getting
> > the perfect number can be a difficult job.
>
> We can discuss this after verifying whether it's core-2-core transfering
> related.
>
> Best Regards,
> Huang, Ying
>
>