FYI, we noticed a -5.5% regression of netpipe.bigger_5M_Mbps.avg due to commit:
commit 3613b3dbd1ade9a6a626dae1f608c57638eb5e8a ("tcp: prepare skbs for better sack shifting")
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
in testcase: netpipe
on test machine: qemu-system-x86_64 -enable-kvm -cpu host -smp 4 -m 6G
with following parameters:
test: tcp
cluster: cs-localhost
NetPIPE is a protocol independent performance tool that encapsulates the best of ttcp and netperf and visually represents the network performance under a variety of conditions.
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
Details are as below:
-------------------------------------------------------------------------------------------------->
To reproduce:
git clone git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml
=========================================================================================
cluster/compiler/kconfig/rootfs/tbox_group/test/testcase:
cs-localhost/gcc-6/x86_64-rhel-7.2/debian-x86_64-2016-08-31.cgz/vm-lkp-a08/tcp/netpipe
commit:
e812bd905a (" wireless-drivers-next patches for 4.9")
3613b3dbd1 ("tcp: prepare skbs for better sack shifting")
e812bd905a5cf00f 3613b3dbd1ade9a6a626dae1f6
---------------- --------------------------
fail:runs %reproduction fail:runs
| | |
%stddev %change %stddev
\ | \
2427 ? 0% -5.5% 2294 ? 0% netpipe.bigger_5M_Mbps.avg
1764 ? 1% +5.0% 1853 ? 0% netpipe.time.elapsed_time
1764 ? 1% +5.0% 1853 ? 0% netpipe.time.elapsed_time.max
84.75 ? 0% -2.1% 83.00 ? 0% netpipe.time.percent_of_cpu_this_job_got
1497 ? 1% +3.2% 1546 ? 0% netpipe.time.system_time
1378335 ? 1% +28.6% 1772455 ? 1% netpipe.time.voluntary_context_switches
14140 ? 3% +6.6% 15079 ? 2% interrupts.CAL:Function_call_interrupts
393.25 ? 2% -29.1% 279.00 ? 2% slabinfo.kmalloc-2048.active_objs
412.00 ? 1% -32.3% 279.00 ? 2% slabinfo.kmalloc-2048.num_objs
4069 ? 0% +17.5% 4779 ? 0% vmstat.system.cs
2961 ? 0% +5.3% 3118 ? 0% vmstat.system.in
229439 ? 94% +6926.4% 16121408 ?172% latency_stats.avg.max
141971 ?160% +11239.9% 16099418 ?172% latency_stats.avg.nfs_wait_on_request.nfs_writepage_setup.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
90183 ?153% -55.4% 40178 ? 84% latency_stats.avg.proc_cgroup_show.proc_single_show.seq_read.__vfs_read.vfs_read.SyS_read.entry_SYSCALL_64_fastpath
839.25 ? 21% +1114.0% 10188 ?125% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_execute.rpc_run_task.nfs4_call_sync_sequence.[nfsv4]._nfs4_proc_lookup.[nfsv4].nfs4_proc_lookup_common.[nfsv4].nfs4_proc_lookup.[nfsv4].nfs_lookup.lookup_slow.walk_component.path_lookupat
414419 ? 86% +7661.1% 32163484 ?172% latency_stats.max.max
257867 ?166% +12362.7% 32137192 ?172% latency_stats.max.nfs_wait_on_request.nfs_writepage_setup.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
81137 ?155% -94.7% 4265 ? 45% latency_stats.max.pipe_read.__vfs_read.vfs_read.SyS_read.entry_SYSCALL_64_fastpath
90183 ?153% -55.4% 40178 ? 84% latency_stats.max.proc_cgroup_show.proc_single_show.seq_read.__vfs_read.vfs_read.SyS_read.entry_SYSCALL_64_fastpath
839.25 ? 21% +1114.0% 10188 ?125% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_execute.rpc_run_task.nfs4_call_sync_sequence.[nfsv4]._nfs4_proc_lookup.[nfsv4].nfs4_proc_lookup_common.[nfsv4].nfs4_proc_lookup.[nfsv4].nfs_lookup.lookup_slow.walk_component.path_lookupat
6232 ? 87% -100.0% 0.00 ? -1% latency_stats.sum.call_rwsem_down_read_failed.__do_page_fault.trace_do_page_fault.do_async_page_fault.async_page_fault.skb_copy_datagram_iter.tcp_rcv_established.tcp_v4_do_rcv.tcp_prequeue_process.tcp_recvmsg.inet_recvmsg.sock_recvmsg
1107 ? 62% +964.9% 11788 ? 55% latency_stats.sum.call_rwsem_down_read_failed.proc_pid_cmdline_read.__vfs_read.vfs_read.SyS_read.do_syscall_64.return_from_SYSCALL_64
276091 ?166% +11542.5% 32143926 ?172% latency_stats.sum.nfs_wait_on_request.nfs_writepage_setup.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
97392 ?128% -82.9% 16660 ? 29% latency_stats.sum.pipe_read.__vfs_read.vfs_read.SyS_read.entry_SYSCALL_64_fastpath
90183 ?153% -55.4% 40178 ? 84% latency_stats.sum.proc_cgroup_show.proc_single_show.seq_read.__vfs_read.vfs_read.SyS_read.entry_SYSCALL_64_fastpath
839.25 ? 21% +1114.0% 10188 ?125% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_execute.rpc_run_task.nfs4_call_sync_sequence.[nfsv4]._nfs4_proc_lookup.[nfsv4].nfs4_proc_lookup_common.[nfsv4].nfs4_proc_lookup.[nfsv4].nfs_lookup.lookup_slow.walk_component.path_lookupat
3561 ? 11% +12.9% 4020 ? 6% sched_debug.cpu.curr->pid.avg
9783 ? 6% +11.8% 10942 ? 3% sched_debug.cpu.curr->pid.max
4222 ? 7% +11.3% 4701 ? 3% sched_debug.cpu.curr->pid.stddev
3988122 ? 13% +24.4% 4962088 ? 4% sched_debug.cpu.max_idle_balance_cost.max
1330791 ? 13% +30.8% 1740923 ? 4% sched_debug.cpu.max_idle_balance_cost.stddev
576379 ? 1% +9.7% 632485 ? 3% sched_debug.cpu.nr_load_updates.max
987546 ? 1% +20.0% 1184637 ? 1% sched_debug.cpu.nr_switches.avg
1074939 ? 2% +23.9% 1332366 ? 3% sched_debug.cpu.nr_switches.max
67934 ? 31% +110.9% 143278 ? 5% sched_debug.cpu.nr_switches.stddev
-15.31 ?-32% -41.6% -8.93 ?-57% sched_debug.cpu.nr_uninterruptible.min
990916 ? 1% +20.2% 1190867 ? 1% sched_debug.cpu.sched_count.avg
1079929 ? 2% +24.2% 1341786 ? 3% sched_debug.cpu.sched_count.max
69632 ? 31% +111.3% 147114 ? 5% sched_debug.cpu.sched_count.stddev
386683 ? 1% +25.2% 483992 ? 2% sched_debug.cpu.sched_goidle.avg
424674 ? 1% +29.4% 549482 ? 1% sched_debug.cpu.sched_goidle.max
33344 ? 23% +107.5% 69183 ? 9% sched_debug.cpu.sched_goidle.stddev
513012 ? 1% +19.1% 610887 ? 1% sched_debug.cpu.ttwu_count.avg
570351 ? 5% +24.2% 708599 ? 5% sched_debug.cpu.ttwu_count.max
40059 ? 42% +112.3% 85064 ? 14% sched_debug.cpu.ttwu_count.stddev
Thanks,
Xiaolong
On Wed, Oct 19, 2016 at 3:54 AM, kernel test robot
<[email protected]> wrote:
>
> FYI, we noticed a -5.5% regression of netpipe.bigger_5M_Mbps.avg due to commit:
>
> commit 3613b3dbd1ade9a6a626dae1f608c57638eb5e8a ("tcp: prepare skbs for better sack shifting")
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>
> in testcase: netpipe
> on test machine: qemu-system-x86_64 -enable-kvm -cpu host -smp 4 -m 6G
> with following parameters:
>
> test: tcp
> cluster: cs-localhost
>
> NetPIPE is a protocol independent performance tool that encapsulates the best of ttcp and netperf and visually represents the network performance under a variety of conditions.
>
>
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> To reproduce:
>
> git clone git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git
> cd lkp-tests
> bin/lkp install job.yaml # job file is attached in this email
> bin/lkp run job.yaml
>
> =========================================================================================
> cluster/compiler/kconfig/rootfs/tbox_group/test/testcase:
> cs-localhost/gcc-6/x86_64-rhel-7.2/debian-x86_64-2016-08-31.cgz/vm-lkp-a08/tcp/netpipe
>
> commit:
> e812bd905a (" wireless-drivers-next patches for 4.9")
> 3613b3dbd1 ("tcp: prepare skbs for better sack shifting")
>
> e812bd905a5cf00f 3613b3dbd1ade9a6a626dae1f6
> ---------------- --------------------------
> fail:runs %reproduction fail:runs
> | | |
> %stddev %change %stddev
> \ | \
> 2427 ą 0% -5.5% 2294 ą 0% netpipe.bigger_5M_Mbps.avg
> 1764 ą 1% +5.0% 1853 ą 0% netpipe.time.elapsed_time
> 1764 ą 1% +5.0% 1853 ą 0% netpipe.time.elapsed_time.max
> 84.75 ą 0% -2.1% 83.00 ą 0% netpipe.time.percent_of_cpu_this_job_got
> 1497 ą 1% +3.2% 1546 ą 0% netpipe.time.system_time
> 1378335 ą 1% +28.6% 1772455 ą 1% netpipe.time.voluntary_context_switches
> 14140 ą 3% +6.6% 15079 ą 2% interrupts.CAL:Function_call_interrupts
> 393.25 ą 2% -29.1% 279.00 ą 2% slabinfo.kmalloc-2048.active_objs
> 412.00 ą 1% -32.3% 279.00 ą 2% slabinfo.kmalloc-2048.num_objs
> 4069 ą 0% +17.5% 4779 ą 0% vmstat.system.cs
> 2961 ą 0% +5.3% 3118 ą 0% vmstat.system.in
> 229439 ą 94% +6926.4% 16121408 ą172% latency_stats.avg.max
> 141971 ą160% +11239.9% 16099418 ą172% latency_stats.avg.nfs_wait_on_request.nfs_writepage_setup.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
> 90183 ą153% -55.4% 40178 ą 84% latency_stats.avg.proc_cgroup_show.proc_single_show.seq_read.__vfs_read.vfs_read.SyS_read.entry_SYSCALL_64_fastpath
> 839.25 ą 21% +1114.0% 10188 ą125% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_execute.rpc_run_task.nfs4_call_sync_sequence.[nfsv4]._nfs4_proc_lookup.[nfsv4].nfs4_proc_lookup_common.[nfsv4].nfs4_proc_lookup.[nfsv4].nfs_lookup.lookup_slow.walk_component.path_lookupat
> 414419 ą 86% +7661.1% 32163484 ą172% latency_stats.max.max
> 257867 ą166% +12362.7% 32137192 ą172% latency_stats.max.nfs_wait_on_request.nfs_writepage_setup.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
> 81137 ą155% -94.7% 4265 ą 45% latency_stats.max.pipe_read.__vfs_read.vfs_read.SyS_read.entry_SYSCALL_64_fastpath
> 90183 ą153% -55.4% 40178 ą 84% latency_stats.max.proc_cgroup_show.proc_single_show.seq_read.__vfs_read.vfs_read.SyS_read.entry_SYSCALL_64_fastpath
> 839.25 ą 21% +1114.0% 10188 ą125% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_execute.rpc_run_task.nfs4_call_sync_sequence.[nfsv4]._nfs4_proc_lookup.[nfsv4].nfs4_proc_lookup_common.[nfsv4].nfs4_proc_lookup.[nfsv4].nfs_lookup.lookup_slow.walk_component.path_lookupat
> 6232 ą 87% -100.0% 0.00 ą -1% latency_stats.sum.call_rwsem_down_read_failed.__do_page_fault.trace_do_page_fault.do_async_page_fault.async_page_fault.skb_copy_datagram_iter.tcp_rcv_established.tcp_v4_do_rcv.tcp_prequeue_process.tcp_recvmsg.inet_recvmsg.sock_recvmsg
> 1107 ą 62% +964.9% 11788 ą 55% latency_stats.sum.call_rwsem_down_read_failed.proc_pid_cmdline_read.__vfs_read.vfs_read.SyS_read.do_syscall_64.return_from_SYSCALL_64
> 276091 ą166% +11542.5% 32143926 ą172% latency_stats.sum.nfs_wait_on_request.nfs_writepage_setup.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
> 97392 ą128% -82.9% 16660 ą 29% latency_stats.sum.pipe_read.__vfs_read.vfs_read.SyS_read.entry_SYSCALL_64_fastpath
> 90183 ą153% -55.4% 40178 ą 84% latency_stats.sum.proc_cgroup_show.proc_single_show.seq_read.__vfs_read.vfs_read.SyS_read.entry_SYSCALL_64_fastpath
> 839.25 ą 21% +1114.0% 10188 ą125% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_execute.rpc_run_task.nfs4_call_sync_sequence.[nfsv4]._nfs4_proc_lookup.[nfsv4].nfs4_proc_lookup_common.[nfsv4].nfs4_proc_lookup.[nfsv4].nfs_lookup.lookup_slow.walk_component.path_lookupat
> 3561 ą 11% +12.9% 4020 ą 6% sched_debug.cpu.curr->pid.avg
> 9783 ą 6% +11.8% 10942 ą 3% sched_debug.cpu.curr->pid.max
> 4222 ą 7% +11.3% 4701 ą 3% sched_debug.cpu.curr->pid.stddev
> 3988122 ą 13% +24.4% 4962088 ą 4% sched_debug.cpu.max_idle_balance_cost.max
> 1330791 ą 13% +30.8% 1740923 ą 4% sched_debug.cpu.max_idle_balance_cost.stddev
> 576379 ą 1% +9.7% 632485 ą 3% sched_debug.cpu.nr_load_updates.max
> 987546 ą 1% +20.0% 1184637 ą 1% sched_debug.cpu.nr_switches.avg
> 1074939 ą 2% +23.9% 1332366 ą 3% sched_debug.cpu.nr_switches.max
> 67934 ą 31% +110.9% 143278 ą 5% sched_debug.cpu.nr_switches.stddev
> -15.31 ą-32% -41.6% -8.93 ą-57% sched_debug.cpu.nr_uninterruptible.min
> 990916 ą 1% +20.2% 1190867 ą 1% sched_debug.cpu.sched_count.avg
> 1079929 ą 2% +24.2% 1341786 ą 3% sched_debug.cpu.sched_count.max
> 69632 ą 31% +111.3% 147114 ą 5% sched_debug.cpu.sched_count.stddev
> 386683 ą 1% +25.2% 483992 ą 2% sched_debug.cpu.sched_goidle.avg
> 424674 ą 1% +29.4% 549482 ą 1% sched_debug.cpu.sched_goidle.max
> 33344 ą 23% +107.5% 69183 ą 9% sched_debug.cpu.sched_goidle.stddev
> 513012 ą 1% +19.1% 610887 ą 1% sched_debug.cpu.ttwu_count.avg
> 570351 ą 5% +24.2% 708599 ą 5% sched_debug.cpu.ttwu_count.max
> 40059 ą 42% +112.3% 85064 ą 14% sched_debug.cpu.ttwu_count.stddev
>
>
>
>
> Thanks,
> Xiaolong
Thanks for the report.
I could not reproduce your results, assuming it is netperf over loopback.
Problem with synthetic tests like that is that a small change can
cascade effects.
In your stats I can see a high number of context switches that a
normal netperf does not exhibit.
Also, maybe the problem is because you are using VM, maybe some NUMA
pinning or something...
Small hack to be able to enable/disable my patch, without having to
reboot and thus have less variability.
edumazet@manihi:/ssd/git/net-next$ git diff net/ipv4/tcp.c
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3251fe71f39f..55afe7b16016 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1035,7 +1035,7 @@ EXPORT_SYMBOL(tcp_sendpage);
*/
static int linear_payload_sz(bool first_skb)
{
- if (first_skb)
+ if (first_skb || (netdev_max_backlog & 1))
return SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER);
return 0;
}
Then :
lpk51:~# echo 1000 >/proc/sys/net/core/netdev_max_backlog
lpk51:~# perf stat -a -C0,1 ./netperf -H 127.0.0.1 -Cc -T0,1
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
127.0.0.1 () port 0 AF_INET : cpu bind
port 0 protocol 6
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 16384 10.00 45854.74 5.90 5.90 0.337 0.337
Performance counter stats for 'system wide':
20005.989547 task-clock (msec) # 2.000 CPUs
utilized
1,568,042 context-switches # 0.078 M/sec
2 cpu-migrations # 0.000 K/sec
121 page-faults # 0.006 K/sec
69,260,921,315 cycles # 3.462 GHz
[83.33%]
49,197,585,858 stalled-cycles-frontend # 71.03% frontend
cycles idle [83.34%]
30,069,741,080 stalled-cycles-backend # 43.42% backend
cycles idle [66.67%]
32,341,104,986 instructions # 0.47 insns per
cycle
# 1.52 stalled
cycles per insn [83.34%]
5,887,445,456 branches # 294.284 M/sec
[83.34%]
65,533,193 branch-misses # 1.11% of all
branches [83.33%]
10.003057093 seconds time elapsed
lpk51:~# echo 1001 >/proc/sys/net/core/netdev_max_backlog
lpk51:~# perf stat -a -C0,1 ./netperf -H 127.0.0.1 -Cc -T0,1
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
127.0.0.1 () port 0 AF_INET : cpu bind
port 0 protocol 6
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 16384 10.00 45267.63 5.94 5.94 0.344 0.344
Performance counter stats for 'system wide':
20006.713062 task-clock (msec) # 2.000 CPUs
utilized
1,466,531 context-switches # 0.073 M/sec
2 cpu-migrations # 0.000 K/sec
120 page-faults # 0.006 K/sec
69,743,911,198 cycles # 3.486 GHz
[83.33%]
49,509,413,005 stalled-cycles-frontend # 70.99% frontend
cycles idle [83.34%]
28,186,805,983 stalled-cycles-backend # 40.41% backend
cycles idle [66.67%]
31,662,971,822 instructions # 0.45 insns per
cycle
# 1.56 stalled
cycles per insn [83.34%]
5,751,313,010 branches # 287.469 M/sec
[83.34%]
69,213,717 branch-misses # 1.20% of all
branches [83.33%]
10.002924757 seconds time elapsed