Message-ID: <1423230001.31870.128.camel@edumazet-glaptop2.roam.corp.google.com> (sfid-20150206_144013_786586_8ACF4C34)
Subject: Re: Throughput regression with `tcp: refine TSO autosizing`
From: Eric Dumazet <eric.dumazet@gmail.com>
To: Michal Kazior <michal.kazior@tieto.com>
Cc: Neal Cardwell <ncardwell@google.com>,
	linux-wireless <linux-wireless@vger.kernel.org>,
	Network Development <netdev@vger.kernel.org>,
	eyalpe@dev.mellanox.co.il
Date: Fri, 06 Feb 2015 05:40:01 -0800
In-Reply-To: <CA+BoTQ=WaKLV=r6qWaWAEfyDr2pqMWpm4NDmnek92TEVndnxRQ@mail.gmail.com>
References: <CA+BoTQkVu23P3EOmY_Q3E1GJnWsyF==Pawz4iPOS_Bq5dvfO5Q@mail.gmail.com>
	 <1422537297.21689.15.camel@edumazet-glaptop2.roam.corp.google.com>
	 <CA+BoTQk2xT-8DqPuiiKG+kHAjLPrj8F9dLTb-rcGhvMq0u_2Qw@mail.gmail.com>
	 <1422628835.21689.95.camel@edumazet-glaptop2.roam.corp.google.com>
	 <CA+BoTQkV+mOZfe_Niz5101sMQeaV6muKCsShptjGQ1AgOHqqoQ@mail.gmail.com>
	 <1422903136.21689.114.camel@edumazet-glaptop2.roam.corp.google.com>
	 <1422926330.21689.138.camel@edumazet-glaptop2.roam.corp.google.com>
	 <CA+BoTQkMikA8wxm1ce2DkKhPB0HiKeAqT7f+sQ=91W40z=X0Rg@mail.gmail.com>
	 <1422973660.907.10.camel@edumazet-glaptop2.roam.corp.google.com>
	 <CA+BoTQmvUuFdfYF=wVMYxrf_nQZB5GCV=LvDZVvfs-3hAE4WKw@mail.gmail.com>
	 <1423051045.907.108.camel@edumazet-glaptop2.roam.corp.google.com>
	 <CA+BoTQ=BDcQ779uKCuX+f40=4npXVF4MTQnpjKimNYAxPsxBoQ@mail.gmail.com>
	 <1423053531.907.115.camel@edumazet-glaptop2.roam.corp.google.com>
	 <CA+BoTQ=qmCZz4CmSOvCOzMLowrDEG12XBffkTcYxjGqVD9604g@mail.gmail.com>
	 <1423055810.907.125.camel@edumazet-glaptop2.roam.corp.google.com>
	 <1423056591.907.130.camel@edumazet-glaptop2.roam.corp.google.com>
	 <1423084303.31870.15.camel@edumazet-glaptop2.roam.corp.google.com>
	 <CA+BoTQnnwrv0nrKyGyQNvosz_E4e5fBa9iN8fpeqcd-iRfi17g@mail.gmail.com>
	 <1423141038.31870.38.camel@edumazet-glaptop2.roam.corp.google.com>
	 <1423142342.31870.49.camel@edumazet-glaptop2.roam.corp.google.com>
	 <CA+BoTQmcShK0U_cXvEOLY_8y7LH8x3taTgjcyMzv0MLVn4UtCA@mail.gmail.com>
	 <1423147286.31870.59.camel@edumazet-glaptop2.roam.corp.google.com>
	 <1423156205.31870.86.camel@edumazet-glaptop2.roam.corp.google.com>
	 <CA+BoTQ=WaKLV=r6qWaWAEfyDr2pqMWpm4NDmnek92TEVndnxRQ@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"
Mime-Version: 1.0
Sender: linux-wireless-owner@vger.kernel.org

On Fri, 2015-02-06 at 10:42 +0100, Michal Kazior wrote:

> The above brings back previous behaviour, i.e. I can get 600mbps TCP
> on 5 flows again. Single flow is still (as it was before TSO
> autosizing) limited to roughly ~280mbps.
> 
> I never really bothered before to understand why I need to push a few
> flows through ath10k to max it out, i.e. if I run a single UDP flow I
> get ~300mbps while with, e.g. 5 I get 670mbps easily.
> 

For single UDP flow, tweaking /proc/sys/net/core/wmem_default might be
enough : UDP has no callback from TX completion to feed following frames
(No write queue like TCP)

# cat /proc/sys/net/core/wmem_default
212992
# ethtool -C eth1 tx-usecs 1024 tx-frames 120
# ./netperf -H remote -t UDP_STREAM -- -m 1450
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992    1450   10.00      697705      0     809.27
212992           10.00      673412            781.09

# echo 800000 >/proc/sys/net/core/wmem_default
# ./netperf -H remote -t UDP_STREAM -- -m 1450
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

800000    1450   10.00     7329221      0    8501.84
212992           10.00     7284051           8449.44


> I guess it was the tx completion latency all along.
> 
> I just put an extra debug to ath10k to see the latency between
> submission and completion. Here's a log
> (http://www.filedropper.com/complete-log) of 2s run of UDP iperf
> trying to push 1gbps but managing only 300mbps.
> 
> I've made sure to not hold any locks nor introduce internal to ath10k
> delays. Frames get completed between 2-4ms in avarage during load.


tcp_wfree() could maintain in tp->tx_completion_delay_ms an EWMA
of TX completion delay. But this would require yet another expensive
call to ktime_get() if HZ < 1000.

Then tcp_write_xmit() could use it to adjust :

   limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);

to

   amount = (2 + tp->tx_completion_delay_ms) * sk->sk_pacing_rate 

   limit = max(2 * skb->truesize, amount / 1000);

I'll cook a patch.

Thanks.