Return-path: Received: from mail-wi0-f181.google.com ([209.85.212.181]:64349 "EHLO mail-wi0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751489AbbBFJmd convert rfc822-to-8bit (ORCPT ); Fri, 6 Feb 2015 04:42:33 -0500 Received: by mail-wi0-f181.google.com with SMTP id fb4so1014887wid.2 for ; Fri, 06 Feb 2015 01:42:32 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <1423156205.31870.86.camel@edumazet-glaptop2.roam.corp.google.com> References: <1422537297.21689.15.camel@edumazet-glaptop2.roam.corp.google.com> <1422628835.21689.95.camel@edumazet-glaptop2.roam.corp.google.com> <1422903136.21689.114.camel@edumazet-glaptop2.roam.corp.google.com> <1422926330.21689.138.camel@edumazet-glaptop2.roam.corp.google.com> <1422973660.907.10.camel@edumazet-glaptop2.roam.corp.google.com> <1423051045.907.108.camel@edumazet-glaptop2.roam.corp.google.com> <1423053531.907.115.camel@edumazet-glaptop2.roam.corp.google.com> <1423055810.907.125.camel@edumazet-glaptop2.roam.corp.google.com> <1423056591.907.130.camel@edumazet-glaptop2.roam.corp.google.com> <1423084303.31870.15.camel@edumazet-glaptop2.roam.corp.google.com> <1423141038.31870.38.camel@edumazet-glaptop2.roam.corp.google.com> <1423142342.31870.49.camel@edumazet-glaptop2.roam.corp.google.com> <1423147286.31870.59.camel@edumazet-glaptop2.roam.corp.google.com> <1423156205.31870.86.camel@edumazet-glaptop2.roam.corp.google.com> Date: Fri, 6 Feb 2015 10:42:32 +0100 Message-ID: (sfid-20150206_104246_140704_B44B1F6E) Subject: Re: Throughput regression with `tcp: refine TSO autosizing` From: Michal Kazior To: Eric Dumazet Cc: Neal Cardwell , linux-wireless , Network Development , eyalpe@dev.mellanox.co.il Content-Type: text/plain; charset=UTF-8 Sender: linux-wireless-owner@vger.kernel.org List-ID: On 5 February 2015 at 18:10, Eric Dumazet wrote: > On Thu, 2015-02-05 at 06:41 -0800, Eric Dumazet wrote: > >> Not at all. This basically removes backpressure. >> >> A single UDP socket can now blast packets regardless of SO_SNDBUF >> limits. >> >> This basically remove years of work trying to fix bufferbloat. >> >> I still do not understand why increasing tcp_limit_output_bytes is not >> working for you. > > Oh well, tcp_limit_output_bytes might be ok. > > In fact, the problem comes from GSO assumption. Maybe Herbert was right, > when he suggested TCP would be simpler if we enforced GSO... > > When GSO is used, the thing works because 2*skb->truesize is roughly 2 > ms worth of traffic. > > Because you do not use GSO, and tx completions are slow, we need this : > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index 65caf8b95e17..ac01b4cd0035 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -2044,7 +2044,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, > break; > > /* TCP Small Queues : > - * Control number of packets in qdisc/devices to two packets / or ~1 ms. > + * Control number of packets in qdisc/devices to two packets / > + * or ~2 ms (sk->sk_pacing_rate >> 9) in case GSO is off. > * This allows for : > * - better RTT estimation and ACK scheduling > * - faster recovery > @@ -2053,7 +2054,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, > * of queued bytes to ensure line rate. > * One example is wifi aggregation (802.11 AMPDU) > */ > - limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10); > + limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9); > limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes); > > if (atomic_read(&sk->sk_wmem_alloc) > limit) { > The above brings back previous behaviour, i.e. I can get 600mbps TCP on 5 flows again. Single flow is still (as it was before TSO autosizing) limited to roughly ~280mbps. I never really bothered before to understand why I need to push a few flows through ath10k to max it out, i.e. if I run a single UDP flow I get ~300mbps while with, e.g. 5 I get 670mbps easily. I guess it was the tx completion latency all along. I just put an extra debug to ath10k to see the latency between submission and completion. Here's a log (http://www.filedropper.com/complete-log) of 2s run of UDP iperf trying to push 1gbps but managing only 300mbps. I've made sure to not hold any locks nor introduce internal to ath10k delays. Frames get completed between 2-4ms in avarage during load. When I tried using different ath10k hw&fw I got between 1-2ms of latency for tx completionsyielding ~430mbps while max should be around 670mbps. MichaƂ