Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp1751334imm; Thu, 14 Jun 2018 03:19:36 -0700 (PDT) X-Google-Smtp-Source: ADUXVKKi4nAaX9juoOH76133v8tYd5OfF+BqTHAy49MXnvxq+ndAJld+ZGvs8GnUBhF/9pLCv+jZ X-Received: by 2002:a17:902:a518:: with SMTP id s24-v6mr2383085plq.144.1528971576791; Thu, 14 Jun 2018 03:19:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1528971576; cv=none; d=google.com; s=arc-20160816; b=D75jEsAF8gpz1nIEsK4DwfBt50I40T4OGe/GyqKxNUT5yA7096n2pWtGiUkmQJ/N+N xI/CJkvi55gQ/cR43katbkw28AXAMVjqPeGU5MiIj8Yc/mYGcRrVjeLNO2zOyUbEA0nx J0U1BCfiu7VN1FVNskjwUNsc6zEF0KT2KwR2cw+zN9u2j5KBf6Dc1pEJVZ7Rf5EAhEkf SLguPOX0JkYQ0rNg2JMCQG2biVlvcUwEbdQu/ZdROwagRvwhFKF6c1ZcfaeyPkTkLCpm gNdDw5sAt3NmosbQ3GuG0GaOGcn7el/DsVAYUMTW8T21Ll1QxAAWA+fMCg8ylnMXOm3S Gt0Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date :arc-authentication-results; bh=WeMHLumUpBap70v7CG9qiQuO9ikmH0LBTg/Bgvnw/Mo=; b=Rv/9FkSVvyq1k7EnogT2F9OvUcpMN0K5oLSET4IVtLwCFn9eDR3gtVPiSYD2HGiSV4 Yved4Iwt2MoEVTHe1oj4P8/fbXHmCzzFhh71fYRIQpypFpSV5Hi3tzNBq8JGpiFCgrjl JIyrvw+sblR91G175q46HTumsvAK+eknfuvl9oQYuxr6Hz/6lfYlnpuV9a9jf8ZAmJgr 6yV5D2S20DiXL6bXUmhB4KTImi6i8YudB+JaFZcNM76TKWOoa68YIPaKLpPjjDHAmnnF 1Ky2c1vsCltQJyFTzffDRVICdwyi6meZTw6soCheCeNXTz1C/ZbGjSHG/nZ4xDMAoy4n QvzA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d18-v6si4038614pgp.214.2018.06.14.03.19.22; Thu, 14 Jun 2018 03:19:36 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754965AbeFNKSv (ORCPT + 99 others); Thu, 14 Jun 2018 06:18:51 -0400 Received: from smtp-rs2-vallila1.fe.helsinki.fi ([128.214.173.73]:34254 "EHLO smtp-rs2-vallila1.fe.helsinki.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754688AbeFNKSu (ORCPT ); Thu, 14 Jun 2018 06:18:50 -0400 Received: from whs-18.cs.helsinki.fi (whs-18.cs.helsinki.fi [128.214.166.46]) by smtp-rs2.it.helsinki.fi (8.14.7/8.14.7) with ESMTP id w5EAIjbx017083; Thu, 14 Jun 2018 13:18:45 +0300 Received: by whs-18.cs.helsinki.fi (Postfix, from userid 1070048) id 3AF433601A6; Thu, 14 Jun 2018 13:18:45 +0300 (EEST) Received: from localhost (localhost [127.0.0.1]) by whs-18.cs.helsinki.fi (Postfix) with ESMTP id 3801C36007C; Thu, 14 Jun 2018 13:18:45 +0300 (EEST) Date: Thu, 14 Jun 2018 13:18:45 +0300 (EEST) From: =?ISO-8859-15?Q?Ilpo_J=E4rvinen?= X-X-Sender: ijjarvin@whs-18.cs.helsinki.fi To: Michal Kubecek cc: Netdev , Eric Dumazet , Yuchung Cheng , LKML Subject: Re: [RFC PATCH RESEND] tcp: avoid F-RTO if SACK and timestamps are disabled In-Reply-To: <20180613165716.4fy7ufk7jnk3r67r@unicorn.suse.cz> Message-ID: References: <20180613164802.99B89A09E2@unicorn.suse.cz> <20180613165543.0F92DA09E2@unicorn.suse.cz> <20180613165716.4fy7ufk7jnk3r67r@unicorn.suse.cz> User-Agent: Alpine 2.20 (DEB 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 13 Jun 2018, Michal Kubecek wrote: > On Wed, Jun 13, 2018 at 06:55:43PM +0200, Michal Kubecek wrote: > > When F-RTO algorithm (RFC 5682) is used on connection without both SACK and > > timestamps (either because of (mis)configuration or because the other > > endpoint does not advertise them), specific pattern loss can make RTO grow > > exponentially until the sender is only able to send one packet per two > > minutes (TCP_RTO_MAX). > > > > One way to reproduce is to > > > > - make sure the connection uses neither SACK nor timestamps > > - let tp->reorder grow enough so that lost packets are retransmitted > > after RTO (rather than when high_seq - snd_una > reorder * MSS) > > - let the data flow stabilize > > - drop multiple sender packets in "every second" pattern > > - either there is no new data to send or acks received in response to new > > data are also window updates (i.e. not dupacks by definition) > > > > In this scenario, the sender keeps cycling between retransmitting first > > lost packet (step 1 of RFC 5682), sending new data by (2b) and timing out > > again. In this loop, the sender only gets > > > > (a) acks for retransmitted segments (possibly together with old ones) > > (b) window updates > > > > Without timestamps, neither can be used for RTT estimator and without SACK, > > we have no newly sacked segments to estimate RTT either. Therefore each > > timeout doubles RTO and without usable RTT samples so that there is nothing > > to counter the exponential growth. > > > > While disabling both SACK and timestamps doesn't make any sense, the > > resulting behaviour is so pathological that it deserves an improvement. > > (Also, both can be disabled on the other side.) Avoid F-RTO algorithm in > > case both SACK and timestamps are disabled so that the sender falls back to > > traditional slow start retransmission. > > > > Signed-off-by: Michal Kubecek > > I was able to illustrate the issue using a packetdrill script. It cheats > a bit by setting net.ipv4.tcp_reordering to 30 so that it we can get to > the issue more quickly. In this case, we don't have more data to send > but it's not essential; the issue can be reproduced even with sending of > new data in F-RTO, it would only make everything more complicated. > > I was able to run the same script on kernels 4.17-rc6, 4.12 (SLE15) and > 4.4 (SLE12-SP2). Kernel 3.12 required minor modifications but not in the > important part (the slow start is a bit slower there). > > --------------------------------------------------------------------------- > --tolerance_usecs=10000 > > // flush cached TCP metrics > 0.000 `ip tcp_metrics flush all` > +0.000 `sysctl -q net.ipv4.tcp_reordering=20` > > > // establish a connection > +0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 > +0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 > +0.000 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [131072], 4) = 0 > +0.000 bind(3, ..., ...) = 0 > +0.000 listen(3, 1) = 0 > > +0.100 < S 0:0(0) win 40000 > +0.000 > S. 0:0(0) ack 1 > +0.100 < . 1:1(0) ack 1 win 40000 > +0.000 accept(3, ..., ...) = 4 > > // Send 10 data segments. > +0.100 write(4, ..., 30000) = 30000 > // For some reason (unknown yet), GSO packets are only 2000 bytes long > +0.000 > . 1:2001(2000) ack 1 > +0.000 > . 2001:4001(2000) ack 1 > +0.000 > . 4001:6001(2000) ack 1 > +0.000 > . 6001:8001(2000) ack 1 > +0.000 > . 8001:10001(2000) ack 1 > +0.100 < . 1:1(0) ack 2001 win 38000 > +0.000 > . 10001:12001(2000) ack 1 > +0.000 > . 12001:14001(2000) ack 1 > +0.001 < . 1:1(0) ack 4001 win 36000 > +0.000 > . 14001:16001(2000) ack 1 > +0.000 > . 16001:18001(2000) ack 1 > +0.001 < . 1:1(0) ack 6001 win 34000 > +0.000 > . 18001:20001(2000) ack 1 > +0.000 > . 20001:22001(2000) ack 1 > +0.001 < . 1:1(0) ack 8001 win 32000 > +0.000 > . 22001:24001(2000) ack 1 > +0.000 > . 24001:26001(2000) ack 1 > +0.001 < . 1:1(0) ack 10001 win 30000 > +0.000 > . 26001:28001(2000) ack 1 > +0.000 > P. 28001:30001(2000) ack 1 > > // loss of 12001:13001, 14001:15001, ..., 28001:29001 > +0.100 < . 1:1(0) ack 12001 win 30000 // original ack > +0.000 < . 1:1(0) ack 12001 win 30000 // 13001:14001 > +0.000 < . 1:1(0) ack 12001 win 30000 // 15001:16001 > +0.000 < . 1:1(0) ack 12001 win 30000 // 17001:18001 > +0.000 < . 1:1(0) ack 12001 win 30000 // 19001:20001 > +0.000 < . 1:1(0) ack 12001 win 30000 // 21001:22001 > +0.000 < . 1:1(0) ack 12001 win 30000 // 13001:24001 > +0.000 < . 1:1(0) ack 12001 win 30000 // 25001:26001 > +0.000 < . 1:1(0) ack 12001 win 30000 // 27001:28001 > +0.000 < . 1:1(0) ack 12001 win 30000 // 29001:30001 > > // RTO 300ms > +0.270~+0.330 > . 12001:13001(1000) ack 1 Lets analyze this case: ca_state = CA_Loss > +0.100 < . 1:1(0) ack 14001 win 38000 snd_una advances => icsk_retransmits = 0 ...The lack of new data segments here seems very relevant to me and it hides from you what is really happening under the hood... > // RTO 600ms > +0.540~+0.660 > . 14001:15001(1000) ack 1 The above should already result false for FRTO in this case: (new_recovery || icsk->icsk_retransmits) && ...But it doesn't. If there would be the new data segment they would show to you that we're running a FRTO bogus undo here (with a burst of new data segments before the second RTO). The bogus undo on that ACK causes ca_state to switch away from CA_Loss and FRTO can then reoccur even though it was not intended. Please, try with this patch: https://patchwork.ozlabs.org/patch/883654/ ...Since you're dealing with non-SACK flows here, you might want to consider the other fixes in that same series too as they all fix bad brokeness. I should do an updated version for that series but I've been waiting for the TCP testsuite to be published... -- i.