Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp896923imm; Wed, 13 Jun 2018 09:58:39 -0700 (PDT) X-Google-Smtp-Source: ADUXVKJ90Xk1+qMtE00gsGDE09T7ECt8gp8REJccncJQ8LjjiNRg6VGXKggyUAU+gk/IStwBouo6 X-Received: by 2002:a62:b612:: with SMTP id j18-v6mr5730109pff.199.1528909119605; Wed, 13 Jun 2018 09:58:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1528909119; cv=none; d=google.com; s=arc-20160816; b=0TYrsy4usuO7y8jwv+dvA57lviQjt/xPCPGJaXaUDiWVF0XjuiFS0o8+jPL+5qEAQk BlwzZAe+sW93ANhsPOsSpbLfnHFCDOb9lbuWu4h7DQd8wVpmYZZVaTpxk9d6Isv91tDU L3nTHsylowOi6V4N6bg9wBD0g46wOod2xwmRkVog/9jbsm1Ou4BoAH9zlmVQUg9Tw1Mb pjPyJJGrv7km7QXvjwAaP5EQwsprl6yybYkgzU5oOaof41LUPFa15xHmdVIo76dbPjx9 aX5NfFU6VcMN4x2HWh5cVEotPOd9gOYPhMEzn+fifUN0khJrBJWTvduMPOMUYC/STeAV YssQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=0Sg9XqlkoTzCM7ZGFZUP4wgE0lXs1+GfYyxLyO7a5Zs=; b=L/o1WLqjDKegmUWY4qp4fIzxKWGa/OXpuU5ql3MeVmuKljBiuvhmUUvCoeRKpwBVA+ VEG5R/2Z6Lows3pgcHQU+MVZYwDKvDfr7Uvcq76pK8o+CeFJWibcH4ZG32/hRllzJa24 XiQqOtvA/j/ravnxLFm8sQvO2vDZ3Z2SQ8AcE1To4xrllwA3oHZLnSrJ7dTwfeyp3Ls9 ox3rS/HmHxxxXHWN+RrZmVUCFaUxTxHhA+EbTcaErO2FGNoagJs9x5hYPv7HVqlCUtcT Ya1xa4kjA108mowyiAs9plkExVuAHzwEfppdtxlImySxF571Q5gtN0ATWCnCvkBaSdWx 5/iA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w15-v6si3052536pfn.12.2018.06.13.09.58.25; Wed, 13 Jun 2018 09:58:39 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935357AbeFMQ5T (ORCPT + 99 others); Wed, 13 Jun 2018 12:57:19 -0400 Received: from mx2.suse.de ([195.135.220.15]:36091 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934802AbeFMQ5S (ORCPT ); Wed, 13 Jun 2018 12:57:18 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (charybdis-ext-too.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id E733DAFAA; Wed, 13 Jun 2018 16:57:16 +0000 (UTC) Received: by unicorn.suse.cz (Postfix, from userid 1000) id 95BA8A09E2; Wed, 13 Jun 2018 18:57:16 +0200 (CEST) Date: Wed, 13 Jun 2018 18:57:16 +0200 From: Michal Kubecek To: netdev@vger.kernel.org Cc: Eric Dumazet , Yuchung Cheng , Ilpo Jarvinen , linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH RESEND] tcp: avoid F-RTO if SACK and timestamps are disabled Message-ID: <20180613165716.4fy7ufk7jnk3r67r@unicorn.suse.cz> References: <20180613164802.99B89A09E2@unicorn.suse.cz> <20180613165543.0F92DA09E2@unicorn.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180613165543.0F92DA09E2@unicorn.suse.cz> User-Agent: NeoMutt/20170912 (1.9.0) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jun 13, 2018 at 06:55:43PM +0200, Michal Kubecek wrote: > When F-RTO algorithm (RFC 5682) is used on connection without both SACK and > timestamps (either because of (mis)configuration or because the other > endpoint does not advertise them), specific pattern loss can make RTO grow > exponentially until the sender is only able to send one packet per two > minutes (TCP_RTO_MAX). > > One way to reproduce is to > > - make sure the connection uses neither SACK nor timestamps > - let tp->reorder grow enough so that lost packets are retransmitted > after RTO (rather than when high_seq - snd_una > reorder * MSS) > - let the data flow stabilize > - drop multiple sender packets in "every second" pattern > - either there is no new data to send or acks received in response to new > data are also window updates (i.e. not dupacks by definition) > > In this scenario, the sender keeps cycling between retransmitting first > lost packet (step 1 of RFC 5682), sending new data by (2b) and timing out > again. In this loop, the sender only gets > > (a) acks for retransmitted segments (possibly together with old ones) > (b) window updates > > Without timestamps, neither can be used for RTT estimator and without SACK, > we have no newly sacked segments to estimate RTT either. Therefore each > timeout doubles RTO and without usable RTT samples so that there is nothing > to counter the exponential growth. > > While disabling both SACK and timestamps doesn't make any sense, the > resulting behaviour is so pathological that it deserves an improvement. > (Also, both can be disabled on the other side.) Avoid F-RTO algorithm in > case both SACK and timestamps are disabled so that the sender falls back to > traditional slow start retransmission. > > Signed-off-by: Michal Kubecek I was able to illustrate the issue using a packetdrill script. It cheats a bit by setting net.ipv4.tcp_reordering to 30 so that it we can get to the issue more quickly. In this case, we don't have more data to send but it's not essential; the issue can be reproduced even with sending of new data in F-RTO, it would only make everything more complicated. I was able to run the same script on kernels 4.17-rc6, 4.12 (SLE15) and 4.4 (SLE12-SP2). Kernel 3.12 required minor modifications but not in the important part (the slow start is a bit slower there). --------------------------------------------------------------------------- --tolerance_usecs=10000 // flush cached TCP metrics 0.000 `ip tcp_metrics flush all` +0.000 `sysctl -q net.ipv4.tcp_reordering=20` // establish a connection +0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0.000 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [131072], 4) = 0 +0.000 bind(3, ..., ...) = 0 +0.000 listen(3, 1) = 0 +0.100 < S 0:0(0) win 40000 +0.000 > S. 0:0(0) ack 1 +0.100 < . 1:1(0) ack 1 win 40000 +0.000 accept(3, ..., ...) = 4 // Send 10 data segments. +0.100 write(4, ..., 30000) = 30000 // For some reason (unknown yet), GSO packets are only 2000 bytes long +0.000 > . 1:2001(2000) ack 1 +0.000 > . 2001:4001(2000) ack 1 +0.000 > . 4001:6001(2000) ack 1 +0.000 > . 6001:8001(2000) ack 1 +0.000 > . 8001:10001(2000) ack 1 +0.100 < . 1:1(0) ack 2001 win 38000 +0.000 > . 10001:12001(2000) ack 1 +0.000 > . 12001:14001(2000) ack 1 +0.001 < . 1:1(0) ack 4001 win 36000 +0.000 > . 14001:16001(2000) ack 1 +0.000 > . 16001:18001(2000) ack 1 +0.001 < . 1:1(0) ack 6001 win 34000 +0.000 > . 18001:20001(2000) ack 1 +0.000 > . 20001:22001(2000) ack 1 +0.001 < . 1:1(0) ack 8001 win 32000 +0.000 > . 22001:24001(2000) ack 1 +0.000 > . 24001:26001(2000) ack 1 +0.001 < . 1:1(0) ack 10001 win 30000 +0.000 > . 26001:28001(2000) ack 1 +0.000 > P. 28001:30001(2000) ack 1 // loss of 12001:13001, 14001:15001, ..., 28001:29001 +0.100 < . 1:1(0) ack 12001 win 30000 // original ack +0.000 < . 1:1(0) ack 12001 win 30000 // 13001:14001 +0.000 < . 1:1(0) ack 12001 win 30000 // 15001:16001 +0.000 < . 1:1(0) ack 12001 win 30000 // 17001:18001 +0.000 < . 1:1(0) ack 12001 win 30000 // 19001:20001 +0.000 < . 1:1(0) ack 12001 win 30000 // 21001:22001 +0.000 < . 1:1(0) ack 12001 win 30000 // 13001:24001 +0.000 < . 1:1(0) ack 12001 win 30000 // 25001:26001 +0.000 < . 1:1(0) ack 12001 win 30000 // 27001:28001 +0.000 < . 1:1(0) ack 12001 win 30000 // 29001:30001 // RTO 300ms +0.270~+0.330 > . 12001:13001(1000) ack 1 +0.100 < . 1:1(0) ack 14001 win 38000 // RTO 600ms +0.540~+0.660 > . 14001:15001(1000) ack 1 +0.100 < . 1:1(0) ack 16001 win 38000 // RTO 1200ms +1.050~+1.350 > . 16001:17001(1000) ack 1 +0.100 < . 1:1(0) ack 18001 win 38000 // RTO 2400ms +2.100~+2.700 > . 18001:19001(1000) ack 1 +0.100 < . 1:1(0) ack 20001 win 38000 // RTO 4800ms +4.200~+5.400 > . 20001:21001(1000) ack 1 +0.100 < . 1:1(0) ack 22001 win 38000 // RTO 9600ms +8.400~+10.800 > . 22001:23001(1000) ack 1 +0.100 < . 1:1(0) ack 24001 win 38000 // RTO 19200ms +16.800~+21.600 > . 24001:25001(1000) ack 1 +1.000 `sysctl -q net.ipv4.tcp_reordering=3` --------------------------------------------------------------------------- And this is what happens on current snapshot of master branch with either net.ipv4.tcp_frto=0 or with the RFC patch: --------------------------------------------------------------------------- --tolerance_usecs=10000 // flush cached TCP metrics 0.000 `ip tcp_metrics flush all` +0.000 `sysctl -q net.ipv4.tcp_reordering=20` // establish a connection +0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0.000 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [131072], 4) = 0 +0.000 bind(3, ..., ...) = 0 +0.000 listen(3, 1) = 0 +0.100 < S 0:0(0) win 40000 +0.000 > S. 0:0(0) ack 1 +0.100 < . 1:1(0) ack 1 win 40000 +0.000 accept(3, ..., ...) = 4 // Send 10 data segments. +0.100 write(4, ..., 30000) = 30000 // For some reason (unknown yet), GSO packets are only 2000 bytes long +0.000 > . 1:2001(2000) ack 1 +0.000 > . 2001:4001(2000) ack 1 +0.000 > . 4001:6001(2000) ack 1 +0.000 > . 6001:8001(2000) ack 1 +0.000 > . 8001:10001(2000) ack 1 +0.100 < . 1:1(0) ack 2001 win 38000 +0.000 > . 10001:12001(2000) ack 1 +0.000 > . 12001:14001(2000) ack 1 +0.001 < . 1:1(0) ack 4001 win 36000 +0.000 > . 14001:16001(2000) ack 1 +0.000 > . 16001:18001(2000) ack 1 +0.001 < . 1:1(0) ack 6001 win 34000 +0.000 > . 18001:20001(2000) ack 1 +0.000 > . 20001:22001(2000) ack 1 +0.001 < . 1:1(0) ack 8001 win 32000 +0.000 > . 22001:24001(2000) ack 1 +0.000 > . 24001:26001(2000) ack 1 +0.001 < . 1:1(0) ack 10001 win 30000 +0.000 > . 26001:28001(2000) ack 1 +0.000 > P. 28001:30001(2000) ack 1 // loss of 12001:13001, 14001:15001, ..., 28001:29001 +0.100 < . 1:1(0) ack 12001 win 30000 // original ack +0.000 < . 1:1(0) ack 12001 win 30000 // 13001:14001 +0.000 < . 1:1(0) ack 12001 win 30000 // 15001:16001 +0.000 < . 1:1(0) ack 12001 win 30000 // 17001:18001 +0.000 < . 1:1(0) ack 12001 win 30000 // 19001:20001 +0.000 < . 1:1(0) ack 12001 win 30000 // 21001:22001 +0.000 < . 1:1(0) ack 12001 win 30000 // 13001:24001 +0.000 < . 1:1(0) ack 12001 win 30000 // 25001:26001 +0.000 < . 1:1(0) ack 12001 win 30000 // 27001:28001 +0.000 < . 1:1(0) ack 12001 win 30000 // 29001:30001 // RTO 300ms +0.270~+0.330 > . 12001:13001(1000) ack 1 +0.100 < . 1:1(0) ack 14001 win 38000 +0.000 > . 14001:16001(2000) ack 1 +0.000 > . 16001:17001(1000) ack 1 +0.100 < . 1:1(0) ack 16001 win 38000 +0.000 > . 17001:18001(1000) ack 1 +0.000 > . 18001:20001(2000) ack 1 +0.000 > . 20001:21001(1000) ack 1 +0.100 < . 1:1(0) ack 18001 win 38000 +0.001 < . 1:1(0) ack 20001 win 36000 +0.001 < . 1:1(0) ack 21001 win 35000 +0.000 > . 21001:22001(1000) ack 1 +0.000 > . 22001:24001(2000) ack 1 +0.000 > . 24001:25001(1000) ack 1 +0.000 > . 25001:26001(1000) ack 1 +0.000 > . 26001:28001(2000) ack 1 +0.000 > . 28001:29001(1000) ack 1 +0.000 > P. 29001:30001(1000) ack 1 +0.100 < . 1:1(0) ack 22001 win 38000 +0.001 < . 1:1(0) ack 24001 win 36000 +0.001 < . 1:1(0) ack 26001 win 34000 +0.001 < . 1:1(0) ack 28001 win 32000 +0.001 < . 1:1(0) ack 30001 win 30000 +1.000 `sysctl -q net.ipv4.tcp_reordering=3` --------------------------------------------------------------------------- Michal Kubecek