Received: by 2002:a05:6a10:a841:0:0:0:0 with SMTP id d1csp408700pxy; Wed, 21 Apr 2021 06:05:29 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxag0AMdubn1IzN9aZDUEHZd6VT9mOS5nxmbYtkvgeSN1t2LWcOw5184scGTU6+lblrQuG0 X-Received: by 2002:a05:6402:3514:: with SMTP id b20mr7421125edd.348.1619010329626; Wed, 21 Apr 2021 06:05:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1619010329; cv=none; d=google.com; s=arc-20160816; b=oOO0wDJZtIVoR4xT09wK6z/a4og1i1AFHQ8aBuqav2CS4bFVgGxgAduDsJcbiWevmd 9WPi4b9M7KSazBERrq2xgbjA2Mnr6RFHkT/2cXAmp/5nVUKpQ6OgplzMYd4JqdXDf8kY +trLDGTVdt3cf9ApyXAnBKZ7AxLwS8OBPT8LiHEGdFmZTadJixsNcxTo/uoPK22V53qP jKxQvVV8OGEbx4Ijs6jQ5Rs5VBzWHzOj4Mt9oOYqjbhkH4STfWbp9KzB+Ieb0yTv6Wh+ XnRakVm9b4Rsdw4rwYlKcDgCjR6GQM9P9s313Y7Vhg9vIo9E3MrlDnISI8wbGkJcZ7nn u99w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=dm0VCnlctogH7NOXWwxOQ2XUcL7nBdd4RWT0rpQEyQI=; b=WRw2gsN1bAAs1he+gnXWqMxEOkbZW/l+Z+J8uXvgdS33u8i1phdag6mWtKIZyrH8aN d8OtQMh+qeAXd5JGYfU6P4qy+3o7pbP/BACVLUjYp4HTPExSLvR9tkSIE7Qb0OygZqMB GCC0JuCxolAPizWeMmlcePLtDNijw4pyOyVj9sNWu7CdtlI5N512xc32PxvlGxKy71hf Blx8CBGIRuA0WxbS7sQ//JnOvsVv2IpHQ14zC8BhWBE4si11F0SjQnkrANGzUPDxuNgD j+SUg/+gCifnTMXX95dC7b7C2PqE7iLLaHLuaTy3tAHlhQJkQ+rsT9Y86VdhwjMN/7Gg e4FQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=YI2Q1E+C; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id bo1si2001006edb.118.2021.04.21.06.05.06; Wed, 21 Apr 2021 06:05:29 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=YI2Q1E+C; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239882AbhDUMsl (ORCPT + 99 others); Wed, 21 Apr 2021 08:48:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34698 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239808AbhDUMsk (ORCPT ); Wed, 21 Apr 2021 08:48:40 -0400 Received: from mail-ua1-x929.google.com (mail-ua1-x929.google.com [IPv6:2607:f8b0:4864:20::929]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D3FD3C06138A for ; Wed, 21 Apr 2021 05:48:07 -0700 (PDT) Received: by mail-ua1-x929.google.com with SMTP id s2so13077984uap.1 for ; Wed, 21 Apr 2021 05:48:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=dm0VCnlctogH7NOXWwxOQ2XUcL7nBdd4RWT0rpQEyQI=; b=YI2Q1E+CS8kAaAkcnffqc+qE9TexY8iEgOXv2ZsWTPykZBkZed4F0O5qgAbdCpBwB1 aN4x1nzAqF76aeUo6CeXHs2QOsxXCHxZQSlJl6wMURxpWw5rXCkmFn9A2q182P6CTbBx DUfhXIK+EMg88MG0T28BgaqmNt+Zkc9muh+vihGr+UfwK5BeM0RtuWhQSK9MH2GhD/ix 8dR5D42ck0lOY1rkn4XBhMpU+APlwCnhP9TZxduoukqejsv4Dzw2WDqI+NxtgbAzm5ma aCQyDL8qVxYEfmqtZ/oUOesfMTpOrtTA7xzFIiSrKi+MZMSDCYsAcCYczVSbO7Q04rW9 cHDQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=dm0VCnlctogH7NOXWwxOQ2XUcL7nBdd4RWT0rpQEyQI=; b=aP5UbkDzTSlFQ+eyEhJHULDTsZEz25Lt6RjQSe5JNtBcU6n5V540l7vlNfeCPQMsbi tpAmCRcTZkKMbj7aROMKJkUZ1BUc6WvgEn7bliZS9aPlwPXUo+uMx7tIgHCml3lLETfr DqHCYOs8TzXDtFi9zw4OP+v8/2mycwlGocUOHNLEQfHt6WjB1uYZgJq0zMw5keWVu5VV QIDJ8sT0cchzNPxGrsdbkLQDHVJ9t1MISYr4NhQLqZEuX2vKLOcbzIr7qsJUo5ZIg9tz jdpBwSfXrxhzGXRemF3uthTl1SsHfI8dYOlW7d/VN8Wn5CojJ9R4+aKGMNn2PNx1k+OF pJAA== X-Gm-Message-State: AOAM531cLePCaWne+hUL5jVzl/SziN8dUgYvBAGu96ryo31u7B1NlKSt 7b/oz2uDGbd4VTySnREJB8E5wPOSPQD23mxOEQeJGg== X-Received: by 2002:ab0:20d0:: with SMTP id z16mr16998068ual.33.1619009286572; Wed, 21 Apr 2021 05:48:06 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Neal Cardwell Date: Wed, 21 Apr 2021 08:47:50 -0400 Message-ID: Subject: Re: [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing To: Leonard Crestez Cc: Willem de Bruijn , Ilya Lesokhin , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Hideaki YOSHIFUJI , David Ahern , Wei Wang , Soheil Hassas Yeganeh , Roopa Prabhu , netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Matt Mathis , Yuchung Cheng Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 21, 2021 at 6:21 AM Leonard Crestez wrote: > > According to RFC4821 Section 7.4 "Protocols MAY delay sending non-probes > in order to accumulate enough data" but linux almost never does that. > > Linux waits for probe_size + (1 + retries) * mss_cache to be available > in the send buffer and if that condition is not met it will send anyway > using the current MSS. The feature can be made to work by sending very > large chunks of data from userspace (for example 128k) but for small writes > on fast links probes almost never happen. > > This patch tries to implement the "MAY" by adding an extra flag > "wait_data" to icsk_mtup which is set to 1 if a probe is possible but > insufficient data is available. Then data is held back in > tcp_write_xmit until a probe is sent, probing conditions are no longer > met, or 500ms pass. > > Signed-off-by: Leonard Crestez > > --- > Documentation/networking/ip-sysctl.rst | 4 ++ > include/net/inet_connection_sock.h | 7 +++- > include/net/netns/ipv4.h | 1 + > include/net/tcp.h | 2 + > net/ipv4/sysctl_net_ipv4.c | 7 ++++ > net/ipv4/tcp_ipv4.c | 1 + > net/ipv4/tcp_output.c | 54 ++++++++++++++++++++++++-- > 7 files changed, 71 insertions(+), 5 deletions(-) > > My tests are here: https://github.com/cdleonard/test-tcp-mtu-probing > > This patch makes the test pass quite reliably with > ICMP_BLACKHOLE=1 TCP_MTU_PROBING=1 IPERF_WINDOW=256k IPERF_LEN=8k while > before it only worked with much higher IPERF_LEN=256k > > In my loopback tests I also observed another issue when tcp_retries > increases because of SACKReorder. This makes the original problem worse > (since the retries amount factors in buffer requirement) and seems to be > unrelated issue. Maybe when loss happens due to MTU shrinkage the sender > sack logic is confused somehow? > > I know it's towards the end of the cycle but this is mostly just intended for > discussion. Thanks for raising the question of how to trigger PMTU probes more often! AFAICT this approach would cause unacceptable performance impacts by often injecting unnecessary 500ms delays when there is no need to do so. If the goal is to increase the frequency of PMTU probes, which seems like a valid goal, I would suggest that we rethink the Linux heuristic for triggering PMTU probes in the light of the fact that the loss detection mechanism is now RACK-TLP, which provides quick recovery in a much wider variety of scenarios. After all, https://tools.ietf.org/html/rfc4821#section-7.4 says: In addition, the timely loss detection algorithms in most protocols have pre-conditions that SHOULD be satisfied before sending a probe. And we know that the "timely loss detection algorithms" have advanced since this RFC was written in 2007. You mention: > Linux waits for probe_size + (1 + retries) * mss_cache to be available The code in question seems to be: size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache; How about just changing this to: size_needed = probe_size + tp->mss_cache; The rationale would be that if that amount of data is available, then the sender can send one probe and one following current-mss-size packet. If the path MTU has not increased to allow the probe of size probe_size to pass through the network, then the following current-mss-size packet will likely pass through the network, generate a SACK, and trigger a RACK fast recovery 1/4*min_rtt later, when the RACK reorder timer fires. A secondary rationale for this heuristic would be: if the flow never accumulates roughly two packets worth of data, then does the flow really need a bigger packet size? IMHO, just reducing the size_needed seems far preferable to needlessly injecting 500ms delays. best, neal