Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp25307pxj; Wed, 26 May 2021 15:05:56 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwkF5Y3V7vD/9lrvwKPJEyE3bOmhkk/GwMwOrkETxgGrwHVYkJ7kT9BnsZ5BLEawQiQ29YS X-Received: by 2002:a05:6638:a2c:: with SMTP id 12mr316662jao.99.1622066756156; Wed, 26 May 2021 15:05:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1622066756; cv=none; d=google.com; s=arc-20160816; b=OrvgjLHsJG8oQCP/QrXTjf2M/Z8jHXseptSp9AY5SYgS2yUIGi5pBnSfe4NGsGLTcj 9YJ5f+VjkojW/PafHA8adAwIN4ILgsjkE0I218mfOJDrKs7zZ8xDRtSHfczID7x1+Pk5 enPV+NhwzFhx0EZoUc/RFCg+CTbP/sS9BA471hOcfmuixDA+FbTy3eJmJ49ggCJVkFo5 DHK9G6aXDC7F1sqK+CbmSYT4U02UOPvsjxNWGCU5cmYeXPt6PLstWJbNBECZjDNKk8cK GBGMwk46eZIXWLxhTyAzStbHQMqla6lpRnFZwCBRNbfQ5cfPKEHHNvuUb80V+WS5uklZ XfTQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=JFMiQ1oT/nQBQJWxoBwnG19CNT17KDuyVJC9NnQkfeE=; b=d5VvQnMnp+cdQ2oN94Q5/nBBwhGIpgYhKeWbRse19T+7K0IllHCWB8B50dDRUMvfBG WcBfU9sOGenBE+R2aai1t96hTSuInuger8d7Sl4hjxiuNZdKMzgzFHKtb/SlcDeObB/q SY7TtOLmTkWIZGZbKK08+NiKeq9JgCBaQT4aNVJaCO7ggNQKQhUDR/gxAnqb6/FWoMZV /jC4p2FAhC8yWI/kH67ojWi5OrG//jDCh8ePfxoTkXzV/Ndx8jB067F32usoLUbLSta2 PJOvGW7PVsKCJHHPonIFHR9K/HZZ/KTfp0X9S9+n9Ibl4WwSeNLL4vnisXJynV87g0P0 7n5g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=soGJnD66; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id j10si265042ilk.31.2021.05.26.15.05.42; Wed, 26 May 2021 15:05:56 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=soGJnD66; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234860AbhEZOgx (ORCPT + 99 others); Wed, 26 May 2021 10:36:53 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34296 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234776AbhEZOgp (ORCPT ); Wed, 26 May 2021 10:36:45 -0400 Received: from mail-yb1-xb2a.google.com (mail-yb1-xb2a.google.com [IPv6:2607:f8b0:4864:20::b2a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2258CC061756 for ; Wed, 26 May 2021 07:35:13 -0700 (PDT) Received: by mail-yb1-xb2a.google.com with SMTP id o18so2320802ybc.8 for ; Wed, 26 May 2021 07:35:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=JFMiQ1oT/nQBQJWxoBwnG19CNT17KDuyVJC9NnQkfeE=; b=soGJnD66WxbUEW5De3S4mUIIu2isKw3F6DMPY+JrtIUdSOhVv2nEV3r+Q23nQwTDIZ IJs29pCzl0ap3ZE1MHQ0SJjYX/8E6KpYjRtVyEXKKLC+znmr1+6R/joRIyFw9udNYd62 No591Otf1NcbHlycSAtbyNwYACzsaNfq13pEVLgClxBaPVbE85Eho4/JQAMHYk/nRGKo QCzoZNU7pUFwnQswm0SzNbBt0chqPwRQcKgmvdHmtEiDOlEaDTKADcGwXFWbJFiARTtq tMaUt/LEJQVsQISCyQ8r/q5/uRe+ZxXPisjYip4B5Wt003o1DIXUcgVrVzUnoV8w9YM0 gq1g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=JFMiQ1oT/nQBQJWxoBwnG19CNT17KDuyVJC9NnQkfeE=; b=DA7DbIKq6Sj5aE6Ca3K7TybiTT1oIGsqpZ0zFXB4T0RailVY0iB+/upfZFSXEHlUQh TUbh8JklZZdD5l6xlBYw1GfwLUHFpZ2xt1UCjqJAnd2nJh3TTSUXz4s7+LDjDDrWOUCu 2kdBqAcIy3twU3suBbOIpJUhygQyU2tu00DrAzl1Nr0qM/cA5ROBgn7qy3l7Xt+HlmQW rEdRU1k/7o7ABK4lY5uxflPAQo3ZGZqPQdLpozgES1Hk8O3/0lrCF3Bn10PNT+cq93aK eMwTFmbTafGEktxoyTtYSAq0h8QSCQFpkdWLOGYU2zfuetjCfVSd9yLE5RhXL1J/QdjW Qfiw== X-Gm-Message-State: AOAM533xnh1AZXjVTDur6wgXxFjteHdyg8uhDVUrQZdDfs2BuSWPxHRS CISWQWOrDCmti+7gurY7gJPoZWkufSyybo+mmvPhXA== X-Received: by 2002:a25:5ec2:: with SMTP id s185mr49604119ybb.303.1622039711946; Wed, 26 May 2021 07:35:11 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Eric Dumazet Date: Wed, 26 May 2021 16:35:00 +0200 Message-ID: Subject: Re: [RFCv2 3/3] tcp: Wait for sufficient data in tcp_mtu_probe To: Leonard Crestez Cc: Neal Cardwell , Matt Mathis , "David S. Miller" , Willem de Bruijn , Jakub Kicinski , Hideaki YOSHIFUJI , David Ahern , John Heffner , Leonard Crestez , Soheil Hassas Yeganeh , Roopa Prabhu , netdev , LKML Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, May 26, 2021 at 12:38 PM Leonard Crestez wrote: > > According to RFC4821 Section 7.4 "Protocols MAY delay sending non-probes > in order to accumulate enough data" but linux almost never does that. > > Implement this by returning 0 from tcp_mtu_probe if not enough data is > queued locally but some packets are still in flight. This makes mtu > probing more likely to happen for applications that do small writes. > > Only doing this if packets are in flight should ensure that writing will > be attempted again later. This is similar to how tcp_mtu_probe already > returns zero if the probe doesn't fit inside the receiver window or the > congestion window. > > Control this with a sysctl because this implies a latency tradeoff but > only up to one RTT. > > Signed-off-by: Leonard Crestez > --- > Documentation/networking/ip-sysctl.rst | 5 +++++ > include/net/netns/ipv4.h | 1 + > net/ipv4/sysctl_net_ipv4.c | 7 +++++++ > net/ipv4/tcp_ipv4.c | 1 + > net/ipv4/tcp_output.c | 18 ++++++++++++++---- > 5 files changed, 28 insertions(+), 4 deletions(-) > > diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst > index 7ab52a105a5d..967b7fac35b1 100644 > --- a/Documentation/networking/ip-sysctl.rst > +++ b/Documentation/networking/ip-sysctl.rst > @@ -349,10 +349,15 @@ tcp_mtu_probe_floor - INTEGER > If MTU probing is enabled this caps the minimum MSS used for search_low > for the connection. > > Default : 48 > > +tcp_mtu_probe_waitdata - BOOLEAN > + Wait for enough data for an mtu probe to accumulate on the sender. > + > + Default: 1 > + > tcp_mtu_probe_rack - BOOLEAN > Try to use shorter probes if RACK is also enabled > > Default: 1 > > diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h > index b4ff12f25a7f..366e7b325778 100644 > --- a/include/net/netns/ipv4.h > +++ b/include/net/netns/ipv4.h > @@ -112,10 +112,11 @@ struct netns_ipv4 { > #ifdef CONFIG_NET_L3_MASTER_DEV > u8 sysctl_tcp_l3mdev_accept; > #endif > u8 sysctl_tcp_mtu_probing; > int sysctl_tcp_mtu_probe_floor; > + int sysctl_tcp_mtu_probe_waitdata; If this is a boolean, you should use u8, and place this field to avoid adding a hole. > int sysctl_tcp_mtu_probe_rack; > int sysctl_tcp_base_mss; > int sysctl_tcp_min_snd_mss; > int sysctl_tcp_probe_threshold; > u32 sysctl_tcp_probe_interval; > diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c > index 275c91fb9cf8..53868b812958 100644 > --- a/net/ipv4/sysctl_net_ipv4.c > +++ b/net/ipv4/sysctl_net_ipv4.c > @@ -847,10 +847,17 @@ static struct ctl_table ipv4_net_table[] = { > .mode = 0644, > .proc_handler = proc_dointvec_minmax, > .extra1 = &tcp_min_snd_mss_min, > .extra2 = &tcp_min_snd_mss_max, > }, > + { > + .procname = "tcp_mtu_probe_waitdata", > + .data = &init_net.ipv4.sysctl_tcp_mtu_probe_waitdata, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = proc_dointvec, If this is a boolean, please use proc_dou8vec_minmax, and SYSCTL_ZERO/SYSCTL_ONE > + }, > { > .procname = "tcp_mtu_probe_rack", > .data = &init_net.ipv4.sysctl_tcp_mtu_probe_rack, > .maxlen = sizeof(int), > .mode = 0644, > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > index ed8af4a7325b..940df2ae4636 100644 > --- a/net/ipv4/tcp_ipv4.c > +++ b/net/ipv4/tcp_ipv4.c > @@ -2892,10 +2892,11 @@ static int __net_init tcp_sk_init(struct net *net) > net->ipv4.sysctl_tcp_base_mss = TCP_BASE_MSS; > net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS; > net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD; > net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL; > net->ipv4.sysctl_tcp_mtu_probe_floor = TCP_MIN_SND_MSS; > + net->ipv4.sysctl_tcp_mtu_probe_waitdata = 1; > net->ipv4.sysctl_tcp_mtu_probe_rack = 1; > > net->ipv4.sysctl_tcp_keepalive_time = TCP_KEEPALIVE_TIME; > net->ipv4.sysctl_tcp_keepalive_probes = TCP_KEEPALIVE_PROBES; > net->ipv4.sysctl_tcp_keepalive_intvl = TCP_KEEPALIVE_INTVL; > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index 362f97cfb09e..268e1bac001f 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -2394,14 +2394,10 @@ static int tcp_mtu_probe(struct sock *sk) > */ > tcp_mtu_check_reprobe(sk); > return -1; > } > > - /* Have enough data in the send queue to probe? */ > - if (tp->write_seq - tp->snd_nxt < size_needed) > - return -1; > - > /* Can probe fit inside congestion window? */ > if (packets_needed > tp->snd_cwnd) > return -1; > > /* Can probe fit inside receiver window? If not then skip probing. > @@ -2411,10 +2407,24 @@ static int tcp_mtu_probe(struct sock *sk) > * clear below. > */ > if (tp->snd_wnd < size_needed) > return -1; > > + /* Have enough data in the send queue to probe? */ > + if (tp->write_seq - tp->snd_nxt < size_needed) { > + /* If packets are already in flight it's safe to wait for more data to > + * accumulate on the sender because writing will be triggered as ACKs > + * arrive. > + * If no packets are in flight returning zero can stall. > + */ > + if (net->ipv4.sysctl_tcp_mtu_probe_waitdata && I have serious doubts about RPC traffic. Adding one RTT latency is going to make this unlikely to be used. > + tcp_packets_in_flight(tp)) > + return 0; > + else > + return -1; > + } > + > /* Do we need for more acks to clear the receive window? */ > if (after(tp->snd_nxt + size_needed, tcp_wnd_end(tp))) > return 0; > > /* Do we need the congestion window to clear? */ > -- > 2.25.1 >