Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp2405435iob; Fri, 6 May 2022 01:56:15 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxjvjSUiNdKnXH9zab/e4UXHWVhUfRQdz/6nzsqN7khxV5hVcl3wmlpUpo94xFHYMzaMVvp X-Received: by 2002:a17:906:5616:b0:6f3:8fe2:a8e8 with SMTP id f22-20020a170906561600b006f38fe2a8e8mr1956245ejq.465.1651827374966; Fri, 06 May 2022 01:56:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1651827374; cv=none; d=google.com; s=arc-20160816; b=q+7h9dOxtqq5Y/lyfBxt70VHaSubzFyWi9eiwUxhi+Xc0YlqRLZKJZNfk0Il8Z3RXO 0qbSoGRAkyqYXFg857SqxnJCFALTjynNwrODwVF3aMuf5JHNAm0OvZjg+ahyjylj8TG6 JnHU1SnS1eFxY2EvNw3u38FvZ+BNDZRC0L0TeQD3wQSJ+Ez4UO63FkL9Z2zYuwge8xqm hDyhnjgo/F27CrqiDIZ5Nnm/jWtNeq4uyemaOhqLqdGQyFY0o7e/O8g0BLK8kJ/L45v5 U5xnnqC/WK0Ozjwc85G9PV7O4jTxo2qA7WeKfdCS3P7WgD+GYkk2dlXxf+RZ8ebV6tDB ncEQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=A7K2rq8Od3WjXRTUOH5KzNWV9lmpcNj9LRe+awvgUaY=; b=AqRiO0YFeRXVwjCkVV2snrAse+5Sf6iU5KZa+3yyQXg/VeeSRIwblm37DuJR9mmGdz JM8v1Y3zapbovCrlLX1HLBWxPq65nf+7/bFZW9Vy94iHGH8UEgbWFlewq3OeJuBmfFN4 NGkCYD2mKLwfHba8CLn0El0vc+nVo55bmynoKps2MwdWkeY1AsYQtcbvZTLi++sqOr5r vg4xKCf025sBODwt+H0DEQpDaXSUUJwliG+zK+AylBVBoYrok+dSSp27g58395Zp9kjY t5xuycaqQXidrNGYWZe0vagWMfI8dvPTtRptcdntUh3OwG/gFfZnffa7dnqZV6w3sHrP Vd5w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=SPiu3FmN; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ia25-20020a170907a07900b006e7111d4ab0si4049913ejc.178.2022.05.06.01.55.48; Fri, 06 May 2022 01:56:14 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=SPiu3FmN; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1354264AbiEDQ44 (ORCPT + 99 others); Wed, 4 May 2022 12:56:56 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52224 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1354408AbiEDQyU (ORCPT ); Wed, 4 May 2022 12:54:20 -0400 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CD2BE4924B; Wed, 4 May 2022 09:49:28 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id F26F7B827A9; Wed, 4 May 2022 16:49:26 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 86A41C385B1; Wed, 4 May 2022 16:49:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1651682966; bh=5E/l/N1Kx3Pl2s6qM0ow1LbGNNft0GWY5gRu3tHdUaU=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=SPiu3FmNXxR/z0S5IhKKp833lK+Lq0iABlpm3VXYr+r2edM15bKEuCDDFeqw9e2VV lL6DO4QEcqRL1Vmv5UdDcyQ+EiTLyfHuzNcMecXkrMoEv0vlNLAxIG5Zw/CX3RrsSz 3dXslraPZr2/TxW2P92wZV5hHrk9nOA7338nLwTQ= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Eric Dumazet , Doug Porter , Soheil Hassas Yeganeh , Neal Cardwell , "David S. Miller" , Sasha Levin Subject: [PATCH 5.4 58/84] tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT Date: Wed, 4 May 2022 18:44:39 +0200 Message-Id: <20220504152931.892966382@linuxfoundation.org> X-Mailer: git-send-email 2.36.0 In-Reply-To: <20220504152927.744120418@linuxfoundation.org> References: <20220504152927.744120418@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-7.7 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Eric Dumazet [ Upstream commit 4bfe744ff1644fbc0a991a2677dc874475dd6776 ] I had this bug sitting for too long in my pile, it is time to fix it. Thanks to Doug Porter for reminding me of it! We had various attempts in the past, including commit 0cbe6a8f089e ("tcp: remove SOCK_QUEUE_SHRUNK"), but the issue is that TCP stack currently only generates EPOLLOUT from input path, when tp->snd_una has advanced and skb(s) cleaned from rtx queue. If a flow has a big RTT, and/or receives SACKs, it is possible that the notsent part (tp->write_seq - tp->snd_nxt) reaches 0 and no more data can be sent until tp->snd_una finally advances. What is needed is to also check if POLLOUT needs to be generated whenever tp->snd_nxt is advanced, from output path. This bug triggers more often after an idle period, as we do not receive ACK for at least one RTT. tcp_notsent_lowat could be a fraction of what CWND and pacing rate would allow to send during this RTT. In a followup patch, I will remove the bogus call to tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED) from tcp_check_space(). Fact that we have decided to generate an EPOLLOUT does not mean the application has immediately refilled the transmit queue. This optimistic call might have been the reason the bug seemed not too serious. Tested: 200 ms rtt, 1% packet loss, 32 MB tcp_rmem[2] and tcp_wmem[2] $ echo 500000 >/proc/sys/net/ipv4/tcp_notsent_lowat $ cat bench_rr.sh SUM=0 for i in {1..10} do V=`netperf -H remote_host -l30 -t TCP_RR -- -r 10000000,10000 -o LOCAL_BYTES_SENT | egrep -v "MIGRATED|Bytes"` echo $V SUM=$(($SUM + $V)) done echo SUM=$SUM Before patch: $ bench_rr.sh 130000000 80000000 140000000 140000000 140000000 140000000 130000000 40000000 90000000 110000000 SUM=1140000000 After patch: $ bench_rr.sh 430000000 590000000 530000000 450000000 450000000 350000000 450000000 490000000 480000000 460000000 SUM=4680000000 # This is 410 % of the value before patch. Fixes: c9bee3b7fdec ("tcp: TCP_NOTSENT_LOWAT socket option") Signed-off-by: Eric Dumazet Reported-by: Doug Porter Cc: Soheil Hassas Yeganeh Cc: Neal Cardwell Acked-by: Soheil Hassas Yeganeh Signed-off-by: David S. Miller Signed-off-by: Sasha Levin --- include/net/tcp.h | 1 + net/ipv4/tcp_input.c | 12 +++++++++++- net/ipv4/tcp_output.c | 1 + 3 files changed, 13 insertions(+), 1 deletion(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index b686a21a8593..9237362e5606 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -603,6 +603,7 @@ void tcp_synack_rtt_meas(struct sock *sk, struct request_sock *req); void tcp_reset(struct sock *sk); void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb); void tcp_fin(struct sock *sk); +void tcp_check_space(struct sock *sk); /* tcp_timer.c */ void tcp_init_xmit_timers(struct sock *); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index c0fcfa296468..f84047aec63c 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5230,7 +5230,17 @@ static void tcp_new_space(struct sock *sk) sk->sk_write_space(sk); } -static void tcp_check_space(struct sock *sk) +/* Caller made space either from: + * 1) Freeing skbs in rtx queues (after tp->snd_una has advanced) + * 2) Sent skbs from output queue (and thus advancing tp->snd_nxt) + * + * We might be able to generate EPOLLOUT to the application if: + * 1) Space consumed in output/rtx queues is below sk->sk_sndbuf/2 + * 2) notsent amount (tp->write_seq - tp->snd_nxt) became + * small enough that tcp_stream_memory_free() decides it + * is time to generate EPOLLOUT. + */ +void tcp_check_space(struct sock *sk) { if (sock_flag(sk, SOCK_QUEUE_SHRUNK)) { sock_reset_flag(sk, SOCK_QUEUE_SHRUNK); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 139e962d1aef..67493ec6318a 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -81,6 +81,7 @@ static void tcp_event_new_data_sent(struct sock *sk, struct sk_buff *skb) NET_ADD_STATS(sock_net(sk), LINUX_MIB_TCPORIGDATASENT, tcp_skb_pcount(skb)); + tcp_check_space(sk); } /* SND.NXT, if window was not shrunk or the amount of shrunk was less than one -- 2.35.1