Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp1076575rwd; Thu, 18 May 2023 07:41:09 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ42w1hECPLKYQl2yNiIdnOpNG1N+/yVBzN1IB8LuslCyx3YqpPoZOuUJnnp3easzfCIQ6oS X-Received: by 2002:a17:902:d682:b0:1ac:8db3:d4e5 with SMTP id v2-20020a170902d68200b001ac8db3d4e5mr2919824ply.69.1684420869509; Thu, 18 May 2023 07:41:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1684420869; cv=none; d=google.com; s=arc-20160816; b=PgVuNYzjdOb7G+CBtDMcHucU8UPEKAvcC9UExRXAPIgE+zkrmCoq45+jUDbopi/nFh tjTxJ/TFQrbXLiOb7JQ4kpHpAiTfbHLhJOqD/OOOxiIgyy0e0etBVOGyc53VvNUz2SxW SZ39TyXFHb3rKktWtjjhgA5wF4YDMhEhKCQNU5TFguk2f7m6kRZlFBIO6RQRVu5ZDTdp eJ1Yxm+8XIJ/aXY3OEqBFg9zLBbaie4omlxwbVBauKhOMgzJEhYCDNt3PGAyQMSzT7AW YxJsfYOwA8qw94wCRaKlzTeG5hwL8n5uPO//dwHcNYHDYnrFYkaezOWO0pK/KSrGpAFP jZdg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=yZgWXPTe1ZG4Sw/b1mpfq3+D2E79rsdzVUg9Htd0C1I=; b=y8N1yp2wYwuk3psaNkykQt2pVG6Bfg4pgFLUB+uMVinlPCiMTmX3Bd5DgNEuD2q7kA 0l8TMHpFG4G+V/hKbLSiGR9ZllW2vB4gjN0oDaxKT8yFvAIpCYjsoJaCzIxdrUeTfARh CL2ENdq6QLUh2T0ZkDdL454scv4yEQXQNxnKcIFQ8NZ0uqXvOfqxDoKue18uBEYxa3eo yNvtOKOnTMnLzCqgGHOVfUoNQ9AOEtP+1ux6FuB9vSz9CXXoyvicoDQj9dpz489fIy15 Ft+psDoxVVsPYvlFfhWmlXX5ej0N8qQxytnWKMRZkajMfIG03YNzZoHn8PVv3quiLP2E Xyww== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=nQCx1Fb7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id b14-20020a170902d50e00b001a643dfb884si1477233plg.451.2023.05.18.07.40.56; Thu, 18 May 2023 07:41:09 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=nQCx1Fb7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231351AbjEROMI (ORCPT + 99 others); Thu, 18 May 2023 10:12:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45526 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230288AbjEROMG (ORCPT ); Thu, 18 May 2023 10:12:06 -0400 Received: from mail-yb1-xb43.google.com (mail-yb1-xb43.google.com [IPv6:2607:f8b0:4864:20::b43]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B34A4DC; Thu, 18 May 2023 07:12:03 -0700 (PDT) Received: by mail-yb1-xb43.google.com with SMTP id 3f1490d57ef6-ba6d024a196so1780307276.2; Thu, 18 May 2023 07:12:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1684419123; x=1687011123; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=yZgWXPTe1ZG4Sw/b1mpfq3+D2E79rsdzVUg9Htd0C1I=; b=nQCx1Fb704okQyK869FxXTPFNzbbbeXJPFjY3YNtSJiJ34R7ANcOTfOCBTBxgUz1v/ 4lgpnWaSBMlfJ6UugEXUL590pNC3ehvBBnc33W9HiUAEw6LtA6ZMx73aQ/8lDU7eA4VK s92KHROyZwH6VM2K34So3/2NIhfg7RJcEeBvDEwrKw3MWognZMwmxWKhmeaWMb+sW1OM 2WAyuyD3ma4nv0RENopUIE6qO1Yog3kehphCTKeEqJEtOzllT1ety04HqDyoRp0buMTw N/EVads6jSmThIQxvJHMMcG2a2IdHC1+rQNnuznQjGtY7qVuaHSUokkP0befQc0IAZuL 1FRA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684419123; x=1687011123; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=yZgWXPTe1ZG4Sw/b1mpfq3+D2E79rsdzVUg9Htd0C1I=; b=M+AOeAjll1MPDSfijaz5CePIepiY+A0JxxwpjubhOoP0kZl/NzxKuGEIlz4WKaC4/Z 284Bdk3NRgHLpUCsIM0ScQfilfGtMXwUydUE4CPVw7eVSU4XQfWYhzzmWnWRaDOCq6XU 6H7QU9txLgNb0b5ME/LBzb8zMHCEUg2zqSNUDc6pPwOq3AT6f0zuPQg7xJvlXamju9Pb JStU0qSHZCpBMHfZ3vpGn5fYtm/GFeaolB+88H5OuyjKFgA0g3VcVEyJlljZfbYGBczA TfsG4/Q47Uw5uX2uVoJAtI9YdjEzTyqJ9klOsjy7tkSBEyNYH1/59ZqxV+fkXBhLNteu sSwQ== X-Gm-Message-State: AC+VfDwiQgRi7OiWX6FXdhkBYz9bQGJmrE9tysK6ksxf5K54TnDOOwj2 KmFq9jO6TE6njc5xp5vWsoO6DnfOCswPoMm8QKU= X-Received: by 2002:a25:460a:0:b0:b9a:6cb6:b942 with SMTP id t10-20020a25460a000000b00b9a6cb6b942mr1368503yba.54.1684419122859; Thu, 18 May 2023 07:12:02 -0700 (PDT) MIME-Version: 1.0 References: <20230517124201.441634-1-imagedong@tencent.com> <20230517124201.441634-4-imagedong@tencent.com> In-Reply-To: From: Menglong Dong Date: Thu, 18 May 2023 22:11:51 +0800 Message-ID: Subject: Re: [PATCH net-next 3/3] net: tcp: handle window shrink properly To: Neal Cardwell Cc: Eric Dumazet , kuba@kernel.org, davem@davemloft.net, pabeni@redhat.com, dsahern@kernel.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Menglong Dong , Yuchung Cheng Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, May 18, 2023 at 9:40=E2=80=AFPM Neal Cardwell wrote: > > On Wed, May 17, 2023 at 10:35=E2=80=AFPM Menglong Dong wrote: > > > > On Wed, May 17, 2023 at 10:47=E2=80=AFPM Eric Dumazet wrote: > > > > > > On Wed, May 17, 2023 at 2:42=E2=80=AFPM wr= ote: > > > > > > > > From: Menglong Dong > > > > > > > > Window shrink is not allowed and also not handled for now, but it's > > > > needed in some case. > > > > > > > > In the origin logic, 0 probe is triggered only when there is no any > > > > data in the retrans queue and the receive window can't hold the dat= a > > > > of the 1th packet in the send queue. > > > > > > > > Now, let's change it and trigger the 0 probe in such cases: > > > > > > > > - if the retrans queue has data and the 1th packet in it is not wit= hin > > > > the receive window > > > > - no data in the retrans queue and the 1th packet in the send queue= is > > > > out of the end of the receive window > > > > > > Sorry, I do not understand. > > > > > > Please provide packetdrill tests for new behavior like that. > > > > > > > Yes. The problem can be reproduced easily. > > > > 1. choose a server machine, decrease it's tcp_mem with: > > echo '1024 1500 2048' > /proc/sys/net/ipv4/tcp_mem > > 2. call listen() and accept() on a port, such as 8888. We call > > accept() looply and without call recv() to make the data stay > > in the receive queue. > > 3. choose a client machine, and create 100 TCP connection > > to the 8888 port of the server. Then, every connection sends > > data about 1M. > > 4. we can see that some of the connection enter the 0-probe > > state, but some of them keep retrans again and again. As > > the server is up to the tcp_mem[2] and skb is dropped before > > the recv_buf full and the connection enter 0-probe state. > > Finially, some of these connection will timeout and break. > > > > With this series, all the 100 connections will enter 0-probe > > status and connection break won't happen. And the data > > trans will recover if we increase tcp_mem or call 'recv()' > > on the sockets in the server. > > > > > Also, such fundamental change would need IETF discussion first. > > > We do not want linux to cause network collapses just because billions > > > of devices send more zero probes. > > > > I think it maybe a good idea to make the connection enter > > 0-probe, rather than drop the skb silently. What 0-probe > > meaning is to wait for space available when the buffer of the > > receive queue is full. And maybe we can also use 0-probe > > when the "buffer" of "TCP protocol" (which means tcp_mem) > > is full? > > > > Am I right? > > > > Thanks! > > Menglong Dong > > Thanks for describing the scenario in more detail. (Some kind of > packetdrill script or other program to reproduce this issue would be > nice, too, as Eric noted.) > > You mention in step (4.) above that some of the connections keep > retransmitting again and again. Are those connections receiving any > ACKs in response to their retransmissions? Perhaps they are receiving > dupacks? Actually, these packets are dropped without any reply, even dupacks. skb will be dropped directly when tcp_try_rmem_schedule() fails in tcp_data_queue(). That's reasonable, as it's useless to reply a ack to the sender, which will cause the sender fast retrans the packet, because we are out of memory now, and retrans can't solve the problem. > If so, then perhaps we could solve this problem without > depending on a violation of the TCP spec (which says the receive > window should not be retracted) in the following way: when a data > sender suffers a retransmission timeout, and retransmits the first > unacknowledged segment, and receives a dupack for SND.UNA instead of > an ACK covering the RTO-retransmitted segment, then the data sender > should estimate that the receiver doesn't have enough memory to buffer > the retransmitted packet. In that case, the data sender should enter > the 0-probe state and repeatedly set the ICSK_TIME_PROBE0 timer to > call tcp_probe_timer(). > > Basically we could try to enhance the sender-side logic to try to > distinguish between two kinds of problems: > > (a) Repeated data packet loss caused by congestion, routing problems, > or connectivity problems. In this case, the data sender uses > ICSK_TIME_RETRANS and tcp_retransmit_timer(), and backs off and only > retries sysctl_tcp_retries2 times before timing out the connection > > (b) A receiver that is repeatedly sending dupacks but not ACKing > retransmitted data because it doesn't have any memory. In this case, > the data sender uses ICSK_TIME_PROBE0 and tcp_probe_timer(), and backs > off but keeps retrying as long as the data sender receives ACKs. > I'm not sure if this is an ideal method, as it may be not rigorous to conclude that the receiver is oom with dupacks. A packet can loss can also cause multi dupacks. Thanks! Menglong Dong > AFAICT that would be another way to reach the happy state you mention: > "all the 100 connections will enter 0-probe status and connection > break won't happen", and we could reach that state without violating > the TCP protocol spec and without requiring changes on the receiver > side (so that this fix could help in scenarios where the > memory-constrained receiver is an older stack without special new > behavior). > > Eric, Yuchung, Menglong: do you think something like that would work? > > neal