Received: by 2002:a05:6a10:2726:0:0:0:0 with SMTP id ib38csp1874524pxb; Wed, 30 Mar 2022 11:41:04 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzHUblX3VC/ckqq2aSbZGE0gSTG8iG9C81HWukiS3tWgs2M1jKqs68RRNV4i78bwP90Pqu3 X-Received: by 2002:a17:907:8a1e:b0:6e0:fec7:5ba1 with SMTP id sc30-20020a1709078a1e00b006e0fec75ba1mr1026406ejc.237.1648665663618; Wed, 30 Mar 2022 11:41:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1648665663; cv=none; d=google.com; s=arc-20160816; b=W+H5ah4NJcr7b9w33RUr+BCifWEnfjr0rwgEsSAftv4gOK+QXY0eMHEIwIQBajYPBc zuCvD2wZtgw9j+MmG27opjzvhsm9sScmaj7OldwPRwLAk/mt9J0Fa7YPxeK3OoHa4ueN vs3CjMByWn4iHCgoczhe8uOVXc+z94LbQSn06I6EhVLE4iE8yWCOn77/icwIxYZk51n2 Q0beck4+062YT7GEcTa+Rz2kmoZk4t8pGoou1Pid142RQ7N4t2MjYb6bCvQ32/DeUk+s 81ooZC5Exx9cpK9sZOT8f+VNnNe0V3qxMfegSh9J4yTXcFtzUrm6eD0PoFMBAoyBqYZX NiGA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=vW+hbyk4J/zcF4AgSfPOyw6F0EA04UzRtTKR2+Wu8Pc=; b=1K3EB4Q3kH1yd2RxHGLW7fqJtfl/3XLxzTxLqRa1897hsP0xomOf9wKzhsTs/0OYYT JDzEnvyHm87YcfMx+t/hFpJh5ODu42i7j6nFKiuP/sxGSZT2m6RE6DuW0pHxRARMtQ46 sAYYXCfSBQMHBFbXB0MoJwObqfb98BQgOtWZv163FejDWFMkLHt0DomvELi27Y6Etyh+ Mbqu+vamtYHnsWVaRSJmt4p99Ax38xDc0D8E33XHpK8SPC8fSkPCePenScomRpbLRazr aUlUKxXjRZYvMxe3o3lWwloXk4P7nsS8TBvp5ieQvssnBFAeIYC7yek23/3OqLy/IJON Py/A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=ULB1pYYD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id d20-20020a1709063ed400b006df76385e63si21225423ejj.771.2022.03.30.11.40.37; Wed, 30 Mar 2022 11:41:03 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=ULB1pYYD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231689AbiC3Cmc (ORCPT + 99 others); Tue, 29 Mar 2022 22:42:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53066 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229946AbiC3Cma (ORCPT ); Tue, 29 Mar 2022 22:42:30 -0400 Received: from mail-yb1-xb34.google.com (mail-yb1-xb34.google.com [IPv6:2607:f8b0:4864:20::b34]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4F04EF212A for ; Tue, 29 Mar 2022 19:40:45 -0700 (PDT) Received: by mail-yb1-xb34.google.com with SMTP id e81so5665668ybf.13 for ; Tue, 29 Mar 2022 19:40:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=vW+hbyk4J/zcF4AgSfPOyw6F0EA04UzRtTKR2+Wu8Pc=; b=ULB1pYYD4leOqjCkyU+/DAlIHxfJJVDibwWb229gXdIbV+47IHRb9USAKV4J/kPRif jQJHtG89hfFWPyghXNsiEqTLxdFNDSK4pYmZ2BWTQjJkczTaeoibEJZNkW+hqIfAY875 3zcmjq5n2qOZ/MU9IHey576EeCi6YONvBN2xwIhBNdq2EFI5RjiJXpwrNxvjTKVf2+vY 5IJmYj0XX4naGeHBegIF1x7hvzdEiLBHZqr9f2J5T4rV69soTrmP2y79+mFQnyLlGOPG 9/th3DO62OyVbcWp5T7bgxyHp7XnIf6ZqCXlILHdGeDgFxlB2mkrHzcnG1mgea9qzs8a OznQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=vW+hbyk4J/zcF4AgSfPOyw6F0EA04UzRtTKR2+Wu8Pc=; b=idRHyWgGId6g3/TGK4AadbouQxd0i0DQ8kP4vZqW+WUyQdkptPXUmfDwe6HjT9YEIL 7YU7bElC1NModP5qZyrp3+hfy99Fhn4FAnQBJV5CTzl6j1yT58SJWfK00qu9047kDIJD k78Xk5xWEd1tKj5sHCAQE2I5C7Wd1n7oF6mIkVNuTQMA+eVtHKXjIpvznduEGmvDLeGV JKVOfTd3AYVUHO6QpdNrYORBLnmkM4EPcM1iPlpCro+giQ8c6dkj6d9JJxRjRRhI76OT Qcct9dUtHx7sKPfp3mRoyPCRSnTGlnWtU0eb/Xb4K94X2qTs0s7k2dK0Msp00Q3+eYaG jIIA== X-Gm-Message-State: AOAM531/Ixjm5ivqHd4BhTm8DW0ui8tzXg8ZG31tdGZvYvrdTtuhzxYo 2QzyyUv/0gkBgzViAtIZDq9kneAzrigbKtwM0JCcM0sxGS/saQ== X-Received: by 2002:a05:6902:72a:b0:634:6843:499c with SMTP id l10-20020a056902072a00b006346843499cmr32583948ybt.36.1648608033765; Tue, 29 Mar 2022 19:40:33 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Eric Dumazet Date: Tue, 29 Mar 2022 19:40:22 -0700 Message-ID: Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections To: Neal Cardwell Cc: Jaco , LKML , Netdev , Yuchung Cheng Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 29, 2022 at 7:01 PM Neal Cardwell wrote: > > On Tue, Mar 29, 2022 at 9:03 PM Jaco wrote: > > > > Dear All, > > > > I'm seeing very strange TCP behaviour. Disabled TCP Segmentation Offlo= ad to > > try and pinpoint this more closely. > > > > It seems the kernel is ignoring ACKs coming from the remote side in som= e cases. > > In this case, on one of four hosts, and seemingly between this one host= and > > Google ... (We've have two emails to google stuck on another host due t= o same > > issue, but several hundred others passed out today on that same host). = I also > > killed selective ACKs as a test as these are known to sometimes cause i= ssues > > for firewalls and "tcp accelerators" (or used to at the very least). > > > > SMTP connection between ourselves and Google ... I'm going to be select= ive in > > copying from tcpdump (full coversation up to the point where I killed i= t > > because it plainly got stuck in a loop is attached). > > > > Connection setup: > > > > 00:56:17.055481 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450= :400c:c07::1b.25: Flags [S], seq 956633779, win 62580, options [mss 8940,no= p,nop,TS val 3687705482 ecr 0,nop,wscale 7,tfo cookie f025dd84b6122510,nop= ,nop], length 0 > > > > 00:56:17.217747 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff= :feb8:f27b.59110: Flags [S.], seq 726465675, ack 956633780, win 65535, opti= ons [mss 1440,nop,nop,TS val 3477429218 ecr 3687705482,nop,wscale 8], lengt= h 0 > > > > 00:56:17.218628 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff= :feb8:f27b.59110: Flags [P.], seq 726465676:726465760, ack 956633780, win 2= 56, options [nop,nop,TS val 3477429220 ecr 3687705482], length 84: SMTP: 22= 0 mx.google.com ESMTP e16-20020a05600c4e5000b0038c77be9b2dsi226281wmq.72 - = gsmtp > > > > 00:56:17.218663 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450= :400c:c07::1b.25: Flags [.], ack 726465760, win 489, options [nop,nop,TS va= l 3687705645 ecr 3477429220], length 0 > > > > This is pretty normal, we advertise an MSS of 8940 and the return is 14= 40, thus > > we shouldn't send segments larger than that, and they "can't". I need = to > > determine if this is some form of offloading or they really are sending= >1500 > > byte frames (which I know won't pass our firewalls without fragmentatio= n so > > probably some form of NIC offloading - which if it was active on older = 5.8 > > kernels did not cause problems): > > > > 00:56:17.709905 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff= :feb8:f27b.59110: Flags [P.], seq 726465979:726468395, ack 956634111, win 2= 61, options [nop,nop,TS val 3477429710 ecr 3687705973], length 2416: SMTP > > > > 00:56:17.709906 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff= :feb8:f27b.59110: Flags [P.], seq 726468395:726470811, ack 956634111, win 2= 61, options [nop,nop,TS val 3477429710 ecr 3687705973], length 2416: SMTP > > > > These are the only two frames I can find that supposedly exceeds the MS= S values > > (although, they don't exceed our value). > > > > Then everything goes pretty normal for a bit. The last data we receive= from > > the remote side before stuff goes wrong: > > > > 00:56:18.088725 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff= :feb8:f27b.59110: Flags [P.], seq 726471823:726471919, ack 956634348, win 2= 61, options [nop,nop,TS val 3477430089 ecr 3687706330], length 96: SMTP > > > > We ACK immediately along with the next segment: > > > > 00:56:18.088969 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450= :400c:c07::1b.25: Flags [.], seq 956634348:956635776, ack 726471919, win 44= 6, options [nop,nop,TS val 3687706515 ecr 3477430089], length 1428: SMTP > > > > Hereafter there is a flurry of data that we transmit, all nicely acknow= ledged, > > no retransmits that I can pick up (eyeballs). > > > > Before a long sequence of TX data we get this ACK: > > > > 00:56:18.576247 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff= :feb8:f27b.59110: Flags [.], ack 956700036, win 774, options [nop,nop,TS va= l 3477430577 ecr 3687706840], length 0 > > > > We then continue to RX a sequence of: > > > > 00:56:18.576300 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450= :400c:c07::1b.25: Flags [.], seq 956745732:956747160, ack 726471919, win 44= 6, options [nop,nop,TS val 3687707002 ecr 3477430577], length 1428: SMTP > > > > up to: > > > > 00:56:18.577031 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450= :400c:c07::1b.25: Flags [P.], seq 956778576:956780004, ack 726471919, win 4= 46, options [nop,nop,TS val 3687707003 ecr 3477430577], length 1428: SMTP > > > > Before we hit our first retransmit: > > > > 00:56:18.960078 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450= :400c:c07::1b.25: Flags [.], seq 956700036:956701464, ack 726471919, win 44= 6, options [nop,nop,TS val 3687707386 ecr 3477430577], length 1428: SMTP > > > > Since 956700036 is the last ACKed data, this seems correct, not sure wh= at timer > > this is based on though, the ACK for the just prior data came in ~384ms= prior > > (could be based on normal time to ACK, I don't know, this is about doub= le the > > usual round-trip-time currently). > > > > And then we receive this ACK (we can see this time the kernel waited fo= r ACK of > > this single segment): > > > > 00:56:19.126678 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff= :feb8:f27b.59110: Flags [.], ack 956701464, win 785, options [nop,nop,TS va= l 3477431127 ecr 3687707386], length 0 > > > > Then we do something (in my opinion) strange by jumping back to the tai= l of the previous burst: > > > > 00:56:19.126735 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450= :400c:c07::1b.25: Flags [.], seq 956780004:956781432, ack 726471919, win 44= 6, options [nop,nop,TS val 3687707553 ecr 3477431127], length 1428: SMTP > > > > 00:56:19.126751 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450= :400c:c07::1b.25: Flags [.], seq 956781432:956782860, ack 726471919, win 44= 6, options [nop,nop,TS val 3687707553 ecr 3477431127], length 1428: SMTP > > > > We then jump back and retransmit again from the just received ACK: > > > > 00:56:19.510078 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450= :400c:c07::1b.25: Flags [.], seq 956701464:956702892, ack 726471919, win 44= 6, options [nop,nop,TS val 3687707936 ecr 3477431127], length 1428: SMTP > > > > We then continue from there on as I'd expect (slow restart), this goes = pretty > > normal up to: > > > > 00:56:19.997088 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff= :feb8:f27b.59110: Flags [.], ack 956708604, win 841, options [nop,nop,TS va= l 3477431998 ecr 3687708261], length 0 > > > > 00:56:19.997148 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450= :400c:c07::1b.25: Flags [.], seq 956708604:956710032, ack 726471919, win 44= 6, options [nop,nop,TS val 3687708423 ecr 3477431998], length 1428: SMTP > > > > 00:56:20.262683 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff= :feb8:f27b.59110: Flags [.], ack 956710032, win 852, options [nop,nop,TS va= l 3477432263 ecr 3687708423], length 0 > > > > Up to here is fine, now things gets bizarre, we just jump to a differen= t > > sequence number, which has already been ACKed: > > > > 00:56:20.380076 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450= :400c:c07::1b.25: Flags [.], seq *956707176*:956708604, ack 726471919, win = 446, options [nop,nop,TS val 3687708806 ecr 3477431998], length 1428: SMTP > > > > 00:56:20.542356 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff= :feb8:f27b.59110: Flags [.], ack 956710032, win 852, options [nop,nop,TS va= l 3477432543 ecr 3687708423], length 0 > > > > And remote side re-ACKs the 956710032 value, which frankly indicates we= need to > > realize that the data we are transmitting has already been received, an= d we can > > continue on to transmit the segments following up on sequence number 95= 6710032, > > instead we choose to get stuck in this sequence: > > > > 00:56:21.180080 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450= :400c:c07::1b.25: Flags [.], seq 956707176:956708604, ack 726471919, win 44= 6, options [nop,nop,TS val 3687709606 ecr 3477431998], length 1428: SMTP > > > > 00:56:21.342347 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff= :feb8:f27b.59110: Flags [.], ack 956710032, win 852, options [nop,nop,TS va= l 3477433343 ecr 3687708423], length 0 > > > > 00:56:22.780101 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450= :400c:c07::1b.25: Flags [.], seq 956707176:956708604, ack 726471919, win 44= 6, options [nop,nop,TS val 3687711206 ecr 3477431998], length 1428: SMTP > > > > 00:56:22.942346 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff= :feb8:f27b.59110: Flags [.], ack 956710032, win 852, options [nop,nop,TS va= l 3477434943 ecr 3687708423], length 0 > > > > And here the connection dies. It eventually times out, and we retry to= the > > next host, resulting in the same problem. > > > > I am aware that Google is having congestion issues in the JHB area in S= A > > currently, and there are probably packet delays and losses somewhere al= ong the > > line between us, but this really should not stall as dead as it is here= . > > > > Looking at only the incoming ACK values, I can see they are strictly > > increasing, so we've never received an ACK > 956710032, but this is sti= ll > > greater than the value we are retransmitting. > > It could be that ACK packets have a wrong checksum, after some point is reached (some bug in a firewall/middlebox) "tcpdump -v" will tell you something about checksum errors. And/or "nstat -az | grep TcpInCsumError" Also, packets could be dropped in a layer like netfilter. Make sure you do not have a rule rate limiting flows, or something like tha= t. > > The first time we transmitted the frame at sequence number 956707176 wa= s part > > of the longest sequence of TX frames without a returning ACK, part of t= his > > sequence: > > > > ... > > > > 00:56:18.414299 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450= :400c:c07::1b.25: Flags [.], seq 956705748:956707176, ack 726471919, win 44= 6, options [nop,nop,TS val 3687706840 ecr 3477430415], length 1428: SMTP > > > > 00:56:18.414302 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450= :400c:c07::1b.25: Flags [P.], seq 956707176:956708604, ack 726471919, win 4= 46, options [nop,nop,TS val 3687706840 ecr 3477430415], length 1428: SMTP > > > > 00:56:18.414316 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450= :400c:c07::1b.25: Flags [.], seq 956708604:956710032, ack 726471919, win 44= 6, options [nop,nop,TS val 3687706840 ecr 3477430415], length 1428: SMTP > > > > ... > > > > Google here is ACKing not only the frame we are continuously retransmit= ting, > > but also the frame directly after ... so why would the kernel not move = on to > > retransmitting starting from sequence number 956710032 (which is larger= than > > the start sequence number of the frame we are retransmitting)? > > > > Kind Regards, > > Jaco > > Thanks for the report! I have CC-ed the netdev list, since it is > probably a better forum for this discussion. > > Can you please attach (or link to) a tcpdump raw .pcap file (produced > with the -w flag)? There are a number of tools that will make this > easier to visualize and analyze if we can see the raw .pcap file. You > may want to anonymize the trace and/or capture just headers, etc (for > example, the -s flag can control how much of each packet tcpdump > grabs). > > Can you please share the exact kernel version of the client machine? > > Also, can you please summarize/clarify whether you think the client, > server, or both are misbehaving? > > Thanks! > neal