Received: by 2002:a05:6a10:2726:0:0:0:0 with SMTP id ib38csp3487389pxb; Mon, 4 Apr 2022 18:33:56 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzk7819d4cRcEBJ+bziA4sofJmpl+ctuu52qDC17MsPzPUuu0zTFFXK5Z5M+qMqdsfVsyAw X-Received: by 2002:a05:6a00:328f:b0:4fa:95b5:e1ff with SMTP id ck15-20020a056a00328f00b004fa95b5e1ffmr959495pfb.60.1649122436026; Mon, 04 Apr 2022 18:33:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1649122436; cv=none; d=google.com; s=arc-20160816; b=iGNfAa/JWhJoR5jNRbSJIUgj41/ojnCLZdLaM0K+i9gW4YqNkvvUD60+DMR9/bBLIt kOoV2Mozg1kIpD64FXl5qNlOXReHD1Qr6TlQJu4WfONAlh5uI5VbfayQEbNY9zbK/wNc sEiTnjIJSrtdbna/pcz4mw3fHmNUNluDoSV5s0KPKkihccOHMJyNWF96uMgtSmVZ3cGM ynT3NJCxzfUZ/mQMymTRr394v5efDE4ulEUXhSvK7KxUvTKd2dk/WSgVH4cH9MRGcS9R 1plcHSw+DVVkGxPj6OXatYRqLSIBnys18mDHCPR8ssuwv8H3PCGlnfvyMiSQBoreTtoZ dmBQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=mebg0QeJQFAke9hx7Dqws17vyFhHRdF0ruAhkxiWa1M=; b=sOgdHikcu8Dvik0sc22zrv2vhBIkzEBBdoSAErK5eiAlk5DsBFtZw3pBzXv+JUyhA+ AIfqr1KGWDEGQP1ZU/zd2Ieu+HPrppUSA23INsOBkesT/1Ddlsfo5+wlIKvuVojVSRwk Ok38phUEb/dcEcbyAQ7FtFNY/bKNB9oa2IBnKEHDHa6iF/MS7g+kn5kF9qwL/kZT6lc9 lXefkRJbQEmyRl1T/7NEGs814ZTPyYNdF2iAdN5Vaz2esBNYwcHYgsx6arrs0D8qmaDt A69CZYRJC2Mh79tS05YQvPBoEsjqplTU3bEmHE+zGl3ICZ5/FiE/jb8vg1ht2mkUNTNG h2tw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=sYxiOPCS; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id o12-20020a170902d4cc00b00153b2d16515si12255417plg.285.2022.04.04.18.33.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 04 Apr 2022 18:33:56 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=sYxiOPCS; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id ADAEF7EA29; Mon, 4 Apr 2022 17:26:18 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1351161AbiDAPlP (ORCPT + 99 others); Fri, 1 Apr 2022 11:41:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54038 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1351892AbiDAPDA (ORCPT ); Fri, 1 Apr 2022 11:03:00 -0400 Received: from mail-qk1-x72a.google.com (mail-qk1-x72a.google.com [IPv6:2607:f8b0:4864:20::72a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9C8EE24F29A for ; Fri, 1 Apr 2022 07:50:23 -0700 (PDT) Received: by mail-qk1-x72a.google.com with SMTP id 85so2253124qkm.9 for ; Fri, 01 Apr 2022 07:50:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=mebg0QeJQFAke9hx7Dqws17vyFhHRdF0ruAhkxiWa1M=; b=sYxiOPCSK6W2fCmfUF0r1cRZ5v5Z97dlJOO7EeYYiZ5wU001BIl5HTd0lZAgcIJGn2 stul4gE6f6qnAo4Ufdrh6ezz9KC4pvSRbmeNxoC9AndIOC0yjRJgGISkDV4xe/PxPKwV 5WOHNtezVuzSEbok8BUiREOu1IIRTqRQb61Aj3bYNICDt90DFkjPjGN+WqayhYcrSBZn HEwXEM76FKKSdgsiDpkQafiA5OfLQDQsVuOAAuqFJC4O3Zp14pNISEFucWG9nbIf9cqI FOExxw0Pl4Z7dIet1gRst74mF2uroq2H1q9e8xeFcxInQjbVXpEtazC4vgNyWSVEREPT fmjA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=mebg0QeJQFAke9hx7Dqws17vyFhHRdF0ruAhkxiWa1M=; b=yuEN1vj5KwxWMdwxmHy0A9dNiwx1n6jmG4IHgfXhOyKl6peOP8ENV/oxwakFdo7EPF 6h5F9VMdtDDFHNmDwheD7z6MMGeV3F9bHt/Ydz/WkwhQMMVnlm7GJRbKJ8K13z5bZdqA LMJokAEPvcOD4cdi73zs97FH8ot//iyxInR7EWpQKRsq0u0z55H+zra3SWIjXnj/nsGp a5jeuxLWUN5KtP7o9ukNI2W8JKAbOnNDoqlmWT8s0SqvqZurqwIxo4FwyPHxIXR8pknY UvfkePACRktzRuBJlkvrqjEhjuQoUOQexr1/roMERl07SyEYYYmHNrqC4vscT/abjpRL E0Yg== X-Gm-Message-State: AOAM532xb59H5oFDLmh+HZESnmDelkRsLZyM8Xd7dnT4dlsXZFrgYFSl 7+2/wDekQ63N6JSULuTeOnvjCK3Bo2fPiFKeQ6eRQQ== X-Received: by 2002:a05:620a:1424:b0:67d:2bc6:856b with SMTP id k4-20020a05620a142400b0067d2bc6856bmr7025217qkj.434.1648824622224; Fri, 01 Apr 2022 07:50:22 -0700 (PDT) MIME-Version: 1.0 References: <10c1e561-8f01-784f-c4f4-a7c551de0644@uls.co.za> <5f1bbeb2-efe4-0b10-bc76-37eff30ea905@uls.co.za> In-Reply-To: <5f1bbeb2-efe4-0b10-bc76-37eff30ea905@uls.co.za> From: Neal Cardwell Date: Fri, 1 Apr 2022 10:50:05 -0400 Message-ID: Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections To: Jaco Kroon Cc: Eric Dumazet , LKML , Netdev , Yuchung Cheng Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.5 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Mar 31, 2022 at 7:06 PM Jaco Kroon wrote: > > Hi Neal, > > This sniff was grabbed ON THE CLIENT HOST. There is no middlebox or > anything between the sniffer and the client. Only the firewall on the > host itself, where we've already establish the traffic is NOT DISCARDED > (at least not in filter/INPUT). Yes, understood. Please excuse my general use of the term "firewalls/middleboxes" even where in some contexts it's clear the "middleboxes" aspect of that term could not apply. :-) > Setup on our end: > > 2 x routers, usually each with a direct peering with Google (which is > being ignored at the moment so instead traffic is incoming via IPT over DD). > > Connected via switch to > > 2 x firewalls, of which ONE is active (they have different networks > behind them, and could be active / standby for different networks behind > them - avoiding active-active because conntrackd is causing more trouble > than it's worth), Linux hosts, using netfilter, has been operating for > years, no recent kernel upgrades. > > 4 x hosts in mail cluster, one of which you're looking at here. > > On 2022/03/31 17:41, Neal Cardwell wrote: > > On Wed, Mar 30, 2022 at 9:04 AM Jaco Kroon wrote: > > ... > >> When you state sane/normal, do you mean there is fault with the other > >> frames that could not be explained by packet loss in one or both of the > >> directions? > > Yes. > > > > (1) If you look at the attached trace time/sequence plots (from > > tcptrace and xplot.org) there are several behaviors that do not look > > like normal congestive packet loss: > OK. I'm not 100% sure how these plots of yours work, but let's see if I > can follow your logic here - they mostly make sense. A legend would > probably help. As I understand the white dots are original transmits, > green is what has been ACKED. R is retransmits ... what's the S? "S" is "SACKed", or selectively acknowledged. The SACK blocks below the green ACK lines are DSACK blocks, for "Duplicate SACKs", indicating the receiver has already received that sequence range. > What's the yellow line (I'm guessing receive window as advertised by the > server)? Yes, the yellow line is the right edge of the receive window of the server. > > (a) Literally *all* original transmissions (white segments in the > > plot) of packets after client sequence 66263 appear lost (are not > > ACKed). Congestion generally does not behave like that. But broken > > firewalls/middleboxes do. > > (See netdev-2022-03-29-tcp-disregarded-acks-zoomed-out.png ) > > Agreed. So could it be that something in the transit path towards > Google is actually dropping all of that? It could be. Or it could be a firewall/middlebox. > As stated - I highly doubt this is on our network unless newer kernel > (on mail cluster) is doing stuff which is causing older netfilter to > drop perhaps? But this doesn't explain why newer kernel retransmits > data for which it received an ACK. Yes, I agree that the biggest problem to focus on is the TCP code in the kernel retransmitting data for which the NIC is receiving ACKs. > > > > (b) When the client is retransmitting packets, only packets at > > exactly snd_una are ACKed. The packets beyond that point are always > > un-ACKed. Again sounds like a broken firewall/middlebox. > > (See netdev-2022-03-29-tcp-disregarded-acks-zoomed-in.png ) > No middlebox between packet sniffer and client ... client here is linux > 5.17.1. Brings me back to the only thing that could be dropping the > traffic is netfilter on the host, or the kernel doesn't like something > about the ACK, or kernel is doing something else wrong as a result of > TFO. I'm not sure which option I like less. Unfortunately I also use > netfilter for redirecting traffic into haproxy here so can't exactly > just switch off netfilter. Given the most problematic aspect of the trace, where the client-side TCP connection is repeatedly retransmitting packets for which ACKs are arriving at the NIC (and captured by tcpdump), it seems some software in your kernel is dropping packets between the network device and the TCP layer. Given that you mention "the only thing that could be dropping the traffic is netfilter on the host", it seems like the netfilter rules or software are buggy. A guess would be that the netfilter code is getting into a bad state due to the TFO behavior where there is a data packet arriving from the server immediately after the SYN/ACK and just before the client sends its first ACK: 00:00:00.000000 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590 > 2a00:1450:4013:c16::1a.25: S 3451342529:3451342529(0) win 62580 00:00:00.164295 IP6 2a00:1450:4013:c16::1a.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: S. 2699962254:2699962254(0) ack 3451342530 win 65535 # this one is perhaps confusing netfilter?: 00:00:00.001641 IP6 2a00:1450:4013:c16::1a.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: P. 1:89(88) ack 1 win 256 00:00:00.000035 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590 > 2a00:1450:4013:c16::1a.25: . 1:1(0) ack 89 win 489 00:00:00.000042 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590 > 2a00:1450:4013:c16::1a.25: P. 1:24(23) ack 89 win 489 Re "so can't exactly just switch off netfilter", are there any other counters or logs you can somehow check for netfilter drops? > > > > (c) After the client receives the server's "ack 73403", the client > > ignores/drops all other incoming packets that show up in the trace. > > Agreed. However, if I read your graph correctly, it gets an ACK for > frame X at ~3.8s into the connection, then for X+2 at 4s, but it keeps > retransmitting X+2, not X+1? At t=4s, as I discussed below there are two ACKs that arrive back-to-back, where the client TCP apparently processes the first but not the second. That's why it keeps retransmitting the packet beyond the first ACk but not beyond the second ACK. > > > > > As Eric notes, this doesn't look like a PAWS issue. And it > > doesn't look like a checksum or sequence/ACK validation issue. The > > client starts ignoring ACKs between two ACKs that have correct > > checksums, valid ACK numbers, and valid (identical) sequence numbers > > and TS val and ecr values (here showing absolute sequence/ACK > > numbers): > I'm not familiar with PAWS here. Assuming that the green line is ACKs, > then at around 4s we get an ACK that basically ACKs two frames in one > (which is fine from my understanding of TCP), and then the second of > these frames keeps getting retransmitted going forward, so it's almost > like the kernel ACKs the *first* of these two frames but not the second. Again, there are two ACKs, where the client TCP apparently processes the first but not the second, as discussed here: > > > > (i) The client processes this ACK and uses it to advance snd_una: > > 17:46:49.889911 IP6 (flowlabel 0x97427, hlim 61, next-header TCP > > (6) payload length: 32) 2a00:1450:4013:c16::1a.25 > > > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: . cksum 0x7005 (correct) > > 2699968514:2699968514(0) ack 3451415932 win 830 > 1206546583 ecr 331191428> > > > > > (ii) The client ignores this ACK and all later ACKs: > > 17:46:49.889912 IP6 (flowlabel 0x97427, hlim 61, next-header TCP > > (6) payload length: 32) 2a00:1450:4013:c16::1a.25 > > > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: . cksum 0x6a66 (correct) > > 2699968514:2699968514(0) ack 3451417360 win 841 > 1206546583 ecr 331191428> > > Here are those same two ACKs again, shown with absolute time and relative sequence numbers, to make them easier to parse: (i) The client processes this ACK and uses it to advance snd_una: 17:46:49.889911 IP6 2a00:1450:4013:c16::1a.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: . 6260:6260(0) ack 73403 win 830 (ii) The client ignores this ACK and all later ACKs: 17:46:49.889912 IP6 2a00:1450:4013:c16::1a.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: . 6260:6260(0) ack 74831 win 841 neal