2022-03-30 09:42:50

by Neal Cardwell

[permalink] [raw]
Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

On Tue, Mar 29, 2022 at 9:03 PM Jaco <[email protected]> wrote:
>
> Dear All,
>
> I'm seeing very strange TCP behaviour. Disabled TCP Segmentation Offload to
> try and pinpoint this more closely.
>
> It seems the kernel is ignoring ACKs coming from the remote side in some cases.
> In this case, on one of four hosts, and seemingly between this one host and
> Google ... (We've have two emails to google stuck on another host due to same
> issue, but several hundred others passed out today on that same host). I also
> killed selective ACKs as a test as these are known to sometimes cause issues
> for firewalls and "tcp accelerators" (or used to at the very least).
>
> SMTP connection between ourselves and Google ... I'm going to be selective in
> copying from tcpdump (full coversation up to the point where I killed it
> because it plainly got stuck in a loop is attached).
>
> Connection setup:
>
> 00:56:17.055481 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [S], seq 956633779, win 62580, options [mss 8940,nop,nop,TS val 3687705482 ecr 0,nop,wscale 7,tfo cookie f025dd84b6122510,nop,nop], length 0
>
> 00:56:17.217747 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [S.], seq 726465675, ack 956633780, win 65535, options [mss 1440,nop,nop,TS val 3477429218 ecr 3687705482,nop,wscale 8], length 0
>
> 00:56:17.218628 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [P.], seq 726465676:726465760, ack 956633780, win 256, options [nop,nop,TS val 3477429220 ecr 3687705482], length 84: SMTP: 220 mx.google.com ESMTP e16-20020a05600c4e5000b0038c77be9b2dsi226281wmq.72 - gsmtp
>
> 00:56:17.218663 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], ack 726465760, win 489, options [nop,nop,TS val 3687705645 ecr 3477429220], length 0
>
> This is pretty normal, we advertise an MSS of 8940 and the return is 1440, thus
> we shouldn't send segments larger than that, and they "can't". I need to
> determine if this is some form of offloading or they really are sending >1500
> byte frames (which I know won't pass our firewalls without fragmentation so
> probably some form of NIC offloading - which if it was active on older 5.8
> kernels did not cause problems):
>
> 00:56:17.709905 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [P.], seq 726465979:726468395, ack 956634111, win 261, options [nop,nop,TS val 3477429710 ecr 3687705973], length 2416: SMTP
>
> 00:56:17.709906 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [P.], seq 726468395:726470811, ack 956634111, win 261, options [nop,nop,TS val 3477429710 ecr 3687705973], length 2416: SMTP
>
> These are the only two frames I can find that supposedly exceeds the MSS values
> (although, they don't exceed our value).
>
> Then everything goes pretty normal for a bit. The last data we receive from
> the remote side before stuff goes wrong:
>
> 00:56:18.088725 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [P.], seq 726471823:726471919, ack 956634348, win 261, options [nop,nop,TS val 3477430089 ecr 3687706330], length 96: SMTP
>
> We ACK immediately along with the next segment:
>
> 00:56:18.088969 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956634348:956635776, ack 726471919, win 446, options [nop,nop,TS val 3687706515 ecr 3477430089], length 1428: SMTP
>
> Hereafter there is a flurry of data that we transmit, all nicely acknowledged,
> no retransmits that I can pick up (eyeballs).
>
> Before a long sequence of TX data we get this ACK:
>
> 00:56:18.576247 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [.], ack 956700036, win 774, options [nop,nop,TS val 3477430577 ecr 3687706840], length 0
>
> We then continue to RX a sequence of:
>
> 00:56:18.576300 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956745732:956747160, ack 726471919, win 446, options [nop,nop,TS val 3687707002 ecr 3477430577], length 1428: SMTP
>
> up to:
>
> 00:56:18.577031 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [P.], seq 956778576:956780004, ack 726471919, win 446, options [nop,nop,TS val 3687707003 ecr 3477430577], length 1428: SMTP
>
> Before we hit our first retransmit:
>
> 00:56:18.960078 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956700036:956701464, ack 726471919, win 446, options [nop,nop,TS val 3687707386 ecr 3477430577], length 1428: SMTP
>
> Since 956700036 is the last ACKed data, this seems correct, not sure what timer
> this is based on though, the ACK for the just prior data came in ~384ms prior
> (could be based on normal time to ACK, I don't know, this is about double the
> usual round-trip-time currently).
>
> And then we receive this ACK (we can see this time the kernel waited for ACK of
> this single segment):
>
> 00:56:19.126678 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [.], ack 956701464, win 785, options [nop,nop,TS val 3477431127 ecr 3687707386], length 0
>
> Then we do something (in my opinion) strange by jumping back to the tail of the previous burst:
>
> 00:56:19.126735 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956780004:956781432, ack 726471919, win 446, options [nop,nop,TS val 3687707553 ecr 3477431127], length 1428: SMTP
>
> 00:56:19.126751 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956781432:956782860, ack 726471919, win 446, options [nop,nop,TS val 3687707553 ecr 3477431127], length 1428: SMTP
>
> We then jump back and retransmit again from the just received ACK:
>
> 00:56:19.510078 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956701464:956702892, ack 726471919, win 446, options [nop,nop,TS val 3687707936 ecr 3477431127], length 1428: SMTP
>
> We then continue from there on as I'd expect (slow restart), this goes pretty
> normal up to:
>
> 00:56:19.997088 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [.], ack 956708604, win 841, options [nop,nop,TS val 3477431998 ecr 3687708261], length 0
>
> 00:56:19.997148 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956708604:956710032, ack 726471919, win 446, options [nop,nop,TS val 3687708423 ecr 3477431998], length 1428: SMTP
>
> 00:56:20.262683 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [.], ack 956710032, win 852, options [nop,nop,TS val 3477432263 ecr 3687708423], length 0
>
> Up to here is fine, now things gets bizarre, we just jump to a different
> sequence number, which has already been ACKed:
>
> 00:56:20.380076 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq *956707176*:956708604, ack 726471919, win 446, options [nop,nop,TS val 3687708806 ecr 3477431998], length 1428: SMTP
>
> 00:56:20.542356 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [.], ack 956710032, win 852, options [nop,nop,TS val 3477432543 ecr 3687708423], length 0
>
> And remote side re-ACKs the 956710032 value, which frankly indicates we need to
> realize that the data we are transmitting has already been received, and we can
> continue on to transmit the segments following up on sequence number 956710032,
> instead we choose to get stuck in this sequence:
>
> 00:56:21.180080 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956707176:956708604, ack 726471919, win 446, options [nop,nop,TS val 3687709606 ecr 3477431998], length 1428: SMTP
>
> 00:56:21.342347 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [.], ack 956710032, win 852, options [nop,nop,TS val 3477433343 ecr 3687708423], length 0
>
> 00:56:22.780101 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956707176:956708604, ack 726471919, win 446, options [nop,nop,TS val 3687711206 ecr 3477431998], length 1428: SMTP
>
> 00:56:22.942346 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [.], ack 956710032, win 852, options [nop,nop,TS val 3477434943 ecr 3687708423], length 0
>
> And here the connection dies. It eventually times out, and we retry to the
> next host, resulting in the same problem.
>
> I am aware that Google is having congestion issues in the JHB area in SA
> currently, and there are probably packet delays and losses somewhere along the
> line between us, but this really should not stall as dead as it is here.
>
> Looking at only the incoming ACK values, I can see they are strictly
> increasing, so we've never received an ACK > 956710032, but this is still
> greater than the value we are retransmitting.
>
> The first time we transmitted the frame at sequence number 956707176 was part
> of the longest sequence of TX frames without a returning ACK, part of this
> sequence:
>
> ...
>
> 00:56:18.414299 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956705748:956707176, ack 726471919, win 446, options [nop,nop,TS val 3687706840 ecr 3477430415], length 1428: SMTP
>
> 00:56:18.414302 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [P.], seq 956707176:956708604, ack 726471919, win 446, options [nop,nop,TS val 3687706840 ecr 3477430415], length 1428: SMTP
>
> 00:56:18.414316 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956708604:956710032, ack 726471919, win 446, options [nop,nop,TS val 3687706840 ecr 3477430415], length 1428: SMTP
>
> ...
>
> Google here is ACKing not only the frame we are continuously retransmitting,
> but also the frame directly after ... so why would the kernel not move on to
> retransmitting starting from sequence number 956710032 (which is larger than
> the start sequence number of the frame we are retransmitting)?
>
> Kind Regards,
> Jaco

Thanks for the report! I have CC-ed the netdev list, since it is
probably a better forum for this discussion.

Can you please attach (or link to) a tcpdump raw .pcap file (produced
with the -w flag)? There are a number of tools that will make this
easier to visualize and analyze if we can see the raw .pcap file. You
may want to anonymize the trace and/or capture just headers, etc (for
example, the -s flag can control how much of each packet tcpdump
grabs).

Can you please share the exact kernel version of the client machine?

Also, can you please summarize/clarify whether you think the client,
server, or both are misbehaving?

Thanks!
neal


Attachments:
iewc_google.txt (38.07 kB)

2022-03-30 18:41:04

by Eric Dumazet

[permalink] [raw]
Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

On Tue, Mar 29, 2022 at 7:01 PM Neal Cardwell <[email protected]> wrote:
>
> On Tue, Mar 29, 2022 at 9:03 PM Jaco <[email protected]> wrote:
> >
> > Dear All,
> >
> > I'm seeing very strange TCP behaviour. Disabled TCP Segmentation Offload to
> > try and pinpoint this more closely.
> >
> > It seems the kernel is ignoring ACKs coming from the remote side in some cases.
> > In this case, on one of four hosts, and seemingly between this one host and
> > Google ... (We've have two emails to google stuck on another host due to same
> > issue, but several hundred others passed out today on that same host). I also
> > killed selective ACKs as a test as these are known to sometimes cause issues
> > for firewalls and "tcp accelerators" (or used to at the very least).
> >
> > SMTP connection between ourselves and Google ... I'm going to be selective in
> > copying from tcpdump (full coversation up to the point where I killed it
> > because it plainly got stuck in a loop is attached).
> >
> > Connection setup:
> >
> > 00:56:17.055481 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [S], seq 956633779, win 62580, options [mss 8940,nop,nop,TS val 3687705482 ecr 0,nop,wscale 7,tfo cookie f025dd84b6122510,nop,nop], length 0
> >
> > 00:56:17.217747 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [S.], seq 726465675, ack 956633780, win 65535, options [mss 1440,nop,nop,TS val 3477429218 ecr 3687705482,nop,wscale 8], length 0
> >
> > 00:56:17.218628 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [P.], seq 726465676:726465760, ack 956633780, win 256, options [nop,nop,TS val 3477429220 ecr 3687705482], length 84: SMTP: 220 mx.google.com ESMTP e16-20020a05600c4e5000b0038c77be9b2dsi226281wmq.72 - gsmtp
> >
> > 00:56:17.218663 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], ack 726465760, win 489, options [nop,nop,TS val 3687705645 ecr 3477429220], length 0
> >
> > This is pretty normal, we advertise an MSS of 8940 and the return is 1440, thus
> > we shouldn't send segments larger than that, and they "can't". I need to
> > determine if this is some form of offloading or they really are sending >1500
> > byte frames (which I know won't pass our firewalls without fragmentation so
> > probably some form of NIC offloading - which if it was active on older 5.8
> > kernels did not cause problems):
> >
> > 00:56:17.709905 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [P.], seq 726465979:726468395, ack 956634111, win 261, options [nop,nop,TS val 3477429710 ecr 3687705973], length 2416: SMTP
> >
> > 00:56:17.709906 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [P.], seq 726468395:726470811, ack 956634111, win 261, options [nop,nop,TS val 3477429710 ecr 3687705973], length 2416: SMTP
> >
> > These are the only two frames I can find that supposedly exceeds the MSS values
> > (although, they don't exceed our value).
> >
> > Then everything goes pretty normal for a bit. The last data we receive from
> > the remote side before stuff goes wrong:
> >
> > 00:56:18.088725 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [P.], seq 726471823:726471919, ack 956634348, win 261, options [nop,nop,TS val 3477430089 ecr 3687706330], length 96: SMTP
> >
> > We ACK immediately along with the next segment:
> >
> > 00:56:18.088969 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956634348:956635776, ack 726471919, win 446, options [nop,nop,TS val 3687706515 ecr 3477430089], length 1428: SMTP
> >
> > Hereafter there is a flurry of data that we transmit, all nicely acknowledged,
> > no retransmits that I can pick up (eyeballs).
> >
> > Before a long sequence of TX data we get this ACK:
> >
> > 00:56:18.576247 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [.], ack 956700036, win 774, options [nop,nop,TS val 3477430577 ecr 3687706840], length 0
> >
> > We then continue to RX a sequence of:
> >
> > 00:56:18.576300 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956745732:956747160, ack 726471919, win 446, options [nop,nop,TS val 3687707002 ecr 3477430577], length 1428: SMTP
> >
> > up to:
> >
> > 00:56:18.577031 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [P.], seq 956778576:956780004, ack 726471919, win 446, options [nop,nop,TS val 3687707003 ecr 3477430577], length 1428: SMTP
> >
> > Before we hit our first retransmit:
> >
> > 00:56:18.960078 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956700036:956701464, ack 726471919, win 446, options [nop,nop,TS val 3687707386 ecr 3477430577], length 1428: SMTP
> >
> > Since 956700036 is the last ACKed data, this seems correct, not sure what timer
> > this is based on though, the ACK for the just prior data came in ~384ms prior
> > (could be based on normal time to ACK, I don't know, this is about double the
> > usual round-trip-time currently).
> >
> > And then we receive this ACK (we can see this time the kernel waited for ACK of
> > this single segment):
> >
> > 00:56:19.126678 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [.], ack 956701464, win 785, options [nop,nop,TS val 3477431127 ecr 3687707386], length 0
> >
> > Then we do something (in my opinion) strange by jumping back to the tail of the previous burst:
> >
> > 00:56:19.126735 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956780004:956781432, ack 726471919, win 446, options [nop,nop,TS val 3687707553 ecr 3477431127], length 1428: SMTP
> >
> > 00:56:19.126751 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956781432:956782860, ack 726471919, win 446, options [nop,nop,TS val 3687707553 ecr 3477431127], length 1428: SMTP
> >
> > We then jump back and retransmit again from the just received ACK:
> >
> > 00:56:19.510078 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956701464:956702892, ack 726471919, win 446, options [nop,nop,TS val 3687707936 ecr 3477431127], length 1428: SMTP
> >
> > We then continue from there on as I'd expect (slow restart), this goes pretty
> > normal up to:
> >
> > 00:56:19.997088 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [.], ack 956708604, win 841, options [nop,nop,TS val 3477431998 ecr 3687708261], length 0
> >
> > 00:56:19.997148 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956708604:956710032, ack 726471919, win 446, options [nop,nop,TS val 3687708423 ecr 3477431998], length 1428: SMTP
> >
> > 00:56:20.262683 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [.], ack 956710032, win 852, options [nop,nop,TS val 3477432263 ecr 3687708423], length 0
> >
> > Up to here is fine, now things gets bizarre, we just jump to a different
> > sequence number, which has already been ACKed:
> >
> > 00:56:20.380076 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq *956707176*:956708604, ack 726471919, win 446, options [nop,nop,TS val 3687708806 ecr 3477431998], length 1428: SMTP
> >
> > 00:56:20.542356 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [.], ack 956710032, win 852, options [nop,nop,TS val 3477432543 ecr 3687708423], length 0
> >
> > And remote side re-ACKs the 956710032 value, which frankly indicates we need to
> > realize that the data we are transmitting has already been received, and we can
> > continue on to transmit the segments following up on sequence number 956710032,
> > instead we choose to get stuck in this sequence:
> >
> > 00:56:21.180080 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956707176:956708604, ack 726471919, win 446, options [nop,nop,TS val 3687709606 ecr 3477431998], length 1428: SMTP
> >
> > 00:56:21.342347 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [.], ack 956710032, win 852, options [nop,nop,TS val 3477433343 ecr 3687708423], length 0
> >
> > 00:56:22.780101 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956707176:956708604, ack 726471919, win 446, options [nop,nop,TS val 3687711206 ecr 3477431998], length 1428: SMTP
> >
> > 00:56:22.942346 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [.], ack 956710032, win 852, options [nop,nop,TS val 3477434943 ecr 3687708423], length 0
> >
> > And here the connection dies. It eventually times out, and we retry to the
> > next host, resulting in the same problem.
> >
> > I am aware that Google is having congestion issues in the JHB area in SA
> > currently, and there are probably packet delays and losses somewhere along the
> > line between us, but this really should not stall as dead as it is here.
> >
> > Looking at only the incoming ACK values, I can see they are strictly
> > increasing, so we've never received an ACK > 956710032, but this is still
> > greater than the value we are retransmitting.
> >

It could be that ACK packets have a wrong checksum, after some point
is reached (some bug in a firewall/middlebox)

"tcpdump -v" will tell you something about checksum errors.
And/or "nstat -az | grep TcpInCsumError"

Also, packets could be dropped in a layer like netfilter.
Make sure you do not have a rule rate limiting flows, or something like that.

> > The first time we transmitted the frame at sequence number 956707176 was part
> > of the longest sequence of TX frames without a returning ACK, part of this
> > sequence:
> >
> > ...
> >
> > 00:56:18.414299 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956705748:956707176, ack 726471919, win 446, options [nop,nop,TS val 3687706840 ecr 3477430415], length 1428: SMTP
> >
> > 00:56:18.414302 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [P.], seq 956707176:956708604, ack 726471919, win 446, options [nop,nop,TS val 3687706840 ecr 3477430415], length 1428: SMTP
> >
> > 00:56:18.414316 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956708604:956710032, ack 726471919, win 446, options [nop,nop,TS val 3687706840 ecr 3477430415], length 1428: SMTP
> >
> > ...
> >
> > Google here is ACKing not only the frame we are continuously retransmitting,
> > but also the frame directly after ... so why would the kernel not move on to
> > retransmitting starting from sequence number 956710032 (which is larger than
> > the start sequence number of the frame we are retransmitting)?
> >
> > Kind Regards,
> > Jaco
>
> Thanks for the report! I have CC-ed the netdev list, since it is
> probably a better forum for this discussion.
>
> Can you please attach (or link to) a tcpdump raw .pcap file (produced
> with the -w flag)? There are a number of tools that will make this
> easier to visualize and analyze if we can see the raw .pcap file. You
> may want to anonymize the trace and/or capture just headers, etc (for
> example, the -s flag can control how much of each packet tcpdump
> grabs).
>
> Can you please share the exact kernel version of the client machine?
>
> Also, can you please summarize/clarify whether you think the client,
> server, or both are misbehaving?
>
> Thanks!
> neal

2022-03-30 23:20:56

by Jaco Kroon

[permalink] [raw]
Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

Hi Neal,

> Thanks for the report! I have CC-ed the netdev list, since it is
> probably a better forum for this discussion.
Awesome thank you.
>
> Can you please attach (or link to) a tcpdump raw .pcap file (produced
> with the -w flag)? There are a number of tools that will make this
> easier to visualize and analyze if we can see the raw .pcap file. You
> may want to anonymize the trace and/or capture just headers, etc (for
> example, the -s flag can control how much of each packet tcpdump
> grabs).

Attached.

The traffic itself should be mostly encrypted but stripped with -s100
anyway.  At this point SACK was still on.

I don't know how, or why, but this relates to TFO.  After sending report
on a hunch (based on comparing the exim logs of a successful delivery
compared to a non-successful) and the only difference was that the
non-working was stating:

TFO mode sendto, no data: EINPROGRESS

and then specifically:

TCP_FASTOPEN tcpi_unacked 2

The working connections never had the latter line in the output.

The moment I set sysctl -w net.ipv4.tcp_fastopen=0 (default is 1) I've
managed to flood out about 1200 emails to google in a matter of no more
than 15 minutes.

In the kernel sources:  git log v5.8..v5.17 net/

And searching for TFO only gives so many possible commits that broke
this, just looking at changelogs I'm not sure if any of them are
relevant.  I'm guessing the issue possibly relates to congestion
control, as such this is probably the most relevant:

commit be5d1b61a2ad28c7e57fe8bfa277373e8ecffcdc
Author: Nguyen Dinh Phi <[email protected]>
Date:   Tue Jul 6 07:19:12 2021 +0800

    tcp: fix tcp_init_transfer() to not reset icsk_ca_initialized

Just looking at the diff it removes a icsk->icsk_ca_initialized = 0; -
the only other place this gets set to 0 is in tcp_disconnect() ... and
to 1 in tcp_init_congestion_control() - so I think we might have an
uninitialized variable here ... then again tcp_init_socket mentions
explicitly that sk_alloc set lots of stuff to 0 - still bugs me that the
original commit (8919a9b31eb4) felt the need to set an explicit 0 in
tcp_init_transfer().

>
> Can you please share the exact kernel version of the client machine?
Our side (client) is 5.17.1 (side that initiates TCP/IP connection), I
obviously can't comment for the Google side (server).
> Also, can you please summarize/clarify whether you think the client,
> server, or both are misbehaving?

client is re-transmitting frames for which it has already received an
ACK from the server.  In pcap from frames 105 onwards one can start
seeing retransmits, then first "spurious retransmission" as wireshark
labels it from frames 122 onwards.

Kind Regards,
Jaco


Attachments:
iewc_google2.pcap (19.36 kB)