2022-03-30 11:50:33

by Eric Dumazet

[permalink] [raw]
Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

On Tue, Mar 29, 2022 at 7:58 PM Jaco Kroon <[email protected]> wrote:
>
> Hi Neal,
>
> > Thanks for the report! I have CC-ed the netdev list, since it is
> > probably a better forum for this discussion.
> Awesome thank you.
> >
> > Can you please attach (or link to) a tcpdump raw .pcap file (produced
> > with the -w flag)? There are a number of tools that will make this
> > easier to visualize and analyze if we can see the raw .pcap file. You
> > may want to anonymize the trace and/or capture just headers, etc (for
> > example, the -s flag can control how much of each packet tcpdump
> > grabs).
>
> Attached.
>
> The traffic itself should be mostly encrypted but stripped with -s100
> anyway. At this point SACK was still on.
>
> I don't know how, or why, but this relates to TFO. After sending report
> on a hunch (based on comparing the exim logs of a successful delivery
> compared to a non-successful) and the only difference was that the
> non-working was stating:
>
> TFO mode sendto, no data: EINPROGRESS
>
> and then specifically:
>
> TCP_FASTOPEN tcpi_unacked 2
>
> The working connections never had the latter line in the output.
>
> The moment I set sysctl -w net.ipv4.tcp_fastopen=0 (default is 1) I've
> managed to flood out about 1200 emails to google in a matter of no more
> than 15 minutes.
>
> In the kernel sources: git log v5.8..v5.17 net/
>
> And searching for TFO only gives so many possible commits that broke
> this, just looking at changelogs I'm not sure if any of them are
> relevant. I'm guessing the issue possibly relates to congestion
> control, as such this is probably the most relevant:
>
> commit be5d1b61a2ad28c7e57fe8bfa277373e8ecffcdc
> Author: Nguyen Dinh Phi <[email protected]>
> Date: Tue Jul 6 07:19:12 2021 +0800
>
> tcp: fix tcp_init_transfer() to not reset icsk_ca_initialized
>
> Just looking at the diff it removes a icsk->icsk_ca_initialized = 0; -
> the only other place this gets set to 0 is in tcp_disconnect() ... and
> to 1 in tcp_init_congestion_control() - so I think we might have an
> uninitialized variable here ... then again tcp_init_socket mentions
> explicitly that sk_alloc set lots of stuff to 0 - still bugs me that the
> original commit (8919a9b31eb4) felt the need to set an explicit 0 in
> tcp_init_transfer().

I do not think this commit is related to the issue you have.

I guess you could try a revert ?

Then, if you think old linux versions were ok, start a bisection ?

Thank you.

(I do not see why a successful TFO would lead to a freeze after ~70 KB
of data has been sent)

>
> >
> > Can you please share the exact kernel version of the client machine?
> Our side (client) is 5.17.1 (side that initiates TCP/IP connection), I
> obviously can't comment for the Google side (server).
> > Also, can you please summarize/clarify whether you think the client,
> > server, or both are misbehaving?
>
> client is re-transmitting frames for which it has already received an
> ACK from the server. In pcap from frames 105 onwards one can start
> seeing retransmits, then first "spurious retransmission" as wireshark
> labels it from frames 122 onwards.
>
> Kind Regards,
> Jaco


2022-03-30 12:38:41

by Jaco Kroon

[permalink] [raw]
Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

Hi Eric,

On 2022/03/30 05:48, Eric Dumazet wrote:
> On Tue, Mar 29, 2022 at 7:58 PM Jaco Kroon <[email protected]> wrote:
>
> I do not think this commit is related to the issue you have.
>
> I guess you could try a revert ?
>
> Then, if you think old linux versions were ok, start a bisection ?
That'll be interesting, will see if I can reproduce on a non-production
host.
>
> Thank you.
>
> (I do not see why a successful TFO would lead to a freeze after ~70 KB
> of data has been sent)

I do actually agree with this in that it makes no sense, but disabling
TFO definitely resolved the issue for us.

Kind Regards,
Jaco

2022-03-31 02:37:36

by Neal Cardwell

[permalink] [raw]
Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

On Wed, Mar 30, 2022 at 2:22 AM Jaco Kroon <[email protected]> wrote:
>
> Hi Eric,
>
> On 2022/03/30 05:48, Eric Dumazet wrote:
> > On Tue, Mar 29, 2022 at 7:58 PM Jaco Kroon <[email protected]> wrote:
> >
> > I do not think this commit is related to the issue you have.
> >
> > I guess you could try a revert ?
> >
> > Then, if you think old linux versions were ok, start a bisection ?
> That'll be interesting, will see if I can reproduce on a non-production
> host.
> >
> > Thank you.
> >
> > (I do not see why a successful TFO would lead to a freeze after ~70 KB
> > of data has been sent)
>
> I do actually agree with this in that it makes no sense, but disabling
> TFO definitely resolved the issue for us.
>
> Kind Regards,
> Jaco

Thanks for the pcap trace! That's a pretty strange trace. I agree with
Eric's theory that this looks like one or more bugs in a firewall,
middlebox, or netfilter rule. From the trace it looks like the buggy
component is sometimes dropping packets and sometimes corrupting them
so that the client's TCP stack ignores them.

Interestingly, in that trace the client SYN has a TFO option and
cookie, but no data in the SYN.

The last packet that looks sane/normal is the ACK from the SMTP server
that looks like:

00:00:00.000010 IP6 2a00:1450:4013:c16::1a.25 >
2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: . 6260:6260(0) ack 66263 win
774 <nop,nop,TS val 1206544341 ecr 331189186>

That's the first ACK that crosses past 2^16. Maybe that is a
coincidence, or maybe not. Perhaps the buggy firewall/middlebox/etc is
confused by the TFO option, corrupts its state, and thereafter behaves
incorrectly past the first 64 KBytes of data from the client.

In addition to checking for checksum failures, mentioned by Eric, you
could look for PAWS failures, something like:

nstat -az | egrep -i 'TcpInCsumError|PAWS'

best,
neal

2022-03-31 04:57:54

by Jaco Kroon

[permalink] [raw]
Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

Hi,

On 2022/03/30 15:56, Neal Cardwell wrote:
> On Wed, Mar 30, 2022 at 2:22 AM Jaco Kroon <[email protected]> wrote:
>> Hi Eric,
>>
>> On 2022/03/30 05:48, Eric Dumazet wrote:
>>> On Tue, Mar 29, 2022 at 7:58 PM Jaco Kroon <[email protected]> wrote:
>>>
>>> I do not think this commit is related to the issue you have.
>>>
>>> I guess you could try a revert ?
>>>
>>> Then, if you think old linux versions were ok, start a bisection ?
>> That'll be interesting, will see if I can reproduce on a non-production
>> host.
>>> Thank you.
>>>
>>> (I do not see why a successful TFO would lead to a freeze after ~70 KB
>>> of data has been sent)
>> I do actually agree with this in that it makes no sense, but disabling
>> TFO definitely resolved the issue for us.
>>
>> Kind Regards,
>> Jaco
> Thanks for the pcap trace! That's a pretty strange trace. I agree with
> Eric's theory that this looks like one or more bugs in a firewall,
> middlebox, or netfilter rule. From the trace it looks like the buggy
> component is sometimes dropping packets and sometimes corrupting them
> so that the client's TCP stack ignores them.
The capture was taken on the client.  So the only firewall there is
iptables, and I redirected all -j DROP statements to a L_DROP chain
which did a -j LOG prior to -j DROP - didn't pick up any drops here.
>
> Interestingly, in that trace the client SYN has a TFO option and
> cookie, but no data in the SYN.

So this allows the SMTP server which in the conversation speaks first to
identify itself to respond with data in the SYN (not sure that was
actually happening but if I recall I did see it send data prior to
receiving the final ACK on the handshake.

>
> The last packet that looks sane/normal is the ACK from the SMTP server
> that looks like:
>
> 00:00:00.000010 IP6 2a00:1450:4013:c16::1a.25 >
> 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: . 6260:6260(0) ack 66263 win
> 774 <nop,nop,TS val 1206544341 ecr 331189186>
>
> That's the first ACK that crosses past 2^16. Maybe that is a
> coincidence, or maybe not. Perhaps the buggy firewall/middlebox/etc is

I believe it should be because we literally had this on every single
connection going out to Google's SMTP ... probably 1/100 connections
managed to deliver an email over the connection.  Then again ... 64KB
isn't that much ...

When you state sane/normal, do you mean there is fault with the other
frames that could not be explained by packet loss in one or both of the
directions?

> confused by the TFO option, corrupts its state, and thereafter behaves
> incorrectly past the first 64 KBytes of data from the client.

Only firewalls we've got are netfilter based, and these packets all
passed through the dedicated firewalls at least by the time they reach
here.  No middleboxes on our end, and if this was Google's side there
would be crazy noise be heard, not just me.  I think the trigger is
packet loss between us (as indicated we know they have link congestion
issues in JHB area, it took us the better part of two weeks to get the
first line tech on their side to just query the internal teams and
probably another week to get the response acknowledging this -
mybroadband.co.za has an article about other local ISPs also complaining).

>
> In addition to checking for checksum failures, mentioned by Eric, you
> could look for PAWS failures, something like:
>
> nstat -az | egrep -i 'TcpInCsumError|PAWS'

TcpInCsumErrors                 0                  0.0
TcpExtPAWSActive                0                  0.0
TcpExtPAWSEstab                 90092              0.0
TcpExtTCPACKSkippedPAWS         81317              0.0

Not sure what these mean, but i should probably investigate, the latter
two are definitely incrementing.

Appreciate the feedback and for looking at the traces.

Kind Regards,
Jaco