LinuxLists.cc - linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

[permalink] [raw]

Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

On Wed, Mar 30, 2022 at 9:04 AM Jaco Kroon <[email protected]> wrote:
>
> Hi,
>
> On 2022/03/30 15:56, Neal Cardwell wrote:
> > On Wed, Mar 30, 2022 at 2:22 AM Jaco Kroon <[email protected]> wrote:
> >> Hi Eric,
> >>
> >> On 2022/03/30 05:48, Eric Dumazet wrote:
> >>> On Tue, Mar 29, 2022 at 7:58 PM Jaco Kroon <[email protected]> wrote:
> >>>
> >>> I do not think this commit is related to the issue you have.
> >>>
> >>> I guess you could try a revert ?
> >>>
> >>> Then, if you think old linux versions were ok, start a bisection ?
> >> That'll be interesting, will see if I can reproduce on a non-production
> >> host.
> >>> Thank you.
> >>>
> >>> (I do not see why a successful TFO would lead to a freeze after ~70 KB
> >>> of data has been sent)
> >> I do actually agree with this in that it makes no sense, but disabling
> >> TFO definitely resolved the issue for us.
> >>
> >> Kind Regards,
> >> Jaco
> > Thanks for the pcap trace! That's a pretty strange trace. I agree with
> > Eric's theory that this looks like one or more bugs in a firewall,
> > middlebox, or netfilter rule. From the trace it looks like the buggy
> > component is sometimes dropping packets and sometimes corrupting them
> > so that the client's TCP stack ignores them.
> The capture was taken on the client. So the only firewall there is
> iptables, and I redirected all -j DROP statements to a L_DROP chain
> which did a -j LOG prior to -j DROP - didn't pick up any drops here.
> >
> > Interestingly, in that trace the client SYN has a TFO option and
> > cookie, but no data in the SYN.
>
> So this allows the SMTP server which in the conversation speaks first to
> identify itself to respond with data in the SYN (not sure that was
> actually happening but if I recall I did see it send data prior to
> receiving the final ACK on the handshake.
>
> >
> > The last packet that looks sane/normal is the ACK from the SMTP server
> > that looks like:
> >
> > 00:00:00.000010 IP6 2a00:1450:4013:c16::1a.25 >
> > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: . 6260:6260(0) ack 66263 win
> > 774 <nop,nop,TS val 1206544341 ecr 331189186>
> >
> > That's the first ACK that crosses past 2^16. Maybe that is a
> > coincidence, or maybe not. Perhaps the buggy firewall/middlebox/etc is
>
> I believe it should be because we literally had this on every single
> connection going out to Google's SMTP ... probably 1/100 connections
> managed to deliver an email over the connection. Then again ... 64KB
> isn't that much ...
>
> When you state sane/normal, do you mean there is fault with the other
> frames that could not be explained by packet loss in one or both of the
> directions?
>
> > confused by the TFO option, corrupts its state, and thereafter behaves
> > incorrectly past the first 64 KBytes of data from the client.
>
> Only firewalls we've got are netfilter based, and these packets all
> passed through the dedicated firewalls at least by the time they reach
> here. No middleboxes on our end, and if this was Google's side there
> would be crazy noise be heard, not just me. I think the trigger is
> packet loss between us (as indicated we know they have link congestion
> issues in JHB area, it took us the better part of two weeks to get the
> first line tech on their side to just query the internal teams and
> probably another week to get the response acknowledging this -
> mybroadband.co.za has an article about other local ISPs also complaining).
>
> >
> > In addition to checking for checksum failures, mentioned by Eric, you
> > could look for PAWS failures, something like:
> >
> > nstat -az | egrep -i 'TcpInCsumError|PAWS'
>
> TcpInCsumErrors 0 0.0
> TcpExtPAWSActive 0 0.0
> TcpExtPAWSEstab 90092 0.0
> TcpExtTCPACKSkippedPAWS 81317 0.0
>
> Not sure what these mean, but i should probably investigate, the latter
> two are definitely incrementing.
>
> Appreciate the feedback and for looking at the traces.
>

Your pcap does not show any obvious PAWS issues.

If the host is lightly loaded you could try while the connection is
attempted/frozen

perf record -a -g -e skb:kfree_skb sleep 30
perf script (or perf report)

2022-03-31 17:00:29

[permalink] [raw]

Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

On Wed, Mar 30, 2022 at 9:04 AM Jaco Kroon <[email protected]> wrote:
...
> When you state sane/normal, do you mean there is fault with the other
> frames that could not be explained by packet loss in one or both of the
> directions?

Yes.

(1) If you look at the attached trace time/sequence plots (from
tcptrace and xplot.org) there are several behaviors that do not look
like normal congestive packet loss:

(a) Literally *all* original transmissions (white segments in the
plot) of packets after client sequence 66263 appear lost (are not
ACKed). Congestion generally does not behave like that. But broken
firewalls/middleboxes do.
(See netdev-2022-03-29-tcp-disregarded-acks-zoomed-out.png )

(b) When the client is retransmitting packets, only packets at
exactly snd_una are ACKed. The packets beyond that point are always
un-ACKed. Again sounds like a broken firewall/middlebox.
(See netdev-2022-03-29-tcp-disregarded-acks-zoomed-in.png )

(c) After the client receives the server's "ack 73403", the client
ignores/drops all other incoming packets that show up in the trace.

As Eric notes, this doesn't look like a PAWS issue. And it
doesn't look like a checksum or sequence/ACK validation issue. The
client starts ignoring ACKs between two ACKs that have correct
checksums, valid ACK numbers, and valid (identical) sequence numbers
and TS val and ecr values (here showing absolute sequence/ACK
numbers):

(i) The client processes this ACK and uses it to advance snd_una:
17:46:49.889911 IP6 (flowlabel 0x97427, hlim 61, next-header TCP
(6) payload length: 32) 2a00:1450:4013:c16::1a.25 >
2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: . cksum 0x7005 (correct)
2699968514:2699968514(0) ack 3451415932 win 830 <nop,nop,TS val
1206546583 ecr 331191428>

(ii) The client ignores this ACK and all later ACKs:
17:46:49.889912 IP6 (flowlabel 0x97427, hlim 61, next-header TCP
(6) payload length: 32) 2a00:1450:4013:c16::1a.25 >
2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: . cksum 0x6a66 (correct)
2699968514:2699968514(0) ack 3451417360 win 841 <nop,nop,TS val
1206546583 ecr 331191428>

neal

Attachments:

netdev-2022-03-29-tcp-disregarded-acks-zoomed-out.png (128.14 kB)
netdev-2022-03-29-tcp-disregarded-acks-zoomed-in.png (125.10 kB)
Download all attachments

2022-04-01 06:17:41

[permalink] [raw]

Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

On Thu, Mar 31, 2022 at 5:33 PM Jaco Kroon <[email protected]> wrote:

> I'll deploy same on a dev host we've got in the coming week and start a
> bisect process.

Thanks, this will definitely help.

2022-04-01 07:16:48

[permalink] [raw]

Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

Hi Neal,

This sniff was grabbed ON THE CLIENT HOST. There is no middlebox or
anything between the sniffer and the client. Only the firewall on the
host itself, where we've already establish the traffic is NOT DISCARDED
(at least not in filter/INPUT).

Setup on our end:

2 x routers, usually each with a direct peering with Google (which is
being ignored at the moment so instead traffic is incoming via IPT over DD).

Connected via switch to

2 x firewalls, of which ONE is active (they have different networks
behind them, and could be active / standby for different networks behind
them - avoiding active-active because conntrackd is causing more trouble
than it's worth), Linux hosts, using netfilter, has been operating for
years, no recent kernel upgrades.

4 x hosts in mail cluster, one of which you're looking at here.

On 2022/03/31 17:41, Neal Cardwell wrote:
> On Wed, Mar 30, 2022 at 9:04 AM Jaco Kroon <[email protected]> wrote:
> ...
>> When you state sane/normal, do you mean there is fault with the other
>> frames that could not be explained by packet loss in one or both of the
>> directions?
> Yes.
>
> (1) If you look at the attached trace time/sequence plots (from
> tcptrace and xplot.org) there are several behaviors that do not look
> like normal congestive packet loss:
OK. I'm not 100% sure how these plots of yours work, but let's see if I
can follow your logic here - they mostly make sense. A legend would
probably help. As I understand the white dots are original transmits,
green is what has been ACKED. R is retransmits ... what's the S?
What's the yellow line (I'm guessing receive window as advertised by the
server)?
>
> (a) Literally *all* original transmissions (white segments in the
> plot) of packets after client sequence 66263 appear lost (are not
> ACKed). Congestion generally does not behave like that. But broken
> firewalls/middleboxes do.
> (See netdev-2022-03-29-tcp-disregarded-acks-zoomed-out.png )

Agreed. So could it be that something in the transit path towards
Google is actually dropping all of that?

As stated - I highly doubt this is on our network unless newer kernel
(on mail cluster) is doing stuff which is causing older netfilter to
drop perhaps? But this doesn't explain why newer kernel retransmits
data for which it received an ACK.

>
> (b) When the client is retransmitting packets, only packets at
> exactly snd_una are ACKed. The packets beyond that point are always
> un-ACKed. Again sounds like a broken firewall/middlebox.
> (See netdev-2022-03-29-tcp-disregarded-acks-zoomed-in.png )
No middlebox between packet sniffer and client ... client here is linux
5.17.1. Brings me back to the only thing that could be dropping the
traffic is netfilter on the host, or the kernel doesn't like something
about the ACK, or kernel is doing something else wrong as a result of
TFO. I'm not sure which option I like less. Unfortunately I also use
netfilter for redirecting traffic into haproxy here so can't exactly
just switch off netfilter.
>
> (c) After the client receives the server's "ack 73403", the client
> ignores/drops all other incoming packets that show up in the trace.

Agreed. However, if I read your graph correctly, it gets an ACK for
frame X at ~3.8s into the connection, then for X+2 at 4s, but it keeps
retransmitting X+2, not X+1?

>
> As Eric notes, this doesn't look like a PAWS issue. And it
> doesn't look like a checksum or sequence/ACK validation issue. The
> client starts ignoring ACKs between two ACKs that have correct
> checksums, valid ACK numbers, and valid (identical) sequence numbers
> and TS val and ecr values (here showing absolute sequence/ACK
> numbers):
I'm not familiar with PAWS here. Assuming that the green line is ACKs,
then at around 4s we get an ACK that basically ACKs two frames in one
(which is fine from my understanding of TCP), and then the second of
these frames keeps getting retransmitted going forward, so it's almost
like the kernel ACKs the *first* of these two frames but not the second.
>
> (i) The client processes this ACK and uses it to advance snd_una:
> 17:46:49.889911 IP6 (flowlabel 0x97427, hlim 61, next-header TCP
> (6) payload length: 32) 2a00:1450:4013:c16::1a.25 >
> 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: . cksum 0x7005 (correct)
> 2699968514:2699968514(0) ack 3451415932 win 830 <nop,nop,TS val
> 1206546583 ecr 331191428>

>
> (ii) The client ignores this ACK and all later ACKs:
> 17:46:49.889912 IP6 (flowlabel 0x97427, hlim 61, next-header TCP
> (6) payload length: 32) 2a00:1450:4013:c16::1a.25 >
> 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: . cksum 0x6a66 (correct)
> 2699968514:2699968514(0) ack 3451417360 win 841 <nop,nop,TS val
> 1206546583 ecr 331191428>
>
> neal

2022-04-01 10:48:27

[permalink] [raw]

Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

Hi,

On 2022/04/01 02:10, Eric Dumazet wrote:
> On Thu, Mar 31, 2022 at 4:06 PM Jaco Kroon <[email protected]> wrote:
>> Hi Neal,
>>
>> This sniff was grabbed ON THE CLIENT HOST. There is no middlebox or
>> anything between the sniffer and the client. Only the firewall on the
>> host itself, where we've already establish the traffic is NOT DISCARDED
>> (at least not in filter/INPUT).
>>
>> Setup on our end:
>>
>> 2 x routers, usually each with a direct peering with Google (which is
>> being ignored at the moment so instead traffic is incoming via IPT over DD).
>>
>> Connected via switch to
>>
>> 2 x firewalls, of which ONE is active (they have different networks
>> behind them, and could be active / standby for different networks behind
>> them - avoiding active-active because conntrackd is causing more trouble
>> than it's worth), Linux hosts, using netfilter, has been operating for
>> years, no recent kernel upgrades.
> Next step would be to attempt removing _all_ firewalls, especially not
> common setups like yours.
That I'm afraid is not going to happen here. I can't imagine what we're
doing is that uncommon. On the host basically for INPUT drop invalid,
ACCEPT related established, accept specific ports, drop everything
else. Other than the redirects in NAT there really isn't anything "funny".
>
> conntrack had a bug preventing TFO deployment for a while, because
> many boxes kept buggy kernel versions for years.

We don't use conntrackd, we tried many years back, but eventually we
just ended up using ucarp with /32s on the interfaces and whatever
subnet is required for the floating IP itself, combined with OSPF to
sort out the routing, that way we get to avoid asymmetric routing and
the need for conntrackd. The core firewalls basically on FORWARD does
some directing based on ingress and/or egress interface to determine
ruleset to apply, again INVALID and RELATED,ESTABLISHED rules at the head.

>
> 356d7d88e088687b6578ca64601b0a2c9d145296 netfilter: nf_conntrack: fix
> tcp_in_window for Fast Open

This is from Aug 9, 2013 ... our firewall's kernel isn't that old :).
Again, the traffic was sniffed on the client side of that firewall, and
the only firewall between the sniffer and the processing part of the
kernel is the local netfilter.

I'll deploy same on a dev host we've got in the coming week and start a
bisect process.

Kind Regards,
Jaco

2022-04-01 13:47:05

[permalink] [raw]

Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

On Thu, Mar 31, 2022 at 4:06 PM Jaco Kroon <[email protected]> wrote:
>
> Hi Neal,
>
> This sniff was grabbed ON THE CLIENT HOST. There is no middlebox or
> anything between the sniffer and the client. Only the firewall on the
> host itself, where we've already establish the traffic is NOT DISCARDED
> (at least not in filter/INPUT).
>
> Setup on our end:
>
> 2 x routers, usually each with a direct peering with Google (which is
> being ignored at the moment so instead traffic is incoming via IPT over DD).
>
> Connected via switch to
>
> 2 x firewalls, of which ONE is active (they have different networks
> behind them, and could be active / standby for different networks behind
> them - avoiding active-active because conntrackd is causing more trouble
> than it's worth), Linux hosts, using netfilter, has been operating for
> years, no recent kernel upgrades.

Next step would be to attempt removing _all_ firewalls, especially not
common setups like yours.

conntrack had a bug preventing TFO deployment for a while, because
many boxes kept buggy kernel versions for years.

356d7d88e088687b6578ca64601b0a2c9d145296 netfilter: nf_conntrack: fix
tcp_in_window for Fast Open

2022-04-01 15:03:37

[permalink] [raw]

Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

On Thu, Mar 31, 2022 at 5:41 PM Eric Dumazet <[email protected]> wrote:
>
> On Thu, Mar 31, 2022 at 5:33 PM Jaco Kroon <[email protected]> wrote:
>
> > I'll deploy same on a dev host we've got in the coming week and start a
> > bisect process.
>
> Thanks, this will definitely help.

One thing I noticed in your pcap is a good amount of drops, as if
Hystart was not able to stop slow-start before the drops are
happening.

TFO with one less RTT at connection establishment could be the trigger.

If you are still using cubic, please try to revert.

commit 4e1fddc98d2585ddd4792b5e44433dcee7ece001
Author: Eric Dumazet <[email protected]>
Date: Tue Nov 23 12:25:35 2021 -0800

tcp_cubic: fix spurious Hystart ACK train detections for
not-cwnd-limited flows

2022-04-01 15:03:42

by Florian Westphal

[permalink] [raw]

Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

Eric Dumazet <[email protected]> wrote:
> Next step would be to attempt removing _all_ firewalls, especially not
> common setups like yours.
>
> conntrack had a bug preventing TFO deployment for a while, because
> many boxes kept buggy kernel versions for years.
>
> 356d7d88e088687b6578ca64601b0a2c9d145296 netfilter: nf_conntrack: fix
> tcp_in_window for Fast Open

Jaco could also try with
net.netfilter.nf_conntrack_tcp_be_liberal=1

and, if that helps, with liberal=0 and
sysctl net.netfilter.nf_conntrack_log_invalid=6

(check dmesg/syslog/nflog).

2022-04-01 15:28:43

[permalink] [raw]

Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

Hi,

On 2022/04/01 02:15, Florian Westphal wrote:

Incidently, I always find your initials to be interesting considering
(as far as I know) you work on netfilter firewall.

> Eric Dumazet <[email protected]> wrote:
>> Next step would be to attempt removing _all_ firewalls, especially not
>> common setups like yours.
>>
>> conntrack had a bug preventing TFO deployment for a while, because
>> many boxes kept buggy kernel versions for years.
>>
>> 356d7d88e088687b6578ca64601b0a2c9d145296 netfilter: nf_conntrack: fix
>> tcp_in_window for Fast Open
> Jaco could also try with
> net.netfilter.nf_conntrack_tcp_be_liberal=1
>
> and, if that helps, with liberal=0 and
> sysctl net.netfilter.nf_conntrack_log_invalid=6
>
> (check dmesg/syslog/nflog).

Our core firewalls already had nf_conntrack_tcp_be_liberal for other
reasons (asymmetric routing combined with conntrackd left-over if I
recall), so maybe that's why it got through there ... don't exactly want
to just flip that setting though, is there a way to log if it would have
dropped anything, without actually dropping it (yet)?

Will do this first, first need to confirm that I can reproduce in a dev
environment.

Kind Regards,
Jaco

2022-04-01 20:54:28

by Florian Westphal

[permalink] [raw]

Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

Jaco Kroon <[email protected]> wrote:
> > Eric Dumazet <[email protected]> wrote:
> >> Next step would be to attempt removing _all_ firewalls, especially not
> >> common setups like yours.
> >>
> >> conntrack had a bug preventing TFO deployment for a while, because
> >> many boxes kept buggy kernel versions for years.
> >>
> >> 356d7d88e088687b6578ca64601b0a2c9d145296 netfilter: nf_conntrack: fix
> >> tcp_in_window for Fast Open
> > Jaco could also try with
> > net.netfilter.nf_conntrack_tcp_be_liberal=1
> >
> > and, if that helps, with liberal=0 and
> > sysctl net.netfilter.nf_conntrack_log_invalid=6
> >
> > (check dmesg/syslog/nflog).
>
> Our core firewalls already had nf_conntrack_tcp_be_liberal for other
> reasons (asymmetric routing combined with conntrackd left-over if I
> recall), so maybe that's why it got through there ... don't exactly want
> to just flip that setting though, is there a way to log if it would have
> dropped anything, without actually dropping it (yet)?

This means conntrack doesn't tag packets as invalid EVEN if it would
consider sequence/ack out-of-window (e.g. due to a bug).

I have a hard time seeing how tcp liberal-mode conntrack would be to
blame here.

Only thing you could also check is if
net.netfilter.nf_conntrack_checksum=0 helps (but i doubt it).

2022-04-01 23:43:43

[permalink] [raw]

Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

On Tue, Mar 29, 2022 at 9:03 PM Jaco <[email protected]> wrote:
...
> Connection setup:
>
> 00:56:17.055481 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [S], seq 956633779, win 62580, options [mss 8940,nop,nop,TS val 3687705482 ecr 0,nop,wscale 7,tfo cookie f025dd84b6122510,nop,nop], length 0
>
> 00:56:17.217747 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [S.], seq 726465675, ack 956633780, win 65535, options [mss 1440,nop,nop,TS val 3477429218 ecr 3687705482,nop,wscale 8], length 0
>
> 00:56:17.218628 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [P.], seq 726465676:726465760, ack 956633780, win 256, options [nop,nop,TS val 3477429220 ecr 3687705482], length 84: SMTP: 220 mx.google.com ESMTP e16-20020a05600c4e5000b0038c77be9b2dsi226281wmq.72 - gsmtp
>
> 00:56:17.218663 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], ack 726465760, win 489, options [nop,nop,TS val 3687705645 ecr 3477429220], length 0
>
> This is pretty normal, we advertise an MSS of 8940 and the return is 1440, thus
> we shouldn't send segments larger than that, and they "can't". I need to
> determine if this is some form of offloading or they really are sending >1500
> byte frames (which I know won't pass our firewalls without fragmentation so
> probably some form of NIC offloading - which if it was active on older 5.8
> kernels did not cause problems):

Jaco, was there some previous kernel version on these client machines
where this problem did not show up? Perhaps the v5.8 version you
mention here? Can you please share the exact version number?

If so, a hypothesis would be:

(1) There is a bug in netfilter's handling of TFO connections where
the server sends a data packet after a TFO SYNACK, before the client
ACKs anything (as we see in this trace).

This bug is perhaps similar in character to the bug fixed by Yuchung's
2013 commit that Eric mentioned:

356d7d88e088687b6578ca64601b0a2c9d145296
netfilter: nf_conntrack: fix tcp_in_window for Fast Open

(2) With kernel v5.8, TFO blackhole detection detected that in your
workload there were TFO connections that died due to apparent
blackholing (like what's shown in the trace), and dynamically disabled
TFO on your machines. This allowed mail traffic to flow, because the
netfilter bug was no longer tickled. This worked around the netfilter
bug.

(3) You upgraded your client-side machine from v5.8 to v5.17, which
has the following commit from v5.14, which disables TFO blackhole
logic by default:
213ad73d0607 tcp: disable TFO blackhole logic by default

(4) Due to (3), the blackhole detection logic was no longer operative,
and when the netfilter bug blackholed the connection, TFO stayed
enabled. This caused mail traffic to Google to stall.

This hypothesis would explain why:
o disabling TFO fixes this problem
o you are seeing this with a newer kernel (and apparently not with a
kernel before v5.14?)

With this hypothesis, we need several pieces to trigger this:

(a) client side software that tries TFO to a server that supports TFO
(like the exim mail transfer agent you are using, connecting to
Google)

(b) a client-side Linux kernel running buggy netfilter code (you are
running netfilter)

(c) a client-side Linux kernel with TFO support but no blackhole
detection logic active (e.g. v5.14 or later, like your v5.17.1)

That's probably a rare combination, so would explain why we have not
had this report before.

Jaco, to provide some evidence for this hypothesis, can you please
re-enable fastopen but also enable the TFO blackhole detection that
was disabled in v5.14 (213ad73d0607), with something like:

sysctl -w net.ipv4.tcp_fastopen=1
sysctl -w tcp_fastopen_blackhole_timeout=3600

And then after a few hours, check to see if this blackholing behavior
has been detected:
nstat -az | grep -i blackhole
And see if TFO FastOpenActive attempts have been cut to a super-low rate:
nstat -az | grep -i fastopenactive

thanks,
neal

2022-04-02 11:42:54

[permalink] [raw]

Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

On Fri, Apr 1, 2022 at 4:36 AM Jaco Kroon <[email protected]> wrote:
>
> Hi Eric,
>
> On 2022/04/01 02:54, Eric Dumazet wrote:
> > On Thu, Mar 31, 2022 at 5:41 PM Eric Dumazet <[email protected]> wrote:
> >> On Thu, Mar 31, 2022 at 5:33 PM Jaco Kroon <[email protected]> wrote:
> >>
> >>> I'll deploy same on a dev host we've got in the coming week and start a
> >>> bisect process.
> >> Thanks, this will definitely help.
> > One thing I noticed in your pcap is a good amount of drops, as if
> > Hystart was not able to stop slow-start before the drops are
> > happening.
> >
> > TFO with one less RTT at connection establishment could be the trigger.
> >
> > If you are still using cubic, please try to revert.
> Sorry, I understand TCP itself a bit, but I've given up trying to
> understand the various schedulers a long time ago and am just using the
> defaults that the kernel provides. How do I check what I'm using, and
> how can I change that? What is recommended at this stage?

How to check: cat /proc/sys/net/ipv4/tcp_congestion_control"

This is of course orthogonal to the buf we are tracking here,
but given your long RTT, I would recommend using fq packet scheduler and bbr.

tc qd replace dev eth0 root fq # or use mq+fq if your NIC is multi
queue and you need a good amount of throughput

insmod tcp_bbr # (after enabling CONFIG_TCP_CONG_BBR=m)
echo bbr >/proc/sys/net/ipv4/tcp_congestion_control

> >
> >
> > commit 4e1fddc98d2585ddd4792b5e44433dcee7ece001
> > Author: Eric Dumazet <[email protected]>
> > Date: Tue Nov 23 12:25:35 2021 -0800
> >
> > tcp_cubic: fix spurious Hystart ACK train detections for
> > not-cwnd-limited flows
> Ok, instead of starting with bisect, if I can reproduce in dev I'll use
> this one first.

Thanks ! (again this won't fix the bug, this is really a shoot in the dark)

2022-04-02 19:58:35

[permalink] [raw]

Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

On Fri, Apr 1, 2022 at 11:39 AM Neal Cardwell <[email protected]> wrote:
...
> Jaco, to provide some evidence for this hypothesis, can you please
> re-enable fastopen but also enable the TFO blackhole detection that
> was disabled in v5.14 (213ad73d0607), with something like:
>
> sysctl -w net.ipv4.tcp_fastopen=1
> sysctl -w tcp_fastopen_blackhole_timeout=3600

I would also suggest using Florian's suggestion to log invalid
packets, so perhaps we can get a clue as to why netfilter thinks these
packets are invalid:

sysctl net.netfilter.nf_conntrack_log_invalid=6

> And then after a few hours, check to see if this blackholing behavior
> has been detected:
> nstat -az | grep -i blackhole
> And see if TFO FastOpenActive attempts have been cut to a super-low rate:
> nstat -az | grep -i fastopenactive

Then I would correspondingly echo Florian's suggestion to check
dmesg/syslog/nflog to learn more about the drops.

neal

2022-04-03 12:47:20

[permalink] [raw]

Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

Hi Florian,

On 2022/04/02 16:14, Florian Westphal wrote:
> Jaco Kroon <[email protected]> wrote:
>> Including sysctl net.netfilter.nf_conntrack_log_invalid=6- which
>> generates lots of logs, something specific I should be looking for? I
>> suspect these relate:
>>
>> [Sat Apr 2 10:31:53 2022] nf_ct_proto_6: SEQ is over the upper bound
>> (over the window of the receiver) IN= OUT=bond0
>> SRC=2c0f:f720:0000:0003:d6ae:52ff:feb8:f27b
>> DST=2a00:1450:400c:0c08:0000:0000:0000:001a LEN=2928 TC=0 HOPLIMIT=64
>> FLOWLBL=867133 PROTO=TCP SPT=48920 DPT=25 SEQ=2689938314 ACK=4200412020
>> WINDOW=447 RES=0x00 ACK PSH URGP=0 OPT (0101080A2F36C1C120EDFB91) UID=8
>> GID=12
> I thought this had "liberal mode" enabled for tcp conntrack?
> The above implies its off.

We have liberal on the core firewalls, not on the endpoints ... yes, we
do double firewall :).

So the firewalls into the subnets has liberal mode (which really was an
oversight when axing conntrackd), but the servers themselves do not.

Kind Regards,
Jaco

2022-04-03 13:43:16

[permalink] [raw]

Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

On Sat, Apr 2, 2022 at 12:32 PM Eric Dumazet <[email protected]> wrote:
>
> On Sat, Apr 2, 2022 at 9:29 AM Neal Cardwell <[email protected]> wrote:
> >
> > FWIW those log entries indicate netfilter on the mail client machine
> > dropping consecutive outbound skbs with 2*MSS of payload. So that
> > explains the large consecutive losses of client data packets to the
> > e-mail server. That seems to confirm my earlier hunch that those drops
> > of consecutive client data packets "do not look like normal congestive
> > packet loss".
>
>
> This also explains why we have all these tiny 2-MSS packets in the pcap.
>
> Under normal conditions, autocorking should kick in, allowing TCP to
> build bigger TSO packets.

I have not looked at the conntrack code before today, but AFAICT this
is the buggy section of nf_conntrack_proto_tcp.c:

} else if (((state->state == TCP_CONNTRACK_SYN_SENT
&& dir == IP_CT_DIR_ORIGINAL)
|| (state->state == TCP_CONNTRACK_SYN_RECV
&& dir == IP_CT_DIR_REPLY))
&& after(end, sender->td_end)) {
/*
* RFC 793: "if a TCP is reinitialized ... then it need
* not wait at all; it must only be sure to use sequence
* numbers larger than those recently used."
*/
sender->td_end =
sender->td_maxend = end;
sender->td_maxwin = (win == 0 ? 1 : win);

tcp_options(skb, dataoff, tcph, sender);

Note that the tcp_options() function implicitly assumes it is being
called on a SYN, because it sets state->td_scale to 0 and only sets
state->td_scale to something non-zero if it sees a wscale option. So
if we ever call that on an skb that's not a SYN, we will forget that
the connection is using the wscale option.

But at this point in the code it is calling tcp_options() without
first checking that this is a SYN.

For this TFO scenario like the one in the trace, where the server
sends its first data packet after the SYNACK packet and before the
client's first ACK, presumably the conntrack state machine is
(correctly) SYN_RECV, and then (incorrectly) executes this code,
including the call to tcp_options(), on this first data packet, which
has no SYN bit, and no wscale option. Thus tcp_options() zeroes out
the server's sending state td_scale and does not set it to a non-zero
value. So now conntrack thinks the server is not using the wscale
option. So when conntrack interprets future receive windows from the
server, it does not scale them (with: win <<= sender->td_scale;), so
in this scenario the estimated right edge of the server's receive
window (td_maxend) is never advanced past the roughly 64KB value
offered in the SYN. Thus when the client sends data packets beyond
64KBytes, conntrack declares them invalid and drops them, due to
failing the condition Eric noted above:

before(seq, sender->td_maxend + 1),

This explains my previous observation that the client's original data
packet transmissions are always dropped after the first 64KBytes.

Someone more familiar with conntrack may have a good idea about how to
best fix this?

neal

2022-04-04 06:40:55

[permalink] [raw]

Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

On Sat, Apr 2, 2022 at 1:42 AM Jaco Kroon <[email protected]> wrote:
>
> Hi Neal,
>
> On 2022/04/01 17:39, Neal Cardwell wrote:
> > On Tue, Mar 29, 2022 at 9:03 PM Jaco <[email protected]> wrote:
> > ...
> >> Connection setup:
> >>
> >> 00:56:17.055481 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [S], seq 956633779, win 62580, options [mss 8940,nop,nop,TS val 3687705482 ecr 0,nop,wscale 7,tfo cookie f025dd84b6122510,nop,nop], length 0
> >>
> >> 00:56:17.217747 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [S.], seq 726465675, ack 956633780, win 65535, options [mss 1440,nop,nop,TS val 3477429218 ecr 3687705482,nop,wscale 8], length 0
> >>
> >> 00:56:17.218628 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [P.], seq 726465676:726465760, ack 956633780, win 256, options [nop,nop,TS val 3477429220 ecr 3687705482], length 84: SMTP: 220 mx.google.com ESMTP e16-20020a05600c4e5000b0038c77be9b2dsi226281wmq.72 - gsmtp
> >>
> >> 00:56:17.218663 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], ack 726465760, win 489, options [nop,nop,TS val 3687705645 ecr 3477429220], length 0
> >>
> >> This is pretty normal, we advertise an MSS of 8940 and the return is 1440, thus
> >> we shouldn't send segments larger than that, and they "can't". I need to
> >> determine if this is some form of offloading or they really are sending >1500
> >> byte frames (which I know won't pass our firewalls without fragmentation so
> >> probably some form of NIC offloading - which if it was active on older 5.8
> >> kernels did not cause problems):
> > Jaco, was there some previous kernel version on these client machines
> > where this problem did not show up? Perhaps the v5.8 version you
> > mention here? Can you please share the exact version number?
> 5.8.14
> >
> > If so, a hypothesis would be:
> >
> > (1) There is a bug in netfilter's handling of TFO connections where
> > the server sends a data packet after a TFO SYNACK, before the client
> > ACKs anything (as we see in this trace).
> >
> > This bug is perhaps similar in character to the bug fixed by Yuchung's
> > 2013 commit that Eric mentioned:
> >
> > 356d7d88e088687b6578ca64601b0a2c9d145296
> > netfilter: nf_conntrack: fix tcp_in_window for Fast Open
> >
> > (2) With kernel v5.8, TFO blackhole detection detected that in your
> > workload there were TFO connections that died due to apparent
> > blackholing (like what's shown in the trace), and dynamically disabled
> > TFO on your machines. This allowed mail traffic to flow, because the
> > netfilter bug was no longer tickled. This worked around the netfilter
> > bug.
> >
> > (3) You upgraded your client-side machine from v5.8 to v5.17, which
> > has the following commit from v5.14, which disables TFO blackhole
> > logic by default:
> > 213ad73d0607 tcp: disable TFO blackhole logic by default
> >
> > (4) Due to (3), the blackhole detection logic was no longer operative,
> > and when the netfilter bug blackholed the connection, TFO stayed
> > enabled. This caused mail traffic to Google to stall.
> >
> > This hypothesis would explain why:
> > o disabling TFO fixes this problem
> > o you are seeing this with a newer kernel (and apparently not with a
> > kernel before v5.14?)
> Agreed.
> >
> > With this hypothesis, we need several pieces to trigger this:
> >
> > (a) client side software that tries TFO to a server that supports TFO
> > (like the exim mail transfer agent you are using, connecting to
> > Google)
> >
> > (b) a client-side Linux kernel running buggy netfilter code (you are
> > running netfilter)
> >
> > (c) a client-side Linux kernel with TFO support but no blackhole
> > detection logic active (e.g. v5.14 or later, like your v5.17.1)
> >
> > That's probably a rare combination, so would explain why we have not
> > had this report before.
> >
> > Jaco, to provide some evidence for this hypothesis, can you please
> > re-enable fastopen but also enable the TFO blackhole detection that
> > was disabled in v5.14 (213ad73d0607), with something like:
> >
> > sysctl -w net.ipv4.tcp_fastopen=1
> > sysctl -w tcp_fastopen_blackhole_timeout=3600
>
> Done.
>
> Including sysctl net.netfilter.nf_conntrack_log_invalid=6- which
> generates lots of logs, something specific I should be looking for? I
> suspect these relate:
>
> [Sat Apr 2 10:31:53 2022] nf_ct_proto_6: SEQ is over the upper bound
> (over the window of the receiver) IN= OUT=bond0
> SRC=2c0f:f720:0000:0003:d6ae:52ff:feb8:f27b
> DST=2a00:1450:400c:0c08:0000:0000:0000:001a LEN=2928 TC=0 HOPLIMIT=64
> FLOWLBL=867133 PROTO=TCP SPT=48920 DPT=25 SEQ=2689938314 ACK=4200412020
> WINDOW=447 RES=0x00 ACK PSH URGP=0 OPT (0101080A2F36C1C120EDFB91) UID=8
> GID=12
> [Sat Apr 2 10:31:53 2022] nf_ct_proto_6: SEQ is over the upper bound
> (over the window of the receiver) IN= OUT=bond0
> SRC=2c0f:f720:0000:0003:d6ae:52ff:feb8:f27b
> DST=2a00:1450:400c:0c08:0000:0000:0000:001a LEN=2928 TC=0 HOPLIMIT=64
> FLOWLBL=867133 PROTO=TCP SPT=48920 DPT=25 SEQ=2689941170 ACK=4200412020
> WINDOW=447 RES=0x00 ACK PSH URGP=0 OPT (0101080A2F36C1C120EDFB91) UID=8
> GID=12
>
> (There are many more of those, and the remote side is Google in this case)
>

Great. This confirms our suspicions.

Please try the following patch that landed in 5.18-rc

f2dd495a8d589371289981d5ed33e6873df94ecc netfilter: nf_conntrack_tcp:
preserve liberal flag in tcp options

CC netfilter folks.

Condition triggering the bug :
before(seq, sender->td_maxend + 1),

I took a look at the code, and it is not clear if td_maxend is
properly setup (or if td_scale is cleared at some point while it
should not)

Alternatively, if conntracking does not know if the connection is
using wscale (or what is the scale), the "before(seq,
sender->td_maxend + 1),"
should not be evaluated/used.

Also, I do not see where td_maxend is extended in tcp_init_sender()

Probably wrong patch, just to point to the code I do not understand yet.

diff --git a/net/netfilter/nf_conntrack_proto_tcp.c
b/net/netfilter/nf_conntrack_proto_tcp.c
index 8ec55cd72572e0cca076631e2cc1c11f0c2b86f6..950082785d61b7a2768559c7500d3aee3aaea7c2
100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -456,9 +456,10 @@ static void tcp_init_sender(struct ip_ct_tcp_state *sender,
/* SYN-ACK in reply to a SYN
* or SYN from reply direction in simultaneous open.
*/
- sender->td_end =
- sender->td_maxend = end;
- sender->td_maxwin = (win == 0 ? 1 : win);
+ sender->td_end = end;
+ sender->td_maxwin = max(win, 1U);
+ /* WIN in SYN & SYNACK is not scaled */
+ sender->td_maxend = end + sender->td_maxwin;

tcp_options(skb, dataoff, tcph, sender);
/* RFC 1323:

2022-04-04 15:44:16

by Florian Westphal

[permalink] [raw]

Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

Jaco Kroon <[email protected]> wrote:
> Including sysctl net.netfilter.nf_conntrack_log_invalid=6- which
> generates lots of logs, something specific I should be looking for?? I
> suspect these relate:
>
> [Sat Apr? 2 10:31:53 2022] nf_ct_proto_6: SEQ is over the upper bound
> (over the window of the receiver) IN= OUT=bond0
> SRC=2c0f:f720:0000:0003:d6ae:52ff:feb8:f27b
> DST=2a00:1450:400c:0c08:0000:0000:0000:001a LEN=2928 TC=0 HOPLIMIT=64
> FLOWLBL=867133 PROTO=TCP SPT=48920 DPT=25 SEQ=2689938314 ACK=4200412020
> WINDOW=447 RES=0x00 ACK PSH URGP=0 OPT (0101080A2F36C1C120EDFB91) UID=8
> GID=12

I thought this had "liberal mode" enabled for tcp conntrack?
The above implies its off.

2022-04-04 21:29:32

[permalink] [raw]

Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

)

On Sat, Apr 2, 2022 at 4:42 AM Jaco Kroon <[email protected]> wrote:
>
> Hi Neal,
>
> On 2022/04/01 17:39, Neal Cardwell wrote:
> > On Tue, Mar 29, 2022 at 9:03 PM Jaco <[email protected]> wrote:
> > ...
> >> Connection setup:
> >>
> >> 00:56:17.055481 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [S], seq 956633779, win 62580, options [mss 8940,nop,nop,TS val 3687705482 ecr 0,nop,wscale 7,tfo cookie f025dd84b6122510,nop,nop], length 0
> >>
> >> 00:56:17.217747 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [S.], seq 726465675, ack 956633780, win 65535, options [mss 1440,nop,nop,TS val 3477429218 ecr 3687705482,nop,wscale 8], length 0
> >>
> >> 00:56:17.218628 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [P.], seq 726465676:726465760, ack 956633780, win 256, options [nop,nop,TS val 3477429220 ecr 3687705482], length 84: SMTP: 220 mx.google.com ESMTP e16-20020a05600c4e5000b0038c77be9b2dsi226281wmq.72 - gsmtp
> >>
> >> 00:56:17.218663 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], ack 726465760, win 489, options [nop,nop,TS val 3687705645 ecr 3477429220], length 0
> >>
> >> This is pretty normal, we advertise an MSS of 8940 and the return is 1440, thus
> >> we shouldn't send segments larger than that, and they "can't". I need to
> >> determine if this is some form of offloading or they really are sending >1500
> >> byte frames (which I know won't pass our firewalls without fragmentation so
> >> probably some form of NIC offloading - which if it was active on older 5.8
> >> kernels did not cause problems):
> > Jaco, was there some previous kernel version on these client machines
> > where this problem did not show up? Perhaps the v5.8 version you
> > mention here? Can you please share the exact version number?
> 5.8.14

Thanks for the client kernel version! (5.8.14)

> > If so, a hypothesis would be:
> >
> > (1) There is a bug in netfilter's handling of TFO connections where
> > the server sends a data packet after a TFO SYNACK, before the client
> > ACKs anything (as we see in this trace).
> >
> > This bug is perhaps similar in character to the bug fixed by Yuchung's
> > 2013 commit that Eric mentioned:
> >
> > 356d7d88e088687b6578ca64601b0a2c9d145296
> > netfilter: nf_conntrack: fix tcp_in_window for Fast Open
> >
> > (2) With kernel v5.8, TFO blackhole detection detected that in your
> > workload there were TFO connections that died due to apparent
> > blackholing (like what's shown in the trace), and dynamically disabled
> > TFO on your machines. This allowed mail traffic to flow, because the
> > netfilter bug was no longer tickled. This worked around the netfilter
> > bug.
> >
> > (3) You upgraded your client-side machine from v5.8 to v5.17, which
> > has the following commit from v5.14, which disables TFO blackhole
> > logic by default:
> > 213ad73d0607 tcp: disable TFO blackhole logic by default
> >
> > (4) Due to (3), the blackhole detection logic was no longer operative,
> > and when the netfilter bug blackholed the connection, TFO stayed
> > enabled. This caused mail traffic to Google to stall.
> >
> > This hypothesis would explain why:
> > o disabling TFO fixes this problem
> > o you are seeing this with a newer kernel (and apparently not with a
> > kernel before v5.14?)
> Agreed.
> >
> > With this hypothesis, we need several pieces to trigger this:
> >
> > (a) client side software that tries TFO to a server that supports TFO
> > (like the exim mail transfer agent you are using, connecting to
> > Google)
> >
> > (b) a client-side Linux kernel running buggy netfilter code (you are
> > running netfilter)
> >
> > (c) a client-side Linux kernel with TFO support but no blackhole
> > detection logic active (e.g. v5.14 or later, like your v5.17.1)
> >
> > That's probably a rare combination, so would explain why we have not
> > had this report before.
> >
> > Jaco, to provide some evidence for this hypothesis, can you please
> > re-enable fastopen but also enable the TFO blackhole detection that
> > was disabled in v5.14 (213ad73d0607), with something like:
> >
> > sysctl -w net.ipv4.tcp_fastopen=1
> > sysctl -w tcp_fastopen_blackhole_timeout=3600
>
> Done.

Thanks for running that experiment and reporting your data! That was
super-informative. So it seems like we have a working high-level
theory about what's going on and where, and we just need to pinpoint
the buggy lines in the netfilter conntrack code running on the mail
client machines.

> Including sysctl net.netfilter.nf_conntrack_log_invalid=6- which
> generates lots of logs, something specific I should be looking for? I
> suspect these relate:
>
> [Sat Apr 2 10:31:53 2022] nf_ct_proto_6: SEQ is over the upper bound
> (over the window of the receiver) IN= OUT=bond0
> SRC=2c0f:f720:0000:0003:d6ae:52ff:feb8:f27b
> DST=2a00:1450:400c:0c08:0000:0000:0000:001a LEN=2928 TC=0 HOPLIMIT=64
> FLOWLBL=867133 PROTO=TCP SPT=48920 DPT=25 SEQ=2689938314 ACK=4200412020
> WINDOW=447 RES=0x00 ACK PSH URGP=0 OPT (0101080A2F36C1C120EDFB91) UID=8
> GID=12
> [Sat Apr 2 10:31:53 2022] nf_ct_proto_6: SEQ is over the upper bound
> (over the window of the receiver) IN= OUT=bond0
> SRC=2c0f:f720:0000:0003:d6ae:52ff:feb8:f27b
> DST=2a00:1450:400c:0c08:0000:0000:0000:001a LEN=2928 TC=0 HOPLIMIT=64
> FLOWLBL=867133 PROTO=TCP SPT=48920 DPT=25 SEQ=2689941170 ACK=4200412020
> WINDOW=447 RES=0x00 ACK PSH URGP=0 OPT (0101080A2F36C1C120EDFB91) UID=8
> GID=12
>
> (There are many more of those, and the remote side is Google in this case)

FWIW those log entries indicate netfilter on the mail client machine
dropping consecutive outbound skbs with 2*MSS of payload. So that
explains the large consecutive losses of client data packets to the
e-mail server. That seems to confirm my earlier hunch that those drops
of consecutive client data packets "do not look like normal congestive
packet loss".

neal

2022-04-04 21:51:16

[permalink] [raw]

Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

Hi Neal,

On 2022/04/01 17:39, Neal Cardwell wrote:
> On Tue, Mar 29, 2022 at 9:03 PM Jaco <[email protected]> wrote:
> ...
>> Connection setup:
>>
>> 00:56:17.055481 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [S], seq 956633779, win 62580, options [mss 8940,nop,nop,TS val 3687705482 ecr 0,nop,wscale 7,tfo cookie f025dd84b6122510,nop,nop], length 0
>>
>> 00:56:17.217747 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [S.], seq 726465675, ack 956633780, win 65535, options [mss 1440,nop,nop,TS val 3477429218 ecr 3687705482,nop,wscale 8], length 0
>>
>> 00:56:17.218628 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [P.], seq 726465676:726465760, ack 956633780, win 256, options [nop,nop,TS val 3477429220 ecr 3687705482], length 84: SMTP: 220 mx.google.com ESMTP e16-20020a05600c4e5000b0038c77be9b2dsi226281wmq.72 - gsmtp
>>
>> 00:56:17.218663 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], ack 726465760, win 489, options [nop,nop,TS val 3687705645 ecr 3477429220], length 0
>>
>> This is pretty normal, we advertise an MSS of 8940 and the return is 1440, thus
>> we shouldn't send segments larger than that, and they "can't". I need to
>> determine if this is some form of offloading or they really are sending >1500
>> byte frames (which I know won't pass our firewalls without fragmentation so
>> probably some form of NIC offloading - which if it was active on older 5.8
>> kernels did not cause problems):
> Jaco, was there some previous kernel version on these client machines
> where this problem did not show up? Perhaps the v5.8 version you
> mention here? Can you please share the exact version number?
5.8.14
>
> If so, a hypothesis would be:
>
> (1) There is a bug in netfilter's handling of TFO connections where
> the server sends a data packet after a TFO SYNACK, before the client
> ACKs anything (as we see in this trace).
>
> This bug is perhaps similar in character to the bug fixed by Yuchung's
> 2013 commit that Eric mentioned:
>
> 356d7d88e088687b6578ca64601b0a2c9d145296
> netfilter: nf_conntrack: fix tcp_in_window for Fast Open
>
> (2) With kernel v5.8, TFO blackhole detection detected that in your
> workload there were TFO connections that died due to apparent
> blackholing (like what's shown in the trace), and dynamically disabled
> TFO on your machines. This allowed mail traffic to flow, because the
> netfilter bug was no longer tickled. This worked around the netfilter
> bug.
>
> (3) You upgraded your client-side machine from v5.8 to v5.17, which
> has the following commit from v5.14, which disables TFO blackhole
> logic by default:
> 213ad73d0607 tcp: disable TFO blackhole logic by default
>
> (4) Due to (3), the blackhole detection logic was no longer operative,
> and when the netfilter bug blackholed the connection, TFO stayed
> enabled. This caused mail traffic to Google to stall.
>
> This hypothesis would explain why:
> o disabling TFO fixes this problem
> o you are seeing this with a newer kernel (and apparently not with a
> kernel before v5.14?)
Agreed.
>
> With this hypothesis, we need several pieces to trigger this:
>
> (a) client side software that tries TFO to a server that supports TFO
> (like the exim mail transfer agent you are using, connecting to
> Google)
>
> (b) a client-side Linux kernel running buggy netfilter code (you are
> running netfilter)
>
> (c) a client-side Linux kernel with TFO support but no blackhole
> detection logic active (e.g. v5.14 or later, like your v5.17.1)
>
> That's probably a rare combination, so would explain why we have not
> had this report before.
>
> Jaco, to provide some evidence for this hypothesis, can you please
> re-enable fastopen but also enable the TFO blackhole detection that
> was disabled in v5.14 (213ad73d0607), with something like:
>
> sysctl -w net.ipv4.tcp_fastopen=1
> sysctl -w tcp_fastopen_blackhole_timeout=3600

Done.

Including sysctl net.netfilter.nf_conntrack_log_invalid=6- which
generates lots of logs, something specific I should be looking for? I
suspect these relate:

[Sat Apr 2 10:31:53 2022] nf_ct_proto_6: SEQ is over the upper bound
(over the window of the receiver) IN= OUT=bond0
SRC=2c0f:f720:0000:0003:d6ae:52ff:feb8:f27b
DST=2a00:1450:400c:0c08:0000:0000:0000:001a LEN=2928 TC=0 HOPLIMIT=64
FLOWLBL=867133 PROTO=TCP SPT=48920 DPT=25 SEQ=2689938314 ACK=4200412020
WINDOW=447 RES=0x00 ACK PSH URGP=0 OPT (0101080A2F36C1C120EDFB91) UID=8
GID=12
[Sat Apr 2 10:31:53 2022] nf_ct_proto_6: SEQ is over the upper bound
(over the window of the receiver) IN= OUT=bond0
SRC=2c0f:f720:0000:0003:d6ae:52ff:feb8:f27b
DST=2a00:1450:400c:0c08:0000:0000:0000:001a LEN=2928 TC=0 HOPLIMIT=64
FLOWLBL=867133 PROTO=TCP SPT=48920 DPT=25 SEQ=2689941170 ACK=4200412020
WINDOW=447 RES=0x00 ACK PSH URGP=0 OPT (0101080A2F36C1C120EDFB91) UID=8
GID=12

(There are many more of those, and the remote side is Google in this case)

>
> And then after a few hours, check to see if this blackholing behavior
> has been detected:
> nstat -az | grep -i blackhole
> And see if TFO FastOpenActive attempts have been cut to a super-low rate:
> nstat -az | grep -i fastopenactive

uriel [06:10:03] ~ # nstat -az | grep -i fastopen
TcpExtTCPFastOpenActive         0                  0.0
TcpExtTCPFastOpenActiveFail     3739               0.0
TcpExtTCPFastOpenPassive        0                  0.0
TcpExtTCPFastOpenPassiveFail    0                  0.0
TcpExtTCPFastOpenListenOverflow 0                  0.0
TcpExtTCPFastOpenCookieReqd     3378               0.0
TcpExtTCPFastOpenBlackhole      0                  0.0
TcpExtTCPFastOpenPassiveAltKey 0                  0.0

uriel [09:54:54] ~ # nstat -az | grep -i fastopen
TcpExtTCPFastOpenActive         0                  0.0
TcpExtTCPFastOpenActiveFail     3742               0.0
TcpExtTCPFastOpenPassive        0                  0.0
TcpExtTCPFastOpenPassiveFail    0                  0.0
TcpExtTCPFastOpenListenOverflow 0                  0.0
TcpExtTCPFastOpenCookieReqd     3391               0.0
TcpExtTCPFastOpenBlackhole      3                  0.0
TcpExtTCPFastOpenPassiveAltKey 0                  0.0

I'm fairly certain that strongly supports your theory. So I *suspect*
the next test would be something like:

Disable the blackhole again, let the queue build up a few minutes until
we have something from google. Shut down exim so we can isolate SMTP
traffic. tcpdump again, capturing the traffic, and correlate the FW
logs with the connection?

Kind Regards,
Jaco

2022-04-04 23:05:43

[permalink] [raw]

Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

Hi,

On 2022/04/02 15:20, Eric Dumazet wrote:

> Great. This confirms our suspicions.
>
> Please try the following patch that landed in 5.18-rc
>
> f2dd495a8d589371289981d5ed33e6873df94ecc netfilter: nf_conntrack_tcp:
> preserve liberal flag in tcp options

Will track this down and deploy in the next day or two. Thank you, Neal
and Florian for all the assistance!

As an aside, would really like to engage with someone that can assist on
the known congestion w.r.t. Google services in JHB, so if you're willing
- or can get me in contact with the right people, please do contact me
direct off-list (we've alleviated the issue by upgrading out IPT but
would like to understand what is going on, can provide ticket references).

Kind Regards,
Jaco

>
> CC netfilter folks.
>
> Condition triggering the bug :
> before(seq, sender->td_maxend + 1),
>
> I took a look at the code, and it is not clear if td_maxend is
> properly setup (or if td_scale is cleared at some point while it
> should not)
>
> Alternatively, if conntracking does not know if the connection is
> using wscale (or what is the scale), the "before(seq,
> sender->td_maxend + 1),"
> should not be evaluated/used.
>
> Also, I do not see where td_maxend is extended in tcp_init_sender()
>
> Probably wrong patch, just to point to the code I do not understand yet.
>
> diff --git a/net/netfilter/nf_conntrack_proto_tcp.c
> b/net/netfilter/nf_conntrack_proto_tcp.c
> index 8ec55cd72572e0cca076631e2cc1c11f0c2b86f6..950082785d61b7a2768559c7500d3aee3aaea7c2
> 100644
> --- a/net/netfilter/nf_conntrack_proto_tcp.c
> +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> @@ -456,9 +456,10 @@ static void tcp_init_sender(struct ip_ct_tcp_state *sender,
> /* SYN-ACK in reply to a SYN
> * or SYN from reply direction in simultaneous open.
> */
> - sender->td_end =
> - sender->td_maxend = end;
> - sender->td_maxwin = (win == 0 ? 1 : win);
> + sender->td_end = end;
> + sender->td_maxwin = max(win, 1U);
> + /* WIN in SYN & SYNACK is not scaled */
> + sender->td_maxend = end + sender->td_maxwin;
>
> tcp_options(skb, dataoff, tcph, sender);
> /* RFC 1323:

2022-04-05 01:12:56

[permalink] [raw]

Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

Hi Eric,

On 2022/04/01 02:54, Eric Dumazet wrote:
> On Thu, Mar 31, 2022 at 5:41 PM Eric Dumazet <[email protected]> wrote:
>> On Thu, Mar 31, 2022 at 5:33 PM Jaco Kroon <[email protected]> wrote:
>>
>>> I'll deploy same on a dev host we've got in the coming week and start a
>>> bisect process.
>> Thanks, this will definitely help.
> One thing I noticed in your pcap is a good amount of drops, as if
> Hystart was not able to stop slow-start before the drops are
> happening.
>
> TFO with one less RTT at connection establishment could be the trigger.
>
> If you are still using cubic, please try to revert.
Sorry, I understand TCP itself a bit, but I've given up trying to
understand the various schedulers a long time ago and am just using the
defaults that the kernel provides. How do I check what I'm using, and
how can I change that? What is recommended at this stage?
>
>
> commit 4e1fddc98d2585ddd4792b5e44433dcee7ece001
> Author: Eric Dumazet <[email protected]>
> Date: Tue Nov 23 12:25:35 2021 -0800
>
> tcp_cubic: fix spurious Hystart ACK train detections for
> not-cwnd-limited flows
Ok, instead of starting with bisect, if I can reproduce in dev I'll use
this one first.

Kind Regards,
Jaco

2022-04-05 01:22:16

[permalink] [raw]

Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

On Sat, Apr 2, 2022 at 9:29 AM Neal Cardwell <[email protected]> wrote:
>
> FWIW those log entries indicate netfilter on the mail client machine
> dropping consecutive outbound skbs with 2*MSS of payload. So that
> explains the large consecutive losses of client data packets to the
> e-mail server. That seems to confirm my earlier hunch that those drops
> of consecutive client data packets "do not look like normal congestive
> packet loss".

This also explains why we have all these tiny 2-MSS packets in the pcap.

Under normal conditions, autocorking should kick in, allowing TCP to
build bigger TSO packets.

2022-04-05 01:33:56

[permalink] [raw]

Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

On Thu, Mar 31, 2022 at 7:06 PM Jaco Kroon <[email protected]> wrote:
>
> Hi Neal,
>
> This sniff was grabbed ON THE CLIENT HOST. There is no middlebox or
> anything between the sniffer and the client. Only the firewall on the
> host itself, where we've already establish the traffic is NOT DISCARDED
> (at least not in filter/INPUT).

Yes, understood. Please excuse my general use of the term
"firewalls/middleboxes" even where in some contexts it's clear the
"middleboxes" aspect of that term could not apply. :-)

> Setup on our end:
>
> 2 x routers, usually each with a direct peering with Google (which is
> being ignored at the moment so instead traffic is incoming via IPT over DD).
>
> Connected via switch to
>
> 2 x firewalls, of which ONE is active (they have different networks
> behind them, and could be active / standby for different networks behind
> them - avoiding active-active because conntrackd is causing more trouble
> than it's worth), Linux hosts, using netfilter, has been operating for
> years, no recent kernel upgrades.
>
> 4 x hosts in mail cluster, one of which you're looking at here.
>
> On 2022/03/31 17:41, Neal Cardwell wrote:
> > On Wed, Mar 30, 2022 at 9:04 AM Jaco Kroon <[email protected]> wrote:
> > ...
> >> When you state sane/normal, do you mean there is fault with the other
> >> frames that could not be explained by packet loss in one or both of the
> >> directions?
> > Yes.
> >
> > (1) If you look at the attached trace time/sequence plots (from
> > tcptrace and xplot.org) there are several behaviors that do not look
> > like normal congestive packet loss:
> OK. I'm not 100% sure how these plots of yours work, but let's see if I
> can follow your logic here - they mostly make sense. A legend would
> probably help. As I understand the white dots are original transmits,
> green is what has been ACKED. R is retransmits ... what's the S?

"S" is "SACKed", or selectively acknowledged. The SACK blocks below
the green ACK lines are DSACK blocks, for "Duplicate SACKs",
indicating the receiver has already received that sequence range.

> What's the yellow line (I'm guessing receive window as advertised by the
> server)?

Yes, the yellow line is the right edge of the receive window of the server.

> > (a) Literally *all* original transmissions (white segments in the
> > plot) of packets after client sequence 66263 appear lost (are not
> > ACKed). Congestion generally does not behave like that. But broken
> > firewalls/middleboxes do.
> > (See netdev-2022-03-29-tcp-disregarded-acks-zoomed-out.png )
>
> Agreed. So could it be that something in the transit path towards
> Google is actually dropping all of that?

It could be. Or it could be a firewall/middlebox.

> As stated - I highly doubt this is on our network unless newer kernel
> (on mail cluster) is doing stuff which is causing older netfilter to
> drop perhaps? But this doesn't explain why newer kernel retransmits
> data for which it received an ACK.

Yes, I agree that the biggest problem to focus on is the TCP code in
the kernel retransmitting data for which the NIC is receiving ACKs.

> >
> > (b) When the client is retransmitting packets, only packets at
> > exactly snd_una are ACKed. The packets beyond that point are always
> > un-ACKed. Again sounds like a broken firewall/middlebox.
> > (See netdev-2022-03-29-tcp-disregarded-acks-zoomed-in.png )
> No middlebox between packet sniffer and client ... client here is linux
> 5.17.1. Brings me back to the only thing that could be dropping the
> traffic is netfilter on the host, or the kernel doesn't like something
> about the ACK, or kernel is doing something else wrong as a result of
> TFO. I'm not sure which option I like less. Unfortunately I also use
> netfilter for redirecting traffic into haproxy here so can't exactly
> just switch off netfilter.

Given the most problematic aspect of the trace, where the client-side
TCP connection is repeatedly retransmitting packets for which ACKs are
arriving at the NIC (and captured by tcpdump), it seems some software
in your kernel is dropping packets between the network device and the
TCP layer. Given that you mention "the only thing that could be
dropping the traffic is netfilter on the host", it seems like the
netfilter rules or software are buggy.

A guess would be that the netfilter code is getting into a bad state
due to the TFO behavior where there is a data packet arriving from the
server immediately after the SYN/ACK and just before the client sends
its first ACK:

00:00:00.000000 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590 >
2a00:1450:4013:c16::1a.25: S 3451342529:3451342529(0) win 62580 <mss
8940,sackOK,TS val 331187616 ecr 0,nop,wscale 7,Unknown Option
3472da7bfe84[|tcp]>

00:00:00.164295 IP6 2a00:1450:4013:c16::1a.25 >
2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: S. 2699962254:2699962254(0)
ack 3451342530 win 65535 <mss 1440,sackOK,TS val 1206542770 ecr
331187616,nop,wscale 8>

# this one is perhaps confusing netfilter?:
00:00:00.001641 IP6 2a00:1450:4013:c16::1a.25 >
2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: P. 1:89(88) ack 1 win 256
<nop,nop,TS val 1206542772 ecr 331187616>

00:00:00.000035 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590 >
2a00:1450:4013:c16::1a.25: . 1:1(0) ack 89 win 489 <nop,nop,TS val
331187782 ecr 1206542772>

00:00:00.000042 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590 >
2a00:1450:4013:c16::1a.25: P. 1:24(23) ack 89 win 489 <nop,nop,TS val
331187782 ecr 1206542772>

Re "so can't exactly just switch off netfilter", are there any other
counters or logs you can somehow check for netfilter drops?

> >
> > (c) After the client receives the server's "ack 73403", the client
> > ignores/drops all other incoming packets that show up in the trace.
>
> Agreed. However, if I read your graph correctly, it gets an ACK for
> frame X at ~3.8s into the connection, then for X+2 at 4s, but it keeps
> retransmitting X+2, not X+1?

At t=4s, as I discussed below there are two ACKs that arrive
back-to-back, where the client TCP apparently processes the first but
not the second. That's why it keeps retransmitting the packet beyond
the first ACk but not beyond the second ACK.

>
> >
> > As Eric notes, this doesn't look like a PAWS issue. And it
> > doesn't look like a checksum or sequence/ACK validation issue. The
> > client starts ignoring ACKs between two ACKs that have correct
> > checksums, valid ACK numbers, and valid (identical) sequence numbers
> > and TS val and ecr values (here showing absolute sequence/ACK
> > numbers):
> I'm not familiar with PAWS here. Assuming that the green line is ACKs,
> then at around 4s we get an ACK that basically ACKs two frames in one
> (which is fine from my understanding of TCP), and then the second of
> these frames keeps getting retransmitted going forward, so it's almost
> like the kernel ACKs the *first* of these two frames but not the second.

Again, there are two ACKs, where the client TCP apparently processes
the first but not the second, as discussed here:

> >
> > (i) The client processes this ACK and uses it to advance snd_una:
> > 17:46:49.889911 IP6 (flowlabel 0x97427, hlim 61, next-header TCP
> > (6) payload length: 32) 2a00:1450:4013:c16::1a.25 >
> > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: . cksum 0x7005 (correct)
> > 2699968514:2699968514(0) ack 3451415932 win 830 <nop,nop,TS val
> > 1206546583 ecr 331191428>
>
> >
> > (ii) The client ignores this ACK and all later ACKs:
> > 17:46:49.889912 IP6 (flowlabel 0x97427, hlim 61, next-header TCP
> > (6) payload length: 32) 2a00:1450:4013:c16::1a.25 >
> > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: . cksum 0x6a66 (correct)
> > 2699968514:2699968514(0) ack 3451417360 win 841 <nop,nop,TS val
> > 1206546583 ecr 331191428>
> >

Here are those same two ACKs again, shown with absolute time and
relative sequence numbers, to make them easier to parse:

(i) The client processes this ACK and uses it to advance snd_una:
17:46:49.889911 IP6 2a00:1450:4013:c16::1a.25 >
2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: . 6260:6260(0) ack 73403 win
830 <nop,nop,TS val 1206546583 ecr 331191428>

(ii) The client ignores this ACK and all later ACKs:
17:46:49.889912 IP6 2a00:1450:4013:c16::1a.25 >
2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: . 6260:6260(0) ack 74831 win
841 <nop,nop,TS val 1206546583 ecr 331191428>

neal

2022-04-05 02:41:09