2020-01-31 12:26:51

by SeongJae Park

[permalink] [raw]
Subject: [PATCH 0/3] Fix reconnection latency caused by FIN/ACK handling race

From: SeongJae Park <[email protected]>

When closing a connection, the two acks that required to change closing
socket's status to FIN_WAIT_2 and then TIME_WAIT could be processed in
reverse order. This is possible in RSS disabled environments such as a
connection inside a host.

For example, expected state transitions and required packets for the
disconnection will be similar to below flow.

00 (Process A) (Process B)
01 ESTABLISHED ESTABLISHED
02 close()
03 FIN_WAIT_1
04 ---FIN-->
05 CLOSE_WAIT
06 <--ACK---
07 FIN_WAIT_2
08 <--FIN/ACK---
09 TIME_WAIT
10 ---ACK-->
11 LAST_ACK
12 CLOSED CLOSED

The acks in lines 6 and 8 are the acks. If the line 8 packet is
processed before the line 6 packet, it will be just ignored as it is not
a expected packet, and the later process of the line 6 packet will
change the status of Process A to FIN_WAIT_2, but as it has already
handled line 8 packet, it will not go to TIME_WAIT and thus will not
send the line 10 packet to Process B. Thus, Process B will left in
CLOSE_WAIT status, as below.

00 (Process A) (Process B)
01 ESTABLISHED ESTABLISHED
02 close()
03 FIN_WAIT_1
04 ---FIN-->
05 CLOSE_WAIT
06 (<--ACK---)
07 (<--FIN/ACK---)
08 (fired in right order)
09 <--FIN/ACK---
10 <--ACK---
11 (processed in reverse order)
12 FIN_WAIT_2

Later, if the Process B sends SYN to Process A for reconnection using
the same port, Process A will responds with an ACK for the last flow,
which has no increased sequence number. Thus, Process A will send RST,
wait for TIMEOUT_INIT (one second in default), and then try
reconnection. If reconnections are frequent, the one second latency
spikes can be a big problem. Below is a tcpdump results of the problem:

14.436259 IP 127.0.0.1.45150 > 127.0.0.1.4242: Flags [S], seq 2560603644
14.436266 IP 127.0.0.1.4242 > 127.0.0.1.45150: Flags [.], ack 5, win 512
14.436271 IP 127.0.0.1.45150 > 127.0.0.1.4242: Flags [R], seq 2541101298
/* ONE SECOND DELAY */
15.464613 IP 127.0.0.1.45150 > 127.0.0.1.4242: Flags [S], seq 2560603644

Patchset Organization
---------------------

The first patch fix a trivial nit. The second one fix the problem by
adjusting the resend delay of the SYN in the case. Finally, the third
patch adds a user space test to reproduce this problem.

The patches are based on the v5.5. You can also clone the complete git
tree:

$ git clone git://github.com/sjp38/linux -b patches/finack_lat/v1

The web is also available:
https://github.com/sjp38/linux/tree/patches/finack_lat/v1

SeongJae Park (3):
net/ipv4/inet_timewait_sock: Fix inconsistent comments
tcp: Reduce SYN resend delay if a suspicous ACK is received
selftests: net: Add FIN_ACK processing order related latency spike
test

net/ipv4/inet_timewait_sock.c | 1 +
net/ipv4/tcp_input.c | 6 +-
tools/testing/selftests/net/.gitignore | 2 +
tools/testing/selftests/net/Makefile | 2 +
tools/testing/selftests/net/fin_ack_lat.sh | 42 ++++++++++
.../selftests/net/fin_ack_lat_accept.c | 49 +++++++++++
.../selftests/net/fin_ack_lat_connect.c | 81 +++++++++++++++++++
7 files changed, 182 insertions(+), 1 deletion(-)
create mode 100755 tools/testing/selftests/net/fin_ack_lat.sh
create mode 100644 tools/testing/selftests/net/fin_ack_lat_accept.c
create mode 100644 tools/testing/selftests/net/fin_ack_lat_connect.c

--
2.17.1


2020-01-31 12:27:13

by SeongJae Park

[permalink] [raw]
Subject: [PATCH 3/3] selftests: net: Add FIN_ACK processing order related latency spike test

From: SeongJae Park <[email protected]>

This commit adds a test for FIN_ACK process races related reconnection
latency spike issues. The issue has described and solved by the
previous commit ("tcp: Reduce SYN resend delay if a suspicous ACK is
received").

Signed-off-by: SeongJae Park <[email protected]>
---
tools/testing/selftests/net/.gitignore | 2 +
tools/testing/selftests/net/Makefile | 2 +
tools/testing/selftests/net/fin_ack_lat.sh | 42 ++++++++++
.../selftests/net/fin_ack_lat_accept.c | 49 +++++++++++
.../selftests/net/fin_ack_lat_connect.c | 81 +++++++++++++++++++
5 files changed, 176 insertions(+)
create mode 100755 tools/testing/selftests/net/fin_ack_lat.sh
create mode 100644 tools/testing/selftests/net/fin_ack_lat_accept.c
create mode 100644 tools/testing/selftests/net/fin_ack_lat_connect.c

diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore
index 8aefd81fbc86..1bcf7b5498dd 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -22,3 +22,5 @@ ipv6_flowlabel_mgr
so_txtime
tcp_fastopen_backup_key
nettest
+fin_ack_lat_accept
+fin_ack_lat_connect
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index a8e04d665b69..e4938c26ce3f 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -11,6 +11,7 @@ TEST_PROGS += udpgso_bench.sh fib_rule_tests.sh msg_zerocopy.sh psock_snd.sh
TEST_PROGS += udpgro_bench.sh udpgro.sh test_vxlan_under_vrf.sh reuseport_addr_any.sh
TEST_PROGS += test_vxlan_fdb_changelink.sh so_txtime.sh ipv6_flowlabel.sh
TEST_PROGS += tcp_fastopen_backup_key.sh fcnal-test.sh l2tp.sh traceroute.sh
+TEST_PROGS += fin_ack_lat.sh
TEST_PROGS_EXTENDED := in_netns.sh
TEST_GEN_FILES = socket nettest
TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy reuseport_addr_any
@@ -18,6 +19,7 @@ TEST_GEN_FILES += tcp_mmap tcp_inq psock_snd txring_overwrite
TEST_GEN_FILES += udpgso udpgso_bench_tx udpgso_bench_rx ip_defrag
TEST_GEN_FILES += so_txtime ipv6_flowlabel ipv6_flowlabel_mgr
TEST_GEN_FILES += tcp_fastopen_backup_key
+TEST_GEN_FILES += fin_ack_lat_accept fin_ack_lat_connect
TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict tls

diff --git a/tools/testing/selftests/net/fin_ack_lat.sh b/tools/testing/selftests/net/fin_ack_lat.sh
new file mode 100755
index 000000000000..0a398c837b7a
--- /dev/null
+++ b/tools/testing/selftests/net/fin_ack_lat.sh
@@ -0,0 +1,42 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test latency spikes caused by FIN/ACK handling race.
+
+set +x
+set -e
+
+tmpfile=$(mktemp /tmp/fin_ack_latency.XXXX.log)
+
+kill_accept() {
+ kill $ACCEPT_PID
+}
+
+cleanup() {
+ kill_accept
+ rm -f $tmpfile
+}
+
+trap cleanup EXIT
+
+do_test() {
+ RUNTIME=$1
+
+ ./fin_ack_lat_accept &
+ ACCEPT_PID=$!
+ sleep 1
+
+ ./fin_ack_lat_connect | tee $tmpfile &
+ sleep $RUNTIME
+ NR_SPIKES=$(wc -l $tmpfile | awk '{print $1}')
+ rm $tmpfile
+ if [ $NR_SPIKES -gt 0 ]
+ then
+ echo "FAIL: $NR_SPIKES spikes detected"
+ return 1
+ fi
+ return 0
+}
+
+do_test "30"
+echo "test done"
diff --git a/tools/testing/selftests/net/fin_ack_lat_accept.c b/tools/testing/selftests/net/fin_ack_lat_accept.c
new file mode 100644
index 000000000000..a0f0210f12b4
--- /dev/null
+++ b/tools/testing/selftests/net/fin_ack_lat_accept.c
@@ -0,0 +1,49 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <error.h>
+#include <netinet/in.h>
+#include <stdio.h>
+#include <sys/socket.h>
+#include <unistd.h>
+
+int main(int argc, char const *argv[])
+{
+ int sock, new_sock;
+ int opt = 1;
+ struct sockaddr_in address;
+ int addrlen = sizeof(address);
+ int buffer;
+ int rc;
+
+ sock = socket(AF_INET, SOCK_STREAM, 0);
+ if (!sock)
+ error(-1, -1, "socket");
+
+ rc = setsockopt(sock, SOL_SOCKET, SO_REUSEADDR | SO_REUSEPORT,
+ &opt, sizeof(opt));
+ if (rc == -1)
+ error(-1, -1, "setsockopt");
+
+ address.sin_family = AF_INET;
+ address.sin_addr.s_addr = INADDR_ANY;
+ address.sin_port = htons(4242);
+
+ rc = bind(sock, (struct sockaddr *)&address, sizeof(address));
+ if (rc < 0)
+ error(-1, -1, "bind");
+
+ rc = listen(sock, 3);
+ if (rc < 0)
+ error(-1, -1, "listen");
+
+ while (1) {
+ new_sock = accept(sock, (struct sockaddr *)&address,
+ (socklen_t *)&addrlen);
+ if (new_sock < 0)
+ error(-1, -1, "accept");
+
+ rc = read(new_sock, &buffer, sizeof(buffer));
+ close(new_sock);
+ }
+ return 0;
+}
diff --git a/tools/testing/selftests/net/fin_ack_lat_connect.c b/tools/testing/selftests/net/fin_ack_lat_connect.c
new file mode 100644
index 000000000000..abfdd79f2e17
--- /dev/null
+++ b/tools/testing/selftests/net/fin_ack_lat_connect.c
@@ -0,0 +1,81 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <arpa/inet.h>
+#include <error.h>
+#include <netinet/tcp.h>
+#include <stdio.h>
+#include <sys/socket.h>
+#include <sys/time.h>
+#include <unistd.h>
+
+static unsigned long timediff(struct timeval s, struct timeval e)
+{
+ if (s.tv_sec > e.tv_sec)
+ return 0;
+ return (e.tv_sec - s.tv_sec) * 1000000 + e.tv_usec - s.tv_usec;
+}
+
+int main(int argc, char const *argv[])
+{
+ int sock = 0;
+ struct sockaddr_in addr, laddr;
+ socklen_t len = sizeof(laddr);
+ struct linger sl;
+ int flag = 1;
+ int buffer;
+ int rc;
+ struct timeval start, end;
+ unsigned long lat, sum_lat = 0, nr_lat = 0;
+
+ while (1) {
+ gettimeofday(&start, NULL);
+
+ sock = socket(AF_INET, SOCK_STREAM, 0);
+ if (sock < 0)
+ error(-1, -1, "socket creation");
+
+ sl.l_onoff = 1;
+ sl.l_linger = 0;
+ if (setsockopt(sock, SOL_SOCKET, SO_LINGER, &sl, sizeof(sl)))
+ error(-1, -1, "setsockopt(linger)");
+
+ if (setsockopt(sock, IPPROTO_TCP, TCP_NODELAY,
+ &flag, sizeof(flag)))
+ error(-1, -1, "setsockopt(nodelay)");
+
+ addr.sin_family = AF_INET;
+ addr.sin_port = htons(4242);
+
+ rc = inet_pton(AF_INET, "127.0.0.1", &addr.sin_addr);
+ if (rc <= 0)
+ error(-1, -1, "inet_pton");
+
+ rc = connect(sock, (struct sockaddr *)&addr, sizeof(addr));
+ if (rc < 0)
+ error(-1, -1, "connect");
+
+ send(sock, &buffer, sizeof(buffer), 0);
+
+ rc = read(sock, &buffer, sizeof(buffer));
+
+ gettimeofday(&end, NULL);
+ lat = timediff(start, end);
+ sum_lat += lat;
+ nr_lat++;
+ if (lat > 100000) {
+ rc = getsockname(sock, (struct sockaddr *)&laddr, &len);
+ if (rc == -1)
+ error(-1, -1, "getsockname");
+ printf("port: %d, lat: %lu, avg: %lu, nr: %lu\n",
+ ntohs(laddr.sin_port), lat,
+ sum_lat / nr_lat, nr_lat);
+ }
+
+ if (nr_lat % 1000 == 0)
+ fflush(stdout);
+
+
+ close(sock);
+ }
+ return 0;
+}
--
2.17.1

2020-01-31 12:27:41

by SeongJae Park

[permalink] [raw]
Subject: [PATCH 1/3] net/ipv4/inet_timewait_sock: Fix inconsistent comments

From: SeongJae Park <[email protected]>

Commit ec94c2696f0b ("tcp/dccp: avoid one atomic operation for timewait
hashdance") mistakenly erased a comment for the second step of
`inet_twsk_hashdance()`. This commit restores it for better
readability.

Signed-off-by: SeongJae Park <[email protected]>
---
net/ipv4/inet_timewait_sock.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index c411c87ae865..fbfcd63cc170 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -120,6 +120,7 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,

spin_lock(lock);

+ /* Step 2: Hash TW into tcp ehash chain. */
inet_twsk_add_node_rcu(tw, &ehead->chain);

/* Step 3: Remove SK from hash chain */
--
2.17.1

2020-01-31 14:01:43

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 0/3] Fix reconnection latency caused by FIN/ACK handling race

From: [email protected]
> Sent: 31 January 2020 12:24
...
> The acks in lines 6 and 8 are the acks. If the line 8 packet is
> processed before the line 6 packet, it will be just ignored as it is not
> a expected packet, and the later process of the line 6 packet will
> change the status of Process A to FIN_WAIT_2, but as it has already
> handled line 8 packet, it will not go to TIME_WAIT and thus will not
> send the line 10 packet to Process B. Thus, Process B will left in
> CLOSE_WAIT status, as below.
>
> 00 (Process A) (Process B)
> 01 ESTABLISHED ESTABLISHED
> 02 close()
> 03 FIN_WAIT_1
> 04 ---FIN-->
> 05 CLOSE_WAIT
> 06 (<--ACK---)
> 07 (<--FIN/ACK---)
> 08 (fired in right order)
> 09 <--FIN/ACK---
> 10 <--ACK---
> 11 (processed in reverse order)
> 12 FIN_WAIT_2

Why doesn't A treat the FIN/ACK (09) as valid (as if
the ACK had got lost) and then ignore the ACK (10) because
it refers to a closed socket?

I presume that B sends two ACKs (06 and 07) because it can
sit in an intermediate state and the first ACK stops the FIN
being resent?

I've implemented lots of protocols in my time, but not TCP.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-01-31 14:56:10

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 1/3] net/ipv4/inet_timewait_sock: Fix inconsistent comments

On Fri, Jan 31, 2020 at 4:24 AM <[email protected]> wrote:
>
> From: SeongJae Park <[email protected]>
>
> Commit ec94c2696f0b ("tcp/dccp: avoid one atomic operation for timewait
> hashdance") mistakenly erased a comment for the second step of
> `inet_twsk_hashdance()`. This commit restores it for better
> readability.
>
> Signed-off-by: SeongJae Park <[email protected]>
> ---
> net/ipv4/inet_timewait_sock.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> index c411c87ae865..fbfcd63cc170 100644
> --- a/net/ipv4/inet_timewait_sock.c
> +++ b/net/ipv4/inet_timewait_sock.c
> @@ -120,6 +120,7 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
>
> spin_lock(lock);
>
> + /* Step 2: Hash TW into tcp ehash chain. */

This comment adds no value, please do not bring it back.

net-next is closed, now is not the time for cosmetic changes.

Also take a look at Documentation/networking/netdev-FAQ.rst

2020-01-31 14:58:32

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 3/3] selftests: net: Add FIN_ACK processing order related latency spike test

On Fri, Jan 31, 2020 at 4:25 AM <[email protected]> wrote:
>
> From: SeongJae Park <[email protected]>
>
> This commit adds a test for FIN_ACK process races related reconnection
> latency spike issues. The issue has described and solved by the
> previous commit ("tcp: Reduce SYN resend delay if a suspicous ACK is
> received").
>

I do not know for other tests, but using a hard coded port (4242) is
going to be flakky, since the port might be already used.

Please make sure to run tests on a separate namespace.

2020-01-31 15:07:46

by SeongJae Park

[permalink] [raw]
Subject: Re: RE: [PATCH 0/3] Fix reconnection latency caused by FIN/ACK handling race

On Fri, 31 Jan 2020 14:00:27 +0000 David Laight <[email protected]> wrote:

> From: [email protected]
> > Sent: 31 January 2020 12:24
> ...
> > The acks in lines 6 and 8 are the acks. If the line 8 packet is
> > processed before the line 6 packet, it will be just ignored as it is not
> > a expected packet, and the later process of the line 6 packet will
> > change the status of Process A to FIN_WAIT_2, but as it has already
> > handled line 8 packet, it will not go to TIME_WAIT and thus will not
> > send the line 10 packet to Process B. Thus, Process B will left in
> > CLOSE_WAIT status, as below.
> >
> > 00 (Process A) (Process B)
> > 01 ESTABLISHED ESTABLISHED
> > 02 close()
> > 03 FIN_WAIT_1
> > 04 ---FIN-->
> > 05 CLOSE_WAIT
> > 06 (<--ACK---)
> > 07 (<--FIN/ACK---)
> > 08 (fired in right order)
> > 09 <--FIN/ACK---
> > 10 <--ACK---
> > 11 (processed in reverse order)
> > 12 FIN_WAIT_2
>
> Why doesn't A treat the FIN/ACK (09) as valid (as if
> the ACK had got lost) and then ignore the ACK (10) because
> it refers to a closed socket?

Because the TCP protocol (RFC 793) doesn't have such speculation. TCP is
stateful protocol. Thus, packets arrived in unexpected state are not required
to be respected, AFAIU.

>
> I presume that B sends two ACKs (06 and 07) because it can
> sit in an intermediate state and the first ACK stops the FIN
> being resent?

I think there is no such presume in the protocol, either.

>
> I've implemented lots of protocols in my time, but not TCP.

If you find anything I'm misunderstanding, please don't hesitate to yell at me.
Hope the previous discussion[1] regarding this issue to be helpful.


Thanks,
SeongJae Park

[1] https://lore.kernel.org/bpf/[email protected]/

>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

2020-01-31 15:11:04

by SeongJae Park

[permalink] [raw]
Subject: Re: Re: [PATCH 1/3] net/ipv4/inet_timewait_sock: Fix inconsistent comments

On Fri, 31 Jan 2020 06:54:53 -0800 Eric Dumazet <[email protected]> wrote:

> On Fri, Jan 31, 2020 at 4:24 AM <[email protected]> wrote:
> >
> > From: SeongJae Park <[email protected]>
> >
> > Commit ec94c2696f0b ("tcp/dccp: avoid one atomic operation for timewait
> > hashdance") mistakenly erased a comment for the second step of
> > `inet_twsk_hashdance()`. This commit restores it for better
> > readability.
> >
> > Signed-off-by: SeongJae Park <[email protected]>
> > ---
> > net/ipv4/inet_timewait_sock.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > index c411c87ae865..fbfcd63cc170 100644
> > --- a/net/ipv4/inet_timewait_sock.c
> > +++ b/net/ipv4/inet_timewait_sock.c
> > @@ -120,6 +120,7 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
> >
> > spin_lock(lock);
> >
> > + /* Step 2: Hash TW into tcp ehash chain. */
>
> This comment adds no value, please do not bring it back.
>
> net-next is closed, now is not the time for cosmetic changes.
>
> Also take a look at Documentation/networking/netdev-FAQ.rst

Thank you for this kind reference. Will drop this in next spin.


Thanks,
SeongJae Park

>

2020-01-31 15:14:46

by SeongJae Park

[permalink] [raw]
Subject: Re: Re: [PATCH 3/3] selftests: net: Add FIN_ACK processing order related latency spike test

On Fri, 31 Jan 2020 06:56:13 -0800 Eric Dumazet <[email protected]> wrote:

> On Fri, Jan 31, 2020 at 4:25 AM <[email protected]> wrote:
> >
> > From: SeongJae Park <[email protected]>
> >
> > This commit adds a test for FIN_ACK process races related reconnection
> > latency spike issues. The issue has described and solved by the
> > previous commit ("tcp: Reduce SYN resend delay if a suspicous ACK is
> > received").
> >
>
> I do not know for other tests, but using a hard coded port (4242) is
> going to be flakky, since the port might be already used.
>
> Please make sure to run tests on a separate namespace.

Agreed, will do so in next spin.


Thanks,
SeongJae Park