2023-06-06 07:02:53

by Duan,Muquan

[permalink] [raw]
Subject: [PATCH v2] tcp: fix connection reset due to tw hashdance race.

If the FIN from passive closer and the ACK for active closer's FIN are
processed on different CPUs concurrently, tw hashdance race may occur.
On loopback interface, transmit function queues a skb to current CPU's
softnet's input queue by default. Suppose active closer runs on CPU 0,
and passive closer runs on CPU 1. If the ACK for the active closer's
FIN is sent with no delay, it will be processed and tw hashdance will
be done on CPU 0; The passive closer's FIN will be sent in another
segment and processed on CPU 1, it may fail to find tw sock in the
ehash table due to tw hashdance on CPU 0, then get a RESET.
If application reconnects immediately with the same source port, it
will get reset because tw sock's tw_substate is still TCP_FIN_WAIT2.

The dmesg to trace down this issue:

.333516] tcp_send_fin: sk 0000000092105ad2 cookie 9 cpu 3
.333524] rcv_state_process:FIN_WAIT2 sk 0000000092105ad2 cookie 9 cpu 3
.333534] tcp_close: tcp_time_wait: sk 0000000092105ad2 cookie 9 cpu 3
.333538] hashdance: tw 00000000690fdb7a added to ehash cookie 9 cpu 3
.333541] hashdance: sk 0000000092105ad2 removed cookie 9 cpu 3
.333544] __inet_lookup_established: Failed the refcount check:
!refcount_inc_not_zero 00000000690fdb7a ref 0 cookie 9 cpu 0
.333549] hashdance: tw 00000000690fdb7a before add ref 0 cookie 9 cpu 3
.333552] rcv_state: RST for FIN listen 000000003c50afa6 cookie 0 cpu 0
.333574] tcp_send_fin: sk 0000000066757bf8 ref 2 cookie 0 cpu 0
.333611] timewait_state: TCP_TW_RST tw 00000000690fdb7a cookie 9 cpu 0
.333626] tcp_connect: sk 0000000066757bf8 cpu 0 cookie 0

Here is the call trace map:

CPU 0 CPU 1

-------- --------
tcp_close()
tcp_send_fin()
loopback_xmit()
netif_rx()
tcp_v4_rcv()
tcp_ack_snd_check()
loopback_xmit
netif_rx() tcp_close()
... tcp_send_fin()
loopback_xmit()
netif_rx()
tcp_v4_rcv()
...
tcp_time_wait()
inet_twsk_hashdance() {
...
<-__inet_lookup_established()
(bad case (1), find sk, may fail tw_refcnt check)
inet_twsk_add_node_tail_rcu(tw, ...)
<-__inet_lookup_established()
(bad case (1), find sk, may fail tw_refcnt check)

__sk_nulls_del_node_init_rcu(sk)
<-__inet_lookup_established()
(bad case (2), find tw, may fail tw_refcnt check)
refcount_set(&tw->tw_refcnt, 3)
<-__inet_lookup_established()
(good case, find tw, tw_refcnt is not 0)
...
}

This issue occurs with a small probability on our application working
on loopback interface, client gets a connection refused error when it
reconnects. In reproducing environments on kernel 4.14,5.10 and
6.4-rc1, modify tcp_ack_snd_check() to disable delay ack all the
time; Let client connect server and server sends a message to client
then close the connection; Repeat this process forever; Let the client
bind the same source port every time, it can be reproduced in about 20
minutes.

Brief of the scenario:

1. Server runs on CPU 0 and Client runs on CPU 1. Server closes
connection actively and sends a FIN to client. The lookback's driver
enqueues the FIN segment to backlog queue of CPU 0 via
loopback_xmit()->netif_rx(), one of the conditions for non-delay ack
meets in __tcp_ack_snd_check(), and the ACK is sent immediately.

2. On loopback interface, the ACK is received and processed on CPU 0,
the 'dance' from original sock to tw sock will perfrom, tw sock will
be inserted to ehash table, then the original sock will be removed.

3. On CPU 1, client closes the connection, a FIN segment is sent and
processed on CPU 1. When it is looking up sock in ehash table (with no
lock), tw hashdance race may occur, it fails to find the tw sock and
get a listener sock in the flowing 3 cases:

(1) Original sock is found, but it has been destroyed and sk_refcnt
has become 0 when validating it.
(2) tw sock is found, but its tw_refcnt has not been set to 3, it is
still 0, validating for sk_refcnt will fail.
(3) For versions without Eric and Jason's commit(3f4ca5fafc08881d7a5
7daa20449d171f2887043), tw sock is added to the head of the list.
It will be missed if the list is traversed before tw sock is
added. And if the original sock is removed before it is found, no
established sock will be found.

The listener sock will reset the FIN segment which has ack bit set.

4. If client reconnects immediately and is assigned with the same
source port as previous connection, the tw sock with tw_substate
TCP_FIN_WAIT2 will reset client's SYN and destroy itself in
inet_twsk_deschedule_put(). Application gets a connection refused
error.

5. If client reconnects again, it will succeed.

Introduce the flowing 2 modifications to solve the above 3 bad cases:

For bad case (2):
Set tw_refcnt to 3 before adding it into list.

For bad case (1):
In function tcp_v4_rcv(), if __inet_lookup_skb() returns a listener
sock and this segment has FIN bit set, then retry the lookup process
one time. This fix can cover bad case (3) for the versions without
Eric and Jason's fix.

There may be another bad case, if the original sock is found and passes
validation, but during further process for the passive closer's FIN on
CPU 1, the sock has been destroyed on CPU 0, then the FIN segment will
be dropped and retransmitted. This case does not hurt application as
much as resetting reconnection, and this case has less possibility than
the other bad cases, it does not occur on our product and in
experimental environment, so it is not considered in this patch.

Could you please check whether this fix is OK, or any suggestions?
Looking forward for your precious comments!

Signed-off-by: Duan Muquan <[email protected]>
---
net/ipv4/inet_timewait_sock.c | 15 +++++++--------
net/ipv4/tcp_ipv4.c | 13 +++++++++++++
2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 40052414c7c7..ed1f255c9aa8 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -144,14 +144,6 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,

spin_lock(lock);

- inet_twsk_add_node_tail_rcu(tw, &ehead->chain);
-
- /* Step 3: Remove SK from hash chain */
- if (__sk_nulls_del_node_init_rcu(sk))
- sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
-
- spin_unlock(lock);
-
/* tw_refcnt is set to 3 because we have :
* - one reference for bhash chain.
* - one reference for ehash chain.
@@ -162,6 +154,13 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
* so we are not allowed to use tw anymore.
*/
refcount_set(&tw->tw_refcnt, 3);
+ inet_twsk_add_node_tail_rcu(tw, &ehead->chain);
+
+ /* Step 3: Remove SK from hash chain */
+ if (__sk_nulls_del_node_init_rcu(sk))
+ sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
+
+ spin_unlock(lock);
}
EXPORT_SYMBOL_GPL(inet_twsk_hashdance);

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 06d2573685ca..3e3cef202f76 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2018,6 +2018,19 @@ int tcp_v4_rcv(struct sk_buff *skb)
sk = __inet_lookup_skb(net->ipv4.tcp_death_row.hashinfo,
skb, __tcp_hdrlen(th), th->source,
th->dest, sdif, &refcounted);
+
+ /* If tw "dance" is performed on another CPU, the lookup process may find
+ * no tw sock for the passive closer's FIN segment, but a listener sock,
+ * which will reset the FIN segment. If application reconnects immediately
+ * with the same source port, it will get reset because the tw sock's
+ * tw_substate is still TCP_FIN_WAIT2. Try to get the tw sock in another try.
+ */
+ if (unlikely(th->fin && sk && sk->sk_state == TCP_LISTEN)) {
+ sk = __inet_lookup_skb(net->ipv4.tcp_death_row.hashinfo,
+ skb, __tcp_hdrlen(th), th->source,
+ th->dest, sdif, &refcounted);
+ }
+
if (!sk)
goto no_tcp_socket;

--
2.32.0



2023-06-06 07:42:46

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v2] tcp: fix connection reset due to tw hashdance race.

On Tue, Jun 6, 2023 at 8:43 AM Duan Muquan <[email protected]> wrote:
>
> If the FIN from passive closer and the ACK for active closer's FIN are
> processed on different CPUs concurrently, tw hashdance race may occur.
> On loopback interface, transmit function queues a skb to current CPU's
> softnet's input queue by default. Suppose active closer runs on CPU 0,
> and passive closer runs on CPU 1. If the ACK for the active closer's
> FIN is sent with no delay, it will be processed and tw hashdance will
> be done on CPU 0; The passive closer's FIN will be sent in another
> segment and processed on CPU 1, it may fail to find tw sock in the
> ehash table due to tw hashdance on CPU 0, then get a RESET.
> If application reconnects immediately with the same source port, it
> will get reset because tw sock's tw_substate is still TCP_FIN_WAIT2.
>
> The dmesg to trace down this issue:
>
> .333516] tcp_send_fin: sk 0000000092105ad2 cookie 9 cpu 3
> .333524] rcv_state_process:FIN_WAIT2 sk 0000000092105ad2 cookie 9 cpu 3
> .333534] tcp_close: tcp_time_wait: sk 0000000092105ad2 cookie 9 cpu 3
> .333538] hashdance: tw 00000000690fdb7a added to ehash cookie 9 cpu 3
> .333541] hashdance: sk 0000000092105ad2 removed cookie 9 cpu 3
> .333544] __inet_lookup_established: Failed the refcount check:
> !refcount_inc_not_zero 00000000690fdb7a ref 0 cookie 9 cpu 0
> .333549] hashdance: tw 00000000690fdb7a before add ref 0 cookie 9 cpu 3
> .333552] rcv_state: RST for FIN listen 000000003c50afa6 cookie 0 cpu 0
> .333574] tcp_send_fin: sk 0000000066757bf8 ref 2 cookie 0 cpu 0
> .333611] timewait_state: TCP_TW_RST tw 00000000690fdb7a cookie 9 cpu 0
> .333626] tcp_connect: sk 0000000066757bf8 cpu 0 cookie 0
>
> Here is the call trace map:
>
> CPU 0 CPU 1
>
> -------- --------
> tcp_close()
> tcp_send_fin()
> loopback_xmit()
> netif_rx()
> tcp_v4_rcv()
> tcp_ack_snd_check()
> loopback_xmit
> netif_rx() tcp_close()
> ... tcp_send_fin()
> loopback_xmit()
> netif_rx()
> tcp_v4_rcv()
> ...
> tcp_time_wait()
> inet_twsk_hashdance() {
> ...
> <-__inet_lookup_established()
> (bad case (1), find sk, may fail tw_refcnt check)
> inet_twsk_add_node_tail_rcu(tw, ...)
> <-__inet_lookup_established()
> (bad case (1), find sk, may fail tw_refcnt check)
>
> __sk_nulls_del_node_init_rcu(sk)
> <-__inet_lookup_established()
> (bad case (2), find tw, may fail tw_refcnt check)
> refcount_set(&tw->tw_refcnt, 3)
> <-__inet_lookup_established()
> (good case, find tw, tw_refcnt is not 0)
> ...
> }
>
> This issue occurs with a small probability on our application working
> on loopback interface, client gets a connection refused error when it
> reconnects. In reproducing environments on kernel 4.14,5.10 and
> 6.4-rc1, modify tcp_ack_snd_check() to disable delay ack all the
> time; Let client connect server and server sends a message to client
> then close the connection; Repeat this process forever; Let the client
> bind the same source port every time, it can be reproduced in about 20
> minutes.
>
> Brief of the scenario:
>
> 1. Server runs on CPU 0 and Client runs on CPU 1. Server closes
> connection actively and sends a FIN to client. The lookback's driver
> enqueues the FIN segment to backlog queue of CPU 0 via
> loopback_xmit()->netif_rx(), one of the conditions for non-delay ack
> meets in __tcp_ack_snd_check(), and the ACK is sent immediately.
>
> 2. On loopback interface, the ACK is received and processed on CPU 0,
> the 'dance' from original sock to tw sock will perfrom, tw sock will
> be inserted to ehash table, then the original sock will be removed.
>
> 3. On CPU 1, client closes the connection, a FIN segment is sent and
> processed on CPU 1. When it is looking up sock in ehash table (with no
> lock), tw hashdance race may occur, it fails to find the tw sock and
> get a listener sock in the flowing 3 cases:
>
> (1) Original sock is found, but it has been destroyed and sk_refcnt
> has become 0 when validating it.
> (2) tw sock is found, but its tw_refcnt has not been set to 3, it is
> still 0, validating for sk_refcnt will fail.
> (3) For versions without Eric and Jason's commit(3f4ca5fafc08881d7a5
> 7daa20449d171f2887043), tw sock is added to the head of the list.
> It will be missed if the list is traversed before tw sock is
> added. And if the original sock is removed before it is found, no
> established sock will be found.
>
> The listener sock will reset the FIN segment which has ack bit set.
>
> 4. If client reconnects immediately and is assigned with the same
> source port as previous connection, the tw sock with tw_substate
> TCP_FIN_WAIT2 will reset client's SYN and destroy itself in
> inet_twsk_deschedule_put(). Application gets a connection refused
> error.
>
> 5. If client reconnects again, it will succeed.
>
> Introduce the flowing 2 modifications to solve the above 3 bad cases:
>
> For bad case (2):
> Set tw_refcnt to 3 before adding it into list.
>
> For bad case (1):
> In function tcp_v4_rcv(), if __inet_lookup_skb() returns a listener
> sock and this segment has FIN bit set, then retry the lookup process
> one time. This fix can cover bad case (3) for the versions without
> Eric and Jason's fix.
>
> There may be another bad case, if the original sock is found and passes
> validation, but during further process for the passive closer's FIN on
> CPU 1, the sock has been destroyed on CPU 0, then the FIN segment will
> be dropped and retransmitted. This case does not hurt application as
> much as resetting reconnection, and this case has less possibility than
> the other bad cases, it does not occur on our product and in
> experimental environment, so it is not considered in this patch.
>
> Could you please check whether this fix is OK, or any suggestions?
> Looking forward for your precious comments!
>
> Signed-off-by: Duan Muquan <[email protected]>
> ---
> net/ipv4/inet_timewait_sock.c | 15 +++++++--------
> net/ipv4/tcp_ipv4.c | 13 +++++++++++++
> 2 files changed, 20 insertions(+), 8 deletions(-)
>
> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> index 40052414c7c7..ed1f255c9aa8 100644
> --- a/net/ipv4/inet_timewait_sock.c
> +++ b/net/ipv4/inet_timewait_sock.c
> @@ -144,14 +144,6 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
>
> spin_lock(lock);
>
> - inet_twsk_add_node_tail_rcu(tw, &ehead->chain);
> -
> - /* Step 3: Remove SK from hash chain */
> - if (__sk_nulls_del_node_init_rcu(sk))
> - sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
> -
> - spin_unlock(lock);
> -
> /* tw_refcnt is set to 3 because we have :
> * - one reference for bhash chain.
> * - one reference for ehash chain.
> @@ -162,6 +154,13 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
> * so we are not allowed to use tw anymore.
> */
> refcount_set(&tw->tw_refcnt, 3);
> + inet_twsk_add_node_tail_rcu(tw, &ehead->chain);
> +
> + /* Step 3: Remove SK from hash chain */
> + if (__sk_nulls_del_node_init_rcu(sk))
> + sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
> +
> + spin_unlock(lock);
> }
> EXPORT_SYMBOL_GPL(inet_twsk_hashdance);
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 06d2573685ca..3e3cef202f76 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -2018,6 +2018,19 @@ int tcp_v4_rcv(struct sk_buff *skb)
> sk = __inet_lookup_skb(net->ipv4.tcp_death_row.hashinfo,
> skb, __tcp_hdrlen(th), th->source,
> th->dest, sdif, &refcounted);
> +
> + /* If tw "dance" is performed on another CPU, the lookup process may find
> + * no tw sock for the passive closer's FIN segment, but a listener sock,
> + * which will reset the FIN segment. If application reconnects immediately
> + * with the same source port, it will get reset because the tw sock's
> + * tw_substate is still TCP_FIN_WAIT2. Try to get the tw sock in another try.
> + */
> + if (unlikely(th->fin && sk && sk->sk_state == TCP_LISTEN)) {
> + sk = __inet_lookup_skb(net->ipv4.tcp_death_row.hashinfo,
> + skb, __tcp_hdrlen(th), th->source,
> + th->dest, sdif, &refcounted);
> + }
> +

I do not think this fixes anything, there is no barrier between first
and second lookup.
This might reduce race a little bit, but at the expense of extra code
in the fast path.

If you want to fix this properly, I think we need to revisit handling
for non established sockets,
to hold the bucket spinlock over the whole thing.

This will be slightly more complex than this...

2023-06-07 14:29:06

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v2] tcp: fix connection reset due to tw hashdance race.

On Wed, Jun 7, 2023 at 1:59 PM Duan,Muquan <[email protected]> wrote:
>
> Hi, Eric,
>
> Thanks for your comments!
>
> About the second lookup, I am sorry that I did not give enough explanations about it. Here are some details:
>
> 1. The second lookup can find the tw sock and avoid the connection refuse error on userland applications:
>
> If the original sock is found, but when validating its refcnt, it has been destroyed and sk_refcnt has become 0 after decreased by tcp_time_wait()->tcp_done()->inet_csk_destory_sock()->sock_put().The validation for refcnt fails and the lookup process gets a listener sock.
>
> When this case occurs, the hashdance has definitely finished,because tcp_done() is executed after inet_twsk_hashdance(). Then if look up the ehash table again, hashdance has already finished, tw sock will be found.
>
> With this fix, logically we can solve the connection reset issue completely when no established sock is found due to hashdance race.In my reproducing environment, the connection refuse error will occur about every 6 hours with only the fix of bad case (2). But with both of the 2 fixes, I tested it many times, the longest test continues for 10 days, it does not occur again,
>
>
>
> 2. About the performance impact:
>
> A similar scenario is that __inet_lookup_established() will do inet_match() check for the second time, if fails it will look up the list again. It is the extra effort to reduce the race impact without using reader lock. inet_match() failure occurs with about the same probability with refcnt validation failure in my test environment.
>
> The second lookup will only be done in the condition that FIN segment gets a listener sock.
>
> About the performance impact:
>
> 1) Most of the time, this condition will not met, the added codes introduces at most 3 comparisons for each segment.
>
> The second inet_match() in __inet_lookup_established() does least 3 comparisons for each segmet.
>
>
> 2) When this condition is met, the probability is very small. The impact is similar to the second try due to inet_match() failure. Since tw sock can definitely be found in the second try, I think this cost is worthy to avoid connection reused error on userland applications.
>
>
>
> My understanding is, current philosophy is avoiding the reader lock by tolerating the minor defect which occurs in a small probability.For example, if the FIN from passive closer is dropped due to the found sock is destroyed, a retransmission can be tolerated, it only makes the connection termination slower. But I think the bottom line is that it does not affect the userland applications’ functionality. If application fails to connect due to the hashdance race, it can’t be tolerated. In fact, guys from product department push hard on the connection refuse error.
>
>
> About bad case (2):
>
> tw sock is found, but its tw_refcnt has not been set to 3, it is still 0, validating for sk_refcnt will fail.
>
> I do not know the reason why setting tw_refcnt after adding it into list, could anyone help point out the reason? It adds extra race because the new added tw sock may be found and checked in other CPU concurrently before ƒsetting tw_refcnt to 3.
>
> By setting tw_refcnt to 3 before adding it into list, this case will be solved, and almost no cost. In my reproducing environment, it occurs more frequently than bad case (1), it appears about every 20 minutes, bad case (1) appears about every 6 hours.
>
>
>
> About the bucket spinlock, the original established sock and tw sock are stored in the ehash table, I concern about the performance when there are lots of short TCP connections, the reader lock may affect the performance of connection creation and termination. Could you share some details of your idea? Thanks in advance.
>
>

Again, you can write a lot of stuff, the fact is that your patch does
not solve the issue.

You could add 10 lookups, and still miss some cases, because they are
all RCU lookups with no barriers.

In order to solve the issue of packets for the same 4-tuple being
processed by many cpus, the only way to solve races is to add mutual
exclusion.

Note that we already have to lock the bucket spinlock every time we
transition a request socket to socket, a socket to timewait, or any
insert/delete.

We need to expand the scope of this lock, and cleanup things that we
added in the past, because we tried too hard to 'detect races'

>
>
>
> Best Regards!
>
> Duan
>
>
> 2023年6月6日 下午3:07,Eric Dumazet <[email protected]> 写道:
>
> On Tue, Jun 6, 2023 at 8:43 AM Duan Muquan <[email protected]> wrote:
>
>
> If the FIN from passive closer and the ACK for active closer's FIN are
> processed on different CPUs concurrently, tw hashdance race may occur.
> On loopback interface, transmit function queues a skb to current CPU's
> softnet's input queue by default. Suppose active closer runs on CPU 0,
> and passive closer runs on CPU 1. If the ACK for the active closer's
> FIN is sent with no delay, it will be processed and tw hashdance will
> be done on CPU 0; The passive closer's FIN will be sent in another
> segment and processed on CPU 1, it may fail to find tw sock in the
> ehash table due to tw hashdance on CPU 0, then get a RESET.
> If application reconnects immediately with the same source port, it
> will get reset because tw sock's tw_substate is still TCP_FIN_WAIT2.
>
> The dmesg to trace down this issue:
>
> .333516] tcp_send_fin: sk 0000000092105ad2 cookie 9 cpu 3
> .333524] rcv_state_process:FIN_WAIT2 sk 0000000092105ad2 cookie 9 cpu 3
> .333534] tcp_close: tcp_time_wait: sk 0000000092105ad2 cookie 9 cpu 3
> .333538] hashdance: tw 00000000690fdb7a added to ehash cookie 9 cpu 3
> .333541] hashdance: sk 0000000092105ad2 removed cookie 9 cpu 3
> .333544] __inet_lookup_established: Failed the refcount check:
> !refcount_inc_not_zero 00000000690fdb7a ref 0 cookie 9 cpu 0
> .333549] hashdance: tw 00000000690fdb7a before add ref 0 cookie 9 cpu 3
> .333552] rcv_state: RST for FIN listen 000000003c50afa6 cookie 0 cpu 0
> .333574] tcp_send_fin: sk 0000000066757bf8 ref 2 cookie 0 cpu 0
> .333611] timewait_state: TCP_TW_RST tw 00000000690fdb7a cookie 9 cpu 0
> .333626] tcp_connect: sk 0000000066757bf8 cpu 0 cookie 0
>
> Here is the call trace map:
>
> CPU 0 CPU 1
>
> -------- --------
> tcp_close()
> tcp_send_fin()
> loopback_xmit()
> netif_rx()
> tcp_v4_rcv()
> tcp_ack_snd_check()
> loopback_xmit
> netif_rx() tcp_close()
> ... tcp_send_fin()
> loopback_xmit()
> netif_rx()
> tcp_v4_rcv()
> ...
> tcp_time_wait()
> inet_twsk_hashdance() {
> ...
> <-__inet_lookup_established()
> (bad case (1), find sk, may fail tw_refcnt check)
> inet_twsk_add_node_tail_rcu(tw, ...)
> <-__inet_lookup_established()
> (bad case (1), find sk, may fail tw_refcnt check)
>
> __sk_nulls_del_node_init_rcu(sk)
> <-__inet_lookup_established()
> (bad case (2), find tw, may fail tw_refcnt check)
> refcount_set(&tw->tw_refcnt, 3)
> <-__inet_lookup_established()
> (good case, find tw, tw_refcnt is not 0)
> ...
> }
>
> This issue occurs with a small probability on our application working
> on loopback interface, client gets a connection refused error when it
> reconnects. In reproducing environments on kernel 4.14,5.10 and
> 6.4-rc1, modify tcp_ack_snd_check() to disable delay ack all the
> time; Let client connect server and server sends a message to client
> then close the connection; Repeat this process forever; Let the client
> bind the same source port every time, it can be reproduced in about 20
> minutes.
>
> Brief of the scenario:
>
> 1. Server runs on CPU 0 and Client runs on CPU 1. Server closes
> connection actively and sends a FIN to client. The lookback's driver
> enqueues the FIN segment to backlog queue of CPU 0 via
> loopback_xmit()->netif_rx(), one of the conditions for non-delay ack
> meets in __tcp_ack_snd_check(), and the ACK is sent immediately.
>
> 2. On loopback interface, the ACK is received and processed on CPU 0,
> the 'dance' from original sock to tw sock will perfrom, tw sock will
> be inserted to ehash table, then the original sock will be removed.
>
> 3. On CPU 1, client closes the connection, a FIN segment is sent and
> processed on CPU 1. When it is looking up sock in ehash table (with no
> lock), tw hashdance race may occur, it fails to find the tw sock and
> get a listener sock in the flowing 3 cases:
>
> (1) Original sock is found, but it has been destroyed and sk_refcnt
> has become 0 when validating it.
> (2) tw sock is found, but its tw_refcnt has not been set to 3, it is
> still 0, validating for sk_refcnt will fail.
> (3) For versions without Eric and Jason's commit(3f4ca5fafc08881d7a5
> 7daa20449d171f2887043), tw sock is added to the head of the list.
> It will be missed if the list is traversed before tw sock is
> added. And if the original sock is removed before it is found, no
> established sock will be found.
>
> The listener sock will reset the FIN segment which has ack bit set.
>
> 4. If client reconnects immediately and is assigned with the same
> source port as previous connection, the tw sock with tw_substate
> TCP_FIN_WAIT2 will reset client's SYN and destroy itself in
> inet_twsk_deschedule_put(). Application gets a connection refused
> error.
>
> 5. If client reconnects again, it will succeed.
>
> Introduce the flowing 2 modifications to solve the above 3 bad cases:
>
> For bad case (2):
> Set tw_refcnt to 3 before adding it into list.
>
> For bad case (1):
> In function tcp_v4_rcv(), if __inet_lookup_skb() returns a listener
> sock and this segment has FIN bit set, then retry the lookup process
> one time. This fix can cover bad case (3) for the versions without
> Eric and Jason's fix.
>
> There may be another bad case, if the original sock is found and passes
> validation, but during further process for the passive closer's FIN on
> CPU 1, the sock has been destroyed on CPU 0, then the FIN segment will
> be dropped and retransmitted. This case does not hurt application as
> much as resetting reconnection, and this case has less possibility than
> the other bad cases, it does not occur on our product and in
> experimental environment, so it is not considered in this patch.
>
> Could you please check whether this fix is OK, or any suggestions?
> Looking forward for your precious comments!
>
> Signed-off-by: Duan Muquan <[email protected]>
> ---
> net/ipv4/inet_timewait_sock.c | 15 +++++++--------
> net/ipv4/tcp_ipv4.c | 13 +++++++++++++
> 2 files changed, 20 insertions(+), 8 deletions(-)
>
> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> index 40052414c7c7..ed1f255c9aa8 100644
> --- a/net/ipv4/inet_timewait_sock.c
> +++ b/net/ipv4/inet_timewait_sock.c
> @@ -144,14 +144,6 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
>
> spin_lock(lock);
>
> - inet_twsk_add_node_tail_rcu(tw, &ehead->chain);
> -
> - /* Step 3: Remove SK from hash chain */
> - if (__sk_nulls_del_node_init_rcu(sk))
> - sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
> -
> - spin_unlock(lock);
> -
> /* tw_refcnt is set to 3 because we have :
> * - one reference for bhash chain.
> * - one reference for ehash chain.
> @@ -162,6 +154,13 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
> * so we are not allowed to use tw anymore.
> */
> refcount_set(&tw->tw_refcnt, 3);
> + inet_twsk_add_node_tail_rcu(tw, &ehead->chain);
> +
> + /* Step 3: Remove SK from hash chain */
> + if (__sk_nulls_del_node_init_rcu(sk))
> + sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
> +
> + spin_unlock(lock);
> }
> EXPORT_SYMBOL_GPL(inet_twsk_hashdance);
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 06d2573685ca..3e3cef202f76 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -2018,6 +2018,19 @@ int tcp_v4_rcv(struct sk_buff *skb)
> sk = __inet_lookup_skb(net->ipv4.tcp_death_row.hashinfo,
> skb, __tcp_hdrlen(th), th->source,
> th->dest, sdif, &refcounted);
> +
> + /* If tw "dance" is performed on another CPU, the lookup process may find
> + * no tw sock for the passive closer's FIN segment, but a listener sock,
> + * which will reset the FIN segment. If application reconnects immediately
> + * with the same source port, it will get reset because the tw sock's
> + * tw_substate is still TCP_FIN_WAIT2. Try to get the tw sock in another try.
> + */
> + if (unlikely(th->fin && sk && sk->sk_state == TCP_LISTEN)) {
> + sk = __inet_lookup_skb(net->ipv4.tcp_death_row.hashinfo,
> + skb, __tcp_hdrlen(th), th->source,
> + th->dest, sdif, &refcounted);
> + }
> +
>
>
> I do not think this fixes anything, there is no barrier between first
> and second lookup.
> This might reduce race a little bit, but at the expense of extra code
> in the fast path.
>
> If you want to fix this properly, I think we need to revisit handling
> for non established sockets,
> to hold the bucket spinlock over the whole thing.
>
> This will be slightly more complex than this...
>
>

2023-06-07 15:46:41

by Duan,Muquan

[permalink] [raw]
Subject: Re: [PATCH v2] tcp: fix connection reset due to tw hashdance race.

Hi, Eric,

Thanks for your reply!
One module of our CDN product suffered from the connection refuse error caused by the hashdance race, I am trying to solve this issue that hurt userland applications, not all the possible cases. Except this reset case, the worst case I can figure out is the lost of the passive closer’s FIN which will cause a retransmission, this kind of case will not cause an error on applications, and the possibility is very small, I think we do not need to introduce reader’s lock for this kind of cases.

I can't agree more that we tried too hard to ‘detect races’, but I also have concern about the performance if introducing reader’s lock, do we have any test result about the performance with the reader’s lock? I will also do some test on this, if the impact can be tolerated, we’d better introduce the lock.

Anyway, I think tw sock's tw_refcnt should be set before added tw into the list, and this modification can make the setting of refcnt in the spin_lock’s protection, what is your opinion about this modification?

145 spin_lock(lock);
146
147 /* tw_refcnt is set to 3 because we have :
148 * - one reference for bhash chain.
149 * - one reference for ehash chain.
150 * - one reference for timer.
151 * We can use atomic_set() because prior spin_lock()/spin_unlock()
152 * committed into memory all tw fields.
153 * Also note that after this point, we lost our implicit reference
154 * so we are not allowed to use tw anymore.
155 */
156 refcount_set(&tw->tw_refcnt, 3); <-----------------------------------------------
157 inet_twsk_add_node_tail_rcu(tw, &ehead->chain);
158
159 /* Step 3: Remove SK from hash chain */
160 if (__sk_nulls_del_node_init_rcu(sk))
161 sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
162
163 spin_unlock(lock);



BTW, I met trouble on sending emails, and some emails may not delivered, thanks a lot for Jason and Simon’s comments!

Best Regards!
Duan




> 2023年6月7日 下午9:32,Eric Dumazet <[email protected]> 写道:
>
> On Wed, Jun 7, 2023 at 1:59 PM Duan,Muquan <[email protected]> wrote:
>>
>> Hi, Eric,
>>
>> Thanks for your comments!
>>
>> About the second lookup, I am sorry that I did not give enough explanations about it. Here are some details:
>>
>> 1. The second lookup can find the tw sock and avoid the connection refuse error on userland applications:
>>
>> If the original sock is found, but when validating its refcnt, it has been destroyed and sk_refcnt has become 0 after decreased by tcp_time_wait()->tcp_done()->inet_csk_destory_sock()->sock_put().The validation for refcnt fails and the lookup process gets a listener sock.
>>
>> When this case occurs, the hashdance has definitely finished,because tcp_done() is executed after inet_twsk_hashdance(). Then if look up the ehash table again, hashdance has already finished, tw sock will be found.
>>
>> With this fix, logically we can solve the connection reset issue completely when no established sock is found due to hashdance race.In my reproducing environment, the connection refuse error will occur about every 6 hours with only the fix of bad case (2). But with both of the 2 fixes, I tested it many times, the longest test continues for 10 days, it does not occur again,
>>
>>
>>
>> 2. About the performance impact:
>>
>> A similar scenario is that __inet_lookup_established() will do inet_match() check for the second time, if fails it will look up the list again. It is the extra effort to reduce the race impact without using reader lock. inet_match() failure occurs with about the same probability with refcnt validation failure in my test environment.
>>
>> The second lookup will only be done in the condition that FIN segment gets a listener sock.
>>
>> About the performance impact:
>>
>> 1) Most of the time, this condition will not met, the added codes introduces at most 3 comparisons for each segment.
>>
>> The second inet_match() in __inet_lookup_established() does least 3 comparisons for each segmet.
>>
>>
>> 2) When this condition is met, the probability is very small. The impact is similar to the second try due to inet_match() failure. Since tw sock can definitely be found in the second try, I think this cost is worthy to avoid connection reused error on userland applications.
>>
>>
>>
>> My understanding is, current philosophy is avoiding the reader lock by tolerating the minor defect which occurs in a small probability.For example, if the FIN from passive closer is dropped due to the found sock is destroyed, a retransmission can be tolerated, it only makes the connection termination slower. But I think the bottom line is that it does not affect the userland applications’ functionality. If application fails to connect due to the hashdance race, it can’t be tolerated. In fact, guys from product department push hard on the connection refuse error.
>>
>>
>> About bad case (2):
>>
>> tw sock is found, but its tw_refcnt has not been set to 3, it is still 0, validating for sk_refcnt will fail.
>>
>> I do not know the reason why setting tw_refcnt after adding it into list, could anyone help point out the reason? It adds extra race because the new added tw sock may be found and checked in other CPU concurrently before ƒsetting tw_refcnt to 3.
>>
>> By setting tw_refcnt to 3 before adding it into list, this case will be solved, and almost no cost. In my reproducing environment, it occurs more frequently than bad case (1), it appears about every 20 minutes, bad case (1) appears about every 6 hours.
>>
>>
>>
>> About the bucket spinlock, the original established sock and tw sock are stored in the ehash table, I concern about the performance when there are lots of short TCP connections, the reader lock may affect the performance of connection creation and termination. Could you share some details of your idea? Thanks in advance.
>>
>>
>
> Again, you can write a lot of stuff, the fact is that your patch does
> not solve the issue.
>
> You could add 10 lookups, and still miss some cases, because they are
> all RCU lookups with no barriers.
>
> In order to solve the issue of packets for the same 4-tuple being
> processed by many cpus, the only way to solve races is to add mutual
> exclusion.
>
> Note that we already have to lock the bucket spinlock every time we
> transition a request socket to socket, a socket to timewait, or any
> insert/delete.
>
> We need to expand the scope of this lock, and cleanup things that we
> added in the past, because we tried too hard to 'detect races'
>
>>
>>
>>
>> Best Regards!
>>
>> Duan
>>
>>
>> 2023年6月6日 下午3:07,Eric Dumazet <[email protected]> 写道:
>>
>> On Tue, Jun 6, 2023 at 8:43 AM Duan Muquan <[email protected]> wrote:
>>
>>
>> If the FIN from passive closer and the ACK for active closer's FIN are
>> processed on different CPUs concurrently, tw hashdance race may occur.
>> On loopback interface, transmit function queues a skb to current CPU's
>> softnet's input queue by default. Suppose active closer runs on CPU 0,
>> and passive closer runs on CPU 1. If the ACK for the active closer's
>> FIN is sent with no delay, it will be processed and tw hashdance will
>> be done on CPU 0; The passive closer's FIN will be sent in another
>> segment and processed on CPU 1, it may fail to find tw sock in the
>> ehash table due to tw hashdance on CPU 0, then get a RESET.
>> If application reconnects immediately with the same source port, it
>> will get reset because tw sock's tw_substate is still TCP_FIN_WAIT2.
>>
>> The dmesg to trace down this issue:
>>
>> .333516] tcp_send_fin: sk 0000000092105ad2 cookie 9 cpu 3
>> .333524] rcv_state_process:FIN_WAIT2 sk 0000000092105ad2 cookie 9 cpu 3
>> .333534] tcp_close: tcp_time_wait: sk 0000000092105ad2 cookie 9 cpu 3
>> .333538] hashdance: tw 00000000690fdb7a added to ehash cookie 9 cpu 3
>> .333541] hashdance: sk 0000000092105ad2 removed cookie 9 cpu 3
>> .333544] __inet_lookup_established: Failed the refcount check:
>> !refcount_inc_not_zero 00000000690fdb7a ref 0 cookie 9 cpu 0
>> .333549] hashdance: tw 00000000690fdb7a before add ref 0 cookie 9 cpu 3
>> .333552] rcv_state: RST for FIN listen 000000003c50afa6 cookie 0 cpu 0
>> .333574] tcp_send_fin: sk 0000000066757bf8 ref 2 cookie 0 cpu 0
>> .333611] timewait_state: TCP_TW_RST tw 00000000690fdb7a cookie 9 cpu 0
>> .333626] tcp_connect: sk 0000000066757bf8 cpu 0 cookie 0
>>
>> Here is the call trace map:
>>
>> CPU 0 CPU 1
>>
>> -------- --------
>> tcp_close()
>> tcp_send_fin()
>> loopback_xmit()
>> netif_rx()
>> tcp_v4_rcv()
>> tcp_ack_snd_check()
>> loopback_xmit
>> netif_rx() tcp_close()
>> ... tcp_send_fin()
>> loopback_xmit()
>> netif_rx()
>> tcp_v4_rcv()
>> ...
>> tcp_time_wait()
>> inet_twsk_hashdance() {
>> ...
>> <-__inet_lookup_established()
>> (bad case (1), find sk, may fail tw_refcnt check)
>> inet_twsk_add_node_tail_rcu(tw, ...)
>> <-__inet_lookup_established()
>> (bad case (1), find sk, may fail tw_refcnt check)
>>
>> __sk_nulls_del_node_init_rcu(sk)
>> <-__inet_lookup_established()
>> (bad case (2), find tw, may fail tw_refcnt check)
>> refcount_set(&tw->tw_refcnt, 3)
>> <-__inet_lookup_established()
>> (good case, find tw, tw_refcnt is not 0)
>> ...
>> }
>>
>> This issue occurs with a small probability on our application working
>> on loopback interface, client gets a connection refused error when it
>> reconnects. In reproducing environments on kernel 4.14,5.10 and
>> 6.4-rc1, modify tcp_ack_snd_check() to disable delay ack all the
>> time; Let client connect server and server sends a message to client
>> then close the connection; Repeat this process forever; Let the client
>> bind the same source port every time, it can be reproduced in about 20
>> minutes.
>>
>> Brief of the scenario:
>>
>> 1. Server runs on CPU 0 and Client runs on CPU 1. Server closes
>> connection actively and sends a FIN to client. The lookback's driver
>> enqueues the FIN segment to backlog queue of CPU 0 via
>> loopback_xmit()->netif_rx(), one of the conditions for non-delay ack
>> meets in __tcp_ack_snd_check(), and the ACK is sent immediately.
>>
>> 2. On loopback interface, the ACK is received and processed on CPU 0,
>> the 'dance' from original sock to tw sock will perfrom, tw sock will
>> be inserted to ehash table, then the original sock will be removed.
>>
>> 3. On CPU 1, client closes the connection, a FIN segment is sent and
>> processed on CPU 1. When it is looking up sock in ehash table (with no
>> lock), tw hashdance race may occur, it fails to find the tw sock and
>> get a listener sock in the flowing 3 cases:
>>
>> (1) Original sock is found, but it has been destroyed and sk_refcnt
>> has become 0 when validating it.
>> (2) tw sock is found, but its tw_refcnt has not been set to 3, it is
>> still 0, validating for sk_refcnt will fail.
>> (3) For versions without Eric and Jason's commit(3f4ca5fafc08881d7a5
>> 7daa20449d171f2887043), tw sock is added to the head of the list.
>> It will be missed if the list is traversed before tw sock is
>> added. And if the original sock is removed before it is found, no
>> established sock will be found.
>>
>> The listener sock will reset the FIN segment which has ack bit set.
>>
>> 4. If client reconnects immediately and is assigned with the same
>> source port as previous connection, the tw sock with tw_substate
>> TCP_FIN_WAIT2 will reset client's SYN and destroy itself in
>> inet_twsk_deschedule_put(). Application gets a connection refused
>> error.
>>
>> 5. If client reconnects again, it will succeed.
>>
>> Introduce the flowing 2 modifications to solve the above 3 bad cases:
>>
>> For bad case (2):
>> Set tw_refcnt to 3 before adding it into list.
>>
>> For bad case (1):
>> In function tcp_v4_rcv(), if __inet_lookup_skb() returns a listener
>> sock and this segment has FIN bit set, then retry the lookup process
>> one time. This fix can cover bad case (3) for the versions without
>> Eric and Jason's fix.
>>
>> There may be another bad case, if the original sock is found and passes
>> validation, but during further process for the passive closer's FIN on
>> CPU 1, the sock has been destroyed on CPU 0, then the FIN segment will
>> be dropped and retransmitted. This case does not hurt application as
>> much as resetting reconnection, and this case has less possibility than
>> the other bad cases, it does not occur on our product and in
>> experimental environment, so it is not considered in this patch.
>>
>> Could you please check whether this fix is OK, or any suggestions?
>> Looking forward for your precious comments!
>>
>> Signed-off-by: Duan Muquan <[email protected]>
>> ---
>> net/ipv4/inet_timewait_sock.c | 15 +++++++--------
>> net/ipv4/tcp_ipv4.c | 13 +++++++++++++
>> 2 files changed, 20 insertions(+), 8 deletions(-)
>>
>> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
>> index 40052414c7c7..ed1f255c9aa8 100644
>> --- a/net/ipv4/inet_timewait_sock.c
>> +++ b/net/ipv4/inet_timewait_sock.c
>> @@ -144,14 +144,6 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
>>
>> spin_lock(lock);
>>
>> - inet_twsk_add_node_tail_rcu(tw, &ehead->chain);
>> -
>> - /* Step 3: Remove SK from hash chain */
>> - if (__sk_nulls_del_node_init_rcu(sk))
>> - sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
>> -
>> - spin_unlock(lock);
>> -
>> /* tw_refcnt is set to 3 because we have :
>> * - one reference for bhash chain.
>> * - one reference for ehash chain.
>> @@ -162,6 +154,13 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
>> * so we are not allowed to use tw anymore.
>> */
>> refcount_set(&tw->tw_refcnt, 3);
>> + inet_twsk_add_node_tail_rcu(tw, &ehead->chain);
>> +
>> + /* Step 3: Remove SK from hash chain */
>> + if (__sk_nulls_del_node_init_rcu(sk))
>> + sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
>> +
>> + spin_unlock(lock);
>> }
>> EXPORT_SYMBOL_GPL(inet_twsk_hashdance);
>>
>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>> index 06d2573685ca..3e3cef202f76 100644
>> --- a/net/ipv4/tcp_ipv4.c
>> +++ b/net/ipv4/tcp_ipv4.c
>> @@ -2018,6 +2018,19 @@ int tcp_v4_rcv(struct sk_buff *skb)
>> sk = __inet_lookup_skb(net->ipv4.tcp_death_row.hashinfo,
>> skb, __tcp_hdrlen(th), th->source,
>> th->dest, sdif, &refcounted);
>> +
>> + /* If tw "dance" is performed on another CPU, the lookup process may find
>> + * no tw sock for the passive closer's FIN segment, but a listener sock,
>> + * which will reset the FIN segment. If application reconnects immediately
>> + * with the same source port, it will get reset because the tw sock's
>> + * tw_substate is still TCP_FIN_WAIT2. Try to get the tw sock in another try.
>> + */
>> + if (unlikely(th->fin && sk && sk->sk_state == TCP_LISTEN)) {
>> + sk = __inet_lookup_skb(net->ipv4.tcp_death_row.hashinfo,
>> + skb, __tcp_hdrlen(th), th->source,
>> + th->dest, sdif, &refcounted);
>> + }
>> +
>>
>>
>> I do not think this fixes anything, there is no barrier between first
>> and second lookup.
>> This might reduce race a little bit, but at the expense of extra code
>> in the fast path.
>>
>> If you want to fix this properly, I think we need to revisit handling
>> for non established sockets,
>> to hold the bucket spinlock over the whole thing.
>>
>> This will be slightly more complex than this...
>>
>>

2023-06-07 16:22:01

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v2] tcp: fix connection reset due to tw hashdance race.

On Wed, Jun 7, 2023 at 5:18 PM Duan,Muquan <[email protected]> wrote:
>
> Hi, Eric,
>
> Thanks for your reply!
> One module of our CDN product suffered from the connection refuse error caused by the hashdance race, I am trying to solve this issue that hurt userland applications, not all the possible cases. Except this reset case, the worst case I can figure out is the lost of the passive closer’s FIN which will cause a retransmission, this kind of case will not cause an error on applications, and the possibility is very small, I think we do not need to introduce reader’s lock for this kind of cases.
>
> I can't agree more that we tried too hard to ‘detect races’, but I also have concern about the performance if introducing reader’s lock, do we have any test result about the performance with the reader’s lock? I will also do some test on this, if the impact can be tolerated, we’d better introduce the lock.
>
> Anyway, I think tw sock's tw_refcnt should be set before added tw into the list, and this modification can make the setting of refcnt in the spin_lock’s protection, what is your opinion about this modification?

My opinion is that we should set the refcnt to not zero only at when
the tw object is ready to be seen by other cpus.

Moving this earlier can trigger subtle bugs with RCU lookups,
that might see a non zero refcnt on an object still containing garbage.

You missed the important comment :

> 153 * Also note that after this point, we lost our implicit reference
> 154 * so we are not allowed to use tw anymore.

We are not _allowed_ to access tw anymore after setting its refcnt to 3.

Again, this patch does not fix the fundamental issue, just moving
things around and hope for the best.

>
> 145 spin_lock(lock);
> 146
> 147 /* tw_refcnt is set to 3 because we have :
> 148 * - one reference for bhash chain.
> 149 * - one reference for ehash chain.
> 150 * - one reference for timer.
> 151 * We can use atomic_set() because prior spin_lock()/spin_unlock()
> 152 * committed into memory all tw fields.
> 153 * Also note that after this point, we lost our implicit reference
> 154 * so we are not allowed to use tw anymore.
> 155 */
> 156 refcount_set(&tw->tw_refcnt, 3); <-----------------------------------------------
> 157 inet_twsk_add_node_tail_rcu(tw, &ehead->chain);
> 158
> 159 /* Step 3: Remove SK from hash chain */
> 160 if (__sk_nulls_del_node_init_rcu(sk))
> 161 sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
> 162
> 163 spin_unlock(lock);
>
>
>
> BTW, I met trouble on sending emails, and some emails may not delivered, thanks a lot for Jason and Simon’s comments!
>
> Best Regards!
> Duan
>
>
>
>
> > 2023年6月7日 下午9:32,Eric Dumazet <[email protected]> 写道:
> >
> > On Wed, Jun 7, 2023 at 1:59 PM Duan,Muquan <[email protected]> wrote:
> >>
> >> Hi, Eric,
> >>
> >> Thanks for your comments!
> >>
> >> About the second lookup, I am sorry that I did not give enough explanations about it. Here are some details:
> >>
> >> 1. The second lookup can find the tw sock and avoid the connection refuse error on userland applications:
> >>
> >> If the original sock is found, but when validating its refcnt, it has been destroyed and sk_refcnt has become 0 after decreased by tcp_time_wait()->tcp_done()->inet_csk_destory_sock()->sock_put().The validation for refcnt fails and the lookup process gets a listener sock.
> >>
> >> When this case occurs, the hashdance has definitely finished,because tcp_done() is executed after inet_twsk_hashdance(). Then if look up the ehash table again, hashdance has already finished, tw sock will be found.
> >>
> >> With this fix, logically we can solve the connection reset issue completely when no established sock is found due to hashdance race.In my reproducing environment, the connection refuse error will occur about every 6 hours with only the fix of bad case (2). But with both of the 2 fixes, I tested it many times, the longest test continues for 10 days, it does not occur again,
> >>
> >>
> >>
> >> 2. About the performance impact:
> >>
> >> A similar scenario is that __inet_lookup_established() will do inet_match() check for the second time, if fails it will look up the list again. It is the extra effort to reduce the race impact without using reader lock. inet_match() failure occurs with about the same probability with refcnt validation failure in my test environment.
> >>
> >> The second lookup will only be done in the condition that FIN segment gets a listener sock.
> >>
> >> About the performance impact:
> >>
> >> 1) Most of the time, this condition will not met, the added codes introduces at most 3 comparisons for each segment.
> >>
> >> The second inet_match() in __inet_lookup_established() does least 3 comparisons for each segmet.
> >>
> >>
> >> 2) When this condition is met, the probability is very small. The impact is similar to the second try due to inet_match() failure. Since tw sock can definitely be found in the second try, I think this cost is worthy to avoid connection reused error on userland applications.
> >>
> >>
> >>
> >> My understanding is, current philosophy is avoiding the reader lock by tolerating the minor defect which occurs in a small probability.For example, if the FIN from passive closer is dropped due to the found sock is destroyed, a retransmission can be tolerated, it only makes the connection termination slower. But I think the bottom line is that it does not affect the userland applications’ functionality. If application fails to connect due to the hashdance race, it can’t be tolerated. In fact, guys from product department push hard on the connection refuse error.
> >>
> >>
> >> About bad case (2):
> >>
> >> tw sock is found, but its tw_refcnt has not been set to 3, it is still 0, validating for sk_refcnt will fail.
> >>
> >> I do not know the reason why setting tw_refcnt after adding it into list, could anyone help point out the reason? It adds extra race because the new added tw sock may be found and checked in other CPU concurrently before ƒsetting tw_refcnt to 3.
> >>
> >> By setting tw_refcnt to 3 before adding it into list, this case will be solved, and almost no cost. In my reproducing environment, it occurs more frequently than bad case (1), it appears about every 20 minutes, bad case (1) appears about every 6 hours.
> >>
> >>
> >>
> >> About the bucket spinlock, the original established sock and tw sock are stored in the ehash table, I concern about the performance when there are lots of short TCP connections, the reader lock may affect the performance of connection creation and termination. Could you share some details of your idea? Thanks in advance.
> >>
> >>
> >
> > Again, you can write a lot of stuff, the fact is that your patch does
> > not solve the issue.
> >
> > You could add 10 lookups, and still miss some cases, because they are
> > all RCU lookups with no barriers.
> >
> > In order to solve the issue of packets for the same 4-tuple being
> > processed by many cpus, the only way to solve races is to add mutual
> > exclusion.
> >
> > Note that we already have to lock the bucket spinlock every time we
> > transition a request socket to socket, a socket to timewait, or any
> > insert/delete.
> >
> > We need to expand the scope of this lock, and cleanup things that we
> > added in the past, because we tried too hard to 'detect races'
> >
> >>
> >>
> >>
> >> Best Regards!
> >>
> >> Duan
> >>
> >>
> >> 2023年6月6日 下午3:07,Eric Dumazet <[email protected]> 写道:
> >>
> >> On Tue, Jun 6, 2023 at 8:43 AM Duan Muquan <[email protected]> wrote:
> >>
> >>
> >> If the FIN from passive closer and the ACK for active closer's FIN are
> >> processed on different CPUs concurrently, tw hashdance race may occur.
> >> On loopback interface, transmit function queues a skb to current CPU's
> >> softnet's input queue by default. Suppose active closer runs on CPU 0,
> >> and passive closer runs on CPU 1. If the ACK for the active closer's
> >> FIN is sent with no delay, it will be processed and tw hashdance will
> >> be done on CPU 0; The passive closer's FIN will be sent in another
> >> segment and processed on CPU 1, it may fail to find tw sock in the
> >> ehash table due to tw hashdance on CPU 0, then get a RESET.
> >> If application reconnects immediately with the same source port, it
> >> will get reset because tw sock's tw_substate is still TCP_FIN_WAIT2.
> >>
> >> The dmesg to trace down this issue:
> >>
> >> .333516] tcp_send_fin: sk 0000000092105ad2 cookie 9 cpu 3
> >> .333524] rcv_state_process:FIN_WAIT2 sk 0000000092105ad2 cookie 9 cpu 3
> >> .333534] tcp_close: tcp_time_wait: sk 0000000092105ad2 cookie 9 cpu 3
> >> .333538] hashdance: tw 00000000690fdb7a added to ehash cookie 9 cpu 3
> >> .333541] hashdance: sk 0000000092105ad2 removed cookie 9 cpu 3
> >> .333544] __inet_lookup_established: Failed the refcount check:
> >> !refcount_inc_not_zero 00000000690fdb7a ref 0 cookie 9 cpu 0
> >> .333549] hashdance: tw 00000000690fdb7a before add ref 0 cookie 9 cpu 3
> >> .333552] rcv_state: RST for FIN listen 000000003c50afa6 cookie 0 cpu 0
> >> .333574] tcp_send_fin: sk 0000000066757bf8 ref 2 cookie 0 cpu 0
> >> .333611] timewait_state: TCP_TW_RST tw 00000000690fdb7a cookie 9 cpu 0
> >> .333626] tcp_connect: sk 0000000066757bf8 cpu 0 cookie 0
> >>
> >> Here is the call trace map:
> >>
> >> CPU 0 CPU 1
> >>
> >> -------- --------
> >> tcp_close()
> >> tcp_send_fin()
> >> loopback_xmit()
> >> netif_rx()
> >> tcp_v4_rcv()
> >> tcp_ack_snd_check()
> >> loopback_xmit
> >> netif_rx() tcp_close()
> >> ... tcp_send_fin()
> >> loopback_xmit()
> >> netif_rx()
> >> tcp_v4_rcv()
> >> ...
> >> tcp_time_wait()
> >> inet_twsk_hashdance() {
> >> ...
> >> <-__inet_lookup_established()
> >> (bad case (1), find sk, may fail tw_refcnt check)
> >> inet_twsk_add_node_tail_rcu(tw, ...)
> >> <-__inet_lookup_established()
> >> (bad case (1), find sk, may fail tw_refcnt check)
> >>
> >> __sk_nulls_del_node_init_rcu(sk)
> >> <-__inet_lookup_established()
> >> (bad case (2), find tw, may fail tw_refcnt check)
> >> refcount_set(&tw->tw_refcnt, 3)
> >> <-__inet_lookup_established()
> >> (good case, find tw, tw_refcnt is not 0)
> >> ...
> >> }
> >>
> >> This issue occurs with a small probability on our application working
> >> on loopback interface, client gets a connection refused error when it
> >> reconnects. In reproducing environments on kernel 4.14,5.10 and
> >> 6.4-rc1, modify tcp_ack_snd_check() to disable delay ack all the
> >> time; Let client connect server and server sends a message to client
> >> then close the connection; Repeat this process forever; Let the client
> >> bind the same source port every time, it can be reproduced in about 20
> >> minutes.
> >>
> >> Brief of the scenario:
> >>
> >> 1. Server runs on CPU 0 and Client runs on CPU 1. Server closes
> >> connection actively and sends a FIN to client. The lookback's driver
> >> enqueues the FIN segment to backlog queue of CPU 0 via
> >> loopback_xmit()->netif_rx(), one of the conditions for non-delay ack
> >> meets in __tcp_ack_snd_check(), and the ACK is sent immediately.
> >>
> >> 2. On loopback interface, the ACK is received and processed on CPU 0,
> >> the 'dance' from original sock to tw sock will perfrom, tw sock will
> >> be inserted to ehash table, then the original sock will be removed.
> >>
> >> 3. On CPU 1, client closes the connection, a FIN segment is sent and
> >> processed on CPU 1. When it is looking up sock in ehash table (with no
> >> lock), tw hashdance race may occur, it fails to find the tw sock and
> >> get a listener sock in the flowing 3 cases:
> >>
> >> (1) Original sock is found, but it has been destroyed and sk_refcnt
> >> has become 0 when validating it.
> >> (2) tw sock is found, but its tw_refcnt has not been set to 3, it is
> >> still 0, validating for sk_refcnt will fail.
> >> (3) For versions without Eric and Jason's commit(3f4ca5fafc08881d7a5
> >> 7daa20449d171f2887043), tw sock is added to the head of the list.
> >> It will be missed if the list is traversed before tw sock is
> >> added. And if the original sock is removed before it is found, no
> >> established sock will be found.
> >>
> >> The listener sock will reset the FIN segment which has ack bit set.
> >>
> >> 4. If client reconnects immediately and is assigned with the same
> >> source port as previous connection, the tw sock with tw_substate
> >> TCP_FIN_WAIT2 will reset client's SYN and destroy itself in
> >> inet_twsk_deschedule_put(). Application gets a connection refused
> >> error.
> >>
> >> 5. If client reconnects again, it will succeed.
> >>
> >> Introduce the flowing 2 modifications to solve the above 3 bad cases:
> >>
> >> For bad case (2):
> >> Set tw_refcnt to 3 before adding it into list.
> >>
> >> For bad case (1):
> >> In function tcp_v4_rcv(), if __inet_lookup_skb() returns a listener
> >> sock and this segment has FIN bit set, then retry the lookup process
> >> one time. This fix can cover bad case (3) for the versions without
> >> Eric and Jason's fix.
> >>
> >> There may be another bad case, if the original sock is found and passes
> >> validation, but during further process for the passive closer's FIN on
> >> CPU 1, the sock has been destroyed on CPU 0, then the FIN segment will
> >> be dropped and retransmitted. This case does not hurt application as
> >> much as resetting reconnection, and this case has less possibility than
> >> the other bad cases, it does not occur on our product and in
> >> experimental environment, so it is not considered in this patch.
> >>
> >> Could you please check whether this fix is OK, or any suggestions?
> >> Looking forward for your precious comments!
> >>
> >> Signed-off-by: Duan Muquan <[email protected]>
> >> ---
> >> net/ipv4/inet_timewait_sock.c | 15 +++++++--------
> >> net/ipv4/tcp_ipv4.c | 13 +++++++++++++
> >> 2 files changed, 20 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> >> index 40052414c7c7..ed1f255c9aa8 100644
> >> --- a/net/ipv4/inet_timewait_sock.c
> >> +++ b/net/ipv4/inet_timewait_sock.c
> >> @@ -144,14 +144,6 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
> >>
> >> spin_lock(lock);
> >>
> >> - inet_twsk_add_node_tail_rcu(tw, &ehead->chain);
> >> -
> >> - /* Step 3: Remove SK from hash chain */
> >> - if (__sk_nulls_del_node_init_rcu(sk))
> >> - sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
> >> -
> >> - spin_unlock(lock);
> >> -
> >> /* tw_refcnt is set to 3 because we have :
> >> * - one reference for bhash chain.
> >> * - one reference for ehash chain.
> >> @@ -162,6 +154,13 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
> >> * so we are not allowed to use tw anymore.
> >> */
> >> refcount_set(&tw->tw_refcnt, 3);
> >> + inet_twsk_add_node_tail_rcu(tw, &ehead->chain);
> >> +
> >> + /* Step 3: Remove SK from hash chain */
> >> + if (__sk_nulls_del_node_init_rcu(sk))
> >> + sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
> >> +
> >> + spin_unlock(lock);
> >> }
> >> EXPORT_SYMBOL_GPL(inet_twsk_hashdance);
> >>
> >> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> >> index 06d2573685ca..3e3cef202f76 100644
> >> --- a/net/ipv4/tcp_ipv4.c
> >> +++ b/net/ipv4/tcp_ipv4.c
> >> @@ -2018,6 +2018,19 @@ int tcp_v4_rcv(struct sk_buff *skb)
> >> sk = __inet_lookup_skb(net->ipv4.tcp_death_row.hashinfo,
> >> skb, __tcp_hdrlen(th), th->source,
> >> th->dest, sdif, &refcounted);
> >> +
> >> + /* If tw "dance" is performed on another CPU, the lookup process may find
> >> + * no tw sock for the passive closer's FIN segment, but a listener sock,
> >> + * which will reset the FIN segment. If application reconnects immediately
> >> + * with the same source port, it will get reset because the tw sock's
> >> + * tw_substate is still TCP_FIN_WAIT2. Try to get the tw sock in another try.
> >> + */
> >> + if (unlikely(th->fin && sk && sk->sk_state == TCP_LISTEN)) {
> >> + sk = __inet_lookup_skb(net->ipv4.tcp_death_row.hashinfo,
> >> + skb, __tcp_hdrlen(th), th->source,
> >> + th->dest, sdif, &refcounted);
> >> + }
> >> +
> >>
> >>
> >> I do not think this fixes anything, there is no barrier between first
> >> and second lookup.
> >> This might reduce race a little bit, but at the expense of extra code
> >> in the fast path.
> >>
> >> If you want to fix this properly, I think we need to revisit handling
> >> for non established sockets,
> >> to hold the bucket spinlock over the whole thing.
> >>
> >> This will be slightly more complex than this...
> >>
> >>
>

2023-06-08 04:21:34

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v2] tcp: fix connection reset due to tw hashdance race.

On Thu, Jun 8, 2023 at 5:59 AM Duan,Muquan <[email protected]> wrote:
>
> Hi, Eric,
>
> Thanks a lot for your explanation!
>
> Even if we add reader lock, if set the refcnt outside spin_lock()/spin_unlock(), during the interval between spin_unlock() and refcnt_set(), other cpus will see the tw sock with refcont 0, and validation for refcnt will fail.
>
> A suggestion, before the tw sock is added into ehash table, it has been already used by tw timer and bhash chain, we can firstly add refcnt to 2 before adding two to ehash table,. or add the refcnt one by one for timer, bhash and ehash. This can avoid the refcont validation failure on other cpus.
>
> This can reduce the frequency of the connection reset issue from 20 min to 180 min for our product, We may wait quite a long time before the best solution is ready, if this obvious defect is fixed, userland applications can benefit from it.
>
> Looking forward to your opinions!

Again, my opinion is that we need a proper fix, not work arounds.

I will work on this a bit later.

In the meantime you can apply locally your patch if you feel this is
what you want.

2023-06-08 06:14:20

by Kuniyuki Iwashima

[permalink] [raw]
Subject: Re: [PATCH v2] tcp: fix connection reset due to tw hashdance race.

From: Eric Dumazet <[email protected]>
Date: Wed, 7 Jun 2023 15:32:57 +0200
> On Wed, Jun 7, 2023 at 1:59 PM Duan,Muquan <[email protected]> wrote:
> >
> > Hi, Eric,
> >
> > Thanks for your comments!
> >
> > About the second lookup, I am sorry that I did not give enough explanations about it. Here are some details:
> >
> > 1. The second lookup can find the tw sock and avoid the connection refuse error on userland applications:
> >
> > If the original sock is found, but when validating its refcnt, it has been destroyed and sk_refcnt has become 0 after decreased by tcp_time_wait()->tcp_done()->inet_csk_destory_sock()->sock_put().The validation for refcnt fails and the lookup process gets a listener sock.
> >
> > When this case occurs, the hashdance has definitely finished,because tcp_done() is executed after inet_twsk_hashdance(). Then if look up the ehash table again, hashdance has already finished, tw sock will be found.
> >
> > With this fix, logically we can solve the connection reset issue completely when no established sock is found due to hashdance race.In my reproducing environment, the connection refuse error will occur about every 6 hours with only the fix of bad case (2). But with both of the 2 fixes, I tested it many times, the longest test continues for 10 days, it does not occur again,
> >
> >
> >
> > 2. About the performance impact:
> >
> > A similar scenario is that __inet_lookup_established() will do inet_match() check for the second time, if fails it will look up the list again. It is the extra effort to reduce the race impact without using reader lock. inet_match() failure occurs with about the same probability with refcnt validation failure in my test environment.
> >
> > The second lookup will only be done in the condition that FIN segment gets a listener sock.
> >
> > About the performance impact:
> >
> > 1) Most of the time, this condition will not met, the added codes introduces at most 3 comparisons for each segment.
> >
> > The second inet_match() in __inet_lookup_established() does least 3 comparisons for each segmet.
> >
> >
> > 2) When this condition is met, the probability is very small. The impact is similar to the second try due to inet_match() failure. Since tw sock can definitely be found in the second try, I think this cost is worthy to avoid connection reused error on userland applications.
> >
> >
> >
> > My understanding is, current philosophy is avoiding the reader lock by tolerating the minor defect which occurs in a small probability.For example, if the FIN from passive closer is dropped due to the found sock is destroyed, a retransmission can be tolerated, it only makes the connection termination slower. But I think the bottom line is that it does not affect the userland applications’ functionality. If application fails to connect due to the hashdance race, it can’t be tolerated. In fact, guys from product department push hard on the connection refuse error.
> >
> >
> > About bad case (2):
> >
> > tw sock is found, but its tw_refcnt has not been set to 3, it is still 0, validating for sk_refcnt will fail.
> >
> > I do not know the reason why setting tw_refcnt after adding it into list, could anyone help point out the reason? It adds extra race because the new added tw sock may be found and checked in other CPU concurrently before ƒsetting tw_refcnt to 3.
> >
> > By setting tw_refcnt to 3 before adding it into list, this case will be solved, and almost no cost. In my reproducing environment, it occurs more frequently than bad case (1), it appears about every 20 minutes, bad case (1) appears about every 6 hours.
> >
> >
> >
> > About the bucket spinlock, the original established sock and tw sock are stored in the ehash table, I concern about the performance when there are lots of short TCP connections, the reader lock may affect the performance of connection creation and termination. Could you share some details of your idea? Thanks in advance.
> >
> >
>
> Again, you can write a lot of stuff, the fact is that your patch does
> not solve the issue.
>
> You could add 10 lookups, and still miss some cases, because they are
> all RCU lookups with no barriers.
>
> In order to solve the issue of packets for the same 4-tuple being
> processed by many cpus, the only way to solve races is to add mutual
> exclusion.
>
> Note that we already have to lock the bucket spinlock every time we
> transition a request socket to socket, a socket to timewait, or any
> insert/delete.
>
> We need to expand the scope of this lock, and cleanup things that we
> added in the past, because we tried too hard to 'detect races'

How about this ? This is still a workaround though, retry sounds
better than expanding the scope of the lock given the race is rare.

---8<---
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index e7391bf310a7..b034be2f37c8 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -484,14 +484,24 @@ struct sock *__inet_lookup_established(struct net *net,
unsigned int hash = inet_ehashfn(net, daddr, hnum, saddr, sport);
unsigned int slot = hash & hashinfo->ehash_mask;
struct inet_ehash_bucket *head = &hashinfo->ehash[slot];
+ bool should_retry = true;

begin:
sk_nulls_for_each_rcu(sk, node, &head->chain) {
if (sk->sk_hash != hash)
continue;
if (likely(inet_match(net, sk, acookie, ports, dif, sdif))) {
- if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
+ if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt))) {
+ if (sk->sk_state == TCP_TIME_WAIT)
+ goto begin;
+
+ if (sk->sk_state == TCP_CLOSE && should_retry) {
+ should_retry = false;
+ goto begin;
+ }
+
goto out;
+ }
if (unlikely(!inet_match(net, sk, acookie,
ports, dif, sdif))) {
sock_gen_put(sk);
---8<---


>
> >
> >
> >
> > Best Regards!
> >
> > Duan
> >
> >
> > 2023年6月6日 下午3:07,Eric Dumazet <[email protected]> 写道:
> >
> > On Tue, Jun 6, 2023 at 8:43 AM Duan Muquan <[email protected]> wrote:
> >
> >
> > If the FIN from passive closer and the ACK for active closer's FIN are
> > processed on different CPUs concurrently, tw hashdance race may occur.
> > On loopback interface, transmit function queues a skb to current CPU's
> > softnet's input queue by default. Suppose active closer runs on CPU 0,
> > and passive closer runs on CPU 1. If the ACK for the active closer's
> > FIN is sent with no delay, it will be processed and tw hashdance will
> > be done on CPU 0; The passive closer's FIN will be sent in another
> > segment and processed on CPU 1, it may fail to find tw sock in the
> > ehash table due to tw hashdance on CPU 0, then get a RESET.
> > If application reconnects immediately with the same source port, it
> > will get reset because tw sock's tw_substate is still TCP_FIN_WAIT2.
> >
> > The dmesg to trace down this issue:
> >
> > .333516] tcp_send_fin: sk 0000000092105ad2 cookie 9 cpu 3
> > .333524] rcv_state_process:FIN_WAIT2 sk 0000000092105ad2 cookie 9 cpu 3
> > .333534] tcp_close: tcp_time_wait: sk 0000000092105ad2 cookie 9 cpu 3
> > .333538] hashdance: tw 00000000690fdb7a added to ehash cookie 9 cpu 3
> > .333541] hashdance: sk 0000000092105ad2 removed cookie 9 cpu 3
> > .333544] __inet_lookup_established: Failed the refcount check:
> > !refcount_inc_not_zero 00000000690fdb7a ref 0 cookie 9 cpu 0
> > .333549] hashdance: tw 00000000690fdb7a before add ref 0 cookie 9 cpu 3
> > .333552] rcv_state: RST for FIN listen 000000003c50afa6 cookie 0 cpu 0
> > .333574] tcp_send_fin: sk 0000000066757bf8 ref 2 cookie 0 cpu 0
> > .333611] timewait_state: TCP_TW_RST tw 00000000690fdb7a cookie 9 cpu 0
> > .333626] tcp_connect: sk 0000000066757bf8 cpu 0 cookie 0
> >
> > Here is the call trace map:
> >
> > CPU 0 CPU 1
> >
> > -------- --------
> > tcp_close()
> > tcp_send_fin()
> > loopback_xmit()
> > netif_rx()
> > tcp_v4_rcv()
> > tcp_ack_snd_check()
> > loopback_xmit
> > netif_rx() tcp_close()
> > ... tcp_send_fin()
> > loopback_xmit()
> > netif_rx()
> > tcp_v4_rcv()
> > ...
> > tcp_time_wait()
> > inet_twsk_hashdance() {
> > ...
> > <-__inet_lookup_established()
> > (bad case (1), find sk, may fail tw_refcnt check)
> > inet_twsk_add_node_tail_rcu(tw, ...)
> > <-__inet_lookup_established()
> > (bad case (1), find sk, may fail tw_refcnt check)
> >
> > __sk_nulls_del_node_init_rcu(sk)
> > <-__inet_lookup_established()
> > (bad case (2), find tw, may fail tw_refcnt check)
> > refcount_set(&tw->tw_refcnt, 3)
> > <-__inet_lookup_established()
> > (good case, find tw, tw_refcnt is not 0)
> > ...
> > }
> >
> > This issue occurs with a small probability on our application working
> > on loopback interface, client gets a connection refused error when it
> > reconnects. In reproducing environments on kernel 4.14,5.10 and
> > 6.4-rc1, modify tcp_ack_snd_check() to disable delay ack all the
> > time; Let client connect server and server sends a message to client
> > then close the connection; Repeat this process forever; Let the client
> > bind the same source port every time, it can be reproduced in about 20
> > minutes.
> >
> > Brief of the scenario:
> >
> > 1. Server runs on CPU 0 and Client runs on CPU 1. Server closes
> > connection actively and sends a FIN to client. The lookback's driver
> > enqueues the FIN segment to backlog queue of CPU 0 via
> > loopback_xmit()->netif_rx(), one of the conditions for non-delay ack
> > meets in __tcp_ack_snd_check(), and the ACK is sent immediately.
> >
> > 2. On loopback interface, the ACK is received and processed on CPU 0,
> > the 'dance' from original sock to tw sock will perfrom, tw sock will
> > be inserted to ehash table, then the original sock will be removed.
> >
> > 3. On CPU 1, client closes the connection, a FIN segment is sent and
> > processed on CPU 1. When it is looking up sock in ehash table (with no
> > lock), tw hashdance race may occur, it fails to find the tw sock and
> > get a listener sock in the flowing 3 cases:
> >
> > (1) Original sock is found, but it has been destroyed and sk_refcnt
> > has become 0 when validating it.
> > (2) tw sock is found, but its tw_refcnt has not been set to 3, it is
> > still 0, validating for sk_refcnt will fail.
> > (3) For versions without Eric and Jason's commit(3f4ca5fafc08881d7a5
> > 7daa20449d171f2887043), tw sock is added to the head of the list.
> > It will be missed if the list is traversed before tw sock is
> > added. And if the original sock is removed before it is found, no
> > established sock will be found.
> >
> > The listener sock will reset the FIN segment which has ack bit set.
> >
> > 4. If client reconnects immediately and is assigned with the same
> > source port as previous connection, the tw sock with tw_substate
> > TCP_FIN_WAIT2 will reset client's SYN and destroy itself in
> > inet_twsk_deschedule_put(). Application gets a connection refused
> > error.
> >
> > 5. If client reconnects again, it will succeed.
> >
> > Introduce the flowing 2 modifications to solve the above 3 bad cases:
> >
> > For bad case (2):
> > Set tw_refcnt to 3 before adding it into list.
> >
> > For bad case (1):
> > In function tcp_v4_rcv(), if __inet_lookup_skb() returns a listener
> > sock and this segment has FIN bit set, then retry the lookup process
> > one time. This fix can cover bad case (3) for the versions without
> > Eric and Jason's fix.
> >
> > There may be another bad case, if the original sock is found and passes
> > validation, but during further process for the passive closer's FIN on
> > CPU 1, the sock has been destroyed on CPU 0, then the FIN segment will
> > be dropped and retransmitted. This case does not hurt application as
> > much as resetting reconnection, and this case has less possibility than
> > the other bad cases, it does not occur on our product and in
> > experimental environment, so it is not considered in this patch.
> >
> > Could you please check whether this fix is OK, or any suggestions?
> > Looking forward for your precious comments!
> >
> > Signed-off-by: Duan Muquan <[email protected]>
> > ---
> > net/ipv4/inet_timewait_sock.c | 15 +++++++--------
> > net/ipv4/tcp_ipv4.c | 13 +++++++++++++
> > 2 files changed, 20 insertions(+), 8 deletions(-)
> >
> > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > index 40052414c7c7..ed1f255c9aa8 100644
> > --- a/net/ipv4/inet_timewait_sock.c
> > +++ b/net/ipv4/inet_timewait_sock.c
> > @@ -144,14 +144,6 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
> >
> > spin_lock(lock);
> >
> > - inet_twsk_add_node_tail_rcu(tw, &ehead->chain);
> > -
> > - /* Step 3: Remove SK from hash chain */
> > - if (__sk_nulls_del_node_init_rcu(sk))
> > - sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
> > -
> > - spin_unlock(lock);
> > -
> > /* tw_refcnt is set to 3 because we have :
> > * - one reference for bhash chain.
> > * - one reference for ehash chain.
> > @@ -162,6 +154,13 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
> > * so we are not allowed to use tw anymore.
> > */
> > refcount_set(&tw->tw_refcnt, 3);
> > + inet_twsk_add_node_tail_rcu(tw, &ehead->chain);
> > +
> > + /* Step 3: Remove SK from hash chain */
> > + if (__sk_nulls_del_node_init_rcu(sk))
> > + sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
> > +
> > + spin_unlock(lock);
> > }
> > EXPORT_SYMBOL_GPL(inet_twsk_hashdance);
> >
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index 06d2573685ca..3e3cef202f76 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -2018,6 +2018,19 @@ int tcp_v4_rcv(struct sk_buff *skb)
> > sk = __inet_lookup_skb(net->ipv4.tcp_death_row.hashinfo,
> > skb, __tcp_hdrlen(th), th->source,
> > th->dest, sdif, &refcounted);
> > +
> > + /* If tw "dance" is performed on another CPU, the lookup process may find
> > + * no tw sock for the passive closer's FIN segment, but a listener sock,
> > + * which will reset the FIN segment. If application reconnects immediately
> > + * with the same source port, it will get reset because the tw sock's
> > + * tw_substate is still TCP_FIN_WAIT2. Try to get the tw sock in another try.
> > + */
> > + if (unlikely(th->fin && sk && sk->sk_state == TCP_LISTEN)) {
> > + sk = __inet_lookup_skb(net->ipv4.tcp_death_row.hashinfo,
> > + skb, __tcp_hdrlen(th), th->source,
> > + th->dest, sdif, &refcounted);
> > + }
> > +
> >
> >
> > I do not think this fixes anything, there is no barrier between first
> > and second lookup.
> > This might reduce race a little bit, but at the expense of extra code
> > in the fast path.
> >
> > If you want to fix this properly, I think we need to revisit handling
> > for non established sockets,
> > to hold the bucket spinlock over the whole thing.
> >
> > This will be slightly more complex than this...

2023-06-08 06:56:29

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v2] tcp: fix connection reset due to tw hashdance race.

On Thu, Jun 8, 2023 at 7:48 AM Kuniyuki Iwashima <[email protected]> wrote:
>
> From: Eric Dumazet <[email protected]>
> Date: Wed, 7 Jun 2023 15:32:57 +0200
> > On Wed, Jun 7, 2023 at 1:59 PM Duan,Muquan <[email protected]> wrote:
> > >
> > > Hi, Eric,
> > >
> > > Thanks for your comments!
> > >
> > > About the second lookup, I am sorry that I did not give enough explanations about it. Here are some details:
> > >
> > > 1. The second lookup can find the tw sock and avoid the connection refuse error on userland applications:
> > >
> > > If the original sock is found, but when validating its refcnt, it has been destroyed and sk_refcnt has become 0 after decreased by tcp_time_wait()->tcp_done()->inet_csk_destory_sock()->sock_put().The validation for refcnt fails and the lookup process gets a listener sock.
> > >
> > > When this case occurs, the hashdance has definitely finished,because tcp_done() is executed after inet_twsk_hashdance(). Then if look up the ehash table again, hashdance has already finished, tw sock will be found.
> > >
> > > With this fix, logically we can solve the connection reset issue completely when no established sock is found due to hashdance race.In my reproducing environment, the connection refuse error will occur about every 6 hours with only the fix of bad case (2). But with both of the 2 fixes, I tested it many times, the longest test continues for 10 days, it does not occur again,
> > >
> > >
> > >
> > > 2. About the performance impact:
> > >
> > > A similar scenario is that __inet_lookup_established() will do inet_match() check for the second time, if fails it will look up the list again. It is the extra effort to reduce the race impact without using reader lock. inet_match() failure occurs with about the same probability with refcnt validation failure in my test environment.
> > >
> > > The second lookup will only be done in the condition that FIN segment gets a listener sock.
> > >
> > > About the performance impact:
> > >
> > > 1) Most of the time, this condition will not met, the added codes introduces at most 3 comparisons for each segment.
> > >
> > > The second inet_match() in __inet_lookup_established() does least 3 comparisons for each segmet.
> > >
> > >
> > > 2) When this condition is met, the probability is very small. The impact is similar to the second try due to inet_match() failure. Since tw sock can definitely be found in the second try, I think this cost is worthy to avoid connection reused error on userland applications.
> > >
> > >
> > >
> > > My understanding is, current philosophy is avoiding the reader lock by tolerating the minor defect which occurs in a small probability.For example, if the FIN from passive closer is dropped due to the found sock is destroyed, a retransmission can be tolerated, it only makes the connection termination slower. But I think the bottom line is that it does not affect the userland applications’ functionality. If application fails to connect due to the hashdance race, it can’t be tolerated. In fact, guys from product department push hard on the connection refuse error.
> > >
> > >
> > > About bad case (2):
> > >
> > > tw sock is found, but its tw_refcnt has not been set to 3, it is still 0, validating for sk_refcnt will fail.
> > >
> > > I do not know the reason why setting tw_refcnt after adding it into list, could anyone help point out the reason? It adds extra race because the new added tw sock may be found and checked in other CPU concurrently before ƒsetting tw_refcnt to 3.
> > >
> > > By setting tw_refcnt to 3 before adding it into list, this case will be solved, and almost no cost. In my reproducing environment, it occurs more frequently than bad case (1), it appears about every 20 minutes, bad case (1) appears about every 6 hours.
> > >
> > >
> > >
> > > About the bucket spinlock, the original established sock and tw sock are stored in the ehash table, I concern about the performance when there are lots of short TCP connections, the reader lock may affect the performance of connection creation and termination. Could you share some details of your idea? Thanks in advance.
> > >
> > >
> >
> > Again, you can write a lot of stuff, the fact is that your patch does
> > not solve the issue.
> >
> > You could add 10 lookups, and still miss some cases, because they are
> > all RCU lookups with no barriers.
> >
> > In order to solve the issue of packets for the same 4-tuple being
> > processed by many cpus, the only way to solve races is to add mutual
> > exclusion.
> >
> > Note that we already have to lock the bucket spinlock every time we
> > transition a request socket to socket, a socket to timewait, or any
> > insert/delete.
> >
> > We need to expand the scope of this lock, and cleanup things that we
> > added in the past, because we tried too hard to 'detect races'
>
> How about this ? This is still a workaround though, retry sounds
> better than expanding the scope of the lock given the race is rare.

The chance of two cpus having to hold the same spinlock is rather small.

Algo is the following:

Attempt a lockless/RCU lookup.

1) Socket is found, we are good to go. Fast path is still fast.

2) Socket is not found in ehash
- We lock the bucket spinlock.
- We retry the lookup
- If socket found, continue with it (release the spinlock when
appropriate, after all write manipulations in the bucket are done)
- If socket still not found, we lookup a listener.
We insert a TCP_NEW_SYN_RECV ....
Again, we release the spinlock when appropriate, after all
write manipulations in the bucket are done)

No more races, and the fast path is the same.




>
> ---8<---
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index e7391bf310a7..b034be2f37c8 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -484,14 +484,24 @@ struct sock *__inet_lookup_established(struct net *net,
> unsigned int hash = inet_ehashfn(net, daddr, hnum, saddr, sport);
> unsigned int slot = hash & hashinfo->ehash_mask;
> struct inet_ehash_bucket *head = &hashinfo->ehash[slot];
> + bool should_retry = true;
>
> begin:
> sk_nulls_for_each_rcu(sk, node, &head->chain) {
> if (sk->sk_hash != hash)
> continue;
> if (likely(inet_match(net, sk, acookie, ports, dif, sdif))) {
> - if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> + if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt))) {

Because of SLAB_TYPESAFE_BY_RCU, we can not really do this kind of stuff.

Really this RCU lookup should be a best effort, not something
potentially looping X times .

We only need a proper fallback to spinlock protected lookup.

> + if (sk->sk_state == TCP_TIME_WAIT)
> + goto begin;
> +
> + if (sk->sk_state == TCP_CLOSE && should_retry) {
> + should_retry = false;
> + goto begin;
> + }
> +
> goto out;
> + }
> if (unlikely(!inet_match(net, sk, acookie,
> ports, dif, sdif))) {
> sock_gen_put(sk);
> ---8<---
>
>
> >

2023-06-08 12:06:50

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v2] tcp: fix connection reset due to tw hashdance race.

On Thu, Jun 8, 2023 at 1:24 PM Duan,Muquan <[email protected]> wrote:
>
> Besides trying to find the right tw sock, another idea is that if FIN segment finds listener sock, just discard the segment, because this is obvious a bad case, and the peer will retransmit it. Or for FIN segment we only look up in the established hash table, if not found then discard it.
>

Sure, please give the RFC number and section number that discusses
this point, and then we might consider this.

Just another reminder about TW : timewait sockets are "best effort".

Their allocation can fail, and /proc/sys/net/ipv4/tcp_max_tw_buckets
can control their number to 0

Applications must be able to recover gracefully if a 4-tuple is reused too fast.

>
> 2023年6月8日 下午12:13,Eric Dumazet <[email protected]> 写道:
>
> On Thu, Jun 8, 2023 at 5:59 AM Duan,Muquan <[email protected]> wrote:
>
>
> Hi, Eric,
>
> Thanks a lot for your explanation!
>
> Even if we add reader lock, if set the refcnt outside spin_lock()/spin_unlock(), during the interval between spin_unlock() and refcnt_set(), other cpus will see the tw sock with refcont 0, and validation for refcnt will fail.
>
> A suggestion, before the tw sock is added into ehash table, it has been already used by tw timer and bhash chain, we can firstly add refcnt to 2 before adding two to ehash table,. or add the refcnt one by one for timer, bhash and ehash. This can avoid the refcont validation failure on other cpus.
>
> This can reduce the frequency of the connection reset issue from 20 min to 180 min for our product, We may wait quite a long time before the best solution is ready, if this obvious defect is fixed, userland applications can benefit from it.
>
> Looking forward to your opinions!
>
>
> Again, my opinion is that we need a proper fix, not work arounds.
>
> I will work on this a bit later.
>
> In the meantime you can apply locally your patch if you feel this is
> what you want.
>
>

2023-06-15 12:50:17

by Duan,Muquan

[permalink] [raw]
Subject: Re: [PATCH v2] tcp: fix connection reset due to tw hashdance race.

Hi, Eric,

Could you please help check whether the following method makes sense, and
any side effect? Thanks!

If do the tw hashdance in real TCP_TIME_WAIT state with substate
TCP_TIME_WAIT, instead of in substate TCP_FIN_WAIT2, the connection
refused issue will not occur. The race of the lookup process for the
new SYN segment and the tw hashdance can come to the following
results:
1) get tw sock, SYN segment will be accepted via TCP_TW_SYN.
2) fail to get tw sock and original sock, get a listener sock,
SYN segment will be accepted by listener sock.
3) get original sock, SYN segment can be discarded in further
process after the sock is destroyed, in this case the peer will
retransmit the SYN segment. This is a very rare case, seems no need to
add lock for it.

I tested this modification in my reproducing environment,
the connection reset issue did not occur and no performance impact
observed.The side effect I currently can think out is that the original
sock will live a little longer and hold some resources longer.
The new patch is a temporary version which has a sysctl
switch for comparing the 2 methods, and some modifications
for statistics of the states not included.
I checked the implementation in FreeBSD 13.1, it does the dance in
state TCP_TIMEWAIT.



Regards!
Duan

> 2023年6月8日 下午7:54,Eric Dumazet <[email protected]> 写道:
>
> On Thu, Jun 8, 2023 at 1:24 PM Duan,Muquan <[email protected]> wrote:
>>
>> Besides trying to find the right tw sock, another idea is that if FIN segment finds listener sock, just discard the segment, because this is obvious a bad case, and the peer will retransmit it. Or for FIN segment we only look up in the established hash table, if not found then discard it.
>>
>
> Sure, please give the RFC number and section number that discusses
> this point, and then we might consider this.
>
> Just another reminder about TW : timewait sockets are "best effort".
>
> Their allocation can fail, and /proc/sys/net/ipv4/tcp_max_tw_buckets
> can control their number to 0
>
> Applications must be able to recover gracefully if a 4-tuple is reused too fast.
>
>>
>> 2023年6月8日 下午12:13,Eric Dumazet <[email protected]> 写道:
>>
>> On Thu, Jun 8, 2023 at 5:59 AM Duan,Muquan <[email protected]> wrote:
>>
>>
>> Hi, Eric,
>>
>> Thanks a lot for your explanation!
>>
>> Even if we add reader lock, if set the refcnt outside spin_lock()/spin_unlock(), during the interval between spin_unlock() and refcnt_set(), other cpus will see the tw sock with refcont 0, and validation for refcnt will fail.
>>
>> A suggestion, before the tw sock is added into ehash table, it has been already used by tw timer and bhash chain, we can firstly add refcnt to 2 before adding two to ehash table,. or add the refcnt one by one for timer, bhash and ehash. This can avoid the refcont validation failure on other cpus.
>>
>> This can reduce the frequency of the connection reset issue from 20 min to 180 min for our product, We may wait quite a long time before the best solution is ready, if this obvious defect is fixed, userland applications can benefit from it.
>>
>> Looking forward to your opinions!
>>
>>
>> Again, my opinion is that we need a proper fix, not work arounds.
>>
>> I will work on this a bit later.
>>
>> In the meantime you can apply locally your patch if you feel this is
>> what you want.
>>
>>

2023-06-15 15:48:12

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v2] tcp: fix connection reset due to tw hashdance race.

On Thu, Jun 15, 2023 at 2:14 PM Duan,Muquan <[email protected]> wrote:
>
> Hi, Eric,
>
> Could you please help check whether the following method makes sense, and
> any side effect? Thanks!
>
> If do the tw hashdance in real TCP_TIME_WAIT state with substate
> TCP_TIME_WAIT, instead of in substate TCP_FIN_WAIT2, the connection
> refused issue will not occur. The race of the lookup process for the
> new SYN segment and the tw hashdance can come to the following
> results:
> 1) get tw sock, SYN segment will be accepted via TCP_TW_SYN.
> 2) fail to get tw sock and original sock, get a listener sock,
> SYN segment will be accepted by listener sock.
> 3) get original sock, SYN segment can be discarded in further
> process after the sock is destroyed, in this case the peer will
> retransmit the SYN segment. This is a very rare case, seems no need to
> add lock for it.
>

You speak of SYN packet, while the original patch was all about FIN.

> I tested this modification in my reproducing environment,
> the connection reset issue did not occur and no performance impact
> observed.The side effect I currently can think out is that the original
> sock will live a little longer and hold some resources longer.
> The new patch is a temporary version which has a sysctl
> switch for comparing the 2 methods, and some modifications
> for statistics of the states not included.

So you are suggesting adding another hack/workaround for SYN packets,
on top of the other one ? What will follow next ?

Again, a correct fix needs to expand the scope of the ehash bucket lock.

I will handle this when time permits.

2023-06-19 17:36:29

by Kuniyuki Iwashima

[permalink] [raw]
Subject: Re: [PATCH v2] tcp: fix connection reset due to tw hashdance race.

From: Eric Dumazet <[email protected]>
Date: Thu, 8 Jun 2023 08:35:20 +0200
> On Thu, Jun 8, 2023 at 7:48 AM Kuniyuki Iwashima <[email protected]> wrote:
> >
> > From: Eric Dumazet <[email protected]>
> > Date: Wed, 7 Jun 2023 15:32:57 +0200
> > > On Wed, Jun 7, 2023 at 1:59 PM Duan,Muquan <[email protected]> wrote:
> > > >
> > > > Hi, Eric,
> > > >
> > > > Thanks for your comments!
> > > >
> > > > About the second lookup, I am sorry that I did not give enough explanations about it. Here are some details:
> > > >
> > > > 1. The second lookup can find the tw sock and avoid the connection refuse error on userland applications:
> > > >
> > > > If the original sock is found, but when validating its refcnt, it has been destroyed and sk_refcnt has become 0 after decreased by tcp_time_wait()->tcp_done()->inet_csk_destory_sock()->sock_put().The validation for refcnt fails and the lookup process gets a listener sock.
> > > >
> > > > When this case occurs, the hashdance has definitely finished,because tcp_done() is executed after inet_twsk_hashdance(). Then if look up the ehash table again, hashdance has already finished, tw sock will be found.
> > > >
> > > > With this fix, logically we can solve the connection reset issue completely when no established sock is found due to hashdance race.In my reproducing environment, the connection refuse error will occur about every 6 hours with only the fix of bad case (2). But with both of the 2 fixes, I tested it many times, the longest test continues for 10 days, it does not occur again,
> > > >
> > > >
> > > >
> > > > 2. About the performance impact:
> > > >
> > > > A similar scenario is that __inet_lookup_established() will do inet_match() check for the second time, if fails it will look up the list again. It is the extra effort to reduce the race impact without using reader lock. inet_match() failure occurs with about the same probability with refcnt validation failure in my test environment.
> > > >
> > > > The second lookup will only be done in the condition that FIN segment gets a listener sock.
> > > >
> > > > About the performance impact:
> > > >
> > > > 1) Most of the time, this condition will not met, the added codes introduces at most 3 comparisons for each segment.
> > > >
> > > > The second inet_match() in __inet_lookup_established() does least 3 comparisons for each segmet.
> > > >
> > > >
> > > > 2) When this condition is met, the probability is very small. The impact is similar to the second try due to inet_match() failure. Since tw sock can definitely be found in the second try, I think this cost is worthy to avoid connection reused error on userland applications.
> > > >
> > > >
> > > >
> > > > My understanding is, current philosophy is avoiding the reader lock by tolerating the minor defect which occurs in a small probability.For example, if the FIN from passive closer is dropped due to the found sock is destroyed, a retransmission can be tolerated, it only makes the connection termination slower. But I think the bottom line is that it does not affect the userland applications’ functionality. If application fails to connect due to the hashdance race, it can’t be tolerated. In fact, guys from product department push hard on the connection refuse error.
> > > >
> > > >
> > > > About bad case (2):
> > > >
> > > > tw sock is found, but its tw_refcnt has not been set to 3, it is still 0, validating for sk_refcnt will fail.
> > > >
> > > > I do not know the reason why setting tw_refcnt after adding it into list, could anyone help point out the reason? It adds extra race because the new added tw sock may be found and checked in other CPU concurrently before ƒsetting tw_refcnt to 3.
> > > >
> > > > By setting tw_refcnt to 3 before adding it into list, this case will be solved, and almost no cost. In my reproducing environment, it occurs more frequently than bad case (1), it appears about every 20 minutes, bad case (1) appears about every 6 hours.
> > > >
> > > >
> > > >
> > > > About the bucket spinlock, the original established sock and tw sock are stored in the ehash table, I concern about the performance when there are lots of short TCP connections, the reader lock may affect the performance of connection creation and termination. Could you share some details of your idea? Thanks in advance.
> > > >
> > > >
> > >
> > > Again, you can write a lot of stuff, the fact is that your patch does
> > > not solve the issue.
> > >
> > > You could add 10 lookups, and still miss some cases, because they are
> > > all RCU lookups with no barriers.
> > >
> > > In order to solve the issue of packets for the same 4-tuple being
> > > processed by many cpus, the only way to solve races is to add mutual
> > > exclusion.
> > >
> > > Note that we already have to lock the bucket spinlock every time we
> > > transition a request socket to socket, a socket to timewait, or any
> > > insert/delete.
> > >
> > > We need to expand the scope of this lock, and cleanup things that we
> > > added in the past, because we tried too hard to 'detect races'
> >
> > How about this ? This is still a workaround though, retry sounds
> > better than expanding the scope of the lock given the race is rare.
>
> The chance of two cpus having to hold the same spinlock is rather small.
>
> Algo is the following:
>
> Attempt a lockless/RCU lookup.
>
> 1) Socket is found, we are good to go. Fast path is still fast.
>
> 2) Socket is not found in ehash
> - We lock the bucket spinlock.
> - We retry the lookup
> - If socket found, continue with it (release the spinlock when
> appropriate, after all write manipulations in the bucket are done)
> - If socket still not found, we lookup a listener.
> We insert a TCP_NEW_SYN_RECV ....
> Again, we release the spinlock when appropriate, after all
> write manipulations in the bucket are done)
>
> No more races, and the fast path is the same.

I was looking around the issue this weekend. Is this what you were
thinking ? I'm wondering if you were also thinking another races like
found_dup_sk/own_req things. e.g.) acquire ehash lock when we start to
process reqsk ?

Duan, could you test the diff below ?

If this resolves the FIN issue, we can also revert 3f4ca5fafc08 ("tcp:
avoid the lookup process failing to get sk in ehash table").

---8<---
diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index 56f1286583d3..bb8e49a6e80f 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -48,6 +48,11 @@ struct sock *__inet6_lookup_established(struct net *net,
const u16 hnum, const int dif,
const int sdif);

+struct sock *__inet6_lookup_established_lock(struct net *net, struct inet_hashinfo *hashinfo,
+ const struct in6_addr *saddr, const __be16 sport,
+ const struct in6_addr *daddr, const u16 hnum,
+ const int dif, const int sdif);
+
struct sock *inet6_lookup_listener(struct net *net,
struct inet_hashinfo *hashinfo,
struct sk_buff *skb, int doff,
@@ -70,9 +75,15 @@ static inline struct sock *__inet6_lookup(struct net *net,
struct sock *sk = __inet6_lookup_established(net, hashinfo, saddr,
sport, daddr, hnum,
dif, sdif);
- *refcounted = true;
- if (sk)
+
+ if (!sk)
+ sk = __inet6_lookup_established_lock(net, hashinfo, saddr, sport,
+ daddr, hnum, dif, sdif);
+ if (sk) {
+ *refcounted = true;
return sk;
+ }
+
*refcounted = false;
return inet6_lookup_listener(net, hashinfo, skb, doff, saddr, sport,
daddr, hnum, dif, sdif);
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 99bd823e97f6..ad97fec63d7a 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -379,6 +379,12 @@ struct sock *__inet_lookup_established(struct net *net,
const __be32 daddr, const u16 hnum,
const int dif, const int sdif);

+struct sock *__inet_lookup_established_lock(struct net *net,
+ struct inet_hashinfo *hashinfo,
+ const __be32 saddr, const __be16 sport,
+ const __be32 daddr, const u16 hnum,
+ const int dif, const int sdif);
+
static inline struct sock *
inet_lookup_established(struct net *net, struct inet_hashinfo *hashinfo,
const __be32 saddr, const __be16 sport,
@@ -402,9 +408,14 @@ static inline struct sock *__inet_lookup(struct net *net,

sk = __inet_lookup_established(net, hashinfo, saddr, sport,
daddr, hnum, dif, sdif);
- *refcounted = true;
- if (sk)
+ if (!sk)
+ sk = __inet_lookup_established_lock(net, hashinfo, saddr, sport,
+ daddr, hnum, dif, sdif);
+ if (sk) {
+ *refcounted = true;
return sk;
+ }
+
*refcounted = false;
return __inet_lookup_listener(net, hashinfo, skb, doff, saddr,
sport, daddr, hnum, dif, sdif);
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index e7391bf310a7..1eeadaf1c9f9 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -514,6 +514,41 @@ struct sock *__inet_lookup_established(struct net *net,
}
EXPORT_SYMBOL_GPL(__inet_lookup_established);

+struct sock *__inet_lookup_established_lock(struct net *net, struct inet_hashinfo *hashinfo,
+ const __be32 saddr, const __be16 sport,
+ const __be32 daddr, const u16 hnum,
+ const int dif, const int sdif)
+{
+ const __portpair ports = INET_COMBINED_PORTS(sport, hnum);
+ INET_ADDR_COOKIE(acookie, saddr, daddr);
+ const struct hlist_nulls_node *node;
+ struct inet_ehash_bucket *head;
+ unsigned int hash;
+ spinlock_t *lock;
+ struct sock *sk;
+
+ hash = inet_ehashfn(net, daddr, hnum, saddr, sport);
+ head = inet_ehash_bucket(hashinfo, hash);
+ lock = inet_ehash_lockp(hashinfo, hash);
+
+ spin_lock(lock);
+ sk_nulls_for_each(sk, node, &head->chain) {
+ if (sk->sk_hash != hash)
+ continue;
+
+ if (unlikely(!inet_match(net, sk, acookie, ports, dif, sdif)))
+ continue;
+
+ sock_hold(sk);
+ spin_unlock(lock);
+ return sk;
+ }
+ spin_unlock(lock);
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(__inet_lookup_established_lock);
+
/* called with local bh disabled */
static int __inet_check_established(struct inet_timewait_death_row *death_row,
struct sock *sk, __u16 lport,
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index b64b49012655..1b2c971859c0 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -89,6 +89,40 @@ struct sock *__inet6_lookup_established(struct net *net,
}
EXPORT_SYMBOL(__inet6_lookup_established);

+struct sock *__inet6_lookup_established_lock(struct net *net, struct inet_hashinfo *hashinfo,
+ const struct in6_addr *saddr, const __be16 sport,
+ const struct in6_addr *daddr, const u16 hnum,
+ const int dif, const int sdif)
+{
+ const __portpair ports = INET_COMBINED_PORTS(sport, hnum);
+ const struct hlist_nulls_node *node;
+ struct inet_ehash_bucket *head;
+ unsigned int hash;
+ spinlock_t *lock;
+ struct sock *sk;
+
+ hash = inet6_ehashfn(net, daddr, hnum, saddr, sport);
+ head = inet_ehash_bucket(hashinfo, hash);
+ lock = inet_ehash_lockp(hashinfo, hash);
+
+ spin_lock(lock);
+ sk_nulls_for_each(sk, node, &head->chain) {
+ if (sk->sk_hash != hash)
+ continue;
+
+ if (unlikely(!inet6_match(net, sk, saddr, daddr, ports, dif, sdif)))
+ continue;
+
+ sock_hold(sk);
+ spin_unlock(lock);
+ return sk;
+ }
+ spin_unlock(lock);
+
+ return NULL;
+}
+EXPORT_SYMBOL(__inet6_lookup_established_lock);
+
static inline int compute_score(struct sock *sk, struct net *net,
const unsigned short hnum,
const struct in6_addr *daddr,
---8<---

2023-06-19 18:09:13

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v2] tcp: fix connection reset due to tw hashdance race.

On Mon, Jun 19, 2023 at 7:03 PM Kuniyuki Iwashima <[email protected]> wrote:
>
> From: Eric Dumazet <[email protected]>
> Date: Thu, 8 Jun 2023 08:35:20 +0200
> > On Thu, Jun 8, 2023 at 7:48 AM Kuniyuki Iwashima <[email protected]> wrote:
> > >
> > > From: Eric Dumazet <[email protected]>
> > > Date: Wed, 7 Jun 2023 15:32:57 +0200
> > > > On Wed, Jun 7, 2023 at 1:59 PM Duan,Muquan <[email protected]> wrote:
> > > > >
> > > > > Hi, Eric,
> > > > >
> > > > > Thanks for your comments!
> > > > >
> > > > > About the second lookup, I am sorry that I did not give enough explanations about it. Here are some details:
> > > > >
> > > > > 1. The second lookup can find the tw sock and avoid the connection refuse error on userland applications:
> > > > >
> > > > > If the original sock is found, but when validating its refcnt, it has been destroyed and sk_refcnt has become 0 after decreased by tcp_time_wait()->tcp_done()->inet_csk_destory_sock()->sock_put().The validation for refcnt fails and the lookup process gets a listener sock.
> > > > >
> > > > > When this case occurs, the hashdance has definitely finished,because tcp_done() is executed after inet_twsk_hashdance(). Then if look up the ehash table again, hashdance has already finished, tw sock will be found.
> > > > >
> > > > > With this fix, logically we can solve the connection reset issue completely when no established sock is found due to hashdance race.In my reproducing environment, the connection refuse error will occur about every 6 hours with only the fix of bad case (2). But with both of the 2 fixes, I tested it many times, the longest test continues for 10 days, it does not occur again,
> > > > >
> > > > >
> > > > >
> > > > > 2. About the performance impact:
> > > > >
> > > > > A similar scenario is that __inet_lookup_established() will do inet_match() check for the second time, if fails it will look up the list again. It is the extra effort to reduce the race impact without using reader lock. inet_match() failure occurs with about the same probability with refcnt validation failure in my test environment.
> > > > >
> > > > > The second lookup will only be done in the condition that FIN segment gets a listener sock.
> > > > >
> > > > > About the performance impact:
> > > > >
> > > > > 1) Most of the time, this condition will not met, the added codes introduces at most 3 comparisons for each segment.
> > > > >
> > > > > The second inet_match() in __inet_lookup_established() does least 3 comparisons for each segmet.
> > > > >
> > > > >
> > > > > 2) When this condition is met, the probability is very small. The impact is similar to the second try due to inet_match() failure. Since tw sock can definitely be found in the second try, I think this cost is worthy to avoid connection reused error on userland applications.
> > > > >
> > > > >
> > > > >
> > > > > My understanding is, current philosophy is avoiding the reader lock by tolerating the minor defect which occurs in a small probability.For example, if the FIN from passive closer is dropped due to the found sock is destroyed, a retransmission can be tolerated, it only makes the connection termination slower. But I think the bottom line is that it does not affect the userland applications’ functionality. If application fails to connect due to the hashdance race, it can’t be tolerated. In fact, guys from product department push hard on the connection refuse error.
> > > > >
> > > > >
> > > > > About bad case (2):
> > > > >
> > > > > tw sock is found, but its tw_refcnt has not been set to 3, it is still 0, validating for sk_refcnt will fail.
> > > > >
> > > > > I do not know the reason why setting tw_refcnt after adding it into list, could anyone help point out the reason? It adds extra race because the new added tw sock may be found and checked in other CPU concurrently before ƒsetting tw_refcnt to 3.
> > > > >
> > > > > By setting tw_refcnt to 3 before adding it into list, this case will be solved, and almost no cost. In my reproducing environment, it occurs more frequently than bad case (1), it appears about every 20 minutes, bad case (1) appears about every 6 hours.
> > > > >
> > > > >
> > > > >
> > > > > About the bucket spinlock, the original established sock and tw sock are stored in the ehash table, I concern about the performance when there are lots of short TCP connections, the reader lock may affect the performance of connection creation and termination. Could you share some details of your idea? Thanks in advance.
> > > > >
> > > > >
> > > >
> > > > Again, you can write a lot of stuff, the fact is that your patch does
> > > > not solve the issue.
> > > >
> > > > You could add 10 lookups, and still miss some cases, because they are
> > > > all RCU lookups with no barriers.
> > > >
> > > > In order to solve the issue of packets for the same 4-tuple being
> > > > processed by many cpus, the only way to solve races is to add mutual
> > > > exclusion.
> > > >
> > > > Note that we already have to lock the bucket spinlock every time we
> > > > transition a request socket to socket, a socket to timewait, or any
> > > > insert/delete.
> > > >
> > > > We need to expand the scope of this lock, and cleanup things that we
> > > > added in the past, because we tried too hard to 'detect races'
> > >
> > > How about this ? This is still a workaround though, retry sounds
> > > better than expanding the scope of the lock given the race is rare.
> >
> > The chance of two cpus having to hold the same spinlock is rather small.
> >
> > Algo is the following:
> >
> > Attempt a lockless/RCU lookup.
> >
> > 1) Socket is found, we are good to go. Fast path is still fast.
> >
> > 2) Socket is not found in ehash
> > - We lock the bucket spinlock.
> > - We retry the lookup
> > - If socket found, continue with it (release the spinlock when
> > appropriate, after all write manipulations in the bucket are done)
> > - If socket still not found, we lookup a listener.
> > We insert a TCP_NEW_SYN_RECV ....
> > Again, we release the spinlock when appropriate, after all
> > write manipulations in the bucket are done)
> >
> > No more races, and the fast path is the same.
>
> I was looking around the issue this weekend. Is this what you were
> thinking ? I'm wondering if you were also thinking another races like
> found_dup_sk/own_req things. e.g.) acquire ehash lock when we start to
> process reqsk ?

No.

Idea is to hold the bucket lock and keep it until all write operations
in the bucket have been done.



>
> Duan, could you test the diff below ?
>
> If this resolves the FIN issue, we can also revert 3f4ca5fafc08 ("tcp:
> avoid the lookup process failing to get sk in ehash table").
>
> ---8<---
> diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
> index 56f1286583d3..bb8e49a6e80f 100644
> --- a/include/net/inet6_hashtables.h
> +++ b/include/net/inet6_hashtables.h
> @@ -48,6 +48,11 @@ struct sock *__inet6_lookup_established(struct net *net,
> const u16 hnum, const int dif,
> const int sdif);
>
> +struct sock *__inet6_lookup_established_lock(struct net *net, struct inet_hashinfo *hashinfo,
> + const struct in6_addr *saddr, const __be16 sport,
> + const struct in6_addr *daddr, const u16 hnum,
> + const int dif, const int sdif);
> +
> struct sock *inet6_lookup_listener(struct net *net,
> struct inet_hashinfo *hashinfo,
> struct sk_buff *skb, int doff,
> @@ -70,9 +75,15 @@ static inline struct sock *__inet6_lookup(struct net *net,
> struct sock *sk = __inet6_lookup_established(net, hashinfo, saddr,
> sport, daddr, hnum,
> dif, sdif);
> - *refcounted = true;
> - if (sk)
> +
> + if (!sk)
> + sk = __inet6_lookup_established_lock(net, hashinfo, saddr, sport,
> + daddr, hnum, dif, sdif);
> + if (sk) {
> + *refcounted = true;
> return sk;
> + }
> +
> *refcounted = false;
> return inet6_lookup_listener(net, hashinfo, skb, doff, saddr, sport,
> daddr, hnum, dif, sdif);
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index 99bd823e97f6..ad97fec63d7a 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -379,6 +379,12 @@ struct sock *__inet_lookup_established(struct net *net,
> const __be32 daddr, const u16 hnum,
> const int dif, const int sdif);
>
> +struct sock *__inet_lookup_established_lock(struct net *net,
> + struct inet_hashinfo *hashinfo,
> + const __be32 saddr, const __be16 sport,
> + const __be32 daddr, const u16 hnum,
> + const int dif, const int sdif);
> +
> static inline struct sock *
> inet_lookup_established(struct net *net, struct inet_hashinfo *hashinfo,
> const __be32 saddr, const __be16 sport,
> @@ -402,9 +408,14 @@ static inline struct sock *__inet_lookup(struct net *net,
>
> sk = __inet_lookup_established(net, hashinfo, saddr, sport,
> daddr, hnum, dif, sdif);
> - *refcounted = true;
> - if (sk)
> + if (!sk)
> + sk = __inet_lookup_established_lock(net, hashinfo, saddr, sport,
> + daddr, hnum, dif, sdif);
> + if (sk) {
> + *refcounted = true;
> return sk;
> + }
> +
> *refcounted = false;
> return __inet_lookup_listener(net, hashinfo, skb, doff, saddr,
> sport, daddr, hnum, dif, sdif);
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index e7391bf310a7..1eeadaf1c9f9 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -514,6 +514,41 @@ struct sock *__inet_lookup_established(struct net *net,
> }
> EXPORT_SYMBOL_GPL(__inet_lookup_established);
>
> +struct sock *__inet_lookup_established_lock(struct net *net, struct inet_hashinfo *hashinfo,
> + const __be32 saddr, const __be16 sport,
> + const __be32 daddr, const u16 hnum,
> + const int dif, const int sdif)
> +{
> + const __portpair ports = INET_COMBINED_PORTS(sport, hnum);
> + INET_ADDR_COOKIE(acookie, saddr, daddr);
> + const struct hlist_nulls_node *node;
> + struct inet_ehash_bucket *head;
> + unsigned int hash;
> + spinlock_t *lock;
> + struct sock *sk;
> +
> + hash = inet_ehashfn(net, daddr, hnum, saddr, sport);
> + head = inet_ehash_bucket(hashinfo, hash);
> + lock = inet_ehash_lockp(hashinfo, hash);
> +
> + spin_lock(lock);
> + sk_nulls_for_each(sk, node, &head->chain) {
> + if (sk->sk_hash != hash)
> + continue;
> +
> + if (unlikely(!inet_match(net, sk, acookie, ports, dif, sdif)))
> + continue;
> +
> + sock_hold(sk);
> + spin_unlock(lock);
> + return sk;
> + }
> + spin_unlock(lock);

Here we need to keep the lock held, and release it later.

> +
> + return NULL;
> +}
> +EXPORT_SYMBOL_GPL(__inet_lookup_established_lock);
> +
> /* called with local bh disabled */
> static int __inet_check_established(struct inet_timewait_death_row *death_row,
> struct sock *sk, __u16 lport,
> diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
> index b64b49012655..1b2c971859c0 100644
> --- a/net/ipv6/inet6_hashtables.c
> +++ b/net/ipv6/inet6_hashtables.c
> @@ -89,6 +89,40 @@ struct sock *__inet6_lookup_established(struct net *net,
> }
> EXPORT_SYMBOL(__inet6_lookup_established);
>
> +struct sock *__inet6_lookup_established_lock(struct net *net, struct inet_hashinfo *hashinfo,
> + const struct in6_addr *saddr, const __be16 sport,
> + const struct in6_addr *daddr, const u16 hnum,
> + const int dif, const int sdif)
> +{
> + const __portpair ports = INET_COMBINED_PORTS(sport, hnum);
> + const struct hlist_nulls_node *node;
> + struct inet_ehash_bucket *head;
> + unsigned int hash;
> + spinlock_t *lock;
> + struct sock *sk;
> +
> + hash = inet6_ehashfn(net, daddr, hnum, saddr, sport);
> + head = inet_ehash_bucket(hashinfo, hash);
> + lock = inet_ehash_lockp(hashinfo, hash);
> +
> + spin_lock(lock);
> + sk_nulls_for_each(sk, node, &head->chain) {
> + if (sk->sk_hash != hash)
> + continue;
> +
> + if (unlikely(!inet6_match(net, sk, saddr, daddr, ports, dif, sdif)))
> + continue;
> +
> + sock_hold(sk);
> + spin_unlock(lock);
> + return sk;
> + }
> + spin_unlock(lock);

/* Same here. */

> +
> + return NULL;
> +}
> +EXPORT_SYMBOL(__inet6_lookup_established_lock);
> +
> static inline int compute_score(struct sock *sk, struct net *net,
> const unsigned short hnum,
> const struct in6_addr *daddr,
> ---8<---

2023-06-19 18:45:04

by Kuniyuki Iwashima

[permalink] [raw]
Subject: Re: [PATCH v2] tcp: fix connection reset due to tw hashdance race.

From: Eric Dumazet <[email protected]>
Date: Mon, 19 Jun 2023 19:39:20 +0200
> On Mon, Jun 19, 2023 at 7:03 PM Kuniyuki Iwashima <[email protected]> wrote:
> >
> > From: Eric Dumazet <[email protected]>
> > Date: Thu, 8 Jun 2023 08:35:20 +0200
> > > On Thu, Jun 8, 2023 at 7:48 AM Kuniyuki Iwashima <[email protected]> wrote:
> > > >
> > > > From: Eric Dumazet <[email protected]>
> > > > Date: Wed, 7 Jun 2023 15:32:57 +0200
> > > > > On Wed, Jun 7, 2023 at 1:59 PM Duan,Muquan <[email protected]> wrote:
> > > > > >
> > > > > > Hi, Eric,
> > > > > >
> > > > > > Thanks for your comments!
> > > > > >
> > > > > > About the second lookup, I am sorry that I did not give enough explanations about it. Here are some details:
> > > > > >
> > > > > > 1. The second lookup can find the tw sock and avoid the connection refuse error on userland applications:
> > > > > >
> > > > > > If the original sock is found, but when validating its refcnt, it has been destroyed and sk_refcnt has become 0 after decreased by tcp_time_wait()->tcp_done()->inet_csk_destory_sock()->sock_put().The validation for refcnt fails and the lookup process gets a listener sock.
> > > > > >
> > > > > > When this case occurs, the hashdance has definitely finished,because tcp_done() is executed after inet_twsk_hashdance(). Then if look up the ehash table again, hashdance has already finished, tw sock will be found.
> > > > > >
> > > > > > With this fix, logically we can solve the connection reset issue completely when no established sock is found due to hashdance race.In my reproducing environment, the connection refuse error will occur about every 6 hours with only the fix of bad case (2). But with both of the 2 fixes, I tested it many times, the longest test continues for 10 days, it does not occur again,
> > > > > >
> > > > > >
> > > > > >
> > > > > > 2. About the performance impact:
> > > > > >
> > > > > > A similar scenario is that __inet_lookup_established() will do inet_match() check for the second time, if fails it will look up the list again. It is the extra effort to reduce the race impact without using reader lock. inet_match() failure occurs with about the same probability with refcnt validation failure in my test environment.
> > > > > >
> > > > > > The second lookup will only be done in the condition that FIN segment gets a listener sock.
> > > > > >
> > > > > > About the performance impact:
> > > > > >
> > > > > > 1) Most of the time, this condition will not met, the added codes introduces at most 3 comparisons for each segment.
> > > > > >
> > > > > > The second inet_match() in __inet_lookup_established() does least 3 comparisons for each segmet.
> > > > > >
> > > > > >
> > > > > > 2) When this condition is met, the probability is very small. The impact is similar to the second try due to inet_match() failure. Since tw sock can definitely be found in the second try, I think this cost is worthy to avoid connection reused error on userland applications.
> > > > > >
> > > > > >
> > > > > >
> > > > > > My understanding is, current philosophy is avoiding the reader lock by tolerating the minor defect which occurs in a small probability.For example, if the FIN from passive closer is dropped due to the found sock is destroyed, a retransmission can be tolerated, it only makes the connection termination slower. But I think the bottom line is that it does not affect the userland applications’ functionality. If application fails to connect due to the hashdance race, it can’t be tolerated. In fact, guys from product department push hard on the connection refuse error.
> > > > > >
> > > > > >
> > > > > > About bad case (2):
> > > > > >
> > > > > > tw sock is found, but its tw_refcnt has not been set to 3, it is still 0, validating for sk_refcnt will fail.
> > > > > >
> > > > > > I do not know the reason why setting tw_refcnt after adding it into list, could anyone help point out the reason? It adds extra race because the new added tw sock may be found and checked in other CPU concurrently before ƒsetting tw_refcnt to 3.
> > > > > >
> > > > > > By setting tw_refcnt to 3 before adding it into list, this case will be solved, and almost no cost. In my reproducing environment, it occurs more frequently than bad case (1), it appears about every 20 minutes, bad case (1) appears about every 6 hours.
> > > > > >
> > > > > >
> > > > > >
> > > > > > About the bucket spinlock, the original established sock and tw sock are stored in the ehash table, I concern about the performance when there are lots of short TCP connections, the reader lock may affect the performance of connection creation and termination. Could you share some details of your idea? Thanks in advance.
> > > > > >
> > > > > >
> > > > >
> > > > > Again, you can write a lot of stuff, the fact is that your patch does
> > > > > not solve the issue.
> > > > >
> > > > > You could add 10 lookups, and still miss some cases, because they are
> > > > > all RCU lookups with no barriers.
> > > > >
> > > > > In order to solve the issue of packets for the same 4-tuple being
> > > > > processed by many cpus, the only way to solve races is to add mutual
> > > > > exclusion.
> > > > >
> > > > > Note that we already have to lock the bucket spinlock every time we
> > > > > transition a request socket to socket, a socket to timewait, or any
> > > > > insert/delete.
> > > > >
> > > > > We need to expand the scope of this lock, and cleanup things that we
> > > > > added in the past, because we tried too hard to 'detect races'
> > > >
> > > > How about this ? This is still a workaround though, retry sounds
> > > > better than expanding the scope of the lock given the race is rare.
> > >
> > > The chance of two cpus having to hold the same spinlock is rather small.
> > >
> > > Algo is the following:
> > >
> > > Attempt a lockless/RCU lookup.
> > >
> > > 1) Socket is found, we are good to go. Fast path is still fast.
> > >
> > > 2) Socket is not found in ehash
> > > - We lock the bucket spinlock.
> > > - We retry the lookup
> > > - If socket found, continue with it (release the spinlock when
> > > appropriate, after all write manipulations in the bucket are done)
> > > - If socket still not found, we lookup a listener.
> > > We insert a TCP_NEW_SYN_RECV ....
> > > Again, we release the spinlock when appropriate, after all
> > > write manipulations in the bucket are done)
> > >
> > > No more races, and the fast path is the same.
> >
> > I was looking around the issue this weekend. Is this what you were
> > thinking ? I'm wondering if you were also thinking another races like
> > found_dup_sk/own_req things. e.g.) acquire ehash lock when we start to
> > process reqsk ?
>
> No.
>
> Idea is to hold the bucket lock and keep it until all write operations
> in the bucket have been done.

Thanks for clarification! I understood why it takes time.

Actually I started on the way like

1. convert bool *refcounted to enum (noref, refcnt, locked)
2. add 2nd ehash lookup hoding lock and mark locked
3. release lock when necessary

but switched to this version because there were many callers of
__inet_lookup().


>
>
>
> >
> > Duan, could you test the diff below ?
> >
> > If this resolves the FIN issue, we can also revert 3f4ca5fafc08 ("tcp:
> > avoid the lookup process failing to get sk in ehash table").
> >
> > ---8<---
> > diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
> > index 56f1286583d3..bb8e49a6e80f 100644
> > --- a/include/net/inet6_hashtables.h
> > +++ b/include/net/inet6_hashtables.h
> > @@ -48,6 +48,11 @@ struct sock *__inet6_lookup_established(struct net *net,
> > const u16 hnum, const int dif,
> > const int sdif);
> >
> > +struct sock *__inet6_lookup_established_lock(struct net *net, struct inet_hashinfo *hashinfo,
> > + const struct in6_addr *saddr, const __be16 sport,
> > + const struct in6_addr *daddr, const u16 hnum,
> > + const int dif, const int sdif);
> > +
> > struct sock *inet6_lookup_listener(struct net *net,
> > struct inet_hashinfo *hashinfo,
> > struct sk_buff *skb, int doff,
> > @@ -70,9 +75,15 @@ static inline struct sock *__inet6_lookup(struct net *net,
> > struct sock *sk = __inet6_lookup_established(net, hashinfo, saddr,
> > sport, daddr, hnum,
> > dif, sdif);
> > - *refcounted = true;
> > - if (sk)
> > +
> > + if (!sk)
> > + sk = __inet6_lookup_established_lock(net, hashinfo, saddr, sport,
> > + daddr, hnum, dif, sdif);
> > + if (sk) {
> > + *refcounted = true;
> > return sk;
> > + }
> > +
> > *refcounted = false;
> > return inet6_lookup_listener(net, hashinfo, skb, doff, saddr, sport,
> > daddr, hnum, dif, sdif);
> > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> > index 99bd823e97f6..ad97fec63d7a 100644
> > --- a/include/net/inet_hashtables.h
> > +++ b/include/net/inet_hashtables.h
> > @@ -379,6 +379,12 @@ struct sock *__inet_lookup_established(struct net *net,
> > const __be32 daddr, const u16 hnum,
> > const int dif, const int sdif);
> >
> > +struct sock *__inet_lookup_established_lock(struct net *net,
> > + struct inet_hashinfo *hashinfo,
> > + const __be32 saddr, const __be16 sport,
> > + const __be32 daddr, const u16 hnum,
> > + const int dif, const int sdif);
> > +
> > static inline struct sock *
> > inet_lookup_established(struct net *net, struct inet_hashinfo *hashinfo,
> > const __be32 saddr, const __be16 sport,
> > @@ -402,9 +408,14 @@ static inline struct sock *__inet_lookup(struct net *net,
> >
> > sk = __inet_lookup_established(net, hashinfo, saddr, sport,
> > daddr, hnum, dif, sdif);
> > - *refcounted = true;
> > - if (sk)
> > + if (!sk)
> > + sk = __inet_lookup_established_lock(net, hashinfo, saddr, sport,
> > + daddr, hnum, dif, sdif);
> > + if (sk) {
> > + *refcounted = true;
> > return sk;
> > + }
> > +
> > *refcounted = false;
> > return __inet_lookup_listener(net, hashinfo, skb, doff, saddr,
> > sport, daddr, hnum, dif, sdif);
> > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > index e7391bf310a7..1eeadaf1c9f9 100644
> > --- a/net/ipv4/inet_hashtables.c
> > +++ b/net/ipv4/inet_hashtables.c
> > @@ -514,6 +514,41 @@ struct sock *__inet_lookup_established(struct net *net,
> > }
> > EXPORT_SYMBOL_GPL(__inet_lookup_established);
> >
> > +struct sock *__inet_lookup_established_lock(struct net *net, struct inet_hashinfo *hashinfo,
> > + const __be32 saddr, const __be16 sport,
> > + const __be32 daddr, const u16 hnum,
> > + const int dif, const int sdif)
> > +{
> > + const __portpair ports = INET_COMBINED_PORTS(sport, hnum);
> > + INET_ADDR_COOKIE(acookie, saddr, daddr);
> > + const struct hlist_nulls_node *node;
> > + struct inet_ehash_bucket *head;
> > + unsigned int hash;
> > + spinlock_t *lock;
> > + struct sock *sk;
> > +
> > + hash = inet_ehashfn(net, daddr, hnum, saddr, sport);
> > + head = inet_ehash_bucket(hashinfo, hash);
> > + lock = inet_ehash_lockp(hashinfo, hash);
> > +
> > + spin_lock(lock);
> > + sk_nulls_for_each(sk, node, &head->chain) {
> > + if (sk->sk_hash != hash)
> > + continue;
> > +
> > + if (unlikely(!inet_match(net, sk, acookie, ports, dif, sdif)))
> > + continue;
> > +
> > + sock_hold(sk);
> > + spin_unlock(lock);
> > + return sk;
> > + }
> > + spin_unlock(lock);
>
> Here we need to keep the lock held, and release it later.
>
> > +
> > + return NULL;
> > +}
> > +EXPORT_SYMBOL_GPL(__inet_lookup_established_lock);
> > +
> > /* called with local bh disabled */
> > static int __inet_check_established(struct inet_timewait_death_row *death_row,
> > struct sock *sk, __u16 lport,
> > diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
> > index b64b49012655..1b2c971859c0 100644
> > --- a/net/ipv6/inet6_hashtables.c
> > +++ b/net/ipv6/inet6_hashtables.c
> > @@ -89,6 +89,40 @@ struct sock *__inet6_lookup_established(struct net *net,
> > }
> > EXPORT_SYMBOL(__inet6_lookup_established);
> >
> > +struct sock *__inet6_lookup_established_lock(struct net *net, struct inet_hashinfo *hashinfo,
> > + const struct in6_addr *saddr, const __be16 sport,
> > + const struct in6_addr *daddr, const u16 hnum,
> > + const int dif, const int sdif)
> > +{
> > + const __portpair ports = INET_COMBINED_PORTS(sport, hnum);
> > + const struct hlist_nulls_node *node;
> > + struct inet_ehash_bucket *head;
> > + unsigned int hash;
> > + spinlock_t *lock;
> > + struct sock *sk;
> > +
> > + hash = inet6_ehashfn(net, daddr, hnum, saddr, sport);
> > + head = inet_ehash_bucket(hashinfo, hash);
> > + lock = inet_ehash_lockp(hashinfo, hash);
> > +
> > + spin_lock(lock);
> > + sk_nulls_for_each(sk, node, &head->chain) {
> > + if (sk->sk_hash != hash)
> > + continue;
> > +
> > + if (unlikely(!inet6_match(net, sk, saddr, daddr, ports, dif, sdif)))
> > + continue;
> > +
> > + sock_hold(sk);
> > + spin_unlock(lock);
> > + return sk;
> > + }
> > + spin_unlock(lock);
>
> /* Same here. */
>
> > +
> > + return NULL;
> > +}
> > +EXPORT_SYMBOL(__inet6_lookup_established_lock);
> > +
> > static inline int compute_score(struct sock *sk, struct net *net,
> > const unsigned short hnum,
> > const struct in6_addr *daddr,
> > ---8<---

2023-06-20 04:24:22

by Duan,Muquan

[permalink] [raw]
Subject: Re: [PATCH v2] tcp: fix connection reset due to tw hashdance race.

Hi, Eric,

Thanks for your comments!

Why not speak of the FIN:
For current implementation, hashdance can be done on state FIN_WAIT2, it may race with the ehash lookup process of passive closer’s FIN. My new patch 3 does the tw hashdance until receiving passive closer's FIN(real TIME_WAIT), so this race does not exist and the 'connection refused' issue will not occur, so I did not speak of the FIN again with the new patch.

Why speak of the SYN:
Theoretically new SYN may race with hashdance with or without my patch, and results in a retransmission of it. But It is almost impossible to get new SYN before hashdance, there are one RTT between the passive closer's FIN and new SYN. The probability is too small to and may never occur in practice. I never me0t this issue in my reproducing environment.
For the connection reuse issue I met, I think no tricks needed with patch 3.

About introducing the lock:
If we really need to introduce the lock, I think besides protecting the list in ehash bucket, we also need to protect the sock during consuming the patcket. Because after we find the sock and consuming the packet, we can meet sock level race at different CPUs, for example, the passive closer's FIN arrives too fast and finds the original sock before the hashdance begins, the FIN may be dropped in further process if the sock is destroyed on another CPU after hashdance.

I took a look at FreeBSD, it uses hash table lock and per sock level lock.It also needs some tricks to retry for some cases, for example, sock dropped by another thread when waiting for per sock lock during the lookup:
/*
* While waiting for inp lock during the lookup, another thread
* can have dropped the inpcb, in which case we need to loop back
* and try to find a new inpcb to deliver to.
*/
if (inp->inp_flags & INP_DROPPED) {
INP_WUNLOCK(inp);
inp = NULL;
goto findpcb;
}


I worry about that locking on ehash bucket and per sock will introduce performance and scaling issue , especially when there are a large number of socks or many short connections. I can take some time out to deal with this issue in the flowing a few weeks, I will try to introduce lock and write some test programs to evaluate the performance hit.

Regards!
Duanmuquan

2023-06-20 08:57:20

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v2] tcp: fix connection reset due to tw hashdance race.

On Tue, Jun 20, 2023 at 5:30 AM Duan,Muquan <[email protected]> wrote:
>
> Hi, Eric,
>
> Thanks for your comments!
>
> Why not speak of the FIN:
> For current implementation, hashdance can be done on state FIN_WAIT2, it may race with the ehash lookup process of passive closer’s FIN. My new patch 3 does the tw hashdance until receiving passive closer's FIN(real TIME_WAIT), so this race does not exist and the 'connection refused' issue will not occur, so I did not speak of the FIN again with the new patch.
>
shdance begins, the FIN may be dropped in further process if the sock
is destroyed on another CPU after hashdance.
>
> I took a look at FreeBSD, it uses hash table lock and per sock level lock.It also needs some tricks to retry for some cases, for example, sock dropped by another thread when waiting for per sock lock during the lookup:
> /*
> * While waiting for inp lock during the lookup, another thread
> * can have dropped the inpcb, in which case we need to loop back
> * and try to find a new inpcb to deliver to.
> */
> if (inp->inp_flags & INP_DROPPED) {
> INP_WUNLOCK(inp);
> inp = NULL;
> goto findpcb;
> }
>

This is the last time you bring FreeBSD code here.

We do not copy FreeBSD code for obvious reasons.
I never looked at FreeBSD code and never will.

Stop this, please, or I will ignore your future emails.

2023-06-20 10:44:56

by Duan,Muquan

[permalink] [raw]
Subject: Re: [PATCH v2] tcp: fix connection reset due to tw hashdance race.

Hi, Eric,

Thanks for your reply!


>> Why not speak of the FIN:
>> For current implementation, hashdance can be done on state FIN_WAIT2, it may race with the ehash lookup process of passive closer’s FIN. My new patch 3 does the tw hashdance until receiving passive closer's FIN(real TIME_WAIT), so this race does not exist and the 'connection refused' issue will not occur, so I did not speak of the FIN again with the new patch.
>>
> shdance begins, the FIN may be dropped in further process if the sock
> is destroyed on another CPU after hashdance.

With my patch 3, passive closer’s FIN will find original sock because hashdance will not be done before receiving it,
After hashdance, the tw sock’s state is set to TIME_WAIT already, it can accept new SYN with the sampe 4-tuples.
If the original sock is destroyed on another CPU or the FIN is droped after hashdance, it will not affect the tw sock.
I don’t know whether I get your point correctly?


>>
>> I took a look at FreeBSD, it uses hash table lock and per sock level lock.It also needs some tricks to retry for some cases, for example, sock dropped by another thread when waiting for per sock lock during the lookup:
>> /*
>> * While waiting for inp lock during the lookup, another thread
>> * can have dropped the inpcb, in which case we need to loop back
>> * and try to find a new inpcb to deliver to.
>> */
>> if (inp->inp_flags & INP_DROPPED) {
>> INP_WUNLOCK(inp);
>> inp = NULL;
>> goto findpcb;
>> }
>>
>
> This is the last time you bring FreeBSD code here.
>
> We do not copy FreeBSD code for obvious reasons.
> I never looked at FreeBSD code and never will.
>
> Stop this, please, or I will ignore your future emails.


I am very sorry, I will not do this again.






Regards!
Duanmuquan

> 2023年6月20日 下午4:44,Eric Dumazet <[email protected]> 写道:
>
> On Tue, Jun 20, 2023 at 5:30 AM Duan,Muquan <[email protected]> wrote:
>>
>> Hi, Eric,
>>
>> Thanks for your comments!
>>
>> Why not speak of the FIN:
>> For current implementation, hashdance can be done on state FIN_WAIT2, it may race with the ehash lookup process of passive closer’s FIN. My new patch 3 does the tw hashdance until receiving passive closer's FIN(real TIME_WAIT), so this race does not exist and the 'connection refused' issue will not occur, so I did not speak of the FIN again with the new patch.
>>
> shdance begins, the FIN may be dropped in further process if the sock
> is destroyed on another CPU after hashdance.
>>
>> I took a look at FreeBSD, it uses hash table lock and per sock level lock.It also needs some tricks to retry for some cases, for example, sock dropped by another thread when waiting for per sock lock during the lookup:
>> /*
>> * While waiting for inp lock during the lookup, another thread
>> * can have dropped the inpcb, in which case we need to loop back
>> * and try to find a new inpcb to deliver to.
>> */
>> if (inp->inp_flags & INP_DROPPED) {
>> INP_WUNLOCK(inp);
>> inp = NULL;
>> goto findpcb;
>> }
>>
>
> This is the last time you bring FreeBSD code here.
>
> We do not copy FreeBSD code for obvious reasons.
> I never looked at FreeBSD code and never will.
>
> Stop this, please, or I will ignore your future emails.