Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp6728076rwd; Tue, 6 Jun 2023 00:42:46 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7f1AP7j5Q8IfUFFvqcEjHWxaNljmvaSlp8mgs+tRnH3aWThew0BUFwunaJIPSQj+m8PurE X-Received: by 2002:a17:90a:1a51:b0:258:b676:59 with SMTP id 17-20020a17090a1a5100b00258b6760059mr363691pjl.19.1686037365880; Tue, 06 Jun 2023 00:42:45 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686037365; cv=none; d=google.com; s=arc-20160816; b=MJ8opcgZMvRdkoQkheQQrxrMgiEjZqvXEvN3A2DXnG70PsKjZqvg4lU0WYgcJWFrTp txOqvr6/2rrQ5ur1no8xgFOzTQEV4w0yLaiFrfZMZ5yXy93bELmVzJFrmLSHwKpDjOxa f+tgM2Q8mMF5k/UIo15694JDsaHU0yvVHw6e/cBGONzGE9bNGDpkLIgErCY5oikrdDvd J4VfZtl1pi6bTuuumnyzUJvj47rZw8vKW7fiPmq/iacpUzCdcfmNUYwyrIbDTXtKgHLB xWauOAhJo9tg6dOVcBqx9yTwr2i97pGsRvZepUXZ6FK2UIDV/nIRxE/FnJWMm4ajaUyz p7UA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=Mh/Q9ExSH8PH6rQafVFdrgOp1kcbGrC6Sma+EsrSOgA=; b=SgyaTUcjq/IJVunsHo3cHNh6vQd4PiTsbsYop68O3MoDdxSV0thM25MyijfyEWeGIb 0KrOKeRcrDlsR+kU1/6pBgKNdydwYg3DcpTtfHgCRiV+X24hKLz4jBhLPGVKoUmL9W61 kyOK9ZUovJHrujGFM2vwos5DdskpRpgAZrmKa0TWLAreLgxjq+dHUhW7bA91EQON2wnI nduIvF5oKJlJXMg5xbNCrEEwxg4bF87ZGMecACGydA1SbAoKNpB04ZIhrXUEBU+7Cypl BlX0x139RgcYYIuL7J3er6REa3d7PrCanimEHUHv9D/o4VU64D70pofxaMfYL0JQ2BjT uWfQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=DYJEmx4a; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id oj14-20020a17090b4d8e00b0025672fbdad5si8798404pjb.178.2023.06.06.00.42.33; Tue, 06 Jun 2023 00:42:45 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=DYJEmx4a; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235796AbjFFHHt (ORCPT + 99 others); Tue, 6 Jun 2023 03:07:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59050 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232852AbjFFHHr (ORCPT ); Tue, 6 Jun 2023 03:07:47 -0400 Received: from mail-wm1-x329.google.com (mail-wm1-x329.google.com [IPv6:2a00:1450:4864:20::329]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 41F9CE52 for ; Tue, 6 Jun 2023 00:07:45 -0700 (PDT) Received: by mail-wm1-x329.google.com with SMTP id 5b1f17b1804b1-3f7359a3b78so48145e9.0 for ; Tue, 06 Jun 2023 00:07:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1686035263; x=1688627263; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Mh/Q9ExSH8PH6rQafVFdrgOp1kcbGrC6Sma+EsrSOgA=; b=DYJEmx4a/tObJALEFG7EuEAwywJTGI6PVfoAUiPhzhBjyCopsRlgJAMUIrOBNrWuv2 A+XlsU1IvBStPTd/he2Qxd7KBqNwmvt+VVoa81GzKgtWg7KPrHh0q2EkBy7F1ABhUADo /k9jv/syrralrl0dOTe/zjzrHSGeFQpd99tRfnEw8ADGVuq9MbAde6IF1EXtR/8QRcn9 7N663ST82p7mgCLA4Bmd96NTJRiOVb4jHZ5pS5d4io1AeuYsL29riVxcnLI8OfGhRtJB WjAaWDFzIwQx69+0lROTDV0NanoWI3Coc3/6CkYHvir//FDx6UaeAKoIzxHZWQSDvX/9 H7SQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686035263; x=1688627263; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Mh/Q9ExSH8PH6rQafVFdrgOp1kcbGrC6Sma+EsrSOgA=; b=aiQ7tEXcNQFJRuYJq5lFpgcr38yaQSrg6Ysxxd4opnXLfQ0vN5YmRNL/NX0qqCpwLz U7KiuyKWmb5vXT3PTe1iZ6LGpBriUibKWpNbsDacc95AcI/r7tTkVqOveA8J09dn/Hyd WLrCtl2Gi4/TKKbqFXakfrD46y9gCnuz2aeEofTUjlnw9aBufIiYaHcQuTod8WL8onb3 b7ETPCvGbtIHMjcC3OrrJSP2llxDHbgXJ/FpxoFan4FLyqU1bM5Q6mbUyhc458D7kMZn kFh8vCxT6j7X7UiAiS0hno3ytL9OcZiPnycW6N60oJXpnlefN6BsQKZWriozDlbEPBs8 Dxmw== X-Gm-Message-State: AC+VfDzRAPxhQuPoRABUbfFHGFZZzUNIseh07YITnrrc+LNBq63BQnWj MSM96SIaEZjZDl8MfQ80RyoE/0V95n72JNCtljmang== X-Received: by 2002:a05:600c:1c16:b0:3f1:6fe9:4a95 with SMTP id j22-20020a05600c1c1600b003f16fe94a95mr137949wms.4.1686035263064; Tue, 06 Jun 2023 00:07:43 -0700 (PDT) MIME-Version: 1.0 References: <20230606064306.9192-1-duanmuquan@baidu.com> In-Reply-To: <20230606064306.9192-1-duanmuquan@baidu.com> From: Eric Dumazet Date: Tue, 6 Jun 2023 09:07:30 +0200 Message-ID: Subject: Re: [PATCH v2] tcp: fix connection reset due to tw hashdance race. To: Duan Muquan Cc: davem@davemloft.net, dsahern@kernel.org, kuba@kernel.org, pabeni@redhat.com, netdev@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jun 6, 2023 at 8:43=E2=80=AFAM Duan Muquan w= rote: > > If the FIN from passive closer and the ACK for active closer's FIN are > processed on different CPUs concurrently, tw hashdance race may occur. > On loopback interface, transmit function queues a skb to current CPU's > softnet's input queue by default. Suppose active closer runs on CPU 0, > and passive closer runs on CPU 1. If the ACK for the active closer's > FIN is sent with no delay, it will be processed and tw hashdance will > be done on CPU 0; The passive closer's FIN will be sent in another > segment and processed on CPU 1, it may fail to find tw sock in the > ehash table due to tw hashdance on CPU 0, then get a RESET. > If application reconnects immediately with the same source port, it > will get reset because tw sock's tw_substate is still TCP_FIN_WAIT2. > > The dmesg to trace down this issue: > > .333516] tcp_send_fin: sk 0000000092105ad2 cookie 9 cpu 3 > .333524] rcv_state_process:FIN_WAIT2 sk 0000000092105ad2 cookie 9 cpu 3 > .333534] tcp_close: tcp_time_wait: sk 0000000092105ad2 cookie 9 cpu 3 > .333538] hashdance: tw 00000000690fdb7a added to ehash cookie 9 cpu 3 > .333541] hashdance: sk 0000000092105ad2 removed cookie 9 cpu 3 > .333544] __inet_lookup_established: Failed the refcount check: > !refcount_inc_not_zero 00000000690fdb7a ref 0 cookie 9 c= pu 0 > .333549] hashdance: tw 00000000690fdb7a before add ref 0 cookie 9 cpu 3 > .333552] rcv_state: RST for FIN listen 000000003c50afa6 cookie 0 cpu 0 > .333574] tcp_send_fin: sk 0000000066757bf8 ref 2 cookie 0 cpu 0 > .333611] timewait_state: TCP_TW_RST tw 00000000690fdb7a cookie 9 cpu 0 > .333626] tcp_connect: sk 0000000066757bf8 cpu 0 cookie 0 > > Here is the call trace map: > > CPU 0 CPU 1 > > -------- -------- > tcp_close() > tcp_send_fin() > loopback_xmit() > netif_rx() > tcp_v4_rcv() > tcp_ack_snd_check() > loopback_xmit > netif_rx() tcp_close() > ... tcp_send_fin() > = loopback_xmit() > = netif_rx() > = tcp_v4_rcv() > = ... > tcp_time_wait() > inet_twsk_hashdance() { > ... > <-__inet_lookup_established() > (bad case= (1), find sk, may fail tw_refcnt check) > inet_twsk_add_node_tail_rcu(tw, ...) > <-__inet_lookup_established() > (bad case= (1), find sk, may fail tw_refcnt check) > > __sk_nulls_del_node_init_rcu(sk) > <-__inet_lookup_established() > (bad case= (2), find tw, may fail tw_refcnt check) > refcount_set(&tw->tw_refcnt, 3) > <-__inet_lookup_established() > (good cas= e, find tw, tw_refcnt is not 0) > ... > } > > This issue occurs with a small probability on our application working > on loopback interface, client gets a connection refused error when it > reconnects. In reproducing environments on kernel 4.14,5.10 and > 6.4-rc1, modify tcp_ack_snd_check() to disable delay ack all the > time; Let client connect server and server sends a message to client > then close the connection; Repeat this process forever; Let the client > bind the same source port every time, it can be reproduced in about 20 > minutes. > > Brief of the scenario: > > 1. Server runs on CPU 0 and Client runs on CPU 1. Server closes > connection actively and sends a FIN to client. The lookback's driver > enqueues the FIN segment to backlog queue of CPU 0 via > loopback_xmit()->netif_rx(), one of the conditions for non-delay ack > meets in __tcp_ack_snd_check(), and the ACK is sent immediately. > > 2. On loopback interface, the ACK is received and processed on CPU 0, > the 'dance' from original sock to tw sock will perfrom, tw sock will > be inserted to ehash table, then the original sock will be removed. > > 3. On CPU 1, client closes the connection, a FIN segment is sent and > processed on CPU 1. When it is looking up sock in ehash table (with no > lock), tw hashdance race may occur, it fails to find the tw sock and > get a listener sock in the flowing 3 cases: > > (1) Original sock is found, but it has been destroyed and sk_refcnt > has become 0 when validating it. > (2) tw sock is found, but its tw_refcnt has not been set to 3, it is > still 0, validating for sk_refcnt will fail. > (3) For versions without Eric and Jason's commit(3f4ca5fafc08881d7a5 > 7daa20449d171f2887043), tw sock is added to the head of the lis= t. > It will be missed if the list is traversed before tw sock is > added. And if the original sock is removed before it is found, = no > established sock will be found. > > The listener sock will reset the FIN segment which has ack bit set. > > 4. If client reconnects immediately and is assigned with the same > source port as previous connection, the tw sock with tw_substate > TCP_FIN_WAIT2 will reset client's SYN and destroy itself in > inet_twsk_deschedule_put(). Application gets a connection refused > error. > > 5. If client reconnects again, it will succeed. > > Introduce the flowing 2 modifications to solve the above 3 bad cases: > > For bad case (2): > Set tw_refcnt to 3 before adding it into list. > > For bad case (1): > In function tcp_v4_rcv(), if __inet_lookup_skb() returns a listener > sock and this segment has FIN bit set, then retry the lookup process > one time. This fix can cover bad case (3) for the versions without > Eric and Jason's fix. > > There may be another bad case, if the original sock is found and passes > validation, but during further process for the passive closer's FIN on > CPU 1, the sock has been destroyed on CPU 0, then the FIN segment will > be dropped and retransmitted. This case does not hurt application as > much as resetting reconnection, and this case has less possibility than > the other bad cases, it does not occur on our product and in > experimental environment, so it is not considered in this patch. > > Could you please check whether this fix is OK, or any suggestions? > Looking forward for your precious comments! > > Signed-off-by: Duan Muquan > --- > net/ipv4/inet_timewait_sock.c | 15 +++++++-------- > net/ipv4/tcp_ipv4.c | 13 +++++++++++++ > 2 files changed, 20 insertions(+), 8 deletions(-) > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.= c > index 40052414c7c7..ed1f255c9aa8 100644 > --- a/net/ipv4/inet_timewait_sock.c > +++ b/net/ipv4/inet_timewait_sock.c > @@ -144,14 +144,6 @@ void inet_twsk_hashdance(struct inet_timewait_sock *= tw, struct sock *sk, > > spin_lock(lock); > > - inet_twsk_add_node_tail_rcu(tw, &ehead->chain); > - > - /* Step 3: Remove SK from hash chain */ > - if (__sk_nulls_del_node_init_rcu(sk)) > - sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1); > - > - spin_unlock(lock); > - > /* tw_refcnt is set to 3 because we have : > * - one reference for bhash chain. > * - one reference for ehash chain. > @@ -162,6 +154,13 @@ void inet_twsk_hashdance(struct inet_timewait_sock *= tw, struct sock *sk, > * so we are not allowed to use tw anymore. > */ > refcount_set(&tw->tw_refcnt, 3); > + inet_twsk_add_node_tail_rcu(tw, &ehead->chain); > + > + /* Step 3: Remove SK from hash chain */ > + if (__sk_nulls_del_node_init_rcu(sk)) > + sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1); > + > + spin_unlock(lock); > } > EXPORT_SYMBOL_GPL(inet_twsk_hashdance); > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > index 06d2573685ca..3e3cef202f76 100644 > --- a/net/ipv4/tcp_ipv4.c > +++ b/net/ipv4/tcp_ipv4.c > @@ -2018,6 +2018,19 @@ int tcp_v4_rcv(struct sk_buff *skb) > sk =3D __inet_lookup_skb(net->ipv4.tcp_death_row.hashinfo, > skb, __tcp_hdrlen(th), th->source, > th->dest, sdif, &refcounted); > + > + /* If tw "dance" is performed on another CPU, the lookup process = may find > + * no tw sock for the passive closer's FIN segment, but a listene= r sock, > + * which will reset the FIN segment. If application reconnects im= mediately > + * with the same source port, it will get reset because the tw so= ck's > + * tw_substate is still TCP_FIN_WAIT2. Try to get the tw sock in = another try. > + */ > + if (unlikely(th->fin && sk && sk->sk_state =3D=3D TCP_LISTEN)) { > + sk =3D __inet_lookup_skb(net->ipv4.tcp_death_row.hashinfo= , > + skb, __tcp_hdrlen(th), th->source, > + th->dest, sdif, &refcounted); > + } > + I do not think this fixes anything, there is no barrier between first and second lookup. This might reduce race a little bit, but at the expense of extra code in the fast path. If you want to fix this properly, I think we need to revisit handling for non established sockets, to hold the bucket spinlock over the whole thing. This will be slightly more complex than this...