Received: by 2002:a05:6a10:a841:0:0:0:0 with SMTP id d1csp4165349pxy; Mon, 26 Apr 2021 20:49:59 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwHbhUf2vhB8CtoYH3iB7TZmo//ZnNNI5+D4VLDJ8EMOQRQLKSrAzwokkeZz39RmZLnNuv8 X-Received: by 2002:a17:90a:5b0a:: with SMTP id o10mr2601264pji.82.1619495399354; Mon, 26 Apr 2021 20:49:59 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1619495399; cv=none; d=google.com; s=arc-20160816; b=bgaIXoO3GFmNE6ATITTlD3TJOI8uL15I2DWggkRa6xZ8DUFQTJkTiMXB/vaYz34Hkk d85WwG4y24SNiSeh0BcSAlUV0TDeuh5cyXLAaNJKckI5jDUMXt2ANfs2/0h+gghKIVcb CoKsduCiE62CrbYKVcqgvx1NcLlyOvkZvFGi57q3rQVHejMzXKub6kXF7bIHTRgkJdzO 2UpalhUSim17rwII68GeujkzUW55EJNFkXO3qY1j3kfzr8+E9riMH9/bQxrABmMljebK mpyXEztMGrKqNcKUPGa1BeiuWno722DZ8Rhise4Wev3Z0JwJELfbHAI3654zKVvvSMtp 3NBg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=pduKHucH/KRAPULE6+kTxR9mRILoe0pCGy/ZehXy9fg=; b=K2pVzmzfi4XPFYn6fYA9jGlys/xaX6DkkqzozKvLaOlrnGRRFUruUpStDVEzAjjh08 GTmqAuQOsZtKXhZaBeTDvUeK/HSCgvnrBOjeWV9flC1oLANAz7wvAoBCREcXAspYFYRm RhzOJvSGlZTK5v25fbtJThzUC/CXaZQA8dTKdKRssB6GAx05VuUGYD4t12uko2w+iK5J gn59q4Edy+aDy0uNmM8h+4Qcw3W5kmp1pVxI2su0I0ogL+j7g/xI1w6bNfcSSu0yeXb8 9capSfvVbTbxdwFWOMnOPYuFtHgviOyU0053+M3oFlmJBFZj/wjj0/3f5ZEplPjDlbOC PP/Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@amazon.co.jp header.s=amazon201209 header.b="ELDB5/rI"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.co.jp Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id x11si2093308pfu.90.2021.04.26.20.49.46; Mon, 26 Apr 2021 20:49:59 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@amazon.co.jp header.s=amazon201209 header.b="ELDB5/rI"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.co.jp Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234461AbhD0Dti (ORCPT + 99 others); Mon, 26 Apr 2021 23:49:38 -0400 Received: from smtp-fw-9102.amazon.com ([207.171.184.29]:24563 "EHLO smtp-fw-9102.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234272AbhD0Dte (ORCPT ); Mon, 26 Apr 2021 23:49:34 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1619495333; x=1651031333; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=pduKHucH/KRAPULE6+kTxR9mRILoe0pCGy/ZehXy9fg=; b=ELDB5/rI/4cga25lP0vqdWtAcgErWYmWL2lSIMe+QnLKYARlZmm1sSBz 2V6vHIYHc9MP/ozhd5muIF9AG6F0dRhBhy8sWHUDlCGHwYFyVioWyKRQL bsrM7cW0RwKmHqj0RSbdJ5eFk82CYSJeHksZ9HLOvJZd2wFUL9QTO7f6e 0=; X-IronPort-AV: E=Sophos;i="5.82,254,1613433600"; d="scan'208";a="130979942" Received: from pdx4-co-svc-p1-lb2-vlan2.amazon.com (HELO email-inbound-relay-1a-821c648d.us-east-1.amazon.com) ([10.25.36.210]) by smtp-border-fw-9102.sea19.amazon.com with ESMTP; 27 Apr 2021 03:48:51 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan3.iad.amazon.com [10.40.163.38]) by email-inbound-relay-1a-821c648d.us-east-1.amazon.com (Postfix) with ESMTPS id 84464A1DCE; Tue, 27 Apr 2021 03:48:48 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 27 Apr 2021 03:48:47 +0000 Received: from 88665a182662.ant.amazon.com (10.43.162.93) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 27 Apr 2021 03:48:43 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [PATCH v4 bpf-next 07/11] tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK. Date: Tue, 27 Apr 2021 12:46:19 +0900 Message-ID: <20210427034623.46528-8-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210427034623.46528-1-kuniyu@amazon.co.jp> References: <20210427034623.46528-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-Originating-IP: [10.43.162.93] X-ClientProxiedBy: EX13D18UWC002.ant.amazon.com (10.43.162.88) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patch also changes the code to call reuseport_migrate_sock() and reqsk_clone(), but unlike the other cases, we do not call reqsk_clone() right after reuseport_migrate_sock(). Currently, in the receive path for TCP_NEW_SYN_RECV sockets, its listener has three kinds of refcnt: (A) for listener itself (B) carried by reuqest_sock (C) sock_hold() in tcp_v[46]_rcv() While processing the req, (A) may disappear by close(listener). Also, (B) can disappear by accept(listener) once we put the req into the accept queue. So, we have to hold another refcnt (C) for the listener to prevent use-after-free. For socket migration, we call reuseport_migrate_sock() to select a listener with (A) and to increment the new listener's refcnt in tcp_v[46]_rcv(). This refcnt corresponds to (C) and is cleaned up later in tcp_v[46]_rcv(). Thus we have to take another refcnt (B) for the newly cloned request_sock. In inet_csk_complete_hashdance(), we hold the count (B), clone the req, and try to put the new req into the accept queue. By migrating req after winning the "own_req" race, we can avoid such a worst situation: CPU 1 looks up req1 CPU 2 looks up req1, unhashes it, then CPU 1 loses the race CPU 3 looks up req2, unhashes it, then CPU 2 loses the race ... Signed-off-by: Kuniyuki Iwashima --- net/ipv4/inet_connection_sock.c | 30 +++++++++++++++++++++++++++++- net/ipv4/tcp_ipv4.c | 20 ++++++++++++++------ net/ipv6/tcp_ipv6.c | 14 +++++++++++--- 3 files changed, 54 insertions(+), 10 deletions(-) diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index dc984d1f352e..2f1e5897137b 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -1072,10 +1072,38 @@ struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child, if (own_req) { inet_csk_reqsk_queue_drop(sk, req); reqsk_queue_removed(&inet_csk(sk)->icsk_accept_queue, req); - if (inet_csk_reqsk_queue_add(sk, req, child)) + + if (sk != req->rsk_listener) { + /* another listening sk has been selected, + * migrate the req to it. + */ + struct request_sock *nreq; + + /* hold a refcnt for the nreq->rsk_listener + * which is assigned in reqsk_clone() + */ + sock_hold(sk); + nreq = reqsk_clone(req, sk); + if (!nreq) { + inet_child_forget(sk, req, child); + goto child_put; + } + + refcount_set(&nreq->rsk_refcnt, 1); + if (inet_csk_reqsk_queue_add(sk, nreq, child)) { + reqsk_migrate_reset(req); + reqsk_put(req); + return child; + } + + reqsk_migrate_reset(nreq); + __reqsk_free(nreq); + } else if (inet_csk_reqsk_queue_add(sk, req, child)) { return child; + } } /* Too bad, another child took ownership of the request, undo. */ +child_put: bh_unlock_sock(child); sock_put(child); return NULL; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 312184cead57..214495d02143 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2000,13 +2000,21 @@ int tcp_v4_rcv(struct sk_buff *skb) goto csum_error; } if (unlikely(sk->sk_state != TCP_LISTEN)) { - inet_csk_reqsk_queue_drop_and_put(sk, req); - goto lookup; + nsk = reuseport_migrate_sock(sk, req_to_sk(req), skb); + if (!nsk) { + inet_csk_reqsk_queue_drop_and_put(sk, req); + goto lookup; + } + sk = nsk; + /* reuseport_migrate_sock() has already held one sk_refcnt + * before returning. + */ + } else { + /* We own a reference on the listener, increase it again + * as we might lose it too soon. + */ + sock_hold(sk); } - /* We own a reference on the listener, increase it again - * as we might lose it too soon. - */ - sock_hold(sk); refcounted = true; nsk = NULL; if (!tcp_filter(sk, skb)) { diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 5f47c0b6e3de..aea8e75d3fed 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1663,10 +1663,18 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb) goto csum_error; } if (unlikely(sk->sk_state != TCP_LISTEN)) { - inet_csk_reqsk_queue_drop_and_put(sk, req); - goto lookup; + nsk = reuseport_migrate_sock(sk, req_to_sk(req), skb); + if (!nsk) { + inet_csk_reqsk_queue_drop_and_put(sk, req); + goto lookup; + } + sk = nsk; + /* reuseport_migrate_sock() has already held one sk_refcnt + * before returning. + */ + } else { + sock_hold(sk); } - sock_hold(sk); refcounted = true; nsk = NULL; if (!tcp_filter(sk, skb)) { -- 2.30.2