Received: by 2002:a05:6a10:a841:0:0:0:0 with SMTP id d1csp623310pxy; Wed, 28 Apr 2021 10:40:32 -0700 (PDT) X-Google-Smtp-Source: ABdhPJy1dAH3mFmP5hmQv/KFwWEZkNrtzRFPE9WBZnvINlVXIcPZE3K//yfmGb5gy9CQjNvHlQqC X-Received: by 2002:a17:906:90cf:: with SMTP id v15mr23519766ejw.432.1619631632228; Wed, 28 Apr 2021 10:40:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1619631632; cv=none; d=google.com; s=arc-20160816; b=q/QyUudNlUbr2sY6OFfosbB71KxV8Du3UcK4yJclzkSK4DDX9IFrUenPal+Lra8Ep7 FZk0d+zpV7XD8UMbMkK9V3PNGYOaTvAiSWCuMo3xZ4bs+tOegeHPiSqye9UI7x0K3zKR gLiNCMZxMOjNoj1rIhZwrdSTKKl5oV3d8niyVEKxi0Zi6XrM+naL0kROVdjkD9paW7+E ii2pucZmN1qcF6sJ/NXvp5mGAnPSPofr3ayc765UL5LiVuzH8El6AJ5XOsYoy6Fom78t W9Ln1vs3ISBiqh+m2Q/NCN4kToREw5Ncs/B2JPp5GLjvdwywG9LkR9LeXhwLAkLzMVdi tK+w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=FnCAkov60OO7Dxf0WEMBA9HQ3rKhJnjcWSSFi9gB2V8=; b=CT8Ll9SWnANpVcvI+mWBtO+o3JGUHI8AjlUj3gpvOJz2dzdqjgPU+00H8pdOvNEwC6 F0cBV71aaFb67jbzljdSnJ9f+hzQKab832U3ScL/9xEFQ61f6vsUF/L6yFJUZZKtK5+a tbNRULeR1b+Rn+WREMdurY2QMVny7oVpXIe7Te68kMXdt0nJfHP2UuRdR985gUYgtZYW iR3ld4rxpdCKCoDz+xItybXHZKGvLvP0MV4/H/5OPIsn7LMT0qsGTZVD9U2u8Go/OSNn DP5tzLI4iGh3s+l5HvlSnWqp4H+Qe+Dj4DeIM9Lk5zInETy7bQUYJCC1LMWEOzaOvI2p GWuw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@amazon.co.jp header.s=amazon201209 header.b=t+vRPr8+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.co.jp Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id co24si341190edb.184.2021.04.28.10.40.08; Wed, 28 Apr 2021 10:40:32 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@amazon.co.jp header.s=amazon201209 header.b=t+vRPr8+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.co.jp Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240588AbhD1Pxl (ORCPT + 99 others); Wed, 28 Apr 2021 11:53:41 -0400 Received: from smtp-fw-4101.amazon.com ([72.21.198.25]:63948 "EHLO smtp-fw-4101.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240737AbhD1Pxi (ORCPT ); Wed, 28 Apr 2021 11:53:38 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1619625174; x=1651161174; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=FnCAkov60OO7Dxf0WEMBA9HQ3rKhJnjcWSSFi9gB2V8=; b=t+vRPr8+sM854VmJM24YqiBkqCEB8HL8sxQIGENvrNeQlmgqn7s9QIsy ZFuOVcK5QE9+bdzGthjaiMrsl2trIjLjKo4nJy5OPjtn+zjGoaT7+zpPW arzWfI4daSkF8N+wIRFfPSYpLFo3lecb5tuSrtckP2IczDXz+40R2VsCM Q=; X-IronPort-AV: E=Sophos;i="5.82,258,1613433600"; d="scan'208";a="104526758" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO email-inbound-relay-1d-2c665b5d.us-east-1.amazon.com) ([10.43.8.6]) by smtp-border-fw-4101.iad4.amazon.com with ESMTP; 28 Apr 2021 15:52:16 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan3.iad.amazon.com [10.40.163.38]) by email-inbound-relay-1d-2c665b5d.us-east-1.amazon.com (Postfix) with ESMTPS id 66924A1CD9; Wed, 28 Apr 2021 15:52:13 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Wed, 28 Apr 2021 15:52:12 +0000 Received: from 88665a182662.ant.amazon.com (10.43.160.26) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Wed, 28 Apr 2021 15:52:07 +0000 From: Kuniyuki Iwashima To: CC: , , , , , , , , , , , , Subject: Re: [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT. Date: Thu, 29 Apr 2021 00:52:03 +0900 Message-ID: <20210428155203.39974-1-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.30.2 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-Originating-IP: [10.43.160.26] X-ClientProxiedBy: EX13D38UWC002.ant.amazon.com (10.43.162.46) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Jason Baron Date: Wed, 28 Apr 2021 10:44:12 -0400 > On 4/28/21 4:13 AM, Kuniyuki Iwashima wrote: > > From: Jason Baron > > Date: Tue, 27 Apr 2021 12:38:58 -0400 > >> On 4/26/21 11:46 PM, Kuniyuki Iwashima wrote: > >>> The SO_REUSEPORT option allows sockets to listen on the same port and to > >>> accept connections evenly. However, there is a defect in the current > >>> implementation [1]. When a SYN packet is received, the connection is tied > >>> to a listening socket. Accordingly, when the listener is closed, in-flight > >>> requests during the three-way handshake and child sockets in the accept > >>> queue are dropped even if other listeners on the same port could accept > >>> such connections. > >>> > >>> This situation can happen when various server management tools restart > >>> server (such as nginx) processes. For instance, when we change nginx > >>> configurations and restart it, it spins up new workers that respect the new > >>> configuration and closes all listeners on the old workers, resulting in the > >>> in-flight ACK of 3WHS is responded by RST. > >> > >> Hi Kuniyuki, > >> > >> I had implemented a different approach to this that I wanted to get your > >> thoughts about. The idea is to use unix sockets and SCM_RIGHTS to pass the > >> listen fd (or any other fd) around. Currently, if you have an 'old' webserver > >> that you want to replace with a 'new' webserver, you would need a separate > >> process to receive the listen fd and then have that process send the fd to > >> the new webserver, if they are not running con-currently. So instead what > >> I'm proposing is a 'delayed close' for a unix socket. That is, one could do: > >> > >> 1) bind unix socket with path '/sockets' > >> 2) sendmsg() the listen fd via the unix socket > >> 2) setsockopt() some 'timeout' on the unix socket (maybe 10 seconds or so) > >> 3) exit/close the old webserver and the listen socket > >> 4) start the new webserver > >> 5) create new unix socket and bind to '/sockets' (if has MAY_WRITE file permissions) > >> 6) recvmsg() the listen fd > >> > >> So the idea is that we set a timeout on the unix socket. If the new process > >> does not start and bind to the unix socket, it simply closes, thus releasing > >> the listen socket. However, if it does bind it can now call recvmsg() and > >> use the listen fd as normal. It can then simply continue to use the old listen > >> fds and/or create new ones and drain the old ones. > >> > >> Thus, the old and new webservers do not have to run concurrently. This doesn't > >> involve any changes to the tcp layer and can be used to pass any type of fd. > >> not sure if it's actually useful for anything else though. > >> > >> I'm not sure if this solves your use-case or not but I thought I'd share it. > >> One can also inherit the fds like in systemd's socket activation model, but > >> that again requires another process to hold open the listen fd. > > > > Thank you for sharing code. > > > > It seems bit more crash-tolerant than normal fd passing, but it can still > > suffer if the process dies before passing fds. With this patch set, we can > > migrate children sockets even if the process dies. > > > > I don't think crashing should be much of an issue. The old server can setup the > unix socket patch '/sockets' when it starts up and queue the listen sockets > there from the start. When it dies it will close all its fds, and the new > server can pick anything up any fds that are in the '/sockets' queue. > > > > Also, as Martin said, fd passing tends to make application complicated. > > > > It may be but perhaps its more flexible? It gives the new server the > chance to re-use the existing listen fds, close, drain and/or start new > ones. It also addresses the non-REUSEPORT case where you can't bind right > away. If the flexibility is really worth the complexity, we do not care about it. But, SO_REUSEPORT can give enough flexibility we want. With socket migration, there is no need to reuse listener (fd passing), drain children (incoming connections are automatically migrated if there is already another listener bind()ed), and of course another listener can close itself and migrated children. If two different approaches resolves the same issue and one does not need complexity in userspace, we select the simpler one.