Received: by 2002:a89:d88:0:b0:1fa:5c73:8e2d with SMTP id eb8csp2528241lqb; Tue, 28 May 2024 02:45:49 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCURBAFBjdunASI6Zyx30dLvn+DpNQZ4yQUf2zNfauNB94d1+OD3CKLstTH59s2C2mY3g0/4utw0dnN/qy9D6/mXSjD/CdOLBnoi/MB71g== X-Google-Smtp-Source: AGHT+IHzGTbNAwpVX5kQwVPL3JkGlMlQ95pqUpAEZ7ddCMF6ptxxe18tGZ3gbYohxf8N87iMVQ5K X-Received: by 2002:a05:6a00:298e:b0:6f3:ef3d:60eb with SMTP id d2e1a72fcca58-6f8f480ecc1mr12942238b3a.34.1716889549062; Tue, 28 May 2024 02:45:49 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1716889548; cv=pass; d=google.com; s=arc-20160816; b=YHHZu68nMwc9/EExR0lONHtVoQCNWWrlLedD5Bk5Dg9GTQdJybiH5DHJ+W+1JWKRJ5 NDbvQwAQTEw6gNe/Mm0qSt++9MBzPMq+g/lXowdQYzt0H9IUa/JTX8y69aaSloxlGZ/S QHGPfwH5HdrRT7RaDeKd7uwU7yugLErsvO5gumWslSlUDoAxobb/oZHeAN2l1NQJ8AFF mmzJOMlnzXU8bv9W8W9yXM51CGSohZPlgtwnCg0FGGmLpXBWAIBTUCN3Xyft3PHsU+Ug jyqSEDYQne8IR81JX9dv7pMxrSOG60QVL17N5jH47v1yTy3zXdVfcSqZ8fl4RJqq4V/5 At9w== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:date:message-id:dkim-signature; bh=iCvwyX9e33nD2Ny1HnKr50ExkwS6rywvhg/eC24fyVY=; fh=Pp1x0J7GEVjk0ydsoUWZkAMLTGfc81YBxFX16ht7MxI=; b=n7fIQJ1rDx+uhGDluQir5rjKX4+kcBvdk8Xw38HubmO3EEs04tNkTjDtrZKOViPVT6 O5e8KuNj7ZaCLfWGrQWmewZwR7zHAWbw4OUmL43OD+O+YRL3VcgXjWeRy09zun1vXRMa w7M+N+Kh20ZrL94HE1M58j+jGGIxsYwI2PUQgOzpM+AkfUjrZLG5xoIMOiIafL0Zd624 LDzAMUGPnxGn/jiujeY3L5fY+1LFdXkCqmHHia1o13tTZkc8kr/B/WQrDHmhqKDcH1sL ul6B4SvXEUyTr2dB6+nE6ERMz9M4V/3dIUswpFwqB8JUbxLQK7GCT81lJQMnlGE3Barb Ujog==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@linux.alibaba.com header.s=default header.b=AlVD+CyP; arc=pass (i=1 spf=pass spfdomain=linux.alibaba.com dkim=pass dkdomain=linux.alibaba.com dmarc=pass fromdomain=linux.alibaba.com); spf=pass (google.com: domain of linux-kernel+bounces-192029-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-192029-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.alibaba.com Return-Path: Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [2604:1380:40f1:3f00::1]) by mx.google.com with ESMTPS id 41be03b00d2f7-68229935dc1si8156351a12.632.2024.05.28.02.45.47 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 28 May 2024 02:45:48 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-192029-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.alibaba.com header.s=default header.b=AlVD+CyP; arc=pass (i=1 spf=pass spfdomain=linux.alibaba.com dkim=pass dkdomain=linux.alibaba.com dmarc=pass fromdomain=linux.alibaba.com); spf=pass (google.com: domain of linux-kernel+bounces-192029-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-192029-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.alibaba.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id E0444B2268A for ; Tue, 28 May 2024 09:45:37 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id BC1CC16ABF3; Tue, 28 May 2024 09:45:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="AlVD+CyP" Received: from out30-100.freemail.mail.aliyun.com (out30-100.freemail.mail.aliyun.com [115.124.30.100]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0FF59155C8F; Tue, 28 May 2024 09:45:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.100 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1716889524; cv=none; b=PUesc+3jGGsidogL63lWQ5jhkFiD2B0tS56r+zNo/YEOMvMembn/dtcLTGCO6pI9aVdjdjn9ZL84z2x6gMllSRnnlik9c5OUHa/CKoMwKxXjsK13eum7OtpeGNZDp28t6xaG0gcz4nHkmXtsJOtaanFsIMlooeUvcaJQ/3Nz1DE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1716889524; c=relaxed/simple; bh=UGQfz+RIjFLy/PjRAKDnS1o4zUhUG9NZtpBCzO6xlFk=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=EYBit63BWxJVzyfzh6CqVQvfmCRlH/OrNl2dhHa2ZK4u6glew2lxU24abpD20qI+vJjnTmS6kXlAe8SpcOeKitzDbYj48SRcCFmg8Ynbuej0uwhMYlDuNKbQ4NlCUL6EGCaTNUYFrtUEHYL5FS8O3+rrO8aLRrMVTKzKDeF1KeA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=AlVD+CyP; arc=none smtp.client-ip=115.124.30.100 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1716889519; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=iCvwyX9e33nD2Ny1HnKr50ExkwS6rywvhg/eC24fyVY=; b=AlVD+CyPTQUztA44Ap3llMlkqtzLu6T3YJEGsgsw6MWIq7D8Ni9dD3dDCDlzcuhXVv8ooRsyOBOY4ltbAwIE3+Yvv6Nyyr0/tIn8BI2VduRu/g5EIjZ7taaA2rvh8Ga/sLK+6Mqh161gxsrNWVhrAOHrNTs2hpWf8mzC6E4UJUI= X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R331e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=maildocker-contentspam033037067113;MF=jefflexu@linux.alibaba.com;NM=1;PH=DS;RN=5;SR=0;TI=SMTPD_---0W7PHPTi_1716889517; Received: from 30.221.144.199(mailfrom:jefflexu@linux.alibaba.com fp:SMTPD_---0W7PHPTi_1716889517) by smtp.aliyun-inc.com; Tue, 28 May 2024 17:45:19 +0800 Message-ID: Date: Tue, 28 May 2024 17:45:16 +0800 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism To: Christian Brauner Cc: miklos@szeredi.hu, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, winters.zc@antgroup.com References: <20240524064030.4944-1-jefflexu@linux.alibaba.com> <20240528-jucken-inkonsequent-60b0a15d7ede@brauner> Content-Language: en-US From: Jingbo Xu In-Reply-To: <20240528-jucken-inkonsequent-60b0a15d7ede@brauner> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Hi, Christian, Thanks for the review. On 5/28/24 4:38 PM, Christian Brauner wrote: > On Fri, May 24, 2024 at 02:40:28PM +0800, Jingbo Xu wrote: >> Background >> ========== >> The fd of '/dev/fuse' serves as a message transmission channel between >> FUSE filesystem (kernel space) and fuse server (user space). Once the >> fd gets closed (intentionally or unintentionally), the FUSE filesystem >> gets aborted, and any attempt of filesystem access gets -ECONNABORTED >> error until the FUSE filesystem finally umounted. >> >> It is one of the requisites in production environment to provide >> uninterruptible filesystem service. The most straightforward way, and >> maybe the most widely used way, is that make another dedicated user >> daemon (similar to systemd fdstore) keep the device fd open. When the >> fuse daemon recovers from a crash, it can retrieve the device fd from the >> fdstore daemon through socket takeover (Unix domain socket) method [1] >> or pidfd_getfd() syscall [2]. In this way, as long as the fdstore >> daemon doesn't exit, the FUSE filesystem won't get aborted once the fuse >> daemon crashes, though the filesystem service may hang there for a while >> when the fuse daemon gets restarted and has not been completely >> recovered yet. >> >> This picture indeed works and has been deployed in our internal >> production environment until the following issues are encountered: >> >> 1. The fdstore daemon may be killed by mistake, in which case the FUSE >> filesystem gets aborted and irrecoverable. > > That's only a problem if you use the fdstore of the per-user instance. > The main fdstore is part of PID 1 and you can't kill that. So really, > systemd needs to hand the fds from the per-user instance to the main > fdstore. Systemd indeed has implemented its own fdstore mechanism in the user space. Nowadays more and more fuse daemons are running inside containers, but a container generally has no systemd inside it. > >> 2. In scenarios of containerized deployment, the fuse daemon is deployed >> in a container POD, and a dedicated fdstore daemon needs to be deployed >> for each fuse daemon. The fdstore daemon could consume a amount of >> resources (e.g. memory footprint), which is not conducive to the dense >> container deployment. >> >> 3. Each fuse daemon implementation needs to implement its own fdstore >> daemon. If we implement the fuse recovery mechanism on the kernel side, >> all fuse daemon implementations could reuse this mechanism. > > You can just the global fdstore. That is a design limitation not an > inherent limitation. What I initially mean is that each fuse daemon implementation (e.g. s3fs, ossfs, and other vendors) needs to make its own but similar mechanism for daemon failover. There has not been a common component for fdstore in container scenarios just like systemd fdstore. I'd admit that it's controversial to implement a kernel-side fdstore. Thus I only implement a failover mechanism for fuse server in this RFC patch. But I also understand Miklos's concern as what we really need to support daemon failover is just something like fdstore to keep the device fd alive. -- Thanks, Jingbo