Received: by 2002:a05:6a10:1287:0:0:0:0 with SMTP id d7csp763219pxv; Thu, 15 Jul 2021 15:37:54 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyK47786lSlSTRlXGvYu6CMM0i4Vjdj6KqeT5IoqOdUKBB/XXglbeilp5Ga1YmcKi2J0fu2 X-Received: by 2002:a92:ad04:: with SMTP id w4mr4129608ilh.221.1626388674493; Thu, 15 Jul 2021 15:37:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1626388674; cv=none; d=google.com; s=arc-20160816; b=W0ZxKMNI0faeDHZPLD4bRISlEmC/y+EiMCY4GI4ZHiJ4aqhvgrs6d0wBc2DJF9V6de P0sAsnRo92jwgGrw1z/fWWrdrVdHkkkY+tO/F6TOefkXeClZzF6z07LB89tqQ9i76EyX E92oQMajZF+LNbGRnRFR6NwspWfdZCqO8bZJw0vmY/1sxqWPPBIg2vNeimXhXMY4FoGG TgjzefuIuKen40gldfSWtz6XvnoyiGE28Wjo/rBdbznhqoieRPZ49RCeLyoSmm8aCemN Np+icMsAISZLskVTjhjcFMl3NKWTdozQxSbRLdh8W9v2EQ7RifU+9xMLs3Vsa0C2z7gu IJrw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:message-id:date:references:in-reply-to:subject :cc:to:from:mime-version:content-transfer-encoding:dkim-signature :dkim-signature; bh=FuRexorNhTCnUXe3dimi/nsOEYPMUp+LaDN4hllMJqM=; b=Z6cxxtQ9AvIsDesFO/2KbxLM2o84Vl1CCiVzMLc14u6gTOhUlDs8coAV1LcB2NICv6 YkYFejH3pLv6LXKxzthcZfHVl5Diort75qj3fKdJsizT20esV907N8+t+1qWDP5brOsq q6N1eFyckveyD/PDlOdC1Du/eDG17uU+naxObYMNyRSdz638StbwW+bjAKNZuywLtD1W cgF6+sZgjXIlwLFP1SckR6BmEm/OH/X+V9ZwXNlQNs/7yJd6XZphpp2kJQZrtrMOYOmf t37NGSRRpbl4u5HPHtulDk89jzgK4wB676G9Tpvk07WcJzHbhhj3HJNuoVQTQFE0tGo0 +RMQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.de header.s=susede2_rsa header.b=kjm+JrMJ; dkim=neutral (no key) header.i=@suse.de header.s=susede2_ed25519; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id h9si9790054jaj.25.2021.07.15.15.37.25; Thu, 15 Jul 2021 15:37:54 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.de header.s=susede2_rsa header.b=kjm+JrMJ; dkim=neutral (no key) header.i=@suse.de header.s=susede2_ed25519; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232169AbhGOWkK (ORCPT + 99 others); Thu, 15 Jul 2021 18:40:10 -0400 Received: from smtp-out1.suse.de ([195.135.220.28]:42248 "EHLO smtp-out1.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232166AbhGOWkJ (ORCPT ); Thu, 15 Jul 2021 18:40:09 -0400 Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id D5A6522ADE; Thu, 15 Jul 2021 22:37:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1626388634; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=FuRexorNhTCnUXe3dimi/nsOEYPMUp+LaDN4hllMJqM=; b=kjm+JrMJTGd4jq4wdEG2nM/SEot1PZ9/2jflIYbZZcMeDkWqF/KYnOz2TRU9eWc3XPg211 tdyrC59Oi4yy0qoM0JC4WKwZmlchkNSDYMgAsyiFMaIyb8zjx2+7hTXAPborQKZc2VhmfR iCph+JqPeYjcuWrKpOianqu9gh1p+OA= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1626388634; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=FuRexorNhTCnUXe3dimi/nsOEYPMUp+LaDN4hllMJqM=; b=G8FuSlEJQ8oAvndJ/MJk9/jk6q1IKEqyi4WLLArFphht/2rQ1GkDY998WgbwKNZ890H+1b 0t/NfW/DxU5uepCw== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 7347D13C4D; Thu, 15 Jul 2021 22:37:11 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id I1+zCZe48GADYgAAMHmgww (envelope-from ); Thu, 15 Jul 2021 22:37:11 +0000 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 From: "NeilBrown" To: "Josef Bacik" Cc: "Christoph Hellwig" , "J. Bruce Fields" , "Chuck Lever" , "Chris Mason" , "David Sterba" , linux-nfs@vger.kernel.org, "Wang Yugui" , "Ulli Horlacher" , linux-btrfs@vger.kernel.org Subject: Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better. In-reply-to: References: <20210613115313.BC59.409509F4@e16-tech.com>, <20210310074620.GA2158@tik.uni-stuttgart.de>, <162632387205.13764.6196748476850020429@noble.neil.brown.name>, , , <28bb883d-8d14-f11a-b37f-d8e71118f87f@toxicpanda.com>, , Date: Fri, 16 Jul 2021 08:37:07 +1000 Message-id: <162638862766.13764.8566962032225976326@noble.neil.brown.name> Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On Fri, 16 Jul 2021, Josef Bacik wrote: > On 7/15/21 1:24 PM, Christoph Hellwig wrote: > > On Thu, Jul 15, 2021 at 01:11:29PM -0400, Josef Bacik wrote: > >> Because there's no alternative. We need a way to tell userspace they've > >> wandered into a different inode namespace. There's no argument that what > >> we're doing is ugly, but there's never been a clear "do X instead". Jus= t a > >> lot of whinging that btrfs is broken. This makes userspace happy and is > >> simple and straightforward. I'm open to alternatives, but there have be= en 0 > >> workable alternatives proposed in the last decade of complaining about i= t. > >=20 > > Make sure we cross a vfsmount when crossing the "st_dev" domain so > > that it is properly reported. Suggested many times and ignored all > > the time beause it requires a bit of work. > >=20 >=20 > You keep telling me this but forgetting that I did all this work when you=20 > originally suggested it. The problem I ran into was the automount stuff=20 > requires that we have a completely different superblock for every vfsmount.= =20 > This is fine for things like nfs or samba where the automount literally poi= nts=20 > to a completely different mount, but doesn't work for btrfs where it's on t= he=20 > same file system. If you have 1000 subvolumes and run sync() you're going = to=20 > write the superblock 1000 times for the same file system. You are going to= =20 > reclaim inodes on the same file system 1000 times. You are going to reclai= m=20 > dcache on the same filesytem 1000 times. You are also going to pin 1000=20 > dentries/inodes into memory whenever you wander into these things because t= he=20 > super is going to hold them open. >=20 > This is not a workable solution. It's not a matter of simply tying into=20 > existing infrastructure, we'd have to completely rework how the VFS deals w= ith=20 > this stuff in order to be reasonable. And when I brought this up to Al he = told=20 > me I was insane and we absolutely had to have a different SB for every vfsm= ount,=20 > which means we can't use vfsmount for this, which means we don't have any o= ther=20 > options. Thanks, When I was first looking at this, I thought that separate vfsmnts and auto-mounting was the way to go "just like NFS". NFS still shares a lot between the multiple superblock - certainly it shares the same connection to the server. But I dropped the idea when Bruce pointed out that nfsd is not set up to export auto-mounted filesystems. It needs to be able to find a filesystem given a UUID (extracted from a filehandle), and it does this by walking through the mount table to find one that matches. So unless all btrfs subvols were mounted all the time (which I wouldn't propose), it would need major work to fix. NFSv4 describes the fsid as having a "major" and "minor" component. We've never treated these as having an important meaning - just extra bits to encode uniqueness in. Maybe we should have used "major" for the vfsmnt, and kept "minor" for the subvol..... The idea for a single vfsmnt exposing multiple inode-name-spaces does appeal to me. The "st_dev" is just part of the name, and already a fairly blurry part. Thanks to bind mounts, multiple mounts can have the same st_dev. I see no intrinsic reason that a single mount should not have multiple fsids, provided that a coherent picture is provided to userspace which doesn't contain too many surprises. NeilBrown