Received: by 2002:a05:6a10:1287:0:0:0:0 with SMTP id d7csp7151489pxv; Fri, 30 Jul 2021 11:16:13 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwIG0RoQddz2iLD+jWT6m7Oq1aNFY5+1Cd6iymFE8mz/7G2sRQ7aHmkE4Lwct5EgfAfkUEl X-Received: by 2002:a05:6402:386:: with SMTP id o6mr4403355edv.294.1627668973297; Fri, 30 Jul 2021 11:16:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1627668973; cv=none; d=google.com; s=arc-20160816; b=NPgAs2Q3+V9ZW13dA28fL0l0Hd0BBvcO08mYOPxvyaUu7VOAHB9300U5HMZ5Qp6a2C K6LANcVevkhuhLXsRiZ+U1c5R9TJnd6Sas/wuCHBbOpJmCcmsHyk3t4iJdte2klrmEOG 0j7EgDeQTJbmJvqBMDaSsvS1G3BjBVoBR0A4Y3jJ67M2FHZYP7AieSWK5erFkSo7YxJy R1QFmSOlxISPt78JFrSDxD5fP9341Fs8slvRulOraikcFuv9SigqIiQlC4iu5j32tSk6 s5cEDuvlpBdzX5y9dBB8FHZnwaM4FjE5x6MQuFdoQysEDD0hzPy3uAp+h4lwbzB40QyL 739Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=TF5nyUA6b1W951ZK9tO070+owPklYoxXWCUiQVEuEx4=; b=QN42h51bJs1dKL36socAf1m/a9Y8c7vM1Ie2bRSlTqyiPd4ZHM64DGsyIMHyAsmLNC ZUPBrhN/JAdZ0kZoTUW7jYQeNRlTzUZYTPYpc7LZxDnzILdu2bGPuvi5NZ9U/aIpSC+t J0hzKmebpSw+TpmCMN2BwqAbyvTGtXUMhzLd0g1rOPeWdlvoh4mb0ItDIdjL3pYZmqgo J3Ug9fJKVgMrKxjw9Xfz4zKZCurPhjNfOzEnkKBEb7qwlyWLI8zGl3EpgcJWrnBR9piZ UVGzZDLNYciGD/CiLJw1JVawArHyT9cGYVt292l0k5+qzbFUg2CqusDyh5S4vteRDZFS HtZg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id cn9si2207058edb.505.2021.07.30.11.15.38; Fri, 30 Jul 2021 11:16:13 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230355AbhG3SPT (ORCPT + 99 others); Fri, 30 Jul 2021 14:15:19 -0400 Received: from james.kirk.hungrycats.org ([174.142.39.145]:38704 "EHLO james.kirk.hungrycats.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230374AbhG3SPT (ORCPT ); Fri, 30 Jul 2021 14:15:19 -0400 Received: by james.kirk.hungrycats.org (Postfix, from userid 1002) id 0CEB5B0C519; Fri, 30 Jul 2021 14:15:01 -0400 (EDT) Date: Fri, 30 Jul 2021 14:15:01 -0400 From: Zygo Blaxell To: Qu Wenruo Cc: NeilBrown , Neal Gompa , Wang Yugui , Christoph Hellwig , Josef Bacik , "J. Bruce Fields" , Chuck Lever , Chris Mason , David Sterba , Alexander Viro , linux-fsdevel , linux-nfs@vger.kernel.org, Btrfs BTRFS Subject: Re: [PATCH/RFC 00/11] expose btrfs subvols in mount table correctly Message-ID: <20210730181501.GN10170@hungrycats.org> References: <162751265073.21659.11050133384025400064@noble.neil.brown.name> <20210729023751.GL10170@hungrycats.org> <162752976632.21659.9573422052804077340@noble.neil.brown.name> <20210729232017.GE10106@hungrycats.org> <162761259105.21659.4838403432058511846@noble.neil.brown.name> <341403c0-a7a7-f6c8-5ef6-2d966b1907a8@gmx.com> <162762468711.21659.161298577376336564@noble.neil.brown.name> <162762802395.21659.5310176078177217626@noble.neil.brown.name> <21939589-bd90-116d-7351-b84ba58446b3@gmx.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <21939589-bd90-116d-7351-b84ba58446b3@gmx.com> User-Agent: Mutt/1.10.1 (2018-07-13) Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On Fri, Jul 30, 2021 at 03:09:12PM +0800, Qu Wenruo wrote: > > > On 2021/7/30 下午2:53, NeilBrown wrote: > > On Fri, 30 Jul 2021, Qu Wenruo wrote: > > > > > > > > You mean like "du -x"?? Yes. You would lose the misleading illusion > > > > that there are multiple filesystems. That is one user-expectation that > > > > would need to be addressed before people opt-in > > > > > > OK, forgot it's an opt-in feature, then it's less an impact. > > > > The hope would have to be that everyone would eventually opt-in once all > > issues were understood. > > > > > > > > Really not familiar with NFS/VFS, thus some ideas from me may sounds > > > super crazy. > > > > > > Is it possible that, for nfsd to detect such "subvolume" concept by its > > > own, like checking st_dev and the fsid returned from statfs(). > > > > > > Then if nfsd find some boundary which has different st_dev, but the same > > > fsid as its parent, then it knows it's a "subvolume"-like concept. > > > > > > Then do some local inode number mapping inside nfsd? > > > Like use the highest 20 bits for different subvolumes, while the > > > remaining 44 bits for real inode numbers. > > > > > > Of-course, this is still a workaround... > > > > Yes, it would certainly be possible to add some hacks to nfsd to fix the > > immediate problem, and we could probably even created some well-defined > > interfaces into btrfs to extract the required information so that it > > wasn't too hackish. > > > > Maybe that is what we will have to do. But I'd rather not hack NFSD > > while there is any chance that a more complete solution will be found. > > > > I'm not quite ready to give up on the idea of squeezing all btrfs inodes > > into a 64bit number space. 24bits of subvol and 40 bits of inode? > > Make the split a mkfs or mount option? > > Btrfs used to have a subvolume number limit in the past, for different > reasons. > > In that case, subvolume number is limited to 48 bits, which is still too > large to avoid conflicts. > > For inode number there is really no limit except the 256 ~ (U64)-256 limit. > > Considering all these numbers are almost U64, conflicts would be > unavoidable AFAIK. > > > Maybe hand out inode numbers to subvols in 2^32 chunks so each subvol > > (which has ever been accessed) has a mapping from the top 32 bits of the > > objectid to the top 32 bits of the inode number. > > > > We don't need something that is theoretically perfect (that's not > > possible anyway as we don't have 64bits of device numbers). We just > > need something that is practical and scales adequately. If you have > > petabytes of storage, it is reasonable to spend a gigabyte of memory on > > a lookup table(?). > > Can such squishing-all-inodes-into-one-namespace work to be done in a > more generic way? e.g, let each fs with "subvolume"-like feature to > provide the interface to do that. If you know the highest subvol ID number, you can pack two integers into one larger integer by reversing the bits of the subvol number and ORing them with the inode number, i.e. 0x0080000000000300 is subvol 256 inode 768. The subvol ID's grow left to right while the inode numbers grow right to left. You can have billions of inodes in a few subvols, or billions of subvols with a few inodes each, and neither will collide with the other until there are billions of both. If the filesystem tracks the number of bits in the highest subvol ID and the highest inode number, then the inode numbers can be decoded, and collisions can be detected. e.g. if the maximum subvol ID on the filesystem is below 131072, it will fit in 17 bits, then we know bits 63-47 are the subvol ID and bits 46-0 are the inode.. When subvol 131072 is created, the number of subvol bits increases to 18, but if every inode fits in less than 46 bits, we know that every existing inode has a 0 in the 18th subvol ID bit of the inode number, so there is no ambiguity. If you don't know the maximum subvol ID, you can guess based on the position of the large run of zero bits in the middle of the integer--not reliable, but good enough for a guess if you were looking at 'ls -li' output (and wrote the inode numbers in hex). In the pathological case (the maximum subvol ID and maximum inode number require more than 64 total bits) we return ENOSPC. This can all be done when btrfs fills in an inode struct. There's no need to change the on-disk format, other than to track the highest inode and subvol number. btrfs can compute the maxima in reasonable but non-zero time by searching trees on mount, so an incompatible disk format change would only be needed to avoid making mount slower. > Despite that I still hope to have a way to distinguish the "subvolume" > boundary. Packing the bits into a single uint64 doesn't help with this--it does the opposite. Subvol boundaries become harder to see without deliberate checking (i.e. not the traditional parent.st_dev != child.st_dev test). Judging from previous btrfs-related complaints, some users do want "stealth" subvols whose boundaries are not accidentally visible, so the new behavior could be a feature for someone. > If completely inside btrfs, it's pretty simple to locate a subvolume > boundary. > All subvolume have the same inode number 256. > > Maybe we could reserve some special "squished" inode number to indicate > boundary inside a filesystem. > > E.g. reserve (u64)-1 as a special indicator for subvolume boundaries. > As most fs would have reserved super high inode numbers anyway. > > > > > > If we can make inode numbers unique, we can possibly leave the st_dev > > changing at subvols so that "du -x" works as currently expected. > > > > One thought I had was to use a strong hash to combine the subvol object > > id and the inode object id into a 64bit number. What is the chance of > > a collision in practice :-) > > But with just 64bits, conflicts will happen anyway... The collision rate might be low enough that we could just skip over the colliding numbers, but we'd have to have some kind of in-memory collision map to avoid slowing down inode creation (currently the next inode number is more or less "++last_inode_number", and looking up inodes to see if they exist first would slow down new file creation a lot). > Thanks, > Qu > > > > Thanks, > > NeilBrown > >