Received: by 2002:a05:6a10:1d13:0:0:0:0 with SMTP id pp19csp40979pxb; Wed, 1 Sep 2021 21:15:17 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxk6OJsxGgeExvP4H++EA2mLkkFC8YjDe0ACw+w1XTc0FNVmCpeaj3kt9b0IFKj2krjUtj7 X-Received: by 2002:a05:6402:714:: with SMTP id w20mr1477737edx.62.1630556117541; Wed, 01 Sep 2021 21:15:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1630556117; cv=none; d=google.com; s=arc-20160816; b=q/gomshkYNxMEWxKIehvrw7PqJFosU8TkMFmdF6HoOMr/nBJ2vvSEu8J2Va1D20hit 7nQ/R+UBWUnoM0/FV2hjaP6D9KDmgjEAHxRNn0B0tI48KzZIBkqbip/XXF11/l/pQAcE ERIJSjct4MKyQZaBgM8a1tcP58qEQYeaKQsEIe7NM9LjrzqiCUZqv4BWF+pmHxrql4DY zaUSybuyVXnu7aQpd5OmkxC5oRbwceMWQlhcXCrUye05GU5Zj+q1h6yJIPDjL75TvGkY g/vzF5yFoeDa622CYlVSytfXt++d2r93Ke8MAo4EVd45hKrtF5KJp1fMoKCvT24ChEcE IbTw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:message-id:date:references:in-reply-to:subject :cc:to:from:mime-version:content-transfer-encoding:dkim-signature :dkim-signature; bh=Bgb2nvS8+/8reZZV/agOAadMtosg8d0yRFRv2TodBAc=; b=DfzxxpEvHzfXI8GlvTF0t3njKJsmnqIrxZAAvRjN1x4YRG7OGbijNPGrOhQfOAuD4d 6pOpSZ6JKUascrAXcFjl+5a7u1JfhVdoqJbjmsfHv7sfshQeTFOt3PSj497J2LxTY+tG IVEg63kqF1j1QD10idT4RpqSYldz6EefY1rSaQ7k2kYzdZ+vuGvOXTgvIx+UZJ/TaS4d tRfqlXqpjLSesIPEPFWxI+OdPiOISk4NqC4CPrz2+OAvEX04wgq0zzb7zTfmB/YG9o+r qqwYtLxbJUspRkR9zgAuU+BQlkC87v0ScYnU27X+5jZMB8Lq9XgwmARw3sAHG78UR/oq 044A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.de header.s=susede2_rsa header.b=lnUWV6If; dkim=neutral (no key) header.i=@suse.de header.b=EETyF5ro; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=suse.de Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id nc1si838785ejc.377.2021.09.01.21.14.46; Wed, 01 Sep 2021 21:15:17 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.de header.s=susede2_rsa header.b=lnUWV6If; dkim=neutral (no key) header.i=@suse.de header.b=EETyF5ro; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=suse.de Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229490AbhIBEPV (ORCPT + 99 others); Thu, 2 Sep 2021 00:15:21 -0400 Received: from smtp-out1.suse.de ([195.135.220.28]:52054 "EHLO smtp-out1.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229469AbhIBEPU (ORCPT ); Thu, 2 Sep 2021 00:15:20 -0400 Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id E8FB2224BC; Thu, 2 Sep 2021 04:14:21 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1630556061; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Bgb2nvS8+/8reZZV/agOAadMtosg8d0yRFRv2TodBAc=; b=lnUWV6IfDoRmkrFW6sBlBFPQ9oYJN0+QU7oZnz32LN5TPZqQQ0uJ7RAcIlCAvKLh7J8WCn eIQNoPOhF7s+RPvnL1CifT8ndodozadqwxs2iVczDia4zLM3DipfNF06OEC+rBFc7dTkDQ Vpm4erDWtMkaQeY4TWTshg3TX4eEa3Y= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1630556061; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Bgb2nvS8+/8reZZV/agOAadMtosg8d0yRFRv2TodBAc=; b=EETyF5ropdYZNhJoAfM+PwTTc6waChFjMPqN2+EPUeaQrZPbts6dgEjjDWIz0MWoMaRWdu MIQlodDl1pP20qAA== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 102B713B3F; Thu, 2 Sep 2021 04:14:19 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id A28EL5tPMGE3fQAAMHmgww (envelope-from ); Thu, 02 Sep 2021 04:14:19 +0000 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit MIME-Version: 1.0 From: "NeilBrown" To: "J. Bruce Fields" Cc: "Christoph Hellwig" , "Chuck Lever" , linux-nfs@vger.kernel.org, "Josef Bacik" , linux-fsdevel@vger.kernel.org Subject: Re: [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export In-reply-to: <20210901152251.GA6533@fieldses.org> References: <162995209561.7591.4202079352301963089@noble.neil.brown.name>, <162995778427.7591.11743795294299207756@noble.neil.brown.name>, , <163010550851.7591.9342822614202739406@noble.neil.brown.name>, , <163038594541.7591.11109978693705593957@noble.neil.brown.name>, , <20210901152251.GA6533@fieldses.org> Date: Thu, 02 Sep 2021 14:14:17 +1000 Message-id: <163055605714.24419.381470460827658370@noble.neil.brown.name> Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On Thu, 02 Sep 2021, J. Bruce Fields wrote: > I looked back through a couple threads to try to understand why we > couldn't do that (on new filesystems, with a mkfs option to choose new > or old behavior) and still don't understand. But the threads are long. > > There are objections to a new mount option (which seem obviously wrong; > this should be a persistent feature of the on-disk filesystem). I hadn't thought much (if at all) about a persistent filesystem feature flag. I'll try that now. There are two features of interest. One is completely unique inode numbers, the other is reporting different st_dev for different subvolumes. I think these need to be kept separate, though the second would depend on the first. They would be similar to my "inumbits" and "numdevs" mount options, though with less flexibility. I think that they would need strong semantics to be acceptable - "mostly unique" isn't really acceptable once we are changing the on-disk data. The "unique inode numbers" bit (UIN) would require that file object-ids fit in some number of bits (maybe 40) and that subvolume numbers fit in the remaining bits (24) and would then combine them together for the inode number. This could obviously be set at mkfs time. Could it be set on an unmounted filesystem? The "single-dev" flag (SD) could be toggled any time that UIN was set, and mkfs would default it on if UIN was selected. If UIN was in effect, then creating a subvol beyond the permitted max would have to fail. 24 bits is small enough that we would probably want a warning of impending doom - maybe at 23 bits? The current 48bits doesn't need that. Similarly creating an inode beyond 40bits would have to fail. This is probably more problematic and so might need more warnings. Do we want a warning each time any subvol crosses some limit? If not we would need a flag for each warning. What should a sysadmin do when they see the warning? If 40 bit an unacceptable limit of the total number of inodes in a subvol, or is it only a problem because of btrfs' practice of never reusing object-ids? Backup-and-restore would compact object-ids, but would be a big cost. Off-line reindexing would be cheaper (does anyone else remember using "renum" programs with BASIC??). Online lazy re-indexing might be possible if the inode number was maintained separately from the object-id and an atomic "switch which inode number to use" could be done at mount time. Setting UIN on an existing filesystem would require checking that only 24bit are used for subvolumes (easy) and that only 40 were usgd for objects in any individual subvolume (presumably that would require checking all subvolumes, which might take a little while, but shouldn't take more than a few minutes. Doing this would break any indexes that might be created over files, and would probably upset any active NFS mounts, and would likely have other problems. Se it would need to be a well-documented step with clear rewards. An alternative to renumbering would be to maintain file-ids and subvolume-ids which are separate from the object-id. Apparently reusing subvolume object-ids is not possible and reusing file object-ids is quite costly. If the file-id were separate from the object-id, these problems would vanish. This would require extra space in the inode (there are several reserved u64s, so that isn't a problem) and space in each directory entry (might be more of a problem). It would also require some way to keep track of used (or unused) id numbers. This avoids the cost of renumbering, by spreading it out over every creation. I suspect the average inode-creation overhead could be kept quite low, but not quite zero. I believe that some code *knows* that the root of any btrfs subvolumes has inode number 256. systemd seems to use this. I have no idea what else might depend on inode numbers in some way. I suspect that if we tried to roll out a change like this, either almost no-one would use it (if it wasn't the default), or things would start breaking (if it was). I'm not against breaking things, but we need to be sure there is a solution for fixing them, and I'm certainly not up to doing that myself. So yes - I think that using a mkfs option would open up other avenues for a solution. There would still be a lot of work to find something that continues to meet everyone's needs. The advantage of an nfsd-focusses solution is that we can have working code today with minimal down-sides. I'm certainly not prepared to go digging through btrfs code to determine how to implement a btrfs-only solution without strong buy-in from btrfs maintainers. NeilBrown