by NeilBrown

[permalink] [raw]

Subject: Re: [PATCH 05/11] VFS: new function: mount_is_internal()

On Wed, 28 Jul 2021, Al Viro wrote:
> On Wed, Jul 28, 2021 at 08:37:45AM +1000, NeilBrown wrote:
> > This patch introduces the concept of an "internal" mount which is a
> > mount where a filesystem has create the mount itself.
> >
> > Both the mounted-on-dentry and the mount's root dentry must refer to the
> > same superblock (they may be the same dentry), and the mounted-on dentry
> > must be an automount.
>
> And what happens if you mount --move it?
>
>
If you move the mount, then the mounted-on dentry would not longer be an
automount (.... I assume???...) so it would not longer be
mount_is_internal().

I think that is reasonable. Whoever moved the mount has now taken over
responsibility for it - it no longer is controlled by the filesystem.
The moving will have removed the mount from the list of auto-expire
mounts, and the mount-trap will now be exposed and can be mounted-on
again.

It would be just like unmounting the automount, and bind-mounting the
same dentry elsewhere.

NeilBrown

2021-07-28 04:58:55

Hi NeilBrown,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on nfsd/nfsd-next]
[also build test WARNING on kdave/for-next hch-configfs/for-next linus/master v5.14-rc3 next-20210727]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/NeilBrown/expose-btrfs-subvols-in-mount-table-correctly/20210728-064502
base: git://linux-nfs.org/~bfields/linux.git nfsd-next
config: i386-randconfig-s002-20210728 (attached as .config)
compiler: gcc-10 (Ubuntu 10.3.0-1ubuntu1~20.04) 10.3.0
reproduce:
# apt-get install sparse
# sparse version: v0.6.3-341-g8af24329-dirty
# https://github.com/0day-ci/linux/commit/58749022685aea90dfddfb9f8b2fcdc74dee6ec0
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review NeilBrown/expose-btrfs-subvols-in-mount-table-correctly/20210728-064502
git checkout 58749022685aea90dfddfb9f8b2fcdc74dee6ec0
# save the attached .config to linux build tree
make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=i386 SHELL=/bin/bash fs/btrfs/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

sparse warnings: (new ones prefixed by >>)
>> fs/btrfs/inode.c:5868:5: sparse: sparse: symbol 'btrfs_mountpoint_expiry_timeout' was not declared. Should it be static?

Please review and possibly fold the followup patch.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]

Attachments:

(No filename) (1.74 kB)
.config.gz (32.86 kB)
Download all attachments

2021-07-28 09:40:45

On Tue, Aug 24, 2021 at 09:22:05AM +1000, NeilBrown wrote:
> On Mon, 23 Aug 2021, Zygo Blaxell wrote:
> ...
> >
> > Subvol IDs are not reusable. They are embedded in shared object ownership
> > metadata, and persist for some time after subvols are deleted.
> ...
> >
> > The cost of _tracking_ free object IDs is trivial compared to the cost
> > of _reusing_ an object ID on btrfs.
>
> One possible approach to these two objections is to decouple inode
> numbers from object ids.

This would be reasonable for subvol IDs (I thought of it earlier in this
thread, but didn't mention it because I wasn't going to be the first to
open that worm can ;).

There aren't very many subvol IDs and they're not used as frequently
as inodes, so a lookup table to remap them to smaller numbers to save
st_ino bit-space wouldn't be unreasonably expensive. If we stop right
here and use the [some_zeros:reversed_subvol:inode] bit-packing scheme
you proposed for NFS, that seems like a reasonable plan. It would have
48 bits of usable inode number space, ~440000 file creates per second
for 20 years with up to 65535 snapshots, the same number of bits that
ZFS has in its inodes.

Once that subvol ID mapping tree exists, it could also map subvol inode
numbers to globally unique numbers. Each tree item would contain a map of
[subvol_inode1..subvol_inode2] that maps the inode numbers in the subvol
into the global inode number space at [global_inode1..global_inode2].
When a snapshot is created, the snapshot gets a copy of all the origin
subvol's inode ranges, but with newly allocated base offsets. If the
original subvol needs new inodes, it gets a new chunk from the global
inode allocator. If the snapshot subvol needs new inodes, it gets a
different new chunk from the global allocator. The minimum chunk might
be a million inodes or so to avoid having to allocate new chunks all the
time, but not so high to make the code completely untested (or testers
just set the minchunk to 1000 inodes).

The question I have (and why I didn't propose this earlier) is whether
this scheme is any real improvement over dividing the subvol:inode space
by bit packing. If you have one subvol that has 3 billion existing inodes
in it, every snapshot of that subvol is going to burn up roughly 2^-32 of
the available globally unique inode numbers. If we burn 3 billion inodes
instead of 4 billion per subvol, it only gets 25% more lifespan for the
filesystem, and the allocation of unique inode spaces and tracking inode
space usage will add cost to every single file creation and snapshot
operation. If your oldest/biggest subvol only has a million inodes in
it, all of the above is pure cost: you can create billions of snapshots,
never repeat any object IDs, and never worry about running out.

I'd want to see cost/benefit simulations of:

this plan,

the simpler but less efficient bit-packing plan,

'cp -a --reflink' to a new subvol and start over every 20 years
when inodes run out,

and online garbage-collection/renumbering schemes that allow
users to schedule the inode renumbering costs in overnight
batches instead of on every inode create.

> The inode number becomes just another piece of metadata stored in the
> inode.
> struct btrfs_inode_item has four spare u64s, so we could use one of
> those.
> struct btrfs_dir_item would need to store the inode number too. What
> is location.offset used for? Would a diritem ever point to a non-zero
> offset? Could the 'offset' be used to store the inode number?

Offset is used to identify subvol roots at the moment, but so far that
means only values 0 and UINT64_MAX are used. It seems possible to treat
all other values as inode numbers. Don't quote me on that--I'm not an
expert on this structure.

> This could even be added to existing filesystems I think. It might not
> be easy to re-use inode numbers smaller than the largest at the time the
> extension was added, but newly added inode numbers could be reused after
> they were deleted.

We'd need a structure to track reusable inode numbers and it would have to
be kept up to date to work, so this feature would necessarily come with an
incompat bit. Whether you borrow bits from existing structures or make
extended new structures doesn't matter at that point, though obviously
for something as common as inodes it would be bad to make them bigger.

Some of the btrfs userspace API uses inode numbers, but unless I missed
something, it could all be converted to use object numbers directly
instead.

> Just a thought...
>
> NeilBrown