2021-08-09 03:59:11

by NeilBrown

[permalink] [raw]
Subject: [PATCH/RFC 0/4] Attempt to make progress with btrfs dev number strangeness.

I continue to search for a way forward for btrfs so that its behaviour
with respect to device numbers and subvols is somewhat coherent.

This series implements some of the ideas in my "A Third perspective"[1],
though with changes is various details.

I introduce two new mount options, which default to
no-change-in-behaviour.

-o inumbits= causes inode numbers to be more unique across a whole btrfs
filesystem, and is many cases completely unique. Mounting
with "-i inumbits=56" will resolve the NFS issues that
started me tilting at this particular windmill.

-o numdevs= can reduce the number of distinct devices reported by
stat(), either to 2 or to 1.
Both ease problems for sites that exhaust their supply of
device numbers.
'2' allows "du -x" to continue to work, but is otherwise
rather strange.
'1' breaks the use of "du -x" and similar to examine a
single subvol which might have subvol descendants, but
provides generally sane behaviour
"-o numdevs=1" also forces inumbits to have a useful value.

I introduce a "tree id" which can be discovered using statx(). Two
files with the same dev and ino might still be different if the tree-ids
are different. Connected files with the same tree-id may be usefully
considered to be related.

I also change various /proc files (only when numdevs=1 is used) to
provide extra information so they are useful with btrfs despite subvols.
/proc/maps /proc/smaps /proc/locks /proc/X/fdinfo/Y are affected.
The inode number becomes "XX:YY" where XX is the subvol number (tree id)
and YY is the inode number.

An alternate might be to report a number which might use up to 128 bits.
Which is less likely to seriously break code?

Note that code which ignores badly formatted lines is safe, because it
will never currently find a match for a btrfs file in these files
anyway. The device number they report is never returned in st_dev for
stat() on any file.

The audit subsystem and one or two other places report dev/ino and so
need enhanced, but I haven't tried to address those.

Various trace points also report dev/ino. I haven't tried thinking
about those either.

Thanks for your upcoming replies!

NeilBrown

---

NeilBrown (4):
btrfs: include subvol identifier in inode number if -o inumbits=...
btrfs: add numdevs= mount option.
VFS/btrfs: add STATX_TREE_ID
Add "tree" number to "inode" number in various /proc files.


fs/btrfs/ctree.h | 17 +++++++++++++++--
fs/btrfs/disk-io.c | 24 +++++++++++++++++++++---
fs/btrfs/inode.c | 39 ++++++++++++++++++++++++++++++++++++++-
fs/btrfs/ioctl.c | 6 ++++--
fs/btrfs/super.c | 31 +++++++++++++++++++++++++++++++
fs/inode.c | 1 +
fs/locks.c | 12 +++++++++---
fs/notify/fdinfo.c | 19 ++++++++++++++-----
fs/proc/nommu.c | 11 ++++++++---
fs/proc/task_mmu.c | 17 ++++++++++++-----
fs/proc/task_nommu.c | 11 ++++++++---
fs/stat.c | 2 ++
include/linux/fs.h | 3 ++-
include/linux/stat.h | 13 +++++++++++++
include/uapi/linux/stat.h | 3 ++-
samples/vfs/test-statx.c | 4 +++-
16 files changed, 183 insertions(+), 30 deletions(-)

--
Signature


2021-08-09 04:00:09

by NeilBrown

[permalink] [raw]
Subject: [PATCH 2/4] btrfs: add numdevs= mount option.

btrfs currently allocates multiple anonymous bdev numbers to hide the
fact that inode numbers are not unique across "subvolumes".
Each subvol gets a different device number.

As described in a previous patch, this is incomplete, doesn't scale, and
should be deprecated. This patch is another step to deprecation.

With mount option "-o numdevs=many", which is the default, the current
behaviour is preserved.

With mount option "-o numdevs=1", the st_dev reported by stat() is
exactly the number that appears in /proc/$PID/mountinfo (i.e.
sb->s_dev). This will prevent "du -x", "find -xdev" and similar tools
from keeping within a subvol, but is otherwise quite functional.

If numdevs=1 and inumbits=0, then there will often be inode number
reuse, so that combination is forbidden and the default fo inumbits
changes to BITS_PER_LONG*7/8. With larger inumbits (close to
BITS_PER_LONG), inode number reuse is still possible, but only with
large or old filesystems.

With mount option "-o numdevs=2", precisely two anon device numbers are
allocated. Each subvol gets the number that its parent isn't using.
When subvols are moved, the device number reported will change if needed
to differentiate from its parent.
If a subvol with dependent subvols is moved and the device numbers need
to change, the numbers in dependent subvols that are currently in cache
will NOT change. Fixing this is a stretch goal.

Using numdevs=2 removes any problems with exhausting the number of
available anon devs, and preserves the functionality of "du -x" and
similar. It may be a useful option for sites that experience exhaustion
problems.

numdevs=1 is, at this stage, most useful for exploring the consequences
of fully deprecating the use of multiple device numbers. It may also be
useful for site that find they have no dependency on multiple device
numbers.

Signed-off-by: NeilBrown <[email protected]>
---
fs/btrfs/ctree.h | 17 +++++++++++++++--
fs/btrfs/disk-io.c | 24 +++++++++++++++++++++---
fs/btrfs/inode.c | 29 ++++++++++++++++++++++++++++-
fs/btrfs/ioctl.c | 6 ++++--
fs/btrfs/super.c | 30 ++++++++++++++++++++++++++++++
5 files changed, 98 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0ef557db3a8b..2caedb8c8c6d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -988,6 +988,14 @@ struct btrfs_fs_info {
u32 stripesize;

unsigned short inumbits;
+ /* num_devs can be:
+ * 1 - all files in all trees use sb->s_dev
+ * 2 - file trees alternate between using sb->s_dev and
+ * secondary_anon_dev.
+ * 3 (BTTSF_MANY_DEVS) - Each subtree uses a unique ->anon_dev
+ */
+ unsigned short num_devs;
+ dev_t secondary_anon_dev;

/* Block groups and devices containing active swapfiles. */
spinlock_t swapfile_pins_lock;
@@ -1035,6 +1043,8 @@ struct btrfs_fs_info {
#endif
};

+#define BTRFS_MANY_DEVS (3)
+
static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
{
return sb->s_fs_info;
@@ -1176,10 +1186,13 @@ struct btrfs_root {
*/
struct radix_tree_root delayed_nodes_tree;
/*
- * right now this just gets used so that a root has its own devid
- * for stat. It may be used for more later
+ * If fs_info->num_devs == 3 (BTRFS_MANY_DEVS) anon_dev holds a device
+ * number to be reported by ->getattr().
+ * If fs_info->num_devs == 2, anon_dev is 0 and use_secondary_dev
+ * is true when this root uses the secondary, not primary, dev.
*/
dev_t anon_dev;
+ bool use_secondary_dev;

spinlock_t root_item_lock;
refcount_t refs;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7f3bfa042d66..5127e2689756 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1516,7 +1516,8 @@ static int btrfs_init_fs_root(struct btrfs_root *root, dev_t anon_dev)
* userspace, the id pool is limited to 1M
*/
if (is_fstree(root->root_key.objectid) &&
- btrfs_root_refs(&root->root_item) > 0) {
+ btrfs_root_refs(&root->root_item) > 0 &&
+ root->fs_info->num_devs == BTRFS_MANY_DEVS) {
if (!anon_dev) {
ret = get_anon_bdev(&root->anon_dev);
if (ret)
@@ -3332,8 +3333,12 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
* "-o inumbits" can over-ride this default.
* BITS_PER_LONG * 7 / 8 is a good value to use
*/
- if (fs_info->inumbits > BITS_PER_LONG)
- fs_info->inumbits = 0;
+ if (fs_info->inumbits > BITS_PER_LONG) {
+ if (fs_info->num_devs == 1)
+ fs_info->inumbits = BITS_PER_LONG * 7 / 8;
+ else
+ fs_info->inumbits = 0;
+ }

features = btrfs_super_incompat_flags(disk_super) &
~BTRFS_FEATURE_INCOMPAT_SUPP;
@@ -3379,6 +3384,15 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
fs_info->csums_per_leaf = BTRFS_MAX_ITEM_SIZE(fs_info) / fs_info->csum_size;
fs_info->stripesize = stripesize;

+ if (fs_info->num_devs == 0)
+ /* set default value */
+ fs_info->num_devs = BTRFS_MANY_DEVS;
+
+ if (fs_info->num_devs == 2) {
+ err = get_anon_bdev(&fs_info->secondary_anon_dev);
+ if (err)
+ goto fail_alloc;
+ }
/*
* mixed block groups end up with duplicate but slightly offset
* extent buffers for the same range. It leads to corruptions
@@ -4446,6 +4460,10 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)

btrfs_mapping_tree_free(&fs_info->mapping_tree);
btrfs_close_devices(fs_info->fs_devices);
+
+ if (fs_info->secondary_anon_dev)
+ free_anon_bdev(fs_info->secondary_anon_dev);
+ fs_info->secondary_anon_dev = 0;
}

int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 860cb5045123..30fa64cbe6dc 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5966,6 +5966,8 @@ struct inode *btrfs_lookup_dentry(struct inode *dir, struct dentry *dentry)
iput(inode);
inode = ERR_PTR(ret);
}
+ if (fs_info->num_devs == 2)
+ sub_root->use_secondary_dev = !root->use_secondary_dev;
}

return inode;
@@ -9204,7 +9206,15 @@ static int btrfs_getattr(struct user_namespace *mnt_userns,
STATX_ATTR_NODUMP);

generic_fillattr(&init_user_ns, inode, stat);
- stat->dev = BTRFS_I(inode)->root->anon_dev;
+ /* If we don't set stat->dev here, sb->s_dev will be used */
+ switch (btrfs_sb(inode->i_sb)->num_devs) {
+ case 2:
+ if (BTRFS_I(inode)->root->use_secondary_dev)
+ stat->dev = btrfs_sb(inode->i_sb)->secondary_anon_dev;
+ break;
+ case BTRFS_MANY_DEVS:
+ stat->dev = BTRFS_I(inode)->root->anon_dev;
+ }

spin_lock(&BTRFS_I(inode)->lock);
delalloc_bytes = BTRFS_I(inode)->new_delalloc_bytes;
@@ -9390,6 +9400,15 @@ static int btrfs_rename_exchange(struct inode *old_dir,
if (new_inode->i_nlink == 1)
BTRFS_I(new_inode)->dir_index = new_idx;

+ if (fs_info->num_devs == 2 &&
+ root->use_secondary_dev != dest->use_secondary_dev) {
+ BTRFS_I(old_inode)->root->use_secondary_dev =
+ !dest->use_secondary_dev;
+ BTRFS_I(new_inode)->root->use_secondary_dev =
+ !root->use_secondary_dev;
+ // FIXME any subvols beneeath 'old_inode' or 'new_inode'
+ // that are in cache are now wrong.
+ }
if (root_log_pinned) {
btrfs_log_new_name(trans, BTRFS_I(old_inode), BTRFS_I(old_dir),
new_dentry->d_parent);
@@ -9656,6 +9675,14 @@ static int btrfs_rename(struct inode *old_dir, struct dentry *old_dentry,
goto out_fail;
}

+ if (fs_info->num_devs == 2 &&
+ root->use_secondary_dev != dest->use_secondary_dev) {
+ BTRFS_I(old_inode)->root->use_secondary_dev =
+ !dest->use_secondary_dev;
+ // FIXME any subvols beneeath 'old_inode' that are
+ // in cache are now wrong.
+ }
+
if (old_inode->i_nlink == 1)
BTRFS_I(old_inode)->dir_index = index;

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index e008a9ceb827..a246f91b4df4 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -522,7 +522,8 @@ static noinline int create_subvol(struct inode *dir,
if (ret)
goto fail_free;

- ret = get_anon_bdev(&anon_dev);
+ if (fs_info->num_devs == BTRFS_MANY_DEVS)
+ ret = get_anon_bdev(&anon_dev);
if (ret < 0)
goto fail_free;

@@ -729,7 +730,8 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
if (!pending_snapshot)
return -ENOMEM;

- ret = get_anon_bdev(&pending_snapshot->anon_dev);
+ if (fs_info->num_devs == BTRFS_MANY_DEVS)
+ ret = get_anon_bdev(&pending_snapshot->anon_dev);
if (ret < 0)
goto free_pending;
pending_snapshot->root_item = kzalloc(sizeof(struct btrfs_root_item),
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 5f3350e2f7ec..b1aecb834234 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -361,6 +361,7 @@ enum {
Opt_discard_mode,
Opt_inumbits,
Opt_norecovery,
+ Opt_numdevs,
Opt_ratio,
Opt_rescan_uuid_tree,
Opt_skip_balance,
@@ -431,6 +432,7 @@ static const match_table_t tokens = {
{Opt_inumbits, "inumbits=%u"},
{Opt_nodiscard, "nodiscard"},
{Opt_norecovery, "norecovery"},
+ {Opt_numdevs, "numdevs=%s"},
{Opt_ratio, "metadata_ratio=%u"},
{Opt_rescan_uuid_tree, "rescan_uuid_tree"},
{Opt_skip_balance, "skip_balance"},
@@ -849,8 +851,35 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
ret = -EINVAL;
goto out;
}
+ if (intarg == 0 && info->num_devs == 1) {
+ btrfs_err(info,
+ "inumbits=0 not permitted when numdevs=1");
+ ret = -EINVAL;
+ goto out;
+ }
info->inumbits = intarg;
break;
+ case Opt_numdevs:
+ if (info->num_devs) {
+ ; /* silently ignore attempts to change this */
+ } else if (strcmp(args[0].from, "many") == 0) {
+ info->num_devs = BTRFS_MANY_DEVS;
+ } else if (strcmp(args[0].from, "1") == 0) {
+ if (info->inumbits == 0) {
+ btrfs_err(info,
+"numdevs=1 not permitted with inumbits=0");
+ ret = -EINVAL;
+ }
+ info->num_devs = 1;
+ } else if (strcmp(args[0].from, "2") == 0) {
+ info->num_devs = 2;
+ } else {
+ btrfs_err(info,
+ "numdevs must be \"1\", \"2\", or \"many\".");
+ ret = -EINVAL;
+ goto out;
+ }
+ break;
case Opt_ratio:
ret = match_int(&args[0], &intarg);
if (ret)
@@ -1559,6 +1588,7 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
if (btrfs_test_opt(info, REF_VERIFY))
seq_puts(seq, ",ref_verify");
seq_printf(seq, ",inumbits=%u", info->inumbits);
+ seq_printf(seq, ",numdevs=%u", info->num_devs);
seq_printf(seq, ",subvolid=%llu",
BTRFS_I(d_inode(dentry))->root->root_key.objectid);
subvol_name = btrfs_get_subvol_name_from_objectid(info,


2021-08-09 04:00:20

by NeilBrown

[permalink] [raw]
Subject: [PATCH 3/4] VFS/btrfs: add STATX_TREE_ID

A new 64bit field is added to the data that can be returned by statx() -
the "tree id".
The tree id serves two needs.
1/ it extends the available inode number space. While a filesystem
SHOULD ensure the inode number is unique across the filesystem,
this is sometimes impractical. In such situations, 'tree id'
may be used to guarantee uniqueness. It can identify a separate
allocation domain.
A particular case when separate allocation domains is useful
is when a directory tree can be effectively "reflink"ed.
Updating all inode numbers in such a tree is prohibitively expensive.
2/ it can identify a collection of objects that provide a coherent
"tree" in some locally-defined sense.

This patch uses STATX_TREE_ID to export the subvol id for btrfs.

samples/vfs/test_statx.c is extended to report the treeid.

Also: a new superblock field is added: s_tree_id_bits. This can store
the number of significant bits in the reported treeid. It is
currently unused, but could be used by overlayfs to determine how
to add a filesystem number to the treeid to differentiate files in
different underlying filesystems.

Signed-off-by: NeilBrown <[email protected]>
---
fs/btrfs/inode.c | 4 ++++
fs/btrfs/super.c | 1 +
fs/stat.c | 1 +
include/linux/fs.h | 2 +-
include/linux/stat.h | 13 +++++++++++++
include/uapi/linux/stat.h | 3 ++-
samples/vfs/test-statx.c | 4 +++-
7 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 30fa64cbe6dc..c878726d090c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9215,6 +9215,10 @@ static int btrfs_getattr(struct user_namespace *mnt_userns,
case BTRFS_MANY_DEVS:
stat->dev = BTRFS_I(inode)->root->anon_dev;
}
+ if (request_mask & STATX_TREE_ID) {
+ stat->tree_id = BTRFS_I(inode)->root->root_key.objectid;
+ stat->result_mask |= STATX_TREE_ID;
+ }

spin_lock(&BTRFS_I(inode)->lock);
delalloc_bytes = BTRFS_I(inode)->new_delalloc_bytes;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index b1aecb834234..e6d166150660 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1410,6 +1410,7 @@ static int btrfs_fill_super(struct super_block *sb,
#endif
sb->s_flags |= SB_I_VERSION;
sb->s_iflags |= SB_I_CGROUPWB;
+ sb->s_tree_id_bits = 48;

err = super_setup_bdi(sb);
if (err) {
diff --git a/fs/stat.c b/fs/stat.c
index 1fa38bdec1a6..2dd5d3d67793 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -580,6 +580,7 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
tmp.stx_dev_major = MAJOR(stat->dev);
tmp.stx_dev_minor = MINOR(stat->dev);
tmp.stx_mnt_id = stat->mnt_id;
+ tmp.stx_tree_id = stat->tree_id;

return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 640574294216..a777c1b1706a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1482,7 +1482,7 @@ struct super_block {

unsigned int s_max_links;
fmode_t s_mode;
-
+ short s_tree_id_bits;
/*
* The next field is for VFS *only*. No filesystems have any business
* even looking at it. You had been warned.
diff --git a/include/linux/stat.h b/include/linux/stat.h
index fff27e603814..08ee409786b3 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -46,6 +46,19 @@ struct kstat {
struct timespec64 btime; /* File creation time */
u64 blocks;
u64 mnt_id;
+ /* Treeid can be used to extend the inode number space. Two inodes
+ * with different 'tree_id' are different, even if 'ino' is the same
+ * (though fs should make ino different as often as possible).
+ * When tree_id is requested and STATX_TREE_ID is set in result_mask,
+ * 'ino' MUST be unique across the filesystem. Specifically, two
+ * open files that report the same dev, ino, and tree_id MUST be
+ * the same.
+ * If a directory and an object in that directory have the same dev
+ * and tree_id, they can be assumed to be in a meaningful tree, though
+ * the meaning is subject to local interpretation. The set of inodes
+ * with a common tree_id is not required to be contiguous.
+ */
+ u64 tree_id;
};

#endif
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 1500a0f58041..725cf3f8e873 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -124,7 +124,7 @@ struct statx {
__u32 stx_dev_minor;
/* 0x90 */
__u64 stx_mnt_id;
- __u64 __spare2;
+ __u64 stx_tree_id;
/* 0xa0 */
__u64 __spare3[12]; /* Spare space for future expansion */
/* 0x100 */
@@ -152,6 +152,7 @@ struct statx {
#define STATX_BASIC_STATS 0x000007ffU /* The stuff in the normal stat struct */
#define STATX_BTIME 0x00000800U /* Want/got stx_btime */
#define STATX_MNT_ID 0x00001000U /* Got stx_mnt_id */
+#define STATX_TREE_ID 0x00002000U /* Want/got stx_treeid and clean stX_ino */

#define STATX__RESERVED 0x80000000U /* Reserved for future struct statx expansion */

diff --git a/samples/vfs/test-statx.c b/samples/vfs/test-statx.c
index 49c7a46cee07..c1141764fa2e 100644
--- a/samples/vfs/test-statx.c
+++ b/samples/vfs/test-statx.c
@@ -118,6 +118,8 @@ static void dump_statx(struct statx *stx)
break;
}
}
+ if (stx->stx_mask & STATX_TREE_ID)
+ printf(" Tree: %-12llu", (unsigned long long) stx->stx_tree_id);
printf("\n");

if (stx->stx_mask & STATX_MODE)
@@ -218,7 +220,7 @@ int main(int argc, char **argv)
struct statx stx;
int ret, raw = 0, atflag = AT_SYMLINK_NOFOLLOW;

- unsigned int mask = STATX_BASIC_STATS | STATX_BTIME;
+ unsigned int mask = STATX_BASIC_STATS | STATX_BTIME | STATX_TREE_ID;

for (argv++; *argv; argv++) {
if (strcmp(*argv, "-F") == 0) {


2021-08-09 04:00:39

by NeilBrown

[permalink] [raw]
Subject: [PATCH 4/4] Add "tree" number to "inode" number in various /proc files.

Various /proc files reporting locks, inotify status, or memory mappings
currently report device number and inode node.

These are already "broken" for btrfs as the device number is not one
that is reported by "stat()" (though a program could find a way to map a
file to an entry in /proc/self/mountinfo, and get the device number that
way).

This patch changes all the inode number is those files to "tree:inode"
when the treeid is non-zero. This it only affect btrfs (at this stage),
and then only when mounted with "-o numdevs=1", as in other cases there
is no value in changing the proc files.

As none of these call ->getattr() to get ino or dev, I have added i_tree
to struct inode so they can get it directly from there. This isn't
ideal, but is consistent with current code.

Programs that looks for dev:ino based in information from stat(), and
which don't crash on "badly" formatted entries will continue to work as
well as they ever did.

Programs which crash when an entry looks wrong should be fixed anyway.

Programs which correlate a file with /proc/self/mountinfo to find the
"real" device number .... would make me sad.

Signed-off-by: NeilBrown <[email protected]>
---
fs/btrfs/inode.c | 6 ++++++
fs/inode.c | 1 +
fs/locks.c | 12 +++++++++---
fs/notify/fdinfo.c | 19 ++++++++++++++-----
fs/proc/nommu.c | 11 ++++++++---
fs/proc/task_mmu.c | 17 ++++++++++++-----
fs/proc/task_nommu.c | 11 ++++++++---
fs/stat.c | 1 +
include/linux/fs.h | 1 +
9 files changed, 60 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c878726d090c..98ba5f32a2b8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5787,6 +5787,8 @@ static int btrfs_init_locked_inode(struct inode *inode, void *p)
* 'root', and should be nearly unique across the filesystem.
*/
inode->i_ino ^= args->root->inum_overlay;
+ if (args->root && args->root->fs_info->num_devs == 1)
+ inode->i_tree = args->root->root_key.objectid;
BTRFS_I(inode)->location.objectid = args->ino;
BTRFS_I(inode)->location.type = BTRFS_INODE_ITEM_KEY;
BTRFS_I(inode)->location.offset = 0;
@@ -5876,6 +5878,8 @@ static struct inode *new_simple_dir(struct super_block *s,
set_bit(BTRFS_INODE_DUMMY, &BTRFS_I(inode)->runtime_flags);

inode->i_ino = BTRFS_EMPTY_SUBVOL_DIR_OBJECTID;
+ if (root->fs_info->num_devs == 1)
+ inode->i_tree = root->root_key.objectid;
/*
* We only need lookup, the rest is read-only and there's no inode
* associated with the dentry
@@ -6425,6 +6429,8 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans,
inode->i_ino = objectid;
if (objectid != root->inum_overlay)
inode->i_ino ^= root->inum_overlay;
+ if (root->fs_info->num_devs == 1)
+ inode->i_tree = root->root_key.objectid;

if (dir && name) {
trace_btrfs_inode_request(dir);
diff --git a/fs/inode.c b/fs/inode.c
index c93500d84264..7f62ac35de02 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -142,6 +142,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
inode->i_op = &empty_iops;
inode->i_fop = &no_open_fops;
inode->i_ino = 0;
+ inode->i_tree = 0;
inode->__i_nlink = 1;
inode->i_opflags = 0;
if (sb->s_xattr)
diff --git a/fs/locks.c b/fs/locks.c
index 74b2a1dfe8d8..21b28c019052 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -2893,9 +2893,15 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
}
if (inode) {
/* userspace relies on this representation of dev_t */
- seq_printf(f, "%d %02x:%02x:%lu ", fl_pid,
- MAJOR(inode->i_sb->s_dev),
- MINOR(inode->i_sb->s_dev), inode->i_ino);
+ if (inode->i_tree)
+ seq_printf(f, "%d %02x:%02x:%lu:%lu ", fl_pid,
+ MAJOR(inode->i_sb->s_dev),
+ MINOR(inode->i_sb->s_dev),
+ inode->i_tree, inode->i_ino);
+ else
+ seq_printf(f, "%d %02x:%02x:%lu ", fl_pid,
+ MAJOR(inode->i_sb->s_dev),
+ MINOR(inode->i_sb->s_dev), inode->i_ino);
} else {
seq_printf(f, "%d <none>:0 ", fl_pid);
}
diff --git a/fs/notify/fdinfo.c b/fs/notify/fdinfo.c
index 57f0d5d9f934..4e8a363d171b 100644
--- a/fs/notify/fdinfo.c
+++ b/fs/notify/fdinfo.c
@@ -90,9 +90,13 @@ static void inotify_fdinfo(struct seq_file *m, struct fsnotify_mark *mark)
* used only internally to the kernel.
*/
u32 mask = mark->mask & IN_ALL_EVENTS;
- seq_printf(m, "inotify wd:%x ino:%lx sdev:%x mask:%x ignored_mask:%x ",
- inode_mark->wd, inode->i_ino, inode->i_sb->s_dev,
- mask, mark->ignored_mask);
+ seq_printf(m, "inotify wd:%x ", inode_mark->wd);
+ if (inode->i_tree)
+ seq_printf(m, "ino:%lx:%lx ", inode->i_tree, inode->i_ino);
+ else
+ seq_printf(m, "ino:%lx ", inode->i_ino);
+ seq_printf(m, "sdev:%x mask:%x ignored_mask:%x ",
+ inode->i_sb->s_dev, mask, mark->ignored_mask);
show_mark_fhandle(m, inode);
seq_putc(m, '\n');
iput(inode);
@@ -120,8 +124,13 @@ static void fanotify_fdinfo(struct seq_file *m, struct fsnotify_mark *mark)
inode = igrab(fsnotify_conn_inode(mark->connector));
if (!inode)
return;
- seq_printf(m, "fanotify ino:%lx sdev:%x mflags:%x mask:%x ignored_mask:%x ",
- inode->i_ino, inode->i_sb->s_dev,
+ if (inode->i_tree)
+ seq_printf(m, "fanotify ino:%lx:%lx", inode->i_tree,
+ inode->i_ino);
+ else
+ seq_printf(m, "fanotify ino:%lx", inode->i_ino);
+ seq_printf(m, " sdev:%x mflags:%x mask:%x ignored_mask:%x ",
+ inode->i_sb->s_dev,
mflags, mark->mask, mark->ignored_mask);
show_mark_fhandle(m, inode);
seq_putc(m, '\n');
diff --git a/fs/proc/nommu.c b/fs/proc/nommu.c
index 13452b32e2bd..371caf60d4a4 100644
--- a/fs/proc/nommu.c
+++ b/fs/proc/nommu.c
@@ -31,7 +31,7 @@
*/
static int nommu_region_show(struct seq_file *m, struct vm_region *region)
{
- unsigned long ino = 0;
+ unsigned long ino = 0, tree = 0;
struct file *file;
dev_t dev = 0;
int flags;
@@ -43,11 +43,12 @@ static int nommu_region_show(struct seq_file *m, struct vm_region *region)
struct inode *inode = file_inode(region->vm_file);
dev = inode->i_sb->s_dev;
ino = inode->i_ino;
+ tree = inode->i_tree;
}

seq_setwidth(m, 25 + sizeof(void *) * 6 - 1);
seq_printf(m,
- "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu ",
+ "%08lx-%08lx %c%c%c%c %08llx %02x:%02x ",
region->vm_start,
region->vm_end,
flags & VM_READ ? 'r' : '-',
@@ -55,7 +56,11 @@ static int nommu_region_show(struct seq_file *m, struct vm_region *region)
flags & VM_EXEC ? 'x' : '-',
flags & VM_MAYSHARE ? flags & VM_SHARED ? 'S' : 's' : 'p',
((loff_t)region->vm_pgoff) << PAGE_SHIFT,
- MAJOR(dev), MINOR(dev), ino);
+ MAJOR(dev), MINOR(dev));
+ if (tree)
+ seq_printf(m, "%lu:%lu ", tree, ino);
+ else
+ seq_printf(m, "%lu ", ino);

if (file) {
seq_pad(m, ' ');
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index eb97468dfe4c..9e6439d7939b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -250,7 +250,8 @@ static int is_stack(struct vm_area_struct *vma)
static void show_vma_header_prefix(struct seq_file *m,
unsigned long start, unsigned long end,
vm_flags_t flags, unsigned long long pgoff,
- dev_t dev, unsigned long ino)
+ dev_t dev, unsigned long ino,
+ unsigned long tree)
{
seq_setwidth(m, 25 + sizeof(void *) * 6 - 1);
seq_put_hex_ll(m, NULL, start, 8);
@@ -263,7 +264,12 @@ static void show_vma_header_prefix(struct seq_file *m,
seq_put_hex_ll(m, " ", pgoff, 8);
seq_put_hex_ll(m, " ", MAJOR(dev), 2);
seq_put_hex_ll(m, ":", MINOR(dev), 2);
- seq_put_decimal_ull(m, " ", ino);
+ if (tree) {
+ seq_put_decimal_ull(m, " ", tree);
+ seq_put_decimal_ull(m, ":", ino);
+ } else {
+ seq_put_decimal_ull(m, " ", ino);
+ }
seq_putc(m, ' ');
}

@@ -273,7 +279,7 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
struct mm_struct *mm = vma->vm_mm;
struct file *file = vma->vm_file;
vm_flags_t flags = vma->vm_flags;
- unsigned long ino = 0;
+ unsigned long ino = 0, tree = 0;
unsigned long long pgoff = 0;
unsigned long start, end;
dev_t dev = 0;
@@ -283,12 +289,13 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
struct inode *inode = file_inode(vma->vm_file);
dev = inode->i_sb->s_dev;
ino = inode->i_ino;
+ tree = inode->i_tree;
pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
}

start = vma->vm_start;
end = vma->vm_end;
- show_vma_header_prefix(m, start, end, flags, pgoff, dev, ino);
+ show_vma_header_prefix(m, start, end, flags, pgoff, dev, ino, tree);

/*
* Print the dentry name for named mappings, and a
@@ -934,7 +941,7 @@ static int show_smaps_rollup(struct seq_file *m, void *v)
}

show_vma_header_prefix(m, priv->mm->mmap->vm_start,
- last_vma_end, 0, 0, 0, 0);
+ last_vma_end, 0, 0, 0, 0, 0);
seq_pad(m, ' ');
seq_puts(m, "[rollup]\n");

diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index a6d21fc0033c..c33d7aad3927 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -145,7 +145,7 @@ static int is_stack(struct vm_area_struct *vma)
static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma)
{
struct mm_struct *mm = vma->vm_mm;
- unsigned long ino = 0;
+ unsigned long ino = 0, tree = 0;
struct file *file;
dev_t dev = 0;
int flags;
@@ -158,12 +158,13 @@ static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma)
struct inode *inode = file_inode(vma->vm_file);
dev = inode->i_sb->s_dev;
ino = inode->i_ino;
+ tree = inode->i_tree;
pgoff = (loff_t)vma->vm_pgoff << PAGE_SHIFT;
}

seq_setwidth(m, 25 + sizeof(void *) * 6 - 1);
seq_printf(m,
- "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu ",
+ "%08lx-%08lx %c%c%c%c %08llx %02x:%02x ",
vma->vm_start,
vma->vm_end,
flags & VM_READ ? 'r' : '-',
@@ -171,7 +172,11 @@ static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma)
flags & VM_EXEC ? 'x' : '-',
flags & VM_MAYSHARE ? flags & VM_SHARED ? 'S' : 's' : 'p',
pgoff,
- MAJOR(dev), MINOR(dev), ino);
+ MAJOR(dev), MINOR(dev));
+ if (tree)
+ seq_printf(m, "%lu:%lu ", ino, tree);
+ else
+ seq_printf(m, "%lu ", ino);

if (file) {
seq_pad(m, ' ');
diff --git a/fs/stat.c b/fs/stat.c
index 2dd5d3d67793..4aa402858f64 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -45,6 +45,7 @@ void generic_fillattr(struct user_namespace *mnt_userns, struct inode *inode,
{
stat->dev = inode->i_sb->s_dev;
stat->ino = inode->i_ino;
+ stat->tree_id = inode->i_tree;
stat->mode = inode->i_mode;
stat->nlink = inode->i_nlink;
stat->uid = i_uid_into_mnt(mnt_userns, inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a777c1b1706a..86dc586c408b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -630,6 +630,7 @@ struct inode {

/* Stat data, not accessed from path walking */
unsigned long i_ino;
+ unsigned long i_tree;
/*
* Filesystems may only read i_nlink directly. They shall use the
* following functions for modification:


2021-08-09 04:01:34

by NeilBrown

[permalink] [raw]
Subject: [PATCH 1/4] btrfs: include subvol identifier in inode number if -o inumbits=...

A btrfs filesystem uses 112 bits to identify an object: 48 for the
object-id of the subvolume (really a 'subtree') and 64 for the object
within that tree.

It only exposes the 64bits in i_ino (and st_ino for stat()) and attempts
to hide the non-uniqueness by reporting a different st_dev in stat() for
each different subvol.

This is incomplete and doesn't scale.

It is incomplete because there are places other than 'stat()' where the
device number is visible to user-space, including /proc/$PID/mountinfo,
/proc/$PID/maps and /proc/locks. These report the device-number for the
whole filesystem together with i_ino, and so do not match the
st_dev+st_ino reported by stat().

It is also incomplete because nfsd doesn't notice the st_dev value,
depending only of the existence of mount points to discover filesystem
changes.

It doesn't scale because there are a limited number of anon device
numbers that can be allocated, which is approximately 20 bits, much less
than the 48 bits worth of subvols which btrfs supports.

I believe that we *must* extend the user-space API to properly support
btrfs - trying to fit within it will always cause pain, at least in the
extremes. I believe that the use of varying device numbers is not a
viable long-term solution and must be deprecated.

This patch is a first step towards deprecating the use of device numbers
to add uniqueness. It changes the inode numbers reported so they are
unique across all subvols for modest sized filesystems. It does this
by creating an 'overlay' from the subvol number and xor-ing that into
the file's object id. This results in reported inode numbers being
completely unique within a subvol, and mostly unique between subvols.

The overlay is *not* xor-ed in when it exactly matches the objectid, as
that would produce zero.

A few placed in the code assume that ->i_ino is the objectid. Those few
are changed to call btrfs_ino(), and ino now subtracts the overlay.

The "overlay" is created by byte-swapping the subvol identifier, then
optionally shifting down a few bits so there is unused space at the
top-end. When the maximum objectid in use requires many fewer than 64
bits, and the maximum subvol id in use does not use all of the
remaining bits, complete uniqueness can be provided. For larger
fileystems, complete uniqueness cannot be guaranteed.

The size of the shift can be set using the "inumbits" mount option. A
value of 64 suppresses any shift and maximum uniqueness is provided. A
value of 0 (the default) disables the overlay functionality.

A generally good value on 64bit systems which might use overlayfs with
btrdfs is 56, as this provides broad uniqueness, while leaving some bits
for overlayfs to differentiate between merged filesystems.

The only material improvement from this patch alone will come to
applications and tools which do not pay attention to st_dev (as that is
unchanged). In particular, nfsd will, when "-o inumbits=56", report
(mostly) unique inode numbers to the NFS client, and some problems
caused by "find" and related tools detecting that the root of a subvol
has the name inode number as the root of its parent - both of which
appear to NFS to be in the same filesystem.

Subsequent patches will build on this base to allow the use of multiple
devices to be controlled, and then to allow complete uniqueness through
interface extensions.

ISSUE: In btrfs, inode numbers below the highest-currently-allocated are
never reused. This allows the highest inode number to be arbitrarily
higher than the number of inodes. This means that an "old"
filesystem can trigger a risk of non-uniqueness just as a large
filesystem can.

ISSUE: I don't understand the role of BTRFS_EMPTY_SUBVOL_DIR_OBJECTID
so I don't know if I have to do anything when that value is assigned
to i_ino.

Signed-off-by: NeilBrown <[email protected]>
---
fs/btrfs/btrfs_inode.h | 22 ++++++++++++++--------
fs/btrfs/ctree.h | 4 ++++
fs/btrfs/disk-io.c | 15 +++++++++++++++
fs/btrfs/inode.c | 17 ++++++++++++++---
fs/btrfs/ioctl.c | 2 +-
fs/btrfs/super.c | 24 +++++++++++++++++++++++-
6 files changed, 71 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index c652e19ad74e..18e1b071bb69 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -246,13 +246,6 @@ static inline unsigned long btrfs_inode_hash(u64 objectid,
return (unsigned long)h;
}

-static inline void btrfs_insert_inode_hash(struct inode *inode)
-{
- unsigned long h = btrfs_inode_hash(inode->i_ino, BTRFS_I(inode)->root);
-
- __insert_inode_hash(inode, h);
-}
-
static inline u64 btrfs_ino(const struct btrfs_inode *inode)
{
u64 ino = inode->location.objectid;
@@ -261,11 +254,24 @@ static inline u64 btrfs_ino(const struct btrfs_inode *inode)
* !ino: btree_inode
* type == BTRFS_ROOT_ITEM_KEY: subvol dir
*/
- if (!ino || inode->location.type == BTRFS_ROOT_ITEM_KEY)
+ if (!ino || inode->location.type == BTRFS_ROOT_ITEM_KEY) {
+ /* vfs_inode.i_ino has inum_overlay merged in, when
+ * that wouldn't produce zero. We need to remove it here.
+ */
ino = inode->vfs_inode.i_ino;
+ if (ino != inode->root->inum_overlay)
+ ino ^= inode->root->inum_overlay;
+ }
return ino;
}

+static inline void btrfs_insert_inode_hash(struct inode *inode)
+{
+ unsigned long h = btrfs_inode_hash(btrfs_ino(BTRFS_I(inode)), BTRFS_I(inode)->root);
+
+ __insert_inode_hash(inode, h);
+}
+
static inline void btrfs_i_size_write(struct btrfs_inode *inode, u64 size)
{
i_size_write(&inode->vfs_inode, size);
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e5e53e592d4f..0ef557db3a8b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -987,6 +987,8 @@ struct btrfs_fs_info {
u32 csums_per_leaf;
u32 stripesize;

+ unsigned short inumbits;
+
/* Block groups and devices containing active swapfiles. */
spinlock_t swapfile_pins_lock;
struct rb_root swapfile_pins;
@@ -1145,6 +1147,8 @@ struct btrfs_root {

u64 last_trans;

+ u64 inum_overlay;
+
u32 type;

u64 free_objectid;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a59ab7b9aea0..7f3bfa042d66 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1202,6 +1202,12 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
memset(&root->defrag_progress, 0, sizeof(root->defrag_progress));
root->root_key.objectid = objectid;
root->anon_dev = 0;
+ if (fs_info->inumbits &&
+ objectid != BTRFS_FS_TREE_OBJECTID &&
+ is_fstree(objectid))
+ root->inum_overlay = swab64(objectid) >> (64 - fs_info->inumbits);
+ else
+ root->inum_overlay = 0;

spin_lock_init(&root->root_item_lock);
btrfs_qgroup_init_swapped_blocks(&root->swapped_blocks);
@@ -3314,12 +3320,21 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
*/
fs_info->compress_type = BTRFS_COMPRESS_ZLIB;

+ fs_info->inumbits = BITS_PER_LONG + 1; /* impossible value */
+
ret = btrfs_parse_options(fs_info, options, sb->s_flags);
if (ret) {
err = ret;
goto fail_alloc;
}

+ /* By default, use inumbits=0 to avoid behaviour change.
+ * "-o inumbits" can over-ride this default.
+ * BITS_PER_LONG * 7 / 8 is a good value to use
+ */
+ if (fs_info->inumbits > BITS_PER_LONG)
+ fs_info->inumbits = 0;
+
features = btrfs_super_incompat_flags(disk_super) &
~BTRFS_FEATURE_INCOMPAT_SUPP;
if (features) {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 01099bf602fb..860cb5045123 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5782,6 +5782,11 @@ static int btrfs_init_locked_inode(struct inode *inode, void *p)
struct btrfs_iget_args *args = p;

inode->i_ino = args->ino;
+ if (args->root && args->ino != args->root->inum_overlay)
+ /* This inode number will still be unique within this
+ * 'root', and should be nearly unique across the filesystem.
+ */
+ inode->i_ino ^= args->root->inum_overlay;
BTRFS_I(inode)->location.objectid = args->ino;
BTRFS_I(inode)->location.type = BTRFS_INODE_ITEM_KEY;
BTRFS_I(inode)->location.offset = 0;
@@ -6092,6 +6097,7 @@ static int btrfs_real_readdir(struct file *file, struct dir_context *ctx)

while (1) {
struct dir_entry *entry;
+ u64 inum;

leaf = path->nodes[0];
slot = path->slots[0];
@@ -6136,7 +6142,10 @@ static int btrfs_real_readdir(struct file *file, struct dir_context *ctx)
put_unaligned(fs_ftype_to_dtype(btrfs_dir_type(leaf, di)),
&entry->type);
btrfs_dir_item_key_to_cpu(leaf, di, &location);
- put_unaligned(location.objectid, &entry->ino);
+ inum = location.objectid;
+ if (inum != root->inum_overlay)
+ inum ^= root->inum_overlay;
+ put_unaligned(inum, &entry->ino);
put_unaligned(found_key.offset, &entry->offset);
entries++;
addr += sizeof(struct dir_entry) + name_len;
@@ -6333,7 +6342,7 @@ static int btrfs_insert_inode_locked(struct inode *inode)
args.root = BTRFS_I(inode)->root;

return insert_inode_locked4(inode,
- btrfs_inode_hash(inode->i_ino, BTRFS_I(inode)->root),
+ btrfs_inode_hash(btrfs_ino(BTRFS_I(inode)), BTRFS_I(inode)->root),
btrfs_find_actor, &args);
}

@@ -6412,6 +6421,8 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans,
* number if we fail afterwards in this function.
*/
inode->i_ino = objectid;
+ if (objectid != root->inum_overlay)
+ inode->i_ino ^= root->inum_overlay;

if (dir && name) {
trace_btrfs_inode_request(dir);
@@ -9515,7 +9526,7 @@ static int btrfs_rename(struct inode *old_dir, struct dentry *old_dentry,


/* check for collisions, even if the name isn't there */
- ret = btrfs_check_dir_item_collision(dest, new_dir->i_ino,
+ ret = btrfs_check_dir_item_collision(dest, btrfs_ino(BTRFS_I(new_dir)),
new_dentry->d_name.name,
new_dentry->d_name.len);

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0ba98e08a029..e008a9ceb827 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -907,7 +907,7 @@ static noinline int btrfs_mksubvol(const struct path *parent,
* check for them now when we can safely fail
*/
error = btrfs_check_dir_item_collision(BTRFS_I(dir)->root,
- dir->i_ino, name,
+ btrfs_ino(BTRFS_I(dir)), name,
namelen);
if (error)
goto out_dput;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index d07b18b2b250..5f3350e2f7ec 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -359,6 +359,7 @@ enum {
Opt_defrag, Opt_nodefrag,
Opt_discard, Opt_nodiscard,
Opt_discard_mode,
+ Opt_inumbits,
Opt_norecovery,
Opt_ratio,
Opt_rescan_uuid_tree,
@@ -427,6 +428,7 @@ static const match_table_t tokens = {
{Opt_nodefrag, "noautodefrag"},
{Opt_discard, "discard"},
{Opt_discard_mode, "discard=%s"},
+ {Opt_inumbits, "inumbits=%u"},
{Opt_nodiscard, "nodiscard"},
{Opt_norecovery, "norecovery"},
{Opt_ratio, "metadata_ratio=%u"},
@@ -830,6 +832,25 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
btrfs_clear_and_info(info, FLUSHONCOMMIT,
"turning off flush-on-commit");
break;
+ case Opt_inumbits:
+ if (info->inumbits <= BITS_PER_LONG)
+ /* silently ignore subsequent change
+ * e.g. on remount
+ */
+ break;
+ ret = match_int(&args[0], &intarg);
+ if (ret)
+ goto out;
+ if (intarg > BITS_PER_LONG ||
+ (intarg && intarg < BITS_PER_LONG / 2)) {
+ btrfs_err(info,
+ "inumbits must be 0 or in range [%d..%d]",
+ BITS_PER_LONG/2, BITS_PER_LONG);
+ ret = -EINVAL;
+ goto out;
+ }
+ info->inumbits = intarg;
+ break;
case Opt_ratio:
ret = match_int(&args[0], &intarg);
if (ret)
@@ -1537,6 +1558,7 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
#endif
if (btrfs_test_opt(info, REF_VERIFY))
seq_puts(seq, ",ref_verify");
+ seq_printf(seq, ",inumbits=%u", info->inumbits);
seq_printf(seq, ",subvolid=%llu",
BTRFS_I(d_inode(dentry))->root->root_key.objectid);
subvol_name = btrfs_get_subvol_name_from_objectid(info,
@@ -1570,7 +1592,7 @@ static int btrfs_set_super(struct super_block *s, void *data)
*/
static inline int is_subvolume_inode(struct inode *inode)
{
- if (inode && inode->i_ino == BTRFS_FIRST_FREE_OBJECTID)
+ if (inode && btrfs_ino(BTRFS_I(inode)) == BTRFS_FIRST_FREE_OBJECTID)
return 1;
return 0;
}


2021-08-09 07:53:08

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 2/4] btrfs: add numdevs= mount option.

Hi NeilBrown,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on kdave/for-next]
[also build test WARNING on ext3/fsnotify linus/master v5.14-rc5 next-20210806]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/NeilBrown/Attempt-to-make-progress-with-btrfs-dev-number-strangeness/20210809-120046
base: https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: x86_64-randconfig-c001-20210809 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project c5c3cdb9c92895a63993cee70d2dd776ff9519c3)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/c5bae87ed5b72b9fd999fa935f477483da001f63
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review NeilBrown/Attempt-to-make-progress-with-btrfs-dev-number-strangeness/20210809-120046
git checkout c5bae87ed5b72b9fd999fa935f477483da001f63
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

>> fs/btrfs/ioctl.c:740:6: warning: variable 'ret' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
if (fs_info->num_devs == BTRFS_MANY_DEVS)
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/btrfs/ioctl.c:742:6: note: uninitialized use occurs here
if (ret < 0)
^~~
fs/btrfs/ioctl.c:740:2: note: remove the 'if' if its condition is always true
if (fs_info->num_devs == BTRFS_MANY_DEVS)
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/btrfs/ioctl.c:725:9: note: initialize the variable 'ret' to silence this warning
int ret;
^
= 0
1 warning generated.


vim +740 fs/btrfs/ioctl.c

716
717 static int create_snapshot(struct btrfs_root *root, struct inode *dir,
718 struct dentry *dentry, bool readonly,
719 struct btrfs_qgroup_inherit *inherit)
720 {
721 struct btrfs_fs_info *fs_info = btrfs_sb(dir->i_sb);
722 struct inode *inode;
723 struct btrfs_pending_snapshot *pending_snapshot;
724 struct btrfs_trans_handle *trans;
725 int ret;
726
727 if (!test_bit(BTRFS_ROOT_SHAREABLE, &root->state))
728 return -EINVAL;
729
730 if (atomic_read(&root->nr_swapfiles)) {
731 btrfs_warn(fs_info,
732 "cannot snapshot subvolume with active swapfile");
733 return -ETXTBSY;
734 }
735
736 pending_snapshot = kzalloc(sizeof(*pending_snapshot), GFP_KERNEL);
737 if (!pending_snapshot)
738 return -ENOMEM;
739
> 740 if (fs_info->num_devs == BTRFS_MANY_DEVS)
741 ret = get_anon_bdev(&pending_snapshot->anon_dev);
742 if (ret < 0)
743 goto free_pending;
744 pending_snapshot->root_item = kzalloc(sizeof(struct btrfs_root_item),
745 GFP_KERNEL);
746 pending_snapshot->path = btrfs_alloc_path();
747 if (!pending_snapshot->root_item || !pending_snapshot->path) {
748 ret = -ENOMEM;
749 goto free_pending;
750 }
751
752 btrfs_init_block_rsv(&pending_snapshot->block_rsv,
753 BTRFS_BLOCK_RSV_TEMP);
754 /*
755 * 1 - parent dir inode
756 * 2 - dir entries
757 * 1 - root item
758 * 2 - root ref/backref
759 * 1 - root of snapshot
760 * 1 - UUID item
761 */
762 ret = btrfs_subvolume_reserve_metadata(BTRFS_I(dir)->root,
763 &pending_snapshot->block_rsv, 8,
764 false);
765 if (ret)
766 goto free_pending;
767
768 pending_snapshot->dentry = dentry;
769 pending_snapshot->root = root;
770 pending_snapshot->readonly = readonly;
771 pending_snapshot->dir = dir;
772 pending_snapshot->inherit = inherit;
773
774 trans = btrfs_start_transaction(root, 0);
775 if (IS_ERR(trans)) {
776 ret = PTR_ERR(trans);
777 goto fail;
778 }
779
780 spin_lock(&fs_info->trans_lock);
781 list_add(&pending_snapshot->list,
782 &trans->transaction->pending_snapshots);
783 spin_unlock(&fs_info->trans_lock);
784
785 ret = btrfs_commit_transaction(trans);
786 if (ret)
787 goto fail;
788
789 ret = pending_snapshot->error;
790 if (ret)
791 goto fail;
792
793 ret = btrfs_orphan_cleanup(pending_snapshot->snap);
794 if (ret)
795 goto fail;
796
797 inode = btrfs_lookup_dentry(d_inode(dentry->d_parent), dentry);
798 if (IS_ERR(inode)) {
799 ret = PTR_ERR(inode);
800 goto fail;
801 }
802
803 d_instantiate(dentry, inode);
804 ret = 0;
805 pending_snapshot->anon_dev = 0;
806 fail:
807 /* Prevent double freeing of anon_dev */
808 if (ret && pending_snapshot->snap)
809 pending_snapshot->snap->anon_dev = 0;
810 btrfs_put_root(pending_snapshot->snap);
811 btrfs_subvolume_release_metadata(root, &pending_snapshot->block_rsv);
812 free_pending:
813 if (pending_snapshot->anon_dev)
814 free_anon_bdev(pending_snapshot->anon_dev);
815 kfree(pending_snapshot->root_item);
816 btrfs_free_path(pending_snapshot->path);
817 kfree(pending_snapshot);
818
819 return ret;
820 }
821

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (5.86 kB)
.config.gz (32.10 kB)
Download all attachments

2021-08-10 20:51:46

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH/RFC 0/4] Attempt to make progress with btrfs dev number strangeness.

On 8/8/21 11:55 PM, NeilBrown wrote:
> I continue to search for a way forward for btrfs so that its behaviour
> with respect to device numbers and subvols is somewhat coherent.
>
> This series implements some of the ideas in my "A Third perspective"[1],
> though with changes is various details.
>
> I introduce two new mount options, which default to
> no-change-in-behaviour.
>
> -o inumbits= causes inode numbers to be more unique across a whole btrfs
> filesystem, and is many cases completely unique. Mounting
> with "-i inumbits=56" will resolve the NFS issues that
> started me tilting at this particular windmill.
>
> -o numdevs= can reduce the number of distinct devices reported by
> stat(), either to 2 or to 1.
> Both ease problems for sites that exhaust their supply of
> device numbers.
> '2' allows "du -x" to continue to work, but is otherwise
> rather strange.
> '1' breaks the use of "du -x" and similar to examine a
> single subvol which might have subvol descendants, but
> provides generally sane behaviour
> "-o numdevs=1" also forces inumbits to have a useful value.
>
> I introduce a "tree id" which can be discovered using statx(). Two
> files with the same dev and ino might still be different if the tree-ids
> are different. Connected files with the same tree-id may be usefully
> considered to be related.
>
> I also change various /proc files (only when numdevs=1 is used) to
> provide extra information so they are useful with btrfs despite subvols.
> /proc/maps /proc/smaps /proc/locks /proc/X/fdinfo/Y are affected.
> The inode number becomes "XX:YY" where XX is the subvol number (tree id)
> and YY is the inode number.
>
> An alternate might be to report a number which might use up to 128 bits.
> Which is less likely to seriously break code?
>
> Note that code which ignores badly formatted lines is safe, because it
> will never currently find a match for a btrfs file in these files
> anyway. The device number they report is never returned in st_dev for
> stat() on any file.
>
> The audit subsystem and one or two other places report dev/ino and so
> need enhanced, but I haven't tried to address those.
>
> Various trace points also report dev/ino. I haven't tried thinking
> about those either.

I think this is a step in the right direction, but I want to figure out a way to
accomplish this without magical mount points that users must be aware of.

I think the stat() st_dev ship as sailed, we're stuck with that. However
Christoph does have a valid point where it breaks the various info spit out by
/proc. You've done a good job with the treeid here, but it still makes it
impossible for somebody to map the st_dev back to the correct mount.

I think we aren't going to solve that problem, at least not with stat(). I
think with statx() spitting out treeid we have given userspace a way to
differentiate subvolumes, and so we should fix statx() to spit out the the super
block device, that way new userspace things can do their appropriate lookup if
they so choose.

This leaves the problem of nfsd. Can you just integrate this new treeid into
nfsd, and use that to either change the ino within nfsd itself, or do something
similar to what your first patchset did and generate a fsid based on the treeid?

Mount options are messy, and are just going to lead to distro's turning them on
without understanding what's going on and then we have to support them forever.
I want to get this fixed in a way that we all hate the least with as little
opportunity for confused users to make bad decisions. Thanks,

Josef

2021-08-11 22:14:09

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH/RFC 0/4] Attempt to make progress with btrfs dev number strangeness.

On Wed, 11 Aug 2021, Josef Bacik wrote:
>
> I think this is a step in the right direction, but I want to figure out a way to
> accomplish this without magical mount points that users must be aware of.

magic mount *options* ???

>
> I think the stat() st_dev ship as sailed, we're stuck with that. However
> Christoph does have a valid point where it breaks the various info spit out by
> /proc. You've done a good job with the treeid here, but it still makes it
> impossible for somebody to map the st_dev back to the correct mount.

The ship might have sailed, but it is not water tight. And as the world
it round, it can still come back to bite us from behind.
Anything can be transitioned away from, whether it is devfs or 32-bit
time or giving different device numbers to different file-trees.

The linkage between device number and and filesystem is quite strong.
We could modified all of /proc and /sys/ and audit and whatever else to
report the fake device number, but we cannot get the fake device number
into the mount table (without making the mount table unmanageablely
large).
And if subtrees aren't in the mount-table for the NFS server, I don't
think they should be in the mount-table of the NFS client. So we cannot
export them to NFS.

I understand your dislike for mount options. An alternative with
different costs and benefits would be to introduce a new filesystem type
- btrfs2 or maybe betrfs. This would provide numdevs=1 semantics and do
whatever we decided was best with inode numbers. How much would you
hate that?

>
> I think we aren't going to solve that problem, at least not with stat(). I
> think with statx() spitting out treeid we have given userspace a way to
> differentiate subvolumes, and so we should fix statx() to spit out the the super
> block device, that way new userspace things can do their appropriate lookup if
> they so choose.

I don't think we should normalize having multiple devnums per filesystem
by encoding it in statx(). It *would* make sense to add a btrfs ioctl
which reports the real device number of a file. Tools that really need
to work with btrfs could use that, but it would always be obvious that
it was an exception.

>
> This leaves the problem of nfsd. Can you just integrate this new treeid into
> nfsd, and use that to either change the ino within nfsd itself, or do something
> similar to what your first patchset did and generate a fsid based on the treeid?

I would only want nfsd to change the inode number. I no longer think it
is acceptable for nfsd to report different device number (as I mention
above).
I would want the new inode number to be explicitly provided by the
filesystem. Whether that is a new export_operation or a new field in
'struct kstat' doesn't really bother me. I'd *prefer* it to be st_ino,
but I can live without that.

On the topic of inode numbers.... I've recently learned that btrfs
never reuses inode (objectid) numbers (except possibly after an
unmount). Equally it doesn't re-use subvol numbers. How much does this
contribute to the 64 bits not being enough for subtree+inode?

It would be nice if we could be comfortable limiting the objectid number
to 40 bits and the root.objectid (filetree) number to 24 bits, and
combine them into a 64bit inode number.

If we added a inode number reuse scheme that was suitably performant,
would that make this possible? That would remove the need for a treeid,
and allow us to use project-id to identify subtrees.

>
> Mount options are messy, and are just going to lead to distro's turning them on
> without understanding what's going on and then we have to support them forever.
> I want to get this fixed in a way that we all hate the least with as little
> opportunity for confused users to make bad decisions. Thanks,

Hence my question: how much do you hate creating a new filesystem type
to fix the problems?

Thanks,
NeilBrown

2021-08-12 13:56:15

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH/RFC 0/4] Attempt to make progress with btrfs dev number strangeness.

On 8/11/21 6:13 PM, NeilBrown wrote:
> On Wed, 11 Aug 2021, Josef Bacik wrote:
>>
>> I think this is a step in the right direction, but I want to figure out a way to
>> accomplish this without magical mount points that users must be aware of.
>
> magic mount *options* ???
>
>>
>> I think the stat() st_dev ship as sailed, we're stuck with that. However
>> Christoph does have a valid point where it breaks the various info spit out by
>> /proc. You've done a good job with the treeid here, but it still makes it
>> impossible for somebody to map the st_dev back to the correct mount.
>
> The ship might have sailed, but it is not water tight. And as the world
> it round, it can still come back to bite us from behind.
> Anything can be transitioned away from, whether it is devfs or 32-bit
> time or giving different device numbers to different file-trees.
>
> The linkage between device number and and filesystem is quite strong.
> We could modified all of /proc and /sys/ and audit and whatever else to
> report the fake device number, but we cannot get the fake device number
> into the mount table (without making the mount table unmanageablely
> large).
> And if subtrees aren't in the mount-table for the NFS server, I don't
> think they should be in the mount-table of the NFS client. So we cannot
> export them to NFS.
>
> I understand your dislike for mount options. An alternative with
> different costs and benefits would be to introduce a new filesystem type
> - btrfs2 or maybe betrfs. This would provide numdevs=1 semantics and do
> whatever we decided was best with inode numbers. How much would you
> hate that?
>

A lot more ;).

>>
>> I think we aren't going to solve that problem, at least not with stat(). I
>> think with statx() spitting out treeid we have given userspace a way to
>> differentiate subvolumes, and so we should fix statx() to spit out the the super
>> block device, that way new userspace things can do their appropriate lookup if
>> they so choose.
>
> I don't think we should normalize having multiple devnums per filesystem
> by encoding it in statx(). It *would* make sense to add a btrfs ioctl
> which reports the real device number of a file. Tools that really need
> to work with btrfs could use that, but it would always be obvious that
> it was an exception.

That's not what I'm saying. I'm saying that stat() continues to behave the way
it currently does, for legacy users.

And then for statx() it returns the correct devnum like any other file system,
with the augmentation of the treeid so that future userspace programs can use
the treeid to decide if they want to wander into a subvolume.

This way moving forward we have a way to map back to a mount point because
statx() will return the actual devnum for the mountpoint, and then we can use
the treeid to be smart about when we wander into a subvolume.

And if we're going to add a treeid, I would actually like to add a parent_treeid
as well so we could tell if we're a snapshot or just a normal subvolume.

>
>>
>> This leaves the problem of nfsd. Can you just integrate this new treeid into
>> nfsd, and use that to either change the ino within nfsd itself, or do something
>> similar to what your first patchset did and generate a fsid based on the treeid?
>
> I would only want nfsd to change the inode number. I no longer think it
> is acceptable for nfsd to report different device number (as I mention
> above).
> I would want the new inode number to be explicitly provided by the
> filesystem. Whether that is a new export_operation or a new field in
> 'struct kstat' doesn't really bother me. I'd *prefer* it to be st_ino,
> but I can live without that.
>

Right, I'm not saying nfsd has to propagate our dev_t thing, I'm saying that you
could accomplish the same behavior without the mount options. We add either a
new SB_I_HAS_TREEID or FS_HAS_TREEID, depending on if you prefer to tag the sb
or the fs_type, and then NFS does the inode number magic transformation
automatically and we are good to go.

> On the topic of inode numbers.... I've recently learned that btrfs
> never reuses inode (objectid) numbers (except possibly after an
> unmount). Equally it doesn't re-use subvol numbers. How much does this
> contribute to the 64 bits not being enough for subtree+inode?
>
> It would be nice if we could be comfortable limiting the objectid number
> to 40 bits and the root.objectid (filetree) number to 24 bits, and
> combine them into a 64bit inode number.
>
> If we added a inode number reuse scheme that was suitably performant,
> would that make this possible? That would remove the need for a treeid,
> and allow us to use project-id to identify subtrees.
>

We had a resuse scheme, we deprecated and deleted it. I don't want to
arbitrarily limit objectid's to work around this issue.

>>
>> Mount options are messy, and are just going to lead to distro's turning them on
>> without understanding what's going on and then we have to support them forever.
>> I want to get this fixed in a way that we all hate the least with as little
>> opportunity for confused users to make bad decisions. Thanks,
>
> Hence my question: how much do you hate creating a new filesystem type
> to fix the problems?
>

I'm still not convinced we can't solve this without adding new options or
fstypes. I think flags to indicate that we're special and to use a treeid that
we stuff into the inode would be a reasonable solution. That being said I'm a
little sleep deprived so I could be missing why my plan is a bad one, so I'm
willing to be convinced that mount options are the solution to this, but I want
to make sure we're damned certain that's the best way forward. Thanks,

Josef

2021-08-12 15:18:10

by Hugo Mills

[permalink] [raw]
Subject: Re: [PATCH/RFC 0/4] Attempt to make progress with btrfs dev number strangeness.

On Thu, Aug 12, 2021 at 09:54:54AM -0400, Josef Bacik wrote:
> On 8/11/21 6:13 PM, NeilBrown wrote:
> > On Wed, 11 Aug 2021, Josef Bacik wrote:
> > >
> > > I think this is a step in the right direction, but I want to figure out a way to
> > > accomplish this without magical mount points that users must be aware of.
> >
> > magic mount *options* ???
> >
> > >
> > > I think the stat() st_dev ship as sailed, we're stuck with that. However
> > > Christoph does have a valid point where it breaks the various info spit out by
> > > /proc. You've done a good job with the treeid here, but it still makes it
> > > impossible for somebody to map the st_dev back to the correct mount.
> >
> > The ship might have sailed, but it is not water tight. And as the world
> > it round, it can still come back to bite us from behind.
> > Anything can be transitioned away from, whether it is devfs or 32-bit
> > time or giving different device numbers to different file-trees.
> >
> > The linkage between device number and and filesystem is quite strong.
> > We could modified all of /proc and /sys/ and audit and whatever else to
> > report the fake device number, but we cannot get the fake device number
> > into the mount table (without making the mount table unmanageablely
> > large).
> > And if subtrees aren't in the mount-table for the NFS server, I don't
> > think they should be in the mount-table of the NFS client. So we cannot
> > export them to NFS.
> >
> > I understand your dislike for mount options. An alternative with
> > different costs and benefits would be to introduce a new filesystem type
> > - btrfs2 or maybe betrfs. This would provide numdevs=1 semantics and do
> > whatever we decided was best with inode numbers. How much would you
> > hate that?
> >
>
> A lot more ;).
>
> > >
> > > I think we aren't going to solve that problem, at least not with stat(). I
> > > think with statx() spitting out treeid we have given userspace a way to
> > > differentiate subvolumes, and so we should fix statx() to spit out the the super
> > > block device, that way new userspace things can do their appropriate lookup if
> > > they so choose.
> >
> > I don't think we should normalize having multiple devnums per filesystem
> > by encoding it in statx(). It *would* make sense to add a btrfs ioctl
> > which reports the real device number of a file. Tools that really need
> > to work with btrfs could use that, but it would always be obvious that
> > it was an exception.
>
> That's not what I'm saying. I'm saying that stat() continues to behave the
> way it currently does, for legacy users.
>
> And then for statx() it returns the correct devnum like any other file
> system, with the augmentation of the treeid so that future userspace
> programs can use the treeid to decide if they want to wander into a
> subvolume.
>
> This way moving forward we have a way to map back to a mount point because
> statx() will return the actual devnum for the mountpoint, and then we can
> use the treeid to be smart about when we wander into a subvolume.
>
> And if we're going to add a treeid, I would actually like to add a
> parent_treeid as well so we could tell if we're a snapshot or just a normal
> subvolume.

Can I make a request to call it something other than a
"parent". There's at least three different usages of "parent" for
three different concepts related to subvolumes in btrfs(*), and it'd
be nice to avoid the inevitable confusion.

(*) 1. "subvolume containing this one",
2. "subvolume that was snapshotted to make this one", and,
3. at least informally, "subvolume that was sent/received to make this one"

Hugo.

[snip to end]

--
Hugo Mills | Reading Mein Kampf won't make you a Nazi. Reading
hugo@... carfax.org.uk | Das Kapital won't make you a communist. But most
http://carfax.org.uk/ | trolls started out with a copy of Lord of the Rings.
PGP: E2AB1DE4 |

2021-08-12 23:40:12

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH/RFC 0/4] Attempt to make progress with btrfs dev number strangeness.

On Thu, 12 Aug 2021, Josef Bacik wrote:
> On 8/11/21 6:13 PM, NeilBrown wrote:
> > On Wed, 11 Aug 2021, Josef Bacik wrote:
> >>
> >> I think this is a step in the right direction, but I want to figure out a way to
> >> accomplish this without magical mount points that users must be aware of.
> >
> > magic mount *options* ???
> >
> >>
> >> I think the stat() st_dev ship as sailed, we're stuck with that. However
> >> Christoph does have a valid point where it breaks the various info spit out by
> >> /proc. You've done a good job with the treeid here, but it still makes it
> >> impossible for somebody to map the st_dev back to the correct mount.
> >
> > The ship might have sailed, but it is not water tight. And as the world
> > it round, it can still come back to bite us from behind.
> > Anything can be transitioned away from, whether it is devfs or 32-bit
> > time or giving different device numbers to different file-trees.
> >
> > The linkage between device number and and filesystem is quite strong.
> > We could modified all of /proc and /sys/ and audit and whatever else to
> > report the fake device number, but we cannot get the fake device number
> > into the mount table (without making the mount table unmanageablely
> > large).
> > And if subtrees aren't in the mount-table for the NFS server, I don't
> > think they should be in the mount-table of the NFS client. So we cannot
> > export them to NFS.
> >
> > I understand your dislike for mount options. An alternative with
> > different costs and benefits would be to introduce a new filesystem type
> > - btrfs2 or maybe betrfs. This would provide numdevs=1 semantics and do
> > whatever we decided was best with inode numbers. How much would you
> > hate that?
> >
>
> A lot more ;).
>
> >>
> >> I think we aren't going to solve that problem, at least not with stat(). I
> >> think with statx() spitting out treeid we have given userspace a way to
> >> differentiate subvolumes, and so we should fix statx() to spit out the the super
> >> block device, that way new userspace things can do their appropriate lookup if
> >> they so choose.
> >
> > I don't think we should normalize having multiple devnums per filesystem
> > by encoding it in statx(). It *would* make sense to add a btrfs ioctl
> > which reports the real device number of a file. Tools that really need
> > to work with btrfs could use that, but it would always be obvious that
> > it was an exception.
>
> That's not what I'm saying. I'm saying that stat() continues to behave the way
> it currently does, for legacy users.
>
> And then for statx() it returns the correct devnum like any other file system,
> with the augmentation of the treeid so that future userspace programs can use
> the treeid to decide if they want to wander into a subvolume.

Yes, that is what I thought you were saying. It implies that the
possibility of a file having two different device numbers becomes
normalised in the API - one returned by stat(), the other by statx()
(presumably in a new field - the FS cannot tell what libc call the
application made). I don't like that.

>
> This way moving forward we have a way to map back to a mount point because
> statx() will return the actual devnum for the mountpoint, and then we can use
> the treeid to be smart about when we wander into a subvolume.

We already have a way to map back to a mountpoint. statx reports a
mnt_id with result flag STATX_MNT_ID. This is the number at the start
of the line in mountinfo. Hmmm, this isn't in the manpage. It has been
in the kernel since Linux 5.8. I'll send a patch for the manpage.

So we could pursue a path where the device-id no longer defines the
filesystem (or mount), but instead it defines some arbitrary grouping of
objects within a filesystem. So instead of my proposed
dev-id / subtree-id / inode-number
we would have
dev-id-in-mountinfo / mnt_id / dev-id-in-stat / inode-number

In some ways this would be a smoother path forward - no change to statx,
no new concepts, just formalizing some de-facto concepts.
In other ways it might be rougher - we would need to convince the
community to use the stat() dev-id in all those proc files etc.

I think having the two meanings for a device-id would cause confusion for
quite some years..... but then any change will probably cause confusion.

>
> And if we're going to add a treeid, I would actually like to add a parent_treeid
> as well so we could tell if we're a snapshot or just a normal subvolume.

Is this a well-defined concept? Isn't "snapshot" just one possible
use-case for the btrfs functionality of creating a reflink to a subtree?
What happens to the "parent_treeid" reference when that "parent" gets
deleted?

I understand the desire to track this sort of connection, but I wonder
if the filesystem is really the right place to track it. Maybe having
the tools track it would be better.

>
> >
> >>
> >> This leaves the problem of nfsd. Can you just integrate this new treeid into
> >> nfsd, and use that to either change the ino within nfsd itself, or do something
> >> similar to what your first patchset did and generate a fsid based on the treeid?
> >
> > I would only want nfsd to change the inode number. I no longer think it
> > is acceptable for nfsd to report different device number (as I mention
> > above).
> > I would want the new inode number to be explicitly provided by the
> > filesystem. Whether that is a new export_operation or a new field in
> > 'struct kstat' doesn't really bother me. I'd *prefer* it to be st_ino,
> > but I can live without that.
> >
>
> Right, I'm not saying nfsd has to propagate our dev_t thing, I'm saying that you
> could accomplish the same behavior without the mount options. We add either a
> new SB_I_HAS_TREEID or FS_HAS_TREEID, depending on if you prefer to tag the sb
> or the fs_type, and then NFS does the inode number magic transformation
> automatically and we are good to go.

I really don't want nfsd to do the magic transformations. I want the
filesystem to do those if they need to be done. I could cope with nfsd
xor-ing some provided number with i_ino, but I wouldn't like nfsd to
have the responsibility of doing the swab64().

>
> > On the topic of inode numbers.... I've recently learned that btrfs
> > never reuses inode (objectid) numbers (except possibly after an
> > unmount). Equally it doesn't re-use subvol numbers. How much does this
> > contribute to the 64 bits not being enough for subtree+inode?
> >
> > It would be nice if we could be comfortable limiting the objectid number
> > to 40 bits and the root.objectid (filetree) number to 24 bits, and
> > combine them into a 64bit inode number.
> >
> > If we added a inode number reuse scheme that was suitably performant,
> > would that make this possible? That would remove the need for a treeid,
> > and allow us to use project-id to identify subtrees.
> >
>
> We had a resuse scheme, we deprecated and deleted it. I don't want to
> arbitrarily limit objectid's to work around this issue.

These are computers we are working with. There are always arbitrary
limits.
The syscall interface places an arbitrary limit of 64bits on the
identity of any object in a filesystem. btrfs clearly doesn't like that
arbitrary limit, and plays games with device number to increase it to a
new arbitrary limit of 84 bits (sort-of).

I'm fully open to the possibility that last year's arbitrary limits are
no longer comfortable and that we might need to push the boundaries.
But I'd rather the justification was a bit stronger than "we cannot be
bothered reusing old inode numbers".

Are you at all aware of any site coming anywhere vaguely close to one trillion
concurrent inodes - maybe even 16 billion?
Or anything close to 16 million concurrent subvolumes?

>
> >>
> >> Mount options are messy, and are just going to lead to distro's turning them on
> >> without understanding what's going on and then we have to support them forever.
> >> I want to get this fixed in a way that we all hate the least with as little
> >> opportunity for confused users to make bad decisions. Thanks,
> >
> > Hence my question: how much do you hate creating a new filesystem type
> > to fix the problems?
> >
>
> I'm still not convinced we can't solve this without adding new options or
> fstypes. I think flags to indicate that we're special and to use a treeid that
> we stuff into the inode would be a reasonable solution. That being said I'm a
> little sleep deprived so I could be missing why my plan is a bad one, so I'm
> willing to be convinced that mount options are the solution to this, but I want
> to make sure we're damned certain that's the best way forward. Thanks,

I don't think "best way forward" is the appropriate goal - impossible to
assess.

What we need is a chosen way forward. Someone - and ultimately that
someone needs to be the BTRFS maintainer team - needs to decide what
breakage they are willing to bear the cost of, and what breakage is
unacceptable to them, and to choose a way to move forward. I cannot
make that decision for you because I'm just an interested bystander. Al
Viro and Linus cannot either, though they are in a position to veto some
decisions.

The current choice appears to be "ignore the problem and hope it goes
away", though I appreciate that appearances can be deceiving.

You appear very keen to preserve as much of the status quo as possible.
Given that, I think you really need to push to get all the procfs files
changed to use the same device number as stat - so push the patch which
SUSE has that add inode_get_dev().

https://github.com/SUSE/kernel-source/blob/master/patches.suse/vfs-add-super_operations-get_inode_dev

(though the change to show_mountinfo() in that patch would need careful consideration).

If that lands, you have a clear way forward, and we can find some
solution for NFSd (and other network filesystems), and for user-space to
use mnt_id.
If you cannot overcome the pushback, then you know you will have to
find another path - make a 64bit inode number unique, or add more bits
to the effective inode number. Or something.

NeilBrown