2021-06-13 04:11:32

by Wang Yugui

[permalink] [raw]
Subject: any idea about auto export multiple btrfs snapshots?

Hi,

Any idea about auto export multiple btrfs snapshots?

One related patch is yet not merged to nfs-utils 2.5.3.
From: "NeilBrown" <[email protected]>
Subject: [PATCH/RFC v2 nfs-utils] Fix NFSv4 export of tmpfs filesystems.

In this patch, an UUID is auto generated when a tmpfs have no UUID.

for btrfs, multiple subvolume snapshot have the same filesystem UUID.
Could we generate an UUID for btrfs subvol with 'filesystem UUID' + 'subvol ID'?

Best Regards
Wang Yugui ([email protected])
2021/06/13



2021-06-14 22:51:18

by NeilBrown

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

On Sun, 13 Jun 2021, Wang Yugui wrote:
> Hi,
>
> Any idea about auto export multiple btrfs snapshots?
>
> One related patch is yet not merged to nfs-utils 2.5.3.
> From: "NeilBrown" <[email protected]>
> Subject: [PATCH/RFC v2 nfs-utils] Fix NFSv4 export of tmpfs filesystems.
>
> In this patch, an UUID is auto generated when a tmpfs have no UUID.
>
> for btrfs, multiple subvolume snapshot have the same filesystem UUID.
> Could we generate an UUID for btrfs subvol with 'filesystem UUID' + 'subvol ID'?

You really need to ask this question of btrfs developers. 'mountd'
already has a special-case exception for btrfs, to prefer the uuid
provided by statfs64() rather than the uuid extracted from the block
device. It would be quite easy to add another exception.
But it would only be reasonable to do that if the btrfs team told us how
that wanted us to generate a UUID for a given mount point, and promised
that would always provide a unique stable result.

This is completely separate from the tmpfs patch you identified.

NeilBrown


>
> Best Regards
> Wang Yugui ([email protected])
> 2021/06/13
>
>
>

2021-06-15 15:13:51

by Wang Yugui

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

Hi, NeilBrown

> On Sun, 13 Jun 2021, Wang Yugui wrote:
> > Hi,
> >
> > Any idea about auto export multiple btrfs snapshots?
> >
> > One related patch is yet not merged to nfs-utils 2.5.3.
> > From: "NeilBrown" <[email protected]>
> > Subject: [PATCH/RFC v2 nfs-utils] Fix NFSv4 export of tmpfs filesystems.
> >
> > In this patch, an UUID is auto generated when a tmpfs have no UUID.
> >
> > for btrfs, multiple subvolume snapshot have the same filesystem UUID.
> > Could we generate an UUID for btrfs subvol with 'filesystem UUID' + 'subvol ID'?
>
> You really need to ask this question of btrfs developers. 'mountd'
> already has a special-case exception for btrfs, to prefer the uuid
> provided by statfs64() rather than the uuid extracted from the block
> device. It would be quite easy to add another exception.
> But it would only be reasonable to do that if the btrfs team told us how
> that wanted us to generate a UUID for a given mount point, and promised
> that would always provide a unique stable result.
> This is completely separate from the tmpfs patch you identified.

Thanks a lot for the replay.

Now btrfs statfs64() return 8 byte unique/stable result.

It is based on two parts.
1) 16 byte blkid of file system. this is uniq/stable between btrfs filesystems.
2) 8 byte of btrfs sub volume objectid. this is uniq/stable inside a
btrfs filesystem.

the code of linux/fs/btrfs
static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)

/* We treat it as constant endianness (it doesn't matter _which_)
because we want the fsid to come out the same whether mounted
on a big-endian or little-endian host */
buf->f_fsid.val[0] = be32_to_cpu(fsid[0]) ^ be32_to_cpu(fsid[2]);
buf->f_fsid.val[1] = be32_to_cpu(fsid[1]) ^ be32_to_cpu(fsid[3]);
/* Mask in the root object ID too, to disambiguate subvols */
buf->f_fsid.val[0] ^=
BTRFS_I(d_inode(dentry))->root->root_key.objectid >> 32;
buf->f_fsid.val[1] ^=
BTRFS_I(d_inode(dentry))->root->root_key.objectid;


for nfs, we need a 16 byte UUID now.

The best way I though:
16 byte blkid , math add 8 byte btrfs sub volume objectid.
but there is yet no a simple/easy way to get the raw value of 'btrfs sub
volume objectid'.

A simple but good enough way:
1) first 8 byte copy from blkid
2) second 8 byte copy from btrfs_statfs()
the uniq/stable of multiple subvolume inside a btrfs filesystem is kept.

Best Regards
Wang Yugui ([email protected])
2021/06/15

2021-06-15 15:44:35

by Wang Yugui

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

Hi, NeilBrown

> > On Sun, 13 Jun 2021, Wang Yugui wrote:
> > > Hi,
> > >
> > > Any idea about auto export multiple btrfs snapshots?
> > >
> > > One related patch is yet not merged to nfs-utils 2.5.3.
> > > From: "NeilBrown" <[email protected]>
> > > Subject: [PATCH/RFC v2 nfs-utils] Fix NFSv4 export of tmpfs filesystems.
> > >
> > > In this patch, an UUID is auto generated when a tmpfs have no UUID.
> > >
> > > for btrfs, multiple subvolume snapshot have the same filesystem UUID.
> > > Could we generate an UUID for btrfs subvol with 'filesystem UUID' + 'subvol ID'?
> >
> > You really need to ask this question of btrfs developers. 'mountd'
> > already has a special-case exception for btrfs, to prefer the uuid
> > provided by statfs64() rather than the uuid extracted from the block
> > device. It would be quite easy to add another exception.
> > But it would only be reasonable to do that if the btrfs team told us how
> > that wanted us to generate a UUID for a given mount point, and promised
> > that would always provide a unique stable result.
> > This is completely separate from the tmpfs patch you identified.
>
> Thanks a lot for the replay.
>
> Now btrfs statfs64() return 8 byte unique/stable result.
>
> It is based on two parts.
> 1) 16 byte blkid of file system. this is uniq/stable between btrfs filesystems.
> 2) 8 byte of btrfs sub volume objectid. this is uniq/stable inside a
> btrfs filesystem.
>
> the code of linux/fs/btrfs
> static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>
> /* We treat it as constant endianness (it doesn't matter _which_)
> because we want the fsid to come out the same whether mounted
> on a big-endian or little-endian host */
> buf->f_fsid.val[0] = be32_to_cpu(fsid[0]) ^ be32_to_cpu(fsid[2]);
> buf->f_fsid.val[1] = be32_to_cpu(fsid[1]) ^ be32_to_cpu(fsid[3]);
> /* Mask in the root object ID too, to disambiguate subvols */
> buf->f_fsid.val[0] ^=
> BTRFS_I(d_inode(dentry))->root->root_key.objectid >> 32;
> buf->f_fsid.val[1] ^=
> BTRFS_I(d_inode(dentry))->root->root_key.objectid;
>
>
> for nfs, we need a 16 byte UUID now.
>
> The best way I though:
> 16 byte blkid , math add 8 byte btrfs sub volume objectid.
> but there is yet no a simple/easy way to get the raw value of 'btrfs sub
> volume objectid'.
>
> A simple but good enough way:
> 1) first 8 byte copy from blkid
> 2) second 8 byte copy from btrfs_statfs()
> the uniq/stable of multiple subvolume inside a btrfs filesystem is kept.

By the way, the random 16 byte UUID still have very little chance to
conflict.

Could we keep the first 4 byte of the UUID of nfs/tmpfs alwasy ZERO? ,
the first 4 byte zero will limit the conflict inside nfs/tmpfs, and it is
easy to diag.

Here we use the first same 8 byte for UUID of btrfs and nfs/btrfs,
so it is easy to diag too.

Best Regards
Wang Yugui ([email protected])
2021/06/15


2021-06-16 05:47:36

by Wang Yugui

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

Hi,

> Hi, NeilBrown
>
> > On Sun, 13 Jun 2021, Wang Yugui wrote:
> > > Hi,
> > >
> > > Any idea about auto export multiple btrfs snapshots?
> > >
> > > One related patch is yet not merged to nfs-utils 2.5.3.
> > > From: "NeilBrown" <[email protected]>
> > > Subject: [PATCH/RFC v2 nfs-utils] Fix NFSv4 export of tmpfs filesystems.
> > >
> > > In this patch, an UUID is auto generated when a tmpfs have no UUID.
> > >
> > > for btrfs, multiple subvolume snapshot have the same filesystem UUID.
> > > Could we generate an UUID for btrfs subvol with 'filesystem UUID' + 'subvol ID'?
> >
> > You really need to ask this question of btrfs developers. 'mountd'
> > already has a special-case exception for btrfs, to prefer the uuid
> > provided by statfs64() rather than the uuid extracted from the block
> > device. It would be quite easy to add another exception.
> > But it would only be reasonable to do that if the btrfs team told us how
> > that wanted us to generate a UUID for a given mount point, and promised
> > that would always provide a unique stable result.
> > This is completely separate from the tmpfs patch you identified.
>
> Thanks a lot for the replay.
>
> Now btrfs statfs64() return 8 byte unique/stable result.
>
> It is based on two parts.
> 1) 16 byte blkid of file system. this is uniq/stable between btrfs filesystems.
> 2) 8 byte of btrfs sub volume objectid. this is uniq/stable inside a
> btrfs filesystem.
>
> the code of linux/fs/btrfs
> static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>
> /* We treat it as constant endianness (it doesn't matter _which_)
> because we want the fsid to come out the same whether mounted
> on a big-endian or little-endian host */
> buf->f_fsid.val[0] = be32_to_cpu(fsid[0]) ^ be32_to_cpu(fsid[2]);
> buf->f_fsid.val[1] = be32_to_cpu(fsid[1]) ^ be32_to_cpu(fsid[3]);
> /* Mask in the root object ID too, to disambiguate subvols */
> buf->f_fsid.val[0] ^=
> BTRFS_I(d_inode(dentry))->root->root_key.objectid >> 32;
> buf->f_fsid.val[1] ^=
> BTRFS_I(d_inode(dentry))->root->root_key.objectid;
>
>
> for nfs, we need a 16 byte UUID now.
>
> The best way I though:
> 16 byte blkid , math add 8 byte btrfs sub volume objectid.
> but there is yet no a simple/easy way to get the raw value of 'btrfs sub
> volume objectid'.

The btrfs subvol objectid (8byte) can be extracted from the
statfs.f_fsid(8 byte) with the help of blkid(16btye) of the btrfs file
system just do a revert cacl in btrfs_statfs().

if we need 8 byte id for btrfs subvol, just use statfs.f_fsid.

if we need 16 byte id for btrfs subvol, use the blkid(16btye) of the
btrfs filesystem, plus the btrfs subvol objectid (8byte) , and keep
the result in 16 byte.

Best Regards
Wang Yugui ([email protected])
2021/06/16



> A simple but good enough way:
> 1) first 8 byte copy from blkid
> 2) second 8 byte copy from btrfs_statfs()
> the uniq/stable of multiple subvolume inside a btrfs filesystem is kept.
>
> Best Regards
> Wang Yugui ([email protected])
> 2021/06/15


2021-06-17 02:47:47

by Wang Yugui

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

Hi,

nfs need to treat btrfs subvols as different filesystems, so nfs need
crossmnt feature to support multiple btrfs subvol auto export?

It is yet not clear what prevent nfs/crossmnt from work well.

1, stat() and the result 'struct stat'
btrfs subvol support it well.
multiple subvols will have different st_dev of 'struct stat'.
/bin/find works well too.

2, statfs() and the result 'struct statfs'
btrfs subvol support it well.
multiple subvols will have different f_fsid of 'struct statfs'.

3, stx_mnt_id of statx
btrfs subvol does NOT support it well.
but stx_mnt_id seems yet not used.

4, d_mountpoint() in kernel
d_mountpoint() seems not support btrfs subvol.
but we can add some dirty fix such as.

+//#define BTRFS_FIRST_FREE_OBJECTID 256ULL
+//#define BTRFS_SUPER_MAGIC 0x9123683E
+static inline bool is_btrfs_subvol_d(const struct dentry *dentry)
+{
+ return dentry->d_inode && dentry->d_inode->i_ino == 256ULL &&
+ dentry->d_sb && dentry->d_sb->s_magic == 0x9123683E;
+}

so the problem list is yet not clear.

Best Regards
Wang Yugui ([email protected])
2021/06/17

> On Sun, 13 Jun 2021, Wang Yugui wrote:
> > Hi,
> >
> > Any idea about auto export multiple btrfs snapshots?
> >
> > One related patch is yet not merged to nfs-utils 2.5.3.
> > From: "NeilBrown" <[email protected]>
> > Subject: [PATCH/RFC v2 nfs-utils] Fix NFSv4 export of tmpfs filesystems.
> >
> > In this patch, an UUID is auto generated when a tmpfs have no UUID.
> >
> > for btrfs, multiple subvolume snapshot have the same filesystem UUID.
> > Could we generate an UUID for btrfs subvol with 'filesystem UUID' + 'subvol ID'?
>
> You really need to ask this question of btrfs developers. 'mountd'
> already has a special-case exception for btrfs, to prefer the uuid
> provided by statfs64() rather than the uuid extracted from the block
> device. It would be quite easy to add another exception.
> But it would only be reasonable to do that if the btrfs team told us how
> that wanted us to generate a UUID for a given mount point, and promised
> that would always provide a unique stable result.
>
> This is completely separate from the tmpfs patch you identified.
>
> NeilBrown
>
>
> >
> > Best Regards
> > Wang Yugui ([email protected])
> > 2021/06/13
> >
> >
> >


2021-06-17 03:03:17

by NeilBrown

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

On Wed, 16 Jun 2021, Wang Yugui wrote:
> Hi, NeilBrown
>
> > On Sun, 13 Jun 2021, Wang Yugui wrote:
> > > Hi,
> > >
> > > Any idea about auto export multiple btrfs snapshots?
> > >
> > > One related patch is yet not merged to nfs-utils 2.5.3.
> > > From: "NeilBrown" <[email protected]>
> > > Subject: [PATCH/RFC v2 nfs-utils] Fix NFSv4 export of tmpfs filesystems.
> > >
> > > In this patch, an UUID is auto generated when a tmpfs have no UUID.
> > >
> > > for btrfs, multiple subvolume snapshot have the same filesystem UUID.
> > > Could we generate an UUID for btrfs subvol with 'filesystem UUID' + 'subvol ID'?
> >
> > You really need to ask this question of btrfs developers. 'mountd'
> > already has a special-case exception for btrfs, to prefer the uuid
> > provided by statfs64() rather than the uuid extracted from the block
> > device. It would be quite easy to add another exception.
> > But it would only be reasonable to do that if the btrfs team told us how
> > that wanted us to generate a UUID for a given mount point, and promised
> > that would always provide a unique stable result.
> > This is completely separate from the tmpfs patch you identified.
>
> Thanks a lot for the replay.
>
> Now btrfs statfs64() return 8 byte unique/stable result.
>
> It is based on two parts.
> 1) 16 byte blkid of file system. this is uniq/stable between btrfs filesystems.
> 2) 8 byte of btrfs sub volume objectid. this is uniq/stable inside a
> btrfs filesystem.
>
> the code of linux/fs/btrfs
> static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>
> /* We treat it as constant endianness (it doesn't matter _which_)
> because we want the fsid to come out the same whether mounted
> on a big-endian or little-endian host */
> buf->f_fsid.val[0] = be32_to_cpu(fsid[0]) ^ be32_to_cpu(fsid[2]);
> buf->f_fsid.val[1] = be32_to_cpu(fsid[1]) ^ be32_to_cpu(fsid[3]);
> /* Mask in the root object ID too, to disambiguate subvols */
> buf->f_fsid.val[0] ^=
> BTRFS_I(d_inode(dentry))->root->root_key.objectid >> 32;
> buf->f_fsid.val[1] ^=
> BTRFS_I(d_inode(dentry))->root->root_key.objectid;
>
>
> for nfs, we need a 16 byte UUID now.
>
> The best way I though:
> 16 byte blkid , math add 8 byte btrfs sub volume objectid.
> but there is yet no a simple/easy way to get the raw value of 'btrfs sub
> volume objectid'.

I'm a bit confused now. You started out talking about snapshots, but
now you are talking about sub volumes. Are they the same thing?

NFS export of btrfs sub volumes has worked for the past 10 years I
believe.

Can we go back to the beginning. What, exactly, is the problem you are
trying to solve? How can you demonstrate the problem?

NeilBrown


>
> A simple but good enough way:
> 1) first 8 byte copy from blkid
> 2) second 8 byte copy from btrfs_statfs()
> the uniq/stable of multiple subvolume inside a btrfs filesystem is kept.
>
> Best Regards
> Wang Yugui ([email protected])
> 2021/06/15
>
>

2021-06-17 04:29:47

by Wang Yugui

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

Hi,

> On Wed, 16 Jun 2021, Wang Yugui wrote:
> > Hi, NeilBrown
> >
> > > On Sun, 13 Jun 2021, Wang Yugui wrote:
> > > > Hi,
> > > >
> > > > Any idea about auto export multiple btrfs snapshots?
> > > >
> > > > One related patch is yet not merged to nfs-utils 2.5.3.
> > > > From: "NeilBrown" <[email protected]>
> > > > Subject: [PATCH/RFC v2 nfs-utils] Fix NFSv4 export of tmpfs filesystems.
> > > >
> > > > In this patch, an UUID is auto generated when a tmpfs have no UUID.
> > > >
> > > > for btrfs, multiple subvolume snapshot have the same filesystem UUID.
> > > > Could we generate an UUID for btrfs subvol with 'filesystem UUID' + 'subvol ID'?
> > >
> > > You really need to ask this question of btrfs developers. 'mountd'
> > > already has a special-case exception for btrfs, to prefer the uuid
> > > provided by statfs64() rather than the uuid extracted from the block
> > > device. It would be quite easy to add another exception.
> > > But it would only be reasonable to do that if the btrfs team told us how
> > > that wanted us to generate a UUID for a given mount point, and promised
> > > that would always provide a unique stable result.
> > > This is completely separate from the tmpfs patch you identified.
> >
> > Thanks a lot for the replay.
> >
> > Now btrfs statfs64() return 8 byte unique/stable result.
> >
> > It is based on two parts.
> > 1) 16 byte blkid of file system. this is uniq/stable between btrfs filesystems.
> > 2) 8 byte of btrfs sub volume objectid. this is uniq/stable inside a
> > btrfs filesystem.
> >
> > the code of linux/fs/btrfs
> > static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
> >
> > /* We treat it as constant endianness (it doesn't matter _which_)
> > because we want the fsid to come out the same whether mounted
> > on a big-endian or little-endian host */
> > buf->f_fsid.val[0] = be32_to_cpu(fsid[0]) ^ be32_to_cpu(fsid[2]);
> > buf->f_fsid.val[1] = be32_to_cpu(fsid[1]) ^ be32_to_cpu(fsid[3]);
> > /* Mask in the root object ID too, to disambiguate subvols */
> > buf->f_fsid.val[0] ^=
> > BTRFS_I(d_inode(dentry))->root->root_key.objectid >> 32;
> > buf->f_fsid.val[1] ^=
> > BTRFS_I(d_inode(dentry))->root->root_key.objectid;
> >
> >
> > for nfs, we need a 16 byte UUID now.
> >
> > The best way I though:
> > 16 byte blkid , math add 8 byte btrfs sub volume objectid.
> > but there is yet no a simple/easy way to get the raw value of 'btrfs sub
> > volume objectid'.
>
> I'm a bit confused now. You started out talking about snapshots, but
> now you are talking about sub volumes. Are they the same thing?
>
> NFS export of btrfs sub volumes has worked for the past 10 years I
> believe.
>
> Can we go back to the beginning. What, exactly, is the problem you are
> trying to solve? How can you demonstrate the problem?
>
> NeilBrown

I nfs/exported a btrfs with 2 subvols and 2 snapshot(subvol).

# btrfs subvolume list /mnt/test
ID 256 gen 53 top level 5 path sub1
ID 260 gen 56 top level 5 path sub2
ID 261 gen 57 top level 5 path .snapshot/sub1-s1
ID 262 gen 57 top level 5 path .snapshot/sub2-s1

and then mount.nfs4 it to /nfs/test.

# /bin/find /nfs/test/
/nfs/test/
find: File system loop detected; ??/nfs/test/sub1?? is part of the same file system loop as ??/nfs/test/??.
/nfs/test/.snapshot
find: File system loop detected; ??/nfs/test/.snapshot/sub1-s1?? is part of the same file system loop as ??/nfs/test/??.
find: File system loop detected; ??/nfs/test/.snapshot/sub2-s1?? is part of the same file system loop as ??/nfs/test/??.
/nfs/test/dir1
/nfs/test/dir1/a.txt
find: File system loop detected; ??/nfs/test/sub2?? is part of the same file system loop as ??/nfs/test/??

/bin/find report 'File system loop detected'. so I though there is
something wrong.

but when I checked the file content through /mnt/test and /nfs/test,
the file through /mnt/test/xxx and /nfs/test/xxx return the same result.

I have used nfs/crossmnt, and then I thought that btrfs subvol/snapshot
support is through 'nfs/crossmnt' feature. but in fact, it is not
through nfs/crossmnt feature?

/bin/find report 'File system loop detected', it means that vfs cache on
nfs client side will have some problem?

Best Regards
Wang Yugui ([email protected])
2021/06/17


2021-06-18 03:10:36

by NeilBrown

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

On Thu, 17 Jun 2021, Wang Yugui wrote:
> > Can we go back to the beginning. What, exactly, is the problem you are
> > trying to solve? How can you demonstrate the problem?
> >
> > NeilBrown
>
> I nfs/exported a btrfs with 2 subvols and 2 snapshot(subvol).
>
> # btrfs subvolume list /mnt/test
> ID 256 gen 53 top level 5 path sub1
> ID 260 gen 56 top level 5 path sub2
> ID 261 gen 57 top level 5 path .snapshot/sub1-s1
> ID 262 gen 57 top level 5 path .snapshot/sub2-s1
>
> and then mount.nfs4 it to /nfs/test.
>
> # /bin/find /nfs/test/
> /nfs/test/
> find: File system loop detected; '/nfs/test/sub1' is part of the same file system loop as '/nfs/test/'.
> /nfs/test/.snapshot
> find: File system loop detected; '/nfs/test/.snapshot/sub1-s1' is part of the same file system loop as '/nfs/test/'.
> find: File system loop detected; '/nfs/test/.snapshot/sub2-s1' is part of the same file system loop as '/nfs/test/'.
> /nfs/test/dir1
> /nfs/test/dir1/a.txt
> find: File system loop detected; '/nfs/test/sub2' is part of the same file system loop as '/nfs/test/'
>
> /bin/find report 'File system loop detected'. so I though there is
> something wrong.

Certainly something is wrong. The error message implies that some
directory is reporting the same dev an ino as an ancestor directory.
Presumably /nfs/test and /nfs/test/sub1.
Can you confirm that please. e.g. run the command

stat /nfs/test /nfs/test/sub1

and examine the output.

As sub1 is considered a different file system, it should have a
different dev number. NFS will assign a different device number only
when the server reports a different fsid. The Linux NFS server will
report a different fsid if d_mountpoint() is 'true' for the dentry, and
follow_down() results in no change the the vfsmnt,dentry in a 'struct
path'.

You have already said that d_mountpoint doesn't work for btrfs, so that
is part of the problem. NFSD doesn't trust d_mountpoint completely as
it only reports that the dentry is a mountpoint in some namespace, not
necessarily in this namespace. So you really need to fix
nfsd_mountpoint.

I suggest you try adding your "dirty fix" to nfsd_mountpoint() so that
it reports the root of a btrfs subvol as a mountpoint, and see if that
fixes the problem. It should change the problem at least. You would
need to get nfsd_mountpoint() to return '1' in this case, not '2'.

NeilBrown

2021-06-18 09:12:45

by Wang Yugui

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

Hi,

> On Thu, 17 Jun 2021, Wang Yugui wrote:
> > > Can we go back to the beginning. What, exactly, is the problem you are
> > > trying to solve? How can you demonstrate the problem?
> > >
> > > NeilBrown
> >
> > I nfs/exported a btrfs with 2 subvols and 2 snapshot(subvol).
> >
> > # btrfs subvolume list /mnt/test
> > ID 256 gen 53 top level 5 path sub1
> > ID 260 gen 56 top level 5 path sub2
> > ID 261 gen 57 top level 5 path .snapshot/sub1-s1
> > ID 262 gen 57 top level 5 path .snapshot/sub2-s1
> >
> > and then mount.nfs4 it to /nfs/test.
> >
> > # /bin/find /nfs/test/
> > /nfs/test/
> > find: File system loop detected; '/nfs/test/sub1' is part of the same file system loop as '/nfs/test/'.
> > /nfs/test/.snapshot
> > find: File system loop detected; '/nfs/test/.snapshot/sub1-s1' is part of the same file system loop as '/nfs/test/'.
> > find: File system loop detected; '/nfs/test/.snapshot/sub2-s1' is part of the same file system loop as '/nfs/test/'.
> > /nfs/test/dir1
> > /nfs/test/dir1/a.txt
> > find: File system loop detected; '/nfs/test/sub2' is part of the same file system loop as '/nfs/test/'
> >
> > /bin/find report 'File system loop detected'. so I though there is
> > something wrong.
>
> Certainly something is wrong. The error message implies that some
> directory is reporting the same dev an ino as an ancestor directory.
> Presumably /nfs/test and /nfs/test/sub1.
> Can you confirm that please. e.g. run the command
>
> stat /nfs/test /nfs/test/sub1
> and examine the output.

# stat /nfs/test /nfs/test/sub1
File: /nfs/test
Size: 42 Blocks: 32 IO Block: 32768 directory
Device: 36h/54d Inode: 256 Links: 1
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2021-06-18 13:50:55.409457648 +0800
Modify: 2021-06-13 10:05:10.830825901 +0800
Change: 2021-06-13 10:05:10.830825901 +0800
Birth: -
File: /nfs/test/sub1
Size: 8 Blocks: 0 IO Block: 32768 directory
Device: 36h/54d Inode: 256 Links: 1
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2021-06-18 13:51:14.463621411 +0800
Modify: 2021-06-12 21:59:10.598089917 +0800
Change: 2021-06-12 21:59:10.598089917 +0800
Birth: -

same 'Device/Inode' are reported.


but the local btrfs mount,
# stat /mnt/test/ /mnt/test/sub1
File: /mnt/test/
Size: 42 Blocks: 32 IO Block: 4096 directory
Device: 33h/51d Inode: 256 Links: 1
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2021-06-18 13:50:55.409457648 +0800
Modify: 2021-06-13 10:05:10.830825901 +0800
Change: 2021-06-13 10:05:10.830825901 +0800
Birth: -
File: /mnt/test/sub1
Size: 8 Blocks: 0 IO Block: 4096 directory
Device: 34h/52d Inode: 256 Links: 1
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2021-06-18 13:51:14.463621411 +0800
Modify: 2021-06-12 21:59:10.598089917 +0800
Change: 2021-06-12 21:59:10.598089917 +0800
Birth: -

'stat' command should cause nfs/crossmnt to happen auto, and then return
the 'stat' result?


> As sub1 is considered a different file system, it should have a
> different dev number. NFS will assign a different device number only
> when the server reports a different fsid. The Linux NFS server will
> report a different fsid if d_mountpoint() is 'true' for the dentry, and
> follow_down() results in no change the the vfsmnt,dentry in a 'struct
> path'.
>
> You have already said that d_mountpoint doesn't work for btrfs, so that
> is part of the problem. NFSD doesn't trust d_mountpoint completely as
> it only reports that the dentry is a mountpoint in some namespace, not
> necessarily in this namespace. So you really need to fix
> nfsd_mountpoint.
>
> I suggest you try adding your "dirty fix" to nfsd_mountpoint() so that
> it reports the root of a btrfs subvol as a mountpoint, and see if that
> fixes the problem. It should change the problem at least. You would
> need to get nfsd_mountpoint() to return '1' in this case, not '2'.
>
> NeilBrown

I changed the return value from 2 to 1.
if (nfsd4_is_junction(dentry))
return 1;
+ if (is_btrfs_subvol_d(dentry))
+ return 1;
if (d_mountpoint(dentry))

but the crossmnt still does not happen auto.

I tried to mount the subvol manual,
# mount.nfs4 T7610:/mnt/test/sub1 /nfs/test/sub1
mount.nfs4: Stale file handle

we add trace to is_btrfs_subvol_d(), it works as expected.
+static inline bool is_btrfs_subvol_d(const struct dentry *dentry)
+{
+ bool ret= dentry->d_inode && dentry->d_inode->i_ino == 256ULL &&
+ dentry->d_sb && dentry->d_sb->s_magic == 0x9123683E;
+ printk(KERN_INFO "is_btrfs_subvol_d(%s)=%d\n", dentry->d_name.name, ret);
+ return ret;
+}

It seems more fixes are needed.

Best Regards
Wang Yugui ([email protected])
2021/06/18


2021-06-18 13:35:28

by Wang Yugui

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

Hi,

> > On Thu, 17 Jun 2021, Wang Yugui wrote:
> > > > Can we go back to the beginning. What, exactly, is the problem you are
> > > > trying to solve? How can you demonstrate the problem?
> > > >
> > > > NeilBrown
> > >
> > > I nfs/exported a btrfs with 2 subvols and 2 snapshot(subvol).
> > >
> > > # btrfs subvolume list /mnt/test
> > > ID 256 gen 53 top level 5 path sub1
> > > ID 260 gen 56 top level 5 path sub2
> > > ID 261 gen 57 top level 5 path .snapshot/sub1-s1
> > > ID 262 gen 57 top level 5 path .snapshot/sub2-s1
> > >
> > > and then mount.nfs4 it to /nfs/test.
> > >
> > > # /bin/find /nfs/test/
> > > /nfs/test/
> > > find: File system loop detected; '/nfs/test/sub1' is part of the same file system loop as '/nfs/test/'.
> > > /nfs/test/.snapshot
> > > find: File system loop detected; '/nfs/test/.snapshot/sub1-s1' is part of the same file system loop as '/nfs/test/'.
> > > find: File system loop detected; '/nfs/test/.snapshot/sub2-s1' is part of the same file system loop as '/nfs/test/'.
> > > /nfs/test/dir1
> > > /nfs/test/dir1/a.txt
> > > find: File system loop detected; '/nfs/test/sub2' is part of the same file system loop as '/nfs/test/'
> > >
> > > /bin/find report 'File system loop detected'. so I though there is
> > > something wrong.
> >
> > Certainly something is wrong. The error message implies that some
> > directory is reporting the same dev an ino as an ancestor directory.
> > Presumably /nfs/test and /nfs/test/sub1.
> > Can you confirm that please. e.g. run the command
> >
> > stat /nfs/test /nfs/test/sub1
> > and examine the output.
>
> # stat /nfs/test /nfs/test/sub1
> File: /nfs/test
> Size: 42 Blocks: 32 IO Block: 32768 directory
> Device: 36h/54d Inode: 256 Links: 1
> Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
> Access: 2021-06-18 13:50:55.409457648 +0800
> Modify: 2021-06-13 10:05:10.830825901 +0800
> Change: 2021-06-13 10:05:10.830825901 +0800
> Birth: -
> File: /nfs/test/sub1
> Size: 8 Blocks: 0 IO Block: 32768 directory
> Device: 36h/54d Inode: 256 Links: 1
> Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
> Access: 2021-06-18 13:51:14.463621411 +0800
> Modify: 2021-06-12 21:59:10.598089917 +0800
> Change: 2021-06-12 21:59:10.598089917 +0800
> Birth: -
>
> same 'Device/Inode' are reported.
>
>
> but the local btrfs mount,
> # stat /mnt/test/ /mnt/test/sub1
> File: /mnt/test/
> Size: 42 Blocks: 32 IO Block: 4096 directory
> Device: 33h/51d Inode: 256 Links: 1
> Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
> Access: 2021-06-18 13:50:55.409457648 +0800
> Modify: 2021-06-13 10:05:10.830825901 +0800
> Change: 2021-06-13 10:05:10.830825901 +0800
> Birth: -
> File: /mnt/test/sub1
> Size: 8 Blocks: 0 IO Block: 4096 directory
> Device: 34h/52d Inode: 256 Links: 1
> Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
> Access: 2021-06-18 13:51:14.463621411 +0800
> Modify: 2021-06-12 21:59:10.598089917 +0800
> Change: 2021-06-12 21:59:10.598089917 +0800
> Birth: -
>
> 'stat' command should cause nfs/crossmnt to happen auto, and then return
> the 'stat' result?
>
>
> > As sub1 is considered a different file system, it should have a
> > different dev number. NFS will assign a different device number only
> > when the server reports a different fsid. The Linux NFS server will
> > report a different fsid if d_mountpoint() is 'true' for the dentry, and
> > follow_down() results in no change the the vfsmnt,dentry in a 'struct
> > path'.
> >
> > You have already said that d_mountpoint doesn't work for btrfs, so that
> > is part of the problem. NFSD doesn't trust d_mountpoint completely as
> > it only reports that the dentry is a mountpoint in some namespace, not
> > necessarily in this namespace. So you really need to fix
> > nfsd_mountpoint.
> >
> > I suggest you try adding your "dirty fix" to nfsd_mountpoint() so that
> > it reports the root of a btrfs subvol as a mountpoint, and see if that
> > fixes the problem. It should change the problem at least. You would
> > need to get nfsd_mountpoint() to return '1' in this case, not '2'.
> >
> > NeilBrown
>
> I changed the return value from 2 to 1.
> if (nfsd4_is_junction(dentry))
> return 1;
> + if (is_btrfs_subvol_d(dentry))
> + return 1;
> if (d_mountpoint(dentry))
>
> but the crossmnt still does not happen auto.
>
> I tried to mount the subvol manual,
> # mount.nfs4 T7610:/mnt/test/sub1 /nfs/test/sub1
> mount.nfs4: Stale file handle
>
> we add trace to is_btrfs_subvol_d(), it works as expected.
> +static inline bool is_btrfs_subvol_d(const struct dentry *dentry)
> +{
> + bool ret= dentry->d_inode && dentry->d_inode->i_ino == 256ULL &&
> + dentry->d_sb && dentry->d_sb->s_magic == 0x9123683E;
> + printk(KERN_INFO "is_btrfs_subvol_d(%s)=%d\n", dentry->d_name.name, ret);
> + return ret;
> +}
>
> It seems more fixes are needed.

for a normal crossmnt,
/mnt/test btrfs
/mn/test/xfs1 xfs
this xfs1 have 2 inode,
1) inode in xfs /mn/test/xfs, as the root.
2) inode in btrfs /mnt/test, as a directory.
when /mn/test/xfs1 is mounted, nfs client with nocrossmnt option will
show 2).

but for a btrfs subvol,
/mnt/test btrfs
/mnt/test/sub1 btrfs subvol
this sub1 have just 1 inode
1) inode in /mnt/test/sub1, as the root

This difference break the nfs support of btrfs multiple subvol?

Best Regards
Wang Yugui ([email protected])
2021/06/18

2021-06-19 10:00:28

by Wang Yugui

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

Hi,

> > > On Thu, 17 Jun 2021, Wang Yugui wrote:
> > > > > Can we go back to the beginning. What, exactly, is the problem you are
> > > > > trying to solve? How can you demonstrate the problem?
> > > > >
> > > > > NeilBrown
> > > >
> > > > I nfs/exported a btrfs with 2 subvols and 2 snapshot(subvol).
> > > >
> > > > # btrfs subvolume list /mnt/test
> > > > ID 256 gen 53 top level 5 path sub1
> > > > ID 260 gen 56 top level 5 path sub2
> > > > ID 261 gen 57 top level 5 path .snapshot/sub1-s1
> > > > ID 262 gen 57 top level 5 path .snapshot/sub2-s1
> > > >
> > > > and then mount.nfs4 it to /nfs/test.
> > > >
> > > > # /bin/find /nfs/test/
> > > > /nfs/test/
> > > > find: File system loop detected; '/nfs/test/sub1' is part of the same file system loop as '/nfs/test/'.
> > > > /nfs/test/.snapshot
> > > > find: File system loop detected; '/nfs/test/.snapshot/sub1-s1' is part of the same file system loop as '/nfs/test/'.
> > > > find: File system loop detected; '/nfs/test/.snapshot/sub2-s1' is part of the same file system loop as '/nfs/test/'.
> > > > /nfs/test/dir1
> > > > /nfs/test/dir1/a.txt
> > > > find: File system loop detected; '/nfs/test/sub2' is part of the same file system loop as '/nfs/test/'
> > > >
> > > > /bin/find report 'File system loop detected'. so I though there is
> > > > something wrong.
> > >
> > > Certainly something is wrong. The error message implies that some
> > > directory is reporting the same dev an ino as an ancestor directory.
> > > Presumably /nfs/test and /nfs/test/sub1.
> > > Can you confirm that please. e.g. run the command
> > >
> > > stat /nfs/test /nfs/test/sub1
> > > and examine the output.
> >
> > # stat /nfs/test /nfs/test/sub1
> > File: /nfs/test
> > Size: 42 Blocks: 32 IO Block: 32768 directory
> > Device: 36h/54d Inode: 256 Links: 1
> > Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
> > Access: 2021-06-18 13:50:55.409457648 +0800
> > Modify: 2021-06-13 10:05:10.830825901 +0800
> > Change: 2021-06-13 10:05:10.830825901 +0800
> > Birth: -
> > File: /nfs/test/sub1
> > Size: 8 Blocks: 0 IO Block: 32768 directory
> > Device: 36h/54d Inode: 256 Links: 1
> > Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
> > Access: 2021-06-18 13:51:14.463621411 +0800
> > Modify: 2021-06-12 21:59:10.598089917 +0800
> > Change: 2021-06-12 21:59:10.598089917 +0800
> > Birth: -
> >
> > same 'Device/Inode' are reported.
> >
> >
> > but the local btrfs mount,
> > # stat /mnt/test/ /mnt/test/sub1
> > File: /mnt/test/
> > Size: 42 Blocks: 32 IO Block: 4096 directory
> > Device: 33h/51d Inode: 256 Links: 1
> > Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
> > Access: 2021-06-18 13:50:55.409457648 +0800
> > Modify: 2021-06-13 10:05:10.830825901 +0800
> > Change: 2021-06-13 10:05:10.830825901 +0800
> > Birth: -
> > File: /mnt/test/sub1
> > Size: 8 Blocks: 0 IO Block: 4096 directory
> > Device: 34h/52d Inode: 256 Links: 1
> > Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
> > Access: 2021-06-18 13:51:14.463621411 +0800
> > Modify: 2021-06-12 21:59:10.598089917 +0800
> > Change: 2021-06-12 21:59:10.598089917 +0800
> > Birth: -
> >
> > 'stat' command should cause nfs/crossmnt to happen auto, and then return
> > the 'stat' result?
> >
> >
> > > As sub1 is considered a different file system, it should have a
> > > different dev number. NFS will assign a different device number only
> > > when the server reports a different fsid. The Linux NFS server will
> > > report a different fsid if d_mountpoint() is 'true' for the dentry, and
> > > follow_down() results in no change the the vfsmnt,dentry in a 'struct
> > > path'.
> > >
> > > You have already said that d_mountpoint doesn't work for btrfs, so that
> > > is part of the problem. NFSD doesn't trust d_mountpoint completely as
> > > it only reports that the dentry is a mountpoint in some namespace, not
> > > necessarily in this namespace. So you really need to fix
> > > nfsd_mountpoint.
> > >
> > > I suggest you try adding your "dirty fix" to nfsd_mountpoint() so that
> > > it reports the root of a btrfs subvol as a mountpoint, and see if that
> > > fixes the problem. It should change the problem at least. You would
> > > need to get nfsd_mountpoint() to return '1' in this case, not '2'.
> > >
> > > NeilBrown
> >
> > I changed the return value from 2 to 1.
> > if (nfsd4_is_junction(dentry))
> > return 1;
> > + if (is_btrfs_subvol_d(dentry))
> > + return 1;
> > if (d_mountpoint(dentry))
> >
> > but the crossmnt still does not happen auto.
> >
> > I tried to mount the subvol manual,
> > # mount.nfs4 T7610:/mnt/test/sub1 /nfs/test/sub1
> > mount.nfs4: Stale file handle
> >
> > we add trace to is_btrfs_subvol_d(), it works as expected.
> > +static inline bool is_btrfs_subvol_d(const struct dentry *dentry)
> > +{
> > + bool ret= dentry->d_inode && dentry->d_inode->i_ino == 256ULL &&
> > + dentry->d_sb && dentry->d_sb->s_magic == 0x9123683E;
> > + printk(KERN_INFO "is_btrfs_subvol_d(%s)=%d\n", dentry->d_name.name, ret);
> > + return ret;
> > +}
> >
> > It seems more fixes are needed.
>
> for a normal crossmnt,
> /mnt/test btrfs
> /mn/test/xfs1 xfs
> this xfs1 have 2 inode,
> 1) inode in xfs /mn/test/xfs, as the root.
> 2) inode in btrfs /mnt/test, as a directory.
> when /mn/test/xfs1 is mounted, nfs client with nocrossmnt option will
> show 2).
>
> but for a btrfs subvol,
> /mnt/test btrfs
> /mnt/test/sub1 btrfs subvol
> this sub1 have just 1 inode
> 1) inode in /mnt/test/sub1, as the root
>
> This difference break the nfs support of btrfs multiple subvol?

When os shutdown, btrfs subvol in nfs client is firstly unmounted,
then the btrfs subvol entry in nfs client will have a unmounted status.

this btrfs subvol unmounted status in nfs client does NOT exist in btrfs,
we could use a dummy inode value(BTRFS_LAST_FREE_OBJECTID/-255ULL).

btrfs subvol entry mounted status
st_dev : from subvol
st_ino : form subvol (255ULL)

btrfs subvol entry mounted status
st_dev : from parent
st_ino : dummy inode ( -255ULL)

Best Regards
Wang Yugui ([email protected])
2021/06/19


2021-06-20 12:27:54

by Wang Yugui

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

Hi,

> It seems more fixes are needed.

when compare btrfs subvol with xfs crossmnt, we found out
a new feature difference.

/mnt/test xfs
/mnt/text/xfs2 another xfs(crossmnt)
nfsd4_encode_dirent_fattr() report "/mnt/test/xfs2" + "/";


but
/mnt/test btrfs
/mnt/test/sub1 btrfs subvol
nfsd4_encode_dirent_fattr() report "/mnt/test/" + "sub1";

for '/mnt/test/sub1', nfsd should treat the mountpoint as
'/mn/test/sub1', rather than '/mnt/test'?

I'm sorry that yet no patch is avaliable, kernel source is quite
difficult for me.

Best Regards
Wang Yugui ([email protected])
2021/06/20

2021-06-21 04:52:55

by NeilBrown

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

On Fri, 18 Jun 2021, Wang Yugui wrote:
> Hi,
>
> > On Thu, 17 Jun 2021, Wang Yugui wrote:
> > > > Can we go back to the beginning. What, exactly, is the problem you are
> > > > trying to solve? How can you demonstrate the problem?
> > > >
> > > > NeilBrown
> > >
> > > I nfs/exported a btrfs with 2 subvols and 2 snapshot(subvol).
> > >
> > > # btrfs subvolume list /mnt/test
> > > ID 256 gen 53 top level 5 path sub1
> > > ID 260 gen 56 top level 5 path sub2
> > > ID 261 gen 57 top level 5 path .snapshot/sub1-s1
> > > ID 262 gen 57 top level 5 path .snapshot/sub2-s1
> > >
> > > and then mount.nfs4 it to /nfs/test.
> > >
> > > # /bin/find /nfs/test/
> > > /nfs/test/
> > > find: File system loop detected; '/nfs/test/sub1' is part of the same file system loop as '/nfs/test/'.
> > > /nfs/test/.snapshot
> > > find: File system loop detected; '/nfs/test/.snapshot/sub1-s1' is part of the same file system loop as '/nfs/test/'.
> > > find: File system loop detected; '/nfs/test/.snapshot/sub2-s1' is part of the same file system loop as '/nfs/test/'.
> > > /nfs/test/dir1
> > > /nfs/test/dir1/a.txt
> > > find: File system loop detected; '/nfs/test/sub2' is part of the same file system loop as '/nfs/test/'
> > >
> > > /bin/find report 'File system loop detected'. so I though there is
> > > something wrong.
> >
> > Certainly something is wrong. The error message implies that some
> > directory is reporting the same dev an ino as an ancestor directory.
> > Presumably /nfs/test and /nfs/test/sub1.
> > Can you confirm that please. e.g. run the command
> >
> > stat /nfs/test /nfs/test/sub1
> > and examine the output.
>
> # stat /nfs/test /nfs/test/sub1
> File: /nfs/test
> Size: 42 Blocks: 32 IO Block: 32768 directory
> Device: 36h/54d Inode: 256 Links: 1
> Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
> Access: 2021-06-18 13:50:55.409457648 +0800
> Modify: 2021-06-13 10:05:10.830825901 +0800
> Change: 2021-06-13 10:05:10.830825901 +0800
> Birth: -
> File: /nfs/test/sub1
> Size: 8 Blocks: 0 IO Block: 32768 directory
> Device: 36h/54d Inode: 256 Links: 1
> Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
> Access: 2021-06-18 13:51:14.463621411 +0800
> Modify: 2021-06-12 21:59:10.598089917 +0800
> Change: 2021-06-12 21:59:10.598089917 +0800
> Birth: -
>
> same 'Device/Inode' are reported.
>
>
> but the local btrfs mount,
> # stat /mnt/test/ /mnt/test/sub1
> File: /mnt/test/
> Size: 42 Blocks: 32 IO Block: 4096 directory
> Device: 33h/51d Inode: 256 Links: 1
> Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
> Access: 2021-06-18 13:50:55.409457648 +0800
> Modify: 2021-06-13 10:05:10.830825901 +0800
> Change: 2021-06-13 10:05:10.830825901 +0800
> Birth: -
> File: /mnt/test/sub1
> Size: 8 Blocks: 0 IO Block: 4096 directory
> Device: 34h/52d Inode: 256 Links: 1
> Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
> Access: 2021-06-18 13:51:14.463621411 +0800
> Modify: 2021-06-12 21:59:10.598089917 +0800
> Change: 2021-06-12 21:59:10.598089917 +0800
> Birth: -
>
> 'stat' command should cause nfs/crossmnt to happen auto, and then return
> the 'stat' result?
>
>
> > As sub1 is considered a different file system, it should have a
> > different dev number. NFS will assign a different device number only
> > when the server reports a different fsid. The Linux NFS server will
> > report a different fsid if d_mountpoint() is 'true' for the dentry, and
> > follow_down() results in no change the the vfsmnt,dentry in a 'struct
> > path'.
> >
> > You have already said that d_mountpoint doesn't work for btrfs, so that
> > is part of the problem. NFSD doesn't trust d_mountpoint completely as
> > it only reports that the dentry is a mountpoint in some namespace, not
> > necessarily in this namespace. So you really need to fix
> > nfsd_mountpoint.
> >
> > I suggest you try adding your "dirty fix" to nfsd_mountpoint() so that
> > it reports the root of a btrfs subvol as a mountpoint, and see if that
> > fixes the problem. It should change the problem at least. You would
> > need to get nfsd_mountpoint() to return '1' in this case, not '2'.
> >
> > NeilBrown
>
> I changed the return value from 2 to 1.
> if (nfsd4_is_junction(dentry))
> return 1;
> + if (is_btrfs_subvol_d(dentry))
> + return 1;
> if (d_mountpoint(dentry))
>
> but the crossmnt still does not happen auto.
>
> I tried to mount the subvol manual,
> # mount.nfs4 T7610:/mnt/test/sub1 /nfs/test/sub1
> mount.nfs4: Stale file handle
>
> we add trace to is_btrfs_subvol_d(), it works as expected.
> +static inline bool is_btrfs_subvol_d(const struct dentry *dentry)
> +{
> + bool ret= dentry->d_inode && dentry->d_inode->i_ino == 256ULL &&
> + dentry->d_sb && dentry->d_sb->s_magic == 0x9123683E;
> + printk(KERN_INFO "is_btrfs_subvol_d(%s)=%d\n", dentry->d_name.name, ret);
> + return ret;
> +}
>
> It seems more fixes are needed.

I think the problem is that the submount doesn't appear in /proc/mounts.
"nfsd_fh()" in nfs-utils needs to be able to map from the uuid for a
filesystem to the mount point. To do this it walks through /proc/mounts
checking the uuid of each filesystem. If a filesystem isn't listed
there, it obviously fails.

I guess you could add code to nfs-utils to do whatever "btrfs subvol
list" does to make up for the fact that btrfs doesn't register in
/proc/mounts.

NeilBrown

2021-06-21 05:14:08

by NeilBrown

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

> > It seems more fixes are needed.
>
> I think the problem is that the submount doesn't appear in /proc/mounts.
> "nfsd_fh()" in nfs-utils needs to be able to map from the uuid for a
> filesystem to the mount point. To do this it walks through /proc/mounts
> checking the uuid of each filesystem. If a filesystem isn't listed
> there, it obviously fails.
>
> I guess you could add code to nfs-utils to do whatever "btrfs subvol
> list" does to make up for the fact that btrfs doesn't register in
> /proc/mounts.

Another approach might be to just change svcxdr_encode_fattr3() and
nfsd4_encode_fattr() in the 'FSIDSOJURCE_UUID' case to check if
dentry->d_inode has a different btrfs volume id to
exp->ex_path.dentry->d_inode.
If it does, then mix the volume id into the fsid somehow.

With that, you wouldn't want the first change I suggested.

NeilBrown

2021-06-21 08:35:22

by Wang Yugui

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

Hi,

> > > It seems more fixes are needed.
> >
> > I think the problem is that the submount doesn't appear in /proc/mounts.
> > "nfsd_fh()" in nfs-utils needs to be able to map from the uuid for a
> > filesystem to the mount point. To do this it walks through /proc/mounts
> > checking the uuid of each filesystem. If a filesystem isn't listed
> > there, it obviously fails.
> >
> > I guess you could add code to nfs-utils to do whatever "btrfs subvol
> > list" does to make up for the fact that btrfs doesn't register in
> > /proc/mounts.
>
> Another approach might be to just change svcxdr_encode_fattr3() and
> nfsd4_encode_fattr() in the 'FSIDSOJURCE_UUID' case to check if
> dentry->d_inode has a different btrfs volume id to
> exp->ex_path.dentry->d_inode.
> If it does, then mix the volume id into the fsid somehow.
>
> With that, you wouldn't want the first change I suggested.

This is what I have done. and it is based on linux 5.10.44

but it still not work, so still more jobs needed.

Best Regards
Wang Yugui ([email protected])
2021/06/21


Attachments:
0001-nfsd-btrfs-subvol-support.patch (4.63 kB)
0002-trace-nfsd-btrfs-subvol-support.txt (3.35 kB)
Download all attachments

2021-06-21 14:41:03

by Frank Filz

[permalink] [raw]
Subject: RE: any idea about auto export multiple btrfs snapshots?

> I think the problem is that the submount doesn't appear in /proc/mounts.
> "nfsd_fh()" in nfs-utils needs to be able to map from the uuid for a filesystem to
> the mount point. To do this it walks through /proc/mounts checking the uuid of
> each filesystem. If a filesystem isn't listed there, it obviously fails.
>
> I guess you could add code to nfs-utils to do whatever "btrfs subvol list" does to
> make up for the fact that btrfs doesn't register in /proc/mounts.
>
> NeilBrown

I've been watching this with interest for the nfs-ganesha project. We recently were made aware that we weren't working with btrfs subvols, and I added code so that in addition to using getmntent (essentially /proc/mounts) to populate filesystems, we also scan for btrfs subvols and with that we are able to export subvols. My question is does a snapshot look any different than a subvol? If they show up in the subvol list then we shouldn't need to do anything more for nfs-ganesha, but if there's something else needed to discover them, then we may need additional code in nfs-ganesha. I have not yet had a chance to check out exporting a snapshot yet.

Thanks

Frank Filz

2021-06-21 14:56:43

by Wang Yugui

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

Hi,

> > I think the problem is that the submount doesn't appear in /proc/mounts.
> > "nfsd_fh()" in nfs-utils needs to be able to map from the uuid for a filesystem to
> > the mount point. To do this it walks through /proc/mounts checking the uuid of
> > each filesystem. If a filesystem isn't listed there, it obviously fails.
> >
> > I guess you could add code to nfs-utils to do whatever "btrfs subvol list" does to
> > make up for the fact that btrfs doesn't register in /proc/mounts.
> >
> > NeilBrown
>
> I've been watching this with interest for the nfs-ganesha project. We recently were made aware that we weren't working with btrfs subvols, and I added code so that in addition to using getmntent (essentially /proc/mounts) to populate filesystems, we also scan for btrfs subvols and with that we are able to export subvols. My question is does a snapshot look any different than a subvol? If they show up in the subvol list then we shouldn't need to do anything more for nfs-ganesha, but if there's something else needed to discover them, then we may need additional code in nfs-ganesha. I have not yet had a chance to check out exporting a snapshot yet.

> My question is does a snapshot look any different than a subvol?

No difference between btrfs subvol and snapshot in theory.

but the btrfs subvol number in product env is usually fixed,
and the btrfs snapshot number is usually dynamic.

For fixed-number btrfs subvol/snapshot, it is OK to put them in the same
hierarchy, and then export all in /etc/exports static, as a walk around.

For dynamic-number btrfs snapshot number, we needs a dynamic way to
export them in nfs.

Best Regards
Wang Yugui ([email protected])
2021/06/21


2021-06-21 17:50:23

by Frank Filz

[permalink] [raw]
Subject: RE: any idea about auto export multiple btrfs snapshots?

> > > I think the problem is that the submount doesn't appear in
/proc/mounts.
> > > "nfsd_fh()" in nfs-utils needs to be able to map from the uuid for a
> > > filesystem to the mount point. To do this it walks through
> > > /proc/mounts checking the uuid of each filesystem. If a filesystem
isn't listed
> there, it obviously fails.
> > >
> > > I guess you could add code to nfs-utils to do whatever "btrfs subvol
> > > list" does to make up for the fact that btrfs doesn't register in
/proc/mounts.
> > >
> > > NeilBrown
> >
> > I've been watching this with interest for the nfs-ganesha project. We
recently
> were made aware that we weren't working with btrfs subvols, and I added
code
> so that in addition to using getmntent (essentially /proc/mounts) to
populate
> filesystems, we also scan for btrfs subvols and with that we are able to
export
> subvols. My question is does a snapshot look any different than a subvol?
If they
> show up in the subvol list then we shouldn't need to do anything more for
nfs-
> ganesha, but if there's something else needed to discover them, then we
may
> need additional code in nfs-ganesha. I have not yet had a chance to check
out
> exporting a snapshot yet.
>
> > My question is does a snapshot look any different than a subvol?
>
> No difference between btrfs subvol and snapshot in theory.
>
> but the btrfs subvol number in product env is usually fixed, and the btrfs
> snapshot number is usually dynamic.
>
> For fixed-number btrfs subvol/snapshot, it is OK to put them in the same
> hierarchy, and then export all in /etc/exports static, as a walk around.
>
> For dynamic-number btrfs snapshot number, we needs a dynamic way to export
> them in nfs.

OK thanks for the information. I think they will just work in nfs-ganesha as
long as the snapshots or subvols are mounted within an nfs-ganesha export or
are exported explicitly. nfs-ganesha has the equivalent of knfsd's
nohide/crossmnt options and when nfs-ganesha detects crossing a filesystem
boundary will lookup the filesystem via getmntend and listing btrfs subvols
and then expose that filesystem (via the fsid attribute) to the clients
where at least the Linux nfs client will detect a filesystem boundary and
create a new mount entry for it.

Frank

2021-06-21 22:42:35

by Wang Yugui

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

Hi,

>
> OK thanks for the information. I think they will just work in nfs-ganesha as
> long as the snapshots or subvols are mounted within an nfs-ganesha export or
> are exported explicitly. nfs-ganesha has the equivalent of knfsd's
> nohide/crossmnt options and when nfs-ganesha detects crossing a filesystem
> boundary will lookup the filesystem via getmntend and listing btrfs subvols
> and then expose that filesystem (via the fsid attribute) to the clients
> where at least the Linux nfs client will detect a filesystem boundary and
> create a new mount entry for it.


Not only exported explicitly, but also kept in the same hierarchy.

If we export
/mnt/test #the btrfs
/mnt/test/sub1 # the btrfs subvol 1
/mnt/test/sub2 # the btrfs subvol 2

we need to make sure we will not access '/mnt/test/sub1' through '/mnt/test'
from nfs client.

current safe export:
#/mnt/test #the btrfs, not exported
/mnt/test/sub1 # the btrfs subvol 1
/mnt/test/sub2 # the btrfs subvol 2


Best Regards
Wang Yugui ([email protected])
2021/06/22


2021-06-22 01:29:28

by NeilBrown

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

On Mon, 21 Jun 2021, Wang Yugui wrote:
> Hi,
>
> > > > It seems more fixes are needed.
> > >
> > > I think the problem is that the submount doesn't appear in /proc/mounts.
> > > "nfsd_fh()" in nfs-utils needs to be able to map from the uuid for a
> > > filesystem to the mount point. To do this it walks through /proc/mounts
> > > checking the uuid of each filesystem. If a filesystem isn't listed
> > > there, it obviously fails.
> > >
> > > I guess you could add code to nfs-utils to do whatever "btrfs subvol
> > > list" does to make up for the fact that btrfs doesn't register in
> > > /proc/mounts.
> >
> > Another approach might be to just change svcxdr_encode_fattr3() and
> > nfsd4_encode_fattr() in the 'FSIDSOJURCE_UUID' case to check if
> > dentry->d_inode has a different btrfs volume id to
> > exp->ex_path.dentry->d_inode.
> > If it does, then mix the volume id into the fsid somehow.
> >
> > With that, you wouldn't want the first change I suggested.
>
> This is what I have done. and it is based on linux 5.10.44
>
> but it still not work, so still more jobs needed.
>

The following is more what I had in mind. It doesn't quite work and I
cannot work out why.

If you 'stat' a file inside the subvol, then 'find' will not complete.
If you don't, then it will.

Doing that 'stat' changes the st_dev number of the main filesystem,
which seems really weird.
I'm probably missing something obvious. Maybe a more careful analysis
of what is changing when will help.

NeilBrown


diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
index 9421dae22737..790a3357525d 100644
--- a/fs/nfsd/export.c
+++ b/fs/nfsd/export.c
@@ -15,6 +15,7 @@
#include <linux/slab.h>
#include <linux/namei.h>
#include <linux/module.h>
+#include <linux/statfs.h>
#include <linux/exportfs.h>
#include <linux/sunrpc/svc_xprt.h>

@@ -575,6 +576,7 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
int err;
struct auth_domain *dom = NULL;
struct svc_export exp = {}, *expp;
+ struct kstatfs statfs;
int an_int;

if (mesg[mlen-1] != '\n')
@@ -604,6 +606,10 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
err = kern_path(buf, 0, &exp.ex_path);
if (err)
goto out1;
+ err = vfs_statfs(&exp.ex_path, &statfs);
+ if (err)
+ goto out3;
+ exp.ex_fsid64 = statfs.f_fsid;

exp.ex_client = dom;
exp.cd = cd;
@@ -809,6 +815,7 @@ static void export_update(struct cache_head *cnew, struct cache_head *citem)
new->ex_anon_uid = item->ex_anon_uid;
new->ex_anon_gid = item->ex_anon_gid;
new->ex_fsid = item->ex_fsid;
+ new->ex_fsid64 = item->ex_fsid64;
new->ex_devid_map = item->ex_devid_map;
item->ex_devid_map = NULL;
new->ex_uuid = item->ex_uuid;
diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
index ee0e3aba4a6e..d3eb9a599918 100644
--- a/fs/nfsd/export.h
+++ b/fs/nfsd/export.h
@@ -68,6 +68,7 @@ struct svc_export {
kuid_t ex_anon_uid;
kgid_t ex_anon_gid;
int ex_fsid;
+ __kernel_fsid_t ex_fsid64;
unsigned char * ex_uuid; /* 16 byte fsid */
struct nfsd4_fs_locations ex_fslocs;
uint32_t ex_nflavors;
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 7abeccb975b2..8144e6037eae 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -2869,6 +2869,7 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
if (err)
goto out_nfserr;
if ((bmval0 & (FATTR4_WORD0_FILES_AVAIL | FATTR4_WORD0_FILES_FREE |
+ FATTR4_WORD0_FSID |
FATTR4_WORD0_FILES_TOTAL | FATTR4_WORD0_MAXNAME)) ||
(bmval1 & (FATTR4_WORD1_SPACE_AVAIL | FATTR4_WORD1_SPACE_FREE |
FATTR4_WORD1_SPACE_TOTAL))) {
@@ -3024,6 +3025,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
case FSIDSOURCE_UUID:
p = xdr_encode_opaque_fixed(p, exp->ex_uuid,
EX_UUID_LEN);
+ if (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] ||
+ statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]) {
+ /* looks like a btrfs subvol */
+ p[-2] ^= statfs.f_fsid.val[0];
+ p[-1] ^= statfs.f_fsid.val[1];
+ }
break;
}
}

2021-06-22 03:23:31

by Wang Yugui

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

Hi,


> On Mon, 21 Jun 2021, Wang Yugui wrote:
> > Hi,
> >
> > > > > It seems more fixes are needed.
> > > >
> > > > I think the problem is that the submount doesn't appear in /proc/mounts.
> > > > "nfsd_fh()" in nfs-utils needs to be able to map from the uuid for a
> > > > filesystem to the mount point. To do this it walks through /proc/mounts
> > > > checking the uuid of each filesystem. If a filesystem isn't listed
> > > > there, it obviously fails.
> > > >
> > > > I guess you could add code to nfs-utils to do whatever "btrfs subvol
> > > > list" does to make up for the fact that btrfs doesn't register in
> > > > /proc/mounts.
> > >
> > > Another approach might be to just change svcxdr_encode_fattr3() and
> > > nfsd4_encode_fattr() in the 'FSIDSOJURCE_UUID' case to check if
> > > dentry->d_inode has a different btrfs volume id to
> > > exp->ex_path.dentry->d_inode.
> > > If it does, then mix the volume id into the fsid somehow.
> > >
> > > With that, you wouldn't want the first change I suggested.
> >
> > This is what I have done. and it is based on linux 5.10.44
> >
> > but it still not work, so still more jobs needed.
> >
>
> The following is more what I had in mind. It doesn't quite work and I
> cannot work out why.
>
> If you 'stat' a file inside the subvol, then 'find' will not complete.
> If you don't, then it will.
>
> Doing that 'stat' changes the st_dev number of the main filesystem,
> which seems really weird.
> I'm probably missing something obvious. Maybe a more careful analysis
> of what is changing when will help.

we compare the trace output between crossmnt and btrfs subvol with some
trace, we found out that we need to add the subvol support to
follow_down().

btrfs subvol should be treated as virtual 'mount point' for nfsd in follow_down().

Best Regards
Wang Yugui ([email protected])
2021/06/22


> NeilBrown
>
>
> diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
> index 9421dae22737..790a3357525d 100644
> --- a/fs/nfsd/export.c
> +++ b/fs/nfsd/export.c
> @@ -15,6 +15,7 @@
> #include <linux/slab.h>
> #include <linux/namei.h>
> #include <linux/module.h>
> +#include <linux/statfs.h>
> #include <linux/exportfs.h>
> #include <linux/sunrpc/svc_xprt.h>
>
> @@ -575,6 +576,7 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
> int err;
> struct auth_domain *dom = NULL;
> struct svc_export exp = {}, *expp;
> + struct kstatfs statfs;
> int an_int;
>
> if (mesg[mlen-1] != '\n')
> @@ -604,6 +606,10 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
> err = kern_path(buf, 0, &exp.ex_path);
> if (err)
> goto out1;
> + err = vfs_statfs(&exp.ex_path, &statfs);
> + if (err)
> + goto out3;
> + exp.ex_fsid64 = statfs.f_fsid;
>
> exp.ex_client = dom;
> exp.cd = cd;
> @@ -809,6 +815,7 @@ static void export_update(struct cache_head *cnew, struct cache_head *citem)
> new->ex_anon_uid = item->ex_anon_uid;
> new->ex_anon_gid = item->ex_anon_gid;
> new->ex_fsid = item->ex_fsid;
> + new->ex_fsid64 = item->ex_fsid64;
> new->ex_devid_map = item->ex_devid_map;
> item->ex_devid_map = NULL;
> new->ex_uuid = item->ex_uuid;
> diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
> index ee0e3aba4a6e..d3eb9a599918 100644
> --- a/fs/nfsd/export.h
> +++ b/fs/nfsd/export.h
> @@ -68,6 +68,7 @@ struct svc_export {
> kuid_t ex_anon_uid;
> kgid_t ex_anon_gid;
> int ex_fsid;
> + __kernel_fsid_t ex_fsid64;
> unsigned char * ex_uuid; /* 16 byte fsid */
> struct nfsd4_fs_locations ex_fslocs;
> uint32_t ex_nflavors;
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 7abeccb975b2..8144e6037eae 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -2869,6 +2869,7 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
> if (err)
> goto out_nfserr;
> if ((bmval0 & (FATTR4_WORD0_FILES_AVAIL | FATTR4_WORD0_FILES_FREE |
> + FATTR4_WORD0_FSID |
> FATTR4_WORD0_FILES_TOTAL | FATTR4_WORD0_MAXNAME)) ||
> (bmval1 & (FATTR4_WORD1_SPACE_AVAIL | FATTR4_WORD1_SPACE_FREE |
> FATTR4_WORD1_SPACE_TOTAL))) {
> @@ -3024,6 +3025,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
> case FSIDSOURCE_UUID:
> p = xdr_encode_opaque_fixed(p, exp->ex_uuid,
> EX_UUID_LEN);
> + if (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] ||
> + statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]) {
> + /* looks like a btrfs subvol */
> + p[-2] ^= statfs.f_fsid.val[0];
> + p[-1] ^= statfs.f_fsid.val[1];
> + }
> break;
> }
> }


Attachments:
0001-nfsd-btrfs-subvol-support.txt (5.39 kB)
0002-trace-nfsd-btrfs-subvol-support.txt (5.76 kB)
Download all attachments

2021-06-22 07:15:40

by Wang Yugui

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

Hi,

> > >
> > > > > > It seems more fixes are needed.
> > > > >
> > > > > I think the problem is that the submount doesn't appear in /proc/mounts.
> > > > > "nfsd_fh()" in nfs-utils needs to be able to map from the uuid for a
> > > > > filesystem to the mount point. To do this it walks through /proc/mounts
> > > > > checking the uuid of each filesystem. If a filesystem isn't listed
> > > > > there, it obviously fails.
> > > > >
> > > > > I guess you could add code to nfs-utils to do whatever "btrfs subvol
> > > > > list" does to make up for the fact that btrfs doesn't register in
> > > > > /proc/mounts.
> > > >
> > > > Another approach might be to just change svcxdr_encode_fattr3() and
> > > > nfsd4_encode_fattr() in the 'FSIDSOJURCE_UUID' case to check if
> > > > dentry->d_inode has a different btrfs volume id to
> > > > exp->ex_path.dentry->d_inode.
> > > > If it does, then mix the volume id into the fsid somehow.
> > > >
> > > > With that, you wouldn't want the first change I suggested.
> > >
> > > This is what I have done. and it is based on linux 5.10.44
> > >
> > > but it still not work, so still more jobs needed.
> > >
> >
> > The following is more what I had in mind. It doesn't quite work and I
> > cannot work out why.
> >
> > If you 'stat' a file inside the subvol, then 'find' will not complete.
> > If you don't, then it will.
> >
> > Doing that 'stat' changes the st_dev number of the main filesystem,
> > which seems really weird.
> > I'm probably missing something obvious. Maybe a more careful analysis
> > of what is changing when will help.
>
> we compare the trace output between crossmnt and btrfs subvol with some
> trace, we found out that we need to add the subvol support to
> follow_down().
>
> btrfs subvol should be treated as virtual 'mount point' for nfsd in follow_down().

btrfs subvol crossmnt begin to work, although buggy.

some subvol is crossmnt-ed, some subvol is yet not, and some dir is
wrongly crossmnt-ed

'stat /nfs/test /nfs/test/sub1' will cause btrfs subvol crossmnt begin
to happen.

This is the current patch based on 5.10.44.
At least nfsd_follow_up() is buggy.

Best Regards
Wang Yugui ([email protected])
2021/06/22


Attachments:
0001-nfsd-btrfs-subvol-support.txt (5.90 kB)
0002-trace-nfsd-btrfs-subvol-support.txt (5.76 kB)
Download all attachments

2021-06-22 17:36:23

by Frank Filz

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

On 6/21/21 3:41 PM, Wang Yugui wrote:
> Hi,
>
>> OK thanks for the information. I think they will just work in nfs-ganesha as
>> long as the snapshots or subvols are mounted within an nfs-ganesha export or
>> are exported explicitly. nfs-ganesha has the equivalent of knfsd's
>> nohide/crossmnt options and when nfs-ganesha detects crossing a filesystem
>> boundary will lookup the filesystem via getmntend and listing btrfs subvols
>> and then expose that filesystem (via the fsid attribute) to the clients
>> where at least the Linux nfs client will detect a filesystem boundary and
>> create a new mount entry for it.
>
> Not only exported explicitly, but also kept in the same hierarchy.
>
> If we export
> /mnt/test #the btrfs
> /mnt/test/sub1 # the btrfs subvol 1
> /mnt/test/sub2 # the btrfs subvol 2
>
> we need to make sure we will not access '/mnt/test/sub1' through '/mnt/test'
> from nfs client.
>
> current safe export:
> #/mnt/test #the btrfs, not exported
> /mnt/test/sub1 # the btrfs subvol 1
> /mnt/test/sub2 # the btrfs subvol 2
>

What's the problem with exporting /mnt/test AND then exporting sub1 and
sub2 as crossmnt exports? As far as I can tell, that seems to work just
fine with nfs-ganesha.

Frank

2021-06-22 22:49:27

by Wang Yugui

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

Hi,

> On 6/21/21 3:41 PM, Wang Yugui wrote:
> > Hi,
> >
> >> OK thanks for the information. I think they will just work in nfs-ganesha as
> >> long as the snapshots or subvols are mounted within an nfs-ganesha export or
> >> are exported explicitly. nfs-ganesha has the equivalent of knfsd's
> >> nohide/crossmnt options and when nfs-ganesha detects crossing a filesystem
> >> boundary will lookup the filesystem via getmntend and listing btrfs subvols
> >> and then expose that filesystem (via the fsid attribute) to the clients
> >> where at least the Linux nfs client will detect a filesystem boundary and
> >> create a new mount entry for it.
> >
> > Not only exported explicitly, but also kept in the same hierarchy.
> >
> > If we export
> > /mnt/test #the btrfs
> > /mnt/test/sub1 # the btrfs subvol 1
> > /mnt/test/sub2 # the btrfs subvol 2
> >
> > we need to make sure we will not access '/mnt/test/sub1' through '/mnt/test'
> > from nfs client.
> >
> > current safe export:
> > #/mnt/test #the btrfs, not exported
> > /mnt/test/sub1 # the btrfs subvol 1
> > /mnt/test/sub2 # the btrfs subvol 2
> >
>
> What's the problem with exporting /mnt/test AND then exporting sub1 and sub2 as crossmnt exports? As far as I can tell, that seems to work just fine with nfs-ganesha.

I'm not sure what will happen on nfs-ganesha.

crossmnt(kernel nfsd) failed to work when exporting /mnt/test,/mnt/test/sub1,
/mnt/test/sub2.

# /bin/find /nfs/test/
/nfs/test/
find: File system loop detected; ??/nfs/test/sub1?? is part of the same file system loop as ??/nfs/test/??.
/nfs/test/.snapshot
find: File system loop detected; ??/nfs/test/.snapshot/sub1-s1?? is part of the same file system loop as ??/nfs/test/??.
find: File system loop detected; ??/nfs/test/.snapshot/sub2-s1?? is part of the same file system loop as ??/nfs/test/??.
/nfs/test/dir1
/nfs/test/dir1/a.txt
find: File system loop detected; ??/nfs/test/sub2?? is part of the same file system loop as ??/nfs/test/??

/bin/find report 'File system loop detected', it means that vfs cache(
based on st_dev + st_ino?) on nfs client side will have some problem?

In fact, I was exporting /mnt/test for years too.
but btrfs subvols means multiple filesystems(different st_dev), in theory, we
needs to use it based on nfs crossmnt.

Best Regards
Wang Yugui ([email protected])
2021/06/23

2021-06-23 00:59:27

by NeilBrown

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

On Tue, 22 Jun 2021, Wang Yugui wrote:
> >
> > btrfs subvol should be treated as virtual 'mount point' for nfsd in follow_down().
>
> btrfs subvol crossmnt begin to work, although buggy.
>
> some subvol is crossmnt-ed, some subvol is yet not, and some dir is
> wrongly crossmnt-ed
>
> 'stat /nfs/test /nfs/test/sub1' will cause btrfs subvol crossmnt begin
> to happen.
>
> This is the current patch based on 5.10.44.
> At least nfsd_follow_up() is buggy.
>

I don't think the approach you are taking makes sense. Let me explain
why.

The problem is that applications on the NFS client can see different
files or directories on the same (apparent) filesystem with the same
inode number. Most application won't care and NFS itself doesn't get
confused by the duplicate inode numbers, but 'find' and similar programs
(probably 'tar' for example) do get upset.

This happens because BTRFS reuses inode numbers in subvols which it
presents to the kernel as all part of the one filesystem (or at least,
all part of the one mount point). NFSD only sees one filesystem, and so
reports the same filesystem-id (fsid) for all objects. The NFS client
then sees that the fsid is the same and tells applications that the
objects are all in the one filesystem.

To fix this, we need to make sure that nfsd reports a different fsid for
objects in different subvols. There are two obvious ways to do this.

One is to teach nfsd to recognize btrfs subvolumes exactly like separate
filesystems (as nfsd already ensure each filesystem gets its own fsid).
This is the approach of my first suggestion. It requires changing
nfsd_mountpoint() and follow_up() and any other code that is aware of
different filesytems. As I mentioned, it also requires changing mountd
to be able to extract a list of subvols from btrfs because they don't
appear in /proc/mounts.

As you might know an NFS filehandle has 3 parts: a header, a filesystem
identifier, and an inode identifier. This approach would involve giving
different subvols different filesystem identifiers in the filehandle.
This, it turns out is a very big change - bigger than I at first
imagined.

The second obvious approach is to leave the filehandles unchanged and to
continue to treat an entire btrfs filesystem as a single filesystem
EXCEPT when reporting the fsid to the NFS client. All we *really* need
to do is make sure the client sees a different fsid when it enters a
part of the filesystem which re-uses inode numbers. This is what my
latest patch did.

Your patch seems to combine ideas from both approaches. It includes my
code to replace the fsid, but also intercepts follow_up etc. This
cannot be useful.

As I noted when I posted it, there is a problem with my patch. I now
understand that problem.

When NFS sees that fsid change it needs to create 2 inodes for that
directory. One inode will be in the parent filesystem and will be
marked as an auto-mount point so that any lookup below that directory
will trigger an internal mount. The other inode is the root of the
child filesystem. It gets mounted on the first inode.

With normal filesystem mounts, there really is an inode in the parent
filesystem and NFS can find it (with NFSv4) using the MOUNTED_ON_FILEID
attribute. This fileid will be different from all other inode numbers
in the parent filesystem.

With BTRFS there is no inode in the parent volume (as far as I know) so
there is nothing useful to return for MOUNTED_ON_FILEID. This results
in NFS using the same inode number for the inode in the parent
filesystem as the inode in the child filesystem. For btrfs, this will
be 256. As there is already an inode in the parent filesystem with inum
256, 'find' complains.

The following patch addresses this by adding code to nfsd when it
determines MOUINTD_ON_FILEID to choose an number that should be unused
in btrfs. With this change, 'find' seems to work correctly with NFSv4
mounts of btrfs.

This doesn't work with NFSv3 as NFSv3 doesn't have the MOUNTED_ON_FILEID
attribute - strictly speaking, the NFSv3 protocol doesn't support
crossing mount points, though the Linux implementation does allow it.

So this patch works and, I think, is the best we can do in terms of
functionality. I don't like the details of the implementation though.
It requires NFSD to know too much about BTRFS internals.

I think I would like btrfs to make it clear where a subvol started,
maybe by setting DCACHE_MOUNTED on the dentry. This flag is only a
hint, not a promise of anything, so other code should get confused.
This would have nfsd calling vfs_statfs quite so often .... maybe that
isn't such a big deal.

More importantly, there needs to be some way for NFSD to find an inode
number to report for the MOUNTED_ON_FILEID. This needs to be a number
not used elsewhere in the filesystem. It might be safe to use the
same fileid for all subvols (as my patch currently does), but we would
need to confirm that 'find' and 'tar' don't complain about that or
mishandle it. If it is safe to use the same fileid, then a new field in
the superblock to store it might work. If a different fileid is needed,
the we might need a new field in 'struct kstatfs', so vfs_statfs can
report it.

Anyway, here is my current patch. It includes support for NFSv3 as well
as NFSv4.

NeilBrown

diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
index 9421dae22737..790a3357525d 100644
--- a/fs/nfsd/export.c
+++ b/fs/nfsd/export.c
@@ -15,6 +15,7 @@
#include <linux/slab.h>
#include <linux/namei.h>
#include <linux/module.h>
+#include <linux/statfs.h>
#include <linux/exportfs.h>
#include <linux/sunrpc/svc_xprt.h>

@@ -575,6 +576,7 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
int err;
struct auth_domain *dom = NULL;
struct svc_export exp = {}, *expp;
+ struct kstatfs statfs;
int an_int;

if (mesg[mlen-1] != '\n')
@@ -604,6 +606,10 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
err = kern_path(buf, 0, &exp.ex_path);
if (err)
goto out1;
+ err = vfs_statfs(&exp.ex_path, &statfs);
+ if (err)
+ goto out3;
+ exp.ex_fsid64 = statfs.f_fsid;

exp.ex_client = dom;
exp.cd = cd;
@@ -809,6 +815,7 @@ static void export_update(struct cache_head *cnew, struct cache_head *citem)
new->ex_anon_uid = item->ex_anon_uid;
new->ex_anon_gid = item->ex_anon_gid;
new->ex_fsid = item->ex_fsid;
+ new->ex_fsid64 = item->ex_fsid64;
new->ex_devid_map = item->ex_devid_map;
item->ex_devid_map = NULL;
new->ex_uuid = item->ex_uuid;
diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
index ee0e3aba4a6e..d3eb9a599918 100644
--- a/fs/nfsd/export.h
+++ b/fs/nfsd/export.h
@@ -68,6 +68,7 @@ struct svc_export {
kuid_t ex_anon_uid;
kgid_t ex_anon_gid;
int ex_fsid;
+ __kernel_fsid_t ex_fsid64;
unsigned char * ex_uuid; /* 16 byte fsid */
struct nfsd4_fs_locations ex_fslocs;
uint32_t ex_nflavors;
diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 0a5ebc52e6a9..f11ba3434fd6 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -367,10 +367,18 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr,
case FSIDSOURCE_FSID:
fsid = (u64)fhp->fh_export->ex_fsid;
break;
- case FSIDSOURCE_UUID:
+ case FSIDSOURCE_UUID: {
+ struct kstatfs statfs;
+
fsid = ((u64 *)fhp->fh_export->ex_uuid)[0];
fsid ^= ((u64 *)fhp->fh_export->ex_uuid)[1];
+ if (fh_getstafs(fhp, &statfs) == 0 &&
+ (statfs.f_fsid.val[0] != fhp->fh_export->ex_fsid64.val[0] ||
+ statfs.f_fsid.val[1] != fhp->fh_export->ex_fsid64.val[1]))
+ /* looks like a btrfs subvol */
+ fsid = statfs.f_fsid.val[0] ^ statfs.f_fsid.val[1];
break;
+ }
default:
fsid = (u64)huge_encode_dev(fhp->fh_dentry->d_sb->s_dev);
}
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 7abeccb975b2..5f614d1b362e 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -42,6 +42,7 @@
#include <linux/sunrpc/svcauth_gss.h>
#include <linux/sunrpc/addr.h>
#include <linux/xattr.h>
+#include <linux/btrfs_tree.h>
#include <uapi/linux/xattr.h>

#include "idmap.h"
@@ -2869,8 +2870,10 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
if (err)
goto out_nfserr;
if ((bmval0 & (FATTR4_WORD0_FILES_AVAIL | FATTR4_WORD0_FILES_FREE |
+ FATTR4_WORD0_FSID |
FATTR4_WORD0_FILES_TOTAL | FATTR4_WORD0_MAXNAME)) ||
(bmval1 & (FATTR4_WORD1_SPACE_AVAIL | FATTR4_WORD1_SPACE_FREE |
+ FATTR4_WORD1_MOUNTED_ON_FILEID |
FATTR4_WORD1_SPACE_TOTAL))) {
err = vfs_statfs(&path, &statfs);
if (err)
@@ -3024,6 +3027,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
case FSIDSOURCE_UUID:
p = xdr_encode_opaque_fixed(p, exp->ex_uuid,
EX_UUID_LEN);
+ if (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] ||
+ statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]) {
+ /* looks like a btrfs subvol */
+ p[-2] ^= statfs.f_fsid.val[0];
+ p[-1] ^= statfs.f_fsid.val[1];
+ }
break;
}
}
@@ -3286,6 +3295,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
goto out_nfserr;
ino = parent_stat.ino;
}
+ if (fsid_source(fhp) == FSIDSOURCE_UUID &&
+ (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] ||
+ statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]))
+ /* btrfs subvol pseudo mount point */
+ ino = BTRFS_FIRST_FREE_OBJECTID-1;
+
p = xdr_encode_hyper(p, ino);
}
#ifdef CONFIG_NFSD_PNFS
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index b21b76e6b9a8..82b76b0b7bec 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -160,6 +160,13 @@ static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
AT_STATX_SYNC_AS_STAT));
}

+static inline __be32 fh_getstafs(const struct svc_fh *fh, struct kstatfs *statfs)
+{
+ struct path p = {.mnt = fh->fh_export->ex_path.mnt,
+ .dentry = fh->fh_dentry};
+ return nfserrno(vfs_statfs(&p, statfs));
+}
+
static inline int nfsd_create_is_exclusive(int createmode)
{
return createmode == NFS3_CREATE_EXCLUSIVE

2021-06-23 06:15:25

by Wang Yugui

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

Hi,

This patch works very well. Thanks a lot.
- crossmnt of btrfs subvol works as expected.
- nfs/umount subvol works well.
- pseudo mount point inode(255) is good.

I test it in 5.10.45 with a few minor rebase.
( see 0001-any-idea-about-auto-export-multiple-btrfs-snapshots.patch,
just fs/nfsd/nfs3xdr.c rebase)

But when I tested it with another btrfs system without subvol but with
more data, 'find /nfs/test' caused a OOPS . and this OOPS will not
happen just without this patch.

The data in this filesystem is created/left by xfstest(FSTYP=nfs,
TEST_DEV).

#nfs4 option: default mount.nfs4, nfs-utils-2.3.3

# access btrfs directly
$ find /mnt/test | wc -l
6612

# access btrfs through nfs
$ find /nfs/test | wc -l

[ 466.164329] BUG: kernel NULL pointer dereference, address: 0000000000000004
[ 466.172123] #PF: supervisor read access in kernel mode
[ 466.177857] #PF: error_code(0x0000) - not-present page
[ 466.183601] PGD 0 P4D 0
[ 466.186443] Oops: 0000 [#1] SMP NOPTI
[ 466.190536] CPU: 27 PID: 1819 Comm: nfsd Not tainted 5.10.45-7.el7.x86_64 #1
[ 466.198418] Hardware name: Dell Inc. PowerEdge T620/02CD1V, BIOS 2.9.0 12/06/2019
[ 466.206806] RIP: 0010:fsid_source+0x7/0x50 [nfsd]
[ 466.212067] Code: e8 3e f9 ff ff 48 c7 c7 40 5a 90 c0 48 89 c6 e8 18 5a 1f d3 44 8b 14 24 e9 a2 f9 ff ff e9
f7 3e 03 00 90 0f 1f 44 00 00 31 c0 <80> 7f 04 01 75 2d 0f b6 47 06 48 8b 97 90 00 00 00 84 c0 74 1f 83
[ 466.233061] RSP: 0018:ffff9cdd0d3479d0 EFLAGS: 00010246
[ 466.238894] RAX: 0000000000000000 RBX: 0000000000010abc RCX: ffff8f50f3049b00
[ 466.246872] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000
[ 466.254848] RBP: ffff9cdd0d347c68 R08: 0000000aaeb00000 R09: 0000000000000001
[ 466.262825] R10: 0000000000010000 R11: 0000000000110000 R12: ffff8f30510f8000
[ 466.270802] R13: ffff8f4fdabb2090 R14: ffff8f30c0b95600 R15: 0000000000000018
[ 466.278779] FS: 0000000000000000(0000) GS:ffff8f5f7fb40000(0000) knlGS:0000000000000000
[ 466.287823] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 466.294246] CR2: 0000000000000004 CR3: 00000014bfa10003 CR4: 00000000001706e0
[ 466.302222] Call Trace:
[ 466.304970] nfsd4_encode_fattr+0x15ac/0x1940 [nfsd]
[ 466.310557] ? btrfs_verify_level_key+0xad/0xf0 [btrfs]
[ 466.316413] ? btrfs_search_slot+0x8e3/0x900 [btrfs]
[ 466.321973] nfsd4_encode_dirent+0x160/0x3b0 [nfsd]
[ 466.327434] nfsd_readdir+0x199/0x240 [nfsd]
[ 466.332215] ? nfsd4_encode_getattr+0x30/0x30 [nfsd]
[ 466.337771] ? nfsd_direct_splice_actor+0x20/0x20 [nfsd]
[ 466.343714] ? security_prepare_creds+0x6f/0xa0
[ 466.348788] nfsd4_encode_readdir+0xd9/0x1c0 [nfsd]
[ 466.354250] nfsd4_encode_operation+0x9b/0x1b0 [nfsd]
[ 466.360430] nfsd4_proc_compound+0x4e3/0x710 [nfsd]
[ 466.366352] nfsd_dispatch+0xd4/0x180 [nfsd]
[ 466.371620] svc_process_common+0x392/0x6c0 [sunrpc]
[ 466.377650] ? svc_recv+0x3c4/0x8a0 [sunrpc]
[ 466.382883] ? nfsd_svc+0x300/0x300 [nfsd]
[ 466.387908] ? nfsd_destroy+0x60/0x60 [nfsd]
[ 466.393126] svc_process+0xb7/0xf0 [sunrpc]
[ 466.398234] nfsd+0xe8/0x140 [nfsd]
[ 466.402555] kthread+0x116/0x130
[ 466.406579] ? kthread_park+0x80/0x80
[ 466.411091] ret_from_fork+0x1f/0x30
[ 466.415499] Modules linked in: acpi_ipmi rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rfkill intel_rapl_m
sr intel_rapl_common iTCO_wdt intel_pmc_bxt iTCO_vendor_support dcdbas ipmi_ssif sb_edac x86_pkg_temp_thermal
intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_si rapl intel_cstate mei_me ipmi_devintf intel_uncore j
oydev mei ipmi_msghandler lpc_ich acpi_power_meter nvme_rdma nvme_fabrics rdma_cm iw_cm ib_cm rdmavt nfsd rdma
_rxe ib_uverbs ip6_udp_tunnel udp_tunnel ib_core auth_rpcgss nfs_acl lockd grace nfs_ssc ip_tables xfs mgag200
drm_kms_helper crct10dif_pclmul crc32_pclmul btrfs cec crc32c_intel xor bnx2x raid6_pq drm igb mpt3sas ghash_
clmulni_intel pcspkr nvme megaraid_sas mdio nvme_core dca raid_class i2c_algo_bit scsi_transport_sas wmi dm_mu
ltipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sunrpc i2c_dev
[ 466.499551] CR2: 0000000000000004
[ 466.503759] ---[ end trace 91eb52bf0cb65801 ]---
[ 466.511948] RIP: 0010:fsid_source+0x7/0x50 [nfsd]
[ 466.517714] Code: e8 3e f9 ff ff 48 c7 c7 40 5a 90 c0 48 89 c6 e8 18 5a 1f d3 44 8b 14 24 e9 a2 f9 ff ff e9
f7 3e 03 00 90 0f 1f 44 00 00 31 c0 <80> 7f 04 01 75 2d 0f b6 47 06 48 8b 97 90 00 00 00 84 c0 74 1f 83
[ 466.539753] RSP: 0018:ffff9cdd0d3479d0 EFLAGS: 00010246
[ 466.546122] RAX: 0000000000000000 RBX: 0000000000010abc RCX: ffff8f50f3049b00
[ 466.554625] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000
[ 466.563096] RBP: ffff9cdd0d347c68 R08: 0000000aaeb00000 R09: 0000000000000001
[ 466.571572] R10: 0000000000010000 R11: 0000000000110000 R12: ffff8f30510f8000
[ 466.580024] R13: ffff8f4fdabb2090 R14: ffff8f30c0b95600 R15: 0000000000000018
[ 466.588487] FS: 0000000000000000(0000) GS:ffff8f5f7fb40000(0000) knlGS:0000000000000000
[ 466.598032] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 466.604973] CR2: 0000000000000004 CR3: 00000014bfa10003 CR4: 00000000001706e0
[ 466.613467] Kernel panic - not syncing: Fatal exception
[ 466.807651] Kernel Offset: 0x12000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xfffff
fffbfffffff)
[ 466.823190] ---[ end Kernel panic - not syncing: Fatal exception ]---


Best Regards
Wang Yugui ([email protected])
2021/06/23

> On Tue, 22 Jun 2021, Wang Yugui wrote:
> > >
> > > btrfs subvol should be treated as virtual 'mount point' for nfsd in follow_down().
> >
> > btrfs subvol crossmnt begin to work, although buggy.
> >
> > some subvol is crossmnt-ed, some subvol is yet not, and some dir is
> > wrongly crossmnt-ed
> >
> > 'stat /nfs/test /nfs/test/sub1' will cause btrfs subvol crossmnt begin
> > to happen.
> >
> > This is the current patch based on 5.10.44.
> > At least nfsd_follow_up() is buggy.
> >
>
> I don't think the approach you are taking makes sense. Let me explain
> why.
>
> The problem is that applications on the NFS client can see different
> files or directories on the same (apparent) filesystem with the same
> inode number. Most application won't care and NFS itself doesn't get
> confused by the duplicate inode numbers, but 'find' and similar programs
> (probably 'tar' for example) do get upset.
>
> This happens because BTRFS reuses inode numbers in subvols which it
> presents to the kernel as all part of the one filesystem (or at least,
> all part of the one mount point). NFSD only sees one filesystem, and so
> reports the same filesystem-id (fsid) for all objects. The NFS client
> then sees that the fsid is the same and tells applications that the
> objects are all in the one filesystem.
>
> To fix this, we need to make sure that nfsd reports a different fsid for
> objects in different subvols. There are two obvious ways to do this.
>
> One is to teach nfsd to recognize btrfs subvolumes exactly like separate
> filesystems (as nfsd already ensure each filesystem gets its own fsid).
> This is the approach of my first suggestion. It requires changing
> nfsd_mountpoint() and follow_up() and any other code that is aware of
> different filesytems. As I mentioned, it also requires changing mountd
> to be able to extract a list of subvols from btrfs because they don't
> appear in /proc/mounts.
>
> As you might know an NFS filehandle has 3 parts: a header, a filesystem
> identifier, and an inode identifier. This approach would involve giving
> different subvols different filesystem identifiers in the filehandle.
> This, it turns out is a very big change - bigger than I at first
> imagined.
>
> The second obvious approach is to leave the filehandles unchanged and to
> continue to treat an entire btrfs filesystem as a single filesystem
> EXCEPT when reporting the fsid to the NFS client. All we *really* need
> to do is make sure the client sees a different fsid when it enters a
> part of the filesystem which re-uses inode numbers. This is what my
> latest patch did.
>
> Your patch seems to combine ideas from both approaches. It includes my
> code to replace the fsid, but also intercepts follow_up etc. This
> cannot be useful.
>
> As I noted when I posted it, there is a problem with my patch. I now
> understand that problem.
>
> When NFS sees that fsid change it needs to create 2 inodes for that
> directory. One inode will be in the parent filesystem and will be
> marked as an auto-mount point so that any lookup below that directory
> will trigger an internal mount. The other inode is the root of the
> child filesystem. It gets mounted on the first inode.
>
> With normal filesystem mounts, there really is an inode in the parent
> filesystem and NFS can find it (with NFSv4) using the MOUNTED_ON_FILEID
> attribute. This fileid will be different from all other inode numbers
> in the parent filesystem.
>
> With BTRFS there is no inode in the parent volume (as far as I know) so
> there is nothing useful to return for MOUNTED_ON_FILEID. This results
> in NFS using the same inode number for the inode in the parent
> filesystem as the inode in the child filesystem. For btrfs, this will
> be 256. As there is already an inode in the parent filesystem with inum
> 256, 'find' complains.
>
> The following patch addresses this by adding code to nfsd when it
> determines MOUINTD_ON_FILEID to choose an number that should be unused
> in btrfs. With this change, 'find' seems to work correctly with NFSv4
> mounts of btrfs.
>
> This doesn't work with NFSv3 as NFSv3 doesn't have the MOUNTED_ON_FILEID
> attribute - strictly speaking, the NFSv3 protocol doesn't support
> crossing mount points, though the Linux implementation does allow it.
>
> So this patch works and, I think, is the best we can do in terms of
> functionality. I don't like the details of the implementation though.
> It requires NFSD to know too much about BTRFS internals.
>
> I think I would like btrfs to make it clear where a subvol started,
> maybe by setting DCACHE_MOUNTED on the dentry. This flag is only a
> hint, not a promise of anything, so other code should get confused.
> This would have nfsd calling vfs_statfs quite so often .... maybe that
> isn't such a big deal.
>
> More importantly, there needs to be some way for NFSD to find an inode
> number to report for the MOUNTED_ON_FILEID. This needs to be a number
> not used elsewhere in the filesystem. It might be safe to use the
> same fileid for all subvols (as my patch currently does), but we would
> need to confirm that 'find' and 'tar' don't complain about that or
> mishandle it. If it is safe to use the same fileid, then a new field in
> the superblock to store it might work. If a different fileid is needed,
> the we might need a new field in 'struct kstatfs', so vfs_statfs can
> report it.
>
> Anyway, here is my current patch. It includes support for NFSv3 as well
> as NFSv4.
>
> NeilBrown
>
> diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
> index 9421dae22737..790a3357525d 100644
> --- a/fs/nfsd/export.c
> +++ b/fs/nfsd/export.c
> @@ -15,6 +15,7 @@
> #include <linux/slab.h>
> #include <linux/namei.h>
> #include <linux/module.h>
> +#include <linux/statfs.h>
> #include <linux/exportfs.h>
> #include <linux/sunrpc/svc_xprt.h>
>
> @@ -575,6 +576,7 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
> int err;
> struct auth_domain *dom = NULL;
> struct svc_export exp = {}, *expp;
> + struct kstatfs statfs;
> int an_int;
>
> if (mesg[mlen-1] != '\n')
> @@ -604,6 +606,10 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
> err = kern_path(buf, 0, &exp.ex_path);
> if (err)
> goto out1;
> + err = vfs_statfs(&exp.ex_path, &statfs);
> + if (err)
> + goto out3;
> + exp.ex_fsid64 = statfs.f_fsid;
>
> exp.ex_client = dom;
> exp.cd = cd;
> @@ -809,6 +815,7 @@ static void export_update(struct cache_head *cnew, struct cache_head *citem)
> new->ex_anon_uid = item->ex_anon_uid;
> new->ex_anon_gid = item->ex_anon_gid;
> new->ex_fsid = item->ex_fsid;
> + new->ex_fsid64 = item->ex_fsid64;
> new->ex_devid_map = item->ex_devid_map;
> item->ex_devid_map = NULL;
> new->ex_uuid = item->ex_uuid;
> diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
> index ee0e3aba4a6e..d3eb9a599918 100644
> --- a/fs/nfsd/export.h
> +++ b/fs/nfsd/export.h
> @@ -68,6 +68,7 @@ struct svc_export {
> kuid_t ex_anon_uid;
> kgid_t ex_anon_gid;
> int ex_fsid;
> + __kernel_fsid_t ex_fsid64;
> unsigned char * ex_uuid; /* 16 byte fsid */
> struct nfsd4_fs_locations ex_fslocs;
> uint32_t ex_nflavors;
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 0a5ebc52e6a9..f11ba3434fd6 100644
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -367,10 +367,18 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr,
> case FSIDSOURCE_FSID:
> fsid = (u64)fhp->fh_export->ex_fsid;
> break;
> - case FSIDSOURCE_UUID:
> + case FSIDSOURCE_UUID: {
> + struct kstatfs statfs;
> +
> fsid = ((u64 *)fhp->fh_export->ex_uuid)[0];
> fsid ^= ((u64 *)fhp->fh_export->ex_uuid)[1];
> + if (fh_getstafs(fhp, &statfs) == 0 &&
> + (statfs.f_fsid.val[0] != fhp->fh_export->ex_fsid64.val[0] ||
> + statfs.f_fsid.val[1] != fhp->fh_export->ex_fsid64.val[1]))
> + /* looks like a btrfs subvol */
> + fsid = statfs.f_fsid.val[0] ^ statfs.f_fsid.val[1];
> break;
> + }
> default:
> fsid = (u64)huge_encode_dev(fhp->fh_dentry->d_sb->s_dev);
> }
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 7abeccb975b2..5f614d1b362e 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -42,6 +42,7 @@
> #include <linux/sunrpc/svcauth_gss.h>
> #include <linux/sunrpc/addr.h>
> #include <linux/xattr.h>
> +#include <linux/btrfs_tree.h>
> #include <uapi/linux/xattr.h>
>
> #include "idmap.h"
> @@ -2869,8 +2870,10 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
> if (err)
> goto out_nfserr;
> if ((bmval0 & (FATTR4_WORD0_FILES_AVAIL | FATTR4_WORD0_FILES_FREE |
> + FATTR4_WORD0_FSID |
> FATTR4_WORD0_FILES_TOTAL | FATTR4_WORD0_MAXNAME)) ||
> (bmval1 & (FATTR4_WORD1_SPACE_AVAIL | FATTR4_WORD1_SPACE_FREE |
> + FATTR4_WORD1_MOUNTED_ON_FILEID |
> FATTR4_WORD1_SPACE_TOTAL))) {
> err = vfs_statfs(&path, &statfs);
> if (err)
> @@ -3024,6 +3027,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
> case FSIDSOURCE_UUID:
> p = xdr_encode_opaque_fixed(p, exp->ex_uuid,
> EX_UUID_LEN);
> + if (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] ||
> + statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]) {
> + /* looks like a btrfs subvol */
> + p[-2] ^= statfs.f_fsid.val[0];
> + p[-1] ^= statfs.f_fsid.val[1];
> + }
> break;
> }
> }
> @@ -3286,6 +3295,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
> goto out_nfserr;
> ino = parent_stat.ino;
> }
> + if (fsid_source(fhp) == FSIDSOURCE_UUID &&
> + (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] ||
> + statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]))
> + /* btrfs subvol pseudo mount point */
> + ino = BTRFS_FIRST_FREE_OBJECTID-1;
> +
> p = xdr_encode_hyper(p, ino);
> }
> #ifdef CONFIG_NFSD_PNFS
> diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
> index b21b76e6b9a8..82b76b0b7bec 100644
> --- a/fs/nfsd/vfs.h
> +++ b/fs/nfsd/vfs.h
> @@ -160,6 +160,13 @@ static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
> AT_STATX_SYNC_AS_STAT));
> }
>
> +static inline __be32 fh_getstafs(const struct svc_fh *fh, struct kstatfs *statfs)
> +{
> + struct path p = {.mnt = fh->fh_export->ex_path.mnt,
> + .dentry = fh->fh_dentry};
> + return nfserrno(vfs_statfs(&p, statfs));
> +}
> +
> static inline int nfsd_create_is_exclusive(int createmode)
> {
> return createmode == NFS3_CREATE_EXCLUSIVE


Attachments:
0001-any-idea-about-auto-export-multiple-btrfs-snapshots.patch (10.20 kB)

2021-06-23 06:30:32

by NeilBrown

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

On Wed, 23 Jun 2021, Wang Yugui wrote:
> Hi,
>
> This patch works very well. Thanks a lot.
> - crossmnt of btrfs subvol works as expected.
> - nfs/umount subvol works well.
> - pseudo mount point inode(255) is good.
>
> I test it in 5.10.45 with a few minor rebase.
> ( see 0001-any-idea-about-auto-export-multiple-btrfs-snapshots.patch,
> just fs/nfsd/nfs3xdr.c rebase)
>
> But when I tested it with another btrfs system without subvol but with
> more data, 'find /nfs/test' caused a OOPS . and this OOPS will not
> happen just without this patch.
>
> The data in this filesystem is created/left by xfstest(FSTYP=nfs,
> TEST_DEV).
>
> #nfs4 option: default mount.nfs4, nfs-utils-2.3.3
>
> # access btrfs directly
> $ find /mnt/test | wc -l
> 6612
>
> # access btrfs through nfs
> $ find /nfs/test | wc -l
>
> [ 466.164329] BUG: kernel NULL pointer dereference, address: 0000000000000004
> [ 466.172123] #PF: supervisor read access in kernel mode
> [ 466.177857] #PF: error_code(0x0000) - not-present page
> [ 466.183601] PGD 0 P4D 0
> [ 466.186443] Oops: 0000 [#1] SMP NOPTI
> [ 466.190536] CPU: 27 PID: 1819 Comm: nfsd Not tainted 5.10.45-7.el7.x86_64 #1
> [ 466.198418] Hardware name: Dell Inc. PowerEdge T620/02CD1V, BIOS 2.9.0 12/06/2019
> [ 466.206806] RIP: 0010:fsid_source+0x7/0x50 [nfsd]

in nfsd4_encode_fattr there is code:

if ((bmval0 & (FATTR4_WORD0_FILEHANDLE | FATTR4_WORD0_FSID)) && !fhp) {
tempfh = kmalloc(sizeof(struct svc_fh), GFP_KERNEL);
status = nfserr_jukebox;
if (!tempfh)
goto out;
fh_init(tempfh, NFS4_FHSIZE);
status = fh_compose(tempfh, exp, dentry, NULL);
if (status)
goto out;
fhp = tempfh;
}

Change that to test for (bmval1 & FATTR4_WORD1_MOUNTED_ON_FILEID) as
well.

NeilBrown

2021-06-23 09:34:26

by Wang Yugui

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

Hi,

> On Wed, 23 Jun 2021, Wang Yugui wrote:
> > Hi,
> >
> > This patch works very well. Thanks a lot.
> > - crossmnt of btrfs subvol works as expected.
> > - nfs/umount subvol works well.
> > - pseudo mount point inode(255) is good.
> >
> > I test it in 5.10.45 with a few minor rebase.
> > ( see 0001-any-idea-about-auto-export-multiple-btrfs-snapshots.patch,
> > just fs/nfsd/nfs3xdr.c rebase)
> >
> > But when I tested it with another btrfs system without subvol but with
> > more data, 'find /nfs/test' caused a OOPS . and this OOPS will not
> > happen just without this patch.
> >
> > The data in this filesystem is created/left by xfstest(FSTYP=nfs,
> > TEST_DEV).
> >
> > #nfs4 option: default mount.nfs4, nfs-utils-2.3.3
> >
> > # access btrfs directly
> > $ find /mnt/test | wc -l
> > 6612
> >
> > # access btrfs through nfs
> > $ find /nfs/test | wc -l
> >
> > [ 466.164329] BUG: kernel NULL pointer dereference, address: 0000000000000004
> > [ 466.172123] #PF: supervisor read access in kernel mode
> > [ 466.177857] #PF: error_code(0x0000) - not-present page
> > [ 466.183601] PGD 0 P4D 0
> > [ 466.186443] Oops: 0000 [#1] SMP NOPTI
> > [ 466.190536] CPU: 27 PID: 1819 Comm: nfsd Not tainted 5.10.45-7.el7.x86_64 #1
> > [ 466.198418] Hardware name: Dell Inc. PowerEdge T620/02CD1V, BIOS 2.9.0 12/06/2019
> > [ 466.206806] RIP: 0010:fsid_source+0x7/0x50 [nfsd]
>
> in nfsd4_encode_fattr there is code:
>
> if ((bmval0 & (FATTR4_WORD0_FILEHANDLE | FATTR4_WORD0_FSID)) && !fhp) {
> tempfh = kmalloc(sizeof(struct svc_fh), GFP_KERNEL);
> status = nfserr_jukebox;
> if (!tempfh)
> goto out;
> fh_init(tempfh, NFS4_FHSIZE);
> status = fh_compose(tempfh, exp, dentry, NULL);
> if (status)
> goto out;
> fhp = tempfh;
> }
>
> Change that to test for (bmval1 & FATTR4_WORD1_MOUNTED_ON_FILEID) as
> well.
>
> NeilBrown


It works well.

- if ((bmval0 & (FATTR4_WORD0_FILEHANDLE | FATTR4_WORD0_FSID)) && !fhp) {
+ if (((bmval0 & (FATTR4_WORD0_FILEHANDLE | FATTR4_WORD0_FSID)) ||
+ (bmval1 & FATTR4_WORD1_MOUNTED_ON_FILEID))
+ && !fhp) {
tempfh = kmalloc(sizeof(struct svc_fh), GFP_KERNEL);
status = nfserr_jukebox;
if (!tempfh)


And I tested some case about statfs.f_fsid conflict between btrfs
filesystem, and it works well too.

Is it safe in theory too?

test case:
two btrfs filesystem with just 1 bit diff of UUID
# blkid /dev/sdb1 /dev/sdb2
/dev/sdb1: UUID="35327ecf-a5a7-4617-a160-1fdbfd644940" UUID_SUB="a831ebde-1e66-4592-bfde-7a86fd6478b5" BLOCK_SIZE="4096" TYPE="btrfs" PARTLABEL="primary" PARTUUID="3e30a849-88db-4fb3-92e6-b66bfbe9cb98"
/dev/sdb2: UUID="35327ecf-a5a7-4617-a160-1fdbfd644941" UUID_SUB="31e07d66-a656-48a8-b1fb-5b438565238e" BLOCK_SIZE="4096" TYPE="btrfs" PARTLABEL="primary" PARTUUID="93a2db85-065a-4ecf-89d4-6a8dcdb8ff99"

both have 3 subvols.
# btrfs subvolume list /mnt/test
ID 256 gen 13 top level 5 path sub1
ID 257 gen 13 top level 5 path sub2
ID 258 gen 13 top level 5 path sub3
# btrfs subvolume list /mnt/scratch
ID 256 gen 13 top level 5 path sub1
ID 257 gen 13 top level 5 path sub2
ID 258 gen 13 top level 5 path sub3


statfs.f_fsid.c is the source of 'statfs' command.

# statfs /mnt/test/sub1 /mnt/test/sub2 /mnt/test/sub3 /mnt/scratch/sub1 /mnt/scratch/sub2 /mnt/scratch/sub3
/mnt/test/sub1
f_fsid=0x9452611458c30e57
/mnt/test/sub2
f_fsid=0x9452611458c30e56
/mnt/test/sub3
f_fsid=0x9452611458c30e55
/mnt/scratch/sub1
f_fsid=0x9452611458c30e56
/mnt/scratch/sub2
f_fsid=0x9452611458c30e57
/mnt/scratch/sub3
f_fsid=0x9452611458c30e54

statfs.f_fsid is uniq inside a btrfs and it's subvols.
but statfs.f_fsid is NOT uniq between btrfs filesystems because just 1
bit diff of UUID.

'find /mnt/test/' and 'find /mnt/scratch/' works as expected.


Best Regards
Wang Yugui ([email protected])
2021/06/23


Attachments:
statfs.f_fsid.c (698.00 B)

2021-06-23 15:39:16

by J. Bruce Fields

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

Is there any hope of solving this problem within btrfs?

It doesn't seem like it should have been that difficult for it to give
subvolumes separate superblocks and vfsmounts.

But this has come up before, and I think the answer may have been that
it's just too late to fix.

--b.

2021-06-23 22:07:54

by NeilBrown

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

On Thu, 24 Jun 2021, J. Bruce Fields wrote:
> Is there any hope of solving this problem within btrfs?
>
> It doesn't seem like it should have been that difficult for it to give
> subvolumes separate superblocks and vfsmounts.
>
> But this has come up before, and I think the answer may have been that
> it's just too late to fix.

It is never too late to do the right thing!

Probably the best approach to fixing this completely on the btrfs side
would be to copy the auto-mount approach used in NFS. NFS sees multiple
different volumes on the server and transparently creates new vfsmounts,
using the automount infrastructure to mount and unmount them. BTRFS
effective sees multiple volumes on the block device and could do the
same thing.

I can only think of one change to the user-space API (other than
/proc/mounts contents) that this would cause and I suspect it could be
resolved if needed.

Currently when you 'stat' the mountpoint of a btrfs subvol you see the
root of that subvol. However when you 'stat' the mountpoint of an NFS
sub-filesystem (before any access below there) you see the mountpoint
(s_dev matches the parent). This is how automounts are expected to work
and if btrfs were switched to use automounts for subvols, stating the
mountpoint would initially show the mountpoint, not the subvol root.

If this were seen to be a problem I doubt it would be hard to add
optional functionality to automount so that 'stat' triggers the mount.

All we really need is:
1/ someone to write the code
2/ someone to review the code
3/ someone to accept the code

How hard can it be :-)

NeilBrown

2021-06-23 22:26:29

by J. Bruce Fields

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

On Thu, Jun 24, 2021 at 08:04:57AM +1000, NeilBrown wrote:
> On Thu, 24 Jun 2021, J. Bruce Fields wrote:
> > Is there any hope of solving this problem within btrfs?
> >
> > It doesn't seem like it should have been that difficult for it to give
> > subvolumes separate superblocks and vfsmounts.
> >
> > But this has come up before, and I think the answer may have been that
> > it's just too late to fix.
>
> It is never too late to do the right thing!
>
> Probably the best approach to fixing this completely on the btrfs side
> would be to copy the auto-mount approach used in NFS. NFS sees multiple
> different volumes on the server and transparently creates new vfsmounts,
> using the automount infrastructure to mount and unmount them. BTRFS
> effective sees multiple volumes on the block device and could do the
> same thing.

Yes, that makes sense to me.

> I can only think of one change to the user-space API (other than
> /proc/mounts contents) that this would cause and I suspect it could be
> resolved if needed.
>
> Currently when you 'stat' the mountpoint of a btrfs subvol you see the
> root of that subvol. However when you 'stat' the mountpoint of an NFS
> sub-filesystem (before any access below there) you see the mountpoint
> (s_dev matches the parent). This is how automounts are expected to work
> and if btrfs were switched to use automounts for subvols, stating the
> mountpoint would initially show the mountpoint, not the subvol root.
>
> If this were seen to be a problem I doubt it would be hard to add
> optional functionality to automount so that 'stat' triggers the mount.

One other thing I'm not sure about: how do cold cache lookups of
filehandles for (possibly not-yet-mounted) subvolumes work?

> All we really need is:
> 1/ someone to write the code
> 2/ someone to review the code
> 3/ someone to accept the code

Hah. Still, the special exceptions for btrfs seem to be accumulating.
I wonder if that's happening outside nfs as well.

--b.

2021-06-23 23:29:46

by NeilBrown

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

On Thu, 24 Jun 2021, J. Bruce Fields wrote:
> On Thu, Jun 24, 2021 at 08:04:57AM +1000, NeilBrown wrote:
> > On Thu, 24 Jun 2021, J. Bruce Fields wrote:
>
> One other thing I'm not sure about: how do cold cache lookups of
> filehandles for (possibly not-yet-mounted) subvolumes work?

Ahhhh... that's a good point. Filehandle lookup depends on the target
filesystem being mounted. NFS exporting filesystems which are
auto-mounted on demand would be ... interesting.

That argues in favour of nfsd treating a btrfs filesystem as a single
filesystem and gaining some knowledge about different subvolumes within
a filesystem.

This has implications for NFS re-export. If a filehandle is received
for an NFS filesystem that needs to be automounted, I expect it would
fail.

Or do we want to introduce a third level in the filehandle: filesystem,
subvol, inode. So just the "filesystem" is used to look things up in
/proc/mounts, but "filesystem+subvol" is used to determine the fsid.

Maybe another way to state this is that the filesystem could identify a
number of bytes from the fs-local part of the filehandle that should be
mixed in to the fsid. That might be a reasonably clean interface.

>
> > All we really need is:
> > 1/ someone to write the code
> > 2/ someone to review the code
> > 3/ someone to accept the code
>
> Hah. Still, the special exceptions for btrfs seem to be accumulating.
> I wonder if that's happening outside nfs as well.

I have some colleagues who work on btrfs and based on my occasional
discussions, I think that: yes, btrfs is a bit "special". There are a
number of corner-cases where it doesn't quite behave how one would hope.
This is probably inevitable given they way it is pushing the boundaries
of functionality. It can be a challenge to determine if that "hope" is
actually reasonable, and to figure out a good solution that meets the
need cleanly without imposing performance burdens elsewhere.

NeilBrown

2021-06-23 23:39:32

by NeilBrown

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

On Wed, 23 Jun 2021, Wang Yugui wrote:
> Hi,
>
> > On Wed, 23 Jun 2021, Wang Yugui wrote:
> > > Hi,
> > >
> > > This patch works very well. Thanks a lot.
> > > - crossmnt of btrfs subvol works as expected.
> > > - nfs/umount subvol works well.
> > > - pseudo mount point inode(255) is good.
> > >
> > > I test it in 5.10.45 with a few minor rebase.
> > > ( see 0001-any-idea-about-auto-export-multiple-btrfs-snapshots.patch,
> > > just fs/nfsd/nfs3xdr.c rebase)
> > >
> > > But when I tested it with another btrfs system without subvol but with
> > > more data, 'find /nfs/test' caused a OOPS . and this OOPS will not
> > > happen just without this patch.
> > >
> > > The data in this filesystem is created/left by xfstest(FSTYP=nfs,
> > > TEST_DEV).
> > >
> > > #nfs4 option: default mount.nfs4, nfs-utils-2.3.3
> > >
> > > # access btrfs directly
> > > $ find /mnt/test | wc -l
> > > 6612
> > >
> > > # access btrfs through nfs
> > > $ find /nfs/test | wc -l
> > >
> > > [ 466.164329] BUG: kernel NULL pointer dereference, address: 0000000000000004
> > > [ 466.172123] #PF: supervisor read access in kernel mode
> > > [ 466.177857] #PF: error_code(0x0000) - not-present page
> > > [ 466.183601] PGD 0 P4D 0
> > > [ 466.186443] Oops: 0000 [#1] SMP NOPTI
> > > [ 466.190536] CPU: 27 PID: 1819 Comm: nfsd Not tainted 5.10.45-7.el7.x86_64 #1
> > > [ 466.198418] Hardware name: Dell Inc. PowerEdge T620/02CD1V, BIOS 2.9.0 12/06/2019
> > > [ 466.206806] RIP: 0010:fsid_source+0x7/0x50 [nfsd]
> >
> > in nfsd4_encode_fattr there is code:
> >
> > if ((bmval0 & (FATTR4_WORD0_FILEHANDLE | FATTR4_WORD0_FSID)) && !fhp) {
> > tempfh = kmalloc(sizeof(struct svc_fh), GFP_KERNEL);
> > status = nfserr_jukebox;
> > if (!tempfh)
> > goto out;
> > fh_init(tempfh, NFS4_FHSIZE);
> > status = fh_compose(tempfh, exp, dentry, NULL);
> > if (status)
> > goto out;
> > fhp = tempfh;
> > }
> >
> > Change that to test for (bmval1 & FATTR4_WORD1_MOUNTED_ON_FILEID) as
> > well.
> >
> > NeilBrown
>
>
> It works well.
>
> - if ((bmval0 & (FATTR4_WORD0_FILEHANDLE | FATTR4_WORD0_FSID)) && !fhp) {
> + if (((bmval0 & (FATTR4_WORD0_FILEHANDLE | FATTR4_WORD0_FSID)) ||
> + (bmval1 & FATTR4_WORD1_MOUNTED_ON_FILEID))
> + && !fhp) {
> tempfh = kmalloc(sizeof(struct svc_fh), GFP_KERNEL);
> status = nfserr_jukebox;
> if (!tempfh)

Good. Thanks for testing.

>
>
> And I tested some case about statfs.f_fsid conflict between btrfs
> filesystem, and it works well too.
>
> Is it safe in theory too?

Probably .... :-)

>
> test case:
> two btrfs filesystem with just 1 bit diff of UUID
> # blkid /dev/sdb1 /dev/sdb2
> /dev/sdb1: UUID="35327ecf-a5a7-4617-a160-1fdbfd644940" UUID_SUB="a831ebde-1e66-4592-bfde-7a86fd6478b5" BLOCK_SIZE="4096" TYPE="btrfs" PARTLABEL="primary" PARTUUID="3e30a849-88db-4fb3-92e6-b66bfbe9cb98"
> /dev/sdb2: UUID="35327ecf-a5a7-4617-a160-1fdbfd644941" UUID_SUB="31e07d66-a656-48a8-b1fb-5b438565238e" BLOCK_SIZE="4096" TYPE="btrfs" PARTLABEL="primary" PARTUUID="93a2db85-065a-4ecf-89d4-6a8dcdb8ff99"

Having two UUIDs that differ in just one bit would be somewhat unusual.

>
> both have 3 subvols.
> # btrfs subvolume list /mnt/test
> ID 256 gen 13 top level 5 path sub1
> ID 257 gen 13 top level 5 path sub2
> ID 258 gen 13 top level 5 path sub3
> # btrfs subvolume list /mnt/scratch
> ID 256 gen 13 top level 5 path sub1
> ID 257 gen 13 top level 5 path sub2
> ID 258 gen 13 top level 5 path sub3
>
>
> statfs.f_fsid.c is the source of 'statfs' command.

You can use "stat -f".

>
> # statfs /mnt/test/sub1 /mnt/test/sub2 /mnt/test/sub3 /mnt/scratch/sub1 /mnt/scratch/sub2 /mnt/scratch/sub3
> /mnt/test/sub1
> f_fsid=0x9452611458c30e57
> /mnt/test/sub2
> f_fsid=0x9452611458c30e56
> /mnt/test/sub3
> f_fsid=0x9452611458c30e55
> /mnt/scratch/sub1
> f_fsid=0x9452611458c30e56
> /mnt/scratch/sub2
> f_fsid=0x9452611458c30e57
> /mnt/scratch/sub3
> f_fsid=0x9452611458c30e54
>
> statfs.f_fsid is uniq inside a btrfs and it's subvols.
> but statfs.f_fsid is NOT uniq between btrfs filesystems because just 1
> bit diff of UUID.

Maybe we should be using a hash function to mix the various numbers into
the fsid rather than a simple xor.

In general we have no guarantee that the stable identifier for each
filesystem is unique. We rely on randomness and large numbers of bits
making collisions extremely unlikely.
Using xor doesn't help, but we would need some hash scheme that is
guaranteed to be stable. Maybe the Jenkins Hash.

NeilBrown


>
> 'find /mnt/test/' and 'find /mnt/scratch/' works as expected.
>
>
> Best Regards
> Wang Yugui ([email protected])
> 2021/06/23
>
>

2021-06-23 23:43:01

by Frank Filz

[permalink] [raw]
Subject: RE: any idea about auto export multiple btrfs snapshots?

> On Thu, 24 Jun 2021, J. Bruce Fields wrote:
> > On Thu, Jun 24, 2021 at 08:04:57AM +1000, NeilBrown wrote:
> > > On Thu, 24 Jun 2021, J. Bruce Fields wrote:
> >
> > One other thing I'm not sure about: how do cold cache lookups of
> > filehandles for (possibly not-yet-mounted) subvolumes work?
>
> Ahhhh... that's a good point. Filehandle lookup depends on the target
> filesystem being mounted. NFS exporting filesystems which are auto-mounted
> on demand would be ... interesting.
>
> That argues in favour of nfsd treating a btrfs filesystem as a single filesystem
> and gaining some knowledge about different subvolumes within a filesystem.
>
> This has implications for NFS re-export. If a filehandle is received for an NFS
> filesystem that needs to be automounted, I expect it would fail.
>
> Or do we want to introduce a third level in the filehandle: filesystem, subvol,
> inode. So just the "filesystem" is used to look things up in /proc/mounts, but
> "filesystem+subvol" is used to determine the fsid.
>
> Maybe another way to state this is that the filesystem could identify a number of
> bytes from the fs-local part of the filehandle that should be mixed in to the fsid.
> That might be a reasonably clean interface.

Hmm, and interesting problem I hadn't considered for nfs-ganesha. Ganesha can handle a lookup into a filesystem (we treat subvols as filesystems) that was not mounted when we started (when we startup we scan mnttab and the btrfs subvol list and add any filesystems belonging to the configured exports) by re-scanning mnttab and the btrfs subvol list.

But what if Ganesha restarted, and then after that, a filesystem that a client had a handle for was not mounted at restart time, but is mounted by the time the client tries to use the handle... That would be easy for us to fix, if a handle specifies an unknown fsid, trigger a filesystem rescan.

> > > All we really need is:
> > > 1/ someone to write the code
> > > 2/ someone to review the code
> > > 3/ someone to accept the code
> >
> > Hah. Still, the special exceptions for btrfs seem to be accumulating.
> > I wonder if that's happening outside nfs as well.
>
> I have some colleagues who work on btrfs and based on my occasional
> discussions, I think that: yes, btrfs is a bit "special". There are a number of
> corner-cases where it doesn't quite behave how one would hope.
> This is probably inevitable given they way it is pushing the boundaries of
> functionality. It can be a challenge to determine if that "hope" is actually
> reasonable, and to figure out a good solution that meets the need cleanly
> without imposing performance burdens elsewhere.

What other special cases does btrfs have that cause nfs servers pain? I know their handle is big but the only special case code nfs-ganesha has at the moment is listing the subvols as part of the filesystem scan.

Frank


2021-06-24 00:02:46

by J. Bruce Fields

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

On Thu, Jun 24, 2021 at 09:29:01AM +1000, NeilBrown wrote:
> On Thu, 24 Jun 2021, J. Bruce Fields wrote:
> > On Thu, Jun 24, 2021 at 08:04:57AM +1000, NeilBrown wrote:
> > > On Thu, 24 Jun 2021, J. Bruce Fields wrote:
> >
> > One other thing I'm not sure about: how do cold cache lookups of
> > filehandles for (possibly not-yet-mounted) subvolumes work?
>
> Ahhhh... that's a good point. Filehandle lookup depends on the target
> filesystem being mounted. NFS exporting filesystems which are
> auto-mounted on demand would be ... interesting.
>
> That argues in favour of nfsd treating a btrfs filesystem as a single
> filesystem and gaining some knowledge about different subvolumes within
> a filesystem.
>
> This has implications for NFS re-export. If a filehandle is received
> for an NFS filesystem that needs to be automounted, I expect it would
> fail.

Yeah, that's why this is on my mind. Currently we can only re-export
filesystems that are explicitly mounted and exported with an fsid=
option.

I had an idea that it would be interesting to run nfsd in a mode where
all it does is re-export everything exported by one single server. In
theory then you no longer need to do any encapsulation, so you avoid the
maximum-filehandle-length problem. When you get an unfamiliar
filehandle, you pass it on to the original server and get back an fsid,
and if it's one you haven't seen before you have to cook up a new
vfsmount and stuff somehow. I ran aground trying to understand how to
do that in a way that wasn't too complicated.

Anyway.

--b.

> Or do we want to introduce a third level in the filehandle: filesystem,
> subvol, inode. So just the "filesystem" is used to look things up in
> /proc/mounts, but "filesystem+subvol" is used to determine the fsid.
>
> Maybe another way to state this is that the filesystem could identify a
> number of bytes from the fs-local part of the filehandle that should be
> mixed in to the fsid. That might be a reasonably clean interface.
>
> >
> > > All we really need is:
> > > 1/ someone to write the code
> > > 2/ someone to review the code
> > > 3/ someone to accept the code
> >
> > Hah. Still, the special exceptions for btrfs seem to be accumulating.
> > I wonder if that's happening outside nfs as well.
>
> I have some colleagues who work on btrfs and based on my occasional
> discussions, I think that: yes, btrfs is a bit "special". There are a
> number of corner-cases where it doesn't quite behave how one would hope.
> This is probably inevitable given they way it is pushing the boundaries
> of functionality. It can be a challenge to determine if that "hope" is
> actually reasonable, and to figure out a good solution that meets the
> need cleanly without imposing performance burdens elsewhere.
>
> NeilBrown

2021-06-24 22:00:35

by Patrick Goetz

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?



On 6/23/21 5:04 PM, NeilBrown wrote:
> On Thu, 24 Jun 2021, J. Bruce Fields wrote:
>> Is there any hope of solving this problem within btrfs?
>>
>> It doesn't seem like it should have been that difficult for it to give
>> subvolumes separate superblocks and vfsmounts.
>>
>> But this has come up before, and I think the answer may have been that
>> it's just too late to fix.
>
> It is never too late to do the right thing!
>
> Probably the best approach to fixing this completely on the btrfs side
> would be to copy the auto-mount approach used in NFS. NFS sees multiple
> different volumes on the server and transparently creates new vfsmounts,
> using the automount infrastructure to mount and unmount them.

I'm very confused about what you're talking about. Is this documented
somewhere? I mean, I do use autofs, but see that as a separate software
system working with NFS.


> BTRFS
> effective sees multiple volumes on the block device and could do the
> same thing.
>
> I can only think of one change to the user-space API (other than
> /proc/mounts contents) that this would cause and I suspect it could be
> resolved if needed.
>
> Currently when you 'stat' the mountpoint of a btrfs subvol you see the
> root of that subvol. However when you 'stat' the mountpoint of an NFS
> sub-filesystem (before any access below there) you see the mountpoint
> (s_dev matches the parent). This is how automounts are expected to work
> and if btrfs were switched to use automounts for subvols, stating the
> mountpoint would initially show the mountpoint, not the subvol root.
>
> If this were seen to be a problem I doubt it would be hard to add
> optional functionality to automount so that 'stat' triggers the mount.
>
> All we really need is:
> 1/ someone to write the code
> 2/ someone to review the code
> 3/ someone to accept the code
>
> How hard can it be :-)
>
> NeilBrown
>

2021-06-24 23:28:16

by NeilBrown

[permalink] [raw]
Subject: Re: any idea about auto export multiple btrfs snapshots?

On Fri, 25 Jun 2021, Patrick Goetz wrote:
>
> On 6/23/21 5:04 PM, NeilBrown wrote:
> >
> > Probably the best approach to fixing this completely on the btrfs side
> > would be to copy the auto-mount approach used in NFS. NFS sees multiple
> > different volumes on the server and transparently creates new vfsmounts,
> > using the automount infrastructure to mount and unmount them.
>
> I'm very confused about what you're talking about. Is this documented
> somewhere? I mean, I do use autofs, but see that as a separate software
> system working with NFS.
>

autofs (together with the user-space automountd) is a special filesystem
that provides automount functionality to the sysadmin.
It makes use of some core automount functionality in the Linux VFS.
This functionality is referred to as "managed" dentries.
See "Revalidation and automounts" in https://lwn.net/Articles/649115/.

autofs makes use of this functionality to provide automounts. NFS makes
use of this same functionality to provide the same mount-point structure
on the client that it finds on the server.

I don't think there is any documentation specifically about NFS using
this infrastructure. It should be largely transparent to users.

Suppose that on the server "/export/foo" is a mount of some
filesystem, and you nfs4 mount "server:/export" to "/import" on the
client.
Then you will at first see only "/import" in /proc/mounts on client.
If you "ls -ld /import/foo" you will still only see /import.
But if you "ls -l /import/foo" so it lists the contents of that other
filesytem, then check /proc/mounts, you will now see "/import" and
"/import/foo".

After a while (between 500 and 1000 seconds I think) of not accessing
/import/foo, that entry will disappear from /proc/mounts.

I'm sure you will recognise this as very similar to autofs behaviour.
It uses the same core functionality. The timeout for inactive NFS
sub-filesystems to be unmounted can be controlled via
/proc/sys/fs/nfs/nfs_mountpoint_timeout and, since Linux 5.7, via the
nfs_mountpoint_expiry_timeout module parameter.
These aren't documented.

Note that I'm no longer sure that btrfs using automount like this would
actually make things easier for nfsd. But in some ways I think it would
be the "right" thing to do.

NeilBrown