From 7f674853edde79f37589586ef219b8650e409677 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Wed, 23 Jun 2021 10:59:00 +1000
Subject: [PATCH] any idea about auto export multiple btrfs snapshots?

On Tue, 22 Jun 2021, Wang Yugui wrote:
> >
> > btrfs subvol should be treated as virtual 'mount point' for nfsd in follow_down().
>
> btrfs subvol crossmnt begin to work, although buggy.
>
> some subvol is crossmnt-ed, some subvol is yet not, and some dir is
> wrongly crossmnt-ed
>
> 'stat /nfs/test /nfs/test/sub1' will cause btrfs subvol crossmnt begin
> to happen.
>
> This is the current patch based on 5.10.44.
> At least nfsd_follow_up() is buggy.
>

I don't think the approach you are taking makes sense.  Let me explain
why.

The problem is that applications on the NFS client can see different
files or directories on the same (apparent) filesystem with the same
inode number.  Most application won't care and NFS itself doesn't get
confused by the duplicate inode numbers, but 'find' and similar programs
(probably 'tar' for example) do get upset.

This happens because BTRFS reuses inode numbers in subvols which it
presents to the kernel as all part of the one filesystem (or at least,
all part of the one mount point).  NFSD only sees one filesystem, and so
reports the same filesystem-id (fsid) for all objects.  The NFS client
then sees that the fsid is the same and tells applications that the
objects are all in the one filesystem.

To fix this, we need to make sure that nfsd reports a different fsid for
objects in different subvols.  There are two obvious ways to do this.

One is to teach nfsd to recognize btrfs subvolumes exactly like separate
filesystems (as nfsd already ensure each filesystem gets its own fsid).
This is the approach of my first suggestion.  It requires changing
nfsd_mountpoint() and follow_up() and any other code that is aware of
different filesytems.  As I mentioned, it also requires changing mountd
to be able to extract a list of subvols from btrfs because they don't
appear in /proc/mounts.

As you might know an NFS filehandle has 3 parts: a header, a filesystem
identifier, and an inode identifier.  This approach would involve giving
different subvols different filesystem identifiers in the filehandle.
This, it turns out is a very big change - bigger than I at first
imagined.

The second obvious approach is to leave the filehandles unchanged and to
continue to treat an entire btrfs filesystem as a single filesystem
EXCEPT when reporting the fsid to the NFS client.  All we *really* need
to do is make sure the client sees a different fsid when it enters a
part of the filesystem which re-uses inode numbers.  This is what my
latest patch did.

Your patch seems to combine ideas from both approaches.  It includes my
code to replace the fsid, but also intercepts follow_up etc.  This
cannot be useful.

As I noted when I posted it, there is a problem with my patch.  I now
understand that problem.

When NFS sees that fsid change it needs to create 2 inodes for that
directory.  One inode will be in the parent filesystem and will be
marked as an auto-mount point so that any lookup below that directory
will trigger an internal mount.  The other inode is the root of the
child filesystem.  It gets mounted on the first inode.

With normal filesystem mounts, there really is an inode in the parent
filesystem and NFS can find it (with NFSv4) using the MOUNTED_ON_FILEID
attribute.  This fileid will be different from all other inode numbers
in the parent filesystem.

With BTRFS there is no inode in the parent volume (as far as I know) so
there is nothing useful to return for MOUNTED_ON_FILEID.  This results
in NFS using the same inode number for the inode in the parent
filesystem as the inode in the child filesystem.  For btrfs, this will
be 256.  As there is already an inode in the parent filesystem with inum
256, 'find' complains.

The following patch addresses this by adding code to nfsd when it
determines MOUINTD_ON_FILEID to choose an number that should be unused
in btrfs.  With this change, 'find' seems to work correctly with NFSv4
mounts of btrfs.

This doesn't work with NFSv3 as NFSv3 doesn't have the MOUNTED_ON_FILEID
attribute - strictly speaking, the NFSv3 protocol doesn't support
crossing mount points, though the Linux implementation does allow it.

So this patch works and, I think, is the best we can do in terms of
functionality.  I don't like the details of the implementation though.
It requires NFSD to know too much about BTRFS internals.

I think I would like btrfs to make it clear where a subvol started,
maybe by setting DCACHE_MOUNTED on the dentry.  This flag is only a
hint, not a promise of anything, so other code should get confused.
This would have nfsd calling vfs_statfs quite so often ....  maybe that
isn't such a big deal.

More importantly, there needs to be some way for NFSD to find an inode
number to report for the MOUNTED_ON_FILEID.  This needs to be a number
not used elsewhere in the filesystem.  It might be safe to use the
same fileid for all subvols (as my patch currently does), but we would
need to confirm that 'find' and 'tar' don't complain about that or
mishandle it.  If it is safe to use the same fileid, then a new field in
the superblock to store it might work.  If a different fileid is needed,
the we might need a new field in 'struct kstatfs', so vfs_statfs can
report it.

Anyway, here is my current patch.  It includes support for NFSv3 as well
as NFSv4.

NeilBrown
---
 fs/nfsd/export.c  |  7 +++++++
 fs/nfsd/export.h  |  1 +
 fs/nfsd/nfs3xdr.c | 10 +++++++++-
 fs/nfsd/nfs4xdr.c | 15 +++++++++++++++
 fs/nfsd/vfs.h     |  7 +++++++
 5 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
index 9421dae22737..790a3357525d 100644
--- a/fs/nfsd/export.c
+++ b/fs/nfsd/export.c
@@ -15,6 +15,7 @@
 #include <linux/slab.h>
 #include <linux/namei.h>
 #include <linux/module.h>
+#include <linux/statfs.h>
 #include <linux/exportfs.h>
 #include <linux/sunrpc/svc_xprt.h>
 
@@ -575,6 +576,7 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
 	int err;
 	struct auth_domain *dom = NULL;
 	struct svc_export exp = {}, *expp;
+	struct kstatfs statfs;
 	int an_int;
 
 	if (mesg[mlen-1] != '\n')
@@ -604,6 +606,10 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
 	err = kern_path(buf, 0, &exp.ex_path);
 	if (err)
 		goto out1;
+	err = vfs_statfs(&exp.ex_path, &statfs);
+	if (err)
+		goto out3;
+	exp.ex_fsid64 = statfs.f_fsid;
 
 	exp.ex_client = dom;
 	exp.cd = cd;
@@ -809,6 +815,7 @@ static void export_update(struct cache_head *cnew, struct cache_head *citem)
 	new->ex_anon_uid = item->ex_anon_uid;
 	new->ex_anon_gid = item->ex_anon_gid;
 	new->ex_fsid = item->ex_fsid;
+	new->ex_fsid64 = item->ex_fsid64;
 	new->ex_devid_map = item->ex_devid_map;
 	item->ex_devid_map = NULL;
 	new->ex_uuid = item->ex_uuid;
diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
index ee0e3aba4a6e..d3eb9a599918 100644
--- a/fs/nfsd/export.h
+++ b/fs/nfsd/export.h
@@ -68,6 +68,7 @@ struct svc_export {
 	kuid_t			ex_anon_uid;
 	kgid_t			ex_anon_gid;
 	int			ex_fsid;
+	__kernel_fsid_t		ex_fsid64;
 	unsigned char *		ex_uuid; /* 16 byte fsid */
 	struct nfsd4_fs_locations ex_fslocs;
 	uint32_t		ex_nflavors;
diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 0a5ebc52e6a9..f11ba3434fd6 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -153,11 +153,19 @@ static __be32 *encode_fsid(__be32 *p, struct svc_fh *fhp)
 	case FSIDSOURCE_FSID:
 		p = xdr_encode_hyper(p, (u64) fhp->fh_export->ex_fsid);
 		break;
-	case FSIDSOURCE_UUID:
+	case FSIDSOURCE_UUID: {
+		struct kstatfs statfs;
+
 		f = ((u64*)fhp->fh_export->ex_uuid)[0];
 		f ^= ((u64*)fhp->fh_export->ex_uuid)[1];
+		if (fh_getstafs(fhp, &statfs) == 0 &&
+		    (statfs.f_fsid.val[0] != fhp->fh_export->ex_fsid64.val[0] ||
+		     statfs.f_fsid.val[1] != fhp->fh_export->ex_fsid64.val[1]))
+			/* looks like a btrfs subvol */
+			f = statfs.f_fsid.val[0] ^ statfs.f_fsid.val[1];
 		p = xdr_encode_hyper(p, f);
 		break;
+		}
 	}
 	return p;
 }
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 7abeccb975b2..5f614d1b362e 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -42,6 +42,7 @@
 #include <linux/sunrpc/svcauth_gss.h>
 #include <linux/sunrpc/addr.h>
 #include <linux/xattr.h>
+#include <linux/btrfs_tree.h>
 #include <uapi/linux/xattr.h>
 
 #include "idmap.h"
@@ -2869,8 +2870,10 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 	if (err)
 		goto out_nfserr;
 	if ((bmval0 & (FATTR4_WORD0_FILES_AVAIL | FATTR4_WORD0_FILES_FREE |
+		       FATTR4_WORD0_FSID |
 			FATTR4_WORD0_FILES_TOTAL | FATTR4_WORD0_MAXNAME)) ||
 	    (bmval1 & (FATTR4_WORD1_SPACE_AVAIL | FATTR4_WORD1_SPACE_FREE |
+		       FATTR4_WORD1_MOUNTED_ON_FILEID |
 		       FATTR4_WORD1_SPACE_TOTAL))) {
 		err = vfs_statfs(&path, &statfs);
 		if (err)
@@ -3024,6 +3027,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 		case FSIDSOURCE_UUID:
 			p = xdr_encode_opaque_fixed(p, exp->ex_uuid,
 								EX_UUID_LEN);
+			if (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] ||
+			    statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]) {
+				/* looks like a btrfs subvol */
+				p[-2] ^= statfs.f_fsid.val[0];
+				p[-1] ^= statfs.f_fsid.val[1];
+			}
 			break;
 		}
 	}
@@ -3286,6 +3295,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 				goto out_nfserr;
 			ino = parent_stat.ino;
 		}
+		if (fsid_source(fhp) == FSIDSOURCE_UUID &&
+		    (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] ||
+		     statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]))
+			    /* btrfs subvol pseudo mount point */
+			    ino = BTRFS_FIRST_FREE_OBJECTID-1;
+
 		p = xdr_encode_hyper(p, ino);
 	}
 #ifdef CONFIG_NFSD_PNFS
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index b21b76e6b9a8..82b76b0b7bec 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -160,6 +160,13 @@ static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
 				    AT_STATX_SYNC_AS_STAT));
 }
 
+static inline __be32 fh_getstafs(const struct svc_fh *fh, struct kstatfs *statfs)
+{
+	struct path p = {.mnt = fh->fh_export->ex_path.mnt,
+			 .dentry = fh->fh_dentry};
+	return nfserrno(vfs_statfs(&p, statfs));
+}
+
 static inline int nfsd_create_is_exclusive(int createmode)
 {
 	return createmode == NFS3_CREATE_EXCLUSIVE
-- 
2.30.2