by NeilBrown

[permalink] [raw]

Subject: Re: any idea about auto export multiple btrfs snapshots?

> > It seems more fixes are needed.
>
> I think the problem is that the submount doesn't appear in /proc/mounts.
> "nfsd_fh()" in nfs-utils needs to be able to map from the uuid for a
> filesystem to the mount point. To do this it walks through /proc/mounts
> checking the uuid of each filesystem. If a filesystem isn't listed
> there, it obviously fails.
>
> I guess you could add code to nfs-utils to do whatever "btrfs subvol
> list" does to make up for the fact that btrfs doesn't register in
> /proc/mounts.

Another approach might be to just change svcxdr_encode_fattr3() and
nfsd4_encode_fattr() in the 'FSIDSOJURCE_UUID' case to check if
dentry->d_inode has a different btrfs volume id to
exp->ex_path.dentry->d_inode.
If it does, then mix the volume id into the fsid somehow.

With that, you wouldn't want the first change I suggested.

NeilBrown

2021-06-21 08:35:22

Hi,

> On Mon, 21 Jun 2021, Wang Yugui wrote:
> > Hi,
> >
> > > > > It seems more fixes are needed.
> > > >
> > > > I think the problem is that the submount doesn't appear in /proc/mounts.
> > > > "nfsd_fh()" in nfs-utils needs to be able to map from the uuid for a
> > > > filesystem to the mount point. To do this it walks through /proc/mounts
> > > > checking the uuid of each filesystem. If a filesystem isn't listed
> > > > there, it obviously fails.
> > > >
> > > > I guess you could add code to nfs-utils to do whatever "btrfs subvol
> > > > list" does to make up for the fact that btrfs doesn't register in
> > > > /proc/mounts.
> > >
> > > Another approach might be to just change svcxdr_encode_fattr3() and
> > > nfsd4_encode_fattr() in the 'FSIDSOJURCE_UUID' case to check if
> > > dentry->d_inode has a different btrfs volume id to
> > > exp->ex_path.dentry->d_inode.
> > > If it does, then mix the volume id into the fsid somehow.
> > >
> > > With that, you wouldn't want the first change I suggested.
> >
> > This is what I have done. and it is based on linux 5.10.44
> >
> > but it still not work, so still more jobs needed.
> >
>
> The following is more what I had in mind. It doesn't quite work and I
> cannot work out why.
>
> If you 'stat' a file inside the subvol, then 'find' will not complete.
> If you don't, then it will.
>
> Doing that 'stat' changes the st_dev number of the main filesystem,
> which seems really weird.
> I'm probably missing something obvious. Maybe a more careful analysis
> of what is changing when will help.

we compare the trace output between crossmnt and btrfs subvol with some
trace, we found out that we need to add the subvol support to
follow_down().

btrfs subvol should be treated as virtual 'mount point' for nfsd in follow_down().

Best Regards
Wang Yugui ([email protected])
2021/06/22

> NeilBrown
>
>
> diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
> index 9421dae22737..790a3357525d 100644
> --- a/fs/nfsd/export.c
> +++ b/fs/nfsd/export.c
> @@ -15,6 +15,7 @@
> #include <linux/slab.h>
> #include <linux/namei.h>
> #include <linux/module.h>
> +#include <linux/statfs.h>
> #include <linux/exportfs.h>
> #include <linux/sunrpc/svc_xprt.h>
>
> @@ -575,6 +576,7 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
> int err;
> struct auth_domain *dom = NULL;
> struct svc_export exp = {}, *expp;
> + struct kstatfs statfs;
> int an_int;
>
> if (mesg[mlen-1] != '\n')
> @@ -604,6 +606,10 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
> err = kern_path(buf, 0, &exp.ex_path);
> if (err)
> goto out1;
> + err = vfs_statfs(&exp.ex_path, &statfs);
> + if (err)
> + goto out3;
> + exp.ex_fsid64 = statfs.f_fsid;
>
> exp.ex_client = dom;
> exp.cd = cd;
> @@ -809,6 +815,7 @@ static void export_update(struct cache_head *cnew, struct cache_head *citem)
> new->ex_anon_uid = item->ex_anon_uid;
> new->ex_anon_gid = item->ex_anon_gid;
> new->ex_fsid = item->ex_fsid;
> + new->ex_fsid64 = item->ex_fsid64;
> new->ex_devid_map = item->ex_devid_map;
> item->ex_devid_map = NULL;
> new->ex_uuid = item->ex_uuid;
> diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
> index ee0e3aba4a6e..d3eb9a599918 100644
> --- a/fs/nfsd/export.h
> +++ b/fs/nfsd/export.h
> @@ -68,6 +68,7 @@ struct svc_export {
> kuid_t ex_anon_uid;
> kgid_t ex_anon_gid;
> int ex_fsid;
> + __kernel_fsid_t ex_fsid64;
> unsigned char * ex_uuid; /* 16 byte fsid */
> struct nfsd4_fs_locations ex_fslocs;
> uint32_t ex_nflavors;
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 7abeccb975b2..8144e6037eae 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -2869,6 +2869,7 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
> if (err)
> goto out_nfserr;
> if ((bmval0 & (FATTR4_WORD0_FILES_AVAIL | FATTR4_WORD0_FILES_FREE |
> + FATTR4_WORD0_FSID |
> FATTR4_WORD0_FILES_TOTAL | FATTR4_WORD0_MAXNAME)) ||
> (bmval1 & (FATTR4_WORD1_SPACE_AVAIL | FATTR4_WORD1_SPACE_FREE |
> FATTR4_WORD1_SPACE_TOTAL))) {
> @@ -3024,6 +3025,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
> case FSIDSOURCE_UUID:
> p = xdr_encode_opaque_fixed(p, exp->ex_uuid,
> EX_UUID_LEN);
> + if (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] ||
> + statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]) {
> + /* looks like a btrfs subvol */
> + p[-2] ^= statfs.f_fsid.val[0];
> + p[-1] ^= statfs.f_fsid.val[1];
> + }
> break;
> }
> }

Attachments:

0001-nfsd-btrfs-subvol-support.txt (5.39 kB)
0002-trace-nfsd-btrfs-subvol-support.txt (5.76 kB)
Download all attachments

2021-06-22 07:15:40

Hi,

This patch works very well. Thanks a lot.
- crossmnt of btrfs subvol works as expected.
- nfs/umount subvol works well.
- pseudo mount point inode(255) is good.

I test it in 5.10.45 with a few minor rebase.
( see 0001-any-idea-about-auto-export-multiple-btrfs-snapshots.patch,
just fs/nfsd/nfs3xdr.c rebase)

But when I tested it with another btrfs system without subvol but with
more data, 'find /nfs/test' caused a OOPS . and this OOPS will not
happen just without this patch.

The data in this filesystem is created/left by xfstest(FSTYP=nfs,
TEST_DEV).

#nfs4 option: default mount.nfs4, nfs-utils-2.3.3

# access btrfs directly
$ find /mnt/test | wc -l
6612

# access btrfs through nfs
$ find /nfs/test | wc -l

[ 466.164329] BUG: kernel NULL pointer dereference, address: 0000000000000004
[ 466.172123] #PF: supervisor read access in kernel mode
[ 466.177857] #PF: error_code(0x0000) - not-present page
[ 466.183601] PGD 0 P4D 0
[ 466.186443] Oops: 0000 [#1] SMP NOPTI
[ 466.190536] CPU: 27 PID: 1819 Comm: nfsd Not tainted 5.10.45-7.el7.x86_64 #1
[ 466.198418] Hardware name: Dell Inc. PowerEdge T620/02CD1V, BIOS 2.9.0 12/06/2019
[ 466.206806] RIP: 0010:fsid_source+0x7/0x50 [nfsd]
[ 466.212067] Code: e8 3e f9 ff ff 48 c7 c7 40 5a 90 c0 48 89 c6 e8 18 5a 1f d3 44 8b 14 24 e9 a2 f9 ff ff e9
f7 3e 03 00 90 0f 1f 44 00 00 31 c0 <80> 7f 04 01 75 2d 0f b6 47 06 48 8b 97 90 00 00 00 84 c0 74 1f 83
[ 466.233061] RSP: 0018:ffff9cdd0d3479d0 EFLAGS: 00010246
[ 466.238894] RAX: 0000000000000000 RBX: 0000000000010abc RCX: ffff8f50f3049b00
[ 466.246872] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000
[ 466.254848] RBP: ffff9cdd0d347c68 R08: 0000000aaeb00000 R09: 0000000000000001
[ 466.262825] R10: 0000000000010000 R11: 0000000000110000 R12: ffff8f30510f8000
[ 466.270802] R13: ffff8f4fdabb2090 R14: ffff8f30c0b95600 R15: 0000000000000018
[ 466.278779] FS: 0000000000000000(0000) GS:ffff8f5f7fb40000(0000) knlGS:0000000000000000
[ 466.287823] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 466.294246] CR2: 0000000000000004 CR3: 00000014bfa10003 CR4: 00000000001706e0
[ 466.302222] Call Trace:
[ 466.304970] nfsd4_encode_fattr+0x15ac/0x1940 [nfsd]
[ 466.310557] ? btrfs_verify_level_key+0xad/0xf0 [btrfs]
[ 466.316413] ? btrfs_search_slot+0x8e3/0x900 [btrfs]
[ 466.321973] nfsd4_encode_dirent+0x160/0x3b0 [nfsd]
[ 466.327434] nfsd_readdir+0x199/0x240 [nfsd]
[ 466.332215] ? nfsd4_encode_getattr+0x30/0x30 [nfsd]
[ 466.337771] ? nfsd_direct_splice_actor+0x20/0x20 [nfsd]
[ 466.343714] ? security_prepare_creds+0x6f/0xa0
[ 466.348788] nfsd4_encode_readdir+0xd9/0x1c0 [nfsd]
[ 466.354250] nfsd4_encode_operation+0x9b/0x1b0 [nfsd]
[ 466.360430] nfsd4_proc_compound+0x4e3/0x710 [nfsd]
[ 466.366352] nfsd_dispatch+0xd4/0x180 [nfsd]
[ 466.371620] svc_process_common+0x392/0x6c0 [sunrpc]
[ 466.377650] ? svc_recv+0x3c4/0x8a0 [sunrpc]
[ 466.382883] ? nfsd_svc+0x300/0x300 [nfsd]
[ 466.387908] ? nfsd_destroy+0x60/0x60 [nfsd]
[ 466.393126] svc_process+0xb7/0xf0 [sunrpc]
[ 466.398234] nfsd+0xe8/0x140 [nfsd]
[ 466.402555] kthread+0x116/0x130
[ 466.406579] ? kthread_park+0x80/0x80
[ 466.411091] ret_from_fork+0x1f/0x30
[ 466.415499] Modules linked in: acpi_ipmi rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rfkill intel_rapl_m
sr intel_rapl_common iTCO_wdt intel_pmc_bxt iTCO_vendor_support dcdbas ipmi_ssif sb_edac x86_pkg_temp_thermal
intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_si rapl intel_cstate mei_me ipmi_devintf intel_uncore j
oydev mei ipmi_msghandler lpc_ich acpi_power_meter nvme_rdma nvme_fabrics rdma_cm iw_cm ib_cm rdmavt nfsd rdma
_rxe ib_uverbs ip6_udp_tunnel udp_tunnel ib_core auth_rpcgss nfs_acl lockd grace nfs_ssc ip_tables xfs mgag200
drm_kms_helper crct10dif_pclmul crc32_pclmul btrfs cec crc32c_intel xor bnx2x raid6_pq drm igb mpt3sas ghash_
clmulni_intel pcspkr nvme megaraid_sas mdio nvme_core dca raid_class i2c_algo_bit scsi_transport_sas wmi dm_mu
ltipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sunrpc i2c_dev
[ 466.499551] CR2: 0000000000000004
[ 466.503759] ---[ end trace 91eb52bf0cb65801 ]---
[ 466.511948] RIP: 0010:fsid_source+0x7/0x50 [nfsd]
[ 466.517714] Code: e8 3e f9 ff ff 48 c7 c7 40 5a 90 c0 48 89 c6 e8 18 5a 1f d3 44 8b 14 24 e9 a2 f9 ff ff e9
f7 3e 03 00 90 0f 1f 44 00 00 31 c0 <80> 7f 04 01 75 2d 0f b6 47 06 48 8b 97 90 00 00 00 84 c0 74 1f 83
[ 466.539753] RSP: 0018:ffff9cdd0d3479d0 EFLAGS: 00010246
[ 466.546122] RAX: 0000000000000000 RBX: 0000000000010abc RCX: ffff8f50f3049b00
[ 466.554625] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000
[ 466.563096] RBP: ffff9cdd0d347c68 R08: 0000000aaeb00000 R09: 0000000000000001
[ 466.571572] R10: 0000000000010000 R11: 0000000000110000 R12: ffff8f30510f8000
[ 466.580024] R13: ffff8f4fdabb2090 R14: ffff8f30c0b95600 R15: 0000000000000018
[ 466.588487] FS: 0000000000000000(0000) GS:ffff8f5f7fb40000(0000) knlGS:0000000000000000
[ 466.598032] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 466.604973] CR2: 0000000000000004 CR3: 00000014bfa10003 CR4: 00000000001706e0
[ 466.613467] Kernel panic - not syncing: Fatal exception
[ 466.807651] Kernel Offset: 0x12000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xfffff
fffbfffffff)
[ 466.823190] ---[ end Kernel panic - not syncing: Fatal exception ]---

Best Regards
Wang Yugui ([email protected])
2021/06/23

> On Tue, 22 Jun 2021, Wang Yugui wrote:
> > >
> > > btrfs subvol should be treated as virtual 'mount point' for nfsd in follow_down().
> >
> > btrfs subvol crossmnt begin to work, although buggy.
> >
> > some subvol is crossmnt-ed, some subvol is yet not, and some dir is
> > wrongly crossmnt-ed
> >
> > 'stat /nfs/test /nfs/test/sub1' will cause btrfs subvol crossmnt begin
> > to happen.
> >
> > This is the current patch based on 5.10.44.
> > At least nfsd_follow_up() is buggy.
> >
>
> I don't think the approach you are taking makes sense. Let me explain
> why.
>
> The problem is that applications on the NFS client can see different
> files or directories on the same (apparent) filesystem with the same
> inode number. Most application won't care and NFS itself doesn't get
> confused by the duplicate inode numbers, but 'find' and similar programs
> (probably 'tar' for example) do get upset.
>
> This happens because BTRFS reuses inode numbers in subvols which it
> presents to the kernel as all part of the one filesystem (or at least,
> all part of the one mount point). NFSD only sees one filesystem, and so
> reports the same filesystem-id (fsid) for all objects. The NFS client
> then sees that the fsid is the same and tells applications that the
> objects are all in the one filesystem.
>
> To fix this, we need to make sure that nfsd reports a different fsid for
> objects in different subvols. There are two obvious ways to do this.
>
> One is to teach nfsd to recognize btrfs subvolumes exactly like separate
> filesystems (as nfsd already ensure each filesystem gets its own fsid).
> This is the approach of my first suggestion. It requires changing
> nfsd_mountpoint() and follow_up() and any other code that is aware of
> different filesytems. As I mentioned, it also requires changing mountd
> to be able to extract a list of subvols from btrfs because they don't
> appear in /proc/mounts.
>
> As you might know an NFS filehandle has 3 parts: a header, a filesystem
> identifier, and an inode identifier. This approach would involve giving
> different subvols different filesystem identifiers in the filehandle.
> This, it turns out is a very big change - bigger than I at first
> imagined.
>
> The second obvious approach is to leave the filehandles unchanged and to
> continue to treat an entire btrfs filesystem as a single filesystem
> EXCEPT when reporting the fsid to the NFS client. All we *really* need
> to do is make sure the client sees a different fsid when it enters a
> part of the filesystem which re-uses inode numbers. This is what my
> latest patch did.
>
> Your patch seems to combine ideas from both approaches. It includes my
> code to replace the fsid, but also intercepts follow_up etc. This
> cannot be useful.
>
> As I noted when I posted it, there is a problem with my patch. I now
> understand that problem.
>
> When NFS sees that fsid change it needs to create 2 inodes for that
> directory. One inode will be in the parent filesystem and will be
> marked as an auto-mount point so that any lookup below that directory
> will trigger an internal mount. The other inode is the root of the
> child filesystem. It gets mounted on the first inode.
>
> With normal filesystem mounts, there really is an inode in the parent
> filesystem and NFS can find it (with NFSv4) using the MOUNTED_ON_FILEID
> attribute. This fileid will be different from all other inode numbers
> in the parent filesystem.
>
> With BTRFS there is no inode in the parent volume (as far as I know) so
> there is nothing useful to return for MOUNTED_ON_FILEID. This results
> in NFS using the same inode number for the inode in the parent
> filesystem as the inode in the child filesystem. For btrfs, this will
> be 256. As there is already an inode in the parent filesystem with inum
> 256, 'find' complains.
>
> The following patch addresses this by adding code to nfsd when it
> determines MOUINTD_ON_FILEID to choose an number that should be unused
> in btrfs. With this change, 'find' seems to work correctly with NFSv4
> mounts of btrfs.
>
> This doesn't work with NFSv3 as NFSv3 doesn't have the MOUNTED_ON_FILEID
> attribute - strictly speaking, the NFSv3 protocol doesn't support
> crossing mount points, though the Linux implementation does allow it.
>
> So this patch works and, I think, is the best we can do in terms of
> functionality. I don't like the details of the implementation though.
> It requires NFSD to know too much about BTRFS internals.
>
> I think I would like btrfs to make it clear where a subvol started,
> maybe by setting DCACHE_MOUNTED on the dentry. This flag is only a
> hint, not a promise of anything, so other code should get confused.
> This would have nfsd calling vfs_statfs quite so often .... maybe that
> isn't such a big deal.
>
> More importantly, there needs to be some way for NFSD to find an inode
> number to report for the MOUNTED_ON_FILEID. This needs to be a number
> not used elsewhere in the filesystem. It might be safe to use the
> same fileid for all subvols (as my patch currently does), but we would
> need to confirm that 'find' and 'tar' don't complain about that or
> mishandle it. If it is safe to use the same fileid, then a new field in
> the superblock to store it might work. If a different fileid is needed,
> the we might need a new field in 'struct kstatfs', so vfs_statfs can
> report it.
>
> Anyway, here is my current patch. It includes support for NFSv3 as well
> as NFSv4.
>
> NeilBrown
>
> diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
> index 9421dae22737..790a3357525d 100644
> --- a/fs/nfsd/export.c
> +++ b/fs/nfsd/export.c
> @@ -15,6 +15,7 @@
> #include <linux/slab.h>
> #include <linux/namei.h>
> #include <linux/module.h>
> +#include <linux/statfs.h>
> #include <linux/exportfs.h>
> #include <linux/sunrpc/svc_xprt.h>
>
> @@ -575,6 +576,7 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
> int err;
> struct auth_domain *dom = NULL;
> struct svc_export exp = {}, *expp;
> + struct kstatfs statfs;
> int an_int;
>
> if (mesg[mlen-1] != '\n')
> @@ -604,6 +606,10 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
> err = kern_path(buf, 0, &exp.ex_path);
> if (err)
> goto out1;
> + err = vfs_statfs(&exp.ex_path, &statfs);
> + if (err)
> + goto out3;
> + exp.ex_fsid64 = statfs.f_fsid;
>
> exp.ex_client = dom;
> exp.cd = cd;
> @@ -809,6 +815,7 @@ static void export_update(struct cache_head *cnew, struct cache_head *citem)
> new->ex_anon_uid = item->ex_anon_uid;
> new->ex_anon_gid = item->ex_anon_gid;
> new->ex_fsid = item->ex_fsid;
> + new->ex_fsid64 = item->ex_fsid64;
> new->ex_devid_map = item->ex_devid_map;
> item->ex_devid_map = NULL;
> new->ex_uuid = item->ex_uuid;
> diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
> index ee0e3aba4a6e..d3eb9a599918 100644
> --- a/fs/nfsd/export.h
> +++ b/fs/nfsd/export.h
> @@ -68,6 +68,7 @@ struct svc_export {
> kuid_t ex_anon_uid;
> kgid_t ex_anon_gid;
> int ex_fsid;
> + __kernel_fsid_t ex_fsid64;
> unsigned char * ex_uuid; /* 16 byte fsid */
> struct nfsd4_fs_locations ex_fslocs;
> uint32_t ex_nflavors;
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 0a5ebc52e6a9..f11ba3434fd6 100644
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -367,10 +367,18 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr,
> case FSIDSOURCE_FSID:
> fsid = (u64)fhp->fh_export->ex_fsid;
> break;
> - case FSIDSOURCE_UUID:
> + case FSIDSOURCE_UUID: {
> + struct kstatfs statfs;
> +
> fsid = ((u64 *)fhp->fh_export->ex_uuid)[0];
> fsid ^= ((u64 *)fhp->fh_export->ex_uuid)[1];
> + if (fh_getstafs(fhp, &statfs) == 0 &&
> + (statfs.f_fsid.val[0] != fhp->fh_export->ex_fsid64.val[0] ||
> + statfs.f_fsid.val[1] != fhp->fh_export->ex_fsid64.val[1]))
> + /* looks like a btrfs subvol */
> + fsid = statfs.f_fsid.val[0] ^ statfs.f_fsid.val[1];
> break;
> + }
> default:
> fsid = (u64)huge_encode_dev(fhp->fh_dentry->d_sb->s_dev);
> }
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 7abeccb975b2..5f614d1b362e 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -42,6 +42,7 @@
> #include <linux/sunrpc/svcauth_gss.h>
> #include <linux/sunrpc/addr.h>
> #include <linux/xattr.h>
> +#include <linux/btrfs_tree.h>
> #include <uapi/linux/xattr.h>
>
> #include "idmap.h"
> @@ -2869,8 +2870,10 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
> if (err)
> goto out_nfserr;
> if ((bmval0 & (FATTR4_WORD0_FILES_AVAIL | FATTR4_WORD0_FILES_FREE |
> + FATTR4_WORD0_FSID |
> FATTR4_WORD0_FILES_TOTAL | FATTR4_WORD0_MAXNAME)) ||
> (bmval1 & (FATTR4_WORD1_SPACE_AVAIL | FATTR4_WORD1_SPACE_FREE |
> + FATTR4_WORD1_MOUNTED_ON_FILEID |
> FATTR4_WORD1_SPACE_TOTAL))) {
> err = vfs_statfs(&path, &statfs);
> if (err)
> @@ -3024,6 +3027,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
> case FSIDSOURCE_UUID:
> p = xdr_encode_opaque_fixed(p, exp->ex_uuid,
> EX_UUID_LEN);
> + if (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] ||
> + statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]) {
> + /* looks like a btrfs subvol */
> + p[-2] ^= statfs.f_fsid.val[0];
> + p[-1] ^= statfs.f_fsid.val[1];
> + }
> break;
> }
> }
> @@ -3286,6 +3295,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
> goto out_nfserr;
> ino = parent_stat.ino;
> }
> + if (fsid_source(fhp) == FSIDSOURCE_UUID &&
> + (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] ||
> + statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]))
> + /* btrfs subvol pseudo mount point */
> + ino = BTRFS_FIRST_FREE_OBJECTID-1;
> +
> p = xdr_encode_hyper(p, ino);
> }
> #ifdef CONFIG_NFSD_PNFS
> diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
> index b21b76e6b9a8..82b76b0b7bec 100644
> --- a/fs/nfsd/vfs.h
> +++ b/fs/nfsd/vfs.h
> @@ -160,6 +160,13 @@ static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
> AT_STATX_SYNC_AS_STAT));
> }
>
> +static inline __be32 fh_getstafs(const struct svc_fh *fh, struct kstatfs *statfs)
> +{
> + struct path p = {.mnt = fh->fh_export->ex_path.mnt,
> + .dentry = fh->fh_dentry};
> + return nfserrno(vfs_statfs(&p, statfs));
> +}
> +
> static inline int nfsd_create_is_exclusive(int createmode)
> {
> return createmode == NFS3_CREATE_EXCLUSIVE

Attachments:

0001-any-idea-about-auto-export-multiple-btrfs-snapshots.patch (10.20 kB)

2021-06-23 06:30:32

by NeilBrown

[permalink] [raw]

Subject: Re: any idea about auto export multiple btrfs snapshots?

On Wed, 23 Jun 2021, Wang Yugui wrote:
> Hi,
>
> This patch works very well. Thanks a lot.
> - crossmnt of btrfs subvol works as expected.
> - nfs/umount subvol works well.
> - pseudo mount point inode(255) is good.
>
> I test it in 5.10.45 with a few minor rebase.
> ( see 0001-any-idea-about-auto-export-multiple-btrfs-snapshots.patch,
> just fs/nfsd/nfs3xdr.c rebase)
>
> But when I tested it with another btrfs system without subvol but with
> more data, 'find /nfs/test' caused a OOPS . and this OOPS will not
> happen just without this patch.
>
> The data in this filesystem is created/left by xfstest(FSTYP=nfs,
> TEST_DEV).
>
> #nfs4 option: default mount.nfs4, nfs-utils-2.3.3
>
> # access btrfs directly
> $ find /mnt/test | wc -l
> 6612
>
> # access btrfs through nfs
> $ find /nfs/test | wc -l
>
> [ 466.164329] BUG: kernel NULL pointer dereference, address: 0000000000000004
> [ 466.172123] #PF: supervisor read access in kernel mode
> [ 466.177857] #PF: error_code(0x0000) - not-present page
> [ 466.183601] PGD 0 P4D 0
> [ 466.186443] Oops: 0000 [#1] SMP NOPTI
> [ 466.190536] CPU: 27 PID: 1819 Comm: nfsd Not tainted 5.10.45-7.el7.x86_64 #1
> [ 466.198418] Hardware name: Dell Inc. PowerEdge T620/02CD1V, BIOS 2.9.0 12/06/2019
> [ 466.206806] RIP: 0010:fsid_source+0x7/0x50 [nfsd]

in nfsd4_encode_fattr there is code:

if ((bmval0 & (FATTR4_WORD0_FILEHANDLE | FATTR4_WORD0_FSID)) && !fhp) {
tempfh = kmalloc(sizeof(struct svc_fh), GFP_KERNEL);
status = nfserr_jukebox;
if (!tempfh)
goto out;
fh_init(tempfh, NFS4_FHSIZE);
status = fh_compose(tempfh, exp, dentry, NULL);
if (status)
goto out;
fhp = tempfh;
}

Change that to test for (bmval1 & FATTR4_WORD1_MOUNTED_ON_FILEID) as
well.

NeilBrown

2021-06-23 09:34:26

On Fri, 25 Jun 2021, Patrick Goetz wrote:
>
> On 6/23/21 5:04 PM, NeilBrown wrote:
> >
> > Probably the best approach to fixing this completely on the btrfs side
> > would be to copy the auto-mount approach used in NFS. NFS sees multiple
> > different volumes on the server and transparently creates new vfsmounts,
> > using the automount infrastructure to mount and unmount them.
>
> I'm very confused about what you're talking about. Is this documented
> somewhere? I mean, I do use autofs, but see that as a separate software
> system working with NFS.
>

autofs (together with the user-space automountd) is a special filesystem
that provides automount functionality to the sysadmin.
It makes use of some core automount functionality in the Linux VFS.
This functionality is referred to as "managed" dentries.
See "Revalidation and automounts" in https://lwn.net/Articles/649115/.

autofs makes use of this functionality to provide automounts. NFS makes
use of this same functionality to provide the same mount-point structure
on the client that it finds on the server.

I don't think there is any documentation specifically about NFS using
this infrastructure. It should be largely transparent to users.

Suppose that on the server "/export/foo" is a mount of some
filesystem, and you nfs4 mount "server:/export" to "/import" on the
client.
Then you will at first see only "/import" in /proc/mounts on client.
If you "ls -ld /import/foo" you will still only see /import.
But if you "ls -l /import/foo" so it lists the contents of that other
filesytem, then check /proc/mounts, you will now see "/import" and
"/import/foo".

After a while (between 500 and 1000 seconds I think) of not accessing
/import/foo, that entry will disappear from /proc/mounts.

I'm sure you will recognise this as very similar to autofs behaviour.
It uses the same core functionality. The timeout for inactive NFS
sub-filesystems to be unmounted can be controlled via
/proc/sys/fs/nfs/nfs_mountpoint_timeout and, since Linux 5.7, via the
nfs_mountpoint_expiry_timeout module parameter.
These aren't documented.

Note that I'm no longer sure that btrfs using automount like this would
actually make things easier for nfsd. But in some ways I think it would
be the "right" thing to do.

NeilBrown