On 7/15/21 12:37 AM, NeilBrown wrote:
>
> Hi all,
> the problem this patch address has been discuss on both the NFS list
> and the BTRFS list, so I'm sending this to both. I'd be very happy for
> people experiencing the problem (NFS export of BTRFS subvols) who are
> in a position to rebuild the kernel on their NFS server to test this
> and report success (or otherwise).
>
> While I've tried to write this patch so that it *could* land upstream
> (and could definitely land in a distro franken-kernel if needed), I'm
> not completely sure it *should* land upstream. It includes some deep
> knowledge of BTRFS into NFSD code. This could be removed later once
> proper APIs are designed and provided. I can see arguments either way
> and wonder what others think.
>
> BTRFS developers: please examine the various claims I have made about
> BTRFS and correct any that are wrong. The observation that
> getdents can report the same inode number of unrelated files
> (a file and a subvol in my case) is ... interesting.
>
> NFSD developers: please comment on anything else.
>
> Others: as I said: testing would be great! :-)
>
> Subject: [PATCH] NFSD: handle BTRFS subvolumes better.
>
> A single BTRFS mount can present as multiple "volumes". i.e. multiple
> sets of objects with potentially overlapping inode number spaces.
> The st_dev presented to user-space via the stat(2) family of calls is
> different for each internal volume, as is the f_fsid reported by
> statfs().
>
> However nfsd doesn't look at st_dev or the fsid (other than for the
> export point - typically the mount point), so it doesn't notice the
> different filesystems. Importantly, it doesn't report a different fsid
> to the NFS client.
>
> This leads to the NFS client reusing inode numbers, and applications
> like "find" and "du" complaining, particularly when they find a
> directory with the same st_ino and st_dev as an ancestor. This
> typically happens with the root of a sub-volume as the root of every
> volume in BTRFS has the same inode number (256).
>
> To fix this, we need to report a different fsid for each subvolume, but
> need to use the same fsid that we currently use for the top-level
> volume. Changing this (by rebooting a server to new code), might
> confuse the client. I don't think it would be a major problem (stale
> filehandles shouldn't happen), but it is best avoided.
>
> Determining the fsid to use is a bit awkward....
>
> There is limited space in the protocol (32 bits for NFSv3, 64 for NFSv4)
> so we cannot append the subvolume fsid. The best option seems to be to
> hash it in. This patch uses a simple 'xor', but possible a Jenkins hash
> would be better.
>
> For BTRFS (and other) filesystems the current fsid is a hash (xor) of
> the uuid provided from userspace by mounted. This is derived from the
> statfs fsid. If we use the statfs fsid for subvolumes and xor this in,
> we risk erasing useful unique information. So I have chosen not to use
> the statfs fsid.
>
> Ideally we should have an API for the filesystem to report if it uses
> multiple subvolumes, and to provide a unique identifier for each. For
> now, this patch calls exportfs_encode_fh(). If the returned fsid type
> is NOT one of those used by BTRFS, then we assume the st_fsid cannot
> change, and use the current behaviour.
>
> If the type IS one that BTRFS uses, we use intimate knowledge of BTRFS
> to extract the root_object_id from the filehandle and record that with
> the export information. Then when exporting an fsid, we check if
> subvolumes are enabled and if the current dentry has a different
> root_object_id to the exported volume. If it does, the root_object_id
> is hashed (xor) into the reported fsid.
>
> When an NFSv4 client sees that the fsid has changed, it will ask for the
> MOUNTED_ON_FILEID. With the Linux NFS client, this is visible to
> userspace as an automount point, until content within the directory is
> accessed and the automount is triggered. Currently the MOUNTED_ON_FILEID
> for these subvolume roots is the same as of the root - 256. This will
> cause find et.al. to complain until the automount actually gets mounted.
>
> So this patch reports the MOUNTED_OF_FILEID in such cases to be a magic
> number that appears to be appropriate for BTRFS:
> BTRFS_FIRST_FREE_OBJECTID - 1
>
> Again, we really want an API to get this from the filesystem. Changing
> it later has no cost, so we don't need any commitment from the btrfs team
> that this is what they will provide if/when we do get such an API.
>
> This same problem (of an automount point with a duplicate inode number)
> also exists for NFSv3. This problem cannot be resolved completely on
> the server as NFSv3 doesn't have a well defined "MOUNTED_ON_FILEID"
> concept, but we can come close. The inode number returned by READDIR is
> likely to be the mounted-on-fileid. With READDIR_PLUS, two fileids are
> returned, the one from the readdir, and (optionally) another from
> 'stat'. Linux-NFS checks these match and if not, it treats the first as
> a mounted-on-fileid.
>
> Interestingly BTRFS getdents() *DOES* report a different inode number
> for subvol roots than is returned by stat(). These aren't actually
> unique (!!!!) but in at least one case, they are different from
> ancestors, so this is sufficient.
>
> NFSD currently SUPPRESSES the stat information if the inode number is
> different. This is because there is room for a file to be renamed between
> the readdir call and the lookup_one_len() prior to getattr, and the
> results could be confusing. However for the case of a BTRFS filesystem
> with an inode number of 256, the value of reporting the difference seems
> to exceed the cost of any confusion caused by a race (if that is even
> possible in this case).
> So this patch allows the two fileids to be different when 256 is found
> on BTRFS.
>
> With this patch a 'du' or 'find' in an NFS-mounted btrfs filesystem
> which has snapshot subvols works correctly for both NFSv4 and NFSv3.
> Fortunately the problematic programs tend to trigger READDIR_PLUS and so
> benefit from the detection of the MOUNTED_ON_FILEID which is provides.
>
> Signed-off-by: NeilBrown <[email protected]>
I'm going to restate what I think the problem is you're having just so I'm sure
we're on the same page.
1. We export a btrfs volume via nfsd that has multiple subvolumes.
2. We run find, and when we stat a file, nfsd doesn't send along our bogus
st_dev, it sends it's own thing (I assume?). This confuses du/find because you
get the same inode number with different parents.
Is this correct? If that's the case then it' be relatively straightforward to
add another callback into export_operations to grab this fsid right? Hell we
could simply return the objectid of the root since that's unique across the
entire file system. We already do our magic FH encoding to make sure we keep
all this straight for NFS, another callback to give that info isn't going to
kill us. Thanks,
Josef
On Thu, Jul 15, 2021 at 10:09:37AM -0400, Josef Bacik wrote:
> I'm going to restate what I think the problem is you're having just so I'm
> sure we're on the same page.
>
> 1. We export a btrfs volume via nfsd that has multiple subvolumes.
> 2. We run find, and when we stat a file, nfsd doesn't send along our bogus
> st_dev, it sends it's own thing (I assume?). This confuses du/find because
> you get the same inode number with different parents.
>
> Is this correct? If that's the case then it' be relatively straightforward
> to add another callback into export_operations to grab this fsid right?
> Hell we could simply return the objectid of the root since that's unique
> across the entire file system. We already do our magic FH encoding to make
> sure we keep all this straight for NFS, another callback to give that info
> isn't going to kill us. Thanks,
Hell no. btrfs is broken plain and simple, and we've been arguing about
this for years without progress. btrfs needs to stop claiming different
st_dev inside the same mount, otherwise hell is going to break lose left
right and center, and this is just one of the many cases where it does.
On 7/15/21 12:45 PM, Christoph Hellwig wrote:
> On Thu, Jul 15, 2021 at 10:09:37AM -0400, Josef Bacik wrote:
>> I'm going to restate what I think the problem is you're having just so I'm
>> sure we're on the same page.
>>
>> 1. We export a btrfs volume via nfsd that has multiple subvolumes.
>> 2. We run find, and when we stat a file, nfsd doesn't send along our bogus
>> st_dev, it sends it's own thing (I assume?). This confuses du/find because
>> you get the same inode number with different parents.
>>
>> Is this correct? If that's the case then it' be relatively straightforward
>> to add another callback into export_operations to grab this fsid right?
>> Hell we could simply return the objectid of the root since that's unique
>> across the entire file system. We already do our magic FH encoding to make
>> sure we keep all this straight for NFS, another callback to give that info
>> isn't going to kill us. Thanks,
>
> Hell no. btrfs is broken plain and simple, and we've been arguing about
> this for years without progress. btrfs needs to stop claiming different
> st_dev inside the same mount, otherwise hell is going to break lose left
> right and center, and this is just one of the many cases where it does.
>
Because there's no alternative. We need a way to tell userspace they've
wandered into a different inode namespace. There's no argument that what we're
doing is ugly, but there's never been a clear "do X instead". Just a lot of
whinging that btrfs is broken. This makes userspace happy and is simple and
straightforward. I'm open to alternatives, but there have been 0 workable
alternatives proposed in the last decade of complaining about it. Thanks,
Josef
On Thu, Jul 15, 2021 at 01:11:29PM -0400, Josef Bacik wrote:
> Because there's no alternative. We need a way to tell userspace they've
> wandered into a different inode namespace. There's no argument that what
> we're doing is ugly, but there's never been a clear "do X instead". Just a
> lot of whinging that btrfs is broken. This makes userspace happy and is
> simple and straightforward. I'm open to alternatives, but there have been 0
> workable alternatives proposed in the last decade of complaining about it.
Make sure we cross a vfsmount when crossing the "st_dev" domain so
that it is properly reported. Suggested many times and ignored all
the time beause it requires a bit of work.
On 7/15/21 1:24 PM, Christoph Hellwig wrote:
> On Thu, Jul 15, 2021 at 01:11:29PM -0400, Josef Bacik wrote:
>> Because there's no alternative. We need a way to tell userspace they've
>> wandered into a different inode namespace. There's no argument that what
>> we're doing is ugly, but there's never been a clear "do X instead". Just a
>> lot of whinging that btrfs is broken. This makes userspace happy and is
>> simple and straightforward. I'm open to alternatives, but there have been 0
>> workable alternatives proposed in the last decade of complaining about it.
>
> Make sure we cross a vfsmount when crossing the "st_dev" domain so
> that it is properly reported. Suggested many times and ignored all
> the time beause it requires a bit of work.
>
You keep telling me this but forgetting that I did all this work when you
originally suggested it. The problem I ran into was the automount stuff
requires that we have a completely different superblock for every vfsmount.
This is fine for things like nfs or samba where the automount literally points
to a completely different mount, but doesn't work for btrfs where it's on the
same file system. If you have 1000 subvolumes and run sync() you're going to
write the superblock 1000 times for the same file system. You are going to
reclaim inodes on the same file system 1000 times. You are going to reclaim
dcache on the same filesytem 1000 times. You are also going to pin 1000
dentries/inodes into memory whenever you wander into these things because the
super is going to hold them open.
This is not a workable solution. It's not a matter of simply tying into
existing infrastructure, we'd have to completely rework how the VFS deals with
this stuff in order to be reasonable. And when I brought this up to Al he told
me I was insane and we absolutely had to have a different SB for every vfsmount,
which means we can't use vfsmount for this, which means we don't have any other
options. Thanks,
Josef
On Fri, 16 Jul 2021, Josef Bacik wrote:
> On 7/15/21 1:24 PM, Christoph Hellwig wrote:
> > On Thu, Jul 15, 2021 at 01:11:29PM -0400, Josef Bacik wrote:
> >> Because there's no alternative. We need a way to tell userspace they've
> >> wandered into a different inode namespace. There's no argument that what
> >> we're doing is ugly, but there's never been a clear "do X instead". Just a
> >> lot of whinging that btrfs is broken. This makes userspace happy and is
> >> simple and straightforward. I'm open to alternatives, but there have been 0
> >> workable alternatives proposed in the last decade of complaining about it.
> >
> > Make sure we cross a vfsmount when crossing the "st_dev" domain so
> > that it is properly reported. Suggested many times and ignored all
> > the time beause it requires a bit of work.
> >
>
> You keep telling me this but forgetting that I did all this work when you
> originally suggested it. The problem I ran into was the automount stuff
> requires that we have a completely different superblock for every vfsmount.
> This is fine for things like nfs or samba where the automount literally points
> to a completely different mount, but doesn't work for btrfs where it's on the
> same file system. If you have 1000 subvolumes and run sync() you're going to
> write the superblock 1000 times for the same file system. You are going to
> reclaim inodes on the same file system 1000 times. You are going to reclaim
> dcache on the same filesytem 1000 times. You are also going to pin 1000
> dentries/inodes into memory whenever you wander into these things because the
> super is going to hold them open.
>
> This is not a workable solution. It's not a matter of simply tying into
> existing infrastructure, we'd have to completely rework how the VFS deals with
> this stuff in order to be reasonable. And when I brought this up to Al he told
> me I was insane and we absolutely had to have a different SB for every vfsmount,
> which means we can't use vfsmount for this, which means we don't have any other
> options. Thanks,
When I was first looking at this, I thought that separate vfsmnts
and auto-mounting was the way to go "just like NFS". NFS still shares a
lot between the multiple superblock - certainly it shares the same
connection to the server.
But I dropped the idea when Bruce pointed out that nfsd is not set up to
export auto-mounted filesystems. It needs to be able to find a
filesystem given a UUID (extracted from a filehandle), and it does this
by walking through the mount table to find one that matches. So unless
all btrfs subvols were mounted all the time (which I wouldn't propose),
it would need major work to fix.
NFSv4 describes the fsid as having a "major" and "minor" component.
We've never treated these as having an important meaning - just extra
bits to encode uniqueness in. Maybe we should have used "major" for the
vfsmnt, and kept "minor" for the subvol.....
The idea for a single vfsmnt exposing multiple inode-name-spaces does
appeal to me. The "st_dev" is just part of the name, and already a
fairly blurry part. Thanks to bind mounts, multiple mounts can have the
same st_dev. I see no intrinsic reason that a single mount should not
have multiple fsids, provided that a coherent picture is provided to
userspace which doesn't contain too many surprises.
NeilBrown
On Fri, 16 Jul 2021, Josef Bacik wrote:
>
> I'm going to restate what I think the problem is you're having just so I'm sure
> we're on the same page.
>
> 1. We export a btrfs volume via nfsd that has multiple subvolumes.
> 2. We run find, and when we stat a file, nfsd doesn't send along our bogus
> st_dev, it sends it's own thing (I assume?). This confuses du/find because you
> get the same inode number with different parents.
>
> Is this correct? If that's the case then it' be relatively straightforward to
> add another callback into export_operations to grab this fsid right? Hell we
> could simply return the objectid of the root since that's unique across the
> entire file system. We already do our magic FH encoding to make sure we keep
> all this straight for NFS, another callback to give that info isn't going to
> kill us. Thanks,
Fairly close.
As well as the fsid I need a "mounted-on" inode number, so one callback
to provide both would do.
If zero was reported, that would be equivalent to not providing the
callback.
- Is "u64" always enough for the subvol-id?
- Should we make these details available to user-space with a new STATX
flag?
- Should it be a new export_operations callback, or new fields in
"struct kstat" ??
... though having asked those question, I begin to wonder if I took a
wrong turn.
I can already get some fsid information form statfs, though it is only
64bits and for BTRFS is combines the filesystem uuid and the subvol
id. For that reason I avoided it.
But I'm already caching the fsid for the export-point. If, when I find
a different fsid lower down, I xor the result with the export-point
fsid, the result would be fairly clean (the xor difference between the
two subvol ids) and could be safely mixed into the fsid we currently
report.
So all I REALLY need from btrfs is a "mounted-on" inode number, matching
what readdir() reports.
I wouldn't argue AGAINST getting cleaner fsid information. A 128-bit
uuid and a 64bit subvol id would be ideal.
I'd rather see them as new STATX flags than a new export_operations
callback.
NeilBrown
On Thu, Jul 15, 2021 at 02:01:11PM -0400, Josef Bacik wrote:
> This is not a workable solution. It's not a matter of simply tying into
> existing infrastructure, we'd have to completely rework how the VFS deals
> with this stuff in order to be reasonable. And when I brought this up to Al
> he told me I was insane and we absolutely had to have a different SB for
> every vfsmount, which means we can't use vfsmount for this, which means we
> don't have any other options. Thanks,
Then fix the problem another way. The problem is known, old and keeps
breaking stuff. Don't paper over it, fix it.
On 7/15/21 6:37 PM, NeilBrown wrote:
> On Fri, 16 Jul 2021, Josef Bacik wrote:
>> On 7/15/21 1:24 PM, Christoph Hellwig wrote:
>>> On Thu, Jul 15, 2021 at 01:11:29PM -0400, Josef Bacik wrote:
>>>> Because there's no alternative. We need a way to tell userspace they've
>>>> wandered into a different inode namespace. There's no argument that what
>>>> we're doing is ugly, but there's never been a clear "do X instead". Just a
>>>> lot of whinging that btrfs is broken. This makes userspace happy and is
>>>> simple and straightforward. I'm open to alternatives, but there have been 0
>>>> workable alternatives proposed in the last decade of complaining about it.
>>>
>>> Make sure we cross a vfsmount when crossing the "st_dev" domain so
>>> that it is properly reported. Suggested many times and ignored all
>>> the time beause it requires a bit of work.
>>>
>>
>> You keep telling me this but forgetting that I did all this work when you
>> originally suggested it. The problem I ran into was the automount stuff
>> requires that we have a completely different superblock for every vfsmount.
>> This is fine for things like nfs or samba where the automount literally points
>> to a completely different mount, but doesn't work for btrfs where it's on the
>> same file system. If you have 1000 subvolumes and run sync() you're going to
>> write the superblock 1000 times for the same file system. You are going to
>> reclaim inodes on the same file system 1000 times. You are going to reclaim
>> dcache on the same filesytem 1000 times. You are also going to pin 1000
>> dentries/inodes into memory whenever you wander into these things because the
>> super is going to hold them open.
>>
>> This is not a workable solution. It's not a matter of simply tying into
>> existing infrastructure, we'd have to completely rework how the VFS deals with
>> this stuff in order to be reasonable. And when I brought this up to Al he told
>> me I was insane and we absolutely had to have a different SB for every vfsmount,
>> which means we can't use vfsmount for this, which means we don't have any other
>> options. Thanks,
>
> When I was first looking at this, I thought that separate vfsmnts
> and auto-mounting was the way to go "just like NFS". NFS still shares a
> lot between the multiple superblock - certainly it shares the same
> connection to the server.
>
> But I dropped the idea when Bruce pointed out that nfsd is not set up to
> export auto-mounted filesystems. It needs to be able to find a
> filesystem given a UUID (extracted from a filehandle), and it does this
> by walking through the mount table to find one that matches. So unless
> all btrfs subvols were mounted all the time (which I wouldn't propose),
> it would need major work to fix.
>
> NFSv4 describes the fsid as having a "major" and "minor" component.
> We've never treated these as having an important meaning - just extra
> bits to encode uniqueness in. Maybe we should have used "major" for the
> vfsmnt, and kept "minor" for the subvol.....
>
> The idea for a single vfsmnt exposing multiple inode-name-spaces does
> appeal to me. The "st_dev" is just part of the name, and already a
> fairly blurry part. Thanks to bind mounts, multiple mounts can have the
> same st_dev. I see no intrinsic reason that a single mount should not
> have multiple fsids, provided that a coherent picture is provided to
> userspace which doesn't contain too many surprises.
>
Ok so setting aside btrfs for the moment, how does NFS deal with exporting a
directory that has multiple other file systems under that tree? I assume the
same sort of problem doesn't occur, but why is that? Is it because it's a
different vfsmount/sb or is there some other magic making this work? Thanks,
Josef
On Fri, Jul 16, 2021 at 08:37:07AM +1000, NeilBrown wrote:
> On Fri, 16 Jul 2021, Josef Bacik wrote:
> > On 7/15/21 1:24 PM, Christoph Hellwig wrote:
> > > On Thu, Jul 15, 2021 at 01:11:29PM -0400, Josef Bacik wrote:
> > >> Because there's no alternative. We need a way to tell userspace they've
> > >> wandered into a different inode namespace. There's no argument that what
> > >> we're doing is ugly, but there's never been a clear "do X instead". Just a
> > >> lot of whinging that btrfs is broken. This makes userspace happy and is
> > >> simple and straightforward. I'm open to alternatives, but there have been 0
> > >> workable alternatives proposed in the last decade of complaining about it.
> > >
> > > Make sure we cross a vfsmount when crossing the "st_dev" domain so
> > > that it is properly reported. Suggested many times and ignored all
> > > the time beause it requires a bit of work.
> > >
> >
> > You keep telling me this but forgetting that I did all this work when you
> > originally suggested it. The problem I ran into was the automount stuff
> > requires that we have a completely different superblock for every vfsmount.
> > This is fine for things like nfs or samba where the automount literally points
> > to a completely different mount, but doesn't work for btrfs where it's on the
> > same file system. If you have 1000 subvolumes and run sync() you're going to
> > write the superblock 1000 times for the same file system. You are going to
> > reclaim inodes on the same file system 1000 times. You are going to reclaim
> > dcache on the same filesytem 1000 times. You are also going to pin 1000
> > dentries/inodes into memory whenever you wander into these things because the
> > super is going to hold them open.
> >
> > This is not a workable solution. It's not a matter of simply tying into
> > existing infrastructure, we'd have to completely rework how the VFS deals with
> > this stuff in order to be reasonable. And when I brought this up to Al he told
> > me I was insane and we absolutely had to have a different SB for every vfsmount,
> > which means we can't use vfsmount for this, which means we don't have any other
> > options. Thanks,
>
> When I was first looking at this, I thought that separate vfsmnts
> and auto-mounting was the way to go "just like NFS". NFS still shares a
> lot between the multiple superblock - certainly it shares the same
> connection to the server.
>
> But I dropped the idea when Bruce pointed out that nfsd is not set up to
> export auto-mounted filesystems.
Yes. I wish it was.... But we'd need some way to look a
not-currently-mounted filesystem by filehandle:
> It needs to be able to find a
> filesystem given a UUID (extracted from a filehandle), and it does this
> by walking through the mount table to find one that matches. So unless
> all btrfs subvols were mounted all the time (which I wouldn't propose),
> it would need major work to fix.
>
> NFSv4 describes the fsid as having a "major" and "minor" component.
> We've never treated these as having an important meaning - just extra
> bits to encode uniqueness in. Maybe we should have used "major" for the
> vfsmnt, and kept "minor" for the subvol.....
So nfsd would use the "major" ID to find the parent export, and then
btrfs would use the "minor" ID to identify the subvolume?
--b.
> The idea for a single vfsmnt exposing multiple inode-name-spaces does
> appeal to me. The "st_dev" is just part of the name, and already a
> fairly blurry part. Thanks to bind mounts, multiple mounts can have the
> same st_dev. I see no intrinsic reason that a single mount should not
> have multiple fsids, provided that a coherent picture is provided to
> userspace which doesn't contain too many surprises.
On Mon, Jul 19, 2021 at 11:40:28AM -0400, Josef Bacik wrote:
> Ok so setting aside btrfs for the moment, how does NFS deal with
> exporting a directory that has multiple other file systems under
> that tree? I assume the same sort of problem doesn't occur, but why
> is that? Is it because it's a different vfsmount/sb or is there
> some other magic making this work? Thanks,
There are two main ways an NFS client can look up a file: by name or by
filehandle. The former's the normal filesystem directory lookup that
we're used to. If the name refers to a mountpoint, the server can cross
into the mounted filesystem like anyone else.
It's the lookup by filehandle that's interesting. Typically the
filehandle includes a UUID and an inode number. The server looks up the
UUID with some help from mountd, and that gives a superblock that nfsd
can use for the inode lookup.
As Neil says, mountd does that basically by searching among mounted
filesystems for one with that uuid.
So if you wanted to be able to handle a uuid for a filesystem that's not
even mounted yet, you'd need some new mechanism to look up such uuids.
That's something we don't currently support but that we'd need to
support if BTRFS subvolumes were automounted. (And it might have other
uses as well.)
But I'm not entirely sure if that answers your question....
--b.
On 7/19/21 4:00 PM, J. Bruce Fields wrote:
> On Mon, Jul 19, 2021 at 11:40:28AM -0400, Josef Bacik wrote:
>> Ok so setting aside btrfs for the moment, how does NFS deal with
>> exporting a directory that has multiple other file systems under
>> that tree? I assume the same sort of problem doesn't occur, but why
>> is that? Is it because it's a different vfsmount/sb or is there
>> some other magic making this work? Thanks,
>
> There are two main ways an NFS client can look up a file: by name or by
> filehandle. The former's the normal filesystem directory lookup that
> we're used to. If the name refers to a mountpoint, the server can cross
> into the mounted filesystem like anyone else.
>
> It's the lookup by filehandle that's interesting. Typically the
> filehandle includes a UUID and an inode number. The server looks up the
> UUID with some help from mountd, and that gives a superblock that nfsd
> can use for the inode lookup.
>
> As Neil says, mountd does that basically by searching among mounted
> filesystems for one with that uuid.
>
> So if you wanted to be able to handle a uuid for a filesystem that's not
> even mounted yet, you'd need some new mechanism to look up such uuids.
>
> That's something we don't currently support but that we'd need to
> support if BTRFS subvolumes were automounted. (And it might have other
> uses as well.)
>
> But I'm not entirely sure if that answers your question....
>
Right, because btrfs handles the filehandles ourselves properly with the
export_operations and we encode the subvolume id's into those things to make
sure we can always do the proper lookup.
I suppose the real problem is that NFS is exposing the inode->i_ino to the
client without understanding that it's on a different subvolume.
Our trick of simply allocating an anonymous bdev every time you wander into a
subvolume to get a unique st_dev doesn't help you guys because you are looking
for mounted file systems.
I'm not concerned about the FH case, because for that it's already been crafted
by btrfs and we know what to do with it, so it's always going to be correct.
The actual problem is that we can do
getattr(/file1)
getattr(/snap/file1)
on the client and the NFS server just blind sends i_ino with the same fsid
because / and /snap are the same fsid.
Which brings us back to what HCH is complaining about. In his view if we had a
vfsmount for /snap then you would know that it was a different fs. However that
would only actually work if we generated a completely different superblock and
thus gave /snap a unique fsid, right?
If we did the automount thing, and the NFS server went down and came back up and
got a getattr(/snap/file1) from a previously generated FH it would still work
right, because it would come into the export_operations with the format that
btrfs is expecting and it would be able to do the lookup. This FH lookup would
do the automount magic it needs to and then NFS would have the fsid it needs,
correct? Thanks,
Josef
On Tue, 20 Jul 2021, Josef Bacik wrote:
> On 7/19/21 4:00 PM, J. Bruce Fields wrote:
> > On Mon, Jul 19, 2021 at 11:40:28AM -0400, Josef Bacik wrote:
> >> Ok so setting aside btrfs for the moment, how does NFS deal with
> >> exporting a directory that has multiple other file systems under
> >> that tree? I assume the same sort of problem doesn't occur, but why
> >> is that? Is it because it's a different vfsmount/sb or is there
> >> some other magic making this work? Thanks,
> >
> > There are two main ways an NFS client can look up a file: by name or by
> > filehandle. The former's the normal filesystem directory lookup that
> > we're used to. If the name refers to a mountpoint, the server can cross
> > into the mounted filesystem like anyone else.
> >
> > It's the lookup by filehandle that's interesting. Typically the
> > filehandle includes a UUID and an inode number. The server looks up the
> > UUID with some help from mountd, and that gives a superblock that nfsd
> > can use for the inode lookup.
> >
> > As Neil says, mountd does that basically by searching among mounted
> > filesystems for one with that uuid.
> >
> > So if you wanted to be able to handle a uuid for a filesystem that's not
> > even mounted yet, you'd need some new mechanism to look up such uuids.
> >
> > That's something we don't currently support but that we'd need to
> > support if BTRFS subvolumes were automounted. (And it might have other
> > uses as well.)
> >
> > But I'm not entirely sure if that answers your question....
> >
>
> Right, because btrfs handles the filehandles ourselves properly with the
> export_operations and we encode the subvolume id's into those things to make
> sure we can always do the proper lookup.
>
> I suppose the real problem is that NFS is exposing the inode->i_ino to the
> client without understanding that it's on a different subvolume.
>
> Our trick of simply allocating an anonymous bdev every time you wander into a
> subvolume to get a unique st_dev doesn't help you guys because you are looking
> for mounted file systems.
>
> I'm not concerned about the FH case, because for that it's already been crafted
> by btrfs and we know what to do with it, so it's always going to be correct.
>
> The actual problem is that we can do
>
> getattr(/file1)
> getattr(/snap/file1)
>
> on the client and the NFS server just blind sends i_ino with the same fsid
> because / and /snap are the same fsid.
>
> Which brings us back to what HCH is complaining about. In his view if we had a
> vfsmount for /snap then you would know that it was a different fs. However that
> would only actually work if we generated a completely different superblock and
> thus gave /snap a unique fsid, right?
No, I don't think it needs to be a different superblock to have a
vfsmount. (I don't know if it does to keep HCH happy).
If I "mount --bind /snap /snap" then I've created a vfsmnt with the
upper and lower directories identical - same inode, same superblock.
This is an existence-proof that you don't need a separate super-block.
>
> If we did the automount thing, and the NFS server went down and came back up and
> got a getattr(/snap/file1) from a previously generated FH it would still work
> right, because it would come into the export_operations with the format that
> btrfs is expecting and it would be able to do the lookup. This FH lookup would
> do the automount magic it needs to and then NFS would have the fsid it needs,
> correct? Thanks,
Not quite.
An NFS filehandle (as generated by linux-nfsd) has two components (plus
a header). The filesystem-part and the file-part.
The filesystem-part is managed by userspace (/usr/sbin/mountd). The
code relies on every filesystem appearing in /proc/self/mounts.
The bytes chosen are either based on the uuid reported by 'libblkid', or the
fsid reported by statfs(), based on a black-list of filesystems for
which libblkid is not useful. This list includes btrfs.
The file-part is managed in the kernel using export_operations.
For any given 'struct path' in the kernel, a filehandle is generated
(conceptually) by finding the closest vfsmnt (close to inode, far from
root) and asking user-space to map that. Then passing the inode to the
filesystem and asking it to map that.
So, in your example, if /snap were a mount point, the kernel would ask
mountd to determine the filesystem-part of /snap, and the fact that the
file-part from btrfs contained the objectid for snap just be redundant
information. If /snap couldn't be found in /proc/self/mounts after a
server restart, the filehandle would be stale.
If btrfs were to use automounts and create the vfsmnts that one might
normally expect, then nfsd would need there to be two different sorts of
mount points, ideally visible in /proc/mounts (maybe a new flag that
appears in the list of mount options? "internal" ??).
- there needs to be the current mountpoint which a expected to be
present after a reboot, and is likely to introduce a new filesystem,
and
- there are these "new" mountpoints which are on-demand and expose
something that is (in some sense) part of the same filesystem.
The key property that NFSd would depend on is that these mount points
do NOT introduce a new name-space for file-handles (in the sense of
export_operations).
To expand on that last point:
- If a filehandle is requested for an inode above the "new" mountpoint
and another "below" the new mountpoint, they are guaranteed to be
different.
- If a filehandle that was "below" the new mountpoint is passed to
exportfs_decode_fh() together with the vfsmnt that was *above* the
mountpoint, then it somehow does "the right thing". Probably
that would require changing exportfs_decode_fh() to return a
'struct path' rather than just a 'struct dentry *'.
When nfsd detected one of these "internal" mountpoints during a lookup,
it would *not* call-out to user-space to create a new export, but it
*would* ensure that a new fsid was reported for all inodes in the new
vfsmnt.
NeilBrown
On Mon, 19 Jul 2021, Christoph Hellwig wrote:
> On Thu, Jul 15, 2021 at 02:01:11PM -0400, Josef Bacik wrote:
> > This is not a workable solution. It's not a matter of simply tying into
> > existing infrastructure, we'd have to completely rework how the VFS deals
> > with this stuff in order to be reasonable. And when I brought this up to Al
> > he told me I was insane and we absolutely had to have a different SB for
> > every vfsmount, which means we can't use vfsmount for this, which means we
> > don't have any other options. Thanks,
>
> Then fix the problem another way. The problem is known, old and keeps
> breaking stuff. Don't paper over it, fix it.
Do you have any pointers to other breakage caused by this particular
behaviour of btrfs? It would to have all requirements clearly on the
table while designing a solution.
Thanks,
NeilBrown
On Tue, 20 Jul 2021, J. Bruce Fields wrote:
> On Fri, Jul 16, 2021 at 08:37:07AM +1000, NeilBrown wrote:
> > On Fri, 16 Jul 2021, Josef Bacik wrote:
> > > On 7/15/21 1:24 PM, Christoph Hellwig wrote:
> > > > On Thu, Jul 15, 2021 at 01:11:29PM -0400, Josef Bacik wrote:
> > > >> Because there's no alternative. We need a way to tell userspace they've
> > > >> wandered into a different inode namespace. There's no argument that what
> > > >> we're doing is ugly, but there's never been a clear "do X instead". Just a
> > > >> lot of whinging that btrfs is broken. This makes userspace happy and is
> > > >> simple and straightforward. I'm open to alternatives, but there have been 0
> > > >> workable alternatives proposed in the last decade of complaining about it.
> > > >
> > > > Make sure we cross a vfsmount when crossing the "st_dev" domain so
> > > > that it is properly reported. Suggested many times and ignored all
> > > > the time beause it requires a bit of work.
> > > >
> > >
> > > You keep telling me this but forgetting that I did all this work when you
> > > originally suggested it. The problem I ran into was the automount stuff
> > > requires that we have a completely different superblock for every vfsmount.
> > > This is fine for things like nfs or samba where the automount literally points
> > > to a completely different mount, but doesn't work for btrfs where it's on the
> > > same file system. If you have 1000 subvolumes and run sync() you're going to
> > > write the superblock 1000 times for the same file system. You are going to
> > > reclaim inodes on the same file system 1000 times. You are going to reclaim
> > > dcache on the same filesytem 1000 times. You are also going to pin 1000
> > > dentries/inodes into memory whenever you wander into these things because the
> > > super is going to hold them open.
> > >
> > > This is not a workable solution. It's not a matter of simply tying into
> > > existing infrastructure, we'd have to completely rework how the VFS deals with
> > > this stuff in order to be reasonable. And when I brought this up to Al he told
> > > me I was insane and we absolutely had to have a different SB for every vfsmount,
> > > which means we can't use vfsmount for this, which means we don't have any other
> > > options. Thanks,
> >
> > When I was first looking at this, I thought that separate vfsmnts
> > and auto-mounting was the way to go "just like NFS". NFS still shares a
> > lot between the multiple superblock - certainly it shares the same
> > connection to the server.
> >
> > But I dropped the idea when Bruce pointed out that nfsd is not set up to
> > export auto-mounted filesystems.
>
> Yes. I wish it was.... But we'd need some way to look a
> not-currently-mounted filesystem by filehandle:
>
> > It needs to be able to find a
> > filesystem given a UUID (extracted from a filehandle), and it does this
> > by walking through the mount table to find one that matches. So unless
> > all btrfs subvols were mounted all the time (which I wouldn't propose),
> > it would need major work to fix.
> >
> > NFSv4 describes the fsid as having a "major" and "minor" component.
> > We've never treated these as having an important meaning - just extra
> > bits to encode uniqueness in. Maybe we should have used "major" for the
> > vfsmnt, and kept "minor" for the subvol.....
>
> So nfsd would use the "major" ID to find the parent export, and then
> btrfs would use the "minor" ID to identify the subvolume?
Maybe, though I don't think it would be really useful - just a
thought-bubble.
As the spec doesn't define any behaviour of these two numbers, there is
no point trying to impose any.
But (as described in another email) I think we do need to clearly
differentiate between "volume" and "subvolume" in the Linux API.
We cannot really use "different mount point" to mean "different volume"
as bind mounts broke that model long ago.
I think that "different st_dev" means "different subvolume" is a core
requirement as many applications assume that. So the question is "how
to determine if two objects in different subvolumes are still in the
same volume". This is something that nfsd needs to know.
NeilBrown
On Tue, Jul 20, 2021 at 09:54:44AM +1000, NeilBrown wrote:
> Do you have any pointers to other breakage caused by this particular
> behaviour of btrfs? It would to have all requirements clearly on the
> table while designing a solution.
A quick google find:
https://lore.kernel.org/linux-btrfs/[email protected]/T/
https://savannah.gnu.org/bugs/?50859
https://github.com/coreos/bugs/issues/301
https://bugs.kde.org/show_bug.cgi?id=317127
https://github.com/borgbackup/borg/issues/4009
https://bugs.python.org/issue37339
http://mail.openjdk.java.net/pipermail/nio-dev/2017-June/004292.html
and that is just the first 2 or three pages of trivial search results.
On Tue, 20 Jul 2021, Christoph Hellwig wrote:
> On Tue, Jul 20, 2021 at 09:54:44AM +1000, NeilBrown wrote:
> > Do you have any pointers to other breakage caused by this particular
> > behaviour of btrfs? It would to have all requirements clearly on the
> > table while designing a solution.
>
> A quick google find:
>
> https://lore.kernel.org/linux-btrfs/[email protected]/T/
> https://savannah.gnu.org/bugs/?50859
> https://github.com/coreos/bugs/issues/301
> https://bugs.kde.org/show_bug.cgi?id=317127
> https://github.com/borgbackup/borg/issues/4009
> https://bugs.python.org/issue37339
> http://mail.openjdk.java.net/pipermail/nio-dev/2017-June/004292.html
>
> and that is just the first 2 or three pages of trivial search results.
>
Thanks a lot for these! Very helpful.
The details vary, but the core problem seems to be that the device
number found in /proc/self/mountinfo is the same for all mounts from a
given btrfs filesystem, no matter which subvol happens to be found at or
beneath that mountpoint. So it can even be that 'stat' on a mountpoint
returns different numbers to what is found for that mountpoint in
/proc/self/mountinfo.
To address these issues we would need to:
1/ make every btrfs subvol which is not already a mountpoint into an
automount point which mounts the subvol (similar to the use of
automount in NFS).
2/ either give each subvol a separate 'struct super_block' (which is
apparently a bad idea) or change show_mountinfo() to allow an
alternate dev_t to be used. e.g. some new s_op which is given
mnt->mnt_root and returns a dev_t. If the new s_op is not
available, sb->s_dev is used.
For nfsd to be able to work with this, those automount points need to
have an inode in the parent filesystem with a distinct inode number, and
the mount must be marked in some way that nfsd can tell that it is
"internal". Possibly a helper function that tests if mnt_parent has the
same mnt.mnt_sb would be sufficient, though it might be nice to export
this fact to user-space somehow.
Also exportfs_decode_fh() needs to be enhanced, probably to return a
'struct path'.
Does anything there seem unreasonable to you?
Thanks,
NeilBrown
On Tue, Jul 20, 2021 at 05:17:12PM +1000, NeilBrown wrote:
> Does anything there seem unreasonable to you?
This is what I've been asking for for years.
On Thu, Jul 15, 2021 at 02:01:11PM -0400, Josef Bacik wrote:
> The problem I ran into was the automount stuff requires that we have a
> completely different superblock for every vfsmount. This is fine for
> things like nfs or samba where the automount literally points to a
> completely different mount, but doesn't work for btrfs where it's on
> the same file system. If you have 1000 subvolumes and run sync()
> you're going to write the superblock 1000 times for the same file
> system.
Dumb question: why do you have to write the superblock 1000 times, and
why is that slower than writing to 1000 different filesystems?
> You are
> going to reclaim inodes on the same file system 1000 times. You are
> going to reclaim dcache on the same filesytem 1000 times. You are
> also going to pin 1000 dentries/inodes into memory whenever you
> wander into these things because the super is going to hold them
> open.
That last part at least is the same for the 1000-different-filesystems
case, isn't it?
--b.
> This is not a workable solution. It's not a matter of simply tying
> into existing infrastructure, we'd have to completely rework how the
> VFS deals with this stuff in order to be reasonable. And when I
> brought this up to Al he told me I was insane and we absolutely had
> to have a different SB for every vfsmount, which means we can't use
> vfsmount for this, which means we don't have any other options.
> Thanks,
>
> Josef
On Tue, 20 Jul 2021, Christoph Hellwig wrote:
> On Tue, Jul 20, 2021 at 05:17:12PM +1000, NeilBrown wrote:
> > Does anything there seem unreasonable to you?
>
> This is what I've been asking for for years.
>
>
Execellent - we seem to be on the same page.
I'll aim to have some prelimiary patches for review within a week.
Thanks,
NeilBrown