2024-02-21 00:51:53

by Kent Overstreet

[permalink] [raw]
Subject: [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more

Recently we had a pretty long discussion on statx extensions, which
eventually got a bit offtopic but nevertheless hashed out all the major
issues.

To summarize:
- guaranteeing inode number uniqueness is becoming increasingly
infeasible, we need a bit to tell userspace "inode number is not
unique, use filehandle instead"
- we need a new field (st_vol, volume ID) for subvolumes - subvolumes
aren't filesystems and st_dev is problematic too
- I'd like a bit for flagging subvolume roots, as well

That's basic stuff. Beyond that, it would be useful to standardize APIs
for
- recursively enumerating subvolumes (without walking full fs
heirarchy), and translating from volume IDs to paths
- exposing snapshot tree structure - which is yet another tree
structure we need to expose, completely unrelated to normal fs path
tree structure
- snapshot recovery - i.e. atomically replacing one subvolume with
another subvolume that was a snapshot
- setting default subvolume root
- exposing disk usage of snapshots

& probably more.

Hoping to get some real participation from the btrfs crew - if they
could talk about what they've done and what else they see a need for,
that would be wonderful. Additionally, we tend to not have a lot of
userspace people at LSF, but standardizing and improving these APIs is
something userspace would _very_ much like us to do, so in addition to
the usual crew I'm hoping to bring Neal Gompa to share that perspective.

Cheers,
Kent


2024-02-21 15:29:13

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more

On Wed, 21 Feb 2024 at 01:51, Kent Overstreet <[email protected]> wrote:
>
> Recently we had a pretty long discussion on statx extensions, which
> eventually got a bit offtopic but nevertheless hashed out all the major
> issues.
>
> To summarize:
> - guaranteeing inode number uniqueness is becoming increasingly
> infeasible, we need a bit to tell userspace "inode number is not
> unique, use filehandle instead"

This is a tough one. POSIX says "The st_ino and st_dev fields taken
together uniquely identify the file within the system."

Adding a bit that says "from now the above POSIX rule is invalid"
doesn't instantly fix all the existing applications that rely on it.

Linux did manage to extend st_ino from 32 to 64 bits, but even in that
case it's not clear how many instances of

stat(path1, &st);
unsigned int ino = st.st_ino;
stat(path2, &st);
if (ino == st.st_ino)
...

are waiting to blow up one fine day. Of course the code should have
used ino_t, but I think this pattern is not that uncommon.

All in all, I don't think adding a flag to statx is the right answer.
It entitles filesystem developers to be sloppy about st_ino
uniqueness, which is not a good idea. I think what overlayfs is
doing (see documentation) is generally the right direction. It makes
various compromises but not to uniqueness, and we haven't had
complaints (fingers crossed).

Nudging userspace developers to use file handles would also be good,
but they should do so unconditionally, not based on a flag that has no
well defined meaning.

Thanks,
Miklos

2024-02-21 21:05:30

by Kent Overstreet

[permalink] [raw]
Subject: Re: [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more

On Wed, Feb 21, 2024 at 04:06:34PM +0100, Miklos Szeredi wrote:
> On Wed, 21 Feb 2024 at 01:51, Kent Overstreet <[email protected]> wrote:
> >
> > Recently we had a pretty long discussion on statx extensions, which
> > eventually got a bit offtopic but nevertheless hashed out all the major
> > issues.
> >
> > To summarize:
> > - guaranteeing inode number uniqueness is becoming increasingly
> > infeasible, we need a bit to tell userspace "inode number is not
> > unique, use filehandle instead"
>
> This is a tough one. POSIX says "The st_ino and st_dev fields taken
> together uniquely identify the file within the system."
>
> Adding a bit that says "from now the above POSIX rule is invalid"
> doesn't instantly fix all the existing applications that rely on it.

Even POSIX must bend when faced with reality. 64 bits is getting
uncomfortably cramped already and with filesystems getting bigger it's
going to break sooner or later.

We don't want to be abusing st_dev, and snapshots and inode number
sharding mean we're basically out of bits today.

> doing (see documentation) is generally the right direction. It makes
> various compromises but not to uniqueness, and we haven't had
> complaints (fingers crossed).

I haven't seen anything in overlayfs that looked like a real solution,
just hacks that would break sooner or later if more filesystems are
being stacked.

> Nudging userspace developers to use file handles would also be good,
> but they should do so unconditionally, not based on a flag that has no
> well defined meaning.

If we define it, it has a perfectly well defined meaning.

I wouldn't be against telling userspace to use file handles
unconditionally; they should only need to query it for a file that has
handlinks, anyways.

But I think we _do_ need this bit, if nothing else, as exactly that
nudge.

2024-02-21 23:35:03

by Josef Bacik

[permalink] [raw]
Subject: Re: [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more

On Wed, Feb 21, 2024 at 04:06:34PM +0100, Miklos Szeredi wrote:
> On Wed, 21 Feb 2024 at 01:51, Kent Overstreet <[email protected]> wrote:
> >
> > Recently we had a pretty long discussion on statx extensions, which
> > eventually got a bit offtopic but nevertheless hashed out all the major
> > issues.
> >
> > To summarize:
> > - guaranteeing inode number uniqueness is becoming increasingly
> > infeasible, we need a bit to tell userspace "inode number is not
> > unique, use filehandle instead"
>
> This is a tough one. POSIX says "The st_ino and st_dev fields taken
> together uniquely identify the file within the system."
>

Which is what btrfs has done forever, and we've gotten yelled at forever for
doing it. We have a compromise and a way forward, but it's not a widely held
view that changing st_dev to give uniqueness is an acceptable solution. It may
have been for overlayfs because you guys are already doing something special,
but it's not an option that is afforded the rest of us.

> Adding a bit that says "from now the above POSIX rule is invalid"
> doesn't instantly fix all the existing applications that rely on it.
>
> Linux did manage to extend st_ino from 32 to 64 bits, but even in that
> case it's not clear how many instances of
>
> stat(path1, &st);
> unsigned int ino = st.st_ino;
> stat(path2, &st);
> if (ino == st.st_ino)
> ...
>
> are waiting to blow up one fine day. Of course the code should have
> used ino_t, but I think this pattern is not that uncommon.
>
> All in all, I don't think adding a flag to statx is the right answer.
> It entitles filesystem developers to be sloppy about st_ino
> uniqueness, which is not a good idea. I think what overlayfs is
> doing (see documentation) is generally the right direction. It makes
> various compromises but not to uniqueness, and we haven't had
> complaints (fingers crossed).

Again, you haven't, I have, consistently and constantly for a decade.

> Nudging userspace developers to use file handles would also be good,
> but they should do so unconditionally, not based on a flag that has no
> well defined meaning.

I think that's what we're trying to do, define it properly. We now have 2 file
systems in tree that have this sort of behavior. It's not a new or crazy thing
(well I suppose it is when you consider the lifetime of file systems), having a
way for user space developers that care to properly identify they've wandered
across a subvolume boundary could be useful.

As for the proposal itself, we talk about this every year. We're all more or
less onboard with the idea, the code just needs to be written. Write the code
and post the patches, I assume that there won't be much pushback, probably could
even get it into Christian's tree in some branch or another before LSF. Thanks,

Josef

2024-02-22 09:14:48

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more

On Wed, 21 Feb 2024 at 22:08, Josef Bacik <[email protected]> wrote:
>
> On Wed, Feb 21, 2024 at 04:06:34PM +0100, Miklos Szeredi wrote:
> > On Wed, 21 Feb 2024 at 01:51, Kent Overstreet <[email protected]> wrote:
> > >
> > > Recently we had a pretty long discussion on statx extensions, which
> > > eventually got a bit offtopic but nevertheless hashed out all the major
> > > issues.
> > >
> > > To summarize:
> > > - guaranteeing inode number uniqueness is becoming increasingly
> > > infeasible, we need a bit to tell userspace "inode number is not
> > > unique, use filehandle instead"
> >
> > This is a tough one. POSIX says "The st_ino and st_dev fields taken
> > together uniquely identify the file within the system."
> >
>
> Which is what btrfs has done forever, and we've gotten yelled at forever for
> doing it. We have a compromise and a way forward, but it's not a widely held
> view that changing st_dev to give uniqueness is an acceptable solution. It may
> have been for overlayfs because you guys are already doing something special,
> but it's not an option that is afforded the rest of us.

Overlayfs tries hard not to use st_dev to give uniqueness and instead
partitions the 64bit st_ino space within the same st_dev. There are
various fallback cases, some involve switching st_dev and some using
non-persistent st_ino.

What overlayfs does may or may not be applicable to btrfs/bcachefs,
but that's not my point. My point is that adding a flag to statx does
not solve anything. You can't just say that from now on btrfs
doesn't have use unique st_ino/st_dev because we've just indicated
that in statx and everything is fine. That will trigger the
no-regressions rule and then it's game over. At least I would expect
that to happen.

What we can do instead is introduce a new API that is better, and
thankfully we already have one in the form of file handles. The
problem I see is that you think you can get away with then reverting
back st_dev to be uniform across subvolumes. But you can't. I see
two options:

a) do some hacks, like overlayfs does

b) introduce a new "st_dev_v2" that will do the right thing and
applications can move over.

Thanks,
Miklos

2024-02-22 09:44:53

by Kent Overstreet

[permalink] [raw]
Subject: Re: [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more

On Thu, Feb 22, 2024 at 10:14:20AM +0100, Miklos Szeredi wrote:
> On Wed, 21 Feb 2024 at 22:08, Josef Bacik <[email protected]> wrote:
> >
> > On Wed, Feb 21, 2024 at 04:06:34PM +0100, Miklos Szeredi wrote:
> > > On Wed, 21 Feb 2024 at 01:51, Kent Overstreet <[email protected]> wrote:
> > > >
> > > > Recently we had a pretty long discussion on statx extensions, which
> > > > eventually got a bit offtopic but nevertheless hashed out all the major
> > > > issues.
> > > >
> > > > To summarize:
> > > > - guaranteeing inode number uniqueness is becoming increasingly
> > > > infeasible, we need a bit to tell userspace "inode number is not
> > > > unique, use filehandle instead"
> > >
> > > This is a tough one. POSIX says "The st_ino and st_dev fields taken
> > > together uniquely identify the file within the system."
> > >
> >
> > Which is what btrfs has done forever, and we've gotten yelled at forever for
> > doing it. We have a compromise and a way forward, but it's not a widely held
> > view that changing st_dev to give uniqueness is an acceptable solution. It may
> > have been for overlayfs because you guys are already doing something special,
> > but it's not an option that is afforded the rest of us.
>
> Overlayfs tries hard not to use st_dev to give uniqueness and instead
> partitions the 64bit st_ino space within the same st_dev. There are
> various fallback cases, some involve switching st_dev and some using
> non-persistent st_ino.

Yeah no, you can't crap multiple 64 bit inode number spaces into 64
bits: pigeonhole principle.

We need something better than "hacks".

> What overlayfs does may or may not be applicable to btrfs/bcachefs,
> but that's not my point. My point is that adding a flag to statx does
> not solve anything. You can't just say that from now on btrfs
> doesn't have use unique st_ino/st_dev because we've just indicated
> that in statx and everything is fine. That will trigger the
> no-regressions rule and then it's game over. At least I would expect
> that to happen.
>
> What we can do instead is introduce a new API that is better,

This isn't a serious proposal.

2024-02-22 10:25:50

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more

On Thu, 22 Feb 2024 at 10:42, Kent Overstreet <[email protected]> wrote:

> Yeah no, you can't crap multiple 64 bit inode number spaces into 64
> bits: pigeonhole principle.

Obviously not. And I have no idea about the inode number allocation
strategy of bcachefs and how many bits would be needed for subvolumes,
etc.. I was just telling what overlayfs does and why. It's a
pragmatic solution that works. I'd very much like to move to better
interfaces, but creating good interfaces is never easy.

> We need something better than "hacks".

That's the end goal, obviously. But we also need to take care of
legacy. Always have.

> This isn't a serious proposal.

If not, then what is?

BTW to expand on the st_dev_v2 idea, it can be done by adding a
STATX_DEV_V2 query mask.

That way userspace can ask for the uniform stx_dev if it wants,
knowing full well that stx_ino will be non-unique within that
filesystem. Then kernel is free to return with or without
STATX_DEV_V2, which is basically what you proposed. Except it's now
negotiated and not forced upon legacy interfaces.

The other issue is adding subvolume ID. You seem to think that it's
okay to add that to statx and let userspace use (st_ino, st_subvoid)
to identify the inode. I'm saying this is wrong, because it doesn't
work in the general case.

It doesn't work for overlayfs, for example, and we definitely want to
avoid having userspace do filesystem specific things *if it isn't
absolutely necessary*. So for example "tar" should not care about
subvolumes as long as it's not been explicitly told to care. And that
means for hard link detection if should using the file handle +
st_dev_v2 instead of st_ino + st_subvolid + st_dev_v2. So if that
field is added to statx it must come with a stern warning about this
type of usage.

Thanks,
Miklos

2024-02-22 11:03:36

by Jan Kara

[permalink] [raw]
Subject: Re: [Lsf-pc] [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more

On Thu 22-02-24 04:42:07, Kent Overstreet wrote:
> On Thu, Feb 22, 2024 at 10:14:20AM +0100, Miklos Szeredi wrote:
> > On Wed, 21 Feb 2024 at 22:08, Josef Bacik <[email protected]> wrote:
> > >
> > > On Wed, Feb 21, 2024 at 04:06:34PM +0100, Miklos Szeredi wrote:
> > > > On Wed, 21 Feb 2024 at 01:51, Kent Overstreet <[email protected]> wrote:
> > > > >
> > > > > Recently we had a pretty long discussion on statx extensions, which
> > > > > eventually got a bit offtopic but nevertheless hashed out all the major
> > > > > issues.
> > > > >
> > > > > To summarize:
> > > > > - guaranteeing inode number uniqueness is becoming increasingly
> > > > > infeasible, we need a bit to tell userspace "inode number is not
> > > > > unique, use filehandle instead"
> > > >
> > > > This is a tough one. POSIX says "The st_ino and st_dev fields taken
> > > > together uniquely identify the file within the system."
> > > >
> > >
> > > Which is what btrfs has done forever, and we've gotten yelled at forever for
> > > doing it. We have a compromise and a way forward, but it's not a widely held
> > > view that changing st_dev to give uniqueness is an acceptable solution. It may
> > > have been for overlayfs because you guys are already doing something special,
> > > but it's not an option that is afforded the rest of us.
> >
> > Overlayfs tries hard not to use st_dev to give uniqueness and instead
> > partitions the 64bit st_ino space within the same st_dev. There are
> > various fallback cases, some involve switching st_dev and some using
> > non-persistent st_ino.
>
> Yeah no, you can't crap multiple 64 bit inode number spaces into 64
> bits: pigeonhole principle.
>
> We need something better than "hacks".

I agree we should have a better long-term plan than finding ways how to
cram things into 64-bits inos. However I don't see a realistic short-term
solution other than that.

To explicit: Currently, tar and patch and very likely other less well-known
tools are broken on bcachefs due to non-unique inode numbers. If you want
ot fix them, either you find ways how bcachefs can cram things into 64-bit
ino_t or you go and modify these tools (or prod maintainers or whatever) to
not depend on ino_t for uniqueness. The application side of things isn't
going to magically fix itself by us telling "bad luck, ino_t isn't unique
anymore".

> > What overlayfs does may or may not be applicable to btrfs/bcachefs,
> > but that's not my point. My point is that adding a flag to statx does
> > not solve anything. You can't just say that from now on btrfs
> > doesn't have use unique st_ino/st_dev because we've just indicated
> > that in statx and everything is fine. That will trigger the
> > no-regressions rule and then it's game over. At least I would expect
> > that to happen.
> >
> > What we can do instead is introduce a new API that is better,
>
> This isn't a serious proposal.

I think for "unique inode identifier" we don't even have to come up with
new APIs. The file handle + fsid pair is an established way to do this,
fanotify successfully uses this as object identifier and Amir did quite
some work for this to be usable for vast majority of filesystems (including
virtual ones). The problem is with rewriting all these applications to use
it. If statx flag telling whether inode numbers are unique helps with that,
sure we can add it, but that seems like a trivial part of the problem.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2024-02-22 11:19:42

by Kent Overstreet

[permalink] [raw]
Subject: Re: [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more

On Thu, Feb 22, 2024 at 11:25:12AM +0100, Miklos Szeredi wrote:
> On Thu, 22 Feb 2024 at 10:42, Kent Overstreet <[email protected]> wrote:
>
> > Yeah no, you can't crap multiple 64 bit inode number spaces into 64
> > bits: pigeonhole principle.
>
> Obviously not. And I have no idea about the inode number allocation
> strategy of bcachefs and how many bits would be needed for subvolumes,
> etc.. I was just telling what overlayfs does and why. It's a
> pragmatic solution that works. I'd very much like to move to better
> interfaces, but creating good interfaces is never easy.

You say "creating good interfaces is never easy" - but we've got a
proposal, that's bounced around a fair bit, and you aren't saying
anything concrete.

> > We need something better than "hacks".
>
> That's the end goal, obviously. But we also need to take care of
> legacy. Always have.

So what are you proposing?

> > This isn't a serious proposal.
>
> If not, then what is?
>
> BTW to expand on the st_dev_v2 idea, it can be done by adding a
> STATX_DEV_V2 query mask.

Didn't you see Josef just say they're trying to get away from st_dev?

> The other issue is adding subvolume ID. You seem to think that it's
> okay to add that to statx and let userspace use (st_ino, st_subvoid)
> to identify the inode. I'm saying this is wrong, because it doesn't
> work in the general case.

No, I explicitly said that when INO_NOT_UNIQUE is set the _filehandle_
would be the means to identify the file.

2024-02-22 11:31:33

by Kent Overstreet

[permalink] [raw]
Subject: Re: [Lsf-pc] [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more

On Thu, Feb 22, 2024 at 12:01:38PM +0100, Jan Kara wrote:
> On Thu 22-02-24 04:42:07, Kent Overstreet wrote:
> > On Thu, Feb 22, 2024 at 10:14:20AM +0100, Miklos Szeredi wrote:
> > > On Wed, 21 Feb 2024 at 22:08, Josef Bacik <[email protected]> wrote:
> > > >
> > > > On Wed, Feb 21, 2024 at 04:06:34PM +0100, Miklos Szeredi wrote:
> > > > > On Wed, 21 Feb 2024 at 01:51, Kent Overstreet <[email protected]> wrote:
> > > > > >
> > > > > > Recently we had a pretty long discussion on statx extensions, which
> > > > > > eventually got a bit offtopic but nevertheless hashed out all the major
> > > > > > issues.
> > > > > >
> > > > > > To summarize:
> > > > > > - guaranteeing inode number uniqueness is becoming increasingly
> > > > > > infeasible, we need a bit to tell userspace "inode number is not
> > > > > > unique, use filehandle instead"
> > > > >
> > > > > This is a tough one. POSIX says "The st_ino and st_dev fields taken
> > > > > together uniquely identify the file within the system."
> > > > >
> > > >
> > > > Which is what btrfs has done forever, and we've gotten yelled at forever for
> > > > doing it. We have a compromise and a way forward, but it's not a widely held
> > > > view that changing st_dev to give uniqueness is an acceptable solution. It may
> > > > have been for overlayfs because you guys are already doing something special,
> > > > but it's not an option that is afforded the rest of us.
> > >
> > > Overlayfs tries hard not to use st_dev to give uniqueness and instead
> > > partitions the 64bit st_ino space within the same st_dev. There are
> > > various fallback cases, some involve switching st_dev and some using
> > > non-persistent st_ino.
> >
> > Yeah no, you can't crap multiple 64 bit inode number spaces into 64
> > bits: pigeonhole principle.
> >
> > We need something better than "hacks".
>
> I agree we should have a better long-term plan than finding ways how to
> cram things into 64-bits inos. However I don't see a realistic short-term
> solution other than that.
>
> To explicit: Currently, tar and patch and very likely other less well-known
> tools are broken on bcachefs due to non-unique inode numbers. If you want
> ot fix them, either you find ways how bcachefs can cram things into 64-bit
> ino_t or you go and modify these tools (or prod maintainers or whatever) to
> not depend on ino_t for uniqueness. The application side of things isn't
> going to magically fix itself by us telling "bad luck, ino_t isn't unique
> anymore".

My intent is to make a real effort towards getting better interfaces
going, prod those maintainers, _then_ look at adding those hacks (that
will necessarily be short term solutions since 64 bits is already
looking cramped).

Kind of seems to me like you guys think kicking the can down the road
is supposed to to be the first priority, and that's not what I'm doing
here.

> > > What overlayfs does may or may not be applicable to btrfs/bcachefs,
> > > but that's not my point. My point is that adding a flag to statx does
> > > not solve anything. You can't just say that from now on btrfs
> > > doesn't have use unique st_ino/st_dev because we've just indicated
> > > that in statx and everything is fine. That will trigger the
> > > no-regressions rule and then it's game over. At least I would expect
> > > that to happen.
> > >
> > > What we can do instead is introduce a new API that is better,
> >
> > This isn't a serious proposal.
>
> I think for "unique inode identifier" we don't even have to come up with
> new APIs. The file handle + fsid pair is an established way to do this,
> fanotify successfully uses this as object identifier and Amir did quite
> some work for this to be usable for vast majority of filesystems (including
> virtual ones). The problem is with rewriting all these applications to use
> it. If statx flag telling whether inode numbers are unique helps with that,
> sure we can add it, but that seems like a trivial part of the problem.

Basically, it turns it into a checklist item for userspace. "Are we
still doing things the old way (and we've got a documented list of where
that will break", or "are we handling the INO_NOT_UNIQUE case
correctly".

In general when semantics change, APIs ought to change.

2024-02-22 11:44:41

by Jan Kara

[permalink] [raw]
Subject: Re: [Lsf-pc] [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more

On Thu 22-02-24 06:27:14, Kent Overstreet wrote:
> On Thu, Feb 22, 2024 at 12:01:38PM +0100, Jan Kara wrote:
> > On Thu 22-02-24 04:42:07, Kent Overstreet wrote:
> > > On Thu, Feb 22, 2024 at 10:14:20AM +0100, Miklos Szeredi wrote:
> > > > On Wed, 21 Feb 2024 at 22:08, Josef Bacik <[email protected]> wrote:
> > > > >
> > > > > On Wed, Feb 21, 2024 at 04:06:34PM +0100, Miklos Szeredi wrote:
> > > > > > On Wed, 21 Feb 2024 at 01:51, Kent Overstreet <[email protected]> wrote:
> > > > > > >
> > > > > > > Recently we had a pretty long discussion on statx extensions, which
> > > > > > > eventually got a bit offtopic but nevertheless hashed out all the major
> > > > > > > issues.
> > > > > > >
> > > > > > > To summarize:
> > > > > > > - guaranteeing inode number uniqueness is becoming increasingly
> > > > > > > infeasible, we need a bit to tell userspace "inode number is not
> > > > > > > unique, use filehandle instead"
> > > > > >
> > > > > > This is a tough one. POSIX says "The st_ino and st_dev fields taken
> > > > > > together uniquely identify the file within the system."
> > > > > >
> > > > >
> > > > > Which is what btrfs has done forever, and we've gotten yelled at forever for
> > > > > doing it. We have a compromise and a way forward, but it's not a widely held
> > > > > view that changing st_dev to give uniqueness is an acceptable solution. It may
> > > > > have been for overlayfs because you guys are already doing something special,
> > > > > but it's not an option that is afforded the rest of us.
> > > >
> > > > Overlayfs tries hard not to use st_dev to give uniqueness and instead
> > > > partitions the 64bit st_ino space within the same st_dev. There are
> > > > various fallback cases, some involve switching st_dev and some using
> > > > non-persistent st_ino.
> > >
> > > Yeah no, you can't crap multiple 64 bit inode number spaces into 64
> > > bits: pigeonhole principle.
> > >
> > > We need something better than "hacks".
> >
> > I agree we should have a better long-term plan than finding ways how to
> > cram things into 64-bits inos. However I don't see a realistic short-term
> > solution other than that.
> >
> > To explicit: Currently, tar and patch and very likely other less well-known
> > tools are broken on bcachefs due to non-unique inode numbers. If you want
> > ot fix them, either you find ways how bcachefs can cram things into 64-bit
> > ino_t or you go and modify these tools (or prod maintainers or whatever) to
> > not depend on ino_t for uniqueness. The application side of things isn't
> > going to magically fix itself by us telling "bad luck, ino_t isn't unique
> > anymore".
>
> My intent is to make a real effort towards getting better interfaces
> going, prod those maintainers, _then_ look at adding those hacks (that
> will necessarily be short term solutions since 64 bits is already
> looking cramped).

OK, fine by me :) So one thing is still not quite clear to me - how do you
expect the INO_NOT_UNIQUE flag to be used by these apps? Do you expect them
to use st_dev + st_ino by default and fall back to fsid + fhandle only when
INO_NOT_UNIQUE is set?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2024-02-22 11:55:50

by Kent Overstreet

[permalink] [raw]
Subject: Re: [Lsf-pc] [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more

On Thu, Feb 22, 2024 at 12:44:17PM +0100, Jan Kara wrote:
> On Thu 22-02-24 06:27:14, Kent Overstreet wrote:
> > On Thu, Feb 22, 2024 at 12:01:38PM +0100, Jan Kara wrote:
> > > On Thu 22-02-24 04:42:07, Kent Overstreet wrote:
> > > > On Thu, Feb 22, 2024 at 10:14:20AM +0100, Miklos Szeredi wrote:
> > > > > On Wed, 21 Feb 2024 at 22:08, Josef Bacik <[email protected]> wrote:
> > > > > >
> > > > > > On Wed, Feb 21, 2024 at 04:06:34PM +0100, Miklos Szeredi wrote:
> > > > > > > On Wed, 21 Feb 2024 at 01:51, Kent Overstreet <[email protected]> wrote:
> > > > > > > >
> > > > > > > > Recently we had a pretty long discussion on statx extensions, which
> > > > > > > > eventually got a bit offtopic but nevertheless hashed out all the major
> > > > > > > > issues.
> > > > > > > >
> > > > > > > > To summarize:
> > > > > > > > - guaranteeing inode number uniqueness is becoming increasingly
> > > > > > > > infeasible, we need a bit to tell userspace "inode number is not
> > > > > > > > unique, use filehandle instead"
> > > > > > >
> > > > > > > This is a tough one. POSIX says "The st_ino and st_dev fields taken
> > > > > > > together uniquely identify the file within the system."
> > > > > > >
> > > > > >
> > > > > > Which is what btrfs has done forever, and we've gotten yelled at forever for
> > > > > > doing it. We have a compromise and a way forward, but it's not a widely held
> > > > > > view that changing st_dev to give uniqueness is an acceptable solution. It may
> > > > > > have been for overlayfs because you guys are already doing something special,
> > > > > > but it's not an option that is afforded the rest of us.
> > > > >
> > > > > Overlayfs tries hard not to use st_dev to give uniqueness and instead
> > > > > partitions the 64bit st_ino space within the same st_dev. There are
> > > > > various fallback cases, some involve switching st_dev and some using
> > > > > non-persistent st_ino.
> > > >
> > > > Yeah no, you can't crap multiple 64 bit inode number spaces into 64
> > > > bits: pigeonhole principle.
> > > >
> > > > We need something better than "hacks".
> > >
> > > I agree we should have a better long-term plan than finding ways how to
> > > cram things into 64-bits inos. However I don't see a realistic short-term
> > > solution other than that.
> > >
> > > To explicit: Currently, tar and patch and very likely other less well-known
> > > tools are broken on bcachefs due to non-unique inode numbers. If you want
> > > ot fix them, either you find ways how bcachefs can cram things into 64-bit
> > > ino_t or you go and modify these tools (or prod maintainers or whatever) to
> > > not depend on ino_t for uniqueness. The application side of things isn't
> > > going to magically fix itself by us telling "bad luck, ino_t isn't unique
> > > anymore".
> >
> > My intent is to make a real effort towards getting better interfaces
> > going, prod those maintainers, _then_ look at adding those hacks (that
> > will necessarily be short term solutions since 64 bits is already
> > looking cramped).
>
> OK, fine by me :) So one thing is still not quite clear to me - how do you
> expect the INO_NOT_UNIQUE flag to be used by these apps? Do you expect them
> to use st_dev + st_ino by default and fall back to fsid + fhandle only when
> INO_NOT_UNIQUE is set?

Shouldn't matter. If they care about performance and they're in some
strange situation where the syscal overhead matters, they can use
fhandle only when the bit is set, but I'd personally prefer to see
everyone on the same codepath and just always using fh.

2024-02-22 12:49:14

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [Lsf-pc] [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more

On Thu, 22 Feb 2024 at 12:01, Jan Kara <[email protected]> wrote:

> I think for "unique inode identifier" we don't even have to come up with
> new APIs. The file handle + fsid pair is an established way to do this,

Why not uuid?

fsid seems to be just a degraded uuid. We can do better with statx
and/or statmount.

> fanotify successfully uses this as object identifier and Amir did quite
> some work for this to be usable for vast majority of filesystems (including

Vast majority != all. Also even uuid is just a statistically unique
identifier, while st_dev was guaranteed to be unique (but not
persistent, like uuid).

If we are going to start fixing userspace, then we better make sure to
use the right interfaces, that won't have issues in the future.

Thanks,
Miklos

2024-02-22 13:11:11

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [Lsf-pc] [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more

On Thu, 22 Feb 2024 at 12:55, Kent Overstreet <[email protected]> wrote:
>
> On Thu, Feb 22, 2024 at 12:44:17PM +0100, Jan Kara wrote:
> > On Thu 22-02-24 06:27:14, Kent Overstreet wrote:

> > > My intent is to make a real effort towards getting better interfaces
> > > going, prod those maintainers, _then_ look at adding those hacks (that
> > > will necessarily be short term solutions since 64 bits is already
> > > looking cramped).
> >
> > OK, fine by me :) So one thing is still not quite clear to me - how do you
> > expect the INO_NOT_UNIQUE flag to be used by these apps? Do you expect them
> > to use st_dev + st_ino by default and fall back to fsid + fhandle only when
> > INO_NOT_UNIQUE is set?
>
> Shouldn't matter. If they care about performance and they're in some
> strange situation where the syscal overhead matters,

If it's expensive, then just make the overhead smaller (by adding fh
and uuid to statx(2), for example).

Using st_ino is also racy in some filesystems, due to the fact that
the ino can be reused. If userspace is converted, it should be
converted properly, there's just no excuse to add conditional code
like that, which makes things more complex and less reliable.

Thanks,
Miklos

2024-02-22 16:08:53

by Jan Kara

[permalink] [raw]
Subject: Re: [Lsf-pc] [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more

On Thu 22-02-24 13:48:45, Miklos Szeredi wrote:
> On Thu, 22 Feb 2024 at 12:01, Jan Kara <[email protected]> wrote:
>
> > I think for "unique inode identifier" we don't even have to come up with
> > new APIs. The file handle + fsid pair is an established way to do this,
>
> Why not uuid?
>
> fsid seems to be just a degraded uuid. We can do better with statx
> and/or statmount.

fanotify uses fsid because we have standard interface for querying fsid
(statfs(2)) and because not all filesystems (in particular virtual ones)
bother with uuid. At least the first thing is being changed now.

> > fanotify successfully uses this as object identifier and Amir did quite
> > some work for this to be usable for vast majority of filesystems (including
>
> Vast majority != all.

True. If we are going to use this scheme more widely, we need to have a
look whether the remaining cases need fixing or we can just ignore them.
They were not very interesting for fanotify so we moved on.

> Also even uuid is just a statistically unique
> identifier, while st_dev was guaranteed to be unique (but not
> persistent, like uuid).

Well, everything is just statistically true in this world :) If you have
conflicting uuids, you are likely to see also other problems so I would not
be too concerned about that.

> If we are going to start fixing userspace, then we better make sure to
> use the right interfaces, that won't have issues in the future.

I agree we should give this a good thought which identification of a
filesystem is the best.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2024-02-22 19:25:41

by Josef Bacik

[permalink] [raw]
Subject: Re: [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more

On Thu, Feb 22, 2024 at 10:14:20AM +0100, Miklos Szeredi wrote:
> On Wed, 21 Feb 2024 at 22:08, Josef Bacik <[email protected]> wrote:
> >
> > On Wed, Feb 21, 2024 at 04:06:34PM +0100, Miklos Szeredi wrote:
> > > On Wed, 21 Feb 2024 at 01:51, Kent Overstreet <[email protected]> wrote:
> > > >
> > > > Recently we had a pretty long discussion on statx extensions, which
> > > > eventually got a bit offtopic but nevertheless hashed out all the major
> > > > issues.
> > > >
> > > > To summarize:
> > > > - guaranteeing inode number uniqueness is becoming increasingly
> > > > infeasible, we need a bit to tell userspace "inode number is not
> > > > unique, use filehandle instead"
> > >
> > > This is a tough one. POSIX says "The st_ino and st_dev fields taken
> > > together uniquely identify the file within the system."
> > >
> >
> > Which is what btrfs has done forever, and we've gotten yelled at forever for
> > doing it. We have a compromise and a way forward, but it's not a widely held
> > view that changing st_dev to give uniqueness is an acceptable solution. It may
> > have been for overlayfs because you guys are already doing something special,
> > but it's not an option that is afforded the rest of us.
>
> Overlayfs tries hard not to use st_dev to give uniqueness and instead
> partitions the 64bit st_ino space within the same st_dev. There are
> various fallback cases, some involve switching st_dev and some using
> non-persistent st_ino.
>
> What overlayfs does may or may not be applicable to btrfs/bcachefs,
> but that's not my point. My point is that adding a flag to statx does
> not solve anything. You can't just say that from now on btrfs
> doesn't have use unique st_ino/st_dev because we've just indicated
> that in statx and everything is fine. That will trigger the
> no-regressions rule and then it's game over. At least I would expect
> that to happen.

Right, nobody is arguing that. Our plan is to

1) Introduce some sort of statx mechanism to expose this information.
2) Introduce an incompat fs feature flag to give unique inode numbers for people
that want them, and there stop doing the st_dev thing we currently do.
3) Continue doing the st_dev thing we currently do for the non-feature flag
case.
4) Wait 50 years and then maybe stop doing the st_dev thing once people have
adopted whatever statx flag has been introduced. God willing I will have
been hit by a bus well before then.

Nobody is arguing about reverting existing behavior, we're stuck with it.

We're trying to define how we'd like it to look going forward, and start working
towards that. Thanks,

Josef

2024-02-26 08:15:23

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more

On Thu, 22 Feb 2024 at 16:48, Josef Bacik <[email protected]> wrote:

> Right, nobody is arguing that. Our plan is to
>
> 1) Introduce some sort of statx mechanism to expose this information.
> 2) Introduce an incompat fs feature flag to give unique inode numbers for people
> that want them, and there stop doing the st_dev thing we currently do.

I don't get it. What does the filesystem (the actual bits on disk)
have anything to do with how st_dev is exposed to userspace
applications?

This is not a filesystem feature, this is an interface feature. And I
even doubt that salvaging st_dev is worth it. Userspace should just
be converted to use something else. In other words st_ino *and*
st_dev are legacy and we need to find superior alternatives.

Seems like there's an agreement about file handle being able to replace st_ino.

I'm not quite sure fsid or uuid can replace st_dev, but that's up for
discussion.

Thanks,
Miklos

2024-02-26 08:28:11

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [Lsf-pc] [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more

On Thu, 22 Feb 2024 at 17:08, Jan Kara <[email protected]> wrote:

> > If we are going to start fixing userspace, then we better make sure to
> > use the right interfaces, that won't have issues in the future.
>
> I agree we should give this a good thought which identification of a
> filesystem is the best.

To find mount boundaries statx.stx_mnt_id (especially with
STATX_MNT_ID_UNIQUE) is perfect.

By supplying stx_mnt_id to statmount(2) it's possible to get the
device number associated with that filesystem (statmount.sb_dev_*). I
think it's what Josef wants btrfs to return as st_dev.

And statx could return that in stx_dev_*, with an interface feature
flag, same as we've done with stx_mnt_id. I.e. STATX_DEV_NOHACK would
force the vfs to replace anything the filesystem put in kstat.dev with
sb->s_dev.

Thanks,
Miklos