On Fri, Aug 26, 2022 at 05:46:57PM -0400, Jeff Layton wrote:
> The i_version field in the kernel has had different semantics over
> the decades, but we're now proposing to expose it to userland via
> statx. This means that we need a clear, consistent definition of
> what it means and when it should change.
>
> Update the comments in iversion.h to describe how a conformant
> i_version implementation is expected to behave. This definition
> suits the current users of i_version (NFSv4 and IMA), but is
> loose enough to allow for a wide range of possible implementations.
>
> Cc: Colin Walters <[email protected]>
> Cc: NeilBrown <[email protected]>
> Cc: Trond Myklebust <[email protected]>
> Cc: Dave Chinner <[email protected]>
> Link: https://lore.kernel.org/linux-xfs/[email protected]/#t
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> include/linux/iversion.h | 23 +++++++++++++++++++++--
> 1 file changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index 3bfebde5a1a6..45e93e1b4edc 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -9,8 +9,19 @@
> * ---------------------------
> * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> - * appear different to observers if there was a change to the inode's data or
> - * metadata since it was last queried.
> + * appear different to observers if there was an explicit change to the inode's
> + * data or metadata since it was last queried.
> + *
> + * An explicit change is one that would ordinarily result in a change to the
> + * inode status change time (aka ctime). The version must appear to change, even
> + * if the ctime does not (since the whole point is to avoid missing updates due
> + * to timestamp granularity). If POSIX mandates that the ctime must change due
> + * to an operation, then the i_version counter must be incremented as well.
> + *
> + * A conformant implementation is allowed to increment the counter in other
> + * cases, but this is not optimal. NFSv4 and IMA both use this value to determine
> + * whether caches are up to date. Spurious increments can cause false cache
> + * invalidations.
"not optimal", but never-the-less allowed - that's "unspecified
behaviour" if I've ever seen it. How is userspace supposed to
know/deal with this?
Indeed, this loophole clause doesn't exist in the man pages that
define what statx.stx_ino_version means. The man pages explicitly
define that stx_ino_version only ever changes when stx_ctime
changes.
IOWs, the behaviour userspace developers are going to expect *does
not include* stx_ino_version changing it more often than ctime is
changed. Hence a kernel iversion implementation that bumps the
counter more often than ctime changes *is not conformant with the
statx version counter specification*. IOWs, we can't export such
behaviour to userspace *ever* - it is a non-conformant
implementation.
Hence I think anything that bumps iversion outside the bounds of the
statx definition should be declared as such:
"Non-conformant iversion implementations:
- MUST NOT be exported by statx() to userspace
- MUST be -tolerated- by kernel internal applications that
use iversion for their own purposes."
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Mon, 2022-08-29 at 17:56 +1000, Dave Chinner wrote:
> On Fri, Aug 26, 2022 at 05:46:57PM -0400, Jeff Layton wrote:
> > The i_version field in the kernel has had different semantics over
> > the decades, but we're now proposing to expose it to userland via
> > statx. This means that we need a clear, consistent definition of
> > what it means and when it should change.
> >
> > Update the comments in iversion.h to describe how a conformant
> > i_version implementation is expected to behave. This definition
> > suits the current users of i_version (NFSv4 and IMA), but is
> > loose enough to allow for a wide range of possible implementations.
> >
> > Cc: Colin Walters <[email protected]>
> > Cc: NeilBrown <[email protected]>
> > Cc: Trond Myklebust <[email protected]>
> > Cc: Dave Chinner <[email protected]>
> > Link: https://lore.kernel.org/linux-xfs/[email protected]/#t
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > include/linux/iversion.h | 23 +++++++++++++++++++++--
> > 1 file changed, 21 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > index 3bfebde5a1a6..45e93e1b4edc 100644
> > --- a/include/linux/iversion.h
> > +++ b/include/linux/iversion.h
> > @@ -9,8 +9,19 @@
> > * ---------------------------
> > * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > - * appear different to observers if there was a change to the inode's data or
> > - * metadata since it was last queried.
> > + * appear different to observers if there was an explicit change to the inode's
> > + * data or metadata since it was last queried.
> > + *
> > + * An explicit change is one that would ordinarily result in a change to the
> > + * inode status change time (aka ctime). The version must appear to change, even
> > + * if the ctime does not (since the whole point is to avoid missing updates due
> > + * to timestamp granularity). If POSIX mandates that the ctime must change due
> > + * to an operation, then the i_version counter must be incremented as well.
> > + *
> > + * A conformant implementation is allowed to increment the counter in other
> > + * cases, but this is not optimal. NFSv4 and IMA both use this value to determine
> > + * whether caches are up to date. Spurious increments can cause false cache
> > + * invalidations.
>
> "not optimal", but never-the-less allowed - that's "unspecified
> behaviour" if I've ever seen it. How is userspace supposed to
> know/deal with this?
>
> Indeed, this loophole clause doesn't exist in the man pages that
> define what statx.stx_ino_version means. The man pages explicitly
> define that stx_ino_version only ever changes when stx_ctime
> changes.
>
We can fix the manpage to make this more clear.
> IOWs, the behaviour userspace developers are going to expect *does
> not include* stx_ino_version changing it more often than ctime is
> changed. Hence a kernel iversion implementation that bumps the
> counter more often than ctime changes *is not conformant with the
> statx version counter specification*. IOWs, we can't export such
> behaviour to userspace *ever* - it is a non-conformant
> implementation.
>
Nonsense. The statx version counter specification is *whatever we decide
to make it*. If we define it to allow for spurious version bumps, then
these implementations would be conformant.
Given that you can't tell what or how much changed in the inode whenever
the value changes, allowing it to be bumped on non-observable changes is
ok and the counter is still useful. When you see it change you need to
go stat/read/getxattr etc, to see what actually happened anyway.
Most applications won't be interested in every possible explicit change
that can happen to an inode. It's likely these applications would check
the parts of the inode they're interested in, and then go back to
waiting for the next bump if the change wasn't significant to them.
> Hence I think anything that bumps iversion outside the bounds of the
> statx definition should be declared as such:
>
> "Non-conformant iversion implementations:
> - MUST NOT be exported by statx() to userspace
> - MUST be -tolerated- by kernel internal applications that
> use iversion for their own purposes."
>
I think this is more strict than is needed. An implementation that bumps
this value more often than is necessary is still useful. It's not
_ideal_, but it still meets the needs of NFSv4, IMA and other potential
users of it. After all, this is basically the definition of i_version
today and it's still useful, even if atime update i_version bumps are
currently harmful for performance.
--
Jeff Layton <[email protected]>
On Mon, 29 Aug 2022, Jeff Layton wrote:
> On Mon, 2022-08-29 at 17:56 +1000, Dave Chinner wrote:
> > On Fri, Aug 26, 2022 at 05:46:57PM -0400, Jeff Layton wrote:
> > > The i_version field in the kernel has had different semantics over
> > > the decades, but we're now proposing to expose it to userland via
> > > statx. This means that we need a clear, consistent definition of
> > > what it means and when it should change.
> > >
> > > Update the comments in iversion.h to describe how a conformant
> > > i_version implementation is expected to behave. This definition
> > > suits the current users of i_version (NFSv4 and IMA), but is
> > > loose enough to allow for a wide range of possible implementations.
> > >
> > > Cc: Colin Walters <[email protected]>
> > > Cc: NeilBrown <[email protected]>
> > > Cc: Trond Myklebust <[email protected]>
> > > Cc: Dave Chinner <[email protected]>
> > > Link: https://lore.kernel.org/linux-xfs/[email protected]/#t
> > > Signed-off-by: Jeff Layton <[email protected]>
> > > ---
> > > include/linux/iversion.h | 23 +++++++++++++++++++++--
> > > 1 file changed, 21 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > > index 3bfebde5a1a6..45e93e1b4edc 100644
> > > --- a/include/linux/iversion.h
> > > +++ b/include/linux/iversion.h
> > > @@ -9,8 +9,19 @@
> > > * ---------------------------
> > > * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > > * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > > - * appear different to observers if there was a change to the inode's data or
> > > - * metadata since it was last queried.
> > > + * appear different to observers if there was an explicit change to the inode's
> > > + * data or metadata since it was last queried.
> > > + *
> > > + * An explicit change is one that would ordinarily result in a change to the
> > > + * inode status change time (aka ctime). The version must appear to change, even
> > > + * if the ctime does not (since the whole point is to avoid missing updates due
> > > + * to timestamp granularity). If POSIX mandates that the ctime must change due
> > > + * to an operation, then the i_version counter must be incremented as well.
> > > + *
> > > + * A conformant implementation is allowed to increment the counter in other
> > > + * cases, but this is not optimal. NFSv4 and IMA both use this value to determine
> > > + * whether caches are up to date. Spurious increments can cause false cache
> > > + * invalidations.
> >
> > "not optimal", but never-the-less allowed - that's "unspecified
> > behaviour" if I've ever seen it. How is userspace supposed to
> > know/deal with this?
> >
> > Indeed, this loophole clause doesn't exist in the man pages that
> > define what statx.stx_ino_version means. The man pages explicitly
> > define that stx_ino_version only ever changes when stx_ctime
> > changes.
> >
>
> We can fix the manpage to make this more clear.
>
> > IOWs, the behaviour userspace developers are going to expect *does
> > not include* stx_ino_version changing it more often than ctime is
> > changed. Hence a kernel iversion implementation that bumps the
> > counter more often than ctime changes *is not conformant with the
> > statx version counter specification*. IOWs, we can't export such
> > behaviour to userspace *ever* - it is a non-conformant
> > implementation.
> >
>
> Nonsense. The statx version counter specification is *whatever we decide
> to make it*. If we define it to allow for spurious version bumps, then
> these implementations would be conformant.
>
> Given that you can't tell what or how much changed in the inode whenever
> the value changes, allowing it to be bumped on non-observable changes is
> ok and the counter is still useful. When you see it change you need to
> go stat/read/getxattr etc, to see what actually happened anyway.
>
> Most applications won't be interested in every possible explicit change
> that can happen to an inode. It's likely these applications would check
> the parts of the inode they're interested in, and then go back to
> waiting for the next bump if the change wasn't significant to them.
>
>
> > Hence I think anything that bumps iversion outside the bounds of the
> > statx definition should be declared as such:
> >
> > "Non-conformant iversion implementations:
> > - MUST NOT be exported by statx() to userspace
> > - MUST be -tolerated- by kernel internal applications that
> > use iversion for their own purposes."
> >
>
> I think this is more strict than is needed. An implementation that bumps
> this value more often than is necessary is still useful. It's not
> _ideal_, but it still meets the needs of NFSv4, IMA and other potential
> users of it. After all, this is basically the definition of i_version
> today and it's still useful, even if atime update i_version bumps are
> currently harmful for performance.
Why do you want to let it be OK? Who is hurt by it being "more strict
than needed"? There is an obvious cost in not being strict as an
implementation can be compliant but completely useless (increment every
nanosecond). So there needs to be a clear benefit to balance this. Who
benefits by not being strict?
Also: Your spec doesn't say it must increase, only it must be different.
So would as hash of all data and metadata be allowed (sysfs might be
able to provide that, but probably wouldn't bother).
Also: if stray updates are still conformant, can occasional repeated
values be still conformant? I would like for a high-precision ctime
timestamp to be acceptable, but as time can go backwards it is currently
not conformant (even though the xfs iversion which is less useful is
actually conformant).
NeilBrown
On Mon, Aug 29, 2022 at 06:39:04AM -0400, Jeff Layton wrote:
> On Mon, 2022-08-29 at 17:56 +1000, Dave Chinner wrote:
> > On Fri, Aug 26, 2022 at 05:46:57PM -0400, Jeff Layton wrote:
> > > The i_version field in the kernel has had different semantics over
> > > the decades, but we're now proposing to expose it to userland via
> > > statx. This means that we need a clear, consistent definition of
> > > what it means and when it should change.
> > >
> > > Update the comments in iversion.h to describe how a conformant
> > > i_version implementation is expected to behave. This definition
> > > suits the current users of i_version (NFSv4 and IMA), but is
> > > loose enough to allow for a wide range of possible implementations.
> > >
> > > Cc: Colin Walters <[email protected]>
> > > Cc: NeilBrown <[email protected]>
> > > Cc: Trond Myklebust <[email protected]>
> > > Cc: Dave Chinner <[email protected]>
> > > Link: https://lore.kernel.org/linux-xfs/[email protected]/#t
> > > Signed-off-by: Jeff Layton <[email protected]>
> > > ---
> > > include/linux/iversion.h | 23 +++++++++++++++++++++--
> > > 1 file changed, 21 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > > index 3bfebde5a1a6..45e93e1b4edc 100644
> > > --- a/include/linux/iversion.h
> > > +++ b/include/linux/iversion.h
> > > @@ -9,8 +9,19 @@
> > > * ---------------------------
> > > * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > > * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > > - * appear different to observers if there was a change to the inode's data or
> > > - * metadata since it was last queried.
> > > + * appear different to observers if there was an explicit change to the inode's
> > > + * data or metadata since it was last queried.
> > > + *
> > > + * An explicit change is one that would ordinarily result in a change to the
> > > + * inode status change time (aka ctime). The version must appear to change, even
> > > + * if the ctime does not (since the whole point is to avoid missing updates due
> > > + * to timestamp granularity). If POSIX mandates that the ctime must change due
> > > + * to an operation, then the i_version counter must be incremented as well.
> > > + *
> > > + * A conformant implementation is allowed to increment the counter in other
> > > + * cases, but this is not optimal. NFSv4 and IMA both use this value to determine
> > > + * whether caches are up to date. Spurious increments can cause false cache
> > > + * invalidations.
> >
> > "not optimal", but never-the-less allowed - that's "unspecified
> > behaviour" if I've ever seen it. How is userspace supposed to
> > know/deal with this?
> >
> > Indeed, this loophole clause doesn't exist in the man pages that
> > define what statx.stx_ino_version means. The man pages explicitly
> > define that stx_ino_version only ever changes when stx_ctime
> > changes.
> >
>
> We can fix the manpage to make this more clear.
>
> > IOWs, the behaviour userspace developers are going to expect *does
> > not include* stx_ino_version changing it more often than ctime is
> > changed. Hence a kernel iversion implementation that bumps the
> > counter more often than ctime changes *is not conformant with the
> > statx version counter specification*. IOWs, we can't export such
> > behaviour to userspace *ever* - it is a non-conformant
> > implementation.
> >
>
> Nonsense. The statx version counter specification is *whatever we decide
> to make it*.
Yes, but...
> If we define it to allow for spurious version bumps, then
> these implementations would be conformant.
... that's _not how you defined stx_ino_version to behave_!
> Given that you can't tell what or how much changed in the inode whenever
> the value changes, allowing it to be bumped on non-observable changes is
> ok and the counter is still useful. When you see it change you need to
> go stat/read/getxattr etc, to see what actually happened anyway.
IDGI. If this is acceptible, then you're forcing userspace into
"store and filter" implementations as the only viable method of
using the change notification usefully.
That means atime is just another attribute in the "store and
filter" algorithm, so if this is how we define stx_ino_version
behaviour, why carve out an explicit exception for atime?
> Most applications won't be interested in every possible explicit change
> that can happen to an inode. It's likely these applications would check
> the parts of the inode they're interested in, and then go back to
> waiting for the next bump if the change wasn't significant to them.
Yes, that is exactly my point.
You make the argument that we must not bump iversion in certain
situations (atime) because it will cause spurious cache
invalidations, but then say it is OK to bump it in others regardless
of the fact that it will cause spurious cache invalidations. And you
justify this latter behaviour by saying it is up to the application
to avoid spurious invalidations by using "store and filter"
algorithms.
If the application has to store state and filter changes indicated
by stx_ino_version changing, then by definition *it must be capable
of filtering iversion bumps as a result of atime changes*.
The iversion exception carved out for atime requires the application
to implement "store and filter" algorithms only if it needs to care
about atime changes. The "invisible bump" exception carved out here
*requires* applications to implement "store and filter" algorithms
to filter out invisible bumps.
Hence if we combine both these behaviours, atime bumping iversion
appears to userspace exactly the same as "invisible bump occurred,
followed by access that changes atime". IOWs, userspace cannot tell the
difference between a filesystem implementation that doesn't bump
iversion on atime but has invisible bump, and a filesystem that
bumps iversion on atime updates and so it always needs to filter
atime changes if it doesn't care about them.
Hence if stx_ino_version can have invisible bumps, it makes no
difference to userspace if atime updates bump iversion or not. They
will have to filter atime if they don't care about it, and they have
to store the new stx_ino_version every time they filter out an
invisible bump that doesn't change anything their filters care
about (e.g. atime!).
At which point I have to ask: if we are expecting userspace to
filter out invisible iversion bumps because that's allowed,
conformant behaviour, then why aren't we requiring both the NFS
server and IMA applications to filter spurious iversion bumps as
well?
> > Hence I think anything that bumps iversion outside the bounds of the
> > statx definition should be declared as such:
> >
> > "Non-conformant iversion implementations:
> > - MUST NOT be exported by statx() to userspace
> > - MUST be -tolerated- by kernel internal applications that
> > use iversion for their own purposes."
> >
>
> I think this is more strict than is needed. An implementation that bumps
> this value more often than is necessary is still useful.
I never said that non-conformant implementations aren't useful. What
I said is they aren't conformant with the provided definition of
stx_ino_version, and as a result we should not allow them to be
exposed to userspace.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, 2022-08-30 at 08:58 +1000, NeilBrown wrote:
> On Mon, 29 Aug 2022, Jeff Layton wrote:
> > On Mon, 2022-08-29 at 17:56 +1000, Dave Chinner wrote:
> > > On Fri, Aug 26, 2022 at 05:46:57PM -0400, Jeff Layton wrote:
> > > > The i_version field in the kernel has had different semantics over
> > > > the decades, but we're now proposing to expose it to userland via
> > > > statx. This means that we need a clear, consistent definition of
> > > > what it means and when it should change.
> > > >
> > > > Update the comments in iversion.h to describe how a conformant
> > > > i_version implementation is expected to behave. This definition
> > > > suits the current users of i_version (NFSv4 and IMA), but is
> > > > loose enough to allow for a wide range of possible implementations.
> > > >
> > > > Cc: Colin Walters <[email protected]>
> > > > Cc: NeilBrown <[email protected]>
> > > > Cc: Trond Myklebust <[email protected]>
> > > > Cc: Dave Chinner <[email protected]>
> > > > Link: https://lore.kernel.org/linux-xfs/[email protected]/#t
> > > > Signed-off-by: Jeff Layton <[email protected]>
> > > > ---
> > > > include/linux/iversion.h | 23 +++++++++++++++++++++--
> > > > 1 file changed, 21 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > > > index 3bfebde5a1a6..45e93e1b4edc 100644
> > > > --- a/include/linux/iversion.h
> > > > +++ b/include/linux/iversion.h
> > > > @@ -9,8 +9,19 @@
> > > > * ---------------------------
> > > > * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > > > * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > > > - * appear different to observers if there was a change to the inode's data or
> > > > - * metadata since it was last queried.
> > > > + * appear different to observers if there was an explicit change to the inode's
> > > > + * data or metadata since it was last queried.
> > > > + *
> > > > + * An explicit change is one that would ordinarily result in a change to the
> > > > + * inode status change time (aka ctime). The version must appear to change, even
> > > > + * if the ctime does not (since the whole point is to avoid missing updates due
> > > > + * to timestamp granularity). If POSIX mandates that the ctime must change due
> > > > + * to an operation, then the i_version counter must be incremented as well.
> > > > + *
> > > > + * A conformant implementation is allowed to increment the counter in other
> > > > + * cases, but this is not optimal. NFSv4 and IMA both use this value to determine
> > > > + * whether caches are up to date. Spurious increments can cause false cache
> > > > + * invalidations.
> > >
> > > "not optimal", but never-the-less allowed - that's "unspecified
> > > behaviour" if I've ever seen it. How is userspace supposed to
> > > know/deal with this?
> > >
> > > Indeed, this loophole clause doesn't exist in the man pages that
> > > define what statx.stx_ino_version means. The man pages explicitly
> > > define that stx_ino_version only ever changes when stx_ctime
> > > changes.
> > >
> >
> > We can fix the manpage to make this more clear.
> >
> > > IOWs, the behaviour userspace developers are going to expect *does
> > > not include* stx_ino_version changing it more often than ctime is
> > > changed. Hence a kernel iversion implementation that bumps the
> > > counter more often than ctime changes *is not conformant with the
> > > statx version counter specification*. IOWs, we can't export such
> > > behaviour to userspace *ever* - it is a non-conformant
> > > implementation.
> > >
> >
> > Nonsense. The statx version counter specification is *whatever we decide
> > to make it*. If we define it to allow for spurious version bumps, then
> > these implementations would be conformant.
> >
> > Given that you can't tell what or how much changed in the inode whenever
> > the value changes, allowing it to be bumped on non-observable changes is
> > ok and the counter is still useful. When you see it change you need to
> > go stat/read/getxattr etc, to see what actually happened anyway.
> >
> > Most applications won't be interested in every possible explicit change
> > that can happen to an inode. It's likely these applications would check
> > the parts of the inode they're interested in, and then go back to
> > waiting for the next bump if the change wasn't significant to them.
> >
> >
> > > Hence I think anything that bumps iversion outside the bounds of the
> > > statx definition should be declared as such:
> > >
> > > "Non-conformant iversion implementations:
> > > - MUST NOT be exported by statx() to userspace
> > > - MUST be -tolerated- by kernel internal applications that
> > > use iversion for their own purposes."
> > >
> >
> > I think this is more strict than is needed. An implementation that bumps
> > this value more often than is necessary is still useful. It's not
> > _ideal_, but it still meets the needs of NFSv4, IMA and other potential
> > users of it. After all, this is basically the definition of i_version
> > today and it's still useful, even if atime update i_version bumps are
> > currently harmful for performance.
>
> Why do you want to let it be OK? Who is hurt by it being "more strict
> than needed"? There is an obvious cost in not being strict as an
> implementation can be compliant but completely useless (increment every
> nanosecond). So there needs to be a clear benefit to balance this. Who
> benefits by not being strict?
>
Other filesystems that may not be able to provide the strict semantics
required. I don't have any names to name here -- I'm just trying to
ensure that we don't paint ourselves into a corner with rules that are
more strict than we really need.
If the consensus is that we should keep the definition strict, then Ican
live with that. That might narrow the number of filesystems that can
provide this attribute though.
> Also: Your spec doesn't say it must increase, only it must be different.
> So would as hash of all data and metadata be allowed (sysfs might be
> able to provide that, but probably wouldn't bother).
>
> Also: if stray updates are still conformant, can occasional repeated
> values be still conformant? I would like for a high-precision ctime
> timestamp to be acceptable, but as time can go backwards it is currently
> not conformant (even though the xfs iversion which is less useful is
> actually conformant).
>
Yes, saying only that it must be different is intentional. What we
really want is for consumers to treat this as an opaque value for the
most part [1]. Therefore an implementation based on hashing would
conform to the spec, I'd think, as long as all of the relevant info is
part of the hash.
OTOH, hash collisions could be an issue here and I think we need to
avoid allowing duplicate values. When it comes to watching for changes
to an inode, false positives are generally ok since that should just
affect performance but not functionality.
False negatives are a different matter. They can lead to cache coherency
issues with NFS, or missing remeasurements in IMA. Userland applications
that use this could be subject to similar issues.
--
Jeff Layton <[email protected]>
[1]: which may not be reasonable in the case of write delegations on
NFSv4, since the client is expected to increment it when it has cached
local changes.
On Tue, 2022-08-30 at 11:04 +1000, Dave Chinner wrote:
> On Mon, Aug 29, 2022 at 06:39:04AM -0400, Jeff Layton wrote:
> > On Mon, 2022-08-29 at 17:56 +1000, Dave Chinner wrote:
> > > On Fri, Aug 26, 2022 at 05:46:57PM -0400, Jeff Layton wrote:
> > > > The i_version field in the kernel has had different semantics over
> > > > the decades, but we're now proposing to expose it to userland via
> > > > statx. This means that we need a clear, consistent definition of
> > > > what it means and when it should change.
> > > >
> > > > Update the comments in iversion.h to describe how a conformant
> > > > i_version implementation is expected to behave. This definition
> > > > suits the current users of i_version (NFSv4 and IMA), but is
> > > > loose enough to allow for a wide range of possible implementations.
> > > >
> > > > Cc: Colin Walters <[email protected]>
> > > > Cc: NeilBrown <[email protected]>
> > > > Cc: Trond Myklebust <[email protected]>
> > > > Cc: Dave Chinner <[email protected]>
> > > > Link: https://lore.kernel.org/linux-xfs/[email protected]/#t
> > > > Signed-off-by: Jeff Layton <[email protected]>
> > > > ---
> > > > include/linux/iversion.h | 23 +++++++++++++++++++++--
> > > > 1 file changed, 21 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > > > index 3bfebde5a1a6..45e93e1b4edc 100644
> > > > --- a/include/linux/iversion.h
> > > > +++ b/include/linux/iversion.h
> > > > @@ -9,8 +9,19 @@
> > > > * ---------------------------
> > > > * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > > > * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > > > - * appear different to observers if there was a change to the inode's data or
> > > > - * metadata since it was last queried.
> > > > + * appear different to observers if there was an explicit change to the inode's
> > > > + * data or metadata since it was last queried.
> > > > + *
> > > > + * An explicit change is one that would ordinarily result in a change to the
> > > > + * inode status change time (aka ctime). The version must appear to change, even
> > > > + * if the ctime does not (since the whole point is to avoid missing updates due
> > > > + * to timestamp granularity). If POSIX mandates that the ctime must change due
> > > > + * to an operation, then the i_version counter must be incremented as well.
> > > > + *
> > > > + * A conformant implementation is allowed to increment the counter in other
> > > > + * cases, but this is not optimal. NFSv4 and IMA both use this value to determine
> > > > + * whether caches are up to date. Spurious increments can cause false cache
> > > > + * invalidations.
> > >
> > > "not optimal", but never-the-less allowed - that's "unspecified
> > > behaviour" if I've ever seen it. How is userspace supposed to
> > > know/deal with this?
> > >
> > > Indeed, this loophole clause doesn't exist in the man pages that
> > > define what statx.stx_ino_version means. The man pages explicitly
> > > define that stx_ino_version only ever changes when stx_ctime
> > > changes.
> > >
> >
> > We can fix the manpage to make this more clear.
> >
> > > IOWs, the behaviour userspace developers are going to expect *does
> > > not include* stx_ino_version changing it more often than ctime is
> > > changed. Hence a kernel iversion implementation that bumps the
> > > counter more often than ctime changes *is not conformant with the
> > > statx version counter specification*. IOWs, we can't export such
> > > behaviour to userspace *ever* - it is a non-conformant
> > > implementation.
> > >
> >
> > Nonsense. The statx version counter specification is *whatever we decide
> > to make it*.
>
> Yes, but...
>
> > If we define it to allow for spurious version bumps, then
> > these implementations would be conformant.
>
> ... that's _not how you defined stx_ino_version to behave_!
>
I certainly didn't say that it must _only_ be incremented when the ctime
would change, only that if the ctime would change that it must be
incremented.
The weasel-words make all the difference. But, point taken, the spec
should be explicit about this. I'll plan to revise the manpage patch and
resend it.
> > Given that you can't tell what or how much changed in the inode whenever
> > the value changes, allowing it to be bumped on non-observable changes is
> > ok and the counter is still useful. When you see it change you need to
> > go stat/read/getxattr etc, to see what actually happened anyway.
>
> IDGI. If this is acceptible, then you're forcing userspace into
> "store and filter" implementations as the only viable method of
> using the change notification usefully.
>
Well, that's all it's really useful for anyway. The counter itself has
tells you nothing other than that something changed.
> That means atime is just another attribute in the "store and
> filter" algorithm, so if this is how we define stx_ino_version
> behaviour, why carve out an explicit exception for atime?
>
Because atime updates are particularly problematic. Ideally, we'd filter
out all implicit updates, but that may not be feasible in all cases.
> > Most applications won't be interested in every possible explicit change
> > that can happen to an inode. It's likely these applications would check
> > the parts of the inode they're interested in, and then go back to
> > waiting for the next bump if the change wasn't significant to them.
>
> Yes, that is exactly my point.
>
> You make the argument that we must not bump iversion in certain
> situations (atime) because it will cause spurious cache
> invalidations, but then say it is OK to bump it in others regardless
> of the fact that it will cause spurious cache invalidations. And you
> justify this latter behaviour by saying it is up to the application
> to avoid spurious invalidations by using "store and filter"
> algorithms.
>
> If the application has to store state and filter changes indicated
> by stx_ino_version changing, then by definition *it must be capable
> of filtering iversion bumps as a result of atime changes*.
>
> The iversion exception carved out for atime requires the application
> to implement "store and filter" algorithms only if it needs to care
> about atime changes. The "invisible bump" exception carved out here
> *requires* applications to implement "store and filter" algorithms
> to filter out invisible bumps.
>
> Hence if we combine both these behaviours, atime bumping iversion
> appears to userspace exactly the same as "invisible bump occurred,
> followed by access that changes atime". IOWs, userspace cannot tell the
> difference between a filesystem implementation that doesn't bump
> iversion on atime but has invisible bump, and a filesystem that
> bumps iversion on atime updates and so it always needs to filter
> atime changes if it doesn't care about them.
>
> Hence if stx_ino_version can have invisible bumps, it makes no
> difference to userspace if atime updates bump iversion or not. They
> will have to filter atime if they don't care about it, and they have
> to store the new stx_ino_version every time they filter out an
> invisible bump that doesn't change anything their filters care
> about (e.g. atime!).
>
> At which point I have to ask: if we are expecting userspace to
> filter out invisible iversion bumps because that's allowed,
> conformant behaviour, then why aren't we requiring both the NFS
> server and IMA applications to filter spurious iversion bumps as
> well?
>
I think you're reading too much into my attempt to carve out atime from
i_version updates. That is purely a pragmatic attempt to staunch the
worst of the bleeding from this problem.
atime updates are _very_ frequent, and the default relatime behavior
ensures that each NFS client reading file data or dir info after a
change to it will end up downloading it at least twice: once to read the
file itself, and then again after it drops its cache due to the atime
update from the prior read.
In an ideal world, an implementation would only bump the i_version on
explicit changes. If an implicit change only happens infrequently, or
always happens in very close succession after an explicit change (such
that readers can't race in as easily) then that's less harmful for
performance.
The i_version value itself can't tell you anything about the inode. It's
only useful for comparing to an earlier sample to see if it is has
changed. This attribute is mostly useful in the context of a store-and-
invalidate kind of system. We want to keep the invalidations to a
minimum, but eliminating spurious updates is more of an optimization
problem than one of correctness.
It would be best if we could eliminate all spurious updates, but I think
the stx_ino_version is just as useful without being that strict, and
that leaves the door open for other implementations that aren't able to
filter out all spurious updates.
> > > Hence I think anything that bumps iversion outside the bounds of the
> > > statx definition should be declared as such:
> > >
> > > "Non-conformant iversion implementations:
> > > - MUST NOT be exported by statx() to userspace
> > > - MUST be -tolerated- by kernel internal applications that
> > > use iversion for their own purposes."
> > >
> >
> > I think this is more strict than is needed. An implementation that bumps
> > this value more often than is necessary is still useful.
>
> I never said that non-conformant implementations aren't useful. What
> I said is they aren't conformant with the provided definition of
> stx_ino_version, and as a result we should not allow them to be
> exposed to userspace.
As I said above, I'll respin the manpage patch to better specify what a
conformant implementation can and can't do, and we can discuss from
there.
--
Jeff Layton <[email protected]>