On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields wrote:
> On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton wrote:
> > Yes, saying only that it must be different is intentional. What we
> > really want is for consumers to treat this as an opaque value for the
> > most part [1]. Therefore an implementation based on hashing would
> > conform to the spec, I'd think, as long as all of the relevant info is
> > part of the hash.
>
> It'd conform, but it might not be as useful as an increasing value.
>
> E.g. a client can use that to work out which of a series of reordered
> write replies is the most recent, and I seem to recall that can prevent
> unnecessary invalidations in some cases.
>
That's a good point; the linux client does this. That said, NFSv4 has a
way for the server to advertise its change attribute behavior [1]
(though nfsd hasn't implemented this yet). We don't have a good way to
do that in userland for now.
This is another place where fsinfo() would have been nice to have. I
think until we have something like that, we'd want to keep our promises
to userland to a minimum.
[1]: https://www.rfc-editor.org/rfc/rfc7862.html#section-12.2.3 . I
guess I should look at plumbing this in for IS_I_VERSION inodes...
--
Jeff Layton <[email protected]>
On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton wrote:
> On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields wrote:
> > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton wrote:
> > > Yes, saying only that it must be different is intentional. What we
> > > really want is for consumers to treat this as an opaque value for the
> > > most part [1]. Therefore an implementation based on hashing would
> > > conform to the spec, I'd think, as long as all of the relevant info is
> > > part of the hash.
> >
> > It'd conform, but it might not be as useful as an increasing value.
> >
> > E.g. a client can use that to work out which of a series of reordered
> > write replies is the most recent, and I seem to recall that can prevent
> > unnecessary invalidations in some cases.
> >
>
> That's a good point; the linux client does this. That said, NFSv4 has a
> way for the server to advertise its change attribute behavior [1]
> (though nfsd hasn't implemented this yet).
It was implemented and reverted. The issue was that I thought nfsd
should mix in the ctime to prevent the change attribute going backwards
on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()), but Trond was
concerned about the possibility of time going backwards. See
1631087ba872 "Revert "nfsd4: support change_attr_type attribute"".
There's some mailing list discussion to that I'm not turning up right
now.
Did NFSv4 add change_attr_type because some implementations needed the
unordered case, or because they realized ordering was useful but wanted
to keep backwards compatibility? I don't know which it was.
--b.
> We don't have a good way to
> do that in userland for now.
>
> This is another place where fsinfo() would have been nice to have. I
> think until we have something like that, we'd want to keep our promises
> to userland to a minimum.
>
> [1]: https://www.rfc-editor.org/rfc/rfc7862.html#section-12.2.3 . I
> guess I should look at plumbing this in for IS_I_VERSION inodes...
>
> --
> Jeff Layton <[email protected]>
>
>
On Tue, 2022-08-30 at 10:44 -0400, J. Bruce Fields wrote:
> On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton wrote:
> > On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields wrote:
> > > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton wrote:
> > > > Yes, saying only that it must be different is intentional. What
> > > > we
> > > > really want is for consumers to treat this as an opaque value
> > > > for the
> > > > most part [1]. Therefore an implementation based on hashing
> > > > would
> > > > conform to the spec, I'd think, as long as all of the relevant
> > > > info is
> > > > part of the hash.
> > >
> > > It'd conform, but it might not be as useful as an increasing
> > > value.
> > >
> > > E.g. a client can use that to work out which of a series of
> > > reordered
> > > write replies is the most recent, and I seem to recall that can
> > > prevent
> > > unnecessary invalidations in some cases.
> > >
> >
> > That's a good point; the linux client does this. That said, NFSv4
> > has a
> > way for the server to advertise its change attribute behavior [1]
> > (though nfsd hasn't implemented this yet).
>
> It was implemented and reverted. The issue was that I thought nfsd
> should mix in the ctime to prevent the change attribute going
> backwards
> on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()), but Trond
> was
> concerned about the possibility of time going backwards. See
> 1631087ba872 "Revert "nfsd4: support change_attr_type attribute"".
> There's some mailing list discussion to that I'm not turning up right
> now.
My main concern was that some filesystems (e.g. ext3) were failing to
provide sufficient timestamp resolution to actually label the resulting
'change attribute' as being updated monotonically. If the time stamp
doesn't change when the file data or metadata are changed, then the
client has to perform extra checks to try to figure out whether or not
its caches are up to date.
>
> Did NFSv4 add change_attr_type because some implementations needed
> the
> unordered case, or because they realized ordering was useful but
> wanted
> to keep backwards compatibility? I don't know which it was.
We implemented it because, as implied above, knowledge of whether or
not the change attribute behaves monotonically, or strictly
monotonically, enables a number of optimisations.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Tue, Aug 30, 2022 at 02:58:27PM +0000, Trond Myklebust wrote:
> On Tue, 2022-08-30 at 10:44 -0400, J. Bruce Fields wrote:
> > On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton wrote:
> > > On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields wrote:
> > > > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton wrote:
> > > > > Yes, saying only that it must be different is intentional. What
> > > > > we
> > > > > really want is for consumers to treat this as an opaque value
> > > > > for the
> > > > > most part [1]. Therefore an implementation based on hashing
> > > > > would
> > > > > conform to the spec, I'd think, as long as all of the relevant
> > > > > info is
> > > > > part of the hash.
> > > >
> > > > It'd conform, but it might not be as useful as an increasing
> > > > value.
> > > >
> > > > E.g. a client can use that to work out which of a series of
> > > > reordered
> > > > write replies is the most recent, and I seem to recall that can
> > > > prevent
> > > > unnecessary invalidations in some cases.
> > > >
> > >
> > > That's a good point; the linux client does this. That said, NFSv4
> > > has a
> > > way for the server to advertise its change attribute behavior [1]
> > > (though nfsd hasn't implemented this yet).
> >
> > It was implemented and reverted. The issue was that I thought nfsd
> > should mix in the ctime to prevent the change attribute going
> > backwards
> > on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()), but Trond
> > was
> > concerned about the possibility of time going backwards. See
> > 1631087ba872 "Revert "nfsd4: support change_attr_type attribute"".
> > There's some mailing list discussion to that I'm not turning up right
> > now.
https://lore.kernel.org/linux-nfs/[email protected]/
is what I was thinking of but it isn't actually that interesting.
> My main concern was that some filesystems (e.g. ext3) were failing to
> provide sufficient timestamp resolution to actually label the resulting
> 'change attribute' as being updated monotonically. If the time stamp
> doesn't change when the file data or metadata are changed, then the
> client has to perform extra checks to try to figure out whether or not
> its caches are up to date.
That's a different issue from the one you were raising in that
discussion.
> > Did NFSv4 add change_attr_type because some implementations needed
> > the
> > unordered case, or because they realized ordering was useful but
> > wanted
> > to keep backwards compatibility? I don't know which it was.
>
> We implemented it because, as implied above, knowledge of whether or
> not the change attribute behaves monotonically, or strictly
> monotonically, enables a number of optimisations.
Of course, but my question was about the value of the old behavior, not
about the value of the monotonic behavior.
Put differently, if we could redesign the protocol from scratch would we
actually have included the option of non-monotonic behavior?
--b.
On Tue, 2022-08-30 at 11:17 -0400, J. Bruce Fields wrote:
> On Tue, Aug 30, 2022 at 02:58:27PM +0000, Trond Myklebust wrote:
> > On Tue, 2022-08-30 at 10:44 -0400, J. Bruce Fields wrote:
> > > On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton wrote:
> > > > On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields wrote:
> > > > > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton wrote:
> > > > > > Yes, saying only that it must be different is intentional.
> > > > > > What
> > > > > > we
> > > > > > really want is for consumers to treat this as an opaque
> > > > > > value
> > > > > > for the
> > > > > > most part [1]. Therefore an implementation based on hashing
> > > > > > would
> > > > > > conform to the spec, I'd think, as long as all of the
> > > > > > relevant
> > > > > > info is
> > > > > > part of the hash.
> > > > >
> > > > > It'd conform, but it might not be as useful as an increasing
> > > > > value.
> > > > >
> > > > > E.g. a client can use that to work out which of a series of
> > > > > reordered
> > > > > write replies is the most recent, and I seem to recall that
> > > > > can
> > > > > prevent
> > > > > unnecessary invalidations in some cases.
> > > > >
> > > >
> > > > That's a good point; the linux client does this. That said,
> > > > NFSv4
> > > > has a
> > > > way for the server to advertise its change attribute behavior
> > > > [1]
> > > > (though nfsd hasn't implemented this yet).
> > >
> > > It was implemented and reverted. The issue was that I thought
> > > nfsd
> > > should mix in the ctime to prevent the change attribute going
> > > backwards
> > > on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()), but
> > > Trond
> > > was
> > > concerned about the possibility of time going backwards. See
> > > 1631087ba872 "Revert "nfsd4: support change_attr_type
> > > attribute"".
> > > There's some mailing list discussion to that I'm not turning up
> > > right
> > > now.
>
> https://lore.kernel.org/linux-nfs/[email protected]/
> is what I was thinking of but it isn't actually that interesting.
>
> > My main concern was that some filesystems (e.g. ext3) were failing
> > to
> > provide sufficient timestamp resolution to actually label the
> > resulting
> > 'change attribute' as being updated monotonically. If the time
> > stamp
> > doesn't change when the file data or metadata are changed, then the
> > client has to perform extra checks to try to figure out whether or
> > not
> > its caches are up to date.
>
> That's a different issue from the one you were raising in that
> discussion.
>
> > > Did NFSv4 add change_attr_type because some implementations
> > > needed
> > > the
> > > unordered case, or because they realized ordering was useful but
> > > wanted
> > > to keep backwards compatibility? I don't know which it was.
> >
> > We implemented it because, as implied above, knowledge of whether
> > or
> > not the change attribute behaves monotonically, or strictly
> > monotonically, enables a number of optimisations.
>
> Of course, but my question was about the value of the old behavior,
> not
> about the value of the monotonic behavior.
>
> Put differently, if we could redesign the protocol from scratch would
> we
> actually have included the option of non-monotonic behavior?
>
If we could design the filesystems from scratch, we probably would not.
The protocol ended up being as it is because people were trying to make
it as easy to implement as possible.
So if we could design the filesystem from scratch, we would have
probably designed it along the lines of what AFS does.
i.e. each explicit change is accompanied by a single bump of the change
attribute, so that the clients can not only decide the order of the
resulting changes, but also if they have missed a change (that might
have been made by a different client).
However that would be a requirement that is likely to be very specific
to distributed caches (and hence distributed filesystems). I doubt
there are many user space applications that would need that high
precision. Maybe MPI, but that's the only candidate I can think of for
now?
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Tue, 2022-08-30 at 15:43 +0000, Trond Myklebust wrote:
> On Tue, 2022-08-30 at 11:17 -0400, J. Bruce Fields wrote:
> > On Tue, Aug 30, 2022 at 02:58:27PM +0000, Trond Myklebust wrote:
> > > On Tue, 2022-08-30 at 10:44 -0400, J. Bruce Fields wrote:
> > > > On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton wrote:
> > > > > On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields wrote:
> > > > > > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton wrote:
> > > > > > > Yes, saying only that it must be different is intentional.
> > > > > > > What
> > > > > > > we
> > > > > > > really want is for consumers to treat this as an opaque
> > > > > > > value
> > > > > > > for the
> > > > > > > most part [1]. Therefore an implementation based on hashing
> > > > > > > would
> > > > > > > conform to the spec, I'd think, as long as all of the
> > > > > > > relevant
> > > > > > > info is
> > > > > > > part of the hash.
> > > > > >
> > > > > > It'd conform, but it might not be as useful as an increasing
> > > > > > value.
> > > > > >
> > > > > > E.g. a client can use that to work out which of a series of
> > > > > > reordered
> > > > > > write replies is the most recent, and I seem to recall that
> > > > > > can
> > > > > > prevent
> > > > > > unnecessary invalidations in some cases.
> > > > > >
> > > > >
> > > > > That's a good point; the linux client does this. That said,
> > > > > NFSv4
> > > > > has a
> > > > > way for the server to advertise its change attribute behavior
> > > > > [1]
> > > > > (though nfsd hasn't implemented this yet).
> > > >
> > > > It was implemented and reverted.? The issue was that I thought
> > > > nfsd
> > > > should mix in the ctime to prevent the change attribute going
> > > > backwards
> > > > on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()), but
> > > > Trond
> > > > was
> > > > concerned about the possibility of time going backwards.? See
> > > > 1631087ba872 "Revert "nfsd4: support change_attr_type
> > > > attribute"".
> > > > There's some mailing list discussion to that I'm not turning up
> > > > right
> > > > now.
> >
> > https://lore.kernel.org/linux-nfs/[email protected]/
> > is what I was thinking of but it isn't actually that interesting.
> >
> > > My main concern was that some filesystems (e.g. ext3) were failing
> > > to
> > > provide sufficient timestamp resolution to actually label the
> > > resulting
> > > 'change attribute' as being updated monotonically. If the time
> > > stamp
> > > doesn't change when the file data or metadata are changed, then the
> > > client has to perform extra checks to try to figure out whether or
> > > not
> > > its caches are up to date.
> >
> > That's a different issue from the one you were raising in that
> > discussion.
> >
> > > > Did NFSv4 add change_attr_type because some implementations
> > > > needed
> > > > the
> > > > unordered case, or because they realized ordering was useful but
> > > > wanted
> > > > to keep backwards compatibility?? I don't know which it was.
> > >
> > > We implemented it because, as implied above, knowledge of whether
> > > or
> > > not the change attribute behaves monotonically, or strictly
> > > monotonically, enables a number of optimisations.
> >
> > Of course, but my question was about the value of the old behavior,
> > not
> > about the value of the monotonic behavior.
> >
> > Put differently, if we could redesign the protocol from scratch would
> > we
> > actually have included the option of non-monotonic behavior?
> >
>
> If we could design the filesystems from scratch, we probably would not.
> The protocol ended up being as it is because people were trying to make
> it as easy to implement as possible.
>
> So if we could design the filesystem from scratch, we would have
> probably designed it along the lines of what AFS does.
> i.e. each explicit change is accompanied by a single bump of the change
> attribute, so that the clients can not only decide the order of the
> resulting changes, but also if they have missed a change (that might
> have been made by a different client).
>
> However that would be a requirement that is likely to be very specific
> to distributed caches (and hence distributed filesystems). I doubt
> there are many user space applications that would need that high
> precision. Maybe MPI, but that's the only candidate I can think of for
> now?
>
The fact that NFS kept this more loosely-defined is what allowed us to
elide some of the i_version bumps and regain a fair bit of performance
for local filesystems [1]. If the change attribute had been more
strictly defined like you mention, then that particular optimization
would not have been possible.
This sort of thing is why I'm a fan of not defining this any more
strictly than we require. Later on, maybe we'll come up with a way for
filesystems to advertise that they can offer stronger guarantees.
--
Jeff Layton <[email protected]>
[1]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f02a9ad1f15d
On Tue, 2022-08-30 at 13:02 -0400, Jeff Layton wrote:
> On Tue, 2022-08-30 at 15:43 +0000, Trond Myklebust wrote:
> > On Tue, 2022-08-30 at 11:17 -0400, J. Bruce Fields wrote:
> > > On Tue, Aug 30, 2022 at 02:58:27PM +0000, Trond Myklebust wrote:
> > > > On Tue, 2022-08-30 at 10:44 -0400, J. Bruce Fields wrote:
> > > > > On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton wrote:
> > > > > > On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields wrote:
> > > > > > > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton
> > > > > > > wrote:
> > > > > > > > Yes, saying only that it must be different is
> > > > > > > > intentional.
> > > > > > > > What
> > > > > > > > we
> > > > > > > > really want is for consumers to treat this as an opaque
> > > > > > > > value
> > > > > > > > for the
> > > > > > > > most part [1]. Therefore an implementation based on
> > > > > > > > hashing
> > > > > > > > would
> > > > > > > > conform to the spec, I'd think, as long as all of the
> > > > > > > > relevant
> > > > > > > > info is
> > > > > > > > part of the hash.
> > > > > > >
> > > > > > > It'd conform, but it might not be as useful as an
> > > > > > > increasing
> > > > > > > value.
> > > > > > >
> > > > > > > E.g. a client can use that to work out which of a series
> > > > > > > of
> > > > > > > reordered
> > > > > > > write replies is the most recent, and I seem to recall
> > > > > > > that
> > > > > > > can
> > > > > > > prevent
> > > > > > > unnecessary invalidations in some cases.
> > > > > > >
> > > > > >
> > > > > > That's a good point; the linux client does this. That said,
> > > > > > NFSv4
> > > > > > has a
> > > > > > way for the server to advertise its change attribute
> > > > > > behavior
> > > > > > [1]
> > > > > > (though nfsd hasn't implemented this yet).
> > > > >
> > > > > It was implemented and reverted. The issue was that I
> > > > > thought
> > > > > nfsd
> > > > > should mix in the ctime to prevent the change attribute going
> > > > > backwards
> > > > > on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()), but
> > > > > Trond
> > > > > was
> > > > > concerned about the possibility of time going backwards. See
> > > > > 1631087ba872 "Revert "nfsd4: support change_attr_type
> > > > > attribute"".
> > > > > There's some mailing list discussion to that I'm not turning
> > > > > up
> > > > > right
> > > > > now.
> > >
> > > https://lore.kernel.org/linux-nfs/[email protected]/
> > > is what I was thinking of but it isn't actually that interesting.
> > >
> > > > My main concern was that some filesystems (e.g. ext3) were
> > > > failing
> > > > to
> > > > provide sufficient timestamp resolution to actually label the
> > > > resulting
> > > > 'change attribute' as being updated monotonically. If the time
> > > > stamp
> > > > doesn't change when the file data or metadata are changed, then
> > > > the
> > > > client has to perform extra checks to try to figure out whether
> > > > or
> > > > not
> > > > its caches are up to date.
> > >
> > > That's a different issue from the one you were raising in that
> > > discussion.
> > >
> > > > > Did NFSv4 add change_attr_type because some implementations
> > > > > needed
> > > > > the
> > > > > unordered case, or because they realized ordering was useful
> > > > > but
> > > > > wanted
> > > > > to keep backwards compatibility? I don't know which it was.
> > > >
> > > > We implemented it because, as implied above, knowledge of
> > > > whether
> > > > or
> > > > not the change attribute behaves monotonically, or strictly
> > > > monotonically, enables a number of optimisations.
> > >
> > > Of course, but my question was about the value of the old
> > > behavior,
> > > not
> > > about the value of the monotonic behavior.
> > >
> > > Put differently, if we could redesign the protocol from scratch
> > > would
> > > we
> > > actually have included the option of non-monotonic behavior?
> > >
> >
> > If we could design the filesystems from scratch, we probably would
> > not.
> > The protocol ended up being as it is because people were trying to
> > make
> > it as easy to implement as possible.
> >
> > So if we could design the filesystem from scratch, we would have
> > probably designed it along the lines of what AFS does.
> > i.e. each explicit change is accompanied by a single bump of the
> > change
> > attribute, so that the clients can not only decide the order of the
> > resulting changes, but also if they have missed a change (that
> > might
> > have been made by a different client).
> >
> > However that would be a requirement that is likely to be very
> > specific
> > to distributed caches (and hence distributed filesystems). I doubt
> > there are many user space applications that would need that high
> > precision. Maybe MPI, but that's the only candidate I can think of
> > for
> > now?
> >
>
> The fact that NFS kept this more loosely-defined is what allowed us
> to
> elide some of the i_version bumps and regain a fair bit of
> performance
> for local filesystems [1]. If the change attribute had been more
> strictly defined like you mention, then that particular optimization
> would not have been possible.
>
> This sort of thing is why I'm a fan of not defining this any more
> strictly than we require. Later on, maybe we'll come up with a way
> for
> filesystems to advertise that they can offer stronger guarantees.
What 'eliding of the bumps' are we talking about here? If it results in
unreliable behaviour, then I propose we just drop the whole concept and
go back to using the ctime. The change attribute is only useful if it
results in a reliable mechanism for detecting changes. Once you "elide
away" the word "reliable", then it has no value beyond what ctime
already does.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Tue, 2022-08-30 at 17:47 +0000, Trond Myklebust wrote:
> On Tue, 2022-08-30 at 13:02 -0400, Jeff Layton wrote:
> > On Tue, 2022-08-30 at 15:43 +0000, Trond Myklebust wrote:
> > > On Tue, 2022-08-30 at 11:17 -0400, J. Bruce Fields wrote:
> > > > On Tue, Aug 30, 2022 at 02:58:27PM +0000, Trond Myklebust wrote:
> > > > > On Tue, 2022-08-30 at 10:44 -0400, J. Bruce Fields wrote:
> > > > > > On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton wrote:
> > > > > > > On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields wrote:
> > > > > > > > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton
> > > > > > > > wrote:
> > > > > > > > > Yes, saying only that it must be different is
> > > > > > > > > intentional.
> > > > > > > > > What
> > > > > > > > > we
> > > > > > > > > really want is for consumers to treat this as an opaque
> > > > > > > > > value
> > > > > > > > > for the
> > > > > > > > > most part [1]. Therefore an implementation based on
> > > > > > > > > hashing
> > > > > > > > > would
> > > > > > > > > conform to the spec, I'd think, as long as all of the
> > > > > > > > > relevant
> > > > > > > > > info is
> > > > > > > > > part of the hash.
> > > > > > > >
> > > > > > > > It'd conform, but it might not be as useful as an
> > > > > > > > increasing
> > > > > > > > value.
> > > > > > > >
> > > > > > > > E.g. a client can use that to work out which of a series
> > > > > > > > of
> > > > > > > > reordered
> > > > > > > > write replies is the most recent, and I seem to recall
> > > > > > > > that
> > > > > > > > can
> > > > > > > > prevent
> > > > > > > > unnecessary invalidations in some cases.
> > > > > > > >
> > > > > > >
> > > > > > > That's a good point; the linux client does this. That said,
> > > > > > > NFSv4
> > > > > > > has a
> > > > > > > way for the server to advertise its change attribute
> > > > > > > behavior
> > > > > > > [1]
> > > > > > > (though nfsd hasn't implemented this yet).
> > > > > >
> > > > > > It was implemented and reverted.? The issue was that I
> > > > > > thought
> > > > > > nfsd
> > > > > > should mix in the ctime to prevent the change attribute going
> > > > > > backwards
> > > > > > on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()), but
> > > > > > Trond
> > > > > > was
> > > > > > concerned about the possibility of time going backwards.? See
> > > > > > 1631087ba872 "Revert "nfsd4: support change_attr_type
> > > > > > attribute"".
> > > > > > There's some mailing list discussion to that I'm not turning
> > > > > > up
> > > > > > right
> > > > > > now.
> > > >
> > > > https://lore.kernel.org/linux-nfs/[email protected]/
> > > > is what I was thinking of but it isn't actually that interesting.
> > > >
> > > > > My main concern was that some filesystems (e.g. ext3) were
> > > > > failing
> > > > > to
> > > > > provide sufficient timestamp resolution to actually label the
> > > > > resulting
> > > > > 'change attribute' as being updated monotonically. If the time
> > > > > stamp
> > > > > doesn't change when the file data or metadata are changed, then
> > > > > the
> > > > > client has to perform extra checks to try to figure out whether
> > > > > or
> > > > > not
> > > > > its caches are up to date.
> > > >
> > > > That's a different issue from the one you were raising in that
> > > > discussion.
> > > >
> > > > > > Did NFSv4 add change_attr_type because some implementations
> > > > > > needed
> > > > > > the
> > > > > > unordered case, or because they realized ordering was useful
> > > > > > but
> > > > > > wanted
> > > > > > to keep backwards compatibility?? I don't know which it was.
> > > > >
> > > > > We implemented it because, as implied above, knowledge of
> > > > > whether
> > > > > or
> > > > > not the change attribute behaves monotonically, or strictly
> > > > > monotonically, enables a number of optimisations.
> > > >
> > > > Of course, but my question was about the value of the old
> > > > behavior,
> > > > not
> > > > about the value of the monotonic behavior.
> > > >
> > > > Put differently, if we could redesign the protocol from scratch
> > > > would
> > > > we
> > > > actually have included the option of non-monotonic behavior?
> > > >
> > >
> > > If we could design the filesystems from scratch, we probably would
> > > not.
> > > The protocol ended up being as it is because people were trying to
> > > make
> > > it as easy to implement as possible.
> > >
> > > So if we could design the filesystem from scratch, we would have
> > > probably designed it along the lines of what AFS does.
> > > i.e. each explicit change is accompanied by a single bump of the
> > > change
> > > attribute, so that the clients can not only decide the order of the
> > > resulting changes, but also if they have missed a change (that
> > > might
> > > have been made by a different client).
> > >
> > > However that would be a requirement that is likely to be very
> > > specific
> > > to distributed caches (and hence distributed filesystems). I doubt
> > > there are many user space applications that would need that high
> > > precision. Maybe MPI, but that's the only candidate I can think of
> > > for
> > > now?
> > >
> >
> > The fact that NFS kept this more loosely-defined is what allowed us
> > to
> > elide some of the i_version bumps and regain a fair bit of
> > performance
> > for local filesystems [1]. If the change attribute had been more
> > strictly defined like you mention, then that particular optimization
> > would not have been possible.
> >
> > This sort of thing is why I'm a fan of not defining this any more
> > strictly than we require. Later on, maybe we'll come up with a way
> > for
> > filesystems to advertise that they can offer stronger guarantees.
>
> What 'eliding of the bumps' are we talking about here? If it results in
> unreliable behaviour, then I propose we just drop the whole concept and
> go back to using the ctime. The change attribute is only useful if it
> results in a reliable mechanism for detecting changes. Once you "elide
> away" the word "reliable", then it has no value beyond what ctime
> already does.
>
I'm talking about the scheme to optimize away i_version updates when the
current one has never been queried:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f02a9ad1f15d
There's nothing unreliable about it.
--
Jeff Layton <[email protected]>
On Tue, 2022-08-30 at 13:53 -0400, Jeff Layton wrote:
> On Tue, 2022-08-30 at 17:47 +0000, Trond Myklebust wrote:
> > On Tue, 2022-08-30 at 13:02 -0400, Jeff Layton wrote:
> > > On Tue, 2022-08-30 at 15:43 +0000, Trond Myklebust wrote:
> > > > On Tue, 2022-08-30 at 11:17 -0400, J. Bruce Fields wrote:
> > > > > On Tue, Aug 30, 2022 at 02:58:27PM +0000, Trond Myklebust
> > > > > wrote:
> > > > > > On Tue, 2022-08-30 at 10:44 -0400, J. Bruce Fields wrote:
> > > > > > > On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton
> > > > > > > wrote:
> > > > > > > > On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields
> > > > > > > > wrote:
> > > > > > > > > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton
> > > > > > > > > wrote:
> > > > > > > > > > Yes, saying only that it must be different is
> > > > > > > > > > intentional.
> > > > > > > > > > What
> > > > > > > > > > we
> > > > > > > > > > really want is for consumers to treat this as an
> > > > > > > > > > opaque
> > > > > > > > > > value
> > > > > > > > > > for the
> > > > > > > > > > most part [1]. Therefore an implementation based on
> > > > > > > > > > hashing
> > > > > > > > > > would
> > > > > > > > > > conform to the spec, I'd think, as long as all of
> > > > > > > > > > the
> > > > > > > > > > relevant
> > > > > > > > > > info is
> > > > > > > > > > part of the hash.
> > > > > > > > >
> > > > > > > > > It'd conform, but it might not be as useful as an
> > > > > > > > > increasing
> > > > > > > > > value.
> > > > > > > > >
> > > > > > > > > E.g. a client can use that to work out which of a
> > > > > > > > > series
> > > > > > > > > of
> > > > > > > > > reordered
> > > > > > > > > write replies is the most recent, and I seem to
> > > > > > > > > recall
> > > > > > > > > that
> > > > > > > > > can
> > > > > > > > > prevent
> > > > > > > > > unnecessary invalidations in some cases.
> > > > > > > > >
> > > > > > > >
> > > > > > > > That's a good point; the linux client does this. That
> > > > > > > > said,
> > > > > > > > NFSv4
> > > > > > > > has a
> > > > > > > > way for the server to advertise its change attribute
> > > > > > > > behavior
> > > > > > > > [1]
> > > > > > > > (though nfsd hasn't implemented this yet).
> > > > > > >
> > > > > > > It was implemented and reverted. The issue was that I
> > > > > > > thought
> > > > > > > nfsd
> > > > > > > should mix in the ctime to prevent the change attribute
> > > > > > > going
> > > > > > > backwards
> > > > > > > on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()),
> > > > > > > but
> > > > > > > Trond
> > > > > > > was
> > > > > > > concerned about the possibility of time going backwards.
> > > > > > > See
> > > > > > > 1631087ba872 "Revert "nfsd4: support change_attr_type
> > > > > > > attribute"".
> > > > > > > There's some mailing list discussion to that I'm not
> > > > > > > turning
> > > > > > > up
> > > > > > > right
> > > > > > > now.
> > > > >
> > > > > https://lore.kernel.org/linux-nfs/[email protected]/
> > > > > is what I was thinking of but it isn't actually that
> > > > > interesting.
> > > > >
> > > > > > My main concern was that some filesystems (e.g. ext3) were
> > > > > > failing
> > > > > > to
> > > > > > provide sufficient timestamp resolution to actually label
> > > > > > the
> > > > > > resulting
> > > > > > 'change attribute' as being updated monotonically. If the
> > > > > > time
> > > > > > stamp
> > > > > > doesn't change when the file data or metadata are changed,
> > > > > > then
> > > > > > the
> > > > > > client has to perform extra checks to try to figure out
> > > > > > whether
> > > > > > or
> > > > > > not
> > > > > > its caches are up to date.
> > > > >
> > > > > That's a different issue from the one you were raising in
> > > > > that
> > > > > discussion.
> > > > >
> > > > > > > Did NFSv4 add change_attr_type because some
> > > > > > > implementations
> > > > > > > needed
> > > > > > > the
> > > > > > > unordered case, or because they realized ordering was
> > > > > > > useful
> > > > > > > but
> > > > > > > wanted
> > > > > > > to keep backwards compatibility? I don't know which it
> > > > > > > was.
> > > > > >
> > > > > > We implemented it because, as implied above, knowledge of
> > > > > > whether
> > > > > > or
> > > > > > not the change attribute behaves monotonically, or strictly
> > > > > > monotonically, enables a number of optimisations.
> > > > >
> > > > > Of course, but my question was about the value of the old
> > > > > behavior,
> > > > > not
> > > > > about the value of the monotonic behavior.
> > > > >
> > > > > Put differently, if we could redesign the protocol from
> > > > > scratch
> > > > > would
> > > > > we
> > > > > actually have included the option of non-monotonic behavior?
> > > > >
> > > >
> > > > If we could design the filesystems from scratch, we probably
> > > > would
> > > > not.
> > > > The protocol ended up being as it is because people were trying
> > > > to
> > > > make
> > > > it as easy to implement as possible.
> > > >
> > > > So if we could design the filesystem from scratch, we would
> > > > have
> > > > probably designed it along the lines of what AFS does.
> > > > i.e. each explicit change is accompanied by a single bump of
> > > > the
> > > > change
> > > > attribute, so that the clients can not only decide the order of
> > > > the
> > > > resulting changes, but also if they have missed a change (that
> > > > might
> > > > have been made by a different client).
> > > >
> > > > However that would be a requirement that is likely to be very
> > > > specific
> > > > to distributed caches (and hence distributed filesystems). I
> > > > doubt
> > > > there are many user space applications that would need that
> > > > high
> > > > precision. Maybe MPI, but that's the only candidate I can think
> > > > of
> > > > for
> > > > now?
> > > >
> > >
> > > The fact that NFS kept this more loosely-defined is what allowed
> > > us
> > > to
> > > elide some of the i_version bumps and regain a fair bit of
> > > performance
> > > for local filesystems [1]. If the change attribute had been more
> > > strictly defined like you mention, then that particular
> > > optimization
> > > would not have been possible.
> > >
> > > This sort of thing is why I'm a fan of not defining this any more
> > > strictly than we require. Later on, maybe we'll come up with a
> > > way
> > > for
> > > filesystems to advertise that they can offer stronger guarantees.
> >
> > What 'eliding of the bumps' are we talking about here? If it
> > results in
> > unreliable behaviour, then I propose we just drop the whole concept
> > and
> > go back to using the ctime. The change attribute is only useful if
> > it
> > results in a reliable mechanism for detecting changes. Once you
> > "elide
> > away" the word "reliable", then it has no value beyond what ctime
> > already does.
> >
>
> I'm talking about the scheme to optimize away i_version updates when
> the
> current one has never been queried:
>
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f02a9ad1f15d
>
> There's nothing unreliable about it.
Not really seeing why that would be incompatible with the idea of
bumping on every change. The I_VERSION_QUERIED is just a hint to tell
you that at the very least you need to sync the next metadata update
after someone peeked at the value. You could still continue to cache
updates after that, and only sync them once a O_SYNC or an fsync() call
explicitly requires you to do so.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Tue, 2022-08-30 at 18:25 +0000, Trond Myklebust wrote:
> On Tue, 2022-08-30 at 13:53 -0400, Jeff Layton wrote:
> > On Tue, 2022-08-30 at 17:47 +0000, Trond Myklebust wrote:
> > > On Tue, 2022-08-30 at 13:02 -0400, Jeff Layton wrote:
> > > > On Tue, 2022-08-30 at 15:43 +0000, Trond Myklebust wrote:
> > > > > On Tue, 2022-08-30 at 11:17 -0400, J. Bruce Fields wrote:
> > > > > > On Tue, Aug 30, 2022 at 02:58:27PM +0000, Trond Myklebust
> > > > > > wrote:
> > > > > > > On Tue, 2022-08-30 at 10:44 -0400, J. Bruce Fields wrote:
> > > > > > > > On Tue, Aug 30, 2022 at 09:50:02AM -0400, Jeff Layton
> > > > > > > > wrote:
> > > > > > > > > On Tue, 2022-08-30 at 09:24 -0400, J. Bruce Fields
> > > > > > > > > wrote:
> > > > > > > > > > On Tue, Aug 30, 2022 at 07:40:02AM -0400, Jeff Layton
> > > > > > > > > > wrote:
> > > > > > > > > > > Yes, saying only that it must be different is
> > > > > > > > > > > intentional.
> > > > > > > > > > > What
> > > > > > > > > > > we
> > > > > > > > > > > really want is for consumers to treat this as an
> > > > > > > > > > > opaque
> > > > > > > > > > > value
> > > > > > > > > > > for the
> > > > > > > > > > > most part [1]. Therefore an implementation based on
> > > > > > > > > > > hashing
> > > > > > > > > > > would
> > > > > > > > > > > conform to the spec, I'd think, as long as all of
> > > > > > > > > > > the
> > > > > > > > > > > relevant
> > > > > > > > > > > info is
> > > > > > > > > > > part of the hash.
> > > > > > > > > >
> > > > > > > > > > It'd conform, but it might not be as useful as an
> > > > > > > > > > increasing
> > > > > > > > > > value.
> > > > > > > > > >
> > > > > > > > > > E.g. a client can use that to work out which of a
> > > > > > > > > > series
> > > > > > > > > > of
> > > > > > > > > > reordered
> > > > > > > > > > write replies is the most recent, and I seem to
> > > > > > > > > > recall
> > > > > > > > > > that
> > > > > > > > > > can
> > > > > > > > > > prevent
> > > > > > > > > > unnecessary invalidations in some cases.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > That's a good point; the linux client does this. That
> > > > > > > > > said,
> > > > > > > > > NFSv4
> > > > > > > > > has a
> > > > > > > > > way for the server to advertise its change attribute
> > > > > > > > > behavior
> > > > > > > > > [1]
> > > > > > > > > (though nfsd hasn't implemented this yet).
> > > > > > > >
> > > > > > > > It was implemented and reverted.? The issue was that I
> > > > > > > > thought
> > > > > > > > nfsd
> > > > > > > > should mix in the ctime to prevent the change attribute
> > > > > > > > going
> > > > > > > > backwards
> > > > > > > > on reboot (see fs/nfsd/nfsfh.h:nfsd4_change_attribute()),
> > > > > > > > but
> > > > > > > > Trond
> > > > > > > > was
> > > > > > > > concerned about the possibility of time going backwards.?
> > > > > > > > See
> > > > > > > > 1631087ba872 "Revert "nfsd4: support change_attr_type
> > > > > > > > attribute"".
> > > > > > > > There's some mailing list discussion to that I'm not
> > > > > > > > turning
> > > > > > > > up
> > > > > > > > right
> > > > > > > > now.
> > > > > >
> > > > > > https://lore.kernel.org/linux-nfs/[email protected]/
> > > > > > is what I was thinking of but it isn't actually that
> > > > > > interesting.
> > > > > >
> > > > > > > My main concern was that some filesystems (e.g. ext3) were
> > > > > > > failing
> > > > > > > to
> > > > > > > provide sufficient timestamp resolution to actually label
> > > > > > > the
> > > > > > > resulting
> > > > > > > 'change attribute' as being updated monotonically. If the
> > > > > > > time
> > > > > > > stamp
> > > > > > > doesn't change when the file data or metadata are changed,
> > > > > > > then
> > > > > > > the
> > > > > > > client has to perform extra checks to try to figure out
> > > > > > > whether
> > > > > > > or
> > > > > > > not
> > > > > > > its caches are up to date.
> > > > > >
> > > > > > That's a different issue from the one you were raising in
> > > > > > that
> > > > > > discussion.
> > > > > >
> > > > > > > > Did NFSv4 add change_attr_type because some
> > > > > > > > implementations
> > > > > > > > needed
> > > > > > > > the
> > > > > > > > unordered case, or because they realized ordering was
> > > > > > > > useful
> > > > > > > > but
> > > > > > > > wanted
> > > > > > > > to keep backwards compatibility?? I don't know which it
> > > > > > > > was.
> > > > > > >
> > > > > > > We implemented it because, as implied above, knowledge of
> > > > > > > whether
> > > > > > > or
> > > > > > > not the change attribute behaves monotonically, or strictly
> > > > > > > monotonically, enables a number of optimisations.
> > > > > >
> > > > > > Of course, but my question was about the value of the old
> > > > > > behavior,
> > > > > > not
> > > > > > about the value of the monotonic behavior.
> > > > > >
> > > > > > Put differently, if we could redesign the protocol from
> > > > > > scratch
> > > > > > would
> > > > > > we
> > > > > > actually have included the option of non-monotonic behavior?
> > > > > >
> > > > >
> > > > > If we could design the filesystems from scratch, we probably
> > > > > would
> > > > > not.
> > > > > The protocol ended up being as it is because people were trying
> > > > > to
> > > > > make
> > > > > it as easy to implement as possible.
> > > > >
> > > > > So if we could design the filesystem from scratch, we would
> > > > > have
> > > > > probably designed it along the lines of what AFS does.
> > > > > i.e. each explicit change is accompanied by a single bump of
> > > > > the
> > > > > change
> > > > > attribute, so that the clients can not only decide the order of
> > > > > the
> > > > > resulting changes, but also if they have missed a change (that
> > > > > might
> > > > > have been made by a different client).
> > > > >
> > > > > However that would be a requirement that is likely to be very
> > > > > specific
> > > > > to distributed caches (and hence distributed filesystems). I
> > > > > doubt
> > > > > there are many user space applications that would need that
> > > > > high
> > > > > precision. Maybe MPI, but that's the only candidate I can think
> > > > > of
> > > > > for
> > > > > now?
> > > > >
> > > >
> > > > The fact that NFS kept this more loosely-defined is what allowed
> > > > us
> > > > to
> > > > elide some of the i_version bumps and regain a fair bit of
> > > > performance
> > > > for local filesystems [1]. If the change attribute had been more
> > > > strictly defined like you mention, then that particular
> > > > optimization
> > > > would not have been possible.
> > > >
> > > > This sort of thing is why I'm a fan of not defining this any more
> > > > strictly than we require. Later on, maybe we'll come up with a
> > > > way
> > > > for
> > > > filesystems to advertise that they can offer stronger guarantees.
> > >
> > > What 'eliding of the bumps' are we talking about here? If it
> > > results in
> > > unreliable behaviour, then I propose we just drop the whole concept
> > > and
> > > go back to using the ctime. The change attribute is only useful if
> > > it
> > > results in a reliable mechanism for detecting changes. Once you
> > > "elide
> > > away" the word "reliable", then it has no value beyond what ctime
> > > already does.
> > >
> >
> > I'm talking about the scheme to optimize away i_version updates when
> > the
> > current one has never been queried:
> >
> > ???
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f02a9ad1f15d
> >
> > There's nothing unreliable about it.
>
> Not really seeing why that would be incompatible with the idea of
> bumping on every change. The I_VERSION_QUERIED is just a hint to tell
> you that at the very least you need to sync the next metadata update
> after someone peeked at the value. You could still continue to cache
> updates after that, and only sync them once a O_SYNC or an fsync() call
> explicitly requires you to do so.
>
Good point! It's not implemented that way today, but we could change it
to do that if it were useful. I think it'd be slightly more costly
CPU-wise when the update isn't going to disk, since you'd now have to
update the value on every change instead of just skipping it, but I
doubt anyone would notice the extra overhead.
--
Jeff Layton <[email protected]>