On Thu, Jun 18, 2015 at 04:38:56PM +0200, David Sterba wrote:
> Moving the discussion to fsdevel.
>
> Summary: disabling MS_I_VERSION brings some speedups to btrfs, but the
> generic 'noiversion' option cannot be used to achieve that. It is
> processed before it reaches btrfs superblock callback, where
> MS_I_VERSION is forced.
>
> The proposed fix is to add btrfs-specific i_version/noi_version to btrfs,
> to which I object.
I was talking to Mingming about this on today's ext4 conference call,
and one of the reasons why ext4 turns off i_version update by default
is because it does a real number on our performance as well --- and
furthermore, the only real user of the field from what we can tell is
NFSv4, which not all that many ext4 users actually care about.
This has caused pain for the nfsv4 folks since it means that they need
to tell people to use a special mount option for ext4 if they are
actually using this for nfsv4, and I suspect they won't be all that
eager to hear that btrfs is going to go the same way.
This however got us thinking --- even in if NFSv4 is depending on
i_version, it doesn't actually _look_ at that field all that often.
It's only going to look at it in a response to a client's getattr
call, and that in turn is used to so the client can do its local disk
cache invalidation if anby of the data blocks of the inode has changed.
So what if we have a per-inode flag which "don't update I_VERSION",
which is off by default, but after the i_version has been updated at
least once, is set, so the i_version field won't be updated again ---
at least until something has actually looked at the i_version field,
when the "don't update I_VERSOIN" flag will get cleared again.
So basically, if we know there are no microphones in the forest, we
don't need to make the tree fall. However, if someone has sampled the
i_version field, then the next time the inode gets updated, we will
update the i_version field so the NFSv4 client can hear the sound of
the tree crashing to the forst floor and so it can invalidate its
local cache of the file. :-)
This should significantly improve the performance of using the
i_version field if the file system is being exported via NFSv4, and if
NFSv4 is not in use, no one will be looking at the i_version field, so
the performance impact will be very slight, and thus we could enable
i_version updates by default for btrfs and ext4.
And this should make the distribution folks happy, since it will unify
the behavior of all file systems, and make life easier for users who
won't need to set certain magic mount options depending on what file
system they are using and whether they are using NFSv4 or not.
Does this sound reasonable?
Cheers,
- Ted
On Tue, Jun 23, 2015 at 12:32:41PM -0400, Theodore Ts'o wrote:
> On Thu, Jun 18, 2015 at 04:38:56PM +0200, David Sterba wrote:
> > Moving the discussion to fsdevel.
> >
> > Summary: disabling MS_I_VERSION brings some speedups to btrfs, but the
> > generic 'noiversion' option cannot be used to achieve that. It is
> > processed before it reaches btrfs superblock callback, where
> > MS_I_VERSION is forced.
> >
> > The proposed fix is to add btrfs-specific i_version/noi_version to btrfs,
> > to which I object.
>
> I was talking to Mingming about this on today's ext4 conference call,
> and one of the reasons why ext4 turns off i_version update by default
> is because it does a real number on our performance as well --- and
> furthermore, the only real user of the field from what we can tell is
> NFSv4, which not all that many ext4 users actually care about.
>
> This has caused pain for the nfsv4 folks since it means that they need
> to tell people to use a special mount option for ext4 if they are
> actually using this for nfsv4, and I suspect they won't be all that
> eager to hear that btrfs is going to go the same way.
>
> This however got us thinking --- even in if NFSv4 is depending on
> i_version, it doesn't actually _look_ at that field all that often.
> It's only going to look at it in a response to a client's getattr
> call, and that in turn is used to so the client can do its local disk
> cache invalidation if anby of the data blocks of the inode has changed.
>
> So what if we have a per-inode flag which "don't update I_VERSION",
> which is off by default, but after the i_version has been updated at
> least once, is set, so the i_version field won't be updated again ---
> at least until something has actually looked at the i_version field,
> when the "don't update I_VERSOIN" flag will get cleared again.
>
> So basically, if we know there are no microphones in the forest, we
> don't need to make the tree fall. However, if someone has sampled the
> i_version field, then the next time the inode gets updated, we will
> update the i_version field so the NFSv4 client can hear the sound of
> the tree crashing to the forst floor and so it can invalidate its
> local cache of the file. :-)
>
> This should significantly improve the performance of using the
> i_version field if the file system is being exported via NFSv4, and if
> NFSv4 is not in use, no one will be looking at the i_version field, so
> the performance impact will be very slight, and thus we could enable
> i_version updates by default for btrfs and ext4.
>
> And this should make the distribution folks happy, since it will unify
> the behavior of all file systems, and make life easier for users who
> won't need to set certain magic mount options depending on what file
> system they are using and whether they are using NFSv4 or not.
>
> Does this sound reasonable?
I agree, this's a promising way to fix the whole thing.
Regarding to client's getattr, I found that inode->i_version is not read by
calling generic_fillattr(), so I'm a bit missing on how we get to
change the flag...
Thanks,
-liubo
>
> Cheers,
>
> - Ted
On Tue, Jun 23, 2015 at 12:32:41PM -0400, Theodore Ts'o wrote:
> This has caused pain for the nfsv4 folks since it means that they need
> to tell people to use a special mount option for ext4 if they are
> actually using this for nfsv4, and I suspect they won't be all that
> eager to hear that btrfs is going to go the same way.
I did not mean to change the default behaviour with respect to nfs, that
would be a regression.
> This however got us thinking --- even in if NFSv4 is depending on
> i_version, it doesn't actually _look_ at that field all that often.
> It's only going to look at it in a response to a client's getattr
> call, and that in turn is used to so the client can do its local disk
> cache invalidation if anby of the data blocks of the inode has changed.
>
> So what if we have a per-inode flag which "don't update I_VERSION",
> which is off by default, but after the i_version has been updated at
> least once, is set, so the i_version field won't be updated again ---
> at least until something has actually looked at the i_version field,
> when the "don't update I_VERSOIN" flag will get cleared again.
This sounds similar to what Dave proposed, a per-inode I_VERSION
attribute that can be changed through chattr. Though the negated meaning
of the flag could be confusing, I had to reread the paragraph again.
> This should significantly improve the performance of using the
> i_version field if the file system is being exported via NFSv4, and if
> NFSv4 is not in use, no one will be looking at the i_version field, so
> the performance impact will be very slight, and thus we could enable
> i_version updates by default for btrfs and ext4.
Btrfs default is to update i_version and the uscecase gets fixed by the
per-inode attribute. But from your description above I think that this
might not be enough for ext4. The reason I see are the different
defaults. You want to turn it on by default but not impose any
performance penalty for that, while for our usecase it's sufficient to
selectively turn it off.
I've tried to locate the source of performance drop for ext4 + iversion,
but was not successful so I don't know if the proposed fix is complete.
> And this should make the distribution folks happy, since it will unify
> the behavior of all file systems, and make life easier for users who
> won't need to set certain magic mount options depending on what file
> system they are using and whether they are using NFSv4 or not.
>
> Does this sound reasonable?
It does, the unified behaviour wrt NFS is definitely a good thing.
On Wed, Jun 24, 2015 at 08:02:15PM +0200, David Sterba wrote:
>
> This sounds similar to what Dave proposed, a per-inode I_VERSION
> attribute that can be changed through chattr. Though the negated meaning
> of the flag could be confusing, I had to reread the paragraph again.
Dave did not specify an I_VERSION attribute that would be stored on
disk. Instead he talked about a inode flag that would be set when the
struct inode is created by the file system.
This would allow file systems to permanently configure (on a per-inode
basis) whether or not a particular inode would require a forced
i_version update any time the inode's data or metadata is modified. I
suppose you could initialized the inode flag from an on-disk
attribute, but that wasn't implied by Dave's proposal, at least as I
understood it.
> > This should significantly improve the performance of using the
> > i_version field if the file system is being exported via NFSv4, and if
> > NFSv4 is not in use, no one will be looking at the i_version field, so
> > the performance impact will be very slight, and thus we could enable
> > i_version updates by default for btrfs and ext4.
>
> Btrfs default is to update i_version and the uscecase gets fixed by the
> per-inode attribute. But from your description above I think that this
> might not be enough for ext4. The reason I see are the different
> defaults. You want to turn it on by default but not impose any
> performance penalty for that, while for our usecase it's sufficient to
> selectively turn it off.
The problem with selectively turning it off is that the user might
decide for a particular file which is getting lots of updates to turn
off i_version updates --- but then at some subsequent time, that file
is part of a file system which is exported NFSv4, clients will
mysteriously break because i_version was manually turned off.
So this is going to be a potential support nightmare for enterprise
distro help desks --- the user will report that a particular file is
constantly getting corrupted, due to the NFSv4 cache invalidation
getting broken, and it might not be obvious why this is only happening
with this one file, and it's because with btrfs, the i_version update
for particular file was selectively turned off. I don't think it's a
good design where it is easy for the user to set a flag which breaks
functionality, and in a potentially confusing way, especially when the
net result is potential data corruption.
This is why I would much rather have the default be on, but with
minimal (preferably not measurable) performance overhead. It's the
best of both worlds.
- Ted
On Wed, Jun 24, 2015 at 07:17:50PM -0400, Theodore Ts'o wrote:
> On Wed, Jun 24, 2015 at 08:02:15PM +0200, David Sterba wrote:
> >
> > This sounds similar to what Dave proposed, a per-inode I_VERSION
> > attribute that can be changed through chattr. Though the negated meaning
> > of the flag could be confusing, I had to reread the paragraph again.
>
> Dave did not specify an I_VERSION attribute that would be stored on
> disk. Instead he talked about a inode flag that would be set when the
> struct inode is created by the file system.
Right.
> This would allow file systems to permanently configure (on a per-inode
> basis) whether or not a particular inode would require a forced
> i_version update any time the inode's data or metadata is modified. I
> suppose you could initialized the inode flag from an on-disk
> attribute, but that wasn't implied by Dave's proposal, at least as I
> understood it.
It enables filesystems to do this. If btrfs want to add an on-disk
flag to turn off I_VERSION on a per-inode basis, or imply it from
some other on-disk flag, then they are welcome to do so and the
above infrastructure change will support it.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, Jun 23, 2015 at 12:32:41PM -0400, Theodore Ts'o wrote:
> On Thu, Jun 18, 2015 at 04:38:56PM +0200, David Sterba wrote:
> > Moving the discussion to fsdevel.
> >
> > Summary: disabling MS_I_VERSION brings some speedups to btrfs, but the
> > generic 'noiversion' option cannot be used to achieve that. It is
> > processed before it reaches btrfs superblock callback, where
> > MS_I_VERSION is forced.
> >
> > The proposed fix is to add btrfs-specific i_version/noi_version to btrfs,
> > to which I object.
>
> I was talking to Mingming about this on today's ext4 conference call,
> and one of the reasons why ext4 turns off i_version update by default
> is because it does a real number on our performance as well --- and
> furthermore, the only real user of the field from what we can tell is
> NFSv4, which not all that many ext4 users actually care about.
>
> This has caused pain for the nfsv4 folks since it means that they need
> to tell people to use a special mount option for ext4 if they are
> actually using this for nfsv4, and I suspect they won't be all that
> eager to hear that btrfs is going to go the same way.
Yes, thanks for looking into this!
> This however got us thinking --- even in if NFSv4 is depending on
> i_version, it doesn't actually _look_ at that field all that often.
Most clients will query it on every write. (I just took a quick look at
the code and I believe the Linux client's requesting it immediately
after every write, except in the O_DIRECT and delegated cases.)
> It's only going to look at it in a response to a client's getattr
> call, and that in turn is used to so the client can do its local disk
> cache invalidation if anby of the data blocks of the inode has changed.
>
> So what if we have a per-inode flag which "don't update I_VERSION",
> which is off by default, but after the i_version has been updated at
> least once, is set, so the i_version field won't be updated again ---
> at least until something has actually looked at the i_version field,
> when the "don't update I_VERSOIN" flag will get cleared again.
>
> So basically, if we know there are no microphones in the forest, we
> don't need to make the tree fall. However, if someone has sampled the
> i_version field, then the next time the inode gets updated, we will
> update the i_version field so the NFSv4 client can hear the sound of
> the tree crashing to the forst floor and so it can invalidate its
> local cache of the file. :-)
>
> This should significantly improve the performance of using the
> i_version field if the file system is being exported via NFSv4, and if
> NFSv4 is not in use, no one will be looking at the i_version field, so
> the performance impact will be very slight, and thus we could enable
> i_version updates by default for btrfs and ext4.
>
> And this should make the distribution folks happy, since it will unify
> the behavior of all file systems, and make life easier for users who
> won't need to set certain magic mount options depending on what file
> system they are using and whether they are using NFSv4 or not.
>
> Does this sound reasonable?
Just to make sure I understand, the logic is something like:
to read the i_version:
inode->i_version_seen = true;
return inode->i_version
to update the i_version:
/*
* If nobody's seen this value of i_version then we can
* keep using it, otherwise we need a new one:
*/
if (inode->i_version_seen)
inode->i_version++;
inode->i_version_seen = false;
Looks OK to me. As I say I'd expect i_version_seen == true to end up
being the common case in a lot of v4 workloads, so I'm more skeptical of
the claim of a performance improvement in the v4 case.
Could maintaining the new flag be a significant drag in itself? If not,
then I guess we're not making things any worse there, so fine.
--b.
On Thu, Jun 25, 2015 at 02:46:44PM -0400, J. Bruce Fields wrote:
> > Does this sound reasonable?
>
> Just to make sure I understand, the logic is something like:
>
> to read the i_version:
>
> inode->i_version_seen = true;
> return inode->i_version
>
> to update the i_version:
>
> /*
> * If nobody's seen this value of i_version then we can
> * keep using it, otherwise we need a new one:
> */
> if (inode->i_version_seen)
> inode->i_version++;
> inode->i_version_seen = false;
Yep, that's what I was proposing.
> Looks OK to me. As I say I'd expect i_version_seen == true to end up
> being the common case in a lot of v4 workloads, so I'm more skeptical of
> the claim of a performance improvement in the v4 case.
Well, so long as we require i_version to be committed to disk on every
single disk write, we're going to be trading off:
* client-side performance of the advanced NFSv4 cacheing for reads
* server-side performance for writes
* data robustness in case of the server crashing and the client-side cache
getting out of sync with the server after the crash
I don't see any way around that. (So for example, with lazy mtime
updates we wouldn't be updating the inode after every single
non-allocating write; enabling i_version updates will trash that
optimization.)
I just want to reduce to a bare minimum the performance hit in the
case where NFSv4 exports are not being used (since that is true in a
very *large* number of ext4 deployments --- i.e., every single Android
handset using ext4 :-), such that we can leave i_version updates
turned on by default.
> Could maintaining the new flag be a significant drag in itself? If not,
> then I guess we're not making things any worse there, so fine.
I don't think so; it's a bit in the in-memory inode, so I don't think
that should be an issue.
- Ted
On Thu, Jun 25, 2015 at 06:12:57PM -0400, Theodore Ts'o wrote:
> On Thu, Jun 25, 2015 at 02:46:44PM -0400, J. Bruce Fields wrote:
> > Looks OK to me. As I say I'd expect i_version_seen == true to end up
> > being the common case in a lot of v4 workloads, so I'm more skeptical of
> > the claim of a performance improvement in the v4 case.
>
> Well, so long as we require i_version to be committed to disk on every
> single disk write, we're going to be trading off:
>
> * client-side performance of the advanced NFSv4 cacheing for reads
> * server-side performance for writes
> * data robustness in case of the server crashing and the client-side cache
> getting out of sync with the server after the crash
>
> I don't see any way around that. (So for example, with lazy mtime
> updates we wouldn't be updating the inode after every single
> non-allocating write; enabling i_version updates will trash that
> optimization.)
>
> I just want to reduce to a bare minimum the performance hit in the
> case where NFSv4 exports are not being used (since that is true in a
> very *large* number of ext4 deployments --- i.e., every single Android
> handset using ext4 :-), such that we can leave i_version updates
> turned on by default.
Definitely understood. I think it's a good idea.
> > Could maintaining the new flag be a significant drag in itself? If not,
> > then I guess we're not making things any worse there, so fine.
>
> I don't think so; it's a bit in the in-memory inode, so I don't think
> that should be an issue.
OK!
--b.