2012-05-14 14:06:18

by J. Bruce Fields

[permalink] [raw]
Subject: [PATCH] ext4: turn on i_version updates by default

knfsd needs i_version updates on, as will userspace nfs servers and
probably others.

The only effects are that inode->i_version is bumped (under the i_lock)
in more places, and that ->dirty_inode(I_DIRTY_DATASYNC) may be called
more frequently than once per jiffy on write (see file_update_time).
However the latter appears to be mostly a no-op in that case.

So, simplify our life and just keep this feature turned on all the time.

Signed-off-by: J. Bruce Fields <[email protected]>
---
fs/ext4/super.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

If people are worried that the performance impact isn't obvious, I'll
find the time somehow to test this properly....

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index e1fb1d5..a99f827 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1483,7 +1483,7 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
sbi->s_mount_flags |= EXT4_MF_FS_ABORTED;
return 1;
case Opt_i_version:
- sb->s_flags |= MS_I_VERSION;
+ /* no-op; this is on by default now */
return 1;
case Opt_journal_dev:
if (is_remount) {
@@ -2979,6 +2979,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
goto out_free_orig;
}
sb->s_fs_info = sbi;
+ sb->s_flags |= MS_I_VERSION;
sbi->s_mount_opt = 0;
sbi->s_resuid = EXT4_DEF_RESUID;
sbi->s_resgid = EXT4_DEF_RESGID;
--
1.7.9.5



2012-05-14 15:02:12

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] ext4: turn on i_version updates by default

On 2012-05-14, at 8:06, "J. Bruce Fields" <[email protected]> wrote:
> knfsd needs i_version updates on, as will userspace nfs servers and
> probably others.
>
> The only effects are that inode->i_version is bumped (under the i_lock)
> in more places, and that ->dirty_inode(I_DIRTY_DATASYNC) may be called
> more frequently than once per jiffy on write (see file_update_time).
> However the latter appears to be mostly a no-op in that case.

I thought this can have noticeable performance impact, since ext4_mark_inode_dirty() is quite heavyweight?

This is one of the reasons that the i_version update is conditional. If someone is exporting a filesystem from userspace the should be able to turn this on as a mount option, and knfsd could do it from inside the kernel. Why add overhead when it is not needed?

> So, simplify our life and just keep this feature turned on all the time.
>
> Signed-off-by: J. Bruce Fields <[email protected]>
> ---
> fs/ext4/super.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> If people are worried that the performance impact isn't obvious, I'll
> find the time somehow to test this properly....
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index e1fb1d5..a99f827 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1483,7 +1483,7 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
> sbi->s_mount_flags |= EXT4_MF_FS_ABORTED;
> return 1;
> case Opt_i_version:
> - sb->s_flags |= MS_I_VERSION;
> + /* no-op; this is on by default now */
> return 1;
> case Opt_journal_dev:
> if (is_remount) {
> @@ -2979,6 +2979,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> goto out_free_orig;
> }
> sb->s_fs_info = sbi;
> + sb->s_flags |= MS_I_VERSION;
> sbi->s_mount_opt = 0;
> sbi->s_resuid = EXT4_DEF_RESUID;
> sbi->s_resgid = EXT4_DEF_RESGID;
> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-05-14 15:23:34

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH] ext4: turn on i_version updates by default

On Mon, May 14, 2012 at 09:02:12AM -0600, Andreas Dilger wrote:
> On 2012-05-14, at 8:06, "J. Bruce Fields" <[email protected]> wrote:
> > knfsd needs i_version updates on, as will userspace nfs servers and
> > probably others.
> >
> > The only effects are that inode->i_version is bumped (under the i_lock)
> > in more places, and that ->dirty_inode(I_DIRTY_DATASYNC) may be called
> > more frequently than once per jiffy on write (see file_update_time).
> > However the latter appears to be mostly a no-op in that case.
>
> I thought this can have noticeable performance impact, since ext4_mark_inode_dirty() is quite heavyweight?

There's no reason it should be, should it, if we already just dirtied
the inode a moment ago?

> This is one of the reasons that the i_version update is conditional.
> If someone is exporting a filesystem from userspace the should be able
> to turn this on as a mount option, and knfsd could do it from inside
> the kernel. Why add overhead when it is not needed?

Any user of the change attribute also wants it to function correctly
while they're away.

And if it at all possible I'd rather have it be something that Just
Works rather than something that requires extra configuration.

--b.

2012-05-14 17:27:42

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] ext4: turn on i_version updates by default

On 2012-05-14, at 9:23 AM, J. Bruce Fields wrote:
> On Mon, May 14, 2012 at 09:02:12AM -0600, Andreas Dilger wrote:
>> On 2012-05-14, at 8:06, "J. Bruce Fields" <[email protected]> wrote:
>>> knfsd needs i_version updates on, as will userspace nfs servers and
>>> probably others.
>>>
>>> The only effects are that inode->i_version is bumped (under the i_lock)
>>> in more places, and that ->dirty_inode(I_DIRTY_DATASYNC) may be called
>>> more frequently than once per jiffy on write (see file_update_time).
>>> However the latter appears to be mostly a no-op in that case.
>>
>> I thought this can have noticeable performance impact, since ext4_mark_inode_dirty() is quite heavyweight?
>
> There's no reason it should be, should it, if we already just dirtied
> the inode a moment ago?

Ideally not, but the way ext[34]_mark_inode_dirty() is implemented
is that it copies the whole in-core inode to the on-disk inode every
time it is marked dirty. That ensures that the on-disk inode is
up-to-date when the journal flushes the blocks to disk, but is not
an ideal implementation. It has been this way since the first ext3
implementation was done.

As a result, dirtying the inode very frequently for ext[34] is
currently expensive and should be avoided.

I _think_ that the ext4 metadata checksum patches have changed this
to only flag the inode dirty and run a pre-commit callback to copy
the in-core inode to the on-disk inode. I'm not sure what the
current status of that patch is, nor how easily it could be split
from that patch series and land separately.

>> This is one of the reasons that the i_version update is conditional.
>> If someone is exporting a filesystem from userspace the should be able
>> to turn this on as a mount option, and knfsd could do it from inside
>> the kernel. Why add overhead when it is not needed?
>
> Any user of the change attribute also wants it to function correctly
> while they're away.

It would only need to change once, however, not continuously.
Is there any way to know when a consumer has sampled the version?
That way the on-disk version could be bumped once after the version
was referenced, and wouldn't have to be changed thousands of times
per second, nor at all if nothing is using the version.

The MS_I_VERSION is intended to be used to indicate that i_version
needs to be updated. I can imagine that it might make sense to
make this flag "sticky" on a filesystem, so that once it is used
for NFSv4 the version will be bumped once for an inode change even
if MS_I_VERSION is not in use, but that is sufficient for NFSv4
and it does not have to be a permanent drag on the filesystem.

> And if it at all possible I'd rather have it be something that Just
> Works rather than something that requires extra configuration.

Sure, but this is only useful for NFSv4, but costs everyone using
ext4 continuous overhead, so it isn't a clear-cut case to enable
the version just on the thought that NFS might one day be used on
any particular filesystem.

Cheers, Andreas




2012-05-14 17:58:22

by Theodore Y. Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: turn on i_version updates by default

On Mon, May 14, 2012 at 11:27:42AM -0600, Andreas Dilger wrote:
> > And if it at all possible I'd rather have it be something that Just
> > Works rather than something that requires extra configuration.
>
> Sure, but this is only useful for NFSv4, but costs everyone using
> ext4 continuous overhead, so it isn't a clear-cut case to enable
> the version just on the thought that NFS might one day be used on
> any particular filesystem.

It's not a matter of "NFSv4 might one day be used"; if we don't turn
on i_version updates until the file system is actually exported via
NFSv4, there would be no deleterious effects.

I always thought that was going to be the plan; that there would be
some flag that would be set in struct super_block when the file system
was exported that would enable i_version updates.

That way we satisfy the "no extra configuration" needed requirement,
which I agree is ideal, but we also don't waste any CPU overhead if
the file system is not exported via NFSv4. I tried to implement
anything along these lines because I don't care enough, and I don't
use NFSv4 personally....

- Ted

2012-05-14 18:33:17

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH] ext4: turn on i_version updates by default

On Mon, May 14, 2012 at 01:58:22PM -0400, Ted Ts'o wrote:
> On Mon, May 14, 2012 at 11:27:42AM -0600, Andreas Dilger wrote:
> > > And if it at all possible I'd rather have it be something that Just
> > > Works rather than something that requires extra configuration.
> >
> > Sure, but this is only useful for NFSv4, but costs everyone using
> > ext4 continuous overhead, so it isn't a clear-cut case to enable
> > the version just on the thought that NFS might one day be used on
> > any particular filesystem.
>
> It's not a matter of "NFSv4 might one day be used"; if we don't turn
> on i_version updates until the file system is actually exported via
> NFSv4, there would be no deleterious effects.
>
> I always thought that was going to be the plan; that there would be
> some flag that would be set in struct super_block when the file system
> was exported that would enable i_version updates.
>
> That way we satisfy the "no extra configuration" needed requirement,
> which I agree is ideal, but we also don't waste any CPU overhead if
> the file system is not exported via NFSv4. I tried to implement
> anything along these lines because I don't care enough, and I don't
> use NFSv4 personally....
>

Seems like this is just a bad place to be doing inode_inc_iversion(). If
MS_IVERSION is set we will update iversion in file_update_time() and then call
mark_inode_dirty which will jack up the iversion again. In btrfs we just change
it wherever we change ctime and that way you don't really notice the extra
overhead since you are doing it in paths where you are changing a bunch of stuff
in the inode already, and mostly where you hold the i_mutex so you aren't going
to be hitting any contention on the i_lock. Thanks,

Josef

2012-05-14 18:48:02

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] ext4: turn on i_version updates by default

On Mon, 14 May 2012 14:33:17 -0400
Josef Bacik <[email protected]> wrote:

> On Mon, May 14, 2012 at 01:58:22PM -0400, Ted Ts'o wrote:
> > On Mon, May 14, 2012 at 11:27:42AM -0600, Andreas Dilger wrote:
> > > > And if it at all possible I'd rather have it be something that Just
> > > > Works rather than something that requires extra configuration.
> > >
> > > Sure, but this is only useful for NFSv4, but costs everyone using
> > > ext4 continuous overhead, so it isn't a clear-cut case to enable
> > > the version just on the thought that NFS might one day be used on
> > > any particular filesystem.
> >
> > It's not a matter of "NFSv4 might one day be used"; if we don't turn
> > on i_version updates until the file system is actually exported via
> > NFSv4, there would be no deleterious effects.
> >
> > I always thought that was going to be the plan; that there would be
> > some flag that would be set in struct super_block when the file system
> > was exported that would enable i_version updates.
> >
> > That way we satisfy the "no extra configuration" needed requirement,
> > which I agree is ideal, but we also don't waste any CPU overhead if
> > the file system is not exported via NFSv4. I tried to implement
> > anything along these lines because I don't care enough, and I don't
> > use NFSv4 personally....
> >
>
> Seems like this is just a bad place to be doing inode_inc_iversion(). If
> MS_IVERSION is set we will update iversion in file_update_time() and then call
> mark_inode_dirty which will jack up the iversion again. In btrfs we just change
> it wherever we change ctime and that way you don't really notice the extra
> overhead since you are doing it in paths where you are changing a bunch of stuff
> in the inode already, and mostly where you hold the i_mutex so you aren't going
> to be hitting any contention on the i_lock. Thanks,
>

Well, you do incur a bit more overhead in btrfs too:

------------------[snip]----------------------
if (!timespec_equal(&inode->i_mtime, &now))
sync_it = S_MTIME;

if (!timespec_equal(&inode->i_ctime, &now))
sync_it |= S_CTIME;

if (IS_I_VERSION(inode))
sync_it |= S_VERSION;
------------------[snip]----------------------

So you'll end up with sync_it being 0 if i_version updates are
disabled, and the mtime/ctime didn't visibly change.

If your jiffies are coarse-grained enough, then you might get "lucky"
rather often, but is that a case worth optimizing for? How often does
it happen that you mark the inode dirty, flush it to disk and then
re-mark it dirty within the same jiffy?

If that's a common occurrence then you might see some performance
impact here from turning i_version on all the time. Otherwise, it seems
like it wouldn't make that big a difference.

--
Jeff Layton <[email protected]>

2012-05-14 18:51:38

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH] ext4: turn on i_version updates by default

On Mon, May 14, 2012 at 02:48:02PM -0400, Jeff Layton wrote:
> On Mon, 14 May 2012 14:33:17 -0400
> Josef Bacik <[email protected]> wrote:
>
> > On Mon, May 14, 2012 at 01:58:22PM -0400, Ted Ts'o wrote:
> > > On Mon, May 14, 2012 at 11:27:42AM -0600, Andreas Dilger wrote:
> > > > > And if it at all possible I'd rather have it be something that Just
> > > > > Works rather than something that requires extra configuration.
> > > >
> > > > Sure, but this is only useful for NFSv4, but costs everyone using
> > > > ext4 continuous overhead, so it isn't a clear-cut case to enable
> > > > the version just on the thought that NFS might one day be used on
> > > > any particular filesystem.
> > >
> > > It's not a matter of "NFSv4 might one day be used"; if we don't turn
> > > on i_version updates until the file system is actually exported via
> > > NFSv4, there would be no deleterious effects.
> > >
> > > I always thought that was going to be the plan; that there would be
> > > some flag that would be set in struct super_block when the file system
> > > was exported that would enable i_version updates.
> > >
> > > That way we satisfy the "no extra configuration" needed requirement,
> > > which I agree is ideal, but we also don't waste any CPU overhead if
> > > the file system is not exported via NFSv4. I tried to implement
> > > anything along these lines because I don't care enough, and I don't
> > > use NFSv4 personally....
> > >
> >
> > Seems like this is just a bad place to be doing inode_inc_iversion(). If
> > MS_IVERSION is set we will update iversion in file_update_time() and then call
> > mark_inode_dirty which will jack up the iversion again. In btrfs we just change
> > it wherever we change ctime and that way you don't really notice the extra
> > overhead since you are doing it in paths where you are changing a bunch of stuff
> > in the inode already, and mostly where you hold the i_mutex so you aren't going
> > to be hitting any contention on the i_lock. Thanks,
> >
>
> Well, you do incur a bit more overhead in btrfs too:
>
> ------------------[snip]----------------------
> if (!timespec_equal(&inode->i_mtime, &now))
> sync_it = S_MTIME;
>
> if (!timespec_equal(&inode->i_ctime, &now))
> sync_it |= S_CTIME;
>
> if (IS_I_VERSION(inode))
> sync_it |= S_VERSION;
> ------------------[snip]----------------------
>
> So you'll end up with sync_it being 0 if i_version updates are
> disabled, and the mtime/ctime didn't visibly change.
>
> If your jiffies are coarse-grained enough, then you might get "lucky"
> rather often, but is that a case worth optimizing for? How often does
> it happen that you mark the inode dirty, flush it to disk and then
> re-mark it dirty within the same jiffy?
>

Well sync_it just means do we need to call mark_inode_dirty, not necessarily do
we need to write it to disk, so you are just updating a field in memory. Now if
your jiffies are coarse enough for you to not notice the ctime update then yes
you are incurring an extra lock/inc/unlock, but this is called in the write path
where you are going to do much more latency inducting operations than locking
and unlocking a generally uncontended spin lock. Thanks,

Josef

2012-05-14 18:54:00

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH] ext4: turn on i_version updates by default

On Mon, May 14, 2012 at 02:33:17PM -0400, Josef Bacik wrote:
> On Mon, May 14, 2012 at 01:58:22PM -0400, Ted Ts'o wrote:
> > On Mon, May 14, 2012 at 11:27:42AM -0600, Andreas Dilger wrote:
> > > > And if it at all possible I'd rather have it be something that Just
> > > > Works rather than something that requires extra configuration.
> > >
> > > Sure, but this is only useful for NFSv4, but costs everyone using
> > > ext4 continuous overhead, so it isn't a clear-cut case to enable
> > > the version just on the thought that NFS might one day be used on
> > > any particular filesystem.
> >
> > It's not a matter of "NFSv4 might one day be used"; if we don't turn
> > on i_version updates until the file system is actually exported via
> > NFSv4, there would be no deleterious effects.
> >
> > I always thought that was going to be the plan; that there would be
> > some flag that would be set in struct super_block when the file system
> > was exported that would enable i_version updates.
> >
> > That way we satisfy the "no extra configuration" needed requirement,
> > which I agree is ideal, but we also don't waste any CPU overhead if
> > the file system is not exported via NFSv4. I tried to implement
> > anything along these lines because I don't care enough, and I don't
> > use NFSv4 personally....
> >
>
> Seems like this is just a bad place to be doing inode_inc_iversion(). If
> MS_IVERSION is set we will update iversion in file_update_time() and then call
> mark_inode_dirty which will jack up the iversion again.

Agreed, that's weird.

> In btrfs we just change
> it wherever we change ctime and that way you don't really notice the extra
> overhead since you are doing it in paths where you are changing a bunch of stuff
> in the inode already, and mostly where you hold the i_mutex so you aren't going
> to be hitting any contention on the i_lock. Thanks,

I don't think they're worried about the inode_inc_iversion() calls
themselves, but the behavior of file_update_time():

if (!timespec_equal(&inode->i_mtime, &now))
sync_it = S_MTIME;

if (!timespec_equal(&inode->i_ctime, &now))
sync_it |= S_CTIME;

if (IS_I_VERSION(inode))
sync_it |= S_VERSION;

if (!sync_it)
return;
...
mark_inode_dirty_sync(inode);

So now mark_inode_dirty_sync() is called on every update, instead of
merely on every update that sees a time change (so at most once a
jiffy).

So mark_inode_dirty_sync (and hence ->dirty_inode = ext4_dirty_inode)
may get called more often if you're writing very frequently.

I'm a bit surprised that's expected to add significant overhead to the
write.

I guess I should stare at the code and try to follow Andreas's
explanation....

--b.

2012-05-14 19:05:18

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH] ext4: turn on i_version updates by default

On Mon, May 14, 2012 at 02:54:00PM -0400, J. Bruce Fields wrote:
> On Mon, May 14, 2012 at 02:33:17PM -0400, Josef Bacik wrote:
> > On Mon, May 14, 2012 at 01:58:22PM -0400, Ted Ts'o wrote:
> > > On Mon, May 14, 2012 at 11:27:42AM -0600, Andreas Dilger wrote:
> > > > > And if it at all possible I'd rather have it be something that Just
> > > > > Works rather than something that requires extra configuration.
> > > >
> > > > Sure, but this is only useful for NFSv4, but costs everyone using
> > > > ext4 continuous overhead, so it isn't a clear-cut case to enable
> > > > the version just on the thought that NFS might one day be used on
> > > > any particular filesystem.
> > >
> > > It's not a matter of "NFSv4 might one day be used"; if we don't turn
> > > on i_version updates until the file system is actually exported via
> > > NFSv4, there would be no deleterious effects.
> > >
> > > I always thought that was going to be the plan; that there would be
> > > some flag that would be set in struct super_block when the file system
> > > was exported that would enable i_version updates.
> > >
> > > That way we satisfy the "no extra configuration" needed requirement,
> > > which I agree is ideal, but we also don't waste any CPU overhead if
> > > the file system is not exported via NFSv4. I tried to implement
> > > anything along these lines because I don't care enough, and I don't
> > > use NFSv4 personally....
> > >
> >
> > Seems like this is just a bad place to be doing inode_inc_iversion(). If
> > MS_IVERSION is set we will update iversion in file_update_time() and then call
> > mark_inode_dirty which will jack up the iversion again.
>
> Agreed, that's weird.
>
> > In btrfs we just change
> > it wherever we change ctime and that way you don't really notice the extra
> > overhead since you are doing it in paths where you are changing a bunch of stuff
> > in the inode already, and mostly where you hold the i_mutex so you aren't going
> > to be hitting any contention on the i_lock. Thanks,
>
> I don't think they're worried about the inode_inc_iversion() calls
> themselves, but the behavior of file_update_time():
>
> if (!timespec_equal(&inode->i_mtime, &now))
> sync_it = S_MTIME;
>
> if (!timespec_equal(&inode->i_ctime, &now))
> sync_it |= S_CTIME;
>
> if (IS_I_VERSION(inode))
> sync_it |= S_VERSION;
>
> if (!sync_it)
> return;
> ...
> mark_inode_dirty_sync(inode);
>
> So now mark_inode_dirty_sync() is called on every update, instead of
> merely on every update that sees a time change (so at most once a
> jiffy).
>
> So mark_inode_dirty_sync (and hence ->dirty_inode = ext4_dirty_inode)
> may get called more often if you're writing very frequently.
>
> I'm a bit surprised that's expected to add significant overhead to the
> write.
>
> I guess I should stare at the code and try to follow Andreas's
> explanation....
>

It shouldn't, let's be honest, most systems aren't going to have such a coarse
jiffie counter that they'll be able to get away with doing 2 calls to write() or
->page_mkwrite() in the same jiffie and skip the update to mtime/ctime anyway.
If they do they are damned lucky, and again the amount of overhead added even if
they are should be negligible since 99% of us all incur the overhead from having
to update mtime/ctime anyway. Thanks,

Josef

2012-05-14 21:27:47

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] ext4: turn on i_version updates by default

On 2012-05-14, at 1:05 PM, Josef Bacik wrote:
> On Mon, May 14, 2012 at 02:54:00PM -0400, J. Bruce Fields wrote:
>> I don't think they're worried about the inode_inc_iversion() calls
>> themselves, but the behavior of file_update_time():
>>
>> if (!timespec_equal(&inode->i_mtime, &now))
>> sync_it = S_MTIME;
>>
>> if (!timespec_equal(&inode->i_ctime, &now))
>> sync_it |= S_CTIME;
>>
>> if (IS_I_VERSION(inode))
>> sync_it |= S_VERSION;
>>
>> if (!sync_it)
>> return;
>> ...
>> mark_inode_dirty_sync(inode);
>>
>> So now mark_inode_dirty_sync() is called on every update, instead of
>> merely on every update that sees a time change (so at most once a
>> jiffy).
>>
>> So mark_inode_dirty_sync (and hence ->dirty_inode = ext4_dirty_inode)
>> may get called more often if you're writing very frequently.
>>
>> I'm a bit surprised that's expected to add significant overhead to the
>> write.
>
> It shouldn't, let's be honest, most systems aren't going to have such
> a coarse jiffie counter that they'll be able to get away with doing
> 2 calls to write() or ->page_mkwrite() in the same jiffie and skip the
> update to mtime/ctime anyway. If they do they are damned lucky, and
> again the amount of overhead added even if they are should be
> negligible since 99% of us all incur the overhead from having
> to update mtime/ctime anyway. Thanks,

Seriously? The whole reason the above checks for timespec_equal()
are there is to avoid calling mark_inode_dirty_sync() thousands of
times per second. If doing write() calls in the same jiffie were
so rare as you suggest then I don't think such an optimization
would ever have appeared in the first place.

For writes to a high-IOPS device (e.g. SSD) can run far higher than
1000 IOPS, and this is an important use case that people care about
today, so why add useless overhead when it isn't needed?

Cheers, Andreas




2012-05-14 23:09:01

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH] ext4: turn on i_version updates by default

On Mon, 2012-05-14 at 09:02 -0600, Andreas Dilger wrote:
> On 2012-05-14, at 8:06, "J. Bruce Fields" <[email protected]> wrote:
> > knfsd needs i_version updates on, as will userspace nfs servers and
> > probably others.
> >
> > The only effects are that inode->i_version is bumped (under the i_lock)
> > in more places, and that ->dirty_inode(I_DIRTY_DATASYNC) may be called
> > more frequently than once per jiffy on write (see file_update_time).
> > However the latter appears to be mostly a no-op in that case.
>
> I thought this can have noticeable performance impact, since ext4_mark_inode_dirty() is quite heavyweight?
>
> This is one of the reasons that the i_version update is conditional. If someone is exporting a filesystem from userspace the should be able to turn this on as a mount option, and knfsd could do it from inside the kernel. Why add overhead when it is not needed?

No. Having knfsd doing something like that under the covers is a BAD
idea. It is way too easy to get into situations where someone starts
changing files after the mount and before knfsd is started. As soon as
you allow files to be changed without i_version changing, then you are
setting yourself up for future corruption.

Ideally, an NFSv4-exported filesystem should be required to set the
tune2fs mount_opts to include the 'i_version' flag to make it hard to
inadvertently mount that filesystem without it.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2012-05-14 23:33:04

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] ext4: turn on i_version updates by default

On 2012-05-14, at 5:08 PM, Myklebust, Trond wrote:
> On Mon, 2012-05-14 at 09:02 -0600, Andreas Dilger wrote:
>> On 2012-05-14, at 8:06, "J. Bruce Fields" <[email protected]> wrote:
>>> knfsd needs i_version updates on, as will userspace nfs servers and
>>> probably others.
>>>
>>> The only effects are that inode->i_version is bumped (under the i_lock)
>>> in more places, and that ->dirty_inode(I_DIRTY_DATASYNC) may be called
>>> more frequently than once per jiffy on write (see file_update_time).
>>> However the latter appears to be mostly a no-op in that case.
>>
>> I thought this can have noticeable performance impact, since ext4_mark_inode_dirty() is quite heavyweight?
>>
>> This is one of the reasons that the i_version update is conditional.
>> If someone is exporting a filesystem from userspace the should be
>> able to turn this on as a mount option, and knfsd could do it from inside the kernel. Why add overhead when it is not needed?
>
> No. Having knfsd doing something like that under the covers is a BAD
> idea. It is way too easy to get into situations where someone starts
> changing files after the mount and before knfsd is started. As soon as
> you allow files to be changed without i_version changing, then you are
> setting yourself up for future corruption.
>
> Ideally, an NFSv4-exported filesystem should be required to set the
> tune2fs mount_opts to include the 'i_version' flag to make it hard to
> inadvertently mount that filesystem without it.

I said as much in another reply - that once i_version is used on
a filesystem, it should be made "sticky" (i.e. permanently enabled
for that filesystem). However, until that time it shouldn't be
enabled just because it might one day be used.

Even better than just blindly bumping the i_version on every change,
it would be better to have users of i_version (i.e. knfsd) flag the
inode with "needs i_version update" then read the version. When the
filesystem/VFS bumps i_version the next time it can clear this flag
and not update i_version again until after the next time i_version
is actually used.

Cheers, Andreas




2012-05-14 23:54:35

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH] ext4: turn on i_version updates by default

On Mon, May 14, 2012 at 05:33:04PM -0600, Andreas Dilger wrote:
> I said as much in another reply - that once i_version is used on
> a filesystem, it should be made "sticky" (i.e. permanently enabled
> for that filesystem). However, until that time it shouldn't be
> enabled just because it might one day be used.
>
> Even better than just blindly bumping the i_version on every change,
> it would be better to have users of i_version (i.e. knfsd) flag the
> inode with "needs i_version update" then read the version. When the
> filesystem/VFS bumps i_version the next time it can clear this flag
> and not update i_version again until after the next time i_version
> is actually used.

I really don't want to do anything more complicated than necessary.

What would be the worst-case test for the extra inode dirtying, so we
can see what the numbers actually are?

--b.

2012-05-15 00:13:34

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH] ext4: turn on i_version updates by default

On Mon, 2012-05-14 at 17:33 -0600, Andreas Dilger wrote:
> Even better than just blindly bumping the i_version on every change,
> it would be better to have users of i_version (i.e. knfsd) flag the
> inode with "needs i_version update" then read the version. When the
> filesystem/VFS bumps i_version the next time it can clear this flag
> and not update i_version again until after the next time i_version
> is actually used.

That doubly penalises actual users of i_version: not only does the
filesystem have to bump it on a file change, but it now has to set and
write this flag on the next read of i_version.

If those reads of i_version were really few and far between, then I
agree that it might be a solution, but most NFS clients will ask for the
updated i_version after every change that they make to the file because
they need to track that value for cache consistency management purposes.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2012-05-15 10:30:21

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: turn on i_version updates by default

On Mon 14-05-12 19:54:32, J. Bruce Fields wrote:
> On Mon, May 14, 2012 at 05:33:04PM -0600, Andreas Dilger wrote:
> > I said as much in another reply - that once i_version is used on
> > a filesystem, it should be made "sticky" (i.e. permanently enabled
> > for that filesystem). However, until that time it shouldn't be
> > enabled just because it might one day be used.
> >
> > Even better than just blindly bumping the i_version on every change,
> > it would be better to have users of i_version (i.e. knfsd) flag the
> > inode with "needs i_version update" then read the version. When the
> > filesystem/VFS bumps i_version the next time it can clear this flag
> > and not update i_version again until after the next time i_version
> > is actually used.
>
> I really don't want to do anything more complicated than necessary.
>
> What would be the worst-case test for the extra inode dirtying, so we
> can see what the numbers actually are?
Something like:

int fd, i;
struct timeval tv[2];

fd = open("file", O_CREAT | O_RDWR, 0644);
if (fd < 0)
return 1;
for (i = 0; i < 1000000; i++) {
gettimeofday(tv);
tv[1] = tv[0];
if (futimes(fd, tv) < 0)
return 1;
}
return 0;

And see how long does it take with and without i_version?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-05-15 12:36:17

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH] ext4: turn on i_version updates by default

On Tue, May 15, 2012 at 12:30:21PM +0200, Jan Kara wrote:
> On Mon 14-05-12 19:54:32, J. Bruce Fields wrote:
> > On Mon, May 14, 2012 at 05:33:04PM -0600, Andreas Dilger wrote:
> > > I said as much in another reply - that once i_version is used on
> > > a filesystem, it should be made "sticky" (i.e. permanently enabled
> > > for that filesystem). However, until that time it shouldn't be
> > > enabled just because it might one day be used.
> > >
> > > Even better than just blindly bumping the i_version on every change,
> > > it would be better to have users of i_version (i.e. knfsd) flag the
> > > inode with "needs i_version update" then read the version. When the
> > > filesystem/VFS bumps i_version the next time it can clear this flag
> > > and not update i_version again until after the next time i_version
> > > is actually used.
> >
> > I really don't want to do anything more complicated than necessary.
> >
> > What would be the worst-case test for the extra inode dirtying, so we
> > can see what the numbers actually are?
> Something like:
>
> int fd, i;
> struct timeval tv[2];
>
> fd = open("file", O_CREAT | O_RDWR, 0644);
> if (fd < 0)
> return 1;
> for (i = 0; i < 1000000; i++) {
> gettimeofday(tv);
> tv[1] = tv[0];
> if (futimes(fd, tv) < 0)
> return 1;
> }
> return 0;
>
> And see how long does it take with and without i_version?

The complaint I hear from Andreas is that we'll cause file_update_time()
to call mark_inode_dirty() more often.

I don't believe futimes() calls file_update_time().

So maybe replace that futimes() by a one-byte write?

--b.

2012-05-15 13:29:02

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH] ext4: turn on i_version updates by default

On Mon, May 14, 2012 at 03:27:47PM -0600, Andreas Dilger wrote:
> On 2012-05-14, at 1:05 PM, Josef Bacik wrote:
> > On Mon, May 14, 2012 at 02:54:00PM -0400, J. Bruce Fields wrote:
> >> I don't think they're worried about the inode_inc_iversion() calls
> >> themselves, but the behavior of file_update_time():
> >>
> >> if (!timespec_equal(&inode->i_mtime, &now))
> >> sync_it = S_MTIME;
> >>
> >> if (!timespec_equal(&inode->i_ctime, &now))
> >> sync_it |= S_CTIME;
> >>
> >> if (IS_I_VERSION(inode))
> >> sync_it |= S_VERSION;
> >>
> >> if (!sync_it)
> >> return;
> >> ...
> >> mark_inode_dirty_sync(inode);
> >>
> >> So now mark_inode_dirty_sync() is called on every update, instead of
> >> merely on every update that sees a time change (so at most once a
> >> jiffy).
> >>
> >> So mark_inode_dirty_sync (and hence ->dirty_inode = ext4_dirty_inode)
> >> may get called more often if you're writing very frequently.
> >>
> >> I'm a bit surprised that's expected to add significant overhead to the
> >> write.
> >
> > It shouldn't, let's be honest, most systems aren't going to have such
> > a coarse jiffie counter that they'll be able to get away with doing
> > 2 calls to write() or ->page_mkwrite() in the same jiffie and skip the
> > update to mtime/ctime anyway. If they do they are damned lucky, and
> > again the amount of overhead added even if they are should be
> > negligible since 99% of us all incur the overhead from having
> > to update mtime/ctime anyway. Thanks,
>
> Seriously? The whole reason the above checks for timespec_equal()
> are there is to avoid calling mark_inode_dirty_sync() thousands of
> times per second. If doing write() calls in the same jiffie were
> so rare as you suggest then I don't think such an optimization
> would ever have appeared in the first place.
>

So here is the commit log

commit ce06e0b21d6732a2bab10a585a3ec6909499be28
Author: Andi Kleen <[email protected]>
Date: Fri Sep 18 13:05:48 2009 -0700

vfs: optimize touch_time() too

Do a similar optimization as earlier for touch_atime. Getting the lock in
mnt_get_write is relatively costly, so try all avenues to avoid it first.

This patch is careful to still only update inode fields inside the lock
region.

This didn't show up in benchmarks, but it's easy enough to do.

Notice that last bit? I'm sure maybe at some point in the future we'll be able
to see the overhead, but right now I highly doubt we can, and if we can I'd like
to see benchmarks before blanket dismissing a feature that would be super
helpful for people exporting nfs volumes. Thanks,

Josef

2012-05-15 14:43:47

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: turn on i_version updates by default

On Tue 15-05-12 08:35:50, J. Bruce Fields wrote:
> On Tue, May 15, 2012 at 12:30:21PM +0200, Jan Kara wrote:
> > On Mon 14-05-12 19:54:32, J. Bruce Fields wrote:
> > > On Mon, May 14, 2012 at 05:33:04PM -0600, Andreas Dilger wrote:
> > > > I said as much in another reply - that once i_version is used on
> > > > a filesystem, it should be made "sticky" (i.e. permanently enabled
> > > > for that filesystem). However, until that time it shouldn't be
> > > > enabled just because it might one day be used.
> > > >
> > > > Even better than just blindly bumping the i_version on every change,
> > > > it would be better to have users of i_version (i.e. knfsd) flag the
> > > > inode with "needs i_version update" then read the version. When the
> > > > filesystem/VFS bumps i_version the next time it can clear this flag
> > > > and not update i_version again until after the next time i_version
> > > > is actually used.
> > >
> > > I really don't want to do anything more complicated than necessary.
> > >
> > > What would be the worst-case test for the extra inode dirtying, so we
> > > can see what the numbers actually are?
> > Something like:
> >
> > int fd, i;
> > struct timeval tv[2];
> >
> > fd = open("file", O_CREAT | O_RDWR, 0644);
> > if (fd < 0)
> > return 1;
> > for (i = 0; i < 1000000; i++) {
> > gettimeofday(tv);
> > tv[1] = tv[0];
> > if (futimes(fd, tv) < 0)
> > return 1;
> > }
> > return 0;
> >
> > And see how long does it take with and without i_version?
>
> The complaint I hear from Andreas is that we'll cause file_update_time()
> to call mark_inode_dirty() more often.
>
> I don't believe futimes() calls file_update_time().
Yeah, right. I didn't check and I was wrong...

> So maybe replace that futimes() by a one-byte write?
Yes, "pwrite(fd, "data", 1, 0);" should be then a proper test instead of
futimes().

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-05-15 17:33:42

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH] ext4: turn on i_version updates by default

On Mon, May 14, 2012 at 11:27:42AM -0600, Andreas Dilger wrote:
> On 2012-05-14, at 9:23 AM, J. Bruce Fields wrote:
> > On Mon, May 14, 2012 at 09:02:12AM -0600, Andreas Dilger wrote:
> >> On 2012-05-14, at 8:06, "J. Bruce Fields" <[email protected]> wrote:
> >>> knfsd needs i_version updates on, as will userspace nfs servers and
> >>> probably others.
> >>>
> >>> The only effects are that inode->i_version is bumped (under the i_lock)
> >>> in more places, and that ->dirty_inode(I_DIRTY_DATASYNC) may be called
> >>> more frequently than once per jiffy on write (see file_update_time).
> >>> However the latter appears to be mostly a no-op in that case.
> >>
> >> I thought this can have noticeable performance impact, since ext4_mark_inode_dirty() is quite heavyweight?
> >
> > There's no reason it should be, should it, if we already just dirtied
> > the inode a moment ago?
>
> Ideally not, but the way ext[34]_mark_inode_dirty() is implemented
> is that it copies the whole in-core inode to the on-disk inode every
> time it is marked dirty. That ensures that the on-disk inode is
> up-to-date when the journal flushes the blocks to disk, but is not
> an ideal implementation. It has been this way since the first ext3
> implementation was done.
>
> As a result, dirtying the inode very frequently for ext[34] is
> currently expensive and should be avoided.
>
> I _think_ that the ext4 metadata checksum patches have changed this
> to only flag the inode dirty and run a pre-commit callback to copy
> the in-core inode to the on-disk inode. I'm not sure what the
> current status of that patch is, nor how easily it could be split
> from that patch series and land separately.

I did some searching, found a couple of versions of the metadata
checksum patches, but no patch that looked like what you're describing.
Any idea where that is?

--b.

2012-05-15 18:05:44

by Marco Stornelli

[permalink] [raw]
Subject: Re: [PATCH] ext4: turn on i_version updates by default

Il 15/05/2012 15:28, Josef Bacik ha scritto:
> On Mon, May 14, 2012 at 03:27:47PM -0600, Andreas Dilger wrote:
>> On 2012-05-14, at 1:05 PM, Josef Bacik wrote:
>>> On Mon, May 14, 2012 at 02:54:00PM -0400, J. Bruce Fields wrote:
>>>> I don't think they're worried about the inode_inc_iversion() calls
>>>> themselves, but the behavior of file_update_time():
>>>>
>>>> if (!timespec_equal(&inode->i_mtime,&now))
>>>> sync_it = S_MTIME;
>>>>
>>>> if (!timespec_equal(&inode->i_ctime,&now))
>>>> sync_it |= S_CTIME;
>>>>
>>>> if (IS_I_VERSION(inode))
>>>> sync_it |= S_VERSION;
>>>>
>>>> if (!sync_it)
>>>> return;
>>>> ...
>>>> mark_inode_dirty_sync(inode);
>>>>
>>>> So now mark_inode_dirty_sync() is called on every update, instead of
>>>> merely on every update that sees a time change (so at most once a
>>>> jiffy).
>>>>
>>>> So mark_inode_dirty_sync (and hence ->dirty_inode = ext4_dirty_inode)
>>>> may get called more often if you're writing very frequently.
>>>>
>>>> I'm a bit surprised that's expected to add significant overhead to the
>>>> write.
>>>
>>> It shouldn't, let's be honest, most systems aren't going to have such
>>> a coarse jiffie counter that they'll be able to get away with doing
>>> 2 calls to write() or ->page_mkwrite() in the same jiffie and skip the
>>> update to mtime/ctime anyway. If they do they are damned lucky, and
>>> again the amount of overhead added even if they are should be
>>> negligible since 99% of us all incur the overhead from having
>>> to update mtime/ctime anyway. Thanks,
>>
>> Seriously? The whole reason the above checks for timespec_equal()
>> are there is to avoid calling mark_inode_dirty_sync() thousands of
>> times per second. If doing write() calls in the same jiffie were
>> so rare as you suggest then I don't think such an optimization
>> would ever have appeared in the first place.
>>
>

Only a really really stupid question (I don't know NFS protocol well
enough). In 3.3 kernel, I see that only ext4 uses MS_I_VERSION, so I
wonder: if i_version change it's needed for exportable fs and so for
nfs, other exportable fs? Is this only a particular problem for ext4? I
mean, it doesn't seems a blocking problem (or we could have a lot of
traffic on fs-devel :) ), it seems a "more compliant behavior". If this
considerations is right, I think the current behavior of ext4 is ok.

Marco

Marco

2012-05-15 18:50:37

by djwong

[permalink] [raw]
Subject: Re: [PATCH] ext4: turn on i_version updates by default

On Tue, May 15, 2012 at 01:33:40PM -0400, J. Bruce Fields wrote:
> On Mon, May 14, 2012 at 11:27:42AM -0600, Andreas Dilger wrote:
> > On 2012-05-14, at 9:23 AM, J. Bruce Fields wrote:
> > > On Mon, May 14, 2012 at 09:02:12AM -0600, Andreas Dilger wrote:
> > >> On 2012-05-14, at 8:06, "J. Bruce Fields" <[email protected]> wrote:
> > >>> knfsd needs i_version updates on, as will userspace nfs servers and
> > >>> probably others.
> > >>>
> > >>> The only effects are that inode->i_version is bumped (under the i_lock)
> > >>> in more places, and that ->dirty_inode(I_DIRTY_DATASYNC) may be called
> > >>> more frequently than once per jiffy on write (see file_update_time).
> > >>> However the latter appears to be mostly a no-op in that case.
> > >>
> > >> I thought this can have noticeable performance impact, since ext4_mark_inode_dirty() is quite heavyweight?
> > >
> > > There's no reason it should be, should it, if we already just dirtied
> > > the inode a moment ago?
> >
> > Ideally not, but the way ext[34]_mark_inode_dirty() is implemented
> > is that it copies the whole in-core inode to the on-disk inode every
> > time it is marked dirty. That ensures that the on-disk inode is
> > up-to-date when the journal flushes the blocks to disk, but is not
> > an ideal implementation. It has been this way since the first ext3
> > implementation was done.
> >
> > As a result, dirtying the inode very frequently for ext[34] is
> > currently expensive and should be avoided.
> >
> > I _think_ that the ext4 metadata checksum patches have changed this
> > to only flag the inode dirty and run a pre-commit callback to copy
> > the in-core inode to the on-disk inode. I'm not sure what the
> > current status of that patch is, nor how easily it could be split
> > from that patch series and land separately.
>
> I did some searching, found a couple of versions of the metadata
> checksum patches, but no patch that looked like what you're describing.
> Any idea where that is?

That _was_ going to be the basis of phase 2 of my metadata checksum patchset,
but since I've been moved to other projects, I don't see that on my plate in
the near future. :/

(tldr: It doesn't exist.)

--D
>
> --b.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


2012-05-15 19:18:12

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH] ext4: turn on i_version updates by default

On Tue, May 15, 2012 at 07:59:11PM +0200, Marco Stornelli wrote:
> Only a really really stupid question (I don't know NFS protocol well
> enough). In 3.3 kernel, I see that only ext4 uses MS_I_VERSION, so I
> wonder: if i_version change it's needed for exportable fs and so for
> nfs, other exportable fs?

Yes, it's needed for others as well. I believe btrfs and xfs are both
adding it.

We're currently using ctime for the nfs change attribute. That's
effectively jiffy granularity. So to see the problem at a minimum you'd
need two writes to be processed within one jiffy, and a stat to come
between them. But that's a correctness problem, and we'd like to see it
fixed before it becomes more common.

More generally, it's useful to be able to ask whether a file changed
without rereading all its data, and a clock that registers every change
and is consistent across a filesystem sounds difficult to scale. We may
eventually find we need something like this outside nfs.

--b.