2009-11-22 22:20:47

by J. Bruce Fields

[permalink] [raw]
Subject: i_version, NFSv4 change attribute

Since c654b8a9cba6002aad1c01919e4928a79a4a6dcf, the NFS server has made
use of ext4's i_version when available.

However, the new i_version support is available only when the filesystem
is mounted with the i_version mount option. And the change attribute is
required for completely correct NFSv4 operation, which we'd prefer to
offer by default!

I recall having a conversation with Ted T'so about ways to do this
without affecting non-NFS-exported filesystems: maybe by providing some
sort of persistant flag on the superblock which the nfsd code could turn
on automatically if desired?

But first it would be useful to know whether there is in fact any
disadvantage to just having the i_version on all the time. Jean Noel
Cordenner did some tests a couple years ago:

http://www.bullopensource.org/ext4/change_attribute/index.html

and didn't find any significant difference. But I don't know if those
results were convincing (I don't understand what we're looking for
here); any suggestions for workloads that would exercise the worst
cases?

--b.


2009-11-23 11:48:30

by Theodore Ts'o

[permalink] [raw]
Subject: Re: i_version, NFSv4 change attribute

On Sun, Nov 22, 2009 at 05:20:47PM -0500, J. Bruce Fields wrote:
> However, the new i_version support is available only when the filesystem
> is mounted with the i_version mount option. And the change attribute is
> required for completely correct NFSv4 operation, which we'd prefer to
> offer by default!
>
> I recall having a conversation with Ted Ts'o about ways to do this
> without affecting non-NFS-exported filesystems: maybe by providing some
> sort of persistant flag on the superblock which the nfsd code could turn
> on automatically if desired?
>
> But first it would be useful to know whether there is in fact any
> disadvantage to just having the i_version on all the time. Jean Noel
> Cordenner did some tests a couple years ago:
>
> http://www.bullopensource.org/ext4/change_attribute/index.html
>
> and didn't find any significant difference. But I don't know if those
> results were convincing (I don't understand what we're looking for
> here); any suggestions for workloads that would exercise the worst
> cases?

Hmmm.... the workload that would probably see the most hit would be
one where we have multiple processes/thread running on different cpu's
that are modifying the inode in parallel. (i.e., the sort of workload
that a database with multiple clients would naturally see).

The test which Bull did above used a workload which was very heavy on
creating and destroying files, and it was only a two processor system
(not that it mattered; it looks like the fileop benchmark was
single-threaded anyway). The test I would do is something like a 4 or
8 processor test, with lots of parallel I/O to the same file (at which
point we would probably end up bottlenecking on inode->i_lock).

It would seem to me that a simple way of fixing this would be to use
atomic64 type for inode->i_version, so we don't have to take the
spinlock and bounce cache lines each time i_version gets updated.

What we might decide, at the end of the day, is that for common fs
workloads no one is going to notice, and for the parallel intensive
workloads (i.e., databases), people will be tuning for this anyway, so
we can make i_version the default, and noi_version an option people
can use to turn off i_version if they are optimizing for the database
workload.

A similar tuning knob that I should add is one that allows us to set
a custom value for sb->s_time_gran, so that we don't have to dirty the
inode and engage the jbd2 machinery after *every* single write. Once
I add that, or if you use i_version on a file system with an 128-byte
inode so the mtime update granularity is a second, I suspect the cost
of i_version will be especially magnified, and the database people
will very much want to turn off i_version.

And that brings up another potential compromise --- what if we only
update i_version every n milliseconds? That way if the file is being
modified to the tune of hundreds of thousands of updates a second,
NFSv4 clients will see a change fairly quickly, with n milliseconds,
but we won't be incrementing i_version at incredibly high rates. I
suspect that would violate NFSv4 protocol specs somewhere, but would
it cause seriously noticeable breakage that would be user visible? If
not, maybe that's something we should allow the user to set, perhaps
not as the default? Just a thought.

Regards,

- Ted

2009-11-23 16:44:01

by J. Bruce Fields

[permalink] [raw]
Subject: Re: i_version, NFSv4 change attribute

On Mon, Nov 23, 2009 at 06:48:31AM -0500, [email protected] wrote:
> On Sun, Nov 22, 2009 at 05:20:47PM -0500, J. Bruce Fields wrote:
> > However, the new i_version support is available only when the filesystem
> > is mounted with the i_version mount option. And the change attribute is
> > required for completely correct NFSv4 operation, which we'd prefer to
> > offer by default!
> >
> > I recall having a conversation with Ted Ts'o about ways to do this
> > without affecting non-NFS-exported filesystems: maybe by providing some
> > sort of persistant flag on the superblock which the nfsd code could turn
> > on automatically if desired?
> >
> > But first it would be useful to know whether there is in fact any
> > disadvantage to just having the i_version on all the time. Jean Noel
> > Cordenner did some tests a couple years ago:
> >
> > http://www.bullopensource.org/ext4/change_attribute/index.html
> >
> > and didn't find any significant difference. But I don't know if those
> > results were convincing (I don't understand what we're looking for
> > here); any suggestions for workloads that would exercise the worst
> > cases?
>
> Hmmm.... the workload that would probably see the most hit would be
> one where we have multiple processes/thread running on different cpu's
> that are modifying the inode in parallel. (i.e., the sort of workload
> that a database with multiple clients would naturally see).

Got it, thanks. Is there an existing easy-to-setup workload I could
start with, or would it be sufficient to try the simplest possible code
that met the above description? (E.g., fork a process for each cpu,
each just overwriting byte 0 as fast as possible, and count total writes
performed per second?)

> The test which Bull did above used a workload which was very heavy on
> creating and destroying files, and it was only a two processor system
> (not that it mattered; it looks like the fileop benchmark was
> single-threaded anyway). The test I would do is something like a 4 or
> 8 processor test, with lots of parallel I/O to the same file (at which
> point we would probably end up bottlenecking on inode->i_lock).
>
> It would seem to me that a simple way of fixing this would be to use
> atomic64 type for inode->i_version, so we don't have to take the
> spinlock and bounce cache lines each time i_version gets updated.

The current locking does seem like overkill.

> What we might decide, at the end of the day, is that for common fs
> workloads no one is going to notice, and for the parallel intensive
> workloads (i.e., databases), people will be tuning for this anyway, so
> we can make i_version the default, and noi_version an option people
> can use to turn off i_version if they are optimizing for the database
> workload.
>
> A similar tuning knob that I should add is one that allows us to set
> a custom value for sb->s_time_gran, so that we don't have to dirty the
> inode and engage the jbd2 machinery after *every* single write. Once
> I add that, or if you use i_version on a file system with an 128-byte
> inode so the mtime update granularity is a second, I suspect the cost
> of i_version will be especially magnified, and the database people
> will very much want to turn off i_version.
>
> And that brings up another potential compromise --- what if we only
> update i_version every n milliseconds? That way if the file is being
> modified to the tune of hundreds of thousands of updates a second,
> NFSv4 clients will see a change fairly quickly, with n milliseconds,
> but we won't be incrementing i_version at incredibly high rates. I
> suspect that would violate NFSv4 protocol specs somewhere, but would
> it cause seriously noticeable breakage that would be user visible? If
> not, maybe that's something we should allow the user to set, perhaps
> not as the default? Just a thought.

That knob would affect the probability of the breakage, but not
necessarily the seriousness. The race is:

- clock tick
write
read and check i_version
write
- clock tick

If the second write doesn't modify the i_version, the client will never
see it. (Unless a later write bumps i_version again.)

If the side we want to optimize is the modifications, I wonder if we
could do all the i_version increments on *read* of i_version?:

- writes (and other inode modifications) set an "i_version_dirty"
flag.
- reads of i_version clear the i_version_dirty flag, increment
i_version, and return the result.

As long as the reader sees i_version_flag set only after it sees the
write that caused it, I think it all works?

--b.

2009-11-23 16:59:07

by J. Bruce Fields

[permalink] [raw]
Subject: Re: i_version, NFSv4 change attribute

On Mon, Nov 23, 2009 at 11:44:45AM -0500, J. Bruce Fields wrote:
> If the side we want to optimize is the modifications, I wonder if we
> could do all the i_version increments on *read* of i_version?:
>
> - writes (and other inode modifications) set an "i_version_dirty"
> flag.
> - reads of i_version clear the i_version_dirty flag, increment
> i_version, and return the result.
>
> As long as the reader sees i_version_flag set only after it sees the
> write that caused it, I think it all works?

Except I'm a little confused about how i_version behaves (and should
behave) on reboot. Assuming it's currently correct, I think it's enough
to also increment i_version as above when writing out the inode.

--b.

2009-11-23 18:11:19

by Trond Myklebust

[permalink] [raw]
Subject: Re: i_version, NFSv4 change attribute

On Mon, 2009-11-23 at 11:44 -0500, J. Bruce Fields wrote:
> If the side we want to optimize is the modifications, I wonder if we
> could do all the i_version increments on *read* of i_version?:
>
> - writes (and other inode modifications) set an "i_version_dirty"
> flag.
> - reads of i_version clear the i_version_dirty flag, increment
> i_version, and return the result.
>
> As long as the reader sees i_version_flag set only after it sees the
> write that caused it, I think it all works?

That probably won't make much of a difference to performance. Most NFSv4
clients will have every WRITE followed by a GETATTR operation in the
same compound, so your i_version_dirty flag will always immediately get
cleared.

The question is, though, why does the jbd2 machinery need to be engaged
on _every_ write? The NFS clients don't care if we lose an i_version
count due to a sudden server reboot, since that will trigger a rewrite
of the dirty data anyway once the server comes back up again.
As long as the i_version is guaranteed to be written to stable storage
on a successful call to fsync(), then the NFS data integrity
requirements are fully satisfied.

Trond


2009-11-23 18:19:51

by J. Bruce Fields

[permalink] [raw]
Subject: Re: i_version, NFSv4 change attribute

On Mon, Nov 23, 2009 at 01:11:19PM -0500, Trond Myklebust wrote:
> On Mon, 2009-11-23 at 11:44 -0500, J. Bruce Fields wrote:
> > If the side we want to optimize is the modifications, I wonder if we
> > could do all the i_version increments on *read* of i_version?:
> >
> > - writes (and other inode modifications) set an "i_version_dirty"
> > flag.
> > - reads of i_version clear the i_version_dirty flag, increment
> > i_version, and return the result.
> >
> > As long as the reader sees i_version_flag set only after it sees the
> > write that caused it, I think it all works?
>
> That probably won't make much of a difference to performance. Most NFSv4
> clients will have every WRITE followed by a GETATTR operation in the
> same compound, so your i_version_dirty flag will always immediately get
> cleared.

I was only thinking about non-NFS performance.

> The question is, though, why does the jbd2 machinery need to be engaged
> on _every_ write?

Is it?

I thought I remembered a journaling issue from previous discussions, but
Ted seemed concerned just about the overhead of an additional
spinlock, and looking at the code, the only test of I_VERSION that I can
see indeed is in ext4_mark_iloc_dirty(), and indeed just takes a
spinlock and updates the i_version.

--b.

> The NFS clients don't care if we lose an i_version count due to a
> sudden server reboot, since that will trigger a rewrite of the dirty
> data anyway once the server comes back up again. As long as the
> i_version is guaranteed to be written to stable storage on a
> successful call to fsync(), then the NFS data integrity requirements
> are fully satisfied.

2009-11-23 18:34:59

by Theodore Ts'o

[permalink] [raw]
Subject: Re: i_version, NFSv4 change attribute

On Mon, Nov 23, 2009 at 11:44:45AM -0500, J. Bruce Fields wrote:
>
> Got it, thanks. Is there an existing easy-to-setup workload I could
> start with, or would it be sufficient to try the simplest possible code
> that met the above description? (E.g., fork a process for each cpu,
> each just overwriting byte 0 as fast as possible, and count total writes
> performed per second?)

We were actually talking about this on the ext4 call today. The
problem is that there isn't a ready-made benchmark that will easily
measure this. A database benchmark will show up (and we may have some
results from the DB2 folks indicating the cost of upgrading the
timestamps with a nanosecond granuality), but these of course aren't
easy to run.

The simplest possible workload that you have proposed is the worst
case, and I have no doubt that will show the contention on
inode->i_lock from inode_inc_version(), and I bet we'll see a big
improvement when we change inode->i_version to be an atomic64 type.
It will probably also show the overhead of ext4_mark_inode_dirty()
being called all the time.

Perhaps a slightly fairer and more realistic benchmark would do a
write to byte zero followed by an fsync(), and measures both the CPU
time per write as well as the writes per second. Either will do the
job, though, and I'd recommend grabbing oprofile and lockstat
measurement to see what bottlenecks we are hitting with that the
workload.

> If the side we want to optimize is the modifications, I wonder if we
> could do all the i_version increments on *read* of i_version?:
>
> - writes (and other inode modifications) set an "i_version_dirty"
> flag.
> - reads of i_version clear the i_version_dirty flag, increment
> i_version, and return the result.
>
> As long as the reader sees i_version_flag set only after it sees the
> write that caused it, I think it all works?

I can see two potential problems with that. One is that this implies
that the read needs to kick off a journal operation, which means that
the act of reading i_version might cause the caller to sleep (not all
the time, but in some cases, such as if we get unlucky and need to do
a journal commit or checkpoint before we can update i_version). I
don't know if the NFSv4 server code would be happy with that!

The second problem is what happens if we crash before a read happens.

On the ext4 call, Andreas suggested trying to do this work at commit
time. This would either mean that i_version would only get updated at
the commit interval (default: 5 seconds), or that i_version might be
updated more frequently than that, but we would defer as much as
possible to the commit time, since it's already the case that if we
crash before the commit happens, i_version could end up going
backwards (since we may have returned i_version numbers that were
never committed).

I'm not entirely convinced how much this will actually help, since we
have to reserve space in the transaction for the inode update, even if
we don't do the copy to the journaled bufferheads on every
sys_write(), since we will end up having to take various journal locks
on every sys_write() anyway. We'll have to code it up to see whether
or not it helps, or how painful it is to actually implement.

What I'm hoping we'll find is that for a typical desktop workload
i_version updates don't really hurt, and we can enable it by default
for desktop workloads. My concern is that really with the database
workloads, i_version updates may be especially hurtful especially for
certain high dollar value (but rare) benchmarks, such as TPC-C/TPC-H.

- Ted

2009-11-23 18:37:44

by Trond Myklebust

[permalink] [raw]
Subject: Re: i_version, NFSv4 change attribute

On Mon, 2009-11-23 at 13:19 -0500, J. Bruce Fields wrote:
> On Mon, Nov 23, 2009 at 01:11:19PM -0500, Trond Myklebust wrote:
> > On Mon, 2009-11-23 at 11:44 -0500, J. Bruce Fields wrote:
> > > If the side we want to optimize is the modifications, I wonder if we
> > > could do all the i_version increments on *read* of i_version?:
> > >
> > > - writes (and other inode modifications) set an "i_version_dirty"
> > > flag.
> > > - reads of i_version clear the i_version_dirty flag, increment
> > > i_version, and return the result.
> > >
> > > As long as the reader sees i_version_flag set only after it sees the
> > > write that caused it, I think it all works?
> >
> > That probably won't make much of a difference to performance. Most NFSv4
> > clients will have every WRITE followed by a GETATTR operation in the
> > same compound, so your i_version_dirty flag will always immediately get
> > cleared.
>
> I was only thinking about non-NFS performance.

I would think that running a high performance database _and_ NFS server
on the same machine would tend to be very much of a corner case anyway.
In most setups I'm aware of, the database and NFS server tend to be
completely separate machines.

> > The question is, though, why does the jbd2 machinery need to be engaged
> > on _every_ write?
>
> Is it?

See Ted's email. As I read it, his concern was that if they allow people
to reduce the a/m/c/time resolution, then the i_version would still
force them to dirty the inode on every write...

Trond


2009-11-23 18:51:05

by Theodore Ts'o

[permalink] [raw]
Subject: Re: i_version, NFSv4 change attribute

On Mon, Nov 23, 2009 at 01:19:51PM -0500, J. Bruce Fields wrote:
> > The question is, though, why does the jbd2 machinery need to be engaged
> > on _every_ write?
>
> Is it?
>
> I thought I remembered a journaling issue from previous discussions, but
> Ted seemed concerned just about the overhead of an additional
> spinlock, and looking at the code, the only test of I_VERSION that I can
> see indeed is in ext4_mark_iloc_dirty(), and indeed just takes a
> spinlock and updates the i_version.

There are two concerns. One is the inode->i_lock overhead, which at
the time when we added i_version, the atomic64 type wasn't added, so
the only simple way it could have been implemented was by taking the
spinlock. This we can fix, and I think it's a no-brainer that we
switch it to be an atomic64, especially for the most common Intel
platforms.

The second problem is the jbd2 machinery, which gets engaged when the
inode changes, which means in the case of sys_write(), if i_version or
i_mtime gets changed. At the moment, if we are using a 256-byte inode
with ext4, we will be updating i_mtime on every single write, and so
when ext4_setattr(), which is called from notify_change() notices that
i_mtime is changed, we are engaging the entire jbd2 machinery for
every single write.

This is not true for a 128-byte inode, since in that case
sb->s_time_gran is set to one second, so we would only be updating the
inode and engaging the jbd2 machinery once a second. This is true for
ext3 and ext4 with 128-byte inodes.

Now, all of this having been said, Feodra 11 and 12 have been using
ext4 as the default filesystem, and for generic desktop usage, people
haven't been screaming about the increased CPU overhead implied by
engaging the jbd2 machinery on every sys_write().

However, we have had a report that some enterprise database developers
have noticed the increased overhead in ext4, and this is on our list
of things that require some performance tuning. Hence my comments
about a mount option to adjust s_time_gran for the benefit of database
workloads, and once we have that moun option, since enabling i_version
would mean once again needing to update the inode at every single
write(2) call, we would be back with the same problem.

Maybe we can find a way to be more clever about doing some (but not
all) of the jbd2 work on each sys_write(), and deferring as much as
possible to the commit handling. We need to do some investigating to
see if that's possible. Even if it isn't, though, my gut tells me
that we will probably be able to enable i_version by default for
desktop workloads, and tell database server folks that they should
mount with the mount options "noi_version,time_gran=1s", or some such.

I'd like to do some testing to confirm my intuition first, of course,
but that's how I'm currently leaning. Does that make sense?

Regards,

- Ted

2009-11-25 20:47:56

by J. Bruce Fields

[permalink] [raw]
Subject: Re: i_version, NFSv4 change attribute

On Mon, Nov 23, 2009 at 01:51:05PM -0500, [email protected] wrote:
> Now, all of this having been said, Feodra 11 and 12 have been using
> ext4 as the default filesystem, and for generic desktop usage, people
> haven't been screaming about the increased CPU overhead implied by
> engaging the jbd2 machinery on every sys_write().
>
> However, we have had a report that some enterprise database developers
> have noticed the increased overhead in ext4, and this is on our list
> of things that require some performance tuning. Hence my comments
> about a mount option to adjust s_time_gran for the benefit of database
> workloads, and once we have that moun option, since enabling i_version
> would mean once again needing to update the inode at every single
> write(2) call, we would be back with the same problem.
>
> Maybe we can find a way to be more clever about doing some (but not
> all) of the jbd2 work on each sys_write(), and deferring as much as
> possible to the commit handling. We need to do some investigating to
> see if that's possible. Even if it isn't, though, my gut tells me
> that we will probably be able to enable i_version by default for
> desktop workloads, and tell database server folks that they should
> mount with the mount options "noi_version,time_gran=1s", or some such.
>
> I'd like to do some testing to confirm my intuition first, of course,
> but that's how I'm currently leaning. Does that make sense?

I think so, thanks.

So do I have this todo list approximately right?:

1. Use an atomic type instead of a spinlock for i_version, and
do some before-and-after benchmarking of writes (following your
suggestions in
http://marc.info/?l=linux-ext4&m=125900130605891&w=2)

2. Turn on i_version by default. (At this point it shouldn't be
making things any worse than the high-resolution timestamps
are.)

3. Find someone to run database benchmarks, and work on
noi_version,time_gran=1s (or whatever) options for their case.

I wish I could volunteer at least for #1, but embarassingly don't have
much more than dual-core machines lying around right now to test with.

--b.