From: tytso@mit.edu
Subject: Re: i_version, NFSv4 change attribute
Date: Mon, 23 Nov 2009 06:48:31 -0500
Message-ID: <20091123114831.GA2532@thunk.org>
References: <20091122222047.GB21944@fieldses.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org
To: "J. Bruce Fields" <bfields@fieldses.org>
Content-Disposition: inline
In-Reply-To: <20091122222047.GB21944@fieldses.org>
Sender: linux-ext4-owner@vger.kernel.org

On Sun, Nov 22, 2009 at 05:20:47PM -0500, J. Bruce Fields wrote:
> However, the new i_version support is available only when the filesystem
> is mounted with the i_version mount option.  And the change attribute is
> required for completely correct NFSv4 operation, which we'd prefer to
> offer by default!
> 
> I recall having a conversation with Ted Ts'o about ways to do this
> without affecting non-NFS-exported filesystems: maybe by providing some
> sort of persistant flag on the superblock which the nfsd code could turn
> on automatically if desired?
> 
> But first it would be useful to know whether there is in fact any
> disadvantage to just having the i_version on all the time.  Jean Noel
> Cordenner did some tests a couple years ago:
> 
> 	http://www.bullopensource.org/ext4/change_attribute/index.html
> 
> and didn't find any significant difference.  But I don't know if those
> results were convincing (I don't understand what we're looking for
> here); any suggestions for workloads that would exercise the worst
> cases?

Hmmm.... the workload that would probably see the most hit would be
one where we have multiple processes/thread running on different cpu's
that are modifying the inode in parallel.  (i.e., the sort of workload
that a database with multiple clients would naturally see).

The test which Bull did above used a workload which was very heavy on
creating and destroying files, and it was only a two processor system
(not that it mattered; it looks like the fileop benchmark was
single-threaded anyway).  The test I would do is something like a 4 or
8 processor test, with lots of parallel I/O to the same file (at which
point we would probably end up bottlenecking on inode->i_lock).

It would seem to me that a simple way of fixing this would be to use
atomic64 type for inode->i_version, so we don't have to take the
spinlock and bounce cache lines each time i_version gets updated.

What we might decide, at the end of the day, is that for common fs
workloads no one is going to notice, and for the parallel intensive
workloads (i.e., databases), people will be tuning for this anyway, so
we can make i_version the default, and noi_version an option people
can use to turn off i_version if they are optimizing for the database
workload.

A similar tuning knob that I should add is one that allows us to set
a custom value for sb->s_time_gran, so that we don't have to dirty the
inode and engage the jbd2 machinery after *every* single write.  Once
I add that, or if you use i_version on a file system with an 128-byte
inode so the mtime update granularity is a second, I suspect the cost
of i_version will be especially magnified, and the database people
will very much want to turn off i_version.

And that brings up another potential compromise --- what if we only
update i_version every n milliseconds?  That way if the file is being
modified to the tune of hundreds of thousands of updates a second,
NFSv4 clients will see a change fairly quickly, with n milliseconds,
but we won't be incrementing i_version at incredibly high rates.  I
suspect that would violate NFSv4 protocol specs somewhere, but would
it cause seriously noticeable breakage that would be user visible?  If
not, maybe that's something we should allow the user to set, perhaps
not as the default?  Just a thought.

Regards,

							- Ted