Date: Wed, 18 Aug 2010 13:32:03 -0400
From: "J. Bruce Fields" <bfields@fieldses.org>
To: Neil Brown <neilb@suse.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>,
        "Patrick J. LoPresti" <lopresti@gmail.com>,
        Andi Kleen <andi@firstfloor.org>, linux-fsdevel@vger.kernel.org,
        linux-nfs@vger.kernel.org, linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: Proposal: Use hi-res clock for file timestamps
Message-ID: <20100818173203.GC32430@fieldses.org>
References: <AANLkTimnyXKahtjaFeSsgcq=xMy-pP3na1jidQhZ-dt2@mail.gmail.com>
 <87aaolwar8.fsf@basil.nowhere.org>
 <20100817174134.GA23176@fieldses.org>
 <20100817182920.GD18161@basil.fritz.box>
 <20100817190447.GA28049@fieldses.org>
 <AANLkTi=w1UA5ZZDBigpxMiL7A7DnbnQhLkg62JZpC6Ri@mail.gmail.com>
 <20100817203941.729830b7@lxorguk.ukuu.org.uk>
 <20100817192937.GD26609@fieldses.org>
 <20100818155359.66b9ddb6@notabene>
Content-Type: text/plain; charset=us-ascii
In-Reply-To: <20100818155359.66b9ddb6@notabene>
Sender: linux-nfs-owner@vger.kernel.org
MIME-Version: 1.0

On Wed, Aug 18, 2010 at 03:53:59PM +1000, Neil Brown wrote:
> I'm not sure you even want to pay for a per-filesystem atomic access when
> updating mtime.  mnt_want_write - called at the same time - seems to go to
> some lengths to avoid an atomic operation.
> 
> I think that nfsd should be the only place that has to pay the atomic
> penalty, as it is where the need is.
> 
> I imagine something like this:
>  - Create a global struct timespec which is protected by a seqlock
>    Call it current_nfsd_time or similar.
>  - file_update_time reads this and uses it if it is newer than
>    current_fs_time.
>  - nfsd updates it whenever it reads an mtime out of an inode that matches
>    current_fs_time to the granularity of 1/HZ.

We can also skip the update whenever current_nfsd_time is greater than
the inode's mtime--that's enough to ensure that the next
file_update_time() call will get a time different from the inode's
current mtime.

And that means that a sequence like

	file_update_time()
	N nfsd_getattr()'s

doesn't make N updates to current_nfsd_time, when only 1 was necessary.

>    If the current value is before current_kernel_time, it
>    is set to current_kernel_time, otherwise tv_nsec is incremented -
>    unless that increases
>    beyond jiffies_to_usec(1)*1000 beyond current_kernel_time.

... which would only happen on hardware that could process a getattr and
a data update per nanosecond continuously for a jiffy.

>  - the global 'struct timespec' is zeroed whenever system time is set
>    backwards.

OK, got it, I think: so this is the same as a global version of Alan's
clock, except that the extra ticks only happen when they need to.

The properties it satisfies:

	- It's still a single global clock, so it's consistent between
	  files.
	- It degenerates to jiffies in the absence of getattr's from
	  nfsd.
	- It need only invalidate the other cpus' cached value of the
	  clock on the first getattr of a file that follows less than a
	  jiffy after an update of the file's data.
	- Absent utime(), time going backwards, or futuristic hardware,
	  it guarantees that two nfsd reads of an inode's mtime will
	  return different values iff the inode's data was modified in
	  between the two.

Shortcomings:

	- The clock advances in units only of either 1 jiffy or 1 ns.
	  This will look odd.  But when the alternative is units of 1
	  jiffy or 0 ns, it seems an improvement....
	- A slowdown due to inodes being file_update_time() marking inodes
	  dirty more frequently?
	- Doesn't help with ext3.  Oh well.

Would the extra expense rule out treating sys_stat() the same as nfsd?
It would be nice to be able to solve the same problem for userspace
nfsd's (or any other application that might be using mtime to save
rereading data).

--b.