2010-08-13 18:25:51

by Patrick J. LoPresti

[permalink] [raw]
Subject: Proposal: Use hi-res clock for file timestamps

For concreteness, let me start with the patch I have in mind. Call it
"patch version 1".


--- linux-2.6.32.13-0.4/kernel/time.c.orig 2010-08-13
10:52:50.000000000 -0700
+++ linux-2.6.32.13-0.4/kernel/time.c 2010-08-13 10:53:20.000000000 -0700
@@ -229,7 +229,7 @@ SYSCALL_DEFINE1(adjtimex, struct timex _
*/
struct timespec current_fs_time(struct super_block *sb)
{
- struct timespec now = current_kernel_time();
+ struct timespec now = getnstimeofday();
return timespec_trunc(now, sb->s_time_gran);
}
EXPORT_SYMBOL(current_fs_time);

...

I recently spent nearly a week tracking down an NFS cache coherence
problem in an application:

http://www.spinics.net/lists/linux-nfs/msg14974.html

Here is what caused my problem:

1) File dir/A is created locally on NFS server.
2) NFS client does LOOKUP on file dir/B, gets ENOENT.
3) File dir/B is created locally on NFS server.

In my case, these all happened in less than 4 milliseconds (much less,
actually). Since HZ on my system is 250, the file creation in step
(3) failed to update the ctime/mtime on the directory. The result is
that the NFS client's "dentry lookup cache" became stale, but did not
know it was stale (since it relies on the directory ctime/mtime to
detect that). Worse, the staleness persists even if additional
changes are made to the directory from the NFS client, thanks to NFS
v3's "weak cache consistency" optimizations.

Why did this take me a week to diagnose? Because I am using XFS, and
I know XFS and NFS use nanosecond resolution for file timestamps. It
never occurred to me that, here in 2010, Linux would have an actual
file timestamp resolution 6.5 orders of magnitude worse.

I know, I know, "use NFS v4 and i_version". But that is not the
point. The point is that 4 milliseconds is a very long time these
days; an awful lot of file system operations can happen in such an
interval.

I am guessing the objection to the above patch will be: "Waaah it's
slow!" My responses would be:

1) Anybody who cares about file system performance is already using
"noatime" or "relatime", which mitigates the hit greatly.

2) Correctness is more important than performance, and 4 milliseconds
is just embarrassing.

3) On the 99.99% of Linux systems that are post-1990 x86, it is not
slow at all, and the performance difference will be utterly
undetectable in the real world.

When was XFS designed? It has nanosecond timestamps. When was NFS
designed? It has nanosecond timestamps. Even ext4 has nanosecond
timestamps... But what is the point if 22 bits' worth will forever be
meaningless?

If the above patch is too slow for some architectures, how about
making it a configuration option? Call it "CONFIG_1980S_FILE_TICK",
have it default to YES on the architectures that care and NO on
anything remotely modern and sane.

OK that's my proposal. Bash away.

- Pat


2010-08-17 19:24:20

by Alan

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

> The problem with "increment mtime by a nanosecond when necessary" is
> that timestamps can wind up out of order. As in:

Surely that depends on your implementation ?

> 1) Do a bunch of operations on file A
> 2) Do one operation on file B
>
> Imagine each operation on A incrementing its timestamp by a nanosecond
> "just because". If all of these operations happen in less than 4 ms,
> you can wind up with the timestamp on B being EARLIER than the
> timestamp on A. That is a big no-no (think "make" or anything else
> relying on timestamps for relative times).


[time resolution bits of data][value incremented value for that time]


if (time_now == time_last)
return { time_last , ++ct };
else {
ct = 0;
time_last = time_now
return { time_last , 0 };
}

providing it is done with the same 'ct' across the fs and you can't do
enough ops/second to wrap the nanosecs - which should be fine for now,
your ordering is still safe is it not ?

> If you can prove that the last modification on B happens after the
> last modification on A, then it is very bad for the mtime on B to be
> earlier than the mtime on A. I guarantee that will break things in
> the real world.

Alan

2010-08-18 19:32:55

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Wed, Aug 18, 2010 at 09:25:08PM +0200, Andi Kleen wrote:
> On Wed, Aug 18, 2010 at 02:54:56PM -0400, J. Bruce Fields wrote:
> > On Wed, Aug 18, 2010 at 07:50:40PM +0200, Andi Kleen wrote:
> > > > - nfsd updates it whenever it reads an mtime out of an inode that matches
> > > > current_fs_time to the granularity of 1/HZ.
> > >
> > > That means you have a very very hot cache line on a larger system
> > > if there are a lot of mtime changes. Probably a bad idea.
> >
> > Only if those mtime changes are also followed immediately by nfsd reads
> > of the mtime.
>
> If multiple writers are changing the same location in quick succession
> you have a hot cache line that gets bounced around. It doesn't need reads,
> although reads make it even worse.

OK, at this point one of us is confused, and I'm not sure which.

Is the "same location" that you're referring to the current_nfsd_time?

Neil's suggestion is to only modify current_nfsd_time on nfsd getattr,
*not* on the write operation that modifies the file data.

Or are you talking about something else?

> There's a lot of effort currently to make the VFS more parallel
> and less synchronized and it would be bad again to regress here again.

Understood.

--b.

2010-08-17 17:43:39

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Tue, Aug 17, 2010 at 04:54:03PM +0200, Andi Kleen wrote:
> "Patrick J. LoPresti" <[email protected]> writes:
>
> >
> > 1) Anybody who cares about file system performance is already using
> > "noatime" or "relatime", which mitigates the hit greatly.
>
> Consider mtime.
>
> > If the above patch is too slow for some architectures, how about
> > making it a configuration option? Call it "CONFIG_1980S_FILE_TICK",
> > have it default to YES on the architectures that care and NO on
> > anything remotely modern and sane.
> >
> > OK that's my proposal. Bash away.
>
> I suspect it will be a performance disaster on x86 for VFS intensive
> applications on capable file systems. VFS is very performance
> critical. These checks lurk on unexpected places too, e.g. on /dev
> access.
>
> Even TSC is much slower than just reading the variable.
>
> Also you should check if the file system granuality
> even supports it, it's completely wasted on a ext3 for example.

Agreed, ext3's probably a lost cause here.

> Maybe as a optional sysctl, default to off.

OK, so that leaves us with the race, even on newer filesystems:

1. File is modified, mtime updated
2. Client fetches mtime to revalidate cache
3. File is modified again, mtime updated
4. Client fetches new mtime to revalidate cache

If step 3 doesn't change the mtime, then step 4 (no matter how much
later it is performed) will return the wrong result, and client
applications will see stale data.

If we want to avoid that race, every modification of file data must
result in the mtime being updated to something different from the last
mtime seen by the client.

(A slight window between data modification and mtime update may be OK,
as long as the update happens eventually, and before the change is
committed to disk--close-to-open semantics mean that NFS clients can
live with not seeing changes until data is written to disk.)

Possible responses:

- Tell everyone to use NFSv4 (and make sure we have
changeattr/i_version working correctly).
- Use a finer-grained time source. (I believe you when you say
the TSC is too slow, but maybe we should run some tests to
make sure.)
- Increment mtime by a nanosecond when necessary.
- ?

--b.

2010-08-13 19:09:46

by john stultz

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Fri, 2010-08-13 at 11:57 -0700, Patrick J. LoPresti wrote:
> On Fri, Aug 13, 2010 at 11:45 AM, john stultz <[email protected]> wrote:
> > On those TSC broken systems that use the hpet or acpi_pm, a
> > getnstimeofday call can take 0.5-1.3us, so the penalty can be quite
> > severe.
>
> So you are saying my proposal is a bad idea forever? (But then why
> even bother having nanosecond resolution on ext4?)
>
> Or that it is a bad idea for now?

I'm not judging the idea as good/bad, just providing information for
context.

> Or that it needs to be refined? Maybe use hi-res precision on systems
> where it is known to be fast?
>
> > And even with the TSC, expect some performance impact, as
> > reading hardware and doing the multiply is more costly then just
> > fetching a value from memory.
>
> Relative to file system operations? Seriously? What performance hit
> would you expect on real-world applications?
> Something like 0.1% (10 nsec / 10 usec) worst case?

If you can show this does not affect performance in benchmarks, etc, I'm
sure it will be easier to push the patch. As outside of performance, I
don't think there's much of an issue with the change.

So other then "show some numbers", my only thought that might make the
patch more attractive is that rather than a global change, or a static
CONFIG_ option, would it maybe make more sense as a mount option?

thanks
-john


2010-08-18 18:53:58

by Andi Kleen

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Wed, Aug 18, 2010 at 07:20:58PM +0100, David Woodhouse wrote:
> On Tue, 2010-08-17 at 20:29 +0200, Andi Kleen wrote:
> >
> > > - Increment mtime by a nanosecond when necessary.
> >
> > You cannot be more precise than the backing file system: this causes
> > non monotonity when the inodes are flushed (has happened in the past)
>
> Um, can't you? You can't *store* timestamps which are more precise, but
> they can be in cache can't they?

No you can't. The initial implementation did that and it broke someone's
make. After that the VFS was fixed to never be precise than the backing
file system.

-Andi

2010-08-19 22:55:18

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Wed, Aug 18, 2010 at 08:17:14PM -0700, john stultz wrote:
> On Wed, 2010-08-18 at 22:31 -0400, J. Bruce Fields wrote:
> > On Wed, Aug 18, 2010 at 06:41:02PM -0700, john stultz wrote:
> > > On Wed, Aug 18, 2010 at 11:12 AM, J. Bruce Fields <[email protected]> wrote:
> > > > I'm completely ignorant about higher-resolution time sources. Any
> > > > recommended reading? What resolution do they actually provide, what's
> > > > the expense of reading them, how reliable are they, and how do the
> > > > answers to those questions vary across different hardware and kernel
> > > > versions? A quick look at drivers/clocksource/ doesn't suggest
> > > > simple answers.
> > >
> > > Yea, there aren't simple answers. Clocksource hardware varies
> > > drastically in resolution and access time across systems and
> > > architectures. Further, clocksources may change while the system is
> > > up, so we don't really expose the hardware resolution.
> > >
> > > On x86, access latency varies from ~50ns (TSC) to ~1.3us (ACPI PM).
> > > (And that is ignoring the PIT, which can be 18us per call - luckily
> > > almost no hardware uses that). The resolution similarly scales from
> > > sub-ns (TSC @ > 1ghz cpus) to ~279ns (ACPI PM). Of course, across
> > > architectures you will see even more variance.
> >
> > The race in question occurs when you manage to check mtime between two
> > file data updates, with all three operations occurring within a clock
> > tick.
> >
> > No idea if that's feasible in hundreds of nanoseconds.
>
> I think this is what Andi meant that you'll always race with time and
> that version counters are the only real solution here.

Yeah. That'll work for NFSv4. But if possible it'd be nice to have a
solution for NFSv3.

As compared to using a higher-resolution time source, a solution for
mtime based on a global counter would provide better guarantees (on
filesystems that can store the extra bits), and perform better. (What
is the worst-case latency if we're bouncing a cache line back and forth
between two CPU's?) Though I guess the possible performance hit would
rule it out for users that didn't specifically ask for it. (So, no help
for userspace nfs servers, make, or whoever else might (wisely or not)
already depend on mtime detecting changes reliably.)

> > I'm also not sure how to judge the access latency. Certainly a
> > microsecond is a lot compared to just reading a cached mtime value.
> >
> > Will we ever see them go backwards? (So if I know I wrote to file B
> > after writing to file A, is there ever a case where I could end up with
> > an earlier mtime on B than A?)
>
> You should not. However, there have been bugs in the past, and there
> will probably be a few more in the future.
>
> There are also theoretical issues with SMP systems where the TSCs are
> not perfectly synced, but the window for those races should be small
> (ie: smaller then can be detected - otherwise we'll throw out the TSC).

Got it. Thanks for your help!

--b.

2010-08-18 19:25:10

by Andi Kleen

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Wed, Aug 18, 2010 at 02:54:56PM -0400, J. Bruce Fields wrote:
> On Wed, Aug 18, 2010 at 07:50:40PM +0200, Andi Kleen wrote:
> > > - nfsd updates it whenever it reads an mtime out of an inode that matches
> > > current_fs_time to the granularity of 1/HZ.
> >
> > That means you have a very very hot cache line on a larger system
> > if there are a lot of mtime changes. Probably a bad idea.
>
> Only if those mtime changes are also followed immediately by nfsd reads
> of the mtime.

If multiple writers are changing the same location in quick succession
you have a hot cache line that gets bounced around. It doesn't need reads,
although reads make it even worse.

There's a lot of effort currently to make the VFS more parallel
and less synchronized and it would be bad again to regress here again.

-Andi
--
[email protected] -- Speaking for myself only.

2010-08-18 18:57:09

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Wed, Aug 18, 2010 at 07:50:40PM +0200, Andi Kleen wrote:
> > - nfsd updates it whenever it reads an mtime out of an inode that matches
> > current_fs_time to the granularity of 1/HZ.
>
> That means you have a very very hot cache line on a larger system
> if there are a lot of mtime changes. Probably a bad idea.

Only if those mtime changes are also followed immediately by nfsd reads
of the mtime.

That will be the typical case for nfsd writes, though.

--b.

2010-08-19 00:52:30

by NeilBrown

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Thu, 19 Aug 2010 09:41:36 +1000
Neil Brown <[email protected]> wrote:

> So I agree that this is probably more of an issue for directories than for
> files, and that implementing it just for directories would be a sensible
> first step with lower expected overhead - just my reasoning seems to be a bit
> different.

Just to be sure we are on the same page:
file_update_time would always refer to current_nfsd_time, but nfsd would
only update current_nfsd_time when a directory was examined (and the other
conditions were met).


So my current thinking on how this would look - names have been changed:

- global timespec 'current_fs_precise_time' is zeroed when
current_kernel_time moves backwards and is protected by a seqlock

- current_fs_time would be
now = max(current_kernel_time(), current_fs_precise_time)
return timespec_trunc(now, sb->s_time_gran)
(with appropriate seqlock protection)

- new function in fs/inode.c
get_precise_time(timestamp)
cft = current_fs_time()
if (timestamp == cft)
write_seqlock()
if cft == current_fs_precise_time
current_fs_precise_time.tv_nsec++
else if cft > current_fs_precise_time
current_fs_precise_time = cft
write_sequnlock()
return timestamp

- nfsd xdr response routine does
ts = inode->i_mtime
if (S_ISDIR(inode->i_mode))
ts = get_precise_time(ts)
xdr_encode_timespec(ts)


get_precise_time() probably needs a bit more subtlety to handle different
s_time_gran values and possible races, but I think it is fairly close.

Then if we ever had an xstat or similar that could ask for precise
timestamps, it just makes a similar call to get_precise_time.
Also if we added code later to use a hires timer on hardware where it was
efficient, get_precise_time could test for that and become a no-op

Yes, I should probably turn this into a patch ... maybe another day.

NeilBrown

2010-08-17 19:39:14

by Alan

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

> I am having trouble seeing why this is a better idea than a simple
> mount option to obtain decent resolution timestamps. (Not that we
> can't have both...) Is there any objection to the mount option I am
> proposing?

I have none. I doubt I'd use it as it would be too expensive on system
performance for some of my boxes, while having an incrementing value is
cheap.

I don't see the two as conflicting - in fact the bits you need to do the
mount option are the bits you also need to do the counter version as
well. One fixes ordering at no real cost, the other adds high res
timestamps, both are useful.

Alan

2010-08-17 18:29:27

by Andi Kleen

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

> OK, so that leaves us with the race, even on newer filesystems:
>
> 1. File is modified, mtime updated
> 2. Client fetches mtime to revalidate cache
> 3. File is modified again, mtime updated
> 4. Client fetches new mtime to revalidate cache

You'll always have a race window with time, the only way around
that would be a version number.
> - Tell everyone to use NFSv4 (and make sure we have
> changeattr/i_version working correctly).
> - Use a finer-grained time source. (I believe you when you say
> the TSC is too slow, but maybe we should run some tests to
> make sure.)

It depends on the CPU too.

> - Increment mtime by a nanosecond when necessary.

You cannot be more precise than the backing file system: this causes
non monotonity when the inodes are flushed (has happened in the past)

-Andi

2010-08-13 18:57:45

by Patrick J. LoPresti

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Fri, Aug 13, 2010 at 11:45 AM, john stultz <[email protected]> wrote:
>
> Your stats are off here. The only fast clocksource on x86 is the TSC,
> and its busted on many, many systems. The cpu vendors have only
> recently taken it seriously and resolved the majority of problems
> (however, issues still remain on large numa systems, but its much
> better then the story was 3-7 years ago).

Thank you for the correction. Still, the number of systems where TSC
works is large, it is growing over time, and.... Really now,
milliseconds? In 2010? On some Apple iToy, perhaps...

> On those TSC broken systems that use the hpet or acpi_pm, a
> getnstimeofday call can take 0.5-1.3us, so the penalty can be quite
> severe.

So you are saying my proposal is a bad idea forever? (But then why
even bother having nanosecond resolution on ext4?)

Or that it is a bad idea for now?

Or that it needs to be refined? Maybe use hi-res precision on systems
where it is known to be fast?

> And even with the TSC, expect some performance impact, as
> reading hardware and doing the multiply is more costly then just
> fetching a value from memory.

Relative to file system operations? Seriously? What performance hit
would you expect on real-world applications?
Something like 0.1% (10 nsec / 10 usec) worst case?

- Pat

2010-08-19 02:44:28

by NeilBrown

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Wed, 18 Aug 2010 22:08:03 -0400
"J. Bruce Fields" <[email protected]> wrote:

> On Thu, Aug 19, 2010 at 10:52:18AM +1000, Neil Brown wrote:
> > On Thu, 19 Aug 2010 09:41:36 +1000
> > Neil Brown <[email protected]> wrote:
> >
> > > So I agree that this is probably more of an issue for directories than for
> > > files, and that implementing it just for directories would be a sensible
> > > first step with lower expected overhead - just my reasoning seems to be a bit
> > > different.
> >
> > Just to be sure we are on the same page:
> > file_update_time would always refer to current_nfsd_time, but nfsd would
> > only update current_nfsd_time when a directory was examined (and the other
> > conditions were met).
> >
> >
> > So my current thinking on how this would look - names have been changed:
> >
> > - global timespec 'current_fs_precise_time' is zeroed when
> > current_kernel_time moves backwards and is protected by a seqlock
> >
> > - current_fs_time would be
> > now = max(current_kernel_time(), current_fs_precise_time)
> > return timespec_trunc(now, sb->s_time_gran)
> > (with appropriate seqlock protection)
> >
> > - new function in fs/inode.c
> > get_precise_time(timestamp)
>
> Odd name for something that returns nothing of interest;
> bump_precise_time() might be closer?
>
> And unique_time might be better than precise_time, since the property
> we're asking for is that mtime on a changed file by new? (Or
> versioned_time?)

Agreed on both counts, tough I'm not keen on 'bump' myself.
got_unique_time()
because that it what we just did... I prefer the name to reflect why the
function is called, rather than what the function is expected to do about it.
never_use_this_timestamp_again(timestamp)
:-?


>
> > cft = current_fs_time()
> > if (timestamp == cft)
> /*
> * Make sure the next mtime stored will be
> * something different from timestamp:
> */
> > write_seqlock()
> > if cft == current_fs_precise_time
> > current_fs_precise_time.tv_nsec++
> > else if cft > current_fs_precise_time
>
> What's the cft < current_fs_precise_time case?

The current_fs_precise_time has been incremented with a resolution higher
than s_time_gran. i.e. s_time_gran > 1.
I'm not really sure what we want to do about that.
Maybe we should be incrementing tv_nsec by s_time_gran as long as that is
significantly less than jiffies_to_usec(1)*1000, but I don't know what I mean
by 'significantly'.

The only values I can find for s_time_gran in current code are 1, 100, 1000
and 1000000000.
All those are either way bigger than a jiffie or significantly smaller, but
suppose a filesystem came along that chose 1000000 (i.e. millisecond
timestamps) - should we increment tv_nsec by 1000000, or not, or cross that
bridge when we come to it?

For reference:
default is 1000000000 (this would cover ext2, ext3, reiserfs, fat, sysv, ...)
cifs, smbfs, ntfs are 100
udf, ceph are 1000
rest (btrfs, ext4, gfs2, jfs, nilfs, ocfs2, xfs and virtual filesystems) are 1

NeilBrown


2010-08-18 18:17:58

by Chuck Lever

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps


On Aug 18, 2010, at 1:32 PM, J. Bruce Fields wrote:

> On Wed, Aug 18, 2010 at 03:53:59PM +1000, Neil Brown wrote:
>> I'm not sure you even want to pay for a per-filesystem atomic access when
>> updating mtime. mnt_want_write - called at the same time - seems to go to
>> some lengths to avoid an atomic operation.
>>
>> I think that nfsd should be the only place that has to pay the atomic
>> penalty, as it is where the need is.
>>
>> I imagine something like this:
>> - Create a global struct timespec which is protected by a seqlock
>> Call it current_nfsd_time or similar.
>> - file_update_time reads this and uses it if it is newer than
>> current_fs_time.
>> - nfsd updates it whenever it reads an mtime out of an inode that matches
>> current_fs_time to the granularity of 1/HZ.
>
> We can also skip the update whenever current_nfsd_time is greater than
> the inode's mtime--that's enough to ensure that the next
> file_update_time() call will get a time different from the inode's
> current mtime.

Would it help if we only did this for directories, for now?

Files have close-to-open. Directories... don't. So we have the problem where directory changes (ie file creation and deletion) takes a long time (some times an infinitely long time) to propagate to clients. Plus: directories don't change very often, so using fine-grained time stamps only on directories wouldn't impact heavy I/O workloads.

> And that means that a sequence like
>
> file_update_time()
> N nfsd_getattr()'s
>
> doesn't make N updates to current_nfsd_time, when only 1 was necessary.
>
>> If the current value is before current_kernel_time, it
>> is set to current_kernel_time, otherwise tv_nsec is incremented -
>> unless that increases
>> beyond jiffies_to_usec(1)*1000 beyond current_kernel_time.
>
> ... which would only happen on hardware that could process a getattr and
> a data update per nanosecond continuously for a jiffy.
>
>> - the global 'struct timespec' is zeroed whenever system time is set
>> backwards.
>
> OK, got it, I think: so this is the same as a global version of Alan's
> clock, except that the extra ticks only happen when they need to.
>
> The properties it satisfies:
>
> - It's still a single global clock, so it's consistent between
> files.
> - It degenerates to jiffies in the absence of getattr's from
> nfsd.
> - It need only invalidate the other cpus' cached value of the
> clock on the first getattr of a file that follows less than a
> jiffy after an update of the file's data.
> - Absent utime(), time going backwards, or futuristic hardware,
> it guarantees that two nfsd reads of an inode's mtime will
> return different values iff the inode's data was modified in
> between the two.
>
> Shortcomings:
>
> - The clock advances in units only of either 1 jiffy or 1 ns.
> This will look odd. But when the alternative is units of 1
> jiffy or 0 ns, it seems an improvement....
> - A slowdown due to inodes being file_update_time() marking inodes
> dirty more frequently?
> - Doesn't help with ext3. Oh well.
>
> Would the extra expense rule out treating sys_stat() the same as nfsd?
> It would be nice to be able to solve the same problem for userspace
> nfsd's (or any other application that might be using mtime to save
> rereading data).
>
> --b.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
chuck[dot]lever[at]oracle[dot]com





2010-08-18 18:21:08

by David Woodhouse

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Tue, 2010-08-17 at 20:29 +0200, Andi Kleen wrote:
>
> > - Increment mtime by a nanosecond when necessary.
>
> You cannot be more precise than the backing file system: this causes
> non monotonity when the inodes are flushed (has happened in the past)

Um, can't you? You can't *store* timestamps which are more precise, but
they can be in cache can't they?

And since you're not going to drop it from cache and bring it back in
again within 4ms, that ought to suffice?

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation


2010-08-18 14:46:58

by Patrick J. LoPresti

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Tue, Aug 17, 2010 at 10:53 PM, Neil Brown <[email protected]> wrote:
>
> I imagine something like this:
> ?- Create a global struct timespec which is protected by a seqlock
> ? Call it current_nfsd_time or similar.
> ?- file_update_time reads this and uses it if it is newer than
> ? current_fs_time.
> ?- nfsd updates it whenever it reads an mtime out of an inode that matches
> ? current_fs_time to the granularity of 1/HZ.

I think nfsd can simply update current_nfsd_time whenever the mtime it
reads from an inode is >= current_nfsd_time. (The invariant you need
to maintain is that whenever nfsd reads an mtime, any timestamps
produced after that have a later time. So just code it that way
directly.)

> ? If the current value is before current_kernel_time, it
> ? is set to current_kernel_time, otherwise tv_nsec is incremented -
> ? unless that increases
> ? beyond jiffies_to_usec(1)*1000 beyond current_kernel_time.
> ?- the global 'struct timespec' is zeroed whenever system time is set
> ? backwards.

I believe this works.

> [[You could probably make ext3 work reasonably well by adding a mount option
> ?which:
> ? ?- advertises s_time_gran as 1
> ? ?- when storing: rounds timestamps up to the next second if tv_nsec != 0
> ? ?- when loading, setting the timestamp to the current time if the stored
> ? ? ?number matches current_kernel_time().tv_sec+1
> ?You would get occasional forward jumps in mtime, but usually when you
> ?aren't looking, and at least you would not get real changes that are not
> ?reflected in mtime
> ]]

But I do not believe this works.

1) Modify file A
2) Modify file B
3) File A experiences one of those "occasional forward jumps in mtime"
(inode evicted + read back within 1 second)
4) mtimes on A and B are now out of order -- very bad

As Bruce mentioned, ext3 is a lost cause.

Regardless of any of this, however, the first step is to provide a
mount option to select the timestamp algorithm... Because it is still
absurd that I cannot have accurate timestamps on my files here in the
21st century.

Once that is done, the rest is just providing the alternative
implementations and choosing defaults.

- Pat

2010-08-18 18:14:50

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Tue, Aug 17, 2010 at 12:34:58PM -0700, Patrick J. LoPresti wrote:
> On Tue, Aug 17, 2010 at 12:39 PM, Alan Cox <[email protected]> wrote:
>
> >        if (time_now == time_last)
> >                return { time_last , ++ct };
> >        else {
> >                ct = 0;
> >                time_last = time_now
> >                return { time_last , 0 };
> >        }
> >
> > providing it is done with the same 'ct' across the fs and you can't do
> > enough ops/second to wrap the nanosecs - which should be fine for now,
> > your ordering is still safe is it not ?
>
> Yes, that would work. Assuming you use atomic counters, else there
> is a risk of the visible time ticking backwards. It seems like a lot
> of effort just to avoid having accurate timestamps on your files,
> though.
>
> I am having trouble seeing why this is a better idea than a simple
> mount option to obtain decent resolution timestamps. (Not that we
> can't have both...) Is there any objection to the mount option I am
> proposing?

I'm completely ignorant about higher-resolution time sources. Any
recommended reading? What resolution do they actually provide, what's
the expense of reading them, how reliable are they, and how do the
answers to those questions vary across different hardware and kernel
versions? A quick look at drivers/clocksource/ doesn't suggest
simple answers.

-b.

2010-08-17 19:06:52

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Tue, Aug 17, 2010 at 08:29:20PM +0200, Andi Kleen wrote:
> > OK, so that leaves us with the race, even on newer filesystems:
> >
> > 1. File is modified, mtime updated
> > 2. Client fetches mtime to revalidate cache
> > 3. File is modified again, mtime updated
> > 4. Client fetches new mtime to revalidate cache
>
> You'll always have a race window with time, the only way around
> that would be a version number.

Agreed, but as a practical matter, nanosecond resolution would extend
the useful lifetime of NFSv3 by quite a bit.

> > - Tell everyone to use NFSv4 (and make sure we have
> > changeattr/i_version working correctly).
> > - Use a finer-grained time source. (I believe you when you say
> > the TSC is too slow, but maybe we should run some tests to
> > make sure.)
>
> It depends on the CPU too.

If we wanted to look into this, what would you suggest (hardware,
workload) to demonstrate the worst case? (Or are the results from the
TSC or any other higher-precision time source likely to be useless for
other reasons?)

> > - Increment mtime by a nanosecond when necessary.
>
> You cannot be more precise than the backing file system: this causes
> non monotonity when the inodes are flushed (has happened in the past)

Right, I think that we probably have to give up ext3 as a lost cause.
But perhaps we could get away with a hack like this on filesystems that
can store nanoseconds.

--b.

2010-08-18 23:41:46

by NeilBrown

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Wed, 18 Aug 2010 14:15:51 -0400
Chuck Lever <[email protected]> wrote:

>
> On Aug 18, 2010, at 1:32 PM, J. Bruce Fields wrote:
>
> > On Wed, Aug 18, 2010 at 03:53:59PM +1000, Neil Brown wrote:
> >> I'm not sure you even want to pay for a per-filesystem atomic access when
> >> updating mtime. mnt_want_write - called at the same time - seems to go to
> >> some lengths to avoid an atomic operation.
> >>
> >> I think that nfsd should be the only place that has to pay the atomic
> >> penalty, as it is where the need is.
> >>
> >> I imagine something like this:
> >> - Create a global struct timespec which is protected by a seqlock
> >> Call it current_nfsd_time or similar.
> >> - file_update_time reads this and uses it if it is newer than
> >> current_fs_time.
> >> - nfsd updates it whenever it reads an mtime out of an inode that matches
> >> current_fs_time to the granularity of 1/HZ.
> >
> > We can also skip the update whenever current_nfsd_time is greater than
> > the inode's mtime--that's enough to ensure that the next
> > file_update_time() call will get a time different from the inode's
> > current mtime.
>
> Would it help if we only did this for directories, for now?
>
> Files have close-to-open. Directories... don't. So we have the problem where directory changes (ie file creation and deletion) takes a long time (some times an infinitely long time) to propagate to clients. Plus: directories don't change very often, so using fine-grained time stamps only on directories wouldn't impact heavy I/O workloads.

I'm don't quite see how close-to-open really affects this issue - it still
relies on the timestamps and so can cache old data if a file update didn't
change the timestamp.

In my mind the difference is that near-concurrent access to files usually
involves file locking which flushes caches (and if it doesn't then you have
bigger problems) while near-concurrent access to directories relies on the
natural atomicity of dir operations so no locking or flushing occurs.

So I agree that this is probably more of an issue for directories than for
files, and that implementing it just for directories would be a sensible
first step with lower expected overhead - just my reasoning seems to be a bit
different.

Thanks,
NeilBrown

2010-08-13 20:52:53

by Jim Rees

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

john stultz wrote:

> How about using getnstimeofday() only if the kernel clocksource is tsc?
> Presumably anyone running into this problem would have modern high
> performance hardware with working tsc.

This might be difficult, as clocksources can change while the system is
running. Further, any checks for specific clocksources would be very
architecture specific. A fast-timekeeping flag could be used, but would
be a fairly subjective metric (ie: would there be issues if the fast
clock on a slower system is slower then a slow clock on a fast system,
etc).

That's partly why I suggested a mount option.

It just would be nice if the kernel could "do the right thing" without the
user having to set some option or sysctl. The problem is infrequently seen,
difficult to diagnose, and not obvious as to whether it's a client or server
issue. I don't think your typical user, or even your typical professional
sysadmin, is going to get this right if it requires a setting.

Having said that, it's not obvious to me how to do this automatically.

2010-08-13 20:26:31

by john stultz

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Fri, 2010-08-13 at 15:57 -0400, Jim Rees wrote:
> Patrick J. LoPresti wrote:
>
> On Fri, Aug 13, 2010 at 11:45 AM, john stultz <[email protected]> wrote:
>
> > On those TSC broken systems that use the hpet or acpi_pm, a
> > getnstimeofday call can take 0.5-1.3us, so the penalty can be quite
> > severe.
>
> So you are saying my proposal is a bad idea forever? (But then why
> even bother having nanosecond resolution on ext4?)
>
> How about using getnstimeofday() only if the kernel clocksource is tsc?
> Presumably anyone running into this problem would have modern high
> performance hardware with working tsc.

This might be difficult, as clocksources can change while the system is
running. Further, any checks for specific clocksources would be very
architecture specific. A fast-timekeeping flag could be used, but would
be a fairly subjective metric (ie: would there be issues if the fast
clock on a slower system is slower then a slow clock on a fast system,
etc).

That's partly why I suggested a mount option.

thanks
-john




2010-08-18 18:32:22

by Patrick J. LoPresti

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Wed, Aug 18, 2010 at 11:20 AM, David Woodhouse <[email protected]> wrote:
>
> Um, can't you? You can't *store* timestamps which are more precise, but
> they can be in cache can't they?

No. That is how Linux used to work, and it caused many problems,
which is why the current_fs_time() function was invented.

> And since you're not going to drop it from cache and bring it back in
> again within 4ms, that ought to suffice?

Not the problem. As usual, the problem is out-of-order timestamps:

1) Modify file A
2) Modify file B
3) File B's inode gets evicted, truncating its timestamp to disk resolution
4) Call stat() on B, bringing it back in with truncated resolution

And boom, B appears to be OLDER than A. Which is not allowed.

This is exactly what happened when Linux first added sub-second
timestamps to the generic VFS layer. Many complaints about "make"
rebuilding files unecessarily, among other things. Eventually it got
fixed by the introduction of current_fs_time().

- Pat

2010-08-19 03:17:23

by john stultz

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Wed, 2010-08-18 at 22:31 -0400, J. Bruce Fields wrote:
> On Wed, Aug 18, 2010 at 06:41:02PM -0700, john stultz wrote:
> > On Wed, Aug 18, 2010 at 11:12 AM, J. Bruce Fields <[email protected]> wrote:
> > > I'm completely ignorant about higher-resolution time sources. Any
> > > recommended reading? What resolution do they actually provide, what's
> > > the expense of reading them, how reliable are they, and how do the
> > > answers to those questions vary across different hardware and kernel
> > > versions? A quick look at drivers/clocksource/ doesn't suggest
> > > simple answers.
> >
> > Yea, there aren't simple answers. Clocksource hardware varies
> > drastically in resolution and access time across systems and
> > architectures. Further, clocksources may change while the system is
> > up, so we don't really expose the hardware resolution.
> >
> > On x86, access latency varies from ~50ns (TSC) to ~1.3us (ACPI PM).
> > (And that is ignoring the PIT, which can be 18us per call - luckily
> > almost no hardware uses that). The resolution similarly scales from
> > sub-ns (TSC @ > 1ghz cpus) to ~279ns (ACPI PM). Of course, across
> > architectures you will see even more variance.
>
> The race in question occurs when you manage to check mtime between two
> file data updates, with all three operations occurring within a clock
> tick.
>
> No idea if that's feasible in hundreds of nanoseconds.

I think this is what Andi meant that you'll always race with time and
that version counters are the only real solution here.

> I'm also not sure how to judge the access latency. Certainly a
> microsecond is a lot compared to just reading a cached mtime value.
>
> Will we ever see them go backwards? (So if I know I wrote to file B
> after writing to file A, is there ever a case where I could end up with
> an earlier mtime on B than A?)

You should not. However, there have been bugs in the past, and there
will probably be a few more in the future.

There are also theoretical issues with SMP systems where the TSCs are
not perfectly synced, but the window for those races should be small
(ie: smaller then can be detected - otherwise we'll throw out the TSC).


thanks
-john



2010-08-18 05:54:09

by NeilBrown

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Tue, 17 Aug 2010 15:29:38 -0400
"J. Bruce Fields" <[email protected]> wrote:

> On Tue, Aug 17, 2010 at 08:39:41PM +0100, Alan Cox wrote:
> > > The problem with "increment mtime by a nanosecond when necessary" is
> > > that timestamps can wind up out of order. As in:
> >
> > Surely that depends on your implementation ?
> >
> > > 1) Do a bunch of operations on file A
> > > 2) Do one operation on file B
> > >
> > > Imagine each operation on A incrementing its timestamp by a nanosecond
> > > "just because". If all of these operations happen in less than 4 ms,
> > > you can wind up with the timestamp on B being EARLIER than the
> > > timestamp on A. That is a big no-no (think "make" or anything else
> > > relying on timestamps for relative times).
> >
> >
> > [time resolution bits of data][value incremented value for that time]
> >
> >
> > if (time_now == time_last)
> > return { time_last , ++ct };
> > else {
> > ct = 0;
> > time_last = time_now;
> > return { time_last , 0 };
> > }
> >
> > providing it is done with the same 'ct' across the fs and you can't do
> > enough ops/second to wrap the nanosecs - which should be fine for now,
> > your ordering is still safe is it not ?
>
> Right, so if I understand correctly, you're proposing a time source
> that's global to the filesystem and that guarantees it will always
> return a unique value by incrementing the nanoseconds field if jiffies
> haven't changed since the last time it was called.
>
> (Does it really need to be global across all filesystems? Or is it
> unreasonable to expect your unbelievably-fast make's to behave well when
> sources and targets live on different filesystems?)
>

I'm not sure you even want to pay for a per-filesystem atomic access when
updating mtime. mnt_want_write - called at the same time - seems to go to
some lengths to avoid an atomic operation.

I think that nfsd should be the only place that has to pay the atomic
penalty, as it is where the need is.

I imagine something like this:
- Create a global struct timespec which is protected by a seqlock
Call it current_nfsd_time or similar.
- file_update_time reads this and uses it if it is newer than
current_fs_time.
- nfsd updates it whenever it reads an mtime out of an inode that matches
current_fs_time to the granularity of 1/HZ.
If the current value is before current_kernel_time, it
is set to current_kernel_time, otherwise tv_nsec is incremented -
unless that increases
beyond jiffies_to_usec(1)*1000 beyond current_kernel_time.
- the global 'struct timespec' is zeroed whenever system time is set
backwards.

Then - providing the fs stores nanosecond timestamps - we should have stable,
globally ordered, precise (if not entirely accurate) time stamps, and a
penalty would only be paid when nfsd actually needs the information.


[[You could probably make ext3 work reasonably well by adding a mount option
which:
- advertises s_time_gran as 1
- when storing: rounds timestamps up to the next second if tv_nsec != 0
- when loading, setting the timestamp to the current time if the stored
number matches current_kernel_time().tv_sec+1
You would get occasional forward jumps in mtime, but usually when you
aren't looking, and at least you would not get real changes that are not
reflected in mtime
]]

NeilBrown

2010-08-18 17:34:21

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Wed, Aug 18, 2010 at 03:53:59PM +1000, Neil Brown wrote:
> I'm not sure you even want to pay for a per-filesystem atomic access when
> updating mtime. mnt_want_write - called at the same time - seems to go to
> some lengths to avoid an atomic operation.
>
> I think that nfsd should be the only place that has to pay the atomic
> penalty, as it is where the need is.
>
> I imagine something like this:
> - Create a global struct timespec which is protected by a seqlock
> Call it current_nfsd_time or similar.
> - file_update_time reads this and uses it if it is newer than
> current_fs_time.
> - nfsd updates it whenever it reads an mtime out of an inode that matches
> current_fs_time to the granularity of 1/HZ.

We can also skip the update whenever current_nfsd_time is greater than
the inode's mtime--that's enough to ensure that the next
file_update_time() call will get a time different from the inode's
current mtime.

And that means that a sequence like

file_update_time()
N nfsd_getattr()'s

doesn't make N updates to current_nfsd_time, when only 1 was necessary.

> If the current value is before current_kernel_time, it
> is set to current_kernel_time, otherwise tv_nsec is incremented -
> unless that increases
> beyond jiffies_to_usec(1)*1000 beyond current_kernel_time.

... which would only happen on hardware that could process a getattr and
a data update per nanosecond continuously for a jiffy.

> - the global 'struct timespec' is zeroed whenever system time is set
> backwards.

OK, got it, I think: so this is the same as a global version of Alan's
clock, except that the extra ticks only happen when they need to.

The properties it satisfies:

- It's still a single global clock, so it's consistent between
files.
- It degenerates to jiffies in the absence of getattr's from
nfsd.
- It need only invalidate the other cpus' cached value of the
clock on the first getattr of a file that follows less than a
jiffy after an update of the file's data.
- Absent utime(), time going backwards, or futuristic hardware,
it guarantees that two nfsd reads of an inode's mtime will
return different values iff the inode's data was modified in
between the two.

Shortcomings:

- The clock advances in units only of either 1 jiffy or 1 ns.
This will look odd. But when the alternative is units of 1
jiffy or 0 ns, it seems an improvement....
- A slowdown due to inodes being file_update_time() marking inodes
dirty more frequently?
- Doesn't help with ext3. Oh well.

Would the extra expense rule out treating sys_stat() the same as nfsd?
It would be nice to be able to solve the same problem for userspace
nfsd's (or any other application that might be using mtime to save
rereading data).

--b.

2010-08-15 01:50:51

by Bret Towe

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Fri, Aug 13, 2010 at 1:53 PM, Patrick J. LoPresti <[email protected]> wrote:
> On Fri, Aug 13, 2010 at 12:09 PM, john stultz <[email protected]> wrote:
>>
>> So other then "show some numbers", my only thought that might make the
>> patch more attractive is that rather than a global change, or a static
>> CONFIG_ option, would it maybe make more sense as a mount option?
>
> I really like this idea.
>
> Consider the following "revision 2" of my proposal:
>
> 1) Add a function pointer "current_fs_time" to struct super_block.
>
> 2) Replace all calls of the form:
>
> ? ?current_fs_time(sb);
>
> with
>
> ?sb->current_fs_time(sb);
>
> ?3) Arrange for the default value to point to the current implementation.
>
> These first three could be one patch. ?They change no functionality;
> they just enable the next step.
>
> Finally:
>
> ?4) Add a mount option to cause sb->current_fs_time(sb) to use the
> hi-res implementation.
>
> Comments?

I'm not sure how nfs works but if this is a client side issue I don't
see anything wrong
with a CONFIG_ item but if its server side it might be better off as a
procfs or sysfs tunable
so reboots are not required to change the setting

performance wise why would there be any difference the same amount of
bits are being set on the disk drive no?

2010-08-13 19:57:16

by Jim Rees

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

Patrick J. LoPresti wrote:

On Fri, Aug 13, 2010 at 11:45 AM, john stultz <[email protected]> wrote:

> On those TSC broken systems that use the hpet or acpi_pm, a
> getnstimeofday call can take 0.5-1.3us, so the penalty can be quite
> severe.

So you are saying my proposal is a bad idea forever? (But then why
even bother having nanosecond resolution on ext4?)

How about using getnstimeofday() only if the kernel clocksource is tsc?
Presumably anyone running into this problem would have modern high
performance hardware with working tsc.

2010-08-19 22:48:18

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Thu, Aug 19, 2010 at 12:44:13PM +1000, Neil Brown wrote:
> On Wed, 18 Aug 2010 22:08:03 -0400
> "J. Bruce Fields" <[email protected]> wrote:
>
> > On Thu, Aug 19, 2010 at 10:52:18AM +1000, Neil Brown wrote:
> > > On Thu, 19 Aug 2010 09:41:36 +1000
> > > Neil Brown <[email protected]> wrote:
> > >
> > > > So I agree that this is probably more of an issue for directories than for
> > > > files, and that implementing it just for directories would be a sensible
> > > > first step with lower expected overhead - just my reasoning seems to be a bit
> > > > different.
> > >
> > > Just to be sure we are on the same page:
> > > file_update_time would always refer to current_nfsd_time, but nfsd would
> > > only update current_nfsd_time when a directory was examined (and the other
> > > conditions were met).
> > >
> > >
> > > So my current thinking on how this would look - names have been changed:
> > >
> > > - global timespec 'current_fs_precise_time' is zeroed when
> > > current_kernel_time moves backwards and is protected by a seqlock
> > >
> > > - current_fs_time would be
> > > now = max(current_kernel_time(), current_fs_precise_time)
> > > return timespec_trunc(now, sb->s_time_gran)
> > > (with appropriate seqlock protection)
> > >
> > > - new function in fs/inode.c
> > > get_precise_time(timestamp)
> >
> > Odd name for something that returns nothing of interest;
> > bump_precise_time() might be closer?
> >
> > And unique_time might be better than precise_time, since the property
> > we're asking for is that mtime on a changed file by new? (Or
> > versioned_time?)
>
> Agreed on both counts, tough I'm not keen on 'bump' myself.
> got_unique_time()
> because that it what we just did... I prefer the name to reflect why the
> function is called, rather than what the function is expected to do about it.
> never_use_this_timestamp_again(timestamp)
> :-?

Maybe "retire" for a pithier version of never_use_again:

/**
* retire_timestamp - prevent a timestamp from being reused as an mtime.
* @timestamp
*
* Advance the clock used to generate mtimes to guarantee that the
* given timestamp will not be reused on any future mtime update.
* This allows the given timestamp to be passed back to users such as
* nfs clients which need the guarantee that mtimes will always change
* on file updates.
*
* Depending on the filesystem's s_time_gran this may not be an ironclad
* guarantee.
*/

?

>
>
> >
> > > cft = current_fs_time()
> > > if (timestamp == cft)
> > /*
> > * Make sure the next mtime stored will be
> > * something different from timestamp:
> > */
> > > write_seqlock()
> > > if cft == current_fs_precise_time
> > > current_fs_precise_time.tv_nsec++
> > > else if cft > current_fs_precise_time
> >
> > What's the cft < current_fs_precise_time case?
>
> The current_fs_precise_time has been incremented with a resolution higher
> than s_time_gran. i.e. s_time_gran > 1.
> I'm not really sure what we want to do about that.
> Maybe we should be incrementing tv_nsec by s_time_gran as long as that is
> significantly less than jiffies_to_usec(1)*1000, but I don't know what I mean
> by 'significantly'.

How about just scratching "significantly" and saying "less"? As long as
we know jiffies is the default time source for mtimes, that should be
safe, shouldn't it?

> The only values I can find for s_time_gran in current code are 1, 100, 1000
> and 1000000000.

I didn't even know there were any other than 1 and a billion. OK!

> All those are either way bigger than a jiffie or significantly smaller, but
> suppose a filesystem came along that chose 1000000 (i.e. millisecond
> timestamps) - should we increment tv_nsec by 1000000, or not, or cross that
> bridge when we come to it?
>
> For reference:
> default is 1000000000 (this would cover ext2, ext3, reiserfs, fat, sysv, ...)
> cifs, smbfs, ntfs are 100
> udf, ceph are 1000
> rest (btrfs, ext4, gfs2, jfs, nilfs, ocfs2, xfs and virtual filesystems) are 1

Interesting list, thanks!

--b.

2010-08-13 20:53:59

by Patrick J. LoPresti

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Fri, Aug 13, 2010 at 12:09 PM, john stultz <[email protected]> wrote:
>
> So other then "show some numbers", my only thought that might make the
> patch more attractive is that rather than a global change, or a static
> CONFIG_ option, would it maybe make more sense as a mount option?

I really like this idea.

Consider the following "revision 2" of my proposal:

1) Add a function pointer "current_fs_time" to struct super_block.

2) Replace all calls of the form:

current_fs_time(sb);

with

sb->current_fs_time(sb);

3) Arrange for the default value to point to the current implementation.

These first three could be one patch. They change no functionality;
they just enable the next step.

Finally:

4) Add a mount option to cause sb->current_fs_time(sb) to use the
hi-res implementation.

Comments?

- Pat

2010-08-17 14:54:05

by Andi Kleen

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

"Patrick J. LoPresti" <[email protected]> writes:

>
> 1) Anybody who cares about file system performance is already using
> "noatime" or "relatime", which mitigates the hit greatly.

Consider mtime.

> If the above patch is too slow for some architectures, how about
> making it a configuration option? Call it "CONFIG_1980S_FILE_TICK",
> have it default to YES on the architectures that care and NO on
> anything remotely modern and sane.
>
> OK that's my proposal. Bash away.

I suspect it will be a performance disaster on x86 for VFS intensive
applications on capable file systems. VFS is very performance
critical. These checks lurk on unexpected places too, e.g. on /dev
access.

Even TSC is much slower than just reading the variable.

Also you should check if the file system granuality
even supports it, it's completely wasted on a ext3 for example.

Maybe as a optional sysctl, default to off.

-Andi

--
[email protected] -- Speaking for myself only.

2010-08-17 19:18:14

by Patrick J. LoPresti

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Tue, Aug 17, 2010 at 12:04 PM, J. Bruce Fields <[email protected]> wrote:
>
> Right, I think that we probably have to give up ext3 as a lost cause.
> But perhaps we could get away with a hack like this on filesystems that
> can store nanoseconds.

I do not think so.

The problem with "increment mtime by a nanosecond when necessary" is
that timestamps can wind up out of order. As in:

1) Do a bunch of operations on file A
2) Do one operation on file B

Imagine each operation on A incrementing its timestamp by a nanosecond
"just because". If all of these operations happen in less than 4 ms,
you can wind up with the timestamp on B being EARLIER than the
timestamp on A. That is a big no-no (think "make" or anything else
relying on timestamps for relative times).

If you can prove that the last modification on B happens after the
last modification on A, then it is very bad for the mtime on B to be
earlier than the mtime on A. I guarantee that will break things in
the real world.

As you say, high-resolution timestamps "will extend the useful
lifetime of NFSv3 by quite a bit". They are also a good idea in
principle, IMO. Correctness is almost always more important than
performance.

- Pat

2010-08-19 02:33:21

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Wed, Aug 18, 2010 at 06:41:02PM -0700, john stultz wrote:
> On Wed, Aug 18, 2010 at 11:12 AM, J. Bruce Fields <[email protected]> wrote:
> > I'm completely ignorant about higher-resolution time sources.  Any
> > recommended reading?  What resolution do they actually provide, what's
> > the expense of reading them, how reliable are they, and how do the
> > answers to those questions vary across different hardware and kernel
> > versions?  A quick look at drivers/clocksource/ doesn't suggest
> > simple answers.
>
> Yea, there aren't simple answers. Clocksource hardware varies
> drastically in resolution and access time across systems and
> architectures. Further, clocksources may change while the system is
> up, so we don't really expose the hardware resolution.
>
> On x86, access latency varies from ~50ns (TSC) to ~1.3us (ACPI PM).
> (And that is ignoring the PIT, which can be 18us per call - luckily
> almost no hardware uses that). The resolution similarly scales from
> sub-ns (TSC @ > 1ghz cpus) to ~279ns (ACPI PM). Of course, across
> architectures you will see even more variance.

The race in question occurs when you manage to check mtime between two
file data updates, with all three operations occurring within a clock
tick.

No idea if that's feasible in hundreds of nanoseconds.

I'm also not sure how to judge the access latency. Certainly a
microsecond is a lot compared to just reading a cached mtime value.

Will we ever see them go backwards? (So if I know I wrote to file B
after writing to file A, is there ever a case where I could end up with
an earlier mtime on B than A?)

--b.

2010-08-17 19:47:08

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Tue, Aug 17, 2010 at 12:43:10PM -0700, Patrick J. LoPresti wrote:
> On Tue, Aug 17, 2010 at 12:54 PM, Alan Cox <[email protected]> wrote:
> >> Is there any objection to the mount option I am proposing?
> >
> > I have none. I doubt I'd use it as it would be too expensive on system
> > performance for some of my boxes, while having an incrementing value is
> > cheap.
> >
> > I don't see the two as conflicting - in fact the bits you need to do the
> > mount option are the bits you also need to do the counter version as
> > well. One fixes ordering at no real cost, the other adds high res
> > timestamps, both are useful.
>
> A mount option could also allow a choice of timestamp resolutions:
>
> Traditional (i.e., fast)
> Alan Cox NFS hack (a tad slower but should fix NFS)
> High-res time (slowest but most accurate)
>
> I will work on a patch this week (weekend at the latest).

I kind of hate to have mount options that are required for nfs exports
to work correctly; it soon makes things too complicated for users to
realiably get right, so distributions end up setting them, and then we
all end up taking the performance tradeoff anyway.

But a mount-option-based version may at least be useful for further
experiments.

--b.

2010-08-19 01:41:04

by john stultz

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Wed, Aug 18, 2010 at 11:12 AM, J. Bruce Fields <[email protected]> wrote:
> I'm completely ignorant about higher-resolution time sources. ?Any
> recommended reading? ?What resolution do they actually provide, what's
> the expense of reading them, how reliable are they, and how do the
> answers to those questions vary across different hardware and kernel
> versions? ?A quick look at drivers/clocksource/ doesn't suggest
> simple answers.

Yea, there aren't simple answers. Clocksource hardware varies
drastically in resolution and access time across systems and
architectures. Further, clocksources may change while the system is
up, so we don't really expose the hardware resolution.

On x86, access latency varies from ~50ns (TSC) to ~1.3us (ACPI PM).
(And that is ignoring the PIT, which can be 18us per call - luckily
almost no hardware uses that). The resolution similarly scales from
sub-ns (TSC @ > 1ghz cpus) to ~279ns (ACPI PM). Of course, across
architectures you will see even more variance.

thanks
-john

2010-08-17 19:35:00

by Patrick J. LoPresti

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Tue, Aug 17, 2010 at 12:39 PM, Alan Cox <[email protected]> wrote:

> ? ? ? ?if (time_now == time_last)
> ? ? ? ? ? ? ? ?return { time_last , ++ct };
> ? ? ? ?else {
> ? ? ? ? ? ? ? ?ct = 0;
> ? ? ? ? ? ? ? ?time_last = time_now
> ? ? ? ? ? ? ? ?return { time_last , 0 };
> ? ? ? ?}
>
> providing it is done with the same 'ct' across the fs and you can't do
> enough ops/second to wrap the nanosecs - which should be fine for now,
> your ordering is still safe is it not ?

Yes, that would work. Assuming you use atomic counters, else there
is a risk of the visible time ticking backwards. It seems like a lot
of effort just to avoid having accurate timestamps on your files,
though.

I am having trouble seeing why this is a better idea than a simple
mount option to obtain decent resolution timestamps. (Not that we
can't have both...) Is there any objection to the mount option I am
proposing?

For the Nth time, I am willing to produce and test the patch, but not
if there is zero chance of it being accepted.

- Pat

2010-08-17 19:43:12

by Patrick J. LoPresti

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Tue, Aug 17, 2010 at 12:54 PM, Alan Cox <[email protected]> wrote:
>>?Is there any objection to the mount option I am proposing?
>
> I have none. I doubt I'd use it as it would be too expensive on system
> performance for some of my boxes, while having an incrementing value is
> cheap.
>
> I don't see the two as conflicting - in fact the bits you need to do the
> mount option are the bits you also need to do the counter version as
> well. One fixes ordering at no real cost, the other adds high res
> timestamps, both are useful.

A mount option could also allow a choice of timestamp resolutions:

Traditional (i.e., fast)
Alan Cox NFS hack (a tad slower but should fix NFS)
High-res time (slowest but most accurate)

I will work on a patch this week (weekend at the latest).

Thanks, Alan.

- Pat

2010-08-19 02:10:18

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Thu, Aug 19, 2010 at 10:52:18AM +1000, Neil Brown wrote:
> On Thu, 19 Aug 2010 09:41:36 +1000
> Neil Brown <[email protected]> wrote:
>
> > So I agree that this is probably more of an issue for directories than for
> > files, and that implementing it just for directories would be a sensible
> > first step with lower expected overhead - just my reasoning seems to be a bit
> > different.
>
> Just to be sure we are on the same page:
> file_update_time would always refer to current_nfsd_time, but nfsd would
> only update current_nfsd_time when a directory was examined (and the other
> conditions were met).
>
>
> So my current thinking on how this would look - names have been changed:
>
> - global timespec 'current_fs_precise_time' is zeroed when
> current_kernel_time moves backwards and is protected by a seqlock
>
> - current_fs_time would be
> now = max(current_kernel_time(), current_fs_precise_time)
> return timespec_trunc(now, sb->s_time_gran)
> (with appropriate seqlock protection)
>
> - new function in fs/inode.c
> get_precise_time(timestamp)

Odd name for something that returns nothing of interest;
bump_precise_time() might be closer?

And unique_time might be better than precise_time, since the property
we're asking for is that mtime on a changed file by new? (Or
versioned_time?)

> cft = current_fs_time()
> if (timestamp == cft)
/*
* Make sure the next mtime stored will be
* something different from timestamp:
*/
> write_seqlock()
> if cft == current_fs_precise_time
> current_fs_precise_time.tv_nsec++
> else if cft > current_fs_precise_time

What's the cft < current_fs_precise_time case?

--b.

> current_fs_precise_time = cft
> write_sequnlock()
> return timestamp
>
> - nfsd xdr response routine does
> ts = inode->i_mtime
> if (S_ISDIR(inode->i_mode))
> ts = get_precise_time(ts)
> xdr_encode_timespec(ts)
>
>
> get_precise_time() probably needs a bit more subtlety to handle different
> s_time_gran values and possible races, but I think it is fairly close.
>
> Then if we ever had an xstat or similar that could ask for precise
> timestamps, it just makes a similar call to get_precise_time.
> Also if we added code later to use a hires timer on hardware where it was
> efficient, get_precise_time could test for that and become a no-op
>
> Yes, I should probably turn this into a patch ... maybe another day.
>
> NeilBrown

2010-08-18 23:47:47

by NeilBrown

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Wed, 18 Aug 2010 13:32:03 -0400
"J. Bruce Fields" <[email protected]> wrote:

> On Wed, Aug 18, 2010 at 03:53:59PM +1000, Neil Brown wrote:
> > I'm not sure you even want to pay for a per-filesystem atomic access when
> > updating mtime. mnt_want_write - called at the same time - seems to go to
> > some lengths to avoid an atomic operation.
> >
> > I think that nfsd should be the only place that has to pay the atomic
> > penalty, as it is where the need is.
> >
> > I imagine something like this:
> > - Create a global struct timespec which is protected by a seqlock
> > Call it current_nfsd_time or similar.
> > - file_update_time reads this and uses it if it is newer than
> > current_fs_time.
> > - nfsd updates it whenever it reads an mtime out of an inode that matches
> > current_fs_time to the granularity of 1/HZ.
>
> We can also skip the update whenever current_nfsd_time is greater than
> the inode's mtime--that's enough to ensure that the next
> file_update_time() call will get a time different from the inode's
> current mtime.

Yes, I agree with you and Patrick - very sensible optimisation.

>
> Would the extra expense rule out treating sys_stat() the same as nfsd?
> It would be nice to be able to solve the same problem for userspace
> nfsd's (or any other application that might be using mtime to save
> rereading data).

It would be nice, but I would be loathe to add any cost to 'stat' unless we
knew it was needed.
If we had an xstat() which could explicitly ask for
high-precision-time-stamps, then yes - otherwise maybe not.

(or maybe define a system:linux.xxxx xattr which would read as a
high-precision time stamp... I seem to be warming to the idea of using the
xattr interface for enhancing stat).

NeilBrown

2010-08-17 19:31:46

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Tue, Aug 17, 2010 at 08:39:41PM +0100, Alan Cox wrote:
> > The problem with "increment mtime by a nanosecond when necessary" is
> > that timestamps can wind up out of order. As in:
>
> Surely that depends on your implementation ?
>
> > 1) Do a bunch of operations on file A
> > 2) Do one operation on file B
> >
> > Imagine each operation on A incrementing its timestamp by a nanosecond
> > "just because". If all of these operations happen in less than 4 ms,
> > you can wind up with the timestamp on B being EARLIER than the
> > timestamp on A. That is a big no-no (think "make" or anything else
> > relying on timestamps for relative times).
>
>
> [time resolution bits of data][value incremented value for that time]
>
>
> if (time_now == time_last)
> return { time_last , ++ct };
> else {
> ct = 0;
> time_last = time_now
> return { time_last , 0 };
> }
>
> providing it is done with the same 'ct' across the fs and you can't do
> enough ops/second to wrap the nanosecs - which should be fine for now,
> your ordering is still safe is it not ?

Right, so if I understand correctly, you're proposing a time source
that's global to the filesystem and that guarantees it will always
return a unique value by incrementing the nanoseconds field if jiffies
haven't changed since the last time it was called.

(Does it really need to be global across all filesystems? Or is it
unreasonable to expect your unbelievably-fast make's to behave well when
sources and targets live on different filesystems?)

--b.

2010-08-17 19:36:54

by Alan

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

> (Does it really need to be global across all filesystems? Or is it
> unreasonable to expect your unbelievably-fast make's to behave well when
> sources and targets live on different filesystems?)

I don't believe it does for the NFS semantics. You can't do it globally
because then you get weirdness between local file systems that support
u/nsecs and those that don't.

It's enough to fix NFS I believe.

Alan

2010-08-18 17:50:45

by Andi Kleen

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

> - nfsd updates it whenever it reads an mtime out of an inode that matches
> current_fs_time to the granularity of 1/HZ.

That means you have a very very hot cache line on a larger system
if there are a lot of mtime changes. Probably a bad idea.

-Andi

2010-08-13 18:45:56

by john stultz

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Fri, Aug 13, 2010 at 11:25 AM, Patrick J. LoPresti
<[email protected]> wrote:
> 3) On the 99.99% of Linux systems that are post-1990 x86, it is not
> slow at all, and the performance difference will be utterly
> undetectable in the real world.

Your stats are off here. The only fast clocksource on x86 is the TSC,
and its busted on many, many systems. The cpu vendors have only
recently taken it seriously and resolved the majority of problems
(however, issues still remain on large numa systems, but its much
better then the story was 3-7 years ago).

On those TSC broken systems that use the hpet or acpi_pm, a
getnstimeofday call can take 0.5-1.3us, so the penalty can be quite
severe. And even with the TSC, expect some performance impact, as
reading hardware and doing the multiply is more costly then just
fetching a value from memory.

thanks
-john

2010-08-17 18:50:43

by Patrick J. LoPresti

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Tue, Aug 17, 2010 at 11:29 AM, Andi Kleen <[email protected]> wrote:
>
> You cannot be more precise than the backing file system: this causes
> non monotonity when the inodes are flushed (has happened in the past)

True. But non-toy filesystems designed post-1990 support nanosecond
timestamps. One-second resolution is a disaster for NFS servers and
always has been. I wonder how many man-hours have been wasted dealing
with the problem... I personally have seen it dozens of times in my
career, but "use XFS" was always a solution. Until now, that is, when
300usec round trip network times, 3GHz processors, and "weak cache
consistency" optimizations have conspired to bring it back, thanks to
4msec resolution timestamps.

Even aside from any NFS issues, I myself would prefer accurate
timestamps over a 10% boost for tight loops calling "utimes()" or
whatever. But maybe that is just me.

Anyway, to repeat my revised proposal:

1) Add a "sb_current_fs_time" member to "struct super_block". Make it
a pointer to a function returning "struct timespec". Have it default
to the current low-resolution implementation.

2) Modify the function "current_fs_time(struct super_block sb *)" to
just "return sb->sb_current_fs_time(sb)". Might as well inline it
too.

3) Add a mount option to allow the selection of the high-res time
source; i.e., to set sb_current_fs_time to point to the high-res
implementation.

Would patches implementing this stand a realistic chance of being accepted?

- Pat

2010-08-14 16:45:11

by Patrick J. LoPresti

[permalink] [raw]
Subject: Re: Proposal: Use hi-res clock for file timestamps

On Fri, Aug 13, 2010 at 1:53 PM, Patrick J. LoPresti <[email protected]> wrote:
>
> Consider the following "revision 2" of my proposal:
>

In case I was not clear...

I am volunteering to implement this myself, but only if there is a
realistic chance my patches will be accepted.

- Pat