2017-03-30 17:30:12

by David Howells

[permalink] [raw]
Subject: Apparent backward time travel in timestamps on file creation

Hi Thomas, John,

I've been writing a testcase for xfstests to test statx. However, it's turned
up what I think is a bug in the kernel's time-tracking system. If I do:

date +%s.%N
touch foo
dump-timestamps foo

such that foo is created, sometimes the atime, mtime and ctime timestamps on
foo will be *before* the time printed by 'date'.

For example:

[root@andromeda ~]# Z=/b/zebra6; date +%s.%N; touch $Z; /tmp/dump-timestamps $Z
1490894656.267225764
st_atime: 1490894656.267032686
st_mtime: 1490894656.267032686
st_ctime: 1490894656.267032686

As can be seen, the three file timestamps are -193078 nsec from the prior
clock time. This was with git commit:

89970a04d70c6c9e5e4492fd4096c0b5630a478c

the current head of Linus's tree.

I'm sure I've seen a case where tv_sec differed by 1 when running my xfstests
script, but it's hard to reproduce that. It also occurs with other file
types, not just regular files. It occurs on both ext4 and xfs.

I've attached the source for dump-timestamps below.

David
---
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
int main(int argc, char *argv[])
{
struct stat st;

if (argc != 2) {
fprintf(stderr, "Format: %s <file>\n", argv[0]);
exit(2);
}
if (stat(argv[1], &st) == -1) {
perror(argv[1]);
exit(1);
}
printf("st_atime: %ld.%09ld\n", st.st_atim.tv_sec, st.st_atim.tv_nsec);
printf("st_mtime: %ld.%09ld\n", st.st_mtim.tv_sec, st.st_mtim.tv_nsec);
printf("st_ctime: %ld.%09ld\n", st.st_ctim.tv_sec, st.st_ctim.tv_nsec);
return 0;
}


2017-03-30 18:08:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: Apparent backward time travel in timestamps on file creation

On Thu, Mar 30, 2017 at 10:30 AM, David Howells <[email protected]> wrote:
>
> I've been writing a testcase for xfstests to test statx. However, it's turned
> up what I think is a bug in the kernel's time-tracking system. If I do:
>
> date +%s.%N
> touch foo
> dump-timestamps foo
>
> such that foo is created, sometimes the atime, mtime and ctime timestamps on
> foo will be *before* the time printed by 'date'.

I don't think our filesystem timestamps have ever been equivalent to
"gettimeofday()".

The "gettimeofday()" times are actually fairly expensive to calculate
(although we've optimized that code heavily, because it's a very
common system call in many loads). We try to give gettimeofday() much
higher precision than any other time in the system: it not only
participates in all the NTP stuff, we also actually read the hardware
time register (hopefully the TSC, but it can be any time source) and
interpolate that very carefully to give a high-quality clock value
that is *much* higher precision than the timer tick.

In contrast, the filesystem times are based on CURRENT_TIME (and the
modern variation: "current_time(inode)") that is a completely
different animal. It truncates the time to the inode time granularity,
yes, but there's a much more fundamental thing going on: instead of
using that exact gettimeofday() thing, it just uses
"current_kernel_time()".

And current_kernel_time() doesn't do *any* of the fancy "interpolate
high-quality hardware clock" stuff. No, it just uses "xtime" that is
updated by the timer interrupt (ok, that's slightly simplified,
'xtime' is no longer just a single global, we have a per-timekeeping
one these days, but it's historically and conceptually what we're
doing).

The difference can be quite noticeable - basically the
"gettimeofday()" time will interpolate within timer ticks, while
"xtime" is just the truncated "time at timer tick" value _without_ the
correction.

And none of this is new. It goes back to forever. Any Linux version
that had gettimeofday() with the finer-grained offset. Not 0.01 (I
checked).

[ Goes digging. Ok, the gettimeofday timer interpolation was added in
April, 1993 in version 0.99.9 ]

So it's been like this for about 24 years now.

Linus

2017-03-30 18:49:13

by John Stultz

[permalink] [raw]
Subject: Re: Apparent backward time travel in timestamps on file creation

On Thu, Mar 30, 2017 at 10:30 AM, David Howells <[email protected]> wrote:
> Hi Thomas, John,
>
> I've been writing a testcase for xfstests to test statx. However, it's turned
> up what I think is a bug in the kernel's time-tracking system. If I do:
>
> date +%s.%N
> touch foo
> dump-timestamps foo
>
> such that foo is created, sometimes the atime, mtime and ctime timestamps on
> foo will be *before* the time printed by 'date'.
>
> For example:
>
> [root@andromeda ~]# Z=/b/zebra6; date +%s.%N; touch $Z; /tmp/dump-timestamps $Z
> 1490894656.267225764
> st_atime: 1490894656.267032686
> st_mtime: 1490894656.267032686
> st_ctime: 1490894656.267032686
>
> As can be seen, the three file timestamps are -193078 nsec from the prior
> clock time. This was with git commit:

Linus covered this already, as its a granularity difference, but it is
something that seems to crop up every few years.

If you want a timestamp that matches the granularity of what the
filesystem uses, you can use clock_gettime(CLOCK_REALTIME_COARSE,..),
which returns the same "time-at-the-last-tick" that filesystem
timestamps use (with the same performance benefit).

Though one slight correction to Linus' comment is that both filesystem
timestamps and CLOCK_REALTIME_COARSE are NTP corrected, the main
performance gain is that one doesn't have go out and read the
clocksource and covert the cycles to nanoseconds.

thanks
-john

2017-03-30 19:35:32

by David Howells

[permalink] [raw]
Subject: Re: Apparent backward time travel in timestamps on file creation

Linus Torvalds <[email protected]> wrote:

> The difference can be quite noticeable - basically the
> "gettimeofday()" time will interpolate within timer ticks, while
> "xtime" is just the truncated "time at timer tick" value _without_ the
> correction.

Is there any way to determine the error bar, do you know? Or do I just make
up a fudge factor?

Obviously the filesystem truncation would need to be taken into account (and
this can be different for different timestamps within the same filesystem).

David

2017-03-30 19:52:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: Apparent backward time travel in timestamps on file creation

On Thu, Mar 30, 2017 at 12:35 PM, David Howells <[email protected]> wrote:
> Linus Torvalds <[email protected]> wrote:
>
>> The difference can be quite noticeable - basically the
>> "gettimeofday()" time will interpolate within timer ticks, while
>> "xtime" is just the truncated "time at timer tick" value _without_ the
>> correction.
>
> Is there any way to determine the error bar, do you know? Or do I just make
> up a fudge factor?

Hmm. The traditional error bar is just HZ. Which depends on
architecture, but I don't think we've ever had anything below 100 on
x86 (on other architectures it's been lower, like 24Hz on MIPS).

I guess you could have missed timer ticks that makes it worse - we do
have code to try to handle things like that.

But in general, I think you should be able to more or less rely on
that 100Hz error bar.

I don't think tickless really changes that, because even the full
tickless keeps ticks going on _some_ CPU to make xtime work. John
should know the details better. Do we ever slow down from that?

> Obviously the filesystem truncation would need to be taken into account (and
> this can be different for different timestamps within the same filesystem).

Yeah, for some filesystems the truncation issue is going to be the
bigger one (eg truncate to one second or something).

But likely those are not the ones you'd care to necessarily test. FAT?

Linus

2017-03-30 20:03:56

by John Stultz

[permalink] [raw]
Subject: Re: Apparent backward time travel in timestamps on file creation

On Thu, Mar 30, 2017 at 12:35 PM, David Howells <[email protected]> wrote:
> Linus Torvalds <[email protected]> wrote:
>
>> The difference can be quite noticeable - basically the
>> "gettimeofday()" time will interpolate within timer ticks, while
>> "xtime" is just the truncated "time at timer tick" value _without_ the
>> correction.
>
> Is there any way to determine the error bar, do you know? Or do I just make
> up a fudge factor?

Again, I'd utilize clock_gettime(CLOCK_REALTIME_COARSE, ...) for this
instead of playing with fudge factors.

thanks
-john

2017-03-30 20:04:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: Apparent backward time travel in timestamps on file creation

On Thu, Mar 30, 2017 at 12:52 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Mar 30, 2017 at 12:35 PM, David Howells <[email protected]> wrote:
>>
>> Is there any way to determine the error bar, do you know? Or do I just make
>> up a fudge factor?
>
> Hmm. The traditional error bar is just HZ

Oh, I'm just being stupid.

The error bar can be huge, for the simple reason that the filesystem
you are testing may not be sharing a clock with the CPU at _all_.

IOW, think network filesystems.

You basically should never test a file time with current time. Yes,
yes, you'd hope that people run NTP or other things and that clocks
are "close enough", but in general the only thing you should test a
file timestamp with is another file timestamp on the same filesystem.

I guess with NTP it's not unreasonable to still say "0.01s is a good
target". I remember having seen people whose clocks were off by *days*
(it shows up in SCM timestamps going backwards l^), but you'd hope
that is really really rare.

Linus

2017-03-30 20:06:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: Apparent backward time travel in timestamps on file creation

On Thu, Mar 30, 2017 at 1:03 PM, John Stultz <[email protected]> wrote:
>
> Again, I'd utilize clock_gettime(CLOCK_REALTIME_COARSE, ...) for this
> instead of playing with fudge factors.

See my note on network filesystems. It won't help.

fudgefactors would at least work. Allow people to set them large for
filesystems with low granularity, or networks with bad time
synchronization.

Linus

2017-03-30 20:13:52

by David Howells

[permalink] [raw]
Subject: Re: Apparent backward time travel in timestamps on file creation

Linus Torvalds <[email protected]> wrote:

> The error bar can be huge, for the simple reason that the filesystem
> you are testing may not be sharing a clock with the CPU at _all_.
>
> IOW, think network filesystems.

Can't I just not do the tests when the filesystem is a network fs? I don't
think it should be a problem for disk filesystems on network-attached storage.

David

2017-03-30 20:30:16

by David Lang

[permalink] [raw]
Subject: Re: Apparent backward time travel in timestamps on file creation

On Thu, 30 Mar 2017, David Howells wrote:

> Linus Torvalds <[email protected]> wrote:
>
>> The error bar can be huge, for the simple reason that the filesystem
>> you are testing may not be sharing a clock with the CPU at _all_.
>>
>> IOW, think network filesystems.
>
> Can't I just not do the tests when the filesystem is a network fs? I don't
> think it should be a problem for disk filesystems on network-attached storage.

it's not trivial to detect if a filesystem is local or network (you would have
to do calls to figure out what filesystem you are on, then have a list to define
what's local and what's remote, that list would become out of date as new
filesystems are added)

David Lang

2017-03-30 21:13:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: Apparent backward time travel in timestamps on file creation

On Thu, Mar 30, 2017 at 1:13 PM, David Howells <[email protected]> wrote:
> Linus Torvalds <[email protected]> wrote:
>
>> The error bar can be huge, for the simple reason that the filesystem
>> you are testing may not be sharing a clock with the CPU at _all_.
>>
>> IOW, think network filesystems.
>
> Can't I just not do the tests when the filesystem is a network fs? I don't
> think it should be a problem for disk filesystems on network-attached storage.

So I actually think that the whole "check timestamps" would be
interesting as a test across a lot of filesystems - including very
much network filesystems - but I think such a test should be largely
informational rather than about correctness.

Now, there definitely are correctness issues too wrt file timestamps
(ie the whole "writes should update mtime" kind of testing), and I
think many of those cound be extended to check relative timestamps on
the same filesystem. For example, if you write to file A first, and to
file B second, it would certainly be odd and interesting if file B now
has a modification time that is before file A.

I say "odd and interesting" because I could see it happening
(writeback ordering), and there are certainly very real issues wrt
things like open/close consistency ordering. For example, if you write
to A, then _close_ A, and then open B and write to B, if the
timestamps are still in the wrong order that really could cause major
real problems for simple things like "make".

So there's a lot of valid and interesting timestamp data that could be
tested for.

The "is it in sync with gettimeofday()" is interesting too, even if
the answer is that you don't expect it to be _perfectly_ in sync. A
test that just reports maximum slop might be an interesting test, and
could show real problems (maybe bad network time synchronization, but
maybe actual bugs in our internal xtime handling even for local
filesystems!).

There are just plain possible *oddities* with file time stamps too.
For example, we've definitely had situations where when you wrote to a
file, the mtime was updated to be precise to within a millisecond -
but then 30 seconds later, when the inode was written back, the mtime
might be truncated to the on-disk format that might be just seconds
(or hundreds of seconds), so now the mtime ON THE SAME FILE might be
going backwards!

So I do think that testing for odd cases like this can be interesting,
but I think it should be primarily be seen as a "reporting" issue than
a "correctness" issue. Then a human might react to the report and say
"that looks _really_ odd and is _way_ out of line of the expected
clock skew" and elevate it to a correctness issue.

So maybe you shouldn't do any real "fudge factor", and just report the
raw numbers. Things like just your comment

"As can be seen, the three file timestamps are -0.000193078 sec
from the prior CPU timestamp"

is just interesting, and as long as it's in that fairly small range,
it's not a bug but it's a good reminder that filesystem timestamps
aren't the same as gettimeofday().

And then if your tool starts reporting times that are off by seconds
or minutes, people might say "Hey, that's not right.." and find
something.

Linus

2017-03-30 22:22:22

by David Howells

[permalink] [raw]
Subject: Re: Apparent backward time travel in timestamps on file creation

Linus Torvalds <[email protected]> wrote:

> The "is it in sync with gettimeofday()" is interesting too, even if
> the answer is that you don't expect it to be _perfectly_ in sync. A
> test that just reports maximum slop might be an interesting test, and
> could show real problems (maybe bad network time synchronization, but
> maybe actual bugs in our internal xtime handling even for local
> filesystems!).

I wonder if multi-cpu systems might show interesting differences between CPUs
too. I would hope not since xtime is based on a global variable.

> And then if your tool starts reporting times that are off by seconds
> or minutes, people might say "Hey, that's not right.." and find
> something.

More likely never see it as the output is hidden away by xfstests. Probably
xfstests needs to gain some way of lending prominence to information of this
type.

David

2017-03-31 12:35:31

by Bob Peterson

[permalink] [raw]
Subject: Re: Apparent backward time travel in timestamps on file creation

----- Original Message -----
| On Thu, Mar 30, 2017 at 1:13 PM, David Howells <[email protected]> wrote:
| > Linus Torvalds <[email protected]> wrote:
| >
| >> The error bar can be huge, for the simple reason that the filesystem
| >> you are testing may not be sharing a clock with the CPU at _all_.
| >>
| >> IOW, think network filesystems.
| >
| > Can't I just not do the tests when the filesystem is a network fs? I don't
| > think it should be a problem for disk filesystems on network-attached
| > storage.
|
| So I actually think that the whole "check timestamps" would be
| interesting as a test across a lot of filesystems - including very
| much network filesystems - but I think such a test should be largely
| informational rather than about correctness.
|
| Now, there definitely are correctness issues too wrt file timestamps
| (ie the whole "writes should update mtime" kind of testing), and I
| think many of those cound be extended to check relative timestamps on
| the same filesystem. For example, if you write to file A first, and to
| file B second, it would certainly be odd and interesting if file B now
| has a modification time that is before file A.

This can happen, and it's not just network file systems. This issue
is also a concern of GFS2 where we have shared storage. We like to think
ntp will keep things relatively sane, but still, we've had issues in the
past where time discrepancies have caused confusion:

File X is created on node 1, but due to clock drift, node 2 sees that
file as having been created in the future, etc.

It's even more worrisome outside the kernel where software (e.g. in the
past, parts of the cluster infrastructure) would calculate negative
time values, interpret them as an "nearly infinite amount of time" having
passed, and then various watchdogs nuking nodes.

I remember a long time ago someone was up in arms because of weird
effects they were seeing, and it boiled down to not using any time sync
and one of their cluster nodes had the wrong month, or some such.

Regards,

Bob Peterson
Red Hat File Systems