LinuxLists.cc - Nanosecond resolution for stat(2)

2002-09-23 21:44:27

Subject: Nanosecond resolution for stat(2)

[The original message for this included the patch and didn't make it to
l-k likely because it was too big. Reposted with out-of-line patch]

Linux currently uses second resolution (time_t) for st_[cam]time in stat(2).
This is a problem for highly parallel make -j, which cannot detect cases
when a make rule runs for less than a second. GNU make supports finer
grained timestamps for this on other OS, but Linux doesn't support it
so far. This patch changes that. We also have several filesystems
in tree now that support better than second resolution for [cam]time in
their on disk inode (XFS,JFS,NFSv3 and VFAT).

This patch extends the VFS and stat(2) to add nsec timestamps.

Why nsecs? First to be compatible with Solaris and then when you add a new
32bit field then there is no reason to stop at msec. It just uses
a POSIX struct timespec. This matches what the filesystems (NFSv3,JFS,XFS)
do.

The real resolution is a jiffie current because it just uses xtime
instead of calling gettimeofday. In 2.5 that is 1ms, which should
be hopefully good enough. If not we can change it later to use do_gettimeofday.
do-gettimeofday unfortunately takes a readlock currently on most architectures,
so before doing it it would be a good idea to fix at least i386 to use
lockless gettimeofday (implementations of that exist already). But xtime
should be good enough for now.

I chose to reuse the "reserved for year 2036" fields in stat64 for nsec, because
y2036 will need many other system call and glibc changes anyways
(e.g. new time, new gettimeofday, glibc support) so adding a new stat64
by then won't be a big deal. The newer architectures have enough

The current kernels fill the fields now reused for nsec always with 0,
so there is perfect compatibility.

On stat64 these fields are always there because everybody uses the glibc
layout. With stat on 64bit architectures it is unfortunately mixed.
The newer 64bit architectures use the stat64 layout. The older ones
unfortunately didn't reserve fields for this (this is mainly alpha)
I think. For now alpha has no way to get at the nsec values. Fixing
it probably requires a new stat64 call for alpha.

I had to add a preprocessor symbol for this case.

I fixed all the architectures for it.

The old utimes system call already supported timeval, so it works fine
(that is ms instead of ns resolution, but should be good enough for now)

I changed the inode and iattr fields to struct timespec. and fixed all the
file systems and other code that accessed it. The rounding in general
is a bit crude from seconds - it should round, but they are currently
just truncated.

Some drivers (like mouse drivers or tty) do dubious inode [mac] time
accesses of the on disk inode and without even marking it dirty. This is
likely a bug. I fixed some of them but left others of these alone for now,
but should probably be all fixed.

[Linus noted that the tty drivers does this to keep 'w' updated. The
patch keeps this. It's probably nonsense for the mouse drivers and
partly removed there.]

I didn't fix Intermezzo completely because it didn't compile at all.

This patch could in theory affect benchmarks a bit. Andrew Morton previously
did an optimization to put inodes only once a second onto the dirty list
when their [mca]time change. With this patch they will be put on the dirty
list each jiffie (1ms), so in the worst case 1000 times as often. The
cost in this is mainly in taken the locks and putting the inode onto
the dirty list. On many FS which do not have better than a second
resolution this makes no sense, because they only change the value once a
second anyways. If this should be a problem a new update_time file/inode
operation may need to be added. I didn't do this for now.

The kernel internally always keeps the nsec (or rather 1ms) resolution
stamp. When a filesystem doesn't support it in its inode (like ext2)
and the inode is flushed to disk and then reloaded then an application
that is nanosecond aware could in theory see a backwards jumping time.
I didn't do anything anything against that yet, because it looks more
like a theoretical problem for me. If it should be one in practice
it could be fixed by rounding the time up in this case.

Patch for 2.5.38 can be found at
ftp://ftp.firstfloor.org/pub/ak/v2.5/nsec-2.5.38-1.gz

-Andi

2002-09-24 04:13:01

by Andrew Pimlott

[permalink] [raw]

Subject: Re: Nanosecond resolution for stat(2)

On Mon, Sep 23, 2002 at 11:48:36PM +0200, Andi Kleen wrote:
> The kernel internally always keeps the nsec (or rather 1ms) resolution
> stamp. When a filesystem doesn't support it in its inode (like ext2)
> and the inode is flushed to disk and then reloaded then an application
> that is nanosecond aware could in theory see a backwards jumping time.
> I didn't do anything anything against that yet, because it looks more
> like a theoretical problem for me.

I assume you mean "theoretical" that an application would care, not
that it would happen. (Unless I misunderstand, it is nearly
guaranteed to happen every time an inode is evicted after a
[mac]time update.)

I fear that there are applications that will be harmed by any
spurious change in [mac]time, even if it's not backwards. Apps that
trigger on any change in mtime may trigger twice for every change.
Eg, I suspect there is some scenario in which an rsync-like
application that supports nanoseconds could suffer (just in
performance, but still).

> If it should be one in practice
> it could be fixed by rounding the time up in this case.

This would mean that even "legacy" programs that only use second
resolution would be affected, which seems worse. At least programs
that recognize the nanosecond field are more likely to know about
the issue.

Andrew

2002-09-24 04:31:45

by Mark Mielke

[permalink] [raw]

Subject: Re: Nanosecond resolution for stat(2)

On Tue, Sep 24, 2002 at 12:05:28AM -0400, Andrew Pimlott wrote:
> On Mon, Sep 23, 2002 at 11:48:36PM +0200, Andi Kleen wrote:
> > The kernel internally always keeps the nsec (or rather 1ms) resolution
> > stamp. When a filesystem doesn't support it in its inode (like ext2)
> > and the inode is flushed to disk and then reloaded then an application
> > that is nanosecond aware could in theory see a backwards jumping time.
> > I didn't do anything anything against that yet, because it looks more
> > like a theoretical problem for me.
> ...
> I fear that there are applications that will be harmed by any
> spurious change in [mac]time, even if it's not backwards. Apps that
> trigger on any change in mtime may trigger twice for every change.
> Eg, I suspect there is some scenario in which an rsync-like
> application that supports nanoseconds could suffer (just in
> performance, but still).

The behaviour does seem wrong. Resolution should not be faked to be
more accurate than the granularity offered by the underlying file
system. Timestamps can be persistently stored, or stored for longer
periods of times, for all sorts of reasons beyond 'make', each with
consequence that cannot be determined here.

What would it take to get microsecond or better time stored in ext[23]?

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-09-24 05:08:31

by NeilBrown

[permalink] [raw]

Subject: Re: Nanosecond resolution for stat(2)

On Monday September 23, [email protected] wrote:
>
> The kernel internally always keeps the nsec (or rather 1ms) resolution
> stamp. When a filesystem doesn't support it in its inode (like ext2)
> and the inode is flushed to disk and then reloaded then an application
> that is nanosecond aware could in theory see a backwards jumping time.
> I didn't do anything anything against that yet, because it looks more
> like a theoretical problem for me. If it should be one in practice
> it could be fixed by rounding the time up in this case.

Would it make sense, when loading a time from disk, for the low order,
non-stored bits of the time to be initialised high rather than low.
i.e. to 999,999,999 rather than 0.
This way time stamps would never seem to jump backwards, only
forwards, which seems less likely to cause confusion and will mean that a
change is not missed (I'm thinking NFS here where cache correctness
depends heavily on mtime).

Also, would it make sense, for filesystems that don't store the full
resolution, to make that forward jump appear as early as
possible. i.e. if the mtime (ctime/atime) is earlier than the current
time at the resoltion of the filesystem, then make the mtime appear to
be what it would be if reloaded from storage... Maybe an example
would help.

Assuming an internal resolution on 1millisecond (to save on digits)
and a stored resolution of 1 second

time change is made Apparent timestamp

23.100 X 23.100
23.200 23.100
23.300 X 23.300
23.500 X 23.500
23.900 23.500
24.001 23.999
25.000 23.999

Thus the only incorrect observation that an application can make is
that there is an extra change at the end of a second when other
changes were made. I think this is better than an apparent change
suddenly becoming visible many minutes after the time of that apparent
change, and definately better than a timestamp moving backwards.

NeilBrown

2002-09-24 05:18:25

by Andreas Dilger

[permalink] [raw]

Subject: Re: Nanosecond resolution for stat(2)

On Sep 24, 2002 00:35 -0400, Mark Mielke wrote:
> The behaviour does seem wrong. Resolution should not be faked to be
> more accurate than the granularity offered by the underlying file
> system. Timestamps can be persistently stored, or stored for longer
> periods of times, for all sorts of reasons beyond 'make', each with
> consequence that cannot be determined here.
>
> What would it take to get microsecond or better time stored in ext[23]?

Not very much. We have been thinking about this for a while already.

The microsecond-resolution times would be stored in a "large inode"
or in an extended attribute if the inode is a regular-sized one. The
latter would be a pretty big performance hit for most applications if
it were only the u-second data that were being stored in the EA space.
We are also looking at a better method of storing the EA data so that
it is more efficient than the current EA implementation, but that is
mostly tangential to your concerns.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-09-24 12:05:15

by Andi Kleen

[permalink] [raw]

Subject: Re: Nanosecond resolution for stat(2)

Andrew Pimlott <[email protected]> writes:
>
> I assume you mean "theoretical" that an application would care, not
> that it would happen. (Unless I misunderstand, it is nearly
> guaranteed to happen every time an inode is evicted after a
> [mac]time update.)

Only when you use an old file system. I tguess even ext2 and ext3 will
learn about nsec timestamps eventually (e.g. by increasing the inode to
the next power of two because there are many other contenders for new
inode fields too)

>
> I fear that there are applications that will be harmed by any
> spurious change in [mac]time, even if it's not backwards. Apps that

When they only look at the second part nothing will change for them.

If applications that do look at nsecs have serious problems with this
backwards jumping it's still possible to add upwards rounding. I didn't
want to do this for the first cut to avoid over engineering.
>
> > If it should be one in practice
> > it could be fixed by rounding the time up in this case.
>
> This would mean that even "legacy" programs that only use second
> resolution would be affected, which seems worse. At least programs
> that recognize the nanosecond field are more likely to know about
> the issue.

Well, there is some change in behaviour. You have to live with that.

-Andi

2002-09-24 12:16:35

by Andi Kleen

[permalink] [raw]

Subject: Re: Nanosecond resolution for stat(2)

Neil Brown <[email protected]> writes:

> Would it make sense, when loading a time from disk, for the low order,
> non-stored bits of the time to be initialised high rather than low.
> i.e. to 999,999,999 rather than 0.
> This way time stamps would never seem to jump backwards, only
> forwards, which seems less likely to cause confusion and will mean that a
> change is not missed (I'm thinking NFS here where cache correctness
> depends heavily on mtime).

It would be possible, but I would only add it when there are actual
problems with the current solution.
It looks like a bit of a hack.

>
> Also, would it make sense, for filesystems that don't store the full
> resolution, to make that forward jump appear as early as
> possible. i.e. if the mtime (ctime/atime) is earlier than the current
> time at the resoltion of the filesystem, then make the mtime appear to
> be what it would be if reloaded from storage... Maybe an example
> would help.

That would require much more callbacks into the filesystem. The VFS changes
the times, but it doesn't know what the resolution of the file system is.

In theory it could help a bit when the rounding is needed, but again
to avoid overengineering early I would like to first see if the current
KISS way works out well enough.

[note that when you want the "only flush once a second" optimization back
you would need some of these callbacks anyways, so when doing it it may
be a good idea to merge it with early increase]

> Thus the only incorrect observation that an application can make is
> that there is an extra change at the end of a second when other
> changes were made. I think this is better than an apparent change
> suddenly becoming visible many minutes after the time of that apparent
> change, and definately better than a timestamp moving backwards.

Probably yes. But it's also much more intrusive for the kernel code.

It also has the drawback that the application can see times in the future
in stat.
e.g. when someone does

touch file
stat(file, &st)
gettimeofday(&now)
tvsub(&diff, &now, &st.st_mtime_ts)

Then diff may actually be negative. Which could also in theory break stuff.
So you would trade one breakage for another. Let's see if KISS works out
first :-)

-And