2002-10-28 09:41:01

by Paul Eggert

[permalink] [raw]
Subject: nanosecond file timestamp resolution in filesystems, GNU make, etc.

> Date: Sun, 27 Oct 2002 10:36:51 -0500
> From: Andrew Pimlott <[email protected]>
>
> I thought I'd forward these in case you'd like to add anything.
> Andi Kleen is adding nanosecond timestamps to Linux (finally!), and
> I have some concerns about how his implementation might affect
> programs like make....

Thanks for mentioning this, as I don't normally read linux-kernel.
I'll add my comments below and CC: to linux-kernel. Please reply to
me as well as to linux-kernel if you want me in on the thread.


> From: Andi Kleen ([email protected])
> Date: Sun Oct 27 2002 - 02:01:25 EST

> When an inode is flushed on an old fs with only second resolution the
> subsecond part is truncated. This has the drawback that an inode
> timestamp can jump backwards on reload as seen by user space.

I see this as a real flaw. Several programs (and not just GNU make)
rely on file timestamps not being altered. For example, GNU diff uses
file timestamps to decide whether two files might be the same file; if
their timestamps differ, then they must not be the same file. If a
file's timestamp might snap back to the previous second, a
nanosecond-aware GNU diff will go astray. Similarly, GNU tar uses
file timestamps to decide whether a file has changed and needs to be
dumped; if the file timestamp jumps back, a nanosecond-aware GNU tar
will erroneously decide that the file has changed and will dump it
unnecessarily.

> Another way would be to round on flush, but that also has some problems :-

Rounding is even worse. GNU Make assumes truncation, i.e. it assumes
that a timestamp is truncated (floored, actually) when it is stored on
a non-nanosecond-aware filesystem.

> In my current patchkit I just chose to truncate because that was the
> easiest and the other more complicated solutions didn't offer any
> compeling advantage.

Can't you truncate/floor to filesystem timestamp resolution
immediately, i.e., before the inode is flushed? That would address
the problems that I see.


> Date: Sun, 27 Oct 2002 10:20:38 -0500
> From: Andrew Pimlott <[email protected]>

> Even the messages to the GNU make mailing list when Paul Eggert
> implemented nanosecond support didn't include a specific rationale.

Partly because it was there. But also because it avoids some bugs
when "make" assumes that a file is up-to-date because its timestamp
equals some other file's timestamp. E.g.:

cp foo1.c foo.c
make foo.o
cp foo2.c foo.c
make foo.o

This works only if timestamps have sufficient resolution.
Admittedly this small example is strained, but I hope you get the idea.


> I tend to prefer the proposal to set the nanosecond field to 10^9-1.

We used this trick in one place in GNU make (look for the comment in
remake.c that says "Avoid spurious rebuilds due to low resolution time
stamps") but it is an application-specific hack; I'm dubious that it
is a good idea in general-purpose filesystem code.

My personal experience is that it's hard to read and write code that
futzes with timestamps of various resolutions, and there is a real
advantage to sticking with a simple rule that always takes the floor
when going to a lower-precision timestamp, even if that rule has
suboptimal results in some cases.


2002-10-28 10:22:15

by Jamie Lokier

[permalink] [raw]
Subject: Re: nanosecond file timestamp resolution in filesystems, GNU make, etc.

Paul Eggert wrote:
> My personal experience is that it's hard to read and write code that
> futzes with timestamps of various resolutions, and there is a real
> advantage to sticking with a simple rule that always takes the floor
> when going to a lower-precision timestamp, even if that rule has
> suboptimal results in some cases.

I have to disagree. The whole point of accurate timestamps is that a
program can reliably ask "are my assumptions about the contents of
this file still valid?", "do I have to read the file to revalidate my
assumptions?"

When you don't have accurate timestamps, or resolutions are mixed,
then it's not possible to answer this question. The only correct
behaviour for a program, such as a cacheing dynamic web server, a
cacheing JIT compiler or something like Make, is to round the
timestamp _up_.

Which is fine so long as you can write this in the application code:

if (ts_nanoseconds == 0)
ts_nanoseconds = 1e9-1;

That's a rare enough occurrence when timestamps have nanosecond
accuracy that the the glitch is not a problem.

Unfortunately that application code breaks when the filesystem may
have timestamps with resolution better than 1 second, but worse than 1
nanosecond. Then the application just can't do the right thing,
unless it knows what rounding was applied by the kernel/filesystem, so
it can change that rounding in a safe direction.

So for applications which actually _depend_ on accurate timestamps for
reliability, I see only two valid things for the kernel to do:

1. Round timestamps _up_.

2. Or, round timestamps down _and_ store the rounding resolution in
struct stat, in addition to the timestamp.

I'm in favour of 1.

(With Paul's suggestion of just rounding down without telling me when
that happened, AFAICT my _reliable_ cacheing applications must simply
ignore the nanoseconds field, which is a bit unfortunate isn't it?)

-- Jamie

2002-10-28 11:01:50

by Andi Kleen

[permalink] [raw]
Subject: Re: nanosecond file timestamp resolution in filesystems, GNU make, etc.

Jamie Lokier <[email protected]> writes:

> Unfortunately that application code breaks when the filesystem may
> have timestamps with resolution better than 1 second, but worse than 1
> nanosecond.

The current resolution is jiffies, which tends to be 1ms

Then the application just can't do the right thing,
> unless it knows what rounding was applied by the kernel/filesystem, so
> it can change that rounding in a safe direction.

The rounding is always truncation. So the application can just assume
that.

-Andi

2002-10-28 10:59:31

by Andi Kleen

[permalink] [raw]
Subject: Re: nanosecond file timestamp resolution in filesystems, GNU make, etc.

Paul Eggert <[email protected]> writes:
>
> > Another way would be to round on flush, but that also has some problems :-
>
> Rounding is even worse. GNU Make assumes truncation, i.e. it assumes
> that a timestamp is truncated (floored, actually) when it is stored on
> a non-nanosecond-aware filesystem.

That is what my patchkit does currently, so I guess it should work fine.

>
> > In my current patchkit I just chose to truncate because that was the
> > easiest and the other more complicated solutions didn't offer any
> > compeling advantage.
>
> Can't you truncate/floor to filesystem timestamp resolution
> immediately, i.e., before the inode is flushed? That would address
> the problems that I see.

That would complicate the timestamp management in the kernel considerably.
I'm not sure if I want to do that, probably not.

-Andi

2002-10-28 12:50:35

by Jamie Lokier

[permalink] [raw]
Subject: Re: nanosecond file timestamp resolution in filesystems, GNU make, etc.

Andi Kleen wrote:
> > Unfortunately that application code breaks when the filesystem may
> > have timestamps with resolution better than 1 second, but worse than 1
> > nanosecond.
>
> The current resolution is jiffies, which tends to be 1ms
>
> Then the application just can't do the right thing,
> > unless it knows what rounding was applied by the kernel/filesystem, so
> > it can change that rounding in a safe direction.
>
> The rounding is always truncation. So the application can just assume
> that.

This is fine when you are comparing two files with the same timestamp
resolution, but when the resolutions are different you need to know
what they are.

Come to think of it, rounding up is no better than rounding down when
comparing two files. The application needs to round one of them up
and one of them down, in order to make reliable tests of the form "is
this file definitely newer than this other file".

For those kinds of tests, the application needs to know a lower bound
of the resolution. Note that a jiffie is not suitable as the lower
bound, because that part of the timestamp is dropped when the inode is
dropped from memory.

The other kind of test is a comparison of one file against against its
own modification time when something derived from the file was last
cached. (This is appropriate for server requests and JIT compiler
launching, for example).

This time there is only one resolution. Nevertheless, to make a
reliable test of the form "have the contents of the file definitely
not been modified since mtime T", neither form of rounding on the
kernel side is sufficient: the application needs to know the
resolution.

I think that in all cases, for the application to make useful
decisions it needs to know the resolution of the timestamps in any
particular struct stat. If those resolutions change when an inode is
flushed from memory: that should change the resolution returned by
struct stat.

So I propose: add a field to struct stat indicating the resolution of
the timestamps in it. It can go on the end.

-- Jamie

2002-10-28 14:09:17

by Andi Kleen

[permalink] [raw]
Subject: Re: nanosecond file timestamp resolution in filesystems, GNU make, etc.

> So I propose: add a field to struct stat indicating the resolution of
> the timestamps in it. It can go on the end.

It's impossible. There is no space left in struct stat64
And adding a new syscall just for that would be severe overkill.

But what you could do if you really wanted that: implement kernel POSIX
pathconf()/fpathconf() and implement it as a parameter to that.

kernel pathconf would be needed for some other reasons anyways, e.g. to return
proper max hard link counts (currently glibc hardcodes the parameters
for various fs in user space and it always breaks the LSB test suite
for new file systems). Ulrich Drepper could probably give you other
reasons on why it is needed if you ask him nicely.

I personally have no plans to implement it, however, because it looks like
kernel bloat to me :-)

-Andi

2002-10-28 14:22:57

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: nanosecond file timestamp resolution in filesystems, GNU make, etc.

On Mon, 28 Oct 2002, Andi Kleen wrote:

> > So I propose: add a field to struct stat indicating the resolution of
> > the timestamps in it. It can go on the end.
>
> It's impossible. There is no space left in struct stat64
> And adding a new syscall just for that would be severe overkill.

Well, possibly more stuff could benefit from new stat syscalls, like a
st_gen member for inode generations. And as someone suggested, a version
number or a length could be specified by the calls this time to permit
less disturbing expansion in the future.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2002-10-28 15:11:31

by Jamie Lokier

[permalink] [raw]
Subject: Re: nanosecond file timestamp resolution in filesystems, GNU make, etc.

Andi Kleen wrote:
> > So I propose: add a field to struct stat indicating the resolution of
> > the timestamps in it. It can go on the end.
> It's impossible. There is no space left in struct stat64
> And adding a new syscall just for that would be severe overkill.

There's the 4 bytes following dev_t in Glibc's definition (call __pad1
in Glibc) which isn't used by Glibc or anything else. That's still
free even when the kernel changes to 64-bit dev_t.

And if you don't like that to do that, i.e you want to guarantee that
the unused word continues to be zero for older programs (though I
can't think of any reason why it would matter), that's what the
(currently unused) flags argument to stat64() is for.

> But what you could do if you really wanted that: implement kernel
> POSIX pathconf()/fpathconf() and implement it as a parameter to
> that.

Ugh.

> I personally have no plans to implement it [pathconf], however,
> because it looks like kernel bloat to me :-)

Given the choice of using fpathconf, I'd rather accept that the
nanoseconds field is not reliable, hence I must ignore it. It does
seem a waste though - sometimes you are returning good information I
can use, other times you're not, and I can't tell the difference :-(

I'd much rather do this right. What do you think of storing the
resolution in the unused word called __pad1 in Glibc?

(Btw, it wouldn't bloat the in-core kernel inode, because only a
couple of flag bits are needed there to distinguish known resolution
values).

-- Jamie

2002-10-28 15:08:03

by Jamie Lokier

[permalink] [raw]
Subject: Re: nanosecond file timestamp resolution in filesystems, GNU make, etc.

Maciej W. Rozycki wrote:
> > It's impossible. There is no space left in struct stat64
> > And adding a new syscall just for that would be severe overkill.
>
> Well, possibly more stuff could benefit from new stat syscalls, like a
> st_gen member for inode generations. And as someone suggested, a version
> number or a length could be specified by the calls this time to permit
> less disturbing expansion in the future.

It's already there. The kernel stat64() syscall has a flags argument,
which is unused at the moment. I presume it's for this purpose.

Glibc aleady uses a version number for its stat() calls, to permit
binary compatible extensions on the user side.

So all the mechanism is there AFAIK.

-- Jamie

2002-10-28 15:33:54

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: nanosecond file timestamp resolution in filesystems, GNU make, etc.

On Mon, 28 Oct 2002, Jamie Lokier wrote:

> > Well, possibly more stuff could benefit from new stat syscalls, like a
> > st_gen member for inode generations. And as someone suggested, a version
> > number or a length could be specified by the calls this time to permit
> > less disturbing expansion in the future.
>
> It's already there. The kernel stat64() syscall has a flags argument,
> which is unused at the moment. I presume it's for this purpose.

Hmm, I haven't thought of this argument to be used this way. Actually it
isn't currently initialized by glibc in any way, which makes its utility
questionable.

> Glibc aleady uses a version number for its stat() calls, to permit
> binary compatible extensions on the user side.

Well, it used to use xstat() functions that provided versioning since the
old days and now ELF symbol versioning is used, too, so the userland is
long prepared.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2002-10-28 15:53:07

by Jamie Lokier

[permalink] [raw]
Subject: Re: nanosecond file timestamp resolution in filesystems, GNU make, etc.

Maciej W. Rozycki wrote:
> > It's already there. The kernel stat64() syscall has a flags argument,
> > which is unused at the moment. I presume it's for this purpose.
>
> Hmm, I haven't thought of this argument to be used this way. Actually it
> isn't currently initialized by glibc in any way, which makes its utility
> questionable.

You are right. I just checked dietlibc and uclibc - neither of them
initialise the flags argument. It should be deleted from the kernel,
because nobody can use it.

On the bright side (for my specific request of st_resolution), it
seems every architecture has a different size reserved for st_dev and
st_rdev in struct stat64:

- i386: 12 bytes (8 bytes used by Glibc)
- SPARC: 8 bytes (all needed for 8 byte dev_t, but space elsewhere)
- ARM: 12 bytes
- MIPS: 16 bytes (!)
- S390: 12 bytes
- Alpha: not obvious what's used - is int 64 bits wide on Alpha?
if it's not, other changes need for 64-bit dev_t anyway.

All the architectures I've looked at have two words available in
struct stat64, if they have struct stat64, but the available space is
in different places for each architecture.

-- Jamie