2001-10-03 07:53:16

by Andi Kleen

[permalink] [raw]
Subject: Finegrained a/c/mtime was Re: Directory notification problem

Alex Larsson <[email protected]> writes:

> I discovered a problem with the dnotify API while fixing a FAM bug today.
>
> The problem occurs when you want to watch a file in a directory, and that
> file is changed several times in the same second. When I get the directory
> notify signal on the directory I need to stat the file to see if the
> change was actually in the file. If the file already changed in the
> current second the stat() result will be identical to the previous stat()
> call, since the resolution of mtime and ctime is one second.
>
> This leads to missed notifications, leaving clients (such as Nautilus or
> Konqueror) displaying an state not representing the current state.
>
> The only userspace solutions I see is to delay all change notifications to
> the end of the second, so that clients always read the correct state. This
> is somewhat countrary to the idea of FAM though, as it does not give
> instant feedback.
>
> Is there any possibility of extending struct stat with a generation
> counter? Or is there another solution to this problem?

make has similar problems with parallel builds on bigger multiprocessor
machines. Solaris7 has fixed this problem with adding new stat fields
to state that contains the ms for mtime/atime/ctime. There are even
already filesystems on linux that support fine grained timestamps
on linux, e.g. XFS has it as ns on disk. The problem is that VFS doesn't
support it currently, so it sets the ns parts always to zero. To fix
it for m/c/atime requires new system calls for utime and stat64.
For stat is also requires a changed glibc ABI -- the glibc/2.4 stat64
structure reserved an additional 4 bytes for every timestamp, but these
either need to be used to give more seconds for the year 2038 problem
or be used for the ms fractions. y2038 is somewhat important too.

[In theory the existing additional bytes could be used for both on a
big endian host if you manage to define a numeric 48byte type in gcc
and be satisfied with 16bit ms resolution, but such a hack would
probably cause problems e.g. with other compilers. It would be
possible on Little Endian too, but only for mtime and ctime, as there
is no unused field in front of st_atime. Overall I think a new stat
call is better. The ugly thing is just that the glibc ABI needs
updating too]

Solving it properly is a 2.5 thing.

-Andi



2001-10-03 08:07:18

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Finegrained a/c/mtime was Re: Directory notification problem

Andi Kleen <[email protected]> writes:

> For stat is also requires a changed glibc ABI -- the glibc/2.4 stat64

Not only stat64, also plain stat.

> structure reserved an additional 4 bytes for every timestamp, but these
> either need to be used to give more seconds for the year 2038 problem
> or be used for the ms fractions. y2038 is somewhat important too.

The fields are meant for nanoseconds. The y2038 will definitely be
solved by time-shifting or making time_t unsigned. In any way nothing
of importance here and now. Especially since there won't be many
systems which are running today and which have a 32-bit time_t be used
then. For the rest I'm sure that in 37 years there will be the one or
the other ABI change.

--
---------------. ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Red Hat `--' drepper at redhat.com `------------------------

2001-10-03 13:45:48

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Finegrained a/c/mtime was Re: Directory notification problem

Ulrich Drepper <[email protected]> writes:

> Andi Kleen <[email protected]> writes:
>
> > For stat is also requires a changed glibc ABI -- the glibc/2.4 stat64
>
> Not only stat64, also plain stat.
>
> > structure reserved an additional 4 bytes for every timestamp, but these
> > either need to be used to give more seconds for the year 2038 problem
> > or be used for the ms fractions. y2038 is somewhat important too.
>
> The fields are meant for nanoseconds. The y2038 will definitely be
> solved by time-shifting or making time_t unsigned. In any way nothing
> of importance here and now. Especially since there won't be many
> systems which are running today and which have a 32-bit time_t be used
> then. For the rest I'm sure that in 37 years there will be the one or
> the other ABI change.

Right. Given current uptimes and being optimistic the fix for y2038
is probably needed by 2030 or just a little later. But in any case
64 bit systems should be maxing out by then, and the conversion to 128
bit systems should have already happened on the server side. 32 bit
systems will likely be limited to embedded and legacy systems by then.

Eric

2001-10-03 14:11:43

by Kirill Ratkin

[permalink] [raw]
Subject: Netfilter problem

Hi.

I've a strange error when I try to check protocol type
in netfilter hook function.

I see this message:
kping.c: In function `knet_hook':
kping.c:116: dereferencing pointer to incomplete type
make: *** [kping.o] Error 1

This is part of my code:
static
unsigned int knet_hook(unsigned int hooknum,
struct sk_buff** p_skb,
const struct net_device* p_in,
const struct net_device* p_out,
int (*okfn)(struct sk_buff* ))
{
...
if((*p_skb)->nh.iph->protocol==
(unsigned char)IPPROTO_ICMP)
{
printk("<1>ICMP Packet killed\n");
return NF_DROP;
}
...
}

It had compiled on 2.4.1 version.

I don't understand why ... .


__________________________________________________
Do You Yahoo!?
Listen to your Yahoo! Mail messages from any phone.
http://phone.yahoo.com

2001-10-03 15:15:03

by Alexander Larsson

[permalink] [raw]
Subject: Re: Finegrained a/c/mtime was Re: Directory notification problem

On 3 Oct 2001, Ulrich Drepper wrote:

> Andi Kleen <[email protected]> writes:
>
> > structure reserved an additional 4 bytes for every timestamp, but these
> > either need to be used to give more seconds for the year 2038 problem
> > or be used for the ms fractions. y2038 is somewhat important too.
>
> The fields are meant for nanoseconds. The y2038 will definitely be
> solved by time-shifting or making time_t unsigned. In any way nothing
> of importance here and now. Especially since there won't be many
> systems which are running today and which have a 32-bit time_t be used
> then. For the rest I'm sure that in 37 years there will be the one or
> the other ABI change.

Is a nanoseconds field the right choice though? In reality you might not
have a nanosecond resolution timer, so you would miss changes that appear
on shorter timescale than the timer resolution. Wouldn't a generation
counter, increased when ctime was updated, be a better solution?

/ Alex


2001-10-03 15:22:43

by Gerhard Mack

[permalink] [raw]
Subject: Re: Finegrained a/c/mtime was Re: Directory notification problem

On 3 Oct 2001, Eric W. Biederman wrote:

> Ulrich Drepper <[email protected]> writes:
>
> > Andi Kleen <[email protected]> writes:
> >
> > > For stat is also requires a changed glibc ABI -- the glibc/2.4 stat64
> >
> > Not only stat64, also plain stat.
> >
> > > structure reserved an additional 4 bytes for every timestamp, but these
> > > either need to be used to give more seconds for the year 2038 problem
> > > or be used for the ms fractions. y2038 is somewhat important too.
> >
> > The fields are meant for nanoseconds. The y2038 will definitely be
> > solved by time-shifting or making time_t unsigned. In any way nothing
> > of importance here and now. Especially since there won't be many
> > systems which are running today and which have a 32-bit time_t be used
> > then. For the rest I'm sure that in 37 years there will be the one or
> > the other ABI change.
>
> Right. Given current uptimes and being optimistic the fix for y2038
> is probably needed by 2030 or just a little later. But in any case
> 64 bit systems should be maxing out by then, and the conversion to 128
> bit systems should have already happened on the server side. 32 bit
> systems will likely be limited to embedded and legacy systems by then.
>
> Eric

Why do I get the feeling no one has learned from the problems the computer
industry had with 2 digit date fields?

Odds are legacy systems will be running something people for whatever
reason couldn't replace.


Gerhard

--
Gerhard Mack

[email protected]

<>< As a computer I find your faith in technology amusing.

2001-10-03 17:45:54

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: Finegrained a/c/mtime was Re: Directory notification problem

> Alex Larsson <[email protected]> writes:
>> I discovered a problem with the dnotify API while fixing a FAM bug today.
>>
>> The problem occurs when you want to watch a file in a directory, and that
>> file is changed several times in the same second. When I get the directory
>> notify signal on the directory I need to stat the file to see if the
>> change was actually in the file. If the file already changed in the
>> current second the stat() result will be identical to the previous stat()
>> call, since the resolution of mtime and ctime is one second.

If you simply check the mtime and the file size you have the two most
relevant parts. If neighter of those changes this means that programs using
the dnotify api most likely do not need to act. After all it is not an
auditing facility but a notifier for things like reload of directory
listings. The only thing I could imagine can cause problems is a self
reloading config file. But in that case dnotify is overkill anyway and a 1
sec delay could be asumed to be reasonable.

Greetigs
Bernd

2001-10-03 21:26:05

by Andi Kleen

[permalink] [raw]
Subject: Re: Finegrained a/c/mtime was Re: Directory notification problem

On Wed, Oct 03, 2001 at 11:15:04AM -0400, Alex Larsson wrote:
> Is a nanoseconds field the right choice though? In reality you might not
> have a nanosecond resolution timer, so you would miss changes that appear
> on shorter timescale than the timer resolution. Wouldn't a generation
> counter, increased when ctime was updated, be a better solution?

Near any CPU has a cycle counter builtin now, which gives you ns like
resolution. In theory you could still get collisions on MP systems,
but window is small enough that it can be ignored in practice.

-Andi

2001-10-03 21:42:38

by Luigi Genoni

[permalink] [raw]
Subject: Re: Netfilter problem

strange!
it compiled correctly fr mw with all 2.4 kernels, with gcc 2.95.3, and gcc
3.0.0/1

Luigi


On Wed, 3 Oct 2001, Kirill Ratkin wrote:

> Hi.
>
> I've a strange error when I try to check protocol type
> in netfilter hook function.
>
> I see this message:
> kping.c: In function `knet_hook':
> kping.c:116: dereferencing pointer to incomplete type
> make: *** [kping.o] Error 1
>
> This is part of my code:
> static
> unsigned int knet_hook(unsigned int hooknum,
> struct sk_buff** p_skb,
> const struct net_device* p_in,
> const struct net_device* p_out,
> int (*okfn)(struct sk_buff* ))
> {
> ...
> if((*p_skb)->nh.iph->protocol==
> (unsigned char)IPPROTO_ICMP)
> {
> printk("<1>ICMP Packet killed\n");
> return NF_DROP;
> }
> ...
> }
>
> It had compiled on 2.4.1 version.
>
> I don't understand why ... .
>
>
> __________________________________________________
> Do You Yahoo!?
> Listen to your Yahoo! Mail messages from any phone.
> http://phone.yahoo.com
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2001-10-05 12:50:09

by Padraig Brady

[permalink] [raw]
Subject: Re: Finegrained a/c/mtime was Re: Directory notification problem

Andi Kleen wrote:

>On Wed, Oct 03, 2001 at 11:15:04AM -0400, Alex Larsson wrote:
>
>>Is a nanoseconds field the right choice though? In reality you might not
>>have a nanosecond resolution timer, so you would miss changes that appear
>>on shorter timescale than the timer resolution. Wouldn't a generation
>>counter, increased when ctime was updated, be a better solution?
>>
>
>Near any CPU has a cycle counter builtin now, which gives you ns like
>resolution. In theory you could still get collisions on MP systems,
>but window is small enough that it can be ignored in practice.
>
>-Andi
>
But the point is you, only ever would want nano second resolution to make
sure you notice all changes to a file. A more general (and much simpler)
solution would be to gen_count++ every time a file's modified. What other
applications would require better than second resolution on files?

Padraig.

2001-10-05 13:01:39

by Andi Kleen

[permalink] [raw]
Subject: Re: Finegrained a/c/mtime was Re: Directory notification problem

On Fri, Oct 05, 2001 at 01:44:20PM +0100, Padraig Brady wrote:
> Andi Kleen wrote:
>
> >On Wed, Oct 03, 2001 at 11:15:04AM -0400, Alex Larsson wrote:
> >
> >>Is a nanoseconds field the right choice though? In reality you might not
> >>have a nanosecond resolution timer, so you would miss changes that appear
> >>on shorter timescale than the timer resolution. Wouldn't a generation
> >>counter, increased when ctime was updated, be a better solution?
> >>
> >
> >Near any CPU has a cycle counter builtin now, which gives you ns like
> >resolution. In theory you could still get collisions on MP systems,
> >but window is small enough that it can be ignored in practice.
> >
> >-Andi
> >
> But the point is you, only ever would want nano second resolution to make
> sure you notice all changes to a file. A more general (and much simpler)
> solution would be to gen_count++ every time a file's modified. What other
> applications would require better than second resolution on files?

The main advantage of using a real timestamp instead of a generation
counter is that we would be compatible to Unixware/Solaris/... Their
API is fine, so I see no advantage in inventing a new incompatible one.

Another advantage of using the real time instead of a counter is that
you can easily merge the both values into a single 64bit value and do
arithmetic on it in user space. With a generation counter you would need
to work with number pairs, which is much more complex.
[or alternatively reset the generation counter every second in the kernel
to get a flat time range again,
which would be racy and ugly and complicated in the kernel because it
would need additional timestamps]

Also a rdtsc/get_timestamp or in the worst case a jiffie read is really
not complex to code in kernel, what makes you think it is?

-Andi

2001-10-05 13:11:30

by Andrew Pimlott

[permalink] [raw]
Subject: Re: Finegrained a/c/mtime was Re: Directory notification problem

On Fri, Oct 05, 2001 at 01:44:20PM +0100, Padraig Brady wrote:
> But the point is you, only ever would want nano second resolution to make
> sure you notice all changes to a file. A more general (and much simpler)
> solution would be to gen_count++ every time a file's modified. What other
> applications would require better than second resolution on files?

Correlating file timestamps with an event log. Comparing timestamps
on different files (make). Real time is _much_ more useful (not to
mention convenient) than a generation count; and given that we've
survived with second resolution so far, I think the hypothetical
collisions on a nanosecond scale are ignorable.

Andrew

2001-10-05 13:20:51

by Padraig Brady

[permalink] [raw]
Subject: Re: Finegrained a/c/mtime was Re: Directory notification problem

Andi Kleen wrote:

>On Fri, Oct 05, 2001 at 01:44:20PM +0100, Padraig Brady wrote:
>
>>Andi Kleen wrote:
>>
>>>On Wed, Oct 03, 2001 at 11:15:04AM -0400, Alex Larsson wrote:
>>>
>>>>Is a nanoseconds field the right choice though? In reality you might not
>>>>have a nanosecond resolution timer, so you would miss changes that appear
>>>>on shorter timescale than the timer resolution. Wouldn't a generation
>>>>counter, increased when ctime was updated, be a better solution?
>>>>
>>>Near any CPU has a cycle counter builtin now, which gives you ns like
>>>resolution. In theory you could still get collisions on MP systems,
>>>but window is small enough that it can be ignored in practice.
>>>
>>>-Andi
>>>
>>But the point is you, only ever would want nano second resolution to make
>>sure you notice all changes to a file. A more general (and much simpler)
>>solution would be to gen_count++ every time a file's modified. What other
>>applications would require better than second resolution on files?
>>
>
>The main advantage of using a real timestamp instead of a generation
>counter is that we would be compatible to Unixware/Solaris/... Their
>API is fine, so I see no advantage in inventing a new incompatible one.
>
Even so I can't see a need to have this resolution for mtime, and as you
pointed
out there can still be races on SMP systems and timing resolutions are
system
dependent anyway.

>
>Another advantage of using the real time instead of a counter is that
>you can easily merge the both values into a single 64bit value and do
>arithmetic on it in user space. With a generation counter you would need
>to work with number pairs, which is much more complex.
>
??
if (file->mtime != mtime || file->gen_count != gen_count)
file_changed=1;

>
>[or alternatively reset the generation counter every second in the kernel
>to get a flat time range again,
>which would be racy and ugly and complicated in the kernel because it
>would need additional timestamps]
>
No need as long as it doesn't wrap within the mtime resolution (1 second).

>
>Also a rdtsc/get_timestamp or in the worst case a jiffie read is really
>not complex to code in kernel, what makes you think it is?
>
Sorry, by more complex I meant more instructions/CPU expensive.

>
>
>-Andi
>
Padraig.

2001-10-05 14:38:53

by Andi Kleen

[permalink] [raw]
Subject: Re: Finegrained a/c/mtime was Re: Directory notification problem

> >Another advantage of using the real time instead of a counter is that
> >you can easily merge the both values into a single 64bit value and do
> >arithmetic on it in user space. With a generation counter you would need
> >to work with number pairs, which is much more complex.
> >
> ??
> if (file->mtime != mtime || file->gen_count != gen_count)
> file_changed=1;

And how would you implement "newer than" and "older than" with a generation
count that doesn't reset in a always fixed time interval (=requiring
additional timestamps in kernel)?

-Andi

2001-10-05 15:05:45

by Padraig Brady

[permalink] [raw]
Subject: Re: Finegrained a/c/mtime was Re: Directory notification problem

Andi Kleen wrote:

>>>Another advantage of using the real time instead of a counter is that
>>>you can easily merge the both values into a single 64bit value and do
>>>arithmetic on it in user space. With a generation counter you would need
>>>to work with number pairs, which is much more complex.
>>>
>>??
>>if (file->mtime != mtime || file->gen_count != gen_count)
>> file_changed=1;
>>
>
>And how would you implement "newer than" and "older than" with a generation
>count that doesn't reset in a always fixed time interval (=requiring
>additional timestamps in kernel)?
>
>-Andi
>
Well IMHO "newer than", "older than" applications have until now
done with second resolution, and that's all that's required?

Padraig.

2001-10-05 19:12:20

by Andi Kleen

[permalink] [raw]
Subject: Re: Finegrained a/c/mtime was Re: Directory notification problem

On Fri, Oct 05, 2001 at 04:00:08PM +0100, Padraig Brady wrote:
> Andi Kleen wrote:
>
> >>>Another advantage of using the real time instead of a counter is that
> >>>you can easily merge the both values into a single 64bit value and do
> >>>arithmetic on it in user space. With a generation counter you would need
> >>>to work with number pairs, which is much more complex.
> >>>
> >>??
> >>if (file->mtime != mtime || file->gen_count != gen_count)
> >> file_changed=1;
> >>
> >
> >And how would you implement "newer than" and "older than" with a generation
> >count that doesn't reset in a always fixed time interval (=requiring
> >additional timestamps in kernel)?
> >
> >-Andi
> >
> Well IMHO "newer than", "older than" applications have until now
> done with second resolution, and that's all that's required?

No they haven't. GNU make supports nsec mtime on Solaris and apparently
some other OS too, because the second granuality mtime can be a big
problem with make -j<bignumber> on a big SMP box. make has to distingush
"is older" from "is newer"; "not equal" alone doesn't cut it.

[If you think it is modify your make to replace the "is older" check
for dependencies with "is not equal" and see what happens]



-Andi

2001-10-05 20:22:18

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: Finegrained a/c/mtime was Re: Directory notification problem

In article <[email protected]> you wrote:
>> if (file->mtime != mtime || file->gen_count != gen_count)
>> file_changed=1;

> And how would you implement "newer than" and "older than" with a generation
> count that doesn't reset in a always fixed time interval (=requiring
> additional timestamps in kernel)?

newer:

if ((file->mtime < mtime) || ((file->mtime == mtime) && (file->gen_count < gen_count))

The Advantage here is, that even can contain some usefull info like "x
modifications".

Greetings
Bernd

2001-10-08 08:44:59

by Padraig Brady

[permalink] [raw]
Subject: Re: Finegrained a/c/mtime was Re: Directory notification problem

Andi Kleen wrote:

>On Fri, Oct 05, 2001 at 04:00:08PM +0100, Padraig Brady wrote:
>
>>Andi Kleen wrote:
>>
>>>>>Another advantage of using the real time instead of a counter is that
>>>>>you can easily merge the both values into a single 64bit value and do
>>>>>arithmetic on it in user space. With a generation counter you would need
>>>>>to work with number pairs, which is much more complex.
>>>>>
>>>>??
>>>>if (file->mtime != mtime || file->gen_count != gen_count)
>>>> file_changed=1;
>>>>
>>>And how would you implement "newer than" and "older than" with a generation
>>>count that doesn't reset in a always fixed time interval (=requiring
>>>additional timestamps in kernel)?
>>>
>>>-Andi
>>>
>>Well IMHO "newer than", "older than" applications have until now
>>done with second resolution, and that's all that's required?
>>
>
>No they haven't. GNU make supports nsec mtime on Solaris and apparently
>some other OS too, because the second granuality mtime can be a big
>problem with make -j<bignumber> on a big SMP box. make has to distingush
>"is older" from "is newer"; "not equal" alone doesn't cut it.
>
>[If you think it is modify your make to replace the "is older" check
>for dependencies with "is not equal" and see what happens]
>

OK agreed, in this case the, complete state/relationship between files,
must be
maintained independently of the userspace app, i.e. in the filesystem.
But wont
you then have the same problem with synchronising nanosecond times between
the various processors (which could be the other side of a network cable
in some
configurations)? So perhaps the best solution is to maintain both a
generation
count which would do for many apps who just care if the file has changed
relative
to some moment it time and not relative to another file(s) on the
filesystem .
Then for make type applications you could maintain the full resolution
timestamp,
however this will still have the synchronisation/portability/CPU expense
issues
discussed previously.

Padraig.

2001-10-08 09:04:18

by Padraig Brady

[permalink] [raw]
Subject: Re: Finegrained a/c/mtime was Re: Directory notification problem

Padraig Brady wrote:

> Andi Kleen wrote:
>
>> On Fri, Oct 05, 2001 at 04:00:08PM +0100, Padraig Brady wrote:
>>
>>> Andi Kleen wrote:
>>>
>>>>>> Another advantage of using the real time instead of a counter is
>>>>>> that you can easily merge the both values into a single 64bit
>>>>>> value and do
>>>>>> arithmetic on it in user space. With a generation counter you
>>>>>> would need to work with number pairs, which is much more complex.
>>>>>
>>>>> ??
>>>>> if (file->mtime != mtime || file->gen_count != gen_count)
>>>>> file_changed=1;
>>>>>
>>>> And how would you implement "newer than" and "older than" with a
>>>> generation
>>>> count that doesn't reset in a always fixed time interval (=requiring
>>>> additional timestamps in kernel)?
>>>> -Andi
>>>>
>>> Well IMHO "newer than", "older than" applications have until now
>>> done with second resolution, and that's all that's required?
>>>
>>
>> No they haven't. GNU make supports nsec mtime on Solaris and apparently
>> some other OS too, because the second granuality mtime can be a big
>> problem with make -j<bignumber> on a big SMP box. make has to distingush
>> "is older" from "is newer"; "not equal" alone doesn't cut it.
>>
>> [If you think it is modify your make to replace the "is older" check
>> for dependencies with "is not equal" and see what happens]
>>
>
> OK agreed, in this case the, complete state/relationship between
> files, must be
> maintained independently of the userspace app, i.e. in the filesystem.
> But wont
> you then have the same problem with synchronising nanosecond times
> between
> the various processors (which could be the other side of a network
> cable in some
> configurations)? So perhaps the best solution is to maintain both a
> generation
> count which would do for many apps who just care if the file has
> changed relative
> to some moment it time and not relative to another file(s) on the
> filesystem .
> Then for make type applications you could maintain the full resolution
> timestamp,
> however this will still have the synchronisation/portability/CPU
> expense issues
> discussed previously.


Just thinking that it's VERY hard to synchronise timings to nanosecond
or even millisecond
resolution over distributed or even within the same filesystem, how
about you synchronise
the timestamps to the particular filesystem and not the universe. I.E.
Instead of incrementing
a "generation count" in each inode you could increment a global
filesystem count everytime
a file is modified in the filesystem, and then this count is stored in
the particular inode being
modified. This would allow you to have exact order relationships between
files in the same
filesystem, and would work perfectly every time for both "types" of apps
mentioned above.
Outside the filesystem you can then resort to just the (second
resolution) timestamp.

Padraig.

2001-10-08 10:04:07

by Trond Myklebust

[permalink] [raw]
Subject: Re: Finegrained a/c/mtime was Re: Directory notification problem

>>>>> " " == Padraig Brady <[email protected]> writes:

> you then have the same problem with synchronising nanosecond
> times between the various processors (which could be the other
> side of a network cable in some

> configurations)? So perhaps the best solution is to maintain
> both a generation

> count which would do for many apps who just care if the file
> has changed relative

> to some moment it time and not relative to another file(s) on
> the filesystem .

> Then for make type applications you could maintain the full
> resolution timestamp,

> however this will still have the
> synchronisation/portability/CPU expense issues

> discussed previously.

This `generation count' idea for file change stamping will eventually
have to go into the kernel if only because things like NFSv4 will
require it.

Meanwhile though, you're going have to look elsewhere than ordinary
NFS to be able to share the generation information over your
network. The current protocols support microsecond(v2)/nanosecond(v3)
timestamps only.

Cheers,
Trond

2001-10-13 15:25:04

by Jamie Lokier

[permalink] [raw]
Subject: Re: Finegrained a/c/mtime was Re: Directory notification problem

This note explains how to implement file timestamps in such a way that
modifications to file can always be detect reliably. Currently,
programs such as `make' and other interesting applications cannot give
absolute guarantees of detecting changed files.

Andi Kleen says we can ignore the risk; I disagree, as there are some
applications that cannot be trusted if the risk is plausible, and it can
be fixed easily.

Bernd Eckenfels wrote:
> If you simply check the mtime and the file size you have the two most
> relevant parts. If neighter of those changes this means that programs using
> the dnotify api most likely do not need to act.
^^^^^^^^^^^

In other words, the API is broken for certain applications.

One that springs to mind is transparent caching of JIT-compiled code
between interpreter invocations. If dnotify misses the notification
sometimes, the caching ceases to be transparent, and you have to switch
it off for reliable behaviour, a major efficiency loss.

Microsecond resolution, of course, does not fix this problem.

Alex Larsson wrote:
> Is a nanoseconds field the right choice though? In reality you might not
> have a nanosecond resolution timer, so you would miss changes that appear
> on shorter timescale than the timer resolution. Wouldn't a generation
> counter, increased when ctime was updated, be a better solution?

As has been pointed out, it would not be compatible with other unix
systems and existing software, and timestamps have nice audit trail
possibilities.

I didn't realise there was enough precision left in ext2 inodes for
nanosecond timestamps.

Timestamps have _many_ problems: the main problem is that you can't
guarantee to reliably detect a changed file. For some interesting
applications this is fatal.

However, you can fix timestamps and keep the best benefits of timestamps
and counters:

- high resolution timestamps.

- whenever there is a change event, check whether the timestamp
would be advanced. If not, delay the change (i.e. inside the
write() call) until the clock time has advanced to the next
high-resolution unit.

- if you use nanoseconds, this will never occur on current machines
and only rarely on faster machines.

- spinning is an acceptable way to delay for such a short time.

- it's not necessary to delay if nobody read the mtime since the last
timestamp update, which will nearly always be the case. So even on
extremely fast future machines, you would hardly ever pause.

cheers,
-- Jamie

2001-10-13 16:12:47

by Andi Kleen

[permalink] [raw]
Subject: Re: Finegrained a/c/mtime was Re: Directory notification problem

In article <[email protected]>,
Jamie Lokier <[email protected]> writes:
> Andi Kleen says we can ignore the risk; I disagree, as there are some
> applications that cannot be trusted if the risk is plausible, and it can
> be fixed easily.

You're misquoting me badly. I said we can ignore the risk that two
nanosecond resolution timestamps that get changed by two different cpus
with out-of-sync cycle counter on a smp system and which are fast enough
to free/aquire the inode lock in a smaller time than they're out of sync
(= giving two file changes with the same ns timestamp) can be ignored.
I implied on the systems that don't have a cycle counter and which use
jiffie resolution gettimeofday it can be also ignored, because they're
unlikely to be SMP and dying out too anyways.

-Andi


2001-10-13 19:40:51

by Jamie Lokier

[permalink] [raw]
Subject: Re: Finegrained a/c/mtime was Re: Directory notification problem

Andi Kleen wrote:
> > Andi Kleen says we can ignore the risk; I disagree, as there are some
> > applications that cannot be trusted if the risk is plausible, and it can
> > be fixed easily.
>
> You're misquoting me badly. I said we can ignore the risk that two
> nanosecond resolution timestamps that get changed by two different cpus
> with out-of-sync cycle counter on a smp system and which are fast enough
> to free/aquire the inode lock in a smaller time than they're out of sync
> (= giving two file changes with the same ns timestamp) can be ignored.
> I implied on the systems that don't have a cycle counter and which use
> jiffie resolution gettimeofday it can be also ignored, because they're
> unlikely to be SMP and dying out too anyways.

Andi, sorry I misrepresented your statement.

I misread your original as saying that the risks due to SMP nanosecond
scale synchronisation problems can be ignored. Implied from that, that
the small risk of one SMP process modifying a file while another checks
the timestamp can be ignored. I misread this way because others have
suggested higher resolution solves the problem, and I believe it does not.

As you say above, multiple modifications within a single tick are not a
problem, do not have to be tracked, and therefore do not require SMP
sychronisation.

The SMP risk of missing a change after checking the timestamp is among
the risks I consider critical for an application which must not miss the
fact that a file has changed. I do not want us to repeat the mistake of
1 second at a smaller timescale.

cheers,
-- Jamie

2001-10-16 20:03:01

by Riley Williams

[permalink] [raw]
Subject: Re: Finegrained a/c/mtime was Re: Directory notification problem

Hi Gerhard.

>>>> For stat is also requires a changed glibc ABI -- the glibc/2.4
>>>> stat64 structure reserved an additional 4 bytes for every
>>>> timestamp, but these either need to be used to give more seconds
>>>> for the year 2038 problem or be used for the ms fractions. y2038
>>>> is somewhat important too.

>>> The fields are meant for nanoseconds. The y2038 will definitely be
>>> solved by time-shifting or making time_t unsigned. In any way
>>> nothing of importance here and now. Especially since there won't be
>>> many systems which are running today and which have a 32-bit time_t
>>> be used then. For the rest I'm sure that in 37 years there will be
>>> the one or the other ABI change.

>> Right. Given current uptimes and being optimistic the fix for y2038
>> is probably needed by 2030 or just a little later. But in any case
>> 64 bit systems should be maxing out by then, and the conversion to
>> 128 bit systems should have already happened on the server side.
>> 32 bit systems will likely be limited to embedded and legacy systems
>> by then.

> Why do I get the feeling no one has learned from the problems the
> computer industry had with 2 digit date fields?

Precicely my feeling. Let's see what the various field widths do for the
y2038 problem, assuming a signed field and that we retain the current
date origin of Jan 1 00:00:00 UTC 1970 for the new routines:

Field Width Rollover Date Time
~~~~~~~~~~~ ~~~~~~~~~~~~~ ~~~~~~~~
32 19 Jan 2038 3:14:08
33 7 Feb 2106 6:28:16
34 16 Mar 2242 12:56:32
35 30 May 2514 1:53:04
36 26 Oct 3058 3:46:08
37 20 Aug 4147 7:32:16
38 8 Apr 6325 15:04:32
39 14 Jul 10680 6:09:04
40 25 Jan 19391 12:18:08
41 20 Feb 36812 0:36:16
42 10 Apr 71654 1:12:32
43 19 Jul 141338 2:25:04
44 4 Feb 280707 4:50:08
45 8 Mar 559444 9:40:16

I somehow don't see the need to go any further with this table...

We can get some really decent rollover dates by expanding the field
width, and the basic question comes down to how far ahead we wish to
push the problem - noting that the WinXX Y2K problem has only been
pushed back to be the Y10K problem now.

The other side of the equation is that we need to increase the
resolution with which we give out timestamps, and it appears to me that
the simplest means would be to change the kernel to use a smaller unit
to record timestamps. The current set of calls would then convert this
to seconds, and we would provide a new set of calls that returned the
raw values as used in the kernel.

Assuming the field widths have to be a complete number of bytes, we need
to determine what the minimum resolution is to allow us to record times
up to 00:00:00 GMT on the 1st of January in whatever year we wish to be
able to record up to. Here's what we would need to use for the given
field sizes to handle up to the following years:

Field Year Year Year Year Year Year Year Year Year
Width 2038 2500 5000 10000 25000 50000 100000 250000 500000
~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~
32 1 s
40 4 ms 31 ms 174 ms 461 ms
48 16 us 119 us 680 us 1.8 ms 5.1 ms 11 ms 22 ms 56 ms 112 ms
56 60 ns 465 ns 2.7 us 7.1 us 21 us 43 us 86 us 218 us 437 us
64 233 ps 1.8 ns 11 ns 28 ns 79 ns 165 ns 336 ns 849 ns 1.8 us
72 909 fs 7.1 ps 41 ps 108 ps 308 ps 642 ps 1.4 ns 3.4 ns 6.7 ns
80 3.6 fs 28 fs 159 fs 420 fs 1.2 ps 2.6 ps 5.2 ps 13 ps 27 ps
~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~

I note that with the recent Y2K changes, WinXX software will next hit
rollover in case (C), and we don't want to be worse than that. Also, to
keep the conversion routines for the current functions simple, we need
to choose an interval that divides exactly into one second.

I would therefore conclude that we could aim for any of the following:

Field Width Unit of Time Rollover Month
~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~~~

40 bits 500 ms May 10680
1 s Sep 19390

48 bits 2500 us Apr 13119
5 ms Jul 24268 *
10 ms Jan 46567
25 ms Jul 113462
125 ms Sep 559432

56 bits 10 us Nov 13386
25 us Feb 30512 *
50 us Mar 59054
100 us May 116138
500 us Nov 572811

64 bits 50 ns Jul 16583
100 ns Feb 31197 *
250 ns Oct 75037
500 ns Jul 148105
1000 ns Jan 294241
2500 ns Jul 732647

72 bits 125 ps Sep 11322
250 ps May 20675
500 ps Sep 39380 *
1 ns May 76791
10 ns Oct 750183

80 bits 500 fs Feb 11547
1000 fs Apr 21124
1250 fs Nov 25912 *
2500 fs Sep 49855
5 ps May 97741
10 ps Sep 193512
25 ps Nov 480826

Allowing that WinXX software is now only susceptible to the Y10K
problem, we can't afford to do worse than that, and the sooner we
sort this out, the better for all concerned as far as I can tell.

My personal choices at each field width would be those marked with an
asterisk, and this is based on the principle of using the shortest time
interval possible that is consistant with being able to record up to
around AD 25000 in a signed field.

My overall preference would be to go straight to 64 bit date fields and
define them as storing the time in units of 100 nanoseconds, but it has
apparently been decided that we will use 48 bit fields, if what I've
seen on this list is correct.

> Odds are legacy systems will be running something people for
> whatever reason couldn't replace.

Most probably...

Best wishes from Riley.