2002-10-27 12:07:06

by Andi Kleen

[permalink] [raw]
Subject: New nanosecond stat patch for 2.5.44


Move time_t members in struct stat to struct timespec and allow subsecond
timestamps for files. Too big to post on the list, because it edits
a lot of file systems and drivers in a straight forward way.

This is required for reliable "make" on fast computers.

File systems that support nsec storage are currently: XFS, JFS, NFSv3
(if the filesystem on the server supports it), VFAT (not quite nanosecond),
CIFS (unit in 100ns which is above what linux supports), SMBFS (for
newer servers)

This is proposed for 2.6.

Changes against the last version:
- Now always take xtime_lock when accessing the whole of xtime
- Port to 2.5.44
- New filesystems supported: CIFS, AFS

ftp://ftp.firstfloor.org/pub/ak/v2.5/nsec-2.5.44-1.bz2


-Andi


2002-10-27 14:27:14

by Andi Kleen

[permalink] [raw]
Subject: Re: New nanosecond stat patch for 2.5.44 - new patch II

> ftp://ftp.firstfloor.org/pub/ak/v2.5/nsec-2.5.44-1.bz2

This version unfortunately had some problems. I removed it now
and replaced it with

ftp://ftp.firstfloor.org/pub/ak/v2.5/nsec-2.5.44-2.bz2

If you already got -1 please redownload.

Thank you,
-Andi

2002-10-27 21:45:52

by Andreas Dilger

[permalink] [raw]
Subject: Re: New nanosecond stat patch for 2.5.44

On Oct 27, 2002 13:13 +0100, Andi Kleen wrote:
> Move time_t members in struct stat to struct timespec and allow subsecond
> timestamps for files. Too big to post on the list, because it edits
> a lot of file systems and drivers in a straight forward way.
>
> This is required for reliable "make" on fast computers.
>
> File systems that support nsec storage are currently: XFS, JFS, NFSv3
> (if the filesystem on the server supports it), VFAT (not quite nanosecond),
> CIFS (unit in 100ns which is above what linux supports), SMBFS (for
> newer servers)

Two notes I might make about this:
1) It would be good if it were possible to select this with a config
option (I don't care which way the default goes), so that people who
don't need/care about the increased resolution don't need the extra
space in their inodes and minor extra overhead. To make this a lot
easier to code, having something akin to the inode_update_time()
which does all of the i_[acm]time updates as appropriate.
2) Updating i_atime based on comparing the nsec timestamp is going to be
a killer. I think AKPM saw dramatic performance improvements when he
changed the code to only do the update once/second, and even though
you are "only" updating the atime if the times are different, in
practise this will be always. Even without the "per superblock interval"
you suggest we should probably only update the atime once a second (I
don't think anything is keyed off such high resolution atimes, unlike
make and mtime/ctime).
3) The fields you are usurping in struct stat are actually there for the
Y2038 problem (when time_t wraps). At least that's what Ted said when
we were looking into nsec times for ext2/3. Granted, we may all be
using 64-bit systems by 2038... I've always thought 64 bits is much
to large for time_t, so we could always use 20 or 30 bits for sub-second
times, and the remaining bits for extending time_t at the high end,
and mask those off for now, but that is a separate issue...

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-10-27 22:50:46

by H. Peter Anvin

[permalink] [raw]
Subject: Re: New nanosecond stat patch for 2.5.44

Followup to: <[email protected]>
By author: Andreas Dilger <[email protected]>
In newsgroup: linux.dev.kernel
>
> 3) The fields you are usurping in struct stat are actually there for the
> Y2038 problem (when time_t wraps). At least that's what Ted said when
> we were looking into nsec times for ext2/3. Granted, we may all be
> using 64-bit systems by 2038... I've always thought 64 bits is much
> to large for time_t, so we could always use 20 or 30 bits for sub-second
> times, and the remaining bits for extending time_t at the high end,
> and mask those off for now, but that is a separate issue...
>

64-bit time_t is nice because you don't *ever* need to worry about
overflow; it's capable of handling times on a galactic lifespan
scale. It's overkill, of course, but it's the *right* kind of
overkill.

We probably need to revamp struct stat anyway, to support a larger
dev_t, and possibly a larger ino_t (we should account for 64-bit ino_t
at least if we have to redesign the structure.) At that point I would
really like to advocate for int64_t ts_sec and uint32_t ts_nsec and
quite possibly a int32_t ts_taidelta to deal with leap seconds... I'd
personally like struct timespec to look like the above everywhere.

-hpa


--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2002-10-27 23:10:18

by Horst H. von Brand

[permalink] [raw]
Subject: Re: New nanosecond stat patch for 2.5.44

Andreas Dilger <[email protected]> said:
> On Oct 27, 2002 13:13 +0100, Andi Kleen wrote:
> > Move time_t members in struct stat to struct timespec and allow subsecond
> > timestamps for files. Too big to post on the list, because it edits
> > a lot of file systems and drivers in a straight forward way.
> >
> > This is required for reliable "make" on fast computers.
> >
> > File systems that support nsec storage are currently: XFS, JFS, NFSv3
> > (if the filesystem on the server supports it), VFAT (not quite nanosecond),
> > CIFS (unit in 100ns which is above what linux supports), SMBFS (for
> > newer servers)
>
> Two notes I might make about this:
> 1) It would be good if it were possible to select this with a config
> option (I don't care which way the default goes), so that people who
> don't need/care about the increased resolution don't need the extra
> space in their inodes and minor extra overhead. To make this a lot
> easier to code, having something akin to the inode_update_time()
> which does all of the i_[acm]time updates as appropriate.

Please don't. Do not create incompatible versions of the same filesystem
just because they were written on kernels compiled with different
configurations. Superblock flags might be OK, but what is the point then?
Better mount flags (mount with/without finegrained timestamps)?

[....]

> 3) The fields you are usurping in struct stat are actually there for the
> Y2038 problem (when time_t wraps). At least that's what Ted said when
> we were looking into nsec times for ext2/3. Granted, we may all be
> using 64-bit systems by 2038... I've always thought 64 bits is much
> to large for time_t, so we could always use 20 or 30 bits for sub-second
> times, and the remaining bits for extending time_t at the high end,
> and mask those off for now, but that is a separate issue...

IMVHO, keeping fields in filesystems' inodes for 36 years in the future is
daydreaming. Not even the filesystems in the just 11 year old Linux have
survived unscathed... and by '38 we'll probably be by ext8 or so, under
64-bit CPUs.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2002-10-28 01:17:23

by Chris Friesen

[permalink] [raw]
Subject: Re: New nanosecond stat patch for 2.5.44

H. Peter Anvin wrote:

> We probably need to revamp struct stat anyway, to support a larger
> dev_t, and possibly a larger ino_t (we should account for 64-bit ino_t
> at least if we have to redesign the structure.) At that point I would
> really like to advocate for int64_t ts_sec and uint32_t ts_nsec and
> quite possibly a int32_t ts_taidelta to deal with leap seconds... I'd
> personally like struct timespec to look like the above everywhere.

For filesystems can we get away with just the 64-bit nanoseconds? By my
calculations that gives something like 584 years--do we need to worry
about files older than that?

Chris



--
Chris Friesen | MailStop: 043/33/F10
Nortel Networks | work: (613) 765-0557
3500 Carling Avenue | fax: (613) 765-2986
Nepean, ON K2H 8E9 Canada | email: [email protected]

2002-10-28 04:37:24

by Andi Kleen

[permalink] [raw]
Subject: Re: New nanosecond stat patch for 2.5.44

Andreas Dilger <[email protected]> writes:

> On Oct 27, 2002 13:13 +0100, Andi Kleen wrote:
> > Move time_t members in struct stat to struct timespec and allow subsecond
> > timestamps for files. Too big to post on the list, because it edits
> > a lot of file systems and drivers in a straight forward way.
> >
> > This is required for reliable "make" on fast computers.
> >
> > File systems that support nsec storage are currently: XFS, JFS, NFSv3
> > (if the filesystem on the server supports it), VFAT (not quite nanosecond),
> > CIFS (unit in 100ns which is above what linux supports), SMBFS (for
> > newer servers)
>
> Two notes I might make about this:
> 1) It would be good if it were possible to select this with a config
> option (I don't care which way the default goes), so that people who
> don't need/care about the increased resolution don't need the extra
> space in their inodes and minor extra overhead. To make this a lot
> easier to code, having something akin to the inode_update_time()
> which does all of the i_[acm]time updates as appropriate.

You're joking right? That's twelve bytes of more state per struct inode
and I bet even with the most insidious micro benchmark you won't be
able to detect a difference in speed from the basic manipulation.

What could hurt a bit is that the "only flush atime once a second"
optimization is gone currently. The right way to address that would
be a mount option "atime_flush_interval", not a CONFIG.

> 2) Updating i_atime based on comparing the nsec timestamp is going to be
> a killer. I think AKPM saw dramatic performance improvements when he
> changed the code to only do the update once/second, and even though
> you are "only" updating the atime if the times are different, in
> practise this will be always. Even without the "per superblock interval"
> you suggest we should probably only update the atime once a second (I
> don't think anything is keyed off such high resolution atimes, unlike
> make and mtime/ctime).

Again I wrote about this in my original mail. Please see
ftp://ftp.firstfloor.org/pub/ak/v2.5/nsec.notes

Basically I agree with you that it's a problem (although "killer" seems
to be an exaggeration to me). The right solution IMHO would be to implement
a new super block field / mount option that specifies that atime flush
(basically generalized noatime). Then you can say you only want it flushed
every 60s and the result will be much faster than what we have currently.

Some file systems already implement intelligent atime flushing (like XFS)
and they don't need it.

But I didn't want to mix such a patch into the big patchkit. When the nsec
patchkit is integrated and benchmarks show it is a problem I will submit
a follow up patch instead.


> 3) The fields you are usurping in struct stat are actually there for the
> Y2038 problem (when time_t wraps). At least that's what Ted said when
> we were looking into nsec times for ext2/3. Granted, we may all be
> using 64-bit systems by 2038... I've always thought 64 bits is much
> to large for time_t, so we could always use 20 or 30 bits for sub-second
> times, and the remaining bits for extending time_t at the high end,
> and mask those off for now, but that is a separate issue...

I wrote about this in my original notes (perhaps I should repost them,
I think they are still on the ftp server) For year 2038 we will need
lots of new syscalls: new time(2), new gettimeofday(2) and lots of
others. When all these are added then a new stat isn't that big
a problem. Also glibc currently doesn't know how to use these fields
for y2038, so all user programs need to be relinked anyways. Again
when that happens it's no big issue to add a new stat. I bet when
y2038 comes there will be other reasons for a new stat too.
So it's fine to reuse these fields.

The make problem is much more pressing.

-Andi

2002-10-28 04:41:07

by Andi Kleen

[permalink] [raw]
Subject: Re: New nanosecond stat patch for 2.5.44

Chris Friesen <[email protected]> writes:

> H. Peter Anvin wrote:
>
> > We probably need to revamp struct stat anyway, to support a larger
> > dev_t, and possibly a larger ino_t (we should account for 64-bit ino_t
> > at least if we have to redesign the structure.) At that point I would
> > really like to advocate for int64_t ts_sec and uint32_t ts_nsec and
> > quite possibly a int32_t ts_taidelta to deal with leap seconds... I'd
> > personally like struct timespec to look like the above everywhere.
>
> For filesystems can we get away with just the 64-bit nanoseconds? By my
> calculations that gives something like 584 years--do we need to worry
> about files older than that?

The current timestamps on 32bit systems are 32bit. 64bit nanoseconds
would take the same room as 32bit second + 32bit nanosecond. And it would
be incompatible with current glibc (which the additional nanosecond
fields are perfectly compatible - they are zeroed currently). Also glibc
would need to convert it to a timespec for Solaris compatbility anyways
and need an unnecessary division for that.

The same thing applies to file system storage. Storing in nanoseconds
(like e.g. NTFS or CIFS do - they store 64bit in 100ns units since 1601)
would require slow divisions to convert from the user visible format,
needs the same space and has no advantage as far as I can see.

-Andi

2002-10-28 05:32:04

by Andreas Dilger

[permalink] [raw]
Subject: Re: New nanosecond stat patch for 2.5.44

On Oct 28, 2002 05:42 +0100, Andi Kleen wrote:
> Andreas Dilger <[email protected]> writes:
> > 1) It would be good if it were possible to select this with a config
> > option (I don't care which way the default goes), so that people who
> > don't need/care about the increased resolution don't need the extra
> > space in their inodes and minor extra overhead. To make this a lot
> > easier to code, having something akin to the inode_update_time()
> > which does all of the i_[acm]time updates as appropriate.
>
> You're joking right? That's twelve bytes of more state per struct inode
> and I bet even with the most insidious micro benchmark you won't be
> able to detect a difference in speed from the basic manipulation.

Except that people have a lot of inodes in their slab caches... It's not
so much the processing overhead as the extra memory. struct inode is
bloated enough without adding more into it that isn't necessarily useful
for some people (people who don't have lots of RAM, or don't use any
filesystems which support the higher resolution, or are slow enough that
compiles don't have problems, or don't compile at all)...

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-10-28 06:29:33

by Rob Landley

[permalink] [raw]
Subject: Re: New nanosecond stat patch for 2.5.44

On Sunday 27 October 2002 19:23, Chris Friesen wrote:
> H. Peter Anvin wrote:
> > We probably need to revamp struct stat anyway, to support a larger
> > dev_t, and possibly a larger ino_t (we should account for 64-bit ino_t
> > at least if we have to redesign the structure.) At that point I would
> > really like to advocate for int64_t ts_sec and uint32_t ts_nsec and
> > quite possibly a int32_t ts_taidelta to deal with leap seconds... I'd
> > personally like struct timespec to look like the above everywhere.
>
> For filesystems can we get away with just the 64-bit nanoseconds? By my
> calculations that gives something like 584 years--do we need to worry
> about files older than that?

1) The hard drive is only about 50 years old, so there aren't any files older
than that at the moment:
http://www.mdhc.scu.edu/100th/reyjohnson.htm

2) This thing is unlikely to be a problem in our lifetimes, our
grandchildren's lifetimes, or our great grandchildren's lifetimes (barring
unforseen advances in active telomere reconstruction and a regenerative
interpretation of DNA that somehow looks at it as a blueprint rather than a
recipe).

3) If any current hardware or software is still in use in the year 2554, it
will be seriously overdue for an upgrade.

Rob

--
http://penguicon.sf.net - Terry Pratchett, Eric Raymond, Pete Abrams, Illiad,
CmdrTaco, liquid nitrogen ice cream, and caffienated jello. Well why not?

2002-10-28 17:06:55

by Andreas Dilger

[permalink] [raw]
Subject: Re: New nanosecond stat patch for 2.5.44

On Oct 27, 2002 20:16 -0300, Horst von Brand wrote:
> Andreas Dilger <[email protected]> said:
> > 1) It would be good if it were possible to select this with a config
> > option (I don't care which way the default goes), so that people who
> > don't need/care about the increased resolution don't need the extra
> > space in their inodes and minor extra overhead. To make this a lot
> > easier to code, having something akin to the inode_update_time()
> > which does all of the i_[acm]time updates as appropriate.
>
> Please don't. Do not create incompatible versions of the same filesystem
> just because they were written on kernels compiled with different
> configurations. Superblock flags might be OK, but what is the point then?
> Better mount flags (mount with/without finegrained timestamps)?

I don't say anything about creating incompatible versions of the same
filesystem. Configuring out nsec timestamps is no different than what
we have today. Many filesystems do not support nsec timestamps anyways.

I just see this as one of many hundreds of "tiny" features that are
added to Linux that could easily be made a config option when they
are first added, but all just end up adding a tiny bit of bloat for
people that don't need it.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-10-29 14:55:49

by Bill Davidsen

[permalink] [raw]
Subject: Re: New nanosecond stat patch for 2.5.44

On Sun, 27 Oct 2002, Andreas Dilger wrote:

> Two notes I might make about this:
> 1) It would be good if it were possible to select this with a config
> option (I don't care which way the default goes), so that people who
> don't need/care about the increased resolution don't need the extra
> space in their inodes and minor extra overhead. To make this a lot
> easier to code, having something akin to the inode_update_time()
> which does all of the i_[acm]time updates as appropriate.

Am I missing something? That would make it two file types, no? I bet
there's more overhead in handling that problem than just writing the time.

> 2) Updating i_atime based on comparing the nsec timestamp is going to be
> a killer. I think AKPM saw dramatic performance improvements when he
> changed the code to only do the update once/second, and even though
> you are "only" updating the atime if the times are different, in
> practise this will be always. Even without the "per superblock interval"
> you suggest we should probably only update the atime once a second (I
> don't think anything is keyed off such high resolution atimes, unlike
> make and mtime/ctime).

find -anewer seems to use as much resolution as it has. More to the point,
what is the overhead of updating the time when an i/o is done? It would
seem pretty trivial.

If you are willing to give up a flag bit you could store the time in some
native unit (machine type dependent) when an i/o is done, then do the
convert to ns when it's used, such as compare, close, etc. You could have
an inode walker thread do the convert in background if that seems needed.
There are probably other ways to reduce overhead, those just came to mind.
I think it's a pretty low impact problem with some effort on making it so.

> 3) The fields you are usurping in struct stat are actually there for the
> Y2038 problem (when time_t wraps). At least that's what Ted said when
> we were looking into nsec times for ext2/3. Granted, we may all be
> using 64-bit systems by 2038... I've always thought 64 bits is much
> to large for time_t, so we could always use 20 or 30 bits for sub-second
> times, and the remaining bits for extending time_t at the high end,
> and mask those off for now, but that is a separate issue...

As you say, but good that you brought it up!

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-10-29 16:27:28

by Andreas Dilger

[permalink] [raw]
Subject: Re: New nanosecond stat patch for 2.5.44

On Oct 29, 2002 10:01 -0500, Bill Davidsen wrote:
> On Sun, 27 Oct 2002, Andreas Dilger wrote:
> > 1) It would be good if it were possible to select this with a config
> > option (I don't care which way the default goes), so that people who
> > don't need/care about the increased resolution don't need the extra
> > space in their inodes and minor extra overhead. To make this a lot
> > easier to code, having something akin to the inode_update_time()
> > which does all of the i_[acm]time updates as appropriate.
>
> Am I missing something? That would make it two file types, no? I bet
> there's more overhead in handling that problem than just writing the time.

Not necessarily. Most filesystems don't even have space for storing a
sub-second time resolution, so having the extra time resolution is
irrelevant. For filesystems which do have room for sub-second timestamps
they currently just fill in 0 there, and if the sub-second time is here
they will fill in that field, so still no incompatible on-disk formats.

As for ext3 having sub-second timestamps, this will be done in a way
which makes it compatible with older filesystem, so whether those
timestamps are written or not written, the filesystem will still be
readable on older kernels.

The "inode" space that I'm referring to is the in-memory inode struct,
and the presence of that would be determined at compile time. Granted,
it would only be 12 bytes added to the inode, but if you have thousands
or millions of inodes resident you start to feel the pinch.

> > 2) Updating i_atime based on comparing the nsec timestamp is going to be
> > a killer. I think AKPM saw dramatic performance improvements when he
> > changed the code to only do the update once/second, and even though
> > you are "only" updating the atime if the times are different, in
> > practise this will be always. Even without the "per superblock interval"
> > you suggest we should probably only update the atime once a second (I
> > don't think anything is keyed off such high resolution atimes, unlike
> > make and mtime/ctime).
>
> find -anewer seems to use as much resolution as it has. More to the point,
> what is the overhead of updating the time when an i/o is done? It would
> seem pretty trivial.

It would be trivial if you are already updating the inode (and we should
optimize for this case), but if you are reading a file in 5-byte chunks
and you update the atime a thousand times a second it most certainly IS
a lot of overhead. We currently limit atime updates to 1/second by
checking if the atime has changed or not. The proposed patch checks if
the atime.ts_nsec has changed, and it most certainly will have, so this
will always be updating the atime on disk.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-10-29 20:32:10

by Bill Davidsen

[permalink] [raw]
Subject: Re: New nanosecond stat patch for 2.5.44

On Tue, 29 Oct 2002, Andreas Dilger wrote:

> On Oct 29, 2002 10:01 -0500, Bill Davidsen wrote:
> > On Sun, 27 Oct 2002, Andreas Dilger wrote:
> > > 1) It would be good if it were possible to select this with a config
> > > option (I don't care which way the default goes), so that people who
> > > don't need/care about the increased resolution don't need the extra
> > > space in their inodes and minor extra overhead. To make this a lot
> > > easier to code, having something akin to the inode_update_time()
> > > which does all of the i_[acm]time updates as appropriate.
> >
> > Am I missing something? That would make it two file types, no? I bet
> > there's more overhead in handling that problem than just writing the time.
>
> Not necessarily. Most filesystems don't even have space for storing a
> sub-second time resolution, so having the extra time resolution is
> irrelevant. For filesystems which do have room for sub-second timestamps
> they currently just fill in 0 there, and if the sub-second time is here
> they will fill in that field, so still no incompatible on-disk formats.

That was my concern.

> As for ext3 having sub-second timestamps, this will be done in a way
> which makes it compatible with older filesystem, so whether those
> timestamps are written or not written, the filesystem will still be
> readable on older kernels.

I was more thinking of a kernel compiled without the hi-res timer code, if
that should be done as an option.

> The "inode" space that I'm referring to is the in-memory inode struct,
> and the presence of that would be determined at compile time. Granted,
> it would only be 12 bytes added to the inode, but if you have thousands
> or millions of inodes resident you start to feel the pinch.

I admit to being one of the "thousands" people, and even if I have 100k
inodes (more likely to be 10% of that) it's in the order of a MB, and any
machine which has 100k inodes open is likely to be large enough to ignore
a MB. One advantage of keeping the HRT in the in-core inode is that it
allows parallel make to work correctly even on a filesystem which doesn't
have space to save that information.

Feel free to tell me if that last isn't true.

> > > 2) Updating i_atime based on comparing the nsec timestamp is going to be
> > > a killer. I think AKPM saw dramatic performance improvements when he
> > > changed the code to only do the update once/second, and even though
> > > you are "only" updating the atime if the times are different, in
> > > practise this will be always. Even without the "per superblock interval"
> > > you suggest we should probably only update the atime once a second (I
> > > don't think anything is keyed off such high resolution atimes, unlike
> > > make and mtime/ctime).
> >
> > find -anewer seems to use as much resolution as it has. More to the point,
> > what is the overhead of updating the time when an i/o is done? It would
> > seem pretty trivial.
>
> It would be trivial if you are already updating the inode (and we should
> optimize for this case), but if you are reading a file in 5-byte chunks
> and you update the atime a thousand times a second it most certainly IS
> a lot of overhead. We currently limit atime updates to 1/second by
> checking if the atime has changed or not. The proposed patch checks if
> the atime.ts_nsec has changed, and it most certainly will have, so this
> will always be updating the atime on disk.

1 - any program which does unbuffered 5 byte reads is probably going to
beat the machine to death anyway. Then the sysadmin will mount noatime.

2 - The patch isn't written in stone, going back to one per second
shouldn't matter except in the case of network or devices shared between
multiple systems (3.0?). processes on the same machine whould use the
in-core information.

3 - updating once/sec could still be default, with HRT being a mount
option like noatime.

4 - the time could be stored in register values, ticks, or whatever else,
avoiding any conversion to ns. Then the time could be converted only when
the inode was read, written out, etc.

I'd really like your comments on these, you probably see things I've
missed.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-10-30 00:39:49

by Jamie Lokier

[permalink] [raw]
Subject: Re: New nanosecond stat patch for 2.5.44

Bill Davidsen wrote:
> I admit to being one of the "thousands" people, and even if I have 100k
> inodes (more likely to be 10% of that) it's in the order of a MB, and any
> machine which has 100k inodes open is likely to be large enough to ignore
> a MB. One advantage of keeping the HRT in the in-core inode is that it
> allows parallel make to work correctly even on a filesystem which doesn't
> have space to save that information.
>
> Feel free to tell me if that last isn't true.

It isn't true if the parallel make actually uses your RAM for
something, thus flushing some of the inodes from RAM.

Admittedly it is no worse than we have at the moment. However, at the
moment it is possible, to construct a "make" or other program of that
ilk which can always make a safe decision: if it's ambiguous whether a
file needs to be remade, then remake the file.

As soon as we have inodes time stamp resolution being spontanously
lowered (because some of the inodes are flushed from RAM and some
aren't), then it's not possible to make a safe program like that
anymore, unless you simply ignore the high resolution time stamps
_all_ the time, even when they are present.

You can just do that - it's correct behaviour. But it would be better
to use the high precision when available, as that reduces the number
of unnecessary remakes.

> 4 - the time could be stored in register values, ticks, or whatever else,
> avoiding any conversion to ns. Then the time could be converted only when
> the inode was read, written out, etc.
>
> I'd really like your comments on these, you probably see things I've
> missed.

I know of exactly one application which depends on atime information:
checking whether you have new mail in your inbox. That's done by
comparing atime and mtime on the mailbox. Mail readers read the file
after writing it, MTAs will simply write it.

For this to function correctly, what's important is that the atime is
updated to be at least the mtime. So for nanosecond atime updates, it
makes sense that the _first_ read following a write should update the
atime -- if not using the current clock, then simply copying the mtime
value.

-- Jamie

2002-10-30 21:07:12

by Bill Davidsen

[permalink] [raw]
Subject: Re: New nanosecond stat patch for 2.5.44

On Wed, 30 Oct 2002, Jamie Lokier wrote:

> Bill Davidsen wrote:
> > I admit to being one of the "thousands" people, and even if I have 100k
> > inodes (more likely to be 10% of that) it's in the order of a MB, and any
> > machine which has 100k inodes open is likely to be large enough to ignore
> > a MB. One advantage of keeping the HRT in the in-core inode is that it
> > allows parallel make to work correctly even on a filesystem which doesn't
> > have space to save that information.
> >
> > Feel free to tell me if that last isn't true.
>
> It isn't true if the parallel make actually uses your RAM for
> something, thus flushing some of the inodes from RAM.

Hopefully it is being smart about doing that, or rather not doing that.
But that would be a good thing to add to my responsiveness benchmark, to
access a file, do a stat, and then do another stat later. Thanks for the
idea, I expect to release a new version sometime this weekend.

> Admittedly it is no worse than we have at the moment. However, at the
> moment it is possible, to construct a "make" or other program of that
> ilk which can always make a safe decision: if it's ambiguous whether a
> file needs to be remade, then remake the file.
>
> As soon as we have inodes time stamp resolution being spontanously
> lowered (because some of the inodes are flushed from RAM and some
> aren't), then it's not possible to make a safe program like that
> anymore, unless you simply ignore the high resolution time stamps
> _all_ the time, even when they are present.
>
> You can just do that - it's correct behaviour. But it would be better
> to use the high precision when available, as that reduces the number
> of unnecessary remakes.

I have to think about the point you raise of doing it one way or the other
but not mixing. I had assumed that the inode of a file which was open
would remain in core, and I want to look at the code before I form an
opinion. If the file is not open or the inode is a non-file...

> > 4 - the time could be stored in register values, ticks, or whatever else,
> > avoiding any conversion to ns. Then the time could be converted only when
> > the inode was read, written out, etc.
> >
> > I'd really like your comments on these, you probably see things I've
> > missed.
>
> I know of exactly one application which depends on atime information:
> checking whether you have new mail in your inbox. That's done by
> comparing atime and mtime on the mailbox. Mail readers read the file
> after writing it, MTAs will simply write it.
>
> For this to function correctly, what's important is that the atime is
> updated to be at least the mtime. So for nanosecond atime updates, it
> makes sense that the _first_ read following a write should update the
> atime -- if not using the current clock, then simply copying the mtime
> value.

I think you may have missed the point of (4), some of the overhead of
keeping HRT is the conversion of data to ns from some machine dependent
information. Where possible the base information, such as a register,
could be stored with a flag, avoiding the "convert to ns" CPU usage. The
conversion could be done when the data was used, before save, at the time
of a stat, etc. I have the feeling that would take some of the sting out
of keeping HRT. It doesn't matter if it's atime, mtime or ctime, the atime
was in response to "nobody uses HRT atime" in an earlier post.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-10-30 22:12:07

by Jamie Lokier

[permalink] [raw]
Subject: Re: New nanosecond stat patch for 2.5.44

Bill Davidsen wrote:
> I have to think about the point you raise of doing it one way or the other
> but not mixing. I had assumed that the inode of a file which was open
> would remain in core, and I want to look at the code before I form an
> opinion. If the file is not open or the inode is a non-file...

Oh, the inode of a file which is open does remain in core. It's just
that between runs of a program like "make", the file's aren't open are
they?

> I think you may have missed the point of (4), some of the overhead of
> keeping HRT is the conversion of data to ns from some machine dependent
> information. Where possible the base information, such as a register,
> could be stored with a flag, avoiding the "convert to ns" CPU usage. The
> conversion could be done when the data was used, before save, at the time
> of a stat, etc. I have the feeling that would take some of the sting out
> of keeping HRT. It doesn't matter if it's atime, mtime or ctime, the atime
> was in response to "nobody uses HRT atime" in an earlier post.

That's some of the overhead. The other overhead is reading the clock,
which is quite high on x86 when TSC is not available. On a Pentium
with no reliable TSC, I think that the time for a read() system call
is comparable to the time to read the clock.

-- Jamie

2002-10-31 00:28:42

by H. Peter Anvin

[permalink] [raw]
Subject: Re: New nanosecond stat patch for 2.5.44

Followup to: <[email protected]>
By author: Jamie Lokier <[email protected]>
In newsgroup: linux.dev.kernel
>
> That's some of the overhead. The other overhead is reading the clock,
> which is quite high on x86 when TSC is not available. On a Pentium
> with no reliable TSC, I think that the time for a read() system call
> is comparable to the time to read the clock.
>

Typically the way you deal with not having a usably cheap
nanosecond-resolution clock is that you use the best available clock
(say if HZ=1000 you'll increment by 1000000 each timer tick), and then
simply use an atomic counter for the smaller divisions. This makes
the relation "is A newer than B" correct, while avoiding the overhead
of producing exact timestamps below the available resolution.

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2002-11-01 01:52:16

by Bill Davidsen

[permalink] [raw]
Subject: Re: New nanosecond stat patch for 2.5.44

On Wed, 30 Oct 2002, Jamie Lokier wrote:

> Bill Davidsen wrote:
> > I have to think about the point you raise of doing it one way or the other
> > but not mixing. I had assumed that the inode of a file which was open
> > would remain in core, and I want to look at the code before I form an
> > opinion. If the file is not open or the inode is a non-file...
>
> Oh, the inode of a file which is open does remain in core. It's just
> that between runs of a program like "make", the file's aren't open are
> they?

I thought we were talking about parallel make, rather than "between runs."
Your point is valid, but given the certainty that the inode has been
recently used, hopefully the kernel is smart on releasing them.

My first thought is that the commonly used filesystems, other than ext2,
do or will support high resolution time. NFS is its own nasty little
problem.

> > I think you may have missed the point of (4), some of the overhead of
> > keeping HRT is the conversion of data to ns from some machine dependent
> > information. Where possible the base information, such as a register,
> > could be stored with a flag, avoiding the "convert to ns" CPU usage. The
> > conversion could be done when the data was used, before save, at the time
> > of a stat, etc. I have the feeling that would take some of the sting out
> > of keeping HRT. It doesn't matter if it's atime, mtime or ctime, the atime
> > was in response to "nobody uses HRT atime" in an earlier post.
>
> That's some of the overhead. The other overhead is reading the clock,
> which is quite high on x86 when TSC is not available. On a Pentium
> with no reliable TSC, I think that the time for a read() system call
> is comparable to the time to read the clock.

Who uses a CPU without TSC? I guess the embedded folks and the people
using really old systems. There was a suggestion on handling that posted,
but I don't have it handy. Using the field as just a counter was the idea
if I remember correctly. The NUMA folks have their own set of problems, I
won't presume to even have an opinion on how they solve it, but if it
needs doing I'm sure they can do it.

Thinking out loud:
To avoid overhead, the kernel needs to be smart about when the updated
inode info is written to storage Perhaps on writes when the data written
actually falls off the elevator or transferred to a network peer. Until
then the time can stay in memory, if the system goes down write data is
lost, so having the inode reflect the time of the last completed write to
storage isn't wildly wrong mtime.

For reads, having some bounded delay between the time of a system call
to read() and the time saved in the inode is of limited impact, as long as
the time to update the inode to storage doesn't get wildly behind the time
of the read. The one second you mentioned is probably aggressive if
anything. That might have to be a tunable.

I haven't forgotten access via execute, I don't know if it differs from
read in practice.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-11-01 03:26:55

by Jamie Lokier

[permalink] [raw]
Subject: Re: New nanosecond stat patch for 2.5.44

Bill Davidsen wrote:
> > Oh, the inode of a file which is open does remain in core. It's just
> > that between runs of a program like "make", the file's aren't open are
> > they?
>
> I thought we were talking about parallel make, rather than "between runs."

A parallel build often does call "make" separately many times, in
parallel but not guaranteed to overlap all file opens. Between those,
the files are closed.

> Your point is valid, but given the certainty that the inode has been
> recently used, hopefully the kernel is smart on releasing them.

That's a "hopefully", and it depends on how much RAM you have as well
as pure luck. I can live with that for building programs at home, but
there are many applications where "hopefully" affecting correctness of
behaviour is not acceptable.

> My first thought is that the commonly used filesystems, other than ext2,
> do or will support high resolution time. NFS is its own nasty little
> problem.

Do they support nanosecond time, though, or do they round it to
microseconds or something like that?

> [stuff about atime]

There seems to be general agreement that atime is not a very important
value, with which I concur. (Why do we even bother with nanosecond atimes?)

I am only concerned about mtime, which is very useful indeed when we
talk about building things which can detect changes to files.

Andi, I belive there is space in every architecture's stat64 (i.e. all
those that have one) for a word describing the mtime resolution. If I
code a patch to create that field, would you be interested?

-- Jamie

2002-11-06 13:20:36

by Gabriel Paubert

[permalink] [raw]
Subject: Re: New nanosecond stat patch for 2.5.44



On 31 Oct 2002, H. Peter Anvin wrote:

> Followup to: <[email protected]>
> By author: Andreas Dilger <[email protected]>
> In newsgroup: linux.dev.kernel
> >
> > 3) The fields you are usurping in struct stat are actually there for the
> > Y2038 problem (when time_t wraps). At least that's what Ted said when
> > we were looking into nsec times for ext2/3. Granted, we may all be
> > using 64-bit systems by 2038... I've always thought 64 bits is much
> > to large for time_t, so we could always use 20 or 30 bits for sub-second
> > times, and the remaining bits for extending time_t at the high end,
> > and mask those off for now, but that is a separate issue...
> >
>
> 64-bit time_t is nice because you don't *ever* need to worry about
> overflow; it's capable of handling times on a galactic lifespan
> scale. It's overkill, of course, but it's the *right* kind of
> overkill.

Indeed.

>
> We probably need to revamp struct stat anyway, to support a larger
> dev_t, and possibly a larger ino_t (we should account for 64-bit ino_t
> at least if we have to redesign the structure.) At that point I would
> really like to advocate for int64_t ts_sec and uint32_t ts_nsec and
> quite possibly a int32_t ts_taidelta to deal with leap seconds... I'd
> personally like struct timespec to look like the above everywhere.

I basically agree but I suspect that filesystem writers will not be very
happy if you want to use 16 bytes for each timestamp, especially when 8 of
the bytes (the 32 high order bits from the second count and the TAI-UT
offset) do not change very often. (besides that tv_nsec is defined as a
long, i.e. 64 bit on 64 bit machines and _signed_ , stupid if you ask me
but I digress).

The goal as I understand it is to avoid first the possibility of ambiguous
timestamps, but then we have to be careful also not to break existing
applications (although they already broken wrt leap seconds).

I don't know how to trim the highly repeated most significant bytes of the
tv_sec field (it's probably file system specific), but 4 bytes can easily
be shaved from the on-disk structure by packing the leap second
information in the high order bits of the nsec field: since the number of
nanoseconds per second is unlikely to ever need more than 30 bits to be
encoded ;-), the 2 most significant bits can be used to encode inserted
leap seconds. Actually 1 bit should be sufficient but some texts claim
that up to 2 leap seconds can be inserted, this has however actually never
happened AFAICT and I believe that NTP for example does not support 2 leap
seconds in a row.

Converting this encoding to the format you suggest for stat(2) is trivial:
it only needs a table of leap seconds. I don't care whether it's in the
kernel or in user space: it's small and grows slowly.

For now I have more problems with the fact that gettimeofday and friends
do not properly handle leap seconds and lead to ambiguous timestamps.
Once this problem (a real killer for astronomical data acquisition, leap
seconds are infrequent but they are a problem) is solved, filesystems can
be updated.

What could be important now is to mask the low 30 bits of the nsec field
and declare the 2 MSB reserved so that no kernel is out in the wild that
simply copies the full nsec field to user space.

Regards,
Gabriel.



2002-11-06 17:55:02

by H. Peter Anvin

[permalink] [raw]
Subject: Re: New nanosecond stat patch for 2.5.44

Gabriel Paubert wrote:
>
> I basically agree but I suspect that filesystem writers will not be very
> happy if you want to use 16 bytes for each timestamp, especially when 8 of
> the bytes (the 32 high order bits from the second count and the TAI-UT
> offset) do not change very often. (besides that tv_nsec is defined as a
> long, i.e. 64 bit on 64 bit machines and _signed_ , stupid if you ask me
> but I digress).
>

The filesystem writers can compact things as they see fit. I'm mostly
talking about the stat(2) format.

-hpa