2023-09-21 19:02:51

by Jeffrey Layton

[permalink] [raw]
Subject: Re: [GIT PULL v2] timestamp fixes

On Thu, 2023-09-21 at 11:24 -0700, Linus Torvalds wrote:
> On Thu, 21 Sept 2023 at 04:21, Christian Brauner <[email protected]> wrote:
> >
> > [email protected]:pub/scm/linux/kernel/git/vfs/vfs tags/v6.6-rc3.vfs.ctime.revert
>
> So for some reason pr-tracker-bot doesn't seem to have reacted to this
> pull request, but it's in my tree now.
>
> I *do* have one reaction to all of this: now that you have made
> "i_ctime" be something that cannot be accessed directly (and renamed
> it to "__i_ctime"), would you mind horribly going all the way, and do
> the same for i_atime and i_mtime too?
>
> The reason I ask is that I *really* despise "struct timespec64" as a type.
>
> I despise it inherently, but I despise it even more when you really
> use it as another type entirely, and are hiding bits in there.
>
> I despise it because "tv_sec" obviously needs to be 64-bit, but then
> "tv_nsec" is this horrible abomination. It's defined as "long", which
> is all kinds of crazy. It's bogus and historical.
>
> And it's wasteful.
>
> Now, in the case of i_ctime, you took advantage of that waste by using
> one (of the potentially 2..34!) unused bits for that
> "fine-granularity" flag.
>
> But even when you do that, there's up to 33 other bits just lying
> around, wasting space in a very central data structure.
>
> So it would actually be much better to explode the 'struct timespec64'
> into explicit 64-bit seconds field, and an explicit 32-bit field with
> two bits reserved. And not even next to each other, because they don't
> pack well in general.
>
> So instead of
>
> struct timespec64 i_atime;
> struct timespec64 i_mtime;
> struct timespec64 __i_ctime;
>
> where that last one needs accessors to access, just make them *all*
> have helper accessors, and make it be
>
> u64 i_atime_sec;
> u64 i_mtime_sec;
> u64 i_ctime_sec;
> u32 i_atime_nsec;
> u32 i_mtime_nsec;
> u32 i_ctime_nsec;
>
> and now that 'struct inode' should shrink by 12 bytes.
>

I like it.

> Then do this:
>
> #define inode_time(x) \
> (struct timespec64) { x##_sec, x##_nsec }
>
> and you can now create a timespec64 by just doing
>
> inode_time(inode->i_atime)
>
> or something like that (to help create those accessor functions).
>
> Now, I agree that 12 bytes in the disaster that is 'struct inode' is
> likely a drop in the ocean. We have tons of big things in there (ie
> several list_heads, a whole 'struct address_space' etc etc), so it's
> only twelve bytes out of hundreds.
>
> But dammit, that 'timespec64' really is ugly, and now that you're
> hiding bits in it and it's no longer *really* a 'timespec64', I feel
> like it would be better to do it right, and not mis-use a type that
> has other semantics, and has other problems.
>


We have many, many inodes though, and 12 bytes per adds up!

I'm on board with the idea, but...that's likely to be as big a patch
series as the ctime overhaul was. In fact, it'll touch a lot of the same
code. I can take a stab at that in the near future though.

Since we're on the subject...another thing that bothers me with all of
the timestamp handling is that we don't currently try to mitigate "torn
reads" across the two different words. It seems like you could fetch a
tv_sec value and then get a tv_nsec value that represents an entirely
different timestamp if there are stores between them.

Should we be concerned about this? I suppose we could do something with
a seqlock, though I'd really want to avoid locking on the write side.
--
Jeff Layton <[email protected]>


2023-09-21 20:01:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL v2] timestamp fixes

On Thu, 21 Sept 2023 at 11:51, Jeff Layton <[email protected]> wrote:
>
> We have many, many inodes though, and 12 bytes per adds up!

That was my thinking, but honestly, who knows what other alignment
issues might eat up some - or all - of the theoreteical 12 bytes.

It might be, for example, that the inode is already some aligned size,
and that the allocation alignment means that the size wouldn't
*really* shrink at all.

So I just want to make clear that I think the 12 bytes isn't
necessarily there. Maybe you'd get it, maybe it would be hidden by
other things.

My biggest impetus was really that whole abuse of a type that I
already disliked for other reasons.

> I'm on board with the idea, but...that's likely to be as big a patch
> series as the ctime overhaul was. In fact, it'll touch a lot of the same
> code. I can take a stab at that in the near future though.

Yea, it's likely to be fairly big and invasive. That was one of the
reasons for my suggested "inode_time()" macro hack: using the macro
argument concatenation is really a hack to "gather" the pieces based
on name, and while it's odd and not a very typical kernel model, I
think doing it that way might allow the conversion to be slightly less
painful.

You'd obviously have to have the same kind of thing for assignment.

Without that kind of name-based hack, you'd have to create all these
random helper functions that just do the same thing over and over for
the different times, which seems really annoying.

> Since we're on the subject...another thing that bothers me with all of
> the timestamp handling is that we don't currently try to mitigate "torn
> reads" across the two different words. It seems like you could fetch a
> tv_sec value and then get a tv_nsec value that represents an entirely
> different timestamp if there are stores between them.

Hmm. I think that's an issue that we have always had in theory, and
have ignored because it's simply not problematic in practice, and
fixing it is *hugely* painful.

I suspect we'd have to use some kind of sequence lock for it (to make
reads be cheap), and while it's _possible_ that having the separate
accessor functions for reading/writing those times might help things
out, I suspect the reading/writing happens for the different times (ie
atime/mtime/ctime) together often enough that you might want to have
the locking done at an outer level, and _not_ do it at the accessor
level.

So I suspect this is a completely separate issue (ie even an accessor
doesn't make the "hugely painful" go away). And probably not worth
worrying about *unless* somebody decides that they really really care
about the race.

That said, one thing that *could* help is if people decide that the
right format for inode times is to just have one 64-bit word that has
"sufficient resolution". That's what we did for "kernel time", ie
"ktime_t" is a 64-bit nanosecond count, and by being just a single
value, it avoids not just the horrible padding with 'struct
timespec64', it is also dense _and_ can be accessed as one atomic
value.

Sadly, that "sufficient resolution" couldn't be nanoseconds, because
64-bit nanoseconds isn't enough of a spread. It's fine for the kernel
time, because 2**63 nanoseconds is 292 years, so it moved the "year
2038" problem to "year 2262".

And that's ok when we're talking about times that are kernel running
times and we haev a couple of centuries to say "ok, we'll need to make
it be a bigger type", but when you save the values to disk, things are
different. I suspect filesystem people are *not* willing to deal with
a "year 2262" issue.

But if we were to say that "a tenth of microsecond resolution is
sufficient for inode timestamps", then suddenly 64 bits is *enormous*.
So we could do a

// tenth of a microseconds since Jan 1, 1970
typedef s64 fstime_t;

and have a nice dense timestamp format with reasonable - but not
nanosecond - accuracy. Now that 292 year range has become 29,247
years, and filesystem people *might* find the "year-31k" problem
acceptable.

I happen to think that "100ns timestamp resolution on files is
sufficient" is a very reasonable statement, but I suspect that we'll
still find lots of people who say "that's completely unacceptable"
both to that resolution, and to the 31k-year problem.

But wouldn't it be nice to have just one single "fstime_t" for file
timestamps, the same way we have "ktime_t" for CPU timestamps?

Then we'd save even more space in the 'struct inode'....

Linus

2023-09-21 23:51:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL v2] timestamp fixes

On Thu, 21 Sept 2023 at 12:28, Linus Torvalds
<[email protected]> wrote:
>
> And that's ok when we're talking about times that are kernel running
> times and we haev a couple of centuries to say "ok, we'll need to make
> it be a bigger type",

Note that the "couple of centuries" here is mostly the machine uptime,
not necessarily "we'll need to change the time in the year 2292".

Although we do also have "ktime_get_real()" which is encoding the
whole "nanoseconds since 1970". That *will* break in 2292.

Anyway, regardless, I am *not* suggesting that ktime_t would be useful
for filesystems, because of this issue.

I *do* suspect that we might consider a "tenth of a microsecond", though.

Resolution-wise, it's pretty much in the "system call time" order of
magnitude, and if we have Linux filesystems around in the year-31k,
I'll happily consider it to be a SEP thing at that point ("somebody
else's problem").

Linus

2023-09-22 12:34:37

by Christian Brauner

[permalink] [raw]
Subject: Re: [GIT PULL v2] timestamp fixes

> We have many, many inodes though, and 12 bytes per adds up!
>
> I'm on board with the idea, but...that's likely to be as big a patch
> series as the ctime overhaul was. In fact, it'll touch a lot of the same

Hm, I really think that isn't an issue. Let the series be big. Half of
the meaningful fs patches explode anyway as soon as you change any
i_op/f_op anyway. If it meaningfully cleans up something then it's worth
it.

Hm, we really need automated testing on vfs.git soon so we can run
xfstests for all major filesystems automatically instead of me having to
kick this off manually every time.

2023-09-22 16:08:14

by David Sterba

[permalink] [raw]
Subject: Re: [GIT PULL v2] timestamp fixes

On Thu, Sep 21, 2023 at 12:28:13PM -0700, Linus Torvalds wrote:
> On Thu, 21 Sept 2023 at 11:51, Jeff Layton <[email protected]> wrote:
> >
> > We have many, many inodes though, and 12 bytes per adds up!
>
> That was my thinking, but honestly, who knows what other alignment
> issues might eat up some - or all - of the theoreteical 12 bytes.
>
> It might be, for example, that the inode is already some aligned size,
> and that the allocation alignment means that the size wouldn't
> *really* shrink at all.
>
> So I just want to make clear that I think the 12 bytes isn't
> necessarily there. Maybe you'd get it, maybe it would be hidden by
> other things.

I think all filesystem developers appreciate when struct inode shrinks,
it's usually embedded with additional data and the size grows. I'm on a
mission to squeeze btrfs_inode under 1024 so it fits better to the slab
pages and currently it's about 1100 bytes. 1024 is within reach but it
gets harder to find potential space savings.

2023-09-23 08:37:42

by Amir Goldstein

[permalink] [raw]
Subject: Re: [GIT PULL v2] timestamp fixes

On Sat, Sep 23, 2023 at 3:43 AM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, 21 Sept 2023 at 11:51, Jeff Layton <[email protected]> wrote:
> >
> > We have many, many inodes though, and 12 bytes per adds up!
>
> That was my thinking, but honestly, who knows what other alignment
> issues might eat up some - or all - of the theoreteical 12 bytes.
>
> It might be, for example, that the inode is already some aligned size,
> and that the allocation alignment means that the size wouldn't
> *really* shrink at all.
>
> So I just want to make clear that I think the 12 bytes isn't
> necessarily there. Maybe you'd get it, maybe it would be hidden by
> other things.
>
> My biggest impetus was really that whole abuse of a type that I
> already disliked for other reasons.
>
> > I'm on board with the idea, but...that's likely to be as big a patch
> > series as the ctime overhaul was. In fact, it'll touch a lot of the same
> > code. I can take a stab at that in the near future though.
>
> Yea, it's likely to be fairly big and invasive. That was one of the
> reasons for my suggested "inode_time()" macro hack: using the macro
> argument concatenation is really a hack to "gather" the pieces based
> on name, and while it's odd and not a very typical kernel model, I
> think doing it that way might allow the conversion to be slightly less
> painful.
>
> You'd obviously have to have the same kind of thing for assignment.
>
> Without that kind of name-based hack, you'd have to create all these
> random helper functions that just do the same thing over and over for
> the different times, which seems really annoying.
>
> > Since we're on the subject...another thing that bothers me with all of
> > the timestamp handling is that we don't currently try to mitigate "torn
> > reads" across the two different words. It seems like you could fetch a
> > tv_sec value and then get a tv_nsec value that represents an entirely
> > different timestamp if there are stores between them.
>
> Hmm. I think that's an issue that we have always had in theory, and
> have ignored because it's simply not problematic in practice, and
> fixing it is *hugely* painful.
>
> I suspect we'd have to use some kind of sequence lock for it (to make
> reads be cheap), and while it's _possible_ that having the separate
> accessor functions for reading/writing those times might help things
> out, I suspect the reading/writing happens for the different times (ie
> atime/mtime/ctime) together often enough that you might want to have
> the locking done at an outer level, and _not_ do it at the accessor
> level.
>
> So I suspect this is a completely separate issue (ie even an accessor
> doesn't make the "hugely painful" go away). And probably not worth
> worrying about *unless* somebody decides that they really really care
> about the race.
>
> That said, one thing that *could* help is if people decide that the
> right format for inode times is to just have one 64-bit word that has
> "sufficient resolution". That's what we did for "kernel time", ie
> "ktime_t" is a 64-bit nanosecond count, and by being just a single
> value, it avoids not just the horrible padding with 'struct
> timespec64', it is also dense _and_ can be accessed as one atomic
> value.

Just pointing out that xfs has already changed it's on-disk timestamp
format to this model (a.k.a bigtime), but the in-core inode still uses
the timespec64 of course.
The vfs can inprise from this model.

>
> Sadly, that "sufficient resolution" couldn't be nanoseconds, because
> 64-bit nanoseconds isn't enough of a spread. It's fine for the kernel
> time, because 2**63 nanoseconds is 292 years, so it moved the "year
> 2038" problem to "year 2262".

Note that xfs_bigtime_to_unix(XFS_BIGTIME_TIME_MAX) is in year
2486, not year 2262, because there was no need to use the 64bit to
go backwards to year 1678.

>
> And that's ok when we're talking about times that are kernel running
> times and we haev a couple of centuries to say "ok, we'll need to make
> it be a bigger type", but when you save the values to disk, things are
> different. I suspect filesystem people are *not* willing to deal with
> a "year 2262" issue.
>

Apparently, they are willing to handle the "year 2486" issue ;)

> But if we were to say that "a tenth of microsecond resolution is
> sufficient for inode timestamps", then suddenly 64 bits is *enormous*.
> So we could do a
>
> // tenth of a microseconds since Jan 1, 1970
> typedef s64 fstime_t;
>
> and have a nice dense timestamp format with reasonable - but not
> nanosecond - accuracy. Now that 292 year range has become 29,247
> years, and filesystem people *might* find the "year-31k" problem
> acceptable.
>
> I happen to think that "100ns timestamp resolution on files is
> sufficient" is a very reasonable statement, but I suspect that we'll
> still find lots of people who say "that's completely unacceptable"
> both to that resolution, and to the 31k-year problem.
>

I am guessing that you are aware of the Windows/SMB FILETIME
standard which is 64bit in 100ns units (since 1601).
So the 31k-year "problem" is very widespread already.

But the resolution change is counter to the purpose of multigrain
timestamps - if two syscalls updated the same or two different inodes
within a 100ns tick, apparently, there are some workloads that
care to know about it and fs needs to store this information persistently.

Thanks,
Amir.

2023-09-23 21:46:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL v2] timestamp fixes

On Fri, 22 Sept 2023 at 23:36, Amir Goldstein <[email protected]> wrote:
>
> Apparently, they are willing to handle the "year 2486" issue ;)

Well, we could certainly do the same at the VFS layer.

But I suspect 10ns resolution is entirely overkill, since on a lot of
platforms you don't even have timers with that resolution.

I feel like 100ns is a much more reasonable resolution, and is quite
close to a single system call (think "one thousand cycles at 10GHz").

> But the resolution change is counter to the purpose of multigrain
> timestamps - if two syscalls updated the same or two different inodes
> within a 100ns tick, apparently, there are some workloads that
> care to know about it and fs needs to store this information persistently.

Those workloads are broken garbage, and we should *not* use that kind
of sh*t to decide on VFS internals.

Honestly, if the main reason for the multigrain resolution is
something like that, I think we should forget about MG *entirely*.
Somebody needs to be told to get their act together.

We have *never* guaranteed nanosecond resolution on timestamps, and I
think we should put our foot down and say that we never will.

Partly because we have platforms where that kind of timer resolution
just does not exist.

Partly because it's stupid to expect that kind of resolution anyway.

And partly because any load that assumes that kind of resolution is
already broken.

End result: we should ABSOLUTELY NOT have as a target to support some
insane resolution.

100ns resolution for file access times is - and I'll happily go down
in history for saying this - enough for anybody.

If you need finer resolution than that, you'd better do it yourself in
user space.

And no, this is not a "but some day we'll have terahertz CPU's and
100ns is an eternity". Moore's law is dead, we're not going to see
terahertz CPUs, and people who say "but quantum" have bought into a
technological fairytale.

100ns is plenty, and has the advantage of having a very safe range.

That said, we don't have to do powers-of-ten. In fact, in many ways,
it would probably be a good idea to think of the fractional seconds in
powers of two. That tends to make it cheaper to do conversions,
without having to do a full 64-bit divide (a constant divide turns
into a fancy multiply, but it's still painful on 32-bit
architectures).

So, for example, we could easily make the format be a fixed-point
format with "sign bit, 38 bit seconds, 25 bit fractional seconds",
which gives us about 30ns resolution, and a range of almost 9000
years. Which is nice, in how it covers all of written history and all
four-digit years (we'd keep the 1970 base).

And 30ns resolution really *is* pretty much the limit of a single
system call. I could *wish* we had system calls that fast, or CPU's
that fast. Not the case right now, and sadly doesn't seem to be the
case in the forseeable future - if ever - either. It would be a really
good problem to have.

And the nice thing about that would be that conversion to timespec64
would be fairly straightforward:

struct timespec64 to_timespec(fstime_t fstime)
{
struct timespec64 res;
unsigned int frac;

frac = fstime & 0x1ffffffu;
res.tv_sec = fstime >> 25;
res.tv_nsec = frac * 1000000000ull >> 25;
return res;
}

fstime_t to_fstime(struct timespec64 a)
{
fstime_t sec = (fstime_t) a.tv_sec << 25;
unsigned frac = a.tv_nsec;

frac = ((unsigned long long) a.tv_nsec << 25) / 1000000000ull;
return sec | frac;
}

and both of those generate good code (that large divide by a constant
in to_fstime() is not great, but the compiler can turn it into a
multiply).

The above could be improved upon (nicer rounding and overflow
handling, and a few modifications to generate even nicer code), but
it's not horrendous as-is. On x86-64, to_timespec becomes a very
reasonable

movq %rdi, %rax
andl $33554431, %edi
imulq $1000000000, %rdi, %rdx
sarq $25, %rax
shrq $25, %rdx

and to some degree that's the critical function (that code would show
up in 'stat()').

Of course, I might have screwed up the above conversion functions,
they are untested garbage, but they look close enough to being in the
right ballpark.

Anyway, we really need to push back at any crazies who say "I want
nanosecond resolution, because I'm special and my mother said so".

Linus

2023-09-24 06:31:17

by Amir Goldstein

[permalink] [raw]
Subject: Re: [GIT PULL v2] timestamp fixes

On Sat, Sep 23, 2023 at 8:49 PM Linus Torvalds
<[email protected]> wrote:
>
> On Fri, 22 Sept 2023 at 23:36, Amir Goldstein <[email protected]> wrote:
> >
> > Apparently, they are willing to handle the "year 2486" issue ;)
>
> Well, we could certainly do the same at the VFS layer.
>
> But I suspect 10ns resolution is entirely overkill, since on a lot of
> platforms you don't even have timers with that resolution.
>
> I feel like 100ns is a much more reasonable resolution, and is quite
> close to a single system call (think "one thousand cycles at 10GHz").
>

You are right. 100ns is enough resolution for fs timestamps, but:

1. We cannot truncate existing timestamps on existing files
that are stored in 1ns format even if the 100ns remainder is garbage,
applications depend on tv_nsec not changing, so if vfs truncates to
100ns, filesystems will need to carry the remainder anyway.

2. The whole idea behind multigrain time is that the 100ns remainder
space is available in on-disk timestamps formats and is not going away,
so better not let it go to waste on garbage and use is for "change cookie".

> > But the resolution change is counter to the purpose of multigrain
> > timestamps - if two syscalls updated the same or two different inodes
> > within a 100ns tick, apparently, there are some workloads that
> > care to know about it and fs needs to store this information persistently.
>
> Those workloads are broken garbage, and we should *not* use that kind
> of sh*t to decide on VFS internals.
>

Sorry, I phrased it completely wrong.
The workloads don't expect 1ns resolution.
The workloads just compare timestamps of objects and expect some sane
not-before ordering rules.
If user visible timestamps are truncated to 100ns all is good.

> End result: we should ABSOLUTELY NOT have as a target to support some
> insane resolution.

This is not the target.

I think the target is explained well in this new v8 revision [1]
of the patch set which does exactly what you promote -
it hides the insane resolution in fs timestamps from users.

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjGofKT2LcM_YmoMYsH42ETpB5kxQkh+21nCbYc=w1nEg@mail.gmail.com/T/#mbc87e67faf71cc58c6424335333935bf7740d49e

2023-09-24 06:45:52

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [GIT PULL v2] timestamp fixes

On Sat, Sep 23, 2023 at 10:48:51AM -0700, Linus Torvalds wrote:
>
> I feel like 100ns is a much more reasonable resolution, and is quite
> close to a single system call (think "one thousand cycles at 10GHz").

FWIW, UUID's (which originally came from Apollo Domain/OS in the
1980's, before getting adopted by OSF/DCE, and then by Linux and
Microsoft) use a 100ns granularity. And the smart folks at Apollo
figured this out some 4 decades ago, and *no* they didn't use units of
a single nanosecond. :-)

100ns granularity is also what what ext4 uses for our on-disk format
--- 2**30 just enough to cover 100ns granularity (with only 7% of
wasted number space), and those two bits are enough for us to encode
timestamps into 2446 using a 64-bit timestamp (and what we do past
2446 is pretty much something I'm happy to let someone else deal with,
as I expect I'll be long dead by then.)

(And if someone does happen to event some kind of life-extension
technology, I'm happy to fix it up... later. :-)

> That said, we don't have to do powers-of-ten. In fact, in many ways,
> it would probably be a good idea to think of the fractional seconds in
> powers of two. That tends to make it cheaper to do conversions,
> without having to do a full 64-bit divide (a constant divide turns
> into a fancy multiply, but it's still painful on 32-bit
> architectures).

It depends on what conversion we need to do. If we're converting to
userspace's timespec64 data structure, which is denominated in
nanosecods, it's actually much easier to use decimal 100ns units:

#define EXT4_EPOCH_BITS 2
#define EXT4_EPOCH_MASK ((1 << EXT4_EPOCH_BITS) - 1)
#define EXT4_NSEC_MASK (~0UL << EXT4_EPOCH_BITS)

static inline __le32 ext4_encode_extra_time(struct timespec64 *time)
{
u32 extra =((time->tv_sec - (s32)time->tv_sec) >> 32) & EXT4_EPOCH_MASK;
return cpu_to_le32(extra | (time->tv_nsec << EXT4_EPOCH_BITS));
}

static inline void ext4_decode_extra_time(struct timespec64 *time,
__le32 extra)
{
if (unlikely(extra & cpu_to_le32(EXT4_EPOCH_MASK)))
time->tv_sec += (u64)(le32_to_cpu(extra) & EXT4_EPOCH_MASK) << 32;
time->tv_nsec = (le32_to_cpu(extra) & EXT4_NSEC_MASK) >> EXT4_EPOCH_BITS;
}

> Of course, I might have screwed up the above conversion functions,
> they are untested garbage, but they look close enough to being in the
> right ballpark.

We actually have kunit tests for ext4_encode_extra_time() and
ext4_decode_extra_time(), mainly because people *have* screwed it up
when making architecture-specific optimizations or when making global
sweeps of VFS code. :-)

- Ted

2023-09-24 10:27:06

by Christian Brauner

[permalink] [raw]
Subject: Re: [GIT PULL v2] timestamp fixes

> > Those workloads are broken garbage, and we should *not* use that kind
> > of sh*t to decide on VFS internals.
> >
>
> Sorry, I phrased it completely wrong.

Thanks for clearing this up. I had just formulated my own reply but I'll
happily delete it. :)

> The workloads don't expect 1ns resolution.

Yes, they don't. In the revert explanation I just used that number to
illustrate the general ordering problem. The workload that surfaced the
issue is just plain weird of course but it points to a more general
ordering problem.

> The workloads just compare timestamps of objects and expect some sane
> not-before ordering rules.

Yes.

> If user visible timestamps are truncated to 100ns all is good.

Yes.

2023-09-24 13:11:41

by Amir Goldstein

[permalink] [raw]
Subject: Re: [GIT PULL v2] timestamp fixes

On Thu, Sep 21, 2023 at 10:02 PM Jeff Layton <[email protected]> wrote:
>
> On Thu, 2023-09-21 at 11:24 -0700, Linus Torvalds wrote:
> > On Thu, 21 Sept 2023 at 04:21, Christian Brauner <[email protected]> wrote:
> > >
> > > [email protected]:pub/scm/linux/kernel/git/vfs/vfs tags/v6.6-rc3.vfs.ctime.revert
> >
> > So for some reason pr-tracker-bot doesn't seem to have reacted to this
> > pull request, but it's in my tree now.
> >
> > I *do* have one reaction to all of this: now that you have made
> > "i_ctime" be something that cannot be accessed directly (and renamed
> > it to "__i_ctime"), would you mind horribly going all the way, and do
> > the same for i_atime and i_mtime too?
> >
> > The reason I ask is that I *really* despise "struct timespec64" as a type.
> >
> > I despise it inherently, but I despise it even more when you really
> > use it as another type entirely, and are hiding bits in there.
> >
> > I despise it because "tv_sec" obviously needs to be 64-bit, but then
> > "tv_nsec" is this horrible abomination. It's defined as "long", which
> > is all kinds of crazy. It's bogus and historical.
> >
> > And it's wasteful.
> >
> > Now, in the case of i_ctime, you took advantage of that waste by using
> > one (of the potentially 2..34!) unused bits for that
> > "fine-granularity" flag.
> >
> > But even when you do that, there's up to 33 other bits just lying
> > around, wasting space in a very central data structure.
> >
> > So it would actually be much better to explode the 'struct timespec64'
> > into explicit 64-bit seconds field, and an explicit 32-bit field with
> > two bits reserved. And not even next to each other, because they don't
> > pack well in general.
> >
> > So instead of
> >
> > struct timespec64 i_atime;
> > struct timespec64 i_mtime;
> > struct timespec64 __i_ctime;
> >
> > where that last one needs accessors to access, just make them *all*
> > have helper accessors, and make it be
> >
> > u64 i_atime_sec;
> > u64 i_mtime_sec;
> > u64 i_ctime_sec;
> > u32 i_atime_nsec;
> > u32 i_mtime_nsec;
> > u32 i_ctime_nsec;
> >
> > and now that 'struct inode' should shrink by 12 bytes.
> >
>
> I like it.
>
> > Then do this:
> >
> > #define inode_time(x) \
> > (struct timespec64) { x##_sec, x##_nsec }
> >
> > and you can now create a timespec64 by just doing
> >
> > inode_time(inode->i_atime)
> >
> > or something like that (to help create those accessor functions).
> >
> > Now, I agree that 12 bytes in the disaster that is 'struct inode' is
> > likely a drop in the ocean. We have tons of big things in there (ie
> > several list_heads, a whole 'struct address_space' etc etc), so it's
> > only twelve bytes out of hundreds.
> >
> > But dammit, that 'timespec64' really is ugly, and now that you're
> > hiding bits in it and it's no longer *really* a 'timespec64', I feel
> > like it would be better to do it right, and not mis-use a type that
> > has other semantics, and has other problems.
> >
>
>
> We have many, many inodes though, and 12 bytes per adds up!
>
> I'm on board with the idea, but...that's likely to be as big a patch
> series as the ctime overhaul was. In fact, it'll touch a lot of the same
> code. I can take a stab at that in the near future though.
>
> Since we're on the subject...another thing that bothers me with all of
> the timestamp handling is that we don't currently try to mitigate "torn
> reads" across the two different words. It seems like you could fetch a
> tv_sec value and then get a tv_nsec value that represents an entirely
> different timestamp if there are stores between them.
>
> Should we be concerned about this? I suppose we could do something with
> a seqlock, though I'd really want to avoid locking on the write side.

As luck would have it, if my calculations are correct, on x86-64 and with
CONFIG_FS_POSIX_ACL=y, CONFIG_SECURITY=y (as they are on
distro kernels), __i_ctime is exactly on split cache lines and maybe even
worse (?), __i_ctime.tv_nsec and the QUERIED bit are on the same
cache line with i_lock :-/

If we reorder the inode timestamps with i_size, we improve the situation
for this specific and very common configuration. Maybe.

Thanks,
Amir.

2023-09-24 15:45:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL v2] timestamp fixes

On Sat, 23 Sept 2023 at 12:30, Theodore Ts'o <[email protected]> wrote:
>
> It depends on what conversion we need to do. If we're converting to
> userspace's timespec64 data structure, which is denominated in
> nanosecods, it's actually much easier to use decimal 100ns units:

Actually, your data format seems to be a mix of "decimal nanoseconds"
and then a power-of-two seconds (ie bit shift).

Except it looks like ext4 actually does full nanosecond resolution (30
bits for nanoseconds, 34 bits for seconds). Thus the "only a couple of
hundred years of range".

And yes, that's probably close to optimal. It makes it harder to do
*math* on those dates (because you have seconds and 100ns as separate
fields), but for file timestamps that's likely not a real issue.

It was for 'ktime_t()', where with timers etc the whole "subtract and
add times" happens *all* the time, but for file timestamps you
basically have "set time" together with possibly comparing them (and
you can do comparisons without even splitting the fields if you lay
things out reasonably - which you ext4 doesn't seem to have done).

So yeah, I think that would be a fine 'fstime_t' for the kernel.

Except we'd do it without the EXT4_EPOCH_MASK conditionals, and I
think it would be better to have a bigger range for seconds. If you
give the seconds field three extra bits, you're comfortable in the
"thousands of years", and you still have 27 bits that can encode a
decimal "100ns".

It means that when you convert a fstime_t to timespec64 at stat()
time, you'd have to do a 32-bit "multiply by 100" to get the actual
nanosecond field, but that's cheap everywhere (and obviously the
shift-and-masks to get the separate fields out of the 64-bit value).

Linus

2023-09-25 11:26:18

by Jeffrey Layton

[permalink] [raw]
Subject: Re: [GIT PULL v2] timestamp fixes

On Sat, 2023-09-23 at 10:48 -0700, Linus Torvalds wrote:
> On Fri, 22 Sept 2023 at 23:36, Amir Goldstein <[email protected]> wrote:
> >
> > Apparently, they are willing to handle the "year 2486" issue ;)
>
> Well, we could certainly do the same at the VFS layer.
>
> But I suspect 10ns resolution is entirely overkill, since on a lot of
> platforms you don't even have timers with that resolution.
>
> I feel like 100ns is a much more reasonable resolution, and is quite
> close to a single system call (think "one thousand cycles at 10GHz").
>
> > But the resolution change is counter to the purpose of multigrain
> > timestamps - if two syscalls updated the same or two different inodes
> > within a 100ns tick, apparently, there are some workloads that
> > care to know about it and fs needs to store this information persistently.
>
> Those workloads are broken garbage, and we should *not* use that kind
> of sh*t to decide on VFS internals.
>
> Honestly, if the main reason for the multigrain resolution is
> something like that, I think we should forget about MG *entirely*.
> Somebody needs to be told to get their act together.
>

As I noted in the other thread, the primary reason for this was to fix
XFS's change cookie without having to rev the on-disk format. If we
could also present fine-grained timestamps to userland and nfsd, then
that would also fix a lot of cache-coherency problems with NFSv3, and
may also help some workloads which depend on comparing timestamps
between files. That'd be a wonderful bonus, but I'm not going to lose
too much sleep if we can't make that work.


> We have *never* guaranteed nanosecond resolution on timestamps, and I
> think we should put our foot down and say that we never will.
>
> Partly because we have platforms where that kind of timer resolution
> just does not exist.
>
> Partly because it's stupid to expect that kind of resolution anyway.
>
> And partly because any load that assumes that kind of resolution is
> already broken.
>
> End result: we should ABSOLUTELY NOT have as a target to support some
> insane resolution.
>
> 100ns resolution for file access times is - and I'll happily go down
> in history for saying this - enough for anybody.
>
> If you need finer resolution than that, you'd better do it yourself in
> user space.
>
> And no, this is not a "but some day we'll have terahertz CPU's and
> 100ns is an eternity". Moore's law is dead, we're not going to see
> terahertz CPUs, and people who say "but quantum" have bought into a
> technological fairytale.
>
> 100ns is plenty, and has the advantage of having a very safe range.
>

The catch here is that we have at least some testcases that do things
like set specific values in the mtime and atime, and then test that the
same value is retrievable.

Are we OK with breaking those? If we can always say that the stored
resolution is X and that even values that are explicitly set get
truncated then the v8 set I sent on Friday may be ok.

Of course, that set truncates the values at jiffies granularity (~4ms on
my box). That's well above 100ns, so it's possible that's too coarse for
us to handwave this problem away.


> That said, we don't have to do powers-of-ten. In fact, in many ways,
> it would probably be a good idea to think of the fractional seconds in
> powers of two. That tends to make it cheaper to do conversions,
> without having to do a full 64-bit divide (a constant divide turns
> into a fancy multiply, but it's still painful on 32-bit
> architectures).
>
> So, for example, we could easily make the format be a fixed-point
> format with "sign bit, 38 bit seconds, 25 bit fractional seconds",
> which gives us about 30ns resolution, and a range of almost 9000
> years. Which is nice, in how it covers all of written history and all
> four-digit years (we'd keep the 1970 base).
>
> And 30ns resolution really *is* pretty much the limit of a single
> system call. I could *wish* we had system calls that fast, or CPU's
> that fast. Not the case right now, and sadly doesn't seem to be the
> case in the forseeable future - if ever - either. It would be a really
> good problem to have.
>
> And the nice thing about that would be that conversion to timespec64
> would be fairly straightforward:
>
> struct timespec64 to_timespec(fstime_t fstime)
> {
> struct timespec64 res;
> unsigned int frac;
>
> frac = fstime & 0x1ffffffu;
> res.tv_sec = fstime >> 25;
> res.tv_nsec = frac * 1000000000ull >> 25;
> return res;
> }
>
> fstime_t to_fstime(struct timespec64 a)
> {
> fstime_t sec = (fstime_t) a.tv_sec << 25;
> unsigned frac = a.tv_nsec;
>
> frac = ((unsigned long long) a.tv_nsec << 25) / 1000000000ull;
> return sec | frac;
> }
>
> and both of those generate good code (that large divide by a constant
> in to_fstime() is not great, but the compiler can turn it into a
> multiply).
>
> The above could be improved upon (nicer rounding and overflow
> handling, and a few modifications to generate even nicer code), but
> it's not horrendous as-is. On x86-64, to_timespec becomes a very
> reasonable
>
> movq %rdi, %rax
> andl $33554431, %edi
> imulq $1000000000, %rdi, %rdx
> sarq $25, %rax
> shrq $25, %rdx
>
> and to some degree that's the critical function (that code would show
> up in 'stat()').
>
> Of course, I might have screwed up the above conversion functions,
> they are untested garbage, but they look close enough to being in the
> right ballpark.
>
> Anyway, we really need to push back at any crazies who say "I want
> nanosecond resolution, because I'm special and my mother said so".
>


Yeah if we we're going to establish a floor granularity for timestamps
above 1ns, then making it a power-of-two factor would probably be a good
thing. These calculations are done a _lot_ so we really do want them to
be efficient.
--
Jeff Layton <[email protected]>

2023-09-25 22:31:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL v2] timestamp fixes

On Mon, 25 Sept 2023 at 04:23, Jeff Layton <[email protected]> wrote:
>
> The catch here is that we have at least some testcases that do things
> like set specific values in the mtime and atime, and then test that the
> same value is retrievable.

Yeah, I'm sure that happens. But as you say, we already have
per-filesystem granularity issues that means that there is some non-ns
granularity that those tests have to deal with.

Unless they currently just work on one or two filesystems, of course.

> Of course, that set truncates the values at jiffies granularity (~4ms on
> my box). That's well above 100ns, so it's possible that's too coarse for
> us to handwave this problem away.

Note that depending or enforcing some kind of jiffies granularity is
*particularly* bad, because it's basically a random value.

It will depend on architecture and on configuration. On some
architectures it's a fixed value (some have it at 100, which is, I
think, the original x86 case), on others it's "configurable", but not
really (ie alpha is "configurable" in our Kconfig, but the _hardware_
typically has a fixed clock tick at 1024 Hz, but then there are
platforms that are different, and then there's Qemu that likes a lower
frequency to avoid overhead etc etc).

And then we have the "modern default", which is to ask the user at
config time if they want a 100 / 250 / 300 / 1000 HZ value, and the
actual hw clock tick may be much more dynamic than that.

Anyway, what I'm saying is just that we should *not* limit granularity
to anything that has to do with jiffies. Yes, that's still a real
thing in that it's a "cheap read of the current time", but it should
never be seen as any kind of vfs granularity.

And yes, HZ will be in the "roughly 100-1000" range, so if we're
talking granularities that are microseconds or finer, then at most
you'll have rounding issues - and since any HZ rounding issues should
only be for the cases where we set the time to "now" - rounding
shouldn't be an issue anyway, since it's not a time that is specified
by user space.

End result: try to avoid anything to do with HZ in filesystem code,
unless it's literally about jiffies (which should typically be mostly
about any timeouts a filesystem may have, not about timestamps
themselves).

Linus