2022-06-15 12:49:18

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [musl] Question about musl's time() implementation in time.c

On Wed, Jun 15, 2022 at 1:28 AM Rich Felker <[email protected]> wrote:
> On Tue, Jun 14, 2022 at 11:11:32PM +0200, Arnd Bergmann wrote:
> >
> > The thing is that a lot of file systems would still behave the same way
> > because they round times down to a filesystem specific resolution,
> > often one microsecond or one second, while the kernel time accounting
> > is in nanoseconds. There have been discussions about an interface
> > to find out what the actual resolution on a given mount point is (similar
> > to clock_getres), but that never made it in. The guarantees that you
> > get from file systems at the moment are:
>
> It's normal that they may be rounded down the the filesystem timestamp
> granularity. I thought what was going on here was worse.

It gets rounded down twice: first down to the start of the current
timer tick, which is at an arbitrary nanosecond value in the past 10ms,
and then to the resolution of the file system. The result is that the
file timestamp can point to a slightly earlier value, up to max(timer tick
cycle, fs resolution) before the actual nanosecond value. We don't
advertise the granule of the file system though, so I would expect
this to be within the expected behavior.

> OK, the time syscall doing the wrong thing here (using a different
> clock that's not correctly ordered with respect to CLOCK_REALTIME)
> seems to be the worst problem here -- if I'm understanding it right.
> The filesystem issue might be a non-issue if it's truly equivalent to
> just having coarser fs timestamp granularity, which is allowed.

Adding the kernel timekeeping maintainers to Cc. I think this is a
reasonable argument, but it goes against the current behavior.

We have four implementations of the time() syscall that one would
commonly encounter:

- The kernel syscall, using (effectively) CLOCK_REALTIME_COARSE
- The kernel vdso, using (effectively) CLOCK_REALTIME_COARSE
- The glibc interface, calling __clock_gettime64(CLOCK_REALTIME_COARSE, ...)
- The musl interface, calling __clock_gettime64(CLOCK_REALTIME, ...)

So even if everyone agrees that the musl implementation is the
correct one, I think both linux and glibc are more likely to stick with
the traditional behavior to avoid breaking user space code such as the
libc-test case that Zev brought up initially. At least Adhemerval's
time() implementation in glibc[1] appears to have done this intentionally,
while the Linux implementation has simply never changed this in an
incompatible way since Linux-0.01 added time() and 0.99.13k added
the high-resolution gettimeofday().

Arnd

[1] https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=0d56378349


2022-06-15 17:01:13

by Adhemerval Zanella

[permalink] [raw]
Subject: Re: [musl] Question about musl's time() implementation in time.c



> On 15 Jun 2022, at 05:09, Arnd Bergmann <[email protected]> wrote:
>
> On Wed, Jun 15, 2022 at 1:28 AM Rich Felker <[email protected]> wrote:
>> On Tue, Jun 14, 2022 at 11:11:32PM +0200, Arnd Bergmann wrote:
>>>
>>> The thing is that a lot of file systems would still behave the same way
>>> because they round times down to a filesystem specific resolution,
>>> often one microsecond or one second, while the kernel time accounting
>>> is in nanoseconds. There have been discussions about an interface
>>> to find out what the actual resolution on a given mount point is (similar
>>> to clock_getres), but that never made it in. The guarantees that you
>>> get from file systems at the moment are:
>>
>> It's normal that they may be rounded down the the filesystem timestamp
>> granularity. I thought what was going on here was worse.
>
> It gets rounded down twice: first down to the start of the current
> timer tick, which is at an arbitrary nanosecond value in the past 10ms,
> and then to the resolution of the file system. The result is that the
> file timestamp can point to a slightly earlier value, up to max(timer tick
> cycle, fs resolution) before the actual nanosecond value. We don't
> advertise the granule of the file system though, so I would expect
> this to be within the expected behavior.
>
>> OK, the time syscall doing the wrong thing here (using a different
>> clock that's not correctly ordered with respect to CLOCK_REALTIME)
>> seems to be the worst problem here -- if I'm understanding it right.
>> The filesystem issue might be a non-issue if it's truly equivalent to
>> just having coarser fs timestamp granularity, which is allowed.
>
> Adding the kernel timekeeping maintainers to Cc. I think this is a
> reasonable argument, but it goes against the current behavior.
>
> We have four implementations of the time() syscall that one would
> commonly encounter:
>
> - The kernel syscall, using (effectively) CLOCK_REALTIME_COARSE
> - The kernel vdso, using (effectively) CLOCK_REALTIME_COARSE
> - The glibc interface, calling __clock_gettime64(CLOCK_REALTIME_COARSE, ...)
> - The musl interface, calling __clock_gettime64(CLOCK_REALTIME, ...)
>
> So even if everyone agrees that the musl implementation is the
> correct one, I think both linux and glibc are more likely to stick with
> the traditional behavior to avoid breaking user space code such as the
> libc-test case that Zev brought up initially. At least Adhemerval's
> time() implementation in glibc[1] appears to have done this intentionally,
> while the Linux implementation has simply never changed this in an
> incompatible way since Linux-0.01 added time() and 0.99.13k added
> the high-resolution gettimeofday().
>
> Arnd
>
> [1] https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=0d56378349

Indeed I have changed glibc to be consistent on all architectures to mimic kernel
behavior time syscall and avoid this very issue. We did not have a consistent
implementation before, so glibc varied depending of architecture and kernel
version whether it uses CLOCK_REALTIME or CLOCK_REALTIME_COARSE.

If kernel does change to make time() use CLOCK_REALTIME, it would make
sense to make glibc __clock_gettime64 to use it as well. We will also need to
either disable time vDSO usage on x86 and powerpc or make kernel implementation
to use CLOCK_REALTIME as well.


2022-06-16 09:16:00

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [musl] Question about musl's time() implementation in time.c

On Wed, Jun 15 2022 at 14:09, Arnd Bergmann wrote:
> On Wed, Jun 15, 2022 at 1:28 AM Rich Felker <[email protected]> wrote:
> Adding the kernel timekeeping maintainers to Cc. I think this is a
> reasonable argument, but it goes against the current behavior.
>
> We have four implementations of the time() syscall that one would
> commonly encounter:
>
> - The kernel syscall, using (effectively) CLOCK_REALTIME_COARSE
> - The kernel vdso, using (effectively) CLOCK_REALTIME_COARSE
> - The glibc interface, calling __clock_gettime64(CLOCK_REALTIME_COARSE, ...)
> - The musl interface, calling __clock_gettime64(CLOCK_REALTIME, ...)
>
> So even if everyone agrees that the musl implementation is the
> correct one, I think both linux and glibc are more likely to stick with
> the traditional behavior to avoid breaking user space code such as the
> libc-test case that Zev brought up initially. At least Adhemerval's
> time() implementation in glibc[1] appears to have done this intentionally,
> while the Linux implementation has simply never changed this in an
> incompatible way since Linux-0.01 added time() and 0.99.13k added
> the high-resolution gettimeofday().

That's correct. Assumed this call order:

clock_gettime(REALTIME, &tr);
clock_gettime(REALTIME_COARSE, &tc);
tt = time();

You can observe

tr->sec > tc->sec
tr->sec > tt

but you can never observe

tc->sec > tt

The reason for this is historical and time() has a distinct performance
advantage as it boils down to a single read and does not require the
sequence count (at least on 64bit). Coarse REALTIME requires the
seqcount, but avoids the hardware read and the larger math.

The costy part is the hardware read. Before TSC became usable, the
hardware read was a matter of microseconds, so avoiding it was a
significant performance gain. With a loop of 1e9 reads (including the
loop overhead) as measured with perf on a halfways recent SKL the
average per invocation is:

time() 7 cycles
clock_gettime(REAL_COARSE) 21 cycles
clock_gettime(REAL) TSC 60 cycles
clock_gettime(REAL) HPET 6092 cycles (~2000 cycles syscall overhead)
clock_gettime(REAL) ACPI_PM 4096 cycles (~2000 cycles syscall overhead)

So at the very end it boils down to performance and expectations. File
systems have chosen their granularity and the underlying mechanism to
get the timestamp according to that.

It's clearly not well documented, but I doubt that we can change the
implementation without running into measurable performance regressions.

VDSO based time() vs. clock_gettime(REAL) TSC is almost an order of
magnitude...

Thanks,

tglx