TSC-based x86_64 timekeeping implementation
===========================================
by Vojtech Pavlik and Jiri Bohac
This implementation allows the current time to be approximated by reading the
CPU's TSC even on SMP machines with unsynchronised TSCs. This allows us to
have a very fast gettimeofday() vsyscall on all SMP machines supporting the
RDTSCP instruction (AMD) or having synchronised TSCs (Intel).
Inter-CPU monotonicity can not, however, be guaranteed in a vsyscall, so
vsyscall is not used by default. Still, the syscall version of gettimeofday is
a lot faster using the TSC approximation instead of other hardware timers.
At boot, either the PM timer or HPET (preferred) is chosen as the "Master
Timer" (MT), from which all the time is calculated. As reading either of these
is slow, we want to approximate it using the TSC.
Each CPU updates its idea of the real time in update_timer_caches() called from
the LAPIC ISR. This function reads the real value of the MT and updates the
per-CPU timekeeping variables accordingly. Each CPU maintains its own
"tsc_slope" (a ratio of the MT and TSC frequencies) and a couple of offsets,
allowing us to guess (using guess_mt()) the value of the MT at any time on any
CPU. All this per-cpu data is kept in the vxtime structure.
The gettimeofday (both the syscall and vsyscall versions) use the
approximated value of the MT to calculate the time elapsed since the
last timer interrupt. For this purpose, vxtime.mt_wall holds the value
of the MT at the last timer interrupt.
During a CPU frequency change, we cannot trust the TSCs. Therefore, when
we get the pre-change notification, we switch to using the hardware
Master Timer instead of the approximation by setting a flag in
vxtime.tsc_invalid. After the post-change notification we keep using the
hardware MT for a while, until the approximation becomes accurate again.
When strict inter-CPU monotonicity is not needed, the vsyscall version of
gettimeofday may be forced using the "nomonotonic" command line parameter.
gettimeofday()'s monotonicity is guaranteed on a single CPU even with the very
fast vsyscall version. Across CPUs, the vsyscall version of gettimeofday is
not guaranteed to be monotonic, but it should be pretty close. Currently, we
get errors of tens/hundreds of microseconds.
We rely on neither the LAPIC timer nor the main timer interrupts being
called in regular intervals (although a little modification would
improve the MT approximation in this case), so we're basically ready for a
tickless kernel.
A patch series follows. Comments welcome.
--
Jiri Bohac <[email protected]>
SUSE Labs, SUSE CZ
* [email protected] <[email protected]> wrote:
> This implementation allows the current time to be approximated by
> reading the CPU's TSC even on SMP machines with unsynchronised TSCs.
> This allows us to have a very fast gettimeofday() vsyscall on all SMP
> machines supporting the RDTSCP instruction (AMD) or having
> synchronised TSCs (Intel).
>
> Inter-CPU monotonicity can not, however, be guaranteed in a vsyscall,
> so vsyscall is not used by default. Still, the syscall version of
> gettimeofday is a lot faster using the TSC approximation instead of
> other hardware timers.
ok, this looks mostly good to me - but this definitely should be based
/ontop/ of the x86_64 GTOD code. I.e. ontop of these patches in -mm:
generic-vsyscall-gtod-support-for-generic_time.patch
generic-vsyscall-gtod-support-for-generic_time-tidy.patch
time-x86_64-hpet_address-cleanup.patch
revert-x86_64-mm-ignore-long-smi-interrupts-in-clock-calibration.patch
time-x86_64-split-x86_64-kernel-timec-up.patch
time-x86_64-split-x86_64-kernel-timec-up-tidy.patch
time-x86_64-split-x86_64-kernel-timec-up-fix.patch
reapply-x86_64-mm-ignore-long-smi-interrupts-in-clock-calibration.patch
time-x86_64-convert-x86_64-to-use-generic_time.patch
time-x86_64-convert-x86_64-to-use-generic_time-fix.patch
time-x86_64-convert-x86_64-to-use-generic_time-tidy.patch
time-x86_64-hpet-fixup-clocksource-changes.patch
time-x86_64-tsc-fixup-clocksource-changes.patch
time-x86_64-re-enable-vsyscall-support-for-x86_64.patch
time-x86_64-re-enable-vsyscall-support-for-x86_64-tidy.patch
also, note that there is a new TSC synchronization check code in -mm as
well:
x86-rewrite-smp-tsc-sync-code.patch
this should be ontop of that too. (and ontop of the high-res timers
queue)
Ingo
On Thursday 01 February 2007 10:59, [email protected] wrote:
>
> Inter-CPU monotonicity can not, however, be guaranteed in a vsyscall, so
> vsyscall is not used by default.
Only for unsynchronized machines I hope
> Still, the syscall version of gettimeofday is
> a lot faster using the TSC approximation instead of other hardware timers.
Yes that makes sense.
The big strategic problem is how to marry your patchkit to John Stultz's
clocksources work which is also competing for merge. Any thoughts on that?
>When strict inter-CPU monotonicity is not needed, the vsyscall version of
>gettimeofday may be forced using the "nomonotonic" command line parameter.
>gettimeofday()'s monotonicity is guaranteed on a single CPU even with the very
>fast vsyscall version. Across CPUs, the vsyscall version of gettimeofday is
>not guaranteed to be monotonic, but it should be pretty close. Currently, we
>get errors of tens/hundreds of microseconds.
I think a better way to do this would be to define a new CLOCK_THREAD_MONOTONOUS
(or better name) timer for clock_gettime().
[and my currently stalled vdso patches that implement clock_gettime
as a vsyscall]
Then also an application could easily use it with LD_PRELOAD
-Andi
* [email protected] <[email protected]> wrote:
> Inter-CPU monotonicity can not, however, be guaranteed in a vsyscall,
> so vsyscall is not used by default. [...]
note that this is not actually the case. My patch below, ontop of -mm,
implements a fully monotonic gettimeofday as an optional vsyscall
feature.
The 'price' paid for it is lower resolution - but it's still good for
those benchmarking TPC-C runs - and /alot/ simpler. It's also quite a
bit faster than any TSC based vgettimeofday, because it doesnt have to
do an RDTSC (or RDTSCP) instruction nor any approximation of the time.
Ingo
---------------------------->
Subject: [patch] x86_64 GTOD: offer scalable vgettimeofday
From: Ingo Molnar <[email protected]>
offer scalable vgettimeofday independently of whether the TSC is
synchronous or not. Off by default. Results in low resolution
gettimefday().
this patch also fixes an SMP bug in sys_vtime(): we should read
__vsyscall_gtod_data.wall_time_tv.tv_sec only once.
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86_64/kernel/vsyscall.c | 30 +++++++++++++++++++++++++++---
1 file changed, 27 insertions(+), 3 deletions(-)
Index: linux/arch/x86_64/kernel/vsyscall.c
===================================================================
--- linux.orig/arch/x86_64/kernel/vsyscall.c
+++ linux/arch/x86_64/kernel/vsyscall.c
@@ -107,6 +107,22 @@ static __always_inline void do_vgettimeo
cycle_t now, base, mask, cycle_delta;
unsigned long seq, mult, shift, nsec_delta;
cycle_t (*vread)(void);
+
+ if (likely(__vsyscall_gtod_data.sysctl_enabled == 2)) {
+ struct timeval tmp;
+
+ do {
+ barrier();
+ *tv = __vsyscall_gtod_data.wall_time_tv;
+ barrier();
+ tmp = __vsyscall_gtod_data.wall_time_tv;
+
+ } while (tmp.tv_usec != tv->tv_usec ||
+ tmp.tv_sec != tv->tv_sec);
+
+ return;
+ }
+
do {
seq = read_seqbegin(&__vsyscall_gtod_data.lock);
@@ -151,11 +167,19 @@ int __vsyscall(0) vgettimeofday(struct t
* unlikely */
time_t __vsyscall(1) vtime(time_t *t)
{
+ time_t secs;
+
if (!__vsyscall_gtod_data.sysctl_enabled)
return time_syscall(t);
- else if (t)
- *t = __vsyscall_gtod_data.wall_time_tv.tv_sec;
- return __vsyscall_gtod_data.wall_time_tv.tv_sec;
+
+ /*
+ * Make sure that what we return is the same number we
+ * write:
+ */
+ secs = __vsyscall_gtod_data.wall_time_tv.tv_sec;
+ if (t)
+ *t = secs;
+ return secs;
}
/* Fast way to get current CPU and node.
On Thu, Feb 01, 2007 at 12:20:59PM +0100, Andi Kleen wrote:
> I think a better way to do this would be to define a new CLOCK_THREAD_MONOTONOUS
> (or better name) timer for clock_gettime().
>
> [and my currently stalled vdso patches that implement clock_gettime
> as a vsyscall]
>
> Then also an application could easily use it with LD_PRELOAD
I think a prctl to enable the non monothone mode is better than any
LD_PRELOAD trick.
On Thursday 01 February 2007 12:53, Andrea Arcangeli wrote:
> On Thu, Feb 01, 2007 at 12:20:59PM +0100, Andi Kleen wrote:
> > I think a better way to do this would be to define a new CLOCK_THREAD_MONOTONOUS
> > (or better name) timer for clock_gettime().
> >
> > [and my currently stalled vdso patches that implement clock_gettime
> > as a vsyscall]
> >
> > Then also an application could easily use it with LD_PRELOAD
>
> I think a prctl to enable the non monothone mode is better than any
> LD_PRELOAD trick.
I don't think so because having per process state in a vsyscall
is quite costly. You would need to allocate at least one more
page to each process, which I think would be excessive.
-Andi
On Thursday 01 February 2007 12:46, Ingo Molnar wrote:
>
> * [email protected] <[email protected]> wrote:
>
> > Inter-CPU monotonicity can not, however, be guaranteed in a vsyscall,
> > so vsyscall is not used by default. [...]
>
> note that this is not actually the case. My patch below, ontop of -mm,
> implements a fully monotonic gettimeofday as an optional vsyscall
> feature.
>
> The 'price' paid for it is lower resolution - but it's still good for
> those benchmarking TPC-C runs - and /alot/ simpler. It's also quite a
> bit faster than any TSC based vgettimeofday, because it doesnt have to
> do an RDTSC (or RDTSCP) instruction nor any approximation of the time.
I believe that should be also a separate clock_gettime() CLOCK_
Global settings for these things are bad. Even if you run TPC-C you don't
want your other programs that rely on monotonic time to break.
-Andi
* Andi Kleen <[email protected]> wrote:
> > The 'price' paid for it is lower resolution - but it's still good
> > for those benchmarking TPC-C runs - and /alot/ simpler. It's also
> > quite a bit faster than any TSC based vgettimeofday, because it
> > doesnt have to do an RDTSC (or RDTSCP) instruction nor any
> > approximation of the time.
>
> I believe that should be also a separate clock_gettime() CLOCK_
>
> Global settings for these things are bad. Even if you run TPC-C you
> don't want your other programs that rely on monotonic time to break.
yeah. But maybe still there should still be an 'easy' option for people
to consciously degrade the resolution of gettimeofday(), in exchange for
more performance. There are systems where gettimeofday already has such
resolution, so apps certainly shouldnt break from this. But i agree with
you: that's why i made this default-off, and the CLOCK_ option could be
a way for apps to reliably get this behavior, independently of the
global setting (hence driving the migration of affected apps to this new
CLOCK_ thing). Hm?
Ingo
On Thursday 01 February 2007 12:46, Ingo Molnar wrote:
>
> * [email protected] <[email protected]> wrote:
>
> > Inter-CPU monotonicity can not, however, be guaranteed in a vsyscall,
> > so vsyscall is not used by default. [...]
>
> note that this is not actually the case. My patch below, ontop of -mm,
> implements a fully monotonic gettimeofday as an optional vsyscall
> feature.
>
> The 'price' paid for it is lower resolution - but it's still good for
> those benchmarking TPC-C runs - and /alot/ simpler.
BTW another comment: I was told that at least one of the big databases
wants ms resolution here. So to make your scheme work
would require a HZ=1024 regular interrupt. But that would also make
everything slower again due to CPU overhead as it was learned in the 2.4->2.6 HZ
transition.
So it might not actually be worth it.
-Andi
* Andi Kleen <[email protected]> wrote:
> The big strategic problem is how to marry your patchkit to John
> Stultz's clocksources work which is also competing for merge. Any
> thoughts on that?
the only sane thing would be to do it ontop of -mm: the stuff in -mm,
barring some catastrophy, is hopefully destined for v2.6.21. We could do
my quick optional hack for those who want fast gettimeofday now ahead of
that queue - but this approximation thing should be definitely ontop.
Ingo
* Andi Kleen <[email protected]> wrote:
> > The 'price' paid for it is lower resolution - but it's still good
> > for those benchmarking TPC-C runs - and /alot/ simpler.
>
> BTW another comment: I was told that at least one of the big databases
> wants ms resolution here. So to make your scheme work would require a
> HZ=1024 regular interrupt. [...]
if resolution is an issue then i can improve this thing to be based off
a separate /optional/ hrtimer, thus if it's enabled it could enable 1000
Hz (and not 1024 Hz) update for the variable. The update resolution
could be tuned via a sysctl trivially, so everyone could tune the
resolution of this to the value desired, and could do so runtime.
[ It could also be driven by the database right now: from a thread open
/dev/rtc, set it to 1024 HZ, and do a gettimeofday() call in every
tick - that will auto-update the timestamp. ]
> [...] But that would also make everything slower again due to CPU
> overhead as it was learned in the 2.4->2.6 HZ transition.
note that this cost was measured on UP and on older hardware, and the
cost of having a global 1000 Hz update gets linearly cheaper with the
increase of CPUs on SMP: because only one such update has to be running.
The systems those database vendors are interested in typically have a
fair number of CPUs.
Ingo
On Thursday 01 February 2007 13:24, Ingo Molnar wrote:
> if resolution is an issue then i can improve this thing to be based off
> a separate /optional/ hrtimer, thus if it's enabled it could enable 1000
> Hz (and not 1024 Hz) update for the variable. The update resolution
> could be tuned via a sysctl trivially, so everyone could tune the
> resolution of this to the value desired, and could do so runtime.
It would be better to let the application set it without root rights
(afaik W. allows this). Auto tuning beats explicit configuration anytime.
Not sure it's really worth it though.
My thinking was to gather more requirements of what users actually
want first before adding all these new modi.
> [ It could also be driven by the database right now: from a thread open
> /dev/rtc, set it to 1024 HZ, and do a gettimeofday() call in every
> tick - that will auto-update the timestamp. ]
zmailer used to do that (or probably still does) but I always hated
the scheme for some reason :)
> > [...] But that would also make everything slower again due to CPU
> > overhead as it was learned in the 2.4->2.6 HZ transition.
>
> note that this cost was measured on UP and on older hardware, and the
> cost of having a global 1000 Hz update gets linearly cheaper with the
> increase of CPUs on SMP: because only one such update has to be running.
> The systems those database vendors are interested in typically have a
> fair number of CPUs.
Good point. Even on desktop with Multi Core or SMT it should be cheaper now.
-Andi
On Thu, Feb 01, 2007 at 01:02:41PM +0100, Andi Kleen wrote:
> I don't think so because having per process state in a vsyscall
> is quite costly. You would need to allocate at least one more
> page to each process, which I think would be excessive.
You would need one page per cpu and to check a change in a TIF_
bitflag during switch_to (zero cost) and overwrite the vsyscall bit in
the slow path.
If we had a picotimeofday that would be guaranteed monotone... if he
can measure errors with shared memory in smp, it means the measurement
error (LAPIC and tsc frequency estimation) is longer than the time it
takes to bounce a spinlock and reach a second rdtscp. I hoped this
wouldn't happen. Could you send me the app used to reproduce the
non-monotonicity over shared memory with rdtscp? I finally have a (EE)
stepping F to attempt testing it. thanks!
On Thu, Feb 01, 2007 at 12:20:59PM +0100, Andi Kleen wrote:
> On Thursday 01 February 2007 10:59, [email protected] wrote:
>
> >
> > Inter-CPU monotonicity can not, however, be guaranteed in a vsyscall, so
> > vsyscall is not used by default.
>
> Only for unsynchronized machines I hope
yes, sorry, only on unsynchronized machines
> The big strategic problem is how to marry your patchkit to John Stultz's
> clocksources work which is also competing for merge. Any thoughts on that?
I'll look into that next week. Sorry, I wanted to do that a long time
ago, but I spent weeks (over a month) fighting a nasty livelock
in the code. (Morale: think twice before using a spinlock inside
a {do .. while (read_seqretry(..))} loop)
> >When strict inter-CPU monotonicity is not needed, the vsyscall version of
> >gettimeofday may be forced using the "nomonotonic" command line parameter.
> >gettimeofday()'s monotonicity is guaranteed on a single CPU even with the very
> >fast vsyscall version. Across CPUs, the vsyscall version of gettimeofday is
> >not guaranteed to be monotonic, but it should be pretty close. Currently, we
> >get errors of tens/hundreds of microseconds.
>
> I think a better way to do this would be to define a new CLOCK_THREAD_MONOTONOUS
> (or better name) timer for clock_gettime().
I absolutely agree. Will do that. This should give userspace a
decently accurate and very fast time source.
--
Jiri Bohac <[email protected]>
SUSE Labs, SUSE CZ
On Thu, 2007-02-01 at 15:52 +0100, Jiri Bohac wrote:
> On Thu, Feb 01, 2007 at 12:20:59PM +0100, Andi Kleen wrote:
>>
> > The big strategic problem is how to marry your patchkit to John Stultz's
> > clocksources work which is also competing for merge. Any thoughts on that?
>
> I'll look into that next week. Sorry, I wanted to do that a long time
> ago, but I spent weeks (over a month) fighting a nasty livelock
> in the code. (Morale: think twice before using a spinlock inside
> a {do .. while (read_seqretry(..))} loop)
The first step here shouldn't be too difficult. Just create a _read
function that uses your code to return monotonic TSC cycles (instead of
nanoseconds w/ gettimeofday). Then just create a clocksource structure
for it.
The harder part will be the vsyscall, as you will need extra per cpu
data in the vsyscall read. I had some test code for this situation
awhile back, so if you get the first part functioning correctly (just a
clocksource w/o a vread pointer), I'll gladly help you get the vsyscall
bits working.
thanks
-john
On Thu, Feb 01, 2007 at 08:56:48AM -0800, john stultz wrote:
> On Thu, 2007-02-01 at 15:52 +0100, Jiri Bohac wrote:
> > On Thu, Feb 01, 2007 at 12:20:59PM +0100, Andi Kleen wrote:
> >>
> > > The big strategic problem is how to marry your patchkit to John Stultz's
> > > clocksources work which is also competing for merge. Any thoughts on that?
> >
> > I'll look into that next week. Sorry, I wanted to do that a long time
> > ago, but I spent weeks (over a month) fighting a nasty livelock
> > in the code. (Morale: think twice before using a spinlock inside
> > a {do .. while (read_seqretry(..))} loop)
>
> The first step here shouldn't be too difficult. Just create a _read
> function that uses your code to return monotonic TSC cycles (instead of
> nanoseconds w/ gettimeofday). Then just create a clocksource structure
> for it.
guess_mt() is more or less the function you're looking for. (With the
exception of the cpufreq and mode switching logic.)
> The harder part will be the vsyscall, as you will need extra per cpu
> data in the vsyscall read. I had some test code for this situation
> awhile back, so if you get the first part functioning correctly (just a
> clocksource w/o a vread pointer), I'll gladly help you get the vsyscall
> bits working.
>
> thanks
> -john
>
>
--
Vojtech Pavlik
Director SuSE Labs
On Thu, 01 Feb 2007 10:59:52 +0100 [email protected] wrote:
> TSC-based x86_64 timekeeping implementation
I worry about the relationship between this work and all the time-management
changes in -mm. If Andi to were to merge all this stuff under that then I
expect various catastrophes would ensue.
Have you checked to determine the severity of the overlaps?
On Friday 02 February 2007 05:22, Andrew Morton wrote:
> On Thu, 01 Feb 2007 10:59:52 +0100 [email protected] wrote:
>
> > TSC-based x86_64 timekeeping implementation
>
> I worry about the relationship between this work and all the time-management
> changes in -mm. If Andi to were to merge all this stuff under that then I
> expect various catastrophes would ensue.
>
> Have you checked to determine the severity of the overlaps?
The overlap is quite total. They both overhaul the time code completely.
I suspect the way to go is to reimplement Jiri's patch on top of clock sources
(Hopefully now with working algorithms that shouldn't be too hard). But
I'm still waiting for Jiri's assessment on how feasible this is.
-Andi