2004-09-02 21:13:54

by john stultz

[permalink] [raw]
Subject: [RFC] New Time of day proposal (updated 9/2/04)

All,
Here again is my updated time of day proposal. Also to follow is
working code for i386. I wanted to get this out so folks could actually
play around with the core architecture independent code. The i386
specific bits function, but are unfinished. Thus they have been broken
out and should mostly be ignored for this release.

I'd really appreciate any feedback or concerns about this proposal and
the following code.

thanks
-john


Proposal for an architecture independent time of day implementation.
-------------------------------------------------------------------
John Stultz ([email protected])
DRAFT
Thu Sep 2 12:35:11 PDT 2004

Credits:
Keith Mannthey: Aided initial design.
Aided greatly to implementation details.
George Anzinger: Initial review and corrections.
Ulrich Windl: Review and suggestions for clarity.

Many of the time of day related issues that cropped up in 2.5
development occurred where a fix or change was made to a number of
architectures, but missed a few others. Currently every architecture has
its own set of timekeeping functions that basically do the same thing,
only using different (or frequently, not so different) types of
hardware. As hardware has changed, many architectures have had to
re-engineer their time system to handle multiple time and interrupt
sources. With little common infrastructure, either each separate
implementation has its own quirks and bugs, or we end up with a
reasonable quantity of duplicated code. Additionally the lack of a clear
time of day interface has led developers to use jiffies, HZ, and the raw
xtime values to calculate the time of day themselves. This has lead to a
number of troublesome bugs.

With the goal to simplify, streamline and consolidate the time-of-day
infrastructure, I propose the following common implementation across all
arches. This will allow generic bugs to be fixed once, reduce code
duplication, and with many architectures sharing the same time source,
this allows drivers to be written once for multiple architectures.
Additionally it will better delineate the lines between the timer
subsystem and the time-of-day subsystem, opening the door for more
flexible and better timekeeping.

Features of this design:
========================

o Splits time of day management from timer interrupts:
This is necessary for virtualization & tickless systems. It allows us
to no longer care how often clock_interrupt() is called. Missing, early
or lost interrupts do not affect time keeping (within bounds - ie: the
time source cannot overflow). This isolates HZ and jiffies to the timer
subsystem (mostly), as they are frequently and incorrectly used to
calculate time.
Additionally, it allows for dynamic tick interrupts / high-res ticks.
Avoid the need to interpolate between multiple shoddy time sources, and
lets us be agnostic to where the periodic interrupts come from (cleans
up i386 HPET interrupt code).

o Consolidates a large amount of code:
Allows for shared times source implementations, such as: i386, x86-64
and ia64 all use HPET, i386 and x86-64 both have ACPI PM timers, and
i386 and ia64 both have cyclone counters. Time sources are just drivers!
Also work for user space gettimeofday implementations will be able to be
shared across all arches (assuming the hardware time source can be
safely accessed from user space).

o Generic algorithms which use time-source drivers chosen at runtime:
Drivers are just simple hw accessors functions with no internal state
needed. They can be loaded and changed while the system is running, like
normal modules.

o More consistent and readable code:
Drop wall_to_monotonic & xtime in favor of a more simple system_time
and wall_time_offset variables. Where system_time is the monotonically
increasing nanoseconds since boot time and wall_time_offset is the
offset added to system_time to calculate time of day.

o Uses nanoseconds as the kernel's base time unit.
Rather then doing ugly manipulations to timevals or timespecs, this
simplifies math, and gives us plenty of room to grow (64bits of
nanoseconds ~= 584 years).

o Clearly separates the NTP code from the time code:
Creates a clean and clear interface, keeping all the NTP related code
in a single place. Save brains, normal people shouldn't have to think
about the in kernel ntp machinery.


Brief Psudo-code to illustrate the design:
==========================================

Globals:
--------
offset_base: timesource cycle value at last call to timeofday_hook()
system_time: time in ns calculated at last call to timeofday_hook()
wall_offset: offset to monotonic_clock() to get current time of day

Functions:
----------
timeofday_hook()
now = read(); /* read the timesource */
ns = cyc2ns(now - offset_base); /* calc nsecs since last call */
ntp_ns = ntp_scale(ns); /* apply ntp scaling */
system_time += ntp_ns; /* add scaled value to system_time */
ntp_advance(ns); /* advance ntp state machine by ns */
offset_base = now; /* set new offset_base */

monotonic_clock()
now = read(); /* read the timesource */
ns = cyc2ns(now - offset_base); /* calculate nsecs since last hook */
ntp_ns = ntp_scale(ns); /* apply ntp scaling */
return system_time + ntp_ns; /* return system_time and scaled value
*/

settimeofday(desired)
wall_offset = desired - monotonic_clock(); /* set wall offset */

gettimeofday()
return wall_offset + monotonic_clock(); /* return current timeofday */


Points I'm glossing over for now:
====================================================

o Have to convert back to time_val for syscall interface

o ntp_scale(ns): scales ns by NTP scaling factor
- see ntp.c for details
- costly, but correct.

o ntp_advance(ns): advances NTP state machine by ns
- see ntp.c for details

o What is the cost of throwing around 64bit values for everything?
- Do we need an arch specific time structure that varies size
accordingly?

o Some arches (arm, for example) do not have high res timing hardware
- In this case we can have a "jiffies" timesource
- cyc2ns(x) = x*(NSEC_PER_SEC/HZ)
- doesn't work for tickless systems

o vsyscalls/userspace gettimeofday()
- Mark functions and data w/ __vsyscall attribute
- Use linker to put all __vsyscall data in the same set of pages
- Mark those pages user-executable
- Should work for all arches
- ia64 fastcall should be used for timesource implementation only

o suspend/resume
- not yet implemented, but shouldn't be hard

Anything else? What am I missing or just being ignorant of?


2004-09-02 21:18:04

by john stultz

[permalink] [raw]
Subject: [RFC][PATCH] new timeofday i386 hooks (v.A0)

All,
This patch implements the minimal i386 architecture hooks to enable the
new time of day subsystem code. It applies on top of my
linux-2.6.9-rc1_timeofday-core_A0 patch and with this patch applied, you
can test the new time of day subsystem on i386. Basically it adds the
call to timeofday_interrupt_hook() and cuts alot of code out of the
build. The only new code is the sync_persistant_clock() function which
is mostly ripped out of do_timer_interrupt(). Pretty un-interesting.

I look forward to your comments and feedback.

thanks
-john

linux-2.6.9-rc1_timeofday-i386_A0.patch
=======================================
diff -Nru a/arch/i386/Kconfig b/arch/i386/Kconfig
--- a/arch/i386/Kconfig 2004-09-02 13:29:59 -07:00
+++ b/arch/i386/Kconfig 2004-09-02 13:29:59 -07:00
@@ -14,6 +14,10 @@
486, 586, Pentiums, and various instruction-set-compatible chips by
AMD, Cyrix, and others.

+config NEWTOD
+ bool
+ default y
+
config MMU
bool
default y
diff -Nru a/arch/i386/kernel/time.c b/arch/i386/kernel/time.c
--- a/arch/i386/kernel/time.c 2004-09-02 13:29:59 -07:00
+++ b/arch/i386/kernel/time.c 2004-09-02 13:29:59 -07:00
@@ -67,6 +67,8 @@

#include "io_ports.h"

+#include <linux/timeofday.h>
+
extern spinlock_t i8259A_lock;
int pit_latch_buggy; /* extern */

@@ -87,6 +89,7 @@

struct timer_opts *cur_timer = &timer_none;

+#ifndef CONFIG_NEWTOD
/*
* This version of gettimeofday has microsecond resolution
* and better than microsecond precision on fast x86 machines with TSC.
@@ -169,6 +172,7 @@
}

EXPORT_SYMBOL(do_settimeofday);
+#endif

static int set_rtc_mmss(unsigned long nowtime)
{
@@ -194,12 +198,39 @@
* Note: This function is required to return accurate
* time even in the absence of multiple timer ticks.
*/
+#ifndef CONFIG_NEWTOD
unsigned long long monotonic_clock(void)
{
return cur_timer->monotonic_clock();
}
EXPORT_SYMBOL(monotonic_clock);
+#endif

+void sync_persistant_clock(struct timespec ts)
+{
+ /*
+ * If we have an externally synchronized Linux clock, then update
+ * CMOS clock accordingly every ~11 minutes. Set_rtc_mmss() has to be
+ * called as close as possible to 500 ms before the new second starts.
+ */
+ if (ts.tv_sec > last_rtc_update + 660 &&
+ (ts.tv_nsec / 1000)
+ >= USEC_AFTER - ((unsigned) TICK_SIZE) / 2 &&
+ (ts.tv_nsec / 1000)
+ <= USEC_BEFORE + ((unsigned) TICK_SIZE) / 2) {
+ /* horrible...FIXME */
+ if (efi_enabled) {
+ if (efi_set_rtc_mmss(ts.tv_sec) == 0)
+ last_rtc_update = ts.tv_sec;
+ else
+ last_rtc_update = ts.tv_sec - 600;
+ } else if (set_rtc_mmss(ts.tv_sec) == 0)
+ last_rtc_update = ts.tv_sec;
+ else
+ last_rtc_update = ts.tv_sec - 600; /* do it again in 60 s */
+ }
+
+}

/*
* timer_interrupt() needs to keep up the real-time clock,
@@ -226,6 +257,7 @@

do_timer_interrupt_hook(regs);

+#ifndef CONFIG_NEWTOD
/*
* If we have an externally synchronized Linux clock, then update
* CMOS clock accordingly every ~11 minutes. Set_rtc_mmss() has to be
@@ -248,6 +280,7 @@
else
last_rtc_update = xtime.tv_sec - 600; /* do it again in 60 s */
}
+#endif

#ifdef CONFIG_MCA
if( MCA_bus ) {
@@ -282,11 +315,15 @@
*/
write_seqlock(&xtime_lock);

+#ifndef CONFIG_NEWTOD
cur_timer->mark_offset();
+#endif

do_timer_interrupt(irq, NULL, regs);

write_sequnlock(&xtime_lock);
+
+ timeofday_interrupt_hook();
return IRQ_HANDLED;
}



2004-09-02 21:28:42

by john stultz

[permalink] [raw]
Subject: [RFC][PATCH] new timeofday core subsystem (v.A0)

All,
This patch implements the architecture independent portion of the time
of day subsystem. Included is timeofday.c (which includes all the time
of day management and accessor functions), ntp.c (which includes the ntp
scaling code, leap second processing, and ntp kernel state machine
code), interface definition .h files, the example jiffies timesource
(lowest common denominator time source, mainly for use as example code)
and minimal hooks into arch independent code.

The patch does not function without minimal architecture specific hooks
(i386 example to follow), and it can be applied to a tree without
affecting the code.

I look forward to your comments and feedback.

thanks
-john

linux-2.6.9-rc1_timeofday-core_A0.patch
=======================================
diff -Nru a/drivers/Makefile b/drivers/Makefile
--- a/drivers/Makefile 2004-09-02 13:27:00 -07:00
+++ b/drivers/Makefile 2004-09-02 13:27:00 -07:00
@@ -52,3 +52,4 @@
obj-$(CONFIG_CPU_FREQ) += cpufreq/
obj-$(CONFIG_MMC) += mmc/
obj-y += firmware/
+obj-$(CONFIG_NEWTOD) += timesource/
diff -Nru a/drivers/timesource/Makefile b/drivers/timesource/Makefile
--- /dev/null Wed Dec 31 16:00:00 196900
+++ b/drivers/timesource/Makefile 2004-09-02 13:27:00 -07:00
@@ -0,0 +1 @@
+obj-y += jiffies.o
diff -Nru a/drivers/timesource/jiffies.c b/drivers/timesource/jiffies.c
--- /dev/null Wed Dec 31 16:00:00 196900
+++ b/drivers/timesource/jiffies.c 2004-09-02 13:27:00 -07:00
@@ -0,0 +1,61 @@
+/*
+ * linux/drivers/timesource/jiffies.c
+ *
+ * Copyright (C) 2004 IBM
+ *
+ * This file contains the jiffies based time source.
+ *
+ */
+#include <linux/timesource.h>
+#include <linux/jiffies.h>
+#include <linux/init.h>
+
+/* The Jiffies based timesource is the lowest common
+ * denominator time source which should function on
+ * all systems. It has the same course resolution as
+ * the timer interrupt frequency HZ and it suffers
+ * inaccuracies caused by missed or lost timer
+ * interrupts and the inability for the timer
+ * interrupt hardware to accuratly tick at the
+ * requested HZ value. It is also not reccomended
+ * for "tick-less" systems.
+ */
+
+static cycle_t jiffies_read(void)
+{
+ cycle_t ret = get_jiffies_64();
+ return ret;
+}
+
+static cycle_t jiffies_delta(cycle_t now, cycle_t then)
+{
+ /* simple subtraction, no need to mask */
+ return now - then;
+}
+
+static nsec_t jiffies_cyc2ns(cycle_t cyc, cycle_t* remainder)
+{
+
+ cyc *= NSEC_PER_SEC/HZ;
+
+ /* no remainder, so set it to zero */
+ if (remainder)
+ *remainder = 0;
+
+ return (nsec_t)cyc;
+}
+
+struct timesource_t timesource_jiffies = {
+ .name = "jiffies",
+ .priority = 0, /* lowest priority*/
+ .read = jiffies_read,
+ .delta = jiffies_delta,
+ .cyc2ns = jiffies_cyc2ns,
+};
+
+static int init_jiffies_timesource(void)
+{
+ register_timesource(&timesource_jiffies);
+ return 0;
+}
+module_init(init_jiffies_timesource);
diff -Nru a/include/linux/ntp.h b/include/linux/ntp.h
--- /dev/null Wed Dec 31 16:00:00 196900
+++ b/include/linux/ntp.h 2004-09-02 13:27:00 -07:00
@@ -0,0 +1,22 @@
+/* linux/include/linux/ntp.h
+ *
+ * Copyright (C) 2003, 2004 IBM, John Stultz ([email protected])
+ *
+ * This file contains time of day helper functions
+ */
+
+#ifndef _LINUX_NTP_H
+#define _LINUX_NTP_H
+#include <linux/types.h>
+#include <linux/time.h>
+#include <linux/timex.h>
+
+/* timeofday interfaces */
+nsec_t ntp_scale(nsec_t value);
+void ntp_advance(nsec_t value);
+int ntp_adjtimex(struct timex*);
+int ntp_leapsecond(struct timespec now);
+void ntp_clear(void);
+int get_ntp_status(void);
+
+#endif
diff -Nru a/include/linux/time.h b/include/linux/time.h
--- a/include/linux/time.h 2004-09-02 13:27:00 -07:00
+++ b/include/linux/time.h 2004-09-02 13:27:00 -07:00
@@ -24,6 +24,10 @@

#ifdef __KERNEL__

+/* timeofday base types */
+typedef u64 nsec_t;
+typedef u64 cycle_t;
+
#include <linux/spinlock.h>
#include <linux/seqlock.h>
#include <linux/timex.h>
diff -Nru a/include/linux/timeofday.h b/include/linux/timeofday.h
--- /dev/null Wed Dec 31 16:00:00 196900
+++ b/include/linux/timeofday.h 2004-09-02 13:27:00 -07:00
@@ -0,0 +1,61 @@
+/* linux/include/linux/timeofday.h
+ *
+ * Copyright (C) 2003, 2004 IBM, John Stultz ([email protected])
+ *
+ * This file contains the interface to the time of day subsystem
+ */
+#ifndef _LINUX_TIMEOFDAY_H
+#define _LINUX_TIMEOFDAY_H
+#include <linux/types.h>
+#include <linux/time.h>
+
+#ifdef CONFIG_NEWTOD
+nsec_t get_lowres_timestamp(void);
+nsec_t get_lowres_timeofday(void);
+nsec_t do_monotonic_clock(void);
+
+
+void do_gettimeofday(struct timeval *tv);
+int do_settimeofday(struct timespec *tv);
+int do_adjtimex(struct timex *tx);
+
+void timeofday_interrupt_hook(void);
+void timeofday_init(void);
+
+
+/* Helper functions */
+static inline struct timeval ns2timeval(nsec_t ns)
+{
+ struct timeval tv;
+ tv.tv_sec = div_long_long_rem(ns, NSEC_PER_SEC, &tv.tv_usec);
+ tv.tv_usec /= NSEC_PER_USEC;
+ return tv;
+}
+
+static inline struct timespec ns2timespec(nsec_t ns)
+{
+ struct timespec ts;
+ ts.tv_sec = div_long_long_rem(ns, NSEC_PER_SEC, &ts.tv_nsec);
+ return ts;
+}
+
+static inline u64 timespec2ns(struct timespec* ts)
+{
+ nsec_t ret;
+ ret = ((nsec_t)ts->tv_sec) * NSEC_PER_SEC;
+ ret += ts->tv_nsec;
+ return ret;
+}
+
+static inline nsec_t timeval2ns(struct timeval* tv)
+{
+ nsec_t ret;
+ ret = ((nsec_t)tv->tv_sec) * NSEC_PER_SEC;
+ ret += tv->tv_usec*NSEC_PER_USEC;
+ return ret;
+}
+#else /* CONFIG_NEWTOD */
+#define timeofday_interrupt_hook()
+#define timeofday_init()
+#endif /* CONFIG_NEWTOD */
+#endif /* _LINUX_TIMEOFDAY_H */
diff -Nru a/include/linux/timesource.h b/include/linux/timesource.h
--- /dev/null Wed Dec 31 16:00:00 196900
+++ b/include/linux/timesource.h 2004-09-02 13:27:00 -07:00
@@ -0,0 +1,36 @@
+/* linux/include/linux/timesource.h
+ *
+ * Copyright (C) 2003, 2004 IBM, John Stultz ([email protected])
+ *
+ * This file contains the structure definitions for timesources.
+ *
+ * If you are not a timesource, or the time of day code, you should
+ * not be including this file.
+ */
+#ifndef _LINUX_TIMESORUCE_H
+#define _LINUX_TIMESORUCE_H
+
+#include <linux/types.h>
+#include <linux/time.h>
+
+/* struct timesource_t:
+ * Provides mostly state-free accessors to the underlying
+ * hardware.
+ * name: ptr to timesource name
+ * priority:priority value (higher is better)
+ * @read: returns a cycle value
+ * @delta: calculates the difference between two cycle values
+ * @cyc2ns: converts a cycle value to ns (expected to be expensive)
+ */
+struct timesource_t {
+ char* name;
+ int priority;
+ cycle_t (*read)(void);
+ cycle_t (*delta)(cycle_t now, cycle_t then);
+ nsec_t (*cyc2ns)(cycle_t cycles, cycle_t* remainder);
+};
+
+/* used to install a new time source */
+void register_timesource(struct timesource_t*);
+
+#endif
diff -Nru a/init/main.c b/init/main.c
--- a/init/main.c 2004-09-02 13:27:00 -07:00
+++ b/init/main.c 2004-09-02 13:27:00 -07:00
@@ -45,6 +45,7 @@
#include <linux/unistd.h>
#include <linux/rmap.h>
#include <linux/mempolicy.h>
+#include <linux/timeofday.h>

#include <asm/io.h>
#include <asm/bugs.h>
@@ -514,6 +515,7 @@
pidhash_init();
init_timers();
softirq_init();
+ timeofday_init();
time_init();

/*
diff -Nru a/kernel/Makefile b/kernel/Makefile
--- a/kernel/Makefile 2004-09-02 13:27:00 -07:00
+++ b/kernel/Makefile 2004-09-02 13:27:00 -07:00
@@ -9,6 +9,7 @@
rcupdate.o intermodule.o extable.o params.o posix-timers.o \
kthread.o

+obj-$(CONFIG_NEWTOD) += timeofday.o ntp.o
obj-$(CONFIG_FUTEX) += futex.o
obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
obj-$(CONFIG_SMP) += cpu.o
diff -Nru a/kernel/ntp.c b/kernel/ntp.c
--- /dev/null Wed Dec 31 16:00:00 196900
+++ b/kernel/ntp.c 2004-09-02 13:27:00 -07:00
@@ -0,0 +1,554 @@
+/********************************************************************
+* linux/kernel/ntp.c
+*
+* NTP state machine and time scaling code.
+*
+* Copyright (C) 2004 IBM, John Stultz ([email protected])
+*
+* Portions rewritten from kernel/time.c and kernel/timer.c
+* Please see those files for original copyrights.
+*
+* Hopefully you should never have to understand or touch
+* any of the code below. but don't let that keep you from trying!
+*
+* This code is loosely based on David Mills' RFC 1589 and its
+* updates. Please see the following for more details:
+* http://www.eecis.udel.edu/~mills/database/rfc/rfc1589.txt
+* http://www.eecis.udel.edu/~mills/database/reports/kern/kernb.pdf
+*
+* NOTE: To simplify the code, we do not implement any of
+* the PPS code, as the code that uses it never was merged.
+* [email protected]
+*
+* Revision History:
+* 2004-09-02: A0
+* o First pass sent to lkml for review.
+*
+*
+* TODO List:
+* o More documentation
+* o More testing
+* o More optimization
+*********************************************************************/
+
+#include <linux/ntp.h>
+#include <linux/errno.h>
+#include <linux/sched.h> /* Needed for capable() */
+
+/* NTP scaling code
+ * Functions:
+ * ----------
+ * nsec_t ntp_scale(nsec_t value):
+ * Scales the nsec_t vale using ntp kernel state
+ * void ntp_advance(nsec_t interval):
+ * Increments the NTP state machine by interval time
+ * static int ntp_hardupdate(long offset, struct timeval tv)
+ * ntp_adjtimex helper function
+ * int ntp_adjtimex(struct timex* tx):
+ * Interface to adjust NTP state machine
+ * int ntp_leapsecond(struct timespec now)
+ * Does NTP leapsecond processing. Returns number of
+ * seconds current time should be adjusted by.
+ * void ntp_clear(void):
+ * Clears the ntp kernel state
+ * int get_ntp_status(void):
+ * returns ntp_status value
+ *
+ * Variables:
+ * ----------
+ * ntp kernel state variables:
+ * See below for full list.
+ * ntp_lock:
+ * Protects ntp kernel state variables
+ */
+
+
+
+/* Chapter 5: Kernel Variables [RFC 1589 pg. 28] */
+/* 5.1 Interface Variables */
+static int ntp_status = STA_UNSYNC; /* status */
+static long ntp_offset; /* usec */
+static long ntp_constant = 2; /* ntp magic? */
+static long ntp_maxerror = NTP_PHASE_LIMIT; /* usec */
+static long ntp_esterror = NTP_PHASE_LIMIT; /* usec */
+static const long ntp_tolerance = MAXFREQ; /* shifted ppm */
+static const long ntp_precision = 1; /* constant */
+
+/* 5.2 Phase-Lock Loop Variables */
+static long ntp_freq; /* shifted ppm */
+static long ntp_reftime; /* sec */
+
+/* Extra values */
+static int ntp_state = TIME_OK; /* leapsecond state */
+static long ntp_tick = USEC_PER_SEC/USER_HZ; /* tick length */
+
+static s64 ss_offset_len; /* SINGLESHOT offset adj interval (nsec)*/
+static long singleshot_adj; /* +/- MAX_SINGLESHOT_ADJ (ppm)*/
+static long tick_adj; /* tx->tick adjustment (ppm) */
+static long offset_adj; /* offset adjustment (ppm) */
+
+
+/* lock for the above variables */
+static seqlock_t ntp_lock = SEQLOCK_UNLOCKED;
+
+#define MILLION 1000000
+#define MAX_SINGLESHOT_ADJ 500 /* (ppm) */
+#define SEC_PER_DAY 86400
+
+/* Required to safely shift negative values */
+#define shiftR(x,s) (x < 0) ? (-((-x) >> (s))) : ((x) >> (s))
+
+/* nsec_t ntp_scale(nsec_t):
+ * Scale the raw time interval to an NTP scaled value
+ *
+ * This is done in three steps:
+ * 1. Tick adjustment.
+ * 2. Frequency adjustment.
+ * 3. Singleshot offset adjustment
+ *
+ * The first two are done over the entire raw interval. While
+ * the third is only done as long as the raw interval is less
+ * then the singleshot offset length.
+ *
+ * The basic equasion is:
+ * raw = raw + ppm scaling
+ *
+ * So while raw < ss_offset_len:
+ * raw = raw + (raw*(tick_adj+freq_adj+ss_offset_adj))/MILLION
+ *
+ * But when raw is > ss_offset_len:
+ * raw = raw
+ * + (ss_offset_len*(tick_adj+freq_adj+ss_offset_adj))/MILLION
+ * + ((raw - ss_offset_len)*(tick_adj+freq_adj))/MILLION
+ *
+ * XXX - TODO:
+ * o Get rid of nasty 64bit divides (try shifts)
+ * o more comments!
+ */
+nsec_t ntp_scale(nsec_t raw)
+{
+ u64 ret = (u64)raw;
+ unsigned long seq;
+ u64 offset_len;
+ long freq_ppm, tick_ppm, ss_ppm, offset_ppm;
+
+ /* save off current kernel state */
+ do {
+ seq = read_seqbegin(&ntp_lock);
+
+ tick_ppm = tick_adj;
+ offset_ppm = offset_adj;
+ freq_ppm = shiftR(ntp_freq,SHIFT_USEC);
+ ss_ppm = singleshot_adj;
+ offset_len = (u64)ss_offset_len;
+
+ } while (read_seqretry(&ntp_lock, seq));
+
+
+ /* Due to sign issues, this is a bit ugly */
+ if (raw <= offset_len) {
+ /* calcluate total ppm */
+ long ppm = tick_ppm + freq_ppm + offset_ppm + ss_ppm;
+
+ /* use abs(ppm) to avoid sign issues w/ do_div */
+ u64 tmp = raw * abs(ppm);
+
+ /* XXX - try to replace this w/ a 64bit shift */
+ do_div(tmp, MILLION);
+
+ /* reapply the sign lost by using abs(ppm) */
+ if(ppm < 0)
+ ret -= tmp;
+ else
+ ret += tmp;
+
+ } else {
+ /* Only apply MAX_SINGLESHOT_ADJ for the ss_offset_len */
+ s64 v1 = offset_len*(tick_ppm + freq_ppm + offset_ppm + ss_ppm);
+ s64 v2 = (s64)(raw - offset_len)*(tick_ppm + freq_ppm+ offset_ppm);
+
+ s64 tmp = v1+v2;
+
+ /* use abs(tmp) to avoid sign issues w/ do_div */
+ u64 tmp2 = abs(tmp);
+
+ /* XXX - try to replace this w/ a 64bit shift */
+ do_div(tmp2,MILLION);
+
+ /* reapply the sign lost by using abs(tmp2) */
+ if(tmp < 0)
+ ret -= tmp2;
+ else
+ ret += tmp2;
+ }
+
+ return (nsec_t)ret;
+}
+
+/* void ntp_advance(nsec_t interval):
+ * Periodic hook which increments NTP state machine by interval
+ * This is ntp_hardclock in the RFC.
+ */
+void ntp_advance(nsec_t interval)
+{
+ static u64 interval_sum=0;
+
+ /* inc interval sum */
+ interval_sum += interval;
+
+ write_seqlock_irq(&ntp_lock);
+
+ /* decrement singleshot offset interval */
+ ss_offset_len =- interval;
+ if(ss_offset_len < 0) /* make sure it doesn't go negative */
+ ss_offset_len=0;
+
+ /* Do second overflow code */
+ while (interval_sum > NSEC_PER_SEC) {
+ /* XXX - I'd prefer to smoothly apply this math
+ * at each call to ntp_advance() rather then each
+ * second.
+ */
+ long tmp;
+
+ /* Bump maxerror by ntp_tolerance */
+ ntp_maxerror += shiftR(ntp_tolerance, SHIFT_USEC);
+ if (ntp_maxerror > NTP_PHASE_LIMIT) {
+ ntp_maxerror = NTP_PHASE_LIMIT;
+ ntp_status |= STA_UNSYNC;
+ }
+
+ /* Calculate offset_adj for the next second */
+ tmp = ntp_offset;
+ if (!(ntp_status & STA_FLL))
+ tmp = shiftR(tmp, SHIFT_KG + ntp_constant);
+
+ /* bound the adjustment to MAXPHASE/MINSEC */
+ if (tmp > (MAXPHASE / MINSEC) << SHIFT_UPDATE)
+ tmp = (MAXPHASE / MINSEC) << SHIFT_UPDATE;
+ if (tmp < -(MAXPHASE / MINSEC) << SHIFT_UPDATE)
+ tmp = -(MAXPHASE / MINSEC) << SHIFT_UPDATE;
+
+ offset_adj = shiftR(tmp, SHIFT_UPDATE); /* (usec/sec) = ppm */
+ ntp_offset -= tmp;
+
+ interval_sum -= NSEC_PER_SEC;
+ }
+
+/* XXX - if still needed do equiv code to switching time_next_adjust */
+
+ write_sequnlock_irq(&ntp_lock);
+
+}
+
+
+/* called only by ntp_adjtimex while holding ntp_lock */
+static int ntp_hardupdate(long offset, struct timeval tv)
+{
+ int ret;
+ long tmp, interval;
+
+ ret = 0;
+ if (!(ntp_status & STA_PLL))
+ return ret;
+
+ tmp = offset;
+ /* Make sure offset is bounded by MAXPHASE */
+ if (tmp > MAXPHASE)
+ tmp = MAXPHASE;
+ if (tmp < -MAXPHASE)
+ tmp = -MAXPHASE;
+
+ ntp_offset = tmp << SHIFT_UPDATE;
+
+ if ((ntp_status & STA_FREQHOLD) || (ntp_reftime == 0))
+ ntp_reftime = tv.tv_sec;
+
+ /* calculate seconds since last call to hardupdate */
+ interval = tv.tv_sec - ntp_reftime;
+ ntp_reftime = tv.tv_sec;
+
+ if ((ntp_status & STA_FLL) && (interval >= MINSEC)) {
+ long damping;
+ tmp = (offset / interval); /* ppm (usec/sec)*/
+
+ /* convert to shifted ppm, then apply damping factor */
+
+ /* calculate damping factor - XXX bigger comment!*/
+ damping = SHIFT_KH - SHIFT_USEC;
+
+ /* apply damping factor */
+ ntp_freq += shiftR(tmp,damping);
+
+ printk("ntp->freq change: %ld\n",shiftR(tmp,damping));
+
+ } else if ((ntp_status & STA_PLL) && (interval < MAXSEC)) {
+ long damping;
+ tmp = offset * interval; /* ppm XXX - not quite*/
+
+ /* calculate damping factor - XXX bigger comment!*/
+ damping = (2 * ntp_constant) + SHIFT_KF - SHIFT_USEC;
+
+ /* apply damping factor */
+ ntp_freq += shiftR(tmp,damping);
+
+ printk("ntp->freq change: %ld\n", shiftR(tmp,damping));
+
+ } else { /* interval out of bounds */
+ printk("interval out of bounds: %ld\n", interval);
+ ret = -1; /* TIME_ERROR */
+ }
+
+ /* bound ntp_freq */
+ if (ntp_freq > ntp_tolerance)
+ ntp_freq = ntp_tolerance;
+ if (ntp_freq < -ntp_tolerance)
+ ntp_freq = -ntp_tolerance;
+
+ return ret;
+}
+
+/* int ntp_adjtimex(struct timex* tx)
+ * Interface to change NTP state machine
+ */
+int ntp_adjtimex(struct timex* tx)
+{
+ long save_offset;
+ int result;
+
+/*=[Sanity checking]===============================*/
+ /* Check capabilities if we're trying to modify something */
+ if (tx->modes && !capable(CAP_SYS_TIME))
+ return -EPERM;
+
+ /* frequency adjustment limited to +/- MAXFREQ */
+ if ((tx->modes & ADJ_FREQUENCY)
+ && (abs(tx->freq) > MAXFREQ))
+ return -EINVAL;
+
+ /* maxerror adjustment limited to NTP_PHASE_LIMIT */
+ if ((tx->modes & ADJ_MAXERROR)
+ && (tx->maxerror < 0
+ || tx->maxerror >= NTP_PHASE_LIMIT))
+ return -EINVAL;
+
+ /* esterror adjustment limited to NTP_PHASE_LIMIT */
+ if ((tx->modes & ADJ_ESTERROR)
+ && (tx->esterror < 0
+ || tx->esterror >= NTP_PHASE_LIMIT))
+ return -EINVAL;
+
+ /* constant adjustment must be positive */
+ if ((tx->modes & ADJ_TIMECONST)
+ && (tx->constant < 0))
+ return -EINVAL;
+
+ /* Single shot mode can only be used by itself */
+ if (((tx->modes & ADJ_OFFSET_SINGLESHOT) == ADJ_OFFSET_SINGLESHOT)
+ && (tx->modes != ADJ_OFFSET_SINGLESHOT))
+ return -EINVAL;
+
+ /* offset adjustment limited to +/- MAXPHASE */
+ if ((tx->modes != ADJ_OFFSET_SINGLESHOT)
+ && (tx->modes & ADJ_OFFSET)
+ && (abs(tx->offset)>= MAXPHASE))
+ return -EINVAL;
+
+ /* tick adjustment limited to 10% */
+ if ((tx->modes & ADJ_TICK)
+ && ((tx->tick < 900000/USER_HZ)
+ ||(tx->tick > 11000000/USER_HZ)))
+ return -EINVAL;
+
+ /* dbg output XXX - yank me! */
+ if(tx->modes) {
+ printk("adjtimex: tx->offset: %ld tx->freq: %ld\n",
+ tx->offset, tx->freq);
+ }
+
+/*=[Kernel input bits]==========================*/
+ write_seqlock_irq(&ntp_lock);
+
+ result = ntp_state;
+
+ /* For ADJ_OFFSET_SINGLESHOT we must return the old offset */
+ save_offset = shiftR(ntp_offset, SHIFT_UPDATE);
+
+ /* Process input parameters */
+ if (tx->modes & ADJ_STATUS) {
+ ntp_status &= STA_RONLY;
+ ntp_status |= tx->status & ~STA_RONLY;
+ }
+
+ if (tx->modes & ADJ_FREQUENCY)
+ ntp_freq = tx->freq;
+
+ if (tx->modes & ADJ_MAXERROR)
+ ntp_maxerror = tx->maxerror;
+
+ if (tx->modes & ADJ_ESTERROR)
+ ntp_esterror = tx->esterror;
+
+ if (tx->modes & ADJ_TIMECONST)
+ ntp_constant = tx->constant;
+
+ if (tx->modes & ADJ_OFFSET) {
+ if (tx->modes == ADJ_OFFSET_SINGLESHOT) {
+ if (tx->offset < 0)
+ singleshot_adj = -MAX_SINGLESHOT_ADJ;
+ else
+ singleshot_adj = MAX_SINGLESHOT_ADJ;
+ /* Calculate single shot offset interval:
+ *
+ * len = offset/(+/-)MAX_SINGESHOT_ADJ
+ * len = abs(offset)/MAX_SINGESHOT_ADJ
+ * len = (abs(offset)*NSEC_PER_USEC)
+ * /(MAX_SINGESHOT_ADJ /NSEC_PER_SEC)
+ * len = (abs(offset)*NSEC_PER_USEC*NSEC_PER_SEC)/MAX_SINGESHOT_ADJ
+ * len = abs(offset)*NSEC_PER_SEC*(NSEC_PER_USEC/MAX_SINGESHOT_ADJ)
+ */
+ ss_offset_len = abs(tx->offset);
+ ss_offset_len *= NSEC_PER_SEC * (NSEC_PER_USEC/MAX_SINGLESHOT_ADJ);
+ }
+ /* call hardupdate() */
+ if (ntp_hardupdate(tx->offset, tx->time))
+ result = TIME_ERROR;
+ }
+
+ if (tx->modes & ADJ_TICK) {
+ /* first calculate usec/user_tick offset */
+ tick_adj = (USEC_PER_SEC/USER_HZ) - tx->tick;
+ /* multiply by user_hz to get usec/sec => ppm */
+ tick_adj *= USER_HZ;
+ /* save tx->tick for future calls to adjtimex */
+ ntp_tick = tx->tick;
+ }
+
+ if ((ntp_status & (STA_UNSYNC|STA_CLOCKERR)) != 0 )
+ result = TIME_ERROR;
+
+/*=[Kernel output bits]================================*/
+ /* write kernel state to user timex values*/
+ if ((tx->modes & ADJ_OFFSET_SINGLESHOT) == ADJ_OFFSET_SINGLESHOT)
+ tx->offset = save_offset;
+ else
+ tx->offset = shiftR(ntp_offset, SHIFT_UPDATE);
+
+ tx->freq = ntp_freq;
+ tx->maxerror = ntp_maxerror;
+ tx->esterror = ntp_esterror;
+ tx->status = ntp_status;
+ tx->constant = ntp_constant;
+ tx->precision = ntp_precision;
+ tx->tolerance = ntp_tolerance;
+
+ /* PPS is not implemented, so these are zero */
+ tx->ppsfreq = /*XXX - Not Implemented!*/ 0;
+ tx->jitter = /*XXX - Not Implemented!*/ 0;
+ tx->shift = /*XXX - Not Implemented!*/ 0;
+ tx->stabil = /*XXX - Not Implemented!*/ 0;
+ tx->jitcnt = /*XXX - Not Implemented!*/ 0;
+ tx->calcnt = /*XXX - Not Implemented!*/ 0;
+ tx->errcnt = /*XXX - Not Implemented!*/ 0;
+ tx->stbcnt = /*XXX - Not Implemented!*/ 0;
+
+ write_sequnlock_irq(&ntp_lock);
+
+ return result;
+}
+
+
+/* void ntp_leapsecond(struct timespec now):
+ * NTP Leapsecnod processing code. Returns the number of
+ * seconds (-1, 0, or 1) that should be added to the current
+ * time to properly adjust for leapseconds.
+ */
+int ntp_leapsecond(struct timespec now)
+{
+ /*
+ * Leap second processing. If in leap-insert state at
+ * the end of the day, the system clock is set back one
+ * second; if in leap-delete state, the system clock is
+ * set ahead one second. The microtime() routine or
+ * external clock driver will insure that reported time
+ * is always monotonic. The ugly divides should be
+ * replaced.
+ */
+ static time_t leaptime = 0;
+
+ switch (ntp_state) {
+ case TIME_OK:
+ if (ntp_status & STA_INS) {
+ ntp_state = TIME_INS;
+ /* calculate end of today (23:59:59)*/
+ leaptime = now.tv_sec + SEC_PER_DAY - (now.tv_sec % SEC_PER_DAY) - 1;
+ }
+ else if (ntp_status & STA_DEL) {
+ ntp_state = TIME_DEL;
+ /* calculate end of today (23:59:59)*/
+ leaptime = now.tv_sec + SEC_PER_DAY - (now.tv_sec % SEC_PER_DAY) - 1;
+ }
+ break;
+
+ case TIME_INS:
+ /* Once we are at (or past) leaptime, insert the second */
+ if (now.tv_sec > leaptime) {
+ ntp_state = TIME_OOP;
+ printk(KERN_NOTICE "Clock: inserting leap second 23:59:60 UTC\n");
+
+ return -1;
+ }
+ break;
+
+ case TIME_DEL:
+ /* Once we are at (or past) leaptime, delete the second */
+ if (now.tv_sec >= leaptime) {
+ ntp_state = TIME_WAIT;
+ printk(KERN_NOTICE "Clock: deleting leap second 23:59:59 UTC\n");
+
+ return 1;
+ }
+ break;
+
+ case TIME_OOP:
+ /* Wait for the end of the leap second*/
+ if (now.tv_sec > (leaptime + 1))
+ ntp_state = TIME_WAIT;
+ break;
+
+ case TIME_WAIT:
+ if (!(ntp_status & (STA_INS | STA_DEL)))
+ ntp_state = TIME_OK;
+ }
+
+ return 0;
+}
+
+/* void ntp_clear(void):
+ * Clears the NTP state machine.
+ */
+void ntp_clear(void)
+{
+ write_seqlock_irq(&ntp_lock);
+
+ /* clear everything */
+ ntp_status |= STA_UNSYNC;
+ ntp_maxerror = NTP_PHASE_LIMIT;
+ ntp_esterror = NTP_PHASE_LIMIT;
+ ss_offset_len=0;
+ singleshot_adj=0;
+ tick_adj=0;
+ offset_adj =0;
+
+ write_sequnlock_irq(&ntp_lock);
+
+}
+
+/* int get_ntp_status(void):
+ * Returns the NTP status.
+ */
+int get_ntp_status(void)
+{
+ return ntp_status;
+}
+
diff -Nru a/kernel/time.c b/kernel/time.c
--- a/kernel/time.c 2004-09-02 13:27:00 -07:00
+++ b/kernel/time.c 2004-09-02 13:27:00 -07:00
@@ -33,6 +33,7 @@
#include <linux/smp_lock.h>
#include <asm/uaccess.h>
#include <asm/unistd.h>
+#include <linux/timeofday.h>

/*
* The timezone where the local system is located. Used as a default by some
@@ -210,6 +211,7 @@
/* adjtimex mainly allows reading (and writing, if superuser) of
* kernel time-keeping variables. used by xntpd.
*/
+#ifndef CONFIG_NEWTOD
int do_adjtimex(struct timex *txc)
{
long ltemp, mtemp, save_adjust;
@@ -392,6 +394,7 @@
do_gettimeofday(&txc->time);
return(result);
}
+#endif

asmlinkage long sys_adjtimex(struct timex __user *txc_p)
{
diff -Nru a/kernel/timeofday.c b/kernel/timeofday.c
--- /dev/null Wed Dec 31 16:00:00 196900
+++ b/kernel/timeofday.c 2004-09-02 13:27:00 -07:00
@@ -0,0 +1,330 @@
+/*********************************************************************
+* linux/kernel/timeofday.c
+*
+* Copyright (C) 2003, 2004 IBM, John Stultz ([email protected])
+*
+* This file contains the functions which access and manage
+* the system's time of day functionality.
+*
+* Revision History:
+* 2004-09-02: A0
+* o First pass sent to lkml for review.
+*
+* TODO List:
+* o More testing
+**********************************************************************/
+
+#include <linux/timeofday.h>
+#include <linux/timesource.h>
+#include <linux/ntp.h>
+#include <linux/timex.h>
+
+/*XXX - remove later */
+#define TIME_DBG 1
+#define TIME_DBG_FREQ 120000
+
+/*[Nanosecond based variables]----------------
+ * system_time:
+ * Monotonically increasing counter of the number of nanoseconds
+ * since boot.
+ * wall_time_offset:
+ * Offset added to system_time to provide accurate time-of-day
+ */
+static nsec_t system_time;
+static nsec_t wall_time_offset;
+
+
+/*[Cycle based variables]----------------
+ * offset_base:
+ * Value of the timesource at the last clock_interrupt_hook()
+ * (adjusted only minorly to account for rounded off cycles)
+ */
+static cycle_t offset_base;
+
+/*[Time source data]-------------------
+ * timesource
+ * current timesource pointer (initialized to timesource_jiffies)
+ * next_timesource:
+ * pointer to the timesource that will be installed at the next hook
+ */
+extern struct timesource_t timesource_jiffies;
+static struct timesource_t *timesource = &timesource_jiffies;
+static struct timesource_t *next_timesource;
+
+/*[Locks]----------------------------
+ * system_time_lock:
+ * generic lock for all locally scoped time values
+ */
+static seqlock_t system_time_lock = SEQLOCK_UNLOCKED;
+
+
+/* [XXX - Hacks]--------------------
+ * Makes stuff compile
+ */
+extern unsigned long get_cmos_time(void);
+extern void sync_persistant_clock(struct timespec ts);
+
+/* get_lowres_timestamp():
+ * Returns a low res timestamp.
+ * (ie: the value of system_time as calculated at
+ * the last invocation of clock_interrupt_hook() )
+ */
+nsec_t get_lowres_timestamp(void)
+{
+ nsec_t ret;
+ unsigned long seq;
+ do {
+ seq = read_seqbegin(&system_time_lock);
+
+ /* quickly grab system_time*/
+ ret = system_time;
+
+ } while (read_seqretry(&system_time_lock, seq));
+
+ return ret;
+}
+
+/* get_lowres_timeofday():
+ * Returns a low res time of day, as calculated at the
+ * last invocation of clock_interrupt_hook()
+ */
+nsec_t get_lowres_timeofday(void)
+{
+ nsec_t ret;
+ unsigned long seq;
+ do {
+ seq = read_seqbegin(&system_time_lock);
+
+ /* quickly calculate low-res time of day */
+ ret = system_time + wall_time_offset;
+
+ } while (read_seqretry(&system_time_lock, seq));
+
+ return ret;
+}
+
+
+/* __monotonic_clock():
+ * private function, must hold system_time_lock lock when being
+ * called. Returns the monotonically increasing number of
+ * nanoseconds since the system booted (adjusted by NTP scaling)
+ */
+static nsec_t __monotonic_clock(void)
+{
+ nsec_t ret, ns_offset;
+ cycle_t now, delta;
+
+ /* read timesource */
+ now = timesource->read();
+
+ /* calculate the delta since the last clock_interrupt */
+ delta = timesource->delta(now, offset_base);
+
+ /* convert to nanoseconds */
+ ns_offset = timesource->cyc2ns(delta, 0);
+
+ /* apply the NTP scaling */
+ ns_offset = ntp_scale(ns_offset);
+
+ /* add result to system time */
+ ret = system_time + ns_offset;
+
+ return ret;
+}
+
+
+/* do_monotonic_clock():
+ * Returns the monotonically increasing number of nanoseconds
+ * since the system booted via __monotonic_clock()
+ */
+nsec_t do_monotonic_clock(void)
+{
+ nsec_t ret;
+ unsigned long seq;
+
+ /* atomically read __monotonic_clock() */
+ do {
+ seq = read_seqbegin(&system_time_lock);
+
+ ret = __monotonic_clock();
+
+ } while (read_seqretry(&system_time_lock, seq));
+
+ return ret;
+}
+
+
+/* do_gettimeofday():
+ * Returns the time of day
+ */
+void do_gettimeofday(struct timeval *tv)
+{
+ nsec_t wall, sys;
+ unsigned long seq;
+
+ /* atomically read wall and sys time */
+ do {
+ seq = read_seqbegin(&system_time_lock);
+
+ wall = wall_time_offset;
+ sys = __monotonic_clock();
+
+ } while (read_seqretry(&system_time_lock, seq));
+
+ /* add them and convert to timeval */
+ *tv = ns2timeval(wall+sys);
+}
+
+
+/* do_settimeofday():
+ * Sets the time of day
+ */
+int do_settimeofday(struct timespec *tv)
+{
+ /* convert timespec to ns */
+ nsec_t newtime = timespec2ns(tv);
+
+ /* atomically adjust wall_time_offset to the desired value */
+ write_seqlock_irq(&system_time_lock);
+
+ wall_time_offset = newtime - __monotonic_clock();
+
+ /* clear NTP settings */
+ ntp_clear();
+
+ write_sequnlock_irq(&system_time_lock);
+
+ return 0;
+}
+
+/* do_adjtimex:
+ * Userspace NTP daemon's interface to the kernel NTP variables
+ */
+int do_adjtimex(struct timex *tx)
+{
+ do_gettimeofday(&tx->time); /* set timex->time*/
+ /* Note: We set tx->time first, */
+ /* because ntp_adjtimex uses it */
+
+ return ntp_adjtimex(tx); /* call out to NTP code */
+}
+
+
+/* timeofday_interrupt_hook:
+ * calculates the delta since the last interrupt,
+ * updates system time and clears the offset.
+ * likely called by timer_interrupt()
+ */
+void timeofday_interrupt_hook(void)
+{
+ cycle_t now, delta, remainder;
+ nsec_t ns, ntp_ns;
+ long leapsecond;
+
+ write_seqlock(&system_time_lock);
+
+ /* read time source */
+ now = timesource->read();
+
+ /* calculate cycle delta */
+ delta = timesource->delta(now, offset_base);
+
+ /* convert cycles to ns and save remainder */
+ ns = timesource->cyc2ns(delta, &remainder);
+
+ /* apply NTP scaling factor for this tick */
+ ntp_ns = ntp_scale(ns);
+
+#if TIME_DBG /* XXX - remove later*/
+{
+ static int dbg=0;
+ if(!(dbg++%TIME_DBG_FREQ)){
+ printk("now: %lluc - then: %lluc = delta: %lluc -> %llu ns + %llu cyc (ntp: %lluc)\n",
+ now, offset_base, delta, ns, remainder, ntp_ns);
+ }
+}
+#endif
+ /* update system_time */
+ system_time += ntp_ns;
+
+ /* reset the offset_base */
+ offset_base = now;
+
+ /* subtract remainder to account for rounded off cycles */
+ offset_base = timesource->delta(offset_base,remainder);
+
+ /* advance the ntp state machine by ns*/
+ ntp_advance(ns);
+
+ /* do ntp leap second processing*/
+ leapsecond = ntp_leapsecond(ns2timespec(system_time+wall_time_offset));
+ if (leapsecond == 1) /* XXX could this be cleaner? */
+ wall_time_offset += 1;
+ if (leapsecond == -1)
+ wall_time_offset -= 1;
+
+ /* sync the persistant clock */
+ if (!(get_ntp_status() & STA_UNSYNC))
+ sync_persistant_clock(ns2timespec(system_time + wall_time_offset));
+
+ /* if necessary, switch timesources */
+ if (next_timesource) {
+ /* immediately set new offset_base */
+ offset_base = next_timesource->read();
+ /* swap timesources */
+ timesource = next_timesource;
+ next_timesource = 0;
+
+ printk(KERN_INFO "Time: %s timesource has been installed\n",
+ timesource->name);
+ }
+
+
+ /* update legacy time values */
+ write_seqlock(&xtime_lock);
+ xtime = ns2timespec(system_time + wall_time_offset);
+ wall_to_monotonic = ns2timespec(wall_time_offset);
+ wall_to_monotonic.tv_sec = -wall_to_monotonic.tv_sec;
+ wall_to_monotonic.tv_nsec = -wall_to_monotonic.tv_nsec;
+ write_sequnlock(&xtime_lock);
+
+ write_sequnlock(&system_time_lock);
+}
+
+/* register_timesource():
+ * Used to install a new timesource
+ */
+void register_timesource(struct timesource_t* t)
+{
+ write_seqlock(&system_time_lock);
+
+ /* XXX - check override */
+
+ /* if next_timesource has been set, make sure we beat that one too */
+ if (next_timesource && (t->priority > next_timesource->priority))
+ next_timesource = t;
+ else if(t->priority > timesource->priority)
+ next_timesource = t;
+
+ write_sequnlock(&system_time_lock);
+}
+
+
+/* timeofday_init():
+ * Initializes time variables
+ */
+void timeofday_init(void)
+{
+ write_seqlock(&system_time_lock);
+
+ /* clear and initialize offsets*/
+ offset_base = timesource->read();
+ wall_time_offset = ((u64)get_cmos_time()) * NSEC_PER_SEC;
+
+ /* clear NTP scaling factor*/
+ ntp_clear();
+
+ write_sequnlock(&system_time_lock);
+
+ return;
+}
diff -Nru a/kernel/timer.c b/kernel/timer.c
--- a/kernel/timer.c 2004-09-02 13:27:00 -07:00
+++ b/kernel/timer.c 2004-09-02 13:27:00 -07:00
@@ -559,6 +559,7 @@
int tickadj = 500/HZ ? : 1; /* microsecs */


+#ifndef CONFIG_NEWTOD
/*
* phase-lock loop variables
*/
@@ -786,6 +787,9 @@
second_overflow();
}
}
+#else /* CONFIG_NEWTOD */
+#define update_wall_time(x)
+#endif /* CONFIG_NEWTOD */

static inline void do_process_times(struct task_struct *p,
unsigned long user, unsigned long system)


2004-09-02 21:32:10

by john stultz

[permalink] [raw]
Subject: [RFC][PATCH] new timeofday i386 timesources (v.A0)

All,
This patch implements most of the i386 time sources (tsc, pit, cyclone,
acpi-pm). It applies ontop of my linux-2.6.9-rc1_timeofday-i386_A0
patch. It provides real timesources (opposed to the example jiffies
timesource) that can be used for more realistic testing.

This patch is the shabbiest of the three. It needs to be broken up, and
cleaned. The i386_pit.c is broken and the cpufreq code needs to be added
to tsc.c. But it will get you going so you can test and play with the
core code.

thanks
-john

linux-2.6.9_rc1_timeofday-i386-timesources_A0.patch
===================================================
diff -Nru a/arch/i386/kernel/Makefile b/arch/i386/kernel/Makefile
--- a/arch/i386/kernel/Makefile 2004-09-02 13:45:15 -07:00
+++ b/arch/i386/kernel/Makefile 2004-09-02 13:45:15 -07:00
@@ -7,7 +7,7 @@
obj-y := process.o semaphore.o signal.o entry.o traps.o irq.o vm86.o \
ptrace.o i8259.o ioport.o ldt.o setup.o time.o sys_i386.o \
pci-dma.o i386_ksyms.o i387.o dmi_scan.o bootflag.o \
- doublefault.o
+ doublefault.o tsc.o

obj-y += cpu/
obj-y += timers/
diff -Nru a/arch/i386/kernel/setup.c b/arch/i386/kernel/setup.c
--- a/arch/i386/kernel/setup.c 2004-09-02 13:45:15 -07:00
+++ b/arch/i386/kernel/setup.c 2004-09-02 13:45:15 -07:00
@@ -48,6 +48,7 @@
#include <asm/io_apic.h>
#include <asm/ist.h>
#include <asm/io.h>
+#include <asm/tsc.h>
#include "setup_arch_pre.h"
#include <bios_ebda.h>

@@ -1413,6 +1414,7 @@
conswitchp = &dummy_con;
#endif
#endif
+ tsc_init();
}

#include "setup_arch_post.h"
diff -Nru a/arch/i386/kernel/time.c b/arch/i386/kernel/time.c
--- a/arch/i386/kernel/time.c 2004-09-02 13:45:15 -07:00
+++ b/arch/i386/kernel/time.c 2004-09-02 13:45:15 -07:00
@@ -412,6 +412,7 @@

void __init time_init(void)
{
+#ifndef CONFIG_NEWTOD
#ifdef CONFIG_HPET_TIMER
if (is_hpet_capable()) {
/*
@@ -429,6 +430,7 @@

cur_timer = select_timer();
printk(KERN_INFO "Using %s for high-res timesource\n",cur_timer->name);
+#endif

time_init_hook();
}
diff -Nru a/arch/i386/kernel/timers/Makefile b/arch/i386/kernel/timers/Makefile
--- a/arch/i386/kernel/timers/Makefile 2004-09-02 13:45:15 -07:00
+++ b/arch/i386/kernel/timers/Makefile 2004-09-02 13:45:15 -07:00
@@ -4,6 +4,6 @@

obj-y := timer.o timer_none.o timer_tsc.o timer_pit.o common.o

-obj-$(CONFIG_X86_CYCLONE_TIMER) += timer_cyclone.o
+#obj-$(CONFIG_X86_CYCLONE_TIMER) += timer_cyclone.o
obj-$(CONFIG_HPET_TIMER) += timer_hpet.o
-obj-$(CONFIG_X86_PM_TIMER) += timer_pm.o
+#obj-$(CONFIG_X86_PM_TIMER) += timer_pm.o
diff -Nru a/arch/i386/kernel/timers/common.c b/arch/i386/kernel/timers/common.c
--- a/arch/i386/kernel/timers/common.c 2004-09-02 13:45:15 -07:00
+++ b/arch/i386/kernel/timers/common.c 2004-09-02 13:45:15 -07:00
@@ -21,8 +21,6 @@
* device.
*/

-#define CALIBRATE_TIME (5 * 1000020/HZ)
-
unsigned long __init calibrate_tsc(void)
{
mach_prepare_counter();
diff -Nru a/arch/i386/kernel/timers/timer.c b/arch/i386/kernel/timers/timer.c
--- a/arch/i386/kernel/timers/timer.c 2004-09-02 13:45:15 -07:00
+++ b/arch/i386/kernel/timers/timer.c 2004-09-02 13:45:15 -07:00
@@ -14,13 +14,13 @@
/* list of timers, ordered by preference, NULL terminated */
static struct timer_opts* timers[] = {
#ifdef CONFIG_X86_CYCLONE_TIMER
- &timer_cyclone,
+// &timer_cyclone,
#endif
#ifdef CONFIG_HPET_TIMER
&timer_hpet,
#endif
#ifdef CONFIG_X86_PM_TIMER
- &timer_pmtmr,
+// &timer_pmtmr,
#endif
&timer_tsc,
&timer_pit,
diff -Nru a/arch/i386/kernel/tsc.c b/arch/i386/kernel/tsc.c
--- /dev/null Wed Dec 31 16:00:00 196900
+++ b/arch/i386/kernel/tsc.c 2004-09-02 13:45:15 -07:00
@@ -0,0 +1,39 @@
+#include <linux/init.h>
+#include <linux/timex.h>
+#include <asm/tsc.h>
+#include "mach_timer.h"
+
+unsigned long cpu_freq_khz;
+
+void tsc_init(void)
+{
+ unsigned long long start, end;
+ unsigned long count;
+ u64 delta64;
+ int i;
+
+ /* repeat 3 times to make sure the cache is warm */
+ for(i=0; i < 3; i++) {
+ mach_prepare_counter();
+ rdtscll(start);
+ mach_countup(&count);
+ rdtscll(end);
+ }
+ delta64 = end - start;
+
+ /* cpu freq too fast */
+ if(delta64 > (1ULL<<32))
+ return;
+ /* cpu freq too slow */
+ if (delta64 <= CALIBRATE_TIME)
+ return;
+
+ delta64 *= 1000;
+ do_div(delta64,CALIBRATE_TIME);
+ cpu_freq_khz = (unsigned long)delta64;
+
+ cpu_khz = cpu_freq_khz;
+
+ printk("Detected %lu.%03lu MHz processor.\n", cpu_khz / 1000, cpu_khz % 1000);
+
+}
diff -Nru a/drivers/timesource/Makefile b/drivers/timesource/Makefile
--- a/drivers/timesource/Makefile 2004-09-02 13:45:15 -07:00
+++ b/drivers/timesource/Makefile 2004-09-02 13:45:15 -07:00
@@ -1 +1,6 @@
obj-y += jiffies.o
+obj-$(CONFIG_X86) += i386_tsc.o
+#obj-$(CONFIG_X86) += i386_pit.o
+obj-$(CONFIG_X86_SUMMIT) += cyclone.o
+obj-$(CONFIG_X86_PM_TIMER) += acpi_pm.o
+
diff -Nru a/drivers/timesource/acpi_pm.c b/drivers/timesource/acpi_pm.c
--- /dev/null Wed Dec 31 16:00:00 196900
+++ b/drivers/timesource/acpi_pm.c 2004-09-02 13:45:15 -07:00
@@ -0,0 +1,130 @@
+#include <linux/timesource.h>
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <asm/io.h>
+#include "mach_timer.h"
+
+/* Number of PMTMR ticks expected during calibration run */
+#define PMTMR_TICKS_PER_SEC 3579545
+#define PMTMR_EXPECTED_RATE \
+ ((CALIBRATE_LATCH * (PMTMR_TICKS_PER_SEC >> 10)) / (CLOCK_TICK_RATE>>10))
+
+
+/* The I/O port the PMTMR resides at.
+ * The location is detected during setup_arch(),
+ * in arch/i386/acpi/boot.c */
+u32 pmtmr_ioport = 0;
+
+#define ACPI_PM_MASK 0xFFFFFF /* limit it to 24 bits */
+
+static inline u32 read_pmtmr(void)
+{
+ u32 v1=0,v2=0,v3=0;
+ /* It has been reported that because of various broken
+ * chipsets (ICH4, PIIX4 and PIIX4E) where the ACPI PM time
+ * source is not latched, so you must read it multiple
+ * times to insure a safe value is read.
+ */
+ do {
+ v1 = inl(pmtmr_ioport);
+ v2 = inl(pmtmr_ioport);
+ v3 = inl(pmtmr_ioport);
+ } while ((v1 > v2 && v1 < v3) || (v2 > v3 && v2 < v1)
+ || (v3 > v1 && v3 < v2));
+
+ /* mask the output to 24 bits */
+ return v2 & ACPI_PM_MASK;
+}
+
+
+static cycle_t acpi_pm_read(void)
+{
+ return (cycle_t)read_pmtmr();
+}
+
+static cycle_t acpi_pm_delta(cycle_t now, cycle_t then)
+{
+ return (now - then) & ACPI_PM_MASK;
+}
+
+static nsec_t acpi_pm_cyc2ns(cycle_t cyc, cycle_t* remainder)
+{
+ /* Power management timer runs at 3.579545 cycles per
+ * microsecond. Thus it runs at 0.003579545 cyc/nsec which
+ * inverted is 279.36511 ns/cyc. Which is ~286070/1024
+ */
+ u64 cycles = cyc;
+ cycles *= 286070;
+ cycles >>= 10;
+ if(remainder)
+ *remainder = 0;
+ return (nsec_t)cycles;
+}
+
+struct timesource_t timesource_acpi_pm = {
+ .name = "acpi_pm",
+ .priority = 200,
+ .read = acpi_pm_read,
+ .delta = acpi_pm_delta,
+ .cyc2ns = acpi_pm_cyc2ns,
+};
+
+/*
+ * Some boards have the PMTMR running way too fast. We check
+ * the PMTMR rate against PIT channel 2 to catch these cases.
+ */
+static int verify_pmtmr_rate(void)
+{
+ u32 value1, value2;
+ unsigned long count, delta;
+
+ mach_prepare_counter();
+ value1 = read_pmtmr();
+ mach_countup(&count);
+ value2 = read_pmtmr();
+ delta = (value2 - value1) & ACPI_PM_MASK;
+
+ /* Check that the PMTMR delta is within 5% of what we expect */
+ if (delta < (PMTMR_EXPECTED_RATE * 19) / 20 ||
+ delta > (PMTMR_EXPECTED_RATE * 21) / 20) {
+ printk(KERN_INFO "PM-Timer running at invalid rate: %lu%% of normal - aborting.\n", 100UL * delta / PMTMR_EXPECTED_RATE);
+ return -1;
+ }
+
+ return 0;
+}
+
+
+static int init_acpi_pm_timesource(void)
+{
+ u32 value1, value2;
+ unsigned int i;
+
+ if (!pmtmr_ioport)
+ return -ENODEV;
+
+ /* "verify" this timing source */
+ value1 = read_pmtmr();
+ for (i = 0; i < 10000; i++) {
+ value2 = read_pmtmr();
+ if (value2 == value1)
+ continue;
+ if (value2 > value1)
+ goto pm_good;
+ if ((value2 < value1) && ((value2) < 0xFFF))
+ goto pm_good;
+ printk(KERN_INFO "PM-Timer had inconsistent results: 0x%#x, 0x%#x - aborting.\n", value1, value2);
+ return -EINVAL;
+ }
+ printk(KERN_INFO "PM-Timer had no reasonable result: 0x%#x - aborting.\n", value1);
+ return -ENODEV;
+
+pm_good:
+ if (verify_pmtmr_rate() != 0)
+ return -ENODEV;
+
+ register_timesource(&timesource_acpi_pm);
+ return 0;
+}
+
+module_init(init_acpi_pm_timesource);
diff -Nru a/drivers/timesource/cyclone.c b/drivers/timesource/cyclone.c
--- /dev/null Wed Dec 31 16:00:00 196900
+++ b/drivers/timesource/cyclone.c 2004-09-02 13:45:15 -07:00
@@ -0,0 +1,167 @@
+#include <linux/timesource.h>
+#include <linux/errno.h>
+#include <linux/string.h>
+#include <linux/timex.h>
+#include <linux/init.h>
+
+#include <asm/io.h>
+#include <asm/pgtable.h>
+#include <asm/fixmap.h>
+#include "mach_timer.h"
+
+#define CYCLONE_CBAR_ADDR 0xFEB00CD0
+#define CYCLONE_PMCC_OFFSET 0x51A0
+#define CYCLONE_MPMC_OFFSET 0x51D0
+#define CYCLONE_MPCS_OFFSET 0x51A8
+#define CYCLONE_TIMER_FREQ 100000000
+#define CYCLONE_TIMER_MASK (((u64)1<<40)-1) /* 40 bit mask */
+
+unsigned long cyclone_freq_khz;
+
+int use_cyclone = 0;
+static u32* volatile cyclone_timer; /* Cyclone MPMC0 register */
+
+/* helper macro to atomically read both cyclone counter registers */
+#define read_cyclone_counter(low,high) \
+ do{ \
+ high = cyclone_timer[1]; low = cyclone_timer[0]; \
+ } while (high != cyclone_timer[1]);
+
+
+static cycle_t cyclone_read(void)
+{
+ u32 low, high;
+ u64 ret;
+
+ read_cyclone_counter(low,high);
+ ret = ((u64)high << 32)|low;
+
+ return (cycle_t)ret;
+}
+
+static cycle_t cyclone_delta(cycle_t now, cycle_t then)
+{
+ return (now - then)&CYCLONE_TIMER_MASK;
+}
+
+static nsec_t cyclone_cyc2ns(cycle_t cyc, cycle_t* remainder)
+{
+ u64 rem;
+ cyc *= 1000000;
+ rem = do_div(cyc, cyclone_freq_khz);
+ if (remainder)
+ *remainder = rem/1000000; /*XXX - do we still loose precision here? */
+ return (nsec_t)cyc;
+}
+
+struct timesource_t timesource_cyclone = {
+ .name = "cyclone",
+ .priority = 100,
+ .read = cyclone_read,
+ .delta = cyclone_delta,
+ .cyc2ns = cyclone_cyc2ns,
+};
+
+
+static void calibrate_cyclone(void)
+{
+ u32 startlow, starthigh, endlow, endhigh, delta32;
+ u64 start, end, delta64;
+ unsigned long i, count;
+ /* repeat 3 times to make sure the cache is warm */
+ for(i=0; i < 3; i++) {
+ mach_prepare_counter();
+ read_cyclone_counter(startlow,starthigh);
+ mach_countup(&count);
+ read_cyclone_counter(endlow,endhigh);
+ }
+ start = (u64)starthigh<<32|startlow;
+ end = (u64)endhigh<<32|endlow;
+
+ delta64 = end - start;
+ printk("cyclone delta: %llu\n", delta64);
+ delta64 *= (ACTHZ/1000)>>8;
+ printk("delta*hz = %llu\n", delta64);
+ delta32 = (u32)delta64;
+ cyclone_freq_khz = delta32/CALIBRATE_ITERATION;
+ printk("calculated cyclone_freq: %lu khz\n", cyclone_freq_khz);
+}
+
+static int init_cyclone_timesource(void)
+{
+ u32* reg;
+ u32 base; /* saved cyclone base address */
+ u32 pageaddr; /* page that contains cyclone_timer register */
+ u32 offset; /* offset from pageaddr to cyclone_timer register */
+ int i;
+
+ /*make sure we're on a summit box*/
+ if(!use_cyclone) return -ENODEV;
+
+ printk(KERN_INFO "Summit chipset: Starting Cyclone Counter.\n");
+
+ /* find base address */
+ pageaddr = (CYCLONE_CBAR_ADDR)&PAGE_MASK;
+ offset = (CYCLONE_CBAR_ADDR)&(~PAGE_MASK);
+ set_fixmap_nocache(FIX_CYCLONE_TIMER, pageaddr);
+ reg = (u32*)(fix_to_virt(FIX_CYCLONE_TIMER) + offset);
+ if(!reg){
+ printk(KERN_ERR "Summit chipset: Could not find valid CBAR register.\n");
+ return -ENODEV;
+ }
+ base = *reg;
+ if(!base){
+ printk(KERN_ERR "Summit chipset: Could not find valid CBAR value.\n");
+ return -ENODEV;
+ }
+
+ /* setup PMCC */
+ pageaddr = (base + CYCLONE_PMCC_OFFSET)&PAGE_MASK;
+ offset = (base + CYCLONE_PMCC_OFFSET)&(~PAGE_MASK);
+ set_fixmap_nocache(FIX_CYCLONE_TIMER, pageaddr);
+ reg = (u32*)(fix_to_virt(FIX_CYCLONE_TIMER) + offset);
+ if(!reg){
+ printk(KERN_ERR "Summit chipset: Could not find valid PMCC register.\n");
+ return -ENODEV;
+ }
+ reg[0] = 0x00000001;
+
+ /* setup MPCS */
+ pageaddr = (base + CYCLONE_MPCS_OFFSET)&PAGE_MASK;
+ offset = (base + CYCLONE_MPCS_OFFSET)&(~PAGE_MASK);
+ set_fixmap_nocache(FIX_CYCLONE_TIMER, pageaddr);
+ reg = (u32*)(fix_to_virt(FIX_CYCLONE_TIMER) + offset);
+ if(!reg){
+ printk(KERN_ERR "Summit chipset: Could not find valid MPCS register.\n");
+ return -ENODEV;
+ }
+ reg[0] = 0x00000001;
+
+ /* map in cyclone_timer */
+ pageaddr = (base + CYCLONE_MPMC_OFFSET)&PAGE_MASK;
+ offset = (base + CYCLONE_MPMC_OFFSET)&(~PAGE_MASK);
+ set_fixmap_nocache(FIX_CYCLONE_TIMER, pageaddr);
+ cyclone_timer = (u32*)(fix_to_virt(FIX_CYCLONE_TIMER) + offset);
+ if(!cyclone_timer){
+ printk(KERN_ERR "Summit chipset: Could not find valid MPMC register.\n");
+ return -ENODEV;
+ }
+
+ /*quick test to make sure its ticking*/
+ for(i=0; i<3; i++){
+ u32 old = cyclone_timer[0];
+ int stall = 100;
+ while(stall--) barrier();
+ if(cyclone_timer[0] == old){
+ printk(KERN_ERR "Summit chipset: Counter not counting! DISABLED\n");
+ cyclone_timer = 0;
+ return -ENODEV;
+ }
+ }
+ calibrate_cyclone();
+ register_timesource(&timesource_cyclone);
+
+ return 0;
+}
+
+module_init(init_cyclone_timesource);
diff -Nru a/drivers/timesource/i386_pit.c b/drivers/timesource/i386_pit.c
--- /dev/null Wed Dec 31 16:00:00 196900
+++ b/drivers/timesource/i386_pit.c 2004-09-02 13:45:15 -07:00
@@ -0,0 +1,99 @@
+/* pit timesource */
+
+#include <linux/timesource.h>
+#include <linux/timex.h>
+#include <linux/init.h>
+
+#include <asm/io.h>
+#include <asm/timer.h>
+#include "io_ports.h"
+#include "do_timer.h"
+
+extern u64 jiffies_64;
+extern long jiffies;
+extern spinlock_t i8253_lock;
+
+/* Since the PIT overflows every tick, its not very useful
+ * to just read by itself. So throw jiffies into the mix to
+ * and just return nanoseconds in pit_read().
+ */
+
+static cycle_t pit_read(void)
+{
+ unsigned long flags;
+ int count;
+ unsigned long jiffies_t;
+ static int count_p;
+ static unsigned long jiffies_p = 0;
+
+ spin_lock_irqsave(&i8253_lock, flags);
+
+ outb_p(0x00, PIT_MODE); /* latch the count ASAP */
+
+ count = inb_p(PIT_CH0); /* read the latched count */
+ jiffies_t = jiffies;
+ count |= inb_p(PIT_CH0) << 8;
+
+ /* VIA686a test code... reset the latch if count > max + 1 */
+ if (count > LATCH) {
+ outb_p(0x34, PIT_MODE);
+ outb_p(LATCH & 0xff, PIT_CH0);
+ outb(LATCH >> 8, PIT_CH0);
+ count = LATCH - 1;
+ }
+
+ /*
+ * avoiding timer inconsistencies (they are rare, but they happen)...
+ * there are two kinds of problems that must be avoided here:
+ * 1. the timer counter underflows
+ * 2. hardware problem with the timer, not giving us continuous time,
+ * the counter does small "jumps" upwards on some Pentium systems,
+ * (see c't 95/10 page 335 for Neptun bug.)
+ */
+
+ if( jiffies_t == jiffies_p ) {
+ if( count > count_p ) {
+ /* the nutcase */
+ count = do_timer_overflow(count);
+ }
+ } else
+ jiffies_p = jiffies_t;
+
+ count_p = count;
+
+ spin_unlock_irqrestore(&i8253_lock, flags);
+
+ count = ((LATCH-1) - count) * TICK_SIZE;
+ count = (count + LATCH/2) / LATCH;
+
+ count *= 1000; /* convert count from usec->nsec */
+
+ return (cycle_t)((jiffies_64 * TICK_NSEC) + count);
+}
+
+static cycle_t pit_delta(cycle_t now, cycle_t then)
+{
+ return now - then;
+}
+
+/* just return cyc, as its already in ns */
+static nsec_t pit_cyc2ns(cycle_t cyc, cycle_t* remainder)
+{
+ return (nsec_t)cyc;
+}
+
+static struct timesource_t timesource_pit = {
+ .name = "pit",
+ .priority = 0,
+ .read = pit_read,
+ .delta = pit_delta,
+ .cyc2ns = pit_cyc2ns,
+};
+
+static int init_pit_timesource(void)
+{
+ register_timesource(&timesource_pit);
+ return 0;
+}
+
+module_init(init_pit_timesource);
diff -Nru a/drivers/timesource/i386_tsc.c b/drivers/timesource/i386_tsc.c
--- /dev/null Wed Dec 31 16:00:00 196900
+++ b/drivers/timesource/i386_tsc.c 2004-09-02 13:45:15 -07:00
@@ -0,0 +1,53 @@
+/* TODO:
+ * o cpufreq code
+ * o better calibration
+ */
+
+#include <linux/timesource.h>
+#include "mach_timer.h"
+#include <linux/timex.h>
+#include <linux/init.h>
+#include <linux/cpufreq.h>
+#include <asm/tsc.h>
+
+static cycle_t tsc_read(void)
+{
+ u64 ret;
+ rdtscll(ret);
+ return (cycle_t)ret;
+}
+
+static cycle_t tsc_delta(cycle_t now, cycle_t then)
+{
+ return now - then;
+}
+
+static nsec_t tsc_cyc2ns(cycle_t cyc, cycle_t* remainder)
+{
+ u32 rem;
+ cyc *= 1000000;
+ rem = do_div(cyc, cpu_freq_khz);
+ if(remainder)
+ *remainder = rem/1000000; /*XXX - do we still loose precision here? */
+ return (nsec_t)cyc;
+}
+
+
+static struct timesource_t timesource_tsc = {
+ .name = "tsc",
+ .priority = 25,
+ .read = tsc_read,
+ .delta = tsc_delta,
+ .cyc2ns = tsc_cyc2ns,
+};
+
+
+static int init_tsc_timesource(void)
+{
+ /* All the initialization is done in arch/i386/kernel/tsc.c */
+ if (cpu_has_tsc && cpu_freq_khz)
+ register_timesource(&timesource_tsc);
+ return 0;
+}
+
+module_init(init_tsc_timesource);
diff -Nru a/include/asm-i386/mach-default/mach_timer.h b/include/asm-i386/mach-default/mach_timer.h
--- a/include/asm-i386/mach-default/mach_timer.h 2004-09-02 13:45:15 -07:00
+++ b/include/asm-i386/mach-default/mach_timer.h 2004-09-02 13:45:15 -07:00
@@ -14,8 +14,11 @@
*/
#ifndef _MACH_TIMER_H
#define _MACH_TIMER_H
+#include <asm/io.h>

-#define CALIBRATE_LATCH (5 * LATCH)
+#define CALIBRATE_ITERATION 50
+#define CALIBRATE_LATCH (CALIBRATE_ITERATION * LATCH)
+#define CALIBRATE_TIME (CALIBRATE_ITERATION * 1000020/HZ)

static inline void mach_prepare_counter(void)
{
diff -Nru a/include/asm-i386/timer.h b/include/asm-i386/timer.h
--- a/include/asm-i386/timer.h 2004-09-02 13:45:15 -07:00
+++ b/include/asm-i386/timer.h 2004-09-02 13:45:15 -07:00
@@ -42,7 +42,7 @@
extern struct timer_opts timer_pit;
extern struct timer_opts timer_tsc;
#ifdef CONFIG_X86_CYCLONE_TIMER
-extern struct timer_opts timer_cyclone;
+//extern struct timer_opts timer_cyclone;
#endif

extern unsigned long calibrate_tsc(void);
@@ -53,6 +53,6 @@
#endif

#ifdef CONFIG_X86_PM_TIMER
-extern struct timer_opts timer_pmtmr;
+//extern struct timer_opts timer_pmtmr;
#endif
#endif
diff -Nru a/include/asm-i386/tsc.h b/include/asm-i386/tsc.h
--- /dev/null Wed Dec 31 16:00:00 196900
+++ b/include/asm-i386/tsc.h 2004-09-02 13:45:15 -07:00
@@ -0,0 +1,6 @@
+#ifndef _ASM_I386_TSC_H
+#define _ASM_I386_TSC_H
+extern unsigned long cpu_freq_khz;
+void tsc_init(void);
+
+#endif


2004-09-02 22:31:58

by john stultz

[permalink] [raw]
Subject: Re: [RFC] New Time of day proposal (updated 9/2/04)

On Thu, 2004-09-02 at 15:09, Christoph Lameter wrote:
> > timeofday_hook()
> > now = read(); /* read the timesource */
> > ns = cyc2ns(now - offset_base); /* calc nsecs since last call */
> > ntp_ns = ntp_scale(ns); /* apply ntp scaling */
> > system_time += ntp_ns; /* add scaled value to system_time */
> > ntp_advance(ns); /* advance ntp state machine by ns */
> > offset_base = now; /* set new offset_base */
>
> This would only work if the precision of the timer used is
> <=1ns and if you are actually able to caculate the nanoseconds that have
> passed. What do you do if the interval is lets say 100ns and the time the
> timeofday hook is being called can be anytime within this 100ns interval
> since the time source is not always precise?

Well, with the exception of the TSC, none of the current time sources
have <=1ns resolution, but I'm not sure I understand the problem you're
trying to point out. Could you clarify?

> I think its unavoidable to do some correction like provided by the time
> interpolator if the clock source does not provide ns.

Could you point to the specific correction you describe?

> > o What is the cost of throwing around 64bit values for everything?
> > - Do we need an arch specific time structure that varies size
> > accordingly?
>
> 64bit may be necessary at a minimum because with 4Ghz machine we may
> have counters with the frequency >2^32 cycles per second.
>
> I would think that 128bit may be necessary (at least
> as an intermediate result during the scaling of the timesource to
> nanoseconds) since we want to be accurate to the nanosecond.

I worry 128bit math might be a bit too rich for the majority of systems
at the moment. I am open to it, although I suspect we can use other
tricks to get the same accuracy within a constrained bitspace.

> > o Some arches (arm, for example) do not have high res timing hardware
> > - In this case we can have a "jiffies" timesource
> > - cyc2ns(x) = x*(NSEC_PER_SEC/HZ)
> > - doesn't work for tickless systems
>
> David M.s time interpolator logic is needed in those cases to insure that
> the clock works properly and that nanoseconds offset can be calculated in
> a consistent way although the exact timing of the increas / reading of the
> counter may vary within a certain time period.

Again, could you point me to the code (is this with your new patch, or
the older code)? I've looked at the time interpolator code, but I'm not
sure exactly what you mean.

Thanks for the feedback! I really appreciate it!
-john

2004-09-02 22:27:21

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Thu, 2 Sep 2004, john stultz wrote:

> +struct timesource_t {
> + char* name;
> + int priority;
> + cycle_t (*read)(void);
> + cycle_t (*delta)(cycle_t now, cycle_t then);
> + nsec_t (*cyc2ns)(cycle_t cycles, cycle_t* remainder);
> +};

This is defining functions to be called for a timesource. Which means that
the functionality for clock_gettime cannot be implemented as a fastcall
anymore on IA64. The delta and the cyc2ns function are always the same.
The compensation for jitter etc is always the same and can also be
implemented generically.

Could we change this to avoid the function calls.

For example.

struct timesource_t {
char *name;
int priority;
int type;
void *address;
long long int frequency;
}

The functions are always

delta = (now-then)
cycle2ns = cycles * (NSEC_PER_SEC / frequency) + remainder

types of time sources

#define TIMESOURCE_CPU 0
#define TIMESOURCE_MMIO32 1
#define TIMESOURCE_MMIO64 2
#define TIMESOURCE_FUNCTION 3 /* To catch odd cases */

2004-09-02 22:39:05

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Thu, 2004-09-02 at 15:19, Christoph Lameter wrote:
> On Thu, 2 Sep 2004, john stultz wrote:
>
> > +struct timesource_t {
> > + char* name;
> > + int priority;
> > + cycle_t (*read)(void);
> > + cycle_t (*delta)(cycle_t now, cycle_t then);
> > + nsec_t (*cyc2ns)(cycle_t cycles, cycle_t* remainder);
> > +};
>
> This is defining functions to be called for a timesource. Which means that
> the functionality for clock_gettime cannot be implemented as a fastcall
> anymore on IA64. The delta and the cyc2ns function are always the same.
> The compensation for jitter etc is always the same and can also be
> implemented generically.
>
> Could we change this to avoid the function calls.
>
> For example.
>
> struct timesource_t {
> char *name;
> int priority;
> int type;
> void *address;
> long long int frequency;
> }
>
> The functions are always
>
> delta = (now-then)
> cycle2ns = cycles * (NSEC_PER_SEC / frequency) + remainder
>
> types of time sources
>
> #define TIMESOURCE_CPU 0
> #define TIMESOURCE_MMIO32 1
> #define TIMESOURCE_MMIO64 2
> #define TIMESOURCE_FUNCTION 3 /* To catch odd cases */


What about my idea from yesterday of inverting the fastcall
relationship? Instead of creating a structure that exports values and
pointers the fastcall can use to create a time of day, why not use the
fast call to read the raw time and return it back to the time of day
code (which may be running in user context)? This avoids the duplication
of having to re-implement the timeofday/clock_gettime functions in
fastcall asm code.

thanks
-john

2004-09-02 22:55:03

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] New Time of day proposal (updated 9/2/04)

On Thu, 2 Sep 2004, john stultz wrote:

> On Thu, 2004-09-02 at 15:09, Christoph Lameter wrote:
> > > timeofday_hook()
> > > now = read(); /* read the timesource */
> > > ns = cyc2ns(now - offset_base); /* calc nsecs since last call */
> > > ntp_ns = ntp_scale(ns); /* apply ntp scaling */
> > > system_time += ntp_ns; /* add scaled value to system_time */
> > > ntp_advance(ns); /* advance ntp state machine by ns */
> > > offset_base = now; /* set new offset_base */
> >
> > This would only work if the precision of the timer used is
> > <=1ns and if you are actually able to caculate the nanoseconds that have
> > passed. What do you do if the interval is lets say 100ns and the time the
> > timeofday hook is being called can be anytime within this 100ns interval
> > since the time source is not always precise?
>
> Well, with the exception of the TSC, none of the current time sources
> have <=1ns resolution, but I'm not sure I understand the problem you're
> trying to point out. Could you clarify?
>
> > I think its unavoidable to do some correction like provided by the time
> > interpolator if the clock source does not provide ns.
>
> Could you point to the specific correction you describe?

I drop my objections. I was thinking too much in terms of the old code....
This should work fine.

2004-09-02 22:51:56

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Thu, 2 Sep 2004, john stultz wrote:

> What about my idea from yesterday of inverting the fastcall
> relationship? Instead of creating a structure that exports values and
> pointers the fastcall can use to create a time of day, why not use the
> fast call to read the raw time and return it back to the time of day
> code (which may be running in user context)? This avoids the duplication
> of having to re-implement the timeofday/clock_gettime functions in
> fastcall asm code.

"Read raw time"? How can you read the raw time in a fast call if the
fast call needs to do additional function calls (as defined in the
proposed time structure) in the kernel context in order to retrieve time?

A fast call cannot do any function calls in the kernel context or
otherwise.

The overhead of the function calls will reduce the performance of time
access significantly.


2004-09-02 23:21:33

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Thu, 2004-09-02 at 15:42, Christoph Lameter wrote:
> On Thu, 2 Sep 2004, john stultz wrote:
>
> > What about my idea from yesterday of inverting the fastcall
> > relationship? Instead of creating a structure that exports values and
> > pointers the fastcall can use to create a time of day, why not use the
> > fast call to read the raw time and return it back to the time of day
> > code (which may be running in user context)? This avoids the duplication
> > of having to re-implement the timeofday/clock_gettime functions in
> > fastcall asm code.
>
> "Read raw time"? How can you read the raw time in a fast call if the
> fast call needs to do additional function calls (as defined in the
> proposed time structure) in the kernel context in order to retrieve time?
>
> A fast call cannot do any function calls in the kernel context or
> otherwise.
>
> The overhead of the function calls will reduce the performance of time
> access significantly.

Forgive me if I mis-understand the fastcall method, but this is the
concept. Instead of having a fastcall function that implmements the
gettimofday/clock_gettime + ntp scaling + etc all in asm we do the
following:

When creating a ia64 timesource (say the cyclone just for a specific
example) we do something like the following.

/* trivial delta (nothing new here) */
static cycle_t cyclone_fastcall_delta(cycle_t now, cycle_t then)
{
return (now - then);
}
/* trivial cyc2ns (nothing new here) */
static nsec_t cyclone_fastcall_cyc2ns(cycle_t cyc, cycle_t* remainder)
{
u64 rem;
cyc *= freq_multiplier;
if (remainder)
*remainder = 0;
return (nsec_t)cyc;
}

/* fastcall read, this is where it gets interesting */
static cycle_t cyclone_fasatcall_read(void)
{
u64 ret;

ret = fastcall_readcyclonecounter();

return (cycle_t)ret;
}

struct timesource_t timesource_cyclone_fastcall = {
.name = "cyclone_fastcall",
.priority = 100,
.read = cyclone_fastcall_read,
.delta = cyclone_fastcall_delta,
.cyc2ns = cyclone_fastcall_cyc2ns,
};

Then you implement a fastcall for fastcall_readcyclonecounter(), which
in crazy ia64 asm would do something like:

ENTRY(fastcall_readcyclonecounter)
blip // magic privledge escalation
blop // load cyclone counter into memory
bloop // copy cyclone counter back into return register
;;
END(fastcall_readcyclonecounter)


This avoids 150+ lines of asm needed to re-implement the gettimeofday
math.

However, I could be mistaken. Is something like this possible?

thanks
-john




2004-09-03 00:59:08

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Thu, 2 Sep 2004, john stultz wrote:

> > Of course but its not a generic way of timer acccess and
> > requires a fastcall for each timer. You still have the problem of
> > exporting the frequency and the time offsets to user space (those also
> > need to be kept current in order for one to be able to calculate a timer
> > value!). The syscall/fastcall API then needs to be extended to allow for a
> > system call to access each of the individual timers supported.
>
> Indeed, it would require a fastcall accessor for each timesource, but
> maybe fastcall is the wrong way to describe it. Could it just be a
> function that uses the epc to escalate its privileges? As for freq and
> offsets, those would already be user-visible (see below for more
> details)

The only way curent way to enter the kernel from glibc with a fastcall is
the EPC.

> The plan at the moment is that the timeofday core gettimeofday code path
> as well as any timesource that supports it adds a _vsyscall linker
> attribute. Then the linker will place all the code on a special page or
> set of pages. Those pages would then be mapped in as user-readable. Then
> just like the x86-64's vsyscall gettimeofday, glibc would re-direct
> gettimeofday calls to the user-mapped kernel pages, where it would
> execute in user mode (with maybe the epc privilege escalation for ia64
> time sources that required it).

The EPC call already does do a *secure* transfer like this on IA64 and
will execute kernel code without user space mapping. This idea raises all sorts
of concerns....

> I had to do most of this for the i386 vsyscall-gettimeofday port, but I
> was unhappy with the duplication of code (and bugs), as well as the fact
> that I was then being pushed to re-do it for each architecture. While
> its not currently implemented (I was hoping to keep the discussion
> focused on the correctness of what's been implemented), I feel the plan
> for user-mode access won't be too complex. I'm still open for further
> discussion if you'd like, obviously performance is quite important, and
> I want to calm any fears you have, but I'm sure the new ntp code plenty
> of performance issues to look at before we start digging into usermode
> access, so maybe we can come back to this discussion later?

The functions in the timer source structure is a central problem for IA64. We
cannot take that out right now.

The full parameterization of timer access as I have suggested also allows
user page access to timers with simply mapping a page to access the timer.
No other gimmicks are needed since the timer source structure contains all
information to calculate a timer offset.

> Yes, but x86-64 has one way, and ia64 does it another, and i know ppc
> folks have talked about their own user mode time access. Chasing down a
> time bug across arches gets to be fairly hairy, so I'm trying to
> simplify that.

The simplest thins is to provide a data structure without any functions
attached that can simply be copied into userspace if wanted. If an arch
needs special time access then that is depending on the arch specific
methods available and such a data structure as I have proposed will
include all the info necessary to implement such user mode time access.

A requirement to call functions in the kernel to determine time makes
these solution impossible. And its getting extremely complex if one has to
invoke different functions for each timer supported.

2004-09-03 01:39:46

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Thu, 2004-09-02 at 17:47, Christoph Lameter wrote:
> On Thu, 2 Sep 2004, john stultz wrote:
>
> > > Of course but its not a generic way of timer acccess and
> > > requires a fastcall for each timer. You still have the problem of
> > > exporting the frequency and the time offsets to user space (those also
> > > need to be kept current in order for one to be able to calculate a timer
> > > value!). The syscall/fastcall API then needs to be extended to allow for a
> > > system call to access each of the individual timers supported.
> >
> > Indeed, it would require a fastcall accessor for each timesource, but
> > maybe fastcall is the wrong way to describe it. Could it just be a
> > function that uses the epc to escalate its privileges? As for freq and
> > offsets, those would already be user-visible (see below for more
> > details)
>
> The only way curent way to enter the kernel from glibc with a fastcall is
> the EPC.

Hmm. I must be explaining myself poorly, or not understanding you. I
apologize for not understanding this EPC/fastcall business well enough.
I'd like to use EPC from a user-executable kernel page to escalate
privileges to access the hardware counter. I don't care if I have to use
the the current fastcall (fsys.S) interface or not. However you're
making sounds like this isn't possible, so I'll need to do some
research.

> > The plan at the moment is that the timeofday core gettimeofday code path
> > as well as any timesource that supports it adds a _vsyscall linker
> > attribute. Then the linker will place all the code on a special page or
> > set of pages. Those pages would then be mapped in as user-readable. Then
> > just like the x86-64's vsyscall gettimeofday, glibc would re-direct
> > gettimeofday calls to the user-mapped kernel pages, where it would
> > execute in user mode (with maybe the epc privilege escalation for ia64
> > time sources that required it).
>
> The EPC call already does do a *secure* transfer like this on IA64 and
> will execute kernel code without user space mapping. This idea raises all sorts
> of concerns....

Yes, but its not portable. Reducing duplicate code so timeofday
maintenance isn't a nightmare is the first goal here. It may not be
completely achievable, and when I hit that point I'll have to rework the
design, but at this point I'm not convinced that it cannot be done.

As for security concerns, all your mapping out to userspace are the time
variables and the functions needed to convert those to a accurate time
of day. This is almost what you suggest below, but with the additional
math from the kernel. The method I'm suggesting is very similar to
x86-64's arch/x86-64/kernel/vsyscall.c.


> > I had to do most of this for the i386 vsyscall-gettimeofday port, but I
> > was unhappy with the duplication of code (and bugs), as well as the fact
> > that I was then being pushed to re-do it for each architecture. While
> > its not currently implemented (I was hoping to keep the discussion
> > focused on the correctness of what's been implemented), I feel the plan
> > for user-mode access won't be too complex. I'm still open for further
> > discussion if you'd like, obviously performance is quite important, and
> > I want to calm any fears you have, but I'm sure the new ntp code plenty
> > of performance issues to look at before we start digging into usermode
> > access, so maybe we can come back to this discussion later?
>
> The functions in the timer source structure is a central problem for IA64. We
> cannot take that out right now.

Don't worry I'm not submitting this code just yet. :) I'll need all (or
at least most of the important) architecture maintainers to sign on
before I try to push this code in.

Right now I'm trying to shake out possible problems with the design and
the first pass implementation of the code. You're helping me do that,
and I thank you for it. Concessions will be made, but for now I'm going
to try to preserve the current design and work around the problems as
they arise.


> The full parameterization of timer access as I have suggested also allows
> user page access to timers with simply mapping a page to access the timer.
> No other gimmicks are needed since the timer source structure contains all
> information to calculate a timer offset.
>
> > Yes, but x86-64 has one way, and ia64 does it another, and i know ppc
> > folks have talked about their own user mode time access. Chasing down a
> > time bug across arches gets to be fairly hairy, so I'm trying to
> > simplify that.
>
> The simplest thins is to provide a data structure without any functions
> attached that can simply be copied into userspace if wanted. If an arch
> needs special time access then that is depending on the arch specific
> methods available and such a data structure as I have proposed will
> include all the info necessary to implement such user mode time access.

Ehhh.. I really don't like the idea of giving all the raw values to
userspace and letting user-code do the timeofday calculation. Fixing
bugs in each arches timeofday code is hard enough. Imagine if we have to
go through and fix userspace too! It would also make a user/kernel data
interface that we'd have to preserve. I'd like to avoid that and instead
use the vsyscall method to give us greater flexibility. Plus I doubt
anyone would want to implement the NTP adjustments in userspace? eek!

> A requirement to call functions in the kernel to determine time makes
> these solution impossible. And its getting extremely complex if one has to
> invoke different functions for each timer supported.

The struct timesource interface of read()/delta()/cyc2ns() was the best
generalization I could come up with. I feel they're necessary for the
following reasons:

cyc2ns(): In this conversion we can optimize the math depending on the
timesource. If the timesource freq is a power of 2, we can just use
shift! However if its a weird value and we have to be very precise, we
do a full 64bit divide. We're not stuck with one equation given a freq
value.

delta(): Some counters don't fill 32 or 64 bits. ACPI PM time source is
24 bits, and the cyclone is 40. Thus to do proper twos complement
subtraction without overflow worries you need to mask the subtraction.
This can be done by exporting a mask value w/ the freq value, but was
cleaner when moved into the timesource.

read(): Rather then just giving the address of the register, the read
call allows for timesource specific logic. This lets us use jiffies as a
timesource, or in cases like the ACPI PM timesource, where the register
must be read 3 times in order to ensure a correct value is latched, we
can avoid having to include that logic into the generic code, so it does
not affect systems that do not use or have that timesource.

But I doubt I'll convince you with words, so let me work on it a bit and
see if I can code around your concerns and put you at ease. You've
brought up some good issues, and I'll definitely work to resolve them!

Thanks again!
-john


2004-09-03 01:48:38

by George Anzinger

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

john stultz wrote:
> All,
> This patch implements the architecture independent portion of the time
> of day subsystem. Included is timeofday.c (which includes all the time
> of day management and accessor functions), ntp.c (which includes the ntp
> scaling code, leap second processing, and ntp kernel state machine
> code), interface definition .h files, the example jiffies timesource
> (lowest common denominator time source, mainly for use as example code)
> and minimal hooks into arch independent code.
>
> The patch does not function without minimal architecture specific hooks
> (i386 example to follow), and it can be applied to a tree without
> affecting the code.
>
> I look forward to your comments and feedback.
>
> thanks
> -john
~

> +static cycle_t jiffies_read(void)
> +{
> + cycle_t ret = get_jiffies_64();
> + return ret;
> +}
> +
> +static cycle_t jiffies_delta(cycle_t now, cycle_t then)
> +{
> + /* simple subtraction, no need to mask */
> + return now - then;
> +}

This should be inline...

> +
> +static nsec_t jiffies_cyc2ns(cycle_t cyc, cycle_t* remainder)
> +{
> +
> + cyc *= NSEC_PER_SEC/HZ;

Hm... This assumes that 1/HZ is what is needed here. Today this value is
999898. Not exactly reachable by NSEC_PER_SEC/HZ. Or did I miss something,
like the relationship of jiffie to 1/HZ and to real time.

> +
~

> +int ntp_leapsecond(struct timespec now)
> +{
> + /*
> + * Leap second processing. If in leap-insert state at
> + * the end of the day, the system clock is set back one
> + * second; if in leap-delete state, the system clock is
> + * set ahead one second. The microtime() routine or
> + * external clock driver will insure that reported time
> + * is always monotonic. The ugly divides should be
> + * replaced.

?? Is this really to be hidden? I rather though this was just the same as a
user driven clock setting. In any case, it is a slew of the wall clock and
should be presented to abs timer code so that the abs timers can be corrected.
This means a call to "clock_was_set()" AND an update of the mono_to_wall value.
> + */
> + static time_t leaptime = 0;
> +
> + switch (ntp_state) {
> + case TIME_OK:
> + if (ntp_status & STA_INS) {
> + ntp_state = TIME_INS;
> + /* calculate end of today (23:59:59)*/
> + leaptime = now.tv_sec + SEC_PER_DAY - (now.tv_sec % SEC_PER_DAY) - 1;
> + }
> + else if (ntp_status & STA_DEL) {
> + ntp_state = TIME_DEL;
> + /* calculate end of today (23:59:59)*/
> + leaptime = now.tv_sec + SEC_PER_DAY - (now.tv_sec % SEC_PER_DAY) - 1;
> + }
> + break;
> +
> + case TIME_INS:
> + /* Once we are at (or past) leaptime, insert the second */
> + if (now.tv_sec > leaptime) {
> + ntp_state = TIME_OOP;
> + printk(KERN_NOTICE "Clock: inserting leap second 23:59:60 UTC\n");
> +
> + return -1;
> + }
> + break;
> +
> + case TIME_DEL:
> + /* Once we are at (or past) leaptime, delete the second */
> + if (now.tv_sec >= leaptime) {
> + ntp_state = TIME_WAIT;
> + printk(KERN_NOTICE "Clock: deleting leap second 23:59:59 UTC\n");
> +
> + return 1;
> + }
~

> +/* do_gettimeofday():
> + * Returns the time of day
> + */
> +void do_gettimeofday(struct timeval *tv)
> +{
> + nsec_t wall, sys;
> + unsigned long seq;
> +
> + /* atomically read wall and sys time */
> + do {
> + seq = read_seqbegin(&system_time_lock);
> +
> + wall = wall_time_offset;
> + sys = __monotonic_clock();
> +
> + } while (read_seqretry(&system_time_lock, seq));
> +
> + /* add them and convert to timeval */
> + *tv = ns2timeval(wall+sys);
> +}
I am not sure you don't want to seperate the locking from the clock read. This
so one lock can be put around larger bits of code. For example, in
posix-times.c we need to get all three clocks under the same lock (that being
monotonic, wall_time_offset, and jiffies (and possibly as sub jiffie value)).

Something like:
void get_tod_parts(nsec_t *wall, nsec_t *mon)
{
*wall = wall_time_offset;
*mon = __monotonic_clock();
}

This could then be used in a larger function with out the double locking.

> +
> +
> +/* do_settimeofday():
> + * Sets the time of day
> + */
> +int do_settimeofday(struct timespec *tv)
> +{
> + /* convert timespec to ns */
> + nsec_t newtime = timespec2ns(tv);
> +
> + /* atomically adjust wall_time_offset to the desired value */
> + write_seqlock_irq(&system_time_lock);
> +
> + wall_time_offset = newtime - __monotonic_clock();
> +
> + /* clear NTP settings */
> + ntp_clear();
> +
> + write_sequnlock_irq(&system_time_lock);

Also need a clock_was_set() call here.
> +
~
--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2004-09-03 01:52:00

by George Anzinger

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday i386 hooks (v.A0)

john stultz wrote:
> All,
> This patch implements the minimal i386 architecture hooks to enable the
> new time of day subsystem code. It applies on top of my
> linux-2.6.9-rc1_timeofday-core_A0 patch and with this patch applied, you
> can test the new time of day subsystem on i386. Basically it adds the
> call to timeofday_interrupt_hook() and cuts alot of code out of the
> build. The only new code is the sync_persistant_clock() function which
> is mostly ripped out of do_timer_interrupt(). Pretty un-interesting.
>
> I look forward to your comments and feedback.
>
> thanks
> -john
>
> linux-2.6.9-rc1_timeofday-i386_A0.patch
> =======================================
> diff -Nru a/arch/i386/Kconfig b/arch/i386/Kconfig
> --- a/arch/i386/Kconfig 2004-09-02 13:29:59 -07:00
> +++ b/arch/i386/Kconfig 2004-09-02 13:29:59 -07:00
> @@ -14,6 +14,10 @@
> 486, 586, Pentiums, and various instruction-set-compatible chips by
> AMD, Cyrix, and others.
>
> +config NEWTOD
> + bool
> + default y
> +
> config MMU
> bool
> default y
> diff -Nru a/arch/i386/kernel/time.c b/arch/i386/kernel/time.c
> --- a/arch/i386/kernel/time.c 2004-09-02 13:29:59 -07:00
> +++ b/arch/i386/kernel/time.c 2004-09-02 13:29:59 -07:00
> @@ -67,6 +67,8 @@
>
> #include "io_ports.h"
>
> +#include <linux/timeofday.h>
> +
> extern spinlock_t i8259A_lock;
> int pit_latch_buggy; /* extern */
>
> @@ -87,6 +89,7 @@
>
> struct timer_opts *cur_timer = &timer_none;
>
> +#ifndef CONFIG_NEWTOD
> /*
> * This version of gettimeofday has microsecond resolution
> * and better than microsecond precision on fast x86 machines with TSC.
> @@ -169,6 +172,7 @@
> }
>
> EXPORT_SYMBOL(do_settimeofday);
> +#endif
>
> static int set_rtc_mmss(unsigned long nowtime)
> {
> @@ -194,12 +198,39 @@
> * Note: This function is required to return accurate
> * time even in the absence of multiple timer ticks.
> */
> +#ifndef CONFIG_NEWTOD
> unsigned long long monotonic_clock(void)
> {
> return cur_timer->monotonic_clock();
> }
> EXPORT_SYMBOL(monotonic_clock);
> +#endif
>
> +void sync_persistant_clock(struct timespec ts)
> +{
> + /*
> + * If we have an externally synchronized Linux clock, then update
> + * CMOS clock accordingly every ~11 minutes. Set_rtc_mmss() has to be
> + * called as close as possible to 500 ms before the new second starts.
> + */
> + if (ts.tv_sec > last_rtc_update + 660 &&
> + (ts.tv_nsec / 1000)
> + >= USEC_AFTER - ((unsigned) TICK_SIZE) / 2 &&
> + (ts.tv_nsec / 1000)
> + <= USEC_BEFORE + ((unsigned) TICK_SIZE) / 2) {
> + /* horrible...FIXME */
> + if (efi_enabled) {
> + if (efi_set_rtc_mmss(ts.tv_sec) == 0)
> + last_rtc_update = ts.tv_sec;
> + else
> + last_rtc_update = ts.tv_sec - 600;
> + } else if (set_rtc_mmss(ts.tv_sec) == 0)
> + last_rtc_update = ts.tv_sec;
> + else
> + last_rtc_update = ts.tv_sec - 600; /* do it again in 60 s */
> + }
> +
I have wondered, and continue to do so, why this is not a timer driven function.
It just seems silly to check this every interrupt when we have low overhead
timers for just this sort of thing.

I wonder about the load calc in the same way...

Now, the question is how do you hook up the timer list. We MUST be able to
start a timer that will run for several min to hours and have it expire such
that the wall time difference is "really" close to what was requested. This
requires some "lock up" between the wall clock and the timer subsystem.

What are your thoughts here?

George


--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2004-09-03 02:08:02

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Thu, 2004-09-02 at 18:39, George Anzinger wrote:
> john stultz wrote:
> > +static cycle_t jiffies_read(void)
> > +{
> > + cycle_t ret = get_jiffies_64();
> > + return ret;
> > +}
> > +
> > +static cycle_t jiffies_delta(cycle_t now, cycle_t then)
> > +{
> > + /* simple subtraction, no need to mask */
> > + return now - then;
> > +}
>
> This should be inline...

Unfortunately they cannot be, as they are used as function pointers
through the struct timesource interface.

> > +
> > +static nsec_t jiffies_cyc2ns(cycle_t cyc, cycle_t* remainder)
> > +{
> > +
> > + cyc *= NSEC_PER_SEC/HZ;
>
> Hm... This assumes that 1/HZ is what is needed here. Today this value is
> 999898. Not exactly reachable by NSEC_PER_SEC/HZ. Or did I miss something,
> like the relationship of jiffie to 1/HZ and to real time.

You're right. You'd think with the recent discussions I would have
already fixed that, but it slipped by. I'll fix it to use ACTHZ in the
next release. Good catch.


> > +int ntp_leapsecond(struct timespec now)
> > +{
> > + /*
> > + * Leap second processing. If in leap-insert state at
> > + * the end of the day, the system clock is set back one
> > + * second; if in leap-delete state, the system clock is
> > + * set ahead one second. The microtime() routine or
> > + * external clock driver will insure that reported time
> > + * is always monotonic. The ugly divides should be
> > + * replaced.
>
> ?? Is this really to be hidden? I rather though this was just the same as a
> user driven clock setting. In any case, it is a slew of the wall clock and
> should be presented to abs timer code so that the abs timers can be corrected.
> This means a call to "clock_was_set()" AND an update of the mono_to_wall value.

Hmm. Looking at that comment, it seems a bit outdated. I'll see if I can
update it to be more clear.

And yes, clock_was_set() is on my list. I realized I dropped it, but I
wanted to make sure I was using it properly before I just threw it in.


> > +void do_gettimeofday(struct timeval *tv)
> > +{
> > + nsec_t wall, sys;
> > + unsigned long seq;
> > +
> > + /* atomically read wall and sys time */
> > + do {
> > + seq = read_seqbegin(&system_time_lock);
> > +
> > + wall = wall_time_offset;
> > + sys = __monotonic_clock();
> > +
> > + } while (read_seqretry(&system_time_lock, seq));
> > +
> > + /* add them and convert to timeval */
> > + *tv = ns2timeval(wall+sys);
> > +}
> I am not sure you don't want to seperate the locking from the clock read. This
> so one lock can be put around larger bits of code. For example, in
> posix-times.c we need to get all three clocks under the same lock (that being
> monotonic, wall_time_offset, and jiffies (and possibly as sub jiffie value)).

Well, all the values in timeofday.c are protected by the
system_time_lock, so I'm trying to minimize the locks. I do want to kill
off xtime and thus the xtime lock, so jiffies will be by itself and will
need a jiffies_lock of some sort. However part of this effort is to ween
folks off of using jiffies as a time value, thus in the future it should
only be used inside the timer subsystem as a interrupt counter.

Thinking about it more, the NTP code could probably drop the ntp_lock
and just use the system_time_lock, but I think I'll push that
optimization off for a bit.

> Something like:
> void get_tod_parts(nsec_t *wall, nsec_t *mon)
> {
> *wall = wall_time_offset;
> *mon = __monotonic_clock();
> }
>
> This could then be used in a larger function with out the double locking.

I understood your point above, but I'm not sure I see double locking.
__monotonic_clock() is lock free for exactly this reason.


> > +int do_settimeofday(struct timespec *tv)
> > +{
> > + /* convert timespec to ns */
> > + nsec_t newtime = timespec2ns(tv);
> > +
> > + /* atomically adjust wall_time_offset to the desired value */
> > + write_seqlock_irq(&system_time_lock);
> > +
> > + wall_time_offset = newtime - __monotonic_clock();
> > +
> > + /* clear NTP settings */
> > + ntp_clear();
> > +
> > + write_sequnlock_irq(&system_time_lock);
>
> Also need a clock_was_set() call here.

Yep!

Thanks for the feedback, George!
-john

2004-09-03 02:18:18

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday i386 hooks (v.A0)

On Thu, 2004-09-02 at 18:44, George Anzinger wrote:
> john stultz wrote:
> > +void sync_persistant_clock(struct timespec ts)
> > +{
> > + /*
> > + * If we have an externally synchronized Linux clock, then update
> > + * CMOS clock accordingly every ~11 minutes. Set_rtc_mmss() has to be
> > + * called as close as possible to 500 ms before the new second starts.
> > + */
> > + if (ts.tv_sec > last_rtc_update + 660 &&
> > + (ts.tv_nsec / 1000)
> > + >= USEC_AFTER - ((unsigned) TICK_SIZE) / 2 &&
> > + (ts.tv_nsec / 1000)
> > + <= USEC_BEFORE + ((unsigned) TICK_SIZE) / 2) {
> > + /* horrible...FIXME */
> > + if (efi_enabled) {
> > + if (efi_set_rtc_mmss(ts.tv_sec) == 0)
> > + last_rtc_update = ts.tv_sec;
> > + else
> > + last_rtc_update = ts.tv_sec - 600;
> > + } else if (set_rtc_mmss(ts.tv_sec) == 0)
> > + last_rtc_update = ts.tv_sec;
> > + else
> > + last_rtc_update = ts.tv_sec - 600; /* do it again in 60 s */
> > + }
> > +
> I have wondered, and continue to do so, why this is not a timer driven function.
> It just seems silly to check this every interrupt when we have low overhead
> timers for just this sort of thing.
>
> I wonder about the load calc in the same way...

The really cool thing is: with the new infrastructure separating the
timeofday code and the timer code, we can make timeofday_interrupt_hook
be called from a soft-timer! We don't have to do the accounting every
interrupt, but just frequently enough that the time sources don't
overflow!

> Now, the question is how do you hook up the timer list. We MUST be able to
> start a timer that will run for several min to hours and have it expire such
> that the wall time difference is "really" close to what was requested. This
> requires some "lock up" between the wall clock and the timer subsystem.
>
> What are your thoughts here?

Well, looking way down the road, once timeofday is independent from the
soft-timer code, we can invert the dependency so the soft-timer code
uses the high-res timeofday code instead of jiffies for determining if a
timer has expired. That would allow for more precise timer intervals to
be set, without worry of how it might affect timekeeping.

Easier said then done, I agree, but this change does allow for these
sorts of thing to be attempted.

My favorite benefit is that lost-ticks caused by bad drivers become a
timer/scheduler latency issue rather then a timeofday is broken issue :)

Yay!

Thanks again, George!
-john


2004-09-03 02:29:13

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Thu, 2004-09-02 at 16:39, Christoph Lameter wrote:
> On Thu, 2 Sep 2004, john stultz wrote:
>
> > Then you implement a fastcall for fastcall_readcyclonecounter(), which
> > in crazy ia64 asm would do something like:
> >
> > ENTRY(fastcall_readcyclonecounter)
> > blip // magic privledge escalation
> > blop // load cyclone counter into memory
> > bloop // copy cyclone counter back into return register
> > ;;
> > END(fastcall_readcyclonecounter)
> >
> >
> > This avoids 150+ lines of asm needed to re-implement the gettimeofday
> > math.
> >
> > However, I could be mistaken. Is something like this possible?
>
> Of course but its not a generic way of timer acccess and
> requires a fastcall for each timer. You still have the problem of
> exporting the frequency and the time offsets to user space (those also
> need to be kept current in order for one to be able to calculate a timer
> value!). The syscall/fastcall API then needs to be extended to allow for a
> system call to access each of the individual timers supported.

Indeed, it would require a fastcall accessor for each timesource, but
maybe fastcall is the wrong way to describe it. Could it just be a
function that uses the epc to escalate its privileges? As for freq and
offsets, those would already be user-visible (see below for more
details)

> And how would this be supported from user space? We link in special
> libraries to support for cyclone and any other supported timers into
> each program? I thought a kernel would provide hardware abstraction?

The plan at the moment is that the timeofday core gettimeofday code path
as well as any timesource that supports it adds a _vsyscall linker
attribute. Then the linker will place all the code on a special page or
set of pages. Those pages would then be mapped in as user-readable. Then
just like the x86-64's vsyscall gettimeofday, glibc would re-direct
gettimeofday calls to the user-mapped kernel pages, where it would
execute in user mode (with maybe the epc privilege escalation for ia64
time sources that required it).

I had to do most of this for the i386 vsyscall-gettimeofday port, but I
was unhappy with the duplication of code (and bugs), as well as the fact
that I was then being pushed to re-do it for each architecture. While
its not currently implemented (I was hoping to keep the discussion
focused on the correctness of what's been implemented), I feel the plan
for user-mode access won't be too complex. I'm still open for further
discussion if you'd like, obviously performance is quite important, and
I want to calm any fears you have, but I'm sure the new ntp code plenty
of performance issues to look at before we start digging into usermode
access, so maybe we can come back to this discussion later?

> The current IA64 fastcall timer interface is generic and is bound into the
> clock_gettime interface of glibc.

Yes, but x86-64 has one way, and ia64 does it another, and i know ppc
folks have talked about their own user mode time access. Chasing down a
time bug across arches gets to be fairly hairy, so I'm trying to
simplify that.

Again, I really appreciate your thoughts and feedback. Let me see if in
the next release I can have more code to illustrate the user-mode access
plan.

thanks!
-john

2004-09-02 23:48:08

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Thu, 2 Sep 2004, john stultz wrote:

> Then you implement a fastcall for fastcall_readcyclonecounter(), which
> in crazy ia64 asm would do something like:
>
> ENTRY(fastcall_readcyclonecounter)
> blip // magic privledge escalation
> blop // load cyclone counter into memory
> bloop // copy cyclone counter back into return register
> ;;
> END(fastcall_readcyclonecounter)
>
>
> This avoids 150+ lines of asm needed to re-implement the gettimeofday
> math.
>
> However, I could be mistaken. Is something like this possible?

Of course but its not a generic way of timer acccess and
requires a fastcall for each timer. You still have the problem of
exporting the frequency and the time offsets to user space (those also
need to be kept current in order for one to be able to calculate a timer
value!). The syscall/fastcall API then needs to be extended to allow for a
system call to access each of the individual timers supported.

And how would this be supported from user space? We link in special
libraries to support for cyclone and any other supported timers into
each program? I thought a kernel would provide hardware abstraction?

The current IA64 fastcall timer interface is generic and is bound into the
clock_gettime interface of glibc.

2004-09-02 22:17:05

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] New Time of day proposal (updated 9/2/04)

> timeofday_hook()
> now = read(); /* read the timesource */
> ns = cyc2ns(now - offset_base); /* calc nsecs since last call */
> ntp_ns = ntp_scale(ns); /* apply ntp scaling */
> system_time += ntp_ns; /* add scaled value to system_time */
> ntp_advance(ns); /* advance ntp state machine by ns */
> offset_base = now; /* set new offset_base */

This would only work if the precision of the timer used is
<=1ns and if you are actually able to caculate the nanoseconds that have
passed. What do you do if the interval is lets say 100ns and the time the
timeofday hook is being called can be anytime within this 100ns interval
since the time source is not always precise?

I think its unavoidable to do some correction like provided by the time
interpolator if the clock source does not provide ns.

> monotonic_clock()
> now = read(); /* read the timesource */
> ns = cyc2ns(now - offset_base); /* calculate nsecs since last hook */
> ntp_ns = ntp_scale(ns); /* apply ntp scaling */
> return system_time + ntp_ns; /* return system_time and scaled value

This works analogous to the time interpolator today. The scaling of the
timesource is omitted tough. Looks like we will be able to keep the
fastcalls for clock_gettime and gettime ofday.

> o What is the cost of throwing around 64bit values for everything?
> - Do we need an arch specific time structure that varies size
> accordingly?

64bit may be necessary at a minimum because with 4Ghz machine we may
have counters with the frequency >2^32 cycles per second.

I would think that 128bit may be necessary (at least
as an intermediate result during the scaling of the timesource to
nanoseconds) since we want to be accurate to the nanosecond.

> o Some arches (arm, for example) do not have high res timing hardware
> - In this case we can have a "jiffies" timesource
> - cyc2ns(x) = x*(NSEC_PER_SEC/HZ)
> - doesn't work for tickless systems

David M.s time interpolator logic is needed in those cases to insure that
the clock works properly and that nanoseconds offset can be calculated in
a consistent way although the exact timing of the increas / reading of the
counter may vary within a certain time period.

2004-09-03 06:49:26

by Albert Cahalan

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Thu, 2004-09-02 at 21:39, George Anzinger wrote:
> john stultz wrote:

> > +
> > +static nsec_t jiffies_cyc2ns(cycle_t cyc, cycle_t* remainder)
> > +{
> > +
> > + cyc *= NSEC_PER_SEC/HZ;
>
> Hm... This assumes that 1/HZ is what is needed here. Today this value is
> 999898. Not exactly reachable by NSEC_PER_SEC/HZ. Or did I miss something,
> like the relationship of jiffie to 1/HZ and to real time.

HZ not being HZ is the source of many foul problems.

NTP should be able to correct for the error. For systems
not running NTP, provide a fake NTP to make corrections
based on the expected frequency error.

Based on that, skip or double-up on the ticks to make
them be exactly HZ over long periods of time.

> > +int ntp_leapsecond(struct timespec now)
> > +{
> > + /*
> > + * Leap second processing. If in leap-insert state at
> > + * the end of the day, the system clock is set back one
> > + * second; if in leap-delete state, the system clock is
> > + * set ahead one second. The microtime() routine or
> > + * external clock driver will insure that reported time
> > + * is always monotonic. The ugly divides should be
> > + * replaced.

Don't optimize until the patch is in and stable.
The divides can be removed much later. Wait months,
if not forever, before making the code less readable.

The same goes for arch-specific non-syscall hacks.


2004-09-03 07:28:17

by George Anzinger

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

Albert Cahalan wrote:
> On Thu, 2004-09-02 at 21:39, George Anzinger wrote:
>
>>john stultz wrote:
>
>
>>>+
>>>+static nsec_t jiffies_cyc2ns(cycle_t cyc, cycle_t* remainder)
>>>+{
>>>+
>>>+ cyc *= NSEC_PER_SEC/HZ;
>>
>>Hm... This assumes that 1/HZ is what is needed here. Today this value is
>>999898. Not exactly reachable by NSEC_PER_SEC/HZ. Or did I miss something,
>>like the relationship of jiffie to 1/HZ and to real time.
>
>
> HZ not being HZ is the source of many foul problems.
>
> NTP should be able to correct for the error. For systems
> not running NTP, provide a fake NTP to make corrections
> based on the expected frequency error.
>
> Based on that, skip or double-up on the ticks to make
> them be exactly HZ over long periods of time.

There are several problems here. First, to make this possible, you will have to
outlaw several values for HZ (1024 comes to mind). Second, like it or not, 1/HZ
or something close to it, is the timer resolution. I think we need to try and
keep this a power of 10, mostly because there are a lot of folks who just don't
understand it if it is not. And third, you need to get real close to it with
the hardware timer. If you introduce NTP or some such to fix things, well,
things just break another place. For example, we started 2.6 with HZ=1000 and
had problems like:
> time sleep 10
sleeping for 9.9 seconds. This will just not do. Any corrections made to the
wall clock really need to be made to the timer system as well. As it is we
assume that, by correctly choosing the tick value, the wall clock will be
correct on average, even under NTP. I.e. that the NTP correction will be, over
time, very small. We really do want code that is much more accurate than "time
sleep 10" to be right, i.e. if we sleep for x nanoseconds, the wall clock will
have changed by x nanoseconds while we did so.
>
>

~

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2004-09-03 07:47:20

by George Anzinger

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

john stultz wrote:
> On Thu, 2004-09-02 at 17:47, Christoph Lameter wrote:
>
>>On Thu, 2 Sep 2004, john stultz wrote:
>>
>>
>>>>Of course but its not a generic way of timer acccess and
>>>>requires a fastcall for each timer. You still have the problem of
>>>>exporting the frequency and the time offsets to user space (those also
>>>>need to be kept current in order for one to be able to calculate a timer
>>>>value!). The syscall/fastcall API then needs to be extended to allow for a
>>>>system call to access each of the individual timers supported.
>>>
>>>Indeed, it would require a fastcall accessor for each timesource, but
>>>maybe fastcall is the wrong way to describe it. Could it just be a
>>>function that uses the epc to escalate its privileges? As for freq and
>>>offsets, those would already be user-visible (see below for more
>>>details)
>>
>>The only way curent way to enter the kernel from glibc with a fastcall is
>>the EPC.
>
>
> Hmm. I must be explaining myself poorly, or not understanding you. I
> apologize for not understanding this EPC/fastcall business well enough.
> I'd like to use EPC from a user-executable kernel page to escalate
> privileges to access the hardware counter. I don't care if I have to use
> the the current fastcall (fsys.S) interface or not. However you're
> making sounds like this isn't possible, so I'll need to do some
> research.
>
>
>>>The plan at the moment is that the timeofday core gettimeofday code path
>>>as well as any timesource that supports it adds a _vsyscall linker
>>>attribute. Then the linker will place all the code on a special page or
>>>set of pages. Those pages would then be mapped in as user-readable. Then
>>>just like the x86-64's vsyscall gettimeofday, glibc would re-direct
>>>gettimeofday calls to the user-mapped kernel pages, where it would
>>>execute in user mode (with maybe the epc privilege escalation for ia64
>>>time sources that required it).
>>
>>The EPC call already does do a *secure* transfer like this on IA64 and
>>will execute kernel code without user space mapping. This idea raises all sorts
>>of concerns....
>
>
> Yes, but its not portable. Reducing duplicate code so timeofday
> maintenance isn't a nightmare is the first goal here. It may not be
> completely achievable, and when I hit that point I'll have to rework the
> design, but at this point I'm not convinced that it cannot be done.
>
> As for security concerns, all your mapping out to userspace are the time
> variables and the functions needed to convert those to a accurate time
> of day. This is almost what you suggest below, but with the additional
> math from the kernel. The method I'm suggesting is very similar to
> x86-64's arch/x86-64/kernel/vsyscall.c.
>
>
>
>>>I had to do most of this for the i386 vsyscall-gettimeofday port, but I
>>>was unhappy with the duplication of code (and bugs), as well as the fact
>>>that I was then being pushed to re-do it for each architecture. While
>>>its not currently implemented (I was hoping to keep the discussion
>>>focused on the correctness of what's been implemented), I feel the plan
>>>for user-mode access won't be too complex. I'm still open for further
>>>discussion if you'd like, obviously performance is quite important, and
>>>I want to calm any fears you have, but I'm sure the new ntp code plenty
>>>of performance issues to look at before we start digging into usermode
>>>access, so maybe we can come back to this discussion later?
>>
>>The functions in the timer source structure is a central problem for IA64. We
>>cannot take that out right now.
>
>
> Don't worry I'm not submitting this code just yet. :) I'll need all (or
> at least most of the important) architecture maintainers to sign on
> before I try to push this code in.
>
> Right now I'm trying to shake out possible problems with the design and
> the first pass implementation of the code. You're helping me do that,
> and I thank you for it. Concessions will be made, but for now I'm going
> to try to preserve the current design and work around the problems as
> they arise.
>
>
>
>>The full parameterization of timer access as I have suggested also allows
>>user page access to timers with simply mapping a page to access the timer.
>>No other gimmicks are needed since the timer source structure contains all
>>information to calculate a timer offset.
>>
>>
>>>Yes, but x86-64 has one way, and ia64 does it another, and i know ppc
>>>folks have talked about their own user mode time access. Chasing down a
>>>time bug across arches gets to be fairly hairy, so I'm trying to
>>>simplify that.
>>
>>The simplest thins is to provide a data structure without any functions
>>attached that can simply be copied into userspace if wanted. If an arch
>>needs special time access then that is depending on the arch specific
>>methods available and such a data structure as I have proposed will
>>include all the info necessary to implement such user mode time access.
>
>
> Ehhh.. I really don't like the idea of giving all the raw values to
> userspace and letting user-code do the timeofday calculation. Fixing
> bugs in each arches timeofday code is hard enough. Imagine if we have to
> go through and fix userspace too! It would also make a user/kernel data
> interface that we'd have to preserve. I'd like to avoid that and instead
> use the vsyscall method to give us greater flexibility. Plus I doubt
> anyone would want to implement the NTP adjustments in userspace? eek!
>
>
>>A requirement to call functions in the kernel to determine time makes
>>these solution impossible. And its getting extremely complex if one has to
>>invoke different functions for each timer supported.
>
>
> The struct timesource interface of read()/delta()/cyc2ns() was the best
> generalization I could come up with. I feel they're necessary for the
> following reasons:
>
> cyc2ns(): In this conversion we can optimize the math depending on the
> timesource. If the timesource freq is a power of 2, we can just use
> shift! However if its a weird value and we have to be very precise, we
> do a full 64bit divide. We're not stuck ith one equation given a freq
> value.
>
> delta(): Some counters don't fill 32 or 64 bits. ACPI PM time source is
> 24 bits, and the cyclone is 40. Thus to do proper twos complement
> subtraction without overflow worries you need to mask the subtraction.
> This can be done by exporting a mask value w/ the freq value, but was
> cleaner when moved into the timesource.
>
> read(): Rather then just giving the address of the register, the read
> call allows for timesource specific logic. This lets us use jiffies as a
> timesource, or in cases like the ACPI PM timesource, where the register
> must be read 3 times in order to ensure a correct value is latched, we
> can avoid having to include that logic into the generic code, so it does
> not affect systems that do not use or have that timesource.

I am not convince that 3 reads are in fact needed. In fact, I am amost certain
that two is more than enough. In fact, it takes so long to read it that I just
use one read and a sanity check in the HRT code. Here is the code I use:

unsigned long
quick_get_cpuctr(void)
{
static unsigned long last_read = 0;
static int qgc_max = 0;
int i;

unsigned long rd_delta, rd_ans, rd = inl(acpi_pm_tmr_address);

/*
* This will be REALLY big if ever we move backward in time...
*/
rd_delta = (rd - last_read) & SIZE_MASK;
last_read = rd;

rd_ans = (rd - last_update) & SIZE_MASK;

if (likely((rd_ans < (arch_cycles_per_jiffy << 1)) &&
(rd_delta < (arch_cycles_per_jiffy << 1))))
return rd_ans;

for (i = 0; i < 10; i++) {
rd = inl(acpi_pm_tmr_address);
rd_delta = (rd - last_read) & SIZE_MASK;
last_read = rd;
if (unlikely(i > qgc_max))
qgc_max = i;
/*
* On my test machine (800MHZ dual PIII) this is always
* seven. Seems long, but we will give it some slack...
* We note that rd_delta (and all the vars) unsigned so
* a backward movement will show as a really big number.
*/
if (likely(rd_delta < 20))
return (rd - last_update) & SIZE_MASK;
}
return (rd - last_update) & SIZE_MASK;
}


--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2004-09-03 08:17:47

by Ulrich Windl

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday i386 hooks (v.A0)

On 2 Sep 2004 at 18:44, George Anzinger wrote:

> john stultz wrote:
> > All,
> > This patch implements the minimal i386 architecture hooks to enable the
> > new time of day subsystem code. It applies on top of my
> > linux-2.6.9-rc1_timeofday-core_A0 patch and with this patch applied, you
> > can test the new time of day subsystem on i386. Basically it adds the
> > call to timeofday_interrupt_hook() and cuts alot of code out of the
> > build. The only new code is the sync_persistant_clock() function which
> > is mostly ripped out of do_timer_interrupt(). Pretty un-interesting.
> >
> > I look forward to your comments and feedback.
> >
> > thanks
> > -john
> >
> > linux-2.6.9-rc1_timeofday-i386_A0.patch
> > =======================================
> > diff -Nru a/arch/i386/Kconfig b/arch/i386/Kconfig
> > --- a/arch/i386/Kconfig 2004-09-02 13:29:59 -07:00
> > +++ b/arch/i386/Kconfig 2004-09-02 13:29:59 -07:00
> > @@ -14,6 +14,10 @@
> > 486, 586, Pentiums, and various instruction-set-compatible chips by
> > AMD, Cyrix, and others.
> >
> > +config NEWTOD
> > + bool
> > + default y
> > +
> > config MMU
> > bool
> > default y
> > diff -Nru a/arch/i386/kernel/time.c b/arch/i386/kernel/time.c
> > --- a/arch/i386/kernel/time.c 2004-09-02 13:29:59 -07:00
> > +++ b/arch/i386/kernel/time.c 2004-09-02 13:29:59 -07:00
> > @@ -67,6 +67,8 @@
> >
> > #include "io_ports.h"
> >
> > +#include <linux/timeofday.h>
> > +
> > extern spinlock_t i8259A_lock;
> > int pit_latch_buggy; /* extern */
> >
> > @@ -87,6 +89,7 @@
> >
> > struct timer_opts *cur_timer = &timer_none;
> >
> > +#ifndef CONFIG_NEWTOD
> > /*
> > * This version of gettimeofday has microsecond resolution
> > * and better than microsecond precision on fast x86 machines with TSC.
> > @@ -169,6 +172,7 @@
> > }
> >
> > EXPORT_SYMBOL(do_settimeofday);
> > +#endif
> >
> > static int set_rtc_mmss(unsigned long nowtime)
> > {
> > @@ -194,12 +198,39 @@
> > * Note: This function is required to return accurate
> > * time even in the absence of multiple timer ticks.
> > */
> > +#ifndef CONFIG_NEWTOD
> > unsigned long long monotonic_clock(void)
> > {
> > return cur_timer->monotonic_clock();
> > }
> > EXPORT_SYMBOL(monotonic_clock);
> > +#endif
> >
> > +void sync_persistant_clock(struct timespec ts)
> > +{
> > + /*
> > + * If we have an externally synchronized Linux clock, then update
> > + * CMOS clock accordingly every ~11 minutes. Set_rtc_mmss() has to be
> > + * called as close as possible to 500 ms before the new second starts.
> > + */
> > + if (ts.tv_sec > last_rtc_update + 660 &&
> > + (ts.tv_nsec / 1000)
> > + >= USEC_AFTER - ((unsigned) TICK_SIZE) / 2 &&
> > + (ts.tv_nsec / 1000)
> > + <= USEC_BEFORE + ((unsigned) TICK_SIZE) / 2) {
> > + /* horrible...FIXME */
> > + if (efi_enabled) {
> > + if (efi_set_rtc_mmss(ts.tv_sec) == 0)
> > + last_rtc_update = ts.tv_sec;
> > + else
> > + last_rtc_update = ts.tv_sec - 600;
> > + } else if (set_rtc_mmss(ts.tv_sec) == 0)
> > + last_rtc_update = ts.tv_sec;
> > + else
> > + last_rtc_update = ts.tv_sec - 600; /* do it again in 60 s */
> > + }
> > +
> I have wondered, and continue to do so, why this is not a timer driven function.

I think it depends on how reliable timers are regarding in-time triggering. This
code has to be executed on-time to make sense. Really.

> It just seems silly to check this every interrupt when we have low overhead
> timers for just this sort of thing.
>
> I wonder about the load calc in the same way...

That's completely different.

...

Regards,
Ulrich


2004-09-03 10:08:48

by Dominik Brodowski

[permalink] [raw]
Subject: Re: [RFC] New Time of day proposal (updated 9/2/04)

Hi John,

Thanks for this exciting code. A few questions:

- Where do you intend to put the "delay" code to? Generalize it as well?

- cpufreq hooks to tsc.c and i386_tsc.c[*] can easily be added. For them to
work _better_ than current code: can timeofday_hook() be called (with IRQs
disabled) _anywhen_ from kernel context?
[*] actually, only one of them needs the notifier, AFAICS...

- what about keeping lower-priority timesources still "active" in some sort to
a) enable loading _and_ unloading timsources (even modularizing them
becoms possible which should make testing easier...),
b) call them every couple of seconds to verify the currently used
timesource is still sane (and if not, call cpufreq_delayed_get() for
example or disable the timesource). This would mean that e.g. pmtmr
and pit can be used to "verify" and "backup" tsc, or otherwise...
The "clock=tsc" override would only affect the priority of the
timesource then, making it so large that no other timesource can
"preempt" it, but doesn't avoid making other timesources available
for backup and verification purposes.

Thanks,

Dominik

2004-09-03 15:22:00

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] New Time of day proposal (updated 9/2/04)

On Thu, Sep 02, 2004 at 02:07:19PM -0700, john stultz wrote:
> All,
> Here again is my updated time of day proposal. Also to follow is
> working code for i386. I wanted to get this out so folks could actually
> play around with the core architecture independent code. The i386
> specific bits function, but are unfinished. Thus they have been broken
> out and should mostly be ignored for this release.
>
> I'd really appreciate any feedback or concerns about this proposal and
> the following code.

Thanks for working on this. Some cleanup in this area was long overdue.

> Functions:
> ----------
> timeofday_hook()
> now = read(); /* read the timesource */
> ns = cyc2ns(now - offset_base); /* calc nsecs since last call */
> ntp_ns = ntp_scale(ns); /* apply ntp scaling */
> system_time += ntp_ns; /* add scaled value to system_time */
> ntp_advance(ns); /* advance ntp state machine by ns */
> offset_base = now; /* set new offset_base */
>
> monotonic_clock()
> now = read(); /* read the timesource */
> ns = cyc2ns(now - offset_base); /* calculate nsecs since last hook */
> ntp_ns = ntp_scale(ns); /* apply ntp scaling */
> return system_time + ntp_ns; /* return system_time and scaled value


I am not sure why you have different hooks for timeofday and monotic.
It may be better to read the timeonly once and then convert to monotonic
or TOD.

> */
>
> settimeofday(desired)
> wall_offset = desired - monotonic_clock(); /* set wall offset */
>
> gettimeofday()
> return wall_offset + monotonic_clock(); /* return current timeofday */


Hmm, I am missing something here, but how do you handle
the case where the timer interrupt uses HPET, but the offset
from last timer interrupt is determined using TSC. But
some machines don't have a reliable TSC, so it may need
to use a different "offset" source.

I guess you would need two different drivers here?

> o vsyscalls/userspace gettimeofday()
> - Mark functions and data w/ __vsyscall attribute
> - Use linker to put all __vsyscall data in the same set of pages

That won't work because the kernel needs to write to
these variables (e.g. to update the wall time from the interrupt
handler). So in general you need two different mappings, one
read only for user space and another writable one for the kernel.

Regarding your patch, putting the delta in a virtual call
seems a bit of overkill. Is that really needed? That was
just from a quick grep, i haven't read it in detail.

-Andi

2004-09-03 16:21:03

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Thu, 2 Sep 2004, john stultz wrote:

> > The only way curent way to enter the kernel from glibc with a fastcall is
> > the EPC.
>
> Hmm. I must be explaining myself poorly, or not understanding you. I
> apologize for not understanding this EPC/fastcall business well enough.
> I'd like to use EPC from a user-executable kernel page to escalate
> privileges to access the hardware counter. I don't care if I have to use
> the the current fastcall (fsys.S) interface or not. However you're
> making sounds like this isn't possible, so I'll need to do some
> research.

The mechanism you envision could probably be created on IA64. Its
certainly not available on all platforms and potentially cannot be available.

Also if you want to pursue the approach with functions you need to enter
the kernel at least 3 times(!) in order to call the 3 different functions
you defined in order to determine time. You need to retrieve time, then do
the time diff call then the nanosecond conversion.

> > The EPC call already does do a *secure* transfer like this on IA64 and
> > will execute kernel code without user space mapping. This idea raises all sorts
> > of concerns....
>
> Yes, but its not portable. Reducing duplicate code so timeofday
> maintenance isn't a nightmare is the first goal here. It may not be
> completely achievable, and when I hit that point I'll have to rework the
> design, but at this point I'm not convinced that it cannot be done.

It does not need to be portable since it is an architecture specific
optimization which could be accomplished in variety of ways on other
platforms. The time source information structure could be platform
independent. The architectures can optimize the way they interpret the
timer source information structure without additional calls to kernel
functions.

I would expect the logic for time retrieval not to be subject to much
change. The current ASM for IA64 f.e. would probalby work without change
for your new approach if the time source information would not require
function calls.

However, the proposal makes these optimizations impossible.

> > The simplest thins is to provide a data structure without any functions
> > attached that can simply be copied into userspace if wanted. If an arch
> > needs special time access then that is depending on the arch specific
> > methods available and such a data structure as I have proposed will
> > include all the info necessary to implement such user mode time access.
>
> Ehhh.. I really don't like the idea of giving all the raw values to
> userspace and letting user-code do the timeofday calculation. Fixing
> bugs in each arches timeofday code is hard enough. Imagine if we have to
> go through and fix userspace too! It would also make a user/kernel data
> interface that we'd have to preserve. I'd like to avoid that and instead
> use the vsyscall method to give us greater flexibility. Plus I doubt
> anyone would want to implement the NTP adjustments in userspace? eek!

The raw values in user space can be used for specialized purposes
(somewhat like a generalized form of HPET) by applicationbs. This is not
intended for real system time.

> cyc2ns(): In this conversion we can optimize the math depending on the
> timesource. If the timesource freq is a power of 2, we can just use
> shift! However if its a weird value and we have to be very precise, we
> do a full 64bit divide. We're not stuck with one equation given a freq
> value.

I have never seeen a timesource freq with the power of 2. Division is not
necessary since one can realize this by multiplying with a certain factor
and then shifting the result right instead of dividing.

> delta(): Some counters don't fill 32 or 64 bits. ACPI PM time source is
> 24 bits, and the cyclone is 40. Thus to do proper twos complement
> subtraction without overflow worries you need to mask the subtraction.
> This can be done by exporting a mask value w/ the freq value, but was
> cleaner when moved into the timesource.

I would suggest it is better to add this mask and avoid another call to
a tiny function that does simply mask and subtract. Note that compilers
are more efficient if they get a sufficiently large chunk of code. This is
in particular necessary on IA64 and other processors given the inherent
parallelism in their internal CPU. A function call is typically much
slower than a subtract and and operation.

> read(): Rather then just giving the address of the register, the read
> call allows for timesource specific logic. This lets us use jiffies as a
> timesource, or in cases like the ACPI PM timesource, where the register
> must be read 3 times in order to ensure a correct value is latched, we
> can avoid having to include that logic into the generic code, so it does
> not affect systems that do not use or have that timesource.

I think it is essential to have the capability to use a function. But
nevertheless it is quite inefficient to have yet another small function
that simply reads a value from a memory location. What I proposed is to be
able to specify a memory location in the time source structure. That way
this function call can be avoided for most timer source. Any specialized
time source will be able to use a function call but will then not be as
fast as the time sources that can simply setup a memory address.

> But I doubt I'll convince you with words, so let me work on it a bit and
> see if I can code around your concerns and put you at ease. You've
> brought up some good issues, and I'll definitely work to resolve them!

Oh. I think this is good in terms of clarifying the issues. I did not
realize that these capabilities that you proposed to use existed. My main
concern here is efficient and scalable access to time. The ideal solution
is one routine that can run straight through with minimal locking and
simply do it all for most cases. The seqlock approach takes care of the
locking.

Also with the ability to specify parameter instead of functions, one could
easily setup a timer with a single function call like f.e.


setup_timer(TIME_SOURCE_CPU,NULL, 1500000)

or

setup_timer(TIME_SOURCE_MMIO64, &timer, 4000000)

Of course I am skimming by some additional detail like the mask. But this
would cut down significantly on the code piece to be maintained. From what
I can see my approach saves a lot of duplicated code. The duplication that
may exist comes about because of architecture specific time access
optimizations.

2004-09-03 18:17:25

by George Anzinger

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday i386 hooks (v.A0)

Ulrich Windl wrote:
> On 2 Sep 2004 at 18:44, George Anzinger wrote:
>
>
~

>>>+#endif
>>>
>>>+void sync_persistant_clock(struct timespec ts)
>>>+{
>>>+ /*
>>>+ * If we have an externally synchronized Linux clock, then update
>>>+ * CMOS clock accordingly every ~11 minutes. Set_rtc_mmss() has to be
>>>+ * called as close as possible to 500 ms before the new second starts.
>>>+ */
>>>+ if (ts.tv_sec > last_rtc_update + 660 &&
>>>+ (ts.tv_nsec / 1000)
>>>+ >= USEC_AFTER - ((unsigned) TICK_SIZE) / 2 &&
>>>+ (ts.tv_nsec / 1000)
>>>+ <= USEC_BEFORE + ((unsigned) TICK_SIZE) / 2) {
>>>+ /* horrible...FIXME */
>>>+ if (efi_enabled) {
>>>+ if (efi_set_rtc_mmss(ts.tv_sec) == 0)
>>>+ last_rtc_update = ts.tv_sec;
>>>+ else
>>>+ last_rtc_update = ts.tv_sec - 600;
>>>+ } else if (set_rtc_mmss(ts.tv_sec) == 0)
>>>+ last_rtc_update = ts.tv_sec;
>>>+ else
>>>+ last_rtc_update = ts.tv_sec - 600; /* do it again in 60 s */
>>>+ }
>>>+
>>
>>I have wondered, and continue to do so, why this is not a timer driven function.
>
>
> I think it depends on how reliable timers are regarding in-time triggering. This
> code has to be executed on-time to make sense. Really.

Granted, but if we are late we can easily skip to the next second. The 60s
thing is rather arbitrary. If we are always late, well, there are lots of user
aps that rely on timers being on time most of the time. So we need to get that
right.

-g
>
>
>> It just seems silly to check this every interrupt when we have low overhead
>>timers for just this sort of thing.
>>
>>I wonder about the load calc in the same way...
>
>
> That's completely different.
>
> ...
>
> Regards,
> Ulrich
>
>

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2004-09-03 19:23:52

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Thu, 2004-09-02 at 23:42, Albert Cahalan wrote:
> > > +int ntp_leapsecond(struct timespec now)
> > > +{
> > > + /*
> > > + * Leap second processing. If in leap-insert state at
> > > + * the end of the day, the system clock is set back one
> > > + * second; if in leap-delete state, the system clock is
> > > + * set ahead one second. The microtime() routine or
> > > + * external clock driver will insure that reported time
> > > + * is always monotonic. The ugly divides should be
> > > + * replaced.
>
> Don't optimize until the patch is in and stable.
> The divides can be removed much later. Wait months,
> if not forever, before making the code less readable.
>
> The same goes for arch-specific non-syscall hacks.

Yep. Code readability is crucial, although performance is also a concern
that *has* to be addressed.

As much as I'm probably digging myself a hole in doing all of this, I
really don't want to work on time for the rest of my life, so making the
code clear and readable is my only hope for passing this on. :)

-john


2004-09-03 19:33:57

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Fri, 2004-09-03 at 00:24, George Anzinger wrote:
> Albert Cahalan wrote:
> > On Thu, 2004-09-02 at 21:39, George Anzinger wrote:
> >
> >>john stultz wrote:
> >
> >
> >>>+
> >>>+static nsec_t jiffies_cyc2ns(cycle_t cyc, cycle_t* remainder)
> >>>+{
> >>>+
> >>>+ cyc *= NSEC_PER_SEC/HZ;
> >>
> >>Hm... This assumes that 1/HZ is what is needed here. Today this value is
> >>999898. Not exactly reachable by NSEC_PER_SEC/HZ. Or did I miss something,
> >>like the relationship of jiffie to 1/HZ and to real time.
> >
> >
> > HZ not being HZ is the source of many foul problems.
> >
> > NTP should be able to correct for the error. For systems
> > not running NTP, provide a fake NTP to make corrections
> > based on the expected frequency error.
> >
> > Based on that, skip or double-up on the ticks to make
> > them be exactly HZ over long periods of time.
>
> There are several problems here. First, to make this possible, you will have to
> outlaw several values for HZ (1024 comes to mind). Second, like it or not, 1/HZ
> or something close to it, is the timer resolution. I think we need to try and
> keep this a power of 10, mostly because there are a lot of folks who just don't
> understand it if it is not. And third, you need to get real close to it with
> the hardware timer. If you introduce NTP or some such to fix things, well,
> things just break another place. For example, we started 2.6 with HZ=1000 and
> had problems like:
> > time sleep 10
> sleeping for 9.9 seconds. This will just not do. Any corrections made to the
> wall clock really need to be made to the timer system as well. As it is we
> assume that, by correctly choosing the tick value, the wall clock will be
> correct on average, even under NTP. I.e. that the NTP correction will be, over
> time, very small. We really do want code that is much more accurate than "time
> sleep 10" to be right, i.e. if we sleep for x nanoseconds, the wall clock will
> have changed by x nanoseconds while we did so.

I feel trying to keep two notions of time, one in the timeofday code and
one in the timer code is the real issue. Trying to better keep them
synced will just lead to madness. Instead the timer subsystem needs to
move to using monotonic_clock(), or get_lowres_timestamp() instead of
using jiffies for its notion of time. Jiffies is just an interrupt
counter.

This somewhat ignores the fact that we're discussing a "jiffies
timesource", which I only implemented as a lowest common denominator /
if you really don't have anything better / thanks the academy for its'
worst time source ever award/ simple example of the timesource code.

More and more I think I need a t-shirt that reads "jiffies ain't time".

thanks
-john

2004-09-03 19:50:04

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Fri, 2004-09-03 at 00:43, George Anzinger wrote:
> john stultz wrote:
> > On Thu, 2004-09-02 at 17:47, Christoph Lameter wrote:
> >
> >>On Thu, 2 Sep 2004, john stultz wrote:
> > read(): Rather then just giving the address of the register, the read
> > call allows for timesource specific logic. This lets us use jiffies as a
> > timesource, or in cases like the ACPI PM timesource, where the register
> > must be read 3 times in order to ensure a correct value is latched, we
> > can avoid having to include that logic into the generic code, so it does
> > not affect systems that do not use or have that timesource.
>
> I am not convince that 3 reads are in fact needed. In fact, I am amost certain
> that two is more than enough. In fact, it takes so long to read it that I just
> use one read and a sanity check in the HRT code. Here is the code I use:

[snip]

I'd have to defer to Dominik (acpi-pm author) for that. But the case
remains, to get a "good" value, you have to do more then just give
access to the raw register.

thanks
-john

2004-09-03 20:14:02

by john stultz

[permalink] [raw]
Subject: Re: [RFC] New Time of day proposal (updated 9/2/04)

On Fri, 2004-09-03 at 02:54, Dominik Brodowski wrote:
> Hi John,
>
> Thanks for this exciting code. A few questions:
>
> - Where do you intend to put the "delay" code to? Generalize it as well?

I'm putting that off. Some of the timesources aren't granular enough to
be used, so I don't see an easy way to cleanly use time source delays
for some cases and loop delays for others. So I just plan to leave it as
an arch specific implementation for now.

> - cpufreq hooks to tsc.c and i386_tsc.c[*] can easily be added. For them to
> work _better_ than current code: can timeofday_hook() be called (with IRQs
> disabled) _anywhen_ from kernel context?
> [*] actually, only one of them needs the notifier, AFAICS...

Yep, that's on my list. I'm trying to keep to just cleaning up one thing
at a time, but since I've got to re-work the timer_tsc.c code anyway, I
figured I'd try to organize all the TSC related functions
(synchronization, cpufreq, get_cycles(), tsc_delay maybe?) into tsc.c.
This will simplify things if we ever get around to correctly fixing the
SMP systems w/ different speed cpus issue.

> - what about keeping lower-priority timesources still "active" in some sort to
> a) enable loading _and_ unloading timsources (even modularizing them
> becoms possible which should make testing easier...),
> b) call them every couple of seconds to verify the currently used
> timesource is still sane (and if not, call cpufreq_delayed_get() for
> example or disable the timesource). This would mean that e.g. pmtmr
> and pit can be used to "verify" and "backup" tsc, or otherwise...
> The "clock=tsc" override would only affect the priority of the
> timesource then, making it so large that no other timesource can
> "preempt" it, but doesn't avoid making other timesources available
> for backup and verification purposes.

That's totally the plan, although I want to put the control into sysfs.
Load a module, and echo "acpi-pm" or whatever into the sysfs file. I
just left it out for the first pass so folks would focus on the core
timeofday and NTP code (I can keep wishing, right? :)

thanks
-john

2004-09-03 20:22:01

by john stultz

[permalink] [raw]
Subject: Re: [RFC] New Time of day proposal (updated 9/2/04)

On Fri, 2004-09-03 at 08:17, Andi Kleen wrote:
> On Thu, Sep 02, 2004 at 02:07:19PM -0700, john stultz wrote:
> > timeofday_hook()
> > now = read(); /* read the timesource */
> > ns = cyc2ns(now - offset_base); /* calc nsecs since last call */
> > ntp_ns = ntp_scale(ns); /* apply ntp scaling */
> > system_time += ntp_ns; /* add scaled value to system_time */
> > ntp_advance(ns); /* advance ntp state machine by ns */
> > offset_base = now; /* set new offset_base */
> >
> > monotonic_clock()
> > now = read(); /* read the timesource */
> > ns = cyc2ns(now - offset_base); /* calculate nsecs since last hook */
> > ntp_ns = ntp_scale(ns); /* apply ntp scaling */
> > return system_time + ntp_ns; /* return system_time and scaled value
>
>
> I am not sure why you have different hooks for timeofday and monotic.
> It may be better to read the timeonly once and then convert to monotonic
> or TOD.

You might be a bit confused:
The timeofday_hook (should be timeofday_interrupt_hook, my bad) is
called by the semi-periodic-irregular-interval(also known as "timer")
interrupt. Its what does the housekeeping for all the timeofday code so
we don't run into a counter overflow.

monotonic_clock() is an accessor that returns the amount of time that
has been accumulated since boot in nanoseconds.

do_timeofday() is also an accessor which uses monotonic_clock +
wall_time_offset to calculate wall time.


> > */
> >
> > settimeofday(desired)
> > wall_offset = desired - monotonic_clock(); /* set wall offset */
> >
> > gettimeofday()
> > return wall_offset + monotonic_clock(); /* return current timeofday */
>
>
> Hmm, I am missing something here, but how do you handle
> the case where the timer interrupt uses HPET, but the offset
> from last timer interrupt is determined using TSC. But
> some machines don't have a reliable TSC, so it may need
> to use a different "offset" source.
>
> I guess you would need two different drivers here?

Not really, this is very doable. Time of day is now completely isolated
from the timer interrupt code, so it doesn't care who calls the
timeofday_interrupt_hook (HPET's interrupt or the i8253's interrupt
could do it). Also it doesn't have to be regular or exact, just frequent
enough that the timesource doesn't overflow.

> > o vsyscalls/userspace gettimeofday()
> > - Mark functions and data w/ __vsyscall attribute
> > - Use linker to put all __vsyscall data in the same set of pages
>
> That won't work because the kernel needs to write to
> these variables (e.g. to update the wall time from the interrupt
> handler). So in general you need two different mappings, one
> read only for user space and another writable one for the kernel.

Well, I was thinking about this and I don't see why the kernel and
userspace can't share the same data and functions for read only access?
Why exactly does it need to be mapped twice? We only write to the data
from kernel mode using different functions, so the same symbols should
be usable.

I could be missing something, and my userspace implementation plans
weigh fairly heavily on this, so do please correct me.

> Regarding your patch, putting the delta in a virtual call
> seems a bit of overkill. Is that really needed? That was
> just from a quick grep, i haven't read it in detail.

Yep. Christoph point is starting to get to me. I'll probably swap out
delta() for a mask variable do the masking in the generic code. However,
I do feel its a micro-optimization which hurts readability, so I'll put
off that change until later when the code has been thoroughly reviewed.

I came up with the read/delta/cyc2ns interface after trying to think of
the simplest way to use ANY time source, no matter how strange.
cycle_t's are basically magic cookies given by the timesource that can
be manipulated generically and converted into nanoseconds. It is likely
an over-virtualization, but I'd prefer to keep things general and
flexible until more arch maintainers tell me its unnecessary.

thanks
-john

2004-09-03 20:53:57

by Dominik Brodowski

[permalink] [raw]
Subject: Re: [RFC] New Time of day proposal (updated 9/2/04)

On Fri, Sep 03, 2004 at 12:41:22PM -0700, john stultz wrote:
> > - cpufreq hooks to tsc.c and i386_tsc.c[*] can easily be added. For them to
> > work _better_ than current code: can timeofday_hook() be called (with IRQs
> > disabled) _anywhen_ from kernel context?
> > [*] actually, only one of them needs the notifier, AFAICS...
>
> Yep, that's on my list. I'm trying to keep to just cleaning up one thing
> at a time, but since I've got to re-work the timer_tsc.c code anyway, I
> figured I'd try to organize all the TSC related functions
> (synchronization, cpufreq, get_cycles(), tsc_delay maybe?) into tsc.c.
> This will simplify things if we ever get around to correctly fixing the
> SMP systems w/ different speed cpus issue.

What about removing cpu_freq_khz you have in your patch, adding a per-CPU
value, and just use the value of the boot CPU for the other CPUs if
!CONFIG_DIFFERENT_CPU_SPEEDS or something like that?

> > - what about keeping lower-priority timesources still "active" in some sort to
> > a) enable loading _and_ unloading timsources (even modularizing them
> > becoms possible which should make testing easier...),
> > b) call them every couple of seconds to verify the currently used
> > timesource is still sane (and if not, call cpufreq_delayed_get() for
> > example or disable the timesource). This would mean that e.g. pmtmr
> > and pit can be used to "verify" and "backup" tsc, or otherwise...
> > The "clock=tsc" override would only affect the priority of the
> > timesource then, making it so large that no other timesource can
> > "preempt" it, but doesn't avoid making other timesources available
> > for backup and verification purposes.
>
> That's totally the plan, although I want to put the control into sysfs.

Even better. Thought about that too, but was worried you'd dislike the sysfs
overhead.

Thanks,
Dominik

2004-09-03 21:06:49

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Fri, 2004-09-03 at 09:18, Christoph Lameter wrote:
> On Thu, 2 Sep 2004, john stultz wrote:
> > > The simplest thins is to provide a data structure without any functions
> > > attached that can simply be copied into userspace if wanted. If an arch
> > > needs special time access then that is depending on the arch specific
> > > methods available and such a data structure as I have proposed will
> > > include all the info necessary to implement such user mode time access.
> >
> > Ehhh.. I really don't like the idea of giving all the raw values to
> > userspace and letting user-code do the timeofday calculation. Fixing
> > bugs in each arches timeofday code is hard enough. Imagine if we have to
> > go through and fix userspace too! It would also make a user/kernel data
> > interface that we'd have to preserve. I'd like to avoid that and instead
> > use the vsyscall method to give us greater flexibility. Plus I doubt
> > anyone would want to implement the NTP adjustments in userspace? eek!
>
> The raw values in user space can be used for specialized purposes
> (somewhat like a generalized form of HPET) by applicationbs. This is not
> intended for real system time.

I worry these specialized interfaces would become troubling to maintain.
The kernel (via regular syscall, fsyscall, or vsyscall) should provide
accurate time and should be the final authority for that.

If specialized applications want to read the cycle counter (TSC, ITC),
/dev/HPET or /dev/mem or whatever, and try to infer time that's fine but
its at their own risk. When those specialized assumptions break (TSC
being fixed freq, for example) its the applications' problem. The kernel
shouldn't codify or encourage those sorts of risky interfaces by
exporting frequency information in a standardized interface.


> > cyc2ns(): In this conversion we can optimize the math depending on the
> > timesource. If the timesource freq is a power of 2, we can just use
> > shift! However if its a weird value and we have to be very precise, we
> > do a full 64bit divide. We're not stuck with one equation given a freq
> > value.
>
> I have never seeen a timesource freq with the power of 2. Division is not
> necessary since one can realize this by multiplying with a certain factor
> and then shifting the result right instead of dividing.

Well, it was a theoretical example, but I agree with your point. The
math tricks you suggest are usable depending on the frequency of the
timesource. Attaching that sort of logic to the timesource
implementation gives us more flexibility. It could be that the
flexibility is unnecessary, and if so I'm fine w/ changing the design,
but I'm not confident of that just yet.

> > delta(): Some counters don't fill 32 or 64 bits. ACPI PM time source is
> > 24 bits, and the cyclone is 40. Thus to do proper twos complement
> > subtraction without overflow worries you need to mask the subtraction.
> > This can be done by exporting a mask value w/ the freq value, but was
> > cleaner when moved into the timesource.
>
> I would suggest it is better to add this mask and avoid another call to
> a tiny function that does simply mask and subtract. Note that compilers
> are more efficient if they get a sufficiently large chunk of code. This is
> in particular necessary on IA64 and other processors given the inherent
> parallelism in their internal CPU. A function call is typically much
> slower than a subtract and and operation.

Yep. As I said before, you've got very good points. The delta() function
is the one I'm least attached to, but for now I'm going to hang on to it
for readability reasons. It will be simple to change to a mask value
later after everyone has reviewed the code and is confident of its
correctness.

> > read(): Rather then just giving the address of the register, the read
> > call allows for timesource specific logic. This lets us use jiffies as a
> > timesource, or in cases like the ACPI PM timesource, where the register
> > must be read 3 times in order to ensure a correct value is latched, we
> > can avoid having to include that logic into the generic code, so it does
> > not affect systems that do not use or have that timesource.
>
> I think it is essential to have the capability to use a function. But
> nevertheless it is quite inefficient to have yet another small function
> that simply reads a value from a memory location. What I proposed is to be
> able to specify a memory location in the time source structure. That way
> this function call can be avoided for most timer source. Any specialized
> time source will be able to use a function call but will then not be as
> fast as the time sources that can simply setup a memory address.

I really do like the time_interpolator changes you've submitted in
earlier threads. The core generalizing of timesources changes are very
similar between our two implementations, so your comments are very
important to me. I am open to adding these direct memory accessors to
the time counters, but again its a performance optimization, and right
now clarity is king.

> > But I doubt I'll convince you with words, so let me work on it a bit and
> > see if I can code around your concerns and put you at ease. You've
> > brought up some good issues, and I'll definitely work to resolve them!
>
> Oh. I think this is good in terms of clarifying the issues. I did not
> realize that these capabilities that you proposed to use existed. My main
> concern here is efficient and scalable access to time. The ideal solution
> is one routine that can run straight through with minimal locking and
> simply do it all for most cases. The seqlock approach takes care of the
> locking.

Yep, performance and scalability is *very* important. However
correctness is even more important, and right now we have enough trouble
with just that. All of the arch specific performance hacks have
complicated the current code such that minor changes in other subsystems
(sometimes not so) subtly break things. Untangling these quirks have
been a large part of what I and many others have worked on for the last
few years. So for correctness sake we do have to balance clarity and
performance. Minimizing trade-offs is my goal right now.

I'm starting to feel like I'm just being repetitious, so I'll try to
move on now. Sorry for the long-winded emails.

> Also with the ability to specify parameter instead of functions, one could
> easily setup a timer with a single function call like f.e.
>
>
> setup_timer(TIME_SOURCE_CPU,NULL, 1500000)
>
> or
>
> setup_timer(TIME_SOURCE_MMIO64, &timer, 4000000)
>
> Of course I am skimming by some additional detail like the mask. But this
> would cut down significantly on the code piece to be maintained. From what
> I can see my approach saves a lot of duplicated code. The duplication that
> may exist comes about because of architecture specific time access
> optimizations.

Well, I realize this is just a simple sketch, but how do you then handle
cpu freq changes? Suddenly generic code has to deal with that as well?
And then there's those buggy time sources that have to be read multiple
times, or drifty ITCs that (may or maynot) need cmpxchgs to ensure they
do not go backwards. Making the generic code have to deal with all of
these cases will be ugly (and may even hurt performance).

Clearly, you've said that function calls should be available in those
cases, so I don't want to miss-characterize you comments. However I do
want to point it out so that others keep in mind that you can't always
just give a pointer and a freq value.

Once again, I thank you and everyone else for the great comments!
Hopefully I'll get more arch hooks in soon so you and others can
actually start playing with (and hopefully improving) the code.

thanks
-john

2004-09-03 21:09:54

by john stultz

[permalink] [raw]
Subject: Re: [RFC] New Time of day proposal (updated 9/2/04)

On Fri, 2004-09-03 at 13:26, Dominik Brodowski wrote:
> On Fri, Sep 03, 2004 at 12:41:22PM -0700, john stultz wrote:
> > > - cpufreq hooks to tsc.c and i386_tsc.c[*] can easily be added. For them to
> > > work _better_ than current code: can timeofday_hook() be called (with IRQs
> > > disabled) _anywhen_ from kernel context?
> > > [*] actually, only one of them needs the notifier, AFAICS...
> >
> > Yep, that's on my list. I'm trying to keep to just cleaning up one thing
> > at a time, but since I've got to re-work the timer_tsc.c code anyway, I
> > figured I'd try to organize all the TSC related functions
> > (synchronization, cpufreq, get_cycles(), tsc_delay maybe?) into tsc.c.
> > This will simplify things if we ever get around to correctly fixing the
> > SMP systems w/ different speed cpus issue.
>
> What about removing cpu_freq_khz you have in your patch, adding a per-CPU
> value, and just use the value of the boot CPU for the other CPUs if
> !CONFIG_DIFFERENT_CPU_SPEEDS or something like that?

Well, for now I just want to re-implement what we already provide. If I
can fix something simple while I'm doing I will, but I don't want to
change behavior too much in order to simplify testing (if vanilla works
fine, and my patch breaks then I want it to be a core bug rather then a
"I changed the semantics of something slightly" issue).

Once I'm having to manage less code, I'll be very much interested in
fixing those outstanding issues. One thing at a time, and all.

> > > - what about keeping lower-priority timesources still "active" in some sort to
> > > a) enable loading _and_ unloading timsources (even modularizing them
> > > becoms possible which should make testing easier...),
> > > b) call them every couple of seconds to verify the currently used
> > > timesource is still sane (and if not, call cpufreq_delayed_get() for
> > > example or disable the timesource). This would mean that e.g. pmtmr
> > > and pit can be used to "verify" and "backup" tsc, or otherwise...
> > > The "clock=tsc" override would only affect the priority of the
> > > timesource then, making it so large that no other timesource can
> > > "preempt" it, but doesn't avoid making other timesources available
> > > for backup and verification purposes.
> >
> > That's totally the plan, although I want to put the control into sysfs.
>
> Even better. Thought about that too, but was worried you'd dislike the sysfs
> overhead.

Overhead? Not sure I see where that'd be a problem. The available
timesource management code needs to be split off into a timesource.c,
but I'll get to that later.

thanks
-john

2004-09-03 22:08:18

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Fri, 3 Sep 2004, john stultz wrote:

> > The raw values in user space can be used for specialized purposes
> > (somewhat like a generalized form of HPET) by applicationbs. This is not
> > intended for real system time.
>
> I worry these specialized interfaces would become troubling to maintain.
> The kernel (via regular syscall, fsyscall, or vsyscall) should provide
> accurate time and should be the final authority for that.
>
> If specialized applications want to read the cycle counter (TSC, ITC),
> /dev/HPET or /dev/mem or whatever, and try to infer time that's fine but
> its at their own risk. When those specialized assumptions break (TSC
> being fixed freq, for example) its the applications' problem. The kernel
> shouldn't codify or encourage those sorts of risky interfaces by
> exporting frequency information in a standardized interface.

How can a timer have a variable frequency and being used for
timing purposes. Maybe I just do not understand the point on TSC having a
fixed frequency (I certainly hope so but is that true in the power save
modes?).

True it is the applications problem if it breaks but we have applications
for SGI systems that demand access to a time source with the least
operating system interference possible.

> > > cyc2ns(): In this conversion we can optimize the math depending on the
> > > timesource. If the timesource freq is a power of 2, we can just use
> > > shift! However if its a weird value and we have to be very precise, we
> > > do a full 64bit divide. We're not stuck with one equation given a freq
> > > value.
> >
> > I have never seeen a timesource freq with the power of 2. Division is not
> > necessary since one can realize this by multiplying with a certain factor
> > and then shifting the result right instead of dividing.
>
> Well, it was a theoretical example, but I agree with your point. The
> math tricks you suggest are usable depending on the frequency of the
> timesource. Attaching that sort of logic to the timesource
> implementation gives us more flexibility. It could be that the
> flexibility is unnecessary, and if so I'm fine w/ changing the design,
> but I'm not confident of that just yet.

s/depending on/independent of/ ???

Are there any examples where the flexibility is needed?

> > > delta(): Some counters don't fill 32 or 64 bits. ACPI PM time source is
> > > 24 bits, and the cyclone is 40. Thus to do proper twos complement
> > > subtraction without overflow worries you need to mask the subtraction.
> > > This can be done by exporting a mask value w/ the freq value, but was
> > > cleaner when moved into the timesource.
> >
> > I would suggest it is better to add this mask and avoid another call to
> > a tiny function that does simply mask and subtract. Note that compilers
> > are more efficient if they get a sufficiently large chunk of code. This is
> > in particular necessary on IA64 and other processors given the inherent
> > parallelism in their internal CPU. A function call is typically much
> > slower than a subtract and and operation.
>
> Yep. As I said before, you've got very good points. The delta() function
> is the one I'm least attached to, but for now I'm going to hang on to it
> for readability reasons. It will be simple to change to a mask value
> later after everyone has reviewed the code and is confident of its
> correctness.

Are you sure about that readability argument? A single
statement to calculate the delta is much more readable and verifyable
to be correct than a gazillion of small functions that (one hopes and
prays) all do the same correctly.

> > > read(): Rather then just giving the address of the register, the read
> > > call allows for timesource specific logic. This lets us use jiffies as a
> > > timesource, or in cases like the ACPI PM timesource, where the register
> > > must be read 3 times in order to ensure a correct value is latched, we
> > > can avoid having to include that logic into the generic code, so it does
> > > not affect systems that do not use or have that timesource.
> >
> > I think it is essential to have the capability to use a function. But
> > nevertheless it is quite inefficient to have yet another small function
> > that simply reads a value from a memory location. What I proposed is to be
> > able to specify a memory location in the time source structure. That way
> > this function call can be avoided for most timer source. Any specialized
> > time source will be able to use a function call but will then not be as
> > fast as the time sources that can simply setup a memory address.
>
> I really do like the time_interpolator changes you've submitted in
> earlier threads. The core generalizing of timesources changes are very
> similar between our two implementations, so your comments are very
> important to me. I am open to adding these direct memory accessors to
> the time counters, but again its a performance optimization, and right
> now clarity is king.

Clarity is gained by centralizing these things. The design results in
numerous functions that are defined and which all mostly do the same. For
clarities sake please change the spec....

> Yep, performance and scalability is *very* important. However
> correctness is even more important, and right now we have enough trouble
> with just that. All of the arch specific performance hacks have
> complicated the current code such that minor changes in other subsystems
> (sometimes not so) subtly break things. Untangling these quirks have
> been a large part of what I and many others have worked on for the last
> few years. So for correctness sake we do have to balance clarity and
> performance. Minimizing trade-offs is my goal right now.

Performance as well as clarify as well as arch independence really
demand that these small functions be given up.

> > Of course I am skimming by some additional detail like the mask. But this
> > would cut down significantly on the code piece to be maintained. From what
> > I can see my approach saves a lot of duplicated code. The duplication that
> > may exist comes about because of architecture specific time access
> > optimizations.
>
> Well, I realize this is just a simple sketch, but how do you then handle
> cpu freq changes? Suddenly generic code has to deal with that as well?
> And then there's those buggy time sources that have to be read multiple
> times, or drifty ITCs that (may or maynot) need cmpxchgs to ensure they
> do not go backwards. Making the generic code have to deal with all of
> these cases will be ugly (and may even hurt performance).

The reading from memory only works for well behaved time sources. If
something unusual comes along then a function needs to be called and thus
a performance hit is taken. If its important enough (like the drifty
logic on IA64) then one may consider putting that logic into the
generic code. I did not not say not to have functions at all. I think a
function call must be possible at least to retrieve the timer value for
special cases.

If the CPU freq cases is to be handled by the generic code or a special
function is a design decision that you probably have to make. Over time
other logic will develop that may first show up as new time retrieval
function but that is then recognized to be of universal importance so that
it will then be moved into the generic code.

> Clearly, you've said that function calls should be available in those
> cases, so I don't want to miss-characterize you comments. However I do
> want to point it out so that others keep in mind that you can't always
> just give a pointer and a freq value.

Exactly.

2004-09-03 22:15:31

by George Anzinger

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

john stultz wrote:
> On Fri, 2004-09-03 at 00:24, George Anzinger wrote:
>
>>Albert Cahalan wrote:
>>
>>>On Thu, 2004-09-02 at 21:39, George Anzinger wrote:
>>>
>>>
>>>>john stultz wrote:
>>>
>>>
>>>>>+
>>>>>+static nsec_t jiffies_cyc2ns(cycle_t cyc, cycle_t* remainder)
>>>>>+{
>>>>>+
>>>>>+ cyc *= NSEC_PER_SEC/HZ;
>>>>
>>>>Hm... This assumes that 1/HZ is what is needed here. Today this value is
>>>>999898. Not exactly reachable by NSEC_PER_SEC/HZ. Or did I miss something,
>>>>like the relationship of jiffie to 1/HZ and to real time.
>>>
>>>
>>>HZ not being HZ is the source of many foul problems.
>>>
>>>NTP should be able to correct for the error. For systems
>>>not running NTP, provide a fake NTP to make corrections
>>>based on the expected frequency error.
>>>
>>>Based on that, skip or double-up on the ticks to make
>>>them be exactly HZ over long periods of time.
>>
>>There are several problems here. First, to make this possible, you will have to
>>outlaw several values for HZ (1024 comes to mind). Second, like it or not, 1/HZ
>>or something close to it, is the timer resolution. I think we need to try and
>>keep this a power of 10, mostly because there are a lot of folks who just don't
>>understand it if it is not. And third, you need to get real close to it with
>>the hardware timer. If you introduce NTP or some such to fix things, well,
>>things just break another place. For example, we started 2.6 with HZ=1000 and
>>had problems like:
>> > time sleep 10
>>sleeping for 9.9 seconds. This will just not do. Any corrections made to the
>>wall clock really need to be made to the timer system as well. As it is we
>>assume that, by correctly choosing the tick value, the wall clock will be
>>correct on average, even under NTP. I.e. that the NTP correction will be, over
>>time, very small. We really do want code that is much more accurate than "time
>>sleep 10" to be right, i.e. if we sleep for x nanoseconds, the wall clock will
>>have changed by x nanoseconds while we did so.
>
>
> I feel trying to keep two notions of time, one in the timeofday code and
> one in the timer code is the real issue. Trying to better keep them
> synced will just lead to madness. Instead the timer subsystem needs to
> move to using monotonic_clock(), or get_lowres_timestamp() instead of
> using jiffies for its notion of time. Jiffies is just an interrupt
> counter.

Well, there may be a better way. Suppose we change all the accounting code to
work with nanoseconds. That way when we do an accounting pass we would just add
what has elapsed since the last pass. Still need some way to do user timers
that are tightly pegged to the clock.

A thought I had along these lines was to program a timer for each tick. The PIT
is much too slow for this, but the APIC timers seem to be rather easy to
program. I am not sure how fast the access is but it can not be anything like
the PIT. Under this scheme we would use the monotonic clock to figure out just
how far out the next tick should be and program that.

This could be modified to use repeating hardware if it is available. (What does
the HPET provide? Does it interrupt?) Since the APIC clock is not the
reference clock (the PIT & pm timer clock is) we would have to correct these
from time to time but that is rather easy (I do it today in HRT code).

This brings up another issue. We know what the PIT clock frequency is but not
what the TSC clock frequency. Currently we do a calibration run at boot to
figure this but I can easily show that this run consistently gives the wrong
answer (I am sure this has to do with the I/O access delays). If we are going
to use the TSC (or any other "clock" that is not derived from the PITs
14.3181818MHZ ital) we need both a way to get the correct value AND a way to
adjust it for drift over time. (Possibly this is the same thing.) Of course
this is all just x86 stuff. Other archs will have their own issues.

For what its worth, the current 2.6 HRT patch uses the PIT as the 1/HZ time
interrupt. It uses the APIC, if present, for the high-res interrupt source.



--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2004-09-03 23:04:42

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Fri, 2004-09-03 at 15:04, Christoph Lameter wrote:
> On Fri, 3 Sep 2004, john stultz wrote:
> > > The raw values in user space can be used for specialized purposes
> > > (somewhat like a generalized form of HPET) by applicationbs. This is not
> > > intended for real system time.
> >
> > I worry these specialized interfaces would become troubling to maintain.
> > The kernel (via regular syscall, fsyscall, or vsyscall) should provide
> > accurate time and should be the final authority for that.
> >
> > If specialized applications want to read the cycle counter (TSC, ITC),
> > /dev/HPET or /dev/mem or whatever, and try to infer time that's fine but
> > its at their own risk. When those specialized assumptions break (TSC
> > being fixed freq, for example) its the applications' problem. The kernel
> > shouldn't codify or encourage those sorts of risky interfaces by
> > exporting frequency information in a standardized interface.
>
> How can a timer have a variable frequency and being used for
> timing purposes. Maybe I just do not understand the point on TSC having a
> fixed frequency (I certainly hope so but is that true in the power save
> modes?).

TSC frequency changes are a reality i386 has been living with for awhile
now. In some cases, there isn't anything else that is performant enough
to use as a timesource, so people want to be able to generate time as
accurately as possible from a variable frequency counter. Me, I think
its somewhat crazy, but I've got to make it work.

> True it is the applications problem if it breaks but we have applications
> for SGI systems that demand access to a time source with the least
> operating system interference possible.

Yep, totally fine with that. If application developers will take the
risk that's fine. Personally, I'd like the kernel's timeofday function
to be fast enough that developers don't feel the need to attempt these
special case hacks, but there will always be folks who need to live on
the edge.


> > > > cyc2ns(): In this conversion we can optimize the math depending on the
> > > > timesource. If the timesource freq is a power of 2, we can just use
> > > > shift! However if its a weird value and we have to be very precise, we
> > > > do a full 64bit divide. We're not stuck with one equation given a freq
> > > > value.
> > >
> > > I have never seeen a timesource freq with the power of 2. Division is not
> > > necessary since one can realize this by multiplying with a certain factor
> > > and then shifting the result right instead of dividing.
> >
> > Well, it was a theoretical example, but I agree with your point. The
> > math tricks you suggest are usable depending on the frequency of the
> > timesource. Attaching that sort of logic to the timesource
> > implementation gives us more flexibility. It could be that the
> > flexibility is unnecessary, and if so I'm fine w/ changing the design,
> > but I'm not confident of that just yet.
>
> s/depending on/independent of/ ???
>
> Are there any examples where the flexibility is needed?

I'm a fan of Barbie's law: "Math is hard, let's go shopping!", so
forgive my inability to immediately see through the details. In dealing
with both high and low res timesources (some with microsecond
frequencies, some run in picosecond frequencies), the math can be tuned
for each case. Low res time sources are mainly a multiply, and possibly
a divide only if it has a non-integer nanosecond per cycle value. For
high-res timesources, we have to divide, so we would want to aproximate
the frequency and use the multiply and shift as you suggested earlier.

Maybe this is completely able to be generalized and I'm just not sharp
enough to see it just yet, but it seems to me that having a timesource
specific cyc2ns allows for further optimization then without.

> > > > delta(): Some counters don't fill 32 or 64 bits. ACPI PM time source is
> > > > 24 bits, and the cyclone is 40. Thus to do proper twos complement
> > > > subtraction without overflow worries you need to mask the subtraction.
> > > > This can be done by exporting a mask value w/ the freq value, but was
> > > > cleaner when moved into the timesource.
> > >
> > > I would suggest it is better to add this mask and avoid another call to
> > > a tiny function that does simply mask and subtract. Note that compilers
> > > are more efficient if they get a sufficiently large chunk of code. This is
> > > in particular necessary on IA64 and other processors given the inherent
> > > parallelism in their internal CPU. A function call is typically much
> > > slower than a subtract and and operation.
> >
> > Yep. As I said before, you've got very good points. The delta() function
> > is the one I'm least attached to, but for now I'm going to hang on to it
> > for readability reasons. It will be simple to change to a mask value
> > later after everyone has reviewed the code and is confident of its
> > correctness.
>
> Are you sure about that readability argument? A single
> statement to calculate the delta is much more readable and verifyable
> to be correct than a gazillion of small functions that (one hopes and
> prays) all do the same correctly.

Your point is valid, and in moving the arch specific code into arch
independent code, I'm largely trying to reduce the tangle of functions.
Having 3 hardware specific functions isn't all that crazy.

> > > > read(): Rather then just giving the address of the register, the read
> > > > call allows for timesource specific logic. This lets us use jiffies as a
> > > > timesource, or in cases like the ACPI PM timesource, where the register
> > > > must be read 3 times in order to ensure a correct value is latched, we
> > > > can avoid having to include that logic into the generic code, so it does
> > > > not affect systems that do not use or have that timesource.
> > >
> > > I think it is essential to have the capability to use a function. But
> > > nevertheless it is quite inefficient to have yet another small function
> > > that simply reads a value from a memory location. What I proposed is to be
> > > able to specify a memory location in the time source structure. That way
> > > this function call can be avoided for most timer source. Any specialized
> > > time source will be able to use a function call but will then not be as
> > > fast as the time sources that can simply setup a memory address.
> >
> > I really do like the time_interpolator changes you've submitted in
> > earlier threads. The core generalizing of timesources changes are very
> > similar between our two implementations, so your comments are very
> > important to me. I am open to adding these direct memory accessors to
> > the time counters, but again its a performance optimization, and right
> > now clarity is king.
>
> Clarity is gained by centralizing these things. The design results in
> numerous functions that are defined and which all mostly do the same. For
> clarities sake please change the spec....

I disagree, but that doesn't mean I'm right. If more people complain as
the patch develops, I'll rectify the code so it is less confusing.
There's no reason for such a hard line at this point. The code is in no
way close to finished, or entering the kernel, so give me a chance to
roll with it for a bit. I'm just wary of premature optimization traps.

Worse case if we cannot come to some agreement on this, there's nothing
stopping architectures from keeping their own timeofday subsystem. I'm
just tired of implementing the same feature or fixing the same bug over
and over and over for each arch.

thanks
-john

2004-09-03 23:36:56

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

Really quick: I'm off on vacation until Weds. Hopefully I've addressed
in some way everyone's comments and my email box won't be stuffed when I
return. I might peek in every once in awhile, though.

Just one last response...

On Fri, 2004-09-03 at 15:10, George Anzinger wrote:
> john stultz wrote:
> > I feel trying to keep two notions of time, one in the timeofday code and
> > one in the timer code is the real issue. Trying to better keep them
> > synced will just lead to madness. Instead the timer subsystem needs to
> > move to using monotonic_clock(), or get_lowres_timestamp() instead of
> > using jiffies for its notion of time. Jiffies is just an interrupt
> > counter.
>
> Well, there may be a better way. Suppose we change all the accounting code to
> work with nanoseconds. That way when we do an accounting pass we would just add
> what has elapsed since the last pass.

Yep, that'd work too.

> Still need some way to do user timers that are tightly pegged to the clock.

I see that as just a problem of programming the timer interrupt
generator. If you have a nanosecond (or as high as the best timesource
on your system provides) resolution notion of time, then there is
nothing to peg or tie the timer with. You simply just program it to go
off every X nanoseconds. Should it be so inaccurate that it does not go
off right at X nanoseconds, then you've hit a limit of the hardware.
Pick a more accurate interval length, dynamically change your interval,
or live with it. If the soft-timers use monotonic-clock() to determine
when they expire, then you won't get accumulating drift (as
monotonic-clock()is NTP frequency adjusted), only the minor jitter
caused latency caused by the interrupt programming.

> A thought I had along these lines was to program a timer for each tick. The PIT
> is much too slow for this, but the APIC timers seem to be rather easy to
> program. I am not sure how fast the access is but it can not be anything like
> the PIT. Under this scheme we would use the monotonic clock to figure out just
> how far out the next tick should be and program that.

Yep, tickless systems also start being possible. We just have interrupts
for scheduled events.

> This could be modified to use repeating hardware if it is available. (What does
> the HPET provide? Does it interrupt?) Since the APIC clock is not the
> reference clock (the PIT & pm timer clock is) we would have to correct these
> from time to time but that is rather easy (I do it today in HRT code).

Don't know the details of interrupt generation, so I can't tell ya. I'm
not sure I followed the correction bit?

> This brings up another issue. We know what the PIT clock frequency is but not
> what the TSC clock frequency. Currently we do a calibration run at boot to
> figure this but I can easily show that this run consistently gives the wrong
> answer (I am sure this has to do with the I/O access delays). If we are going
> to use the TSC (or any other "clock" that is not derived from the PITs
> 14.3181818MHZ ital) we need both a way to get the correct value AND a way to
> adjust it for drift over time. (Possibly this is the same thing.) Of course
> this is all just x86 stuff. Other archs will have their own issues.

Again, monotonic_clock() and friends are NTP adjusted, so drift caused
by inaccurate calibration shouldn't be a problem the interval timer code
should need to worry about (outside of maybe adjusting its interval time
if its always arriving late/early). If possible the timesource
calibration code should be improved, but that's icing on the cake and
isn't critical.

Again, thanks to everyone for the great feedback and discussion!
-john

2004-09-04 00:08:28

by George Anzinger

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

john stultz wrote:
> Really quick: I'm off on vacation until Weds. Hopefully I've addressed
> in some way everyone's comments and my email box won't be stuffed when I
> return. I might peek in every once in awhile, though.
>
> Just one last response...
>
> On Fri, 2004-09-03 at 15:10, George Anzinger wrote:
>
>>john stultz wrote:
>>
>>>I feel trying to keep two notions of time, one in the timeofday code and
>>>one in the timer code is the real issue. Trying to better keep them
>>>synced will just lead to madness. Instead the timer subsystem needs to
>>>move to using monotonic_clock(), or get_lowres_timestamp() instead of
>>>using jiffies for its notion of time. Jiffies is just an interrupt
>>>counter.
>>
>>Well, there may be a better way. Suppose we change all the accounting code to
>>work with nanoseconds. That way when we do an accounting pass we would just add
>>what has elapsed since the last pass.
>
>
> Yep, that'd work too.
>
>
>>Still need some way to do user timers that are tightly pegged to the clock.
>
>
> I see that as just a problem of programming the timer interrupt
> generator. If you have a nanosecond (or as high as the best timesource
> on your system provides) resolution notion of time, then there is
> nothing to peg or tie the timer with. You simply just program it to go
> off every X nanoseconds. Should it be so inaccurate that it does not go
> off right at X nanoseconds, then you've hit a limit of the hardware.
> Pick a more accurate interval length, dynamically change your interval,
> or live with it. If the soft-timers use monotonic-clock() to determine
> when they expire, then you won't get accumulating drift (as
> monotonic-clock()is NTP frequency adjusted), only the minor jitter
> caused latency caused by the interrupt programming.
>
>
>>A thought I had along these lines was to program a timer for each tick. The PIT
>>is much too slow for this, but the APIC timers seem to be rather easy to
>>program. I am not sure how fast the access is but it can not be anything like
>>the PIT. Under this scheme we would use the monotonic clock to figure out just
>>how far out the next tick should be and program that.
>
>
> Yep, tickless systems also start being possible. We just have interrupts
> for scheduled events.
>
>
>>This could be modified to use repeating hardware if it is available. (What does
>>the HPET provide? Does it interrupt?) Since the APIC clock is not the
>>reference clock (the PIT & pm timer clock is) we would have to correct these
>>from time to time but that is rather easy (I do it today in HRT code).
>
>
> Don't know the details of interrupt generation, so I can't tell ya. I'm
> not sure I followed the correction bit?
>
>
>>This brings up another issue. We know what the PIT clock frequency is but not
>>what the TSC clock frequency. Currently we do a calibration run at boot to
>>figure this but I can easily show that this run consistently gives the wrong
>>answer (I am sure this has to do with the I/O access delays). If we are going
>>to use the TSC (or any other "clock" that is not derived from the PITs
>>14.3181818MHZ ital) we need both a way to get the correct value AND a way to
>>adjust it for drift over time. (Possibly this is the same thing.) Of course
>>this is all just x86 stuff. Other archs will have their own issues.
>
>
> Again, monotonic_clock() and friends are NTP adjusted, so drift caused
> by inaccurate calibration shouldn't be a problem the interval timer code
> should need to worry about (outside of maybe adjusting its interval time
> if its always arriving late/early). If possible the timesource
> calibration code should be improved, but that's icing on the cake and
> isn't critical.
>
Are you providing a way to predict what clock count provide a given time offset
INCLUDING ntp? If so, cool. If not we need to get this conversion right. We
will go into this more on your return.

Have fun.

> Again, thanks to everyone for the great feedback and discussion!
> -john
>

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2004-09-04 00:13:46

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Fri, 3 Sep 2004, john stultz wrote:

> > True it is the applications problem if it breaks but we have applications
> > for SGI systems that demand access to a time source with the least
> > operating system interference possible.
>
> Yep, totally fine with that. If application developers will take the
> risk that's fine. Personally, I'd like the kernel's timeofday function
> to be fast enough that developers don't feel the need to attempt these
> special case hacks, but there will always be folks who need to live on
> the edge.

Yeah. I tuned the gettimeofday function on IA64 to be faster than the
glibc's use of the ITC but there are still some voices that demand a memory mapped
timer.

> > Are there any examples where the flexibility is needed?
>
> I'm a fan of Barbie's law: "Math is hard, let's go shopping!", so
> forgive my inability to immediately see through the details. In dealing
> with both high and low res timesources (some with microsecond
> frequencies, some run in picosecond frequencies), the math can be tuned
> for each case. Low res time sources are mainly a multiply, and possibly
> a divide only if it has a non-integer nanosecond per cycle value. For
> high-res timesources, we have to divide, so we would want to aproximate
> the frequency and use the multiply and shift as you suggested earlier.

A division can always be avoided. This is the equation

nanoseconds = x / y * timer_value

where x/y = scaling factor

and one can always set Y = 2^py to substitute a shift instead of
doing division. Just scale x also appropriately by 2^px. This works to an
arbitrary accuracy if one has a long enough integer type. That is why I
suggested at least the intermediate use of 128bit. 64bit is fine for the
current cases but I would expect a need for 128bit to arise in the
future.

Math is sometimes simple and may simplify a lot of things.

> Maybe this is completely able to be generalized and I'm just not sharp
> enough to see it just yet, but it seems to me that having a timesource
> specific cyc2ns allows for further optimization then without.

I have not encountered a case that could not be handled by the above
mentioned transformation. Mathematically I would think I could say it is
impossible to find such a case.

> > Are you sure about that readability argument? A single
> > statement to calculate the delta is much more readable and verifyable
> > to be correct than a gazillion of small functions that (one hopes and
> > prays) all do the same correctly.
>
> Your point is valid, and in moving the arch specific code into arch
> independent code, I'm largely trying to reduce the tangle of functions.
> Having 3 hardware specific functions isn't all that crazy.

I would rather have only one (the read timer function) and that function
would be optional for non-well-behaved timer sources.

> Worse case if we cannot come to some agreement on this, there's nothing
> stopping architectures from keeping their own timeofday subsystem. I'm
> just tired of implementing the same feature or fixing the same bug over
> and over and over for each arch.

These are all good reasons for centralizing the functionality instead of
dispersing it in many functions.

2004-09-04 13:03:28

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] New Time of day proposal (updated 9/2/04)

On Fri, Sep 03, 2004 at 01:11:58PM -0700, john stultz wrote:
> The timeofday_hook (should be timeofday_interrupt_hook, my bad) is
> called by the semi-periodic-irregular-interval(also known as "timer")
> interrupt. Its what does the housekeeping for all the timeofday code so
> we don't run into a counter overflow.
>
> monotonic_clock() is an accessor that returns the amount of time that
> has been accumulated since boot in nanoseconds.

Ok, but you need different low level drivers for those. The TSC is not
stable enough as a long term time source, but it is best&fastest for
the offset calculation between timer interrupts.

Or you would need to make the single driver handle both TSC
and HPET/PIT, which would probably defeat the extensibility of
the new architecture. Even in that case you would need different
read() calls.


> >
> > Hmm, I am missing something here, but how do you handle
> > the case where the timer interrupt uses HPET, but the offset
> > from last timer interrupt is determined using TSC. But
> > some machines don't have a reliable TSC, so it may need
> > to use a different "offset" source.
> >
> > I guess you would need two different drivers here?
>
> Not really, this is very doable. Time of day is now completely isolated
> from the timer interrupt code, so it doesn't care who calls the
> timeofday_interrupt_hook (HPET's interrupt or the i8253's interrupt
> could do it). Also it doesn't have to be regular or exact, just frequent
> enough that the timesource doesn't overflow.

I am not talking about the interrupt source here, just from which
time source xtime is refreshed.

>
> > > o vsyscalls/userspace gettimeofday()
> > > - Mark functions and data w/ __vsyscall attribute
> > > - Use linker to put all __vsyscall data in the same set of pages
> >
> > That won't work because the kernel needs to write to
> > these variables (e.g. to update the wall time from the interrupt
> > handler). So in general you need two different mappings, one
> > read only for user space and another writable one for the kernel.
>
> Well, I was thinking about this and I don't see why the kernel and
> userspace can't share the same data and functions for read only access?

For reading only kernel & user could probably share, although
you may run into some interesting problems on architecturs
that use truly different "segments" for user and kernel space
(like sparc64 or m68k or possibly s390)

> Why exactly does it need to be mapped twice? We only write to the data
> from kernel mode using different functions, so the same symbols should
> be usable.

The kernel needs to write too. And at least on x86 there
is no way to make a page read only for user space and writable
for kernel space on the same mapping.

> the simplest way to use ANY time source, no matter how strange.
> cycle_t's are basically magic cookies given by the timesource that can
> be manipulated generically and converted into nanoseconds. It is likely
> an over-virtualization, but I'd prefer to keep things general and
> flexible until more arch maintainers tell me its unnecessary.

Sounds like a redundancy with the cputime type the s390 people did.
I don't think we should have two ADTs who do such similar things.

-Andi

2004-09-06 06:10:26

by Ulrich Windl

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On 3 Sep 2004 at 11:07, Albert Cahalan wrote:

> On Fri, 2004-09-03 at 05:08, Ulrich Windl wrote:
> > On 3 Sep 2004 at 2:42, Albert Cahalan wrote:
>
> > > HZ not being HZ is the source of many foul problems.
> > >
> > > NTP should be able to correct for the error. For systems
> > > not running NTP, provide a fake NTP to make corrections
> > > based on the expected frequency error.
> > >
> > > Based on that, skip or double-up on the ticks to make
> > > them be exactly HZ over long periods of time.
> >
> > I think nobody wants a sawtooth-like timing. Time should
> > proceed as smoothly as possible.
>
> Of course, and all hardware should have ideal clocks.
> Now, back to the real world...

Albert,

I am serious.

>
> The kernel is broken if:
>
> a. HZ is not really HZ

HZ is an integer however.

> b. Timekeeping is via 2 unrelated clocks. (jiffies+offset)

What do you do if one clock lacks resolution, and the other clock lacks digits?
The interrupt clock may suck, but the TSC sucks even more.

> c. HZ is not an integer

HZ is HZ, but the true interrupt frequency is something completely different.

>
> So on box using only clock ticks, steer jiffies
> toward HZ using NTP (or default frequency error value).

Are you saying you are happy with a lcok that less than 1 HZ off? That would be --
in a extreme case -- about 86000 seconds off per day.

> On a box with high-res time, use that instead, and make
> jiffies follow it to satisfy various kernel-internal
> uses of jiffies.

Make jiffies follow variable CPU clock? Are you serious?

>
> Look, if HZ won't be HZ then you can just remove
> the "HZ" define from the kernel. It's useless.

It's not useless, it's a historic standard.

>
> I think we're all sick of the recent time-related
> bugs. I could go for ripping out all the fancy and

Yes.

> broken stuff that was added recently, replacing it
> with the simple Linux 2.4.xx or 2.2.xx code. Swiping

It depends on what you want. The kernel really needs a working framework for
nanoseconds; at least regarding the variables' precision.

> code from DragonflyBSD would be worth investigating.
>
>
>

Regards,
Ulrich



2004-09-06 06:29:57

by Ulrich Windl

[permalink] [raw]
Subject: Re: [RFC] New Time of day proposal (updated 9/2/04)

On 3 Sep 2004 at 22:26, Dominik Brodowski wrote:

...
> What about removing cpu_freq_khz you have in your patch, adding a per-CPU
> value, and just use the value of the boot CPU for the other CPUs if
> !CONFIG_DIFFERENT_CPU_SPEEDS or something like that?
...

I would not mention SMP at that stage, but most of the existing code on IA32
suumes that all CPUs run with the same frequency. I only heared that at least on
Alphas this is not true. Probably you'll need a per-CPU state regarding time. Most
likely resulting in the "local CPU's time" and a global concept of time that
should not be in contradiction with any CPU's time. That might mean that any
process always has to query the global time, involving extra overhead.

You mentioned it...

Regards,
Ulrich


2004-09-06 13:09:14

by Alan

[permalink] [raw]
Subject: Re: [RFC] New Time of day proposal (updated 9/2/04)

On Llu, 2004-09-06 at 07:26, Ulrich Windl wrote:
> I would not mention SMP at that stage, but most of the existing code on IA32
> suumes that all CPUs run with the same frequency. I only heared that at least on

Except for the tsc functionality Linux 2.4.x/2.6.x does perfectly well
on split clock x86. Several people ran mixed 300/450Mhz systems with
dual celerons

2004-09-07 16:16:45

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] New Time of day proposal (updated 9/2/04)

On Sat, 4 Sep 2004, Andi Kleen wrote:

> On Fri, Sep 03, 2004 at 01:11:58PM -0700, john stultz wrote:
> > The timeofday_hook (should be timeofday_interrupt_hook, my bad) is
> > called by the semi-periodic-irregular-interval(also known as "timer")
> > interrupt. Its what does the housekeeping for all the timeofday code so
> > we don't run into a counter overflow.
> >
> > monotonic_clock() is an accessor that returns the amount of time that
> > has been accumulated since boot in nanoseconds.
>
> Ok, but you need different low level drivers for those. The TSC is not
> stable enough as a long term time source, but it is best&fastest for
> the offset calculation between timer interrupts.

I thought the NTP daemon etc would even that out? ITC (TSC on IA64) is
used by default on IA64 for all time keeping purposes. The CPU has on chip
support for timer interrupt generation.

2004-09-07 16:20:54

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] New Time of day proposal (updated 9/2/04)

On Mon, 6 Sep 2004, Ulrich Windl wrote:

> On 3 Sep 2004 at 22:26, Dominik Brodowski wrote:
>
> ...
> > What about removing cpu_freq_khz you have in your patch, adding a per-CPU
> > value, and just use the value of the boot CPU for the other CPUs if
> > !CONFIG_DIFFERENT_CPU_SPEEDS or something like that?
> ...
>
> I would not mention SMP at that stage, but most of the existing code on IA32
> suumes that all CPUs run with the same frequency. I only heared that at least on
> Alphas this is not true. Probably you'll need a per-CPU state regarding time. Most
> likely resulting in the "local CPU's time" and a global concept of time that
> should not be in contradiction with any CPU's time. That might mean that any
> process always has to query the global time, involving extra overhead.

CPUs in SGI Altix systems may also run at different frequencies.
Typically 2 CPUs have a common clock source. For example in a system of
16 CPUs, there may exist 8 clock sources that slightly diverge from one
another.

2004-09-07 18:36:46

by George Anzinger

[permalink] [raw]
Subject: Re: [RFC] New Time of day proposal (updated 9/2/04)

Christoph Lameter wrote:
> On Sat, 4 Sep 2004, Andi Kleen wrote:
>
>
>>On Fri, Sep 03, 2004 at 01:11:58PM -0700, john stultz wrote:
>>
>>>The timeofday_hook (should be timeofday_interrupt_hook, my bad) is
>>>called by the semi-periodic-irregular-interval(also known as "timer")
>>>interrupt. Its what does the housekeeping for all the timeofday code so
>>>we don't run into a counter overflow.
>>>
>>>monotonic_clock() is an accessor that returns the amount of time that
>>>has been accumulated since boot in nanoseconds.
>>
>>Ok, but you need different low level drivers for those. The TSC is not
>>stable enough as a long term time source, but it is best&fastest for
>>the offset calculation between timer interrupts.
>
>
> I thought the NTP daemon etc would even that out? ITC (TSC on IA64) is
> used by default on IA64 for all time keeping purposes. The CPU has on chip
> support for timer interrupt generation.

Yes, and it is designed (read the "rock" is carefully choosen) for time keeping.
On the x86 the "rock" drives the pm timer and the PIT, but a somewhat less
stable "rock" drives the TSC.

Also, we don't "know" what rate the TSC is actully clocking so we must
"discover" it at boot time. This process either is inaccurate or slow (I think
we use ~ 50 ms these days which gives an error of ~10 TSC cycles on a 800MHZ
box). FWIW the problem here is the sync up with the I/O backplane to find the
start and ending of the measured time.

I suspect that the IA64 "tells" you what its clock rate is. Right?
>

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2004-09-07 21:00:08

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] New Time of day proposal (updated 9/2/04)

On Tue, 7 Sep 2004, George Anzinger wrote:

> Also, we don't "know" what rate the TSC is actully clocking so we must
> "discover" it at boot time. This process either is inaccurate or slow (I think
> we use ~ 50 ms these days which gives an error of ~10 TSC cycles on a 800MHZ
> box). FWIW the problem here is the sync up with the I/O backplane to find the
> start and ending of the measured time.
>
> I suspect that the IA64 "tells" you what its clock rate is. Right?

Not the CPU itself. There is a special hardware I/O interface called the
PAL/SAL that allows one to retrive that information. Doesnt the BIOS on
i386 allow you to get to that information?

2004-09-07 21:48:06

by George Anzinger

[permalink] [raw]
Subject: Re: [RFC] New Time of day proposal (updated 9/2/04)

Christoph Lameter wrote:
> On Tue, 7 Sep 2004, George Anzinger wrote:
>
>
>>Also, we don't "know" what rate the TSC is actually clocking so we must
>>"discover" it at boot time. This process either is inaccurate or slow (I think
>>we use ~ 50 ms these days which gives an error of ~10 TSC cycles on a 800MHZ
>>box). FWIW the problem here is the sync up with the I/O backplane to find the
>>start and ending of the measured time.
>>
>>I suspect that the IA64 "tells" you what its clock rate is. Right?
>
>
> Not the CPU itself. There is a special hardware I/O interface called the
> PAL/SAL that allows one to retrieve that information. Doesn't the BIOS on
> i386 allow you to get to that information?

Not as far as I know. If it did we would kick out the calibrate loop and boot
some 50ms faster (always a good thing). Could be somebody else has better info
here...

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2004-09-08 06:29:22

by Ulrich Windl

[permalink] [raw]
Subject: Re: [RFC] New Time of day proposal (updated 9/2/04)

On 7 Sep 2004 at 11:24, George Anzinger wrote:

> Christoph Lameter wrote:
...
> > I thought the NTP daemon etc would even that out? ITC (TSC on IA64) is
> > used by default on IA64 for all time keeping purposes. The CPU has on chip
> > support for timer interrupt generation.
>
> Yes, and it is designed (read the "rock" is carefully choosen) for time keeping.

Believe me: NTP cannot repair a broken clock; it can only "tune" a clock that is
systematically off.

...

Regards,
Ulrich


2004-09-08 18:12:33

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Fri, 2004-09-03 at 17:02, George Anzinger wrote:
> > Again, monotonic_clock() and friends are NTP adjusted, so drift caused
> > by inaccurate calibration shouldn't be a problem the interval timer code
> > should need to worry about (outside of maybe adjusting its interval time
> > if its always arriving late/early). If possible the timesource
> > calibration code should be improved, but that's icing on the cake and
> > isn't critical.
> >
> Are you providing a way to predict what clock count provide a given time offset
> INCLUDING ntp? If so, cool. If not we need to get this conversion right. We
> will go into this more on your return.

Sorry, I'm not sure what you mean. Mind expanding on the idea while my
brain warms back up?

thanks
-john


2004-09-08 18:34:29

by john stultz

[permalink] [raw]
Subject: Re: [RFC] New Time of day proposal (updated 9/2/04)

On Sat, 2004-09-04 at 06:00, Andi Kleen wrote:
> On Fri, Sep 03, 2004 at 01:11:58PM -0700, john stultz wrote:
> > The timeofday_hook (should be timeofday_interrupt_hook, my bad) is
> > called by the semi-periodic-irregular-interval(also known as "timer")
> > interrupt. Its what does the housekeeping for all the timeofday code so
> > we don't run into a counter overflow.
> >
> > monotonic_clock() is an accessor that returns the amount of time that
> > has been accumulated since boot in nanoseconds.
>
> Ok, but you need different low level drivers for those. The TSC is not
> stable enough as a long term time source, but it is best&fastest for
> the offset calculation between timer interrupts.
>
> Or you would need to make the single driver handle both TSC
> and HPET/PIT, which would probably defeat the extensibility of
> the new architecture. Even in that case you would need different
> read() calls.

Well, actually with this design the TSC, when used, would be the long
term timesource. Interpolating between two timesources (timer ticks and
TSC) is what causes a great deal of the time trouble on i386. You get
into the issue of: "which one do you trust?" (some laptops change freq
w/o notification, while other desktops have buggy interval timers) which
takes us then into the problems lost-tick detection brought up.

Instead, I'm suggesting we make our best guess, and then select a
timesource to trust. Make it trivial for folks to work around bad
selections (say a buggy HPET shows up, have the user echo "acpi-pm" >
/sysfs/something/timesource, or use the "clock=" boot option), and make
it easy to add blacklist/sanity checking code to the timesource init
functions (so it does the right thing automatically in the next
release).

So basically, rather then confuse the core code, my plan is to leave it
up to the timesource code to ensure they provide good time.


> > > Hmm, I am missing something here, but how do you handle
> > > the case where the timer interrupt uses HPET, but the offset
> > > from last timer interrupt is determined using TSC. But
> > > some machines don't have a reliable TSC, so it may need
> > > to use a different "offset" source.
> > >
> > > I guess you would need two different drivers here?
> >
> > Not really, this is very doable. Time of day is now completely isolated
> > from the timer interrupt code, so it doesn't care who calls the
> > timeofday_interrupt_hook (HPET's interrupt or the i8253's interrupt
> > could do it). Also it doesn't have to be regular or exact, just frequent
> > enough that the timesource doesn't overflow.
>
> I am not talking about the interrupt source here, just from which
> time source xtime is refreshed.

Well, at the end of the day, the hope is xtime is removed completely and
replaced w/ get_lowres_timestamp()/get_lowres_timeofday(). However,
right now instead of making a patch with thousands of single line
replacements, I'm just updating xtime in timer_interrupt_hook().

I'm not sure if that answers your question, so let me know if I need to
clarify myself further.

> > > > o vsyscalls/userspace gettimeofday()
> > > > - Mark functions and data w/ __vsyscall attribute
> > > > - Use linker to put all __vsyscall data in the same set of pages
> > >
> > > That won't work because the kernel needs to write to
> > > these variables (e.g. to update the wall time from the interrupt
> > > handler). So in general you need two different mappings, one
> > > read only for user space and another writable one for the kernel.
> >
> > Well, I was thinking about this and I don't see why the kernel and
> > userspace can't share the same data and functions for read only access?
>
> For reading only kernel & user could probably share, although
> you may run into some interesting problems on architecturs
> that use truly different "segments" for user and kernel space
> (like sparc64 or m68k or possibly s390)
>
> > Why exactly does it need to be mapped twice? We only write to the data
> > from kernel mode using different functions, so the same symbols should
> > be usable.
>
> The kernel needs to write too. And at least on x86 there
> is no way to make a page read only for user space and writable
> for kernel space on the same mapping.

Ah! You're right. I was confusing page permission semantics w/ file
permissions (ie: permissions don't affect kernel mode, just user mode).

That does throw a wrench in the simplicity of my design. I *really*
would like to avoid the extended linker trickery that x86-64 has, but it
may be necessary in the end.

> > the simplest way to use ANY time source, no matter how strange.
> > cycle_t's are basically magic cookies given by the timesource that can
> > be manipulated generically and converted into nanoseconds. It is likely
> > an over-virtualization, but I'd prefer to keep things general and
> > flexible until more arch maintainers tell me its unnecessary.
>
> Sounds like a redundancy with the cputime type the s390 people did.
> I don't think we should have two ADTs who do such similar things.

Hmm. I should look into that. Thanks for the heads up.

And once again, thanks for the feedback.
-john

2004-09-09 00:17:32

by George Anzinger

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

john stultz wrote:
> On Fri, 2004-09-03 at 17:02, George Anzinger wrote:
>
>>>Again, monotonic_clock() and friends are NTP adjusted, so drift caused
>>>by inaccurate calibration shouldn't be a problem the interval timer code
>>>should need to worry about (outside of maybe adjusting its interval time
>>>if its always arriving late/early). If possible the timesource
>>>calibration code should be improved, but that's icing on the cake and
>>>isn't critical.
>>>
>>
>>Are you providing a way to predict what clock count provide a given time offset
>>INCLUDING ntp? If so, cool. If not we need to get this conversion right. We
>>will go into this more on your return.
>
>
> Sorry, I'm not sure what you mean. Mind expanding on the idea while my
> brain warms back up?

The issue is this: A user wants a timer to fire at exactly time B which is
several hours later than now (time A). We need to know how to measure this time
with the timer resources (not the clock as you are talking about it). Currently
we do something like delta_jiffies = timespec_to_jiffies(B - A) and set up a
jiffies timer to expire in delta_jiffies. At this time we "assume" in
timespec_to_jiffies() that we _know_ how long a jiffie is in terms of wall clock
nanoseconds. We also assume (possibly incorrectly) that this number is "good
enough" even with ntp messing things up. I think this means that we assume
that, on the average, we have the right conversion and that any drift will be a)
small and b) on the average 0 (or real close to it).

To my mind this is a few too many assumptions, but this is what we do today. If
you are proposing that ntp corrections take care of systemic constant drift
(which by the way, linux use to do WRT HZ=1024 division errors), we will need a
way of working this back into the timesepec_to_jiffies() code or we will not be
able to do timers with any real accuracy.

An alternative, and one that I prefer, but, I think, runs counter to your
proposal, is to derive the timer interrupts from the clock after the ntp
correction. Some hardware will have a problem with this (x86) but other
hardware is rather constrained to do things this way. The PPC counter interrupt
generator, for example, counts the same pulses as the clock counter.
>

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2004-09-09 00:56:25

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Wed, 2004-09-08 at 17:08, George Anzinger wrote:
> john stultz wrote:
> > On Fri, 2004-09-03 at 17:02, George Anzinger wrote:
> >
> >>>Again, monotonic_clock() and friends are NTP adjusted, so drift caused
> >>>by inaccurate calibration shouldn't be a problem the interval timer code
> >>>should need to worry about (outside of maybe adjusting its interval time
> >>>if its always arriving late/early). If possible the timesource
> >>>calibration code should be improved, but that's icing on the cake and
> >>>isn't critical.
> >>>
> >>
> >>Are you providing a way to predict what clock count provide a given time offset
> >>INCLUDING ntp? If so, cool. If not we need to get this conversion right. We
> >>will go into this more on your return.
> >
> >
> > Sorry, I'm not sure what you mean. Mind expanding on the idea while my
> > brain warms back up?
>
> The issue is this: A user wants a timer to fire at exactly time B which is
> several hours later than now (time A). We need to know how to measure this time
> with the timer resources (not the clock as you are talking about it). Currently
> we do something like delta_jiffies = timespec_to_jiffies(B - A) and set up a
> jiffies timer to expire in delta_jiffies. At this time we "assume" in
> timespec_to_jiffies() that we _know_ how long a jiffie is in terms of wall clock
> nanoseconds. We also assume (possibly incorrectly) that this number is "good
> enough" even with ntp messing things up. I think this means that we assume
> that, on the average, we have the right conversion and that any drift will be a)
> small and b) on the average 0 (or real close to it).

Why must we use jiffies to tell when a timer expires? Honestly I'd like
to see xtime and jiffies both disappear, but I'm not very familiar w/
the soft-timer code, so forgive me if I'm misunderstanding.

So instead of calculating delta_jiffies, just mark the timer to expire
at B. Then each interrupt, you use get_fast_timestamp() to decide if now
is greater then B. If so, expire it.

Then we can look at being able to program timer interrupts to occur as
close as possible to the next soft-timer's expiration time.

-john

2004-09-09 03:18:16

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Wed, 8 Sep 2004, john stultz wrote:

> Why must we use jiffies to tell when a timer expires? Honestly I'd like
> to see xtime and jiffies both disappear, but I'm not very familiar w/
> the soft-timer code, so forgive me if I'm misunderstanding.
>
> So instead of calculating delta_jiffies, just mark the timer to expire
> at B. Then each interrupt, you use get_fast_timestamp() to decide if now
> is greater then B. If so, expire it.
>
> Then we can look at being able to program timer interrupts to occur as
> close as possible to the next soft-timer's expiration time.

Would it not be best to have some means to determine the time in
nanoseconds since the epoch and then use that for long waits? That cuts
out the timer dependencies. Could we have a generic interface for kernel
time that simply provides nanoseconds for everything and hides all the
details of time scaling?

Maybe as simple as

u64 now(void);

?

One can then calculate the wait time in nanoseconds which may then be
passed to another timer routine which may take the appropriate action
depending on the time frame involved. I.e. for a few hundred nsecs do busy
wait. If longer reschedule and if even longer queue the task on some
event queue that is handled by the timer tick or something else.

2004-09-09 03:37:19

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Wed, 2004-09-08 at 20:14, Christoph Lameter wrote:
> On Wed, 8 Sep 2004, john stultz wrote:
>
> > Why must we use jiffies to tell when a timer expires? Honestly I'd like
> > to see xtime and jiffies both disappear, but I'm not very familiar w/
> > the soft-timer code, so forgive me if I'm misunderstanding.
> >
> > So instead of calculating delta_jiffies, just mark the timer to expire
> > at B. Then each interrupt, you use get_fast_timestamp() to decide if now
> > is greater then B. If so, expire it.
> >
> > Then we can look at being able to program timer interrupts to occur as
> > close as possible to the next soft-timer's expiration time.
>
> Would it not be best to have some means to determine the time in
> nanoseconds since the epoch and then use that for long waits?

The proposal has get_lowres_timeofday() which does just that, although
for timer stuff, I would guess monotonic_clock() or
get_lowres_timestamp(), which returns the number of (ntp adjusted)
nanoseconds the system has been running, would be better.


> One can then calculate the wait time in nanoseconds which may then be
> passed to another timer routine which may take the appropriate action
> depending on the time frame involved. I.e. for a few hundred nsecs do busy
> wait. If longer reschedule and if even longer queue the task on some
> event queue that is handled by the timer tick or something else.

I'm not sure about the busy wait bit, but yes, at some point I'd like to
see the timer subsystem use the timeofday subsystem instead of jiffies
for its timekeeping.

thanks
-john

2004-09-09 04:37:45

by George Anzinger

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

john stultz wrote:
> On Wed, 2004-09-08 at 20:14, Christoph Lameter wrote:
>
>>On Wed, 8 Sep 2004, john stultz wrote:
>>
>>
>>>Why must we use jiffies to tell when a timer expires? Honestly I'd like
>>>to see xtime and jiffies both disappear, but I'm not very familiar w/
>>>the soft-timer code, so forgive me if I'm misunderstanding.
>>>
>>>So instead of calculating delta_jiffies, just mark the timer to expire
>>>at B. Then each interrupt, you use get_fast_timestamp() to decide if now
>>>is greater then B. If so, expire it.
>>>
>>>Then we can look at being able to program timer interrupts to occur as
>>>close as possible to the next soft-timer's expiration time.
>>
>>Would it not be best to have some means to determine the time in
>>nanoseconds since the epoch and then use that for long waits?
>
>
> The proposal has get_lowres_timeofday() which does just that, although
> for timer stuff, I would guess monotonic_clock() or
> get_lowres_timestamp(), which returns the number of (ntp adjusted)
> nanoseconds the system has been running, would be better.
>
>
>
>>One can then calculate the wait time in nanoseconds which may then be
>>passed to another timer routine which may take the appropriate action
>>depending on the time frame involved. I.e. for a few hundred nsecs do busy
>>wait. If longer reschedule and if even longer queue the task on some
>>event queue that is handled by the timer tick or something else.
>
>
> I'm not sure about the busy wait bit, but yes, at some point I'd like to
> see the timer subsystem use the timeofday subsystem instead of jiffies
> for its timekeeping.
>
Yes, I think this is the way we want to go. Here are the "rubs" I see:

a.) resolution. If you don't put a limit on this you will invite timer storms.
Currently, by useing 1/HZ resolution, all timer "line up" on ticks and reduce
the interrupt overhead that would occure if we actually tried to give "exactly"
what was asked for. This is a matter of math and can be handled (assuming we
resist the urge to go shopping :))

b.) For those platforms with repeating timers that can not "hit" our desired
resolution (i.e. 1/HZ) there is an additional overhead to program the timer each
interrupt. In principle we do this in the HRT patch, but there we only do it
for high resolution timers, which we assume are rather rare. It is good to have
a low res timer that is also accurate. Even better if we can keep the overhead
low by not having to reprogram a timer each tick.

In the ideal world we would have a hardware repeating timer that is reasonably
accurate (we might want to correct it every second or so) to generate the low
res timing interrupts and a high res timer that we can program quickly for high
resolution interrupts.

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2004-09-09 06:38:32

by Jesse Barnes

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Wednesday, September 8, 2004 9:31 pm, George Anzinger wrote:
> a.) resolution. If you don't put a limit on this you will invite timer
> storms. Currently, by useing 1/HZ resolution, all timer "line up" on ticks
> and reduce the interrupt overhead that would occure if we actually tried to
> give "exactly" what was asked for. This is a matter of math and can be
> handled (assuming we resist the urge to go shopping :))

This can be bad though if lots of CPUs hit it at the same time or nearly so if
they're all trying to write the same cacheline or two.

Jesse

2004-09-09 08:13:37

by George Anzinger

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

Jesse Barnes wrote:
> On Wednesday, September 8, 2004 9:31 pm, George Anzinger wrote:
>
>>a.) resolution. If you don't put a limit on this you will invite timer
>>storms. Currently, by useing 1/HZ resolution, all timer "line up" on ticks
>>and reduce the interrupt overhead that would occure if we actually tried to
>>give "exactly" what was asked for. This is a matter of math and can be
>>handled (assuming we resist the urge to go shopping :))
>
>
> This can be bad though if lots of CPUs hit it at the same time or nearly so if
> they're all trying to write the same cacheline or two.

I think that most of the SMP issues your talking about have gone away with 2.6
where we have seperate timer lists for each cpu.


--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2004-09-09 19:16:00

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Wed, 2004-09-08 at 21:31, George Anzinger wrote:
> john stultz wrote:
> > I'm not sure about the busy wait bit, but yes, at some point I'd like to
> > see the timer subsystem use the timeofday subsystem instead of jiffies
> > for its timekeeping.
> >
> Yes, I think this is the way we want to go. Here are the "rubs" I see:
>
> a.) resolution. If you don't put a limit on this you will invite timer storms.
> Currently, by useing 1/HZ resolution, all timer "line up" on ticks and reduce
> the interrupt overhead that would occure if we actually tried to give "exactly"
> what was asked for. This is a matter of math and can be handled (assuming we
> resist the urge to go shopping :))

Well, basically we have the same resolution as we do today. Right now
you specify the time in jiffies which has different resolutions
depending on configuration, so I'm not sure I see the problem. I mean, I
guess someone could be off put when they ask for a 1 ns expiration time
and it returns in 1 ms, but I thought timers followed sleep() semantics,
so it shouldn't be that much of an issue.

> b.) For those platforms with repeating timers that can not "hit" our desired
> resolution (i.e. 1/HZ) there is an additional overhead to program the timer each
> interrupt. In principle we do this in the HRT patch, but there we only do it
> for high resolution timers, which we assume are rather rare. It is good to have
> a low res timer that is also accurate. Even better if we can keep the overhead
> low by not having to reprogram a timer each tick.
>
> In the ideal world we would have a hardware repeating timer that is reasonably
> accurate (we might want to correct it every second or so) to generate the low
> res timing interrupts and a high res timer that we can program quickly for high
> resolution interrupts.

Even in this case, inaccurate hardware timers can only cause a single
tick jitter. The current problem we're seeing is that the timer
inaccuracy accumulates, so if we're 1% off, a sleep(600) returns 6
seconds late. Using the time source instead of jiffies for expiration
would avoid the accumulation, so we'd be at most a single tick late.

Or is the problem more subtle?

thanks
-john


2004-09-09 20:54:24

by George Anzinger

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

john stultz wrote:
> On Wed, 2004-09-08 at 21:31, George Anzinger wrote:
>
>>john stultz wrote:
>>
>>>I'm not sure about the busy wait bit, but yes, at some point I'd like to
>>>see the timer subsystem use the timeofday subsystem instead of jiffies
>>>for its timekeeping.
>>>
>>
>>Yes, I think this is the way we want to go. Here are the "rubs" I see:
>>
>>a.) resolution. If you don't put a limit on this you will invite timer storms.
>> Currently, by using 1/HZ resolution, all timer "line up" on ticks and reduce
>>the interrupt overhead that would occur if we actually tried to give "exactly"
>>what was asked for. This is a matter of math and can be handled (assuming we
>>resist the urge to go shopping :))
>
>
> Well, basically we have the same resolution as we do today. Right now
> you specify the time in jiffies which has different resolutions
> depending on configuration, so I'm not sure I see the problem. I mean, I
> guess someone could be off put when they ask for a 1 ns expiration time
> and it returns in 1 ms, but I thought timers followed sleep() semantics,
> so it shouldn't be that much of an issue.

What we get down to here is setting up a tick source that depends on the clock.
This issue of resolution, as I said, is easy to solve.
>
>
>>b.) For those platforms with repeating timers that can not "hit" our desired
>>resolution (i.e. 1/HZ) there is an additional overhead to program the timer each
>>interrupt. In principle we do this in the HRT patch, but there we only do it
>>for high resolution timers, which we assume are rather rare. It is good to have
>>a low res timer that is also accurate. Even better if we can keep the overhead
>>low by not having to reprogram a timer each tick.
>>
>>In the ideal world we would have a hardware repeating timer that is reasonably
>>accurate (we might want to correct it every second or so) to generate the low
>>res timing interrupts and a high res timer that we can program quickly for high
>>resolution interrupts.
>
>
> Even in this case, inaccurate hardware timers can only cause a single
> tick jitter. The current problem we're seeing is that the timer
> inaccuracy accumulates, so if we're 1% off, a sleep(600) returns 6
> seconds late. Using the time source instead of jiffies for expiration
> would avoid the accumulation, so we'd be at most a single tick late.
>
> Or is the problem more subtle?

No, I think you have it. The x86 problem, in this regard, is that the "clock"
xtal is only connected to the pm timer and the PIT, both of which are slow to
access (not to mention all the hardware errors). The x86 TSC is driven by
another xtal which, to the best of my knowledge, is not a "clock" grade xtal.
We also don't know what the actual TSC frequency is, thus the calibration loop.

What we get down to, and what others have said, is that the TSC is great for
interpolation but bad as a long term clock.

I suspect that a time solution that required us to reprogram the PIT each tick
would not fly, just because of the access overhead. One that required
reprogramming the PIT every second or so _might_ be ok.

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2004-09-12 17:16:04

by Albert Cahalan

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Mon, 2004-09-06 at 02:08, Ulrich Windl wrote:
> On 3 Sep 2004 at 11:07, Albert Cahalan wrote:

> > The kernel is broken if:
> >
> > a. HZ is not really HZ
>
> HZ is an integer however.

Yes, and this is the rate at which jiffies must tick.
At no time should jiffies be more than one tick away
from the value it would have according to HZ.

(where "tick" is the slower of the two)

> > b. Timekeeping is via 2 unrelated clocks. (jiffies+offset)
>
> What do you do if one clock lacks resolution, and the other clock lacks digits?
> The interrupt clock may suck, but the TSC sucks even more.

You choose one. If you have sucky hardware, that's the
best that can be done.

If by "lacks digits" you mean something like the 24-bit
cycle counter on the Alpha, faking a 64-bit counter is
not terribly hard. I've seen it done for the SHARC DSP.
You'll just need an interrupt before roll-over.

> > c. HZ is not an integer
>
> HZ is HZ, but the true interrupt frequency is something completely different.

I could run the interrupts at 4152.0625 Hz and still
produce a jiffies tick at 1000.0000 Hz over the long
term. At any given point in time, jiffies won't be
off by more than 1 tick.

I could run the interrupts at 666 Hz too, and not be
off by more than 2 jiffies ticks.

Having the interrupt frequency be _almost_ equal to
the HZ value is kind of bad, because it tempts people to
equate them. When the frequencies are really different,
the errors and solutions become more obvious.

> > So on box using only clock ticks, steer jiffies
> > toward HZ using NTP (or default frequency error value).
>
> Are you saying you are happy with a lcok that less than 1 HZ off? That would be --
> in a extreme case -- about 86000 seconds off per day.

Less than 1/HZ off, yes. That's 1 millisecond max error
over any time period from nanoseconds to years.

> > On a box with high-res time, use that instead, and make
> > jiffies follow it to satisfy various kernel-internal
> > uses of jiffies.
>
> Make jiffies follow variable CPU clock? Are you serious?

You can not use a variable clock. End of story.

Well, if you do, you pay the price. (time goes crazy every
now and then, until a very-forgiving ntpd adjusts things)
This would not be a sane default.

If you want high-res time on a PC, your choices:

a. read the PIT registers (slow)
b. have a PC with the HPET
c. have a PC with a stable TSC
d. some ACPI PM thing?

If your hardware isn't suited for high-res time,
nothing will save you. Linux got along OK for a
decade without high-res time at all; you'll live.


2004-09-13 21:35:31

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

Hmm... I am still thinking that making xtime an u64 could help simplify a
lot of things. Together with some other parameters it would allow a
tickless system where there is no need to update xtime regularly. It would
also simplify gettimeofday and jiffies calculation.

In order to calculate the current time, one then has to use a counter that
started running at a certain point. The counter can be used to calculate
an offset to xtime after scaling it. It is then only necessary to update
xtime if the value calculated by xtime and the counter deviates
significantly from the real value. Then time keeping may be corrected
(after xtime has been updated, we might have handed out time before
this event and we cannot change time keeping without a new starting point
for the calculation of the offset!) by

1. Adding a number to xtime resulting in a time jump forward. This
needs to be a small number.
2. Slowing down the clock by reducing the scaling (time cannot
jump backward. This is the only way to correct the clock if its
running too fast).
3. Speeding up the clock by increasing the scaling. This would presumably
be done if the time jumps become too large.

All of this is based a lot on the work done for the time interpolator.

/****************************************************************
* A concept for a new timer module
*
***************************************************************/

struct time_source_t {
u8 type;
u8 shift;
u32 multiply;
void *address;
};

#define TIME_SOURCE_CPU 0
#define TIME_SOURCE_MMIO32 1
#define TIME_SOURCE_MMIO64 2
#define TIME_SOURCE_FUNCTION 15

struct seqlock_t xtime_lock; /* Protects access to xtime time_source and time_source_last, may be used by
time retrieval routines if 64 bit atomic operations are not available */
u64 xtime;
struct time_source_t *time_source;
u64 time_source_last;

u64 inline time_source_get(struct time_source_t *s) {
switch (s->type) {
case TIME_SOURCE_CPU : return cycles();
case TIME_SOURCE_MMIO32 : return readl(s->address);
case TIME_SOURCE_MMIO64 : return readq(s->address);
case TIME_SOURCE_FUNCTION: {
u64 (*x)(void);
x = s->address;
return x();
}
default : return 0;
}
}

u64 time_source_to_ns(u64 x) {
return ((x-time_source_last)*time_source->multiply) >> time_source->shift;
}

inline u64 now(void) {
u64 t;
unsigned long seq;

do {
seq = read_seqbegin(&xtime_lock);
t = xtime + time_source_to_ns(time_source_get(time_source));
} while (unlikely(seq_retry(&xtime_lock,seq)));
return t;
}

/* Time adjustment (only possible forward, to go backwards adjust time_source->multiply and wait ...) */
void time_adjust_skip(u64 ns) {
u64 c;

write_seqlock(&xtime_lock);
c = time_source_get(time_source);
xtime += ns + time_source_to_ns(c);
time_source_last = c;
write_sequnlock(&xtime_lock);
}

void time_adjust_slower(void) {
u64 c;

write_seqlock(&xtime_lock);
c = time_source_get(time_source);
xtime += time_source_to_ns(c);
time_source_last = c;
time_source->multiply--;
write_sequnlock(&xtime_lock);
}

void time_adjust_faster(void) {
u64 c;

write_seqlock(&xtime_lock);
c = time_source_get(time_source);
xtime += time_source_to_ns(c);
time_source_last = c;
time_source->multiply++;
write_sequnlock(&xtime_unlock);
}

/* Switch to another time source */
void new_time_source(struct time_source_t *s) {
u64 c_old;
u64 c_new;

write_seqlock(&xtime_lock);
c_old = time_source_get(time_source);
c_new = time_source_get(s);

xtime += time_source_to_ns(c_old);

time_source_last = c_new;
time_source = s;

write_sequnlock(&xtime_lock);
}

struct time_source_t *make_time_source(u64 freq, int t,void *a, int bits) {
struct time_source_t *s = kmalloc(sizeof(struct time_source_t));

s->shift = 64 - bits;
s->multiply = (NSEC_PER_SEC << s->shift) / freq;
s->address = a;
s->type = t;
}

/* Values in use in the kernel and how they may be derived from xtime */

#define jiffies (now()/1000000)
#define seconds(x) ((x)/1000000000)
#define nanoseconds_within_second(x) ((x)%1000000000)
#define microseconds_within_second(x) (nanoseconds_within_second(x) / 1000)

u32 time(void) {
return seconds(now());
}

void gettimeofday(struct timeval *p) {
u64 t=now();

p->tv_sec = seconds(t);
p->tv_usec = microseconds_within_second(t);
}

void clock_gettime(int clock, struct timespec *p) {
u64 t=now();

p->tv_sec = seconds(t);
p->tv_nsec = nanoseconds_within_second(t);
}

2004-09-13 22:37:08

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Mon, 2004-09-13 at 14:29, Christoph Lameter wrote:
> Hmm... I am still thinking that making xtime an u64 could help simplify a
> lot of things. Together with some other parameters it would allow a
> tickless system where there is no need to update xtime regularly. It would
> also simplify gettimeofday and jiffies calculation.

Looks interesting! Regardless of our earlier butting heads, both this
and my earlier proposal are indeed very similar.

My first pass comments:
o How would settimeofday work? Would it be forced to use
time_adjust_XYZ? Do we still need the wall_to_monotonic uglyness to
manage posix CLOCK_MONOTONIC timers?

My suggestion: split xtime into two values (system_time,
wall_time_offset) like in my proposal. This allows you to keep a true
monotonically increasing time base (system_time) and wall_time_offset
can then be used to adjust to wall time. settimeofday() would only
change wall_time_offset.

o time_source_to_ns() needs some method of masking the subtraction in
order to handle timesources that overflow under 64 bits.

o How would your time_adjust_XYZ interfaces merge w/ adjtimex and the
NTP wall_time_update()/second_overflow() code? I spent quite a bit of
time on my ntp.c implementation and I'm not 100% confident in it. Could
you explain in further detail how yours would work?

o My only other nit is that you use a different name then xtime. If
you're changing the type, you might as well use a meaningful name.

Unfortunately I'm off working on other things for the next two weeks,
but once that is over I look forward to trying to integrate some of your
design ideas into my own. Keeping my timeofday and ntp core code, but
using your timesource interface looks to be quite promising.

thanks
-john

2004-09-13 22:52:13

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Mon, 13 Sep 2004, john stultz wrote:
> My first pass comments:
> o How would settimeofday work? Would it be forced to use
> time_adjust_XYZ? Do we still need the wall_to_monotonic uglyness to
> manage posix CLOCK_MONOTONIC timers?

You could simply set xtime and time_source_last = time_source_get(time_source);

I think the existing wall_to_monotonic stuff is okay. Monotonic time is
rarely used and thus its best to keep xtime on CLOCK_REALTIME to simplify
things for general time access.

> My suggestion: split xtime into two values (system_time,
> wall_time_offset) like in my proposal. This allows you to keep a true
> monotonically increasing time base (system_time) and wall_time_offset
> can then be used to adjust to wall time. settimeofday() would only
> change wall_time_offset.

xtime is also truly monotonic increasing in my proposal. Setting the time
is an extraordinary event which I would not count against monotoneity. The
clock corrections that I proposed all preserve the monotoneity of xtime.

> o time_source_to_ns() needs some method of masking the subtraction in
> order to handle timesources that overflow under 64 bits.

Yes. My interpolator logic limits begin to show....

> o How would your time_adjust_XYZ interfaces merge w/ adjtimex and the
> NTP wall_time_update()/second_overflow() code? I spent quite a bit of
> time on my ntp.c implementation and I'm not 100% confident in it. Could
> you explain in further detail how yours would work?

I have no idea how adjtimex works. Would have to look at that in detail
first. The timer interpolator stuff also does time adjustments and I am
sure that something like that is in my mind. The time interpolator only
ever corrects the clock forward. Otherwise it adjusts the scaling if the
clock is running too fast. Thats also what I provided in the proposal.

> o My only other nit is that you use a different name then xtime. If
> you're changing the type, you might as well use a meaningful name.

xtime is the traditional name. Maybe renaming it to intentionally break
old code would be good but I thought it would be good to understand
the approach.

> Unfortunately I'm off working on other things for the next two weeks,
> but once that is over I look forward to trying to integrate some of your
> design ideas into my own. Keeping my timeofday and ntp core code, but
> using your timesource interface looks to be quite promising.

That sounds very promising. I may also have to make time for
different projects but I will try to do as much as I can.

2004-09-14 06:56:55

by Ulrich Windl

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On 13 Sep 2004 at 15:45, Christoph Lameter wrote:

> > o My only other nit is that you use a different name then xtime. If
> > you're changing the type, you might as well use a meaningful name.
>
> xtime is the traditional name. Maybe renaming it to intentionally break
> old code would be good but I thought it would be good to understand
> the approach.
>

I think direct access to xtime should vanish. Provide an inlinable function to get
the current time coarsely and quickly (that's what reading xtime is).

Ulrich


2004-09-14 18:06:11

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

Here is a rev on the concept of a time source subsystem taking into
account some of the feedback:

1. add mask
2. rename time source so that its different from xtime
3. provide set_time functionality
4. add monotonic time support
5. Give some example of time sources including a function time source.

note that the original concept already includes the ability to switch time
sources. I would think that this would be necessary especially if the CPU
time sources frequency switches or if the system goes into sleep mode.

/****************************************************************
* A concept for a new timer module
* V2
*
* Christoph Lameter, September 14, 2003
***************************************************************/

struct time_source_t {
u8 type;
u8 shift;
u32 multiply;
void *address;
u64 mask;
};

#define TIME_SOURCE_CPU 0
#define TIME_SOURCE_MMIO32 1
#define TIME_SOURCE_MMIO64 2
#define TIME_SOURCE_FUNCTION 15

struct seqlock_t time_lock; /* Protects access to time_base time_source time_mono_offset and time_source_at_base */
u64 time_base;
u64 time_mono_offset;
struct time_source_t *time_source;
u64 time_source_at_base; /* time source value at time time_base */

u64 inline time_source_get(struct time_source_t *s) {
switch (s->type) {
case TIME_SOURCE_CPU : return cycles();
case TIME_SOURCE_MMIO32 : return readl(s->address);
case TIME_SOURCE_MMIO64 : return readq(s->address);
case TIME_SOURCE_FUNCTION: {
u64 (*x)(void);
x = s->address;
return x();
}
}

u64 time_source_to_ns(u64 x) {
return (((x-time_source_at_base) & time_source->mask)*time_source->multiply) >> time_source->shift;
}

inline u64 now(void) {
u64 t;
unsigned long seq;

do {
seq = read_seqbegin(&time_lock);
t = time_base + time_source_to_ns(time_source_get(time_source));
} while (unlikely(seq_retry(&time_lock,seq)));
return t;
}

#define now_mono() (now()+time_mono_offset)

/* Time adjustment (only possible forward, to go backwards adjust time_source->multiply and wait ...) */
void time_adjust_skip(u64 ns) {
u64 c;

write_seqlock(&time_lock);
c = time_source_get(time_source);
time_base += ns + time_source_to_ns(c);
time_source_at_base = c;
write_sequnlock(&time_lock);
}

void time_adjust_slower(void) {
u64 c;

write_seqlock(&time_lock);
c = time_source_get(time_source);
time_base += time_source_to_ns(c);
time_source_at_base = c;
time_source->multiply--;
write_sequnlock(&time_lock);
}

void time_adjust_faster(void) {
u64 c;

write_seqlock(&time_lock);
c = time_source_get(time_source);
time_base += time_source_to_ns(c);
time_source_at_base = c;
time_source->multiply++;
write_sequnlock(&time_unlock);
}

/* Switch to another time source */
void new_time_source(struct time_source_t *s) {
u64 c_old;
u64 c_new;

write_seqlock(&time_lock);
c_old = time_source_get(time_source);
c_new = time_source_get(s);

time_base += time_source_to_ns(c_old);

time_source_at_base = c_new;
time_source = s;

write_sequnlock(&time_lock);
}

struct time_source_t *make_time_source(u64 freq, int t,void *a, int bits) {
struct time_source_t *s = kmalloc(sizeof(struct time_source_t));

s->shift = 64 - bits;
s->multiply = (NSEC_PER_SEC << s->shift) / freq;
s->address = a;
s->type = t;
s->mask = 1 << bits -1;
}

void time_set(u64 ns) {
u64 c;

write_seqlock(&time_lock);
c = time_source_get(time_source);
/* Adjust monotonic time base */
time_mono_offset += time_base + time_source_to_ns(c) - ns;
/* Setup new time base */
time_base = ns;
time_source_at_base = c;
write_sequnlock(&time_lock);
}

void time_init(struct time_source_t *s) {
{
write_seqlock(&time_lock);
time_base = 0;
time_mono_offset = 0;
time_source_at_base = time_source_get(s);
time_source = s;
write_sequnlock(&time_lock);
}

/* Values in use in the kernel and how they may be derived from xtime */
#define jiffies (now()/1000000)
#define seconds(x) ((x)/1000000000)
#define nanoseconds_within_second(x) ((x)%1000000000)
#define microseconds_within_second(x) (nanoseconds_within_second(x) / 1000)

u32 time(void) {
return seconds(now());
}

void gettimeofday(struct timeval *p) {
u64 t=now();

p->tv_sec = seconds(t);
p->tv_usec = microseconds_within_second(t);
}

void clock_gettime(int clock, struct timespec *p) {
u64 t=now();

p->tv_sec = seconds(t);
p->tv_nsec = nanoseconds_within_second(t);
}

/* Exampe of a CPU time time source */

make_time_source(1200000000, TIME_SOURCE_CPU, NULL, 44);

/* A memory based time source at 20Mhz 55 bits wide */

make_time_source(20000000, TIME_SOURCE_MMIO64, &my_timer_address, 55);

/* Example of handling a difficult time source. In SMP systems the CPU time sources are notoriously difficult to
* sync. If we have such a problem then insure at least sure that the time source never goes backward.
*/

u64 time_source_last;

u64 get_cpu_time_filtered() {
u64 x;
u64 l;

do {
l = time_source_last;
x = cycles();
if (x<l) return l;
/* the cmpxchg is going to hurt in terms of scalability ! */
} while (cmpxchg(&time_source_last, l, x) != l);
return x;
}

Generate the time_source with

make_time_source(1200000000, TIME_SOURCE_FUNCTION, cpu_time_filtered, 44);

2004-09-15 01:01:58

by George Anzinger

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

Christoph Lameter wrote:
> Here is a rev on the concept of a time source subsystem taking into
> account some of the feedback:
>
> 1. add mask
> 2. rename time source so that its different from xtime
> 3. provide set_time functionality
> 4. add monotonic time support
> 5. Give some example of time sources including a function time source.
>
> note that the original concept already includes the ability to switch time
> sources. I would think that this would be necessary especially if the CPU
> time sources frequency switches or if the system goes into sleep mode.
>
> /****************************************************************
> * A concept for a new timer module
> * V2
> *
> * Christoph Lameter, September 14, 2003
> ***************************************************************/
>
> struct time_source_t {
> u8 type;
> u8 shift;
> u32 multiply;
> void *address;
> u64 mask;
> };
>
> #define TIME_SOURCE_CPU 0
> #define TIME_SOURCE_MMIO32 1
> #define TIME_SOURCE_MMIO64 2
> #define TIME_SOURCE_FUNCTION 15
>
> struct seqlock_t time_lock; /* Protects access to time_base time_source time_mono_offset and time_source_at_base */
> u64 time_base;
> u64 time_mono_offset;
> struct time_source_t *time_source;
> u64 time_source_at_base; /* time source value at time time_base */
>
> u64 inline time_source_get(struct time_source_t *s) {
> switch (s->type) {
> case TIME_SOURCE_CPU : return cycles();
> case TIME_SOURCE_MMIO32 : return readl(s->address);
> case TIME_SOURCE_MMIO64 : return readq(s->address);
> case TIME_SOURCE_FUNCTION: {
> u64 (*x)(void);
> x = s->address;
> return x();
> }
> }
>
> u64 time_source_to_ns(u64 x) {
> return (((x-time_source_at_base) & time_source->mask)*time_source->multiply) >> time_source->shift;
> }

This seems to assume that the time souce is incrementing. On some archs, I
think, it decrements...
>
> inline u64 now(void) {
> u64 t;
> unsigned long seq;
>
> do {
> seq = read_seqbegin(&time_lock);
> t = time_base + time_source_to_ns(time_source_get(time_source));
> } while (unlikely(seq_retry(&time_lock,seq)));
> return t;
> }
>
> #define now_mono() (now()+time_mono_offset)
>
> /* Time adjustment (only possible forward, to go backwards adjust time_source->multiply and wait ...) */
> void time_adjust_skip(u64 ns) {
> u64 c;
>
> write_seqlock(&time_lock);
> c = time_source_get(time_source);
> time_base += ns + time_source_to_ns(c);
> time_source_at_base = c;
> write_sequnlock(&time_lock);
> }

So we would do "time_adjust_skip(0);" to update time_source_at_base?
>
> void time_adjust_slower(void) {
> u64 c;
>
> write_seqlock(&time_lock);
> c = time_source_get(time_source);
> time_base += time_source_to_ns(c);
> time_source_at_base = c;
> time_source->multiply--;
> write_sequnlock(&time_lock);
> }
>
If we do a "good" job of choosing <multiply> and <shift> this will be a "very"
small change. Might be better to pass in a "delta" to change it by. Then you
would only need one function.

> void time_adjust_faster(void) {
> u64 c;
>
> write_seqlock(&time_lock);
> c = time_source_get(time_source);
> time_base += time_source_to_ns(c);
> time_source_at_base = c;
> time_source->multiply++;
> write_sequnlock(&time_unlock);
> }
>
> /* Switch to another time source */
> void new_time_source(struct time_source_t *s) {
> u64 c_old;
> u64 c_new;
>
> write_seqlock(&time_lock);
> c_old = time_source_get(time_source);
> c_new = time_source_get(s);
>
> time_base += time_source_to_ns(c_old);
>
> time_source_at_base = c_new;
> time_source = s;
>
> write_sequnlock(&time_lock);
> }
>
> struct time_source_t *make_time_source(u64 freq, int t,void *a, int bits) {
> struct time_source_t *s = kmalloc(sizeof(struct time_source_t));
>
> s->shift = 64 - bits;
> s->multiply = (NSEC_PER_SEC << s->shift) / freq;
> s->address = a;
> s->type = t;
> s->mask = 1 << bits -1;
> }

The mask and the shift value are not really related. The mask is a function of
the number of bits the hardware provides. The shift is related to the value of
freq. Me thinks they should not be tied together here.
>
> void time_set(u64 ns) {
> u64 c;
>
> write_seqlock(&time_lock);
> c = time_source_get(time_source);
> /* Adjust monotonic time base */
> time_mono_offset += time_base + time_source_to_ns(c) - ns;
> /* Setup new time base */
> time_base = ns;
> time_source_at_base = c;
> write_sequnlock(&time_lock);
> }
>
> void time_init(struct time_source_t *s) {
> {
> write_seqlock(&time_lock);
> time_base = 0;
> time_mono_offset = 0;
> time_source_at_base = time_source_get(s);
> time_source = s;
> write_sequnlock(&time_lock);
> }
>
> /* Values in use in the kernel and how they may be derived from xtime */
> #define jiffies (now()/1000000)

This assumes HZ=1000. (Assuming there is an HZ any more, that is.) Not all
archs will want this value. Possibly:
#define jiffies ((now() * HZ) / 1000000000)

> #define seconds(x) ((x)/1000000000)
> #define nanoseconds_within_second(x) ((x)%1000000000)
> #define microseconds_within_second(x) (nanoseconds_within_second(x) / 1000
>
> u32 time(void) {
> return seconds(now());
> }
>
> void gettimeofday(struct timeval *p) {
> u64 t=now();
>
> p->tv_sec = seconds(t);
> p->tv_usec = microseconds_within_second(t);
> }
>
> void clock_gettime(int clock, struct timespec *p) {
> u64 t=now();
>
> p->tv_sec = seconds(t);
> p->tv_nsec = nanoseconds_within_second(t);
> }
>
> /* Exampe of a CPU time time source */
>
> make_time_source(1200000000, TIME_SOURCE_CPU, NULL, 44);
>
> /* A memory based time source at 20Mhz 55 bits wide */
>
> make_time_source(20000000, TIME_SOURCE_MMIO64, &my_timer_address, 55);
>
> /* Example of handling a difficult time source. In SMP systems the CPU time sources are notoriously difficult to
> * sync. If we have such a problem then insure at least sure that the time source never goes backward.
> */
>
> u64 time_source_last;
>
> u64 get_cpu_time_filtered() {
> u64 x;
> u64 l;
This will need to be "static";
>
> do {
> l = time_source_last;
> x = cycles();
> if (x<l) return l;
> /* the cmpxchg is going to hurt in terms of scalability ! */
> } while (cmpxchg(&time_source_last, l, x) != l);
> return x;
> }
>
> Generate the time_source with
>
> make_time_source(1200000000, TIME_SOURCE_FUNCTION, cpu_time_filtered, 44);
> -

Ok, so now lets hook this up with interval timers:

#define ns_per_jiffie (NSEC_PER_SEC / HZ)
#define jiffies_to_ns(jiff) (jiff * ns_per_jiffie)

This function is a request to interrupt at the next jiffie after the passed
reference jiffie. If that time is passed return true, else false.

int schedule_next_jiffie(unsigned long ref_jiffie)
{
u64 remains;
u64 now_frozen = now();
u32 jiffies_frozen = ((now_frozen * HZ) / 1000000000);

if (jiffies_frozen > ref_jiffies)
return 1;

remains = now_frozen - jiffies_to_ns(ref_jiffie + 1);
/*
* set_timer_to_interrupt(u64 ns) is provided by the arch layer
*/
set_timer_to_interrupt_in(remains);
return 0;
}

On interrupt:

time_adjust_skip(0);
do_timer();

Or, possibly do_timer() does the time_adjust_skip(0);

do_timer() will set up the softirq to run the timer list. The timer list could
could call schedule_next_jiffie, but there are SMP issues here.






--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2004-09-15 03:36:40

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Tue, 14 Sep 2004, George Anzinger wrote:

> > u64 time_source_to_ns(u64 x) {
> > return (((x-time_source_at_base) & time_source->mask)*time_source->multiply) >> time_source->shift;
> > }
>
> This seems to assume that the time souce is incrementing. On some archs, I
> think, it decrements...

This could be handled by a function that transforms the value read from
the counter into an incrementing value. I.e.

u64 get_rev_timerval(void) {
return 1<< 55 - readq(TIMER_PORT);
}

> So we would do "time_adjust_skip(0);" to update time_source_at_base?

There is no reason to update time_source_at_base unless adjustments need
to be done or a danger exists of the counter wrapping around (16 bit
counter?)

> If we do a "good" job of choosing <multiply> and <shift> this will be a "very"
> small change. Might be better to pass in a "delta" to change it by. Then you
> would only need one function.

These are the raw routines. Higher level function could translate a delta
into the appropriate adjustments.

> The mask and the shift value are not really related. The mask is a function of
> the number of bits the hardware provides. The shift is related to the value of
> freq. Me thinks they should not be tied together here.

They are related because the maximum shift for a 64 bit value without
loosing bits is 64 - number of significant bits. This basically insures
maximum precision when scaling the counter.


> > /* Values in use in the kernel and how they may be derived from xtime */
> > #define jiffies (now()/1000000)
>
> This assumes HZ=1000. (Assuming there is an HZ any more, that is.) Not all
> archs will want this value. Possibly:
> #define jiffies ((now() * HZ) / 1000000000)

Right. I just thought of the standard case HZ=1000.

> > u64 get_cpu_time_filtered() {
> > u64 x;
> > u64 l;
> This will need to be "static";

Nope. time_source_last is the global. l is just a copy of
time_source_last.

> Ok, so now lets hook this up with interval timers:
>
> #define ns_per_jiffie (NSEC_PER_SEC / HZ)
> #define jiffies_to_ns(jiff) (jiff * ns_per_jiffie)
>
> This function is a request to interrupt at the next jiffie after the passed
> reference jiffie. If that time is passed return true, else false.

One could do this but we want to have a tickless system. The tick is only
necessary if the time needs to be adjusted.

But you are right there is the need for timer event scheduling that is
not included yet. This should be a method of the time source.

2004-09-15 06:53:13

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

V3 of the concept for a time subsystem.
* add a function to the time_source_t to be able to trigger the timer
interrupt at a certain counter value instead of at regular intervals
* function to map back from nanoseconds to counter units
* add primitive scheduler that wakes up for timer events and
figures out when to schedule the next event.
* Some info at the end on how maybe have the NTP code interact with
this stuff.

The deferral of the time_base updates actually is beneficial since it
allows a more accurate scaling of the counter and thus does not loose
remaining nanoseconds as often.

/****************************************************************
* A concept for a new timer module
* V3
*
* Christoph Lameter, September 14, 2003
***************************************************************/

struct time_source_t {
u8 type;
u8 shift;
u32 multiply;
void *address;
u64 mask;
int (*interrupt_at)(u64 counter_value);
};

#define TIME_SOURCE_CPU 0
#define TIME_SOURCE_MMIO32 1
#define TIME_SOURCE_MMIO64 2
#define TIME_SOURCE_FUNCTION 15

struct seqlock_t time_lock; /* Protects access to time_base time_source time_mono_offset and time_source_at_base */
u64 time_base;
u64 time_mono_offset;
struct time_source_t *time_source;
u64 time_source_at_base; /* time source value at time time_base */

u64 inline time_source_get(struct time_source_t *s) {
switch (s->type) {
case TIME_SOURCE_CPU : return cycles();
case TIME_SOURCE_MMIO32 : return readl(s->address);
case TIME_SOURCE_MMIO64 : return readq(s->address);
case TIME_SOURCE_FUNCTION: {
u64 (*x)(void);
x = s->address;
return x();
}
}

/* Convert from time source units to ns */
u64 time_source_to_ns(u64 x) {
return ((x & time_source->mask)*time_source->multiply) >> time_source->shift;
}

/* Convert from ns time to time source units */
u64 ns_to_time_source(u64 x) {
return (x << time_source->shift ) / time_source->multiply;
}
inline u64 now(void) {
u64 t;
unsigned long seq;

do {
seq = read_seqbegin(&time_lock);
t = time_base + time_source_to_ns(time_source_get(time_source) - time_source_at_base);
} while (unlikely(seq_retry(&time_lock,seq)));
return t;
}

#define now_mono() (now()+time_mono_offset)

/* Time adjustment (only possible forward, to go backwards adjust time_source->multiply and wait ...) */
void time_adjust_skip(u64 ns) {
u64 c;

write_seqlock(&time_lock);
c = time_source_get(time_source);
time_base += ns + time_source_to_ns(c - time_source_at_base);
time_source_at_base = c;
write_sequnlock(&time_lock);
}

void time_adjust_slower(void) {
u64 c;

write_seqlock(&time_lock);
c = time_source_get(time_source);
time_base += time_source_to_ns(c - time_source_at_base);
time_source_at_base = c;
time_source->multiply--;
write_sequnlock(&time_lock);
}

void time_adjust_faster(void) {
u64 c;

write_seqlock(&time_lock);
c = time_source_get(time_source);
time_base += time_source_to_ns(c - time_source_at_base);
time_source_at_base = c;
time_source->multiply++;
write_sequnlock(&time_unlock);
}

/* Switch to another time source */
void new_time_source(struct time_source_t *s) {
u64 c_old;
u64 c_new;

write_seqlock(&time_lock);
c_old = time_source_get(time_source);
c_new = time_source_get(s);

time_base += time_source_to_ns(c_old - time_source_at_base);

time_source_at_base = c_new;
time_source = s;

write_sequnlock(&time_lock);
}

struct time_source_t *make_time_source(u64 freq, int t,void *a, int bits, void *i) {
struct time_source_t *s = kmalloc(sizeof(struct time_source_t));

s->shift = 64 - bits;
s->multiply = (NSEC_PER_SEC << s->shift) / freq;
s->address = a;
s->type = t;
s->mask = 1 << bits -1;
s->interrupt_at = i;
}

void time_set(u64 ns) {
u64 c;

write_seqlock(&time_lock);
c = time_source_get(time_source);
/* Adjust monotonic time base */
time_mono_offset += time_base + time_source_to_ns(c - time_source_at_base) - ns;
/* Setup new time base */
time_base = ns;
time_source_at_base = c;
write_sequnlock(&time_lock);
}

void time_init(struct time_source_t *s) {
{
write_seqlock(&time_lock);
time_base = 0;
time_mono_offset = 0;
time_source_at_base = time_source_get(s);
time_source = s;
write_sequnlock(&time_lock);
}

/* Values in use in the kernel and how they may be derived from xtime */
#define jiffies (now() * HZ /1000000000)
#define seconds(x) ((x)/1000000000)
#define nanoseconds_within_second(x) ((x)%1000000000)
#define microseconds_within_second(x) (nanoseconds_within_second(x) / 1000)

u32 time(void) {
return seconds(now());
}

void gettimeofday(struct timeval *p) {
u64 t=now();

p->tv_sec = seconds(t);
p->tv_usec = microseconds_within_second(t);
}

void clock_gettime(int clock, struct timespec *p) {
u64 t=now();

p->tv_sec = seconds(t);
p->tv_nsec = nanoseconds_within_second(t);
}

/* Exampe of a CPU time time source */
int itm_setup(u64 counter) {
set_itm(counter);
return 1; /* Success */
}

make_time_source(1200000000, TIME_SOURCE_CPU, NULL, 44,itm_setup);

/* A memory based time source at 20Mhz 55 bits wide, no ability to fire an interrupt */

make_time_source(20000000, TIME_SOURCE_MMIO64, &my_timer_address, 55, NULL);

/* Example of handling a difficult time source. In SMP systems the CPU time sources are notoriously difficult to
* sync. If we have such a problem then insure at least sure that the time source never goes backward.
*/

u64 time_source_last;

u64 get_cpu_time_filtered() {
u64 x;
u64 l;

do {
l = time_source_last;
x = cycles();
if (x<l) return l;
/* the cmpxchg is going to hurt in terms of scalability ! */
} while (cmpxchg(&time_source_last, l, x) != l);
return x;
}

Generate the time_source with

make_time_source(1200000000, TIME_SOURCE_FUNCTION, cpu_time_filtered, 44, itm_setup);


/* event scheduler to be able to run events at specific times */

struct event {
u64 when;
u64 param;
void (*run)(u64);
struct event next;
}

struct event *event_queue;

void event_run(void) {
u64 t;

spin_lock(event_handler);
t = now();
redo:
/* Run all events scheduled before now */
while (event_queue && event_queue->when <= t) {
struct event *e=event_queue;

e->run(e->param);
event_queue = e->next;
kfree(e);
}
if (event_queue = NULL) {
spin_unlock(event_handler);
return;
}
t = now();
d = t - event_queue->when;
if (d<= 0) goto retry;

time_source->interrupt_at(time_source_at_base + ns_to_time_source(event_queue->when - time_base));
spin_unlock(event_handler);
}

void event_new(u64 w,void (*r)(u64), u64 p) {
struct event *e=kmalloc(sizeof(struct event));

e->when =w;
e->param = p;
e->run = r;
spin_lock(event_handler);
if (event_queue) {
/* Insert time event at the appropriate time */
struct event *p = event_queue;
while (p->next && p->next->w < w) p = p->next;
e->next = p->next;
p->next = e;
} else
event_queue=e;
spin_unlock(event_handler);
event_run();
}

void scheduled_timer_interrupt(void) {
event_run();
}

/* simulation of the old tick behavior */
tick(u64 when) {
/* do tick stuff */

/* time base update .... */
time_source_adjust(0);
/* Schedule next timer tick */
event_new(when + NSEC_PER_SEC / HZ , tick, when + NSEC_PER_SEC / HZ);
}

/* New tick would be scheduled by the ntp logic when a correction is needed.
* ntp logic needs to decide when to skip a few nanosecond or slow down the clock or
* make the clock run faster.
* One way to do this is to accumulate a time difference to real time.
* if this time difference is small and positive then skip time forward a bit.
* if the time difference is negative then slow down the clock.
* if the time difference is way too high then accelerate the clock
*/


2004-09-15 08:06:56

by George Anzinger

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

Christoph Lameter wrote:
> On Tue, 14 Sep 2004, George Anzinger wrote:
>
>
>>>u64 time_source_to_ns(u64 x) {
>>> return (((x-time_source_at_base) & time_source->mask)*time_source->multiply) >> time_source->shift;
>>>}
>>
>>This seems to assume that the time souce is incrementing. On some archs, I
>>think, it decrements...
>
>
> This could be handled by a function that transforms the value read from
> the counter into an incrementing value. I.e.
>
> u64 get_rev_timerval(void) {
> return 1<< 55 - readq(TIMER_PORT);
> }
>
>
>>So we would do "time_adjust_skip(0);" to update time_source_at_base?
>
>
> There is no reason to update time_source_at_base unless adjustments need
> to be done or a danger exists of the counter wrapping around (16 bit
> counter?)

Yes, for example the pm counter is 24 bits. A lot of platforms have 32 bit
counters...
>
>
>>If we do a "good" job of choosing <multiply> and <shift> this will be a "very"
>>small change. Might be better to pass in a "delta" to change it by. Then you
>>would only need one function.
>
>
> These are the raw routines. Higher level function could translate a delta
> into the appropriate adjustments.
>
>
>>The mask and the shift value are not really related. The mask is a function of
>>the number of bits the hardware provides. The shift is related to the value of
>>freq. Me thinks they should not be tied together here.
>
>
> They are related because the maximum shift for a 64 bit value without
> loosing bits is 64 - number of significant bits. This basically insures
> maximum precision when scaling the counter.

Lets assume the pm counter which has 24 bits. This means your shift is 40 bits.
In "s->multiply = (NSEC_PER_SEC << s->shift) / freq;" you will have an overflow.
Here you need to keep (NSEC_PER_SEC << s->shift) in 64 bits AND multiply must
also be 32 bits or less. I really don't think you can choose the scale so easily.
>
>
>
>>>/* Values in use in the kernel and how they may be derived from xtime */
>>>#define jiffies (now()/1000000)
>>
>>This assumes HZ=1000. (Assuming there is an HZ any more, that is.) Not all
>>archs will want this value. Possibly:
>>#define jiffies ((now() * HZ) / 1000000000)
>
>
> Right. I just thought of the standard case HZ=1000.
>
>
>>>u64 get_cpu_time_filtered() {
>>> u64 x;
>>> u64 l;
>>
>>This will need to be "static";
>
>
> Nope. time_source_last is the global. l is just a copy of
> time_source_last.

Right, I miss read the function. cycles() should be now() if I am reading this
right.
>
>
>>Ok, so now lets hook this up with interval timers:
>>
>>#define ns_per_jiffie (NSEC_PER_SEC / HZ)
>>#define jiffies_to_ns(jiff) (jiff * ns_per_jiffie)
>>
>>This function is a request to interrupt at the next jiffie after the passed
>>reference jiffie. If that time is passed return true, else false.
>
>
> One could do this but we want to have a tickless system. The tick is only
> necessary if the time needs to be adjusted.

I really think a tickless system, for other than UML systems, is a loosing
thing. The accounting overhead on context switch (which increases as the number
of switchs per second) will cause more overhead than a periodic accounting tick
once a respectable load appears. The periodic accounting tick has a flat
overhead that does not depend on load.
>
> But you are right there is the need for timer event scheduling that is
> not included yet. This should be a method of the time source.
>
I am not sure that is the right thing to do here. For example, on SMP systems
today we have a timer event interrupt per cpu. This is much more scaleable and
not so easy to do if we tie it to the time source. All we need is a reasonably
accurate short term counter.
--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2004-09-15 08:55:38

by Dominik Brodowski

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Wed, Sep 15, 2004 at 01:04:04AM -0700, George Anzinger wrote:
> >One could do this but we want to have a tickless system. The tick is only
> >necessary if the time needs to be adjusted.
>
> I really think a tickless system, for other than UML systems, is a loosing
> thing. The accounting overhead on context switch (which increases as the
> number of switchs per second) will cause more overhead than a periodic
> accounting tick once a respectable load appears.
^^^^^^^^^^^^^^^^

On a largely idle system (like notebooks on battery power in typical use)
the accounting overhead will be less a problem. However, the CPU being
woken up each millisecond will cause an increased battery usage. So if
the load is less than a certain threshold, tickless systems do make much
sense.

Dominik

2004-09-15 09:14:46

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

> I really think a tickless system, for other than UML systems, is a loosing
> thing. The accounting overhead on context switch (which increases as the

You can run the accounting independently at much lower frequency (10ms is
perfectly fine as 2.4 has proven - i suspect even lower would be ok too)
IMHO it should be a sysctl so that users can do a trade off between
power consumption and accounting accuracy.

And there is one important special that doesn't need any accounting
for user space at all: the idle loop.
It still needs some way to account interrupts, but that could be done
in do_IRQ or also do with rather low frequency.

-Andi

2004-09-15 15:53:41

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Wed, 15 Sep 2004, George Anzinger wrote:

> Lets assume the pm counter which has 24 bits. This means your shift is 40 bits.
> In "s->multiply = (NSEC_PER_SEC << s->shift) / freq;" you will have an overflow.
> Here you need to keep (NSEC_PER_SEC << s->shift) in 64 bits AND multiply must
> also be 32 bits or less. I really don't think you can choose the scale so easily.

Thanks for pointing that out. Will fix that. I sure wish one could
have a 128 bit intermediate result.

> > Nope. time_source_last is the global. l is just a copy of
> > time_source_last.
>
> Right, I miss read the function. cycles() should be now() if I am reading this
> right.

Sortof. now() is the time in nanoseconds whereas cycles() is a counter
value of the counter in the cpu.

> > One could do this but we want to have a tickless system. The tick is only
> > necessary if the time needs to be adjusted.
>
> I really think a tickless system, for other than UML systems, is a loosing
> thing. The accounting overhead on context switch (which increases as the number
> of switchs per second) will cause more overhead than a periodic accounting tick
> once a respectable load appears. The periodic accounting tick has a flat
> overhead that does not depend on load.

I am not following you here. Why does the context switch overhead
increase? Because there are multiple interrupts for different tasks done
in the tick?

2004-09-15 16:42:03

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Tue, 2004-09-14 at 23:46, Christoph Lameter wrote:
> V3 of the concept for a time subsystem.
> * add a function to the time_source_t to be able to trigger the timer
> interrupt at a certain counter value instead of at regular intervals
> * function to map back from nanoseconds to counter units
> * add primitive scheduler that wakes up for timer events and
> figures out when to schedule the next event.
> * Some info at the end on how maybe have the NTP code interact with
> this stuff.
>
> The deferral of the time_base updates actually is beneficial since it
> allows a more accurate scaling of the counter and thus does not loose
> remaining nanoseconds as often.
>
> /****************************************************************
> * A concept for a new timer module
> * V3
> *
> * Christoph Lameter, September 14, 2003
> ***************************************************************/
>
> struct time_source_t {
> u8 type;
> u8 shift;
> u32 multiply;
> void *address;
> u64 mask;
> int (*interrupt_at)(u64 counter_value);
> };

Could you explain the interrupt_at function further?



> /* Exampe of a CPU time time source */
> int itm_setup(u64 counter) {
> set_itm(counter);
> return 1; /* Success */
> }
>
> make_time_source(1200000000, TIME_SOURCE_CPU, NULL, 44,itm_setup);
>
> /* A memory based time source at 20Mhz 55 bits wide, no ability to fire an interrupt */
>
> make_time_source(20000000, TIME_SOURCE_MMIO64, &my_timer_address, 55, NULL);
>
> /* Example of handling a difficult time source. In SMP systems the CPU time sources are notoriously difficult to
> * sync. If we have such a problem then insure at least sure that the time source never goes backward.
> */
>
> u64 time_source_last;
>
> u64 get_cpu_time_filtered() {
> u64 x;
> u64 l;
>
> do {
> l = time_source_last;
> x = cycles();
> if (x<l) return l;
> /* the cmpxchg is going to hurt in terms of scalability ! */
> } while (cmpxchg(&time_source_last, l, x) != l);
> return x;
> }
>
> Generate the time_source with
>
> make_time_source(1200000000, TIME_SOURCE_FUNCTION, cpu_time_filtered, 44, itm_setup);

Out of curiosity, what happens in your fsyscall code in these cases
where TIME_SOURCE_FUNCTION is used?


> /* event scheduler to be able to run events at specific times */
>
> struct event {
> u64 when;
> u64 param;
> void (*run)(u64);
> struct event next;
> }
>
> struct event *event_queue;
>
> void event_run(void) {
> u64 t;
>
> spin_lock(event_handler);
> t = now();
> redo:
> /* Run all events scheduled before now */
> while (event_queue && event_queue->when <= t) {
> struct event *e=event_queue;
>
> e->run(e->param);
> event_queue = e->next;
> kfree(e);
> }
> if (event_queue = NULL) {
> spin_unlock(event_handler);
> return;
> }
> t = now();
> d = t - event_queue->when;
> if (d<= 0) goto retry;
>
> time_source->interrupt_at(time_source_at_base + ns_to_time_source(event_queue->when - time_base));
> spin_unlock(event_handler);
> }
>
> void event_new(u64 w,void (*r)(u64), u64 p) {
> struct event *e=kmalloc(sizeof(struct event));
>
> e->when =w;
> e->param = p;
> e->run = r;
> spin_lock(event_handler);
> if (event_queue) {
> /* Insert time event at the appropriate time */
> struct event *p = event_queue;
> while (p->next && p->next->w < w) p = p->next;
> e->next = p->next;
> p->next = e;
> } else
> event_queue=e;
> spin_unlock(event_handler);
> event_run();
> }
>
> void scheduled_timer_interrupt(void) {
> event_run();
> }
>
> /* simulation of the old tick behavior */
> tick(u64 when) {
> /* do tick stuff */
>
> /* time base update .... */
> time_source_adjust(0);
> /* Schedule next timer tick */
> event_new(when + NSEC_PER_SEC / HZ , tick, when + NSEC_PER_SEC / HZ);
> }

Hmm. This is getting somewhat tangled in my mind. Who is calling tick()
originally? How is event_run being called?

> /* New tick would be scheduled by the ntp logic when a correction is needed.
> * ntp logic needs to decide when to skip a few nanosecond or slow down the clock or
> * make the clock run faster.
> * One way to do this is to accumulate a time difference to real time.
> * if this time difference is small and positive then skip time forward a bit.
> * if the time difference is negative then slow down the clock.
> * if the time difference is way too high then accelerate the clock
> */

Well, this is still a bit vague. do_adjtimex gives us 4 values we need
to use in adjusting time. The parts-per-million (ppm - all of which are
signed) tick adjustment value, the ppm frequency adjustment, the ppm
offset adjustment, and the singleshot offset adjustment length (# of
nsecs in which we apply the maximum ppm singleshot offset adjustment).

How do these 4 values get translated to adjusting the time source?

You might want to just swipe my timeofday-core patch and re-work the
timesource.h interface to your liking (make it like time_source_t). That
way you get all the NTP details as well as the interrupt separation for
free.


2004-09-15 16:51:10

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Wed, 15 Sep 2004, john stultz wrote:

> > struct time_source_t {
> > u8 type;
> > u8 shift;
> > u32 multiply;
> > void *address;
> > u64 mask;
> > int (*interrupt_at)(u64 counter_value);
> > };
>
> Could you explain the interrupt_at function further?

Generates a timer interrupt when a certain counter value has been reached.
This is a common feature of most clock chips that I am aware of.

> > make_time_source(1200000000, TIME_SOURCE_FUNCTION, cpu_time_filtered, 44, itm_setup);
>
> Out of curiosity, what happens in your fsyscall code in these cases
> where TIME_SOURCE_FUNCTION is used?

It backs out of the fast call track and does a regular call instead.

> > /* simulation of the old tick behavior */
> > tick(u64 when) {
> > /* do tick stuff */
> >
> > /* time base update .... */
> > time_source_adjust(0);
> > /* Schedule next timer tick */
> > event_new(when + NSEC_PER_SEC / HZ , tick, when + NSEC_PER_SEC / HZ);
> > }
>
> Hmm. This is getting somewhat tangled in my mind. Who is calling tick()
> originally? How is event_run being called?

This is just some toying around with the concept. Maybe tick could be
called by the regular timer interrupt?

>
> > /* New tick would be scheduled by the ntp logic when a correction is needed.
> > * ntp logic needs to decide when to skip a few nanosecond or slow down the clock or
> > * make the clock run faster.
> > * One way to do this is to accumulate a time difference to real time.
> > * if this time difference is small and positive then skip time forward a bit.
> > * if the time difference is negative then slow down the clock.
> > * if the time difference is way too high then accelerate the clock
> > */
>
> Well, this is still a bit vague. do_adjtimex gives us 4 values we need
> to use in adjusting time. The parts-per-million (ppm - all of which are
> signed) tick adjustment value, the ppm frequency adjustment, the ppm
> offset adjustment, and the singleshot offset adjustment length (# of
> nsecs in which we apply the maximum ppm singleshot offset adjustment).
>
> How do these 4 values get translated to adjusting the time source?

I agree this is vague and I have no clue here. I hoped that you
could come up with some may to use these functions for adjtimex etc? I
would need some time to figure out how the adjusting works in order to
come up with a solution. I thought we agreed you do the NTP and time
adjustment things ;-) I have never touched that code....

> You might want to just swipe my timeofday-core patch and re-work the
> timesource.h interface to your liking (make it like time_source_t). That
> way you get all the NTP details as well as the interrupt separation for
> free.

Will have a look at it and put it together when I have some time.

2004-09-15 17:23:50

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Wed, 2004-09-15 at 09:46, Christoph Lameter wrote:
> On Wed, 15 Sep 2004, john stultz wrote:
>
> > > struct time_source_t {
> > > u8 type;
> > > u8 shift;
> > > u32 multiply;
> > > void *address;
> > > u64 mask;
> > > int (*interrupt_at)(u64 counter_value);
> > > };
> >
> > Could you explain the interrupt_at function further?
>
> Generates a timer interrupt when a certain counter value has been reached.
> This is a common feature of most clock chips that I am aware of.

Well, not all time sources have that feature. Both TSC, and cyclone
don't. You'd have to do something else for those. This is why my
proposal has absolutely nothing to do with interrupt generation. It has
a single hook that needs to be called only every once in a while, which
can be done however any architecture wants.

Interrupt generation has more to do with soft-timers and scheduling then
time of day anyway.

> > > make_time_source(1200000000, TIME_SOURCE_FUNCTION, cpu_time_filtered, 44, itm_setup);
> >
> > Out of curiosity, what happens in your fsyscall code in these cases
> > where TIME_SOURCE_FUNCTION is used?
>
> It backs out of the fast call track and does a regular call instead.

Ok, just curious.

> > > /* simulation of the old tick behavior */
> > > tick(u64 when) {
> > > /* do tick stuff */
> > >
> > > /* time base update .... */
> > > time_source_adjust(0);
> > > /* Schedule next timer tick */
> > > event_new(when + NSEC_PER_SEC / HZ , tick, when + NSEC_PER_SEC / HZ);
> > > }
> >
> > Hmm. This is getting somewhat tangled in my mind. Who is calling tick()
> > originally? How is event_run being called?
>
> This is just some toying around with the concept. Maybe tick could be
> called by the regular timer interrupt?

Ok, no worries. Its just one of the big reasons I started my earlier
proposal is to lessen the dependencies and cleanup the interactions
between the interrupt generation, soft-timer and the time of day code.


> > > /* New tick would be scheduled by the ntp logic when a correction is needed.
> > > * ntp logic needs to decide when to skip a few nanosecond or slow down the clock or
> > > * make the clock run faster.
> > > * One way to do this is to accumulate a time difference to real time.
> > > * if this time difference is small and positive then skip time forward a bit.
> > > * if the time difference is negative then slow down the clock.
> > > * if the time difference is way too high then accelerate the clock
> > > */
> >
> > Well, this is still a bit vague. do_adjtimex gives us 4 values we need
> > to use in adjusting time. The parts-per-million (ppm - all of which are
> > signed) tick adjustment value, the ppm frequency adjustment, the ppm
> > offset adjustment, and the singleshot offset adjustment length (# of
> > nsecs in which we apply the maximum ppm singleshot offset adjustment).
> >
> > How do these 4 values get translated to adjusting the time source?
>
> I agree this is vague and I have no clue here. I hoped that you
> could come up with some may to use these functions for adjtimex etc? I
> would need some time to figure out how the adjusting works in order to
> come up with a solution. I thought we agreed you do the NTP and time
> adjustment things ;-) I have never touched that code....

Heh. Well, I don't think anyone wants to do the NTP work, but I did have
to go through it to implement my proposal. Unfortunately it isn't
something that can easily be tacked on to any design, so you're going to
have to understand some aspects of it if you want to do your own
design. That said, if you have any questions, hopefully I can explain
my interpretation of the existing code as well as my own implementation.

> > You might want to just swipe my timeofday-core patch and re-work the
> > timesource.h interface to your liking (make it like time_source_t). That
> > way you get all the NTP details as well as the interrupt separation for
> > free.
>
> Will have a look at it and put it together when I have some time.

Sure thing, and I'll try to take a stab at it myself after next week.


thanks
-john

2004-09-15 17:36:35

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Wed, 15 Sep 2004, john stultz wrote:

> On Wed, 2004-09-15 at 09:46, Christoph Lameter wrote:
> > On Wed, 15 Sep 2004, john stultz wrote:
> >
> > > > struct time_source_t {
> > > > u8 type;
> > > > u8 shift;
> > > > u32 multiply;
> > > > void *address;
> > > > u64 mask;
> > > > int (*interrupt_at)(u64 counter_value);
> > > > };
> > >
> > > Could you explain the interrupt_at function further?
> >
> > Generates a timer interrupt when a certain counter value has been reached.
> > This is a common feature of most clock chips that I am aware of.
>
> Well, not all time sources have that feature. Both TSC, and cyclone
> don't. You'd have to do something else for those. This is why my
> proposal has absolutely nothing to do with interrupt generation. It has
> a single hook that needs to be called only every once in a while, which
> can be done however any architecture wants.
>
> Interrupt generation has more to do with soft-timers and scheduling then
> time of day anyway.

Hmm... I think this is a serious issue. The ability to exactly time an
interrupt and therefore specific actions is important. Maybe we can have a
fall back to soft interrupts if interrupt_at == NULL?

If the scheduler could assign dynamic time slices to processes then new
more effective designs of process scheduling become possible. F.e. on an
SMP system if a single process is running and no other contention exists
then the scheduler can simply let the process run without interruption.

On the other hand the system may reduce the time slices assigned to a
process that causes significant load on the system in order to insure the
responsiveness of other applications despite the load.

If one can schedule a tick dynamically then also NTP time correction can
be "scheduled" when it is likely that the clock has sufficiently deviated
from standard time. As soon as the time source has been sufficiently tamed
by scaling it correctly then the periods of checkup on the time source
could be gradually expanded.

Real time features such as posix-timer's also depend on the ability to
deliver a signal at an exact point in time. Soft timers can only give a
very rough approximation in these cases.

So I think this feature is essential.

2004-09-15 17:59:39

by George Anzinger

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

Dominik Brodowski wrote:
> On Wed, Sep 15, 2004 at 01:04:04AM -0700, George Anzinger wrote:
>
>>>One could do this but we want to have a tickless system. The tick is only
>>>necessary if the time needs to be adjusted.
>>
>>I really think a tickless system, for other than UML systems, is a loosing
>>thing. The accounting overhead on context switch (which increases as the
>>number of switchs per second) will cause more overhead than a periodic
>>accounting tick once a respectable load appears.
>
> ^^^^^^^^^^^^^^^^
>
> On a largely idle system (like notebooks on battery power in typical use)
> the accounting overhead will be less a problem. However, the CPU being
> woken up each millisecond will cause an increased battery usage. So if
> the load is less than a certain threshold, tickless systems do make much
> sense.

At MontaVista I have been working on a thing we call VST which looks ahead in
the timer list and, finding nothing for N ticks, turns off the ticker until that
time. It is not tickless, unless the system is idle, but then it can go
tickless for as long as the max value that can be programmed on the wakeup
timer. Interrupts prior to that time will, of course, also wake the system.

Seems like the best of both worlds to me.

An early version of this is on the HRT sourceforge site.


--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2004-09-15 18:05:17

by George Anzinger

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

Christoph Lameter wrote:
~

>>>One could do this but we want to have a tickless system. The tick is only
>>>necessary if the time needs to be adjusted.
>>
>>I really think a tickless system, for other than UML systems, is a loosing
>>thing. The accounting overhead on context switch (which increases as the number
>>of switchs per second) will cause more overhead than a periodic accounting tick
>>once a respectable load appears. The periodic accounting tick has a flat
>>overhead that does not depend on load.
>
>
> I am not following you here. Why does the context switch overhead
> increase? Because there are multiple interrupts for different tasks done
> in the tick?
>
Each task has several timers, i.e. time slice, time limit, and possibly itimer
profile. Granted only one of these needs to be sent to the timer code, but that
takes a bit of time, not much, but enough to increase the context switch
overhead such that a system with a modest amount of context switching will incur
more timer management overhead than the periodic tick generates.

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2004-09-15 18:30:14

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)


On Wed, 15 Sep 2004, George Anzinger wrote:
> > I am not following you here. Why does the context switch overhead
> > increase? Because there are multiple interrupts for different tasks done
> > in the tick?
> >
> Each task has several timers, i.e. time slice, time limit, and possibly itimer
> profile. Granted only one of these needs to be sent to the timer code, but that
> takes a bit of time, not much, but enough to increase the context switch
> overhead such that a system with a modest amount of context switching will incur
> more timer management overhead than the periodic tick generates.

I thought that the timer handling could be separate from the time slice
handling of the scheduler. With the ability schedule events as needed the
scheduler could generate its own independent timing from the necessary
clock adjustments. The tick would be disassembled into its various
components.

2004-09-15 18:53:20

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Wed, 2004-09-15 at 10:30, Christoph Lameter wrote:
> On Wed, 15 Sep 2004, john stultz wrote:
>
> > Well, not all time sources have that feature. Both TSC, and cyclone
> > don't. You'd have to do something else for those. This is why my
> > proposal has absolutely nothing to do with interrupt generation. It has
> > a single hook that needs to be called only every once in a while, which
> > can be done however any architecture wants.
> >
> > Interrupt generation has more to do with soft-timers and scheduling then
> > time of day anyway.
>
> Hmm... I think this is a serious issue. The ability to exactly time an
> interrupt and therefore specific actions is important. Maybe we can have a
> fall back to soft interrupts if interrupt_at == NULL?

Oh, its an issue, but one the time of day subsystem shouldn't solve. The
code that manages interval (be it timer or whatever) interrupts should
use the time of day subsystem to ensure they are triggering accurately,
and manipulate the intervals as needed.

> If the scheduler could assign dynamic time slices to processes then new
> more effective designs of process scheduling become possible. F.e. on an
> SMP system if a single process is running and no other contention exists
> then the scheduler can simply let the process run without interruption.
>
> On the other hand the system may reduce the time slices assigned to a
> process that causes significant load on the system in order to insure the
> responsiveness of other applications despite the load.

Yep, all of this is outside what the time of day subsystem should do.
Interval interrupt scheduling should be done by other code (I suggest
the timer code do it).


> If one can schedule a tick dynamically then also NTP time correction can
> be "scheduled" when it is likely that the clock has sufficiently deviated
> from standard time. As soon as the time source has been sufficiently tamed
> by scaling it correctly then the periods of checkup on the time source
> could be gradually expanded.

NTP adjustments need to be applied smoothly and consistently as time
progresses. Adjustment to the NTP parameters occur at do_adjtimex and at
interval boundaries (ie: the interrupt_hook in my code, or look at
update_wall_time and second_overflow in the existing code). So in my
understanding its not really a schedulable function.


> Real time features such as posix-timer's also depend on the ability to
> deliver a signal at an exact point in time. Soft timers can only give a
> very rough approximation in these cases.
>
> So I think this feature is essential.

I think the functionality is essential, but that it doesn't belong in the time of day code.

Basically we have two things we're trying to do:

1. Keep accurate time
2. Generate hardware interrupts accurately

While frequently the same hardware can do both, not all hardware is
usable for both functions. Thus I believe we should cleanly split these
two subsystems. My proposal only provided the keep accurate time part,
however one could using that functionality, to then manipulate hardware
interrupts to ensure accuracy in the timer subsystem.

thanks
-john

2004-09-15 20:02:44

by George Anzinger

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

john stultz wrote:
~

>
>
>>Real time features such as posix-timer's also depend on the ability to
>>deliver a signal at an exact point in time. Soft timers can only give a
>>very rough approximation in these cases.
>>
>>So I think this feature is essential.
>
>
> I think the functionality is essential, but that it doesn't belong in the time of day code.
>
> Basically we have two things we're trying to do:
>
> 1. Keep accurate time
> 2. Generate hardware interrupts accurately
>
> While frequently the same hardware can do both, not all hardware is
> usable for both functions. Thus I believe we should cleanly split these
> two subsystems. My proposal only provided the keep accurate time part,
> however one could using that functionality, to then manipulate hardware
> interrupts to ensure accuracy in the timer subsystem.
>
The thing I think is missing in all this is that, in some platforms, the
hardware to provide the interrupts is more accurate. We have, IMHO, three cases
here:
a) The interrupts are accurate but the clock info (e.g. TSC) is not.
b) The clock in accurate but the timer is not, and
c) The clock and interrupt come from the same accurate hardware source.

The X86 is in class a) in that the PIT is accurate and the TSC is not. The
muddy part here is the pm-timer which is accurate but takes a _long_ time to access.

PPCs are in class c) as are some MIPS, ARM, and PARISC. I am not sure about the
rest and can not lay my hands on one of class b).

For the class a) machines, I think the best approach is to use a clock that has
reasonable short term accuracy and to use the timer to, from time to time,
correct it. This should be on the order of an internal NTP sort of correction.

For the class a) and b) machines, I think it would be wise to not seperated the
timer and the clock so much as to make it "hard" to use one to correct the other.
--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

2004-09-15 20:28:43

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On Wed, 15 Sep 2004, john stultz wrote:

> On Wed, 2004-09-15 at 10:30, Christoph Lameter wrote:
> > On Wed, 15 Sep 2004, john stultz wrote:
> >
> > > Well, not all time sources have that feature. Both TSC, and cyclone
> > > don't. You'd have to do something else for those. This is why my
> > > proposal has absolutely nothing to do with interrupt generation. It has
> > > a single hook that needs to be called only every once in a while, which
> > > can be done however any architecture wants.
> > >
> > > Interrupt generation has more to do with soft-timers and scheduling then
> > > time of day anyway.
> >
> > Hmm... I think this is a serious issue. The ability to exactly time an
> > interrupt and therefore specific actions is important. Maybe we can have a
> > fall back to soft interrupts if interrupt_at == NULL?
>
> Oh, its an issue, but one the time of day subsystem shouldn't solve. The
> code that manages interval (be it timer or whatever) interrupts should
> use the time of day subsystem to ensure they are triggering accurately,
> and manipulate the intervals as needed.

Huh? These are time related events. They cannot be managed effectively
outside of the time subsystem. You need at least a regularly interrupting
time source even for the current tick based approach.

> I think the functionality is essential, but that it doesn't belong in the time of day code.

Hmm. I was thinking all of this was about managing
the various time sources not only the time of day code.

> Basically we have two things we're trying to do:
>
> 1. Keep accurate time
> 2. Generate hardware interrupts accurately
>
> While frequently the same hardware can do both, not all hardware is
> usable for both functions. Thus I believe we should cleanly split these
> two subsystems. My proposal only provided the keep accurate time part,
> however one could using that functionality, to then manipulate hardware
> interrupts to ensure accuracy in the timer subsystem.

These two are intrisincally related and all modern hardware that I know
of does both. In a sense the regularly occuring timer tick is also a
yet another form of time source but not does provide a counter (unless
simulated by jiffies). The timer subsystem needs to manage all
the available time sources of the system effectively and provide time
services to other subsystems.

2004-09-16 07:14:24

by Ulrich Windl

[permalink] [raw]
Subject: Re: [RFC][PATCH] new timeofday core subsystem (v.A0)

On 15 Sep 2004 at 11:48, john stultz wrote:

> NTP adjustments need to be applied smoothly and consistently as time
> progresses. Adjustment to the NTP parameters occur at do_adjtimex and at

Actually I think (constant) NTP adjustments need to be applied any time. That's
why I even tried to correct the TSC interpolation between ticks (If you visit the
PPSkit patch, that's the "nanoscale" thing). The "constant adjustment values" are
updated regularly also (per tick, per second, per NTP update).

The basic thought was: If the time between two ticks is not 1/HZ the TSC
interpolation (that is calibrated against the timer ticks) also will not
interpolate between 1/HZ ticks. I assumed the TSC interpolations is wrong by the
same percentage the timer interrupts are. Thus I tried to reversely apply the
frequency error learnt through NTP even to the TSC interpolation (in addition to
the time increments every tick). My code however has the problem that I ran out of
significant bits during several muliplies and divides, while supporting that
maximum imaginable frequency range for TSC. Even with the best oscillator, it
won't reach single nanosecond precision as I don't have enough bits for them.

Regards,
Ulrich