2005-09-28 20:43:55

by Thomas Gleixner

[permalink] [raw]
Subject: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

This is an updated version which contains following changes:

- Selectable time storage format: union/struct based, scalar (64bit)
- Fixed an endless loop in forward_posix_timer (George Anzinger)
- Fixed a wrong sizeof(x) (George Anzinger)
- Fixed build problems for non x86 architectures

Roman pointed out that the penalty for some architectures
would be quite big when using the nsec_t (64bit) scalar time
storage format. After a long discussion and some more detailed
tests especially on ARM it turned out that the scalar format
is unfortunately not suitable everywhere. The tradeoff between
performance and cleanliness seems too big for some architectures.

After several rounds of functional conversions and
cleanups an acceptable compromise between cleanliness and
storage format flexibility was found.

For 64bit architectures the scalar representation is definitely
a win and therefor enabled unconditionally. The code defaults to
the union/struct based implementation on 32bit archs, but can be
switched to the scalar storage format by setting
CONFIG_KTIME_SCALAR=y if there is a benefit for the particular
architecture. The union/struct magic has an advantage over the
struct timespec based format which I considered to use first. It
produces better and denser code for most architecures and does no
harm anywhere else. This might change with improvements of
compilers, but then it requires just a replacement of the related
macros / inlines.

The code is not harder to understand than the previous
open coded scalar storage based implementation.

The correctness was verified with the posix timer tests from
the HRT project on the forward ported ktimers based high
resolution proof of concept implementation.
For those interested in this topic the patchseries is available
at http://www.tglx.de/private/tglx/ktimers/patch-2.6.14-rc2-kt5.patches.tar.bz2


Thanks for review and feedback.

tglx


ktimers seperate the "timer API" from the "timeout API".
ktimers are used for:
- nanosleep
- posixtimers
- itimers


The patch contains the base implementation of ktimers and the
conversion of nanosleep, posixtimers and itimers to ktimer users.

The patch does not require other changes to the Linux time(r) core
system.

The implementation was done with following constraints in mind:

- Not bound to jiffies
- Multiple time sources
- Per CPU timer queues
- Simplification of absolute CLOCK_REALTIME posix timers
- High resolution timer aware
- Allows the timeout API to reschedule the next event
(for tickless systems)

Ktimers enqueue the timers into a time sorted list, which is implemented
with a rbtree, which is effiecient and already used in other performance
critical parts of the kernel. This is a bit slower than the timer wheel,
but due to the fact that the vast majority of timers is actually
expiring it has to be waged versus the cascading penalty.

The code supports multiple time sources. Currently implemented are
CLOCK_REALTIME and CLOCK_MONOTONIC. They provide seperate timer queues
and support functions.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

---
Index: linux-2.6.14-rc2-rt4/include/linux/calc64.h
===================================================================
--- /dev/null
+++ linux-2.6.14-rc2-rt4/include/linux/calc64.h
@@ -0,0 +1,31 @@
+#ifndef _linux_CALC64_H
+#define _linux_CALC64_H
+
+#include <linux/types.h>
+#include <asm/div64.h>
+
+#ifndef div_long_long_rem
+#define div_long_long_rem(dividend,divisor,remainder) \
+({ \
+ u64 result = dividend; \
+ *remainder = do_div(result,divisor); \
+ result; \
+})
+#endif
+
+static inline long div_long_long_rem_signed(long long dividend,
+ long divisor,
+ long *remainder)
+{
+ long res;
+
+ if (unlikely(dividend < 0)) {
+ res = -div_long_long_rem(-dividend, divisor, remainder);
+ *remainder = -(*remainder);
+ } else {
+ res = div_long_long_rem(dividend, divisor, remainder);
+ }
+ return res;
+}
+
+#endif
Index: linux-2.6.14-rc2-rt4/include/linux/jiffies.h
===================================================================
--- linux-2.6.14-rc2-rt4.orig/include/linux/jiffies.h
+++ linux-2.6.14-rc2-rt4/include/linux/jiffies.h
@@ -1,21 +1,12 @@
#ifndef _LINUX_JIFFIES_H
#define _LINUX_JIFFIES_H

+#include <linux/calc64.h>
#include <linux/kernel.h>
#include <linux/types.h>
#include <linux/time.h>
#include <linux/timex.h>
#include <asm/param.h> /* for HZ */
-#include <asm/div64.h>
-
-#ifndef div_long_long_rem
-#define div_long_long_rem(dividend,divisor,remainder) \
-({ \
- u64 result = dividend; \
- *remainder = do_div(result,divisor); \
- result; \
-})
-#endif

/*
* The following defines establish the engineering parameters of the PLL
Index: linux-2.6.14-rc2-rt4/fs/exec.c
===================================================================
--- linux-2.6.14-rc2-rt4.orig/fs/exec.c
+++ linux-2.6.14-rc2-rt4/fs/exec.c
@@ -645,9 +645,10 @@ static inline int de_thread(struct task_
* synchronize with any firing (by calling del_timer_sync)
* before we can safely let the old group leader die.
*/
- sig->real_timer.data = (unsigned long)current;
- if (del_timer_sync(&sig->real_timer))
- add_timer(&sig->real_timer);
+ sig->real_timer.data = current;
+ if (stop_ktimer(&sig->real_timer))
+ start_ktimer(&sig->real_timer, NULL,
+ KTIMER_RESTART|KTIMER_NOCHECK);
}
while (atomic_read(&sig->count) > count) {
sig->group_exit_task = current;
@@ -659,7 +660,7 @@ static inline int de_thread(struct task_
}
sig->group_exit_task = NULL;
sig->notify_count = 0;
- sig->real_timer.data = (unsigned long)current;
+ sig->real_timer.data = current;
spin_unlock_irq(lock);

/*
Index: linux-2.6.14-rc2-rt4/fs/proc/array.c
===================================================================
--- linux-2.6.14-rc2-rt4.orig/fs/proc/array.c
+++ linux-2.6.14-rc2-rt4/fs/proc/array.c
@@ -330,7 +330,7 @@ static int do_task_stat(struct task_stru
unsigned long min_flt = 0, maj_flt = 0;
cputime_t cutime, cstime, utime, stime;
unsigned long rsslim = 0;
- unsigned long it_real_value = 0;
+ DEFINE_KTIME(it_real_value);
struct task_struct *t;
char tcomm[sizeof(task->comm)];

@@ -386,7 +386,7 @@ static int do_task_stat(struct task_stru
utime = cputime_add(utime, task->signal->utime);
stime = cputime_add(stime, task->signal->stime);
}
- it_real_value = task->signal->it_real_value;
+ it_real_value = task->signal->real_timer.expires;
}
ppid = pid_alive(task) ? task->group_leader->real_parent->tgid : 0;
read_unlock(&tasklist_lock);
@@ -435,7 +435,7 @@ static int do_task_stat(struct task_stru
priority,
nice,
num_threads,
- jiffies_to_clock_t(it_real_value),
+ (clock_t) ktime_to_clock_t(it_real_value),
start_time,
vsize,
mm ? get_mm_counter(mm, rss) : 0, /* you might want to shift this left 3 */
Index: linux-2.6.14-rc2-rt4/include/linux/ktimer.h
===================================================================
--- /dev/null
+++ linux-2.6.14-rc2-rt4/include/linux/ktimer.h
@@ -0,0 +1,335 @@
+#ifndef _LINUX_KTIMER_H
+#define _LINUX_KTIMER_H
+
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/rbtree.h>
+#include <linux/time.h>
+#include <linux/wait.h>
+
+/* Timer API */
+
+/*
+ * Select the ktime_t data type
+ */
+#if defined(CONFIG_KTIME_SCALAR) || (BITS_PER_LONG == 64)
+ #define KTIME_IS_SCALAR
+#endif
+
+#ifndef KTIME_IS_SCALAR
+typedef union {
+ s64 tv64;
+ struct {
+#ifdef __BIG_ENDIAN
+ s32 sec, nsec;
+#else
+ s32 nsec, sec;
+#endif
+ } tv;
+} ktime_t;
+
+#else
+
+typedef s64 ktime_t;
+
+#endif
+
+struct ktimer_base;
+
+/*
+ * Timer structure must be initialized by init_ktimer_xxx !
+ */
+struct ktimer {
+ struct rb_node node;
+ struct list_head list;
+ ktime_t expires;
+ ktime_t expired;
+ ktime_t interval;
+ int overrun;
+ unsigned long status;
+ void (*function)(void *);
+ void *data;
+ struct ktimer_base *base;
+};
+
+/*
+ * Timer base struct
+ */
+struct ktimer_base {
+ int index;
+ char *name;
+ spinlock_t lock;
+ struct rb_root active;
+ struct list_head pending;
+ int count;
+ unsigned long resolution;
+ ktime_t (*get_time)(void);
+ struct ktimer *running_timer;
+ wait_queue_head_t wait_for_running_timer;
+};
+
+/*
+ * Values for the mode argument of xxx_ktimer functions
+ */
+enum
+{
+ KTIMER_NOREARM, /* Internal value */
+ KTIMER_ABS, /* Time value is absolute */
+ KTIMER_REL, /* Time value is relativ to now */
+ KTIMER_INCR, /* Time value is relativ to previous expiry time */
+ KTIMER_FORWARD, /* Timer is rearmed with value. Overruns are accounted */
+ KTIMER_REARM, /* Timer is rearmed with interval. Overruns are accounted */
+ KTIMER_RESTART /* Timer is restarted with the stored expiry value */
+};
+
+/* The timer states */
+enum
+{
+ KTIMER_INACTIVE,
+ KTIMER_PENDING,
+ KTIMER_EXPIRED,
+ KTIMER_EXPIRED_NOQUEUE,
+};
+
+/* Expiry must not be checked when the timer is started */
+#define KTIMER_NOCHECK 0x10000
+
+#define KTIMER_POISON ((void *) 0x00100101)
+
+#define KTIME_ZERO 0LL
+
+#define ktimer_active(t) ((t)->status != KTIMER_INACTIVE)
+#define ktimer_before(t1, t2) (ktime_cmp((t1)->expires, <, (t2)->expires))
+
+#ifndef KTIME_IS_SCALAR
+/*
+ * Helper macros/inlines to get the math with ktime_t right. Uurgh, that's
+ * ugly as hell, but for performance sake we have to use this. The
+ * nsec_t based code was nice and simple. :(
+ *
+ * Be careful when using this stuff. It blows up on you if you d?n't
+ * get the weirdness right.
+ *
+ * Be especially aware, that negative values are represented in the
+ * form:
+ * tv.sec < 0 and 0 >= tv.nsec < NSEC_PER_SEC
+ *
+ */
+#define DEFINE_KTIME(k) ktime_t k = {.tv64 = 0LL }
+
+#define ktime_cmp(a,op,b) ((a).tv64 op (b).tv64)
+#define ktime_cmp_val(a, op, b) ((a).tv64 op b)
+
+#define ktime_set(s,n) \
+({ \
+ ktime_t __kt; \
+ __kt.tv.sec = s; \
+ __kt.tv.nsec = n; \
+ __kt; \
+})
+
+#define ktime_set_zero(k) k.tv64 = 0LL
+
+#define ktime_set_low_high(l,h) ktime_set(h,l)
+
+#define ktime_get_low(t) (t).tv.nsec
+#define ktime_get_high(t) (t).tv.sec
+
+static inline ktime_t ktime_set_normalized(long sec, long nsec)
+{
+ ktime_t res;
+
+ while (nsec < 0) {
+ nsec += NSEC_PER_SEC;
+ sec--;
+ }
+ while (nsec >= NSEC_PER_SEC) {
+ nsec -= NSEC_PER_SEC;
+ sec++;
+ }
+
+ res.tv.sec = sec;
+ res.tv.nsec = nsec;
+ return res;
+}
+
+static inline ktime_t ktime_sub(ktime_t a, ktime_t b)
+{
+ ktime_t res;
+
+ res.tv64 = a.tv64 - b.tv64;
+ if (res.tv.nsec < 0)
+ res.tv.nsec += NSEC_PER_SEC;
+
+ return res;
+}
+
+static inline ktime_t ktime_add(ktime_t a, ktime_t b)
+{
+ ktime_t res;
+
+ res.tv64 = a.tv64 + b.tv64;
+ if (res.tv.nsec >= NSEC_PER_SEC) {
+ res.tv.nsec -= NSEC_PER_SEC;
+ res.tv.sec++;
+ }
+ return res;
+}
+
+static inline ktime_t ktime_add_ns(ktime_t a, u64 nsec)
+{
+ ktime_t tmp;
+
+ if (likely(nsec < NSEC_PER_SEC)) {
+ tmp.tv64 = nsec;
+ } else {
+ unsigned long rem;
+ rem = do_div(nsec, NSEC_PER_SEC);
+ tmp = ktime_set((long)nsec, rem);
+ }
+ return ktime_add(a,tmp);
+}
+
+#define timespec_to_ktime(ts) \
+({ \
+ ktime_t __kt; \
+ struct timespec __ts = (ts); \
+ __kt.tv.sec = (s32)__ts.tv_sec; \
+ __kt.tv.nsec = (s32)__ts.tv_nsec; \
+ __kt; \
+})
+
+#define ktime_to_timespec(kt) \
+({ \
+ struct timespec __ts; \
+ ktime_t __kt = (kt); \
+ __ts.tv_sec = (time_t)__kt.tv.sec; \
+ __ts.tv_nsec = (long)__kt.tv.nsec; \
+ __ts; \
+})
+
+#define ktime_to_timeval(kt) \
+({ \
+ struct timeval __tv; \
+ ktime_t __kt = (kt); \
+ __tv.tv_sec = (time_t)__kt.tv.sec; \
+ __tv.tv_usec = (long)(__kt.tv.nsec / NSEC_PER_USEC); \
+ __tv; \
+})
+
+#define ktime_to_clock_t(kt) \
+({ \
+ ktime_t __kt = (kt); \
+ u64 nsecs = (u64) __kt.tv.sec * NSEC_PER_SEC; \
+ nsec_to_clock_t(nsecs + (u64) __kt.tv.nsec); \
+})
+
+#define ktime_to_ns(kt) \
+({ \
+ ktime_t __kt = (kt); \
+ (((u64)__kt.tv.sec * NSEC_PER_SEC) + (u64)__kt.tv.nsec);\
+})
+
+#else
+
+/* ktime_t macros when using a 64bit variable */
+
+#define DEFINE_KTIME(kt) ktime_t kt = 0LL
+
+#define ktime_cmp(a,op,b) ((a) op (b))
+#define ktime_cmp_val(a,op,b) ((a) op b)
+
+#define ktime_set(s,n) (((s64) s * NSEC_PER_SEC) + (s64)n)
+#define ktime_set_zero(kt) kt = 0LL
+
+#define ktime_set_low_high(l,h) ((s64)((u64)l) | (((s64) h) << 32))
+
+#define ktime_get_low(t) ((t) & 0xFFFFFFFFLL)
+#define ktime_get_high(t) ((t) >> 32)
+
+#define ktime_sub(a,b) ((a) - (b))
+#define ktime_add(a,b) ((a) + (b))
+#define ktime_add_ns(a,b) ((a) + (b))
+
+#define timespec_to_ktime(ts) ktime_set(ts.tv_sec, ts.tv_nsec)
+
+#define ktime_to_timespec(kt) ns_to_timespec(kt)
+#define ktime_to_timeval(kt) ns_to_timeval(kt)
+
+#define ktime_to_clock_t(kt) nsec_to_clock_t(kt)
+
+#define ktime_to_ns(kt) (kt)
+
+#define ktime_set_normalized(s,n) ktime_set(s,n)
+
+#endif
+
+/* Exported functions */
+extern void fastcall init_ktimer_real(struct ktimer *timer);
+extern void fastcall init_ktimer_mono(struct ktimer *timer);
+extern int modify_ktimer(struct ktimer *timer, ktime_t *tim, int mode);
+extern int start_ktimer(struct ktimer *timer, ktime_t *tim, int mode);
+extern int try_to_stop_ktimer(struct ktimer *timer);
+extern int stop_ktimer(struct ktimer *timer);
+extern ktime_t get_remtime_ktimer(struct ktimer *timer, long fake);
+extern ktime_t get_expiry_ktimer(struct ktimer *timer, ktime_t *now);
+extern void __init init_ktimers(void);
+
+/* Conversion functions with rounding based on resolution */
+extern ktime_t ktimer_convert_timeval(struct ktimer *timer, struct timeval *tv);
+extern ktime_t ktimer_convert_timespec(struct ktimer *timer, struct timespec *ts);
+
+/* Posix timers current quirks */
+extern int get_ktimer_mono_res(clockid_t which_clock, struct timespec *tp);
+extern int get_ktimer_real_res(clockid_t which_clock, struct timespec *tp);
+
+/* nanosleep functions */
+long ktimer_nanosleep_mono(struct timespec *rqtp, struct timespec __user *rmtp, int mode);
+long ktimer_nanosleep_real(struct timespec *rqtp, struct timespec __user *rmtp, int mode);
+
+#if defined(CONFIG_SMP)
+extern void wait_for_ktimer(struct ktimer *timer);
+#else
+#define wait_for_ktimer(t) do {} while (0)
+#endif
+
+#define KTIME_REALTIME_RES (NSEC_PER_SEC/HZ)
+#define KTIME_MONOTONIC_RES (NSEC_PER_SEC/HZ)
+
+static inline void get_ktime_mono_ts(struct timespec *ts)
+{
+ unsigned long seq;
+ struct timespec tomono;
+ do {
+ seq = read_seqbegin(&xtime_lock);
+ getnstimeofday(ts);
+ tomono = wall_to_monotonic;
+ } while (read_seqretry(&xtime_lock, seq));
+
+
+ set_normalized_timespec(ts, ts->tv_sec + tomono.tv_sec,
+ ts->tv_nsec + tomono.tv_nsec);
+
+}
+
+static inline ktime_t do_get_ktime_mono(void)
+{
+ struct timespec now;
+
+ get_ktime_mono_ts(&now);
+ return timespec_to_ktime(now);
+}
+
+#define get_ktime_real_ts(ts) getnstimeofday(ts)
+static inline ktime_t do_get_ktime_real(void)
+{
+ struct timespec now;
+
+ getnstimeofday(&now);
+ return timespec_to_ktime(now);
+}
+
+#define clock_was_set() do { } while (0)
+extern void run_ktimer_queues(void);
+
+#endif
Index: linux-2.6.14-rc2-rt4/include/linux/posix-timers.h
===================================================================
--- linux-2.6.14-rc2-rt4.orig/include/linux/posix-timers.h
+++ linux-2.6.14-rc2-rt4/include/linux/posix-timers.h
@@ -51,10 +51,9 @@ struct k_itimer {
struct sigqueue *sigq; /* signal queue entry. */
union {
struct {
- struct timer_list timer;
- struct list_head abs_timer_entry; /* clock abs_timer_list */
- struct timespec wall_to_prev; /* wall_to_monotonic used when set */
- unsigned long incr; /* interval in jiffies */
+ struct ktimer timer;
+ ktime_t incr;
+ int overrun;
} real;
struct cpu_timer_list cpu;
struct {
@@ -66,10 +65,6 @@ struct k_itimer {
} it;
};

-struct k_clock_abs {
- struct list_head list;
- spinlock_t lock;
-};
struct k_clock {
int res; /* in nano seconds */
int (*clock_getres) (clockid_t which_clock, struct timespec *tp);
@@ -77,7 +72,7 @@ struct k_clock {
int (*clock_set) (clockid_t which_clock, struct timespec * tp);
int (*clock_get) (clockid_t which_clock, struct timespec * tp);
int (*timer_create) (struct k_itimer *timer);
- int (*nsleep) (clockid_t which_clock, int flags, struct timespec *);
+ int (*nsleep) (clockid_t which_clock, int flags, struct timespec *, struct timespec __user *);
int (*timer_set) (struct k_itimer * timr, int flags,
struct itimerspec * new_setting,
struct itimerspec * old_setting);
@@ -91,37 +86,104 @@ void register_posix_clock(clockid_t cloc

/* Error handlers for timer_create, nanosleep and settime */
int do_posix_clock_notimer_create(struct k_itimer *timer);
-int do_posix_clock_nonanosleep(clockid_t, int flags, struct timespec *);
+int do_posix_clock_nonanosleep(clockid_t, int flags, struct timespec *, struct timespec __user *);
int do_posix_clock_nosettime(clockid_t, struct timespec *tp);

/* function to call to trigger timer event */
int posix_timer_event(struct k_itimer *timr, int si_private);

-struct now_struct {
- unsigned long jiffies;
-};
-
-#define posix_get_now(now) (now)->jiffies = jiffies;
-#define posix_time_before(timer, now) \
- time_before((timer)->expires, (now)->jiffies)
-
-#define posix_bump_timer(timr, now) \
- do { \
- long delta, orun; \
- delta = now.jiffies - (timr)->it.real.timer.expires; \
- if (delta >= 0) { \
- orun = 1 + (delta / (timr)->it.real.incr); \
- (timr)->it.real.timer.expires += \
- orun * (timr)->it.real.incr; \
- (timr)->it_overrun += orun; \
- } \
- }while (0)
+#if (BITS_PER_LONG < 64)
+static inline ktime_t forward_posix_timer(struct k_itimer *t, ktime_t now)
+{
+ ktime_t delta = ktime_sub(now, t->it.real.timer.expires);
+ unsigned long orun = 1;
+
+ if (ktime_cmp_val(delta, <, KTIME_ZERO))
+ goto out;
+
+ if (unlikely(ktime_cmp(delta, >, t->it.real.incr))) {
+
+ int sft = 0;
+ u64 div, dclc, inc, dns;
+
+ dclc = dns = ktime_to_ns(delta);
+ div = inc = ktime_to_ns(t->it.real.incr);
+ /* Make sure the divisor is less than 2^32 */
+ while(div >> 32) {
+ sft++;
+ div >>= 1;
+ }
+ dclc >>= sft;
+ do_div(dclc, (unsigned long) div);
+ orun = (unsigned long) dclc;
+ if (likely(!(inc >> 32)))
+ dclc *= (unsigned long) inc;
+ else
+ dclc *= inc;
+ t->it.real.timer.expires = ktime_add_ns(t->it.real.timer.expires,
+ dclc);
+ } else {
+ t->it.real.timer.expires = ktime_add(t->it.real.timer.expires,
+ t->it.real.incr);
+ }
+ /*
+ * Here is the correction for exact. Also covers delta == incr
+ * which is the else clause above.
+ */
+ if (ktime_cmp(t->it.real.timer.expires, <=, now)) {
+ t->it.real.timer.expires = ktime_add(t->it.real.timer.expires,
+ t->it.real.incr);
+ orun++;
+ }
+ t->it_overrun += orun;
+
+ out:
+ return ktime_sub(t->it.real.timer.expires, now);
+}
+#else
+static inline ktime_t forward_posix_timer(struct k_itimer *t, ktime_t now)
+{
+ ktime_t delta = ktime_sub(now, t->it.real.timer.expires);
+ unsigned long orun = 1;
+
+ if (ktime_cmp_val(delta, <, KTIME_ZERO))
+ goto out;
+
+ if (unlikely(ktime_cmp(delta, >, t->it.real.incr))) {
+
+ u64 dns, inc;
+
+ dns = ktime_to_ns(delta);
+ inc = ktime_to_ns(t->it.real.incr);
+
+ orun = dns / inc;
+ t->it.real.timer.expires = ktime_add_ns(t->it.real.timer.expires,
+ orun * inc);
+ } else {
+ t->it.real.timer.expires = ktime_add(t->it.real.timer.expires,
+ t->it.real.incr);
+ }
+ /*
+ * Here is the correction for exact. Also covers delta == incr
+ * which is the else clause above.
+ */
+ if (ktime_cmp(t->it.real.timer.expires, <=, now)) {
+ t->it.real.timer.expires = ktime_add(t->it.real.timer.expires,
+ t->it.real.incr);
+ orun++;
+ }
+ t->it_overrun += orun;
+ out:
+ return ktime_sub(t->it.real.timer.expires, now);
+}
+#endif

int posix_cpu_clock_getres(clockid_t which_clock, struct timespec *);
int posix_cpu_clock_get(clockid_t which_clock, struct timespec *);
int posix_cpu_clock_set(clockid_t which_clock, const struct timespec *tp);
int posix_cpu_timer_create(struct k_itimer *);
-int posix_cpu_nsleep(clockid_t, int, struct timespec *);
+int posix_cpu_nsleep(clockid_t, int, struct timespec *,
+ struct timespec __user *);
int posix_cpu_timer_set(struct k_itimer *, int,
struct itimerspec *, struct itimerspec *);
int posix_cpu_timer_del(struct k_itimer *);
Index: linux-2.6.14-rc2-rt4/include/linux/sched.h
===================================================================
--- linux-2.6.14-rc2-rt4.orig/include/linux/sched.h
+++ linux-2.6.14-rc2-rt4/include/linux/sched.h
@@ -104,6 +104,7 @@ extern unsigned long nr_iowait(void);
#include <linux/param.h>
#include <linux/resource.h>
#include <linux/timer.h>
+#include <linux/ktimer.h>

#include <asm/processor.h>

@@ -346,8 +347,7 @@ struct signal_struct {
struct list_head posix_timers;

/* ITIMER_REAL timer for the process */
- struct timer_list real_timer;
- unsigned long it_real_value, it_real_incr;
+ struct ktimer real_timer;

/* ITIMER_PROF and ITIMER_VIRTUAL timers for the process */
cputime_t it_prof_expires, it_virt_expires;
Index: linux-2.6.14-rc2-rt4/include/linux/timer.h
===================================================================
--- linux-2.6.14-rc2-rt4.orig/include/linux/timer.h
+++ linux-2.6.14-rc2-rt4/include/linux/timer.h
@@ -91,6 +91,6 @@ static inline void add_timer(struct time

extern void init_timers(void);
extern void run_local_timers(void);
-extern void it_real_fn(unsigned long);
+extern void it_real_fn(void *);

#endif
Index: linux-2.6.14-rc2-rt4/init/main.c
===================================================================
--- linux-2.6.14-rc2-rt4.orig/init/main.c
+++ linux-2.6.14-rc2-rt4/init/main.c
@@ -485,6 +485,7 @@ asmlinkage void __init start_kernel(void
init_IRQ();
pidhash_init();
init_timers();
+ init_ktimers();
softirq_init();
time_init();

Index: linux-2.6.14-rc2-rt4/kernel/Makefile
===================================================================
--- linux-2.6.14-rc2-rt4.orig/kernel/Makefile
+++ linux-2.6.14-rc2-rt4/kernel/Makefile
@@ -7,7 +7,8 @@ obj-y = sched.o fork.o exec_domain.o
sysctl.o capability.o ptrace.o timer.o user.o \
signal.o sys.o kmod.o workqueue.o pid.o \
rcupdate.o intermodule.o extable.o params.o posix-timers.o \
- kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o
+ kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o \
+ ktimers.o

obj-$(CONFIG_FUTEX) += futex.o
obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
Index: linux-2.6.14-rc2-rt4/kernel/exit.c
===================================================================
--- linux-2.6.14-rc2-rt4.orig/kernel/exit.c
+++ linux-2.6.14-rc2-rt4/kernel/exit.c
@@ -842,7 +842,7 @@ fastcall NORET_TYPE void do_exit(long co
update_mem_hiwater(tsk);
group_dead = atomic_dec_and_test(&tsk->signal->live);
if (group_dead) {
- del_timer_sync(&tsk->signal->real_timer);
+ stop_ktimer(&tsk->signal->real_timer);
acct_process(code);
}
exit_mm(tsk);
Index: linux-2.6.14-rc2-rt4/kernel/fork.c
===================================================================
--- linux-2.6.14-rc2-rt4.orig/kernel/fork.c
+++ linux-2.6.14-rc2-rt4/kernel/fork.c
@@ -804,10 +804,9 @@ static inline int copy_signal(unsigned l
init_sigpending(&sig->shared_pending);
INIT_LIST_HEAD(&sig->posix_timers);

- sig->it_real_value = sig->it_real_incr = 0;
+ init_ktimer_mono(&sig->real_timer);
sig->real_timer.function = it_real_fn;
- sig->real_timer.data = (unsigned long) tsk;
- init_timer(&sig->real_timer);
+ sig->real_timer.data = tsk;

sig->it_virt_expires = cputime_zero;
sig->it_virt_incr = cputime_zero;
Index: linux-2.6.14-rc2-rt4/kernel/itimer.c
===================================================================
--- linux-2.6.14-rc2-rt4.orig/kernel/itimer.c
+++ linux-2.6.14-rc2-rt4/kernel/itimer.c
@@ -12,36 +12,22 @@
#include <linux/syscalls.h>
#include <linux/time.h>
#include <linux/posix-timers.h>
+#include <linux/ktimer.h>

#include <asm/uaccess.h>

-static unsigned long it_real_value(struct signal_struct *sig)
-{
- unsigned long val = 0;
- if (timer_pending(&sig->real_timer)) {
- val = sig->real_timer.expires - jiffies;
-
- /* look out for negative/zero itimer.. */
- if ((long) val <= 0)
- val = 1;
- }
- return val;
-}
-
int do_getitimer(int which, struct itimerval *value)
{
struct task_struct *tsk = current;
- unsigned long interval, val;
+ ktime_t interval, val;
cputime_t cinterval, cval;

switch (which) {
case ITIMER_REAL:
- spin_lock_irq(&tsk->sighand->siglock);
- interval = tsk->signal->it_real_incr;
- val = it_real_value(tsk->signal);
- spin_unlock_irq(&tsk->sighand->siglock);
- jiffies_to_timeval(val, &value->it_value);
- jiffies_to_timeval(interval, &value->it_interval);
+ interval = tsk->signal->real_timer.interval;
+ val = get_remtime_ktimer(&tsk->signal->real_timer, NSEC_PER_USEC);
+ value->it_value = ktime_to_timeval(val);
+ value->it_interval = ktime_to_timeval(interval);
break;
case ITIMER_VIRTUAL:
read_lock(&tasklist_lock);
@@ -113,59 +99,35 @@ asmlinkage long sys_getitimer(int which,
}


-void it_real_fn(unsigned long __data)
+/*
+ * The timer is automagically restarted, when interval != 0
+ */
+void it_real_fn(void *data)
{
- struct task_struct * p = (struct task_struct *) __data;
- unsigned long inc = p->signal->it_real_incr;
-
- send_group_sig_info(SIGALRM, SEND_SIG_PRIV, p);
-
- /*
- * Now restart the timer if necessary. We don't need any locking
- * here because do_setitimer makes sure we have finished running
- * before it touches anything.
- * Note, we KNOW we are (or should be) at a jiffie edge here so
- * we don't need the +1 stuff. Also, we want to use the prior
- * expire value so as to not "slip" a jiffie if we are late.
- * Deal with requesting a time prior to "now" here rather than
- * in add_timer.
- */
- if (!inc)
- return;
- while (time_before_eq(p->signal->real_timer.expires, jiffies))
- p->signal->real_timer.expires += inc;
- add_timer(&p->signal->real_timer);
+ send_group_sig_info(SIGALRM, SEND_SIG_PRIV, data);
}

int do_setitimer(int which, struct itimerval *value, struct itimerval *ovalue)
{
struct task_struct *tsk = current;
- unsigned long val, interval, expires;
+ struct ktimer *timer;
+ ktime_t expires;
cputime_t cval, cinterval, nval, ninterval;

switch (which) {
case ITIMER_REAL:
-again:
- spin_lock_irq(&tsk->sighand->siglock);
- interval = tsk->signal->it_real_incr;
- val = it_real_value(tsk->signal);
- /* We are sharing ->siglock with it_real_fn() */
- if (try_to_del_timer_sync(&tsk->signal->real_timer) < 0) {
- spin_unlock_irq(&tsk->sighand->siglock);
- goto again;
- }
- tsk->signal->it_real_incr =
- timeval_to_jiffies(&value->it_interval);
- expires = timeval_to_jiffies(&value->it_value);
- if (expires)
- mod_timer(&tsk->signal->real_timer,
- jiffies + 1 + expires);
- spin_unlock_irq(&tsk->sighand->siglock);
+ timer = &tsk->signal->real_timer;
+ stop_ktimer(timer);
if (ovalue) {
- jiffies_to_timeval(val, &ovalue->it_value);
- jiffies_to_timeval(interval,
- &ovalue->it_interval);
- }
+ ovalue->it_value = ktime_to_timeval(
+ get_remtime_ktimer(timer, NSEC_PER_USEC));
+ ovalue->it_interval = ktime_to_timeval(timer->interval);
+ }
+ timer->interval = ktimer_convert_timeval(timer, &value->it_interval);
+ expires = ktimer_convert_timeval(timer, &value->it_value);
+ if (ktime_cmp_val(expires, != , KTIME_ZERO))
+ modify_ktimer(timer, &expires, KTIMER_REL | KTIMER_NOCHECK);
+
break;
case ITIMER_VIRTUAL:
nval = timeval_to_cputime(&value->it_value);
Index: linux-2.6.14-rc2-rt4/kernel/ktimers.c
===================================================================
--- /dev/null
+++ linux-2.6.14-rc2-rt4/kernel/ktimers.c
@@ -0,0 +1,824 @@
+/*
+ * linux/kernel/ktimers.c
+ *
+ * Copyright(C) 2005 Thomas Gleixner <[email protected]>
+ *
+ * Kudos to Ingo Molnar for review, criticism, ideas
+ *
+ * Credits:
+ * Lot of ideas and implementation details taken from
+ * timer.c and related code
+ *
+ * Kernel timers
+ *
+ * In contrast to the timeout related API found in kernel/timer.c,
+ * ktimers provide finer resolution and accuracy depending on system
+ * configuration and capabilities.
+ *
+ * These timers are used for
+ * - itimers
+ * - posixtimers
+ * - nanosleep
+ * - precise in kernel timing
+ *
+ * Please do not abuse this API for simple timeouts.
+ *
+ * For licencing details see kernel-base/COPYING
+ *
+ */
+
+#include <linux/cpu.h>
+#include <linux/interrupt.h>
+#include <linux/ktimer.h>
+#include <linux/module.h>
+#include <linux/notifier.h>
+#include <linux/percpu.h>
+#include <linux/syscalls.h>
+
+#include <asm/uaccess.h>
+
+static ktime_t get_ktime_mono(void);
+static ktime_t get_ktime_real(void);
+
+/* The time bases */
+#define MAX_KTIMER_BASES 2
+static DEFINE_PER_CPU(struct ktimer_base, ktimer_bases[MAX_KTIMER_BASES]) =
+{
+ {
+ .index = CLOCK_REALTIME,
+ .name = "Realtime",
+ .get_time = &get_ktime_real,
+ .resolution = KTIME_REALTIME_RES,
+ },
+ {
+ .index = CLOCK_MONOTONIC,
+ .name = "Monotonic",
+ .get_time = &get_ktime_mono,
+ .resolution = KTIME_MONOTONIC_RES,
+ },
+};
+
+/*
+ * The SMP/UP kludge goes here
+ */
+#if defined(CONFIG_SMP)
+
+#define set_running_timer(b,t) b->running_timer = t
+#define wake_up_timer_waiters(b) wake_up(&b->wait_for_running_timer)
+#define ktimer_base_can_change (1)
+/*
+ * Wait for a running timer
+ */
+void wait_for_ktimer(struct ktimer *timer)
+{
+ struct ktimer_base *base = timer->base;
+
+ if (base && base->running_timer == timer)
+ wait_event(base->wait_for_running_timer,
+ base->running_timer != timer);
+}
+
+/*
+ * We are using hashed locking: holding per_cpu(ktimer_bases)[n].lock
+ * means that all timers which are tied to this base via timer->base are
+ * locked, and the base itself is locked too.
+ *
+ * So __run_timers/migrate_timers can safely modify all timers which could
+ * be found on the lists/queues.
+ *
+ * When the timer's base is locked, and the timer removed from list, it is
+ * possible to set timer->base = NULL and drop the lock: the timer remains
+ * locked.
+ */
+static inline struct ktimer_base *lock_ktimer_base(struct ktimer *timer,
+ unsigned long *flags)
+{
+ struct ktimer_base *base;
+
+ for (;;) {
+ base = timer->base;
+ if (likely(base != NULL)) {
+ spin_lock_irqsave(&base->lock, *flags);
+ if (likely(base == timer->base))
+ return base;
+ /* The timer has migrated to another CPU */
+ spin_unlock_irqrestore(&base->lock, *flags);
+ }
+ cpu_relax();
+ }
+}
+
+static inline struct ktimer_base *switch_ktimer_base(struct ktimer *timer,
+ struct ktimer_base *base)
+{
+ int ktidx = base->index;
+ struct ktimer_base *new_base = &__get_cpu_var(ktimer_bases[ktidx]);
+
+ if (base != new_base) {
+ /*
+ * We are trying to schedule the timer on the local CPU.
+ * However we can't change timer's base while it is running,
+ * so we keep it on the same CPU. No hassle vs. reprogramming
+ * the event source in the high resolution case. The softirq
+ * code will take care of this when the timer function has
+ * completed. There is no conflict as we hold the lock until
+ * the timer is enqueued.
+ */
+ if (unlikely(base->running_timer == timer)) {
+ return base;
+ } else {
+ /* See the comment in lock_timer_base() */
+ timer->base = NULL;
+ spin_unlock(&base->lock);
+ spin_lock(&new_base->lock);
+ timer->base = new_base;
+ }
+ }
+ return new_base;
+}
+
+/*
+ * Get the timer base unlocked
+ *
+ * Take care of timer->base = NULL in switch_ktimer_base !
+ */
+static inline struct ktimer_base *get_ktimer_base_unlocked(struct ktimer *timer)
+{
+ struct ktimer_base *base;
+ while (!(base = timer->base));
+ return base;
+}
+#else
+
+#define set_running_timer(b,t) do {} while (0)
+#define wake_up_timer_waiters(b) do {} while (0)
+
+static inline struct ktimer_base *lock_ktimer_base(struct ktimer *timer,
+ unsigned long *flags)
+{
+ struct ktimer_base *base;
+
+ base = timer->base;
+ spin_lock_irqsave(&base->lock, *flags);
+ return base;
+}
+
+#define switch_ktimer_base(t, b) b
+
+#define get_ktimer_base_unlocked(t) (t)->base
+#define ktimer_base_can_change (0)
+
+#endif /* !CONFIG_SMP */
+
+/*
+ * Convert timespec to ktime_t with resolution adjustment
+ *
+ * Note: We can access base without locking here, as ktimers can
+ * migrate between CPUs but can not be moved from one clock source to
+ * another. The clock source binding is set at init_ktimer_XXX.
+ */
+ktime_t ktimer_convert_timespec(struct ktimer *timer, struct timespec *ts)
+{
+ struct ktimer_base *base = get_ktimer_base_unlocked(timer);
+ ktime_t t;
+ long rem = ts->tv_nsec % base->resolution;
+
+ t = ktime_set(ts->tv_sec, ts->tv_nsec);
+
+ /* Check, if the value has to be rounded */
+ if (rem)
+ t = ktime_add_ns(t, base->resolution - rem);
+ return t;
+}
+
+/*
+ * Convert timeval to ktime_t with resolution adjustment
+ */
+ktime_t ktimer_convert_timeval(struct ktimer *timer, struct timeval *tv)
+{
+ struct timespec ts;
+
+ ts.tv_sec = tv->tv_sec;
+ ts.tv_nsec = tv->tv_usec * NSEC_PER_USEC;
+
+ return ktimer_convert_timespec(timer, &ts);
+}
+
+/*
+ * Internal function to add (re)start a timer
+ *
+ * The timer is inserted in expiry order.
+ * Insertion into the red black tree is O(log(n))
+ *
+ */
+static int enqueue_ktimer(struct ktimer *timer, struct ktimer_base *base,
+ ktime_t *tim, int mode)
+{
+ struct rb_node **link = &base->active.rb_node;
+ struct rb_node *parent = NULL;
+ struct ktimer *entry;
+ struct list_head *prev = &base->pending;
+ ktime_t now;
+
+ /* Get current time */
+ now = base->get_time();
+
+ /* Timer expiry mode */
+ switch (mode & ~KTIMER_NOCHECK) {
+ case KTIMER_ABS:
+ timer->expires = *tim;
+ break;
+ case KTIMER_REL:
+ timer->expires = ktime_add(now, *tim);
+ break;
+ case KTIMER_INCR:
+ timer->expires = ktime_add(timer->expires, *tim);
+ break;
+ case KTIMER_FORWARD:
+ while ktime_cmp(timer->expires, <= , now) {
+ timer->expires = ktime_add(timer->expires, *tim);
+ timer->overrun++;
+ }
+ goto nocheck;
+ case KTIMER_REARM:
+ while ktime_cmp(timer->expires, <= , now) {
+ timer->expires = ktime_add(timer->expires, *tim);
+ timer->overrun++;
+ }
+ goto nocheck;
+ case KTIMER_RESTART:
+ break;
+ default:
+ BUG();
+ }
+
+ /* Already expired.*/
+ if ktime_cmp(timer->expires, <=, now) {
+ timer->expired = now;
+ /* The caller takes care of expiry */
+ if (!(mode & KTIMER_NOCHECK))
+ return -1;
+ }
+ nocheck:
+
+ while (*link) {
+ parent = *link;
+ entry = rb_entry(parent, struct ktimer, node);
+ /*
+ * We dont care about collisions. Nodes with
+ * the same expiry time stay together.
+ */
+ if (ktimer_before(timer, entry))
+ link = &(*link)->rb_left;
+ else {
+ link = &(*link)->rb_right;
+ prev = &entry->list;
+ }
+ }
+
+ rb_link_node(&timer->node, parent, link);
+ rb_insert_color(&timer->node, &base->active);
+ list_add(&timer->list, prev);
+ timer->status = KTIMER_PENDING;
+ base->count++;
+ return 0;
+}
+
+/*
+ * Internal helper to remove a timer
+ *
+ * The function allows automatic rearming for interval
+ * timers.
+ *
+ */
+static inline void do_remove_ktimer(struct ktimer *timer,
+ struct ktimer_base *base, int rearm)
+{
+ list_del(&timer->list);
+ rb_erase(&timer->node, &base->active);
+ timer->node.rb_parent = KTIMER_POISON;
+ timer->status = KTIMER_INACTIVE;
+ base->count--;
+ BUG_ON(base->count < 0);
+ /* Auto rearm the timer ? */
+ if (rearm && ktime_cmp_val(timer->interval, !=, KTIME_ZERO))
+ enqueue_ktimer(timer, base, NULL, KTIMER_REARM);
+}
+
+/*
+ * Called with base lock held
+ */
+static inline int remove_ktimer(struct ktimer *timer, struct ktimer_base *base)
+{
+ if (ktimer_active(timer)) {
+ do_remove_ktimer(timer, base, KTIMER_NOREARM);
+ return 1;
+ }
+ return 0;
+}
+
+/*
+ * Internal function to (re)start a timer.
+ */
+static int internal_restart_ktimer(struct ktimer *timer, ktime_t *tim,
+ int mode)
+{
+ struct ktimer_base *base, *new_base;
+ unsigned long flags;
+ int ret;
+
+ BUG_ON(!timer->function);
+
+ base = lock_ktimer_base(timer, &flags);
+
+ /* Remove an active timer from the queue */
+ ret = remove_ktimer(timer, base);
+
+ /* Switch the timer base, if necessary */
+ new_base = switch_ktimer_base(timer, base);
+
+ /*
+ * When the new timer setting is already expired,
+ * let the calling code deal with it.
+ */
+ if (enqueue_ktimer(timer, new_base, tim, mode))
+ ret = -1;
+
+ spin_unlock_irqrestore(&new_base->lock, flags);
+ return ret;
+}
+
+/***
+ * modify_ktimer - modify a running timer
+ * @timer: the timer to be modified
+ * @tim: expiry time (required)
+ * @mode: timer setup mode
+ *
+ */
+int modify_ktimer(struct ktimer *timer, ktime_t *tim, int mode)
+{
+ BUG_ON(!tim || !timer->function);
+ return internal_restart_ktimer(timer, tim, mode);
+}
+
+/***
+ * start_ktimer - start a timer on current CPU
+ * @timer: the timer to be added
+ * @tim: expiry time (optional, if not set in the timer)
+ * @mode: timer setup mode
+ */
+int start_ktimer(struct ktimer *timer, ktime_t *tim, int mode)
+{
+ BUG_ON(ktimer_active(timer) || !timer->function);
+
+ return internal_restart_ktimer(timer, tim, mode);
+}
+
+/***
+ * try_to_stop_ktimer - try to deactivate a timer
+ */
+int try_to_stop_ktimer(struct ktimer *timer)
+{
+ struct ktimer_base *base;
+ unsigned long flags;
+ int ret = -1;
+
+ base = lock_ktimer_base(timer, &flags);
+
+ if (base->running_timer != timer) {
+ ret = remove_ktimer(timer, base);
+ if (ret)
+ timer->expired = base->get_time();
+ }
+
+ spin_unlock_irqrestore(&base->lock, flags);
+
+ return ret;
+
+}
+
+/***
+ * stop_timer_sync - deactivate a timer and wait for the handler to finish.
+ * @timer: the timer to be deactivated
+ *
+ */
+int stop_ktimer(struct ktimer *timer)
+{
+ for (;;) {
+ int ret = try_to_stop_ktimer(timer);
+ if (ret >= 0)
+ return ret;
+ wait_for_ktimer(timer);
+ }
+}
+
+/***
+ * get_remtime_ktimer - get remaining time for the timer
+ * @timer: the timer to read
+ * @fake: when fake > 0 a pending, but expired timer
+ * returns fake (itimers need this, uurg)
+ */
+ktime_t get_remtime_ktimer(struct ktimer *timer, long fake)
+{
+ struct ktimer_base *base;
+ unsigned long flags;
+ ktime_t rem;
+
+ base = lock_ktimer_base(timer, &flags);
+ if (ktimer_active(timer)) {
+ rem = ktime_sub(timer->expires,base->get_time());
+ if (fake && ktime_cmp_val(rem, <=, KTIME_ZERO))
+ rem = ktime_set(0, fake);
+ } else {
+ if (!fake)
+ rem = ktime_sub(timer->expires,base->get_time());
+ else
+ ktime_set_zero(rem);
+ }
+ spin_unlock_irqrestore(&base->lock, flags);
+ return rem;
+}
+
+/***
+ * get_expiry_ktimer - get expiry time for the timer
+ * @timer: the timer to read
+ * @now: if != NULL store current base->time
+ */
+ktime_t get_expiry_ktimer(struct ktimer *timer, ktime_t *now)
+{
+ struct ktimer_base *base;
+ unsigned long flags;
+ ktime_t expiry;
+
+ base = lock_ktimer_base(timer, &flags);
+ expiry = timer->expires;
+ if (now)
+ *now = base->get_time();
+ spin_unlock_irqrestore(&base->lock, flags);
+ return expiry;
+}
+
+/*
+ * Functions related to clock sources
+ */
+
+static inline void ktimer_common_init(struct ktimer *timer)
+{
+ memset(timer, 0, sizeof(struct ktimer));
+ timer->node.rb_parent = KTIMER_POISON;
+}
+
+/*
+ * Get monotonic time
+ */
+static ktime_t get_ktime_mono(void)
+{
+ return do_get_ktime_mono();
+}
+
+/***
+ * init_ktimer_mono - initialize a timer on monotonic time
+ * @timer: the timer to be initialized
+ *
+ */
+void fastcall init_ktimer_mono(struct ktimer *timer)
+{
+ ktimer_common_init(timer);
+ timer->base =
+ &per_cpu(ktimer_bases, raw_smp_processor_id())[CLOCK_MONOTONIC];
+}
+
+/***
+ * get_ktimer_mono_res - get the monotonic timer resolution
+ *
+ */
+int get_ktimer_mono_res(clockid_t which_clock, struct timespec *tp)
+{
+ tp->tv_sec = 0;
+ tp->tv_nsec =
+ per_cpu(ktimer_bases, raw_smp_processor_id())[CLOCK_MONOTONIC].resolution;
+ return 0;
+}
+
+/*
+ * Get real time
+ */
+static ktime_t get_ktime_real(void)
+{
+ return do_get_ktime_real();
+}
+
+/***
+ * init_ktimer_real - initialize a timer on real time
+ * @timer: the timer to be initialized
+ *
+ */
+void fastcall init_ktimer_real(struct ktimer *timer)
+{
+ ktimer_common_init(timer);
+ timer->base =
+ &per_cpu(ktimer_bases, raw_smp_processor_id())[CLOCK_REALTIME];
+}
+
+/***
+ * get_ktimer_real_res - get the real timer resolution
+ *
+ */
+int get_ktimer_real_res(clockid_t which_clock, struct timespec *tp)
+{
+ tp->tv_sec = 0;
+ tp->tv_nsec =
+ per_cpu(ktimer_bases, raw_smp_processor_id())[CLOCK_REALTIME].resolution;
+ return 0;
+}
+
+/*
+ * The per base runqueue
+ */
+static inline void run_ktimer_queue(struct ktimer_base *base)
+{
+ ktime_t now = base->get_time();
+
+ spin_lock_irq(&base->lock);
+ while (!list_empty(&base->pending)) {
+ void (*fn)(void *);
+ void *data;
+ struct ktimer *timer = list_entry(base->pending.next,
+ struct ktimer, list);
+ if ktime_cmp(now, <=, timer->expires)
+ break;
+ timer->expired = now;
+ fn = timer->function;
+ data = timer->data;
+ set_running_timer(base, timer);
+ do_remove_ktimer(timer, base, KTIMER_REARM);
+ spin_unlock_irq(&base->lock);
+ fn(data);
+ spin_lock_irq(&base->lock);
+ set_running_timer(base, NULL);
+ }
+ spin_unlock_irq(&base->lock);
+ wake_up_timer_waiters(base);
+}
+
+/*
+ * Called from timer softirq every jiffy
+ */
+void run_ktimer_queues(void)
+{
+ struct ktimer_base *base = __get_cpu_var(ktimer_bases);
+ int i;
+
+ for (i = 0; i < MAX_KTIMER_BASES; i++)
+ run_ktimer_queue(&base[i]);
+}
+
+/*
+ * Functions related to initialization
+ */
+static void __devinit init_ktimers_cpu(int cpu)
+{
+ struct ktimer_base *base = per_cpu(ktimer_bases, cpu);
+ int i;
+
+ for (i = 0; i < MAX_KTIMER_BASES; i++) {
+ spin_lock_init(&base->lock);
+ INIT_LIST_HEAD(&base->pending);
+ init_waitqueue_head(&base->wait_for_running_timer);
+ base++;
+ }
+}
+
+#ifdef CONFIG_HOTPLUG_CPU
+static void migrate_ktimer_list(struct ktimer_base *old_base,
+ struct ktimer_base *new_base)
+{
+ struct ktimer *timer;
+ struct rb_node *node;
+
+ while ((node = rb_first(&old_base->active))) {
+ timer = rb_entry(node, struct ktimer, node);
+ remove_ktimer(timer, old_base);
+ timer->base = new_base;
+ enqueue_ktimer(timer, new_base, NULL, KTIMER_RESTART);
+ }
+}
+
+static void __devinit migrate_ktimers(int cpu)
+{
+ struct ktimer_base *old_base;
+ struct ktimer_base *new_base;
+ int i;
+
+ BUG_ON(cpu_online(cpu));
+ old_base = per_cpu(ktimer_bases, cpu);
+ new_base = get_cpu_var(ktimer_bases);
+
+ local_irq_disable();
+
+ for (i = 0; i < MAX_KTIMER_BASES; i++) {
+
+ spin_lock(&new_base->lock);
+ spin_lock(&old_base->lock);
+
+ if (old_base->running_timer)
+ BUG();
+
+ migrate_ktimer_list(old_base, new_base);
+
+ spin_unlock(&old_base->lock);
+ spin_unlock(&new_base->lock);
+ old_base++;
+ new_base++;
+ }
+
+ local_irq_enable();
+ &put_cpu_var(ktimer_bases);
+}
+#endif /* CONFIG_HOTPLUG_CPU */
+
+static int __devinit ktimer_cpu_notify(struct notifier_block *self,
+ unsigned long action, void *hcpu)
+{
+ long cpu = (long)hcpu;
+ switch(action) {
+ case CPU_UP_PREPARE:
+ init_ktimers_cpu(cpu);
+ break;
+#ifdef CONFIG_HOTPLUG_CPU
+ case CPU_DEAD:
+ migrate_ktimers(cpu);
+ break;
+#endif
+ default:
+ break;
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block __devinitdata ktimers_nb = {
+ .notifier_call = ktimer_cpu_notify,
+};
+
+void __init init_ktimers(void)
+{
+ ktimer_cpu_notify(&ktimers_nb, (unsigned long)CPU_UP_PREPARE,
+ (void *)(long)smp_processor_id());
+ register_cpu_notifier(&ktimers_nb);
+}
+
+/*
+ * system interface related functions
+ */
+static void process_ktimer(void *data)
+{
+ wake_up_process(data);
+}
+
+/**
+ * schedule_ktimer - sleep until timeout
+ * @timeout: timeout value
+ * @state: state to use for sleep
+ * @rel: timeout value is abs/rel
+ *
+ * Make the current task sleep until @timeout is
+ * elapsed.
+ *
+ * You can set the task state as follows -
+ *
+ * %TASK_UNINTERRUPTIBLE - at least @timeout is guaranteed to
+ * pass before the routine returns. The routine will return 0
+ *
+ * %TASK_INTERRUPTIBLE - the routine may return early if a signal is
+ * delivered to the current task. In this case the remaining time
+ * will be returned
+ *
+ * The current task state is guaranteed to be TASK_RUNNING when this
+ * routine returns.
+ *
+ */
+static fastcall ktime_t __sched schedule_ktimer(struct ktimer *timer,
+ ktime_t *t, int state, int mode)
+{
+ timer->data = current;
+ timer->function = process_ktimer;
+
+ current->state = state;
+ if (start_ktimer(timer, t, mode)) {
+ current->state = TASK_RUNNING;
+ goto out;
+ }
+ if (current->state != TASK_RUNNING)
+ schedule();
+ stop_ktimer(timer);
+ out:
+ /* Store the absolute expiry time */
+ *t = timer->expires;
+ /* Return the remaining time */
+ return ktime_sub(timer->expires, timer->expired);
+}
+
+static long __sched nanosleep_restart(struct ktimer *timer,
+ struct restart_block *restart)
+{
+ struct timespec tu;
+ ktime_t t, rem;
+ void *rfn = restart->fn;
+ struct timespec __user *rmtp = (struct timespec __user *) restart->arg2;
+
+ restart->fn = do_no_restart_syscall;
+
+ t = ktime_set_low_high(restart->arg0, restart->arg1);
+
+ rem = schedule_ktimer(timer, &t, TASK_INTERRUPTIBLE, KTIMER_ABS);
+
+ if (ktime_cmp_val(rem, <=, KTIME_ZERO))
+ return 0;
+
+ tu = ktime_to_timespec(rem);
+ if (rmtp && copy_to_user(rmtp, &rem, sizeof(tu)))
+ return -EFAULT;
+
+ restart->fn = rfn;
+ /* The other values in restart are already filled in */
+ return -ERESTART_RESTARTBLOCK;
+}
+
+static long __sched nanosleep_restart_mono(struct restart_block *restart)
+{
+ struct ktimer timer;
+
+ init_ktimer_mono(&timer);
+ return nanosleep_restart(&timer, restart);
+}
+
+static long __sched nanosleep_restart_real(struct restart_block *restart)
+{
+ struct ktimer timer;
+
+ init_ktimer_real(&timer);
+ return nanosleep_restart(&timer, restart);
+}
+
+static long ktimer_nanosleep(struct ktimer *timer, struct timespec *rqtp,
+ struct timespec __user *rmtp, int mode,
+ long (*rfn)(struct restart_block *))
+{
+ struct timespec tu;
+ ktime_t rem, t;
+ struct restart_block *restart;
+
+ t = ktimer_convert_timespec(timer, rqtp);
+
+ /* t is updated to absolute expiry time ! */
+ rem = schedule_ktimer(timer, &t, TASK_INTERRUPTIBLE, mode);
+
+ if (ktime_cmp_val(rem, <=, KTIME_ZERO))
+ return 0;
+
+ tu = ktime_to_timespec(rem);
+
+ if (rmtp && copy_to_user(rmtp, &tu, sizeof(tu)))
+ return -EFAULT;
+
+ restart = &current_thread_info()->restart_block;
+ restart->fn = rfn;
+ restart->arg0 = ktime_get_low(t);
+ restart->arg1 = ktime_get_high(t);
+ restart->arg2 = (unsigned long) rmtp;
+ return -ERESTART_RESTARTBLOCK;
+
+}
+
+long ktimer_nanosleep_mono(struct timespec *rqtp,
+ struct timespec __user *rmtp, int mode)
+{
+ struct ktimer timer;
+
+ init_ktimer_mono(&timer);
+ return ktimer_nanosleep(&timer, rqtp, rmtp, mode, nanosleep_restart_mono);
+}
+
+long ktimer_nanosleep_real(struct timespec *rqtp,
+ struct timespec __user *rmtp, int mode)
+{
+ struct ktimer timer;
+
+ init_ktimer_real(&timer);
+ return ktimer_nanosleep(&timer, rqtp, rmtp, mode, nanosleep_restart_real);
+}
+
+asmlinkage long sys_nanosleep(struct timespec __user *rqtp,
+ struct timespec __user *rmtp)
+{
+ struct timespec tu;
+
+ if (copy_from_user(&tu, rqtp, sizeof(tu)))
+ return -EFAULT;
+
+ if (!timespec_valid(&tu))
+ return -EINVAL;
+
+ return ktimer_nanosleep_mono(&tu, rmtp, KTIMER_REL);
+}
+
Index: linux-2.6.14-rc2-rt4/kernel/posix-cpu-timers.c
===================================================================
--- linux-2.6.14-rc2-rt4.orig/kernel/posix-cpu-timers.c
+++ linux-2.6.14-rc2-rt4/kernel/posix-cpu-timers.c
@@ -1394,7 +1394,7 @@ void set_process_cpu_timer(struct task_s
static long posix_cpu_clock_nanosleep_restart(struct restart_block *);

int posix_cpu_nsleep(clockid_t which_clock, int flags,
- struct timespec *rqtp)
+ struct timespec *rqtp, struct timespec __user *rmtp)
{
struct restart_block *restart_block =
&current_thread_info()->restart_block;
@@ -1419,7 +1419,6 @@ int posix_cpu_nsleep(clockid_t which_clo
error = posix_cpu_timer_create(&timer);
timer.it_process = current;
if (!error) {
- struct timespec __user *rmtp;
static struct itimerspec zero_it;
struct itimerspec it = { .it_value = *rqtp,
.it_interval = {} };
@@ -1466,7 +1465,6 @@ int posix_cpu_nsleep(clockid_t which_clo
/*
* Report back to the user the time still remaining.
*/
- rmtp = (struct timespec __user *) restart_block->arg1;
if (rmtp != NULL && !(flags & TIMER_ABSTIME) &&
copy_to_user(rmtp, &it.it_value, sizeof *rmtp))
return -EFAULT;
@@ -1474,6 +1472,7 @@ int posix_cpu_nsleep(clockid_t which_clo
restart_block->fn = posix_cpu_clock_nanosleep_restart;
/* Caller already set restart_block->arg1 */
restart_block->arg0 = which_clock;
+ restart_block->arg1 = (unsigned long) rmtp;
restart_block->arg2 = rqtp->tv_sec;
restart_block->arg3 = rqtp->tv_nsec;

@@ -1487,10 +1486,15 @@ static long
posix_cpu_clock_nanosleep_restart(struct restart_block *restart_block)
{
clockid_t which_clock = restart_block->arg0;
- struct timespec t = { .tv_sec = restart_block->arg2,
- .tv_nsec = restart_block->arg3 };
+ struct timespec __user *rmtp;
+ struct timespec t;
+
+ rmtp = (struct timespec __user *) restart_block->arg1;
+ t.tv_sec = restart_block->arg2;
+ t.tv_nsec = restart_block->arg3;
+
restart_block->fn = do_no_restart_syscall;
- return posix_cpu_nsleep(which_clock, TIMER_ABSTIME, &t);
+ return posix_cpu_nsleep(which_clock, TIMER_ABSTIME, &t, rmtp);
}


@@ -1511,9 +1515,10 @@ static int process_cpu_timer_create(stru
return posix_cpu_timer_create(timer);
}
static int process_cpu_nsleep(clockid_t which_clock, int flags,
- struct timespec *rqtp)
+ struct timespec *rqtp,
+ struct timespec __user *rmtp)
{
- return posix_cpu_nsleep(PROCESS_CLOCK, flags, rqtp);
+ return posix_cpu_nsleep(PROCESS_CLOCK, flags, rqtp, rmtp);
}
static int thread_cpu_clock_getres(clockid_t which_clock, struct timespec *tp)
{
@@ -1529,7 +1534,7 @@ static int thread_cpu_timer_create(struc
return posix_cpu_timer_create(timer);
}
static int thread_cpu_nsleep(clockid_t which_clock, int flags,
- struct timespec *rqtp)
+ struct timespec *rqtp, struct timespec __user *rmtp)
{
return -EINVAL;
}
Index: linux-2.6.14-rc2-rt4/kernel/posix-timers.c
===================================================================
--- linux-2.6.14-rc2-rt4.orig/kernel/posix-timers.c
+++ linux-2.6.14-rc2-rt4/kernel/posix-timers.c
@@ -48,21 +48,6 @@
#include <linux/workqueue.h>
#include <linux/module.h>

-#ifndef div_long_long_rem
-#include <asm/div64.h>
-
-#define div_long_long_rem(dividend,divisor,remainder) ({ \
- u64 result = dividend; \
- *remainder = do_div(result,divisor); \
- result; })
-
-#endif
-#define CLOCK_REALTIME_RES TICK_NSEC /* In nano seconds. */
-
-static inline u64 mpy_l_X_l_ll(unsigned long mpy1,unsigned long mpy2)
-{
- return (u64)mpy1 * mpy2;
-}
/*
* Management arrays for POSIX timers. Timers are kept in slab memory
* Timer ids are allocated by an external routine that keeps track of the
@@ -148,18 +133,18 @@ static DEFINE_SPINLOCK(idr_lock);
*/

static struct k_clock posix_clocks[MAX_CLOCKS];
+
/*
- * We only have one real clock that can be set so we need only one abs list,
- * even if we should want to have several clocks with differing resolutions.
+ * These ones are defined below.
*/
-static struct k_clock_abs abs_list = {.list = LIST_HEAD_INIT(abs_list.list),
- .lock = SPIN_LOCK_UNLOCKED};
+static int common_nsleep(clockid_t, int flags, struct timespec *t,
+ struct timespec __user *rmtp);
+static void common_timer_get(struct k_itimer *, struct itimerspec *);
+static int common_timer_set(struct k_itimer *, int,
+ struct itimerspec *, struct itimerspec *);
+static int common_timer_del(struct k_itimer *timer);

-static void posix_timer_fn(unsigned long);
-static u64 do_posix_clock_monotonic_gettime_parts(
- struct timespec *tp, struct timespec *mo);
-int do_posix_clock_monotonic_gettime(struct timespec *tp);
-static int do_posix_clock_monotonic_get(clockid_t, struct timespec *tp);
+static void posix_timer_fn(void *data);

static struct k_itimer *lock_timer(timer_t timer_id, unsigned long *flags);

@@ -205,21 +190,25 @@ static inline int common_clock_set(clock

static inline int common_timer_create(struct k_itimer *new_timer)
{
- INIT_LIST_HEAD(&new_timer->it.real.abs_timer_entry);
- init_timer(&new_timer->it.real.timer);
- new_timer->it.real.timer.data = (unsigned long) new_timer;
+ return -EINVAL;
+}
+
+static int timer_create_mono(struct k_itimer *new_timer)
+{
+ init_ktimer_mono(&new_timer->it.real.timer);
+ new_timer->it.real.timer.data = new_timer;
+ new_timer->it.real.timer.function = posix_timer_fn;
+ return 0;
+}
+
+static int timer_create_real(struct k_itimer *new_timer)
+{
+ init_ktimer_real(&new_timer->it.real.timer);
+ new_timer->it.real.timer.data = new_timer;
new_timer->it.real.timer.function = posix_timer_fn;
return 0;
}

-/*
- * These ones are defined below.
- */
-static int common_nsleep(clockid_t, int flags, struct timespec *t);
-static void common_timer_get(struct k_itimer *, struct itimerspec *);
-static int common_timer_set(struct k_itimer *, int,
- struct itimerspec *, struct itimerspec *);
-static int common_timer_del(struct k_itimer *timer);

/*
* Return nonzero iff we know a priori this clockid_t value is bogus.
@@ -239,19 +228,44 @@ static inline int invalid_clockid(clocki
return 1;
}

+/*
+ * Get real time for posix timers
+ */
+static int posix_get_ktime_real_ts(clockid_t which_clock, struct timespec *tp)
+{
+ get_ktime_real_ts(tp);
+ return 0;
+}
+
+/*
+ * Get monotonic time for posix timers
+ */
+static int posix_get_ktime_mono_ts(clockid_t which_clock, struct timespec *tp)
+{
+ get_ktime_mono_ts(tp);
+ return 0;
+}
+
+void do_posix_clock_monotonic_gettime(struct timespec *ts)
+{
+ get_ktime_mono_ts(ts);
+}

/*
* Initialize everything, well, just everything in Posix clocks/timers ;)
*/
static __init int init_posix_timers(void)
{
- struct k_clock clock_realtime = {.res = CLOCK_REALTIME_RES,
- .abs_struct = &abs_list
+ struct k_clock clock_realtime = {
+ .clock_getres = get_ktimer_real_res,
+ .clock_get = posix_get_ktime_real_ts,
+ .timer_create = timer_create_real,
};
- struct k_clock clock_monotonic = {.res = CLOCK_REALTIME_RES,
- .abs_struct = NULL,
- .clock_get = do_posix_clock_monotonic_get,
- .clock_set = do_posix_clock_nosettime
+ struct k_clock clock_monotonic = {
+ .clock_getres = get_ktimer_mono_res,
+ .clock_get = posix_get_ktime_mono_ts,
+ .clock_set = do_posix_clock_nosettime,
+ .timer_create = timer_create_mono,
};

register_posix_clock(CLOCK_REALTIME, &clock_realtime);
@@ -265,117 +279,17 @@ static __init int init_posix_timers(void

__initcall(init_posix_timers);

-static void tstojiffie(struct timespec *tp, int res, u64 *jiff)
-{
- long sec = tp->tv_sec;
- long nsec = tp->tv_nsec + res - 1;
-
- if (nsec > NSEC_PER_SEC) {
- sec++;
- nsec -= NSEC_PER_SEC;
- }
-
- /*
- * The scaling constants are defined in <linux/time.h>
- * The difference between there and here is that we do the
- * res rounding and compute a 64-bit result (well so does that
- * but it then throws away the high bits).
- */
- *jiff = (mpy_l_X_l_ll(sec, SEC_CONVERSION) +
- (mpy_l_X_l_ll(nsec, NSEC_CONVERSION) >>
- (NSEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC;
-}
-
-/*
- * This function adjusts the timer as needed as a result of the clock
- * being set. It should only be called for absolute timers, and then
- * under the abs_list lock. It computes the time difference and sets
- * the new jiffies value in the timer. It also updates the timers
- * reference wall_to_monotonic value. It is complicated by the fact
- * that tstojiffies() only handles positive times and it needs to work
- * with both positive and negative times. Also, for negative offsets,
- * we need to defeat the res round up.
- *
- * Return is true if there is a new time, else false.
- */
-static long add_clockset_delta(struct k_itimer *timr,
- struct timespec *new_wall_to)
-{
- struct timespec delta;
- int sign = 0;
- u64 exp;
-
- set_normalized_timespec(&delta,
- new_wall_to->tv_sec -
- timr->it.real.wall_to_prev.tv_sec,
- new_wall_to->tv_nsec -
- timr->it.real.wall_to_prev.tv_nsec);
- if (likely(!(delta.tv_sec | delta.tv_nsec)))
- return 0;
- if (delta.tv_sec < 0) {
- set_normalized_timespec(&delta,
- -delta.tv_sec,
- 1 - delta.tv_nsec -
- posix_clocks[timr->it_clock].res);
- sign++;
- }
- tstojiffie(&delta, posix_clocks[timr->it_clock].res, &exp);
- timr->it.real.wall_to_prev = *new_wall_to;
- timr->it.real.timer.expires += (sign ? -exp : exp);
- return 1;
-}
-
-static void remove_from_abslist(struct k_itimer *timr)
-{
- if (!list_empty(&timr->it.real.abs_timer_entry)) {
- spin_lock(&abs_list.lock);
- list_del_init(&timr->it.real.abs_timer_entry);
- spin_unlock(&abs_list.lock);
- }
-}

static void schedule_next_timer(struct k_itimer *timr)
{
- struct timespec new_wall_to;
- struct now_struct now;
- unsigned long seq;
-
- /*
- * Set up the timer for the next interval (if there is one).
- * Note: this code uses the abs_timer_lock to protect
- * it.real.wall_to_prev and must hold it until exp is set, not exactly
- * obvious...
-
- * This function is used for CLOCK_REALTIME* and
- * CLOCK_MONOTONIC* timers. If we ever want to handle other
- * CLOCKs, the calling code (do_schedule_next_timer) would need
- * to pull the "clock" info from the timer and dispatch the
- * "other" CLOCKs "next timer" code (which, I suppose should
- * also be added to the k_clock structure).
- */
- if (!timr->it.real.incr)
+ if (ktime_cmp_val(timr->it.real.incr, ==, KTIME_ZERO))
return;

- do {
- seq = read_seqbegin(&xtime_lock);
- new_wall_to = wall_to_monotonic;
- posix_get_now(&now);
- } while (read_seqretry(&xtime_lock, seq));
-
- if (!list_empty(&timr->it.real.abs_timer_entry)) {
- spin_lock(&abs_list.lock);
- add_clockset_delta(timr, &new_wall_to);
-
- posix_bump_timer(timr, now);
-
- spin_unlock(&abs_list.lock);
- } else {
- posix_bump_timer(timr, now);
- }
- timr->it_overrun_last = timr->it_overrun;
- timr->it_overrun = -1;
+ timr->it_overrun_last = timr->it.real.overrun;
+ timr->it.real.overrun = timr->it.real.timer.overrun = -1;
++timr->it_requeue_pending;
- add_timer(&timr->it.real.timer);
+ start_ktimer(&timr->it.real.timer, &timr->it.real.incr, KTIMER_FORWARD);
+ timr->it.real.overrun = timr->it.real.timer.overrun;
}

/*
@@ -413,14 +327,7 @@ int posix_timer_event(struct k_itimer *t
{
memset(&timr->sigq->info, 0, sizeof(siginfo_t));
timr->sigq->info.si_sys_private = si_private;
- /*
- * Send signal to the process that owns this timer.
-
- * This code assumes that all the possible abs_lists share the
- * same lock (there is only one list at this time). If this is
- * not the case, the CLOCK info would need to be used to find
- * the proper abs list lock.
- */
+ /* Send signal to the process that owns this timer.*/

timr->sigq->info.si_signo = timr->it_sigev_signo;
timr->sigq->info.si_errno = 0;
@@ -454,65 +361,28 @@ EXPORT_SYMBOL_GPL(posix_timer_event);

* This code is for CLOCK_REALTIME* and CLOCK_MONOTONIC* timers.
*/
-static void posix_timer_fn(unsigned long __data)
+static void posix_timer_fn(void *data)
{
- struct k_itimer *timr = (struct k_itimer *) __data;
+ struct k_itimer *timr = data;
unsigned long flags;
- unsigned long seq;
- struct timespec delta, new_wall_to;
- u64 exp = 0;
- int do_notify = 1;
+ int si_private = 0;

spin_lock_irqsave(&timr->it_lock, flags);
- if (!list_empty(&timr->it.real.abs_timer_entry)) {
- spin_lock(&abs_list.lock);
- do {
- seq = read_seqbegin(&xtime_lock);
- new_wall_to = wall_to_monotonic;
- } while (read_seqretry(&xtime_lock, seq));
- set_normalized_timespec(&delta,
- new_wall_to.tv_sec -
- timr->it.real.wall_to_prev.tv_sec,
- new_wall_to.tv_nsec -
- timr->it.real.wall_to_prev.tv_nsec);
- if (likely((delta.tv_sec | delta.tv_nsec ) == 0)) {
- /* do nothing, timer is on time */
- } else if (delta.tv_sec < 0) {
- /* do nothing, timer is already late */
- } else {
- /* timer is early due to a clock set */
- tstojiffie(&delta,
- posix_clocks[timr->it_clock].res,
- &exp);
- timr->it.real.wall_to_prev = new_wall_to;
- timr->it.real.timer.expires += exp;
- add_timer(&timr->it.real.timer);
- do_notify = 0;
- }
- spin_unlock(&abs_list.lock);

- }
- if (do_notify) {
- int si_private=0;
+ if (ktime_cmp_val(timr->it.real.incr, !=, KTIME_ZERO))
+ si_private = ++timr->it_requeue_pending;

- if (timr->it.real.incr)
- si_private = ++timr->it_requeue_pending;
- else {
- remove_from_abslist(timr);
- }
+ if (posix_timer_event(timr, si_private))
+ /*
+ * signal was not sent because of sig_ignor
+ * we will not get a call back to restart it AND
+ * it should be restarted.
+ */
+ schedule_next_timer(timr);

- if (posix_timer_event(timr, si_private))
- /*
- * signal was not sent because of sig_ignor
- * we will not get a call back to restart it AND
- * it should be restarted.
- */
- schedule_next_timer(timr);
- }
unlock_timer(timr, flags); /* hold thru abs lock to keep irq off */
}

-
static inline struct task_struct * good_sigevent(sigevent_t * event)
{
struct task_struct *rtn = current->group_leader;
@@ -776,39 +646,40 @@ static struct k_itimer * lock_timer(time
static void
common_timer_get(struct k_itimer *timr, struct itimerspec *cur_setting)
{
- unsigned long expires;
- struct now_struct now;
+ ktime_t expires, now, remaining;
+ struct ktimer *timer = &timr->it.real.timer;

- do
- expires = timr->it.real.timer.expires;
- while ((volatile long) (timr->it.real.timer.expires) != expires);
-
- posix_get_now(&now);
-
- if (expires &&
- ((timr->it_sigev_notify & ~SIGEV_THREAD_ID) == SIGEV_NONE) &&
- !timr->it.real.incr &&
- posix_time_before(&timr->it.real.timer, &now))
- timr->it.real.timer.expires = expires = 0;
- if (expires) {
- if (timr->it_requeue_pending & REQUEUE_PENDING ||
- (timr->it_sigev_notify & ~SIGEV_THREAD_ID) == SIGEV_NONE) {
- posix_bump_timer(timr, now);
- expires = timr->it.real.timer.expires;
- }
- else
- if (!timer_pending(&timr->it.real.timer))
- expires = 0;
- if (expires)
- expires -= now.jiffies;
- }
- jiffies_to_timespec(expires, &cur_setting->it_value);
- jiffies_to_timespec(timr->it.real.incr, &cur_setting->it_interval);
-
- if (cur_setting->it_value.tv_sec < 0) {
+ memset(cur_setting, 0, sizeof(struct itimerspec));
+ expires = get_expiry_ktimer(timer, &now);
+ remaining = ktime_sub(expires, now);
+
+ /* Time left ? or timer pending */
+ if (ktime_cmp_val(remaining, >, KTIME_ZERO) || ktimer_active(timer))
+ goto calci;
+ /* interval timer ? */
+ if (ktime_cmp_val(timr->it.real.incr, ==, 0))
+ return;
+ /*
+ * When a requeue is pending or this is a SIGEV_NONE timer
+ * move the expiry time forward by intervals, so expiry is >
+ * now.
+ * The active (non SIGEV_NONE) rearm should be done
+ * automatically by the ktimer REARM mode. Thats the next
+ * iteration. The REQUEUE_PENDING part will go away !
+ */
+ if (timr->it_requeue_pending & REQUEUE_PENDING ||
+ (timr->it_sigev_notify & ~SIGEV_THREAD_ID) == SIGEV_NONE) {
+ remaining = forward_posix_timer(timr, now);
+ }
+ calci:
+ /* interval timer ? */
+ if (ktime_cmp_val(timr->it.real.incr, !=, KTIME_ZERO))
+ cur_setting->it_interval = ktime_to_timespec(timr->it.real.incr);
+ /* Return 0 only, when the timer is expired and not pending */
+ if (ktime_cmp_val(remaining, <=, KTIME_ZERO))
cur_setting->it_value.tv_nsec = 1;
- cur_setting->it_value.tv_sec = 0;
- }
+ else
+ cur_setting->it_value = ktime_to_timespec(remaining);
}

/* Get the time remaining on a POSIX.1b interval timer. */
@@ -832,6 +703,7 @@ sys_timer_gettime(timer_t timer_id, stru

return 0;
}
+
/*
* Get the number of overruns of a POSIX.1b interval timer. This is to
* be the overrun of the timer last delivered. At the same time we are
@@ -858,84 +730,6 @@ sys_timer_getoverrun(timer_t timer_id)

return overrun;
}
-/*
- * Adjust for absolute time
- *
- * If absolute time is given and it is not CLOCK_MONOTONIC, we need to
- * adjust for the offset between the timer clock (CLOCK_MONOTONIC) and
- * what ever clock he is using.
- *
- * If it is relative time, we need to add the current (CLOCK_MONOTONIC)
- * time to it to get the proper time for the timer.
- */
-static int adjust_abs_time(struct k_clock *clock, struct timespec *tp,
- int abs, u64 *exp, struct timespec *wall_to)
-{
- struct timespec now;
- struct timespec oc = *tp;
- u64 jiffies_64_f;
- int rtn =0;
-
- if (abs) {
- /*
- * The mask pick up the 4 basic clocks
- */
- if (!((clock - &posix_clocks[0]) & ~CLOCKS_MASK)) {
- jiffies_64_f = do_posix_clock_monotonic_gettime_parts(
- &now, wall_to);
- /*
- * If we are doing a MONOTONIC clock
- */
- if((clock - &posix_clocks[0]) & CLOCKS_MONO){
- now.tv_sec += wall_to->tv_sec;
- now.tv_nsec += wall_to->tv_nsec;
- }
- } else {
- /*
- * Not one of the basic clocks
- */
- clock->clock_get(clock - posix_clocks, &now);
- jiffies_64_f = get_jiffies_64();
- }
- /*
- * Take away now to get delta and normalize
- */
- set_normalized_timespec(&oc, oc.tv_sec - now.tv_sec,
- oc.tv_nsec - now.tv_nsec);
- }else{
- jiffies_64_f = get_jiffies_64();
- }
- /*
- * Check if the requested time is prior to now (if so set now)
- */
- if (oc.tv_sec < 0)
- oc.tv_sec = oc.tv_nsec = 0;
-
- if (oc.tv_sec | oc.tv_nsec)
- set_normalized_timespec(&oc, oc.tv_sec,
- oc.tv_nsec + clock->res);
- tstojiffie(&oc, clock->res, exp);
-
- /*
- * Check if the requested time is more than the timer code
- * can handle (if so we error out but return the value too).
- */
- if (*exp > ((u64)MAX_JIFFY_OFFSET))
- /*
- * This is a considered response, not exactly in
- * line with the standard (in fact it is silent on
- * possible overflows). We assume such a large
- * value is ALMOST always a programming error and
- * try not to compound it by setting a really dumb
- * value.
- */
- rtn = -EINVAL;
- /*
- * return the actual jiffies expire time, full 64 bits
- */
- *exp += jiffies_64_f;
- return rtn;
-}

/* Set a POSIX.1b interval timer. */
/* timr->it_lock is taken. */
@@ -943,68 +737,52 @@ static inline int
common_timer_set(struct k_itimer *timr, int flags,
struct itimerspec *new_setting, struct itimerspec *old_setting)
{
- struct k_clock *clock = &posix_clocks[timr->it_clock];
- u64 expire_64;
+ ktime_t expires;
+ int mode;

if (old_setting)
common_timer_get(timr, old_setting);

/* disable the timer */
- timr->it.real.incr = 0;
+ ktime_set_zero(timr->it.real.incr);
/*
* careful here. If smp we could be in the "fire" routine which will
* be spinning as we hold the lock. But this is ONLY an SMP issue.
*/
- if (try_to_del_timer_sync(&timr->it.real.timer) < 0) {
-#ifdef CONFIG_SMP
- /*
- * It can only be active if on an other cpu. Since
- * we have cleared the interval stuff above, it should
- * clear once we release the spin lock. Of course once
- * we do that anything could happen, including the
- * complete melt down of the timer. So return with
- * a "retry" exit status.
- */
+ if (try_to_stop_ktimer(&timr->it.real.timer) < 0)
return TIMER_RETRY;
-#endif
- }
-
- remove_from_abslist(timr);

timr->it_requeue_pending = (timr->it_requeue_pending + 2) &
~REQUEUE_PENDING;
timr->it_overrun_last = 0;
timr->it_overrun = -1;
- /*
- *switch off the timer when it_value is zero
- */
- if (!new_setting->it_value.tv_sec && !new_setting->it_value.tv_nsec) {
- timr->it.real.timer.expires = 0;
+
+ /* switch off the timer when it_value is zero */
+ if (!new_setting->it_value.tv_sec && !new_setting->it_value.tv_nsec)
return 0;
- }

- if (adjust_abs_time(clock,
- &new_setting->it_value, flags & TIMER_ABSTIME,
- &expire_64, &(timr->it.real.wall_to_prev))) {
- return -EINVAL;
- }
- timr->it.real.timer.expires = (unsigned long)expire_64;
- tstojiffie(&new_setting->it_interval, clock->res, &expire_64);
- timr->it.real.incr = (unsigned long)expire_64;
+ mode = flags & TIMER_ABSTIME ? KTIMER_ABS : KTIMER_REL;

- /*
- * We do not even queue SIGEV_NONE timers! But we do put them
- * in the abs list so we can do that right.
+ /* Posix madness. Only absolute CLOCK_REALTIME timers
+ * are affected by clock sets. So we must reiniatilize
+ * the timer.
*/
+ if (timr->it_clock == CLOCK_REALTIME && mode == KTIMER_ABS)
+ timer_create_real(timr);
+ else
+ timer_create_mono(timr);
+
+ expires = ktimer_convert_timespec(&timr->it.real.timer,
+ &new_setting->it_value);
+ /* This should be moved to the auto rearm code */
+ timr->it.real.incr = ktimer_convert_timespec(&timr->it.real.timer,
+ &new_setting->it_interval);
+
+ /* SIGEV_NONE timers are not queued ! See common_timer_get */
if (((timr->it_sigev_notify & ~SIGEV_THREAD_ID) != SIGEV_NONE))
- add_timer(&timr->it.real.timer);
+ start_ktimer(&timr->it.real.timer, &expires,
+ mode | KTIMER_NOCHECK);

- if (flags & TIMER_ABSTIME && clock->abs_struct) {
- spin_lock(&clock->abs_struct->lock);
- list_add_tail(&(timr->it.real.abs_timer_entry),
- &(clock->abs_struct->list));
- spin_unlock(&clock->abs_struct->lock);
- }
return 0;
}

@@ -1039,6 +817,7 @@ retry:

unlock_timer(timr, flag);
if (error == TIMER_RETRY) {
+ wait_for_ktimer(&timr->it.real.timer);
rtn = NULL; // We already got the old time...
goto retry;
}
@@ -1052,24 +831,10 @@ retry:

static inline int common_timer_del(struct k_itimer *timer)
{
- timer->it.real.incr = 0;
+ ktime_set_zero(timer->it.real.incr);

- if (try_to_del_timer_sync(&timer->it.real.timer) < 0) {
-#ifdef CONFIG_SMP
- /*
- * It can only be active if on an other cpu. Since
- * we have cleared the interval stuff above, it should
- * clear once we release the spin lock. Of course once
- * we do that anything could happen, including the
- * complete melt down of the timer. So return with
- * a "retry" exit status.
- */
+ if (try_to_stop_ktimer(&timer->it.real.timer) < 0)
return TIMER_RETRY;
-#endif
- }
-
- remove_from_abslist(timer);
-
return 0;
}

@@ -1085,24 +850,17 @@ sys_timer_delete(timer_t timer_id)
struct k_itimer *timer;
long flags;

-#ifdef CONFIG_SMP
- int error;
retry_delete:
-#endif
timer = lock_timer(timer_id, &flags);
if (!timer)
return -EINVAL;

-#ifdef CONFIG_SMP
- error = timer_delete_hook(timer);
-
- if (error == TIMER_RETRY) {
+ if (timer_delete_hook(timer) == TIMER_RETRY) {
unlock_timer(timer, flags);
+ wait_for_ktimer(&timer->it.real.timer);
goto retry_delete;
}
-#else
- timer_delete_hook(timer);
-#endif
+
spin_lock(&current->sighand->siglock);
list_del(&timer->list);
spin_unlock(&current->sighand->siglock);
@@ -1119,6 +877,7 @@ retry_delete:
release_posix_timer(timer, IT_ID_SET);
return 0;
}
+
/*
* return timer owned by the process, used by exit_itimers
*/
@@ -1126,22 +885,14 @@ static inline void itimer_delete(struct
{
unsigned long flags;

-#ifdef CONFIG_SMP
- int error;
retry_delete:
-#endif
spin_lock_irqsave(&timer->it_lock, flags);

-#ifdef CONFIG_SMP
- error = timer_delete_hook(timer);
-
- if (error == TIMER_RETRY) {
+ if (timer_delete_hook(timer) == TIMER_RETRY) {
unlock_timer(timer, flags);
+ wait_for_ktimer(&timer->it.real.timer);
goto retry_delete;
}
-#else
- timer_delete_hook(timer);
-#endif
list_del(&timer->list);
/*
* This keeps any tasks waiting on the spin lock from thinking
@@ -1170,60 +921,7 @@ void exit_itimers(struct signal_struct *
}
}

-/*
- * And now for the "clock" calls
- *
- * These functions are called both from timer functions (with the timer
- * spin_lock_irq() held and from clock calls with no locking. They must
- * use the save flags versions of locks.
- */
-
-/*
- * We do ticks here to avoid the irq lock ( they take sooo long).
- * The seqlock is great here. Since we a reader, we don't really care
- * if we are interrupted since we don't take lock that will stall us or
- * any other cpu. Voila, no irq lock is needed.
- *
- */
-
-static u64 do_posix_clock_monotonic_gettime_parts(
- struct timespec *tp, struct timespec *mo)
-{
- u64 jiff;
- unsigned int seq;
-
- do {
- seq = read_seqbegin(&xtime_lock);
- getnstimeofday(tp);
- *mo = wall_to_monotonic;
- jiff = jiffies_64;
-
- } while(read_seqretry(&xtime_lock, seq));
-
- return jiff;
-}
-
-static int do_posix_clock_monotonic_get(clockid_t clock, struct timespec *tp)
-{
- struct timespec wall_to_mono;
-
- do_posix_clock_monotonic_gettime_parts(tp, &wall_to_mono);
-
- tp->tv_sec += wall_to_mono.tv_sec;
- tp->tv_nsec += wall_to_mono.tv_nsec;
-
- if ((tp->tv_nsec - NSEC_PER_SEC) > 0) {
- tp->tv_nsec -= NSEC_PER_SEC;
- tp->tv_sec++;
- }
- return 0;
-}
-
-int do_posix_clock_monotonic_gettime(struct timespec *tp)
-{
- return do_posix_clock_monotonic_get(CLOCK_MONOTONIC, tp);
-}
-
+/* Not available / possible... functions */
int do_posix_clock_nosettime(clockid_t clockid, struct timespec *tp)
{
return -EINVAL;
@@ -1236,7 +934,8 @@ int do_posix_clock_notimer_create(struct
}
EXPORT_SYMBOL_GPL(do_posix_clock_notimer_create);

-int do_posix_clock_nonanosleep(clockid_t clock, int flags, struct timespec *t)
+int do_posix_clock_nonanosleep(clockid_t clock, int flags, struct timespec *t,
+ struct timespec __user *r)
{
#ifndef ENOTSUP
return -EOPNOTSUPP; /* aka ENOTSUP in userland for POSIX */
@@ -1295,125 +994,34 @@ sys_clock_getres(clockid_t which_clock,
return error;
}

-static void nanosleep_wake_up(unsigned long __data)
-{
- struct task_struct *p = (struct task_struct *) __data;
-
- wake_up_process(p);
-}
-
/*
- * The standard says that an absolute nanosleep call MUST wake up at
- * the requested time in spite of clock settings. Here is what we do:
- * For each nanosleep call that needs it (only absolute and not on
- * CLOCK_MONOTONIC* (as it can not be set)) we thread a little structure
- * into the "nanosleep_abs_list". All we need is the task_struct pointer.
- * When ever the clock is set we just wake up all those tasks. The rest
- * is done by the while loop in clock_nanosleep().
- *
- * On locking, clock_was_set() is called from update_wall_clock which
- * holds (or has held for it) a write_lock_irq( xtime_lock) and is
- * called from the timer bh code. Thus we need the irq save locks.
- *
- * Also, on the call from update_wall_clock, that is done as part of a
- * softirq thing. We don't want to delay the system that much (possibly
- * long list of timers to fix), so we defer that work to keventd.
+ * nanosleep for monotonic and realtime clocks
*/
-
-static DECLARE_WAIT_QUEUE_HEAD(nanosleep_abs_wqueue);
-static DECLARE_WORK(clock_was_set_work, (void(*)(void*))clock_was_set, NULL);
-
-static DECLARE_MUTEX(clock_was_set_lock);
-
-void clock_was_set(void)
+static int common_nsleep(clockid_t which_clock, int flags,
+ struct timespec *tsave, struct timespec __user *rmtp)
{
- struct k_itimer *timr;
- struct timespec new_wall_to;
- LIST_HEAD(cws_list);
- unsigned long seq;
-
+ int mode = flags & TIMER_ABSTIME ? KTIMER_ABS : KTIMER_REL;

- if (unlikely(in_interrupt())) {
- schedule_work(&clock_was_set_work);
- return;
+ switch (which_clock) {
+ case CLOCK_REALTIME:
+ /* Posix madness. Only absolute timers on clock realtime
+ are affected by clock set. */
+ if (mode == KTIMER_ABS)
+ return ktimer_nanosleep_real(tsave, rmtp, mode);
+ case CLOCK_MONOTONIC:
+ return ktimer_nanosleep_mono(tsave, rmtp, mode);
+ default:
+ break;
}
- wake_up_all(&nanosleep_abs_wqueue);
-
- /*
- * Check if there exist TIMER_ABSTIME timers to correct.
- *
- * Notes on locking: This code is run in task context with irq
- * on. We CAN be interrupted! All other usage of the abs list
- * lock is under the timer lock which holds the irq lock as
- * well. We REALLY don't want to scan the whole list with the
- * interrupt system off, AND we would like a sequence lock on
- * this code as well. Since we assume that the clock will not
- * be set often, it seems ok to take and release the irq lock
- * for each timer. In fact add_timer will do this, so this is
- * not an issue. So we know when we are done, we will move the
- * whole list to a new location. Then as we process each entry,
- * we will move it to the actual list again. This way, when our
- * copy is empty, we are done. We are not all that concerned
- * about preemption so we will use a semaphore lock to protect
- * aginst reentry. This way we will not stall another
- * processor. It is possible that this may delay some timers
- * that should have expired, given the new clock, but even this
- * will be minimal as we will always update to the current time,
- * even if it was set by a task that is waiting for entry to
- * this code. Timers that expire too early will be caught by
- * the expire code and restarted.
-
- * Absolute timers that repeat are left in the abs list while
- * waiting for the task to pick up the signal. This means we
- * may find timers that are not in the "add_timer" list, but are
- * in the abs list. We do the same thing for these, save
- * putting them back in the "add_timer" list. (Note, these are
- * left in the abs list mainly to indicate that they are
- * ABSOLUTE timers, a fact that is used by the re-arm code, and
- * for which we have no other flag.)
-
- */
-
- down(&clock_was_set_lock);
- spin_lock_irq(&abs_list.lock);
- list_splice_init(&abs_list.list, &cws_list);
- spin_unlock_irq(&abs_list.lock);
- do {
- do {
- seq = read_seqbegin(&xtime_lock);
- new_wall_to = wall_to_monotonic;
- } while (read_seqretry(&xtime_lock, seq));
-
- spin_lock_irq(&abs_list.lock);
- if (list_empty(&cws_list)) {
- spin_unlock_irq(&abs_list.lock);
- break;
- }
- timr = list_entry(cws_list.next, struct k_itimer,
- it.real.abs_timer_entry);
-
- list_del_init(&timr->it.real.abs_timer_entry);
- if (add_clockset_delta(timr, &new_wall_to) &&
- del_timer(&timr->it.real.timer)) /* timer run yet? */
- add_timer(&timr->it.real.timer);
- list_add(&timr->it.real.abs_timer_entry, &abs_list.list);
- spin_unlock_irq(&abs_list.lock);
- } while (1);
-
- up(&clock_was_set_lock);
+ return -EINVAL;
}

-long clock_nanosleep_restart(struct restart_block *restart_block);
-
asmlinkage long
sys_clock_nanosleep(clockid_t which_clock, int flags,
const struct timespec __user *rqtp,
struct timespec __user *rmtp)
{
struct timespec t;
- struct restart_block *restart_block =
- &(current_thread_info()->restart_block);
- int ret;

if (invalid_clockid(which_clock))
return -EINVAL;
@@ -1421,135 +1029,8 @@ sys_clock_nanosleep(clockid_t which_cloc
if (copy_from_user(&t, rqtp, sizeof (struct timespec)))
return -EFAULT;

- if ((unsigned) t.tv_nsec >= NSEC_PER_SEC || t.tv_sec < 0)
+ if (!timespec_valid(&t))
return -EINVAL;

- /*
- * Do this here as nsleep function does not have the real address.
- */
- restart_block->arg1 = (unsigned long)rmtp;
-
- ret = CLOCK_DISPATCH(which_clock, nsleep, (which_clock, flags, &t));
-
- if ((ret == -ERESTART_RESTARTBLOCK) && rmtp &&
- copy_to_user(rmtp, &t, sizeof (t)))
- return -EFAULT;
- return ret;
-}
-
-
-static int common_nsleep(clockid_t which_clock,
- int flags, struct timespec *tsave)
-{
- struct timespec t, dum;
- struct timer_list new_timer;
- DECLARE_WAITQUEUE(abs_wqueue, current);
- u64 rq_time = (u64)0;
- s64 left;
- int abs;
- struct restart_block *restart_block =
- &current_thread_info()->restart_block;
-
- abs_wqueue.flags = 0;
- init_timer(&new_timer);
- new_timer.expires = 0;
- new_timer.data = (unsigned long) current;
- new_timer.function = nanosleep_wake_up;
- abs = flags & TIMER_ABSTIME;
-
- if (restart_block->fn == clock_nanosleep_restart) {
- /*
- * Interrupted by a non-delivered signal, pick up remaining
- * time and continue. Remaining time is in arg2 & 3.
- */
- restart_block->fn = do_no_restart_syscall;
-
- rq_time = restart_block->arg3;
- rq_time = (rq_time << 32) + restart_block->arg2;
- if (!rq_time)
- return -EINTR;
- left = rq_time - get_jiffies_64();
- if (left <= (s64)0)
- return 0; /* Already passed */
- }
-
- if (abs && (posix_clocks[which_clock].clock_get !=
- posix_clocks[CLOCK_MONOTONIC].clock_get))
- add_wait_queue(&nanosleep_abs_wqueue, &abs_wqueue);
-
- do {
- t = *tsave;
- if (abs || !rq_time) {
- adjust_abs_time(&posix_clocks[which_clock], &t, abs,
- &rq_time, &dum);
- }
-
- left = rq_time - get_jiffies_64();
- if (left >= (s64)MAX_JIFFY_OFFSET)
- left = (s64)MAX_JIFFY_OFFSET;
- if (left < (s64)0)
- break;
-
- new_timer.expires = jiffies + left;
- __set_current_state(TASK_INTERRUPTIBLE);
- add_timer(&new_timer);
-
- schedule();
-
- del_timer_sync(&new_timer);
- left = rq_time - get_jiffies_64();
- } while (left > (s64)0 && !test_thread_flag(TIF_SIGPENDING));
-
- if (abs_wqueue.task_list.next)
- finish_wait(&nanosleep_abs_wqueue, &abs_wqueue);
-
- if (left > (s64)0) {
-
- /*
- * Always restart abs calls from scratch to pick up any
- * clock shifting that happened while we are away.
- */
- if (abs)
- return -ERESTARTNOHAND;
-
- left *= TICK_NSEC;
- tsave->tv_sec = div_long_long_rem(left,
- NSEC_PER_SEC,
- &tsave->tv_nsec);
- /*
- * Restart works by saving the time remaing in
- * arg2 & 3 (it is 64-bits of jiffies). The other
- * info we need is the clock_id (saved in arg0).
- * The sys_call interface needs the users
- * timespec return address which _it_ saves in arg1.
- * Since we have cast the nanosleep call to a clock_nanosleep
- * both can be restarted with the same code.
- */
- restart_block->fn = clock_nanosleep_restart;
- restart_block->arg0 = which_clock;
- /*
- * Caller sets arg1
- */
- restart_block->arg2 = rq_time & 0xffffffffLL;
- restart_block->arg3 = rq_time >> 32;
-
- return -ERESTART_RESTARTBLOCK;
- }
-
- return 0;
-}
-/*
- * This will restart clock_nanosleep.
- */
-long
-clock_nanosleep_restart(struct restart_block *restart_block)
-{
- struct timespec t;
- int ret = common_nsleep(restart_block->arg0, 0, &t);
-
- if ((ret == -ERESTART_RESTARTBLOCK) && restart_block->arg1 &&
- copy_to_user((struct timespec __user *)(restart_block->arg1), &t,
- sizeof (t)))
- return -EFAULT;
- return ret;
+ return CLOCK_DISPATCH(which_clock, nsleep, (which_clock, flags, &t, rmtp));
}
Index: linux-2.6.14-rc2-rt4/kernel/timer.c
===================================================================
--- linux-2.6.14-rc2-rt4.orig/kernel/timer.c
+++ linux-2.6.14-rc2-rt4/kernel/timer.c
@@ -912,6 +912,7 @@ static void run_timer_softirq(struct sof
{
tvec_base_t *base = &__get_cpu_var(tvec_bases);

+ run_ktimer_queues();
if (time_after_eq(jiffies, base->timer_jiffies))
__run_timers(base);
}
@@ -1177,62 +1178,6 @@ asmlinkage long sys_gettid(void)
return current->pid;
}

-static long __sched nanosleep_restart(struct restart_block *restart)
-{
- unsigned long expire = restart->arg0, now = jiffies;
- struct timespec __user *rmtp = (struct timespec __user *) restart->arg1;
- long ret;
-
- /* Did it expire while we handled signals? */
- if (!time_after(expire, now))
- return 0;
-
- expire = schedule_timeout_interruptible(expire - now);
-
- ret = 0;
- if (expire) {
- struct timespec t;
- jiffies_to_timespec(expire, &t);
-
- ret = -ERESTART_RESTARTBLOCK;
- if (rmtp && copy_to_user(rmtp, &t, sizeof(t)))
- ret = -EFAULT;
- /* The 'restart' block is already filled in */
- }
- return ret;
-}
-
-asmlinkage long sys_nanosleep(struct timespec __user *rqtp, struct timespec __user *rmtp)
-{
- struct timespec t;
- unsigned long expire;
- long ret;
-
- if (copy_from_user(&t, rqtp, sizeof(t)))
- return -EFAULT;
-
- if ((t.tv_nsec >= 1000000000L) || (t.tv_nsec < 0) || (t.tv_sec < 0))
- return -EINVAL;
-
- expire = timespec_to_jiffies(&t) + (t.tv_sec || t.tv_nsec);
- expire = schedule_timeout_interruptible(expire);
-
- ret = 0;
- if (expire) {
- struct restart_block *restart;
- jiffies_to_timespec(expire, &t);
- if (rmtp && copy_to_user(rmtp, &t, sizeof(t)))
- return -EFAULT;
-
- restart = &current_thread_info()->restart_block;
- restart->fn = nanosleep_restart;
- restart->arg0 = jiffies + expire;
- restart->arg1 = (unsigned long) rmtp;
- ret = -ERESTART_RESTARTBLOCK;
- }
- return ret;
-}
-
/*
* sys_sysinfo - fill in sysinfo struct
*/
Index: linux-2.6.14-rc2-rt4/include/linux/time.h
===================================================================
--- linux-2.6.14-rc2-rt4.orig/include/linux/time.h
+++ linux-2.6.14-rc2-rt4/include/linux/time.h
@@ -4,6 +4,7 @@
#include <linux/types.h>

#ifdef __KERNEL__
+#include <linux/calc64.h>
#include <linux/seqlock.h>
#endif

@@ -38,6 +39,11 @@ static __inline__ int timespec_equal(str
return (a->tv_sec == b->tv_sec) && (a->tv_nsec == b->tv_nsec);
}

+#define timespec_valid(ts) \
+(((ts)->tv_sec >= 0) && (((unsigned) (ts)->tv_nsec) < NSEC_PER_SEC))
+
+typedef s64 nsec_t;
+
/* Converts Gregorian date to seconds since 1970-01-01 00:00:00.
* Assumes input in normal date format, i.e. 1980-12-31 23:59:59
* => year=1980, mon=12, day=31, hour=23, min=59, sec=59.
@@ -88,8 +94,7 @@ struct timespec current_kernel_time(void
extern void do_gettimeofday(struct timeval *tv);
extern int do_settimeofday(struct timespec *tv);
extern int do_sys_settimeofday(struct timespec *tv, struct timezone *tz);
-extern void clock_was_set(void); // call when ever the clock is set
-extern int do_posix_clock_monotonic_gettime(struct timespec *tp);
+extern void do_posix_clock_monotonic_gettime(struct timespec *ts);
extern long do_utimes(char __user * filename, struct timeval * times);
struct itimerval;
extern int do_setitimer(int which, struct itimerval *value, struct itimerval *ovalue);
@@ -113,6 +118,40 @@ set_normalized_timespec (struct timespec
ts->tv_nsec = nsec;
}

+static __inline__ nsec_t timespec_to_ns(struct timespec *s)
+{
+ nsec_t res = (nsec_t) s->tv_sec * NSEC_PER_SEC;
+ return res + (nsec_t) s->tv_nsec;
+}
+
+static __inline__ struct timespec ns_to_timespec(nsec_t n)
+{
+ struct timespec ts;
+
+ if (n)
+ ts.tv_sec = div_long_long_rem_signed(n, NSEC_PER_SEC, &ts.tv_nsec);
+ else
+ ts.tv_sec = ts.tv_nsec = 0;
+ return ts;
+}
+
+static __inline__ nsec_t timeval_to_ns(struct timeval *s)
+{
+ nsec_t res = (nsec_t) s->tv_sec * NSEC_PER_SEC;
+ return res + (nsec_t) s->tv_usec * NSEC_PER_USEC;
+}
+
+static __inline__ struct timeval ns_to_timeval(nsec_t n)
+{
+ struct timeval tv;
+ if (n) {
+ tv.tv_sec = div_long_long_rem_signed(n, NSEC_PER_SEC, &tv.tv_usec);
+ tv.tv_usec /= 1000;
+ } else
+ tv.tv_sec = tv.tv_usec = 0;
+ return tv;
+}
+
#endif /* __KERNEL__ */

#define NFDBITS __NFDBITS
@@ -145,23 +184,18 @@ struct itimerval {
/*
* The IDs of the various system clocks (for POSIX.1b interval timers).
*/
-#define CLOCK_REALTIME 0
-#define CLOCK_MONOTONIC 1
+#define CLOCK_REALTIME 0
+#define CLOCK_MONOTONIC 1
#define CLOCK_PROCESS_CPUTIME_ID 2
#define CLOCK_THREAD_CPUTIME_ID 3
-#define CLOCK_REALTIME_HR 4
-#define CLOCK_MONOTONIC_HR 5

/*
* The IDs of various hardware clocks
*/
-
-
#define CLOCK_SGI_CYCLE 10
#define MAX_CLOCKS 16
-#define CLOCKS_MASK (CLOCK_REALTIME | CLOCK_MONOTONIC | \
- CLOCK_REALTIME_HR | CLOCK_MONOTONIC_HR)
-#define CLOCKS_MONO (CLOCK_MONOTONIC & CLOCK_MONOTONIC_HR)
+#define CLOCKS_MASK (CLOCK_REALTIME | CLOCK_MONOTONIC)
+#define CLOCKS_MONO (CLOCK_MONOTONIC)

/*
* The various flags for setting POSIX.1b interval timers.


2005-09-29 00:00:36

by Frank Sorenson

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

[email protected] wrote:
> This is an updated version which contains following changes:
<snip>
> Thanks for review and feedback.
>
> tglx

I get this kernel panic on boot (serial capture) with the latest
git tree (2.6.14-rc2++) plus this version of ktimers:

[4294709.646000] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[4294709.646000] printing eip:
[4294709.646000] c0137578
[4294709.646000] *pde = 00000000
[4294709.646000] Oops: 0000 [#1]
[4294709.646000] PREEMPT
[4294709.646000] Modules linked in: ipw2200 ieee80211 ieee80211_crypt
[4294709.646000] CPU: 0
[4294709.646000] EIP: 0060:[<c0137578>] Not tainted VLI
[4294709.646000] EFLAGS: 00010087 (2.6.14-rc2-fs2)
[4294709.646000] EIP is at enqueue_ktimer+0x168/0x280
[4294709.646000] eax: 00000000 ebx: c051c1b4 ecx: 0000002a edx: 14712508
[4294709.646000] esi: 00000000 edi: f7f42240 ebp: c051c1b8 esp: c05e4f58
[4294709.646000] ds: 007b es: 007b ss: 0068
[4294709.646000] Process swapper (pid: 0, threadinfo=c05e4000 task=c0515bc0)
[4294709.646000] Stack: c051c1b4 00000000 c051c1ac 147a7f90 0000002a 147a7f90 0000002a f7f42240
[4294709.646000] 147a6ff0 0000002a c051c1ac c0137d72 00000005 c05e4000 c051c1b8 c051c1b4
[4294709.646000] c05e4000 c0124380 f7f69a90 147a6ff0 0000002a 00000001 c06136c8 0000000a
[4<0>Kernel panic - not syncing: Fatal exception in interrupt
[4294709.680000]



Frank
- --
Frank Sorenson - KD7TZK
Systems Manager, Computer Science Department
Brigham Young University
[email protected]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFDOy5haI0dwg4A47wRAnXcAJ996Yrw2nkjuNThfLCep2GRZ0VjzgCcDIWl
IvIgmrrHG3qB8LNszTPITX8=
=TLMU
-----END PGP SIGNATURE-----

2005-09-29 00:51:03

by Frank Sorenson

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Frank Sorenson wrote:
> I get this kernel panic on boot (serial capture) with the latest
> git tree (2.6.14-rc2++) plus this version of ktimers:

Here's a little more information. I've narrowed the panic down to ntpd
startup. Without ntpd, the system seems to run okay, but panics the
moment I startup ntpd.

Hope this helps,

Frank
- --
Frank Sorenson - KD7TZK
Systems Manager, Computer Science Department
Brigham Young University
[email protected]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFDOzpSaI0dwg4A47wRAipFAJ0c6/2tif49xVEhDZCH2drgpJXQmACgoY+G
tT9LkOWmS67SyX5Vekrl024=
=f/qY
-----END PGP SIGNATURE-----

2005-09-29 00:56:43

by john stultz

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

On Wed, 2005-09-28 at 18:50 -0600, Frank Sorenson wrote:
> Frank Sorenson wrote:
> > I get this kernel panic on boot (serial capture) with the latest
> > git tree (2.6.14-rc2++) plus this version of ktimers:
>
> Here's a little more information. I've narrowed the panic down to ntpd
> startup. Without ntpd, the system seems to run okay, but panics the
> moment I startup ntpd.

Are you just testing the ktimers patch or the full set of patches Thomas
is working with (including my code)?

thanks
-john

2005-09-29 01:06:22

by Frank Sorenson

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

john stultz wrote:
> On Wed, 2005-09-28 at 18:50 -0600, Frank Sorenson wrote:
>
>>Frank Sorenson wrote:
>>
>>>I get this kernel panic on boot (serial capture) with the latest
>>>git tree (2.6.14-rc2++) plus this version of ktimers:
>>
>>Here's a little more information. I've narrowed the panic down to ntpd
>>startup. Without ntpd, the system seems to run okay, but panics the
>>moment I startup ntpd.
>
>
> Are you just testing the ktimers patch or the full set of patches Thomas
> is working with (including my code)?
>
> thanks
> -john

After first testing with other patches, I verified that the panic occurs
without any other patches involved.

So, I am just testing this particular ktimers patch, without any others.

Am I correct in my understanding that this patch is standalone?

Frank
- --
Frank Sorenson - KD7TZK
Systems Manager, Computer Science Department
Brigham Young University
[email protected]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFDOz3faI0dwg4A47wRAn+/AKDsu/lRzUhbln8pNoRpfZ2V45D0NgCfQLHF
lK6+uXzWFQQhp8SvqBxPw1M=
=B9oy
-----END PGP SIGNATURE-----

2005-09-29 01:10:06

by john stultz

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

On Wed, 2005-09-28 at 22:43 +0200, [email protected] wrote:
> +static int enqueue_ktimer(struct ktimer *timer, struct ktimer_base *base,
> + ktime_t *tim, int mode)
> +{
> + struct rb_node **link = &base->active.rb_node;
> + struct rb_node *parent = NULL;
> + struct ktimer *entry;
> + struct list_head *prev = &base->pending;
> + ktime_t now;
> +
> + /* Get current time */
> + now = base->get_time();
> +
> + /* Timer expiry mode */
> + switch (mode & ~KTIMER_NOCHECK) {
> + case KTIMER_ABS:
> + timer->expires = *tim;
> + break;
> + case KTIMER_REL:
> + timer->expires = ktime_add(now, *tim);
> + break;
> + case KTIMER_INCR:
> + timer->expires = ktime_add(timer->expires, *tim);
> + break;

...



> +static inline void do_remove_ktimer(struct ktimer *timer,
> + struct ktimer_base *base, int rearm)
> +{
> + list_del(&timer->list);
> + rb_erase(&timer->node, &base->active);
> + timer->node.rb_parent = KTIMER_POISON;
> + timer->status = KTIMER_INACTIVE;
> + base->count--;
> + BUG_ON(base->count < 0);
> + /* Auto rearm the timer ? */
> + if (rearm && ktime_cmp_val(timer->interval, !=, KTIME_ZERO))
> + enqueue_ktimer(timer, base, NULL, KTIMER_REARM);
> +}


There's a couple of places like this where you pass NULL as the ktime_t
pointer tim to enqueue_ktimer(). However in enqueue_ktimer, you
dereference tim in a few spots w/o checking for NULL.

I'm guessing this is what Frank is seeing.

thanks
-john


2005-09-29 06:52:33

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

On Wed, 2005-09-28 at 18:10 -0700, john stultz wrote:

> > + /* Auto rearm the timer ? */
> > + if (rearm && ktime_cmp_val(timer->interval, !=, KTIME_ZERO))
> > + enqueue_ktimer(timer, base, NULL, KTIMER_REARM);
> > +}
>
>
> There's a couple of places like this where you pass NULL as the ktime_t
> pointer tim to enqueue_ktimer(). However in enqueue_ktimer, you
> dereference tim in a few spots w/o checking for NULL.
>

The KTIMER_REARM case is the broken spot. I fixed this already as it was
oopsing here to, but somehow I messed up with quilt.

tglx

Index: linux-2.6.14-rc2-rt4/kernel/ktimers.c
===================================================================
--- linux-2.6.14-rc2-rt4.orig/kernel/ktimers.c
+++ linux-2.6.14-rc2-rt4/kernel/ktimers.c
@@ -242,7 +242,7 @@ static int enqueue_ktimer(struct ktimer
goto nocheck;
case KTIMER_REARM:
while ktime_cmp(timer->expires, <= , now) {
- timer->expires = ktime_add(timer->expires, *tim);
+ timer->expires = ktime_add(timer->expires, timer->interval);
timer->overrun++;
}
goto nocheck;


2005-09-29 20:00:04

by George Anzinger

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Am I the only one finding "=20\n" and other corruption in this patch?

George
--

[email protected] wrote:
> This is an updated version which contains following changes:
>
> - Selectable time storage format: union/struct based, scalar (64bit)
> - Fixed an endless loop in forward_posix_timer (George Anzinger)
> - Fixed a wrong sizeof(x) (George Anzinger)
> - Fixed build problems for non x86 architectures
>
> Roman pointed out that the penalty for some architectures
> would be quite big when using the nsec_t (64bit) scalar time
> storage format. After a long discussion and some more detailed
> tests especially on ARM it turned out that the scalar format
> is unfortunately not suitable everywhere. The tradeoff between
> performance and cleanliness seems too big for some architectures.
>
> After several rounds of functional conversions and
> cleanups an acceptable compromise between cleanliness and
> storage format flexibility was found.
>
> For 64bit architectures the scalar representation is definitely
> a win and therefor enabled unconditionally. The code defaults to
> the union/struct based implementation on 32bit archs, but can be
> switched to the scalar storage format by setting
> CONFIG_KTIME_SCALAR=y if there is a benefit for the particular
> architecture. The union/struct magic has an advantage over the
> struct timespec based format which I considered to use first. It
> produces better and denser code for most architecures and does no
> harm anywhere else. This might change with improvements of
> compilers, but then it requires just a replacement of the related
> macros / inlines.
>
> The code is not harder to understand than the previous
> open coded scalar storage based implementation.
>
> The correctness was verified with the posix timer tests from
> the HRT project on the forward ported ktimers based high
> resolution proof of concept implementation.
> For those interested in this topic the patchseries is available
> at http://www.tglx.de/private/tglx/ktimers/patch-2.6.14-rc2-kt5.patches.tar.bz2
>
>
> Thanks for review and feedback.
>
> tglx
>
>
> ktimers seperate the "timer API" from the "timeout API".
> ktimers are used for:
> - nanosleep
> - posixtimers
> - itimers
>
>
> The patch contains the base implementation of ktimers and the
> conversion of nanosleep, posixtimers and itimers to ktimer users.
>
> The patch does not require other changes to the Linux time(r) core
> system.
>
> The implementation was done with following constraints in mind:
>
> - Not bound to jiffies
> - Multiple time sources
> - Per CPU timer queues
> - Simplification of absolute CLOCK_REALTIME posix timers
> - High resolution timer aware
> - Allows the timeout API to reschedule the next event
> (for tickless systems)
>
> Ktimers enqueue the timers into a time sorted list, which is implemented
> with a rbtree, which is effiecient and already used in other performance
> critical parts of the kernel. This is a bit slower than the timer wheel,
> but due to the fact that the vast majority of timers is actually
> expiring it has to be waged versus the cascading penalty.
>
> The code supports multiple time sources. Currently implemented are
> CLOCK_REALTIME and CLOCK_MONOTONIC. They provide seperate timer queues
> and support functions.
>
> Signed-off-by: Thomas Gleixner <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
>
> ---
> Index: linux-2.6.14-rc2-rt4/include/linux/calc64.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6.14-rc2-rt4/include/linux/calc64.h
> @@ -0,0 +1,31 @@
> +#ifndef _linux_CALC64_H
> +#define _linux_CALC64_H
> +
> +#include <linux/types.h>
> +#include <asm/div64.h>
> +
> +#ifndef div_long_long_rem
> +#define div_long_long_rem(dividend,divisor,remainder) \
> +({ \
> + u64 result = dividend; \
> + *remainder = do_div(result,divisor); \
> + result; \
> +})
> +#endif
> +
> +static inline long div_long_long_rem_signed(long long dividend,
> + long divisor,
> + long *remainder)
> +{
> + long res;
> +
> + if (unlikely(dividend < 0)) {
> + res = -div_long_long_rem(-dividend, divisor, remainder);
> + *remainder = -(*remainder);
> + } else {
> + res = div_long_long_rem(dividend, divisor, remainder);
> + }
> + return res;
> +}
> +
> +#endif
> Index: linux-2.6.14-rc2-rt4/include/linux/jiffies.h
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/include/linux/jiffies.h
> +++ linux-2.6.14-rc2-rt4/include/linux/jiffies.h
> @@ -1,21 +1,12 @@
> #ifndef _LINUX_JIFFIES_H
> #define _LINUX_JIFFIES_H
>
> +#include <linux/calc64.h>
> #include <linux/kernel.h>
> #include <linux/types.h>
> #include <linux/time.h>
> #include <linux/timex.h>
> #include <asm/param.h> /* for HZ */
> -#include <asm/div64.h>
> -
> -#ifndef div_long_long_rem
> -#define div_long_long_rem(dividend,divisor,remainder) \
> -({ \
> - u64 result = dividend; \
> - *remainder = do_div(result,divisor); \
> - result; \
> -})
> -#endif
>
> /*
> * The following defines establish the engineering parameters of the PLL
> Index: linux-2.6.14-rc2-rt4/fs/exec.c
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/fs/exec.c
> +++ linux-2.6.14-rc2-rt4/fs/exec.c
> @@ -645,9 +645,10 @@ static inline int de_thread(struct task_
> * synchronize with any firing (by calling del_timer_sync)
> * before we can safely let the old group leader die.
> */
> - sig->real_timer.data = (unsigned long)current;
> - if (del_timer_sync(&sig->real_timer))
> - add_timer(&sig->real_timer);
> + sig->real_timer.data = current;
> + if (stop_ktimer(&sig->real_timer))
> + start_ktimer(&sig->real_timer, NULL,
> + KTIMER_RESTART|KTIMER_NOCHECK);
> }
> while (atomic_read(&sig->count) > count) {
> sig->group_exit_task = current;
> @@ -659,7 +660,7 @@ static inline int de_thread(struct task_
> }
> sig->group_exit_task = NULL;
> sig->notify_count = 0;
> - sig->real_timer.data = (unsigned long)current;
> + sig->real_timer.data = current;
> spin_unlock_irq(lock);
>
> /*
> Index: linux-2.6.14-rc2-rt4/fs/proc/array.c
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/fs/proc/array.c
> +++ linux-2.6.14-rc2-rt4/fs/proc/array.c
> @@ -330,7 +330,7 @@ static int do_task_stat(struct task_stru
> unsigned long min_flt = 0, maj_flt = 0;
> cputime_t cutime, cstime, utime, stime;
> unsigned long rsslim = 0;
> - unsigned long it_real_value = 0;
> + DEFINE_KTIME(it_real_value);
> struct task_struct *t;
> char tcomm[sizeof(task->comm)];
>
> @@ -386,7 +386,7 @@ static int do_task_stat(struct task_stru
> utime = cputime_add(utime, task->signal->utime);
> stime = cputime_add(stime, task->signal->stime);
> }
> - it_real_value = task->signal->it_real_value;
> + it_real_value = task->signal->real_timer.expires;
> }
> ppid = pid_alive(task) ? task->group_leader->real_parent->tgid : 0;
> read_unlock(&tasklist_lock);
> @@ -435,7 +435,7 @@ static int do_task_stat(struct task_stru
> priority,
> nice,
> num_threads,
> - jiffies_to_clock_t(it_real_value),
> + (clock_t) ktime_to_clock_t(it_real_value),
> start_time,
> vsize,
> mm ? get_mm_counter(mm, rss) : 0, /* you might want to shift this left 3 */
> Index: linux-2.6.14-rc2-rt4/include/linux/ktimer.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6.14-rc2-rt4/include/linux/ktimer.h
> @@ -0,0 +1,335 @@
> +#ifndef _LINUX_KTIMER_H
> +#define _LINUX_KTIMER_H
> +
> +#include <linux/init.h>
> +#include <linux/list.h>
> +#include <linux/rbtree.h>
> +#include <linux/time.h>
> +#include <linux/wait.h>
> +
> +/* Timer API */
> +
> +/*
> + * Select the ktime_t data type
> + */
> +#if defined(CONFIG_KTIME_SCALAR) || (BITS_PER_LONG == 64)
> + #define KTIME_IS_SCALAR
> +#endif
> +
> +#ifndef KTIME_IS_SCALAR
> +typedef union {
> + s64 tv64;
> + struct {
> +#ifdef __BIG_ENDIAN
> + s32 sec, nsec;
> +#else
> + s32 nsec, sec;
> +#endif
> + } tv;
> +} ktime_t;
> +
> +#else
> +
> +typedef s64 ktime_t;
> +
> +#endif
> +
> +struct ktimer_base;
> +
> +/*
> + * Timer structure must be initialized by init_ktimer_xxx !
> + */
> +struct ktimer {
> + struct rb_node node;
> + struct list_head list;
> + ktime_t expires;
> + ktime_t expired;
> + ktime_t interval;
> + int overrun;
> + unsigned long status;
> + void (*function)(void *);
> + void *data;
> + struct ktimer_base *base;
> +};
> +
> +/*
> + * Timer base struct
> + */
> +struct ktimer_base {
> + int index;
> + char *name;
> + spinlock_t lock;
> + struct rb_root active;
> + struct list_head pending;
> + int count;
> + unsigned long resolution;
> + ktime_t (*get_time)(void);
> + struct ktimer *running_timer;
> + wait_queue_head_t wait_for_running_timer;
> +};
> +
> +/*
> + * Values for the mode argument of xxx_ktimer functions
> + */
> +enum
> +{
> + KTIMER_NOREARM, /* Internal value */
> + KTIMER_ABS, /* Time value is absolute */
> + KTIMER_REL, /* Time value is relativ to now */
> + KTIMER_INCR, /* Time value is relativ to previous expiry time */
> + KTIMER_FORWARD, /* Timer is rearmed with value. Overruns are accounted */
> + KTIMER_REARM, /* Timer is rearmed with interval. Overruns are accounted */
> + KTIMER_RESTART /* Timer is restarted with the stored expiry value */
> +};
> +
> +/* The timer states */
> +enum
> +{
> + KTIMER_INACTIVE,
> + KTIMER_PENDING,
> + KTIMER_EXPIRED,
> + KTIMER_EXPIRED_NOQUEUE,
> +};
> +
> +/* Expiry must not be checked when the timer is started */
> +#define KTIMER_NOCHECK 0x10000
> +
> +#define KTIMER_POISON ((void *) 0x00100101)
> +
> +#define KTIME_ZERO 0LL
> +
> +#define ktimer_active(t) ((t)->status != KTIMER_INACTIVE)
> +#define ktimer_before(t1, t2) (ktime_cmp((t1)->expires, <, (t2)->expires))
> +
> +#ifndef KTIME_IS_SCALAR
> +/*
> + * Helper macros/inlines to get the math with ktime_t right. Uurgh, that's
> + * ugly as hell, but for performance sake we have to use this. The
> + * nsec_t based code was nice and simple. :(
> + *
> + * Be careful when using this stuff. It blows up on you if you d?n't
> + * get the weirdness right.
> + *
> + * Be especially aware, that negative values are represented in the
> + * form:
> + * tv.sec < 0 and 0 >= tv.nsec < NSEC_PER_SEC
> + *
> + */
> +#define DEFINE_KTIME(k) ktime_t k = {.tv64 = 0LL }
> +
> +#define ktime_cmp(a,op,b) ((a).tv64 op (b).tv64)
> +#define ktime_cmp_val(a, op, b) ((a).tv64 op b)
> +
> +#define ktime_set(s,n) \
> +({ \
> + ktime_t __kt; \
> + __kt.tv.sec = s; \
> + __kt.tv.nsec = n; \
> + __kt; \
> +})
> +
> +#define ktime_set_zero(k) k.tv64 = 0LL
> +
> +#define ktime_set_low_high(l,h) ktime_set(h,l)
> +
> +#define ktime_get_low(t) (t).tv.nsec
> +#define ktime_get_high(t) (t).tv.sec
> +
> +static inline ktime_t ktime_set_normalized(long sec, long nsec)
> +{
> + ktime_t res;
> +
> + while (nsec < 0) {
> + nsec += NSEC_PER_SEC;
> + sec--;
> + }
> + while (nsec >= NSEC_PER_SEC) {
> + nsec -= NSEC_PER_SEC;
> + sec++;
> + }
> +
> + res.tv.sec = sec;
> + res.tv.nsec = nsec;
> + return res;
> +}
> +
> +static inline ktime_t ktime_sub(ktime_t a, ktime_t b)
> +{
> + ktime_t res;
> +
> + res.tv64 = a.tv64 - b.tv64;
> + if (res.tv.nsec < 0)
> + res.tv.nsec += NSEC_PER_SEC;
> +
> + return res;
> +}
> +
> +static inline ktime_t ktime_add(ktime_t a, ktime_t b)
> +{
> + ktime_t res;
> +
> + res.tv64 = a.tv64 + b.tv64;
> + if (res.tv.nsec >= NSEC_PER_SEC) {
> + res.tv.nsec -= NSEC_PER_SEC;
> + res.tv.sec++;
> + }
> + return res;
> +}
> +
> +static inline ktime_t ktime_add_ns(ktime_t a, u64 nsec)
> +{
> + ktime_t tmp;
> +
> + if (likely(nsec < NSEC_PER_SEC)) {
> + tmp.tv64 = nsec;
> + } else {
> + unsigned long rem;
> + rem = do_div(nsec, NSEC_PER_SEC);
> + tmp = ktime_set((long)nsec, rem);
> + }
> + return ktime_add(a,tmp);
> +}
> +
> +#define timespec_to_ktime(ts) \
> +({ \
> + ktime_t __kt; \
> + struct timespec __ts = (ts); \
> + __kt.tv.sec = (s32)__ts.tv_sec; \
> + __kt.tv.nsec = (s32)__ts.tv_nsec; \
> + __kt; \
> +})
> +
> +#define ktime_to_timespec(kt) \
> +({ \
> + struct timespec __ts; \
> + ktime_t __kt = (kt); \
> + __ts.tv_sec = (time_t)__kt.tv.sec; \
> + __ts.tv_nsec = (long)__kt.tv.nsec; \
> + __ts; \
> +})
> +
> +#define ktime_to_timeval(kt) \
> +({ \
> + struct timeval __tv; \
> + ktime_t __kt = (kt); \
> + __tv.tv_sec = (time_t)__kt.tv.sec; \
> + __tv.tv_usec = (long)(__kt.tv.nsec / NSEC_PER_USEC); \
> + __tv; \
> +})
> +
> +#define ktime_to_clock_t(kt) \
> +({ \
> + ktime_t __kt = (kt); \
> + u64 nsecs = (u64) __kt.tv.sec * NSEC_PER_SEC; \
> + nsec_to_clock_t(nsecs + (u64) __kt.tv.nsec); \
> +})
> +
> +#define ktime_to_ns(kt) \
> +({ \
> + ktime_t __kt = (kt); \
> + (((u64)__kt.tv.sec * NSEC_PER_SEC) + (u64)__kt.tv.nsec);\
> +})
> +
> +#else
> +
> +/* ktime_t macros when using a 64bit variable */
> +
> +#define DEFINE_KTIME(kt) ktime_t kt = 0LL
> +
> +#define ktime_cmp(a,op,b) ((a) op (b))
> +#define ktime_cmp_val(a,op,b) ((a) op b)
> +
> +#define ktime_set(s,n) (((s64) s * NSEC_PER_SEC) + (s64)n)
> +#define ktime_set_zero(kt) kt = 0LL
> +
> +#define ktime_set_low_high(l,h) ((s64)((u64)l) | (((s64) h) << 32))
> +
> +#define ktime_get_low(t) ((t) & 0xFFFFFFFFLL)
> +#define ktime_get_high(t) ((t) >> 32)
> +
> +#define ktime_sub(a,b) ((a) - (b))
> +#define ktime_add(a,b) ((a) + (b))
> +#define ktime_add_ns(a,b) ((a) + (b))
> +
> +#define timespec_to_ktime(ts) ktime_set(ts.tv_sec, ts.tv_nsec)
> +
> +#define ktime_to_timespec(kt) ns_to_timespec(kt)
> +#define ktime_to_timeval(kt) ns_to_timeval(kt)
> +
> +#define ktime_to_clock_t(kt) nsec_to_clock_t(kt)
> +
> +#define ktime_to_ns(kt) (kt)
> +
> +#define ktime_set_normalized(s,n) ktime_set(s,n)
> +
> +#endif
> +
> +/* Exported functions */
> +extern void fastcall init_ktimer_real(struct ktimer *timer);
> +extern void fastcall init_ktimer_mono(struct ktimer *timer);
> +extern int modify_ktimer(struct ktimer *timer, ktime_t *tim, int mode);
> +extern int start_ktimer(struct ktimer *timer, ktime_t *tim, int mode);
> +extern int try_to_stop_ktimer(struct ktimer *timer);
> +extern int stop_ktimer(struct ktimer *timer);
> +extern ktime_t get_remtime_ktimer(struct ktimer *timer, long fake);
> +extern ktime_t get_expiry_ktimer(struct ktimer *timer, ktime_t *now);
> +extern void __init init_ktimers(void);
> +
> +/* Conversion functions with rounding based on resolution */
> +extern ktime_t ktimer_convert_timeval(struct ktimer *timer, struct timeval *tv);
> +extern ktime_t ktimer_convert_timespec(struct ktimer *timer, struct timespec *ts);
> +
> +/* Posix timers current quirks */
> +extern int get_ktimer_mono_res(clockid_t which_clock, struct timespec *tp);
> +extern int get_ktimer_real_res(clockid_t which_clock, struct timespec *tp);
> +
> +/* nanosleep functions */
> +long ktimer_nanosleep_mono(struct timespec *rqtp, struct timespec __user *rmtp, int mode);
> +long ktimer_nanosleep_real(struct timespec *rqtp, struct timespec __user *rmtp, int mode);
> +
> +#if defined(CONFIG_SMP)
> +extern void wait_for_ktimer(struct ktimer *timer);
> +#else
> +#define wait_for_ktimer(t) do {} while (0)
> +#endif
> +
> +#define KTIME_REALTIME_RES (NSEC_PER_SEC/HZ)
> +#define KTIME_MONOTONIC_RES (NSEC_PER_SEC/HZ)
> +
> +static inline void get_ktime_mono_ts(struct timespec *ts)
> +{
> + unsigned long seq;
> + struct timespec tomono;
> + do {
> + seq = read_seqbegin(&xtime_lock);
> + getnstimeofday(ts);
> + tomono = wall_to_monotonic;
> + } while (read_seqretry(&xtime_lock, seq));
> +
> +
> + set_normalized_timespec(ts, ts->tv_sec + tomono.tv_sec,
> + ts->tv_nsec + tomono.tv_nsec);
> +
> +}
> +
> +static inline ktime_t do_get_ktime_mono(void)
> +{
> + struct timespec now;
> +
> + get_ktime_mono_ts(&now);
> + return timespec_to_ktime(now);
> +}
> +
> +#define get_ktime_real_ts(ts) getnstimeofday(ts)
> +static inline ktime_t do_get_ktime_real(void)
> +{
> + struct timespec now;
> +
> + getnstimeofday(&now);
> + return timespec_to_ktime(now);
> +}
> +
> +#define clock_was_set() do { } while (0)
> +extern void run_ktimer_queues(void);
> +
> +#endif
> Index: linux-2.6.14-rc2-rt4/include/linux/posix-timers.h
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/include/linux/posix-timers.h
> +++ linux-2.6.14-rc2-rt4/include/linux/posix-timers.h
> @@ -51,10 +51,9 @@ struct k_itimer {
> struct sigqueue *sigq; /* signal queue entry. */
> union {
> struct {
> - struct timer_list timer;
> - struct list_head abs_timer_entry; /* clock abs_timer_list */
> - struct timespec wall_to_prev; /* wall_to_monotonic used when set */
> - unsigned long incr; /* interval in jiffies */
> + struct ktimer timer;
> + ktime_t incr;
> + int overrun;
> } real;
> struct cpu_timer_list cpu;
> struct {
> @@ -66,10 +65,6 @@ struct k_itimer {
> } it;
> };
>
> -struct k_clock_abs {
> - struct list_head list;
> - spinlock_t lock;
> -};
> struct k_clock {
> int res; /* in nano seconds */
> int (*clock_getres) (clockid_t which_clock, struct timespec *tp);
> @@ -77,7 +72,7 @@ struct k_clock {
> int (*clock_set) (clockid_t which_clock, struct timespec * tp);
> int (*clock_get) (clockid_t which_clock, struct timespec * tp);
> int (*timer_create) (struct k_itimer *timer);
> - int (*nsleep) (clockid_t which_clock, int flags, struct timespec *);
> + int (*nsleep) (clockid_t which_clock, int flags, struct timespec *, struct timespec __user *);
> int (*timer_set) (struct k_itimer * timr, int flags,
> struct itimerspec * new_setting,
> struct itimerspec * old_setting);
> @@ -91,37 +86,104 @@ void register_posix_clock(clockid_t cloc
>
> /* Error handlers for timer_create, nanosleep and settime */
> int do_posix_clock_notimer_create(struct k_itimer *timer);
> -int do_posix_clock_nonanosleep(clockid_t, int flags, struct timespec *);
> +int do_posix_clock_nonanosleep(clockid_t, int flags, struct timespec *, struct timespec __user *);
> int do_posix_clock_nosettime(clockid_t, struct timespec *tp);
>
> /* function to call to trigger timer event */
> int posix_timer_event(struct k_itimer *timr, int si_private);
>
> -struct now_struct {
> - unsigned long jiffies;
> -};
> -
> -#define posix_get_now(now) (now)->jiffies = jiffies;
> -#define posix_time_before(timer, now) \
> - time_before((timer)->expires, (now)->jiffies)
> -
> -#define posix_bump_timer(timr, now) \
> - do { \
> - long delta, orun; \
> - delta = now.jiffies - (timr)->it.real.timer.expires; \
> - if (delta >= 0) { \
> - orun = 1 + (delta / (timr)->it.real.incr); \
> - (timr)->it.real.timer.expires += \
> - orun * (timr)->it.real.incr; \
> - (timr)->it_overrun += orun; \
> - } \
> - }while (0)
> +#if (BITS_PER_LONG < 64)
> +static inline ktime_t forward_posix_timer(struct k_itimer *t, ktime_t now)
> +{
> + ktime_t delta = ktime_sub(now, t->it.real.timer.expires);
> + unsigned long orun = 1;
> +
> + if (ktime_cmp_val(delta, <, KTIME_ZERO))
> + goto out;
> +
> + if (unlikely(ktime_cmp(delta, >, t->it.real.incr))) {
> +
> + int sft = 0;
> + u64 div, dclc, inc, dns;
> +
> + dclc = dns = ktime_to_ns(delta);
> + div = inc = ktime_to_ns(t->it.real.incr);
> + /* Make sure the divisor is less than 2^32 */
> + while(div >> 32) {
> + sft++;
> + div >>= 1;
> + }
> + dclc >>= sft;
> + do_div(dclc, (unsigned long) div);
> + orun = (unsigned long) dclc;
> + if (likely(!(inc >> 32)))
> + dclc *= (unsigned long) inc;
> + else
> + dclc *= inc;
> + t->it.real.timer.expires = ktime_add_ns(t->it.real.timer.expires,
> + dclc);
> + } else {
> + t->it.real.timer.expires = ktime_add(t->it.real.timer.expires,
> + t->it.real.incr);
> + }
> + /*
> + * Here is the correction for exact. Also covers delta == incr
> + * which is the else clause above.
> + */
> + if (ktime_cmp(t->it.real.timer.expires, <=, now)) {
> + t->it.real.timer.expires = ktime_add(t->it.real.timer.expires,
> + t->it.real.incr);
> + orun++;
> + }
> + t->it_overrun += orun;
> +
> + out:
> + return ktime_sub(t->it.real.timer.expires, now);
> +}
> +#else
> +static inline ktime_t forward_posix_timer(struct k_itimer *t, ktime_t now)
> +{
> + ktime_t delta = ktime_sub(now, t->it.real.timer.expires);
> + unsigned long orun = 1;
> +
> + if (ktime_cmp_val(delta, <, KTIME_ZERO))
> + goto out;
> +
> + if (unlikely(ktime_cmp(delta, >, t->it.real.incr))) {
> +
> + u64 dns, inc;
> +
> + dns = ktime_to_ns(delta);
> + inc = ktime_to_ns(t->it.real.incr);
> +
> + orun = dns / inc;
> + t->it.real.timer.expires = ktime_add_ns(t->it.real.timer.expires,
> + orun * inc);
> + } else {
> + t->it.real.timer.expires = ktime_add(t->it.real.timer.expires,
> + t->it.real.incr);
> + }
> + /*
> + * Here is the correction for exact. Also covers delta == incr
> + * which is the else clause above.
> + */
> + if (ktime_cmp(t->it.real.timer.expires, <=, now)) {
> + t->it.real.timer.expires = ktime_add(t->it.real.timer.expires,
> + t->it.real.incr);
> + orun++;
> + }
> + t->it_overrun += orun;
> + out:
> + return ktime_sub(t->it.real.timer.expires, now);
> +}
> +#endif
>
> int posix_cpu_clock_getres(clockid_t which_clock, struct timespec *);
> int posix_cpu_clock_get(clockid_t which_clock, struct timespec *);
> int posix_cpu_clock_set(clockid_t which_clock, const struct timespec *tp);
> int posix_cpu_timer_create(struct k_itimer *);
> -int posix_cpu_nsleep(clockid_t, int, struct timespec *);
> +int posix_cpu_nsleep(clockid_t, int, struct timespec *,
> + struct timespec __user *);
> int posix_cpu_timer_set(struct k_itimer *, int,
> struct itimerspec *, struct itimerspec *);
> int posix_cpu_timer_del(struct k_itimer *);
> Index: linux-2.6.14-rc2-rt4/include/linux/sched.h
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/include/linux/sched.h
> +++ linux-2.6.14-rc2-rt4/include/linux/sched.h
> @@ -104,6 +104,7 @@ extern unsigned long nr_iowait(void);
> #include <linux/param.h>
> #include <linux/resource.h>
> #include <linux/timer.h>
> +#include <linux/ktimer.h>
>
> #include <asm/processor.h>
>
> @@ -346,8 +347,7 @@ struct signal_struct {
> struct list_head posix_timers;
>
> /* ITIMER_REAL timer for the process */
> - struct timer_list real_timer;
> - unsigned long it_real_value, it_real_incr;
> + struct ktimer real_timer;
>
> /* ITIMER_PROF and ITIMER_VIRTUAL timers for the process */
> cputime_t it_prof_expires, it_virt_expires;
> Index: linux-2.6.14-rc2-rt4/include/linux/timer.h
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/include/linux/timer.h
> +++ linux-2.6.14-rc2-rt4/include/linux/timer.h
> @@ -91,6 +91,6 @@ static inline void add_timer(struct time
>
> extern void init_timers(void);
> extern void run_local_timers(void);
> -extern void it_real_fn(unsigned long);
> +extern void it_real_fn(void *);
>
> #endif
> Index: linux-2.6.14-rc2-rt4/init/main.c
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/init/main.c
> +++ linux-2.6.14-rc2-rt4/init/main.c
> @@ -485,6 +485,7 @@ asmlinkage void __init start_kernel(void
> init_IRQ();
> pidhash_init();
> init_timers();
> + init_ktimers();
> softirq_init();
> time_init();
>
> Index: linux-2.6.14-rc2-rt4/kernel/Makefile
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/kernel/Makefile
> +++ linux-2.6.14-rc2-rt4/kernel/Makefile
> @@ -7,7 +7,8 @@ obj-y = sched.o fork.o exec_domain.o
> sysctl.o capability.o ptrace.o timer.o user.o \
> signal.o sys.o kmod.o workqueue.o pid.o \
> rcupdate.o intermodule.o extable.o params.o posix-timers.o \
> - kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o
> + kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o \
> + ktimers.o
>
> obj-$(CONFIG_FUTEX) += futex.o
> obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
> Index: linux-2.6.14-rc2-rt4/kernel/exit.c
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/kernel/exit.c
> +++ linux-2.6.14-rc2-rt4/kernel/exit.c
> @@ -842,7 +842,7 @@ fastcall NORET_TYPE void do_exit(long co
> update_mem_hiwater(tsk);
> group_dead = atomic_dec_and_test(&tsk->signal->live);
> if (group_dead) {
> - del_timer_sync(&tsk->signal->real_timer);
> + stop_ktimer(&tsk->signal->real_timer);
> acct_process(code);
> }
> exit_mm(tsk);
> Index: linux-2.6.14-rc2-rt4/kernel/fork.c
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/kernel/fork.c
> +++ linux-2.6.14-rc2-rt4/kernel/fork.c
> @@ -804,10 +804,9 @@ static inline int copy_signal(unsigned l
> init_sigpending(&sig->shared_pending);
> INIT_LIST_HEAD(&sig->posix_timers);
>
> - sig->it_real_value = sig->it_real_incr = 0;
> + init_ktimer_mono(&sig->real_timer);
> sig->real_timer.function = it_real_fn;
> - sig->real_timer.data = (unsigned long) tsk;
> - init_timer(&sig->real_timer);
> + sig->real_timer.data = tsk;
>
> sig->it_virt_expires = cputime_zero;
> sig->it_virt_incr = cputime_zero;
> Index: linux-2.6.14-rc2-rt4/kernel/itimer.c
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/kernel/itimer.c
> +++ linux-2.6.14-rc2-rt4/kernel/itimer.c
> @@ -12,36 +12,22 @@
> #include <linux/syscalls.h>
> #include <linux/time.h>
> #include <linux/posix-timers.h>
> +#include <linux/ktimer.h>
>
> #include <asm/uaccess.h>
>
> -static unsigned long it_real_value(struct signal_struct *sig)
> -{
> - unsigned long val = 0;
> - if (timer_pending(&sig->real_timer)) {
> - val = sig->real_timer.expires - jiffies;
> -
> - /* look out for negative/zero itimer.. */
> - if ((long) val <= 0)
> - val = 1;
> - }
> - return val;
> -}
> -
> int do_getitimer(int which, struct itimerval *value)
> {
> struct task_struct *tsk = current;
> - unsigned long interval, val;
> + ktime_t interval, val;
> cputime_t cinterval, cval;
>
> switch (which) {
> case ITIMER_REAL:
> - spin_lock_irq(&tsk->sighand->siglock);
> - interval = tsk->signal->it_real_incr;
> - val = it_real_value(tsk->signal);
> - spin_unlock_irq(&tsk->sighand->siglock);
> - jiffies_to_timeval(val, &value->it_value);
> - jiffies_to_timeval(interval, &value->it_interval);
> + interval = tsk->signal->real_timer.interval;
> + val = get_remtime_ktimer(&tsk->signal->real_timer, NSEC_PER_USEC);
> + value->it_value = ktime_to_timeval(val);
> + value->it_interval = ktime_to_timeval(interval);
> break;
> case ITIMER_VIRTUAL:
> read_lock(&tasklist_lock);
> @@ -113,59 +99,35 @@ asmlinkage long sys_getitimer(int which,
> }
>
>
> -void it_real_fn(unsigned long __data)
> +/*
> + * The timer is automagically restarted, when interval != 0
> + */
> +void it_real_fn(void *data)
> {
> - struct task_struct * p = (struct task_struct *) __data;
> - unsigned long inc = p->signal->it_real_incr;
> -
> - send_group_sig_info(SIGALRM, SEND_SIG_PRIV, p);
> -
> - /*
> - * Now restart the timer if necessary. We don't need any locking
> - * here because do_setitimer makes sure we have finished running
> - * before it touches anything.
> - * Note, we KNOW we are (or should be) at a jiffie edge here so
> - * we don't need the +1 stuff. Also, we want to use the prior
> - * expire value so as to not "slip" a jiffie if we are late.
> - * Deal with requesting a time prior to "now" here rather than
> - * in add_timer.
> - */
> - if (!inc)
> - return;
> - while (time_before_eq(p->signal->real_timer.expires, jiffies))
> - p->signal->real_timer.expires += inc;
> - add_timer(&p->signal->real_timer);
> + send_group_sig_info(SIGALRM, SEND_SIG_PRIV, data);
> }
>
> int do_setitimer(int which, struct itimerval *value, struct itimerval *ovalue)
> {
> struct task_struct *tsk = current;
> - unsigned long val, interval, expires;
> + struct ktimer *timer;
> + ktime_t expires;
> cputime_t cval, cinterval, nval, ninterval;
>
> switch (which) {
> case ITIMER_REAL:
> -again:
> - spin_lock_irq(&tsk->sighand->siglock);
> - interval = tsk->signal->it_real_incr;
> - val = it_real_value(tsk->signal);
> - /* We are sharing ->siglock with it_real_fn() */
> - if (try_to_del_timer_sync(&tsk->signal->real_timer) < 0) {
> - spin_unlock_irq(&tsk->sighand->siglock);
> - goto again;
> - }
> - tsk->signal->it_real_incr =
> - timeval_to_jiffies(&value->it_interval);
> - expires = timeval_to_jiffies(&value->it_value);
> - if (expires)
> - mod_timer(&tsk->signal->real_timer,
> - jiffies + 1 + expires);
> - spin_unlock_irq(&tsk->sighand->siglock);
> + timer = &tsk->signal->real_timer;
> + stop_ktimer(timer);
> if (ovalue) {
> - jiffies_to_timeval(val, &ovalue->it_value);
> - jiffies_to_timeval(interval,
> - &ovalue->it_interval);
> - }
> + ovalue->it_value = ktime_to_timeval(
> + get_remtime_ktimer(timer, NSEC_PER_USEC));
> + ovalue->it_interval = ktime_to_timeval(timer->interval);
> + }
> + timer->interval = ktimer_convert_timeval(timer, &value->it_interval);
> + expires = ktimer_convert_timeval(timer, &value->it_value);
> + if (ktime_cmp_val(expires, != , KTIME_ZERO))
> + modify_ktimer(timer, &expires, KTIMER_REL | KTIMER_NOCHECK);
> +
> break;
> case ITIMER_VIRTUAL:
> nval = timeval_to_cputime(&value->it_value);
> Index: linux-2.6.14-rc2-rt4/kernel/ktimers.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6.14-rc2-rt4/kernel/ktimers.c
> @@ -0,0 +1,824 @@
> +/*
> + * linux/kernel/ktimers.c
> + *
> + * Copyright(C) 2005 Thomas Gleixner <[email protected]>
> + *
> + * Kudos to Ingo Molnar for review, criticism, ideas
> + *
> + * Credits:
> + * Lot of ideas and implementation details taken from
> + * timer.c and related code
> + *
> + * Kernel timers
> + *
> + * In contrast to the timeout related API found in kernel/timer.c,
> + * ktimers provide finer resolution and accuracy depending on system
> + * configuration and capabilities.
> + *
> + * These timers are used for
> + * - itimers
> + * - posixtimers
> + * - nanosleep
> + * - precise in kernel timing
> + *
> + * Please do not abuse this API for simple timeouts.
> + *
> + * For licencing details see kernel-base/COPYING
> + *
> + */
> +
> +#include <linux/cpu.h>
> +#include <linux/interrupt.h>
> +#include <linux/ktimer.h>
> +#include <linux/module.h>
> +#include <linux/notifier.h>
> +#include <linux/percpu.h>
> +#include <linux/syscalls.h>
> +
> +#include <asm/uaccess.h>
> +
> +static ktime_t get_ktime_mono(void);
> +static ktime_t get_ktime_real(void);
> +
> +/* The time bases */
> +#define MAX_KTIMER_BASES 2
> +static DEFINE_PER_CPU(struct ktimer_base, ktimer_bases[MAX_KTIMER_BASES]) =
> +{
> + {
> + .index = CLOCK_REALTIME,
> + .name = "Realtime",
> + .get_time = &get_ktime_real,
> + .resolution = KTIME_REALTIME_RES,
> + },
> + {
> + .index = CLOCK_MONOTONIC,
> + .name = "Monotonic",
> + .get_time = &get_ktime_mono,
> + .resolution = KTIME_MONOTONIC_RES,
> + },
> +};
> +
> +/*
> + * The SMP/UP kludge goes here
> + */
> +#if defined(CONFIG_SMP)
> +
> +#define set_running_timer(b,t) b->running_timer = t
> +#define wake_up_timer_waiters(b) wake_up(&b->wait_for_running_timer)
> +#define ktimer_base_can_change (1)
> +/*
> + * Wait for a running timer
> + */
> +void wait_for_ktimer(struct ktimer *timer)
> +{
> + struct ktimer_base *base = timer->base;
> +
> + if (base && base->running_timer == timer)
> + wait_event(base->wait_for_running_timer,
> + base->running_timer != timer);
> +}
> +
> +/*
> + * We are using hashed locking: holding per_cpu(ktimer_bases)[n].lock
> + * means that all timers which are tied to this base via timer->base are
> + * locked, and the base itself is locked too.
> + *
> + * So __run_timers/migrate_timers can safely modify all timers which could
> + * be found on the lists/queues.
> + *
> + * When the timer's base is locked, and the timer removed from list, it is
> + * possible to set timer->base = NULL and drop the lock: the timer remains
> + * locked.
> + */
> +static inline struct ktimer_base *lock_ktimer_base(struct ktimer *timer,
> + unsigned long *flags)
> +{
> + struct ktimer_base *base;
> +
> + for (;;) {
> + base = timer->base;
> + if (likely(base != NULL)) {
> + spin_lock_irqsave(&base->lock, *flags);
> + if (likely(base == timer->base))
> + return base;
> + /* The timer has migrated to another CPU */
> + spin_unlock_irqrestore(&base->lock, *flags);
> + }
> + cpu_relax();
> + }
> +}
> +
> +static inline struct ktimer_base *switch_ktimer_base(struct ktimer *timer,
> + struct ktimer_base *base)
> +{
> + int ktidx = base->index;
> + struct ktimer_base *new_base = &__get_cpu_var(ktimer_bases[ktidx]);
> +
> + if (base != new_base) {
> + /*
> + * We are trying to schedule the timer on the local CPU.
> + * However we can't change timer's base while it is running,
> + * so we keep it on the same CPU. No hassle vs. reprogramming
> + * the event source in the high resolution case. The softirq
> + * code will take care of this when the timer function has
> + * completed. There is no conflict as we hold the lock until
> + * the timer is enqueued.
> + */
> + if (unlikely(base->running_timer == timer)) {
> + return base;
> + } else {
> + /* See the comment in lock_timer_base() */
> + timer->base = NULL;
> + spin_unlock(&base->lock);
> + spin_lock(&new_base->lock);
> + timer->base = new_base;
> + }
> + }
> + return new_base;
> +}
> +
> +/*
> + * Get the timer base unlocked
> + *
> + * Take care of timer->base = NULL in switch_ktimer_base !
> + */
> +static inline struct ktimer_base *get_ktimer_base_unlocked(struct ktimer *timer)
> +{
> + struct ktimer_base *base;
> + while (!(base = timer->base));
> + return base;
> +}
> +#else
> +
> +#define set_running_timer(b,t) do {} while (0)
> +#define wake_up_timer_waiters(b) do {} while (0)
> +
> +static inline struct ktimer_base *lock_ktimer_base(struct ktimer *timer,
> + unsigned long *flags)
> +{
> + struct ktimer_base *base;
> +
> + base = timer->base;
> + spin_lock_irqsave(&base->lock, *flags);
> + return base;
> +}
> +
> +#define switch_ktimer_base(t, b) b
> +
> +#define get_ktimer_base_unlocked(t) (t)->base
> +#define ktimer_base_can_change (0)
> +
> +#endif /* !CONFIG_SMP */
> +
> +/*
> + * Convert timespec to ktime_t with resolution adjustment
> + *
> + * Note: We can access base without locking here, as ktimers can
> + * migrate between CPUs but can not be moved from one clock source to
> + * another. The clock source binding is set at init_ktimer_XXX.
> + */
> +ktime_t ktimer_convert_timespec(struct ktimer *timer, struct timespec *ts)
> +{
> + struct ktimer_base *base = get_ktimer_base_unlocked(timer);
> + ktime_t t;
> + long rem = ts->tv_nsec % base->resolution;
> +
> + t = ktime_set(ts->tv_sec, ts->tv_nsec);
> +
> + /* Check, if the value has to be rounded */
> + if (rem)
> + t = ktime_add_ns(t, base->resolution - rem);
> + return t;
> +}
> +
> +/*
> + * Convert timeval to ktime_t with resolution adjustment
> + */
> +ktime_t ktimer_convert_timeval(struct ktimer *timer, struct timeval *tv)
> +{
> + struct timespec ts;
> +
> + ts.tv_sec = tv->tv_sec;
> + ts.tv_nsec = tv->tv_usec * NSEC_PER_USEC;
> +
> + return ktimer_convert_timespec(timer, &ts);
> +}
> +
> +/*
> + * Internal function to add (re)start a timer
> + *
> + * The timer is inserted in expiry order.
> + * Insertion into the red black tree is O(log(n))
> + *
> + */
> +static int enqueue_ktimer(struct ktimer *timer, struct ktimer_base *base,
> + ktime_t *tim, int mode)
> +{
> + struct rb_node **link = &base->active.rb_node;
> + struct rb_node *parent = NULL;
> + struct ktimer *entry;
> + struct list_head *prev = &base->pending;
> + ktime_t now;
> +
> + /* Get current time */
> + now = base->get_time();
> +
> + /* Timer expiry mode */
> + switch (mode & ~KTIMER_NOCHECK) {
> + case KTIMER_ABS:
> + timer->expires = *tim;
> + break;
> + case KTIMER_REL:
> + timer->expires = ktime_add(now, *tim);
> + break;
> + case KTIMER_INCR:
> + timer->expires = ktime_add(timer->expires, *tim);
> + break;
> + case KTIMER_FORWARD:
> + while ktime_cmp(timer->expires, <= , now) {
> + timer->expires = ktime_add(timer->expires, *tim);
> + timer->overrun++;
> + }
> + goto nocheck;
> + case KTIMER_REARM:
> + while ktime_cmp(timer->expires, <= , now) {
> + timer->expires = ktime_add(timer->expires, *tim);
> + timer->overrun++;
> + }
> + goto nocheck;
> + case KTIMER_RESTART:
> + break;
> + default:
> + BUG();
> + }
> +
> + /* Already expired.*/
> + if ktime_cmp(timer->expires, <=, now) {
> + timer->expired = now;
> + /* The caller takes care of expiry */
> + if (!(mode & KTIMER_NOCHECK))
> + return -1;
> + }
> + nocheck:
> +
> + while (*link) {
> + parent = *link;
> + entry = rb_entry(parent, struct ktimer, node);
> + /*
> + * We dont care about collisions. Nodes with
> + * the same expiry time stay together.
> + */
> + if (ktimer_before(timer, entry))
> + link = &(*link)->rb_left;
> + else {
> + link = &(*link)->rb_right;
> + prev = &entry->list;
> + }
> + }
> +
> + rb_link_node(&timer->node, parent, link);
> + rb_insert_color(&timer->node, &base->active);
> + list_add(&timer->list, prev);
> + timer->status = KTIMER_PENDING;
> + base->count++;
> + return 0;
> +}
> +
> +/*
> + * Internal helper to remove a timer
> + *
> + * The function allows automatic rearming for interval
> + * timers.
> + *
> + */
> +static inline void do_remove_ktimer(struct ktimer *timer,
> + struct ktimer_base *base, int rearm)
> +{
> + list_del(&timer->list);
> + rb_erase(&timer->node, &base->active);
> + timer->node.rb_parent = KTIMER_POISON;
> + timer->status = KTIMER_INACTIVE;
> + base->count--;
> + BUG_ON(base->count < 0);
> + /* Auto rearm the timer ? */
> + if (rearm && ktime_cmp_val(timer->interval, !=, KTIME_ZERO))
> + enqueue_ktimer(timer, base, NULL, KTIMER_REARM);
> +}
> +
> +/*
> + * Called with base lock held
> + */
> +static inline int remove_ktimer(struct ktimer *timer, struct ktimer_base *base)
> +{
> + if (ktimer_active(timer)) {
> + do_remove_ktimer(timer, base, KTIMER_NOREARM);
> + return 1;
> + }
> + return 0;
> +}
> +
> +/*
> + * Internal function to (re)start a timer.
> + */
> +static int internal_restart_ktimer(struct ktimer *timer, ktime_t *tim,
> + int mode)
> +{
> + struct ktimer_base *base, *new_base;
> + unsigned long flags;
> + int ret;
> +
> + BUG_ON(!timer->function);
> +
> + base = lock_ktimer_base(timer, &flags);
> +
> + /* Remove an active timer from the queue */
> + ret = remove_ktimer(timer, base);
> +
> + /* Switch the timer base, if necessary */
> + new_base = switch_ktimer_base(timer, base);
> +
> + /*
> + * When the new timer setting is already expired,
> + * let the calling code deal with it.
> + */
> + if (enqueue_ktimer(timer, new_base, tim, mode))
> + ret = -1;
> +
> + spin_unlock_irqrestore(&new_base->lock, flags);
> + return ret;
> +}
> +
> +/***
> + * modify_ktimer - modify a running timer
> + * @timer: the timer to be modified
> + * @tim: expiry time (required)
> + * @mode: timer setup mode
> + *
> + */
> +int modify_ktimer(struct ktimer *timer, ktime_t *tim, int mode)
> +{
> + BUG_ON(!tim || !timer->function);
> + return internal_restart_ktimer(timer, tim, mode);
> +}
> +
> +/***
> + * start_ktimer - start a timer on current CPU
> + * @timer: the timer to be added
> + * @tim: expiry time (optional, if not set in the timer)
> + * @mode: timer setup mode
> + */
> +int start_ktimer(struct ktimer *timer, ktime_t *tim, int mode)
> +{
> + BUG_ON(ktimer_active(timer) || !timer->function);
> +
> + return internal_restart_ktimer(timer, tim, mode);
> +}
> +
> +/***
> + * try_to_stop_ktimer - try to deactivate a timer
> + */
> +int try_to_stop_ktimer(struct ktimer *timer)
> +{
> + struct ktimer_base *base;
> + unsigned long flags;
> + int ret = -1;
> +
> + base = lock_ktimer_base(timer, &flags);
> +
> + if (base->running_timer != timer) {
> + ret = remove_ktimer(timer, base);
> + if (ret)
> + timer->expired = base->get_time();
> + }
> +
> + spin_unlock_irqrestore(&base->lock, flags);
> +
> + return ret;
> +
> +}
> +
> +/***
> + * stop_timer_sync - deactivate a timer and wait for the handler to finish.
> + * @timer: the timer to be deactivated
> + *
> + */
> +int stop_ktimer(struct ktimer *timer)
> +{
> + for (;;) {
> + int ret = try_to_stop_ktimer(timer);
> + if (ret >= 0)
> + return ret;
> + wait_for_ktimer(timer);
> + }
> +}
> +
> +/***
> + * get_remtime_ktimer - get remaining time for the timer
> + * @timer: the timer to read
> + * @fake: when fake > 0 a pending, but expired timer
> + * returns fake (itimers need this, uurg)
> + */
> +ktime_t get_remtime_ktimer(struct ktimer *timer, long fake)
> +{
> + struct ktimer_base *base;
> + unsigned long flags;
> + ktime_t rem;
> +
> + base = lock_ktimer_base(timer, &flags);
> + if (ktimer_active(timer)) {
> + rem = ktime_sub(timer->expires,base->get_time());
> + if (fake && ktime_cmp_val(rem, <=, KTIME_ZERO))
> + rem = ktime_set(0, fake);
> + } else {
> + if (!fake)
> + rem = ktime_sub(timer->expires,base->get_time());
> + else
> + ktime_set_zero(rem);
> + }
> + spin_unlock_irqrestore(&base->lock, flags);
> + return rem;
> +}
> +
> +/***
> + * get_expiry_ktimer - get expiry time for the timer
> + * @timer: the timer to read
> + * @now: if != NULL store current base->time
> + */
> +ktime_t get_expiry_ktimer(struct ktimer *timer, ktime_t *now)
> +{
> + struct ktimer_base *base;
> + unsigned long flags;
> + ktime_t expiry;
> +
> + base = lock_ktimer_base(timer, &flags);
> + expiry = timer->expires;
> + if (now)
> + *now = base->get_time();
> + spin_unlock_irqrestore(&base->lock, flags);
> + return expiry;
> +}
> +
> +/*
> + * Functions related to clock sources
> + */
> +
> +static inline void ktimer_common_init(struct ktimer *timer)
> +{
> + memset(timer, 0, sizeof(struct ktimer));
> + timer->node.rb_parent = KTIMER_POISON;
> +}
> +
> +/*
> + * Get monotonic time
> + */
> +static ktime_t get_ktime_mono(void)
> +{
> + return do_get_ktime_mono();
> +}
> +
> +/***
> + * init_ktimer_mono - initialize a timer on monotonic time
> + * @timer: the timer to be initialized
> + *
> + */
> +void fastcall init_ktimer_mono(struct ktimer *timer)
> +{
> + ktimer_common_init(timer);
> + timer->base =
> + &per_cpu(ktimer_bases, raw_smp_processor_id())[CLOCK_MONOTONIC];
> +}
> +
> +/***
> + * get_ktimer_mono_res - get the monotonic timer resolution
> + *
> + */
> +int get_ktimer_mono_res(clockid_t which_clock, struct timespec *tp)
> +{
> + tp->tv_sec = 0;
> + tp->tv_nsec =
> + per_cpu(ktimer_bases, raw_smp_processor_id())[CLOCK_MONOTONIC].resolution;
> + return 0;
> +}
> +
> +/*
> + * Get real time
> + */
> +static ktime_t get_ktime_real(void)
> +{
> + return do_get_ktime_real();
> +}
> +
> +/***
> + * init_ktimer_real - initialize a timer on real time
> + * @timer: the timer to be initialized
> + *
> + */
> +void fastcall init_ktimer_real(struct ktimer *timer)
> +{
> + ktimer_common_init(timer);
> + timer->base =
> + &per_cpu(ktimer_bases, raw_smp_processor_id())[CLOCK_REALTIME];
> +}
> +
> +/***
> + * get_ktimer_real_res - get the real timer resolution
> + *
> + */
> +int get_ktimer_real_res(clockid_t which_clock, struct timespec *tp)
> +{
> + tp->tv_sec = 0;
> + tp->tv_nsec =
> + per_cpu(ktimer_bases, raw_smp_processor_id())[CLOCK_REALTIME].resolution;
> + return 0;
> +}
> +
> +/*
> + * The per base runqueue
> + */
> +static inline void run_ktimer_queue(struct ktimer_base *base)
> +{
> + ktime_t now = base->get_time();
> +
> + spin_lock_irq(&base->lock);
> + while (!list_empty(&base->pending)) {
> + void (*fn)(void *);
> + void *data;
> + struct ktimer *timer = list_entry(base->pending.next,
> + struct ktimer, list);
> + if ktime_cmp(now, <=, timer->expires)
> + break;
> + timer->expired = now;
> + fn = timer->function;
> + data = timer->data;
> + set_running_timer(base, timer);
> + do_remove_ktimer(timer, base, KTIMER_REARM);
> + spin_unlock_irq(&base->lock);
> + fn(data);
> + spin_lock_irq(&base->lock);
> + set_running_timer(base, NULL);
> + }
> + spin_unlock_irq(&base->lock);
> + wake_up_timer_waiters(base);
> +}
> +
> +/*
> + * Called from timer softirq every jiffy
> + */
> +void run_ktimer_queues(void)
> +{
> + struct ktimer_base *base = __get_cpu_var(ktimer_bases);
> + int i;
> +
> + for (i = 0; i < MAX_KTIMER_BASES; i++)
> + run_ktimer_queue(&base[i]);
> +}
> +
> +/*
> + * Functions related to initialization
> + */
> +static void __devinit init_ktimers_cpu(int cpu)
> +{
> + struct ktimer_base *base = per_cpu(ktimer_bases, cpu);
> + int i;
> +
> + for (i = 0; i < MAX_KTIMER_BASES; i++) {
> + spin_lock_init(&base->lock);
> + INIT_LIST_HEAD(&base->pending);
> + init_waitqueue_head(&base->wait_for_running_timer);
> + base++;
> + }
> +}
> +
> +#ifdef CONFIG_HOTPLUG_CPU
> +static void migrate_ktimer_list(struct ktimer_base *old_base,
> + struct ktimer_base *new_base)
> +{
> + struct ktimer *timer;
> + struct rb_node *node;
> +
> + while ((node = rb_first(&old_base->active))) {
> + timer = rb_entry(node, struct ktimer, node);
> + remove_ktimer(timer, old_base);
> + timer->base = new_base;
> + enqueue_ktimer(timer, new_base, NULL, KTIMER_RESTART);
> + }
> +}
> +
> +static void __devinit migrate_ktimers(int cpu)
> +{
> + struct ktimer_base *old_base;
> + struct ktimer_base *new_base;
> + int i;
> +
> + BUG_ON(cpu_online(cpu));
> + old_base = per_cpu(ktimer_bases, cpu);
> + new_base = get_cpu_var(ktimer_bases);
> +
> + local_irq_disable();
> +
> + for (i = 0; i < MAX_KTIMER_BASES; i++) {
> +
> + spin_lock(&new_base->lock);
> + spin_lock(&old_base->lock);
> +
> + if (old_base->running_timer)
> + BUG();
> +
> + migrate_ktimer_list(old_base, new_base);
> +
> + spin_unlock(&old_base->lock);
> + spin_unlock(&new_base->lock);
> + old_base++;
> + new_base++;
> + }
> +
> + local_irq_enable();
> + &put_cpu_var(ktimer_bases);
> +}
> +#endif /* CONFIG_HOTPLUG_CPU */
> +
> +static int __devinit ktimer_cpu_notify(struct notifier_block *self,
> + unsigned long action, void *hcpu)
> +{
> + long cpu = (long)hcpu;
> + switch(action) {
> + case CPU_UP_PREPARE:
> + init_ktimers_cpu(cpu);
> + break;
> +#ifdef CONFIG_HOTPLUG_CPU
> + case CPU_DEAD:
> + migrate_ktimers(cpu);
> + break;
> +#endif
> + default:
> + break;
> + }
> + return NOTIFY_OK;
> +}
> +
> +static struct notifier_block __devinitdata ktimers_nb = {
> + .notifier_call = ktimer_cpu_notify,
> +};
> +
> +void __init init_ktimers(void)
> +{
> + ktimer_cpu_notify(&ktimers_nb, (unsigned long)CPU_UP_PREPARE,
> + (void *)(long)smp_processor_id());
> + register_cpu_notifier(&ktimers_nb);
> +}
> +
> +/*
> + * system interface related functions
> + */
> +static void process_ktimer(void *data)
> +{
> + wake_up_process(data);
> +}
> +
> +/**
> + * schedule_ktimer - sleep until timeout
> + * @timeout: timeout value
> + * @state: state to use for sleep
> + * @rel: timeout value is abs/rel
> + *
> + * Make the current task sleep until @timeout is
> + * elapsed.
> + *
> + * You can set the task state as follows -
> + *
> + * %TASK_UNINTERRUPTIBLE - at least @timeout is guaranteed to
> + * pass before the routine returns. The routine will return 0
> + *
> + * %TASK_INTERRUPTIBLE - the routine may return early if a signal is
> + * delivered to the current task. In this case the remaining time
> + * will be returned
> + *
> + * The current task state is guaranteed to be TASK_RUNNING when this
> + * routine returns.
> + *
> + */
> +static fastcall ktime_t __sched schedule_ktimer(struct ktimer *timer,
> + ktime_t *t, int state, int mode)
> +{
> + timer->data = current;
> + timer->function = process_ktimer;
> +
> + current->state = state;
> + if (start_ktimer(timer, t, mode)) {
> + current->state = TASK_RUNNING;
> + goto out;
> + }
> + if (current->state != TASK_RUNNING)
> + schedule();
> + stop_ktimer(timer);
> + out:
> + /* Store the absolute expiry time */
> + *t = timer->expires;
> + /* Return the remaining time */
> + return ktime_sub(timer->expires, timer->expired);
> +}
> +
> +static long __sched nanosleep_restart(struct ktimer *timer,
> + struct restart_block *restart)
> +{
> + struct timespec tu;
> + ktime_t t, rem;
> + void *rfn = restart->fn;
> + struct timespec __user *rmtp = (struct timespec __user *) restart->arg2;
> +
> + restart->fn = do_no_restart_syscall;
> +
> + t = ktime_set_low_high(restart->arg0, restart->arg1);
> +
> + rem = schedule_ktimer(timer, &t, TASK_INTERRUPTIBLE, KTIMER_ABS);
> +
> + if (ktime_cmp_val(rem, <=, KTIME_ZERO))
> + return 0;
> +
> + tu = ktime_to_timespec(rem);
> + if (rmtp && copy_to_user(rmtp, &rem, sizeof(tu)))
> + return -EFAULT;
> +
> + restart->fn = rfn;
> + /* The other values in restart are already filled in */
> + return -ERESTART_RESTARTBLOCK;
> +}
> +
> +static long __sched nanosleep_restart_mono(struct restart_block *restart)
> +{
> + struct ktimer timer;
> +
> + init_ktimer_mono(&timer);
> + return nanosleep_restart(&timer, restart);
> +}
> +
> +static long __sched nanosleep_restart_real(struct restart_block *restart)
> +{
> + struct ktimer timer;
> +
> + init_ktimer_real(&timer);
> + return nanosleep_restart(&timer, restart);
> +}
> +
> +static long ktimer_nanosleep(struct ktimer *timer, struct timespec *rqtp,
> + struct timespec __user *rmtp, int mode,
> + long (*rfn)(struct restart_block *))
> +{
> + struct timespec tu;
> + ktime_t rem, t;
> + struct restart_block *restart;
> +
> + t = ktimer_convert_timespec(timer, rqtp);
> +
> + /* t is updated to absolute expiry time ! */
> + rem = schedule_ktimer(timer, &t, TASK_INTERRUPTIBLE, mode);
> +
> + if (ktime_cmp_val(rem, <=, KTIME_ZERO))
> + return 0;
> +
> + tu = ktime_to_timespec(rem);
> +
> + if (rmtp && copy_to_user(rmtp, &tu, sizeof(tu)))
> + return -EFAULT;
> +
> + restart = &current_thread_info()->restart_block;
> + restart->fn = rfn;
> + restart->arg0 = ktime_get_low(t);
> + restart->arg1 = ktime_get_high(t);
> + restart->arg2 = (unsigned long) rmtp;
> + return -ERESTART_RESTARTBLOCK;
> +
> +}
> +
> +long ktimer_nanosleep_mono(struct timespec *rqtp,
> + struct timespec __user *rmtp, int mode)
> +{
> + struct ktimer timer;
> +
> + init_ktimer_mono(&timer);
> + return ktimer_nanosleep(&timer, rqtp, rmtp, mode, nanosleep_restart_mono);
> +}
> +
> +long ktimer_nanosleep_real(struct timespec *rqtp,
> + struct timespec __user *rmtp, int mode)
> +{
> + struct ktimer timer;
> +
> + init_ktimer_real(&timer);
> + return ktimer_nanosleep(&timer, rqtp, rmtp, mode, nanosleep_restart_real);
> +}
> +
> +asmlinkage long sys_nanosleep(struct timespec __user *rqtp,
> + struct timespec __user *rmtp)
> +{
> + struct timespec tu;
> +
> + if (copy_from_user(&tu, rqtp, sizeof(tu)))
> + return -EFAULT;
> +
> + if (!timespec_valid(&tu))
> + return -EINVAL;
> +
> + return ktimer_nanosleep_mono(&tu, rmtp, KTIMER_REL);
> +}
> +
> Index: linux-2.6.14-rc2-rt4/kernel/posix-cpu-timers.c
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/kernel/posix-cpu-timers.c
> +++ linux-2.6.14-rc2-rt4/kernel/posix-cpu-timers.c
> @@ -1394,7 +1394,7 @@ void set_process_cpu_timer(struct task_s
> static long posix_cpu_clock_nanosleep_restart(struct restart_block *);
>
> int posix_cpu_nsleep(clockid_t which_clock, int flags,
> - struct timespec *rqtp)
> + struct timespec *rqtp, struct timespec __user *rmtp)
> {
> struct restart_block *restart_block =
> &current_thread_info()->restart_block;
> @@ -1419,7 +1419,6 @@ int posix_cpu_nsleep(clockid_t which_clo
> error = posix_cpu_timer_create(&timer);
> timer.it_process = current;
> if (!error) {
> - struct timespec __user *rmtp;
> static struct itimerspec zero_it;
> struct itimerspec it = { .it_value = *rqtp,
> .it_interval = {} };
> @@ -1466,7 +1465,6 @@ int posix_cpu_nsleep(clockid_t which_clo
> /*
> * Report back to the user the time still remaining.
> */
> - rmtp = (struct timespec __user *) restart_block->arg1;
> if (rmtp != NULL && !(flags & TIMER_ABSTIME) &&
> copy_to_user(rmtp, &it.it_value, sizeof *rmtp))
> return -EFAULT;
> @@ -1474,6 +1472,7 @@ int posix_cpu_nsleep(clockid_t which_clo
> restart_block->fn = posix_cpu_clock_nanosleep_restart;
> /* Caller already set restart_block->arg1 */
> restart_block->arg0 = which_clock;
> + restart_block->arg1 = (unsigned long) rmtp;
> restart_block->arg2 = rqtp->tv_sec;
> restart_block->arg3 = rqtp->tv_nsec;
>
> @@ -1487,10 +1486,15 @@ static long
> posix_cpu_clock_nanosleep_restart(struct restart_block *restart_block)
> {
> clockid_t which_clock = restart_block->arg0;
> - struct timespec t = { .tv_sec = restart_block->arg2,
> - .tv_nsec = restart_block->arg3 };
> + struct timespec __user *rmtp;
> + struct timespec t;
> +
> + rmtp = (struct timespec __user *) restart_block->arg1;
> + t.tv_sec = restart_block->arg2;
> + t.tv_nsec = restart_block->arg3;
> +
> restart_block->fn = do_no_restart_syscall;
> - return posix_cpu_nsleep(which_clock, TIMER_ABSTIME, &t);
> + return posix_cpu_nsleep(which_clock, TIMER_ABSTIME, &t, rmtp);
> }
>
>
> @@ -1511,9 +1515,10 @@ static int process_cpu_timer_create(stru
> return posix_cpu_timer_create(timer);
> }
> static int process_cpu_nsleep(clockid_t which_clock, int flags,
> - struct timespec *rqtp)
> + struct timespec *rqtp,
> + struct timespec __user *rmtp)
> {
> - return posix_cpu_nsleep(PROCESS_CLOCK, flags, rqtp);
> + return posix_cpu_nsleep(PROCESS_CLOCK, flags, rqtp, rmtp);
> }
> static int thread_cpu_clock_getres(clockid_t which_clock, struct timespec *tp)
> {
> @@ -1529,7 +1534,7 @@ static int thread_cpu_timer_create(struc
> return posix_cpu_timer_create(timer);
> }
> static int thread_cpu_nsleep(clockid_t which_clock, int flags,
> - struct timespec *rqtp)
> + struct timespec *rqtp, struct timespec __user *rmtp)
> {
> return -EINVAL;
> }
> Index: linux-2.6.14-rc2-rt4/kernel/posix-timers.c
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/kernel/posix-timers.c
> +++ linux-2.6.14-rc2-rt4/kernel/posix-timers.c
> @@ -48,21 +48,6 @@
> #include <linux/workqueue.h>
> #include <linux/module.h>
>
> -#ifndef div_long_long_rem
> -#include <asm/div64.h>
> -
> -#define div_long_long_rem(dividend,divisor,remainder) ({ \
> - u64 result = dividend; \
> - *remainder = do_div(result,divisor); \
> - result; })
> -
> -#endif
> -#define CLOCK_REALTIME_RES TICK_NSEC /* In nano seconds. */
> -
> -static inline u64 mpy_l_X_l_ll(unsigned long mpy1,unsigned long mpy2)
> -{
> - return (u64)mpy1 * mpy2;
> -}
> /*
> * Management arrays for POSIX timers. Timers are kept in slab memory
> * Timer ids are allocated by an external routine that keeps track of the
> @@ -148,18 +133,18 @@ static DEFINE_SPINLOCK(idr_lock);
> */
>
> static struct k_clock posix_clocks[MAX_CLOCKS];
> +
> /*
> - * We only have one real clock that can be set so we need only one abs list,
> - * even if we should want to have several clocks with differing resolutions.
> + * These ones are defined below.
> */
> -static struct k_clock_abs abs_list = {.list = LIST_HEAD_INIT(abs_list.list),
> - .lock = SPIN_LOCK_UNLOCKED};
> +static int common_nsleep(clockid_t, int flags, struct timespec *t,
> + struct timespec __user *rmtp);
> +static void common_timer_get(struct k_itimer *, struct itimerspec *);
> +static int common_timer_set(struct k_itimer *, int,
> + struct itimerspec *, struct itimerspec *);
> +static int common_timer_del(struct k_itimer *timer);
>
> -static void posix_timer_fn(unsigned long);
> -static u64 do_posix_clock_monotonic_gettime_parts(
> - struct timespec *tp, struct timespec *mo);
> -int do_posix_clock_monotonic_gettime(struct timespec *tp);
> -static int do_posix_clock_monotonic_get(clockid_t, struct timespec *tp);
> +static void posix_timer_fn(void *data);
>
> static struct k_itimer *lock_timer(timer_t timer_id, unsigned long *flags);
>
> @@ -205,21 +190,25 @@ static inline int common_clock_set(clock
>
> static inline int common_timer_create(struct k_itimer *new_timer)
> {
> - INIT_LIST_HEAD(&new_timer->it.real.abs_timer_entry);
> - init_timer(&new_timer->it.real.timer);
> - new_timer->it.real.timer.data = (unsigned long) new_timer;
> + return -EINVAL;
> +}
> +
> +static int timer_create_mono(struct k_itimer *new_timer)
> +{
> + init_ktimer_mono(&new_timer->it.real.timer);
> + new_timer->it.real.timer.data = new_timer;
> + new_timer->it.real.timer.function = posix_timer_fn;
> + return 0;
> +}
> +
> +static int timer_create_real(struct k_itimer *new_timer)
> +{
> + init_ktimer_real(&new_timer->it.real.timer);
> + new_timer->it.real.timer.data = new_timer;
> new_timer->it.real.timer.function = posix_timer_fn;
> return 0;
> }
>
> -/*
> - * These ones are defined below.
> - */
> -static int common_nsleep(clockid_t, int flags, struct timespec *t);
> -static void common_timer_get(struct k_itimer *, struct itimerspec *);
> -static int common_timer_set(struct k_itimer *, int,
> - struct itimerspec *, struct itimerspec *);
> -static int common_timer_del(struct k_itimer *timer);
>
> /*
> * Return nonzero iff we know a priori this clockid_t value is bogus.
> @@ -239,19 +228,44 @@ static inline int invalid_clockid(clocki
> return 1;
> }
>
> +/*
> + * Get real time for posix timers
> + */
> +static int posix_get_ktime_real_ts(clockid_t which_clock, struct timespec *tp)
> +{
> + get_ktime_real_ts(tp);
> + return 0;
> +}
> +
> +/*
> + * Get monotonic time for posix timers
> + */
> +static int posix_get_ktime_mono_ts(clockid_t which_clock, struct timespec *tp)
> +{
> + get_ktime_mono_ts(tp);
> + return 0;
> +}
> +
> +void do_posix_clock_monotonic_gettime(struct timespec *ts)
> +{
> + get_ktime_mono_ts(ts);
> +}
>
> /*
> * Initialize everything, well, just everything in Posix clocks/timers ;)
> */
> static __init int init_posix_timers(void)
> {
> - struct k_clock clock_realtime = {.res = CLOCK_REALTIME_RES,
> - .abs_struct = &abs_list
> + struct k_clock clock_realtime = {
> + .clock_getres = get_ktimer_real_res,
> + .clock_get = posix_get_ktime_real_ts,
> + .timer_create = timer_create_real,
> };
> - struct k_clock clock_monotonic = {.res = CLOCK_REALTIME_RES,
> - .abs_struct = NULL,
> - .clock_get = do_posix_clock_monotonic_get,
> - .clock_set = do_posix_clock_nosettime
> + struct k_clock clock_monotonic = {
> + .clock_getres = get_ktimer_mono_res,
> + .clock_get = posix_get_ktime_mono_ts,
> + .clock_set = do_posix_clock_nosettime,
> + .timer_create = timer_create_mono,
> };
>
> register_posix_clock(CLOCK_REALTIME, &clock_realtime);
> @@ -265,117 +279,17 @@ static __init int init_posix_timers(void
>
> __initcall(init_posix_timers);
>
> -static void tstojiffie(struct timespec *tp, int res, u64 *jiff)
> -{
> - long sec = tp->tv_sec;
> - long nsec = tp->tv_nsec + res - 1;
> -
> - if (nsec > NSEC_PER_SEC) {
> - sec++;
> - nsec -= NSEC_PER_SEC;
> - }
> -
> - /*
> - * The scaling constants are defined in <linux/time.h>
> - * The difference between there and here is that we do the
> - * res rounding and compute a 64-bit result (well so does that
> - * but it then throws away the high bits).
> - */
> - *jiff = (mpy_l_X_l_ll(sec, SEC_CONVERSION) +
> - (mpy_l_X_l_ll(nsec, NSEC_CONVERSION) >>
> - (NSEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC;
> -}
> -
> -/*
> - * This function adjusts the timer as needed as a result of the clock
> - * being set. It should only be called for absolute timers, and then
> - * under the abs_list lock. It computes the time difference and sets
> - * the new jiffies value in the timer. It also updates the timers
> - * reference wall_to_monotonic value. It is complicated by the fact
> - * that tstojiffies() only handles positive times and it needs to work
> - * with both positive and negative times. Also, for negative offsets,
> - * we need to defeat the res round up.
> - *
> - * Return is true if there is a new time, else false.
> - */
> -static long add_clockset_delta(struct k_itimer *timr,
> - struct timespec *new_wall_to)
> -{
> - struct timespec delta;
> - int sign = 0;
> - u64 exp;
> -
> - set_normalized_timespec(&delta,
> - new_wall_to->tv_sec -
> - timr->it.real.wall_to_prev.tv_sec,
> - new_wall_to->tv_nsec -
> - timr->it.real.wall_to_prev.tv_nsec);
> - if (likely(!(delta.tv_sec | delta.tv_nsec)))
> - return 0;
> - if (delta.tv_sec < 0) {
> - set_normalized_timespec(&delta,
> - -delta.tv_sec,
> - 1 - delta.tv_nsec -
> - posix_clocks[timr->it_clock].res);
> - sign++;
> - }
> - tstojiffie(&delta, posix_clocks[timr->it_clock].res, &exp);
> - timr->it.real.wall_to_prev = *new_wall_to;
> - timr->it.real.timer.expires += (sign ? -exp : exp);
> - return 1;
> -}
> -
> -static void remove_from_abslist(struct k_itimer *timr)
> -{
> - if (!list_empty(&timr->it.real.abs_timer_entry)) {
> - spin_lock(&abs_list.lock);
> - list_del_init(&timr->it.real.abs_timer_entry);
> - spin_unlock(&abs_list.lock);
> - }
> -}
>
> static void schedule_next_timer(struct k_itimer *timr)
> {
> - struct timespec new_wall_to;
> - struct now_struct now;
> - unsigned long seq;
> -
> - /*
> - * Set up the timer for the next interval (if there is one).
> - * Note: this code uses the abs_timer_lock to protect
> - * it.real.wall_to_prev and must hold it until exp is set, not exactly
> - * obvious...
> -
> - * This function is used for CLOCK_REALTIME* and
> - * CLOCK_MONOTONIC* timers. If we ever want to handle other
> - * CLOCKs, the calling code (do_schedule_next_timer) would need
> - * to pull the "clock" info from the timer and dispatch the
> - * "other" CLOCKs "next timer" code (which, I suppose should
> - * also be added to the k_clock structure).
> - */
> - if (!timr->it.real.incr)
> + if (ktime_cmp_val(timr->it.real.incr, ==, KTIME_ZERO))
> return;
>
> - do {
> - seq = read_seqbegin(&xtime_lock);
> - new_wall_to = wall_to_monotonic;
> - posix_get_now(&now);
> - } while (read_seqretry(&xtime_lock, seq));
> -
> - if (!list_empty(&timr->it.real.abs_timer_entry)) {
> - spin_lock(&abs_list.lock);
> - add_clockset_delta(timr, &new_wall_to);
> -
> - posix_bump_timer(timr, now);
> -
> - spin_unlock(&abs_list.lock);
> - } else {
> - posix_bump_timer(timr, now);
> - }
> - timr->it_overrun_last = timr->it_overrun;
> - timr->it_overrun = -1;
> + timr->it_overrun_last = timr->it.real.overrun;
> + timr->it.real.overrun = timr->it.real.timer.overrun = -1;
> ++timr->it_requeue_pending;
> - add_timer(&timr->it.real.timer);
> + start_ktimer(&timr->it.real.timer, &timr->it.real.incr, KTIMER_FORWARD);
> + timr->it.real.overrun = timr->it.real.timer.overrun;
> }
>
> /*
> @@ -413,14 +327,7 @@ int posix_timer_event(struct k_itimer *t
> {
> memset(&timr->sigq->info, 0, sizeof(siginfo_t));
> timr->sigq->info.si_sys_private = si_private;
> - /*
> - * Send signal to the process that owns this timer.
> -
> - * This code assumes that all the possible abs_lists share the
> - * same lock (there is only one list at this time). If this is
> - * not the case, the CLOCK info would need to be used to find
> - * the proper abs list lock.
> - */
> + /* Send signal to the process that owns this timer.*/
>
> timr->sigq->info.si_signo = timr->it_sigev_signo;
> timr->sigq->info.si_errno = 0;
> @@ -454,65 +361,28 @@ EXPORT_SYMBOL_GPL(posix_timer_event);
>
> * This code is for CLOCK_REALTIME* and CLOCK_MONOTONIC* timers.
> */
> -static void posix_timer_fn(unsigned long __data)
> +static void posix_timer_fn(void *data)
> {
> - struct k_itimer *timr = (struct k_itimer *) __data;
> + struct k_itimer *timr = data;
> unsigned long flags;
> - unsigned long seq;
> - struct timespec delta, new_wall_to;
> - u64 exp = 0;
> - int do_notify = 1;
> + int si_private = 0;
>
> spin_lock_irqsave(&timr->it_lock, flags);
> - if (!list_empty(&timr->it.real.abs_timer_entry)) {
> - spin_lock(&abs_list.lock);
> - do {
> - seq = read_seqbegin(&xtime_lock);
> - new_wall_to = wall_to_monotonic;
> - } while (read_seqretry(&xtime_lock, seq));
> - set_normalized_timespec(&delta,
> - new_wall_to.tv_sec -
> - timr->it.real.wall_to_prev.tv_sec,
> - new_wall_to.tv_nsec -
> - timr->it.real.wall_to_prev.tv_nsec);
> - if (likely((delta.tv_sec | delta.tv_nsec ) == 0)) {
> - /* do nothing, timer is on time */
> - } else if (delta.tv_sec < 0) {
> - /* do nothing, timer is already late */
> - } else {
> - /* timer is early due to a clock set */
> - tstojiffie(&delta,
> - posix_clocks[timr->it_clock].res,
> - &exp);
> - timr->it.real.wall_to_prev = new_wall_to;
> - timr->it.real.timer.expires += exp;
> - add_timer(&timr->it.real.timer);
> - do_notify = 0;
> - }
> - spin_unlock(&abs_list.lock);
>
> - }
> - if (do_notify) {
> - int si_private=0;
> + if (ktime_cmp_val(timr->it.real.incr, !=, KTIME_ZERO))
> + si_private = ++timr->it_requeue_pending;
>
> - if (timr->it.real.incr)
> - si_private = ++timr->it_requeue_pending;
> - else {
> - remove_from_abslist(timr);
> - }
> + if (posix_timer_event(timr, si_private))
> + /*
> + * signal was not sent because of sig_ignor
> + * we will not get a call back to restart it AND
> + * it should be restarted.
> + */
> + schedule_next_timer(timr);
>
> - if (posix_timer_event(timr, si_private))
> - /*
> - * signal was not sent because of sig_ignor
> - * we will not get a call back to restart it AND
> - * it should be restarted.
> - */
> - schedule_next_timer(timr);
> - }
> unlock_timer(timr, flags); /* hold thru abs lock to keep irq off */
> }
>
> -
> static inline struct task_struct * good_sigevent(sigevent_t * event)
> {
> struct task_struct *rtn = current->group_leader;
> @@ -776,39 +646,40 @@ static struct k_itimer * lock_timer(time
> static void
> common_timer_get(struct k_itimer *timr, struct itimerspec *cur_setting)
> {
> - unsigned long expires;
> - struct now_struct now;
> + ktime_t expires, now, remaining;
> + struct ktimer *timer = &timr->it.real.timer;
>
> - do
> - expires = timr->it.real.timer.expires;
> - while ((volatile long) (timr->it.real.timer.expires) != expires);
> -
> - posix_get_now(&now);
> -
> - if (expires &&
> - ((timr->it_sigev_notify & ~SIGEV_THREAD_ID) == SIGEV_NONE) &&
> - !timr->it.real.incr &&
> - posix_time_before(&timr->it.real.timer, &now))
> - timr->it.real.timer.expires = expires = 0;
> - if (expires) {
> - if (timr->it_requeue_pending & REQUEUE_PENDING ||
> - (timr->it_sigev_notify & ~SIGEV_THREAD_ID) == SIGEV_NONE) {
> - posix_bump_timer(timr, now);
> - expires = timr->it.real.timer.expires;
> - }
> - else
> - if (!timer_pending(&timr->it.real.timer))
> - expires = 0;
> - if (expires)
> - expires -= now.jiffies;
> - }
> - jiffies_to_timespec(expires, &cur_setting->it_value);
> - jiffies_to_timespec(timr->it.real.incr, &cur_setting->it_interval);
> -
> - if (cur_setting->it_value.tv_sec < 0) {
> + memset(cur_setting, 0, sizeof(struct itimerspec));
> + expires = get_expiry_ktimer(timer, &now);
> + remaining = ktime_sub(expires, now);
> +
> + /* Time left ? or timer pending */
> + if (ktime_cmp_val(remaining, >, KTIME_ZERO) || ktimer_active(timer))
> + goto calci;
> + /* interval timer ? */
> + if (ktime_cmp_val(timr->it.real.incr, ==, 0))
> + return;
> + /*
> + * When a requeue is pending or this is a SIGEV_NONE timer
> + * move the expiry time forward by intervals, so expiry is >
> + * now.
> + * The active (non SIGEV_NONE) rearm should be done
> + * automatically by the ktimer REARM mode. Thats the next
> + * iteration. The REQUEUE_PENDING part will go away !
> + */
> + if (timr->it_requeue_pending & REQUEUE_PENDING ||
> + (timr->it_sigev_notify & ~SIGEV_THREAD_ID) == SIGEV_NONE) {
> + remaining = forward_posix_timer(timr, now);
> + }
> + calci:
> + /* interval timer ? */
> + if (ktime_cmp_val(timr->it.real.incr, !=, KTIME_ZERO))
> + cur_setting->it_interval = ktime_to_timespec(timr->it.real.incr);
> + /* Return 0 only, when the timer is expired and not pending */
> + if (ktime_cmp_val(remaining, <=, KTIME_ZERO))
> cur_setting->it_value.tv_nsec = 1;
> - cur_setting->it_value.tv_sec = 0;
> - }
> + else
> + cur_setting->it_value = ktime_to_timespec(remaining);
> }
>
> /* Get the time remaining on a POSIX.1b interval timer. */
> @@ -832,6 +703,7 @@ sys_timer_gettime(timer_t timer_id, stru
>
> return 0;
> }
> +
> /*
> * Get the number of overruns of a POSIX.1b interval timer. This is to
> * be the overrun of the timer last delivered. At the same time we are
> @@ -858,84 +730,6 @@ sys_timer_getoverrun(timer_t timer_id)
>
> return overrun;
> }
> -/*
> - * Adjust for absolute time
> - *
> - * If absolute time is given and it is not CLOCK_MONOTONIC, we need to
> - * adjust for the offset between the timer clock (CLOCK_MONOTONIC) and
> - * what ever clock he is using.
> - *
> - * If it is relative time, we need to add the current (CLOCK_MONOTONIC)
> - * time to it to get the proper time for the timer.
> - */
> -static int adjust_abs_time(struct k_clock *clock, struct timespec *tp,
> - int abs, u64 *exp, struct timespec *wall_to)
> -{
> - struct timespec now;
> - struct timespec oc = *tp;
> - u64 jiffies_64_f;
> - int rtn =0;
> -
> - if (abs) {
> - /*
> - * The mask pick up the 4 basic clocks
> - */
> - if (!((clock - &posix_clocks[0]) & ~CLOCKS_MASK)) {
> - jiffies_64_f = do_posix_clock_monotonic_gettime_parts(
> - &now, wall_to);
> - /*
> - * If we are doing a MONOTONIC clock
> - */
> - if((clock - &posix_clocks[0]) & CLOCKS_MONO){
> - now.tv_sec += wall_to->tv_sec;
> - now.tv_nsec += wall_to->tv_nsec;
> - }
> - } else {
> - /*
> - * Not one of the basic clocks
> - */
> - clock->clock_get(clock - posix_clocks, &now);
> - jiffies_64_f = get_jiffies_64();
> - }
> - /*
> - * Take away now to get delta and normalize
> - */
> - set_normalized_timespec(&oc, oc.tv_sec - now.tv_sec,
> - oc.tv_nsec - now.tv_nsec);
> - }else{
> - jiffies_64_f = get_jiffies_64();
> - }
> - /*
> - * Check if the requested time is prior to now (if so set now)
> - */
> - if (oc.tv_sec < 0)
> - oc.tv_sec = oc.tv_nsec = 0;
> -
> - if (oc.tv_sec | oc.tv_nsec)
> - set_normalized_timespec(&oc, oc.tv_sec,
> - oc.tv_nsec + clock->res);
> - tstojiffie(&oc, clock->res, exp);
> -
> - /*
> - * Check if the requested time is more than the timer code
> - * can handle (if so we error out but return the value too).
> - */
> - if (*exp > ((u64)MAX_JIFFY_OFFSET))
> - /*
> - * This is a considered response, not exactly in
> - * line with the standard (in fact it is silent on
> - * possible overflows). We assume such a large
> - * value is ALMOST always a programming error and
> - * try not to compound it by setting a really dumb
> - * value.
> - */
> - rtn = -EINVAL;
> - /*
> - * return the actual jiffies expire time, full 64 bits
> - */
> - *exp += jiffies_64_f;
> - return rtn;
> -}
>
> /* Set a POSIX.1b interval timer. */
> /* timr->it_lock is taken. */
> @@ -943,68 +737,52 @@ static inline int
> common_timer_set(struct k_itimer *timr, int flags,
> struct itimerspec *new_setting, struct itimerspec *old_setting)
> {
> - struct k_clock *clock = &posix_clocks[timr->it_clock];
> - u64 expire_64;
> + ktime_t expires;
> + int mode;
>
> if (old_setting)
> common_timer_get(timr, old_setting);
>
> /* disable the timer */
> - timr->it.real.incr = 0;
> + ktime_set_zero(timr->it.real.incr);
> /*
> * careful here. If smp we could be in the "fire" routine which will
> * be spinning as we hold the lock. But this is ONLY an SMP issue.
> */
> - if (try_to_del_timer_sync(&timr->it.real.timer) < 0) {
> -#ifdef CONFIG_SMP
> - /*
> - * It can only be active if on an other cpu. Since
> - * we have cleared the interval stuff above, it should
> - * clear once we release the spin lock. Of course once
> - * we do that anything could happen, including the
> - * complete melt down of the timer. So return with
> - * a "retry" exit status.
> - */
> + if (try_to_stop_ktimer(&timr->it.real.timer) < 0)
> return TIMER_RETRY;
> -#endif
> - }
> -
> - remove_from_abslist(timr);
>
> timr->it_requeue_pending = (timr->it_requeue_pending + 2) &
> ~REQUEUE_PENDING;
> timr->it_overrun_last = 0;
> timr->it_overrun = -1;
> - /*
> - *switch off the timer when it_value is zero
> - */
> - if (!new_setting->it_value.tv_sec && !new_setting->it_value.tv_nsec) {
> - timr->it.real.timer.expires = 0;
> +
> + /* switch off the timer when it_value is zero */
> + if (!new_setting->it_value.tv_sec && !new_setting->it_value.tv_nsec)
> return 0;
> - }
>
> - if (adjust_abs_time(clock,
> - &new_setting->it_value, flags & TIMER_ABSTIME,
> - &expire_64, &(timr->it.real.wall_to_prev))) {
> - return -EINVAL;
> - }
> - timr->it.real.timer.expires = (unsigned long)expire_64;
> - tstojiffie(&new_setting->it_interval, clock->res, &expire_64);
> - timr->it.real.incr = (unsigned long)expire_64;
> + mode = flags & TIMER_ABSTIME ? KTIMER_ABS : KTIMER_REL;
>
> - /*
> - * We do not even queue SIGEV_NONE timers! But we do put them
> - * in the abs list so we can do that right.
> + /* Posix madness. Only absolute CLOCK_REALTIME timers
> + * are affected by clock sets. So we must reiniatilize
> + * the timer.
> */
> + if (timr->it_clock == CLOCK_REALTIME && mode == KTIMER_ABS)
> + timer_create_real(timr);
> + else
> + timer_create_mono(timr);
> +
> + expires = ktimer_convert_timespec(&timr->it.real.timer,
> + &new_setting->it_value);
> + /* This should be moved to the auto rearm code */
> + timr->it.real.incr = ktimer_convert_timespec(&timr->it.real.timer,
> + &new_setting->it_interval);
> +
> + /* SIGEV_NONE timers are not queued ! See common_timer_get */
> if (((timr->it_sigev_notify & ~SIGEV_THREAD_ID) != SIGEV_NONE))
> - add_timer(&timr->it.real.timer);
> + start_ktimer(&timr->it.real.timer, &expires,
> + mode | KTIMER_NOCHECK);
>
> - if (flags & TIMER_ABSTIME && clock->abs_struct) {
> - spin_lock(&clock->abs_struct->lock);
> - list_add_tail(&(timr->it.real.abs_timer_entry),
> - &(clock->abs_struct->list));
> - spin_unlock(&clock->abs_struct->lock);
> - }
> return 0;
> }
>
> @@ -1039,6 +817,7 @@ retry:
>
> unlock_timer(timr, flag);
> if (error == TIMER_RETRY) {
> + wait_for_ktimer(&timr->it.real.timer);
> rtn = NULL; // We already got the old time...
> goto retry;
> }
> @@ -1052,24 +831,10 @@ retry:
>
> static inline int common_timer_del(struct k_itimer *timer)
> {
> - timer->it.real.incr = 0;
> + ktime_set_zero(timer->it.real.incr);
>
> - if (try_to_del_timer_sync(&timer->it.real.timer) < 0) {
> -#ifdef CONFIG_SMP
> - /*
> - * It can only be active if on an other cpu. Since
> - * we have cleared the interval stuff above, it should
> - * clear once we release the spin lock. Of course once
> - * we do that anything could happen, including the
> - * complete melt down of the timer. So return with
> - * a "retry" exit status.
> - */
> + if (try_to_stop_ktimer(&timer->it.real.timer) < 0)
> return TIMER_RETRY;
> -#endif
> - }
> -
> - remove_from_abslist(timer);
> -
> return 0;
> }
>
> @@ -1085,24 +850,17 @@ sys_timer_delete(timer_t timer_id)
> struct k_itimer *timer;
> long flags;
>
> -#ifdef CONFIG_SMP
> - int error;
> retry_delete:
> -#endif
> timer = lock_timer(timer_id, &flags);
> if (!timer)
> return -EINVAL;
>
> -#ifdef CONFIG_SMP
> - error = timer_delete_hook(timer);
> -
> - if (error == TIMER_RETRY) {
> + if (timer_delete_hook(timer) == TIMER_RETRY) {
> unlock_timer(timer, flags);
> + wait_for_ktimer(&timer->it.real.timer);
> goto retry_delete;
> }
> -#else
> - timer_delete_hook(timer);
> -#endif
> +
> spin_lock(&current->sighand->siglock);
> list_del(&timer->list);
> spin_unlock(&current->sighand->siglock);
> @@ -1119,6 +877,7 @@ retry_delete:
> release_posix_timer(timer, IT_ID_SET);
> return 0;
> }
> +
> /*
> * return timer owned by the process, used by exit_itimers
> */
> @@ -1126,22 +885,14 @@ static inline void itimer_delete(struct
> {
> unsigned long flags;
>
> -#ifdef CONFIG_SMP
> - int error;
> retry_delete:
> -#endif
> spin_lock_irqsave(&timer->it_lock, flags);
>
> -#ifdef CONFIG_SMP
> - error = timer_delete_hook(timer);
> -
> - if (error == TIMER_RETRY) {
> + if (timer_delete_hook(timer) == TIMER_RETRY) {
> unlock_timer(timer, flags);
> + wait_for_ktimer(&timer->it.real.timer);
> goto retry_delete;
> }
> -#else
> - timer_delete_hook(timer);
> -#endif
> list_del(&timer->list);
> /*
> * This keeps any tasks waiting on the spin lock from thinking
> @@ -1170,60 +921,7 @@ void exit_itimers(struct signal_struct *
> }
> }
>
> -/*
> - * And now for the "clock" calls
> - *
> - * These functions are called both from timer functions (with the timer
> - * spin_lock_irq() held and from clock calls with no locking. They must
> - * use the save flags versions of locks.
> - */
> -
> -/*
> - * We do ticks here to avoid the irq lock ( they take sooo long).
> - * The seqlock is great here. Since we a reader, we don't really care
> - * if we are interrupted since we don't take lock that will stall us or
> - * any other cpu. Voila, no irq lock is needed.
> - *
> - */
> -
> -static u64 do_posix_clock_monotonic_gettime_parts(
> - struct timespec *tp, struct timespec *mo)
> -{
> - u64 jiff;
> - unsigned int seq;
> -
> - do {
> - seq = read_seqbegin(&xtime_lock);
> - getnstimeofday(tp);
> - *mo = wall_to_monotonic;
> - jiff = jiffies_64;
> -
> - } while(read_seqretry(&xtime_lock, seq));
> -
> - return jiff;
> -}
> -
> -static int do_posix_clock_monotonic_get(clockid_t clock, struct timespec *tp)
> -{
> - struct timespec wall_to_mono;
> -
> - do_posix_clock_monotonic_gettime_parts(tp, &wall_to_mono);
> -
> - tp->tv_sec += wall_to_mono.tv_sec;
> - tp->tv_nsec += wall_to_mono.tv_nsec;
> -
> - if ((tp->tv_nsec - NSEC_PER_SEC) > 0) {
> - tp->tv_nsec -= NSEC_PER_SEC;
> - tp->tv_sec++;
> - }
> - return 0;
> -}
> -
> -int do_posix_clock_monotonic_gettime(struct timespec *tp)
> -{
> - return do_posix_clock_monotonic_get(CLOCK_MONOTONIC, tp);
> -}
> -
> +/* Not available / possible... functions */
> int do_posix_clock_nosettime(clockid_t clockid, struct timespec *tp)
> {
> return -EINVAL;
> @@ -1236,7 +934,8 @@ int do_posix_clock_notimer_create(struct
> }
> EXPORT_SYMBOL_GPL(do_posix_clock_notimer_create);
>
> -int do_posix_clock_nonanosleep(clockid_t clock, int flags, struct timespec *t)
> +int do_posix_clock_nonanosleep(clockid_t clock, int flags, struct timespec *t,
> + struct timespec __user *r)
> {
> #ifndef ENOTSUP
> return -EOPNOTSUPP; /* aka ENOTSUP in userland for POSIX */
> @@ -1295,125 +994,34 @@ sys_clock_getres(clockid_t which_clock,
> return error;
> }
>
> -static void nanosleep_wake_up(unsigned long __data)
> -{
> - struct task_struct *p = (struct task_struct *) __data;
> -
> - wake_up_process(p);
> -}
> -
> /*
> - * The standard says that an absolute nanosleep call MUST wake up at
> - * the requested time in spite of clock settings. Here is what we do:
> - * For each nanosleep call that needs it (only absolute and not on
> - * CLOCK_MONOTONIC* (as it can not be set)) we thread a little structure
> - * into the "nanosleep_abs_list". All we need is the task_struct pointer.
> - * When ever the clock is set we just wake up all those tasks. The rest
> - * is done by the while loop in clock_nanosleep().
> - *
> - * On locking, clock_was_set() is called from update_wall_clock which
> - * holds (or has held for it) a write_lock_irq( xtime_lock) and is
> - * called from the timer bh code. Thus we need the irq save locks.
> - *
> - * Also, on the call from update_wall_clock, that is done as part of a
> - * softirq thing. We don't want to delay the system that much (possibly
> - * long list of timers to fix), so we defer that work to keventd.
> + * nanosleep for monotonic and realtime clocks
> */
> -
> -static DECLARE_WAIT_QUEUE_HEAD(nanosleep_abs_wqueue);
> -static DECLARE_WORK(clock_was_set_work, (void(*)(void*))clock_was_set, NULL);
> -
> -static DECLARE_MUTEX(clock_was_set_lock);
> -
> -void clock_was_set(void)
> +static int common_nsleep(clockid_t which_clock, int flags,
> + struct timespec *tsave, struct timespec __user *rmtp)
> {
> - struct k_itimer *timr;
> - struct timespec new_wall_to;
> - LIST_HEAD(cws_list);
> - unsigned long seq;
> -
> + int mode = flags & TIMER_ABSTIME ? KTIMER_ABS : KTIMER_REL;
>
> - if (unlikely(in_interrupt())) {
> - schedule_work(&clock_was_set_work);
> - return;
> + switch (which_clock) {
> + case CLOCK_REALTIME:
> + /* Posix madness. Only absolute timers on clock realtime
> + are affected by clock set. */
> + if (mode == KTIMER_ABS)
> + return ktimer_nanosleep_real(tsave, rmtp, mode);
> + case CLOCK_MONOTONIC:
> + return ktimer_nanosleep_mono(tsave, rmtp, mode);
> + default:
> + break;
> }
> - wake_up_all(&nanosleep_abs_wqueue);
> -
> - /*
> - * Check if there exist TIMER_ABSTIME timers to correct.
> - *
> - * Notes on locking: This code is run in task context with irq
> - * on. We CAN be interrupted! All other usage of the abs list
> - * lock is under the timer lock which holds the irq lock as
> - * well. We REALLY don't want to scan the whole list with the
> - * interrupt system off, AND we would like a sequence lock on
> - * this code as well. Since we assume that the clock will not
> - * be set often, it seems ok to take and release the irq lock
> - * for each timer. In fact add_timer will do this, so this is
> - * not an issue. So we know when we are done, we will move the
> - * whole list to a new location. Then as we process each entry,
> - * we will move it to the actual list again. This way, when our
> - * copy is empty, we are done. We are not all that concerned
> - * about preemption so we will use a semaphore lock to protect
> - * aginst reentry. This way we will not stall another
> - * processor. It is possible that this may delay some timers
> - * that should have expired, given the new clock, but even this
> - * will be minimal as we will always update to the current time,
> - * even if it was set by a task that is waiting for entry to
> - * this code. Timers that expire too early will be caught by
> - * the expire code and restarted.
> -
> - * Absolute timers that repeat are left in the abs list while
> - * waiting for the task to pick up the signal. This means we
> - * may find timers that are not in the "add_timer" list, but are
> - * in the abs list. We do the same thing for these, save
> - * putting them back in the "add_timer" list. (Note, these are
> - * left in the abs list mainly to indicate that they are
> - * ABSOLUTE timers, a fact that is used by the re-arm code, and
> - * for which we have no other flag.)
> -
> - */
> -
> - down(&clock_was_set_lock);
> - spin_lock_irq(&abs_list.lock);
> - list_splice_init(&abs_list.list, &cws_list);
> - spin_unlock_irq(&abs_list.lock);
> - do {
> - do {
> - seq = read_seqbegin(&xtime_lock);
> - new_wall_to = wall_to_monotonic;
> - } while (read_seqretry(&xtime_lock, seq));
> -
> - spin_lock_irq(&abs_list.lock);
> - if (list_empty(&cws_list)) {
> - spin_unlock_irq(&abs_list.lock);
> - break;
> - }
> - timr = list_entry(cws_list.next, struct k_itimer,
> - it.real.abs_timer_entry);
> -
> - list_del_init(&timr->it.real.abs_timer_entry);
> - if (add_clockset_delta(timr, &new_wall_to) &&
> - del_timer(&timr->it.real.timer)) /* timer run yet? */
> - add_timer(&timr->it.real.timer);
> - list_add(&timr->it.real.abs_timer_entry, &abs_list.list);
> - spin_unlock_irq(&abs_list.lock);
> - } while (1);
> -
> - up(&clock_was_set_lock);
> + return -EINVAL;
> }
>
> -long clock_nanosleep_restart(struct restart_block *restart_block);
> -
> asmlinkage long
> sys_clock_nanosleep(clockid_t which_clock, int flags,
> const struct timespec __user *rqtp,
> struct timespec __user *rmtp)
> {
> struct timespec t;
> - struct restart_block *restart_block =
> - &(current_thread_info()->restart_block);
> - int ret;
>
> if (invalid_clockid(which_clock))
> return -EINVAL;
> @@ -1421,135 +1029,8 @@ sys_clock_nanosleep(clockid_t which_cloc
> if (copy_from_user(&t, rqtp, sizeof (struct timespec)))
> return -EFAULT;
>
> - if ((unsigned) t.tv_nsec >= NSEC_PER_SEC || t.tv_sec < 0)
> + if (!timespec_valid(&t))
> return -EINVAL;
>
> - /*
> - * Do this here as nsleep function does not have the real address.
> - */
> - restart_block->arg1 = (unsigned long)rmtp;
> -
> - ret = CLOCK_DISPATCH(which_clock, nsleep, (which_clock, flags, &t));
> -
> - if ((ret == -ERESTART_RESTARTBLOCK) && rmtp &&
> - copy_to_user(rmtp, &t, sizeof (t)))
> - return -EFAULT;
> - return ret;
> -}
> -
> -
> -static int common_nsleep(clockid_t which_clock,
> - int flags, struct timespec *tsave)
> -{
> - struct timespec t, dum;
> - struct timer_list new_timer;
> - DECLARE_WAITQUEUE(abs_wqueue, current);
> - u64 rq_time = (u64)0;
> - s64 left;
> - int abs;
> - struct restart_block *restart_block =
> - &current_thread_info()->restart_block;
> -
> - abs_wqueue.flags = 0;
> - init_timer(&new_timer);
> - new_timer.expires = 0;
> - new_timer.data = (unsigned long) current;
> - new_timer.function = nanosleep_wake_up;
> - abs = flags & TIMER_ABSTIME;
> -
> - if (restart_block->fn == clock_nanosleep_restart) {
> - /*
> - * Interrupted by a non-delivered signal, pick up remaining
> - * time and continue. Remaining time is in arg2 & 3.
> - */
> - restart_block->fn = do_no_restart_syscall;
> -
> - rq_time = restart_block->arg3;
> - rq_time = (rq_time << 32) + restart_block->arg2;
> - if (!rq_time)
> - return -EINTR;
> - left = rq_time - get_jiffies_64();
> - if (left <= (s64)0)
> - return 0; /* Already passed */
> - }
> -
> - if (abs && (posix_clocks[which_clock].clock_get !=
> - posix_clocks[CLOCK_MONOTONIC].clock_get))
> - add_wait_queue(&nanosleep_abs_wqueue, &abs_wqueue);
> -
> - do {
> - t = *tsave;
> - if (abs || !rq_time) {
> - adjust_abs_time(&posix_clocks[which_clock], &t, abs,
> - &rq_time, &dum);
> - }
> -
> - left = rq_time - get_jiffies_64();
> - if (left >= (s64)MAX_JIFFY_OFFSET)
> - left = (s64)MAX_JIFFY_OFFSET;
> - if (left < (s64)0)
> - break;
> -
> - new_timer.expires = jiffies + left;
> - __set_current_state(TASK_INTERRUPTIBLE);
> - add_timer(&new_timer);
> -
> - schedule();
> -
> - del_timer_sync(&new_timer);
> - left = rq_time - get_jiffies_64();
> - } while (left > (s64)0 && !test_thread_flag(TIF_SIGPENDING));
> -
> - if (abs_wqueue.task_list.next)
> - finish_wait(&nanosleep_abs_wqueue, &abs_wqueue);
> -
> - if (left > (s64)0) {
> -
> - /*
> - * Always restart abs calls from scratch to pick up any
> - * clock shifting that happened while we are away.
> - */
> - if (abs)
> - return -ERESTARTNOHAND;
> -
> - left *= TICK_NSEC;
> - tsave->tv_sec = div_long_long_rem(left,
> - NSEC_PER_SEC,
> - &tsave->tv_nsec);
> - /*
> - * Restart works by saving the time remaing in
> - * arg2 & 3 (it is 64-bits of jiffies). The other
> - * info we need is the clock_id (saved in arg0).
> - * The sys_call interface needs the users
> - * timespec return address which _it_ saves in arg1.
> - * Since we have cast the nanosleep call to a clock_nanosleep
> - * both can be restarted with the same code.
> - */
> - restart_block->fn = clock_nanosleep_restart;
> - restart_block->arg0 = which_clock;
> - /*
> - * Caller sets arg1
> - */
> - restart_block->arg2 = rq_time & 0xffffffffLL;
> - restart_block->arg3 = rq_time >> 32;
> -
> - return -ERESTART_RESTARTBLOCK;
> - }
> -
> - return 0;
> -}
> -/*
> - * This will restart clock_nanosleep.
> - */
> -long
> -clock_nanosleep_restart(struct restart_block *restart_block)
> -{
> - struct timespec t;
> - int ret = common_nsleep(restart_block->arg0, 0, &t);
> -
> - if ((ret == -ERESTART_RESTARTBLOCK) && restart_block->arg1 &&
> - copy_to_user((struct timespec __user *)(restart_block->arg1), &t,
> - sizeof (t)))
> - return -EFAULT;
> - return ret;
> + return CLOCK_DISPATCH(which_clock, nsleep, (which_clock, flags, &t, rmtp));
> }
> Index: linux-2.6.14-rc2-rt4/kernel/timer.c
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/kernel/timer.c
> +++ linux-2.6.14-rc2-rt4/kernel/timer.c
> @@ -912,6 +912,7 @@ static void run_timer_softirq(struct sof
> {
> tvec_base_t *base = &__get_cpu_var(tvec_bases);
>
> + run_ktimer_queues();
> if (time_after_eq(jiffies, base->timer_jiffies))
> __run_timers(base);
> }
> @@ -1177,62 +1178,6 @@ asmlinkage long sys_gettid(void)
> return current->pid;
> }
>
> -static long __sched nanosleep_restart(struct restart_block *restart)
> -{
> - unsigned long expire = restart->arg0, now = jiffies;
> - struct timespec __user *rmtp = (struct timespec __user *) restart->arg1;
> - long ret;
> -
> - /* Did it expire while we handled signals? */
> - if (!time_after(expire, now))
> - return 0;
> -
> - expire = schedule_timeout_interruptible(expire - now);
> -
> - ret = 0;
> - if (expire) {
> - struct timespec t;
> - jiffies_to_timespec(expire, &t);
> -
> - ret = -ERESTART_RESTARTBLOCK;
> - if (rmtp && copy_to_user(rmtp, &t, sizeof(t)))
> - ret = -EFAULT;
> - /* The 'restart' block is already filled in */
> - }
> - return ret;
> -}
> -
> -asmlinkage long sys_nanosleep(struct timespec __user *rqtp, struct timespec __user *rmtp)
> -{
> - struct timespec t;
> - unsigned long expire;
> - long ret;
> -
> - if (copy_from_user(&t, rqtp, sizeof(t)))
> - return -EFAULT;
> -
> - if ((t.tv_nsec >= 1000000000L) || (t.tv_nsec < 0) || (t.tv_sec < 0))
> - return -EINVAL;
> -
> - expire = timespec_to_jiffies(&t) + (t.tv_sec || t.tv_nsec);
> - expire = schedule_timeout_interruptible(expire);
> -
> - ret = 0;
> - if (expire) {
> - struct restart_block *restart;
> - jiffies_to_timespec(expire, &t);
> - if (rmtp && copy_to_user(rmtp, &t, sizeof(t)))
> - return -EFAULT;
> -
> - restart = &current_thread_info()->restart_block;
> - restart->fn = nanosleep_restart;
> - restart->arg0 = jiffies + expire;
> - restart->arg1 = (unsigned long) rmtp;
> - ret = -ERESTART_RESTARTBLOCK;
> - }
> - return ret;
> -}
> -
> /*
> * sys_sysinfo - fill in sysinfo struct
> */
> Index: linux-2.6.14-rc2-rt4/include/linux/time.h
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/include/linux/time.h
> +++ linux-2.6.14-rc2-rt4/include/linux/time.h
> @@ -4,6 +4,7 @@
> #include <linux/types.h>
>
> #ifdef __KERNEL__
> +#include <linux/calc64.h>
> #include <linux/seqlock.h>
> #endif
>
> @@ -38,6 +39,11 @@ static __inline__ int timespec_equal(str
> return (a->tv_sec == b->tv_sec) && (a->tv_nsec == b->tv_nsec);
> }
>
> +#define timespec_valid(ts) \
> +(((ts)->tv_sec >= 0) && (((unsigned) (ts)->tv_nsec) < NSEC_PER_SEC))
> +
> +typedef s64 nsec_t;
> +
> /* Converts Gregorian date to seconds since 1970-01-01 00:00:00.
> * Assumes input in normal date format, i.e. 1980-12-31 23:59:59
> * => year=1980, mon=12, day=31, hour=23, min=59, sec=59.
> @@ -88,8 +94,7 @@ struct timespec current_kernel_time(void
> extern void do_gettimeofday(struct timeval *tv);
> extern int do_settimeofday(struct timespec *tv);
> extern int do_sys_settimeofday(struct timespec *tv, struct timezone *tz);
> -extern void clock_was_set(void); // call when ever the clock is set
> -extern int do_posix_clock_monotonic_gettime(struct timespec *tp);
> +extern void do_posix_clock_monotonic_gettime(struct timespec *ts);
> extern long do_utimes(char __user * filename, struct timeval * times);
> struct itimerval;
> extern int do_setitimer(int which, struct itimerval *value, struct itimerval *ovalue);
> @@ -113,6 +118,40 @@ set_normalized_timespec (struct timespec
> ts->tv_nsec = nsec;
> }
>
> +static __inline__ nsec_t timespec_to_ns(struct timespec *s)
> +{
> + nsec_t res = (nsec_t) s->tv_sec * NSEC_PER_SEC;
> + return res + (nsec_t) s->tv_nsec;
> +}
> +
> +static __inline__ struct timespec ns_to_timespec(nsec_t n)
> +{
> + struct timespec ts;
> +
> + if (n)
> + ts.tv_sec = div_long_long_rem_signed(n, NSEC_PER_SEC, &ts.tv_nsec);
> + else
> + ts.tv_sec = ts.tv_nsec = 0;
> + return ts;
> +}
> +
> +static __inline__ nsec_t timeval_to_ns(struct timeval *s)
> +{
> + nsec_t res = (nsec_t) s->tv_sec * NSEC_PER_SEC;
> + return res + (nsec_t) s->tv_usec * NSEC_PER_USEC;
> +}
> +
> +static __inline__ struct timeval ns_to_timeval(nsec_t n)
> +{
> + struct timeval tv;
> + if (n) {
> + tv.tv_sec = div_long_long_rem_signed(n, NSEC_PER_SEC, &tv.tv_usec);
> + tv.tv_usec /= 1000;
> + } else
> + tv.tv_sec = tv.tv_usec = 0;
> + return tv;
> +}
> +
> #endif /* __KERNEL__ */
>
> #define NFDBITS __NFDBITS
> @@ -145,23 +184,18 @@ struct itimerval {
> /*
> * The IDs of the various system clocks (for POSIX.1b interval timers).
> */
> -#define CLOCK_REALTIME 0
> -#define CLOCK_MONOTONIC 1
> +#define CLOCK_REALTIME 0
> +#define CLOCK_MONOTONIC 1
> #define CLOCK_PROCESS_CPUTIME_ID 2
> #define CLOCK_THREAD_CPUTIME_ID 3
> -#define CLOCK_REALTIME_HR 4
> -#define CLOCK_MONOTONIC_HR 5
>
> /*
> * The IDs of various hardware clocks
> */
> -
> -
> #define CLOCK_SGI_CYCLE 10
> #define MAX_CLOCKS 16
> -#define CLOCKS_MASK (CLOCK_REALTIME | CLOCK_MONOTONIC | \
> - CLOCK_REALTIME_HR | CLOCK_MONOTONIC_HR)
> -#define CLOCKS_MONO (CLOCK_MONOTONIC & CLOCK_MONOTONIC_HR)
> +#define CLOCKS_MASK (CLOCK_REALTIME | CLOCK_MONOTONIC)
> +#define CLOCKS_MONO (CLOCK_MONOTONIC)
>
> /*
> * The various flags for setting POSIX.1b interval timers.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
George Anzinger [email protected]
HRT (High-res-timers): http://sourceforge.net/projects/high-res-timers/

2005-09-30 15:59:10

by Frank Sorenson

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Thomas Gleixner wrote:
> The KTIMER_REARM case is the broken spot. I fixed this already as it was
> oopsing here to, but somehow I messed up with quilt.
>
> tglx

This does indeed fix the panic. Thanks.

Frank
- --
Frank Sorenson - KD7TZK
Systems Manager, Computer Science Department
Brigham Young University
[email protected]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFDPWCKaI0dwg4A47wRAmjAAJ0XarfSYFyqAvGKi+uHbXZLg4+fEwCgso39
5hdrQfgzwMDdT9zM+4GkwLk=
=UoVd
-----END PGP SIGNATURE-----

2005-10-01 01:03:53

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Hi,

On Wed, 28 Sep 2005 [email protected] wrote:

Your patch introduces some whitespace damage, search for "^\+ " in your
patch.

> ktimers seperate the "timer API" from the "timeout API".

I'm not really happy with these names, timeouts are what timers do, so
these names don't tell at all, what the difference is.
Calling them "process timer" and "kernel timer" would include their main
usage, although that also means ptimer were the more correct abbreviation.

> +#ifndef KTIME_IS_SCALAR
> +typedef union {
> + s64 tv64;
> + struct {
> +#ifdef __BIG_ENDIAN
> + s32 sec, nsec;
> +#else
> + s32 nsec, sec;
> +#endif
> + } tv;
> +} ktime_t;
> +
> +#else
> +
> +typedef s64 ktime_t;
> +
> +#endif

Making the union unconditional, would make tv64 always available and a lot
of macros unnessary.

> +struct ktimer {
> + struct rb_node node;
> + struct list_head list;
> + ktime_t expires;
> + ktime_t expired;
> + ktime_t interval;
> + int overrun;
> + unsigned long status;
> + void (*function)(void *);
> + void *data;
> + struct ktimer_base *base;
> +};

This structure is rather large and I think a lot can be avoided.
- list: AFAICT it's only used by run_ktimer_queue() to get the first
pending entry. This can also be done by keeping track of the first entry
in the base structure (useful in other places as well).
- expired: can be replaced by base->last_expired (may also be useful in
other places)
- status: only user is ktimer_active(), the same test can be done by
testing node.rb_parent.
- interval/overrun: this is only needed by itimers and I think it's
possible to leave it there. Main change would be to let 'function' return
a value indicating whether to rearm the timer or not (this includes
expires is updated).

> +#define DEFINE_KTIME(k) ktime_t k = {.tv64 = 0LL }
> +
> +#define ktime_cmp(a,op,b) ((a).tv64 op (b).tv64)
> +#define ktime_cmp_val(a, op, b) ((a).tv64 op b)

A union ktime would especially avoid this.

> +static inline ktime_t ktime_sub(ktime_t a, ktime_t b)
> +{
> + ktime_t res;
> +
> + res.tv64 = a.tv64 - b.tv64;
> + if (res.tv.nsec < 0)
> + res.tv.nsec += NSEC_PER_SEC;
> +
> + return res;
> +}
> +
> +static inline ktime_t ktime_add(ktime_t a, ktime_t b)
> +{
> + ktime_t res;
> +
> + res.tv64 = a.tv64 + b.tv64;
> + if (res.tv.nsec >= NSEC_PER_SEC) {
> + res.tv.nsec -= NSEC_PER_SEC;
> + res.tv.sec++;
> + }
> + return res;
> +}

Not using 64bit math here allows gcc to generate better code, e.g. gcc
has to add another test for "nsec < 0" because the condition code is
already used for the overflow, adding the "sec--" instead is IMO faster
(i.e. less likely).

> +/* The time bases */
> +#define MAX_KTIMER_BASES 2
> +static DEFINE_PER_CPU(struct ktimer_base, ktimer_bases[MAX_KTIMER_BASES]) =

Do you have any numbers (besides maybe microbenchmarks) that show a real
advantage by using per cpu data? What kind of usage do you expect here?

The other thing is that this assumes, that all time sources are
programmable per cpu, otherwise it will be more complicated for a time
source to run the timers for every cpu, I don't know how safe that
assumption is.
Changing the array of structures into an array of pointers to the
structures would allow to switch between percpu bases and a single base.

> +ktime_t ktimer_convert_timespec(struct ktimer *timer, struct timespec *ts)
> +{
> + struct ktimer_base *base = get_ktimer_base_unlocked(timer);
> + ktime_t t;
> + long rem = ts->tv_nsec % base->resolution;
> +
> + t = ktime_set(ts->tv_sec, ts->tv_nsec);
> +
> + /* Check, if the value has to be rounded */
> + if (rem)
> + t = ktime_add_ns(t, base->resolution - rem);
> + return t;
> +}

Could you explain a little the resolution handling behind in your patch?
If I read SUS correctly clock resolution and timer resolution don't have
to be the same, the first is returned by clock_getres() and the latter
only documented somewhere (and AFAICT our implementation always returned
the wrong value).
IMO this also means we can don't have to make the rounding that
complicated. Actually it could be done automatically by the timer, e.g.
interval timer are reprogrammed at (now + interval) and the timer
resolution will automatically round it up.

> +static int enqueue_ktimer(struct ktimer *timer, struct ktimer_base *base,
> + ktime_t *tim, int mode)
> +{
> + struct rb_node **link = &base->active.rb_node;
> + struct rb_node *parent = NULL;
> + struct ktimer *entry;
> + struct list_head *prev = &base->pending;
> + ktime_t now;
> +
> + /* Get current time */
> + now = base->get_time();

As get_time() is not necessarily cheap, it can be avoided for nonrelative
timers by comparing it with the first pending timer. Maintaining a pointer
to the first timer here, avoids the timer list and is a simple check
whether the time source needs any reprogramming later.

> + if ktime_cmp(timer->expires, <=, now) {
> + timer->expired = now;
> + /* The caller takes care of expiry */
> + if (!(mode & KTIMER_NOCHECK))
> + return -1;

I think KTIMER_NOFAIL would be better name, for a while that had me
confused, as you actually do check the value, but you don't fail it and
enqueue it anyway.

bye, Roman

2005-10-01 11:22:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5


* Roman Zippel <[email protected]> wrote:

> > +/* The time bases */
> > +#define MAX_KTIMER_BASES 2
> > +static DEFINE_PER_CPU(struct ktimer_base, ktimer_bases[MAX_KTIMER_BASES]) =
>
> Do you have any numbers (besides maybe microbenchmarks) that show a
> real advantage by using per cpu data? What kind of usage do you expect
> here?

it has countless advantages, and these days we basically only design
per-CPU data structures within the kernel, unless some limitation (such
as API or hw property) forces us to do otherwise. So i turn around the
question: what would be your reason for _not_ doing this clean per-CPU
design for SMP systems?

> The other thing is that this assumes, that all time sources are
> programmable per cpu, otherwise it will be more complicated for a time
> source to run the timers for every cpu, I don't know how safe that
> assumption is. Changing the array of structures into an array of
> pointers to the structures would allow to switch between percpu bases
> and a single base.

yeah, and that's an assumption that simplifies things on SMP
significantly. PIT on SMP systems for HRT is so gross that it's not
funny. If anyone wants to revive that notion, please do a separate patch
and make the case convincing enough ...

Ingo

2005-10-01 12:04:49

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Roman,

On Sat, 2005-10-01 at 03:03 +0200, Roman Zippel wrote:

> Your patch introduces some whitespace damage, search for "^\+ " in your
> patch.

Ok.

> > ktimers seperate the "timer API" from the "timeout API".
> I'm not really happy with these names, timeouts are what timers do, so
> these names don't tell at all, what the difference is.

There is a clear distinction between timers and timeouts.

>From IT-dictonary:

"Timeout is a specified period of time that will be allowed to elapse in
a system before a specified event is to take place, unless another
specified event occurs first; in either case, the period is terminated
when either event takes place."

"A timer is a specialized type of clock. A timer can be used to control
the sequence of an event or process."

> Calling them "process timer" and "kernel timer" would include their main
> usage, although that also means ptimer were the more correct abbreviation.

As said before I think the disctinction between timers and timeouts
makes perfectly sense and ktimers are _not_ restricted to process
timers.

> > +#ifndef KTIME_IS_SCALAR
> > +typedef union {
> > + s64 tv64;
> > + struct {
> > +#ifdef __BIG_ENDIAN
> > + s32 sec, nsec;
> > +#else
> > + s32 nsec, sec;
> > +#endif
> > + } tv;
> > +} ktime_t;
> > +
> > +#else
> > +
> > +typedef s64 ktime_t;
> > +
> > +#endif
>
> Making the union unconditional, would make tv64 always available and a lot
> of macros unnessary.

nsec,sec storage format is essentially different to the scalar storage
format and has to be handled different.

The above gives a clear distinction between scalar and sec/nsec based
cases. So you cannot mess up without notice.

I prefer having a lot more macros / inlines around rather than tracking
down _one_ single bug which happens by a non clearly seperated
implementation.

> > +struct ktimer {
> > + struct rb_node node;
> > + struct list_head list;
> > + ktime_t expires;
> > + ktime_t expired;
> > + ktime_t interval;
> > + int overrun;
> > + unsigned long status;
> > + void (*function)(void *);
> > + void *data;
> > + struct ktimer_base *base;
> > +};
>
> This structure is rather large and I think a lot can be avoided.
> - list: AFAICT it's only used by run_ktimer_queue() to get the first
> pending entry. This can also be done by keeping track of the first entry
> in the base structure (useful in other places as well).

You are right that the list is not necessary for the plain integration
into the current system, but it is necessary once you start to upgrade
to high resolution timers.

> - expired: can be replaced by base->last_expired (may also be useful in
> other places)

How gives base->last_expired a per timer expired information? And where
would it be useful ?

> - status: only user is ktimer_active(), the same test can be done by
> testing node.rb_parent.

Uurg. Been there and discarded the idea, because its ugly and clashes
with further extensibilty requirements e.g. high resolution timers,
where we have more than two states.

Having status information bound to arbitrary pointers is trading a
variable against flexibility, cleanliness and maintainability.

> - interval/overrun: this is only needed by itimers and I think it's
> possible to leave it there. Main change would be to let 'function' return
> a value indicating whether to rearm the timer or not (this includes
> expires is updated).

It is also used by the posix timer code and I plan to do another round
of simplification also there.


This implementation is chosen to be flexible and easy exstensible for
use cases like high resolution timers.


I do not want to end up with a next round of discussion there about
either introducing tons of new ifdefs, macros or redesigning the code
base another time.

As others have stated too, we have to wage the tradeoff between

simplicity, flexibility, maintainability
vs.
size and performance impacts

Performance is definitely an important issue and was accepted and
addressed.

The tradeoff of the size in question is not a valid argument to give up
a clear, flexible and maintainable design.

> > +#define DEFINE_KTIME(k) ktime_t k = {.tv64 = 0LL }
> > +
> > +#define ktime_cmp(a,op,b) ((a).tv64 op (b).tv64)
> > +#define ktime_cmp_val(a, op, b) ((a).tv64 op b)
>
> A union ktime would especially avoid this.

See above

> > +static inline ktime_t ktime_sub(ktime_t a, ktime_t b)
> > +{
> > + ktime_t res;
> > +
> > + res.tv64 = a.tv64 - b.tv64;
> > + if (res.tv.nsec < 0)
> > + res.tv.nsec += NSEC_PER_SEC;
> > +
> > + return res;
> > +}
> > +
> > +static inline ktime_t ktime_add(ktime_t a, ktime_t b)
> > +{
> > + ktime_t res;
> > +
> > + res.tv64 = a.tv64 + b.tv64;
> > + if (res.tv.nsec >= NSEC_PER_SEC) {
> > + res.tv.nsec -= NSEC_PER_SEC;
> > + res.tv.sec++;
> > + }
> > + return res;
> > +}
>
> Not using 64bit math here allows gcc to generate better code, e.g. gcc
> has to add another test for "nsec < 0" because the condition code is
> already used for the overflow, adding the "sec--" instead is IMO faster
> (i.e. less likely).

i686
DOADD32 00000048
DOADD64 0000002a
DOSUB32 00000060
DOSUB64 0000002f
arm
DOADD32 0000004c
DOADD64 0000004c
DOSUB32 00000040
DOSUB64 00000038
m68k
DOADD32 0000003c
DOADD64 0000002e
DOSUB32 00000036
DOSUB64 00000028
powerpc
DOADD32 00000040
DOADD64 00000044
DOSUB32 00000044
DOSUB64 00000044
m68k
DOADD32 0000003c
DOADD64 0000002e
DOSUB32 00000036
DOSUB64 00000028

Please do not tell me that size does not matter. :)

I attached the assembler dumps, so you can have a look yourself. I did
these tests during the implementation and decided on the results rather
than on assumptions about gcc.


> Could you explain a little the resolution handling behind in your patch?
> If I read SUS correctly clock resolution and timer resolution don't have
> to be the same, the first is returned by clock_getres() and the latter
> only documented somewhere (and AFAICT our implementation always returned
> the wrong value).

As far as I understand SUS timer resolution is equal to clock resolution
and the timer value/interval is rounded up to the resolution.

> IMO this also means we can don't have to make the rounding that
> complicated. Actually it could be done automatically by the timer, e.g.
> interval timer are reprogrammed at (now + interval) and the timer
> resolution will automatically round it up.

Reprogramming interval timers by now + interval is completely wrong.
Reprogramming has to be
timer->expires + interval and nothing else.

Doing real rounding in the reprogramming code would be a performance
impact.

> > + /* Get current time */
> > + now = base->get_time();
>
> As get_time() is not necessarily cheap, it can be avoided for nonrelative
> timers by comparing it with the first pending timer. Maintaining a pointer
> to the first timer here, avoids the timer list and is a simple check
> whether the time source needs any reprogramming later.

Would you please care to read the complete related code to find out why
this does not work. This is totaly unrelated to reprogramming of the
time event source in the HRT case.
...
case KTIMER_FORWARD:
while ktime_cmp(timer->expires, <= , now) {
...

case KTIMER_REARM:
while ktime_cmp(timer->expires, <= , now) {
timer->expires = ktime_add(timer->expires,

and of course the expiry check below.

> > + if ktime_cmp(timer->expires, <=, now) {
> > + timer->expired = now;
> > + /* The caller takes care of expiry */
> > + if (!(mode & KTIMER_NOCHECK))
> > + return -1;
>
> I think KTIMER_NOFAIL would be better name, for a while that had me
> confused, as you actually do check the value, but you don't fail it and
> enqueue it anyway.

It does not fail. It returns in the case that the timer is already
expired. The NOCHECK flag is used to skip the check.

tglx


Attachments:
testarchs.dmp (17.74 kB)

2005-10-04 01:56:53

by George Anzinger

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Roman Zippel wrote:
>
>
> Could you explain a little the resolution handling behind in your patch?
> If I read SUS correctly clock resolution and timer resolution don't have
> to be the same, the first is returned by clock_getres() and the latter
> only documented somewhere (and AFAICT our implementation always returned
> the wrong value).
> IMO this also means we can don't have to make the rounding that
> complicated. Actually it could be done automatically by the timer, e.g.
> interval timer are reprogrammed at (now + interval) and the timer
> resolution will automatically round it up.

As I understand it the resolution should apply to timers assigned to the given clock. I assume most
clock reads will return the best resolution possible, but we can only know what that is (in user
land) by looking at at series of clock reads and making an educated guess (if indeed we care).

For timers, on the other hand, resolution serves two purposes: a) it tells the user/ application
what to expect and allows him to take evasive action (such as asking for the timer to expire a "res"
amount sooner) to get what he wants/needs. b) for the kernel, it allows timers to be grouped such
that we can limit the number of interrupts we need to service to handle timers. Some of this might
be possible by relying on the hardware, but a lot of hardware may actually be able to handle
nanosecond resolution. At that point you end up grouping by latency and getting to the point were,
for no good reason, you have the possibility of timer storms. For no good reason, i.e. the user
really doesn't need or want that level of resolution, being happy with, for example 10 microseconds
or some such. This is why, in the HRT patch, the same can be said of the new ability to set HZ at
configure time.
>
>

--
George Anzinger [email protected]
HRT (High-res-timers): http://sourceforge.net/projects/high-res-timers/

2005-10-04 02:00:33

by George Anzinger

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Ingo Molnar wrote:
> * Roman Zippel <[email protected]> wrote:
> >
>>The other thing is that this assumes, that all time sources are
>>programmable per cpu, otherwise it will be more complicated for a time
>>source to run the timers for every cpu, I don't know how safe that
>>assumption is. Changing the array of structures into an array of
>>pointers to the structures would allow to switch between percpu bases
>>and a single base.
>
>
> yeah, and that's an assumption that simplifies things on SMP
> significantly. PIT on SMP systems for HRT is so gross that it's not
> funny. If anyone wants to revive that notion, please do a separate patch
> and make the case convincing enough ...
>
Lets not talk about PIT, but, a lot of SMP platforms do NOT have per cpu timers. For those, it
would seem having per cpu lists to handle the timer is not really reasonable.

--
George Anzinger [email protected]
HRT (High-res-timers): http://sourceforge.net/projects/high-res-timers/

2005-10-04 05:50:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5


* George Anzinger <[email protected]> wrote:

> > yeah, and that's an assumption that simplifies things on SMP
> > significantly. PIT on SMP systems for HRT is so gross that it's not
> > funny. If anyone wants to revive that notion, please do a separate
> > patch and make the case convincing enough ...
>
> Lets not talk about PIT, but, a lot of SMP platforms do NOT have per
> cpu timers. For those, it would seem having per cpu lists to handle
> the timer is not really reasonable.

frankly, such systems are rare, and are an afterthought at most. Think
about it: 8 CPUs and only one hres timer source? It cannot work nor
scale well.

i agree that they might eventually be handled (although i think we
shouldnt bother, all sane SMP designs have per-CPU timers), but we
definite wont design for them. What such an architecture has to do is to
provide the proper do_hr_timer_int() and arch_hrtimer_reprogram()
semantics, via locking around that timer source (naturally), and via
cross-CPU calls - as if they were per-CPU timers.

Ingo

2005-10-10 12:43:06

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Hi,

On Sat, 1 Oct 2005, Ingo Molnar wrote:

> > Do you have any numbers (besides maybe microbenchmarks) that show a
> > real advantage by using per cpu data? What kind of usage do you expect
> > here?
>
> it has countless advantages, and these days we basically only design
> per-CPU data structures within the kernel, unless some limitation (such
> as API or hw property) forces us to do otherwise. So i turn around the
> question: what would be your reason for _not_ doing this clean per-CPU
> design for SMP systems?

Did I say I'm against it? No, I was just hoping someone put some more
thought into it than just "all the other kids are doing it".
I was just curious how well it really scales compared to the simple
version, e.g. what happens if most timer end up on a single cpu or what
happens if we want to start the timer on a different cpu. Is this so wrong
that you have to go into attack mode? :(

> > The other thing is that this assumes, that all time sources are
> > programmable per cpu, otherwise it will be more complicated for a time
> > source to run the timers for every cpu, I don't know how safe that
> > assumption is. Changing the array of structures into an array of
> > pointers to the structures would allow to switch between percpu bases
> > and a single base.
>
> yeah, and that's an assumption that simplifies things on SMP
> significantly. PIT on SMP systems for HRT is so gross that it's not
> funny. If anyone wants to revive that notion, please do a separate patch
> and make the case convincing enough ...

Why do use "PIT on SMP" as an extreme example to reject the general
concept completely? This doesn't explain, why first such a (simple) SMP
design shouldn't exist and why secondly my suggestion is such a big
problem.

bye, Roman

2005-10-10 14:04:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5


* Roman Zippel <[email protected]> wrote:

> > > Do you have any numbers (besides maybe microbenchmarks) that show a
> > > real advantage by using per cpu data? What kind of usage do you expect
> > > here?
> >
> > it has countless advantages, and these days we basically only design
> > per-CPU data structures within the kernel, unless some limitation (such
> > as API or hw property) forces us to do otherwise. So i turn around the
> > question: what would be your reason for _not_ doing this clean per-CPU
> > design for SMP systems?
>
> Did I say I'm against it? No, I was just hoping someone put some more
> thought into it than just "all the other kids are doing it". I was
> just curious how well it really scales compared to the simple version,
> e.g. what happens if most timer end up on a single cpu or what happens
> if we want to start the timer on a different cpu. Is this so wrong
> that you have to go into attack mode? :(

[ sorry, and i didnt go into 'attack mode'. I believe you'll distinctly
notice when i do that :-) ]

just think NUMA, and the generic advantages of PER_CPU become obvious.
(via PER_CPU the different data structures indexed can properly end up
on another domain's RAM, and can thus improve caching characteristics.)

Ingo

2005-10-10 17:23:12

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Hi,

On Sat, 1 Oct 2005, Thomas Gleixner wrote:

> > > ktimers seperate the "timer API" from the "timeout API".
> > I'm not really happy with these names, timeouts are what timers do, so
> > these names don't tell at all, what the difference is.
>
> There is a clear distinction between timers and timeouts.
>
> >From IT-dictonary:
>
> "Timeout is a specified period of time that will be allowed to elapse in
> a system before a specified event is to take place, unless another
> specified event occurs first; in either case, the period is terminated
> when either event takes place."
>
> "A timer is a specialized type of clock. A timer can be used to control
> the sequence of an event or process."

IOW a timer uses timeouts to control a sequence of events, it's still part
of the same thing, which makes "timer API" and "timeout API" very
confusing.

> > Calling them "process timer" and "kernel timer" would include their main
> > usage, although that also means ptimer were the more correct abbreviation.
>
> As said before I think the disctinction between timers and timeouts
> makes perfectly sense and ktimers are _not_ restricted to process
> timers.

"main usage" != "restricted to"

> > > +#ifndef KTIME_IS_SCALAR
> > > +typedef union {
> > > + s64 tv64;
> > > + struct {
> > > +#ifdef __BIG_ENDIAN
> > > + s32 sec, nsec;
> > > +#else
> > > + s32 nsec, sec;
> > > +#endif
> > > + } tv;
> > > +} ktime_t;
> > > +
> > > +#else
> > > +
> > > +typedef s64 ktime_t;
> > > +
> > > +#endif
> >
> > Making the union unconditional, would make tv64 always available and a lot
> > of macros unnessary.
>
> nsec,sec storage format is essentially different to the scalar storage
> format and has to be handled different.
>
> The above gives a clear distinction between scalar and sec/nsec based
> cases. So you cannot mess up without notice.

There are enough macros to do this anyway. There are a number of
operations which are identical. Separating them artifically makes
everything only more complicated.

> > > +struct ktimer {
> > > + struct rb_node node;
> > > + struct list_head list;
> > > + ktime_t expires;
> > > + ktime_t expired;
> > > + ktime_t interval;
> > > + int overrun;
> > > + unsigned long status;
> > > + void (*function)(void *);
> > > + void *data;
> > > + struct ktimer_base *base;
> > > +};
> >
> > This structure is rather large and I think a lot can be avoided.
> > - list: AFAICT it's only used by run_ktimer_queue() to get the first
> > pending entry. This can also be done by keeping track of the first entry
> > in the base structure (useful in other places as well).
>
> You are right that the list is not necessary for the plain integration
> into the current system, but it is necessary once you start to upgrade
> to high resolution timers.

Could you please specifiy these requirements?

> > - expired: can be replaced by base->last_expired (may also be useful in
> > other places)
>
> How gives base->last_expired a per timer expired information? And where
> would it be useful ?

If a callback needs that information, it can it get from there.

> > - status: only user is ktimer_active(), the same test can be done by
> > testing node.rb_parent.
>
> Uurg. Been there and discarded the idea, because its ugly and clashes
> with further extensibilty requirements e.g. high resolution timers,
> where we have more than two states.
>
> Having status information bound to arbitrary pointers is trading a
> variable against flexibility, cleanliness and maintainability.

If you want to introduce more states later, it requires changing _one_
macro, so I don't really see the problem.

> > - interval/overrun: this is only needed by itimers and I think it's
> > possible to leave it there. Main change would be to let 'function' return
> > a value indicating whether to rearm the timer or not (this includes
> > expires is updated).
>
> It is also used by the posix timer code and I plan to do another round
> of simplification also there.

Please explain.

> I do not want to end up with a next round of discussion there about
> either introducing tons of new ifdefs, macros or redesigning the code
> base another time.

I don't really see why this should be an excuse to introduce now more
complex code than really necessary. If that extra complexity can't stand
on it's own please introduce as soon as it becomes necessary.
I like most of the patch, but I would prefer to do a simple
implementation/ cleanup first and then build anything more complex on top
of it. If you need another complete redesign for this, then you likely do
something wrong already now.

> > Not using 64bit math here allows gcc to generate better code, e.g. gcc
> > has to add another test for "nsec < 0" because the condition code is
> > already used for the overflow, adding the "sec--" instead is IMO faster
> > (i.e. less likely).
>
> i686
> DOADD32 00000048
> DOADD64 0000002a
> DOSUB32 00000060
> DOSUB64 0000002f
> arm
> DOADD32 0000004c
> DOADD64 0000004c
> DOSUB32 00000040
> DOSUB64 00000038
> m68k
> DOADD32 0000003c
> DOADD64 0000002e
> DOSUB32 00000036
> DOSUB64 00000028
> powerpc
> DOADD32 00000040
> DOADD64 00000044
> DOSUB32 00000044
> DOSUB64 00000044
>
> Please do not tell me that size does not matter. :)
>
> I attached the assembler dumps, so you can have a look yourself. I did
> these tests during the implementation and decided on the results rather
> than on assumptions about gcc.

Did you look at the generating code? Most of it is function prologue/
epilogue, which is quite unimportant for inline functions. The other thing
I forgot to mention last time is that passing values by reference instead
of value also makes a difference.
For m68k I actually got smaller code this way (mostly because addx/subx
are limited in their addressing modes). In the other cases I'm actually
surprised gcc doesn't use the previous result from the sub and adds
another test. The remaining difference comes from how gcc deals with
structure vs. integral values, which could use some improvement,
especially the add case should have produced nearly identical results.

Anyway, this point wasn't that important, it's only microoptimizations and
at least having the option to change it later (after more tests) is fine
with me.

> > Could you explain a little the resolution handling behind in your patch?
> > If I read SUS correctly clock resolution and timer resolution don't have
> > to be the same, the first is returned by clock_getres() and the latter
> > only documented somewhere (and AFAICT our implementation always returned
> > the wrong value).
>
> As far as I understand SUS timer resolution is equal to clock resolution
> and the timer value/interval is rounded up to the resolution.

Please check the rationale about clocks and timers. It talks about clocks
and timer services based on them and their resolutions can be different.

> > IMO this also means we can don't have to make the rounding that
> > complicated. Actually it could be done automatically by the timer, e.g.
> > interval timer are reprogrammed at (now + interval) and the timer
> > resolution will automatically round it up.
>
> Reprogramming interval timers by now + interval is completely wrong.
> Reprogramming has to be
> timer->expires + interval and nothing else.

Where do get the requirement for an explicit rounding from?
The point is that the timer should not expire early, but there is more
than one way to do this and can be done implicitly using the timer
resolution.

> > > + /* Get current time */
> > > + now = base->get_time();
> >
> > As get_time() is not necessarily cheap, it can be avoided for nonrelative
> > timers by comparing it with the first pending timer. Maintaining a pointer
> > to the first timer here, avoids the timer list and is a simple check
> > whether the time source needs any reprogramming later.
>
> Would you please care to read the complete related code to find out why
> this does not work. This is totaly unrelated to reprogramming of the
> time event source in the HRT case.

You saw that I restricted this to "nonrelative timers"?

> > > + if ktime_cmp(timer->expires, <=, now) {
> > > + timer->expired = now;
> > > + /* The caller takes care of expiry */
> > > + if (!(mode & KTIMER_NOCHECK))
> > > + return -1;
> >
> > I think KTIMER_NOFAIL would be better name, for a while that had me
> > confused, as you actually do check the value, but you don't fail it and
> > enqueue it anyway.
>
> It does not fail. It returns in the case that the timer is already
> expired. The NOCHECK flag is used to skip the check.

It returns with a failure value!? The NOCHECK name is ambiguous about what
should not be checked, the NOFAIL name is more clear that the caller
doesn't need to check the return value, because the function won't fail.

bye, Roman

2005-10-11 07:40:52

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

On Mon, 2005-10-10 at 19:22 +0200, Roman Zippel wrote:
> > The above gives a clear distinction between scalar and sec/nsec based
> > cases. So you cannot mess up without notice.
>
> There are enough macros to do this anyway. There are a number of
> operations which are identical. Separating them artifically makes
> everything only more complicated.

I don't see a distinct set of macros around which is providing all the
functionality.

> > As far as I understand SUS timer resolution is equal to clock resolution
> > and the timer value/interval is rounded up to the resolution.
>
> Please check the rationale about clocks and timers. It talks about clocks
> and timer services based on them and their resolutions can be different.

clock_settime():
... Time values that are between two consecutive non-negative integer
multiples of the resolution of the specified clock shall be truncated
down to the smaller multiple of the resolution.

timer_settime():
...Time values that are between two consecutive non-negative integer
multiples of the resolution of the specified timer shall be rounded up
to the larger multiple of the resolution. Quantization error shall not
cause the timer to expire earlier than the rounded time value.

> > Reprogramming interval timers by now + interval is completely wrong.
> > Reprogramming has to be
> > timer->expires + interval and nothing else.
>
> Where do get the requirement for an explicit rounding from?
> The point is that the timer should not expire early, but there is more
> than one way to do this and can be done implicitly using the timer
> resolution.

See above.

tglx


2005-10-12 22:37:18

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Hi,

On Tue, 11 Oct 2005, Thomas Gleixner wrote:

> > > As far as I understand SUS timer resolution is equal to clock resolution
> > > and the timer value/interval is rounded up to the resolution.
> >
> > Please check the rationale about clocks and timers. It talks about clocks
> > and timer services based on them and their resolutions can be different.
>
> clock_settime():
> ... Time values that are between two consecutive non-negative integer
> multiples of the resolution of the specified clock shall be truncated
> down to the smaller multiple of the resolution.
>
> timer_settime():
> ...Time values that are between two consecutive non-negative integer
> multiples of the resolution of the specified timer shall be rounded up
> to the larger multiple of the resolution. Quantization error shall not
> cause the timer to expire earlier than the rounded time value.

Where does it say anything about that their resolution is equal?

> > > Reprogramming interval timers by now + interval is completely wrong.
> > > Reprogramming has to be
> > > timer->expires + interval and nothing else.
> >
> > Where do get the requirement for an explicit rounding from?
> > The point is that the timer should not expire early, but there is more
> > than one way to do this and can be done implicitly using the timer
> > resolution.
>
> See above.

I know it and above is an _interface_ description, but what leads you to
the conclusion that your _implementation_ is the only correct one?

Thomas, are you even interested in discussing this? Do you just expect
that everyone accepts your patch and is happy? So far it's difficult
enough to get you to explain your design, but a serious discussion also
requires to look at the possible alternatives. It's quite possible I'm
wrong, but you have to try a little harder at explaining why.

bye, Roman

2005-10-12 23:48:06

by George Anzinger

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Roman Zippel wrote:
> Hi,
>
> On Tue, 11 Oct 2005, Thomas Gleixner wrote:
>
>
>>>>As far as I understand SUS timer resolution is equal to clock resolution
>>>>and the timer value/interval is rounded up to the resolution.
>>>
>>>Please check the rationale about clocks and timers. It talks about clocks
>>>and timer services based on them and their resolutions can be different.

Well, yes and no. Under timer_settime() it talks about ticks and resolution being the inverse of
the tick rate. AND it does imply that timers on a given CLOCK will have that clocks resolution as
returned by clock_res(). This is fine as far as it goes. In practical systems we almost always
have a much higher resolution for the clock_gettime() and gettimeofday() than the tick rate. What
the standard does not seem to want to do is to admit that a clock may have the ability to be read at
a better resolution than its tick rate.

For this reason, the usual practice is to return the "timer" resolution for clock_res() and to
return clock values with as much resolution as possible. In no case should the actual clock
resolution be less than what clock_res() returns.

>>
>>clock_settime():
>>... Time values that are between two consecutive non-negative integer
>>multiples of the resolution of the specified clock shall be truncated
>>down to the smaller multiple of the resolution.
>>
>>timer_settime():
>>...Time values that are between two consecutive non-negative integer
>>multiples of the resolution of the specified timer shall be rounded up
>>to the larger multiple of the resolution. Quantization error shall not
>>cause the timer to expire earlier than the rounded time value.

Here the standard uses "resolution of the specified timer" but the only way, in the standard, to
associate a resolution with a timer is via the CLOCK used.
>
>
> Where does it say anything about that their resolution is equal?

So the timers resolution is the same as the CLOCKs resolution as returned by clock_res() but, as I
said above, the usual practice is to return clock values (via clock_gettime or gettimeofday) with
higher resolution.
>
>
>>>>Reprogramming interval timers by now + interval is completely wrong.
>>>>Reprogramming has to be
>>>>timer->expires + interval and nothing else.
>>>
>>>Where do get the requirement for an explicit rounding from?
>>>The point is that the timer should not expire early, but there is more
>>>than one way to do this and can be done implicitly using the timer
>>>resolution.
>>
>>See above.

The standard requires that timer expiry times and interval times be rounded up to the next
"resolution" value. For the first or initial time of a repeating timer we, usually, have to add an
additional "resolution" to account for starting the timer at some point between ticks. For the
interval on repeating timers, we know that the interval is starting at the last expiry time and thus
do not need to account for the between tick start time.
>
~

--
George Anzinger [email protected]
HRT (High-res-timers): http://sourceforge.net/projects/high-res-timers/

2005-10-16 16:35:38

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Hi,

On Wed, 12 Oct 2005, George Anzinger wrote:

> > > > > As far as I understand SUS timer resolution is equal to clock
> > > > > resolution
> > > > > and the timer value/interval is rounded up to the resolution.
> > > >
> > > > Please check the rationale about clocks and timers. It talks about
> > > > clocks and timer services based on them and their resolutions can be
> > > > different.
>
> Well, yes and no. Under timer_settime() it talks about ticks and resolution
> being the inverse of the tick rate. AND it does imply that timers on a given
> CLOCK will have that clocks resolution as returned by clock_res(). This is
> fine as far as it goes. In practical systems we almost always have a much
> higher resolution for the clock_gettime() and gettimeofday() than the tick
> rate. What the standard does not seem to want to do is to admit that a clock
> may have the ability to be read at a better resolution than its tick rate.

The interesting question is what resolution has CLOCK_REALTIME really?
This paragraph in timer_settime() doesn't mention CLOCK_REALTIME and
AFAICT historically the resolution of e.g. gettimeofday() was really in
the msec range.

IMO there is a far more interesting in sentence under clock_getres(): "If
the time argument of clock_settime() is not a multiple of res, then the
value is truncated to a multiple of res."
This is relatively obvious for hardware clocks, e.g. we could define a
CLOCK_JIFFIES with a resolution of TICK_NSEC or CLOCK_PIT with a
resolution of 838 nsec. The conversion from the actual clock value to/from
timespec automatically takes care of any truncation/rounding.

CLOCK_REALTIME is now a bit special as it doesn't map directly to a
hardware clock, it also includes adjustments and these are done in nsec
resolution (actually even fractions of that in the NTP code). In 2.6 we
don't truncate the value anywhere and maintain it as a nsec value,
therefore the resolution of CLOCK_REALTIME should really really 1 nsec
(and 1 usec under 2.4).

OTOH the precision with which the clock can be read is a different matter
and depends on the hardware clock CLOCK_REALTIME is derived of. It would
really help if we could agree on something what clock resolution really
means (especially for CLOCK_REALTIME). For hardware clocks the resolution
is defined by the conversion factor from clock cycles to timespec, but
CLOCK_REALTIME is a virtual clock, so is its resolution the precision with
which the clock can be read or written? clock_getres() specifically
mentions clock_settime()...

Depending on this is how we define what timer resolution means. Currently
we convert the timespec value from/into a jiffies value, so I guess the
resolution is really TICK_NSEC, as it's the resolution at which we
maintain the timer value. Thomas's patch now changes this and we keep a
nsec value, but doesn't that mean the resolution of the timer becomes 1
nsec? It's basically the same question as above, is the timer resolution
the precision at which we maintain the values, the precision with which
the timer can be read or the precision with which the timer can be
programmed?

The spec is not really clear and Thomas refusal to explain his design
decision is as also not really helpful. :-(
He sets the timer resolution to (NSEC_PER_SEC/HZ) which matches no value
above and this way he basically creates another virtual timer, which has
only little to do with the real kernel timer tick.

I'm open to other interpretations and I think it's important to get to
some agreement, _before_ we start to change interfaces.

bye, Roman

2005-10-16 19:24:38

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

On Sun, 2005-10-16 at 18:34 +0200, Roman Zippel wrote:

> The spec is not really clear and Thomas refusal to explain his design
> decision is as also not really helpful. :-(

I did explain, why I did the rounding in the way it is implemented. If
you define the fact that I have a different interpretation of SUS than
you as refusal, then we can stop this thread right here.

> He sets the timer resolution to (NSEC_PER_SEC/HZ) which matches no value
> above and this way he basically creates another virtual timer, which has
> only little to do with the real kernel timer tick.

As George explained already we return the resolution of the timer as the
value which can be assumed to be the resolution of the event source,
which drives the timer, because that seems to be the only interesting
value for an application programmer. The theoretical resolution of a
jiffie based timer system is NSEC_PER_SEC/HZ.

So why is NSEC_PER_SEC/HZ creating a virtual timer ? Because the ntp
adjusted resolution per tick is 1% off ?

I really don't see any sense in returning changing resolution values
every 5 minutes due to NTP adjustments. I imagine the happiness of
application programmers which actually do calculations based on such a
resolution value.

And in the logical consequence you would have to save the original
userspace timespec value including the time when the timer is set up and
redo the rounding and calculation every time NTP changes the
NSEC_PER_TICK value for _all_ timers which are related to
CLOCK_MONOTONIC and CLOCK_REALTIME.

The code does not introduce a virtual timer at all. It uses the ntp
adjusted time reference and guarantees that the timer goes not off
early. Usually it expires with the next tick - of course system load can
delay it further.

tglx


2005-10-16 23:03:44

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Hi,

On Sun, 16 Oct 2005, Thomas Gleixner wrote:

> > The spec is not really clear and Thomas refusal to explain his design
> > decision is as also not really helpful. :-(
>
> I did explain, why I did the rounding in the way it is implemented. If
> you define the fact that I have a different interpretation of SUS than
> you as refusal, then we can stop this thread right here.

I have no problem with you having a different opinion, I have a problem
with your childish behaviour. :-(
You completely ignore the rest of my mail, trying to establish some base
definitions, which would help to figure out the options we have based on
the spec. You instead just insist on your interpretation without going
into any detail.

> > He sets the timer resolution to (NSEC_PER_SEC/HZ) which matches no value
> > above and this way he basically creates another virtual timer, which has
> > only little to do with the real kernel timer tick.
>
> As George explained already we return the resolution of the timer as the
> value which can be assumed to be the resolution of the event source,
> which drives the timer, because that seems to be the only interesting
> value for an application programmer. The theoretical resolution of a
> jiffie based timer system is NSEC_PER_SEC/HZ.

You still don't explain, how you you get to this conclusion based on the
spec. Instead you redefine it now to useful assupmtions for application
programmers who can't read the spec...
You still completely leave the question unanswered of the possibility of
different resolutions. We can still discuss what resolution to return with
clock_getres(), but first we have to establish with what kind of resoltion
we're dealing with here.

> I really don't see any sense in returning changing resolution values
> every 5 minutes due to NTP adjustments. I imagine the happiness of
> application programmers which actually do calculations based on such a
> resolution value.

Why are they doing this kind of calculations based on this value?
We can discuss returning a reasonable value for these applications, but I
don't see how these assumptions should control how the kernel works.

> And in the logical consequence you would have to save the original
> userspace timespec value including the time when the timer is set up and
> redo the rounding and calculation every time NTP changes the
> NSEC_PER_TICK value for _all_ timers which are related to
> CLOCK_MONOTONIC and CLOCK_REALTIME.

The rounding is done based on your interpretation of the spec, which you
refuse to discuss. AFAICT the spec leaves enough room to avoid this
rounding completely.

bye, Roman

2005-10-17 07:59:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5


* Roman Zippel <[email protected]> wrote:

> > > The spec is not really clear and Thomas refusal to explain his design
> > > decision is as also not really helpful. :-(
> >
> > I did explain, why I did the rounding in the way it is implemented. If
> > you define the fact that I have a different interpretation of SUS than
> > you as refusal, then we can stop this thread right here.
>
> I have no problem with you having a different opinion, I have a problem
> with your childish behaviour. :-(

Roman, IMO Thomas has been more than reasonable in replying to you - i'd
have stopped replying to you after the first couple of mails, and we are
at mail round 10 now! Thomas is being very patient with you. You are
being difficult, and IMO you are wasting his and others' time.

the thing is that Thomas has advanced the whole issue of timeouts and
timekeeping by leaps and bounds and he has written thousands of lines of
new and excellent code for a kernel subsystem that has seen little
activity for many years, before John got involved. One of Thomas'
accomplishments is a timer/time design that allows the enabling of HRT
timers via an _18 lines_ architecture patch. (!)

on the other hand, i have yet to see a single line of code from you and
have yet to receive a single bugreport from you. (!)

so for me as a patch integrator and upstream maintainer the equation is
very simple, and i am not nearly as tolerant as Thomas: shut up Roman
already and show us the code!

really, start sending in patches. Testreports. Useful feedback. Those we
can judge by their merits. Talk is cheap. The time subsystem has been
dormant for years, and it has had more than enough talk already.

the moment you express yourself via patches we'll know that 1) you
understand what we have done so far 2) you have useful ideas of what
should be done differently 3) you have the coder capability to implement
and test those ideas. Patches wont be ignored, i can assure you. Get the
patches rolling!

Ingo

2005-10-17 08:27:54

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5


Trivial "stupid" patch. MAKE Makefile HAVE -kt($num)!!!!

This will help with ketchup :-)

-- Steve

Index: linux-2.6.14-rc4-kt2/Makefile
===================================================================
--- linux-2.6.14-rc4-kt2.orig/Makefile 2005-10-17 10:14:26.000000000 +0200
+++ linux-2.6.14-rc4-kt2/Makefile 2005-10-17 10:15:12.000000000 +0200
@@ -1,7 +1,7 @@
VERSION = 2
PATCHLEVEL = 6
SUBLEVEL = 14
-EXTRAVERSION =-rc4
+EXTRAVERSION =-rc4-kt2
NAME=Affluent Albatross

# *DOCUMENTATION*

2005-10-17 09:29:58

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Hi,

On Mon, 17 Oct 2005, Ingo Molnar wrote:

> the thing is that Thomas has advanced the whole issue of timeouts and
> timekeeping by leaps and bounds and he has written thousands of lines of
> new and excellent code for a kernel subsystem that has seen little
> activity for many years, before John got involved. One of Thomas'
> accomplishments is a timer/time design that allows the enabling of HRT
> timers via an _18 lines_ architecture patch. (!)

Did I say these patches were bad in general? All I'm asking for is an
explanation for a few design decisions to understand the patch and its
behaviour better and evaluate alternative solutions.
Neither of you have shown any real interest in this so far.

> the moment you express yourself via patches we'll know that 1) you
> understand what we have done so far 2) you have useful ideas of what
> should be done differently 3) you have the coder capability to implement
> and test those ideas. Patches wont be ignored, i can assure you. Get the
> patches rolling!

This "shut up and show code" attitude is sometimes quite funny, but it's
no real threat to me. I hoped to avoid this and solve this more civilized.
Of course I'll understand the issues better afterwards, but you could as
easily just tell me. It will waste my time, I could spend on other
projects and it will put Andrew in the unfortunate position to decide,
which patch to accept.
Is this really what you want?

bye, Roman

2005-10-17 09:41:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5


* Roman Zippel <[email protected]> wrote:

> > the moment you express yourself via patches we'll know that 1) you
> > understand what we have done so far 2) you have useful ideas of what
> > should be done differently 3) you have the coder capability to implement
> > and test those ideas. Patches wont be ignored, i can assure you. Get the
> > patches rolling!
>
> This "shut up and show code" attitude is sometimes quite funny, but
> it's no real threat to me. I hoped to avoid this and solve this more
> civilized. Of course I'll understand the issues better afterwards, but
> you could as easily just tell me. [...]

if a dozen mails werent enough then one more probably wont make a
difference, especially with your last mail calling Thomas's behavior
"childish" - when all he did was to try to explain his reasons to you as
patiently as possible! Thomas is not obliged to teach you or bear with
you - it is his own free choice. (But if you want to discuss this
personal angle any further please take the public lists (and other
people) off the Cc: list, it's getting very off-topic.)

Thomas's stuff is now fully integrated into the -rt tree and it works
excellently. I have measured a 12 usecs worst-case HR timer-delivery
latency (using cyclictest). _That_ is the thing i care about.

> [...] It will waste my time, I could spend on other projects and it
> will put Andrew in the unfortunate position to decide, which patch to
> accept. [...]

yes, please, put Andrew (and me too) into that unfortunate position!
Please, pretty please, get on with the patches!

Ingo

2005-10-17 09:55:59

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5


On Mon, 17 Oct 2005, Roman Zippel wrote:

> Hi,
>
> On Mon, 17 Oct 2005, Ingo Molnar wrote:
>
> > the thing is that Thomas has advanced the whole issue of timeouts and
> > timekeeping by leaps and bounds and he has written thousands of lines of
> > new and excellent code for a kernel subsystem that has seen little
> > activity for many years, before John got involved. One of Thomas'
> > accomplishments is a timer/time design that allows the enabling of HRT
> > timers via an _18 lines_ architecture patch. (!)
>
> Did I say these patches were bad in general? All I'm asking for is an
> explanation for a few design decisions to understand the patch and its
> behaviour better and evaluate alternative solutions.
> Neither of you have shown any real interest in this so far.
>

Well, for me anyway, the best way I have with understanding ones decisions
in their code design _is_ to start playing with the code. Try it the way
you want and you might realize things don't work so well, and then you
might understand why Thomas did it his way.

There's several times where I thought I could write something better, and
after playing with it, the problems start to arise where I then become
"enlightened" by the decisions others have made.


> > the moment you express yourself via patches we'll know that 1) you
> > understand what we have done so far 2) you have useful ideas of what
> > should be done differently 3) you have the coder capability to implement
> > and test those ideas. Patches wont be ignored, i can assure you. Get the
> > patches rolling!
>
> This "shut up and show code" attitude is sometimes quite funny, but it's
> no real threat to me. I hoped to avoid this and solve this more civilized.
> Of course I'll understand the issues better afterwards, but you could as
> easily just tell me. It will waste my time, I could spend on other
> projects and it will put Andrew in the unfortunate position to decide,
> which patch to accept.
> Is this really what you want?
>

I think what Ingo is saying, is to modify Thomas' code and show where it
is failing, instead of just talking about it. You can ask "why" he did
something, but I think Thomas gave you enough in his answers. If you are
still not satisfied, then that is the time to start playing with the code
and find the problems, fix them and show us that "yes" your way is better.
Don't just ask why Thomas did it one way without a patch that changes it
to show us why he shouldn't have.

-- Steve

2005-10-17 09:58:22

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Ingo Molnar <[email protected]> wrote:
>
> > [...] It will waste my time, I could spend on other projects and it
> > will put Andrew in the unfortunate position to decide, which patch to
> > accept. [...]
>
> yes, please, put Andrew (and me too) into that unfortunate position!
> Please, pretty please, get on with the patches!

I'm with Roman on this one - the old "show me the code" trick which people
use to quash other people's objections is rather poor form - we should simply
address the objections as raised.

That being said, I'll confess that I've largely ignored this discussion in
the hope that things would get sorted out. Seems that this won't be
happening and as Roman's opinions carry weight I do intend to solicit a
(brief!) summary of his objections from him when the patch comes round
again. Sorry.

2005-10-17 11:00:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5


* Andrew Morton <[email protected]> wrote:

> Ingo Molnar <[email protected]> wrote:
> >
> > > [...] It will waste my time, I could spend on other projects and it
> > > will put Andrew in the unfortunate position to decide, which patch to
> > > accept. [...]
> >
> > yes, please, put Andrew (and me too) into that unfortunate position!
> > Please, pretty please, get on with the patches!
>
> I'm with Roman on this one - the old "show me the code" trick which
> people use to quash other people's objections is rather poor form - we
> should simply address the objections as raised.
>
> That being said, I'll confess that I've largely ignored this
> discussion in the hope that things would get sorted out. Seems that
> this won't be happening and as Roman's opinions carry weight I do
> intend to solicit a (brief!) summary of his objections from him when
> the patch comes round again. Sorry.

Fine with me. A brief summary of technical objections (without any
personal attacks) is all we wanted to have to begin with. "Show me the
code" was my last-ditch attempt to move this seemingly unmovable
discussion from a communication channel where the chemistry doesnt seem
to work out to a more objective format.

Ingo

2005-10-17 16:26:12

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Hi,

On Mon, 17 Oct 2005, Andrew Morton wrote:

> That being said, I'll confess that I've largely ignored this discussion in
> the hope that things would get sorted out. Seems that this won't be
> happening and as Roman's opinions carry weight I do intend to solicit a
> (brief!) summary of his objections from him when the patch comes round
> again. Sorry.

It's rather simple:
- "timer API" vs "timeout API": I got absolutely no acknowlegement that
this might be a little confusing and in consequence "process timer" may be
a better name.
- I pointed out various (IMO) unnecessary complexities, which were rather
quickly brushed off e.g. with a need for further (not closer specified)
cleanups.
- resolution handling: at what resolution should/does the kernel work and
what do we report to user space. The spec allows multiple interpretations
and I have a hard time to get at least one coherent interpretation out of
Thomas.

Maybe I'm the only one who found Thomas answers a little superficial, but
as this is a central kernel subsystem I think it deserves a closer look
and everytime I tried to poke a little deeper I got nothing.

bye, Roman

2005-10-17 16:33:47

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Hi,

On Mon, 17 Oct 2005, Ingo Molnar wrote:

> if a dozen mails werent enough then one more probably wont make a
> difference,

Just for the record: in this thread I got exactly three answers from
Thomas. I don't know where you got the other nine mails from, maybe you
could forward them to me, as they seem to contain the "patient
explanations" I'm missing.

bye, Roman

2005-10-17 16:40:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5


* Roman Zippel <[email protected]> wrote:

> Hi,
>
> On Mon, 17 Oct 2005, Ingo Molnar wrote:
>
> > if a dozen mails werent enough then one more probably wont make a
> > difference,
>
> Just for the record: in this thread I got exactly three answers from
> Thomas. I don't know where you got the other nine mails from, maybe
> you could forward them to me, as they seem to contain the "patient
> explanations" I'm missing.

here are all the replies from Thomas, regarding ktimers:

12359 * Sep 22 Thomas Gleixner ( 319) Re: [ANNOUNCE] ktimers subsystem
12362 * Sep 23 Thomas Gleixner ( 49) Re: [ANNOUNCE] ktimers subsystem
12363 * Sep 23 Thomas Gleixner ( 235) Re: [ANNOUNCE] ktimers subsystem
12367 * Sep 24 Thomas Gleixner ( 214) Re: [ANNOUNCE] ktimers subsystem
12368 * Sep 25 Thomas Gleixner ( 25) Re: [ANNOUNCE] ktimers subsystem
12369 * Sep 25 Thomas Gleixner ( 17) Re: [ANNOUNCE] ktimers subsystem
12370 * Sep 25 Thomas Gleixner ( 10) Re: [ANNOUNCE] ktimers subsystem
12387 * Oct 01 Thomas Gleixner ( 817) Re: [PATCH] ktimers subsystem 2.6.14-rc
12419 * Oct 11 Thomas Gleixner ( 41) Re: [PATCH] ktimers subsystem 2.6.14-rc
12434 * Oct 16 Thomas Gleixner ( 40) Re: [PATCH] ktimers subsystem 2.6.14-rc

some of them very large and detailed.

Ingo

2005-10-17 16:44:08

by Tim Bird

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Roman Zippel wrote:
> On Mon, 17 Oct 2005, Andrew Morton wrote:
>>That being said, I'll confess that I've largely ignored this discussion in
>>the hope that things would get sorted out. Seems that this won't be
>>happening and as Roman's opinions carry weight I do intend to solicit a
>>(brief!) summary of his objections from him when the patch comes round
>>again. Sorry.
>
>
> It's rather simple:
> - "timer API" vs "timeout API": I got absolutely no acknowlegement that
> this might be a little confusing and in consequence "process timer" may be
> a better name.

I agree with Thomas on this one. Maybe "timer" and "timeout" are too
close, but I think they are the most descriptive names.
- timeout is something used for a timeout. Timeouts only actually
expire infrequently, so they have a host of attributes associated
with that characteristic.
- timer is something used to time something. They almost always
expire as part of their normal behaviour. In the ktimer code they
have a host of attributes related to this characteristic.

Thomas answered the suggestion to use "process timer" as an alternative
name, but I didn't see a reply after that from Roman (I may have missed it.)

> - I pointed out various (IMO) unnecessary complexities, which were rather
> quickly brushed off e.g. with a need for further (not closer specified)
> cleanups.

This is rather vague. It is rather easy to raise hypothetical
issues. From what I've seen, Thomas has gone to great lengths to
address specific issues raised. For example, he actually compiled
code on 4 different platforms to get the REAL size of the assembly
fragments, in order to address your concern about CONJECTURED size
problems.

> - resolution handling: at what resolution should/does the kernel work and
> what do we report to user space. The spec allows multiple interpretations
> and I have a hard time to get at least one coherent interpretation out of
> Thomas.

Huh? I thought Thomas' last answer was pretty clear.

>
> Maybe I'm the only one who found Thomas answers a little superficial, but
> as this is a central kernel subsystem I think it deserves a closer look
> and everytime I tried to poke a little deeper I got nothing.

No one minds poking deep. But frankly, I find hypothetical arguments
to be less useful than reality-backed ones. I would rather not hear
reasoning about a resolution issue - I'd like to numbers, if possible,
about the degradation of performance, if that's the issue. If
it's confusion about the API, then maybe we just need clear statements
that "X API provides resolution at Y level (from one of: hardware, tick,
something else).

Regards,
-- Tim

=============================
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Electronics
=============================

2005-10-17 16:55:37

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Hi,

On Mon, 17 Oct 2005, Ingo Molnar wrote:

> here are all the replies from Thomas, regarding ktimers:
>
> 12359 * Sep 22 Thomas Gleixner ( 319) Re: [ANNOUNCE] ktimers subsystem
> 12362 * Sep 23 Thomas Gleixner ( 49) Re: [ANNOUNCE] ktimers subsystem
> 12363 * Sep 23 Thomas Gleixner ( 235) Re: [ANNOUNCE] ktimers subsystem
> 12367 * Sep 24 Thomas Gleixner ( 214) Re: [ANNOUNCE] ktimers subsystem
> 12368 * Sep 25 Thomas Gleixner ( 25) Re: [ANNOUNCE] ktimers subsystem
> 12369 * Sep 25 Thomas Gleixner ( 17) Re: [ANNOUNCE] ktimers subsystem
> 12370 * Sep 25 Thomas Gleixner ( 10) Re: [ANNOUNCE] ktimers subsystem

Different thread and not directly related to issues with the patch.

> 12387 * Oct 01 Thomas Gleixner ( 817) Re: [PATCH] ktimers subsystem 2.6.14-rc
> 12419 * Oct 11 Thomas Gleixner ( 41) Re: [PATCH] ktimers subsystem 2.6.14-rc
> 12434 * Oct 16 Thomas Gleixner ( 40) Re: [PATCH] ktimers subsystem 2.6.14-rc

That's the only mails related to the patch.

bye, Roman

2005-10-17 17:26:46

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5



On Mon, 17 Oct 2005, Tim Bird wrote:

> >
> >
> > It's rather simple:
> > - "timer API" vs "timeout API": I got absolutely no acknowlegement that
> > this might be a little confusing and in consequence "process timer" may be
> > a better name.
>
> I agree with Thomas on this one. Maybe "timer" and "timeout" are too
> close, but I think they are the most descriptive names.
> - timeout is something used for a timeout. Timeouts only actually
> expire infrequently, so they have a host of attributes associated
> with that characteristic.
> - timer is something used to time something. They almost always
> expire as part of their normal behaviour. In the ktimer code they
> have a host of attributes related to this characteristic.
>
> Thomas answered the suggestion to use "process timer" as an alternative
> name, but I didn't see a reply after that from Roman (I may have missed it.)
>

I can add to this. After this was brought up, I did a little
non-scientific survey. I walked around and asked various engineers here at
my customer's site, what it meant to them if I had two types of timer
APIs, one for "timers" and one for "timeouts". All 100% of 8 people that
I asked (not a lot, but still), had no confusion with what they meant. I
asked them to explain what these names mean to them, and every one said
basically, timeouts are for situations that are for things that lasted too
long, and timers and for things where they want to be notified of an
event that takes place at some time.

They all agreed with me that timeouts were for exceptions and not expected
to be triggered, and timers were the other way around and should always be
triggered.

Not only that, I also asked if these timers would make sense if we called
them "kernel" timers and "process" timers. These names confused them
because they use both timers in their kernel modules.

That convinced me enough to think that Thomas' naming convention is not
confusing.

-- Steve

2005-10-17 17:35:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5


* Roman Zippel <[email protected]> wrote:

> > > Just for the record: in this thread I got exactly three answers
> > > from Thomas. I don't know where you got the other nine mails from,
> > > maybe you could forward them to me, as they seem to contain the
> > > "patient explanations" I'm missing.
> > >
> > here are all the replies from Thomas, regarding ktimers:
> >
> > 12359 * Sep 22 Thomas Gleixner ( 319) Re: [ANNOUNCE] ktimers subsystem
> > 12362 * Sep 23 Thomas Gleixner ( 49) Re: [ANNOUNCE] ktimers subsystem
> > 12363 * Sep 23 Thomas Gleixner ( 235) Re: [ANNOUNCE] ktimers subsystem
> > 12367 * Sep 24 Thomas Gleixner ( 214) Re: [ANNOUNCE] ktimers subsystem
> > 12368 * Sep 25 Thomas Gleixner ( 25) Re: [ANNOUNCE] ktimers subsystem
> > 12369 * Sep 25 Thomas Gleixner ( 17) Re: [ANNOUNCE] ktimers subsystem
> > 12370 * Sep 25 Thomas Gleixner ( 10) Re: [ANNOUNCE] ktimers subsystem
>
> Different thread and not directly related to issues with the patch.

ugh, what were they about then, poetry?

Ah i think i know what you mean: these were about a PREVIOUS VERSION of
the patch, and hence they fell off the face of the earth, regardless of
their content, right? What a tricky little definition of "Thomas replied
only 3 times" ...

> > 12387 * Oct 01 Thomas Gleixner ( 817) Re: [PATCH] ktimers subsystem 2.6.14-rc
> > 12419 * Oct 11 Thomas Gleixner ( 41) Re: [PATCH] ktimers subsystem 2.6.14-rc
> > 12434 * Oct 16 Thomas Gleixner ( 40) Re: [PATCH] ktimers subsystem 2.6.14-rc
>
> That's the only mails related to the patch.

your latest mail with the list of 'open' issues seems to contradict your
assertion that the above 3 mails from Thomas where "the only mails
related to the patch". E.g.:

' - "timer API" vs "timeout API": I got absolutely no acknowlegement
that this might be a little confusing and in consequence "process
timer" may be a better name. '

was raised and discussed in the first chunk of mails just as well.

Ingo

2005-10-17 18:38:46

by George Spelvin

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

> - "timer API" vs "timeout API": I got absolutely no acknowlegement that
> this might be a little confusing and in consequence "process timer" may be
> a better name.

I have to disagree. Once you grasp the desirability of having two kinds
of timers, one optimized for the case where it does expire, and one
optimized for the case where it is aborted or rescheduled before its
expiration time, the timer/timeout terminology seems quite intuitive
to me.

In particular, knowing that there are these two kinds of timers, and
they are called "timer" and "timeout", it's immediately obvious to me
which is which. I have no idea which one a "process timer" is.

The word "timeout" is already understood to refer to an error-recovery
mechanism. The common and desired case is that a timeout does not occur,
but rather the device responds, the packet is acknowledged, or what
have you.

Also, a common use case is that a timeout is continuously active, but its
expiration time keeps getting postponed. And great accuracy in timing is
not required; if the timeout expires 10% late, very little harm is done.

Finally, timeouts are always relative to some triggering event, not
absolute.

Given this, a specialized timer implementation that optimizes timer
scheduling at the expense of timer expiration makes all sorts of sense.


Note that the network stack can make good use of both kinds. Timeouts
for all the usual network timeouts, but high-resolution timers are very
desirable for transmission rate control.

2005-10-17 18:50:30

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Hi,

On Mon, 17 Oct 2005, Tim Bird wrote:

> > It's rather simple:
> > - "timer API" vs "timeout API": I got absolutely no acknowlegement that this
> > might be a little confusing and in consequence "process timer" may be a
> > better name.
>
> I agree with Thomas on this one. Maybe "timer" and "timeout" are too close,
> but I think they are the most descriptive names.
> - timeout is something used for a timeout. Timeouts only actually
> expire infrequently, so they have a host of attributes associated
> with that characteristic.
> - timer is something used to time something. They almost always
> expire as part of their normal behaviour. In the ktimer code they
> have a host of attributes related to this characteristic.

There is of course a difference, but is it big enough that they deserve
different APIs? Just look into <linux/timer.h> it doesn't mention timeout
once, but according to Thomas that's our "timeout API". Look at the
description of mod_timer() in timer.c: "modify a timer's timeout".
It seems I'm not only one who thinks that both are closely related.

> Thomas answered the suggestion to use "process timer" as an alternative name,
> but I didn't see a reply after that from Roman (I may have missed it.)

It was short and painless:

} > > Calling them "process timer" and "kernel timer" would include their main
} > > usage, although that also means ptimer were the more correct abbreviation.
} >
} > As said before I think the disctinction between timers and timeouts
} > makes perfectly sense and ktimers are _not_ restricted to process
} > timers.
}
} "main usage" != "restricted to"

IOW I didn't say that "process timer" are restricted to processes, but
it's their intended main usage. "kernel timer" are OTOH the first choice
for any internal kernel time issues (which are not just timeouts).

> > - I pointed out various (IMO) unnecessary complexities, which were rather
> > quickly brushed off e.g. with a need for further (not closer specified)
> > cleanups.
>
> This is rather vague. It is rather easy to raise hypothetical
> issues. From what I've seen, Thomas has gone to great lengths to
> address specific issues raised. For example, he actually compiled
> code on 4 different platforms to get the REAL size of the assembly
> fragments, in order to address your concern about CONJECTURED size
> problems.

This was the _only_ issue where he got into any detail, but I also
mentioned later that this one of the minor issues.
Above was about the size of the ktimer structure and interval timer.

> > - resolution handling: at what resolution should/does the kernel work and
> > what do we report to user space. The spec allows multiple interpretations
> > and I have a hard time to get at least one coherent interpretation out of
> > Thomas.
>
> Huh? I thought Thomas' last answer was pretty clear.

Then I must have missed something. Earlier he just quotes something from
SUS without any explanation. His last answer was just about user
expectations without any connection to the different resolutions at the
kernel side I described in the mail before.

bye, Roman

2005-10-17 19:05:05

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Hi,

On Mon, 17 Oct 2005 [email protected] wrote:

> > - "timer API" vs "timeout API": I got absolutely no acknowlegement that
> > this might be a little confusing and in consequence "process timer" may be
> > a better name.
>
> I have to disagree. Once you grasp the desirability of having two kinds
> of timers, one optimized for the case where it does expire, and one
> optimized for the case where it is aborted or rescheduled before its
> expiration time, the timer/timeout terminology seems quite intuitive
> to me.

Thank you, that's exactly the confusion, I'd like to avoid.

The main difference is in their possible resolution: kernel timer are a
low resolution, low overhead timer optimized for kernel needs and process
timer have a larger resolution mainly for applications, but this also
implies a larger overhead (i.e. more expensive locking).

bye, Roman

2005-10-17 19:19:27

by Tim Bird

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Roman Zippel wrote:
> } > > Calling them "process timer" and "kernel timer" would include
> their main
> } > > usage, although that also means ptimer were the more correct
> abbreviation.
> } >
> } > As said before I think the disctinction between timers and timeouts
> } > makes perfectly sense and ktimers are _not_ restricted to process
> } > timers.
> }
> } "main usage" != "restricted to"
>
> IOW I didn't say that "process timer" are restricted to processes, but
> it's their intended main usage. "kernel timer" are OTOH the first choice
>
> for any internal kernel time issues (which are not just timeouts).

Maybe for a more experienced kernel person such as
yourself, this distinction make sense. But
"process timer" and "kernel timer" don't carry much
semantic value for me. They seem to convey an
arbitrary expectation of usage patterns. Maybe
they match the current usage patterns in the kernel,
but I'd prefer naming based on functionality or
behaviour of the API.


> There is of course a difference, but is it big enough that they deserve
> different APIs?

IMHO yes. I think having separate APIs will eventually be
beneficial to allow better handling of resolution
manipulation in the future.

For example, timeouts are likely to need less resolution,
and it may be valuable to adjust the resolution of timeouts
to support coalescing timeouts for better tickless operation.
(Driving towards better power management performance for
embedded devices.)

> Just look into <linux/timer.h> it doesn't mention timeout
> once, but according to Thomas that's our "timeout API". Look at the
> description of mod_timer() in timer.c: "modify a timer's timeout".
> It seems I'm not only one who thinks that both are closely related.

I'm not sure if you are arguing for renaming the
old API. I would be in favor of this (from an abstract
perspective, to clarify the usage in the kernel), but
it might be too big a change right now.

Regards,
-- Tim



=============================
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Electronics
=============================

2005-10-17 19:48:30

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Hi,

On Mon, 17 Oct 2005, Tim Bird wrote:

> Maybe for a more experienced kernel person such as
> yourself, this distinction make sense. But
> "process timer" and "kernel timer" don't carry much
> semantic value for me. They seem to convey an
> arbitrary expectation of usage patterns. Maybe
> they match the current usage patterns in the kernel,
> but I'd prefer naming based on functionality or
> behaviour of the API.

Let's say you want to implement a watchdog timer for a driver, which runs
about every second to do something. Now if you have the choice between
"timer API" vs. "timeout API" and "kernel timer" vs. "process timer", what
would you choose based on the name?

bye, Roman

2005-10-17 20:09:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5


* Roman Zippel <[email protected]> wrote:

> > > It's rather simple:
> > > - "timer API" vs "timeout API": I got absolutely no acknowlegement that this
> > > might be a little confusing and in consequence "process timer" may be a
> > > better name.
> >
> > I agree with Thomas on this one. Maybe "timer" and "timeout" are too close,
> > but I think they are the most descriptive names.
> > - timeout is something used for a timeout. Timeouts only actually
> > expire infrequently, so they have a host of attributes associated
> > with that characteristic.
> > - timer is something used to time something. They almost always
> > expire as part of their normal behaviour. In the ktimer code they
> > have a host of attributes related to this characteristic.
>
> There is of course a difference, but is it big enough that they
> deserve different APIs? Just look into <linux/timer.h> it doesn't
> mention timeout once, but according to Thomas that's our "timeout
> API". Look at the description of mod_timer() in timer.c: "modify a
> timer's timeout". It seems I'm not only one who thinks that both are
> closely related.

this is one more area where there's no good substitute from 'walking the
walk', i.e. getting yourself dirty with actual code. I have been
involved with the following variants which were part of the -rt tree:

- we implemented both timeouts and timers with the same
timeout-optimized framework [i.e. with the 'wheel'] - it sucked.

- timers and timeouts with a timer-optimized framework [i.e. with a
binary tree] sucks too, due to the tree overhead.

- we in fact tried another variant too: a hybrid method where timers and
timeouts lived in the timer wheel and some time before (hr) timers
were about to time out they were put into a separate hr-list. This
hybrid solution sucked too.

so then we tried a separate API and subsystem for both of them, and
voila, many of the uglinesses went away, and things became robust.

Ingo

2005-10-17 20:13:08

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5


* Roman Zippel <[email protected]> wrote:

> > Maybe for a more experienced kernel person such as
> > yourself, this distinction make sense. But
> > "process timer" and "kernel timer" don't carry much
> > semantic value for me. They seem to convey an
> > arbitrary expectation of usage patterns. Maybe
> > they match the current usage patterns in the kernel,
> > but I'd prefer naming based on functionality or
> > behaviour of the API.
>
> Let's say you want to implement a watchdog timer for a driver, which
> runs about every second to do something. Now if you have the choice
> between "timer API" vs. "timeout API" and "kernel timer" vs. "process
> timer", what would you choose based on the name?

why you insist on ktimers being 'process timers'? They are totally
separate entities, not limited to any process notion. One of their first
practical use happens to be POSIX process timers (both itimers and
ptimers) via them, but no way are ktimers only 'process timers'. They
are very generic timers, usable for any kernel purpose.

so to answer your question: it is totally possible for a watchdog
mechanism to use ktimers. In fact it would be desirable from a
robustness POV too: e.g. we dont want a watchdog from being
overload-able via too many timeouts in the timer wheel ...

Ingo

2005-10-17 20:31:55

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Hi,

On Mon, 17 Oct 2005, Ingo Molnar wrote:

> why you insist on ktimers being 'process timers'?

Because they are optimized for process usage. OTOH kernel usage is more
than just "timeouts".

> so to answer your question: it is totally possible for a watchdog
> mechanism to use ktimers. In fact it would be desirable from a
> robustness POV too:

"possible" and "desirable" is still different from "preferable", as they
involve a higher cost.

> e.g. we dont want a watchdog from being
> overload-able via too many timeouts in the timer wheel ...

Please explain.

bye, Roman

2005-10-17 20:52:55

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

On Mon, 2005-10-17 at 18:25 +0200, Roman Zippel wrote:
> It's rather simple:
> - "timer API" vs "timeout API": I got absolutely no acknowlegement that
> this might be a little confusing and in consequence "process timer" may be
> a better name.

Not only me, also a lot of other people do _not_ find it confusing and I
explained why it is a clear technical distcinction. I also explained why
I think that process_timers is too restrictive IMO.

I accept that you find it confusing, but I dont understand neither what
kind of acknowledgement you want nor how you deduce my obligation for
acknowledging whatever.

> - I pointed out various (IMO) unnecessary complexities, which were rather
> quickly brushed off e.g. with a need for further (not closer specified)
> cleanups.

The so called complexities are a not various. You complained about
exactly 5 members of the ktimer structure.

- list, expired, status, interval, overrun

which are superflous in your opinion.

Again an explanation for each :

list:
allows fast access to the time sorted list without walking the rbtree
and is a preliminary for the extension to high resolution timers.

-----------

expired:
The field was added for simplification of some delta calculations in the
return path. e.g. nanosleep in the expired case to avoid the extra call
to get the current time. Also quite useful for debugging.

-----------

status:
A simple field, which stores at the moment 2 states and is necessary for
extensions to high resolution timers too, as we have more states there.
The suggested usage of the rbnode.parent pointer is wrong IMO as the
overloading of arbitrary pointers for status information is a kind of
pseudo optimization which is reducing in fact maintainability and
clarity for a the win of a 32bit variable.

-----------

interval, overrun:
Interval holds the converted interval value for itimers. The overrun
member is used by the rearm code so the caller can figure out the number
of missed events.

The cleanup I pointed out for the posix timer interval timers is pretty
obvious. It makes use of interval and overrun and removes two members of
the posix timer structure.

-----------

The size of the ktimer structure is a matter of micro optimizations in
the same way as the macros/inlines are.

Calling the pure existence of some struct members complexity is an
exaggeration and contradicts your own request for a simple and clear
design.

The implementation was done clear and simple from the very beginning and
I really dont understand why the preparation for further extensions in
the first place is bad.

Doing a design with the final goal in mind is much cleaner than doing
micro optimizations in the first place and afterwards working around
them when you apply extensions.

> - resolution handling: at what resolution should/does the kernel work and
> what do we report to user space. The spec allows multiple interpretations
> and I have a hard time to get at least one coherent interpretation out of
> Thomas.

I interpret the spec in the way I do for following reasons:

1. It is _usual practice_ to return the "timer" resolution for
clock_res() and to return clock values with as much resolution as
possible. In no case should the actual clock resolution be less than
what clock_res() returns.
- George Anzinger in this thread. Similar opinions can be found via
Google. I came to the same conclusion and saw no reason to repeat
Georges statement. I thought a simple pointer would be sufficient.

2. The rounding to the resolution value is explicitly required by the
standard.

3. It makes a lot of sense to do what (1.) describes, due to the fact
that we actually want to restrict the timer resolution to avoid
interrupt and reprogramming floods in very short intervals. This in fact
is the default behaviour in a jiffy driven environment. Pretending a
real nsec resolution and doing no rounding at all is violating (2.).
>From an application programmers view it makes sense to return the timer
resolution so he actually can adjust the program behaviour on the
provided resolution and not rely on wild guess assumptions about what
might be there. Applications need to be able to verify whether the
system can handle the required intervals or not.

tglx


2005-10-17 22:41:49

by George Spelvin

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

>> I have to disagree. Once you grasp the desirability of having two kinds
>> of timers, one optimized for the case where it does expire, and one
>> optimized for the case where it is aborted or rescheduled before its
>> expiration time, the timer/timeout terminology seems quite intuitive
>> to me.

> Thank you, that's exactly the confusion, I'd like to avoid.

Er... *what* confusion? I wasn't in the slightest bit confused when
I wrote that, and re-reading it very carefully, I can't see how
it could be interpreted in a way that is in any way confusing.

There's no more confusion in that paragraph than there is lemon meringue.

The only thing confusing is your response.

2005-10-18 00:07:58

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Hi,

On Mon, 17 Oct 2005, Thomas Gleixner wrote:

> On Mon, 2005-10-17 at 18:25 +0200, Roman Zippel wrote:
> > It's rather simple:
> > - "timer API" vs "timeout API": I got absolutely no acknowlegement that
> > this might be a little confusing and in consequence "process timer" may be
> > a better name.
>
> Not only me, also a lot of other people do _not_ find it confusing and I
> explained why it is a clear technical distcinction. I also explained why
> I think that process_timers is too restrictive IMO.

People don't find it confusing, exactly because it gives them the wrong
idea about it, neither "API" is restricted to just timeouts or timer.
I don't insist on the term "process timer", but I'd really like to find
something better than ktimer. We already have kernel timer API, which is
the primary API for kernel usage (for both timer and timeouts).

> list:
> allows fast access to the time sorted list without walking the rbtree
> and is a preliminary for the extension to high resolution timers.

Only access to first element is needed, which can be cached in the base.
Please explain the second part.

> expired:
> The field was added for simplification of some delta calculations in the
> return path. e.g. nanosleep in the expired case to avoid the extra call
> to get the current time. Also quite useful for debugging.

The return path can also get it from the base.

> status:
> A simple field, which stores at the moment 2 states and is necessary for
> extensions to high resolution timers too, as we have more states there.
> The suggested usage of the rbnode.parent pointer is wrong IMO as the
> overloading of arbitrary pointers for status information is a kind of
> pseudo optimization which is reducing in fact maintainability and
> clarity for a the win of a 32bit variable.

Testing a pointer is not "arbitrary", we do it all the time in the kernel.

> interval, overrun:
> Interval holds the converted interval value for itimers. The overrun
> member is used by the rearm code so the caller can figure out the number
> of missed events.
>
> The cleanup I pointed out for the posix timer interval timers is pretty
> obvious. It makes use of interval and overrun and removes two members of
> the posix timer structure.

Where I think it's possible to separate the timer from the interval
functionality to get a simpler timer base implementation.

> The size of the ktimer structure is a matter of micro optimizations in
> the same way as the macros/inlines are.

Not really, these fields have to be initialized and maintained, which
quickly goes beyond "micro optimizations".

> Calling the pure existence of some struct members complexity is an
> exaggeration and contradicts your own request for a simple and clear
> design.

That's not all what I had it in mind regarding complexity, I just started
with the more simpler parts and never got to the more complex part.

> Doing a design with the final goal in mind is much cleaner than doing
> micro optimizations in the first place and afterwards working around
> them when you apply extensions.

This is fine, but then you should explain them, I'm not a mind reader, so
that I can guess what you're planning.

> > - resolution handling: at what resolution should/does the kernel work and
> > what do we report to user space. The spec allows multiple interpretations
> > and I have a hard time to get at least one coherent interpretation out of
> > Thomas.
>
> I interpret the spec in the way I do for following reasons:
>
> 1. It is _usual practice_ to return the "timer" resolution for
> clock_res() and to return clock values with as much resolution as
> possible. In no case should the actual clock resolution be less than
> what clock_res() returns.
> - George Anzinger in this thread. Similar opinions can be found via
> Google. I came to the same conclusion and saw no reason to repeat
> Georges statement. I thought a simple pointer would be sufficient.

In this case you don't interpret the spec, you ignore the spec. (I'll
leave it open whether that's a good or bad thing.)

> 2. The rounding to the resolution value is explicitly required by the
> standard.

It doesn't explicitly specify which resolution (see the previous mail).
It doesn't explicitly specify how this rounding has to be implemented.

> 3. It makes a lot of sense to do what (1.) describes, due to the fact
> that we actually want to restrict the timer resolution to avoid
> interrupt and reprogramming floods in very short intervals. This in fact
> is the default behaviour in a jiffy driven environment. Pretending a
> real nsec resolution and doing no rounding at all is violating (2.).
> >From an application programmers view it makes sense to return the timer
> resolution so he actually can adjust the program behaviour on the
> provided resolution and not rely on wild guess assumptions about what
> might be there. Applications need to be able to verify whether the
> system can handle the required intervals or not.

A portable application simply cannot make this assumption.

Anyway, it's rather confusing how you ignore the spec, when "it makes a
lot of sense" and OTOH how you can stick to the spec. I honestly don't
know how to argue on this basis, where the spec can be arbitrarily
redefined based on undocumented assumptions.

bye, Roman

2005-10-18 01:04:42

by George Anzinger

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Roman Zippel wrote:
> Hi,
>
> On Mon, 17 Oct 2005, Thomas Gleixner wrote:
>
>
>>On Mon, 2005-10-17 at 18:25 +0200, Roman Zippel wrote:
>>
~
>>interval, overrun:
>>Interval holds the converted interval value for itimers. The overrun
>>member is used by the rearm code so the caller can figure out the number
>>of missed events.
>>
>>The cleanup I pointed out for the posix timer interval timers is pretty
>>obvious. It makes use of interval and overrun and removes two members of
>>the posix timer structure.
>
>
> Where I think it's possible to separate the timer from the interval
> functionality to get a simpler timer base implementation.

They are required fields for the POSIX timer. I think you are saying that they should be there and
not in the ktime struct, which is part of the POSIX timer struct. Is that right?

Along this line, I have a bit of a problem with the ktimer code doing the timer repeat stuff. This
is NOT used by POSIX timers because we want to wait for the user to pick up the signal before
starting the next interval. This is key to avoiding timer storms and I would think that puting the
repeat stuff in ktimer code opens it to the possibility of other users starting a timer storm via
this. I think the itimer code should also use the signal call back to start the next interval, and
for the same reason.
>
>
~
>>>- resolution handling: at what resolution should/does the kernel work and
>>>what do we report to user space. The spec allows multiple interpretations
>>>and I have a hard time to get at least one coherent interpretation out of
>>>Thomas.
>>
>>I interpret the spec in the way I do for following reasons:
>>
>>1. It is _usual practice_ to return the "timer" resolution for
>>clock_res() and to return clock values with as much resolution as
>>possible. In no case should the actual clock resolution be less than
>>what clock_res() returns.
>>- George Anzinger in this thread. Similar opinions can be found via
>>Google. I came to the same conclusion and saw no reason to repeat
>>Georges statement. I thought a simple pointer would be sufficient.
>
>
> In this case you don't interpret the spec, you ignore the spec. (I'll
> leave it open whether that's a good or bad thing.)

Eh? Granted we don't truncate the time on settime, but how else is it ignored?
>
>
>>2. The rounding to the resolution value is explicitly required by the
>>standard.
>
>
> It doesn't explicitly specify which resolution (see the previous mail).
> It doesn't explicitly specify how this rounding has to be implemented.

In the timer_settime() call there is only one possible resolution refered to, that of the specified
clock. The standard says(http://www.opengroup.org/onlinepubs/009695399/functions/timer_settime.html):

Time values that are between two consecutive non-negative integer multiples of the resolution of the
specified timer shall be rounded up to the larger multiple of the resolution. Quantization error
shall not cause the timer to expire earlier than the rounded time value.

This says a) round to the next resolution, and b) don't allow the resulting timer to expire early.
The implication is that timers are to expire on resolution boundries so we need to round such that
the expire happens _after_ the rounded time.

Am I missing something here?

The assumption, that I think you made, that we can let the hardware do the rounding runs contrary to
one of the main reasons for resolution, i.e. to group timers so that we can reduce the system
overhead. Just because we have timer hardware with microsecond resolution is not reason enough to
offer it to the user as handling an interrupt every micro second is way too much overhead, and, in
most cases, the user doesn't even want to such a fine resolution.
>
>
>>3. It makes a lot of sense to do what (1.) describes, due to the fact
>>that we actually want to restrict the timer resolution to avoid
>>interrupt and reprogramming floods in very short intervals. This in fact
>>is the default behaviour in a jiffy driven environment. Pretending a
>>real nsec resolution and doing no rounding at all is violating (2.).
>>>From an application programmers view it makes sense to return the timer
>>resolution so he actually can adjust the program behaviour on the
>>provided resolution and not rely on wild guess assumptions about what
>>might be there. Applications need to be able to verify whether the
>>system can handle the required intervals or not.
>
>
> A portable application simply cannot make this assumption.

POSIX clocks and timers are part of the REAL TIME POSIX extension. Arguing that real time apps need
to be portable is, I think, rather beside the point. At the same time, if rounding follows the
rules, one can set up a timer_settime() timer_gettime() sequence to get the resolution, even with
the itimer one can do this. So resolution is available to the user in one way or another. What he
does with it is up to him, but at least some RT apps. set up timer to expire early and after expiry,
busy wait until the "appointed" time. Knowing the resolution helps to know how to set this up...
~
--
George Anzinger [email protected]
HRT (High-res-timers): http://sourceforge.net/projects/high-res-timers/

2005-10-18 08:46:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5


* Roman Zippel <[email protected]> wrote:

> On Mon, 17 Oct 2005, Ingo Molnar wrote:
>
> > why you insist on ktimers being 'process timers'?
>
> Because they are optimized for process usage. OTOH kernel usage is
> more than just "timeouts".

you have cut out the rest of what i write in the paragraph, which IMO
answers your question:

> > They are totally separate entities, not limited to any process
> > notion. One of their first practical use happens to be POSIX process
> > timers (both itimers and ptimers) via them, but no way are ktimers
> > only 'process timers'. They are very generic timers, usable for any
> > kernel purpose.

so i can only repeat that ktimers is a generic timer subsystem, with a
focus on _actually delivering a timer event_.

and no, ktimers are not "optimized for process usage" (or tied to
whatever other process notion, as i said before), they are optimized
for:

- the delivery of time related events

as contrasted to the timeout-API (a'ka "timer wheel") code in
kernel/timers.c that is optimized towards:

- the fast adding/removal of timers

without too much focus on robust and deterministic delivery of events.

these two concepts are conflicting, and i claim that a (sane) data
structure that maximally fulfills both sets of requirements does not
exist, mathematically. (to repeat, the requirements are: 'fast
add/remove' and 'fast+deterministic expiry')

at this point i'd really suggest for readers to lean back and think
about the mathematical foundations of timer data structures for a bit,
with a focus on the tradeoffs that the timer wheel data structure has,
vs. the tradeoffs of the rbtree data structure that ktimers has.

My claim is that if you _know_ that a timer will expire most likely, you
want it to order at insertion time - i.e. you want to have a tree
structure. If you _know_ that a timer will most likely _not_ expire,
then you can avoid the tree overhead by 'delaying' the decision of
sorting timers, to the point in the future where we really are forced to
do so.

The result of this mathematical paradox is that we end up with two data
structures: one is the timer wheel (kernel/timers.c) for
timeout/exception related use; the other one is ktimers
(kernel/ktimers.c), for expiry oriented use.

> > so to answer your question: it is totally possible for a watchdog
> > mechanism to use ktimers. In fact it would be desirable from a
> > robustness POV too:
>
> "possible" and "desirable" is still different from "preferable", as
> they involve a higher cost.

[ in my answer above you are free to substitute "preferable" with
"desirable" - i do mean it as it reads in plain English. ]

> > e.g. we dont want a watchdog from being
> > overload-able via too many timeouts in the timer wheel ...
>
> Please explain.

e.g. on busy networked servers (i.e. ones that do have a need for
watchdogs) the timer wheel often includes large numbers of timeouts,
99.9% of which never expire. If they do expire en masse for whatever
reason, then we can get into overload mode: a million timers might have
to expire before we get to process the watchdog event and act upon it.
This can delay the watchdog event significantly, which delay might (or
might not) matter to the watchdog application.

in short: the timer wheel was not designed with determinism in mind (nor
should 'simple timeouts' care about determinism). Watchdogs are
preferably (and desirably) implemented via the most deterministic timer
mechanism that the kernel offers: ktimers in this particular case.

Ingo

2005-10-18 23:53:03

by Tim Bird

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Ingo Molnar wrote:
> My claim is that if you _know_ that a timer will expire most likely, you
> want it to order at insertion time - i.e. you want to have a tree
> structure. If you _know_ that a timer will most likely _not_ expire,
> then you can avoid the tree overhead by 'delaying' the decision of
> sorting timers, to the point in the future where we really are forced to
> do so.
>
> The result of this mathematical paradox is that we end up with two data
> structures: one is the timer wheel (kernel/timers.c) for
> timeout/exception related use; the other one is ktimers
> (kernel/ktimers.c), for expiry oriented use.

I'd like to make an observation on another
difference between the wheel and the rbtree. Note that
the wheel implementation inherently coalesces timeouts
that are near each other, due to it's relatively
low resolution (at tick granularity - which is
still pretty low resolution on embedded hardware -
usually 10 milliseconds.)

One concern I have with the rbtree is that this
automatic coalescing is lost, and there may be
unanticipated overhead in the move to support
high resolution timers.

Whether some form of coalescing should be
preserved for timers, even when the system
supports higher resolution, will be a
function of the number of timers and their
intended use. I don't see any support for that
in the current patch, but maybe I'm missing
something.

=============================
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Electronics
=============================

2005-10-19 00:04:47

by George Anzinger

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Tim Bird wrote:
> Ingo Molnar wrote:
>
>>My claim is that if you _know_ that a timer will expire most likely, you
>>want it to order at insertion time - i.e. you want to have a tree
>>structure. If you _know_ that a timer will most likely _not_ expire,
>>then you can avoid the tree overhead by 'delaying' the decision of
>>sorting timers, to the point in the future where we really are forced to
>>do so.
>>
>>The result of this mathematical paradox is that we end up with two data
>>structures: one is the timer wheel (kernel/timers.c) for
>>timeout/exception related use; the other one is ktimers
>>(kernel/ktimers.c), for expiry oriented use.
>
>
> I'd like to make an observation on another
> difference between the wheel and the rbtree. Note that
> the wheel implementation inherently coalesces timeouts
> that are near each other, due to it's relatively
> low resolution (at tick granularity - which is
> still pretty low resolution on embedded hardware -
> usually 10 milliseconds.)
>
> One concern I have with the rbtree is that this
> automatic coalescing is lost, and there may be
> unanticipated overhead in the move to support
> high resolution timers.

I think the coalescing is really done by the resolution rounding. There will always be the list
removal overhead, but short of a duplex tree (i.e. one entry per time with dup times linked from the
first (Ug)) you will always have that. What you want to coalesce is the interrupt overhead, not the
list overhead, the former being MUCH larger. The difference here is that we don't see the
resolution reflected in the tree structure, but that, I think, is good.
>
> Whether some form of coalescing should be
> preserved for timers, even when the system
> supports higher resolution, will be a
> function of the number of timers and their
> intended use. I don't see any support for that
> in the current patch, but maybe I'm missing
> something.
>
> =============================
~
--
George Anzinger [email protected]
HRT (High-res-timers): http://sourceforge.net/projects/high-res-timers/

2005-10-19 01:27:14

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Hi,

On Mon, 17 Oct 2005, George Anzinger wrote:

> > Where I think it's possible to separate the timer from the interval
> > functionality to get a simpler timer base implementation.
>
> They are required fields for the POSIX timer. I think you are saying that
> they should be there and not in the ktime struct, which is part of the POSIX
> timer struct. Is that right?

Basically, yes. I would take some simpler steps in creating the new timer
system. Thomas' patch introduces multiple concepts at once, which are hard
to digest via a simple review. As it looks right now I have to take the
patch apart myself and split it into simpler patches.

> > > 2. The rounding to the resolution value is explicitly required by the
> > > standard.
> >
> >
> > It doesn't explicitly specify which resolution (see the previous mail).
> > It doesn't explicitly specify how this rounding has to be implemented.
>
> In the timer_settime() call there is only one possible resolution refered to,
> that of the specified clock. The standard
> says(http://www.opengroup.org/onlinepubs/009695399/functions/timer_settime.html):
>
> Time values that are between two consecutive non-negative integer multiples of
> the resolution of the specified timer shall be rounded up to the larger
> multiple of the resolution. Quantization error shall not cause the timer to
> expire earlier than the rounded time value.
>
> This says a) round to the next resolution, and b) don't allow the resulting
> timer to expire early. The implication is that timers are to expire on
> resolution boundries so we need to round such that the expire happens _after_
> the rounded time.
>
> Am I missing something here?

In short: rounding errors.

Above says IOW if we have a clock with a freqency f and a resolution with
r=10^9/f, we have to round time t up so that it becomes a integer multiple
i of r, so that once the counter reaches the value i all timer with upto a
time value of i*r are expired.

If we now simply ignore the resolution fraction, we get a rounded value
which is quickly far away from the real value (with a worst case of r-1
nsec). This means an explicit rounding is likely only to make things
worse and any rounding is better done as part of the conversion from/to
timespec to/from the counter value according to the above rules and even
this conversion should be avoided as much as possible to minimize rounding
errors.

> The assumption, that I think you made, that we can let the hardware do the
> rounding runs contrary to one of the main reasons for resolution, i.e. to
> group timers so that we can reduce the system overhead. Just because we have
> timer hardware with microsecond resolution is not reason enough to offer it to
> the user as handling an interrupt every micro second is way too much overhead,
> and, in most cases, the user doesn't even want to such a fine resolution.

This just means that we have two values to describe a timer, the clock
resolution describes the precision with which the timer can be programmed
and the timer precision which describes the maximum frequency of timer
expiry. I think both values are of interest to user applications, so my
current preference is to actually export them both properly instead of
overloading the clock_getres() interface.

The spec allows both resolutions:

"an implementation (is required) to document the resolution supported for
timers and nanosleep() if they differ from the supported clock resolution"

This means that unfortunately only one can be determined at runtime via
standard means, so if we are going to create nonportable interfaces, we
should do it at least properly.

bye, Roman

2005-10-19 01:58:33

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Hi,

On Tue, 18 Oct 2005, Ingo Molnar wrote:

> > Because they are optimized for process usage. OTOH kernel usage is
> > more than just "timeouts".
>
> you have cut out the rest of what i write in the paragraph, which IMO
> answers your question:
>
> > > They are totally separate entities, not limited to any process
> > > notion. One of their first practical use happens to be POSIX process
> > > timers (both itimers and ptimers) via them, but no way are ktimers
> > > only 'process timers'. They are very generic timers, usable for any
> > > kernel purpose.
>
> so i can only repeat that ktimers is a generic timer subsystem, with a
> focus on _actually delivering a timer event_.

It doesn't answer it at all. The new timer system is definitively not
"usable for any kernel purpose", it has certain properties, which makes it
only applicable under certain conditions.

> and no, ktimers are not "optimized for process usage" (or tied to
> whatever other process notion, as i said before), they are optimized
> for:
>
> - the delivery of time related events
>
> as contrasted to the timeout-API (a'ka "timer wheel") code in
> kernel/timers.c that is optimized towards:
>
> - the fast adding/removal of timers
>
> without too much focus on robust and deterministic delivery of events.

You forgot the main property of high resolution, which implies a higher
maintainance cost.
Whether the timer event is delivered or not is completely unimportant, as
at some point the event has to be removed anyway, so that optimizing a
timer for (non)delivery is complete nonsense.

> these two concepts are conflicting, and i claim that a (sane) data
> structure that maximally fulfills both sets of requirements does not
> exist, mathematically. (to repeat, the requirements are: 'fast
> add/remove' and 'fast+deterministic expiry')

to repeat: low resolution/overhead vs high resolution.
Both are hopefully deterministic (only at different resolutions) or we
have serious bug at hand.

> > > e.g. we dont want a watchdog from being
> > > overload-able via too many timeouts in the timer wheel ...
> >
> > Please explain.
>
> e.g. on busy networked servers (i.e. ones that do have a need for
> watchdogs) the timer wheel often includes large numbers of timeouts,
> 99.9% of which never expire. If they do expire en masse for whatever
> reason, then we can get into overload mode: a million timers might have
> to expire before we get to process the watchdog event and act upon it.
> This can delay the watchdog event significantly, which delay might (or
> might not) matter to the watchdog application.

I already mentioned earlier that it's possible to reduce the timer load by
using a watchdog timer to filter most of these events, so that you get
into the interesting situation that most kernel timer actually do expire
and suddenly you easily can have more "timers" than "timeouts".

bye, Roman

2005-10-19 02:53:49

by George Anzinger

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Roman Zippel wrote:
> Hi,
>
> On Mon, 17 Oct 2005, George Anzinger wrote:
>
~
>
>>>>2. The rounding to the resolution value is explicitly required by the
>>>>standard.
>>>
>>>
>>>It doesn't explicitly specify which resolution (see the previous mail).
>>>It doesn't explicitly specify how this rounding has to be implemented.
>>
>>In the timer_settime() call there is only one possible resolution referred to,
>>that of the specified clock. The standard
>>says(http://www.opengroup.org/onlinepubs/009695399/functions/timer_settime.html):
>>
>>Time values that are between two consecutive non-negative integer multiples of
>>the resolution of the specified timer shall be rounded up to the larger
>>multiple of the resolution. Quantization error shall not cause the timer to
>>expire earlier than the rounded time value.
>>
>>This says a) round to the next resolution, and b) don't allow the resulting
>>timer to expire early. The implication is that timers are to expire on
>>resolution boundaries so we need to round such that the expire happens _after_
>>the rounded time.
>>
>>Am I missing something here?
>
>
> In short: rounding errors.
>
> Above says IOW if we have a clock with a frequency f and a resolution with
> r=10^9/f, we have to round time t up so that it becomes a integer multiple
> i of r, so that once the counter reaches the value i all timer with up to a
> time value of i*r are expired.
>
> If we now simply ignore the resolution fraction, we get a rounded value
> which is quickly far away from the real value (with a worst case of r-1
> nsec). This means an explicit rounding is likely only to make things
> worse and any rounding is better done as part of the conversion from/to
> timespec to/from the counter value according to the above rules and even
> this conversion should be avoided as much as possible to minimize rounding
> errors.

I think the rounding errors you are talking about would require us to define the clock period in
something finer than nanoseconds. The usual practice is to work with a resolution specified in
nanoseconds (which is the same units the user hands us). We then only worry about the last
"resolution" or so of the elapsed time, rather than going back to the beginning of time. The math
becomes harder when converting to a particular timer with resolution in the nanosecond area, as, for
example, the TSC. Here we use what I call "scaled math" to both improve resolution and accuracy and
to avoid the evil div instruction. It is rather easy to get accuracy down to a few parts per
billion. I really don't think the math, however, is the issue here.

Rather I think you would like to turn the hardware resolution into the resolution we use and send to
the user. This, I think, is not quite the right way to go. Suppose, for example, we have a timer
that will do micro second resolution. To provide this to the user implies that he is free to ask
for timers that expire every micro second. Today, this is not really a wise thing to do as we would
soon use all the cpu cycles doing interrupt overhead. So we define a resolution, say 100 micro
seconds, and set things up that way. This means we, at most, need to handle timer interrupts once
every 100 usecs (still not really wise, put possible with some of todays hardware).

Now, if the timer we use actually has a resolution of 1.33333 usec, do we want to use a multiple of
this as our resolution? Not really, folks would just get confused. We can just tell them it is
100usec and do the math. The errors introduced by this are, at most, 1.3333 usec, and they are NOT
cumulative, as long as we do the math for each expiry. (If we try to compute a LATCH to use to get
100 usec periods, we will accumulate errors, so why do that?) A jitter of 1.3333 usec is well under
the radar, being lost in the interrupt overhead.
>
>
>>The assumption, that I think you made, that we can let the hardware do the
>>rounding runs contrary to one of the main reasons for resolution, i.e. to
>>group timers so that we can reduce the system overhead. Just because we have
>>timer hardware with microsecond resolution is not reason enough to offer it to
>>the user as handling an interrupt every micro second is way too much overhead,
>>and, in most cases, the user doesn't even want to such a fine resolution.
>
>
> This just means that we have two values to describe a timer, the clock
> resolution describes the precision with which the timer can be programmed
> and the timer precision which describes the maximum frequency of timer
> expiry. I think both values are of interest to user applications, so my
> current preference is to actually export them both properly instead of
> overloading the clock_getres() interface.

But, as I say above, we don't want to export the hardware detail, but an abstraction we build on top
of it. Suppose we don't want to provide 100 usec timers except where really needed. We could
provide a different abstraction that has, say 10 ms resolution. We could then set things up so that
the user gets this all most all the time, say by define CLOCK_REALTIME with this resolution. We
then might define CLOCK_REALTIME_HR to have a resolution of 100 usec. The user who needs it will
realize that it has higher overhead (else why would we make it a bit harder to get to), and use it
only when he needs the resolution it provides.

There is no reason that both of these "clocks" can not use the same underlying code and hardware.
At the same time they do not have to.
>
> The spec allows both resolutions:
>
> "an implementation (is required) to document the resolution supported for
> timers and nanosleep() if they differ from the supported clock resolution"

What we want to do, and what is done by others, is to define different clocks which carry their
resolution to the timers used on them. This is a little orthogonal to the standard, but seems to be
a reasonable extension.
>
> This means that unfortunately only one can be determined at runtime via
> standard means, so if we are going to create non portable interfaces, we
> should do it at least properly.
>
> bye, Roman

--
George Anzinger [email protected]
HRT (High-res-timers): http://sourceforge.net/projects/high-res-timers/

2005-10-19 06:46:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5


* Roman Zippel <[email protected]> wrote:

> > and no, ktimers are not "optimized for process usage" (or tied to
> > whatever other process notion, as i said before), they are optimized
> > for:
> >
> > - the delivery of time related events
> >
> > as contrasted to the timeout-API (a'ka "timer wheel") code in
> > kernel/timers.c that is optimized towards:
> >
> > - the fast adding/removal of timers
> >
> > without too much focus on robust and deterministic delivery of events.
>
> You forgot the main property of high resolution, which implies a
> higher maintainance cost.

what did i forget? I did not mention "high resolution" anywhere. And
what precisely do you mean by "higher maintainance cost"?

Ingo

2005-10-19 10:50:04

by Ingo Molnar

[permalink] [raw]
Subject: kernel/timer.c design (was: Re: ktimers subsystem)

* Roman Zippel <[email protected]> wrote:

> Whether the timer event is delivered or not is completely unimportant,
> as at some point the event has to be removed anyway, so that
> optimizing a timer for (non)delivery is complete nonsense.

completely wrong! To explain this, let me first give you an introduction
to the design goals and implementation/optimization details of the
upstream kernel/timer.c code:

The current design has remained largely unchanged since Finn Arne
Gangstad implemented timer wheels in 1997.

The code implements 'struct timer_list' objects, which can be 'added'
via add_timer() to 'expire' in N jiffies, and can be 'removed' via
del_timer() before expiry. If timers are not removed before expiry then
they will expire, at which point the kernel has to call
timer->fn(timer->data). Time has a granularity of 1/HZ and timeouts are
32 bits.

[ sidenote: there are other details, like timer modification and other
API variants, SMP scalability and other issues - in that sense this
writeup is simplified, but the essence of the algorithms is still the
same. ]

since timers can be added in arbitrary time order (a timer that will
expire sooner can be added after a timer has been added that will expire
later, etc.), the kernel has to have timers sorted when they expire.
Note: there is no requirement to sort timers _before_ expiry!

the initial Linux timer implementation did not (have to) bother about
the 'millions of timers' workloads yet, so it went for the simplest
model: it has put all timers into a doubly-linked list, and sorted
timers at insertion time, which made addition O(N). It also had an O(N)
removal function, only expiry was O(1).

[ the name 'struct timer_list' originates from this linked-list model,
and this name has survived 15 years. The reason for the O(N) removal
overhead of the original implementation was that it maintained a 'next
timer will expire in N jiffies' value for every timer on the list,
which the kernel could have used to implement dynamic timer ticks. We
never ended up using that particular aspect of the implementation, and
future timer implementations removed that property altogether. ]

one could implement a add:O(N)/del:O(1)/exp:O(1) algorithm for sorted
linked lists, the original implementation was suboptimal in doing a O(N)
del_timer().

one could also implement a add:O(1)/del:O(1)/exp:O(N) algorithm via an
unsorted linked list. In any case, if there's only a single list then
either insertion or expiry has to carry the O(N) linear sorting
overhead.

another canonical 'computer science' way of dealing with timers is to
put them into a binary tree that sorts by expiry-time: this means that
at add_timer() time we have to insert the timer into the binary tree
(O(log(N)) overhead), removal and expiry is O(1).

the fastest theoretical timer algorithm is to have a linear array of
lists [timer buckets] for every future jiffy (and a running index to
represent the current jiffy): then adding a timer is a simple add_list()
for the array entry indexed by the target timeout. Removing a timer is a
simple list_del(), and expiring the timer is a matter of advancing the
'current time' index by one and expiring all (if any) timers that are in
the next slot. Thus adding, removing and expiring a timer has constant
O(1) overhead, and the worst-case behavior is constant bounded too.

what makes this algorithm impossible in practice is its huge RAM
footprint: tens of gigabytes of RAM to represent all ~2^32 jiffies.
(Some OSs still do this, at the price of restricting either timer
granularity, or the maximum possible timeout)

it can be proven that under our assumptions this 'linear array of time'
approach is the best fully O(1) algorithm [with constant worst-case
behavior as well], so whatever other solution we choose to significantly
reduce the RAM footprint, it wont be fully O(1).

we've seen two practical approaches so far: the 'historical Linux
implementation' which was add:O(N)/del:O(N)/exp:O(1), and the 'timer
tree' solution which is add:O(log(N))/del:O(1)/exp:O(1).

but the current Linux kernel uses a third algorithm: the timer wheels.
This is a variant of the simple 'array of future jiffies' model, but
instead of representing every future jiffy in a bucket, it categorizes
future jiffies into a 'logarithmic array of arrays' where the arrays
represent buckets with larger and larger 'scope/granularity': the
further a jiffy is in the future, the more jiffies belong to the same
single bucket.

In practice it's done by categorizing all future jiffies into 5 groups:

1..256, 257..16384, 16385..1048576, 1048577..67108864, 67108865..4294967295

the first category consists of 256 buckets (each bucket representing a
single jiffy), the second category consists of 64 buckets equally
divided (each bucket represents 256 subsequent jiffies), the third
category consists of 64 buckets too (each bucket representing 256*64 ==
16384 jiffies), the fourth category consists of 64 buckets too (each
bucket representing 256*64*64 == 1048576 jiffies), the fifth category
consists of 64 buckets too (each bucket representing 67108864 jiffies).

the buckets of each category are put into a per-category fixed-size
array, called the "timer vector" - named tv1, tv2, tv3, tv4 and tv5.

as you can see, we only used 256+64+64+64+64 == 512 buckets, but we've
managed to map all 4294967295 future jiffies to these buckets! In other
words: we've split up the 32 bits of 'timeout' value into 8+6+6+6+6
bits.

[ you might ask: why dont we use an even number of buckets such as
8+8+8+8, which would simplify the code? The reason is mostly RAM
footprint optimizations: an 8+8+8+8 splitup gives a total of
256+256+256+256 == 1024 buckets, which was considered a bit too high
back when this code was designed. In fact, in recent 2.6 kernels, if
CONFIG_BASE_SMALL is specified then we use a 6+4+4+4+4 splitup and
round down the remaining 10 bits, which gives an embedded-friendly RAM
footprint of 128 buckets. The 'splitup' is under constant revision and
we might switch to the simpler (and slightly faster) 8+8+8+8 model in
the future, for servers. ]

how do we insert timers? In add_timer() we can calculate their "target
category" in constant overhead (with at most 5 comparisons), and put the
timer into that bucket. Note: unless it's in the first category, timers
with different timeout values can end up in the same bucket. E.g. timers
expiring at jiffy 260 and 265 will be both put into the first bucket of
category 2. This means that timers in these buckets are 'partially
sorted': they are only sorted in their highest bits, initially. So
add_timer() is O(1) overhead.

removal is simple: we remove the timer from the bucket, which is a
list_del(), so O(1) overhead too.

we knew that there's no free lunch, right? The main complication is how
we do expiry. The first 256 jiffies are not a problem, because they are
represented by the first array of buckets, so the expiry code only has
to check whether there are any timers to be expired in that bucket.
Expiry overhead is O(1) for these steps. But at jiffy 257 we do
something special: the expiry code 'cascades' the first bucket of the
second array 'down into' the first 256 buckets. It does it the hard way:
walks the list of timers in that bucket (if any), and removes them from
that list and inserts them into one of the first 256 buckets (depending
on what the timeout value of that timer is). Then the expiry code goes
back to bucket 1, and expires the timers there (if any). The expiry code
keeps a persistent running index for every category, and if that index
overflows back to 1, it increments the next category's index by one and
'cascades down' timers from that bucket into the previous category.

in other words: what happens is that we sort timers "piecemail wise",
first we order by the highest bits of their timeout value, then we sort
by the lower bits too - in the end they are fully sorted. If all timers
expire and are never removed then still we have won relative to the
fully-sorted-list approach: all timers will end up fully sorted, and
average per-timer expiry overhead is still O(1)! But expiry worst-case
is not bounded, it is O(N).

One cost is the burstiness of processing: a single step of cascading can
take many timers to be processed (if they happen to be in that same
bucket), and no timers may expire while we do that processing. The
worst-case expiry behavior is O(N). (The average cost is still O(1),
because we process every timer at most 5 times.) Another cost is that we
touch (and dirty) the timers again and again during their lifetime,
bringing them into cache multiple times.

But there's a hidden win as well from this approach: if a timer is
removed before it expires, we've saved the remaining cascading steps!
This happens surprisingly often: on a busy networked server, the
majority of the timers never expire, and are removed before they have to
be cascaded even once.

in other words: we 'lazy sort' timers, and we push most of the sorting
overhead as much into the future as possible, in the hope of the problem
of having to sort them going away, because they get removed before they
expire. (and even if we wanted, we couldnt sort earlier in this model,
due to the RAM footprint limits)

with all these details in mind, lets go back to Roman's assertion:

> Whether the timer event is delivered or not is completely unimportant,
> as at some point the event has to be removed anyway, so that
> optimizing a timer for (non)delivery is complete nonsense.

it is very much crutial whether a timer event is delivered. Think about
the 'millions of network timers' case: most of them are removed before
cascaded even once! By removing early we might not have to propagate and
sort the timer in any way: it is added to a bucket and soon removed from
the same bucket.

Ingo

2005-10-19 11:40:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5


* Roman Zippel <[email protected]> wrote:

> > so i can only repeat that ktimers is a generic timer subsystem, with a
> > focus on _actually delivering a timer event_.
>
> It doesn't answer it at all. The new timer system is definitively not
> "usable for any kernel purpose", it has certain properties, which
> makes it only applicable under certain conditions.

what "certain properties" and under what "certain conditions"? Please
provide specifics to prove your point. I repeat for the third time:
ktimers is a generic timer subsystem, with a focus on timer event
delivery.

Ingo

2005-10-19 11:58:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5


* Roman Zippel <[email protected]> wrote:

> > > > e.g. we dont want a watchdog from being
> > > > overload-able via too many timeouts in the timer wheel ...
> > >
> > > Please explain.
> >
> > e.g. on busy networked servers (i.e. ones that do have a need for
> > watchdogs) the timer wheel often includes large numbers of timeouts,
> > 99.9% of which never expire. If they do expire en masse for whatever
> > reason, then we can get into overload mode: a million timers might have
> > to expire before we get to process the watchdog event and act upon it.
> > This can delay the watchdog event significantly, which delay might (or
> > might not) matter to the watchdog application.
>
> I already mentioned earlier that it's possible to reduce the timer
> load by using a watchdog timer to filter most of these events, so that
> you get into the interesting situation that most kernel timer actually
> do expire and suddenly you easily can have more "timers" than
> "timeouts".

this sentence does not parse at all, for me. Here's the effort i did
trying to decypher it:

Firstly, you mention 'watchdog' without clarifying whether it's the
examplary watchdog we were talking about above, or whether it's some
other, new mechanism. The former makes no sense (what does the watchdog
timer in a random driver have to do with the millions of network timers
i was talking about, and how could it be used to filter anything?), the
later you dont explain.

Secondly, the above sentence is the first time in the ktimer discussion
that you ever mentioned the word 'filter', and you never mentioned the
word 'watchdog' outside of the example we were discussing, so i'm
curious about the source of the above "I already mentioned earlier"
statement. When earlier? Which email? Frankly, the whole paragraph reads
as if from another planet, i see the words but the content seems totally
out of context and makes no sense to me.

So i cannot even agree or disagree with anything you said in that
sentence, because the sentence simply does not parse. Please enlighten
me!

Ingo

2005-10-19 17:48:40

by Tim Bird

[permalink] [raw]
Subject: Re: kernel/timer.c design

Ingo,

Thanks for the excellent description of the timer wheel
implementation.

Ingo Molnar wrote:
> One cost is the burstiness of processing: a single step of cascading can
> take many timers to be processed (if they happen to be in that same
> bucket)...

> But there's a hidden win as well from this approach: if a timer is
> removed before it expires, we've saved the remaining cascading steps!
> This happens surprisingly often: on a busy networked server, the
> majority of the timers never expire, and are removed before they have to
> be cascaded even once.

Unfortunately, this means that the actual costs of the wheel
implementation vary depending on the relationship between HZ,
the average timeout duration, and the bucket mappings (which,
as you say, can be adjusted for size reasons.) This is one of
the downsides of the wheel implementation. It's very difficult
to tell in advance whether a particular timer load
will cascade or not, making the costs (although bounded)
unexpectedly variable.

One solution (even suggested by Linus) for high resolution
timers was to increase HZ and skip timer ticks. Unfortunately,
this has a dramatic affect on the cost of cascading, and on
the maximum duration available for timers. (By increasing
HZ, you push more timers to higher tiers in the wheel, which
means you potentially end up cascading them more often,
even when they are removed before expiry.) These types
of unexpected consequences are one good reason for avoiding
use of the wheel for high res timers.

=============================
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Electronics
=============================

2005-10-19 18:00:22

by Tim Bird

[permalink] [raw]
Subject: Re: kernel/timer.c design

Ingo,

Thanks for the excellent description of the timer wheel
implementation.

Ingo Molnar wrote:
> One cost is the burstiness of processing: a single step of cascading can
> take many timers to be processed (if they happen to be in that same
> bucket)...

> But there's a hidden win as well from this approach: if a timer is
> removed before it expires, we've saved the remaining cascading steps!
> This happens surprisingly often: on a busy networked server, the
> majority of the timers never expire, and are removed before they have to
> be cascaded even once.

Unfortunately, this means that the actual costs of the wheel
implementation vary depending on the relationship between HZ,
the average timeout duration, and the bucket mappings (which,
as you say, can be adjusted for size reasons.) This is one of
the downsides of the wheel implementation. It's very difficult
to tell in advance whether a particular timer load
will cascade or not, making the costs (although bounded)
unexpectedly variable.

One solution (even suggested by Linus) for high resolution
timers was to increase HZ and skip timer ticks. Unfortunately,
this has a dramatic affect on the cost of cascading, and on
the maximum duration available for timers. (By increasing
HZ, you push more timers to higher tiers in the wheel, which
means you potentially end up cascading them more often,
even when they are removed before expiry.) These types
of unexpected consequences are one good reason for avoiding
use of the wheel for high res timers.

=============================
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Electronics
=============================

2005-10-19 19:02:05

by Thomas Gleixner

[permalink] [raw]
Subject: Re: kernel/timer.c design

On Wed, 2005-10-19 at 11:00 -0700, Tim Bird wrote:
> > But there's a hidden win as well from this approach: if a timer is
> > removed before it expires, we've saved the remaining cascading steps!
> > This happens surprisingly often: on a busy networked server, the
> > majority of the timers never expire, and are removed before they have to
> > be cascaded even once.
>
> Unfortunately, this means that the actual costs of the wheel
> implementation vary depending on the relationship between HZ,
> the average timeout duration, and the bucket mappings (which,
> as you say, can be adjusted for size reasons.) This is one of
> the downsides of the wheel implementation. It's very difficult
> to tell in advance whether a particular timer load
> will cascade or not, making the costs (although bounded)
> unexpectedly variable.

Thats exactly the problem we described earlier in the ktimer discussion:

Changing HZ from 100 to 1000 while keeping the primary wheel size
unchanged caused increased cascading load.

HZ CONFIG_BASE_SMALL=n CONFIG_BASE_SMALL=y
100 2560 ms 640 ms
250 1024 ms 256 ms
1000 256 ms 64 ms

A lot of timeouts are in the range of 500ms. While the HZ=100 and HZ=250
settings keep them in the primary wheel either until expiry or early
removal, HZ=1000 and CONFIG_BASE_SMALL with HZ > 100 make cascading more
likely when the system load goes up.

Thats hard to balance for sure.

tglx


2005-10-19 22:13:05

by Roman Zippel

[permalink] [raw]
Subject: Re: kernel/timer.c design (was: Re: ktimers subsystem)

Hi,

On Wed, 19 Oct 2005, Ingo Molnar wrote:

> > Whether the timer event is delivered or not is completely unimportant,
> > as at some point the event has to be removed anyway, so that
> > optimizing a timer for (non)delivery is complete nonsense.
>
> completely wrong! To explain this, let me first give you an introduction
> to the design goals and implementation/optimization details of the
> upstream kernel/timer.c code:

I indeed made a mistake, thanks for pointing it out so elaborately.

I'd like to mention something else here. It's rather bad style to start
with "completely wrong!" and then continue to gloat with "let me give you
an introduction", unless you intentionally want to insult me. Usually I
would just ignore this, as it can happen to anyone, but I can find this
style too often in your mails lately with the most obvious example of your
"shut up or show code" comment. You're more busy trying to prove me wrong
than adressing the actual issue. It never was my intention to discuss the
kernel timer design (the one in timer.c you describe here), the original
issue was and still is that "timer API" is a too generic term and you
actually proved my point by using the terms timer and their timeout values
very consistently in your description.

It's possible I read this wrong, in that case I apologize already in
advance, but please rethink the attitude you're showing, otherwise I'll
reduce our conversion to a minimum. You're certainly have the more
detailed knowledge in this area, but you don't have to show it off like
this.

bye, Roman

2005-10-19 22:25:04

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Hi,

On Wed, 19 Oct 2005, Ingo Molnar wrote:

> Secondly, the above sentence is the first time in the ktimer discussion
> that you ever mentioned the word 'filter', and you never mentioned the
> word 'watchdog' outside of the example we were discussing, so i'm
> curious about the source of the above "I already mentioned earlier"
> statement. When earlier? Which email?

http://marc.theaimsgroup.com/?l=linux-kernel&m=112752984710746&w=2

bye, Roman

2005-10-21 16:23:56

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Hi,

On Tue, 18 Oct 2005, George Anzinger wrote:

> > Above says IOW if we have a clock with a frequency f and a resolution with
> > r=10^9/f, we have to round time t up so that it becomes a integer multiple i
> > of r, so that once the counter reaches the value i all timer with up to a
> > time value of i*r are expired.

You don't specifically disagree, so I can assume you agree that this a
valid interpretation of the spec?
(I'm asking because it's important for the design of the timer system.)

> > If we now simply ignore the resolution fraction, we get a rounded value
> > which is quickly far away from the real value (with a worst case of r-1
> > nsec). This means an explicit rounding is likely only to make things worse
> > and any rounding is better done as part of the conversion from/to timespec
> > to/from the counter value according to the above rules and even this
> > conversion should be avoided as much as possible to minimize rounding
> > errors.
>
> I think the rounding errors you are talking about would require us to define
> the clock period in something finer than nanoseconds.

No, you don't have to, all you have to do is to make sure that
"Quantization error shall not cause the timer to expire earlier than the
rounded time value." IOW at the time the timer expires, the expiry time
must not be greater than clock_gettime().

> Rather I think you would like to turn the hardware resolution into the
> resolution we use and send to the user. This, I think, is not quite the right
> way to go. Suppose, for example, we have a timer that will do micro second
> resolution. To provide this to the user implies that he is free to ask for
> timers that expire every micro second. Today, this is not really a wise thing
> to do as we would soon use all the cpu cycles doing interrupt overhead. So we
> define a resolution, say 100 micro seconds, and set things up that way. This
> means we, at most, need to handle timer interrupts once every 100 usecs (still
> not really wise, put possible with some of todays hardware).
>
> Now, if the timer we use actually has a resolution of 1.33333 usec, do we want
> to use a multiple of this as our resolution? Not really, folks would just get
> confused. We can just tell them it is 100usec and do the math. The errors
> introduced by this are, at most, 1.3333 usec, and they are NOT cumulative, as
> long as we do the math for each expiry. (If we try to compute a LATCH to use
> to get 100 usec periods, we will accumulate errors, so why do that?) A jitter
> of 1.3333 usec is well under the radar, being lost in the interrupt overhead.

No, the error is worse, although I specifically talk about the rounding
done in Thomas' patch, I'm not sure we're really talking about the same
thing. I didn't mean the error caused by jitter, in this case I'd
actually agree with you.

He sets the resolution right now to (NSEC_PER_SEC/HZ) and uses this value
to explicit round the time values. For example a timer is set to the value
1.1ms and is rounded to 2ms. The timer tick now actually expires at 1.2ms
and could expire the timer, but it's instead expired at 2.2ms and user
space sees an error of 1.1ms.
A similiar error even exists with interval timer, e.g. an interval timer
is set to 0.9ms and rounded to 1ms. If the clock now expires a little
too early the timer will expire repeatedly one tick too late.

In general due to this rounding and normal clock skew an extra error is
added with an average value of half the timer resolution.

> But, as I say above, we don't want to export the hardware detail, but an
> abstraction we build on top of it. Suppose we don't want to provide 100 usec
> timers except where really needed. We could provide a different abstraction
> that has, say 10 ms resolution. We could then set things up so that the user
> gets this all most all the time, say by define CLOCK_REALTIME with this
> resolution. We then might define CLOCK_REALTIME_HR to have a resolution of
> 100 usec. The user who needs it will realize that it has higher overhead
> (else why would we make it a bit harder to get to), and use it only when he
> needs the resolution it provides.
>
> There is no reason that both of these "clocks" can not use the same underlying
> code and hardware. At the same time they do not have to.

I don't have a problem with this at all. I think it's fine to leave
clock_getres(CLOCK_REALTIME) at a save value.

> > The spec allows both resolutions:
> >
> > "an implementation (is required) to document the resolution supported for
> > timers and nanosleep() if they differ from the supported clock resolution"
>
> What we want to do, and what is done by others, is to define different clocks
> which carry their resolution to the timers used on them. This is a little
> orthogonal to the standard, but seems to be a reasonable extension.

Could you please give me an example for "others"?
I don't think I have a problem with this. My point is to better define
"reasonable extension" or to be more specific what user expectations are
reasonable. What is needed by applications and where exactly do we draw
the line, when it comes to extra complexity in the timer design.

I wouldn't a priori exclude the possibility to break some user
applications, which have unreasonable expectations. We did this in the
past (e.g. sched_yield()), we simply fixed the applications and moved on,
but this requires more information about what applications expect about
high resolution timer.

George, many thanks for trying to understand me and helping me to get a
better understanding of the issues. I see more misunderstandings than
disagreements, so I'd be really grateful, if you can help me later to
translate this into something Thomas and Ingo can understand. :-)

bye, Roman

2005-10-23 18:17:51

by George Anzinger

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Roman Zippel wrote:
> Hi,
>
> On Tue, 18 Oct 2005, George Anzinger wrote:
>
>
>>>Above says IOW if we have a clock with a frequency f and a resolution with
>>>r=10^9/f, we have to round time t up so that it becomes a integer multiple i
>>>of r, so that once the counter reaches the value i all timer with up to a
>>>time value of i*r are expired.
>
>
> You don't specifically disagree, so I can assume you agree that this a
> valid interpretation of the spec?
> (I'm asking because it's important for the design of the timer system.)

I agree with the proviso that we can define such a clock as an abstraction of a clock with a better
resolution. I.e. we can provide clocks with lesser resolution than the physical clock has.
>
>
>>>If we now simply ignore the resolution fraction, we get a rounded value
>>>which is quickly far away from the real value (with a worst case of r-1
>>>nsec). This means an explicit rounding is likely only to make things worse
>>>and any rounding is better done as part of the conversion from/to timespec
>>>to/from the counter value according to the above rules and even this
>>>conversion should be avoided as much as possible to minimize rounding
>>>errors.
>>
>>I think the rounding errors you are talking about would require us to define
>>the clock period in something finer than nanoseconds.
>
>
> No, you don't have to, all you have to do is to make sure that
> "Quantization error shall not cause the timer to expire earlier than the
> rounded time value." IOW at the time the timer expires, the expiry time
> must not be greater than clock_gettime().

That should be "less than", but yes. The comment I was making is that the math is not that hard to
get right.
>
>
>>Rather I think you would like to turn the hardware resolution into the
>>resolution we use and send to the user. This, I think, is not quite the right
>>way to go. Suppose, for example, we have a timer that will do micro second
>>resolution. To provide this to the user implies that he is free to ask for
>>timers that expire every micro second. Today, this is not really a wise thing
>>to do as we would soon use all the cpu cycles doing interrupt overhead. So we
>>define a resolution, say 100 micro seconds, and set things up that way. This
>>means we, at most, need to handle timer interrupts once every 100 usecs (still
>>not really wise, put possible with some of todays hardware).
>>
>>Now, if the timer we use actually has a resolution of 1.33333 usec, do we want
>>to use a multiple of this as our resolution? Not really, folks would just get
>>confused. We can just tell them it is 100usec and do the math. The errors
>>introduced by this are, at most, 1.3333 usec, and they are NOT cumulative, as
>>long as we do the math for each expiry. (If we try to compute a LATCH to use
>>to get 100 usec periods, we will accumulate errors, so why do that?) A jitter
>>of 1.3333 usec is well under the radar, being lost in the interrupt overhead.
>
>
> No, the error is worse, although I specifically talk about the rounding
> done in Thomas' patch, I'm not sure we're really talking about the same
> thing. I didn't mean the error caused by jitter, in this case I'd
> actually agree with you.
>
> He sets the resolution right now to (NSEC_PER_SEC/HZ) and uses this value
> to explicit round the time values. For example a timer is set to the value
> 1.1ms and is rounded to 2ms. The timer tick now actually expires at 1.2ms
> and could expire the timer, but it's instead expired at 2.2ms and user
> space sees an error of 1.1ms.
> A similiar error even exists with interval timer, e.g. an interval timer
> is set to 0.9ms and rounded to 1ms. If the clock now expires a little
> too early the timer will expire repeatedly one tick too late.
>
> In general due to this rounding and normal clock skew an extra error is
> added with an average value of half the timer resolution.


I admit I have not looked, in detail, at this part of ktimers, however, assuming that the clock
ticks at HZ then the normal error to be expected is and average of 1/2 the resolution with a max of
1 resolution. This is AFTER the rounding to the next resolution, so we can expect the expiry to be
any where from 0 to 2*resolution-1. (up to resolution-1 from rounding, and up to one resolution
from clock skew). This the way I and every one I have worked with understand the standard.

In your example, consider a request for 0.1ms rounded to 1ms....
>
>
>>But, as I say above, we don't want to export the hardware detail, but an
>>abstraction we build on top of it. Suppose we don't want to provide 100 usec
>>timers except where really needed. We could provide a different abstraction
>>that has, say 10 ms resolution. We could then set things up so that the user
>>gets this all most all the time, say by define CLOCK_REALTIME with this
>>resolution. We then might define CLOCK_REALTIME_HR to have a resolution of
>>100 usec. The user who needs it will realize that it has higher overhead
>>(else why would we make it a bit harder to get to), and use it only when he
>>needs the resolution it provides.
>>
>>There is no reason that both of these "clocks" can not use the same underlying
>>code and hardware. At the same time they do not have to.
>
>
> I don't have a problem with this at all. I think it's fine to leave
> clock_getres(CLOCK_REALTIME) at a save value.
>
>
>>>The spec allows both resolutions:
>>>
>>>"an implementation (is required) to document the resolution supported for
>>>timers and nanosleep() if they differ from the supported clock resolution"
>>
>>What we want to do, and what is done by others, is to define different clocks
>>which carry their resolution to the timers used on them. This is a little
>>orthogonal to the standard, but seems to be a reasonable extension.
>
>
> Could you please give me an example for "others"?

Well, I know that HP in the HPRT system (likely long gone by now) did it this way. That was back
prior to 1997. The system was based on the HP PA risc arch which has a system timer based on a
cycle counter (rather like the PPC, but different).

> I don't think I have a problem with this. My point is to better define
> "reasonable extension" or to be more specific what user expectations are
> reasonable. What is needed by applications and where exactly do we draw
> the line, when it comes to extra complexity in the timer design.
>
> I wouldn't a priori exclude the possibility to break some user
> applications, which have unreasonable expectations. We did this in the
> past (e.g. sched_yield()), we simply fixed the applications and moved on,
> but this requires more information about what applications expect about
> high resolution timer.

We have had good acceptance of the HRT patch in our customer base. As far as I know we have not
gotten any feed back on just what resolution they want or use. We allow them to define it (within
reason) at configure time. I have been recommending, for the x86, nothing better than about 100usec
but this is based on my machine being able to handle an interrupt in about that amount of time.

We don't require alignment of the resolution with the actual hardware resolution as at these levels
the interrupt jitter smooths over any issues in this area. A comment here, in some of your math
examples you seem to be implying that we are going to use a particular resolution from the begining
of time to compute expiry time. In fact, we start from "now" as defined by a, possibly corrected,
system clock. Once we have the rounded expiry time we use full resolution math to figure how to fit
that into the timing services. So, infact, the resolution comes into play only over 1 to 2
resolutions of the requested time. In other words, errors do not accumulate since we always mark to
the corrected clock.
>
> George, many thanks for trying to understand me and helping me to get a
> better understanding of the issues. I see more misunderstandings than
> disagreements, so I'd be really grateful, if you can help me later to
> translate this into something Thomas and Ingo can understand. :-)
>
No problem. Do be advised I will be out most of next week through the end of Oct.
--
George Anzinger [email protected]
HRT (High-res-timers): http://sourceforge.net/projects/high-res-timers/

2005-10-27 20:24:41

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Hi,

On Sun, 23 Oct 2005, George Anzinger wrote:

> > > > Above says IOW if we have a clock with a frequency f and a resolution with
> > > > r=10^9/f, we have to round time t up so that it becomes a integer multiple i
> > > > of r, so that once the counter reaches the value i all timer with up to a
> > > > time value of i*r are expired.
> >
> >
> > You don't specifically disagree, so I can assume you agree that this a valid
> > interpretation of the spec?
> > (I'm asking because it's important for the design of the timer system.)
>
> I agree with the proviso that we can define such a clock as an abstraction of
> a clock with a better resolution. I.e. we can provide clocks with lesser
> resolution than the physical clock has.

I had a different aspect in mind: at what resolution are we doing the
calculations?
Let's say we have a clock with a frequency of 300Hz, now we could programm
the timer like this:

tmp = time * 300;
clock = tmp / 10^9 + (tmp % 10^9 != 0);

This rounds the time to the next clock count and as soon as the clock
reaches this count the timer is expired.
OTOH We could also do this at the timer interrupt:

tmp = clock * 10^9;
time = tmp / 300 + (tmp % 300 != 0);

and use time to expire all timer upto this time. In either case the
behaviour is exactly the same.

The problem is now that we can export the real resolution only as integer
value. What consequences has this to the kernel timer implementation?
Something like above must be done anyway, so what's the point in doing an
extra rounding step?
For example if we set a timer to expire at 999999990ns, so the next
interrupt is at 1000000000ns, but rounding it to 3333333ns means the
expiry time changes to 1003333233ns and the timer expires one clock tick
later.
Which application seriously expects this kind of behaviour?

> I admit I have not looked, in detail, at this part of ktimers, however,
> assuming that the clock ticks at HZ then the normal error to be expected is
> and average of 1/2 the resolution with a max of 1 resolution. This is AFTER
> the rounding to the next resolution, so we can expect the expiry to be any
> where from 0 to 2*resolution-1. (up to resolution-1 from rounding, and up to
> one resolution from clock skew). This the way I and every one I have worked
> with understand the standard.

For relative timer I agree that the error can be twice the resolution.
First the value read from the clock must be rounded up and then we still
have to wait till the next clock tick.
OTOH for absolute timer we don't need the first step, we just have to wait
until the clock reaches this time. Why should we add an extra error to it,
if we can avoid it? The spec actually says "a timer expiration signal is
requested when the associated clock reaches or exceeds the specified
time." The clock resolution causes the actual expiration time
automatically to be a rounded value of the requested value.

Next question would be what happens if timer and clock resolution differs?
For example if the clock has a resolution of 1us and the timer runs every
1ms. For relative timer this would mean we can keep the error within
1.001ms and for absolute timer within 1ms. Do we really have to force an
error larger than really necessary?

Interesting is now that Thomas doesn't take the clock resolution into
account at all. Let's say clock and timer resolution are 1ms (or HZ=1000).
If we program a normal kernel timer, we do something like this:

timer->expires = jiffies + 1 + usecs_to_jiffies(timeout);

Thomas does now basically this:

timer->expires = jiffies * res + round(timeout, res);

IOW if the clock resolution is larger than the interrupt delay, the timer
may expire early.

> We have had good acceptance of the HRT patch in our customer base. As far as
> I know we have not gotten any feed back on just what resolution they want or
> use. We allow them to define it (within reason) at configure time. I have
> been recommending, for the x86, nothing better than about 100usec but this is
> based on my machine being able to handle an interrupt in about that amount of
> time.
>
> We don't require alignment of the resolution with the actual hardware
> resolution as at these levels the interrupt jitter smooths over any issues in
> this area.

I expected as much, so users who do care make sure the timer resolution is
good enough. In this case I would also expect that they are interested in
keeping the error as small as possible.

> A comment here, in some of your math examples you seem to be
> implying that we are going to use a particular resolution from the begining of
> time to compute expiry time. In fact, we start from "now" as defined by a,
> possibly corrected, system clock. Once we have the rounded expiry time we use
> full resolution math to figure how to fit that into the timing services. So,
> infact, the resolution comes into play only over 1 to 2 resolutions of the
> requested time. In other words, errors do not accumulate since we always mark
> to the corrected clock.

I didn't imply that, I tried to keep focus on the model as described in
the spec. I think we should keep the focus on the behaviour this model
describes, no user cares how the kernel implements the spec, just that the
visible behaviour matches the spec.
SUS rationale specifically says "The interfaces also allow flexibility in
the implementation of the functions. For example, ..." (in "Relationship
of Timers to Clocks"), i.e. there is not one true implementation and so I
think it's very well worth it to explore our options. Reducing the whole
design to a single number (the resolution returned by clock_getres())
would be IMO very shortsighted. We could very well allow the user to
define his own timer based on various parameters, so he can adjust the
timer to his needs.

bye, Roman

2005-10-28 04:52:54

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

On Thu, 2005-10-27 at 22:23 +0200, Roman Zippel wrote:

>
> Next question would be what happens if timer and clock resolution differs?
> For example if the clock has a resolution of 1us and the timer runs every
> 1ms. For relative timer this would mean we can keep the error within
> 1.001ms and for absolute timer within 1ms. Do we really have to force an
> error larger than really necessary?
>
> Interesting is now that Thomas doesn't take the clock resolution into
> account at all. Let's say clock and timer resolution are 1ms (or HZ=1000).
> If we program a normal kernel timer, we do something like this:
>
> timer->expires = jiffies + 1 + usecs_to_jiffies(timeout);
>
> Thomas does now basically this:
>
> timer->expires = jiffies * res + round(timeout, res);
>
> IOW if the clock resolution is larger than the interrupt delay, the timer
> may expire early.

Roman, I think I know what you are trying to say here. Although it took
me several readings of what you wrote and then really just looking at
Thomas' code.

It's the old problem with:

1 2 3 4
+----------+----------+----------+---------->>
^ ^
| |
Start End

Asking for 2 ms (with both clock and res the same at 1ms). We start the
clock at 1 but it really is 1.7 and we get the interrupt and return at 3
but really 3.2, so instead of receiving a wait of 2ms, we return with
3.2 - 1.7 = 1.5ms

Currently, this is not a problem when the clock is at a higher
frequency, (like the tsc). So the base->get_time works now since the
clock is at a higher frequency, but if the get_time returned jiffies,
this would fail. And the clock used is also much faster that the delay
it takes to get back to the calling process (which is much more than a
nanosecond today).

Is that what you were trying to say Roman?

Interesting though, I tried to force this scenario, by changing the
base->get_time to return jiffies. I have a jitter test and ran this
several times, and I could never get it to expire early. I even changed
HZ back to 100.

Then I looked at run_ktimer_queue. And here we have the compare:

timer = list_entry(base->pending.next, struct ktimer, list);
if (ktime_cmp(now, <=, timer->expires))
break;

So, the timer does _not_ get processed if it is after or _equal_ to the
current time. So although the timer may go off early, the expired queue
does not get executed. So the above example would not go off at 3.2,
but some time in the 4 category.

So the function will _not_ be executed early, although this could mean
that the timer could actually go off early (in the HRT case), but I
haven't taken a look there. That is to say the interrupt goes off
early, not the function being executed.

-- Steve


2005-10-28 16:07:08

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] ktimers subsystem 2.6.14-rc2-kt5

Hi,

On Fri, 28 Oct 2005, Steven Rostedt wrote:

> Roman, I think I know what you are trying to say here. Although it took
> me several readings of what you wrote and then really just looking at
> Thomas' code.

Thanks for the effort. :-) I know I'm sometimes a bit difficult to
understand, which makes it easier to simply flame me for a silly mistake.

> It's the old problem with:
>
> 1 2 3 4
> +----------+----------+----------+---------->>
> ^ ^
> | |
> Start End
>
> Asking for 2 ms (with both clock and res the same at 1ms). We start the
> clock at 1 but it really is 1.7 and we get the interrupt and return at 3
> but really 3.2, so instead of receiving a wait of 2ms, we return with
> 3.2 - 1.7 = 1.5ms
>
> Currently, this is not a problem when the clock is at a higher
> frequency, (like the tsc). So the base->get_time works now since the
> clock is at a higher frequency, but if the get_time returned jiffies,
> this would fail. And the clock used is also much faster that the delay
> it takes to get back to the calling process (which is much more than a
> nanosecond today).
>
> Is that what you were trying to say Roman?

Yes.

> Interesting though, I tried to force this scenario, by changing the
> base->get_time to return jiffies. I have a jitter test and ran this
> several times, and I could never get it to expire early. I even changed
> HZ back to 100.
>
> Then I looked at run_ktimer_queue. And here we have the compare:
>
> timer = list_entry(base->pending.next, struct ktimer, list);
> if (ktime_cmp(now, <=, timer->expires))
> break;
>
> So, the timer does _not_ get processed if it is after or _equal_ to the
> current time. So although the timer may go off early, the expired queue
> does not get executed. So the above example would not go off at 3.2,
> but some time in the 4 category.
>
> So the function will _not_ be executed early, although this could mean
> that the timer could actually go off early (in the HRT case), but I
> haven't taken a look there. That is to say the interrupt goes off
> early, not the function being executed.

You're correct. I missed that comparison, so if clock resolution and timer
resolution are equal, this indeed works.
It still goes wrong if the resolutions are different. get_time() normally
wouldn't use jiffies but xtime. Thomas uses a fixed resolution, but xtime
updates are not constant. On my machine here with HZ=250 the resolution
would be 4000000ns, but xtime is updated by about 4000150ns every tick.
This means the timer value is rounded to full 4ms, but this is not enough
to get it past the next tick.
In the other more common case, where clock resolution is smaller than the
timer resolution, this means the delay may be larger than really
necessary.

bye, Roman