2011-03-09 14:37:57

by Alexander Shishkin

[permalink] [raw]
Subject: [RFCv4] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

Changes since v3:
- changed timerfd_settime() semantics (see below)
Changes since v2:
- replaced sysfs interface with a syscall
- added sysctl/procfs handle to set a limit to the number of users
- fixed issues pointed out by Greg.
Changes since v1:
- updated against 2.6.36-rc1,
- added notification/filtering options,
- added Documentation/ABI/sysfs-kernel-time-notify interface description.

Certain userspace applications (like "clock" desktop applets or cron or
systemd) might want to be notified when some other application changes
the system time. There are several known to me reasons for this:
- avoiding periodic wakeups to poll time changes;
- rearming CLOCK_REALTIME timers when said changes happen;
- changing system timekeeping policy for system-wide time management
programs;
- keeping guest applications/operating systems running in emulators
up to date;
- recalculation of traffic signal cycles in Advanced Traffic Controllers
(ATC), which is part of ATC API requirements [1] as developed by ATC
working group of the U.S. Institute of Transportation Engineers (ITE).

The major change since the previous version is the new semantics of
timerfd_settime() when it's called on a time change notification
descriptor: it will set the system time to utmr.it_value if the time
change counter is zero, otherwise it will return EBUSY, this is required
to prevent a race between setting the time and reading the counter, when
the time controlling procees changes the time immediately after another
process in the system did the same (the counter is greater than one),
that process' time change will be lost. Thus, the time controlling
process should use timerfd_settime() instead of clock_settime() or
settimeofday() to ensure that other processes' time changes don't get
lost.

This is another attempt to approach notifying userspace about system
clock changes. The other one is using an eventfd and a syscall [1]. In
the course of discussing the necessity of a syscall for this kind of
notifications, it was suggested that this functionality can be achieved
via timers [2] (and timerfd in particular [3]). This idea got quite
some support [4], [5], [6] and some vague criticism [7], so I decided
to try and go a bit further with it.

[1] http://www.ite.org/standards/atcapi/version2.asp
[2] http://marc.info/?l=linux-kernel&m=128950389423614&w=2
[3] http://marc.info/?l=linux-kernel&m=128951020831573&w=2
[4] http://marc.info/?l=linux-kernel&m=128951588006157&w=2
[5] http://marc.info/?l=linux-kernel&m=128951503205111&w=2
[6] http://marc.info/?l=linux-kernel&m=128955890118477&w=2
[7] http://marc.info/?l=linux-kernel&m=129002967031104&w=2
[8] http://marc.info/?l=linux-kernel&m=129002672227263&w=2

Signed-off-by: Alexander Shishkin <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Alexander Viro <[email protected]>
CC: Greg Kroah-Hartman <[email protected]>
CC: Feng Tang <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Michael Tokarev <[email protected]>
CC: Marcelo Tosatti <[email protected]>
CC: John Stultz <[email protected]>
CC: Chris Friesen <[email protected]>
CC: Kay Sievers <[email protected]>
CC: Kirill A. Shutemov <[email protected]>
CC: Artem Bityutskiy <[email protected]>
CC: Davide Libenzi <[email protected]>
CC: [email protected]
CC: [email protected]
---
fs/timerfd.c | 94 ++++++++++++++++++++++++++++++++++++++++++-----
include/linux/hrtimer.h | 6 +++
include/linux/timerfd.h | 3 +-
kernel/compat.c | 5 ++-
kernel/hrtimer.c | 4 ++
kernel/time.c | 11 ++++-
6 files changed, 109 insertions(+), 14 deletions(-)

diff --git a/fs/timerfd.c b/fs/timerfd.c
index 8c4fc14..6170f61 100644
--- a/fs/timerfd.c
+++ b/fs/timerfd.c
@@ -22,6 +22,7 @@
#include <linux/anon_inodes.h>
#include <linux/timerfd.h>
#include <linux/syscalls.h>
+#include <linux/security.h>

struct timerfd_ctx {
struct hrtimer tmr;
@@ -30,8 +31,13 @@ struct timerfd_ctx {
u64 ticks;
int expired;
int clockid;
+ struct list_head notifiers_list;
};

+/* TFD_NOTIFY_CLOCK_SET timers go here */
+static DEFINE_SPINLOCK(notifiers_lock);
+static LIST_HEAD(notifiers_list);
+
/*
* This gets called when the timer event triggers. We set the "expired"
* flag, but we do not re-arm the timer (in case it's necessary,
@@ -51,10 +57,31 @@ static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr)
return HRTIMER_NORESTART;
}

+void timerfd_clock_was_set(clockid_t clockid)
+{
+ struct timerfd_ctx *ctx;
+ unsigned long flags;
+
+ spin_lock(&notifiers_lock);
+ list_for_each_entry(ctx, &notifiers_list, notifiers_list) {
+ spin_lock_irqsave(&ctx->wqh.lock, flags);
+ if (ctx->tmr.base->index == clockid) {
+ ctx->ticks++;
+ wake_up_locked(&ctx->wqh);
+ }
+ spin_unlock_irqrestore(&ctx->wqh.lock, flags);
+ }
+ spin_unlock(&notifiers_lock);
+}
+
static ktime_t timerfd_get_remaining(struct timerfd_ctx *ctx)
{
ktime_t remaining;

+ /* for notification timers, return current time */
+ if (!list_empty(&ctx->notifiers_list))
+ return timespec_to_ktime(current_kernel_time());
+
remaining = hrtimer_expires_remaining(&ctx->tmr);
return remaining.tv64 < 0 ? ktime_set(0, 0): remaining;
}
@@ -72,6 +99,12 @@ static void timerfd_setup(struct timerfd_ctx *ctx, int flags,
ctx->expired = 0;
ctx->ticks = 0;
ctx->tintv = timespec_to_ktime(ktmr->it_interval);
+
+ if (flags & TFD_NOTIFY_CLOCK_SET) {
+ list_add(&ctx->notifiers_list, &notifiers_list);
+ return;
+ }
+
hrtimer_init(&ctx->tmr, ctx->clockid, htmode);
hrtimer_set_expires(&ctx->tmr, texp);
ctx->tmr.function = timerfd_tmrproc;
@@ -83,7 +116,12 @@ static int timerfd_release(struct inode *inode, struct file *file)
{
struct timerfd_ctx *ctx = file->private_data;

- hrtimer_cancel(&ctx->tmr);
+ if (!list_empty(&ctx->notifiers_list)) {
+ spin_lock(&notifiers_lock);
+ list_del(&ctx->notifiers_list);
+ spin_unlock(&notifiers_lock);
+ } else
+ hrtimer_cancel(&ctx->tmr);
kfree(ctx);
return 0;
}
@@ -113,6 +151,7 @@ static ssize_t timerfd_read(struct file *file, char __user *buf, size_t count,

if (count < sizeof(ticks))
return -EINVAL;
+
spin_lock_irq(&ctx->wqh.lock);
if (file->f_flags & O_NONBLOCK)
res = -EAGAIN;
@@ -120,7 +159,8 @@ static ssize_t timerfd_read(struct file *file, char __user *buf, size_t count,
res = wait_event_interruptible_locked_irq(ctx->wqh, ctx->ticks);
if (ctx->ticks) {
ticks = ctx->ticks;
- if (ctx->expired && ctx->tintv.tv64) {
+ if (ctx->expired && ctx->tintv.tv64 &&
+ list_empty(&ctx->notifiers_list)) {
/*
* If tintv.tv64 != 0, this is a periodic timer that
* needs to be re-armed. We avoid doing it in the timer
@@ -184,6 +224,8 @@ SYSCALL_DEFINE2(timerfd_create, int, clockid, int, flags)
ctx->clockid = clockid;
hrtimer_init(&ctx->tmr, clockid, HRTIMER_MODE_ABS);

+ INIT_LIST_HEAD(&ctx->notifiers_list);
+
ufd = anon_inode_getfd("[timerfd]", &timerfd_fops, ctx,
O_RDWR | (flags & TFD_SHARED_FCNTL_FLAGS));
if (ufd < 0)
@@ -196,18 +238,24 @@ SYSCALL_DEFINE4(timerfd_settime, int, ufd, int, flags,
const struct itimerspec __user *, utmr,
struct itimerspec __user *, otmr)
{
+ int ret = 0;
struct file *file;
struct timerfd_ctx *ctx;
struct itimerspec ktmr, kotmr;

- if (copy_from_user(&ktmr, utmr, sizeof(ktmr)))
- return -EFAULT;
-
- if ((flags & ~TFD_SETTIME_FLAGS) ||
- !timespec_valid(&ktmr.it_value) ||
- !timespec_valid(&ktmr.it_interval))
+ if (flags & ~TFD_SETTIME_FLAGS)
return -EINVAL;

+ /* utmr may be NULL for notification timerfd */
+ if (!(flags & TFD_NOTIFY_CLOCK_SET) || utmr) {
+ if (copy_from_user(&ktmr, utmr, sizeof(ktmr)))
+ return -EFAULT;
+
+ if (!timespec_valid(&ktmr.it_value) ||
+ !timespec_valid(&ktmr.it_interval))
+ return -EINVAL;
+ }
+
file = timerfd_fget(ufd);
if (IS_ERR(file))
return PTR_ERR(file);
@@ -218,10 +266,12 @@ SYSCALL_DEFINE4(timerfd_settime, int, ufd, int, flags,
* it to the new values.
*/
for (;;) {
+ spin_lock(&notifiers_lock);
spin_lock_irq(&ctx->wqh.lock);
- if (hrtimer_try_to_cancel(&ctx->tmr) >= 0)
+ if (!list_empty(&notifiers_list) || hrtimer_try_to_cancel(&ctx->tmr) >= 0)
break;
spin_unlock_irq(&ctx->wqh.lock);
+ spin_unlock(&notifiers_lock);
cpu_relax();
}

@@ -238,16 +288,39 @@ SYSCALL_DEFINE4(timerfd_settime, int, ufd, int, flags,
kotmr.it_interval = ktime_to_timespec(ctx->tintv);

/*
+ * for the notification timerfd, set current time to it_value
+ * if the timer hasn't expired; otherwise someone has changed
+ * the system time to the value that we don't know
+ */
+ if (!list_empty(&ctx->notifiers_list) && utmr) {
+ if (ctx->ticks) {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ ret = security_settime(&ktmr.it_value, NULL);
+ if (ret)
+ goto out;
+
+ spin_unlock_irq(&ctx->wqh.lock);
+ ret = do_settimeofday(&ktmr.it_value);
+ goto out1;
+ }
+
+ /*
* Re-program the timer to the new value ...
*/
timerfd_setup(ctx, flags, &ktmr);

+out:
spin_unlock_irq(&ctx->wqh.lock);
+out1:
+ spin_unlock(&notifiers_lock);
fput(file);
if (otmr && copy_to_user(otmr, &kotmr, sizeof(kotmr)))
return -EFAULT;

- return 0;
+ return ret;
}

SYSCALL_DEFINE2(timerfd_gettime, int, ufd, struct itimerspec __user *, otmr)
@@ -273,6 +346,7 @@ SYSCALL_DEFINE2(timerfd_gettime, int, ufd, struct itimerspec __user *, otmr)
spin_unlock_irq(&ctx->wqh.lock);
fput(file);

+out:
return copy_to_user(otmr, &kotmr, sizeof(kotmr)) ? -EFAULT: 0;
}

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 6bc1804..991e8d9 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -248,6 +248,12 @@ static inline ktime_t hrtimer_expires_remaining(const struct hrtimer *timer)
return ktime_sub(timer->node.expires, timer->base->get_time());
}

+#ifdef CONFIG_TIMERFD
+extern void timerfd_clock_was_set(clockid_t clockid);
+#else
+static inline void timerfd_clock_was_set(clockid_t clockid) {}
+#endif
+
#ifdef CONFIG_HIGH_RES_TIMERS
struct clock_event_device;

diff --git a/include/linux/timerfd.h b/include/linux/timerfd.h
index 2d07929..c3ddad9 100644
--- a/include/linux/timerfd.h
+++ b/include/linux/timerfd.h
@@ -19,6 +19,7 @@
* shared O_* flags.
*/
#define TFD_TIMER_ABSTIME (1 << 0)
+#define TFD_NOTIFY_CLOCK_SET (1 << 1)
#define TFD_CLOEXEC O_CLOEXEC
#define TFD_NONBLOCK O_NONBLOCK

@@ -26,6 +27,6 @@
/* Flags for timerfd_create. */
#define TFD_CREATE_FLAGS TFD_SHARED_FCNTL_FLAGS
/* Flags for timerfd_settime. */
-#define TFD_SETTIME_FLAGS TFD_TIMER_ABSTIME
+#define TFD_SETTIME_FLAGS (TFD_TIMER_ABSTIME | TFD_NOTIFY_CLOCK_SET)

#endif /* _LINUX_TIMERFD_H */
diff --git a/kernel/compat.c b/kernel/compat.c
index 38b1d2c..b1cf3e1 100644
--- a/kernel/compat.c
+++ b/kernel/compat.c
@@ -995,7 +995,10 @@ asmlinkage long compat_sys_stime(compat_time_t __user *tptr)
if (err)
return err;

- do_settimeofday(&tv);
+ err = do_settimeofday(&tv);
+ if (!err)
+ timerfd_clock_was_set(CLOCK_REALTIME);
+
return 0;
}

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 2c3d6e5..469eef6 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -663,6 +663,7 @@ void clock_was_set(void)
{
/* Retrigger the CPU local events everywhere */
on_each_cpu(retrigger_next_event, NULL, 1);
+
}

/*
@@ -675,6 +676,9 @@ void hres_timers_resume(void)
KERN_INFO "hres_timers_resume() called with IRQs enabled!");

retrigger_next_event(NULL);
+
+ /* Trigger timerfd notifiers */
+ timerfd_clock_was_set(CLOCK_MONOTONIC);
}

/*
diff --git a/kernel/time.c b/kernel/time.c
index 6430a75..b06f759 100644
--- a/kernel/time.c
+++ b/kernel/time.c
@@ -92,7 +92,10 @@ SYSCALL_DEFINE1(stime, time_t __user *, tptr)
if (err)
return err;

- do_settimeofday(&tv);
+ err = do_settimeofday(&tv);
+ if (!err)
+ timerfd_clock_was_set(CLOCK_REALTIME);
+
return 0;
}

@@ -177,7 +180,11 @@ int do_sys_settimeofday(const struct timespec *tv, const struct timezone *tz)
/* SMP safe, again the code in arch/foo/time.c should
* globally block out interrupts when it runs.
*/
- return do_settimeofday(tv);
+ error = do_settimeofday(tv);
+ if (!error)
+ timerfd_clock_was_set(CLOCK_REALTIME);
+
+ return error;
}
return 0;
}
--
1.7.2.1.45.gb66c2


2011-03-10 00:26:05

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFCv4] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Wed, 9 Mar 2011 16:36:51 +0200
Alexander Shishkin <[email protected]> wrote:

> Changes since v3:
> - changed timerfd_settime() semantics (see below)
> Changes since v2:
> - replaced sysfs interface with a syscall
> - added sysctl/procfs handle to set a limit to the number of users
> - fixed issues pointed out by Greg.
> Changes since v1:
> - updated against 2.6.36-rc1,
> - added notification/filtering options,
> - added Documentation/ABI/sysfs-kernel-time-notify interface description.
>
> Certain userspace applications (like "clock" desktop applets or cron or
> systemd) might want to be notified when some other application changes
> the system time. There are several known to me reasons for this:
> - avoiding periodic wakeups to poll time changes;
> - rearming CLOCK_REALTIME timers when said changes happen;
> - changing system timekeeping policy for system-wide time management
> programs;
> - keeping guest applications/operating systems running in emulators
> up to date;
> - recalculation of traffic signal cycles in Advanced Traffic Controllers
> (ATC), which is part of ATC API requirements [1] as developed by ATC
> working group of the U.S. Institute of Transportation Engineers (ITE).
>
> The major change since the previous version is the new semantics of
> timerfd_settime() when it's called on a time change notification
> descriptor: it will set the system time to utmr.it_value if the time
> change counter is zero, otherwise it will return EBUSY, this is required
> to prevent a race between setting the time and reading the counter, when
> the time controlling procees changes the time immediately after another
> process in the system did the same (the counter is greater than one),
> that process' time change will be lost. Thus, the time controlling
> process should use timerfd_settime() instead of clock_settime() or
> settimeofday() to ensure that other processes' time changes don't get
> lost.
>
> This is another attempt to approach notifying userspace about system
> clock changes. The other one is using an eventfd and a syscall [1]. In
> the course of discussing the necessity of a syscall for this kind of
> notifications, it was suggested that this functionality can be achieved
> via timers [2] (and timerfd in particular [3]). This idea got quite
> some support [4], [5], [6] and some vague criticism [7], so I decided
> to try and go a bit further with it.

Looks sane to me. The changelog is a bit of a pickle - I stole a
useful-looking paragraph from your other please-ignore-this-email
email.

I also cc'ed Michael and linux-api: there isn't much point in adding
kernel features if we don't tell anyone about them.

It would be helpful to know if the identified users of this feature
actually find it useful and adequate. I guess the most common
application is the 1,001 desktop clock widgets. Do you have any
feedback from any of the owners of those?

Thanks.

2011-03-10 00:36:37

by Kay Sievers

[permalink] [raw]
Subject: Re: [RFCv4] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Thu, Mar 10, 2011 at 01:25, Andrew Morton <[email protected]> wrote:
> On Wed,  9 Mar 2011 16:36:51 +0200
> Alexander Shishkin <[email protected]> wrote:
>
>> Changes since v3:
>>  - changed timerfd_settime() semantics (see below)
>> Changes since v2:
>>  - replaced sysfs interface with a syscall
>>  - added sysctl/procfs handle to set a limit to the number of users
>>  - fixed issues pointed out by Greg.
>> Changes since v1:
>>  - updated against 2.6.36-rc1,
>>  - added notification/filtering options,
>>  - added Documentation/ABI/sysfs-kernel-time-notify interface description.

> It would be helpful to know if the identified users of this feature
> actually find it useful and adequate.  I guess the most common
> application is the 1,001 desktop clock widgets.  Do you have any
> feedback from any of the owners of those?

We want it for systemd, to provide cron-like functionality but without
the need to stupidly wake up every minute and check the system time
for possible jumps.

It also sounds useful for a generic resume (after system suspend)
notification for applications, which isn't really possible today.

Kay

2011-03-10 02:01:12

by Scott James Remnant

[permalink] [raw]
Subject: Re: [RFCv4] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Wed, Mar 9, 2011 at 4:25 PM, Andrew Morton <[email protected]> wrote:
> On Wed, ?9 Mar 2011 16:36:51 +0200
> Alexander Shishkin <[email protected]> wrote:
>
>> Changes since v3:
>> ?- changed timerfd_settime() semantics (see below)
>> Changes since v2:
>> ?- replaced sysfs interface with a syscall
>> ?- added sysctl/procfs handle to set a limit to the number of users
>> ?- fixed issues pointed out by Greg.
>> Changes since v1:
>> ?- updated against 2.6.36-rc1,
>> ?- added notification/filtering options,
>> ?- added Documentation/ABI/sysfs-kernel-time-notify interface description.
>>
>> Certain userspace applications (like "clock" desktop applets or cron or
>> systemd) might want to be notified when some other application changes
>> the system time. There are several known to me reasons for this:
>> ?- avoiding periodic wakeups to poll time changes;
>> ?- rearming CLOCK_REALTIME timers when said changes happen;
>> ?- changing system timekeeping policy for system-wide time management
>> ? ?programs;
>> ?- keeping guest applications/operating systems running in emulators
>> ? ?up to date;
>> ?- recalculation of traffic signal cycles in Advanced Traffic Controllers
>> ? ?(ATC), which is part of ATC API requirements [1] as developed by ATC
>> ? ?working group of the U.S. Institute of Transportation Engineers (ITE).
>>
>> The major change since the previous version is the new semantics of
>> timerfd_settime() when it's called on a time change notification
>> descriptor: it will set the system time to utmr.it_value if the time
>> change counter is zero, otherwise it will return EBUSY, this is required
>> to prevent a race between setting the time and reading the counter, when
>> the time controlling procees changes the time immediately after another
>> process in the system did the same (the counter is greater than one),
>> that process' time change will be lost. Thus, the time controlling
>> process should use timerfd_settime() instead of clock_settime() or
>> settimeofday() to ensure that other processes' time changes don't get
>> lost.
>>
>> This is another attempt to approach notifying userspace about system
>> clock changes. The other one is using an eventfd and a syscall [1]. In
>> the course of discussing the necessity of a syscall for this kind of
>> notifications, it was suggested that this functionality can be achieved
>> via timers [2] (and timerfd in particular [3]). This idea got quite
>> some support [4], [5], [6] and some vague criticism [7], so I decided
>> to try and go a bit further with it.
>
> Looks sane to me. ?The changelog is a bit of a pickle - I stole a
> useful-looking paragraph from your other please-ignore-this-email
> email.
>
> I also cc'ed Michael and linux-api: there isn't much point in adding
> kernel features if we don't tell anyone about them.
>
> It would be helpful to know if the identified users of this feature
> actually find it useful and adequate. ?I guess the most common
> application is the 1,001 desktop clock widgets. ?Do you have any
> feedback from any of the owners of those?
>
cron is another obvious one (or init systems attempting to replace
cron). Having to wakeup and check the time every minute can be
non-conducive to power savings, it would be better if we could just
sleep until the next alarm and be woken up if the time changes in
between.

(That being said, we also need to poll for and/or check for timezone
changes - but those are entirely userspace, so we can deal with that
separately)

Scott

2011-03-10 08:02:57

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFCv4] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Wed, Mar 09, 2011 at 04:36:51PM +0200, Alexander Shishkin wrote:
> Changes since v3:
> - changed timerfd_settime() semantics (see below)
> Changes since v2:
> - replaced sysfs interface with a syscall
> - added sysctl/procfs handle to set a limit to the number of users
> - fixed issues pointed out by Greg.
> Changes since v1:
> - updated against 2.6.36-rc1,
> - added notification/filtering options,
> - added Documentation/ABI/sysfs-kernel-time-notify interface description.
>
> Certain userspace applications (like "clock" desktop applets or cron or
> systemd) might want to be notified when some other application changes
> the system time. There are several known to me reasons for this:
> - avoiding periodic wakeups to poll time changes;
> - rearming CLOCK_REALTIME timers when said changes happen;
> - changing system timekeeping policy for system-wide time management
> programs;
> - keeping guest applications/operating systems running in emulators
> up to date;
> - recalculation of traffic signal cycles in Advanced Traffic Controllers
> (ATC), which is part of ATC API requirements [1] as developed by ATC
> working group of the U.S. Institute of Transportation Engineers (ITE).
>
> The major change since the previous version is the new semantics of
> timerfd_settime() when it's called on a time change notification
> descriptor: it will set the system time to utmr.it_value if the time
> change counter is zero, otherwise it will return EBUSY, this is required
> to prevent a race between setting the time and reading the counter, when
> the time controlling procees changes the time immediately after another
> process in the system did the same (the counter is greater than one),
> that process' time change will be lost. Thus, the time controlling
> process should use timerfd_settime() instead of clock_settime() or
> settimeofday() to ensure that other processes' time changes don't get
> lost.
>
> This is another attempt to approach notifying userspace about system
> clock changes. The other one is using an eventfd and a syscall [1]. In
> the course of discussing the necessity of a syscall for this kind of
> notifications, it was suggested that this functionality can be achieved
> via timers [2] (and timerfd in particular [3]). This idea got quite
> some support [4], [5], [6] and some vague criticism [7], so I decided
> to try and go a bit further with it.
>
> [1] http://www.ite.org/standards/atcapi/version2.asp
> [2] http://marc.info/?l=linux-kernel&m=128950389423614&w=2
> [3] http://marc.info/?l=linux-kernel&m=128951020831573&w=2
> [4] http://marc.info/?l=linux-kernel&m=128951588006157&w=2
> [5] http://marc.info/?l=linux-kernel&m=128951503205111&w=2
> [6] http://marc.info/?l=linux-kernel&m=128955890118477&w=2
> [7] http://marc.info/?l=linux-kernel&m=129002967031104&w=2
> [8] http://marc.info/?l=linux-kernel&m=129002672227263&w=2
>
> Signed-off-by: Alexander Shishkin <[email protected]>
> CC: Thomas Gleixner <[email protected]>
> CC: Alexander Viro <[email protected]>
> CC: Greg Kroah-Hartman <[email protected]>
> CC: Feng Tang <[email protected]>
> CC: Andrew Morton <[email protected]>
> CC: Michael Tokarev <[email protected]>
> CC: Marcelo Tosatti <[email protected]>
> CC: John Stultz <[email protected]>
> CC: Chris Friesen <[email protected]>
> CC: Kay Sievers <[email protected]>
> CC: Kirill A. Shutemov <[email protected]>
> CC: Artem Bityutskiy <[email protected]>
> CC: Davide Libenzi <[email protected]>
> CC: [email protected]
> CC: [email protected]
> ---
> fs/timerfd.c | 94 ++++++++++++++++++++++++++++++++++++++++++-----
> include/linux/hrtimer.h | 6 +++
> include/linux/timerfd.h | 3 +-
> kernel/compat.c | 5 ++-
> kernel/hrtimer.c | 4 ++
> kernel/time.c | 11 ++++-
> 6 files changed, 109 insertions(+), 14 deletions(-)
>
> diff --git a/fs/timerfd.c b/fs/timerfd.c
> index 8c4fc14..6170f61 100644
> --- a/fs/timerfd.c
> +++ b/fs/timerfd.c
> @@ -22,6 +22,7 @@
> #include <linux/anon_inodes.h>
> #include <linux/timerfd.h>
> #include <linux/syscalls.h>
> +#include <linux/security.h>
>
> struct timerfd_ctx {
> struct hrtimer tmr;
> @@ -30,8 +31,13 @@ struct timerfd_ctx {
> u64 ticks;
> int expired;
> int clockid;
> + struct list_head notifiers_list;
> };
>
> +/* TFD_NOTIFY_CLOCK_SET timers go here */
> +static DEFINE_SPINLOCK(notifiers_lock);
> +static LIST_HEAD(notifiers_list);
> +
> /*
> * This gets called when the timer event triggers. We set the "expired"
> * flag, but we do not re-arm the timer (in case it's necessary,
> @@ -51,10 +57,31 @@ static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr)
> return HRTIMER_NORESTART;
> }
>
> +void timerfd_clock_was_set(clockid_t clockid)
> +{
> + struct timerfd_ctx *ctx;
> + unsigned long flags;
> +
> + spin_lock(&notifiers_lock);
> + list_for_each_entry(ctx, &notifiers_list, notifiers_list) {
> + spin_lock_irqsave(&ctx->wqh.lock, flags);
> + if (ctx->tmr.base->index == clockid) {
> + ctx->ticks++;
> + wake_up_locked(&ctx->wqh);
> + }
> + spin_unlock_irqrestore(&ctx->wqh.lock, flags);
> + }
> + spin_unlock(&notifiers_lock);
> +}
> +
> static ktime_t timerfd_get_remaining(struct timerfd_ctx *ctx)
> {
> ktime_t remaining;
>
> + /* for notification timers, return current time */
> + if (!list_empty(&ctx->notifiers_list))
> + return timespec_to_ktime(current_kernel_time());
> +
> remaining = hrtimer_expires_remaining(&ctx->tmr);
> return remaining.tv64 < 0 ? ktime_set(0, 0): remaining;
> }
> @@ -72,6 +99,12 @@ static void timerfd_setup(struct timerfd_ctx *ctx, int flags,
> ctx->expired = 0;
> ctx->ticks = 0;
> ctx->tintv = timespec_to_ktime(ktmr->it_interval);
> +
> + if (flags & TFD_NOTIFY_CLOCK_SET) {
> + list_add(&ctx->notifiers_list, &notifiers_list);
> + return;
> + }
> +
> hrtimer_init(&ctx->tmr, ctx->clockid, htmode);
> hrtimer_set_expires(&ctx->tmr, texp);
> ctx->tmr.function = timerfd_tmrproc;
> @@ -83,7 +116,12 @@ static int timerfd_release(struct inode *inode, struct file *file)
> {
> struct timerfd_ctx *ctx = file->private_data;
>
> - hrtimer_cancel(&ctx->tmr);
> + if (!list_empty(&ctx->notifiers_list)) {
> + spin_lock(&notifiers_lock);
> + list_del(&ctx->notifiers_list);
> + spin_unlock(&notifiers_lock);
> + } else
> + hrtimer_cancel(&ctx->tmr);
> kfree(ctx);
> return 0;
> }
> @@ -113,6 +151,7 @@ static ssize_t timerfd_read(struct file *file, char __user *buf, size_t count,
>
> if (count < sizeof(ticks))
> return -EINVAL;
> +
> spin_lock_irq(&ctx->wqh.lock);
> if (file->f_flags & O_NONBLOCK)
> res = -EAGAIN;

Whitespace changes?

> @@ -120,7 +159,8 @@ static ssize_t timerfd_read(struct file *file, char __user *buf, size_t count,
> res = wait_event_interruptible_locked_irq(ctx->wqh, ctx->ticks);
> if (ctx->ticks) {
> ticks = ctx->ticks;
> - if (ctx->expired && ctx->tintv.tv64) {
> + if (ctx->expired && ctx->tintv.tv64 &&
> + list_empty(&ctx->notifiers_list)) {
> /*
> * If tintv.tv64 != 0, this is a periodic timer that
> * needs to be re-armed. We avoid doing it in the timer
> @@ -184,6 +224,8 @@ SYSCALL_DEFINE2(timerfd_create, int, clockid, int, flags)
> ctx->clockid = clockid;
> hrtimer_init(&ctx->tmr, clockid, HRTIMER_MODE_ABS);
>
> + INIT_LIST_HEAD(&ctx->notifiers_list);
> +
> ufd = anon_inode_getfd("[timerfd]", &timerfd_fops, ctx,
> O_RDWR | (flags & TFD_SHARED_FCNTL_FLAGS));
> if (ufd < 0)
> @@ -196,18 +238,24 @@ SYSCALL_DEFINE4(timerfd_settime, int, ufd, int, flags,
> const struct itimerspec __user *, utmr,
> struct itimerspec __user *, otmr)
> {
> + int ret = 0;
> struct file *file;
> struct timerfd_ctx *ctx;
> struct itimerspec ktmr, kotmr;
>
> - if (copy_from_user(&ktmr, utmr, sizeof(ktmr)))
> - return -EFAULT;
> -
> - if ((flags & ~TFD_SETTIME_FLAGS) ||
> - !timespec_valid(&ktmr.it_value) ||
> - !timespec_valid(&ktmr.it_interval))
> + if (flags & ~TFD_SETTIME_FLAGS)
> return -EINVAL;
>
> + /* utmr may be NULL for notification timerfd */
> + if (!(flags & TFD_NOTIFY_CLOCK_SET) || utmr) {
> + if (copy_from_user(&ktmr, utmr, sizeof(ktmr)))
> + return -EFAULT;
> +
> + if (!timespec_valid(&ktmr.it_value) ||
> + !timespec_valid(&ktmr.it_interval))
> + return -EINVAL;
> + }
> +
> file = timerfd_fget(ufd);
> if (IS_ERR(file))
> return PTR_ERR(file);
> @@ -218,10 +266,12 @@ SYSCALL_DEFINE4(timerfd_settime, int, ufd, int, flags,
> * it to the new values.
> */
> for (;;) {
> + spin_lock(&notifiers_lock);
> spin_lock_irq(&ctx->wqh.lock);
> - if (hrtimer_try_to_cancel(&ctx->tmr) >= 0)
> + if (!list_empty(&notifiers_list) || hrtimer_try_to_cancel(&ctx->tmr) >= 0)
> break;
> spin_unlock_irq(&ctx->wqh.lock);
> + spin_unlock(&notifiers_lock);
> cpu_relax();
> }
>
> @@ -238,16 +288,39 @@ SYSCALL_DEFINE4(timerfd_settime, int, ufd, int, flags,
> kotmr.it_interval = ktime_to_timespec(ctx->tintv);
>
> /*
> + * for the notification timerfd, set current time to it_value
> + * if the timer hasn't expired; otherwise someone has changed
> + * the system time to the value that we don't know
> + */
> + if (!list_empty(&ctx->notifiers_list) && utmr) {
> + if (ctx->ticks) {
> + ret = -EBUSY;
> + goto out;
> + }
> +
> + ret = security_settime(&ktmr.it_value, NULL);
> + if (ret)
> + goto out;
> +
> + spin_unlock_irq(&ctx->wqh.lock);
> + ret = do_settimeofday(&ktmr.it_value);
> + goto out1;
> + }
> +
> + /*
> * Re-program the timer to the new value ...
> */
> timerfd_setup(ctx, flags, &ktmr);
>
> +out:
> spin_unlock_irq(&ctx->wqh.lock);
> +out1:
> + spin_unlock(&notifiers_lock);
> fput(file);
> if (otmr && copy_to_user(otmr, &kotmr, sizeof(kotmr)))
> return -EFAULT;

What's reason to do copy_to_user() in case of error?
Is it safe for error from security_settime()?

> - return 0;
> + return ret;
> }
>
> SYSCALL_DEFINE2(timerfd_gettime, int, ufd, struct itimerspec __user *, otmr)
> @@ -273,6 +346,7 @@ SYSCALL_DEFINE2(timerfd_gettime, int, ufd, struct itimerspec __user *, otmr)
> spin_unlock_irq(&ctx->wqh.lock);
> fput(file);
>
> +out:
> return copy_to_user(otmr, &kotmr, sizeof(kotmr)) ? -EFAULT: 0;
> }

This lable is not needed.

> diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
> index 6bc1804..991e8d9 100644
> --- a/include/linux/hrtimer.h
> +++ b/include/linux/hrtimer.h
> @@ -248,6 +248,12 @@ static inline ktime_t hrtimer_expires_remaining(const struct hrtimer *timer)
> return ktime_sub(timer->node.expires, timer->base->get_time());
> }
>
> +#ifdef CONFIG_TIMERFD
> +extern void timerfd_clock_was_set(clockid_t clockid);
> +#else
> +static inline void timerfd_clock_was_set(clockid_t clockid) {}
> +#endif
> +
> #ifdef CONFIG_HIGH_RES_TIMERS
> struct clock_event_device;
>
> diff --git a/include/linux/timerfd.h b/include/linux/timerfd.h
> index 2d07929..c3ddad9 100644
> --- a/include/linux/timerfd.h
> +++ b/include/linux/timerfd.h
> @@ -19,6 +19,7 @@
> * shared O_* flags.
> */
> #define TFD_TIMER_ABSTIME (1 << 0)
> +#define TFD_NOTIFY_CLOCK_SET (1 << 1)
> #define TFD_CLOEXEC O_CLOEXEC
> #define TFD_NONBLOCK O_NONBLOCK
>
> @@ -26,6 +27,6 @@
> /* Flags for timerfd_create. */
> #define TFD_CREATE_FLAGS TFD_SHARED_FCNTL_FLAGS
> /* Flags for timerfd_settime. */
> -#define TFD_SETTIME_FLAGS TFD_TIMER_ABSTIME
> +#define TFD_SETTIME_FLAGS (TFD_TIMER_ABSTIME | TFD_NOTIFY_CLOCK_SET)
>
> #endif /* _LINUX_TIMERFD_H */
> diff --git a/kernel/compat.c b/kernel/compat.c
> index 38b1d2c..b1cf3e1 100644
> --- a/kernel/compat.c
> +++ b/kernel/compat.c
> @@ -995,7 +995,10 @@ asmlinkage long compat_sys_stime(compat_time_t __user *tptr)
> if (err)
> return err;
>
> - do_settimeofday(&tv);
> + err = do_settimeofday(&tv);
> + if (!err)
> + timerfd_clock_was_set(CLOCK_REALTIME);
> +
> return 0;
> }
>
> diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
> index 2c3d6e5..469eef6 100644
> --- a/kernel/hrtimer.c
> +++ b/kernel/hrtimer.c
> @@ -663,6 +663,7 @@ void clock_was_set(void)
> {
> /* Retrigger the CPU local events everywhere */
> on_each_cpu(retrigger_next_event, NULL, 1);
> +
> }
>
> /*

Whitespace changes?

> @@ -675,6 +676,9 @@ void hres_timers_resume(void)
> KERN_INFO "hres_timers_resume() called with IRQs enabled!");
>
> retrigger_next_event(NULL);
> +
> + /* Trigger timerfd notifiers */
> + timerfd_clock_was_set(CLOCK_MONOTONIC);
> }
>
> /*
> diff --git a/kernel/time.c b/kernel/time.c
> index 6430a75..b06f759 100644
> --- a/kernel/time.c
> +++ b/kernel/time.c
> @@ -92,7 +92,10 @@ SYSCALL_DEFINE1(stime, time_t __user *, tptr)
> if (err)
> return err;
>
> - do_settimeofday(&tv);
> + err = do_settimeofday(&tv);
> + if (!err)
> + timerfd_clock_was_set(CLOCK_REALTIME);
> +
> return 0;
> }
>
> @@ -177,7 +180,11 @@ int do_sys_settimeofday(const struct timespec *tv, const struct timezone *tz)
> /* SMP safe, again the code in arch/foo/time.c should
> * globally block out interrupts when it runs.
> */
> - return do_settimeofday(tv);
> + error = do_settimeofday(tv);
> + if (!error)
> + timerfd_clock_was_set(CLOCK_REALTIME);
> +
> + return error;
> }
> return 0;
> }
> --
> 1.7.2.1.45.gb66c2
>

--
Kirill A. Shutemov

2011-03-10 08:10:48

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [RFCv4] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Wed, Mar 09, 2011 at 04:25:13PM -0800, Andrew Morton wrote:
> On Wed, 9 Mar 2011 16:36:51 +0200
> Alexander Shishkin <[email protected]> wrote:
>
> > Changes since v3:
> > - changed timerfd_settime() semantics (see below)
> > Changes since v2:
> > - replaced sysfs interface with a syscall
> > - added sysctl/procfs handle to set a limit to the number of users
> > - fixed issues pointed out by Greg.
> > Changes since v1:
> > - updated against 2.6.36-rc1,
> > - added notification/filtering options,
> > - added Documentation/ABI/sysfs-kernel-time-notify interface description.
> >
> > Certain userspace applications (like "clock" desktop applets or cron or
> > systemd) might want to be notified when some other application changes
> > the system time. There are several known to me reasons for this:
> > - avoiding periodic wakeups to poll time changes;
> > - rearming CLOCK_REALTIME timers when said changes happen;
> > - changing system timekeeping policy for system-wide time management
> > programs;
> > - keeping guest applications/operating systems running in emulators
> > up to date;
> > - recalculation of traffic signal cycles in Advanced Traffic Controllers
> > (ATC), which is part of ATC API requirements [1] as developed by ATC
> > working group of the U.S. Institute of Transportation Engineers (ITE).
> >
> > The major change since the previous version is the new semantics of
> > timerfd_settime() when it's called on a time change notification
> > descriptor: it will set the system time to utmr.it_value if the time
> > change counter is zero, otherwise it will return EBUSY, this is required
> > to prevent a race between setting the time and reading the counter, when
> > the time controlling procees changes the time immediately after another
> > process in the system did the same (the counter is greater than one),
> > that process' time change will be lost. Thus, the time controlling
> > process should use timerfd_settime() instead of clock_settime() or
> > settimeofday() to ensure that other processes' time changes don't get
> > lost.
> >
> > This is another attempt to approach notifying userspace about system
> > clock changes. The other one is using an eventfd and a syscall [1]. In
> > the course of discussing the necessity of a syscall for this kind of
> > notifications, it was suggested that this functionality can be achieved
> > via timers [2] (and timerfd in particular [3]). This idea got quite
> > some support [4], [5], [6] and some vague criticism [7], so I decided
> > to try and go a bit further with it.
>
> Looks sane to me. The changelog is a bit of a pickle - I stole a
> useful-looking paragraph from your other please-ignore-this-email
> email.

I was trying to word it so that I wouldn't have to explain the same things
over and over again after each patch version. :) But your version looks
better.

> I also cc'ed Michael and linux-api: there isn't much point in adding
> kernel features if we don't tell anyone about them.

Yep, they somehow dropped out of my CC list in one of the previous
versions. I can provide a patch for man-pages, of course.

> It would be helpful to know if the identified users of this feature
> actually find it useful and adequate. I guess the most common
> application is the 1,001 desktop clock widgets. Do you have any
> feedback from any of the owners of those?

Shaun Reich (KDE) and Ken MacLeod (ITE) have contacted me recently asking
about my progress with this feature. They, as well as other potential
users for this are in CC. Their feedback was how I came up with the list
of usecases above.

Thanks,
--
Alex

2011-03-10 08:15:59

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [RFCv4] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Thu, Mar 10, 2011 at 10:02:51AM +0200, Kirill A. Shutemov wrote:
> On Wed, Mar 09, 2011 at 04:36:51PM +0200, Alexander Shishkin wrote:
> > Changes since v3:
> > - changed timerfd_settime() semantics (see below)
> > Changes since v2:
> > - replaced sysfs interface with a syscall
> > - added sysctl/procfs handle to set a limit to the number of users
> > - fixed issues pointed out by Greg.
> > Changes since v1:
> > - updated against 2.6.36-rc1,
> > - added notification/filtering options,
> > - added Documentation/ABI/sysfs-kernel-time-notify interface description.
> >
> > Certain userspace applications (like "clock" desktop applets or cron or
> > systemd) might want to be notified when some other application changes
> > the system time. There are several known to me reasons for this:
> > - avoiding periodic wakeups to poll time changes;
> > - rearming CLOCK_REALTIME timers when said changes happen;
> > - changing system timekeeping policy for system-wide time management
> > programs;
> > - keeping guest applications/operating systems running in emulators
> > up to date;
> > - recalculation of traffic signal cycles in Advanced Traffic Controllers
> > (ATC), which is part of ATC API requirements [1] as developed by ATC
> > working group of the U.S. Institute of Transportation Engineers (ITE).
> >
> > The major change since the previous version is the new semantics of
> > timerfd_settime() when it's called on a time change notification
> > descriptor: it will set the system time to utmr.it_value if the time
> > change counter is zero, otherwise it will return EBUSY, this is required
> > to prevent a race between setting the time and reading the counter, when
> > the time controlling procees changes the time immediately after another
> > process in the system did the same (the counter is greater than one),
> > that process' time change will be lost. Thus, the time controlling
> > process should use timerfd_settime() instead of clock_settime() or
> > settimeofday() to ensure that other processes' time changes don't get
> > lost.
> >
> > This is another attempt to approach notifying userspace about system
> > clock changes. The other one is using an eventfd and a syscall [1]. In
> > the course of discussing the necessity of a syscall for this kind of
> > notifications, it was suggested that this functionality can be achieved
> > via timers [2] (and timerfd in particular [3]). This idea got quite
> > some support [4], [5], [6] and some vague criticism [7], so I decided
> > to try and go a bit further with it.
> >
> > [1] http://www.ite.org/standards/atcapi/version2.asp
> > [2] http://marc.info/?l=linux-kernel&m=128950389423614&w=2
> > [3] http://marc.info/?l=linux-kernel&m=128951020831573&w=2
> > [4] http://marc.info/?l=linux-kernel&m=128951588006157&w=2
> > [5] http://marc.info/?l=linux-kernel&m=128951503205111&w=2
> > [6] http://marc.info/?l=linux-kernel&m=128955890118477&w=2
> > [7] http://marc.info/?l=linux-kernel&m=129002967031104&w=2
> > [8] http://marc.info/?l=linux-kernel&m=129002672227263&w=2
> >
> > Signed-off-by: Alexander Shishkin <[email protected]>
> > CC: Thomas Gleixner <[email protected]>
> > CC: Alexander Viro <[email protected]>
> > CC: Greg Kroah-Hartman <[email protected]>
> > CC: Feng Tang <[email protected]>
> > CC: Andrew Morton <[email protected]>
> > CC: Michael Tokarev <[email protected]>
> > CC: Marcelo Tosatti <[email protected]>
> > CC: John Stultz <[email protected]>
> > CC: Chris Friesen <[email protected]>
> > CC: Kay Sievers <[email protected]>
> > CC: Kirill A. Shutemov <[email protected]>
> > CC: Artem Bityutskiy <[email protected]>
> > CC: Davide Libenzi <[email protected]>
> > CC: [email protected]
> > CC: [email protected]
> > ---
> > fs/timerfd.c | 94 ++++++++++++++++++++++++++++++++++++++++++-----
> > include/linux/hrtimer.h | 6 +++
> > include/linux/timerfd.h | 3 +-
> > kernel/compat.c | 5 ++-
> > kernel/hrtimer.c | 4 ++
> > kernel/time.c | 11 ++++-
> > 6 files changed, 109 insertions(+), 14 deletions(-)
> >
> > diff --git a/fs/timerfd.c b/fs/timerfd.c
> > index 8c4fc14..6170f61 100644
> > --- a/fs/timerfd.c
> > +++ b/fs/timerfd.c
> > @@ -22,6 +22,7 @@
> > #include <linux/anon_inodes.h>
> > #include <linux/timerfd.h>
> > #include <linux/syscalls.h>
> > +#include <linux/security.h>
> >
> > struct timerfd_ctx {
> > struct hrtimer tmr;
> > @@ -30,8 +31,13 @@ struct timerfd_ctx {
> > u64 ticks;
> > int expired;
> > int clockid;
> > + struct list_head notifiers_list;
> > };
> >
> > +/* TFD_NOTIFY_CLOCK_SET timers go here */
> > +static DEFINE_SPINLOCK(notifiers_lock);
> > +static LIST_HEAD(notifiers_list);
> > +
> > /*
> > * This gets called when the timer event triggers. We set the "expired"
> > * flag, but we do not re-arm the timer (in case it's necessary,
> > @@ -51,10 +57,31 @@ static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr)
> > return HRTIMER_NORESTART;
> > }
> >
> > +void timerfd_clock_was_set(clockid_t clockid)
> > +{
> > + struct timerfd_ctx *ctx;
> > + unsigned long flags;
> > +
> > + spin_lock(&notifiers_lock);
> > + list_for_each_entry(ctx, &notifiers_list, notifiers_list) {
> > + spin_lock_irqsave(&ctx->wqh.lock, flags);
> > + if (ctx->tmr.base->index == clockid) {
> > + ctx->ticks++;
> > + wake_up_locked(&ctx->wqh);
> > + }
> > + spin_unlock_irqrestore(&ctx->wqh.lock, flags);
> > + }
> > + spin_unlock(&notifiers_lock);
> > +}
> > +
> > static ktime_t timerfd_get_remaining(struct timerfd_ctx *ctx)
> > {
> > ktime_t remaining;
> >
> > + /* for notification timers, return current time */
> > + if (!list_empty(&ctx->notifiers_list))
> > + return timespec_to_ktime(current_kernel_time());
> > +
> > remaining = hrtimer_expires_remaining(&ctx->tmr);
> > return remaining.tv64 < 0 ? ktime_set(0, 0): remaining;
> > }
> > @@ -72,6 +99,12 @@ static void timerfd_setup(struct timerfd_ctx *ctx, int flags,
> > ctx->expired = 0;
> > ctx->ticks = 0;
> > ctx->tintv = timespec_to_ktime(ktmr->it_interval);
> > +
> > + if (flags & TFD_NOTIFY_CLOCK_SET) {
> > + list_add(&ctx->notifiers_list, &notifiers_list);
> > + return;
> > + }
> > +
> > hrtimer_init(&ctx->tmr, ctx->clockid, htmode);
> > hrtimer_set_expires(&ctx->tmr, texp);
> > ctx->tmr.function = timerfd_tmrproc;
> > @@ -83,7 +116,12 @@ static int timerfd_release(struct inode *inode, struct file *file)
> > {
> > struct timerfd_ctx *ctx = file->private_data;
> >
> > - hrtimer_cancel(&ctx->tmr);
> > + if (!list_empty(&ctx->notifiers_list)) {
> > + spin_lock(&notifiers_lock);
> > + list_del(&ctx->notifiers_list);
> > + spin_unlock(&notifiers_lock);
> > + } else
> > + hrtimer_cancel(&ctx->tmr);
> > kfree(ctx);
> > return 0;
> > }
> > @@ -113,6 +151,7 @@ static ssize_t timerfd_read(struct file *file, char __user *buf, size_t count,
> >
> > if (count < sizeof(ticks))
> > return -EINVAL;
> > +
> > spin_lock_irq(&ctx->wqh.lock);
> > if (file->f_flags & O_NONBLOCK)
> > res = -EAGAIN;
>
> Whitespace changes?
>
> > @@ -120,7 +159,8 @@ static ssize_t timerfd_read(struct file *file, char __user *buf, size_t count,
> > res = wait_event_interruptible_locked_irq(ctx->wqh, ctx->ticks);
> > if (ctx->ticks) {
> > ticks = ctx->ticks;
> > - if (ctx->expired && ctx->tintv.tv64) {
> > + if (ctx->expired && ctx->tintv.tv64 &&
> > + list_empty(&ctx->notifiers_list)) {
> > /*
> > * If tintv.tv64 != 0, this is a periodic timer that
> > * needs to be re-armed. We avoid doing it in the timer
> > @@ -184,6 +224,8 @@ SYSCALL_DEFINE2(timerfd_create, int, clockid, int, flags)
> > ctx->clockid = clockid;
> > hrtimer_init(&ctx->tmr, clockid, HRTIMER_MODE_ABS);
> >
> > + INIT_LIST_HEAD(&ctx->notifiers_list);
> > +
> > ufd = anon_inode_getfd("[timerfd]", &timerfd_fops, ctx,
> > O_RDWR | (flags & TFD_SHARED_FCNTL_FLAGS));
> > if (ufd < 0)
> > @@ -196,18 +238,24 @@ SYSCALL_DEFINE4(timerfd_settime, int, ufd, int, flags,
> > const struct itimerspec __user *, utmr,
> > struct itimerspec __user *, otmr)
> > {
> > + int ret = 0;
> > struct file *file;
> > struct timerfd_ctx *ctx;
> > struct itimerspec ktmr, kotmr;
> >
> > - if (copy_from_user(&ktmr, utmr, sizeof(ktmr)))
> > - return -EFAULT;
> > -
> > - if ((flags & ~TFD_SETTIME_FLAGS) ||
> > - !timespec_valid(&ktmr.it_value) ||
> > - !timespec_valid(&ktmr.it_interval))
> > + if (flags & ~TFD_SETTIME_FLAGS)
> > return -EINVAL;
> >
> > + /* utmr may be NULL for notification timerfd */
> > + if (!(flags & TFD_NOTIFY_CLOCK_SET) || utmr) {
> > + if (copy_from_user(&ktmr, utmr, sizeof(ktmr)))
> > + return -EFAULT;
> > +
> > + if (!timespec_valid(&ktmr.it_value) ||
> > + !timespec_valid(&ktmr.it_interval))
> > + return -EINVAL;
> > + }
> > +
> > file = timerfd_fget(ufd);
> > if (IS_ERR(file))
> > return PTR_ERR(file);
> > @@ -218,10 +266,12 @@ SYSCALL_DEFINE4(timerfd_settime, int, ufd, int, flags,
> > * it to the new values.
> > */
> > for (;;) {
> > + spin_lock(&notifiers_lock);
> > spin_lock_irq(&ctx->wqh.lock);
> > - if (hrtimer_try_to_cancel(&ctx->tmr) >= 0)
> > + if (!list_empty(&notifiers_list) || hrtimer_try_to_cancel(&ctx->tmr) >= 0)
> > break;
> > spin_unlock_irq(&ctx->wqh.lock);
> > + spin_unlock(&notifiers_lock);
> > cpu_relax();
> > }
> >
> > @@ -238,16 +288,39 @@ SYSCALL_DEFINE4(timerfd_settime, int, ufd, int, flags,
> > kotmr.it_interval = ktime_to_timespec(ctx->tintv);
> >
> > /*
> > + * for the notification timerfd, set current time to it_value
> > + * if the timer hasn't expired; otherwise someone has changed
> > + * the system time to the value that we don't know
> > + */
> > + if (!list_empty(&ctx->notifiers_list) && utmr) {
> > + if (ctx->ticks) {
> > + ret = -EBUSY;
> > + goto out;
> > + }
> > +
> > + ret = security_settime(&ktmr.it_value, NULL);
> > + if (ret)
> > + goto out;
> > +
> > + spin_unlock_irq(&ctx->wqh.lock);
> > + ret = do_settimeofday(&ktmr.it_value);
> > + goto out1;
> > + }
> > +
> > + /*
> > * Re-program the timer to the new value ...
> > */
> > timerfd_setup(ctx, flags, &ktmr);
> >
> > +out:
> > spin_unlock_irq(&ctx->wqh.lock);
> > +out1:
> > + spin_unlock(&notifiers_lock);
> > fput(file);
> > if (otmr && copy_to_user(otmr, &kotmr, sizeof(kotmr)))
> > return -EFAULT;
>
> What's reason to do copy_to_user() in case of error?

That's a separate issue. security_settime() checks for the capability for
_setting_ the time. This bit is returning the previously effective time
back to user if he/she asked for it.

> Is it safe for error from security_settime()?

The user can still obtain system time by other means.

>
> > - return 0;
> > + return ret;
> > }
> >
> > SYSCALL_DEFINE2(timerfd_gettime, int, ufd, struct itimerspec __user *, otmr)
> > @@ -273,6 +346,7 @@ SYSCALL_DEFINE2(timerfd_gettime, int, ufd, struct itimerspec __user *, otmr)
> > spin_unlock_irq(&ctx->wqh.lock);
> > fput(file);
> >
> > +out:
> > return copy_to_user(otmr, &kotmr, sizeof(kotmr)) ? -EFAULT: 0;
> > }
>
> This lable is not needed.
>
> > diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
> > index 6bc1804..991e8d9 100644
> > --- a/include/linux/hrtimer.h
> > +++ b/include/linux/hrtimer.h
> > @@ -248,6 +248,12 @@ static inline ktime_t hrtimer_expires_remaining(const struct hrtimer *timer)
> > return ktime_sub(timer->node.expires, timer->base->get_time());
> > }
> >
> > +#ifdef CONFIG_TIMERFD
> > +extern void timerfd_clock_was_set(clockid_t clockid);
> > +#else
> > +static inline void timerfd_clock_was_set(clockid_t clockid) {}
> > +#endif
> > +
> > #ifdef CONFIG_HIGH_RES_TIMERS
> > struct clock_event_device;
> >
> > diff --git a/include/linux/timerfd.h b/include/linux/timerfd.h
> > index 2d07929..c3ddad9 100644
> > --- a/include/linux/timerfd.h
> > +++ b/include/linux/timerfd.h
> > @@ -19,6 +19,7 @@
> > * shared O_* flags.
> > */
> > #define TFD_TIMER_ABSTIME (1 << 0)
> > +#define TFD_NOTIFY_CLOCK_SET (1 << 1)
> > #define TFD_CLOEXEC O_CLOEXEC
> > #define TFD_NONBLOCK O_NONBLOCK
> >
> > @@ -26,6 +27,6 @@
> > /* Flags for timerfd_create. */
> > #define TFD_CREATE_FLAGS TFD_SHARED_FCNTL_FLAGS
> > /* Flags for timerfd_settime. */
> > -#define TFD_SETTIME_FLAGS TFD_TIMER_ABSTIME
> > +#define TFD_SETTIME_FLAGS (TFD_TIMER_ABSTIME | TFD_NOTIFY_CLOCK_SET)
> >
> > #endif /* _LINUX_TIMERFD_H */
> > diff --git a/kernel/compat.c b/kernel/compat.c
> > index 38b1d2c..b1cf3e1 100644
> > --- a/kernel/compat.c
> > +++ b/kernel/compat.c
> > @@ -995,7 +995,10 @@ asmlinkage long compat_sys_stime(compat_time_t __user *tptr)
> > if (err)
> > return err;
> >
> > - do_settimeofday(&tv);
> > + err = do_settimeofday(&tv);
> > + if (!err)
> > + timerfd_clock_was_set(CLOCK_REALTIME);
> > +
> > return 0;
> > }
> >
> > diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
> > index 2c3d6e5..469eef6 100644
> > --- a/kernel/hrtimer.c
> > +++ b/kernel/hrtimer.c
> > @@ -663,6 +663,7 @@ void clock_was_set(void)
> > {
> > /* Retrigger the CPU local events everywhere */
> > on_each_cpu(retrigger_next_event, NULL, 1);
> > +
> > }
> >
> > /*
>
> Whitespace changes?
>
> > @@ -675,6 +676,9 @@ void hres_timers_resume(void)
> > KERN_INFO "hres_timers_resume() called with IRQs enabled!");
> >
> > retrigger_next_event(NULL);
> > +
> > + /* Trigger timerfd notifiers */
> > + timerfd_clock_was_set(CLOCK_MONOTONIC);
> > }
> >
> > /*
> > diff --git a/kernel/time.c b/kernel/time.c
> > index 6430a75..b06f759 100644
> > --- a/kernel/time.c
> > +++ b/kernel/time.c
> > @@ -92,7 +92,10 @@ SYSCALL_DEFINE1(stime, time_t __user *, tptr)
> > if (err)
> > return err;
> >
> > - do_settimeofday(&tv);
> > + err = do_settimeofday(&tv);
> > + if (!err)
> > + timerfd_clock_was_set(CLOCK_REALTIME);
> > +
> > return 0;
> > }
> >
> > @@ -177,7 +180,11 @@ int do_sys_settimeofday(const struct timespec *tv, const struct timezone *tz)
> > /* SMP safe, again the code in arch/foo/time.c should
> > * globally block out interrupts when it runs.
> > */
> > - return do_settimeofday(tv);
> > + error = do_settimeofday(tv);
> > + if (!error)
> > + timerfd_clock_was_set(CLOCK_REALTIME);
> > +
> > + return error;
> > }
> > return 0;
> > }
> > --
> > 1.7.2.1.45.gb66c2
> >
>
> --
> Kirill A. Shutemov

2011-03-10 08:19:28

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [RFCv4] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Thu, Mar 10, 2011 at 01:36:18AM +0100, Kay Sievers wrote:
> On Thu, Mar 10, 2011 at 01:25, Andrew Morton <[email protected]> wrote:
> > On Wed, ?9 Mar 2011 16:36:51 +0200
> > Alexander Shishkin <[email protected]> wrote:
> >
> >> Changes since v3:
> >> ?- changed timerfd_settime() semantics (see below)
> >> Changes since v2:
> >> ?- replaced sysfs interface with a syscall
> >> ?- added sysctl/procfs handle to set a limit to the number of users
> >> ?- fixed issues pointed out by Greg.
> >> Changes since v1:
> >> ?- updated against 2.6.36-rc1,
> >> ?- added notification/filtering options,
> >> ?- added Documentation/ABI/sysfs-kernel-time-notify interface description.
>
> > It would be helpful to know if the identified users of this feature
> > actually find it useful and adequate. ?I guess the most common
> > application is the 1,001 desktop clock widgets. ?Do you have any
> > feedback from any of the owners of those?
>
> We want it for systemd, to provide cron-like functionality but without
> the need to stupidly wake up every minute and check the system time
> for possible jumps.
>
> It also sounds useful for a generic resume (after system suspend)
> notification for applications, which isn't really possible today.

Yes, but with John's CLOCK_BOOTTIME patches (which are in the tip tree
currently) it will be.

Regards,
--
Alex

2011-03-10 08:27:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFCv4] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Wed, 9 Mar 2011 18:01:09 -0800 Scott James Remnant <[email protected]> wrote:

> > It would be helpful to know if the identified users of this feature
> > actually find it useful and adequate. __I guess the most common
> > application is the 1,001 desktop clock widgets. __Do you have any
> > feedback from any of the owners of those?
> >
> cron is another obvious one (or init systems attempting to replace
> cron). Having to wakeup and check the time every minute can be
> non-conducive to power savings, it would be better if we could just
> sleep until the next alarm and be woken up if the time changes in
> between.
>
> (That being said, we also need to poll for and/or check for timezone
> changes - but those are entirely userspace, so we can deal with that
> separately)

Sure, there will be lots of applications.

But what I'm asking isn't "it is a good feature". I'm asking "is the
feature implemented well". Ideally someone would get down and modify
cron to use the interface in this patch.

That's a bit of work (although not a huge amount) and a compromise
would be for app owners to sit down for half an hour and work through
their code and the kernel interface and let us know whether they found
the interface to be good and complete.

Because it would be most regrettable if we were to roll this feature
out and then three months later its users come back saying "what a
shame you didn't do it *this* way".

Also... Alexander, I assume you have a userspace test app? Please
send it and we'll add it to the changelog for testers, as example code
for implementers and as an additional way of reviewing the proposed
interface.

Thanks.

2011-03-10 08:48:28

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFCv4] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Wednesday 09 March 2011 15:36:51 Alexander Shishkin wrote:
> This is another attempt to approach notifying userspace about system
> clock changes. The other one is using an eventfd and a syscall [1]. In
> the course of discussing the necessity of a syscall for this kind of
> notifications, it was suggested that this functionality can be achieved
> via timers [2] (and timerfd in particular [3]). This idea got quite
> some support [4], [5], [6] and some vague criticism [7], so I decided
> to try and go a bit further with it.

I don't understand from your description or from the patch how user
space gets notified. From your description, I would have expected
to see a change to the timerfd_poll() function to check if the
clock has changed, possibly returning POLLPRI, but the only such
change I can see is in the timerfd_read() function. Don't you need
to change poll() so that a task knows when to call read()?

> +/* TFD_NOTIFY_CLOCK_SET timers go here */
> +static DEFINE_SPINLOCK(notifiers_lock);
> +static LIST_HEAD(notifiers_list);

Maybe I was the only one to be confused by this, but I think t
he naming is slightly misleading, because it's easy to confuse
with a struct notifier_block. You could of course use the
notifier API instead of building your own, but if you don't,
I'd recommend coming up with a different name.

I also think that a mutex would be better here than a spinlock.
It's unlikely to be under heavy contention, but if you have
a lot of threads, it could be held for a significant amount
of time.

Arnd

2011-03-10 09:09:05

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFCv4] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Thu, 10 Mar 2011, Kay Sievers wrote:

> On Thu, Mar 10, 2011 at 01:25, Andrew Morton <[email protected]> wrote:
> > On Wed,  9 Mar 2011 16:36:51 +0200
> > Alexander Shishkin <[email protected]> wrote:
> >
> >> Changes since v3:
> >>  - changed timerfd_settime() semantics (see below)
> >> Changes since v2:
> >>  - replaced sysfs interface with a syscall
> >>  - added sysctl/procfs handle to set a limit to the number of users
> >>  - fixed issues pointed out by Greg.
> >> Changes since v1:
> >>  - updated against 2.6.36-rc1,
> >>  - added notification/filtering options,
> >>  - added Documentation/ABI/sysfs-kernel-time-notify interface description.
>
> > It would be helpful to know if the identified users of this feature
> > actually find it useful and adequate.  I guess the most common
> > application is the 1,001 desktop clock widgets.  Do you have any
> > feedback from any of the owners of those?
>
> We want it for systemd, to provide cron-like functionality but without
> the need to stupidly wake up every minute and check the system time
> for possible jumps.
>
> It also sounds useful for a generic resume (after system suspend)
> notification for applications, which isn't really possible today.

Note, that we have CLOCK_BOOTTIME pending for .39 which aims at the
same problem. It's basically CLOCK_MONOTONIC adjusted by the time we
were in suspend. So while CLOCK_MONOTONIC timers are not aware of the
time spent in suspend CLOCK_BOOTTIME timers are. The reason for
implementing CLOCK_BOOTTIME was basically the same problem.

Thanks,

tglx

2011-03-10 09:52:59

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFCv4] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Wed, 9 Mar 2011, Alexander Shishkin wrote:
> The major change since the previous version is the new semantics of
> timerfd_settime() when it's called on a time change notification
> descriptor: it will set the system time to utmr.it_value if the time
> change counter is zero, otherwise it will return EBUSY, this is required
> to prevent a race between setting the time and reading the counter, when
> the time controlling procees changes the time immediately after another
> process in the system did the same (the counter is greater than one),
> that process' time change will be lost. Thus, the time controlling
> process should use timerfd_settime() instead of clock_settime() or
> settimeofday() to ensure that other processes' time changes don't get
> lost.

No, we really don't want to go there and invent another mechanism to
set the time.

> /*
> + * for the notification timerfd, set current time to it_value
> + * if the timer hasn't expired; otherwise someone has changed
> + * the system time to the value that we don't know
> + */
> + if (!list_empty(&ctx->notifiers_list) && utmr) {
> + if (ctx->ticks) {
> + ret = -EBUSY;
> + goto out;
> + }
> +
> + ret = security_settime(&ktmr.it_value, NULL);
> + if (ret)
> + goto out;
> +
> + spin_unlock_irq(&ctx->wqh.lock);
> + ret = do_settimeofday(&ktmr.it_value);
> + goto out1;
> + }

And how does that solve the problem of multiple processes using that
interface? Not at all. You moved the timer_fd_clock_was_set()
notification into the syscalls so you do not deadlock on the
notifier_lock when you call do_settimeofday() here. So if you have
multiple users of notification fd then they do not notice that you
changed the time here. That's a half thought hack, really.

And you start to overload timerfd in a way which is really horrible.
The proposed semantics of timerfd_settime() with utmr == NULL or utmr
!= NULL depending on the notification flag are so non obvious that Joe
user space programmer is doomed to fail.

The problem you want to solve is:

Wakeup CLOCK_REALTIME timers which are not yet expired, when the
time is set backward.

That's at least what you said you wanted to solve. I regret already
that I suggested adding that flag to timerfd, as it was only meant to
provide an interface which wakes a non expired timer whenever clock
was set.

The patch does something different. How is this related to the problem
you wanted to solve in the first place?

Can you please explain which problems you identified aside of the
initial one?

Thanks,

tglx

2011-03-10 11:39:28

by Jamie Lokier

[permalink] [raw]
Subject: Re: [RFCv4] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

Thomas Gleixner wrote:
> Note, that we have CLOCK_BOOTTIME pending for .39 which aims at the
> same problem. It's basically CLOCK_MONOTONIC adjusted by the time we
> were in suspend. So while CLOCK_MONOTONIC timers are not aware of the
> time spent in suspend CLOCK_BOOTTIME timers are. The reason for
> implementing CLOCK_BOOTTIME was basically the same problem.

I'm afraid for coherent distributed system problems,
I don't trust CLOCK_BOOTTIME.

What happens when the clock battery is flat? (Some systems have
separate battery for the clock, and it's never changed or recharged).

What about systems that just don't have a hardware clock while
suspended, or the clock doesn't remember the current year reliably, or
it's handled by userspace not the kernel?

(I have a system here where the clock battery will eventually run
down, and which has a userspace-only hwclock driver)

What happens if user does suspend to disk and resumes the disk image
after they used a different OS for a while, which has meanwhile also
altered the clock?

Or suspend to disk on a VM followed by moving to a different VM host.

In general I trust CLOCK_BOOTTIME to be a reasonable measure of
elapsed time most of the time - but not reliable enough for
distributed systems (such as coherent caches) that need stricter
guarantees whatever the client hardware, or need to know when those
guarantees aren't met.

Whereas I'd trust an "something happened so recalibate" event that is
always triggered - provided it's not sent too early or too late
relative to clock measurements and timer queue reads. I've yet to
check if these proposed timerfd events meet that criterion.

-- Jamie

2011-03-10 11:42:09

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFCv4] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Thu, 10 Mar 2011, Jamie Lokier wrote:
> Thomas Gleixner wrote:
> > Note, that we have CLOCK_BOOTTIME pending for .39 which aims at the
> > same problem. It's basically CLOCK_MONOTONIC adjusted by the time we
> > were in suspend. So while CLOCK_MONOTONIC timers are not aware of the
> > time spent in suspend CLOCK_BOOTTIME timers are. The reason for
> > implementing CLOCK_BOOTTIME was basically the same problem.
>
> I'm afraid for coherent distributed system problems,
> I don't trust CLOCK_BOOTTIME.

That timerfd thing as proposed will not solve that either.

> What happens when the clock battery is flat? (Some systems have
> separate battery for the clock, and it's never changed or recharged).

Then your timekeeping is hosed anyway, so that's the least of your
worries.

> What about systems that just don't have a hardware clock while
> suspended, or the clock doesn't remember the current year reliably, or
> it's handled by userspace not the kernel?
>
> (I have a system here where the clock battery will eventually run
> down, and which has a userspace-only hwclock driver)

Ditto. And I really do not care about user space only drivers at all.

> What happens if user does suspend to disk and resumes the disk image
> after they used a different OS for a while, which has meanwhile also
> altered the clock?

Again, your timekeeping is busted.

> Or suspend to disk on a VM followed by moving to a different VM host.

If your hosts have a complete different notion of clock realtime or
provide complete different based RTC emulation, then you run into a
whole set of other problems.

> In general I trust CLOCK_BOOTTIME to be a reasonable measure of
> elapsed time most of the time - but not reliable enough for
> distributed systems (such as coherent caches) that need stricter
> guarantees whatever the client hardware, or need to know when those
> guarantees aren't met.
>
> Whereas I'd trust an "something happened so recalibate" event that is
> always triggered - provided it's not sent too early or too late
> relative to clock measurements and timer queue reads. I've yet to
> check if these proposed timerfd events meet that criterion.

Not at all. There is no guarantee, that the process waiting for that
timerfd notification will resume before any other process which might
be affected by such an event.

You try to square the circle, but that won't happen before pigs fly.

Thanks,

tglx

2011-03-10 14:12:48

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [RFCv4] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Thu, Mar 10, 2011 at 10:52:18AM +0100, Thomas Gleixner wrote:
> On Wed, 9 Mar 2011, Alexander Shishkin wrote:
> > The major change since the previous version is the new semantics of
> > timerfd_settime() when it's called on a time change notification
> > descriptor: it will set the system time to utmr.it_value if the time
> > change counter is zero, otherwise it will return EBUSY, this is required
> > to prevent a race between setting the time and reading the counter, when
> > the time controlling procees changes the time immediately after another
> > process in the system did the same (the counter is greater than one),
> > that process' time change will be lost. Thus, the time controlling
> > process should use timerfd_settime() instead of clock_settime() or
> > settimeofday() to ensure that other processes' time changes don't get
> > lost.
>
> No, we really don't want to go there and invent another mechanism to
> set the time.
>
> > /*
> > + * for the notification timerfd, set current time to it_value
> > + * if the timer hasn't expired; otherwise someone has changed
> > + * the system time to the value that we don't know
> > + */
> > + if (!list_empty(&ctx->notifiers_list) && utmr) {
> > + if (ctx->ticks) {
> > + ret = -EBUSY;
> > + goto out;
> > + }
> > +
> > + ret = security_settime(&ktmr.it_value, NULL);
> > + if (ret)
> > + goto out;
> > +
> > + spin_unlock_irq(&ctx->wqh.lock);
> > + ret = do_settimeofday(&ktmr.it_value);
> > + goto out1;
> > + }
>
> And how does that solve the problem of multiple processes using that
> interface? Not at all. You moved the timer_fd_clock_was_set()
> notification into the syscalls so you do not deadlock on the
> notifier_lock when you call do_settimeofday() here. So if you have
> multiple users of notification fd then they do not notice that you
> changed the time here. That's a half thought hack, really.

Indeed, you're right here.

> And you start to overload timerfd in a way which is really horrible.
> The proposed semantics of timerfd_settime() with utmr == NULL or utmr
> != NULL depending on the notification flag are so non obvious that Joe
> user space programmer is doomed to fail.
>
> The problem you want to solve is:
>
> Wakeup CLOCK_REALTIME timers which are not yet expired, when the
> time is set backward.

"...when the time is set", yes.

> That's at least what you said you wanted to solve. I regret already
> that I suggested adding that flag to timerfd, as it was only meant to
> provide an interface which wakes a non expired timer whenever clock
> was set.

Yes. Except for in our usecase here and other usecases listed in the patch
description, there doesn't necessarily have to be a timer set to expire in
future. In some cases programs simply want to be notified when the time
changes. However, systemd or crond wouldn't (or shouldn't, in any case)
really care about time changes unless they have scheduled tasks.

I'm not sure it's worth it to always start a timer in order to get these
notifications. On the other hand, it fits much better in the timer/timerfd
interface than what I currently have.

> The patch does something different. How is this related to the problem
> you wanted to solve in the first place?

Well, if you scratch the timerfd_settime() bit, it kind of addresses the
initial problem. The timerfd_settime() was indeed a mistake.

> Can you please explain which problems you identified aside of the
> initial one?

Sure. The time daemon that we have here has to stop automatic time updates
when some other program changes system time *and* keep that setting
effective. Currently, when "the other program" changes the system time
right before time daemon changes it, this time setting will be overwritten
and lost. I'm thinking that it could be solved with something like

clock_swaptime(clockid, new_timespec, old_timespec);

but something tells me that it will not be welcome either.

Thanks,
--
Alex

2011-03-10 14:19:47

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [RFCv4] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Thu, Mar 10, 2011 at 09:48:03AM +0100, Arnd Bergmann wrote:
> On Wednesday 09 March 2011 15:36:51 Alexander Shishkin wrote:
> > This is another attempt to approach notifying userspace about system
> > clock changes. The other one is using an eventfd and a syscall [1]. In
> > the course of discussing the necessity of a syscall for this kind of
> > notifications, it was suggested that this functionality can be achieved
> > via timers [2] (and timerfd in particular [3]). This idea got quite
> > some support [4], [5], [6] and some vague criticism [7], so I decided
> > to try and go a bit further with it.
>
> I don't understand from your description or from the patch how user
> space gets notified. From your description, I would have expected
> to see a change to the timerfd_poll() function to check if the
> clock has changed, possibly returning POLLPRI, but the only such
> change I can see is in the timerfd_read() function. Don't you need
> to change poll() so that a task knows when to call read()?

Luckily, no changes for the _poll function were required, because the
notification code reuses the ctx->ticks counter of timerfds. IOW, poll
wakes up in the same way as before.

> > +/* TFD_NOTIFY_CLOCK_SET timers go here */
> > +static DEFINE_SPINLOCK(notifiers_lock);
> > +static LIST_HEAD(notifiers_list);
>
> Maybe I was the only one to be confused by this, but I think t
> he naming is slightly misleading, because it's easy to confuse
> with a struct notifier_block. You could of course use the
> notifier API instead of building your own, but if you don't,
> I'd recommend coming up with a different name.

Point taken.

> I also think that a mutex would be better here than a spinlock.
> It's unlikely to be under heavy contention, but if you have
> a lot of threads, it could be held for a significant amount
> of time.

Indeed, thanks!

Regards,
--
Alex

2011-03-10 14:55:42

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFCv4] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Thu, 10 Mar 2011, Alexander Shishkin wrote:
> On Thu, Mar 10, 2011 at 10:52:18AM +0100, Thomas Gleixner wrote:
> > On Wed, 9 Mar 2011, Alexander Shishkin wrote:
> > The patch does something different. How is this related to the problem
> > you wanted to solve in the first place?
>
> Well, if you scratch the timerfd_settime() bit, it kind of addresses the
> initial problem. The timerfd_settime() was indeed a mistake.
>
> > Can you please explain which problems you identified aside of the
> > initial one?
>
> Sure. The time daemon that we have here has to stop automatic time updates
> when some other program changes system time *and* keep that setting
> effective. Currently, when "the other program" changes the system time
> right before time daemon changes it, this time setting will be overwritten
> and lost. I'm thinking that it could be solved with something like
>
> clock_swaptime(clockid, new_timespec, old_timespec);
>
> but something tells me that it will not be welcome either.

What's that time daemon doing? The semantics of updating system time,
but stopping to do so when something else sets the time sounds more
like a design problem than anything else.

Thanks,

tglx

2011-03-10 15:44:04

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [RFCv4] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Thu, Mar 10, 2011 at 03:55:00PM +0100, Thomas Gleixner wrote:
> On Thu, 10 Mar 2011, Alexander Shishkin wrote:
> > On Thu, Mar 10, 2011 at 10:52:18AM +0100, Thomas Gleixner wrote:
> > > On Wed, 9 Mar 2011, Alexander Shishkin wrote:
> > > The patch does something different. How is this related to the problem
> > > you wanted to solve in the first place?
> >
> > Well, if you scratch the timerfd_settime() bit, it kind of addresses the
> > initial problem. The timerfd_settime() was indeed a mistake.
> >
> > > Can you please explain which problems you identified aside of the
> > > initial one?
> >
> > Sure. The time daemon that we have here has to stop automatic time updates
> > when some other program changes system time *and* keep that setting
> > effective. Currently, when "the other program" changes the system time
> > right before time daemon changes it, this time setting will be overwritten
> > and lost. I'm thinking that it could be solved with something like
> >
> > clock_swaptime(clockid, new_timespec, old_timespec);
> >
> > but something tells me that it will not be welcome either.
>
> What's that time daemon doing? The semantics of updating system time,
> but stopping to do so when something else sets the time sounds more
> like a design problem than anything else.

The daemon's synchronizing system time with various sources like GSM base
stations, time servers etc, but only until something else touches the time
in the system, which would basically mean that the user has installed a
3rd-party application that's controlling system time or just called `date`.
I don't know all the reasons for this requirement, but it seems that not
losing time changes to a race is not a bad idea. Of course, if anyone cares.

Regards,
--
Alex

2011-03-10 16:40:49

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFCv4] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Thu, 10 Mar 2011, Alexander Shishkin wrote:
> On Thu, Mar 10, 2011 at 03:55:00PM +0100, Thomas Gleixner wrote:
> > On Thu, 10 Mar 2011, Alexander Shishkin wrote:
> > > On Thu, Mar 10, 2011 at 10:52:18AM +0100, Thomas Gleixner wrote:
> > > > On Wed, 9 Mar 2011, Alexander Shishkin wrote:
> > > > The patch does something different. How is this related to the problem
> > > > you wanted to solve in the first place?
> > >
> > > Well, if you scratch the timerfd_settime() bit, it kind of addresses the
> > > initial problem. The timerfd_settime() was indeed a mistake.
> > >
> > > > Can you please explain which problems you identified aside of the
> > > > initial one?
> > >
> > > Sure. The time daemon that we have here has to stop automatic time updates
> > > when some other program changes system time *and* keep that setting
> > > effective. Currently, when "the other program" changes the system time
> > > right before time daemon changes it, this time setting will be overwritten
> > > and lost. I'm thinking that it could be solved with something like
> > >
> > > clock_swaptime(clockid, new_timespec, old_timespec);
> > >
> > > but something tells me that it will not be welcome either.
> >
> > What's that time daemon doing? The semantics of updating system time,
> > but stopping to do so when something else sets the time sounds more
> > like a design problem than anything else.
>
> The daemon's synchronizing system time with various sources like GSM base
> stations, time servers etc, but only until something else touches the time
> in the system, which would basically mean that the user has installed a
> 3rd-party application that's controlling system time or just called `date`.

Well, having several different applications fiddling with
settimeofday() is not a good idea to begin with. If you have that,
then there is no way to avoid races or inconsistencies.

There is no restriction of issuing settimeofday() or clock_settime()
on any standard Linux system other than security_settime(). Which is
fine. When I have the permission to issue 'date -s' then I better know
that I'm screwing over whatever is responsible for maintaining time on
my machine. Same applies for installing applications which fiddle with
time. My package manager usually makes sure, that I don't install two
NTP daemons, but nothing prevents me to launch another one when I'm on
a root shell. If stuff explodes in my face, then I'm to blame nothing
else.

> I don't know all the reasons for this requirement, but it seems that not
> losing time changes to a race is not a bad idea. Of course, if anyone cares.

Unless I'm missing something then this requirement is based on the
wish to shorten the rope with which you can hang yourself, but fails
to make it short enough to prevent it. So what's the point?

Thanks,

tglx

2011-03-10 21:58:09

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFCv4] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Thu, 10 Mar 2011, Alexander Shishkin wrote:
> On Thu, Mar 10, 2011 at 10:52:18AM +0100, Thomas Gleixner wrote:
> Sure. The time daemon that we have here has to stop automatic time updates
> when some other program changes system time *and* keep that setting
> effective. Currently, when "the other program" changes the system time
> right before time daemon changes it, this time setting will be overwritten
> and lost. I'm thinking that it could be solved with something like
>
> clock_swaptime(clockid, new_timespec, old_timespec);
>
> but something tells me that it will not be welcome either.

Aside of that it wont work. You don't have a reference what
old_timespec means.

The whole problem space is full of race conditions and always will be
a horrible hackery when we try to piggy pack on clock_was_set() as we
have no idea what and when it actually happened. clock_was_set() is
async. While we can somehow get an event on a counter which tells us
that the clock was set, any attempt to return useful information aside
of the fact that the counter changed is going to be inconsistent one
way or the other.

It really takes some more to make this consistent for all the use
cases which are interested in notifications and unconditional timer
cancellation when the underlying clock was set.

After twisting my brain around the corner cases for a while I think
the only feasible approach to avoid all the lurking races is to:

1) Provide a syscall which returns the current offset of
CLOCK_REALTIME vs. CLOCK_MONOTONIC. That offset is changed when
CLOCK_REALTIME is set.

2) Provide a mechanism to check consistently the CLOCK_REALTIME
vs. CLOCK_MONOTONIC offset and notify about changes.

3) Extend the clock_nanosleep() flags with TIMER_CANCEL_ON_CLOCK_SET

When the flag is set, then the rmtp pointer, which is currently
used to copy the remaining time to user space must contain a valid
pointer to the previously retrieved CLOCK_REALTIME offset.

clock_nanosleep() then checks that user space provided offset under
#2 and hooks the caller into the notification mechanism. If the
offset has changed before the timer is enqueued the syscall returns
immediately with an appropriate error code. If the offset changes
after the check, then an eventually enqueued timer will be
cancelled and an appropriate error code returned.

Note: This wont work for signal based timers as we have no sane way
to notify user space about a forced cancellation of the timer. Even
if we could think about some extra signal for this, it's not worth
the trouble and the mess it's going to create.

4) Extend timerfd_settime() as #3 if necessary

I'd prefer to avoid that, but I can see the charm of the poll
facility which is provided by timerfd.

Again we could reuse the omtr pointer of timerfd_settime() to
provide the offset as an incoming parameter when the corresponing
flag is set and basically do the same thing as clock_nanosleep() in
the setup path - check the offset consistently.

It needs some thought on the return values from poll and how to
handle read, but that's a solvable problem as we can reasonably
restrict this functionality to non self rearming timers.

That should solve the most urgent problem of cron alike battery
wasters. It also should be a reasonable notification mechanism for
others who are just interested in the fact that clock was set as those
can simply arm a timer which expires somewhere in the next decade. If
clock is not set within that time frame then battery life wont suffer
from that once in a decade regular timer expiry wakeup.

It's not going to solve the "stop updating time when something else
set the clock" requirement, but as I argued before there is no point
to even think about that at all.

Thanks,

tglx

2011-03-11 19:51:42

by Scott James Remnant

[permalink] [raw]
Subject: Re: [RFCv4] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Thu, Mar 10, 2011 at 12:25 AM, Andrew Morton
<[email protected]> wrote:
> On Wed, 9 Mar 2011 18:01:09 -0800 Scott James Remnant <[email protected]> wrote:
>
>> > It would be helpful to know if the identified users of this feature
>> > actually find it useful and adequate. __I guess the most common
>> > application is the 1,001 desktop clock widgets. __Do you have any
>> > feedback from any of the owners of those?
>> >
>> cron is another obvious one (or init systems attempting to replace
>> cron). Having to wakeup and check the time every minute can be
>> non-conducive to power savings, it would be better if we could just
>> sleep until the next alarm and be woken up if the time changes in
>> between.
>>
>> (That being said, we also need to poll for and/or check for timezone
>> changes - but those are entirely userspace, so we can deal with that
>> separately)
>
> Sure, there will be lots of applications.
>
> But what I'm asking isn't "it is a good feature". ?I'm asking "is the
> feature implemented well". ?Ideally someone would get down and modify
> cron to use the interface in this patch.
>
So I've just been thinking today - and I'm actually not sure whether
this is needed at all for this case.

A good cron implementation is going to set timers according to
CLOCK_REALTIME; in the case where the clock changes forwards, those
timers will fire as part of the clock changing already no? And in the
case where the clock changes backwards, you don't want to re-run old
ones anyway.

Even the hourly/daily cases are actually at a fixed time, so would be
triggered - and a decent implementation wouldn't trigger a given
script more than once.

I'm going to test this in userspace shortly to see whether it's the case.

Scott

2011-03-11 19:57:55

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFCv4] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Fri, 11 Mar 2011, Scott James Remnant wrote:
> On Thu, Mar 10, 2011 at 12:25 AM, Andrew Morton
> <[email protected]> wrote:
> > On Wed, 9 Mar 2011 18:01:09 -0800 Scott James Remnant <[email protected]> wrote:
> >
> >> > It would be helpful to know if the identified users of this feature
> >> > actually find it useful and adequate. __I guess the most common
> >> > application is the 1,001 desktop clock widgets. __Do you have any
> >> > feedback from any of the owners of those?
> >> >
> >> cron is another obvious one (or init systems attempting to replace
> >> cron). Having to wakeup and check the time every minute can be
> >> non-conducive to power savings, it would be better if we could just
> >> sleep until the next alarm and be woken up if the time changes in
> >> between.
> >>
> >> (That being said, we also need to poll for and/or check for timezone
> >> changes - but those are entirely userspace, so we can deal with that
> >> separately)
> >
> > Sure, there will be lots of applications.
> >
> > But what I'm asking isn't "it is a good feature". ?I'm asking "is the
> > feature implemented well". ?Ideally someone would get down and modify
> > cron to use the interface in this patch.
> >
> So I've just been thinking today - and I'm actually not sure whether
> this is needed at all for this case.
>
> A good cron implementation is going to set timers according to
> CLOCK_REALTIME; in the case where the clock changes forwards, those
> timers will fire as part of the clock changing already no? And in the
> case where the clock changes backwards, you don't want to re-run old
> ones anyway.
>
> Even the hourly/daily cases are actually at a fixed time, so would be
> triggered - and a decent implementation wouldn't trigger a given
> script more than once.

Yeah, I was wondering about today as well. Though when you set back
your clock several days, stuff might be surprised if it's not woken up
for several days :)

Thanks,

tglx

2011-03-15 01:53:12

by Scott James Remnant

[permalink] [raw]
Subject: Re: [RFCv4] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Fri, Mar 11, 2011 at 11:56 AM, Thomas Gleixner <[email protected]> wrote:
> On Fri, 11 Mar 2011, Scott James Remnant wrote:
>> On Thu, Mar 10, 2011 at 12:25 AM, Andrew Morton
>> <[email protected]> wrote:
>> > On Wed, 9 Mar 2011 18:01:09 -0800 Scott James Remnant <[email protected]> wrote:
>> >
>> >> > It would be helpful to know if the identified users of this feature
>> >> > actually find it useful and adequate. __I guess the most common
>> >> > application is the 1,001 desktop clock widgets. __Do you have any
>> >> > feedback from any of the owners of those?
>> >> >
>> >> cron is another obvious one (or init systems attempting to replace
>> >> cron). Having to wakeup and check the time every minute can be
>> >> non-conducive to power savings, it would be better if we could just
>> >> sleep until the next alarm and be woken up if the time changes in
>> >> between.
>> >>
>> >> (That being said, we also need to poll for and/or check for timezone
>> >> changes - but those are entirely userspace, so we can deal with that
>> >> separately)
>> >
>> > Sure, there will be lots of applications.
>> >
>> > But what I'm asking isn't "it is a good feature". ?I'm asking "is the
>> > feature implemented well". ?Ideally someone would get down and modify
>> > cron to use the interface in this patch.
>> >
>> So I've just been thinking today - and I'm actually not sure whether
>> this is needed at all for this case.
>>
>> A good cron implementation is going to set timers according to
>> CLOCK_REALTIME; in the case where the clock changes forwards, those
>> timers will fire as part of the clock changing already no? And in the
>> case where the clock changes backwards, you don't want to re-run old
>> ones anyway.
>>
>> Even the hourly/daily cases are actually at a fixed time, so would be
>> triggered - and a decent implementation wouldn't trigger a given
>> script more than once.
>
> Yeah, I was wondering about today as well. Though when you set back
> your clock several days, stuff might be surprised if it's not woken up
> for several days :)
>
I've checked the code, and more importantly, tested the
setting-forward example - timers do indeed fire at the point the clock
is wound forwards. This means there doesn't appear to be a utility for
this patch in the cron case.

In the wound back case, I believe that even current cron goes to some
effort to avoid firing events that have already happened?

Scott

2011-04-27 10:44:59

by Alexander Shishkin

[permalink] [raw]
Subject: [RFC][PATCH 3/4] hrtimer: add nanosleep cancellation

This patch implements conditional cancellation for clock_nanosleep
system call. Users who want to be notified of the time changes while
they are sleeping should use TIMER_CANCEL_ON_CLOCK_SET flag and
provide the wall_to_monotonic value they have obtained earlier with
clock_rtoffset call, in rmtp argument. This is only supported for
TIMER_ABSTIME sleepers.

If the provided monotonic offset is still effective, the caller will
sleep until either the requested time comes or somebody changes the
system time, in which case the system call will return ECANCELED. If
the offset has changed by the time of calling clock_nanosleep(), it
will return ECANCELLED straight away.

Signed-off-by: Alexander Shishkin <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Andrew Morton <[email protected]>
CC: John Stultz <[email protected]>
CC: Chris Friesen <[email protected]>
CC: Kay Sievers <[email protected]>
CC: Kirill A. Shutemov <[email protected]>
CC: [email protected]
---
include/linux/hrtimer.h | 1 +
include/linux/time.h | 1 +
kernel/compat.c | 2 +-
kernel/hrtimer.c | 38 ++++++++++++++++++++++++++++++++++++--
kernel/posix-timers.c | 1 +
5 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index a34b8c6..bf0b8d3 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -444,6 +444,7 @@ static inline u64 hrtimer_forward_now(struct hrtimer *timer,
extern long hrtimer_nanosleep(struct timespec *rqtp,
struct timespec __user *rmtp,
const enum hrtimer_mode mode,
+ int cancel_on_clock_set,
const clockid_t clockid);
extern long hrtimer_nanosleep_restart(struct restart_block *restart_block);

diff --git a/include/linux/time.h b/include/linux/time.h
index 8994853..33ff9f3 100644
--- a/include/linux/time.h
+++ b/include/linux/time.h
@@ -310,5 +310,6 @@ struct itimerval {
* The various flags for setting POSIX.1b interval timers:
*/
#define TIMER_ABSTIME 0x01
+#define TIMER_CANCEL_ON_CLOCK_SET 0x02

#endif
diff --git a/kernel/compat.c b/kernel/compat.c
index 38b1d2c..863b3e1 100644
--- a/kernel/compat.c
+++ b/kernel/compat.c
@@ -199,7 +199,7 @@ asmlinkage long compat_sys_nanosleep(struct compat_timespec __user *rqtp,
set_fs(KERNEL_DS);
ret = hrtimer_nanosleep(&tu,
rmtp ? (struct timespec __user *)&rmt : NULL,
- HRTIMER_MODE_REL, CLOCK_MONOTONIC);
+ HRTIMER_MODE_REL, 0, CLOCK_MONOTONIC);
set_fs(oldfs);

if (ret) {
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 1463164..0a495c9 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1537,6 +1537,11 @@ static enum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer)
return HRTIMER_NORESTART;
}

+static void nanosleeper_cancel(struct hrtimer *timer)
+{
+ hrtimer_wakeup(timer);
+}
+
void hrtimer_init_sleeper(struct hrtimer_sleeper *sl, struct task_struct *task)
{
sl->timer.function = hrtimer_wakeup;
@@ -1583,6 +1588,20 @@ static int update_rmtp(struct hrtimer *timer, struct timespec __user *rmtp)
return 1;
}

+static int nanosleep_set_cancel_on_clock_set(struct hrtimer_sleeper *t,
+ struct timespec __user *rmtp)
+{
+ struct timespec offset;
+
+ if (!rmtp)
+ return -EINVAL;
+ if (copy_from_user(&offset, rmtp, sizeof(offset)))
+ return -EFAULT;
+
+ return hrtimer_set_cancel_on_clock_set(&t->timer, &offset,
+ nanosleeper_cancel);
+}
+
long __sched hrtimer_nanosleep_restart(struct restart_block *restart)
{
struct hrtimer_sleeper t;
@@ -1611,19 +1630,30 @@ out:
}

long hrtimer_nanosleep(struct timespec *rqtp, struct timespec __user *rmtp,
- const enum hrtimer_mode mode, const clockid_t clockid)
+ const enum hrtimer_mode mode, int cancel_on_clock_set,
+ const clockid_t clockid)
{
struct restart_block *restart;
struct hrtimer_sleeper t;
int ret = 0;
unsigned long slack;

+ if (cancel_on_clock_set && mode != HRTIMER_MODE_ABS)
+ return -EINVAL;
+
slack = current->timer_slack_ns;
if (rt_task(current))
slack = 0;

hrtimer_init_on_stack(&t.timer, clockid, mode);
hrtimer_set_expires_range_ns(&t.timer, timespec_to_ktime(*rqtp), slack);
+
+ if (cancel_on_clock_set) {
+ ret = nanosleep_set_cancel_on_clock_set(&t, rmtp);
+ if (ret)
+ goto out;
+ }
+
if (do_nanosleep(&t, mode))
goto out;

@@ -1647,6 +1677,9 @@ long hrtimer_nanosleep(struct timespec *rqtp, struct timespec __user *rmtp,

ret = -ERESTART_RESTARTBLOCK;
out:
+ if (t.timer.cancel.cancelled)
+ ret = -ECANCELED;
+
destroy_hrtimer_on_stack(&t.timer);
return ret;
}
@@ -1662,7 +1695,8 @@ SYSCALL_DEFINE2(nanosleep, struct timespec __user *, rqtp,
if (!timespec_valid(&tu))
return -EINVAL;

- return hrtimer_nanosleep(&tu, rmtp, HRTIMER_MODE_REL, CLOCK_MONOTONIC);
+ return hrtimer_nanosleep(&tu, rmtp, HRTIMER_MODE_REL, 0,
+ CLOCK_MONOTONIC);
}

/*
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index 2512708..4791c04 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -1040,6 +1040,7 @@ static int common_nsleep(const clockid_t which_clock, int flags,
{
return hrtimer_nanosleep(tsave, rmtp, flags & TIMER_ABSTIME ?
HRTIMER_MODE_ABS : HRTIMER_MODE_REL,
+ !!(flags & TIMER_CANCEL_ON_CLOCK_SET),
which_clock);
}

--
1.7.4.1

2011-04-27 10:44:52

by Alexander Shishkin

[permalink] [raw]
Subject: [RFC][PATCH 2/4] hrtimer: add cancellation when clock is set

This patch implements a mechanism for hrtimers to be cancelled in
the event of system time changing. This is needed for users who want
to keep track of system time changes.

Users must provide the offset of CLOCK_REALTIME vs CLOCK_MONOTONIC
that they have obtained with clock_rtoffset() call, then, if it
matches the current value of wall_to_monotonic, the timer will be put
on the cancellation list. When the system time is changed, all the
timers on this list will be cancelled.

Signed-off-by: Alexander Shishkin <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Andrew Morton <[email protected]>
CC: John Stultz <[email protected]>
CC: Chris Friesen <[email protected]>
CC: Kay Sievers <[email protected]>
CC: Kirill A. Shutemov <[email protected]>
CC: [email protected]
---
include/linux/hrtimer.h | 22 ++++++++++++++
include/linux/time.h | 3 ++
kernel/hrtimer.c | 68 +++++++++++++++++++++++++++++++++++++++++++++
kernel/time/ntp.c | 6 ++++
kernel/time/timekeeping.c | 27 ++++++++++++++++++
5 files changed, 126 insertions(+), 0 deletions(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 62f500c..a34b8c6 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -46,6 +46,19 @@ enum hrtimer_restart {
HRTIMER_RESTART, /* Timer must be restarted */
};

+/**
+ * struct hrtimer_cancel_block - timer cancellation control structure
+ * @list: link to the list of timers to be cancelled when a certain
+ * event is triggered.
+ * @function: cancellation callback
+ * @cancelled: flag to indicate that the timer has been cancelled
+ */
+struct hrtimer_cancel_block {
+ struct list_head list;
+ void (*function)(struct hrtimer *);
+ unsigned int cancelled:1;
+};
+
/*
* Values to track state of the timer
*
@@ -96,6 +109,7 @@ enum hrtimer_restart {
* @function: timer expiry callback function
* @base: pointer to the timer base (per cpu and per clock)
* @state: state information (See bit values above)
+ * @cancel: cancellation-related fields
* @start_site: timer statistics field to store the site where the timer
* was started
* @start_comm: timer statistics field to store the name of the process which
@@ -111,6 +125,7 @@ struct hrtimer {
enum hrtimer_restart (*function)(struct hrtimer *);
struct hrtimer_clock_base *base;
unsigned long state;
+ struct hrtimer_cancel_block cancel;
#ifdef CONFIG_TIMER_STATS
int start_pid;
void *start_site;
@@ -260,6 +275,8 @@ extern void clock_was_set(void);
extern void hres_timers_resume(void);
extern void hrtimer_interrupt(struct clock_event_device *dev);

+extern void hrtimer_clock_was_set(void);
+
/*
* In high resolution mode the time reference must be read accurate
*/
@@ -301,6 +318,7 @@ static inline void hrtimer_peek_ahead_timers(void) { }

static inline void hres_timers_resume(void) { }

+static inline void hrtimer_clock_was_set(void) { }
/*
* In non high resolution mode the time reference is taken from
* the base softirq time variable.
@@ -330,6 +348,10 @@ DECLARE_PER_CPU(struct tick_device, tick_cpu_device);
extern void hrtimer_init(struct hrtimer *timer, clockid_t which_clock,
enum hrtimer_mode mode);

+extern int hrtimer_set_cancel_on_clock_set(struct hrtimer *timer,
+ struct timespec *offset,
+ void (*function)(struct hrtimer *));
+
#ifdef CONFIG_DEBUG_OBJECTS_TIMERS
extern void hrtimer_init_on_stack(struct hrtimer *timer, clockid_t which_clock,
enum hrtimer_mode mode);
diff --git a/include/linux/time.h b/include/linux/time.h
index 454a262..8994853 100644
--- a/include/linux/time.h
+++ b/include/linux/time.h
@@ -120,6 +120,9 @@ extern int no_sync_cmos_clock __read_mostly;
void timekeeping_init(void);
extern int timekeeping_suspended;

+int call_if_monotonic_offset_is(struct timespec *offset,
+ void (*function)(void *), void *data);
+
unsigned long get_seconds(void);
struct timespec current_kernel_time(void);
struct timespec __current_kernel_time(void); /* does not take xtime_lock */
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 9017478..1463164 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -649,6 +649,65 @@ static void retrigger_next_event(void *arg)
}

/*
+ * Every time wall_to_monotonic changes, we cancel all the sleepers
+ * on this list.
+ */
+static LIST_HEAD(cancellation_list);
+static DEFINE_RAW_SPINLOCK(cancellation_lock);
+
+static void cancel_list_add(void *data)
+{
+ struct hrtimer *timer = data;
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&cancellation_lock, flags);
+ list_add(&timer->cancel.list, &cancellation_list);
+ raw_spin_unlock_irqrestore(&cancellation_lock, flags);
+}
+
+/*
+ * Add hrtimer to the list of timers that should be cancelled
+ * on monotonic offset change
+ */
+int hrtimer_set_cancel_on_clock_set(struct hrtimer *timer,
+ struct timespec *offset,
+ void (*canceller)(struct hrtimer *))
+{
+ int ret = 0;
+
+ if (call_if_monotonic_offset_is(offset, cancel_list_add, timer))
+ ret = -ECANCELED;
+ else
+ timer->cancel.function = canceller;
+
+ return ret;
+}
+
+/*
+ * Cancel the timers on the cancellation list; call when
+ * wall_to_monotonic has been changed.
+ */
+void hrtimer_clock_was_set(void)
+{
+ struct hrtimer *timer, *iter;
+ unsigned long flags;
+ struct list_head new_head;
+
+ raw_spin_lock_irqsave(&cancellation_lock, flags);
+ list_replace_init(&cancellation_list, &new_head);
+ raw_spin_unlock_irqrestore(&cancellation_lock, flags);
+
+ list_for_each_entry_safe(timer, iter, &new_head, cancel.list) {
+ list_del_init(&timer->cancel.list);
+ hrtimer_cancel(timer);
+ timer->cancel.cancelled = 1;
+
+ BUG_ON(!timer->cancel.function);
+ timer->cancel.function(timer);
+ }
+}
+
+/*
* Clock realtime was set
*
* Change the offset of the realtime clock vs. the monotonic
@@ -876,6 +935,8 @@ static void __remove_hrtimer(struct hrtimer *timer,
struct hrtimer_clock_base *base,
unsigned long newstate, int reprogram)
{
+ unsigned long flags;
+
if (!(timer->state & HRTIMER_STATE_ENQUEUED))
goto out;

@@ -895,6 +956,11 @@ static void __remove_hrtimer(struct hrtimer *timer,
timerqueue_del(&base->active, &timer->node);
out:
timer->state = newstate;
+
+ raw_spin_lock_irqsave(&cancellation_lock, flags);
+ if (!list_empty(&timer->cancel.list))
+ list_del(&timer->cancel.list);
+ raw_spin_unlock_irqrestore(&cancellation_lock, flags);
}

/*
@@ -1140,6 +1206,8 @@ static void __hrtimer_init(struct hrtimer *timer, clockid_t clock_id,
timer->base = &cpu_base->clock_base[base];
timerqueue_init(&timer->node);

+ INIT_LIST_HEAD(&timer->cancel.list);
+
#ifdef CONFIG_TIMER_STATS
timer->start_site = NULL;
timer->start_pid = -1;
diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c
index f6117a4..0963061 100644
--- a/kernel/time/ntp.c
+++ b/kernel/time/ntp.c
@@ -357,6 +357,7 @@ void ntp_clear(void)
static enum hrtimer_restart ntp_leap_second(struct hrtimer *timer)
{
enum hrtimer_restart res = HRTIMER_NORESTART;
+ int wtm_changed = 0;

write_seqlock(&xtime_lock);

@@ -365,6 +366,7 @@ static enum hrtimer_restart ntp_leap_second(struct hrtimer *timer)
break;
case TIME_INS:
timekeeping_leap_insert(-1);
+ wtm_changed++;
time_state = TIME_OOP;
printk(KERN_NOTICE
"Clock: inserting leap second 23:59:60 UTC\n");
@@ -373,6 +375,7 @@ static enum hrtimer_restart ntp_leap_second(struct hrtimer *timer)
break;
case TIME_DEL:
timekeeping_leap_insert(1);
+ wtm_changed++;
time_tai--;
time_state = TIME_WAIT;
printk(KERN_NOTICE
@@ -390,6 +393,9 @@ static enum hrtimer_restart ntp_leap_second(struct hrtimer *timer)

write_sequnlock(&xtime_lock);

+ if (wtm_changed)
+ hrtimer_clock_was_set();
+
return res;
}

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 8ad5d57..130c19e 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -169,6 +169,24 @@ static struct timespec raw_time;
/* flag for if timekeeping is suspended */
int __read_mostly timekeeping_suspended;

+/*
+ * call a function if offset matches current value of wall_to_monotonic
+ */
+int call_if_monotonic_offset_is(struct timespec *offset,
+ void (*function)(void *), void *data)
+{
+ unsigned long flags;
+ int ret;
+
+ write_seqlock_irqsave(&xtime_lock, flags);
+ ret = timespec_compare(&wall_to_monotonic, offset);
+ if (!ret)
+ function(data);
+ write_sequnlock_irqrestore(&xtime_lock, flags);
+
+ return ret;
+}
+
/* must hold xtime_lock */
void timekeeping_leap_insert(int leapsecond)
{
@@ -379,6 +397,8 @@ int do_settimeofday(const struct timespec *tv)

write_sequnlock_irqrestore(&xtime_lock, flags);

+ hrtimer_clock_was_set();
+
/* signal hrtimers about time change */
clock_was_set();

@@ -416,6 +436,8 @@ int timekeeping_inject_offset(struct timespec *ts)

write_sequnlock_irqrestore(&xtime_lock, flags);

+ hrtimer_clock_was_set();
+
/* signal hrtimers about time change */
clock_was_set();

@@ -606,6 +628,7 @@ static void timekeeping_resume(void)
{
unsigned long flags;
struct timespec ts;
+ int wtm_changed = 0;

read_persistent_clock(&ts);

@@ -617,6 +640,7 @@ static void timekeeping_resume(void)
ts = timespec_sub(ts, timekeeping_suspend_time);
xtime = timespec_add(xtime, ts);
wall_to_monotonic = timespec_sub(wall_to_monotonic, ts);
+ wtm_changed++;
total_sleep_time = timespec_add(total_sleep_time, ts);
}
/* re-base the last cycle value */
@@ -625,6 +649,9 @@ static void timekeeping_resume(void)
timekeeping_suspended = 0;
write_sequnlock_irqrestore(&xtime_lock, flags);

+ if (wtm_changed)
+ hrtimer_clock_was_set();
+
touch_softlockup_watchdog();

clockevents_notify(CLOCK_EVT_NOTIFY_RESUME, NULL);
--
1.7.4.1

2011-04-27 10:45:44

by Alexander Shishkin

[permalink] [raw]
Subject: [RFC][PATCH 4/4] timerfd: add cancellation

This patch implements conditional cancellation for timerfd timers.
Similarly to clock_nanosleep, users who want to be woken up when
system time changes have to use TFD_CANCEL_ON_CLOCK_SET flag and
monotonic vs realtime offset, obtained from clock_rtoffset system
call, in otmr.it_interval to timerfd_settime() call. This is only
supported for absolute timers (TFD_TIMER_ABSTIME).

If the provided monotonic offset is still effective, poll on the
timerfd will sleep until either the requested time comes or somebody
changes the system time, in which case the system call will return
POLLHUP. If the monotonic offset has changed by the time of calling
timerfd_settime(), it will return ECANCELLED straight away. This
seems more straightforward than returning POLLIN from poll and only
returning error from read.

Signed-off-by: Alexander Shishkin <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Andrew Morton <[email protected]>
CC: John Stultz <[email protected]>
CC: Chris Friesen <[email protected]>
CC: Kay Sievers <[email protected]>
CC: Kirill A. Shutemov <[email protected]>
CC: [email protected]
---
fs/timerfd.c | 40 +++++++++++++++++++++++++++++++++++++++-
include/linux/timerfd.h | 3 ++-
2 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/fs/timerfd.c b/fs/timerfd.c
index 8c4fc14..1955552 100644
--- a/fs/timerfd.c
+++ b/fs/timerfd.c
@@ -59,6 +59,16 @@ static ktime_t timerfd_get_remaining(struct timerfd_ctx *ctx)
return remaining.tv64 < 0 ? ktime_set(0, 0): remaining;
}

+static void timerfd_canceller(struct hrtimer *timer)
+{
+ struct timerfd_ctx *ctx = container_of(timer, struct timerfd_ctx, tmr);
+ unsigned long flags;
+
+ spin_lock_irqsave(&ctx->wqh.lock, flags);
+ wake_up_locked(&ctx->wqh);
+ spin_unlock_irqrestore(&ctx->wqh.lock, flags);
+}
+
static void timerfd_setup(struct timerfd_ctx *ctx, int flags,
const struct itimerspec *ktmr)
{
@@ -99,6 +109,8 @@ static unsigned int timerfd_poll(struct file *file, poll_table *wait)
spin_lock_irqsave(&ctx->wqh.lock, flags);
if (ctx->ticks)
events |= POLLIN;
+ if (ctx->tmr.cancel.cancelled)
+ events |= POLLHUP;
spin_unlock_irqrestore(&ctx->wqh.lock, flags);

return events;
@@ -117,7 +129,15 @@ static ssize_t timerfd_read(struct file *file, char __user *buf, size_t count,
if (file->f_flags & O_NONBLOCK)
res = -EAGAIN;
else
- res = wait_event_interruptible_locked_irq(ctx->wqh, ctx->ticks);
+ res = wait_event_interruptible_locked_irq(ctx->wqh, ctx->ticks
+ || ctx->tmr.cancel.cancelled);
+
+ /* if the timer is being cancelled, don't rearm it */
+ if (ctx->tmr.cancel.cancelled) {
+ spin_unlock_irq(&ctx->wqh.lock);
+ return -ECANCELED;
+ }
+
if (ctx->ticks) {
ticks = ctx->ticks;
if (ctx->expired && ctx->tintv.tv64) {
@@ -197,6 +217,7 @@ SYSCALL_DEFINE4(timerfd_settime, int, ufd, int, flags,
struct itimerspec __user *, otmr)
{
struct file *file;
+ struct timespec offset;
struct timerfd_ctx *ctx;
struct itimerspec ktmr, kotmr;

@@ -208,6 +229,14 @@ SYSCALL_DEFINE4(timerfd_settime, int, ufd, int, flags,
!timespec_valid(&ktmr.it_interval))
return -EINVAL;

+ if (flags & TFD_CANCEL_ON_CLOCK_SET) {
+ if (!otmr || !(flags & TFD_TIMER_ABSTIME))
+ return -EINVAL;
+ if (copy_from_user(&kotmr, otmr, sizeof(kotmr)))
+ return -EFAULT;
+ offset = kotmr.it_interval;
+ }
+
file = timerfd_fget(ufd);
if (IS_ERR(file))
return PTR_ERR(file);
@@ -242,6 +271,15 @@ SYSCALL_DEFINE4(timerfd_settime, int, ufd, int, flags,
*/
timerfd_setup(ctx, flags, &ktmr);

+ if ((flags & TFD_CANCEL_ON_CLOCK_SET) &&
+ hrtimer_set_cancel_on_clock_set(&ctx->tmr, &offset,
+ timerfd_canceller)) {
+ hrtimer_cancel(&ctx->tmr);
+ spin_unlock_irq(&ctx->wqh.lock);
+ fput(file);
+ return -ECANCELED;
+ }
+
spin_unlock_irq(&ctx->wqh.lock);
fput(file);
if (otmr && copy_to_user(otmr, &kotmr, sizeof(kotmr)))
diff --git a/include/linux/timerfd.h b/include/linux/timerfd.h
index 2d07929..f4cd638 100644
--- a/include/linux/timerfd.h
+++ b/include/linux/timerfd.h
@@ -19,6 +19,7 @@
* shared O_* flags.
*/
#define TFD_TIMER_ABSTIME (1 << 0)
+#define TFD_CANCEL_ON_CLOCK_SET (1 << 1)
#define TFD_CLOEXEC O_CLOEXEC
#define TFD_NONBLOCK O_NONBLOCK

@@ -26,6 +27,6 @@
/* Flags for timerfd_create. */
#define TFD_CREATE_FLAGS TFD_SHARED_FCNTL_FLAGS
/* Flags for timerfd_settime. */
-#define TFD_SETTIME_FLAGS TFD_TIMER_ABSTIME
+#define TFD_SETTIME_FLAGS (TFD_TIMER_ABSTIME | TFD_CANCEL_ON_CLOCK_SET)

#endif /* _LINUX_TIMERFD_H */
--
1.7.4.1

2011-04-27 10:46:04

by Alexander Shishkin

[permalink] [raw]
Subject: [RFC][PATCH 1/4] clock_rtoffset: new syscall

In order to keep track of system time changes, we introduce a new
syscall which returns the offset of CLOCK_MONOTONIC clock against
CLOCK_REALTIME. The caller is to store this value and use it in
system calls (like clock_nanosleep or timerfd_settime) that will
compare it against the effective offset in order to ensure that
the caller's notion of the system time corresponds to the effective
system time at the moment of the action being carried out. If it
has changed, these system calls will return an error and the caller
will have to obtain this offset again.

Signed-off-by: Alexander Shishkin <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Andrew Morton <[email protected]>
CC: John Stultz <[email protected]>
CC: Chris Friesen <[email protected]>
CC: Kay Sievers <[email protected]>
CC: Kirill A. Shutemov <[email protected]>
CC: [email protected]
---
include/asm-generic/unistd.h | 4 +++-
kernel/posix-timers.c | 14 ++++++++++++++
kernel/sys_ni.c | 3 +++
3 files changed, 20 insertions(+), 1 deletions(-)

diff --git a/include/asm-generic/unistd.h b/include/asm-generic/unistd.h
index 07c40d5..89edfc8 100644
--- a/include/asm-generic/unistd.h
+++ b/include/asm-generic/unistd.h
@@ -654,9 +654,11 @@ __SYSCALL(__NR_open_by_handle_at, sys_open_by_handle_at)
__SYSCALL(__NR_clock_adjtime, sys_clock_adjtime)
#define __NR_syncfs 267
__SYSCALL(__NR_syncfs, sys_syncfs)
+#define __NR_clock_rtoffset 268
+__SYSCALL(__NR_clock_rtoffset, sys_clock_rtoffset)

#undef __NR_syscalls
-#define __NR_syscalls 268
+#define __NR_syscalls 269

/*
* All syscalls below here should go away really,
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index e5498d7..2512708 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -1019,6 +1019,20 @@ SYSCALL_DEFINE2(clock_getres, const clockid_t, which_clock,
}

/*
+ * Retrieve CLOCK_REALTIME's offset against CLOCK_MONOTONIC
+ */
+SYSCALL_DEFINE1(clock_rtoffset, struct timespec __user, *offset)
+{
+ struct timespec time, mono, sleep;
+ int ret;
+
+ get_xtime_and_monotonic_and_sleep_offset(&time, &mono, &sleep);
+ ret = copy_to_user(offset, &mono, sizeof(mono));
+
+ return ret ? -EFAULT : 0;
+}
+
+/*
* nanosleep for monotonic and realtime clocks
*/
static int common_nsleep(const clockid_t which_clock, int flags,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 25cc41c..47791cc 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -191,3 +191,6 @@ cond_syscall(sys_fanotify_mark);
cond_syscall(sys_name_to_handle_at);
cond_syscall(sys_open_by_handle_at);
cond_syscall(compat_sys_open_by_handle_at);
+
+/* monotonic vs realtime clock offset */
+cond_syscall(sys_clock_rtoffset);
--
1.7.4.1

2011-04-27 14:03:19

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/4] clock_rtoffset: new syscall

On Wed, 27 Apr 2011, Alexander Shishkin wrote:

> In order to keep track of system time changes, we introduce a new
> syscall which returns the offset of CLOCK_MONOTONIC clock against
> CLOCK_REALTIME. The caller is to store this value and use it in
> system calls (like clock_nanosleep or timerfd_settime) that will
> compare it against the effective offset in order to ensure that
> the caller's notion of the system time corresponds to the effective
> system time at the moment of the action being carried out. If it
> has changed, these system calls will return an error and the caller
> will have to obtain this offset again.

No, we do not expose kernel internals like this to user space. The
kernel internal representation of time is subject to change.

Also abusing the reminder argument of clock_nanosleep for handing back
the offset is a horrible hack including the non standard -ECANCELED
return value. No, we don't change posix interfaces that way.

We can add something to timerfd, but definitly not with another
syscall and by bloating hrtimer and sprinkling cancellation calls all
over the place. clock_was_set() should be enough and it already calls
into the hrtimer code, the resume path calls into it as well, so
there is no need to introduce such a mess.

The completely untested patch below should solve the same problem in a
sane way. Restricted to timerfd, but that really should be sufficient.

Thanks,

tglx

------------->

Subject: timerfd: Allow timers to be cancelled when clock was set
From: Thomas Gleixner <[email protected]>
Date: Wed, 27 Apr 2011 14:16:42 +0200

<Add proper changelog here>

Signed-off-by: Thomas Gleixner <[email protected]>
---
fs/timerfd.c | 57 +++++++++++++++++++++++++++++++++++++++++-----
include/linux/hrtimer.h | 3 +-
include/linux/time.h | 5 ++++
include/linux/timerfd.h | 3 +-
kernel/hrtimer.c | 41 +++++++++++++++++++++++++++------
kernel/time/timekeeping.c | 15 ++++++++++++
6 files changed, 109 insertions(+), 15 deletions(-)

Index: linux-2.6-tip/fs/timerfd.c
===================================================================
--- linux-2.6-tip.orig/fs/timerfd.c
+++ linux-2.6-tip/fs/timerfd.c
@@ -26,10 +26,12 @@
struct timerfd_ctx {
struct hrtimer tmr;
ktime_t tintv;
+ ktime_t moffs;
wait_queue_head_t wqh;
u64 ticks;
int expired;
int clockid;
+ bool might_cancel;
};

/*
@@ -59,24 +61,52 @@ static ktime_t timerfd_get_remaining(str
return remaining.tv64 < 0 ? ktime_set(0, 0): remaining;
}

-static void timerfd_setup(struct timerfd_ctx *ctx, int flags,
- const struct itimerspec *ktmr)
+static bool timerfd_cancelled(struct timerfd_ctx *ctx)
+{
+ ktime_t moffs;
+
+ if (!ctx->might_cancel)
+ return false;
+
+ moffs = get_monotonic_offset();
+
+ if (moffs.tv64 == ctx->moffs.tv64)
+ return false;
+
+ ctx->moffs = moffs;
+ return true;
+}
+
+static int timerfd_setup(struct timerfd_ctx *ctx, int flags,
+ const struct itimerspec *ktmr)
{
enum hrtimer_mode htmode;
ktime_t texp;
+ int clockid = ctx->clockid;

htmode = (flags & TFD_TIMER_ABSTIME) ?
HRTIMER_MODE_ABS: HRTIMER_MODE_REL;

+ ctx->might_cancel = false;
+ if (htmode == HRTIMER_MODE_ABS && ctx->clockid == CLOCK_REALTIME &&
+ (flags & TFD_TIMER_CANCELON_SET)) {
+ clockid = CLOCK_REALTIME_COS;
+ ctx->might_cancel = true;
+ }
+
texp = timespec_to_ktime(ktmr->it_value);
ctx->expired = 0;
ctx->ticks = 0;
ctx->tintv = timespec_to_ktime(ktmr->it_interval);
- hrtimer_init(&ctx->tmr, ctx->clockid, htmode);
+ hrtimer_init(&ctx->tmr, clockid, htmode);
hrtimer_set_expires(&ctx->tmr, texp);
ctx->tmr.function = timerfd_tmrproc;
- if (texp.tv64 != 0)
+ if (texp.tv64 != 0) {
hrtimer_start(&ctx->tmr, texp, htmode);
+ if (timerfd_cancelled(ctx))
+ return -ECANCELED;
+ }
+ return 0;
}

static int timerfd_release(struct inode *inode, struct file *file)
@@ -118,8 +148,21 @@ static ssize_t timerfd_read(struct file
res = -EAGAIN;
else
res = wait_event_interruptible_locked_irq(ctx->wqh, ctx->ticks);
+
if (ctx->ticks) {
ticks = ctx->ticks;
+
+ /*
+ * If clock has changed, we do not care about the
+ * ticks and we do not rearm the timer. Userspace must
+ * reevaluate anyway.
+ */
+ if (timerfd_cancelled(ctx)) {
+ ticks = 0;
+ ctx->expired = 0;
+ res = -ECANCELED;
+ }
+
if (ctx->expired && ctx->tintv.tv64) {
/*
* If tintv.tv64 != 0, this is a periodic timer that
@@ -183,6 +226,7 @@ SYSCALL_DEFINE2(timerfd_create, int, clo
init_waitqueue_head(&ctx->wqh);
ctx->clockid = clockid;
hrtimer_init(&ctx->tmr, clockid, HRTIMER_MODE_ABS);
+ ctx->moffs = get_monotonic_offset();

ufd = anon_inode_getfd("[timerfd]", &timerfd_fops, ctx,
O_RDWR | (flags & TFD_SHARED_FCNTL_FLAGS));
@@ -199,6 +243,7 @@ SYSCALL_DEFINE4(timerfd_settime, int, uf
struct file *file;
struct timerfd_ctx *ctx;
struct itimerspec ktmr, kotmr;
+ int ret;

if (copy_from_user(&ktmr, utmr, sizeof(ktmr)))
return -EFAULT;
@@ -240,14 +285,14 @@ SYSCALL_DEFINE4(timerfd_settime, int, uf
/*
* Re-program the timer to the new value ...
*/
- timerfd_setup(ctx, flags, &ktmr);
+ ret = timerfd_setup(ctx, flags, &ktmr);

spin_unlock_irq(&ctx->wqh.lock);
fput(file);
if (otmr && copy_to_user(otmr, &kotmr, sizeof(kotmr)))
return -EFAULT;

- return 0;
+ return ret;
}

SYSCALL_DEFINE2(timerfd_gettime, int, ufd, struct itimerspec __user *, otmr)
Index: linux-2.6-tip/include/linux/hrtimer.h
===================================================================
--- linux-2.6-tip.orig/include/linux/hrtimer.h
+++ linux-2.6-tip/include/linux/hrtimer.h
@@ -157,6 +157,7 @@ enum hrtimer_base_type {
HRTIMER_BASE_REALTIME,
HRTIMER_BASE_MONOTONIC,
HRTIMER_BASE_BOOTTIME,
+ HRTIMER_BASE_REALTIME_COS,
HRTIMER_MAX_CLOCK_BASES,
};

@@ -319,7 +320,7 @@ static inline int hrtimer_is_hres_active
extern ktime_t ktime_get(void);
extern ktime_t ktime_get_real(void);
extern ktime_t ktime_get_boottime(void);
-
+ktime_t get_monotonic_offset(void);

DECLARE_PER_CPU(struct tick_device, tick_cpu_device);

Index: linux-2.6-tip/include/linux/time.h
===================================================================
--- linux-2.6-tip.orig/include/linux/time.h
+++ linux-2.6-tip/include/linux/time.h
@@ -295,6 +295,11 @@ struct itimerval {
#define CLOCK_MONOTONIC_COARSE 6
#define CLOCK_BOOTTIME 7

+#ifdef __KERNEL__
+/* This clock is not exposed to user space */
+#define CLOCK_REALTIME_COS 8
+#endif
+
/*
* The IDs of various hardware clocks:
*/
Index: linux-2.6-tip/include/linux/timerfd.h
===================================================================
--- linux-2.6-tip.orig/include/linux/timerfd.h
+++ linux-2.6-tip/include/linux/timerfd.h
@@ -19,6 +19,7 @@
* shared O_* flags.
*/
#define TFD_TIMER_ABSTIME (1 << 0)
+#define TFD_TIMER_CANCELON_SET (1 << 1)
#define TFD_CLOEXEC O_CLOEXEC
#define TFD_NONBLOCK O_NONBLOCK

@@ -26,6 +27,6 @@
/* Flags for timerfd_create. */
#define TFD_CREATE_FLAGS TFD_SHARED_FCNTL_FLAGS
/* Flags for timerfd_settime. */
-#define TFD_SETTIME_FLAGS TFD_TIMER_ABSTIME
+#define TFD_SETTIME_FLAGS (TFD_TIMER_ABSTIME | TFD_TIMER_CANCELON_SET)

#endif /* _LINUX_TIMERFD_H */
Index: linux-2.6-tip/kernel/hrtimer.c
===================================================================
--- linux-2.6-tip.orig/kernel/hrtimer.c
+++ linux-2.6-tip/kernel/hrtimer.c
@@ -78,6 +78,11 @@ DEFINE_PER_CPU(struct hrtimer_cpu_base,
.get_time = &ktime_get_boottime,
.resolution = KTIME_LOW_RES,
},
+ {
+ .index = CLOCK_REALTIME_COS,
+ .get_time = &ktime_get_real,
+ .resolution = KTIME_LOW_RES,
+ },
}
};

@@ -106,6 +111,7 @@ static void hrtimer_get_softirq_time(str
base->clock_base[HRTIMER_BASE_REALTIME].softirq_time = xtim;
base->clock_base[HRTIMER_BASE_MONOTONIC].softirq_time = mono;
base->clock_base[HRTIMER_BASE_BOOTTIME].softirq_time = boot;
+ base->clock_base[HRTIMER_BASE_REALTIME_COS].softirq_time = xtim;
}

/*
@@ -617,6 +623,8 @@ static int hrtimer_reprogram(struct hrti
return res;
}

+static void
+hrtimer_expire_cancelable(struct hrtimer_cpu_base *cpu_base, ktime_t now);

/*
* Retrigger next event is called after clock was set
@@ -626,13 +634,12 @@ static int hrtimer_reprogram(struct hrti
static void retrigger_next_event(void *arg)
{
struct hrtimer_cpu_base *base;
- struct timespec realtime_offset, wtm, sleep;
+ struct timespec realtime_offset, xtim, wtm, sleep;

if (!hrtimer_hres_active())
return;

- get_xtime_and_monotonic_and_sleep_offset(&realtime_offset, &wtm,
- &sleep);
+ get_xtime_and_monotonic_and_sleep_offset(&xtim, &wtm, &sleep);
set_normalized_timespec(&realtime_offset, -wtm.tv_sec, -wtm.tv_nsec);

base = &__get_cpu_var(hrtimer_bases);
@@ -643,6 +650,10 @@ static void retrigger_next_event(void *a
timespec_to_ktime(realtime_offset);
base->clock_base[HRTIMER_BASE_BOOTTIME].offset =
timespec_to_ktime(sleep);
+ base->clock_base[HRTIMER_BASE_REALTIME_COS].offset =
+ timespec_to_ktime(realtime_offset);
+
+ hrtimer_expire_cancelable(base, timespec_to_ktime(xtim));

hrtimer_force_reprogram(base, 0);
raw_spin_unlock(&base->lock);
@@ -715,7 +726,7 @@ static inline int hrtimer_enqueue_reprog
*/
static int hrtimer_switch_to_hres(void)
{
- int cpu = smp_processor_id();
+ int i, cpu = smp_processor_id();
struct hrtimer_cpu_base *base = &per_cpu(hrtimer_bases, cpu);
unsigned long flags;

@@ -731,9 +742,8 @@ static int hrtimer_switch_to_hres(void)
return 0;
}
base->hres_active = 1;
- base->clock_base[HRTIMER_BASE_REALTIME].resolution = KTIME_HIGH_RES;
- base->clock_base[HRTIMER_BASE_MONOTONIC].resolution = KTIME_HIGH_RES;
- base->clock_base[HRTIMER_BASE_BOOTTIME].resolution = KTIME_HIGH_RES;
+ for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++)
+ base->clock_base[i].resolution = KTIME_HIGH_RES;

tick_setup_sched_timer();

@@ -1221,6 +1231,22 @@ static void __run_hrtimer(struct hrtimer
timer->state &= ~HRTIMER_STATE_CALLBACK;
}

+static void
+hrtimer_expire_cancelable(struct hrtimer_cpu_base *cpu_base, ktime_t now)
+{
+ struct timerqueue_node *node;
+ struct hrtimer_clock_base *base;
+
+ base = cpu_base->clock_base + HRTIMER_BASE_REALTIME_COS;
+
+ while ((node = timerqueue_getnext(&base->active))) {
+ struct hrtimer *timer;
+
+ timer = container_of(node, struct hrtimer, node);
+ __run_hrtimer(timer, &now);
+ }
+}
+
#ifdef CONFIG_HIGH_RES_TIMERS

/*
@@ -1725,6 +1751,7 @@ void __init hrtimers_init(void)
hrtimer_clock_to_base_table[CLOCK_REALTIME] = HRTIMER_BASE_REALTIME;
hrtimer_clock_to_base_table[CLOCK_MONOTONIC] = HRTIMER_BASE_MONOTONIC;
hrtimer_clock_to_base_table[CLOCK_BOOTTIME] = HRTIMER_BASE_BOOTTIME;
+ hrtimer_clock_to_base_table[CLOCK_REALTIME_COS] = HRTIMER_BASE_REALTIME_COS;

hrtimer_cpu_notify(&hrtimers_nb, (unsigned long)CPU_UP_PREPARE,
(void *)(long)smp_processor_id());
Index: linux-2.6-tip/kernel/time/timekeeping.c
===================================================================
--- linux-2.6-tip.orig/kernel/time/timekeeping.c
+++ linux-2.6-tip/kernel/time/timekeeping.c
@@ -1049,6 +1049,21 @@ void get_xtime_and_monotonic_and_sleep_o
}

/**
+ * get_monotonic_offset() - get wall_to_monotonic
+ */
+ktime_t get_monotonic_offset(void)
+{
+ unsigned long seq;
+ struct timespec wtom;
+
+ do {
+ seq = read_seqbegin(&xtime_lock);
+ wtom = wall_to_monotonic;
+ } while (read_seqretry(&xtime_lock, seq));
+ return timespec_to_ktime(wtom);
+}
+
+/**
* xtime_update() - advances the timekeeping infrastructure
* @ticks: number of ticks, that have elapsed since the last call.
*

2011-04-27 16:32:35

by Kay Sievers

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/4] clock_rtoffset: new syscall

On Wed, 2011-04-27 at 16:02 +0200, Thomas Gleixner wrote:
> On Wed, 27 Apr 2011, Alexander Shishkin wrote:
>
> > In order to keep track of system time changes, we introduce a new
> > syscall which returns the offset of CLOCK_MONOTONIC clock against
> > CLOCK_REALTIME. The caller is to store this value and use it in
> > system calls (like clock_nanosleep or timerfd_settime) that will
> > compare it against the effective offset in order to ensure that
> > the caller's notion of the system time corresponds to the effective
> > system time at the moment of the action being carried out. If it
> > has changed, these system calls will return an error and the caller
> > will have to obtain this offset again.
>
> No, we do not expose kernel internals like this to user space. The
> kernel internal representation of time is subject to change.

> The completely untested patch below should solve the same problem in a
> sane way. Restricted to timerfd, but that really should be sufficient.

> Subject: timerfd: Allow timers to be cancelled when clock was set
> From: Thomas Gleixner <[email protected]>
> Date: Wed, 27 Apr 2011 14:16:42 +0200

Seems to work fine for me here for the weird "cron timer" case, or the
simple "desktop clock" use-case -- where we want to rely on the timer to
let the process sleep as long as possible, but wake it up to recalculate
the next wakeup in case anything has changed in the meantime.

Tested-By: Kay Sievers <[email protected]>

The simple test program, that has two timerfds, and watches them is
below. Changing the system time from wakes up only the timerfd with
TFD_TIMER_CANCELON_SET set.

Thanks,
Kay


#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <stdint.h>
#include <sys/poll.h>
#include <sys/timerfd.h>

#ifndef TFD_TIMER_CANCELON_SET
#define TFD_TIMER_CANCELON_SET (1 << 1)
#endif

int main(int argc, char *argv[])
{
struct itimerspec its;
struct pollfd pollfd[2];

pollfd[0].fd = timerfd_create(CLOCK_REALTIME, TFD_NONBLOCK|TFD_CLOEXEC);
if (pollfd[0].fd < 0) {
printf("create error 1: %m\n");
_exit(2);
}
pollfd[0].events = POLLIN;

pollfd[1].fd = timerfd_create(CLOCK_REALTIME, TFD_NONBLOCK|TFD_CLOEXEC);
if (pollfd[1].fd < 0) {
printf("create error 2: %m\n");
_exit(2);
}
pollfd[1].events = POLLIN;

for (;;) {
printf("\n");

memset(&its, 0, sizeof(struct itimerspec));
clock_gettime(CLOCK_REALTIME, &its.it_value);
its.it_value.tv_sec += 10;

if (timerfd_settime(pollfd[0].fd, TFD_TIMER_ABSTIME, &its, NULL) < 0) {
printf("settime error 1: %m\n");
_exit(3);
} else {
printf("settime +10s 1\n");
}

if (timerfd_settime(pollfd[1].fd, TFD_TIMER_ABSTIME|TFD_TIMER_CANCELON_SET, &its, NULL) < 0) {
printf("settime error: 2: %m\n");
_exit(3);
} else {
printf("settime +10s 2\n");
}

printf("poll\n");
if (poll(pollfd, 2, -1) < 0) {
if (errno == EAGAIN || errno == EINTR)
continue;
_exit(4);
}

if (pollfd[0].revents) {
uint64_t count;

if (read(pollfd[0].fd, &count, sizeof(uint64_t)) < 0)
printf("read error 1: %m\n");
else
printf("read 1: %u\n", (unsigned int)count);

printf("wakeup 1\n");
}

if (pollfd[1].revents) {
uint64_t count;

if (read(pollfd[1].fd, &count, sizeof(uint64_t)) < 0)
printf("read error 2: %m\n");
else
printf("read 2: %u\n", (unsigned int)count);

printf("wakeup 2\n");
}
}

return EXIT_SUCCESS;
}

2011-04-27 19:12:09

by john stultz

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/4] clock_rtoffset: new syscall

On Wed, 2011-04-27 at 16:02 +0200, Thomas Gleixner wrote:
> On Wed, 27 Apr 2011, Alexander Shishkin wrote:
>
> > In order to keep track of system time changes, we introduce a new
> > syscall which returns the offset of CLOCK_MONOTONIC clock against
> > CLOCK_REALTIME. The caller is to store this value and use it in
> > system calls (like clock_nanosleep or timerfd_settime) that will
> > compare it against the effective offset in order to ensure that
> > the caller's notion of the system time corresponds to the effective
> > system time at the moment of the action being carried out. If it
> > has changed, these system calls will return an error and the caller
> > will have to obtain this offset again.
>
> No, we do not expose kernel internals like this to user space. The
> kernel internal representation of time is subject to change.
>
> Also abusing the reminder argument of clock_nanosleep for handing back
> the offset is a horrible hack including the non standard -ECANCELED
> return value. No, we don't change posix interfaces that way.
>
> We can add something to timerfd, but definitly not with another
> syscall and by bloating hrtimer and sprinkling cancellation calls all
> over the place. clock_was_set() should be enough and it already calls
> into the hrtimer code, the resume path calls into it as well, so
> there is no need to introduce such a mess.
>
> The completely untested patch below should solve the same problem in a
> sane way. Restricted to timerfd, but that really should be sufficient.


Overall looks good. I flinched a little bit at adding an internal only
clockid but trying to avoid that would really make it messy.

Few minor nits below, but just my opinion, so you can ignore.

Otherwise:
Acked-by: John Stultz <[email protected]>


> Index: linux-2.6-tip/include/linux/time.h
> ===================================================================
> --- linux-2.6-tip.orig/include/linux/time.h
> +++ linux-2.6-tip/include/linux/time.h
> @@ -295,6 +295,11 @@ struct itimerval {
> #define CLOCK_MONOTONIC_COARSE 6
> #define CLOCK_BOOTTIME 7
>
> +#ifdef __KERNEL__
> +/* This clock is not exposed to user space */
> +#define CLOCK_REALTIME_COS 8
> +#endif

Would something like INTERNAL_CLOCK_REALTIME_COS be more explicit?



> @@ -1221,6 +1231,22 @@ static void __run_hrtimer(struct hrtimer
> timer->state &= ~HRTIMER_STATE_CALLBACK;
> }
>
> +static void
> +hrtimer_expire_cancelable(struct hrtimer_cpu_base *cpu_base, ktime_t now)
> +{
> + struct timerqueue_node *node;
> + struct hrtimer_clock_base *base;
> +
> + base = cpu_base->clock_base + HRTIMER_BASE_REALTIME_COS;

I know its the same thing, but for some reason the above makes me think
that the clock_base could be non-zero.

base = &cpu_base->clock_base[HRTIMER_BASE_REALTIME_COS];

seems more straight forward to me. But its not a huge deal.



2011-04-27 22:20:05

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/4] clock_rtoffset: new syscall

On Wed, 27 Apr 2011, john stultz wrote:
> On Wed, 2011-04-27 at 16:02 +0200, Thomas Gleixner wrote:
> > On Wed, 27 Apr 2011, Alexander Shishkin wrote:
> Overall looks good. I flinched a little bit at adding an internal only
> clockid but trying to avoid that would really make it messy.

Yes. It would. And we really don't want to mess with the POSIX
interfaces. Having this extension and the weird return code for
timerfd seems to be a reasonable limitation.

> > +#ifdef __KERNEL__
> > +/* This clock is not exposed to user space */
> > +#define CLOCK_REALTIME_COS 8
> > +#endif
>
> Would something like INTERNAL_CLOCK_REALTIME_COS be more explicit?

Yes. Good point.

> > + base = cpu_base->clock_base + HRTIMER_BASE_REALTIME_COS;
>
> I know its the same thing, but for some reason the above makes me think
> that the clock_base could be non-zero.
>
> base = &cpu_base->clock_base[HRTIMER_BASE_REALTIME_COS];
>
> seems more straight forward to me. But its not a huge deal.

I don't care either way :)

Thanks,

tglx

2011-04-28 07:15:36

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/4] clock_rtoffset: new syscall

On Wed, 27 Apr 2011 16:02:33 +0200 (CEST), Thomas Gleixner <[email protected]> wrote:
> On Wed, 27 Apr 2011, Alexander Shishkin wrote:
>
> > In order to keep track of system time changes, we introduce a new
> > syscall which returns the offset of CLOCK_MONOTONIC clock against
> > CLOCK_REALTIME. The caller is to store this value and use it in
> > system calls (like clock_nanosleep or timerfd_settime) that will
> > compare it against the effective offset in order to ensure that
> > the caller's notion of the system time corresponds to the effective
> > system time at the moment of the action being carried out. If it
> > has changed, these system calls will return an error and the caller
> > will have to obtain this offset again.
>
> No, we do not expose kernel internals like this to user space. The
> kernel internal representation of time is subject to change.
>
> Also abusing the reminder argument of clock_nanosleep for handing back
> the offset is a horrible hack including the non standard -ECANCELED
> return value. No, we don't change posix interfaces that way.

Hm, https://lkml.org/lkml/2011/3/10/471

> We can add something to timerfd, but definitly not with another
> syscall and by bloating hrtimer and sprinkling cancellation calls all
> over the place. clock_was_set() should be enough and it already calls
> into the hrtimer code, the resume path calls into it as well, so
> there is no need to introduce such a mess.

Agreed.

> The completely untested patch below should solve the same problem in a
> sane way. Restricted to timerfd, but that really should be sufficient.

Looks useful to me. FWIW,

Reviewed-by: Alexander Shishkin <[email protected]>

Regards,
--
Alex

>
> Thanks,
>
> tglx
>
> ------------->
>
> Subject: timerfd: Allow timers to be cancelled when clock was set
> From: Thomas Gleixner <[email protected]>
> Date: Wed, 27 Apr 2011 14:16:42 +0200
>
> <Add proper changelog here>
>
> Signed-off-by: Thomas Gleixner <[email protected]>
> ---
> fs/timerfd.c | 57 +++++++++++++++++++++++++++++++++++++++++-----
> include/linux/hrtimer.h | 3 +-
> include/linux/time.h | 5 ++++
> include/linux/timerfd.h | 3 +-
> kernel/hrtimer.c | 41 +++++++++++++++++++++++++++------
> kernel/time/timekeeping.c | 15 ++++++++++++
> 6 files changed, 109 insertions(+), 15 deletions(-)
>
> Index: linux-2.6-tip/fs/timerfd.c
> ===================================================================
> --- linux-2.6-tip.orig/fs/timerfd.c
> +++ linux-2.6-tip/fs/timerfd.c
> @@ -26,10 +26,12 @@
> struct timerfd_ctx {
> struct hrtimer tmr;
> ktime_t tintv;
> + ktime_t moffs;
> wait_queue_head_t wqh;
> u64 ticks;
> int expired;
> int clockid;
> + bool might_cancel;
> };
>
> /*
> @@ -59,24 +61,52 @@ static ktime_t timerfd_get_remaining(str
> return remaining.tv64 < 0 ? ktime_set(0, 0): remaining;
> }
>
> -static void timerfd_setup(struct timerfd_ctx *ctx, int flags,
> - const struct itimerspec *ktmr)
> +static bool timerfd_cancelled(struct timerfd_ctx *ctx)
> +{
> + ktime_t moffs;
> +
> + if (!ctx->might_cancel)
> + return false;
> +
> + moffs = get_monotonic_offset();
> +
> + if (moffs.tv64 == ctx->moffs.tv64)
> + return false;
> +
> + ctx->moffs = moffs;
> + return true;
> +}
> +
> +static int timerfd_setup(struct timerfd_ctx *ctx, int flags,
> + const struct itimerspec *ktmr)
> {
> enum hrtimer_mode htmode;
> ktime_t texp;
> + int clockid = ctx->clockid;
>
> htmode = (flags & TFD_TIMER_ABSTIME) ?
> HRTIMER_MODE_ABS: HRTIMER_MODE_REL;
>
> + ctx->might_cancel = false;
> + if (htmode == HRTIMER_MODE_ABS && ctx->clockid == CLOCK_REALTIME &&
> + (flags & TFD_TIMER_CANCELON_SET)) {
> + clockid = CLOCK_REALTIME_COS;
> + ctx->might_cancel = true;
> + }
> +
> texp = timespec_to_ktime(ktmr->it_value);
> ctx->expired = 0;
> ctx->ticks = 0;
> ctx->tintv = timespec_to_ktime(ktmr->it_interval);
> - hrtimer_init(&ctx->tmr, ctx->clockid, htmode);
> + hrtimer_init(&ctx->tmr, clockid, htmode);
> hrtimer_set_expires(&ctx->tmr, texp);
> ctx->tmr.function = timerfd_tmrproc;
> - if (texp.tv64 != 0)
> + if (texp.tv64 != 0) {
> hrtimer_start(&ctx->tmr, texp, htmode);
> + if (timerfd_cancelled(ctx))
> + return -ECANCELED;
> + }
> + return 0;
> }
>
> static int timerfd_release(struct inode *inode, struct file *file)
> @@ -118,8 +148,21 @@ static ssize_t timerfd_read(struct file
> res = -EAGAIN;
> else
> res = wait_event_interruptible_locked_irq(ctx->wqh, ctx->ticks);
> +
> if (ctx->ticks) {
> ticks = ctx->ticks;
> +
> + /*
> + * If clock has changed, we do not care about the
> + * ticks and we do not rearm the timer. Userspace must
> + * reevaluate anyway.
> + */
> + if (timerfd_cancelled(ctx)) {
> + ticks = 0;
> + ctx->expired = 0;
> + res = -ECANCELED;
> + }
> +
> if (ctx->expired && ctx->tintv.tv64) {
> /*
> * If tintv.tv64 != 0, this is a periodic timer that
> @@ -183,6 +226,7 @@ SYSCALL_DEFINE2(timerfd_create, int, clo
> init_waitqueue_head(&ctx->wqh);
> ctx->clockid = clockid;
> hrtimer_init(&ctx->tmr, clockid, HRTIMER_MODE_ABS);
> + ctx->moffs = get_monotonic_offset();
>
> ufd = anon_inode_getfd("[timerfd]", &timerfd_fops, ctx,
> O_RDWR | (flags & TFD_SHARED_FCNTL_FLAGS));
> @@ -199,6 +243,7 @@ SYSCALL_DEFINE4(timerfd_settime, int, uf
> struct file *file;
> struct timerfd_ctx *ctx;
> struct itimerspec ktmr, kotmr;
> + int ret;
>
> if (copy_from_user(&ktmr, utmr, sizeof(ktmr)))
> return -EFAULT;
> @@ -240,14 +285,14 @@ SYSCALL_DEFINE4(timerfd_settime, int, uf
> /*
> * Re-program the timer to the new value ...
> */
> - timerfd_setup(ctx, flags, &ktmr);
> + ret = timerfd_setup(ctx, flags, &ktmr);
>
> spin_unlock_irq(&ctx->wqh.lock);
> fput(file);
> if (otmr && copy_to_user(otmr, &kotmr, sizeof(kotmr)))
> return -EFAULT;
>
> - return 0;
> + return ret;
> }
>
> SYSCALL_DEFINE2(timerfd_gettime, int, ufd, struct itimerspec __user *, otmr)
> Index: linux-2.6-tip/include/linux/hrtimer.h
> ===================================================================
> --- linux-2.6-tip.orig/include/linux/hrtimer.h
> +++ linux-2.6-tip/include/linux/hrtimer.h
> @@ -157,6 +157,7 @@ enum hrtimer_base_type {
> HRTIMER_BASE_REALTIME,
> HRTIMER_BASE_MONOTONIC,
> HRTIMER_BASE_BOOTTIME,
> + HRTIMER_BASE_REALTIME_COS,
> HRTIMER_MAX_CLOCK_BASES,
> };
>
> @@ -319,7 +320,7 @@ static inline int hrtimer_is_hres_active
> extern ktime_t ktime_get(void);
> extern ktime_t ktime_get_real(void);
> extern ktime_t ktime_get_boottime(void);
> -
> +ktime_t get_monotonic_offset(void);
>
> DECLARE_PER_CPU(struct tick_device, tick_cpu_device);
>
> Index: linux-2.6-tip/include/linux/time.h
> ===================================================================
> --- linux-2.6-tip.orig/include/linux/time.h
> +++ linux-2.6-tip/include/linux/time.h
> @@ -295,6 +295,11 @@ struct itimerval {
> #define CLOCK_MONOTONIC_COARSE 6
> #define CLOCK_BOOTTIME 7
>
> +#ifdef __KERNEL__
> +/* This clock is not exposed to user space */
> +#define CLOCK_REALTIME_COS 8
> +#endif
> +
> /*
> * The IDs of various hardware clocks:
> */
> Index: linux-2.6-tip/include/linux/timerfd.h
> ===================================================================
> --- linux-2.6-tip.orig/include/linux/timerfd.h
> +++ linux-2.6-tip/include/linux/timerfd.h
> @@ -19,6 +19,7 @@
> * shared O_* flags.
> */
> #define TFD_TIMER_ABSTIME (1 << 0)
> +#define TFD_TIMER_CANCELON_SET (1 << 1)
> #define TFD_CLOEXEC O_CLOEXEC
> #define TFD_NONBLOCK O_NONBLOCK
>
> @@ -26,6 +27,6 @@
> /* Flags for timerfd_create. */
> #define TFD_CREATE_FLAGS TFD_SHARED_FCNTL_FLAGS
> /* Flags for timerfd_settime. */
> -#define TFD_SETTIME_FLAGS TFD_TIMER_ABSTIME
> +#define TFD_SETTIME_FLAGS (TFD_TIMER_ABSTIME | TFD_TIMER_CANCELON_SET)
>
> #endif /* _LINUX_TIMERFD_H */
> Index: linux-2.6-tip/kernel/hrtimer.c
> ===================================================================
> --- linux-2.6-tip.orig/kernel/hrtimer.c
> +++ linux-2.6-tip/kernel/hrtimer.c
> @@ -78,6 +78,11 @@ DEFINE_PER_CPU(struct hrtimer_cpu_base,
> .get_time = &ktime_get_boottime,
> .resolution = KTIME_LOW_RES,
> },
> + {
> + .index = CLOCK_REALTIME_COS,
> + .get_time = &ktime_get_real,
> + .resolution = KTIME_LOW_RES,
> + },
> }
> };
>
> @@ -106,6 +111,7 @@ static void hrtimer_get_softirq_time(str
> base->clock_base[HRTIMER_BASE_REALTIME].softirq_time = xtim;
> base->clock_base[HRTIMER_BASE_MONOTONIC].softirq_time = mono;
> base->clock_base[HRTIMER_BASE_BOOTTIME].softirq_time = boot;
> + base->clock_base[HRTIMER_BASE_REALTIME_COS].softirq_time = xtim;
> }
>
> /*
> @@ -617,6 +623,8 @@ static int hrtimer_reprogram(struct hrti
> return res;
> }
>
> +static void
> +hrtimer_expire_cancelable(struct hrtimer_cpu_base *cpu_base, ktime_t now);
>
> /*
> * Retrigger next event is called after clock was set
> @@ -626,13 +634,12 @@ static int hrtimer_reprogram(struct hrti
> static void retrigger_next_event(void *arg)
> {
> struct hrtimer_cpu_base *base;
> - struct timespec realtime_offset, wtm, sleep;
> + struct timespec realtime_offset, xtim, wtm, sleep;
>
> if (!hrtimer_hres_active())
> return;
>
> - get_xtime_and_monotonic_and_sleep_offset(&realtime_offset, &wtm,
> - &sleep);
> + get_xtime_and_monotonic_and_sleep_offset(&xtim, &wtm, &sleep);
> set_normalized_timespec(&realtime_offset, -wtm.tv_sec, -wtm.tv_nsec);
>
> base = &__get_cpu_var(hrtimer_bases);
> @@ -643,6 +650,10 @@ static void retrigger_next_event(void *a
> timespec_to_ktime(realtime_offset);
> base->clock_base[HRTIMER_BASE_BOOTTIME].offset =
> timespec_to_ktime(sleep);
> + base->clock_base[HRTIMER_BASE_REALTIME_COS].offset =
> + timespec_to_ktime(realtime_offset);
> +
> + hrtimer_expire_cancelable(base, timespec_to_ktime(xtim));
>
> hrtimer_force_reprogram(base, 0);
> raw_spin_unlock(&base->lock);
> @@ -715,7 +726,7 @@ static inline int hrtimer_enqueue_reprog
> */
> static int hrtimer_switch_to_hres(void)
> {
> - int cpu = smp_processor_id();
> + int i, cpu = smp_processor_id();
> struct hrtimer_cpu_base *base = &per_cpu(hrtimer_bases, cpu);
> unsigned long flags;
>
> @@ -731,9 +742,8 @@ static int hrtimer_switch_to_hres(void)
> return 0;
> }
> base->hres_active = 1;
> - base->clock_base[HRTIMER_BASE_REALTIME].resolution = KTIME_HIGH_RES;
> - base->clock_base[HRTIMER_BASE_MONOTONIC].resolution = KTIME_HIGH_RES;
> - base->clock_base[HRTIMER_BASE_BOOTTIME].resolution = KTIME_HIGH_RES;
> + for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++)
> + base->clock_base[i].resolution = KTIME_HIGH_RES;
>
> tick_setup_sched_timer();
>
> @@ -1221,6 +1231,22 @@ static void __run_hrtimer(struct hrtimer
> timer->state &= ~HRTIMER_STATE_CALLBACK;
> }
>
> +static void
> +hrtimer_expire_cancelable(struct hrtimer_cpu_base *cpu_base, ktime_t now)
> +{
> + struct timerqueue_node *node;
> + struct hrtimer_clock_base *base;
> +
> + base = cpu_base->clock_base + HRTIMER_BASE_REALTIME_COS;
> +
> + while ((node = timerqueue_getnext(&base->active))) {
> + struct hrtimer *timer;
> +
> + timer = container_of(node, struct hrtimer, node);
> + __run_hrtimer(timer, &now);
> + }
> +}
> +
> #ifdef CONFIG_HIGH_RES_TIMERS
>
> /*
> @@ -1725,6 +1751,7 @@ void __init hrtimers_init(void)
> hrtimer_clock_to_base_table[CLOCK_REALTIME] = HRTIMER_BASE_REALTIME;
> hrtimer_clock_to_base_table[CLOCK_MONOTONIC] = HRTIMER_BASE_MONOTONIC;
> hrtimer_clock_to_base_table[CLOCK_BOOTTIME] = HRTIMER_BASE_BOOTTIME;
> + hrtimer_clock_to_base_table[CLOCK_REALTIME_COS] = HRTIMER_BASE_REALTIME_COS;
>
> hrtimer_cpu_notify(&hrtimers_nb, (unsigned long)CPU_UP_PREPARE,
> (void *)(long)smp_processor_id());
> Index: linux-2.6-tip/kernel/time/timekeeping.c
> ===================================================================
> --- linux-2.6-tip.orig/kernel/time/timekeeping.c
> +++ linux-2.6-tip/kernel/time/timekeeping.c
> @@ -1049,6 +1049,21 @@ void get_xtime_and_monotonic_and_sleep_o
> }
>
> /**
> + * get_monotonic_offset() - get wall_to_monotonic
> + */
> +ktime_t get_monotonic_offset(void)
> +{
> + unsigned long seq;
> + struct timespec wtom;
> +
> + do {
> + seq = read_seqbegin(&xtime_lock);
> + wtom = wall_to_monotonic;
> + } while (read_seqretry(&xtime_lock, seq));
> + return timespec_to_ktime(wtom);
> +}
> +
> +/**
> * xtime_update() - advances the timekeeping infrastructure
> * @ticks: number of ticks, that have elapsed since the last call.
> *

2011-04-29 17:32:40

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/4] clock_rtoffset: new syscall

Alexander,

On Wed, 27 Apr 2011, Kay Sievers wrote:
> On Wed, 2011-04-27 at 16:02 +0200, Thomas Gleixner wrote:
> > On Wed, 27 Apr 2011, Alexander Shishkin wrote:
> > The completely untested patch below should solve the same problem in a
> > sane way. Restricted to timerfd, but that really should be sufficient.
>
> > Subject: timerfd: Allow timers to be cancelled when clock was set
> > From: Thomas Gleixner <[email protected]>
> > Date: Wed, 27 Apr 2011 14:16:42 +0200
>
> Seems to work fine for me here for the weird "cron timer" case, or the
> simple "desktop clock" use-case -- where we want to rely on the timer to
> let the process sleep as long as possible, but wake it up to recalculate
> the next wakeup in case anything has changed in the meantime.

is that sufficient to solve your problems as well?

Thanks,

tglx

2011-05-02 08:11:06

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/4] clock_rtoffset: new syscall

On Fri, 29 Apr 2011 19:32:13 +0200 (CEST), Thomas Gleixner <[email protected]> wrote:
> Alexander,
>
> On Wed, 27 Apr 2011, Kay Sievers wrote:
> > On Wed, 2011-04-27 at 16:02 +0200, Thomas Gleixner wrote:
> > > On Wed, 27 Apr 2011, Alexander Shishkin wrote:
> > > The completely untested patch below should solve the same problem in a
> > > sane way. Restricted to timerfd, but that really should be sufficient.
> >
> > > Subject: timerfd: Allow timers to be cancelled when clock was set
> > > From: Thomas Gleixner <[email protected]>
> > > Date: Wed, 27 Apr 2011 14:16:42 +0200
> >
> > Seems to work fine for me here for the weird "cron timer" case, or the
> > simple "desktop clock" use-case -- where we want to rely on the timer to
> > let the process sleep as long as possible, but wake it up to recalculate
> > the next wakeup in case anything has changed in the meantime.
>
> is that sufficient to solve your problems as well?

Yes, it looks to be.

Thanks,
--
Alex