2010-11-23 17:26:06

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH] [RFC] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

Certain userspace applications (like "clock" desktop applets or cron or
systemd) might want to be notified when some other application changes
the system time. There are several known to me reasons for this:
- avoiding periodic wakeups to poll time changes;
- rearming CLOCK_REALTIME timers when said changes happen;
- changing system timekeeping policy for system-wide time management
programs;
- keeping guest applications/operating systems running in emulators
up to date.

This is another attempt to approach notifying userspace about system
clock changes. The other one is using an eventfd and a syscall [1]. In
the course of discussing the necessity of a syscall for this kind of
notifications, it was suggested that this functionality can be achieved
via timers [2] (and timerfd in particular [3]). This idea got quite
some support [4], [5], [6] and some vague criticism [7], so I decided
to try and go a bit further with it.

[1] http://marc.info/?l=linux-kernel&m=128950389423614&w=2
[2] http://marc.info/?l=linux-kernel&m=128951020831573&w=2
[3] http://marc.info/?l=linux-kernel&m=128951588006157&w=2
[4] http://marc.info/?l=linux-kernel&m=128951503205111&w=2
[5] http://marc.info/?l=linux-kernel&m=128955890118477&w=2
[6] http://marc.info/?l=linux-kernel&m=129002967031104&w=2
[7] http://marc.info/?l=linux-kernel&m=129002672227263&w=2

Signed-off-by: Alexander Shishkin <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Alexander Viro <[email protected]>
CC: Greg Kroah-Hartman <[email protected]>
CC: Feng Tang <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Michael Tokarev <[email protected]>
CC: Marcelo Tosatti <[email protected]>
CC: John Stultz <[email protected]>
CC: Chris Friesen <[email protected]>
CC: Kay Sievers <[email protected]>
CC: Kirill A. Shutemov <[email protected]>
CC: Artem Bityutskiy <[email protected]>
CC: Davide Libenzi <[email protected]>
CC: [email protected]
CC: [email protected]
---
fs/timerfd.c | 51 +++++++++++++++++++++++++++++++++++++++++++++-
include/linux/hrtimer.h | 6 +++++
include/linux/timerfd.h | 3 +-
kernel/hrtimer.c | 3 ++
4 files changed, 60 insertions(+), 3 deletions(-)

diff --git a/fs/timerfd.c b/fs/timerfd.c
index 8c4fc14..7890815 100644
--- a/fs/timerfd.c
+++ b/fs/timerfd.c
@@ -30,8 +30,13 @@ struct timerfd_ctx {
u64 ticks;
int expired;
int clockid;
+ struct list_head notifiers_list;
};

+/* TFD_NOTIFY_CLOCK_SET timers go here */
+static DEFINE_SPINLOCK(notifiers_lock);
+static LIST_HEAD(notifiers_list);
+
/*
* This gets called when the timer event triggers. We set the "expired"
* flag, but we do not re-arm the timer (in case it's necessary,
@@ -51,6 +56,21 @@ static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr)
return HRTIMER_NORESTART;
}

+void timerfd_clock_was_set(void)
+{
+ struct timerfd_ctx *ctx;
+ unsigned long flags;
+
+ spin_lock(&notifiers_lock);
+ list_for_each_entry(ctx, &notifiers_list, notifiers_list) {
+ spin_lock_irqsave(&ctx->wqh.lock, flags);
+ ctx->ticks++;
+ wake_up_locked(&ctx->wqh);
+ spin_unlock_irqrestore(&ctx->wqh.lock, flags);
+ }
+ spin_unlock(&notifiers_lock);
+}
+
static ktime_t timerfd_get_remaining(struct timerfd_ctx *ctx)
{
ktime_t remaining;
@@ -72,6 +92,12 @@ static void timerfd_setup(struct timerfd_ctx *ctx, int flags,
ctx->expired = 0;
ctx->ticks = 0;
ctx->tintv = timespec_to_ktime(ktmr->it_interval);
+
+ if (flags & TFD_NOTIFY_CLOCK_SET) {
+ list_add(&ctx->notifiers_list, &notifiers_list);
+ return;
+ }
+
hrtimer_init(&ctx->tmr, ctx->clockid, htmode);
hrtimer_set_expires(&ctx->tmr, texp);
ctx->tmr.function = timerfd_tmrproc;
@@ -83,7 +109,12 @@ static int timerfd_release(struct inode *inode, struct file *file)
{
struct timerfd_ctx *ctx = file->private_data;

- hrtimer_cancel(&ctx->tmr);
+ if (!list_empty(&ctx->notifiers_list)) {
+ spin_lock(&notifiers_lock);
+ list_del(&ctx->notifiers_list);
+ spin_unlock(&notifiers_lock);
+ } else
+ hrtimer_cancel(&ctx->tmr);
kfree(ctx);
return 0;
}
@@ -113,6 +144,7 @@ static ssize_t timerfd_read(struct file *file, char __user *buf, size_t count,

if (count < sizeof(ticks))
return -EINVAL;
+
spin_lock_irq(&ctx->wqh.lock);
if (file->f_flags & O_NONBLOCK)
res = -EAGAIN;
@@ -120,7 +152,8 @@ static ssize_t timerfd_read(struct file *file, char __user *buf, size_t count,
res = wait_event_interruptible_locked_irq(ctx->wqh, ctx->ticks);
if (ctx->ticks) {
ticks = ctx->ticks;
- if (ctx->expired && ctx->tintv.tv64) {
+ if (ctx->expired && ctx->tintv.tv64 &&
+ list_empty(&ctx->notifiers_list)) {
/*
* If tintv.tv64 != 0, this is a periodic timer that
* needs to be re-armed. We avoid doing it in the timer
@@ -218,13 +251,17 @@ SYSCALL_DEFINE4(timerfd_settime, int, ufd, int, flags,
* it to the new values.
*/
for (;;) {
+ spin_lock(&notifiers_lock);
spin_lock_irq(&ctx->wqh.lock);
if (hrtimer_try_to_cancel(&ctx->tmr) >= 0)
break;
spin_unlock_irq(&ctx->wqh.lock);
+ spin_unlock(&notifiers_lock);
cpu_relax();
}

+ INIT_LIST_HEAD(&ctx->notifiers_list);
+
/*
* If the timer is expired and it's periodic, we need to advance it
* because the caller may want to know the previous expiration time.
@@ -243,6 +280,7 @@ SYSCALL_DEFINE4(timerfd_settime, int, ufd, int, flags,
timerfd_setup(ctx, flags, &ktmr);

spin_unlock_irq(&ctx->wqh.lock);
+ spin_unlock(&notifiers_lock);
fput(file);
if (otmr && copy_to_user(otmr, &kotmr, sizeof(kotmr)))
return -EFAULT;
@@ -262,6 +300,14 @@ SYSCALL_DEFINE2(timerfd_gettime, int, ufd, struct itimerspec __user *, otmr)
ctx = file->private_data;

spin_lock_irq(&ctx->wqh.lock);
+ if (!list_empty(&ctx->notifiers_list)) {
+ kotmr.it_value = current_kernel_time();
+ kotmr.it_interval.tv_sec = 0;
+ kotmr.it_interval.tv_nsec = 0;
+ spin_unlock_irq(&ctx->wqh.lock);
+ goto out;
+ }
+
if (ctx->expired && ctx->tintv.tv64) {
ctx->expired = 0;
ctx->ticks +=
@@ -273,6 +319,7 @@ SYSCALL_DEFINE2(timerfd_gettime, int, ufd, struct itimerspec __user *, otmr)
spin_unlock_irq(&ctx->wqh.lock);
fput(file);

+out:
return copy_to_user(otmr, &kotmr, sizeof(kotmr)) ? -EFAULT: 0;
}

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index fd0c1b8..eb9d331 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -247,6 +247,12 @@ static inline ktime_t hrtimer_expires_remaining(const struct hrtimer *timer)
return ktime_sub(timer->_expires, timer->base->get_time());
}

+#ifdef CONFIG_TIMERFD
+extern void timerfd_clock_was_set(void);
+#else
+static inline void timerfd_clock_was_set(void) {}
+#endif
+
#ifdef CONFIG_HIGH_RES_TIMERS
struct clock_event_device;

diff --git a/include/linux/timerfd.h b/include/linux/timerfd.h
index 2d07929..c3ddad9 100644
--- a/include/linux/timerfd.h
+++ b/include/linux/timerfd.h
@@ -19,6 +19,7 @@
* shared O_* flags.
*/
#define TFD_TIMER_ABSTIME (1 << 0)
+#define TFD_NOTIFY_CLOCK_SET (1 << 1)
#define TFD_CLOEXEC O_CLOEXEC
#define TFD_NONBLOCK O_NONBLOCK

@@ -26,6 +27,6 @@
/* Flags for timerfd_create. */
#define TFD_CREATE_FLAGS TFD_SHARED_FCNTL_FLAGS
/* Flags for timerfd_settime. */
-#define TFD_SETTIME_FLAGS TFD_TIMER_ABSTIME
+#define TFD_SETTIME_FLAGS (TFD_TIMER_ABSTIME | TFD_NOTIFY_CLOCK_SET)

#endif /* _LINUX_TIMERFD_H */
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 72206cf..6f6403a 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -646,6 +646,9 @@ void clock_was_set(void)
{
/* Retrigger the CPU local events everywhere */
on_each_cpu(retrigger_next_event, NULL, 1);
+
+ /* Trigger timerfd notifiers */
+ timerfd_clock_was_set();
}

/*
--
1.7.2.1.45.gb66c2


2010-11-23 22:44:08

by Lennart Poettering

[permalink] [raw]
Subject: Re: [PATCH] [RFC] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Tue, 23.11.10 19:22, Alexander Shishkin ([email protected]) wrote:

> Certain userspace applications (like "clock" desktop applets or cron or
> systemd) might want to be notified when some other application changes
> the system time. There are several known to me reasons for this:
> - avoiding periodic wakeups to poll time changes;
> - rearming CLOCK_REALTIME timers when said changes happen;
> - changing system timekeeping policy for system-wide time management
> programs;
> - keeping guest applications/operating systems running in emulators
> up to date.
>
> This is another attempt to approach notifying userspace about system
> clock changes. The other one is using an eventfd and a syscall [1]. In
> the course of discussing the necessity of a syscall for this kind of
> notifications, it was suggested that this functionality can be achieved
> via timers [2] (and timerfd in particular [3]). This idea got quite
> some support [4], [5], [6] and some vague criticism [7], so I decided
> to try and go a bit further with it.

I agree with Kay, this is pretty much exactly what we want for
systemd. (Assuming that the time jump due to system suspend is
propagated to userspace like any other time jump with this path).

So yeah, I'd be very happy if this could be merged.

Lennart

--
Lennart Poettering - Red Hat, Inc.

2010-11-24 08:06:09

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH] [RFC] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Tue, 2010-11-23 at 23:43 +0100, Lennart Poettering wrote:
> On Tue, 23.11.10 19:22, Alexander Shishkin ([email protected]) wrote:
>
> > Certain userspace applications (like "clock" desktop applets or cron or
> > systemd) might want to be notified when some other application changes
> > the system time. There are several known to me reasons for this:
> > - avoiding periodic wakeups to poll time changes;
> > - rearming CLOCK_REALTIME timers when said changes happen;
> > - changing system timekeeping policy for system-wide time management
> > programs;
> > - keeping guest applications/operating systems running in emulators
> > up to date.
> >
> > This is another attempt to approach notifying userspace about system
> > clock changes. The other one is using an eventfd and a syscall [1]. In
> > the course of discussing the necessity of a syscall for this kind of
> > notifications, it was suggested that this functionality can be achieved
> > via timers [2] (and timerfd in particular [3]). This idea got quite
> > some support [4], [5], [6] and some vague criticism [7], so I decided
> > to try and go a bit further with it.
>
> I agree with Kay, this is pretty much exactly what we want for
> systemd. (Assuming that the time jump due to system suspend is
> propagated to userspace like any other time jump with this path).

A "Tested-by: Lennart Poettering <[email protected]>" would be
helpful, I think.

> So yeah, I'd be very happy if this could be merged.

Hmm, and question about why exactly the timerfd interface is a bad way
to go was ignored.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2010-12-01 00:34:04

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH] [RFC] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Tue, Nov 23, 2010 at 11:43:46PM +0100, Lennart Poettering wrote:
> On Tue, 23.11.10 19:22, Alexander Shishkin ([email protected]) wrote:
>
> > Certain userspace applications (like "clock" desktop applets or cron or
> > systemd) might want to be notified when some other application changes
> > the system time. There are several known to me reasons for this:
> > - avoiding periodic wakeups to poll time changes;
> > - rearming CLOCK_REALTIME timers when said changes happen;
> > - changing system timekeeping policy for system-wide time management
> > programs;
> > - keeping guest applications/operating systems running in emulators
> > up to date.
> >
> > This is another attempt to approach notifying userspace about system
> > clock changes. The other one is using an eventfd and a syscall [1]. In
> > the course of discussing the necessity of a syscall for this kind of
> > notifications, it was suggested that this functionality can be achieved
> > via timers [2] (and timerfd in particular [3]). This idea got quite
> > some support [4], [5], [6] and some vague criticism [7], so I decided
> > to try and go a bit further with it.
>
> I agree with Kay, this is pretty much exactly what we want for
> systemd. (Assuming that the time jump due to system suspend is
> propagated to userspace like any other time jump with this path).

Good point, I'll add the notification to the resume point.

> So yeah, I'd be very happy if this could be merged.

Good to hear.

Regards,
--
Alex

2010-12-01 10:44:33

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] [RFC] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

Lennart Poettering wrote:
> On Tue, 23.11.10 19:22, Alexander Shishkin ([email protected]) wrote:
>
> > Certain userspace applications (like "clock" desktop applets or cron or
> > systemd) might want to be notified when some other application changes
> > the system time. There are several known to me reasons for this:
> > - avoiding periodic wakeups to poll time changes;
> > - rearming CLOCK_REALTIME timers when said changes happen;
> > - changing system timekeeping policy for system-wide time management
> > programs;
> > - keeping guest applications/operating systems running in emulators
> > up to date.
> >
> > This is another attempt to approach notifying userspace about system
> > clock changes. The other one is using an eventfd and a syscall [1]. In
> > the course of discussing the necessity of a syscall for this kind of
> > notifications, it was suggested that this functionality can be achieved
> > via timers [2] (and timerfd in particular [3]). This idea got quite
> > some support [4], [5], [6] and some vague criticism [7], so I decided
> > to try and go a bit further with it.
>
> I agree with Kay, this is pretty much exactly what we want for
> systemd. (Assuming that the time jump due to system suspend is
> propagated to userspace like any other time jump with this path).

I hope the time jump due to suspend is *not* propagated in the same
way to userspace :-)

What I'd like to see:

1. Time jump due to the system clock being stepped: Notification.

This is *not* a change in real time. It means the clock was
corrected/changed. No physical time passed.

2. Time jump due to suspend/resume: Different notification.

This *is* a change in real time. Physical time passed.

3. Time drift corrections: As now, no notification, it's just
the clock being regulated.

To signal the difference between 1 and 2, there ought to be some way
for userspace to determine how much of the clock delta corresponds
with physical time, by reading some sort of "monotonic" clock :-)

CLOCK_MONOTONIC is unsuitable because it stops at suspend. Maybe it
should stay that way. But maybe not - programs using CLOCK_MONOTONIC
usually want to trigger timeouts etc. based on real elapsed time, and
after suspend/resume, it's quite reasonable to want to trigger all of
a program's short timeouts immediately. Indeed some network protocol
userspace may currently behave *incorrectly* over suspend/resume,
especially those using clock times to validate their caches,
*because* CLOCK_MONOTONIC doesn't count it.

So maybe CLOCK_MONOTONIC should be changed to include elapsed time
during suspend/resume, and CLOCK_MONOTONIC_RAW could remain as it is,
for programs that want that?

That, plus this proposed patch, would signal the difference between 1
and 2 above nicely.

-- Jamie

2010-12-01 11:00:25

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH] [RFC] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Wed, Dec 01, 2010 at 10:43:59AM +0000, Jamie Lokier wrote:
> Lennart Poettering wrote:
> > On Tue, 23.11.10 19:22, Alexander Shishkin ([email protected]) wrote:
> >
> > > Certain userspace applications (like "clock" desktop applets or cron or
> > > systemd) might want to be notified when some other application changes
> > > the system time. There are several known to me reasons for this:
> > > - avoiding periodic wakeups to poll time changes;
> > > - rearming CLOCK_REALTIME timers when said changes happen;
> > > - changing system timekeeping policy for system-wide time management
> > > programs;
> > > - keeping guest applications/operating systems running in emulators
> > > up to date.
> > >
> > > This is another attempt to approach notifying userspace about system
> > > clock changes. The other one is using an eventfd and a syscall [1]. In
> > > the course of discussing the necessity of a syscall for this kind of
> > > notifications, it was suggested that this functionality can be achieved
> > > via timers [2] (and timerfd in particular [3]). This idea got quite
> > > some support [4], [5], [6] and some vague criticism [7], so I decided
> > > to try and go a bit further with it.
> >
> > I agree with Kay, this is pretty much exactly what we want for
> > systemd. (Assuming that the time jump due to system suspend is
> > propagated to userspace like any other time jump with this path).
>
> I hope the time jump due to suspend is *not* propagated in the same
> way to userspace :-)

Can you please check if my new patch from earlier today [1] seems better?

[1] http://marc.info/?l=linux-kernel&m=129116762414291&w=2

> What I'd like to see:
>
> 1. Time jump due to the system clock being stepped: Notification.
>
> This is *not* a change in real time. It means the clock was
> corrected/changed. No physical time passed.
>
> 2. Time jump due to suspend/resume: Different notification.
>
> This *is* a change in real time. Physical time passed.
>
> 3. Time drift corrections: As now, no notification, it's just
> the clock being regulated.

Generally, I used CLOCK_MONOTONIC + TFD_NOTIFY_CLOCK_SET to signal
suspend/resume time changes and CLOCK_REALTIME + TFD_NOTIFY_CLOCK_SET
to signal wall clock changes. Without TFD_NOTIFY_CLOCK_SET it works
as it did before.

> To signal the difference between 1 and 2, there ought to be some way
> for userspace to determine how much of the clock delta corresponds
> with physical time, by reading some sort of "monotonic" clock :-)
>
> CLOCK_MONOTONIC is unsuitable because it stops at suspend. Maybe it
> should stay that way. But maybe not - programs using CLOCK_MONOTONIC
> usually want to trigger timeouts etc. based on real elapsed time, and
> after suspend/resume, it's quite reasonable to want to trigger all of
> a program's short timeouts immediately. Indeed some network protocol
> userspace may currently behave *incorrectly* over suspend/resume,
> especially those using clock times to validate their caches,
> *because* CLOCK_MONOTONIC doesn't count it.
>
> So maybe CLOCK_MONOTONIC should be changed to include elapsed time
> during suspend/resume, and CLOCK_MONOTONIC_RAW could remain as it is,
> for programs that want that?
>
> That, plus this proposed patch, would signal the difference between 1
> and 2 above nicely.

Regards,
--
Alex

2010-12-01 22:47:06

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [PATCH] [RFC] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Wed, 01 Dec 2010 10:43:59 GMT, Jamie Lokier said:

> So maybe CLOCK_MONOTONIC should be changed to include elapsed time
> during suspend/resume, and CLOCK_MONOTONIC_RAW could remain as it is,
> for programs that want that?

Wouldn't that be an API break for programs that are expecting the current
behavior of CLOCK_MONOTONIC? Yes, there should be a way to request either of
them - but if there's only one way now, it should continue to act the current
way, and the added way is the second option.


Attachments:
(No filename) (227.00 B)

2010-12-01 23:46:57

by john stultz

[permalink] [raw]
Subject: Re: [PATCH] [RFC] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Wed, 2010-12-01 at 17:46 -0500, [email protected] wrote:
> On Wed, 01 Dec 2010 10:43:59 GMT, Jamie Lokier said:
>
> > So maybe CLOCK_MONOTONIC should be changed to include elapsed time
> > during suspend/resume, and CLOCK_MONOTONIC_RAW could remain as it is,
> > for programs that want that?
>
> Wouldn't that be an API break for programs that are expecting the current
> behavior of CLOCK_MONOTONIC? Yes, there should be a way to request either of
> them - but if there's only one way now, it should continue to act the current
> way, and the added way is the second option.

We do keep track of the amount of time in suspend (total_sleep_time), so
creating a new clockid to provide CLOCK_MONOTONIC + total_sleep_time
wouldn't be hard. We just haven't had a clear articulation of why it
would be useful to expose to userland (nor a clear name to describe
exactly what it represents).

thanks
-john

2010-12-02 00:11:12

by john stultz

[permalink] [raw]
Subject: Re: [PATCH] [RFC] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Wed, 2010-12-01 at 10:43 +0000, Jamie Lokier wrote:
> Lennart Poettering wrote:
> > On Tue, 23.11.10 19:22, Alexander Shishkin ([email protected]) wrote:
> >
> > > Certain userspace applications (like "clock" desktop applets or cron or
> > > systemd) might want to be notified when some other application changes
> > > the system time. There are several known to me reasons for this:
> > > - avoiding periodic wakeups to poll time changes;
> > > - rearming CLOCK_REALTIME timers when said changes happen;
> > > - changing system timekeeping policy for system-wide time management
> > > programs;
> > > - keeping guest applications/operating systems running in emulators
> > > up to date.
> > >
> > > This is another attempt to approach notifying userspace about system
> > > clock changes. The other one is using an eventfd and a syscall [1]. In
> > > the course of discussing the necessity of a syscall for this kind of
> > > notifications, it was suggested that this functionality can be achieved
> > > via timers [2] (and timerfd in particular [3]). This idea got quite
> > > some support [4], [5], [6] and some vague criticism [7], so I decided
> > > to try and go a bit further with it.
> >
> > I agree with Kay, this is pretty much exactly what we want for
> > systemd. (Assuming that the time jump due to system suspend is
> > propagated to userspace like any other time jump with this path).
>
> I hope the time jump due to suspend is *not* propagated in the same
> way to userspace :-)

Sadly this behavior depends on architecture and rtc configuration.

For x86 and a number of other architectures, read_persisitent_clock()
functions and we inject the time in suspend into CLOCK_REALTIME on
resume. No notification would be seen.

For architectures where read_persistent_clock does not function (usually
due to RTC not being accessible with irqs are off), we rely on the RTC
code to set the time when it resumes and irqs are enabled. This happens
via do_settimeofday, so a notification would be seen.

A hook could be added so the non-read_persistent_clock supporting arches
can inject time into CLOCK_REALTIME without going through settimeofday()
and triggering the notification. But there may still be odd races around
other stuff running and getting the wrong time before the suspend time
is injected.

This ignores any userland resume scripts that may do something like call
ntpdate or whatever, which would call settimefoday().


> What I'd like to see:
>
> 1. Time jump due to the system clock being stepped: Notification.
>
> This is *not* a change in real time. It means the clock was
> corrected/changed. No physical time passed.

Right. That's settimeofday()/clock_settime().


> 2. Time jump due to suspend/resume: Different notification.
>
> This *is* a change in real time. Physical time passed.

This is the case for read_persistent_clock() supported architectures.

Why do you want a notification here? Or is the resume hook enough?


> 3. Time drift corrections: As now, no notification, it's just
> the clock being regulated.

Yep. adjtimex() handles this.


> To signal the difference between 1 and 2, there ought to be some way
> for userspace to determine how much of the clock delta corresponds
> with physical time, by reading some sort of "monotonic" clock :-)


Could you further expand on the needs for distinguishing between the
two?


> CLOCK_MONOTONIC is unsuitable because it stops at suspend. Maybe it
> should stay that way. But maybe not - programs using CLOCK_MONOTONIC
> usually want to trigger timeouts etc. based on real elapsed time, and
> after suspend/resume, it's quite reasonable to want to trigger all of
> a program's short timeouts immediately. Indeed some network protocol
> userspace may currently behave *incorrectly* over suspend/resume,
> especially those using clock times to validate their caches,
> *because* CLOCK_MONOTONIC doesn't count it.

Is there a specific example of this occurring that you have in mind?


> So maybe CLOCK_MONOTONIC should be changed to include elapsed time
> during suspend/resume, and CLOCK_MONOTONIC_RAW could remain as it is,
> for programs that want that?

No. Lets not change it. CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW's
relationship is tightly coupled, and applications that are tracking the
amount of clock adjustment being done to the system require they keep
their semantics.

As I said earlier, adding a new clockid to represent the MONOTONIC
+SUSPEND time wouldn't be difficult, we just need to be clear about why
it should be exposed, and have it also be easy to describe to developers
which clockid would suit their needs best.

thanks
-john

2010-12-02 01:13:09

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] [RFC] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

john stultz wrote:
> > CLOCK_MONOTONIC is unsuitable because it stops at suspend. Maybe it
> > should stay that way. But maybe not - programs using CLOCK_MONOTONIC
> > usually want to trigger timeouts etc. based on real elapsed time, and
> > after suspend/resume, it's quite reasonable to want to trigger all of
> > a program's short timeouts immediately. Indeed some network protocol
> > userspace may currently behave *incorrectly* over suspend/resume,
> > especially those using clock times to validate their caches,
> > *because* CLOCK_MONOTONIC doesn't count it.
>
> Is there a specific example of this occurring that you have in mind?

Yes, it's a correctness issue in network protocols using
lease/oplock/MESI-style cache coherency. (E.g. NFSv4, CIFS, whatever
you like in userspace.)

By this, I mean anything with this sort of pattern:

1. Receive message "you may cache thing X for up to 20 seconds *without
checking if it changed* during that time; afterwards, check".

(If the other end need to change X within the 20 second
interval, the other end will send a request to break the lease;
if the other end doesn't get a response, then it waits until the
20 second expires, and then it's safe to assume the lease expired.)

2. Local request for value of X.

=> If less than 20 seconds has passed, the local cache responds
with X *without any network confirmation*. I.e. it's instant.
=> If more than 20 seconds has passed, it has to talk to the
other end. I.e. a network round trip.

The algorithm is coherent even if the network is unreliable and goes
down sometimes. When that happens, local requests are stalled, rather
than returning values incoherent with other machines.

This algorithm breaks if the local application depends on
CLOCK_MONOTONIC to confirm that less than 20 seconds has passed
and CLOCK_MONOTONIC is lying.

CLOCK_MONOTONIC lies when you've done suspend+resume while this
program was running, so it's 20 seconds test gives the wrong result.

You can imagine there are quite a few applications that use this
technique because it's quite fundamental to efficient coherency
protocols. (Although I'm unable to name any off the top of my head!).

There are generalisations for more interesting distributed systems.
The thing they all have in common is the ability to locally
query "has time T elapsed, in terms that would be recognised as T by
the remote machines". In reality clocks have tolerances etc. so you
fudge by some percentage, and you are more careful about the order of
events than I have shown (it's more like "you may assume Y for up to
time T since you sent the request which initiated this response".)

clearly this wouldn't be expected to go wrong *often*, but it's good
if distributed systems correctness remains assured over
suspend/resume, so it's no different from slowing/descheduling.

For this sort of thing, it's enough to have a vague guaranteed
notification from the system: "There has been some kind of
CLOCK_MONOTONIC discrepancy", and algorithms should assume everything
might have elapsed.

> > So maybe CLOCK_MONOTONIC should be changed to include elapsed time
> > during suspend/resume, and CLOCK_MONOTONIC_RAW could remain as it is,
> > for programs that want that?
>
> No. Lets not change it. CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW's
> relationship is tightly coupled, and applications that are tracking the
> amount of clock adjustment being done to the system require they keep
> their semantics.
>
> As I said earlier, adding a new clockid to represent the MONOTONIC
> +SUSPEND time wouldn't be difficult, we just need to be clear about why
> it should be exposed, and have it also be easy to describe to developers
> which clockid would suit their needs best.

What I've described above doesn't actually need a new clock. It's
enough if you guarantee some kind of notification when there's been an
unknown jump in CLOCK_MONOTONIC's relationship to real time.

That's just as well, as I doubt you could guarantee MONOTONIC+SUSPEND
accuracy on all hardware.

For correct behaviour, the notification must be guaranteed to be seen
by any program when it queries CLOCK_MONOTONIC or queries expiry of a
timer based on that. It's insufficient to queue a notification which
might take program execution time to be delivered (that includes
signals). In other words, the clock-jump flag must be flagged by
suspend/resume before the program execution itself is resumed (and
after it's suspended of course), and seen synchronously when the
program calls a system call to check the clock/timer.

-- Jamie

2010-12-02 01:19:05

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] [RFC] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

[email protected] wrote:
> On Wed, 01 Dec 2010 10:43:59 GMT, Jamie Lokier said:
>
> > So maybe CLOCK_MONOTONIC should be changed to include elapsed time
> > during suspend/resume, and CLOCK_MONOTONIC_RAW could remain as it is,
> > for programs that want that?
>
> Wouldn't that be an API break for programs that are expecting the current
> behavior of CLOCK_MONOTONIC? Yes, there should be a way to request either of
> them - but if there's only one way now, it should continue to act the current
> way, and the added way is the second option.

I don't know. Can you think of any program which would break if
suspend/resume's clocks behaved like ordinary task scheduling - when a
task doesn't run for a long time because of scheduling decisions?
Hmm, I guess some realtime apps might like to know.

Currently CLOCK_MONOTONIC jumps forwards by 4 seconds on
suspend/resume anyway (as seen by userspace), on my x86 laptop running
2.6.37-rc3. So it does already jump a bit...

But see my other reply; maybe there's no need to change it. A
reliable, immediate notification that CLOCK_MONOTONIC's relationship
to real time has been disrupted by an unknown amount would be
sufficient for the problems I have in mind.

-- Jamie

2010-12-02 01:56:12

by john stultz

[permalink] [raw]
Subject: Re: [PATCH] [RFC] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Thu, 2010-12-02 at 01:18 +0000, Jamie Lokier wrote:
> [email protected] wrote:
> > On Wed, 01 Dec 2010 10:43:59 GMT, Jamie Lokier said:
> >
> > > So maybe CLOCK_MONOTONIC should be changed to include elapsed time
> > > during suspend/resume, and CLOCK_MONOTONIC_RAW could remain as it is,
> > > for programs that want that?
> >
> > Wouldn't that be an API break for programs that are expecting the current
> > behavior of CLOCK_MONOTONIC? Yes, there should be a way to request either of
> > them - but if there's only one way now, it should continue to act the current
> > way, and the added way is the second option.
>
> I don't know. Can you think of any program which would break if
> suspend/resume's clocks behaved like ordinary task scheduling - when a
> task doesn't run for a long time because of scheduling decisions?
> Hmm, I guess some realtime apps might like to know.

Like I mentioned earlier, CLOCK_MONOTONIC_RAW and CLOCK_MONOTONIC are
tightly tied, so anything using CLOCK_MONOTONIC_RAW would break.

It might be possible to change both, but I still think such a change
would be bad.

> Currently CLOCK_MONOTONIC jumps forwards by 4 seconds on
> suspend/resume anyway (as seen by userspace), on my x86 laptop running
> 2.6.37-rc3. So it does already jump a bit...

So just to clarify here, by this do you mean that there's ~4 seconds
delay between the resume event and when userland apps start to run (or
possibly some of that accumulating between the app freeze and the
timekeeping suspend) ?

Or are you seeing CLOCK_MONOTONIC jump 4 seconds out of sync with
CLOCK_REALTIME?

It should be the delta between CLOCK_MONOTONIC and CLOCK_REALTIME prior
to suspend should be that same delta + suspend time after resume. If
that's not the case, something may be broken.

thanks
-john

2010-12-02 03:07:56

by john stultz

[permalink] [raw]
Subject: Re: [PATCH] [RFC] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Thu, 2010-12-02 at 01:12 +0000, Jamie Lokier wrote:
> john stultz wrote:
> > > CLOCK_MONOTONIC is unsuitable because it stops at suspend. Maybe it
> > > should stay that way. But maybe not - programs using CLOCK_MONOTONIC
> > > usually want to trigger timeouts etc. based on real elapsed time, and
> > > after suspend/resume, it's quite reasonable to want to trigger all of
> > > a program's short timeouts immediately. Indeed some network protocol
> > > userspace may currently behave *incorrectly* over suspend/resume,
> > > especially those using clock times to validate their caches,
> > > *because* CLOCK_MONOTONIC doesn't count it.
> >
> > Is there a specific example of this occurring that you have in mind?
>
> Yes, it's a correctness issue in network protocols using
> lease/oplock/MESI-style cache coherency. (E.g. NFSv4, CIFS, whatever
> you like in userspace.)

Ok. Just curious, as similar cases I was thinking about (like AFS)
require clients to have a reasonably synced CLOCK_REALTIME to the server
for such caching. I'll have to look at the NFSv4 and CIFS cases.

> By this, I mean anything with this sort of pattern:
>
> 1. Receive message "you may cache thing X for up to 20 seconds *without
> checking if it changed* during that time; afterwards, check".
>
> (If the other end need to change X within the 20 second
> interval, the other end will send a request to break the lease;
> if the other end doesn't get a response, then it waits until the
> 20 second expires, and then it's safe to assume the lease expired.)
>
> 2. Local request for value of X.
>
> => If less than 20 seconds has passed, the local cache responds
> with X *without any network confirmation*. I.e. it's instant.
> => If more than 20 seconds has passed, it has to talk to the
> other end. I.e. a network round trip.
>
> The algorithm is coherent even if the network is unreliable and goes
> down sometimes. When that happens, local requests are stalled, rather
> than returning values incoherent with other machines.
>
> This algorithm breaks if the local application depends on
> CLOCK_MONOTONIC to confirm that less than 20 seconds has passed
> and CLOCK_MONOTONIC is lying.
>
> CLOCK_MONOTONIC lies when you've done suspend+resume while this
> program was running, so it's 20 seconds test gives the wrong result.
>
> You can imagine there are quite a few applications that use this
> technique because it's quite fundamental to efficient coherency
> protocols. (Although I'm unable to name any off the top of my head!).

Yea, the case seems reasonable. I guess I'm just surprised they use
CLOCK_MONOTONIC and haven't complained earlier about it.

> > > So maybe CLOCK_MONOTONIC should be changed to include elapsed time
> > > during suspend/resume, and CLOCK_MONOTONIC_RAW could remain as it is,
> > > for programs that want that?
> >
> > No. Lets not change it. CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW's
> > relationship is tightly coupled, and applications that are tracking the
> > amount of clock adjustment being done to the system require they keep
> > their semantics.
> >
> > As I said earlier, adding a new clockid to represent the MONOTONIC
> > +SUSPEND time wouldn't be difficult, we just need to be clear about why
> > it should be exposed, and have it also be easy to describe to developers
> > which clockid would suit their needs best.
>
> What I've described above doesn't actually need a new clock. It's
> enough if you guarantee some kind of notification when there's been an
> unknown jump in CLOCK_MONOTONIC's relationship to real time.

I'm not as familiar with the pm code, but if you just need
suspend/resume event notification, we should already have that via the
userland suspend/resume hooks.

It just seems to me that the notification you suggest is sufficient, but
is only minimally useful. So, an application gets a notification that we
suspended, and so CLOCK_MONOTONIC based timers may have been delayed,
but without knowing how much, its unclear what to do. For the cache
cases, sure, you can just drop everything, but I'm sure for other cases
we'd be pushing the userland app to keep its own sense of the
CLOCK_MONOTONIC/REALTIME delta and try to track those changes.

So providing a new CLOCK_BOOTTIME or something would seem pretty
reasonable to me, allowing things like timers to be set that would
expire immediately after a resume if they were to expire while the
system was suspended.

> That's just as well, as I doubt you could guarantee MONOTONIC+SUSPEND
> accuracy on all hardware.

Well, unless there is no persistent/RTC device to figure out the suspend
time from, I think we could do a decent job. There are limitations (ie:
RTC hardware only providing second resolution time), but the bar for
time accuracy over suspend has been fairly low so far.

> For correct behaviour, the notification must be guaranteed to be seen
> by any program when it queries CLOCK_MONOTONIC or queries expiry of a
> timer based on that. It's insufficient to queue a notification which
> might take program execution time to be delivered (that includes
> signals). In other words, the clock-jump flag must be flagged by
> suspend/resume before the program execution itself is resumed (and
> after it's suspended of course), and seen synchronously when the
> program calls a system call to check the clock/timer.

Maybe I'm missing something, but that seems like such a notification is
going to be difficult to provide with the current interfaces. And I'm
not sure it resolves any races you'd have with the suspend hitting you
right after the time read but before an action is taken.

For such strict semantics, it almost seems like some way to inhibit
suspend would be needed around the time checks and actions.

thanks
-john





2010-12-04 00:58:08

by john stultz

[permalink] [raw]
Subject: Re: [PATCH] [RFC] timerfd: add TFD_NOTIFY_CLOCK_SET to watch for clock changes

On Wed, 2010-12-01 at 17:55 -0800, john stultz wrote:
> On Thu, 2010-12-02 at 01:18 +0000, Jamie Lokier wrote:
> > [email protected] wrote:
> > > On Wed, 01 Dec 2010 10:43:59 GMT, Jamie Lokier said:
> > >
> > > > So maybe CLOCK_MONOTONIC should be changed to include elapsed time
> > > > during suspend/resume, and CLOCK_MONOTONIC_RAW could remain as it is,
> > > > for programs that want that?
> > >
> > > Wouldn't that be an API break for programs that are expecting the current
> > > behavior of CLOCK_MONOTONIC? Yes, there should be a way to request either of
> > > them - but if there's only one way now, it should continue to act the current
> > > way, and the added way is the second option.
> >
> > I don't know. Can you think of any program which would break if
> > suspend/resume's clocks behaved like ordinary task scheduling - when a
> > task doesn't run for a long time because of scheduling decisions?
> > Hmm, I guess some realtime apps might like to know.
>
> Like I mentioned earlier, CLOCK_MONOTONIC_RAW and CLOCK_MONOTONIC are
> tightly tied, so anything using CLOCK_MONOTONIC_RAW would break.
>
> It might be possible to change both, but I still think such a change
> would be bad.

So actually, as I think more about this, I'm starting to come around to
the side that maybe CLOCK_MONOTONIC should be changed to increment
during suspend (CLOCK_MONOTONIC_RAW could also be moved forward by the
same amount, which isn't really ideal, but maybe not problematic).

There are still quite a number of problems that might be caused by such
a change. So it may still be impractical to actually do, but more and
more it does seem like it might be the better approach.

I keep thinking about it.

thanks
-john