2014-01-13 10:44:55

by Alexey Perevalov

[permalink] [raw]
Subject: [PATCH v2 0/3] Deferrable timers support for timerfd API

Hello dear community.

This is reworked patch set of original Anton's Vorontsov
proposal regarding unified deferrable timers in the user space.
http://lwn.net/Articles/514707/


I decided to resubmit it due we found it usefull for us too.

timerfd was modified since Anton's commit, Alarm support was added.
This isn't only rebase. Anton's previous version used deferrable timer
in couple with hrtimer. This version uses only deferrable timer. It
mean the behaviour of overrun number is different.
e.g. if you don't poll one second timer for a 10 seconds - you'll get
10 overruns with hrtimer, but for deferrable timer it could be another value.

Anton Vorontsov (3):
kernel/time: Add new helpers to convert ktime to/from jiffies
timerfd: Factor out timer-type unspecific timerfd_expire()
timerfd: Add support for deferrable timers

fs/timerfd.c | 127 +++++++++++++++++++++++++++++++++++------------
include/linux/jiffies.h | 4 +-
include/linux/ktime.h | 3 +-
include/linux/timerfd.h | 4 +-
kernel/time.c | 23 +++++++++
5 files changed, 126 insertions(+), 35 deletions(-)

--
1.7.9.5


2014-01-13 10:45:23

by Alexey Perevalov

[permalink] [raw]
Subject: [PATCH v2 1/3] kernel/time: Add new helpers to convert ktime to/from jiffies

From: Anton Vorontsov <[email protected]>

Two new functions: jiffies_to_ktime() and ktime_to_jiffies(), we'll use
them for timerfd deferred timers handling.

We fully reuse the logic from timespec implementations, so the functions
are pretty straightforward.

The only tricky part is in headers: we have to include jiffies.h after
we defined ktime_t, this is because ktime.h needs some declarations from
jiffies.h (e.g. TICK_NSEC).

Signed-off-by: Anton Vorontsov <[email protected]>
Signed-off-by: Alexey Perevalov <[email protected]>
---
include/linux/jiffies.h | 4 +++-
include/linux/ktime.h | 3 ++-
kernel/time.c | 23 +++++++++++++++++++++++
3 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h
index d235e88..1ba02ae 100644
--- a/include/linux/jiffies.h
+++ b/include/linux/jiffies.h
@@ -6,6 +6,7 @@
#include <linux/types.h>
#include <linux/time.h>
#include <linux/timex.h>
+#include <linux/ktime.h>
#include <asm/param.h> /* for HZ */

/*
@@ -302,7 +303,8 @@ extern void jiffies_to_timespec(const unsigned long jiffies,
extern unsigned long timeval_to_jiffies(const struct timeval *value);
extern void jiffies_to_timeval(const unsigned long jiffies,
struct timeval *value);
-
+extern unsigned long ktime_to_jiffies(ktime_t *value);
+extern void jiffies_to_ktime(const unsigned long jiffies, ktime_t *value);
extern clock_t jiffies_to_clock_t(unsigned long x);
static inline clock_t jiffies_delta_to_clock_t(long delta)
{
diff --git a/include/linux/ktime.h b/include/linux/ktime.h
index 31c0cd1..e8ed619 100644
--- a/include/linux/ktime.h
+++ b/include/linux/ktime.h
@@ -22,7 +22,6 @@
#define _LINUX_KTIME_H

#include <linux/time.h>
-#include <linux/jiffies.h>

/*
* ktime_t:
@@ -58,6 +57,8 @@ union ktime {

typedef union ktime ktime_t; /* Kill this */

+#include <linux/jiffies.h>
+
/*
* ktime_t definitions when using the 64-bit scalar representation:
*/
diff --git a/kernel/time.c b/kernel/time.c
index 7c7964c..22580a0 100644
--- a/kernel/time.c
+++ b/kernel/time.c
@@ -29,6 +29,7 @@

#include <linux/export.h>
#include <linux/timex.h>
+#include <linux/ktime.h>
#include <linux/capability.h>
#include <linux/timekeeper_internal.h>
#include <linux/errno.h>
@@ -575,6 +576,28 @@ void jiffies_to_timeval(const unsigned long jiffies, struct timeval *value)
}
EXPORT_SYMBOL(jiffies_to_timeval);

+unsigned long ktime_to_jiffies(ktime_t *value)
+{
+ struct timespec ts = ktime_to_timespec(*value);
+
+ /*
+ * nsecs_to_jiffies(ktime_to_ns(*ktime)) is unsafe as nsecs_to_jiffies
+ * doesn't handle MAX_JIFFY_OFFSET. So we reuse the logic from the
+ * timespec to jiffies conversion function.
+ */
+ return timespec_to_jiffies(&ts);
+}
+EXPORT_SYMBOL(ktime_to_jiffies);
+
+void jiffies_to_ktime(const unsigned long jiffies, ktime_t *value)
+{
+ struct timespec ts;
+
+ jiffies_to_timespec(jiffies, &ts);
+ *value = timespec_to_ktime(ts);
+}
+EXPORT_SYMBOL(jiffies_to_ktime);
+
/*
* Convert jiffies/jiffies_64 to clock_t and back.
*/
--
1.7.9.5

2014-01-13 10:45:30

by Alexey Perevalov

[permalink] [raw]
Subject: [PATCH v2 2/3] timerfd: Factor out timer-type unspecific timerfd_expire()

From: Anton Vorontsov <[email protected]>

There is nothing hrtimer-specific inside the timerfd_tmrproc(), except
the function prototype. We're about to add other timer types, so factor
out generic timerfd_expire() helper from timerfd_tmrproc().

Signed-off-by: Anton Vorontsov <[email protected]>
Signed-off-by: Alexey Perevalov <[email protected]>
---
fs/timerfd.c | 40 +++++++++++++++++++---------------------
1 file changed, 19 insertions(+), 21 deletions(-)

diff --git a/fs/timerfd.c b/fs/timerfd.c
index 9293121..3561ce7 100644
--- a/fs/timerfd.c
+++ b/fs/timerfd.c
@@ -229,6 +229,23 @@ static unsigned int timerfd_poll(struct file *file, poll_table *wait)
return events;
}

+static u64 timerfd_rearm(struct timerfd_ctx *ctx)
+{
+ u64 orun;
+
+ if (isalarm(ctx)) {
+ orun += alarm_forward_now(
+ &ctx->t.alarm, ctx->tintv) - 1;
+ alarm_restart(&ctx->t.alarm);
+ } else {
+ orun += hrtimer_forward_now(&ctx->t.tmr,
+ ctx->tintv) - 1;
+ hrtimer_restart(&ctx->t.tmr);
+ }
+
+ return orun;
+}
+
static ssize_t timerfd_read(struct file *file, char __user *buf, size_t count,
loff_t *ppos)
{
@@ -265,15 +282,7 @@ static ssize_t timerfd_read(struct file *file, char __user *buf, size_t count,
* callback to avoid DoS attacks specifying a very
* short timer period.
*/
- if (isalarm(ctx)) {
- ticks += alarm_forward_now(
- &ctx->t.alarm, ctx->tintv) - 1;
- alarm_restart(&ctx->t.alarm);
- } else {
- ticks += hrtimer_forward_now(&ctx->t.tmr,
- ctx->tintv) - 1;
- hrtimer_restart(&ctx->t.tmr);
- }
+ ticks += timerfd_rearm(ctx);
}
ctx->expired = 0;
ctx->ticks = 0;
@@ -421,18 +430,7 @@ static int do_timerfd_gettime(int ufd, struct itimerspec *t)
spin_lock_irq(&ctx->wqh.lock);
if (ctx->expired && ctx->tintv.tv64) {
ctx->expired = 0;
-
- if (isalarm(ctx)) {
- ctx->ticks +=
- alarm_forward_now(
- &ctx->t.alarm, ctx->tintv) - 1;
- alarm_restart(&ctx->t.alarm);
- } else {
- ctx->ticks +=
- hrtimer_forward_now(&ctx->t.tmr, ctx->tintv)
- - 1;
- hrtimer_restart(&ctx->t.tmr);
- }
+ ctx->ticks += timerfd_rearm(ctx);
}
t->it_value = ktime_to_timespec(timerfd_get_remaining(ctx));
t->it_interval = ktime_to_timespec(ctx->tintv);
--
1.7.9.5

2014-01-13 10:45:34

by Alexey Perevalov

[permalink] [raw]
Subject: [PATCH v2 3/3] timerfd: Add support for deferrable timers

From: Anton Vorontsov <[email protected]>

This patch implements a userland-side API for generic deferrable timers,
per linux/timer.h:

* A deferrable timer will work normally when the system is busy, but
* will not cause a CPU to come out of idle just to service it; instead,
* the timer will be serviced when the CPU eventually wakes up with a
* subsequent non-deferrable timer.

These timers are crucial for power saving, i.e. periodic tasks that want
to work in background when the system is under use, but don't want to
cause wakeups themselves.

The deferred timers are somewhat orthogonal to high-res external timers,
since the deferred timer is tied to the system load, not just to some
external decrementer source.

So, currently, the implementation has a HZ precision, and the maximum
interval is jiffies resolution (i.e. with HZ=1000, on 32 bit that would
be around max 49 days). Of course we can implement longer timeouts by
rearming the timer, although it probably wouldn't make much sense in
real world, so we keep it simple and just return E2BIG if we don't like
the interval.

Signed-off-by: Alexey Perevalov <[email protected]>
---
fs/timerfd.c | 89 ++++++++++++++++++++++++++++++++++++++++-------
include/linux/timerfd.h | 4 ++-
2 files changed, 80 insertions(+), 13 deletions(-)

diff --git a/fs/timerfd.c b/fs/timerfd.c
index 3561ce7..9677a66 100644
--- a/fs/timerfd.c
+++ b/fs/timerfd.c
@@ -31,6 +31,8 @@ struct timerfd_ctx {
struct hrtimer tmr;
struct alarm alarm;
} t;
+ struct timer_list dtmr;
+ bool deferrable;
ktime_t tintv;
ktime_t moffs;
wait_queue_head_t wqh;
@@ -51,6 +53,11 @@ static inline bool isalarm(struct timerfd_ctx *ctx)
ctx->clockid == CLOCK_BOOTTIME_ALARM;
}

+static inline bool isdeferrable(struct timerfd_ctx *ctx)
+{
+ return ctx->deferrable;
+}
+
/*
* This gets called when the timer event triggers. We set the "expired"
* flag, but we do not re-arm the timer (in case it's necessary,
@@ -84,6 +91,11 @@ static enum alarmtimer_restart timerfd_alarmproc(struct alarm *alarm,
return ALARMTIMER_NORESTART;
}

+static void timerfd_dtmrproc(unsigned long data)
+{
+ timerfd_triggered((struct timerfd_ctx *)data);
+}
+
/*
* Called when the clock was set to cancel the timers in the cancel
* list. This will wake up processes waiting on these timers. The
@@ -151,12 +163,40 @@ static ktime_t timerfd_get_remaining(struct timerfd_ctx *ctx)

if (isalarm(ctx))
remaining = alarm_expires_remaining(&ctx->t.alarm);
- else
+ else if (isdeferrable(ctx)) {
+ ktime_t expires;
+ jiffies_to_ktime(ctx->dtmr.expires, &expires);
+ remaining = ktime_sub(expires, ktime_get());
+ } else
remaining = hrtimer_expires_remaining(&ctx->t.tmr);

return remaining.tv64 < 0 ? ktime_set(0, 0): remaining;
}

+static bool timerfd_deferrable_valid(ktime_t intv)
+{
+ ktime_t max;
+
+ jiffies_to_ktime(MAX_JIFFY_OFFSET, &max);
+ if (intv.tv64 > max.tv64)
+ return 0;
+ return 1;
+}
+
+static int timerfd_setup_deferrable(struct timerfd_ctx *ctx)
+{
+ ktime_t rem = timerfd_get_remaining(ctx);
+
+ if (ctx->clockid != CLOCK_MONOTONIC)
+ return -EINVAL;
+ if (!timerfd_deferrable_valid(ctx->tintv) ||
+ !timerfd_deferrable_valid(rem))
+ return -E2BIG;
+
+ mod_timer(&ctx->dtmr, jiffies + ktime_to_jiffies(&rem) + 1);
+ return 0;
+}
+
static int timerfd_setup(struct timerfd_ctx *ctx, int flags,
const struct itimerspec *ktmr)
{
@@ -177,6 +217,9 @@ static int timerfd_setup(struct timerfd_ctx *ctx, int flags,
ctx->clockid == CLOCK_REALTIME_ALARM ?
ALARM_REALTIME : ALARM_BOOTTIME,
timerfd_alarmproc);
+ } else if (isdeferrable(ctx)) {
+ ctx->dtmr.function = timerfd_dtmrproc;
+ ctx->dtmr.data = (unsigned long)ctx;
} else {
hrtimer_init(&ctx->t.tmr, clockid, htmode);
hrtimer_set_expires(&ctx->t.tmr, texp);
@@ -189,6 +232,13 @@ static int timerfd_setup(struct timerfd_ctx *ctx, int flags,
alarm_start(&ctx->t.alarm, texp);
else
alarm_start_relative(&ctx->t.alarm, texp);
+
+ } else if (isdeferrable(ctx)) {
+ int ret;
+
+ ret = timerfd_setup_deferrable(ctx);
+ if (ret)
+ return ret;
} else {
hrtimer_start(&ctx->t.tmr, texp, htmode);
}
@@ -207,8 +257,11 @@ static int timerfd_release(struct inode *inode, struct file *file)

if (isalarm(ctx))
alarm_cancel(&ctx->t.alarm);
- else
+ else {
+ del_timer_sync(&ctx->dtmr);
hrtimer_cancel(&ctx->t.tmr);
+ }
+
kfree_rcu(ctx, rcu);
return 0;
}
@@ -231,12 +284,15 @@ static unsigned int timerfd_poll(struct file *file, poll_table *wait)

static u64 timerfd_rearm(struct timerfd_ctx *ctx)
{
- u64 orun;
+ u64 orun = 0;

if (isalarm(ctx)) {
orun += alarm_forward_now(
&ctx->t.alarm, ctx->tintv) - 1;
alarm_restart(&ctx->t.alarm);
+ } else if (isdeferrable(ctx)) {
+ mod_timer(&ctx->dtmr, jiffies +
+ ktime_to_jiffies(&ctx->tintv) + 1);
} else {
orun += hrtimer_forward_now(&ctx->t.tmr,
ctx->tintv) - 1;
@@ -341,8 +397,11 @@ SYSCALL_DEFINE2(timerfd_create, int, clockid, int, flags)
ctx->clockid == CLOCK_REALTIME_ALARM ?
ALARM_REALTIME : ALARM_BOOTTIME,
timerfd_alarmproc);
- else
+ else {
hrtimer_init(&ctx->t.tmr, clockid, HRTIMER_MODE_ABS);
+ /* Create deferrable timer in any none alarm case */
+ init_timer_deferrable(&ctx->dtmr);
+ }

ctx->moffs = ktime_get_monotonic_offset();

@@ -354,7 +413,7 @@ SYSCALL_DEFINE2(timerfd_create, int, clockid, int, flags)
return ufd;
}

-static int do_timerfd_settime(int ufd, int flags,
+static int do_timerfd_settime(int ufd, int flags,
const struct itimerspec *new,
struct itimerspec *old)
{
@@ -379,19 +438,25 @@ static int do_timerfd_settime(int ufd, int flags,
* it to the new values.
*/
for (;;) {
+ int canceled;
spin_lock_irq(&ctx->wqh.lock);

- if (isalarm(ctx)) {
- if (alarm_try_to_cancel(&ctx->t.alarm) >= 0)
- break;
- } else {
- if (hrtimer_try_to_cancel(&ctx->t.tmr) >= 0)
- break;
- }
+ if (isalarm(ctx))
+ canceled = alarm_try_to_cancel(&ctx->t.alarm);
+ else if (isdeferrable(ctx))
+ canceled = try_to_del_timer_sync(&ctx->dtmr);
+ else
+ canceled = hrtimer_try_to_cancel(&ctx->t.tmr);
+
+ if (canceled >= 0)
+ break;
spin_unlock_irq(&ctx->wqh.lock);
cpu_relax();
}

+ /* Must set a new value after we cancel the previous timer. */
+ ctx->deferrable = flags & TFD_TIMER_DEFERRABLE;
+
/*
* If the timer is expired and it's periodic, we need to advance it
* because the caller may want to know the previous expiration time.
diff --git a/include/linux/timerfd.h b/include/linux/timerfd.h
index d3b57fa..e053105 100644
--- a/include/linux/timerfd.h
+++ b/include/linux/timerfd.h
@@ -20,6 +20,7 @@
*/
#define TFD_TIMER_ABSTIME (1 << 0)
#define TFD_TIMER_CANCEL_ON_SET (1 << 1)
+#define TFD_TIMER_DEFERRABLE (1 << 2)
#define TFD_CLOEXEC O_CLOEXEC
#define TFD_NONBLOCK O_NONBLOCK

@@ -27,6 +28,7 @@
/* Flags for timerfd_create. */
#define TFD_CREATE_FLAGS TFD_SHARED_FCNTL_FLAGS
/* Flags for timerfd_settime. */
-#define TFD_SETTIME_FLAGS (TFD_TIMER_ABSTIME | TFD_TIMER_CANCEL_ON_SET)
+#define TFD_SETTIME_FLAGS (TFD_TIMER_ABSTIME | TFD_TIMER_CANCEL_ON_SET | \
+ TFD_TIMER_DEFERRABLE)

#endif /* _LINUX_TIMERFD_H */
--
1.7.9.5

2014-01-13 15:30:16

by Alexey Perevalov

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Deferrable timers support for timerfd API


Hello all,

one remark - timerfd is not documented in linux Documentation directory
at all.
I think it's better to have such description.


On 01/13/2014 02:43 PM, Alexey Perevalov wrote:
> Hello dear community.
>
> This is reworked patch set of original Anton's Vorontsov
> proposal regarding unified deferrable timers in the user space.
> http://lwn.net/Articles/514707/
>
>
> I decided to resubmit it due we found it usefull for us too.
>
> timerfd was modified since Anton's commit, Alarm support was added.
> This isn't only rebase. Anton's previous version used deferrable timer
> in couple with hrtimer. This version uses only deferrable timer. It
> mean the behaviour of overrun number is different.
> e.g. if you don't poll one second timer for a 10 seconds - you'll get
> 10 overruns with hrtimer, but for deferrable timer it could be another value.
>
> Anton Vorontsov (3):
> kernel/time: Add new helpers to convert ktime to/from jiffies
> timerfd: Factor out timer-type unspecific timerfd_expire()
> timerfd: Add support for deferrable timers
>
> fs/timerfd.c | 127 +++++++++++++++++++++++++++++++++++------------
> include/linux/jiffies.h | 4 +-
> include/linux/ktime.h | 3 +-
> include/linux/timerfd.h | 4 +-
> kernel/time.c | 23 +++++++++
> 5 files changed, 126 insertions(+), 35 deletions(-)
>


--
Best regards,
Alexey Perevalov

2014-01-13 17:37:11

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Deferrable timers support for timerfd API

Alexey Perevalov <[email protected]> writes:

> Hello all,
>
> one remark - timerfd is not documented in linux Documentation
> directory at all.
> I think it's better to have such description.

The documentation is the manpage in man-pages.

Ideally you change would come with the changed manpage
too.

-Andi

--
[email protected] -- Speaking for myself only

2014-01-14 00:15:12

by Chanwoo Choi

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] kernel/time: Add new helpers to convert ktime to/from jiffies

On 01/13/2014 07:43 PM, Alexey Perevalov wrote:
> From: Anton Vorontsov <[email protected]>
>
> Two new functions: jiffies_to_ktime() and ktime_to_jiffies(), we'll use
> them for timerfd deferred timers handling.
>
> We fully reuse the logic from timespec implementations, so the functions
> are pretty straightforward.
>
> The only tricky part is in headers: we have to include jiffies.h after
> we defined ktime_t, this is because ktime.h needs some declarations from
> jiffies.h (e.g. TICK_NSEC).
>
> Signed-off-by: Anton Vorontsov <[email protected]>
> Signed-off-by: Alexey Perevalov <[email protected]>

Tested-by: Chanwoo Choi <[email protected]>

I tested this patchset about operation of deferrable timer on user-space.

Thanks,
Chanwoo Choi

2014-01-14 06:44:40

by Alexey Perevalov

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Deferrable timers support for timerfd API

On 01/13/2014 09:36 PM, Andi Kleen wrote:
> Alexey Perevalov <[email protected]> writes:
>
>> Hello all,
>>
>> one remark - timerfd is not documented in linux Documentation
>> directory at all.
>> I think it's better to have such description.
> The documentation is the manpage in man-pages.
>
> Ideally you change would come with the changed manpage
> too.
>
> -Andi
>
Hello Andi,
thank you,

I found ./man2/timerfd_create.2 from
http://git.kernel.org/pub/scm/docs/man-pages/man-pages is outdated.
TFD_TIMER_CANCEL_ON_SET isn't described, just mention "since linux 3.0"
Clockids CLOCK_REALTIME_ALARM, CLOCK_BOOTTIME_ALARM isn't mentioned at all.

I'm not technical writer, but I'll try to describe _ALARM clockids and
flags introduced by Anton's patchset.


--
Best regards,
Alexey Perevalov

2014-01-21 19:12:41

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Deferrable timers support for timerfd API

On 01/13/2014 02:43 AM, Alexey Perevalov wrote:
> Hello dear community.
>
> This is reworked patch set of original Anton's Vorontsov
> proposal regarding unified deferrable timers in the user space.
> http://lwn.net/Articles/514707/
>
>
> I decided to resubmit it due we found it usefull for us too.
>
> timerfd was modified since Anton's commit, Alarm support was added.
> This isn't only rebase. Anton's previous version used deferrable timer
> in couple with hrtimer. This version uses only deferrable timer. It
> mean the behaviour of overrun number is different.
> e.g. if you don't poll one second timer for a 10 seconds - you'll get
> 10 overruns with hrtimer, but for deferrable timer it could be another value.
>

Sorry, last week was a little crazy and I didn't get a chance to closely
review this. But looking at this my major conceptual objection with the
previous patchset (introducing the new clockid) is gone.

My remaining conceptual concern here is that the TIMER_DEFERRABLE flag
is a timerfd only construct here, and I worry we should make sure we
think this through well enough that the same functionality can be
supported via other timer interfaces (like clock_nanosleep, etc), which
may mean the functionality should be pushed more deeply into the hrtimer
subsystem.

So main suggestion here is to make sure you cc Thomas Gleixner on future
iterations, so he can provide some thoughts on what the best approach
might be here. I know he also has some plans that might collide with the
jiffies_to_ktime work.

Thomas: Any thought here? Should we be trying to unify the timerfd flags
and the posix timer flags (specifically things like TIMER_CANCEL_ON_SET,
which is currently timerfd-only)? Should a deferrable flag be added to
the hrtimer core or left to the timer wheel?

thanks
-john

2014-01-27 07:12:33

by Alexey Perevalov

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Deferrable timers support for timerfd API

Dear Thomas,
could you please comment John's question (see bellow) regarding flags.

On 01/21/2014 11:12 PM, John Stultz wrote:
> On 01/13/2014 02:43 AM, Alexey Perevalov wrote:
>> Hello dear community.
>>
>> This is reworked patch set of original Anton's Vorontsov
>> proposal regarding unified deferrable timers in the user space.
>> http://lwn.net/Articles/514707/
>>
>>
>> I decided to resubmit it due we found it usefull for us too.
>>
>> timerfd was modified since Anton's commit, Alarm support was added.
>> This isn't only rebase. Anton's previous version used deferrable timer
>> in couple with hrtimer. This version uses only deferrable timer. It
>> mean the behaviour of overrun number is different.
>> e.g. if you don't poll one second timer for a 10 seconds - you'll get
>> 10 overruns with hrtimer, but for deferrable timer it could be another value.
>>
> Sorry, last week was a little crazy and I didn't get a chance to closely
> review this. But looking at this my major conceptual objection with the
> previous patchset (introducing the new clockid) is gone.
>
> My remaining conceptual concern here is that the TIMER_DEFERRABLE flag
> is a timerfd only construct here, and I worry we should make sure we
> think this through well enough that the same functionality can be
> supported via other timer interfaces (like clock_nanosleep, etc), which
> may mean the functionality should be pushed more deeply into the hrtimer
> subsystem.
>
> So main suggestion here is to make sure you cc Thomas Gleixner on future
> iterations, so he can provide some thoughts on what the best approach
> might be here. I know he also has some plans that might collide with the
> jiffies_to_ktime work.
>
> Thomas: Any thought here? Should we be trying to unify the timerfd flags
> and the posix timer flags (specifically things like TIMER_CANCEL_ON_SET,
> which is currently timerfd-only)? Should a deferrable flag be added to
> the hrtimer core or left to the timer wheel?
>
> thanks
> -john
>


--
Best regards,
Alexey Perevalov

2014-02-03 06:55:04

by Alexey Perevalov

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Deferrable timers support for timerfd API

Dear John, hello

could we figure out without Thomas advice?
Maybe it worth to propose timerfd and posix timer flag unification patch?

On 01/21/2014 11:12 PM, John Stultz wrote:
> On 01/13/2014 02:43 AM, Alexey Perevalov wrote:
>> Hello dear community.
>>
>> This is reworked patch set of original Anton's Vorontsov
>> proposal regarding unified deferrable timers in the user space.
>> http://lwn.net/Articles/514707/
>>
>>
>> I decided to resubmit it due we found it usefull for us too.
>>
>> timerfd was modified since Anton's commit, Alarm support was added.
>> This isn't only rebase. Anton's previous version used deferrable timer
>> in couple with hrtimer. This version uses only deferrable timer. It
>> mean the behaviour of overrun number is different.
>> e.g. if you don't poll one second timer for a 10 seconds - you'll get
>> 10 overruns with hrtimer, but for deferrable timer it could be another value.
>>
> Sorry, last week was a little crazy and I didn't get a chance to closely
> review this. But looking at this my major conceptual objection with the
> previous patchset (introducing the new clockid) is gone.
>
> My remaining conceptual concern here is that the TIMER_DEFERRABLE flag
> is a timerfd only construct here, and I worry we should make sure we
> think this through well enough that the same functionality can be
> supported via other timer interfaces (like clock_nanosleep, etc), which
> may mean the functionality should be pushed more deeply into the hrtimer
> subsystem.
>
> So main suggestion here is to make sure you cc Thomas Gleixner on future
> iterations, so he can provide some thoughts on what the best approach
> might be here. I know he also has some plans that might collide with the
> jiffies_to_ktime work.
>
> Thomas: Any thought here? Should we be trying to unify the timerfd flags
> and the posix timer flags (specifically things like TIMER_CANCEL_ON_SET,
> which is currently timerfd-only)? Should a deferrable flag be added to
> the hrtimer core or left to the timer wheel?
>
> thanks
> -john
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>


--
Best regards,
Alexey Perevalov,
Leading software engineer,
phone: +7 (495) 797 25 00 ext 3969
e-mail: [email protected] <mailto:[email protected]>

Mobile group, Samsung R&D Institute Rus
12 Dvintsev street, building 1
127018, Moscow, Russian Federation

2014-02-03 23:58:50

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Deferrable timers support for timerfd API

On 02/02/2014 10:54 PM, Alexey Perevalov wrote:
> Dear John, hello
>
> could we figure out without Thomas advice?
> Maybe it worth to propose timerfd and posix timer flag unification patch?

That would likely get his attention (and possibly wrath)... so not a
bad idea! ;)


thanks
-john


2014-02-04 16:10:23

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Deferrable timers support for timerfd API

On Mon, 27 Jan 2014, Alexey Perevalov wrote:
> On 01/21/2014 11:12 PM, John Stultz wrote:
> >
> > Thomas: Any thought here? Should we be trying to unify the timerfd flags
> > and the posix timer flags (specifically things like TIMER_CANCEL_ON_SET,
> > which is currently timerfd-only)? Should a deferrable flag be added to
> > the hrtimer core or left to the timer wheel?

The timer cancel on set was added only to timerfd because timerfd is a
non posix interface and we are halfways free to add stuff to
it. Adding extra flags to the real posix timer interfaces is a
different story.

What's the rationale for a deferrable flag for user space interfaces?

Thanks,

tglx

2014-02-05 06:43:57

by Alexey Perevalov

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Deferrable timers support for timerfd API

On 02/04/2014 08:10 PM, Thomas Gleixner wrote:
> On Mon, 27 Jan 2014, Alexey Perevalov wrote:
>> On 01/21/2014 11:12 PM, John Stultz wrote:
>>> Thomas: Any thought here? Should we be trying to unify the timerfd flags
>>> and the posix timer flags (specifically things like TIMER_CANCEL_ON_SET,
>>> which is currently timerfd-only)? Should a deferrable flag be added to
>>> the hrtimer core or left to the timer wheel?
> The timer cancel on set was added only to timerfd because timerfd is a
> non posix interface and we are halfways free to add stuff to
> it. Adding extra flags to the real posix timer interfaces is a
> different story.
And what about "deferrable" possibility for hrtimers, do you consider it
reasonable?

>
> What's the rationale for a deferrable flag for user space interfaces?
The main reason of this was do not call user space timers on system
idle, to safe power on embedded systems, especially in case of NOHZ.

>
> Thanks,
>
> tglx
>


--
Best regards,
Alexey Perevalov

2014-02-05 21:41:21

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Deferrable timers support for timerfd API

On Wed, 5 Feb 2014, Alexey Perevalov wrote:
> On 02/04/2014 08:10 PM, Thomas Gleixner wrote:
> > On Mon, 27 Jan 2014, Alexey Perevalov wrote:
> > > On 01/21/2014 11:12 PM, John Stultz wrote:
> > > > Thomas: Any thought here? Should we be trying to unify the timerfd flags
> > > > and the posix timer flags (specifically things like TIMER_CANCEL_ON_SET,
> > > > which is currently timerfd-only)? Should a deferrable flag be added to
> > > > the hrtimer core or left to the timer wheel?
> > The timer cancel on set was added only to timerfd because timerfd is a
> > non posix interface and we are halfways free to add stuff to
> > it. Adding extra flags to the real posix timer interfaces is a
> > different story.
>
> And what about "deferrable" possibility for hrtimers, do you consider it
> reasonable?

In principle, I have no objections, but we need a proper technical
solution. Just adding a flag and keeping the timers in the same rbtree
like we do for the timer wheel timers is not going to happen.

The only feasible solution is to have separate clock ids,
e.g. CLOCK_*_DEFERRABLE, which would enable the deferrable
functionality for all user space interfaces. No need for magic flags
and complex search for non deferrable timers.

Though I really do not like the idea of adding more overhead to
hrtimer_interrupt, but we must die one sort of death and definitely
any solution which brings timer_list back into the game is the worst
of all choices.

We spent enough time to explain why the conversion to hrtimers is the
only sensible solution for user space interfaces. Even thinking about
exposing the timer wheel to user space again is a violation of good
taste.

It's not rocket science to avoid the overhead for hrtimer_interrupt
when there are no users of certain clock bases. We already have a flag
field which marks the active bases, we just do not make proper use of
it. Did you even think about that before hacking that timer_list based
crappola with completely unspecified and unextensible behaviour?

Find an untested (even uncompiled) POC patch below. It should work
somehow, but you can find the unresolved issues and the necessary
updates to other parts of hrtimer.c yourself. I might take the
properly split up result if you can prove that there is no downside on
this aside of a few bytes memory footprint when nothing is using the
new clocks.

Thanks,

tglx
----
Subject: hrtimer-deferrable-hack.patch
From: Thomas Gleixner <[email protected]>
Date: Wed, 05 Feb 2014 21:44:31 +0100

[ Insert proper changelog ]

Signed-off-by: Thomas Gleixner <[email protected]>
---
include/linux/hrtimer.h | 3 ++
include/uapi/linux/time.h | 3 ++
kernel/hrtimer.c | 62 ++++++++++++++++++++++++++++++++++++----------
3 files changed, 55 insertions(+), 13 deletions(-)

Index: tip/include/linux/hrtimer.h
===================================================================
--- tip.orig/include/linux/hrtimer.h
+++ tip/include/linux/hrtimer.h
@@ -158,6 +158,9 @@ enum hrtimer_base_type {
HRTIMER_BASE_REALTIME,
HRTIMER_BASE_BOOTTIME,
HRTIMER_BASE_TAI,
+ HRTIMER_BASE_MONOTONIC_DEFERRABLE,
+ HRTIMER_BASE_REALTIME_DEFERRABLE,
+ HRTIMER_BASE_BOOTTIME_DEFERRABLE,
HRTIMER_MAX_CLOCK_BASES,
};

Index: tip/include/uapi/linux/time.h
===================================================================
--- tip.orig/include/uapi/linux/time.h
+++ tip/include/uapi/linux/time.h
@@ -56,6 +56,9 @@ struct itimerval {
#define CLOCK_BOOTTIME_ALARM 9
#define CLOCK_SGI_CYCLE 10 /* Hardware specific */
#define CLOCK_TAI 11
+#define CLOCK_REALTIME_DEFERRABLE 12
+#define CLOCK_MONOTONIC_DEFERRABLE 13
+#define CLOCK_BOOTTIME_DEFERRABLE 14

#define MAX_CLOCKS 16
#define CLOCKS_MASK (CLOCK_REALTIME | CLOCK_MONOTONIC)
Index: tip/kernel/hrtimer.c
===================================================================
--- tip.orig/kernel/hrtimer.c
+++ tip/kernel/hrtimer.c
@@ -91,14 +91,35 @@ DEFINE_PER_CPU(struct hrtimer_cpu_base,
.get_time = &ktime_get_clocktai,
.resolution = KTIME_LOW_RES,
},
+ {
+ .index = HRTIMER_BASE_MONOTONIC_DEFERRABLE,
+ .clockid = CLOCK_MONOTONIC_DEFERRABLE,
+ .get_time = &ktime_get,
+ .resolution = KTIME_LOW_RES,
+ },
+ {
+ .index = HRTIMER_BASE_REALTIME_DEFERRABLE,
+ .clockid = CLOCK_REALTIME_DEFERRABLE,
+ .get_time = &ktime_get_real,
+ .resolution = KTIME_LOW_RES,
+ },
+ {
+ .index = HRTIMER_BASE_BOOTTIME_DEFERRABLE,
+ .clockid = CLOCK_BOOTTIME_DEFERRABLE,
+ .get_time = &ktime_get_boottime,
+ .resolution = KTIME_LOW_RES,
+ },
}
};

static const int hrtimer_clock_to_base_table[MAX_CLOCKS] = {
- [CLOCK_REALTIME] = HRTIMER_BASE_REALTIME,
- [CLOCK_MONOTONIC] = HRTIMER_BASE_MONOTONIC,
- [CLOCK_BOOTTIME] = HRTIMER_BASE_BOOTTIME,
- [CLOCK_TAI] = HRTIMER_BASE_TAI,
+ [CLOCK_REALTIME] = HRTIMER_BASE_REALTIME,
+ [CLOCK_MONOTONIC] = HRTIMER_BASE_MONOTONIC,
+ [CLOCK_BOOTTIME] = HRTIMER_BASE_BOOTTIME,
+ [CLOCK_TAI] = HRTIMER_BASE_TAI,
+ [CLOCK_REALTIME_DEFERRABLE] = HRTIMER_BASE_REALTIME_DEFERRABLE,
+ [CLOCK_MONOTONIC_DEFERRABLE] = HRTIMER_BASE_MONOTONIC_DEFERRABLE,
+ [CLOCK_BOOTTIME_DEFERRABLE] = HRTIMER_BASE_BOOTTIME_DEFERRABLE,
};

static inline int hrtimer_clockid_to_base(clockid_t clock_id)
@@ -193,7 +214,8 @@ hrtimer_check_target(struct hrtimer *tim
#ifdef CONFIG_HIGH_RES_TIMERS
ktime_t expires;

- if (!new_base->cpu_base->hres_active)
+ if (!new_base->cpu_base->hres_active ||
+ new_base->index >= HRTIMER_BASE_MONOTONIC_DEFERRABLE)
return 0;

expires = ktime_sub(hrtimer_get_expires(timer), new_base->offset);
@@ -555,7 +577,7 @@ hrtimer_force_reprogram(struct hrtimer_c

expires_next.tv64 = KTIME_MAX;

- for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++, base++) {
+ for (i = 0; i < HRTIMER_BASE_MONOTONIC_DEFERRABLE; i++, base++) {
struct hrtimer *timer;
struct timerqueue_node *next;

@@ -614,6 +636,13 @@ static int hrtimer_reprogram(struct hrti
return 0;

/*
+ * Deferrable timers are not touching the underlying
+ * hardware.
+ */
+ if (base->index >= HRTIMER_BASE_MONOTONIC_DEFERRABLE)
+ return 0;
+
+ /*
* CLOCK_REALTIME timer might be requested with an absolute
* expiry time which is less than base->offset. Nothing wrong
* about that, just avoid to call into the tick code, which
@@ -923,7 +952,10 @@ static void __remove_hrtimer(struct hrti

expires = ktime_sub(hrtimer_get_expires(timer),
base->offset);
- if (base->cpu_base->expires_next.tv64 == expires.tv64)
+
+ /* We only care about non deferrable timers here */
+ if (base->index <= HRTIMER_BASE_MONOTONIC_DEFERRABLE &&
+ base->cpu_base->expires_next.tv64 == expires.tv64)
hrtimer_force_reprogram(base->cpu_base, 1);
}
#endif
@@ -1151,7 +1183,7 @@ ktime_t hrtimer_get_next_event(void)
raw_spin_lock_irqsave(&cpu_base->lock, flags);

if (!hrtimer_hres_active()) {
- for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++, base++) {
+ for (i = 0; i < HRTIMER_BASE_MONOTONIC_DEFERRABLE; i++, base++) {
struct hrtimer *timer;
struct timerqueue_node *next;

@@ -1283,7 +1315,8 @@ void hrtimer_interrupt(struct clock_even
{
struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
ktime_t expires_next, now, entry_time, delta;
- int i, retries = 0;
+ unsigned long bases;
+ int retries = 0;

BUG_ON(!cpu_base->hres_active);
cpu_base->nr_events++;
@@ -1301,14 +1334,16 @@ retry:
* this CPU.
*/
cpu_base->expires_next.tv64 = KTIME_MAX;
+ bases = cpu_base->active_bases;

- for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
+ while (bases) {
struct hrtimer_clock_base *base;
struct timerqueue_node *node;
ktime_t basenow;
+ int i;

- if (!(cpu_base->active_bases & (1 << i)))
- continue;
+ i = __ffs(bases);
+ bases &= ~(1 << i);

base = cpu_base->clock_base + i;
basenow = ktime_add(now, base->offset);
@@ -1338,7 +1373,8 @@ retry:
base->offset);
if (expires.tv64 < 0)
expires.tv64 = KTIME_MAX;
- if (expires.tv64 < expires_next.tv64)
+ if (expires.tv64 < expires_next.tv64 &&
+ base->index <= HRTIMER_BASE_MONOTONIC_DEFERRABLE)
expires_next = expires;
break;
}



2014-02-05 22:02:49

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Deferrable timers support for timerfd API

On 02/05/2014 01:41 PM, Thomas Gleixner wrote:
> On Wed, 5 Feb 2014, Alexey Perevalov wrote:
>> On 02/04/2014 08:10 PM, Thomas Gleixner wrote:
>>> On Mon, 27 Jan 2014, Alexey Perevalov wrote:
>>>> On 01/21/2014 11:12 PM, John Stultz wrote:
>>>>> Thomas: Any thought here? Should we be trying to unify the timerfd flags
>>>>> and the posix timer flags (specifically things like TIMER_CANCEL_ON_SET,
>>>>> which is currently timerfd-only)? Should a deferrable flag be added to
>>>>> the hrtimer core or left to the timer wheel?
>>> The timer cancel on set was added only to timerfd because timerfd is a
>>> non posix interface and we are halfways free to add stuff to
>>> it. Adding extra flags to the real posix timer interfaces is a
>>> different story.
>> And what about "deferrable" possibility for hrtimers, do you consider it
>> reasonable?
> In principle, I have no objections, but we need a proper technical
> solution. Just adding a flag and keeping the timers in the same rbtree
> like we do for the timer wheel timers is not going to happen.
>
> The only feasible solution is to have separate clock ids,
> e.g. CLOCK_*_DEFERRABLE, which would enable the deferrable
> functionality for all user space interfaces. No need for magic flags
> and complex search for non deferrable timers.

So of course, I was actually arguing against having a new clockid (which
was Alexey's first approach).

My reasoning was that the deferrablity isn't a clock domain, and is more
of a modifier. Thus to keep the interfaces somewhat sane (and avoiding
having to add N new clockids for each new modifier), we should utilize
the flag arguments to timers. So instead of just having TIMER_ABSTIME,
we could add TIMER_DEFER, etc, which we could utilize instead.

Internally we can still keep separate bases, much as your patch does, to
keep the next-event searching overhead more limited.

I mainly wanted to get your thoughts on extending the flags, and doing
so in a consistent manner between the timerfd and other timer interfaces.

Of course, all this is after I added the _ALARM clockids... so you can
decide if its hypocrisy or experience.
(The "old wisdom comes from experience and experience comes from bad
decisions" bit ;).

thanks
-john









2014-02-05 22:16:47

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Deferrable timers support for timerfd API

On Wed, 5 Feb 2014, John Stultz wrote:
> On 02/05/2014 01:41 PM, Thomas Gleixner wrote:
> > On Wed, 5 Feb 2014, Alexey Perevalov wrote:
> >> On 02/04/2014 08:10 PM, Thomas Gleixner wrote:
> >>> On Mon, 27 Jan 2014, Alexey Perevalov wrote:
> >>>> On 01/21/2014 11:12 PM, John Stultz wrote:
> >>>>> Thomas: Any thought here? Should we be trying to unify the timerfd flags
> >>>>> and the posix timer flags (specifically things like TIMER_CANCEL_ON_SET,
> >>>>> which is currently timerfd-only)? Should a deferrable flag be added to
> >>>>> the hrtimer core or left to the timer wheel?
> >>> The timer cancel on set was added only to timerfd because timerfd is a
> >>> non posix interface and we are halfways free to add stuff to
> >>> it. Adding extra flags to the real posix timer interfaces is a
> >>> different story.
> >> And what about "deferrable" possibility for hrtimers, do you consider it
> >> reasonable?
> > In principle, I have no objections, but we need a proper technical
> > solution. Just adding a flag and keeping the timers in the same rbtree
> > like we do for the timer wheel timers is not going to happen.
> >
> > The only feasible solution is to have separate clock ids,
> > e.g. CLOCK_*_DEFERRABLE, which would enable the deferrable
> > functionality for all user space interfaces. No need for magic flags
> > and complex search for non deferrable timers.
>
> So of course, I was actually arguing against having a new clockid (which
> was Alexey's first approach).

Mooo.

> My reasoning was that the deferrablity isn't a clock domain, and is more
> of a modifier. Thus to keep the interfaces somewhat sane (and avoiding
> having to add N new clockids for each new modifier), we should utilize
> the flag arguments to timers. So instead of just having TIMER_ABSTIME,
> we could add TIMER_DEFER, etc, which we could utilize instead.

I can see the point. I have no objections against that approach as
long as we map that against separate internal bases.

> Internally we can still keep separate bases, much as your patch does, to
> keep the next-event searching overhead more limited.

It's not only more limited, it's bound.

> I mainly wanted to get your thoughts on extending the flags, and doing
> so in a consistent manner between the timerfd and other timer interfaces.

So the only interface which does not support that is sys_nanosleep()
but that's not really an issue. sys_nanosleep() should die anyway :)

> Of course, all this is after I added the _ALARM clockids... so you can
> decide if its hypocrisy or experience.
> (The "old wisdom comes from experience and experience comes from bad
> decisions" bit ;).

Well, you have a valid point about the clock ids. I did not realize in
the first place that we can avoid that business if we use the flags to
select the internal representation.

Either way is preferred over reintroducing the timer wheel mess....

Thanks,

tglx

2014-02-06 17:38:51

by Alexey Perevalov

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Deferrable timers support for timerfd API

On 02/06/2014 02:16 AM, Thomas Gleixner wrote:
> On Wed, 5 Feb 2014, John Stultz wrote:
>> On 02/05/2014 01:41 PM, Thomas Gleixner wrote:
>>> On Wed, 5 Feb 2014, Alexey Perevalov wrote:
>>>> On 02/04/2014 08:10 PM, Thomas Gleixner wrote:
>>>>> On Mon, 27 Jan 2014, Alexey Perevalov wrote:
>>>>>> On 01/21/2014 11:12 PM, John Stultz wrote:
>>>>>>> Thomas: Any thought here? Should we be trying to unify the timerfd flags
>>>>>>> and the posix timer flags (specifically things like TIMER_CANCEL_ON_SET,
>>>>>>> which is currently timerfd-only)? Should a deferrable flag be added to
>>>>>>> the hrtimer core or left to the timer wheel?
>>>>> The timer cancel on set was added only to timerfd because timerfd is a
>>>>> non posix interface and we are halfways free to add stuff to
>>>>> it. Adding extra flags to the real posix timer interfaces is a
>>>>> different story.
>>>> And what about "deferrable" possibility for hrtimers, do you consider it
>>>> reasonable?
>>> In principle, I have no objections, but we need a proper technical
>>> solution. Just adding a flag and keeping the timers in the same rbtree
>>> like we do for the timer wheel timers is not going to happen.
>>>
>>> The only feasible solution is to have separate clock ids,
>>> e.g. CLOCK_*_DEFERRABLE, which would enable the deferrable
>>> functionality for all user space interfaces. No need for magic flags
>>> and complex search for non deferrable timers.
>> So of course, I was actually arguing against having a new clockid (which
>> was Alexey's first approach).
> Mooo.
>
>> My reasoning was that the deferrablity isn't a clock domain, and is more
>> of a modifier. Thus to keep the interfaces somewhat sane (and avoiding
>> having to add N new clockids for each new modifier), we should utilize
>> the flag arguments to timers. So instead of just having TIMER_ABSTIME,
>> we could add TIMER_DEFER, etc, which we could utilize instead.
> I can see the point. I have no objections against that approach as
> long as we map that against separate internal bases.
>
>> Internally we can still keep separate bases, much as your patch does, to
>> keep the next-event searching overhead more limited.
> It's not only more limited, it's bound.
>
>> I mainly wanted to get your thoughts on extending the flags, and doing
>> so in a consistent manner between the timerfd and other timer interfaces.
> So the only interface which does not support that is sys_nanosleep()
> but that's not really an issue. sys_nanosleep() should die anyway :)
>
>> Of course, all this is after I added the _ALARM clockids... so you can
>> decide if its hypocrisy or experience.
>> (The "old wisdom comes from experience and experience comes from bad
>> decisions" bit ;).
> Well, you have a valid point about the clock ids. I did not realize in
> the first place that we can avoid that business if we use the flags to
> select the internal representation.
>
> Either way is preferred over reintroducing the timer wheel mess....
>
> Thanks,
>
> tglx
>

As I truly understand, you decided - flags is better than new clockids,
and internals of timerfd could be a mix of timer_list and hrtimer.
If so, it's in v2 patch set.

--
Best regards,
Alexey Perevalov

2014-02-06 17:47:21

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Deferrable timers support for timerfd API

On 02/06/2014 09:38 AM, Alexey Perevalov wrote:
> On 02/06/2014 02:16 AM, Thomas Gleixner wrote:
>> On Wed, 5 Feb 2014, John Stultz wrote:
>>> My reasoning was that the deferrablity isn't a clock domain, and is
>>> more
>>> of a modifier. Thus to keep the interfaces somewhat sane (and avoiding
>>> having to add N new clockids for each new modifier), we should utilize
>>> the flag arguments to timers. So instead of just having TIMER_ABSTIME,
>>> we could add TIMER_DEFER, etc, which we could utilize instead.
>> I can see the point. I have no objections against that approach as
>> long as we map that against separate internal bases.
>>
>>> Internally we can still keep separate bases, much as your patch
>>> does, to
>>> keep the next-event searching overhead more limited.
>> It's not only more limited, it's bound.
>>
>>> I mainly wanted to get your thoughts on extending the flags, and doing
>>> so in a consistent manner between the timerfd and other timer
>>> interfaces.
>> So the only interface which does not support that is sys_nanosleep()
>> but that's not really an issue. sys_nanosleep() should die anyway :)
>>
>>> Of course, all this is after I added the _ALARM clockids... so you can
>>> decide if its hypocrisy or experience.
>>> (The "old wisdom comes from experience and experience comes from bad
>>> decisions" bit ;).
>> Well, you have a valid point about the clock ids. I did not realize in
>> the first place that we can avoid that business if we use the flags to
>> select the internal representation.
>>
>> Either way is preferred over reintroducing the timer wheel mess....
>
> As I truly understand, you decided - flags is better than new
> clockids, and internals of timerfd could be a mix of timer_list and
> hrtimer.
> If so, it's in v2 patch set.

So, I think Thomas is suggesting we add new deferrable HRTIMER bases,
then the timerfd code would only use the hrtimers for non-alarm-timers.
We would then use the flag from the interface to decide internally which
base to add the hrtimer to. This would also allow us to use the flag via
non-timerfd interfaces to get the same result.

Does that clarify things?

thanks
-john



2014-02-06 20:50:24

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Deferrable timers support for timerfd API

On Thu, 6 Feb 2014, Alexey Perevalov wrote:
> On 02/06/2014 02:16 AM, Thomas Gleixner wrote:
> As I truly understand, you decided - flags is better than new clockids, and
> internals of timerfd could be a mix of timer_list and hrtimer.

NO, NO, NO, NO. There is no mix of timer_list and hrtimer. Either we
add proper deferrable support to hrtimers or the whole thing is not
going to happen at all. There is no way that we bring back timer_list
to user space interfaces. Period.

Thanks,

tglx

2014-02-07 17:41:41

by Alexey Perevalov

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Deferrable timers support for timerfd API

On 02/07/2014 12:50 AM, Thomas Gleixner wrote:
> On Thu, 6 Feb 2014, Alexey Perevalov wrote:
>> On 02/06/2014 02:16 AM, Thomas Gleixner wrote:
>> As I truly understand, you decided - flags is better than new clockids, and
>> internals of timerfd could be a mix of timer_list and hrtimer.
> NO, NO, NO, NO. There is no mix of timer_list and hrtimer. Either we
> add proper deferrable support to hrtimers or the whole thing is not
> going to happen at all. There is no way that we bring back timer_list
> to user space interfaces. Period.
>
> Thanks,
>
> tglx
>
>
>
Ok, thank you, I'll work with you patch for hrtimers, and I'll rework
current patch set.


--
Best regards,
Alexey Perevalov

2014-02-16 15:20:57

by Alexey Perevalov

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Deferrable timers support for timerfd API

On 02/07/2014 12:50 AM, Thomas Gleixner wrote:
> On Thu, 6 Feb 2014, Alexey Perevalov wrote:
>> On 02/06/2014 02:16 AM, Thomas Gleixner wrote:
>> As I truly understand, you decided - flags is better than new clockids, and
>> internals of timerfd could be a mix of timer_list and hrtimer.
> NO, NO, NO, NO. There is no mix of timer_list and hrtimer. Either we
> add proper deferrable support to hrtimers or the whole thing is not
> going to happen at all. There is no way that we bring back timer_list
> to user space interfaces. Period.
>
> Thanks,
>
> tglx
>
>
>
Hello Thomas,
I checked you patch on system with CONFIG_HZ=250,
Here is result for hrtimer with 200ms interval
[291943.009557] time delta 209518762
[291943.205546] time delta 195995131
[291943.403012] time delta 197466387
[291943.610090] time delta 207071848
[291943.816138] time delta 206054438
[291944.000063] time delta 183918221
[291944.209290] time delta 209234301
[291944.405104] time delta 195813256
[291944.601507] time delta 196396567
[291944.816090] time delta 214589122
[291945.000125] time delta 184034925
[291945.214207] time delta 214075670

Maximum delay is 14 ms, but sometimes timer function is called early.

On the same system deferrable timer_list looks like:
[292214.336130] time delta 248064082
[292214.548061] time delta 211932930
[292214.804028] time delta 255964215
[292215.080160] time delta 276133570
[292215.328212] time delta 248043192
[292215.576241] time delta 248043616
[292215.796066] time delta 219818102
[292216.004040] time delta 207973662
[292216.260111] time delta 256069522
[292216.568160] time delta 308049741
[292216.816120] time delta 247962559
[292217.028062] time delta 211941369
[292217.264078] time delta 236013934

Maximum delay is 108 ms. And it looks like more predictable.

After I changed CONFIG_HZ=200,
hrtimer charged for 200ms had shown following result:

[ 187.548009] time delta 200360371
[ 187.724141] time delta 176132213
[ 187.826221] time delta 102085975
[ 188.008010] time delta 181788726
[ 188.312062] time delta 304051846
[ 188.705480] time delta 393417803
[ 188.800070] time delta 94583497
[ 189.000125] time delta 200055104
[ 189.224055] time delta 223937438
[ 189.400065] time delta 176009338
[ 189.672916] time delta 272851573
[ 189.828029] time delta 155112014
[ 190.000047] time delta 172018578

Now it has maximum delay 193 ms (393417803), but in other times it
triggers early.
Normal deferrable timer behaves as expected (time delta is 200ms).
These examples were made with MONOTONIC timers.

As I understand main idea in hrtimer.c was do not decrement expires_next
in case of DEFERRABLE timers type.
Such small average delay could be explained: it's due higher resolution,
and cpu is not in idle when we in hrtimer_interrupt,
with timer_list decrementing process not so often.
In this case it's hard to me to explain such small "time delta", it
occurs almost every time we have larger delay.

--
Best regards,
Alexey Perevalov

2014-02-16 15:39:50

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Deferrable timers support for timerfd API

On Sun, 16 Feb 2014, Alexey Perevalov wrote:
> As I understand main idea in hrtimer.c was do not decrement expires_next in
> case of DEFERRABLE timers type.
> Such small average delay could be explained: it's due higher resolution, and
> cpu is not in idle when we in hrtimer_interrupt,
> with timer_list decrementing process not so often.
> In this case it's hard to me to explain such small "time delta", it occurs
> almost every time we have larger delay.

Well, the point of deferrable timers is that they get executed, when
the cpu is not idle, i.e. running some other timers as well

I did not test my patch and I have no idea whether it really does what
it should do, but tracing should tell you rather fast.

So w/o instrumenting the kernel you can't tell why a timer is
expired. Just looking at random numbers does not help. You need to
create a proper test scenario which makes sure that the system goes
into an extended nohz idle and then check whether the timers are
deferred over that idle time.

Thanks,

tglx

2014-02-17 14:15:21

by Alexey Perevalov

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Deferrable timers support for timerfd API

On 02/16/2014 07:39 PM, Thomas Gleixner wrote:
> On Sun, 16 Feb 2014, Alexey Perevalov wrote:
>> As I understand main idea in hrtimer.c was do not decrement expires_next in
>> case of DEFERRABLE timers type.
>> Such small average delay could be explained: it's due higher resolution, and
>> cpu is not in idle when we in hrtimer_interrupt,
>> with timer_list decrementing process not so often.
>> In this case it's hard to me to explain such small "time delta", it occurs
>> almost every time we have larger delay.
> Well, the point of deferrable timers is that they get executed, when
> the cpu is not idle, i.e. running some other timers as well
>
> I did not test my patch and I have no idea whether it really does what
> it should do, but tracing should tell you rather fast.
>
> So w/o instrumenting the kernel you can't tell why a timer is
> expired. Just looking at random numbers does not help. You need to
> create a proper test scenario which makes sure that the system goes
> into an extended nohz idle and then check whether the timers are
> deferred over that idle time.
>
> Thanks,
>
> tglx
>
>
My system is NO_HZ=ON, I'm sure and timer_list results is proof it.
Regarding to previous result, yes, you right the rootcause of such
insane deviation was a glitch in logging (but only for really big
deviation).

I traced it by ftrace.
Bellow is trace-cmd report for HRTIMER_BASE_MONOTONIC_DEFERRABLE.
(it's only part, but it's long part, summary right after the log)

<idle>-0 [000] 339.148371: hrtimer_expire_entry:
hrtimer=0xffffffffa04a9280 now=2154000086890 function=timerfd_tmrproc/0x0
<idle>-0 [000] 339.148391: hrtimer_start:
hrtimer=0xffffffffa04a9280 function=timerfd_tmrproc
expires=2154300000000 softexpires=2154300000000
<idle>-0 [000] 339.157821: hrtimer_expire_entry:
hrtimer=0xffffffffa04a9280 now=2154328094166 function=timerfd_tmrproc/0x0
<idle>-0 [000] 339.157837: hrtimer_start:
hrtimer=0xffffffffa04a9280 function=timerfd_tmrproc
expires=2154600000000 softexpires=2154600000000
<idle>-0 [000] 339.170895: hrtimer_expire_entry:
hrtimer=0xffffffffa04a9280 now=2154632081099 function=timerfd_tmrproc/0x0
<idle>-0 [000] 339.170915: hrtimer_start:
hrtimer=0xffffffffa04a9280 function=timerfd_tmrproc
expires=2154900000000 softexpires=2154900000000
<idle>-0 [000] 339.180459: hrtimer_expire_entry:
hrtimer=0xffffffffa04a9280 now=2154912090729 function=timerfd_tmrproc/0x0
<idle>-0 [000] 339.180475: hrtimer_start:
hrtimer=0xffffffffa04a9280 function=timerfd_tmrproc
expires=2155200000000 softexpires=2155200000000
<idle>-0 [000] 339.204672: hrtimer_expire_entry:
hrtimer=0xffffffffa04a9280 now=2155228092094 function=timerfd_tmrproc/0x0
<idle>-0 [000] 339.204694: hrtimer_start:
hrtimer=0xffffffffa04a9280 function=timerfd_tmrproc
expires=2155500000000 softexpires=2155500000000
<idle>-0 [000] 339.213807: hrtimer_expire_entry:
hrtimer=0xffffffffa04a9280 now=2155532081987 function=timerfd_tmrproc/0x0
<idle>-0 [000] 339.213823: hrtimer_start:
hrtimer=0xffffffffa04a9280 function=timerfd_tmrproc
expires=2155800000000 softexpires=2155800000000
<idle>-0 [000] 339.226986: hrtimer_expire_entry:
hrtimer=0xffffffffa04a9280 now=2155816087467 function=timerfd_tmrproc/0x0
<idle>-0 [000] 339.227007: hrtimer_start:
hrtimer=0xffffffffa04a9280 function=timerfd_tmrproc
expires=2156100000000 softexpires=2156100000000
<idle>-0 [000] 339.272420: hrtimer_expire_entry:
hrtimer=0xffffffffa04a9280 now=2156101736770 function=timerfd_tmrproc/0x0
<idle>-0 [000] 339.272438: hrtimer_start:
hrtimer=0xffffffffa04a9280 function=timerfd_tmrproc
expires=2156400000000 softexpires=2156400000000
<idle>-0 [000] 339.286856: hrtimer_expire_entry:
hrtimer=0xffffffffa04a9280 now=2156432078659 function=timerfd_tmrproc/0x0
<idle>-0 [000] 339.286877: hrtimer_start:
hrtimer=0xffffffffa04a9280 function=timerfd_tmrproc
expires=2156700000000 softexpires=2156700000000
<idle>-0 [000] 339.363097: hrtimer_expire_entry:
hrtimer=0xffffffffa04a9280 now=2156728086117 function=timerfd_tmrproc/0x0
<idle>-0 [000] 339.363122: hrtimer_start:
hrtimer=0xffffffffa04a9280 function=timerfd_tmrproc
expires=2157000000000 softexpires=2157000000000
<idle>-0 [000] 339.411552: hrtimer_expire_entry:
hrtimer=0xffffffffa04a9280 now=2157000091028 function=timerfd_tmrproc/0x0
<idle>-0 [000] 339.411576: hrtimer_start:
hrtimer=0xffffffffa04a9280 function=timerfd_tmrproc
expires=2157300000000 softexpires=2157300000000
<idle>-0 [000] 339.492448: hrtimer_expire_entry:
hrtimer=0xffffffffa04a9280 now=2157328089996 function=timerfd_tmrproc/0x0
<idle>-0 [000] 339.492471: hrtimer_start:
hrtimer=0xffffffffa04a9280 function=timerfd_tmrproc
expires=2157600000000 softexpires=2157600000000
<idle>-0 [000] 339.526305: hrtimer_expire_entry:
hrtimer=0xffffffffa04a9280 now=2157600425537 function=timerfd_tmrproc/0x0
<idle>-0 [000] 339.526341: hrtimer_start:
hrtimer=0xffffffffa04a9280 function=timerfd_tmrproc
expires=2157900000000 softexpires=2157900000000
<idle>-0 [000] 339.572977: hrtimer_expire_entry:
hrtimer=0xffffffffa04a9280 now=2157928085502 function=timerfd_tmrproc/0x0
<idle>-0 [000] 339.572997: hrtimer_start:
hrtimer=0xffffffffa04a9280 function=timerfd_tmrproc
expires=2158200000000 softexpires=2158200000000
Xorg-1166 [000] 339.658862: hrtimer_expire_entry:
hrtimer=0xffffffffa04a9280 now=2158200013165 function=timerfd_tmrproc/0x0
Xorg-1166 [000] 339.658893: hrtimer_start:
hrtimer=0xffffffffa04a9280 function=timerfd_tmrproc
expires=2158500000000 softexpires=2158500000000
<idle>-0 [000] 339.694244: hrtimer_expire_entry:
hrtimer=0xffffffffa04a9280 now=2158532090801 function=timerfd_tmrproc/0x0
<idle>-0 [000] 339.694263: hrtimer_start:
hrtimer=0xffffffffa04a9280 function=timerfd_tmrproc
expires=2158800000000 softexpires=2158800000000
<idle>-0 [000] 339.769293: hrtimer_expire_entry:
hrtimer=0xffffffffa04a9280 now=2158816038362 function=timerfd_tmrproc/0x0
<idle>-0 [000] 339.769316: hrtimer_start:
hrtimer=0xffffffffa04a9280 function=timerfd_tmrproc
expires=2159100000000 softexpires=2159100000000
<idle>-0 [000] 339.808105: hrtimer_expire_entry:
hrtimer=0xffffffffa04a9280 now=2159122612405 function=timerfd_tmrproc/0x0
<idle>-0 [000] 339.808127: hrtimer_start:
hrtimer=0xffffffffa04a9280 function=timerfd_tmrproc
expires=2159400000000 softexpires=2159400000000

Here is the summary:

now value | difference with previous value
2159122612405|
2158816038362| 306574043
2158532090801| 283947561
2158200013165| 332077636
2157928085502| 271927663
2157600425537| 327659965
2157328089996| 272335541
2157000091028| 327998968
2156728086117| 272004911
2156432078659| 296007458
2156101736770| 330341889
2155816087467| 285649303
2155532081987| 284005480
2155228092094| 303989893
2154912090729| 316001365
2154632081099| 280009630
2154328094166| 303986933
2154000086890| 328007276

Timer was charged with 300ms interval. And we see difference in neighborhood
of point 300 000 000 - 300ms. The average value as we can see:
301325030.294118
- the average delay is 1 ms.

Here is the summary for timer_list, timer is charged with 200 ms interval:

now value | difference with previous value
4295653665|
4295653615| 200000000
4295653563| 208000000
4295653500| 252000000
4295653438| 248000000
4295653365| 292000000
4295653313| 208000000
4295653248| 260000000
4295653198| 200000000
4295653140| 232000000
4295653090| 200000000
4295653038| 208000000
4295652988| 200000000

Average interval is 225 666 666.666667, 25ms delay, it's better.
And for 300ms timer_list shows following result:

4298170712| 300000000
4298170637| 300000000
4298170562| 300000000
4298170483| 316000000
4298170387| 384000000
4298170287| 400000000
4298170212| 300000000
4298170129| 332000000
4298170053| 304000000
4298169978| 300000000
4298169903| 300000000
4298169828| 300000000

Average delay here is 20ms.
------------------------------------------------------------------------
It's log for 300ms timer_list, you could skip it
------------------------------------------------------------------------
<idle>-0 [000] 1914.815152: timer_expire_entry:
timer=0xffffffffa05c2280 function=test_def_timer_fn now=4298169828
<idle>-0 [000] 1914.815183: timer_start:
timer=0xffffffffa05c2280 function=test_def_timer_fn expires=4298169903
[timeout=75]
<idle>-0 [000] 1914.862357: timer_expire_entry:
timer=0xffffffffa05c2280 function=test_def_timer_fn now=4298169903
<idle>-0 [000] 1914.862378: timer_start:
timer=0xffffffffa05c2280 function=test_def_timer_fn expires=4298169978
[timeout=75]
firefox-2898 [000] 1914.901656: timer_expire_entry:
timer=0xffffffffa05c2280 function=test_def_timer_fn now=4298169978
firefox-2898 [000] 1914.901680: timer_start:
timer=0xffffffffa05c2280 function=test_def_timer_fn expires=4298170053
[timeout=75]
<idle>-0 [000] 1914.946591: timer_expire_entry:
timer=0xffffffffa05c2280 function=test_def_timer_fn now=4298170053
<idle>-0 [000] 1914.946625: timer_start:
timer=0xffffffffa05c2280 function=test_def_timer_fn expires=4298170128
[timeout=75]
<idle>-0 [000] 1914.994114: timer_expire_entry:
timer=0xffffffffa05c2280 function=test_def_timer_fn now=4298170129
<idle>-0 [000] 1914.994137: timer_start:
timer=0xffffffffa05c2280 function=test_def_timer_fn expires=4298170204
[timeout=75]
<idle>-0 [000] 1915.035696: timer_expire_entry:
timer=0xffffffffa05c2280 function=test_def_timer_fn now=4298170212
<idle>-0 [000] 1915.035719: timer_start:
timer=0xffffffffa05c2280 function=test_def_timer_fn expires=4298170287
[timeout=75]
<idle>-0 [000] 1915.082886: timer_expire_entry:
timer=0xffffffffa05c2280 function=test_def_timer_fn now=4298170287
<idle>-0 [000] 1915.082920: timer_start:
timer=0xffffffffa05c2280 function=test_def_timer_fn expires=4298170362
[timeout=75]
<idle>-0 [000] 1915.141888: timer_expire_entry:
timer=0xffffffffa05c2280 function=test_def_timer_fn now=4298170387
<idle>-0 [000] 1915.141923: timer_start:
timer=0xffffffffa05c2280 function=test_def_timer_fn expires=4298170462
[timeout=75]
soffice.bin-3237 [000] 1915.201642: timer_expire_entry:
timer=0xffffffffa05c2280 function=test_def_timer_fn now=4298170483
soffice.bin-3237 [000] 1915.201665: timer_start:
timer=0xffffffffa05c2280 function=test_def_timer_fn expires=4298170558
[timeout=75]
<idle>-0 [000] 1915.264248: timer_expire_entry:
timer=0xffffffffa05c2280 function=test_def_timer_fn now=4298170562
<idle>-0 [000] 1915.264271: timer_start:
timer=0xffffffffa05c2280 function=test_def_timer_fn expires=4298170637
[timeout=75]
<idle>-0 [000] 1915.309599: timer_expire_entry:
timer=0xffffffffa05c2280 function=test_def_timer_fn now=4298170637
<idle>-0 [000] 1915.309628: timer_start:
timer=0xffffffffa05c2280 function=test_def_timer_fn expires=4298170712
[timeout=75]
<idle>-0 [000] 1915.354971: timer_expire_entry:
timer=0xffffffffa05c2280 function=test_def_timer_fn now=4298170712
<idle>-0 [000] 1915.354986: timer_start:
timer=0xffffffffa05c2280 function=test_def_timer_fn expires=4298170787
[timeout=75]
<idle>-0 [000] 1915.398761: timer_expire_entry:
timer=0xffffffffa05c2280 function=test_def_timer_fn now=4298170787
<idle>-0 [000] 1915.398794: timer_start:
timer=0xffffffffa05c2280 function=test_def_timer_fn expires=4298170862
[timeout=75]


Maybe it's due cpuidle implementation or clock_device on my system. It
was MENU cpuidle governor.


--
Best regards,
Alexey Perevalov

2014-02-18 19:43:17

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Deferrable timers support for timerfd API

On Mon, 17 Feb 2014, Alexey Perevalov wrote:
> On 02/16/2014 07:39 PM, Thomas Gleixner wrote:

> Here is the summary:
>
> now value | difference with previous value
> 2159122612405|
> 2158816038362| 306574043
> 2158532090801| 283947561
> 2158200013165| 332077636
> 2157928085502| 271927663
> 2157600425537| 327659965
> 2157328089996| 272335541
> 2157000091028| 327998968
> 2156728086117| 272004911
> 2156432078659| 296007458
> 2156101736770| 330341889
> 2155816087467| 285649303
> 2155532081987| 284005480
> 2155228092094| 303989893
> 2154912090729| 316001365
> 2154632081099| 280009630
> 2154328094166| 303986933
> 2154000086890| 328007276
>
> Timer was charged with 300ms interval. And we see difference in neighborhood
> of point 300 000 000 - 300ms. The average value as we can see:
> 301325030.294118
> - the average delay is 1 ms.

Math is hard, right?

Lets look at the trace data:

> hrtimer=0xffffffffa04a9280 now=2154000086890 function=timerfd_tmrproc/0x0
> <idle>-0 [000] 339.148391: hrtimer_start:
> hrtimer=0xffffffffa04a9280 function=timerfd_tmrproc expires=2154300000000
> softexpires=2154300000000

expires=2154300000000

> <idle>-0 [000] 339.157821: hrtimer_expire_entry:
> hrtimer=0xffffffffa04a9280 now=2154328094166 function=timerfd_tmrproc/0x0

now=2154328094166

-> delay = 2154328094166 - 2154300000000 = 28094166 = 28ms


> <idle>-0 [000] 339.157837: hrtimer_start:
> hrtimer=0xffffffffa04a9280 function=timerfd_tmrproc expires=2154600000000
> softexpires=2154600000000

expires=2154600000000

> <idle>-0 [000] 339.170895: hrtimer_expire_entry:
> hrtimer=0xffffffffa04a9280 now=2154632081099 function=timerfd_tmrproc/0x0

now=2154632081099

-> delay = 2154632081099 - 2154600000000 = 32081099 = 32ms

You cannot sum up the delta between the expiry times when you use an
timer on an absolute time line. You need to look at the delta between
the programmed expiry time and the actual expiry time. That tells you
the deferred time. And running your excerpt through a filter tells me:

Delta: 28.0ms
Delta: 32.0ms
Delta: 12.0ms
Delta: 28.0ms
Delta: 32.0ms
Delta: 16.0ms
Delta: 1.0ms
Delta: 32.0ms
Delta: 28.0ms
Delta: 0.0ms
Delta: 28.0ms
Delta: 0.0ms
Delta: 28.0ms
Delta: 0.0ms
Delta: 32.0ms
Delta: 16.0ms
Delta: 22.0ms

Avg: 19.0ms

> 4298170712| 300000000
> 4298170637| 300000000
> 4298170562| 300000000
> 4298170483| 316000000
> 4298170387| 384000000
> 4298170287| 400000000
> 4298170212| 300000000
> 4298170129| 332000000
> 4298170053| 304000000
> 4298169978| 300000000
> 4298169903| 300000000
> 4298169828| 300000000
>
> Average delay here is 20ms.

Your trace tells a different story:

Delta: 0.0ms
Delta: 0.0ms
Delta: 0.0ms
Delta: 1.0ms
Delta: 8.0ms
Delta: 0.0ms
Delta: 25.0ms
Delta: 21.0ms
Delta: 4.0ms
Delta: 0.0ms
Delta: 0.0ms
Delta: 0.0ms

Avg: 4.0ms

> Maybe it's due cpuidle implementation or clock_device on my system. It was
> MENU cpuidle governor.

I fear it's something else ...

Thanks,

tglx

2014-02-18 19:48:39

by Alexey Perevalov

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Deferrable timers support for timerfd API

On 02/16/2014 07:39 PM, Thomas Gleixner wrote:
> On Sun, 16 Feb 2014, Alexey Perevalov wrote:
>> As I understand main idea in hrtimer.c was do not decrement expires_next in
>> case of DEFERRABLE timers type.
>> Such small average delay could be explained: it's due higher resolution, and
>> cpu is not in idle when we in hrtimer_interrupt,
>> with timer_list decrementing process not so often.
>> In this case it's hard to me to explain such small "time delta", it occurs
>> almost every time we have larger delay.
> Well, the point of deferrable timers is that they get executed, when
> the cpu is not idle, i.e. running some other timers as well
>
> I did not test my patch and I have no idea whether it really does what
> it should do, but tracing should tell you rather fast.
>
> So w/o instrumenting the kernel you can't tell why a timer is
> expired. Just looking at random numbers does not help. You need to
> create a proper test scenario which makes sure that the system goes
> into an extended nohz idle and then check whether the timers are
> deferred over that idle time.
>
> Thanks,
>
> tglx
>
>


Dear Thomas,

I figured out with deviation, I described before.

It was due expires and especially softexpires is fixed (don't base on
delay).

For example if we have a timer like this:
hrtimer_expire_entry: hrtimer=ffffffffa056f280 function=timerfd_tmrproc
[hrtimers_mod] now=191450988244 expires=191400000000
softexpires=191400000000
It was fired at 191450988244, but softexpire is 191400000000, 50ms
delay, if I'm not wrong.
Next trigger time is 191700000000, (hrtimer_start:
hrtimer=ffffffffa056f280 function=timerfd_tmrproc [hrtimers_mod]
expires=191700000000 softexpires=191700000000)
and if there is no cpu idle at next time, we'll get 250ms interval for
such timer. But we want 300ms or more for DEFERRABLE timer.

Thomas what do you think about moving format expire/softexpire to
_!now!_ in run_hrtimer, right before we
invoke callback function? The prolongation of hrtimer usually comes from
user timer functions by
invoking hrtimer_forward, which moves expires/softexpires forward.


+static void __run_hrtimer(struct hrtimer *timer, ktime_t *now, int
deferrable)
{
struct hrtimer_clock_base *base = timer->base;
struct hrtimer_cpu_base *cpu_base = base->cpu_base;
@@ -1286,8 +1286,13 @@ static void __run_hrtimer(struct hrtimer *timer,
ktime_t *now)
* the timer base.
*/
raw_spin_unlock(&cpu_base->lock);
- trace_hrtimer_expire_entry(timer, now);
+ trace_hrtimer_expire_entry(timer, now, 0);
+
+ if (deferrable)
+ hrtimer_set_expires(timer, *now);
restart = fn(timer);


I got expected results (timer interval is 300ms):
[ 247.251609] time delta 315563612
[ 247.652398] time delta 400788785
[ 248.000169] time delta 347764697
[ 248.372070] time delta 371900444
[ 248.800175] time delta 428105238
[ 249.144178] time delta 344002733
[ 249.455920] time delta 311748775
[ 249.804166] time delta 348245676
[ 250.136165] time delta 331999061
[ 250.458606] time delta 322441758
[ 250.804184] time delta 345571829

Also I tried move just _softexpire, but timer function in this case
wasn't called at all.


--
Best regards,
Alexey Perevalov

2014-02-18 22:33:42

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Deferrable timers support for timerfd API

On Tue, 18 Feb 2014, Alexey Perevalov wrote:
> On 02/16/2014 07:39 PM, Thomas Gleixner wrote:
> I figured out with deviation, I described before.

Which is wrong to begin with. Using the wrong method does not justify
the results. Are you actually trying to understand what I'm saying?

> It was due expires and especially softexpires is fixed (don't base on delay).

That's how an interval timer is supposed to work by definition. End of
discussion.

> For example if we have a timer like this:
> hrtimer_expire_entry: hrtimer=ffffffffa056f280 function=timerfd_tmrproc
> [hrtimers_mod] now=191450988244 expires=191400000000 softexpires=191400000000
> It was fired at 191450988244, but softexpire is 191400000000, 50ms delay, if
> I'm not wrong.
> Next trigger time is 191700000000, (hrtimer_start: hrtimer=ffffffffa056f280
> function=timerfd_tmrproc [hrtimers_mod] expires=191700000000
> softexpires=191700000000)
> and if there is no cpu idle at next time, we'll get 250ms interval for such
> timer.

This is complete nonsense. You schedule your hrtimer on an absolute
timeline:

191400000000
191700000000
...

So it's supposed to fire every 300ms, but it is allowed to fire later
when the system is idle. And that's what it does. If the system would
be idle for 10.3 seconds from the point where the timer is started
then it would expire the first timer at

191400000000 + 10 sec = 201400000000

and then schedule the next one at

201400000000 + 300ms = 201700000000

So if your system is not idle the timer can be expired. That's the
same for deferrable timer list timers. We expire them when a non
deferrable timer fires.

And you do the same for your timer list timer according to your trace:

expires=4298169903
expires=4298169978
expires=4298170053
expires=4298170128
expires=4298170204
expires=4298170287
expires=4298170362
expires=4298170462
expires=4298170558
expires=4298170637
expires=4298170712
expires=4298170787
expires=4298170862

The delta is always 75 ticks. And the expiry times are

now=4298169903
now=4298169978
now=4298170053
now=4298170129
now=4298170212
now=4298170287
now=4298170387
now=4298170483
now=4298170562
now=4298170637
now=4298170712
now=4298170787

Which results in the deferrements:

Delta: 0.0ms
Delta: 0.0ms
Delta: 0.0ms
Delta: 1.0ms
Delta: 8.0ms
Delta: 0.0ms
Delta: 25.0ms
Delta: 21.0ms
Delta: 4.0ms
Delta: 0.0ms
Delta: 0.0ms
Delta: 0.0ms

Avg: 4.0ms

And why? Because you scheduled your timer along an absolute
timeline. And if you use an absolute timeline, then the deltas between
the actual timer events are completely irrelevant. The only thing what
matters is the delta between the expected and the real expiry time.

Is it really that hard to understand?

> But we want 300ms or more for DEFERRABLE timer.

And I want a pony!

If you want that then simply setup the timer in relative oneshot mode,
i.e. interval = 0 and when it expires (deferred) rearm it relative to
now from user space. Then you get exactly the behaviour you want. It's
that simple, really.

> Thomas what do you think about moving format expire/softexpire to _!now!_ in
> run_hrtimer, right before we
> invoke callback function? The prolongation of hrtimer usually comes from user
> timer functions by
> invoking hrtimer_forward, which moves expires/softexpires forward.

You really don't want to know what I think about that.

> + trace_hrtimer_expire_entry(timer, now, 0);
> +
> + if (deferrable)
> + hrtimer_set_expires(timer, *now);
> restart = fn(timer);
>
>
> I got expected results (timer interval is 300ms):

So you got your pony. But it's your private pony and it stays that
way, because you made the timer interval relative. And you managed to
do that in the most disgusting way.

In course of that you broke the behaviour of the existing user space
interfaces. You can do so in your own hackery, but it's not going to
go near mainline.

Read and understand:

man timer_create
man timerfd

along with the relevant standards.

And if you need further education feel free to get a lecture from
Linus about user space interfaces or google one if you really want to
know how that works out.

We are not going to special case that deferrable stuff, simply because
it breaks the user space interfaces and you can solve your issue with
the existing user space interfaces already.

There is a simple solution for the problem. You just need to
understand what you try to solve and use the proper mechanisms. And
don't tell me you can't do that, because you need to modify your user
space code anyway as CLOCK*DEFERRABLE does not exist yet.

Just because you can do it in the kernel does not mean that it is the
correct approach.

Thanks,

tglx

2014-02-19 07:09:04

by Alexey Perevalov

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Deferrable timers support for timerfd API

On 02/19/2014 02:33 AM, Thomas Gleixner wrote:
> On Tue, 18 Feb 2014, Alexey Perevalov wrote:
>> On 02/16/2014 07:39 PM, Thomas Gleixner wrote:
>> I figured out with deviation, I described before.
> Which is wrong to begin with. Using the wrong method does not justify
> the results. Are you actually trying to understand what I'm saying?
>
>> It was due expires and especially softexpires is fixed (don't base on delay).
> That's how an interval timer is supposed to work by definition. End of
> discussion.
>
>> For example if we have a timer like this:
>> hrtimer_expire_entry: hrtimer=ffffffffa056f280 function=timerfd_tmrproc
>> [hrtimers_mod] now=191450988244 expires=191400000000 softexpires=191400000000
>> It was fired at 191450988244, but softexpire is 191400000000, 50ms delay, if
>> I'm not wrong.
>> Next trigger time is 191700000000, (hrtimer_start: hrtimer=ffffffffa056f280
>> function=timerfd_tmrproc [hrtimers_mod] expires=191700000000
>> softexpires=191700000000)
>> and if there is no cpu idle at next time, we'll get 250ms interval for such
>> timer.
> This is complete nonsense. You schedule your hrtimer on an absolute
> timeline:
>
> 191400000000
> 191700000000
> ...
>
> So it's supposed to fire every 300ms, but it is allowed to fire later
> when the system is idle. And that's what it does. If the system would
> be idle for 10.3 seconds from the point where the timer is started
> then it would expire the first timer at
>
> 191400000000 + 10 sec = 201400000000
>
> and then schedule the next one at
>
> 201400000000 + 300ms = 201700000000
>
> So if your system is not idle the timer can be expired. That's the
> same for deferrable timer list timers. We expire them when a non
> deferrable timer fires.
>
> And you do the same for your timer list timer according to your trace:
>
> expires=4298169903
> expires=4298169978
> expires=4298170053
> expires=4298170128
> expires=4298170204
> expires=4298170287
> expires=4298170362
> expires=4298170462
> expires=4298170558
> expires=4298170637
> expires=4298170712
> expires=4298170787
> expires=4298170862
>
> The delta is always 75 ticks. And the expiry times are
>
> now=4298169903
> now=4298169978
> now=4298170053
> now=4298170129
> now=4298170212
> now=4298170287
> now=4298170387
> now=4298170483
> now=4298170562
> now=4298170637
> now=4298170712
> now=4298170787
>
> Which results in the deferrements:
>
> Delta: 0.0ms
> Delta: 0.0ms
> Delta: 0.0ms
> Delta: 1.0ms
> Delta: 8.0ms
> Delta: 0.0ms
> Delta: 25.0ms
> Delta: 21.0ms
> Delta: 4.0ms
> Delta: 0.0ms
> Delta: 0.0ms
> Delta: 0.0ms
>
> Avg: 4.0ms
>
> And why? Because you scheduled your timer along an absolute
> timeline. And if you use an absolute timeline, then the deltas between
> the actual timer events are completely irrelevant. The only thing what
> matters is the delta between the expected and the real expiry time.
>
> Is it really that hard to understand?
>
>> But we want 300ms or more for DEFERRABLE timer.
> And I want a pony!
>
> If you want that then simply setup the timer in relative oneshot mode,
> i.e. interval = 0 and when it expires (deferred) rearm it relative to
> now from user space. Then you get exactly the behaviour you want. It's
> that simple, really.
>
>> Thomas what do you think about moving format expire/softexpire to _!now!_ in
>> run_hrtimer, right before we
>> invoke callback function? The prolongation of hrtimer usually comes from user
>> timer functions by
>> invoking hrtimer_forward, which moves expires/softexpires forward.
> You really don't want to know what I think about that.
>
>> + trace_hrtimer_expire_entry(timer, now, 0);
>> +
>> + if (deferrable)
>> + hrtimer_set_expires(timer, *now);
>> restart = fn(timer);
>>
>>
>> I got expected results (timer interval is 300ms):
> So you got your pony. But it's your private pony and it stays that
> way, because you made the timer interval relative. And you managed to
> do that in the most disgusting way.
>
> In course of that you broke the behaviour of the existing user space
> interfaces. You can do so in your own hackery, but it's not going to
> go near mainline.
>
> Read and understand:
>
> man timer_create
> man timerfd
>
> along with the relevant standards.
>
> And if you need further education feel free to get a lecture from
> Linus about user space interfaces or google one if you really want to
> know how that works out.
>
> We are not going to special case that deferrable stuff, simply because
> it breaks the user space interfaces and you can solve your issue with
> the existing user space interfaces already.
>
> There is a simple solution for the problem. You just need to
> understand what you try to solve and use the proper mechanisms. And
> don't tell me you can't do that, because you need to modify your user
> space code anyway as CLOCK*DEFERRABLE does not exist yet.
In that the whole point CLOCK*DEFERRABLE doesn't exist yet, and such
reset of timer's expire didn't affect another clockids. Ok no problem,
to set new expire time based on _now_ from user space.

>
> Just because you can do it in the kernel does not mean that it is the
> correct approach.
>
> Thanks,
>
> tglx
>


--
Best regards,
Alexey Perevalov