2009-07-17 10:11:27

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 0/4] ftrace: add tracepoint for timer event

Hi,

We can monitor timer's lifecycle and behaviors by these tracepoint.

Following tracepoints are introduced as ingo's suggestion in
http://marc.info/?l=linux-kernel&m=123791899529128&w=2

for timer:
trace_timer_init()
trace_timer_start()
trace_timer_expire()
trace_timer_cancel()

for hrtimer:
trace_hrtimer_init()
trace_hrtimer_start()
trace_hrtimer_expire()
trace_hrtimer_cancel()

for itimer:
itimer_state
itimer_expire()

Changlog from latest patchset:
1: ours patches are base on Anton Blanchard's patch which can been found
at:
http://marc.info/?l=linux-kernel&m=124331396919301&w=2
but this patch is can't apply on Ingo's tip tree, so we rebase this
patch against latest tip tree

2: fix those patches address Thomas's comment


2009-07-17 10:15:07

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 1/4] tracing/events: Add timer and high res timer tracepoints

From: Anton Blanchard <[email protected]>

Add tracepoints for timer and high res timer execution. We add entry and
exit tracepoints so we can calculate timer latency.

Example ftrace output:

<idle>-0 [000] 264.040506: hrtimer_entry: func=.tick_sched_timer
<idle>-0 [000] 264.040508: hrtimer_exit: func=.tick_sched_timer restart=HRTIMER_RESTART
<idle>-0 [000] 264.040530: timer_entry: func=.e1000_watchdog
<idle>-0 [000] 264.040728: timer_exit: func=.e1000_watchdog

Here we can see e1000_watchdog is taking 0.2ms - it might make sense to
move this into a workqueue or kernel thread.

Changelog v1->v2:
1: Rebase this patch against latest tip tree by Xiao Guangrong.
2: Move TRACE_SYSTEM out of #if

Signed-off-by: Anton Blanchard <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Signed-off-by: Xiao Guangrong <[email protected]>
---
include/trace/events/timer.h | 116 ++++++++++++++++++++++++++++++++++++++++++
kernel/hrtimer.c | 3 +
kernel/timer.c | 5 ++
3 files changed, 124 insertions(+), 0 deletions(-)
create mode 100644 include/trace/events/timer.h

diff --git a/include/trace/events/timer.h b/include/trace/events/timer.h
new file mode 100644
index 0000000..ff2754a
--- /dev/null
+++ b/include/trace/events/timer.h
@@ -0,0 +1,116 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM timer
+
+#if !defined(_TRACE_TIMER_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_TIMER_H
+
+#include <linux/tracepoint.h>
+#include <linux/timer.h>
+#include <linux/hrtimer.h>
+
+/**
+ * timer_entry - called immediately before the timer callback
+ * @timer: pointer to struct timer_list
+ *
+ * When used in combination with the timer_exit tracepoint we can
+ * determine the timer latency.
+ **/
+TRACE_EVENT(timer_entry,
+
+ TP_PROTO(struct timer_list *timer),
+
+ TP_ARGS(timer),
+
+ TP_STRUCT__entry(
+ __field(void *, function)
+ ),
+
+ TP_fast_assign(
+ __entry->function = timer->function;
+ ),
+
+ TP_printk("func=%pf", __entry->function)
+);
+
+/**
+ * timer_exit - called immediately after the timer returns
+ * @timer: pointer to struct timer_list
+ *
+ * When used in combination with the timer_entry tracepoint we can
+ * determine the timer latency.
+ **/
+TRACE_EVENT(timer_exit,
+
+ TP_PROTO(struct timer_list *timer),
+
+ TP_ARGS(timer),
+
+ TP_STRUCT__entry(
+ __field(void *, function)
+ ),
+
+ TP_fast_assign(
+ __entry->function = timer->function;
+ ),
+
+ TP_printk("func=%pf", __entry->function)
+);
+
+/**
+ * hrtimer_entry - called immediately before the high res timer callback
+ * @timer: pointer to struct hrtimer
+ *
+ * When used in combination with the hrtimer_exit tracepoint we can
+ * determine the high res timer latency.
+ **/
+TRACE_EVENT(hrtimer_entry,
+
+ TP_PROTO(struct hrtimer *timer),
+
+ TP_ARGS(timer),
+
+ TP_STRUCT__entry(
+ __field(void *, function)
+ ),
+
+ TP_fast_assign(
+ __entry->function = timer->function;
+ ),
+
+ TP_printk("func=%pf", __entry->function)
+);
+
+/**
+ * hrtimer_exit - called immediately after the high res timer returns
+ * @timer: pointer to struct hrtimer
+ * @restart: high res timer return value
+ *
+ * High res timer will restart if @restart is set to HRTIMER_RESTART.
+ * When used in combination with the hrtimer_entry tracepoint we can
+ * determine the high res timer latency.
+ **/
+TRACE_EVENT(hrtimer_exit,
+
+ TP_PROTO(struct hrtimer *timer, int restart),
+
+ TP_ARGS(timer, restart),
+
+ TP_STRUCT__entry(
+ __field(void *, function)
+ __field(int, restart)
+ ),
+
+ TP_fast_assign(
+ __entry->function = timer->function;
+ __entry->restart = restart;
+ ),
+
+ TP_printk("func=%pf restart=%s", __entry->function,
+ (__entry->restart == HRTIMER_RESTART) ?
+ "HRTIMER_RESTART" : "HRTIMER_NORESTART")
+);
+
+#endif /* _TRACE_TIMER_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index ab5eb70..33317e4 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -47,6 +47,7 @@
#include <linux/timer.h>

#include <asm/uaccess.h>
+#include <trace/events/timer.h>

/*
* The timer bases:
@@ -1161,7 +1162,9 @@ static void __run_hrtimer(struct hrtimer *timer)
* the timer base.
*/
spin_unlock(&cpu_base->lock);
+ trace_hrtimer_entry(timer);
restart = fn(timer);
+ trace_hrtimer_exit(timer, restart);
spin_lock(&cpu_base->lock);

/*
diff --git a/kernel/timer.c b/kernel/timer.c
index 0b36b9e..9385ad5 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -46,6 +46,9 @@
#include <asm/timex.h>
#include <asm/io.h>

+#define CREATE_TRACE_POINTS
+#include <trace/events/timer.h>
+
u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES;

EXPORT_SYMBOL(jiffies_64);
@@ -984,7 +987,9 @@ static inline void __run_timers(struct tvec_base *base)
*/
lock_map_acquire(&lockdep_map);

+ trace_timer_entry(timer);
fn(data);
+ trace_timer_exit(timer);

lock_map_release(&lockdep_map);

--
1.6.1.2

2009-07-17 10:17:03

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 2/4] ftrace: add tracepoint for timer

This patch is modify from Mathieu's patch base on Ingo's suggestion, the
original patch can be found here:
http://marc.info/?l=linux-kernel&m=123791201816247&w=2

Great thanks to Thomas for giving me so many valuable advices.

Those tracepoints are wanted and useful:
1: We can detect a timer's delay
2: We can monitor the lifecycle and behaviors of a timer

Thus they help in analysing and debuging.

Example ftrace output:
insmod-1712 [000] 145.901459: timer_init: timer=d0ace4b4
insmod-1712 [000] 145.901469: timer_start: timer=d0ace4b4 func=timer_fun expires=4294813629 timeout=199
<idle>-0 [000] 146.100230: timer_cancel: timer=d0ace4b4
<idle>-0 [000] 146.100233: timer_expire: timer=d0ace4b4 jiffies=4294813629
<idle>-0 [000] 146.100627: timer_callback_done: func=d0ace4b4

We expect the timer expires at 4294813629, after 199 jiffies, actually the
timer expires at 4294813629, so it is delayed by 0 jiffies.
We also realize the timer's callback started at 146.100233, and it finished
at 146.100627, so it's taking 146.100627-146.100233=0.4ms.

Changelog:
v1->v2:
1: Add jiffies info into timer
2: Combine debugobjects and trace as Thomas's suggestion
v2->v3:
1: Remove function address from timer_expire and timer_cancel
as Thomas's suggestion
2: Remove debug_and_trace_timer_expire() as Thomas's suggestion
3: Rename trace_timer_entry() and trace_timer_exit() to match this patch

Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Xiao Guangrong <[email protected]>
---
include/trace/events/timer.h | 84 ++++++++++++++++++++++++++++++++++++------
kernel/timer.c | 31 ++++++++++++---
2 files changed, 97 insertions(+), 18 deletions(-)

diff --git a/include/trace/events/timer.h b/include/trace/events/timer.h
index ff2754a..967e225 100644
--- a/include/trace/events/timer.h
+++ b/include/trace/events/timer.h
@@ -8,52 +8,112 @@
#include <linux/timer.h>
#include <linux/hrtimer.h>

+TRACE_EVENT(timer_init,
+
+ TP_PROTO(struct timer_list *timer),
+
+ TP_ARGS(timer),
+
+ TP_STRUCT__entry(
+ __field( void *, timer )
+ ),
+
+ TP_fast_assign(
+ __entry->timer = timer;
+ ),
+
+ TP_printk("timer=%p", __entry->timer)
+);
+
+TRACE_EVENT(timer_start,
+
+ TP_PROTO(struct timer_list *timer, unsigned long expires),
+
+ TP_ARGS(timer, expires),
+
+ TP_STRUCT__entry(
+ __field( void *, timer )
+ __field( void *, function )
+ __field( unsigned long, expires )
+ __field( int, timeout )
+ ),
+
+ TP_fast_assign(
+ __entry->timer = timer;
+ __entry->function = timer->function;
+ __entry->expires = expires;
+ __entry->timeout = timer->expires - jiffies;
+ ),
+
+ TP_printk("timer=%p func=%pf expires=%lu timeout=%d", __entry->timer,
+ __entry->function, __entry->expires, __entry->timeout)
+);
+
/**
- * timer_entry - called immediately before the timer callback
+ * timer_expire - called immediately before the timer callback
* @timer: pointer to struct timer_list
*
- * When used in combination with the timer_exit tracepoint we can
+ * When used in combination with the timer_callback_done tracepoint we can
* determine the timer latency.
**/
-TRACE_EVENT(timer_entry,
+TRACE_EVENT(timer_expire,

TP_PROTO(struct timer_list *timer),

TP_ARGS(timer),

TP_STRUCT__entry(
- __field(void *, function)
+ __field( void *, timer )
+ __field( unsigned long, jiffies )
),

TP_fast_assign(
- __entry->function = timer->function;
+ __entry->timer = timer;
+ __entry->jiffies = jiffies;
),

- TP_printk("func=%pf", __entry->function)
+ TP_printk("timer=%p jiffies=%lu", __entry->timer, __entry->jiffies)
);

/**
- * timer_exit - called immediately after the timer returns
+ * timer_callback_done - called immediately after the timer returns
* @timer: pointer to struct timer_list
*
- * When used in combination with the timer_entry tracepoint we can
+ * When used in combination with the timer_expire tracepoint we can
* determine the timer latency.
**/
-TRACE_EVENT(timer_exit,
+TRACE_EVENT(timer_callback_done,

TP_PROTO(struct timer_list *timer),

TP_ARGS(timer),

TP_STRUCT__entry(
- __field(void *, function)
+ __field(void *, timer )
),

TP_fast_assign(
- __entry->function = timer->function;
+ __entry->timer = timer;
),

- TP_printk("func=%pf", __entry->function)
+ TP_printk("func=%p", __entry->timer)
+);
+
+TRACE_EVENT(timer_cancel,
+
+ TP_PROTO(struct timer_list *timer),
+
+ TP_ARGS(timer),
+
+ TP_STRUCT__entry(
+ __field( void *, timer )
+ ),
+
+ TP_fast_assign(
+ __entry->timer = timer;
+ ),
+
+ TP_printk("timer=%p", __entry->timer)
);

/**
diff --git a/kernel/timer.c b/kernel/timer.c
index 9385ad5..0f53710 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -523,6 +523,25 @@ static inline void debug_timer_activate(struct timer_list *timer) { }
static inline void debug_timer_deactivate(struct timer_list *timer) { }
#endif

+static inline void debug_and_trace_timer_init(struct timer_list *timer)
+{
+ debug_timer_init(timer);
+ trace_timer_init(timer);
+}
+
+static inline void
+debug_and_trace_timer_activate(struct timer_list *timer, unsigned long expires)
+{
+ debug_timer_activate(timer);
+ trace_timer_start(timer, expires);
+}
+
+static inline void debug_and_trace_timer_deactivate(struct timer_list *timer)
+{
+ debug_timer_deactivate(timer);
+ trace_timer_cancel(timer);
+}
+
static void __init_timer(struct timer_list *timer,
const char *name,
struct lock_class_key *key)
@@ -551,7 +570,7 @@ void init_timer_key(struct timer_list *timer,
const char *name,
struct lock_class_key *key)
{
- debug_timer_init(timer);
+ debug_and_trace_timer_init(timer);
__init_timer(timer, name, key);
}
EXPORT_SYMBOL(init_timer_key);
@@ -570,7 +589,7 @@ static inline void detach_timer(struct timer_list *timer,
{
struct list_head *entry = &timer->entry;

- debug_timer_deactivate(timer);
+ debug_and_trace_timer_deactivate(timer);

__list_del(entry->prev, entry->next);
if (clear_pending)
@@ -631,7 +650,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
goto out_unlock;
}

- debug_timer_activate(timer);
+ debug_and_trace_timer_activate(timer, expires);

new_base = __get_cpu_var(tvec_bases);

@@ -783,7 +802,7 @@ void add_timer_on(struct timer_list *timer, int cpu)
BUG_ON(timer_pending(timer) || !timer->function);
spin_lock_irqsave(&base->lock, flags);
timer_set_base(timer, base);
- debug_timer_activate(timer);
+ debug_and_trace_timer_activate(timer, timer->expires);
internal_add_timer(base, timer);
/*
* Check whether the other CPU is idle and needs to be
@@ -987,9 +1006,9 @@ static inline void __run_timers(struct tvec_base *base)
*/
lock_map_acquire(&lockdep_map);

- trace_timer_entry(timer);
+ trace_timer_expire(timer);
fn(data);
- trace_timer_exit(timer);
+ trace_timer_callback_done(timer);

lock_map_release(&lockdep_map);

--
1.6.1.2

2009-07-17 10:18:54

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 3/4] ftrace: add tracepoint for hrtimer

Those tracepoints are wanted and useful:
1: We can detect a hrtimer's delay
2: We can monitor the lifecycle and behaviors of a hrtimer

Thus they help in analysing and debuging.

Great thanks to Thomas for giving me so many valuable advices.

Example ftrace output:
insmod-5280 [000] 10798.356372: hrtimer_init: timer=d0b044c0 clockid=CLOCK_REALTIME mode=HRTIMER_MODE_ABS
insmod-5280 [000] 10798.356385: hrtimer_start: timer=d0b044c0 func=hrtimer_fun expires=1246242896000000000 ns softexpires=1246242896000000000 ns
<idle>-0 [000] 10807.982987: hrtimer_expire: timer=d0b044c0 now=1246242896000906503 ns
<idle>-0 [000] 10807.982989: hrtimer_cancel: timer=d0b044c0
<idle>-0 [000] 10807.983562: hrtimer_callback_done: timer=d0b044c0

We expect the hrtimer expires at 1246242896000000000 ns, actually the
hrtimer expires at 1246242896000906503 ns, so it is delayed by
1246242896000906503-1246242896000000000 = 906503 ns.
We also realize the hrtimer's callback started at 10807.982987, and it
finished at 10807.983562, so it's taking 10807.983562-10807.982987=0.6ms.

Changelog:
v1->v2:
1: Remove ktime_to_ns() in TP_fast_assign()
2: Combine debugobjects and trace as Thomas's suggestion
v2->v3:
1: Remove function address from hrtimer_expire and hrtimer_cancel
as Thomas's suggestion
2: Remove debug_and_trace_hrtimer_expire() as Thomas's suggestion
3: Rename trace_hrtimer_entry() and trace_hrtimer_exit() to match this patch

Signed-off-by: Xiao Guangrong <[email protected]>
---
include/trace/events/timer.h | 107 ++++++++++++++++++++++++++++++++++--------
kernel/hrtimer.c | 35 +++++++++++---
2 files changed, 115 insertions(+), 27 deletions(-)

diff --git a/include/trace/events/timer.h b/include/trace/events/timer.h
index 967e225..8b22b12 100644
--- a/include/trace/events/timer.h
+++ b/include/trace/events/timer.h
@@ -116,58 +116,125 @@ TRACE_EVENT(timer_cancel,
TP_printk("timer=%p", __entry->timer)
);

+TRACE_EVENT(hrtimer_init,
+
+ TP_PROTO(struct hrtimer *timer, clockid_t clockid, enum hrtimer_mode mode),
+
+ TP_ARGS(timer, clockid, mode),
+
+ TP_STRUCT__entry(
+ __field( void *, timer )
+ __field( clockid_t, clockid )
+ __field( enum hrtimer_mode, mode )
+ ),
+
+ TP_fast_assign(
+ __entry->timer = timer;
+ __entry->clockid = clockid;
+ __entry->mode = mode;
+ ),
+
+ TP_printk("timer=%p clockid=%s mode=%s", __entry->timer,
+ __entry->clockid == CLOCK_REALTIME ?
+ "CLOCK_REALTIME" : "CLOCK_MONOTONIC",
+ __entry->mode == HRTIMER_MODE_ABS ?
+ "HRTIMER_MODE_ABS" : "HRTIMER_MODE_REL")
+);
+
+TRACE_EVENT(hrtimer_start,
+
+ TP_PROTO(struct hrtimer *timer),
+
+ TP_ARGS(timer),
+
+ TP_STRUCT__entry(
+ __field( void *, timer )
+ __field( void *, function )
+ __field( s64, expires )
+ __field( s64, softexpires )
+ ),
+
+ TP_fast_assign(
+ __entry->timer = timer;
+ __entry->function = timer->function;
+ __entry->expires = hrtimer_get_expires(timer).tv64;
+ __entry->softexpires = hrtimer_get_softexpires(timer).tv64;
+ ),
+
+ TP_printk("timer=%p func=%pf expires=%llu ns softexpires=%llu ns",
+ __entry->timer, __entry->function,
+ (unsigned long long)__entry->expires,
+ (unsigned long long)__entry->softexpires)
+);
+
/**
- * hrtimer_entry - called immediately before the high res timer callback
+ * htimmer_expire - called before the high res timer callback
* @timer: pointer to struct hrtimer
*
- * When used in combination with the hrtimer_exit tracepoint we can
+ * When used in combination with the hrtimer_callback_done tracepoint we can
* determine the high res timer latency.
**/
-TRACE_EVENT(hrtimer_entry,
+TRACE_EVENT(hrtimer_expire,

- TP_PROTO(struct hrtimer *timer),
+ TP_PROTO(struct hrtimer *timer, s64 now),

- TP_ARGS(timer),
+ TP_ARGS(timer, now),

TP_STRUCT__entry(
- __field(void *, function)
+ __field( void *, timer )
+ __field( s64, now )
),

TP_fast_assign(
- __entry->function = timer->function;
+ __entry->timer = timer;
+ __entry->now = now;
),

- TP_printk("func=%pf", __entry->function)
+ TP_printk("timer=%p now=%llu ns", __entry->timer,
+ (unsigned long long)__entry->now)
);

/**
- * hrtimer_exit - called immediately after the high res timer returns
+ * hrtimer_callback_done - called immediately after the high res timer returns
* @timer: pointer to struct hrtimer
* @restart: high res timer return value
*
* High res timer will restart if @restart is set to HRTIMER_RESTART.
- * When used in combination with the hrtimer_entry tracepoint we can
+ * When used in combination with the hrtimer_expire tracepoint we can
* determine the high res timer latency.
**/
-TRACE_EVENT(hrtimer_exit,
+TRACE_EVENT(hrtimer_callback_done,

- TP_PROTO(struct hrtimer *timer, int restart),
+ TP_PROTO(struct hrtimer *timer),

- TP_ARGS(timer, restart),
+ TP_ARGS(timer),

TP_STRUCT__entry(
- __field(void *, function)
- __field(int, restart)
+ __field( void *, timer )
),

TP_fast_assign(
- __entry->function = timer->function;
- __entry->restart = restart;
+ __entry->timer = timer;
),

- TP_printk("func=%pf restart=%s", __entry->function,
- (__entry->restart == HRTIMER_RESTART) ?
- "HRTIMER_RESTART" : "HRTIMER_NORESTART")
+ TP_printk("timer=%p", __entry->timer)
+);
+
+TRACE_EVENT(hrtimer_cancel,
+
+ TP_PROTO(struct hrtimer *timer),
+
+ TP_ARGS(timer),
+
+ TP_STRUCT__entry(
+ __field( void *, timer )
+ ),
+
+ TP_fast_assign(
+ __entry->timer = timer;
+ ),
+
+ TP_printk("timer=%p", __entry->timer)
);

#endif /* _TRACE_TIMER_H */
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 33317e4..bd21c9b 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -442,6 +442,26 @@ static inline void debug_hrtimer_activate(struct hrtimer *timer) { }
static inline void debug_hrtimer_deactivate(struct hrtimer *timer) { }
#endif

+static inline void
+debug_and_trace_hrtimer_init(struct hrtimer *timer, clockid_t clockid,
+ enum hrtimer_mode mode)
+{
+ debug_hrtimer_init(timer);
+ trace_hrtimer_init(timer, clockid, mode);
+}
+
+static inline void debug_and_trace_hrtimer_activate(struct hrtimer *timer)
+{
+ debug_hrtimer_activate(timer);
+ trace_hrtimer_start(timer);
+}
+
+static inline void debug_and_trace_hrtimer_deactivate(struct hrtimer *timer)
+{
+ debug_hrtimer_deactivate(timer);
+ trace_hrtimer_cancel(timer);
+}
+
/* High resolution timer related functions */
#ifdef CONFIG_HIGH_RES_TIMERS

@@ -798,7 +818,7 @@ static int enqueue_hrtimer(struct hrtimer *timer,
struct hrtimer *entry;
int leftmost = 1;

- debug_hrtimer_activate(timer);
+ debug_and_trace_hrtimer_activate(timer);

/*
* Find the right place in the rbtree:
@@ -884,7 +904,7 @@ remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base)
* reprogramming happens in the interrupt handler. This is a
* rare case and less expensive than a smp call.
*/
- debug_hrtimer_deactivate(timer);
+ debug_and_trace_hrtimer_deactivate(timer);
timer_stats_hrtimer_clear_start_info(timer);
reprogram = base->cpu_base == &__get_cpu_var(hrtimer_bases);
__remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE,
@@ -1118,7 +1138,7 @@ static void __hrtimer_init(struct hrtimer *timer, clockid_t clock_id,
void hrtimer_init(struct hrtimer *timer, clockid_t clock_id,
enum hrtimer_mode mode)
{
- debug_hrtimer_init(timer);
+ debug_and_trace_hrtimer_init(timer, clock_id, mode);
__hrtimer_init(timer, clock_id, mode);
}
EXPORT_SYMBOL_GPL(hrtimer_init);
@@ -1151,7 +1171,7 @@ static void __run_hrtimer(struct hrtimer *timer)

WARN_ON(!irqs_disabled());

- debug_hrtimer_deactivate(timer);
+ debug_and_trace_hrtimer_deactivate(timer);
__remove_hrtimer(timer, base, HRTIMER_STATE_CALLBACK, 0);
timer_stats_account_hrtimer(timer);
fn = timer->function;
@@ -1162,9 +1182,8 @@ static void __run_hrtimer(struct hrtimer *timer)
* the timer base.
*/
spin_unlock(&cpu_base->lock);
- trace_hrtimer_entry(timer);
restart = fn(timer);
- trace_hrtimer_exit(timer, restart);
+ trace_hrtimer_callback_done(timer);
spin_lock(&cpu_base->lock);

/*
@@ -1275,6 +1294,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)
break;
}

+ trace_hrtimer_expire(timer, basenow.tv64);
__run_hrtimer(timer);
}
base++;
@@ -1397,6 +1417,7 @@ void hrtimer_run_queues(void)
hrtimer_get_expires_tv64(timer))
break;

+ trace_hrtimer_expire(timer, base->softirq_time.tv64);
__run_hrtimer(timer);
}
spin_unlock(&cpu_base->lock);
@@ -1573,7 +1594,7 @@ static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base,
while ((node = rb_first(&old_base->active))) {
timer = rb_entry(node, struct hrtimer, node);
BUG_ON(hrtimer_callback_running(timer));
- debug_hrtimer_deactivate(timer);
+ debug_and_trace_hrtimer_deactivate(timer);

/*
* Mark it as STATE_MIGRATE not INACTIVE otherwise the
--
1.6.1.2

2009-07-17 10:20:47

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH v3 4/4] ftrace: add tracepoint for itimer

Those tracepoints are wanted and useful:
1: We can detect a itimer's delay
2: We can monitor the lifecycle and behaviors of a itimer

Thus they help in analysing and debuging.

Great thanks to Thomas for giving me so many valuable advices.

There have two case:
case 1, if we use ITIMER_REAL itimer:
Example ftrace output:
main-1517 [000] 288.826521: hrtimer_start: timer=ce807544 func=it_real_fn expires=289327664443 ns softexpires=289327664443 ns
main-1517 [000] 288.826524: itimer_state: which=0 expires=0 it_value=0.500 it_interval=0.0
main-1517 [000] 288.827313: hrtimer_expire: timer=ce807544 now=289327958161 ns
main-1517 [000] 288.827314: hrtimer_cancel: timer=ce807544
main-1517 [000] 288.827315: itimer_expire: pid=1517 which=0 now=0
main-1517 [000] 288.827319: hrtimer_callback_done: timer=ce807544
main-1517 [000] 288.827463: itimer_state: which=0 expires=0 it_value=0.0 it_interval=0.0

ITIMER_REAL itime use hrtimer to calculate time, so we can cooperate with
hrtimer's tracepoint to get it's delay.

from this output information, the process set up a ITIMER_REAL
iteimr(which=0), use a hrtime which address is ce807544, we can get the
delay is 289327958161-289327664443=293718 ns

case 2, if we use ITIMER_VIRTUAL/ITIMER_PROF itimer:
Example ftrace output:
main-5409 [001] 12763.874835: itimer_state: which=1 expires=2 it_value=0.500 it_interval=0.500
main-5409 [001] 12763.883495: itimer_expire: pid=5409 which=1 now=2
main-5409 [001] 12763.883712: itimer_state: which=1 expires=0 it_value=0.0 it_interval=0.0

from this, we can know the itimer delay is 2-2=0 jiffies.

Changelog:
v1->v2:
Remove ktime_to_ns() in TP_fast_assign()
v2->v3:
Below changes are all base on Thomas's suggestion:
1: Remove ITIMER_REAL's timer address and 'state' from trace_itimer_state()
2: Don't save task's name in trace_itimer_state()
3: Save task's pid instead of task's name in trace_itimer_expire()

Signed-off-by: Xiao Guangrong <[email protected]>
---
include/trace/events/timer.h | 52 ++++++++++++++++++++++++++++++++++++++++++
kernel/itimer.c | 5 +++-
kernel/posix-cpu-timers.c | 3 ++
3 files changed, 59 insertions(+), 1 deletions(-)

diff --git a/include/trace/events/timer.h b/include/trace/events/timer.h
index 8b22b12..0a77f34 100644
--- a/include/trace/events/timer.h
+++ b/include/trace/events/timer.h
@@ -237,6 +237,58 @@ TRACE_EVENT(hrtimer_cancel,
TP_printk("timer=%p", __entry->timer)
);

+TRACE_EVENT(itimer_state,
+
+ TP_PROTO(int which, struct itimerval *value, cputime_t expires),
+
+ TP_ARGS(which, value, expires),
+
+ TP_STRUCT__entry(
+ __field( int, which )
+ __field( cputime_t, expires )
+ __field( long, value_sec )
+ __field( long, value_usec )
+ __field( long, interval_sec )
+ __field( long, interval_usec )
+ ),
+
+ TP_fast_assign(
+ __entry->which = which;
+ __entry->expires = expires;
+ __entry->value_sec = value->it_value.tv_sec;
+ __entry->value_usec = value->it_value.tv_usec;
+ __entry->interval_sec = value->it_interval.tv_sec;
+ __entry->interval_usec = value->it_interval.tv_usec;
+ ),
+
+ TP_printk("which=%d expires=%lu it_value=%lu.%lu it_interval=%lu.%lu",
+ __entry->which, __entry->expires,
+ __entry->value_sec, __entry->value_usec,
+ __entry->interval_sec, __entry->interval_usec)
+);
+
+TRACE_EVENT(itimer_expire,
+
+ TP_PROTO(int which, struct pid *pid, cputime_t now),
+
+ TP_ARGS(which, pid, now),
+
+ TP_STRUCT__entry(
+ __field( int , which )
+ __field( int, pid )
+ __field( cputime_t, now )
+ ),
+
+ TP_fast_assign(
+ __entry->which = which;
+ __entry->now = now;
+ __entry->pid = pid_vnr(pid);
+ ),
+
+ TP_printk("pid=%d which=%d now=%lu", __entry->pid, __entry->which,
+ __entry->now)
+);
+
#endif /* _TRACE_TIMER_H */

/* This part must be outside protection */
diff --git a/kernel/itimer.c b/kernel/itimer.c
index 58762f7..7e9ee89 100644
--- a/kernel/itimer.c
+++ b/kernel/itimer.c
@@ -12,6 +12,7 @@
#include <linux/time.h>
#include <linux/posix-timers.h>
#include <linux/hrtimer.h>
+#include <trace/events/timer.h>

#include <asm/uaccess.h>

@@ -123,6 +124,7 @@ enum hrtimer_restart it_real_fn(struct hrtimer *timer)
struct signal_struct *sig =
container_of(timer, struct signal_struct, real_timer);

+ trace_itimer_expire(ITIMER_REAL, sig->leader_pid, 0);
kill_pid_info(SIGALRM, SEND_SIG_PRIV, sig->leader_pid);

return HRTIMER_NORESTART;
@@ -139,7 +141,7 @@ int do_setitimer(int which, struct itimerval *value, struct itimerval *ovalue)
struct task_struct *tsk = current;
struct hrtimer *timer;
ktime_t expires;
- cputime_t cval, cinterval, nval, ninterval;
+ cputime_t cval, cinterval, nval = 0, ninterval;

/*
* Validate the timevals in value.
@@ -220,6 +222,7 @@ again:
default:
return -EINVAL;
}
+ trace_itimer_state(which, value, nval);
return 0;
}

diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index bece7c0..63e93bf 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -8,6 +8,7 @@
#include <linux/math64.h>
#include <asm/uaccess.h>
#include <linux/kernel_stat.h>
+#include <trace/events/timer.h>

/*
* Called after updating RLIMIT_CPU to set timer expiration if necessary.
@@ -1160,6 +1161,7 @@ static void check_process_timers(struct task_struct *tsk,
sig->it_prof_expires = cputime_add(
sig->it_prof_expires, ptime);
}
+ trace_itimer_expire(ITIMER_PROF, sig->leader_pid, ptime);
__group_send_sig_info(SIGPROF, SEND_SIG_PRIV, tsk);
}
if (!cputime_eq(sig->it_prof_expires, cputime_zero) &&
@@ -1176,6 +1178,7 @@ static void check_process_timers(struct task_struct *tsk,
sig->it_virt_expires = cputime_add(
sig->it_virt_expires, utime);
}
+ trace_itimer_expire(ITIMER_VIRTUAL, sig->leader_pid, utime);
__group_send_sig_info(SIGVTALRM, SEND_SIG_PRIV, tsk);
}
if (!cputime_eq(sig->it_virt_expires, cputime_zero) &&
--
1.6.1.2

2009-07-17 10:50:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] ftrace: add tracepoint for hrtimer

On Fri, 2009-07-17 at 18:18 +0800, Xiao Guangrong wrote:
> Those tracepoints are wanted and useful:
> 1: We can detect a hrtimer's delay
> 2: We can monitor the lifecycle and behaviors of a hrtimer
>
> Thus they help in analysing and debuging.
>
> Great thanks to Thomas for giving me so many valuable advices.
>
> Example ftrace output:
> insmod-5280 [000] 10798.356372: hrtimer_init: timer=d0b044c0 clockid=CLOCK_REALTIME mode=HRTIMER_MODE_ABS
> insmod-5280 [000] 10798.356385: hrtimer_start: timer=d0b044c0 func=hrtimer_fun expires=1246242896000000000 ns softexpires=1246242896000000000 ns
> <idle>-0 [000] 10807.982987: hrtimer_expire: timer=d0b044c0 now=1246242896000906503 ns
> <idle>-0 [000] 10807.982989: hrtimer_cancel: timer=d0b044c0
> <idle>-0 [000] 10807.983562: hrtimer_callback_done: timer=d0b044c0
>
> We expect the hrtimer expires at 1246242896000000000 ns, actually the
> hrtimer expires at 1246242896000906503 ns, so it is delayed by
> 1246242896000906503-1246242896000000000 = 906503 ns.
> We also realize the hrtimer's callback started at 10807.982987, and it
> finished at 10807.983562, so it's taking 10807.983562-10807.982987=0.6ms.
>
> Changelog:
> v1->v2:
> 1: Remove ktime_to_ns() in TP_fast_assign()
> 2: Combine debugobjects and trace as Thomas's suggestion
> v2->v3:
> 1: Remove function address from hrtimer_expire and hrtimer_cancel
> as Thomas's suggestion
> 2: Remove debug_and_trace_hrtimer_expire() as Thomas's suggestion
> 3: Rename trace_hrtimer_entry() and trace_hrtimer_exit() to match this patch
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---



> diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
> index 33317e4..bd21c9b 100644
> --- a/kernel/hrtimer.c
> +++ b/kernel/hrtimer.c
> @@ -442,6 +442,26 @@ static inline void debug_hrtimer_activate(struct hrtimer *timer) { }
> static inline void debug_hrtimer_deactivate(struct hrtimer *timer) { }
> #endif
>
> +static inline void
> +debug_and_trace_hrtimer_init(struct hrtimer *timer, clockid_t clockid,
> + enum hrtimer_mode mode)
> +{
> + debug_hrtimer_init(timer);
> + trace_hrtimer_init(timer, clockid, mode);
> +}
> +
> +static inline void debug_and_trace_hrtimer_activate(struct hrtimer *timer)
> +{
> + debug_hrtimer_activate(timer);
> + trace_hrtimer_start(timer);
> +}
> +
> +static inline void debug_and_trace_hrtimer_deactivate(struct hrtimer *timer)
> +{
> + debug_hrtimer_deactivate(timer);
> + trace_hrtimer_cancel(timer);
> +}

I would argue that tracing is a form of debugging and you shouldn't need
to mangle these names like that, simply leave them debug_*().


> @@ -1162,9 +1182,8 @@ static void __run_hrtimer(struct hrtimer *timer)
> * the timer base.
> */
> spin_unlock(&cpu_base->lock);
> - trace_hrtimer_entry(timer);
> restart = fn(timer);
> - trace_hrtimer_exit(timer, restart);
> + trace_hrtimer_callback_done(timer);
> spin_lock(&cpu_base->lock);
>
> /*

Why bother introducing these tracepoints if you're going to remove them
in the same patch-set?

Also, the below:

> @@ -1275,6 +1294,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)
> break;
> }
>
> + trace_hrtimer_expire(timer, basenow.tv64);
> __run_hrtimer(timer);
> }
> base++;
> @@ -1397,6 +1417,7 @@ void hrtimer_run_queues(void)
> hrtimer_get_expires_tv64(timer))
> break;
>
> + trace_hrtimer_expire(timer, base->softirq_time.tv64);
> __run_hrtimer(timer);
> }
> spin_unlock(&cpu_base->lock);

indicates you placed that tracepoint in the wrong place.

Furthermore, I don't get why you want it there and not on the old
_entry() site, because this adds all kinds of extra overhead and you
loose the exact callback timings.

2009-07-20 07:25:57

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] ftrace: add tracepoint for hrtimer

Hi Peter,

Thanks for your review.

Peter Zijlstra wrote:

>> +static inline void debug_and_trace_hrtimer_deactivate(struct hrtimer *timer)
>> +{
>> + debug_hrtimer_deactivate(timer);
>> + trace_hrtimer_cancel(timer);
>> +}
>
> I would argue that tracing is a form of debugging and you shouldn't need
> to mangle these names like that, simply leave them debug_*().
>

I think this makes sense. I'll fix it unless Thomas has objections.

>
>> @@ -1162,9 +1182,8 @@ static void __run_hrtimer(struct hrtimer *timer)
>> * the timer base.
>> */
>> spin_unlock(&cpu_base->lock);
>> - trace_hrtimer_entry(timer);
>> restart = fn(timer);
>> - trace_hrtimer_exit(timer, restart);
>> + trace_hrtimer_callback_done(timer);
>> spin_lock(&cpu_base->lock);
>>
>> /*
>
> Why bother introducing these tracepoints if you're going to remove them
> in the same patch-set?
>

Actually I'm renaming them but not removing them.

I can drop the first patch and merge it into the latter patches,
but that will lose the credit for Anton Blanchard

> Also, the below:
>
>> @@ -1275,6 +1294,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)
>> break;
>> }
>>
>> + trace_hrtimer_expire(timer, basenow.tv64);
>> __run_hrtimer(timer);
>> }
>> base++;
>> @@ -1397,6 +1417,7 @@ void hrtimer_run_queues(void)
>> hrtimer_get_expires_tv64(timer))
>> break;
>>
>> + trace_hrtimer_expire(timer, base->softirq_time.tv64);
>> __run_hrtimer(timer);
>> }
>> spin_unlock(&cpu_base->lock);
>
> indicates you placed that tracepoint in the wrong place.
>
> Furthermore, I don't get why you want it there and not on the old
> _entry() site, because this adds all kinds of extra overhead and you
> loose the exact callback timings.
>

Yes, it's true, but the loose is only about 1 microsecond as I tested it.
Do you think it's acceptable or not?

If we put trace_hrtimer_expire() on the old _entry() site, then we can't
get the timestamps when hrtimer expires, which is used to calculate the
latency of hrtimer.

You can see the mail which I send to Thomas last week, can be found here:
http://marc.info/?l=linux-kernel&m=124762164322497&w=2)


Thanks,
Xiao

>
>
>

2009-07-20 12:09:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] ftrace: add tracepoint for hrtimer

On Mon, 2009-07-20 at 15:25 +0800, Xiao Guangrong wrote:

> >> @@ -1162,9 +1182,8 @@ static void __run_hrtimer(struct hrtimer *timer)
> >> * the timer base.
> >> */
> >> spin_unlock(&cpu_base->lock);
> >> - trace_hrtimer_entry(timer);
> >> restart = fn(timer);
> >> - trace_hrtimer_exit(timer, restart);
> >> + trace_hrtimer_callback_done(timer);
> >> spin_lock(&cpu_base->lock);
> >>
> >> /*
> >
> > Why bother introducing these tracepoints if you're going to remove them
> > in the same patch-set?
> >
>
> Actually I'm renaming them but not removing them.

Since the new set doesn't include anything comparable to _entry, I'd say
you're deleting one.

> I can drop the first patch and merge it into the latter patches,
> but that will lose the credit for Anton Blanchard

You could fix that with something like:

Suggested-by: Anton Blanchard

> > Also, the below:
> >
> >> @@ -1275,6 +1294,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)
> >> break;
> >> }
> >>
> >> + trace_hrtimer_expire(timer, basenow.tv64);
> >> __run_hrtimer(timer);
> >> }
> >> base++;
> >> @@ -1397,6 +1417,7 @@ void hrtimer_run_queues(void)
> >> hrtimer_get_expires_tv64(timer))
> >> break;
> >>
> >> + trace_hrtimer_expire(timer, base->softirq_time.tv64);
> >> __run_hrtimer(timer);
> >> }
> >> spin_unlock(&cpu_base->lock);
> >
> > indicates you placed that tracepoint in the wrong place.
> >
> > Furthermore, I don't get why you want it there and not on the old
> > _entry() site, because this adds all kinds of extra overhead and you
> > loose the exact callback timings.
> >
>
> Yes, it's true, but the loose is only about 1 microsecond as I tested it.
> Do you think it's acceptable or not?

The time is irrelevant, but look at the code, it includes a whole heap
of things, like debug code, locks, poking at hardware etc.

It simply isn't comparable to before.

> If we put trace_hrtimer_expire() on the old _entry() site, then we can't
> get the timestamps when hrtimer expires, which is used to calculate the
> latency of hrtimer.

Ah, but you don't get those anyway, I'd argue the whole expire thing is
broken. The only expiry you get is the hardware interrupt firing.
Anything after that is a free-for-all.

Look at that loop in hrtimer_interrupt(), with your tracepoint, they'd
all expire at the same time, regardless of how long previous callback's
took to complete.

Also, the whole loop can be re-tried, updating 'now' expiring a whole
new set of timers without expiry event.

The best you can get is a tracepoint when the hrtimer interrupt happens,
and the IRQ tracepoint already give you that.

So I really don't see the use-case for this _expiry tracepoint, its just
too ill-defined.

The _entry and _exit ones I can agree with, this stuff just doesn't make
any sense.

2009-07-22 09:38:51

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] ftrace: add tracepoint for hrtimer



Peter Zijlstra wrote:

> Ah, but you don't get those anyway, I'd argue the whole expire thing is
> broken. The only expiry you get is the hardware interrupt firing.
> Anything after that is a free-for-all.
>
> Look at that loop in hrtimer_interrupt(), with your tracepoint, they'd
> all expire at the same time, regardless of how long previous callback's
> took to complete.
>
> Also, the whole loop can be re-tried, updating 'now' expiring a whole
> new set of timers without expiry event.
>

Yes, the expire time that got by _expire() is incorrect and thanks for
your point out.

> The best you can get is a tracepoint when the hrtimer interrupt happens,
> and the IRQ tracepoint already give you that.
>

I'm trying to fix it address your comment, but meet some problems,
the time of ftrace output can't solve everything, because:

1: the time unit of ftrace output is microsecond, but hrtimer's unit
is nanosecond, it's not exact for us

2: the time of ftrace ouput is the time after system boot, but we need
xtime and wall_to_monotonic to calculate latency of hrtimer,
for example:
insmod-3821 [001] 3192.239335: hrtimer_start: timer=d08a1480 expires=1245162841000000000 ns
<idle>-0 [001] 3201.506127: hrtimer_expire: timer=d08a1480

we expect the timer expire at 1245162841000000000 ns, this is base on
xtime, but we don't know the interval running that we are expect hrtimer
to run if we don't know the xtime at hrtimer_start or hrtimer_expire.

But it's hard for hrtime's TRACE_EVENT to get xtime and wall_to_monotonic
since it's a fast patch, if we have to do this, the code maybe like below:

TRACE_EVENT(hrtimer_expire,

......

TP_STRUCT__entry(
__field( void *, timer )
__field( s64, now )
__field( s64, offset )
),

TP_fast_assign(
__entry->timer = timer;
__entry->now = ktime_get().tv64;
__entry->wtom = timespec_to_ktime(wall_to_monotonic).tv64;
),

TP_printk("timer=%p now=%llu ns wtom=%llu", __entry->timer,
(unsigned long long)__entry->now, (unsigned long long)__entry->wtom)
);

We need cooperate with trace_hrtimer_init() to get hrtimer's clockid.

That make trace_hrtimer_expire() slow.

Though the original patch get expire time not exactly, but It harm system's
performance very little.

Thanks,
Xiao

> So I really don't see the use-case for this _expiry tracepoint, its just
> too ill-defined.
>
> The _entry and _exit ones I can agree with, this stuff just doesn't make
> any sense.
>
>

2009-07-22 10:12:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] ftrace: add tracepoint for hrtimer

On Wed, 2009-07-22 at 17:36 +0800, Xiao Guangrong wrote:
>
> Peter Zijlstra wrote:
>
> > Ah, but you don't get those anyway, I'd argue the whole expire thing is
> > broken. The only expiry you get is the hardware interrupt firing.
> > Anything after that is a free-for-all.
> >
> > Look at that loop in hrtimer_interrupt(), with your tracepoint, they'd
> > all expire at the same time, regardless of how long previous callback's
> > took to complete.
> >
> > Also, the whole loop can be re-tried, updating 'now' expiring a whole
> > new set of timers without expiry event.
> >
>
> Yes, the expire time that got by _expire() is incorrect and thanks for
> your point out.
>
> > The best you can get is a tracepoint when the hrtimer interrupt happens,
> > and the IRQ tracepoint already give you that.
> >
>
> I'm trying to fix it address your comment, but meet some problems,
> the time of ftrace output can't solve everything, because:
>
> 1: the time unit of ftrace output is microsecond, but hrtimer's unit
> is nanosecond, it's not exact for us
>
> 2: the time of ftrace ouput is the time after system boot, but we need
> xtime and wall_to_monotonic to calculate latency of hrtimer,
> for example:
> insmod-3821 [001] 3192.239335: hrtimer_start: timer=d08a1480 expires=1245162841000000000 ns
> <idle>-0 [001] 3201.506127: hrtimer_expire: timer=d08a1480
>
> we expect the timer expire at 1245162841000000000 ns, this is base on
> xtime, but we don't know the interval running that we are expect hrtimer
> to run if we don't know the xtime at hrtimer_start or hrtimer_expire.
>
> But it's hard for hrtime's TRACE_EVENT to get xtime and wall_to_monotonic
> since it's a fast patch, if we have to do this, the code maybe like below:
>
> TRACE_EVENT(hrtimer_expire,
>
> ......
>
> TP_STRUCT__entry(
> __field( void *, timer )
> __field( s64, now )
> __field( s64, offset )
> ),
>
> TP_fast_assign(
> __entry->timer = timer;
> __entry->now = ktime_get().tv64;
> __entry->wtom = timespec_to_ktime(wall_to_monotonic).tv64;
> ),
>
> TP_printk("timer=%p now=%llu ns wtom=%llu", __entry->timer,
> (unsigned long long)__entry->now, (unsigned long long)__entry->wtom)
> );
>
> We need cooperate with trace_hrtimer_init() to get hrtimer's clockid.
>
> That make trace_hrtimer_expire() slow.
>
> Though the original patch get expire time not exactly, but It harm system's
> performance very little.

OK, so what you want to measure is the time of the actual callback
happening (hrtimer_entry) vs that where you would have expected it to
happen (hrtimer_start + delay), right?

So what's wrong with printing the expected expiration time in the
hrtimer_start tracepoint in the cheap clock units?

2009-07-22 15:36:44

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] ftrace: add tracepoint for hrtimer

* Peter Zijlstra ([email protected]) wrote:
> On Wed, 2009-07-22 at 17:36 +0800, Xiao Guangrong wrote:
> >
> > Peter Zijlstra wrote:
> >
> > > Ah, but you don't get those anyway, I'd argue the whole expire thing is
> > > broken. The only expiry you get is the hardware interrupt firing.
> > > Anything after that is a free-for-all.
> > >
> > > Look at that loop in hrtimer_interrupt(), with your tracepoint, they'd
> > > all expire at the same time, regardless of how long previous callback's
> > > took to complete.
> > >
> > > Also, the whole loop can be re-tried, updating 'now' expiring a whole
> > > new set of timers without expiry event.
> > >
> >
> > Yes, the expire time that got by _expire() is incorrect and thanks for
> > your point out.
> >
> > > The best you can get is a tracepoint when the hrtimer interrupt happens,
> > > and the IRQ tracepoint already give you that.
> > >
> >
> > I'm trying to fix it address your comment, but meet some problems,
> > the time of ftrace output can't solve everything, because:
> >
> > 1: the time unit of ftrace output is microsecond, but hrtimer's unit
> > is nanosecond, it's not exact for us
> >
> > 2: the time of ftrace ouput is the time after system boot, but we need
> > xtime and wall_to_monotonic to calculate latency of hrtimer,
> > for example:
> > insmod-3821 [001] 3192.239335: hrtimer_start: timer=d08a1480 expires=1245162841000000000 ns
> > <idle>-0 [001] 3201.506127: hrtimer_expire: timer=d08a1480
> >
> > we expect the timer expire at 1245162841000000000 ns, this is base on
> > xtime, but we don't know the interval running that we are expect hrtimer
> > to run if we don't know the xtime at hrtimer_start or hrtimer_expire.
> >
> > But it's hard for hrtime's TRACE_EVENT to get xtime and wall_to_monotonic
> > since it's a fast patch, if we have to do this, the code maybe like below:
> >
> > TRACE_EVENT(hrtimer_expire,
> >
> > ......
> >
> > TP_STRUCT__entry(
> > __field( void *, timer )
> > __field( s64, now )
> > __field( s64, offset )
> > ),
> >
> > TP_fast_assign(
> > __entry->timer = timer;
> > __entry->now = ktime_get().tv64;
> > __entry->wtom = timespec_to_ktime(wall_to_monotonic).tv64;
> > ),
> >
> > TP_printk("timer=%p now=%llu ns wtom=%llu", __entry->timer,
> > (unsigned long long)__entry->now, (unsigned long long)__entry->wtom)
> > );
> >
> > We need cooperate with trace_hrtimer_init() to get hrtimer's clockid.
> >
> > That make trace_hrtimer_expire() slow.
> >
> > Though the original patch get expire time not exactly, but It harm system's
> > performance very little.
>
> OK, so what you want to measure is the time of the actual callback
> happening (hrtimer_entry) vs that where you would have expected it to
> happen (hrtimer_start + delay), right?
>
> So what's wrong with printing the expected expiration time in the
> hrtimer_start tracepoint in the cheap clock units?

This is also my thinking: why are __entry->now and __entry->wtom needed
at all? Ftrace should be providing an accurate clock source for the
event already.

If we _really_ need to match the trace clock with the system
gettimeofday value (or monotonic time), then we can instrument the timer
tick update routine. e.g., from LTTng (not yet ported to TRACE_EVENT):

kernel/timer.c:

void do_timer(unsigned long ticks)
{
jiffies_64 += ticks;
update_times(ticks);
trace_timer_update_time(&xtime, &wall_to_monotonic);
}

ltt/probes/kernel-trace.c:

void probe_timer_update_time(struct timespec *_xtime,
struct timespec *_wall_to_monotonic)
{
trace_mark_tp(kernel, timer_update_time, timer_update_time,
probe_timer_update_time,
"jiffies #8u%llu xtime_sec %ld xtime_nsec %ld "
"walltomonotonic_sec %ld walltomonotonic_nsec %ld",
(unsigned long long)jiffies_64, _xtime->tv_sec, _xtime->tv_nsec,
_wall_to_monotonic->tv_sec, _wall_to_monotonic->tv_nsec);
}

Mathieu

>
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-07-23 10:01:28

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] ftrace: add tracepoint for hrtimer



Peter Zijlstra wrote:

> OK, so what you want to measure is the time of the actual callback
> happening (hrtimer_entry) vs that where you would have expected it to
> happen (hrtimer_start + delay), right?
>

Yes

> So what's wrong with printing the expected expiration time in the
> hrtimer_start tracepoint in the cheap clock units?
>

Is "cheap clock units" means jiffies time?

If so, we need to convert "expire time" into jiffies time, but
I searched ways for this type of converting but found nothing.

I'm afraid it's hard to do as below example:
insmod-3821 [001] 3192.239335: hrtimer_start: timer=d08a1480 expires=1245162841000000000 ns
<idle>-0 [001] 3201.506127: hrtimer_expire: timer=d08a1480

We can't convert 1245162841000000000ns to jiffies if we don't know
the xtime/wall_to_monotonic at hrimter_start.
And the expire time that got by _start() maybe wrong because user
can change time of day between hrtimer_start() and hrtimer_expire().

What your opinion?

>
>
>

2009-07-23 10:06:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] ftrace: add tracepoint for hrtimer

On Thu, 2009-07-23 at 18:01 +0800, Xiao Guangrong wrote:
>
> Peter Zijlstra wrote:
>
> > OK, so what you want to measure is the time of the actual callback
> > happening (hrtimer_entry) vs that where you would have expected it to
> > happen (hrtimer_start + delay), right?
> >
>
> Yes
>
> > So what's wrong with printing the expected expiration time in the
> > hrtimer_start tracepoint in the cheap clock units?
> >
>
> Is "cheap clock units" means jiffies time?

Nah, something like cpu_clock() which is monotonic per-cpu and should
have high resolution where available.

2009-07-24 09:40:46

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] ftrace: add tracepoint for hrtimer



Peter Zijlstra wrote:
> On Thu, 2009-07-23 at 18:01 +0800, Xiao Guangrong wrote:
>> Peter Zijlstra wrote:
>>
>>> OK, so what you want to measure is the time of the actual callback
>>> happening (hrtimer_entry) vs that where you would have expected it to
>>> happen (hrtimer_start + delay), right?
>>>
>> Yes
>>
>>> So what's wrong with printing the expected expiration time in the
>>> hrtimer_start tracepoint in the cheap clock units?
>>>
>> Is "cheap clock units" means jiffies time?
>
> Nah, something like cpu_clock() which is monotonic per-cpu and should
> have high resolution where available.
>

Hi Peter,

Thanks for your valuable suggestion.

I have do some test for it, and it very suitable for CLOCK_MONOTONIC clock,
but if the hrtimer's clock is CLOCK_REALTIME, however xtime is needed to get
the real time.

In addition, it make TP_fast_assign() become a slow path, we should avoid it
if possible.

Thanks,
Xiao

>

2009-07-24 11:09:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] ftrace: add tracepoint for hrtimer

On Fri, 2009-07-24 at 17:40 +0800, Xiao Guangrong wrote:
>
> Peter Zijlstra wrote:
> > On Thu, 2009-07-23 at 18:01 +0800, Xiao Guangrong wrote:
> >> Peter Zijlstra wrote:
> >>
> >>> OK, so what you want to measure is the time of the actual callback
> >>> happening (hrtimer_entry) vs that where you would have expected it to
> >>> happen (hrtimer_start + delay), right?
> >>>
> >> Yes
> >>
> >>> So what's wrong with printing the expected expiration time in the
> >>> hrtimer_start tracepoint in the cheap clock units?
> >>>
> >> Is "cheap clock units" means jiffies time?
> >
> > Nah, something like cpu_clock() which is monotonic per-cpu and should
> > have high resolution where available.
> >
>
> Hi Peter,
>
> Thanks for your valuable suggestion.
>
> I have do some test for it, and it very suitable for CLOCK_MONOTONIC clock,
> but if the hrtimer's clock is CLOCK_REALTIME, however xtime is needed to get
> the real time.

If you also log xtime shifts in cpu_clock() units you're good again :-)

You could of course also choose to ignore xtime shifts (they're rarely
significant on the scale of timer intervals).