2009-07-24 10:37:50

by Zhao Lei

[permalink] [raw]
Subject: [PATCH 0/3] Add walltime support for ring-buffer

Hello, Steven

It is a prototype code to make ftrace display walltime of events.

It need to applys on my patch of:
[PATCH v3 1/2] Add function to convert between calendar time and broken-down time for universal use

This code can run but need to be fixed, it only for demostrate what is result
looks like.

I think display walltime is valuable for ftrace, it can provide more
recognizable information to system manager that he can known "when"
event happened.
And it is also helpful for flight-recorder we are doing now, we can get
walltime of events after panic for analyse.

We can get following result on this patch:
# cd /mnt/debugfs/tracing/
# echo sched_switch > current_tracer
# cat trace
# tracer: sched_switch
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
bash-2457 [000] 214.895446: 2457:120:S + [000] 2457:120:S bash
bash-2457 [000] 214.898298: 2457:120:R + [000] 5:115:S events/0
bash-2457 [000] 214.898684: 2457:120:R ==> [000] 5:115:R events/0
events/0-5 [000] 214.899458: 5:115:R + [000] 2455:120:S sshd
events/0-5 [000] 214.899495: 5:115:S ==> [000] 2455:120:R sshd
...
# echo 1 > options/walltime
# cat trace
# tracer: sched_switch
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
<idle>-0 [000] 2009-07-25 18:01:00.848468: 0:140:R ==> [000] 5:115:R events/0
events/0-5 [000] 2009-07-25 18:01:00.848523: 5:115:S ==> [000] 0:140:R <idle>
<idle>-0 [000] 2009-07-25 18:01:01.613479: 0:140:R + [000] 2455:120:S sshd
<idle>-0 [000] 2009-07-25 18:01:01.613678: 0:140:R ==> [000] 2455:120:R sshd
sshd-2455 [000] 2009-07-25 18:01:01.614015: 2455:120:S + [000] 2455:120:S sshd
...

This patch is prototype because we have following things to do:
1: function to get walltime is not near rb_time_stamp()
So walltime is not absolutely synchronize with trace time.
But if we place code to get walltime near rb_time_stamp(), it will be called
on every reserve-event and will make system slow.
2: We can't call getnstimeofday() freely in ring-buffer's code, because
it is possibility already hold xtime_lock. Prototype code used a ugly way
to get gid of this problem, maybe we need to ftrace_disable_cpu() before
every write_seqlock() instead.
3: People maybe change walltime when tracing, but current ring-buffer page's
walltime can not reflect this change. We need to add tracepoints to watch
this change, and add a walltime-change event(already reserved in
ring-buffer.c) to record this change.

I'll continue to finish remain steps, what is your opinion on it?

Thanks
Zhaolei


2009-07-24 10:39:21

by Zhao Lei

[permalink] [raw]
Subject: [PATCH] Add walltime support for ring-buffer

Note:
It is only a prototype patch to for demostrate what is result
looks like.
There still have many works to do(bug to fix) before apply.

Signed-off-by: Zhao Lei <[email protected]>
---
drivers/oprofile/cpu_buffer.c | 4 +-
include/linux/ring_buffer.h | 12 ++-
kernel/trace/ring_buffer.c | 159 ++++++++++++++++++++++++++++++----
kernel/trace/ring_buffer_benchmark.c | 3 +-
kernel/trace/trace.c | 8 +-
kernel/trace/trace_functions_graph.c | 8 +-
kernel/trace/trace_selftest.c | 2 +-
7 files changed, 161 insertions(+), 35 deletions(-)

diff --git a/drivers/oprofile/cpu_buffer.c b/drivers/oprofile/cpu_buffer.c
index a7aae24..1b9b36d 100644
--- a/drivers/oprofile/cpu_buffer.c
+++ b/drivers/oprofile/cpu_buffer.c
@@ -187,14 +187,14 @@ int op_cpu_buffer_write_commit(struct op_entry *entry)
struct op_sample *op_cpu_buffer_read_entry(struct op_entry *entry, int cpu)
{
struct ring_buffer_event *e;
- e = ring_buffer_consume(op_ring_buffer_read, cpu, NULL);
+ e = ring_buffer_consume(op_ring_buffer_read, cpu, NULL, NULL);
if (e)
goto event;
if (ring_buffer_swap_cpu(op_ring_buffer_read,
op_ring_buffer_write,
cpu))
return NULL;
- e = ring_buffer_consume(op_ring_buffer_read, cpu, NULL);
+ e = ring_buffer_consume(op_ring_buffer_read, cpu, NULL, NULL);
if (e)
goto event;
return NULL;
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 7fca716..eafffca 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -134,18 +134,22 @@ int ring_buffer_write(struct ring_buffer *buffer,
unsigned long length, void *data);

struct ring_buffer_event *
-ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts);
+ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts,
+ struct timespec *timespec);
struct ring_buffer_event *
-ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts);
+ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts,
+ struct timespec *timespec);

struct ring_buffer_iter *
ring_buffer_read_start(struct ring_buffer *buffer, int cpu);
void ring_buffer_read_finish(struct ring_buffer_iter *iter);

struct ring_buffer_event *
-ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts);
+ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts,
+ struct timespec *timespec);
struct ring_buffer_event *
-ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts);
+ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts,
+ struct timespec *timespec);
void ring_buffer_iter_reset(struct ring_buffer_iter *iter);
int ring_buffer_iter_empty(struct ring_buffer_iter *iter);

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 51633d7..698e47f 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -318,6 +318,7 @@ EXPORT_SYMBOL_GPL(ring_buffer_event_data);

struct buffer_data_page {
u64 time_stamp; /* page time stamp */
+ u32 walltime[3]; /* page wall time */
local_t commit; /* write committed index */
unsigned char data[]; /* data of buffer page */
};
@@ -442,6 +443,8 @@ struct ring_buffer_per_cpu {
unsigned long read;
u64 write_stamp;
u64 read_stamp;
+ u64 read_walltimestamp;
+ u32 read_walltime[3];
atomic_t record_disabled;
};

@@ -469,6 +472,8 @@ struct ring_buffer_iter {
unsigned long head;
struct buffer_page *head_page;
u64 read_stamp;
+ u64 read_walltimestamp;
+ u32 read_walltime[3];
};

/* buffer may be either ring_buffer or ring_buffer_per_cpu */
@@ -1499,6 +1504,14 @@ rb_set_commit_to_write(struct ring_buffer_per_cpu *cpu_buffer)
static void rb_reset_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
{
cpu_buffer->read_stamp = cpu_buffer->reader_page->page->time_stamp;
+ cpu_buffer->read_walltimestamp =
+ cpu_buffer->reader_page->page->time_stamp;
+ cpu_buffer->read_walltime[0] =
+ cpu_buffer->reader_page->page->walltime[0];
+ cpu_buffer->read_walltime[1] =
+ cpu_buffer->reader_page->page->walltime[1];
+ cpu_buffer->read_walltime[2] =
+ cpu_buffer->reader_page->page->walltime[2];
cpu_buffer->reader_page->read = 0;
}

@@ -1518,6 +1531,10 @@ static void rb_inc_iter(struct ring_buffer_iter *iter)
rb_inc_page(cpu_buffer, &iter->head_page);

iter->read_stamp = iter->head_page->page->time_stamp;
+ iter->read_walltimestamp = iter->head_page->page->time_stamp;
+ iter->read_walltime[0] = iter->head_page->page->walltime[0];
+ iter->read_walltime[1] = iter->head_page->page->walltime[1];
+ iter->read_walltime[2] = iter->head_page->page->walltime[2];
iter->head = 0;
}

@@ -1927,8 +1944,48 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
* If this is the first commit on the page, then update
* its timestamp.
*/
- if (!tail)
+ if (!tail) {
+ struct timespec timespec;
tail_page->page->time_stamp = *ts;
+ /*
+ * Fixme:
+ * 1: Get walltime close to get timestamp
+ * 2: don't need to set timestamp in rb_move_tail(), timestamp
+ * will be rewrited here
+ */
+ /*
+ * Fixme:
+ * We can't call getnstimeofday() here,
+ * because when when a function called with xtime_lock holded,
+ * kernel will run to here by function_trace, then
+ * getnstimeofday() will run into infinite loop, fox ex:
+ * tick_do_update_jiffies64() * write_seqlock(xtime_lock) *
+ * ->do_timer()
+ * ->update_wall_time()
+ * ->update_xtime_cache()
+ * ->ftrace_call() * here into function trace *
+ * ->function_test_events_call()
+ * ->trace_current_buffer_lock_reserve()
+ * -> ...
+ * -> here
+ *
+ * So we use get_seconds() when xtime_lock is holded,
+ * it is only a temporary way just to get walltime function
+ * work, we must fix it.
+ *
+ * Another way is call ftrace_disable_cpu() before
+ * write_seqlock().
+ */
+ if (xtime_lock.sequence & 1) {
+ timespec.tv_sec = get_seconds();
+ timespec.tv_nsec = 0;
+ } else {
+ getnstimeofday(&timespec);
+ }
+ tail_page->page->walltime[0] = timespec.tv_nsec;
+ *(__kernel_time_t *)(tail_page->page->walltime + 1) =
+ timespec.tv_sec;
+ }

return event;
}
@@ -2674,10 +2731,19 @@ static void rb_iter_reset(struct ring_buffer_iter *iter)
iter->head_page = cpu_buffer->reader_page;
iter->head = cpu_buffer->reader_page->read;
}
- if (iter->head)
+ if (iter->head) {
iter->read_stamp = cpu_buffer->read_stamp;
- else
+ iter->read_walltimestamp = cpu_buffer->read_walltimestamp;
+ iter->read_walltime[0] = cpu_buffer->read_walltime[0];
+ iter->read_walltime[1] = cpu_buffer->read_walltime[1];
+ iter->read_walltime[2] = cpu_buffer->read_walltime[2];
+ } else {
iter->read_stamp = iter->head_page->page->time_stamp;
+ iter->read_walltimestamp = iter->head_page->page->time_stamp;
+ iter->read_walltime[0] = iter->head_page->page->walltime[0];
+ iter->read_walltime[1] = iter->head_page->page->walltime[1];
+ iter->read_walltime[2] = iter->head_page->page->walltime[2];
+ }
}

/**
@@ -2780,6 +2846,21 @@ rb_update_iter_read_stamp(struct ring_buffer_iter *iter,
return;
}

+static void rb_cal_walltime(u32 *walltime, u64 ts, u32 *base_walltime,
+ u64 base_ts)
+{
+ u64 sec;
+ u64 nsec;
+ sec = *(__kernel_time_t *)(base_walltime + 1);
+ nsec = base_walltime[0] + ts - base_ts;
+ while (nsec >= NSEC_PER_SEC) {
+ nsec -= NSEC_PER_SEC;
+ ++sec;
+ }
+ walltime[0] = nsec;
+ *(__kernel_time_t *)(walltime + 1) = sec;
+}
+
static struct buffer_page *
rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
{
@@ -2953,12 +3034,14 @@ static void rb_advance_iter(struct ring_buffer_iter *iter)
}

static struct ring_buffer_event *
-rb_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts)
+rb_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts,
+ struct timespec *timespec)
{
struct ring_buffer_per_cpu *cpu_buffer;
struct ring_buffer_event *event;
struct buffer_page *reader;
int nr_loops = 0;
+ u64 timestamp;

cpu_buffer = buffer->buffers[cpu];

@@ -3004,10 +3087,21 @@ rb_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts)
goto again;

case RINGBUF_TYPE_DATA:
- if (ts) {
- *ts = cpu_buffer->read_stamp + event->time_delta;
+ if (ts || timespec) {
+ timestamp = cpu_buffer->read_stamp + event->time_delta;
ring_buffer_normalize_time_stamp(buffer,
- cpu_buffer->cpu, ts);
+ cpu_buffer->cpu,
+ &timestamp);
+ }
+ if (ts)
+ *ts = timestamp;
+ if (timespec) {
+ u32 walltime[3];
+ rb_cal_walltime(walltime, timestamp,
+ cpu_buffer->read_walltime,
+ cpu_buffer->read_walltimestamp);
+ timespec->tv_nsec = walltime[0];
+ timespec->tv_sec = *(__kernel_time_t *)(walltime + 1);
}
return event;

@@ -3020,12 +3114,13 @@ rb_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts)
EXPORT_SYMBOL_GPL(ring_buffer_peek);

static struct ring_buffer_event *
-rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
+rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts, struct timespec *timespec)
{
struct ring_buffer *buffer;
struct ring_buffer_per_cpu *cpu_buffer;
struct ring_buffer_event *event;
int nr_loops = 0;
+ u64 timestamp;

if (ring_buffer_iter_empty(iter))
return NULL;
@@ -3070,10 +3165,20 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
goto again;

case RINGBUF_TYPE_DATA:
- if (ts) {
- *ts = iter->read_stamp + event->time_delta;
+ if (ts || timespec) {
+ timestamp = iter->read_stamp + event->time_delta;
ring_buffer_normalize_time_stamp(buffer,
- cpu_buffer->cpu, ts);
+ cpu_buffer->cpu,
+ &timestamp);
+ }
+ if (ts)
+ *ts = timestamp;
+ if (timespec) {
+ u32 walltime[3];
+ rb_cal_walltime(walltime, timestamp,
+ iter->read_walltime, iter->read_walltimestamp);
+ timespec->tv_nsec = walltime[0];
+ timespec->tv_sec = *(__kernel_time_t *)(walltime + 1);
}
return event;

@@ -3110,7 +3215,8 @@ static inline int rb_ok_to_lock(void)
* not consume the data.
*/
struct ring_buffer_event *
-ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts)
+ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts,
+ struct timespec *timespec)
{
struct ring_buffer_per_cpu *cpu_buffer = buffer->buffers[cpu];
struct ring_buffer_event *event;
@@ -3125,7 +3231,8 @@ ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts)
local_irq_save(flags);
if (dolock)
spin_lock(&cpu_buffer->reader_lock);
- event = rb_buffer_peek(buffer, cpu, ts);
+ event = rb_buffer_peek(buffer, cpu, ts, timespec);
+
if (dolock)
spin_unlock(&cpu_buffer->reader_lock);
local_irq_restore(flags);
@@ -3147,7 +3254,8 @@ ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts)
* not increment the iterator.
*/
struct ring_buffer_event *
-ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
+ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts,
+ struct timespec *timespec)
{
struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
struct ring_buffer_event *event;
@@ -3155,7 +3263,7 @@ ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts)

again:
spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
- event = rb_iter_peek(iter, ts);
+ event = rb_iter_peek(iter, ts, timespec);
spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);

if (event && event->type_len == RINGBUF_TYPE_PADDING) {
@@ -3175,7 +3283,8 @@ ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
* and eventually empty the ring buffer if the producer is slower.
*/
struct ring_buffer_event *
-ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts)
+ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts,
+ struct timespec *timespec)
{
struct ring_buffer_per_cpu *cpu_buffer;
struct ring_buffer_event *event = NULL;
@@ -3196,7 +3305,7 @@ ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts)
if (dolock)
spin_lock(&cpu_buffer->reader_lock);

- event = rb_buffer_peek(buffer, cpu, ts);
+ event = rb_buffer_peek(buffer, cpu, ts, timespec);
if (!event)
goto out_unlock;

@@ -3287,7 +3396,8 @@ EXPORT_SYMBOL_GPL(ring_buffer_read_finish);
* This reads the next event in the ring buffer and increments the iterator.
*/
struct ring_buffer_event *
-ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts)
+ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts,
+ struct timespec *timespec)
{
struct ring_buffer_event *event;
struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
@@ -3295,7 +3405,7 @@ ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts)

again:
spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
- event = rb_iter_peek(iter, ts);
+ event = rb_iter_peek(iter, ts, timespec);
if (!event)
goto out;

@@ -3353,6 +3463,10 @@ rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)

cpu_buffer->write_stamp = 0;
cpu_buffer->read_stamp = 0;
+ cpu_buffer->read_walltimestamp = 0;
+ cpu_buffer->read_walltime[0] = 0;
+ cpu_buffer->read_walltime[1] = 0;
+ cpu_buffer->read_walltime[2] = 0;

rb_head_page_activate(cpu_buffer);
}
@@ -3620,6 +3734,7 @@ int ring_buffer_read_page(struct ring_buffer *buffer,
unsigned int commit;
unsigned int read;
u64 save_timestamp;
+ u32 save_walltime[3];
int ret = -1;

if (!cpumask_test_cpu(cpu, buffer->cpumask))
@@ -3679,6 +3794,9 @@ int ring_buffer_read_page(struct ring_buffer *buffer,

/* save the current timestamp, since the user will need it */
save_timestamp = cpu_buffer->read_stamp;
+ rb_cal_walltime(save_walltime, cpu_buffer->read_stamp,
+ cpu_buffer->read_walltime,
+ cpu_buffer->read_walltimestamp);

/* Need to copy one event at a time */
do {
@@ -3697,6 +3815,9 @@ int ring_buffer_read_page(struct ring_buffer *buffer,
/* update bpage */
local_set(&bpage->commit, pos);
bpage->time_stamp = save_timestamp;
+ bpage->walltime[0] = save_walltime[0];
+ bpage->walltime[1] = save_walltime[1];
+ bpage->walltime[2] = save_walltime[2];

/* we copied everything to the beginning */
read = 0;
diff --git a/kernel/trace/ring_buffer_benchmark.c b/kernel/trace/ring_buffer_benchmark.c
index 573d3cc..059a728 100644
--- a/kernel/trace/ring_buffer_benchmark.c
+++ b/kernel/trace/ring_buffer_benchmark.c
@@ -57,8 +57,9 @@ static enum event_status read_event(int cpu)
struct ring_buffer_event *event;
int *entry;
u64 ts;
+ struct timespec timespec;

- event = ring_buffer_consume(buffer, cpu, &ts);
+ event = ring_buffer_consume(buffer, cpu, &ts, &timespec);
if (!event)
return EVENT_DROPPED;

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index e30e6b1..22787b3 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1393,7 +1393,7 @@ static void trace_iterator_increment(struct trace_iterator *iter)

iter->idx++;
if (iter->buffer_iter[iter->cpu])
- ring_buffer_read(iter->buffer_iter[iter->cpu], NULL);
+ ring_buffer_read(iter->buffer_iter[iter->cpu], NULL, NULL);

ftrace_enable_cpu();
}
@@ -1408,9 +1408,9 @@ peek_next_entry(struct trace_iterator *iter, int cpu, u64 *ts)
ftrace_disable_cpu();

if (buf_iter)
- event = ring_buffer_iter_peek(buf_iter, ts);
+ event = ring_buffer_iter_peek(buf_iter, ts, NULL);
else
- event = ring_buffer_peek(iter->tr->buffer, cpu, ts);
+ event = ring_buffer_peek(iter->tr->buffer, cpu, ts, NULL);

ftrace_enable_cpu();

@@ -1489,7 +1489,7 @@ static void trace_consume(struct trace_iterator *iter)
{
/* Don't allow ftrace to trace into the ring buffers */
ftrace_disable_cpu();
- ring_buffer_consume(iter->tr->buffer, iter->cpu, &iter->ts);
+ ring_buffer_consume(iter->tr->buffer, iter->cpu, &iter->ts, NULL);
ftrace_enable_cpu();
}

diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
index abf7c4a..03520a5 100644
--- a/kernel/trace/trace_functions_graph.c
+++ b/kernel/trace/trace_functions_graph.c
@@ -340,12 +340,12 @@ get_return_for_leaf(struct trace_iterator *iter,

/* First peek to compare current entry and the next one */
if (ring_iter)
- event = ring_buffer_iter_peek(ring_iter, NULL);
+ event = ring_buffer_iter_peek(ring_iter, NULL, NULL);
else {
/* We need to consume the current entry to see the next one */
- ring_buffer_consume(iter->tr->buffer, iter->cpu, NULL);
+ ring_buffer_consume(iter->tr->buffer, iter->cpu, NULL, NULL);
event = ring_buffer_peek(iter->tr->buffer, iter->cpu,
- NULL);
+ NULL, NULL);
}

if (!event)
@@ -362,7 +362,7 @@ get_return_for_leaf(struct trace_iterator *iter,

/* this is a leaf, now advance the iterator */
if (ring_iter)
- ring_buffer_read(ring_iter, NULL);
+ ring_buffer_read(ring_iter, NULL, NULL);

return next;
}
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index 71f2edb..e6fa0d1 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -29,7 +29,7 @@ static int trace_test_buffer_cpu(struct trace_array *tr, int cpu)
struct trace_entry *entry;
unsigned int loops = 0;

- while ((event = ring_buffer_consume(tr->buffer, cpu, NULL))) {
+ while ((event = ring_buffer_consume(tr->buffer, cpu, NULL, NULL))) {
entry = ring_buffer_event_data(event);

/*
--
1.5.5.3

2009-07-24 10:40:22

by Zhao Lei

[permalink] [raw]
Subject: [RFC PATCH 1/3] Add walltime support for ring-buffer

Note:
It is only a prototype patch to for demostrate what is result
looks like.
There still have many works to do(bug to fix) before apply.

Signed-off-by: Zhao Lei <[email protected]>
---
drivers/oprofile/cpu_buffer.c | 4 +-
include/linux/ring_buffer.h | 12 ++-
kernel/trace/ring_buffer.c | 159 ++++++++++++++++++++++++++++++----
kernel/trace/ring_buffer_benchmark.c | 3 +-
kernel/trace/trace.c | 8 +-
kernel/trace/trace_functions_graph.c | 8 +-
kernel/trace/trace_selftest.c | 2 +-
7 files changed, 161 insertions(+), 35 deletions(-)

diff --git a/drivers/oprofile/cpu_buffer.c b/drivers/oprofile/cpu_buffer.c
index a7aae24..1b9b36d 100644
--- a/drivers/oprofile/cpu_buffer.c
+++ b/drivers/oprofile/cpu_buffer.c
@@ -187,14 +187,14 @@ int op_cpu_buffer_write_commit(struct op_entry *entry)
struct op_sample *op_cpu_buffer_read_entry(struct op_entry *entry, int cpu)
{
struct ring_buffer_event *e;
- e = ring_buffer_consume(op_ring_buffer_read, cpu, NULL);
+ e = ring_buffer_consume(op_ring_buffer_read, cpu, NULL, NULL);
if (e)
goto event;
if (ring_buffer_swap_cpu(op_ring_buffer_read,
op_ring_buffer_write,
cpu))
return NULL;
- e = ring_buffer_consume(op_ring_buffer_read, cpu, NULL);
+ e = ring_buffer_consume(op_ring_buffer_read, cpu, NULL, NULL);
if (e)
goto event;
return NULL;
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 7fca716..eafffca 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -134,18 +134,22 @@ int ring_buffer_write(struct ring_buffer *buffer,
unsigned long length, void *data);

struct ring_buffer_event *
-ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts);
+ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts,
+ struct timespec *timespec);
struct ring_buffer_event *
-ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts);
+ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts,
+ struct timespec *timespec);

struct ring_buffer_iter *
ring_buffer_read_start(struct ring_buffer *buffer, int cpu);
void ring_buffer_read_finish(struct ring_buffer_iter *iter);

struct ring_buffer_event *
-ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts);
+ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts,
+ struct timespec *timespec);
struct ring_buffer_event *
-ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts);
+ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts,
+ struct timespec *timespec);
void ring_buffer_iter_reset(struct ring_buffer_iter *iter);
int ring_buffer_iter_empty(struct ring_buffer_iter *iter);

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 51633d7..698e47f 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -318,6 +318,7 @@ EXPORT_SYMBOL_GPL(ring_buffer_event_data);

struct buffer_data_page {
u64 time_stamp; /* page time stamp */
+ u32 walltime[3]; /* page wall time */
local_t commit; /* write committed index */
unsigned char data[]; /* data of buffer page */
};
@@ -442,6 +443,8 @@ struct ring_buffer_per_cpu {
unsigned long read;
u64 write_stamp;
u64 read_stamp;
+ u64 read_walltimestamp;
+ u32 read_walltime[3];
atomic_t record_disabled;
};

@@ -469,6 +472,8 @@ struct ring_buffer_iter {
unsigned long head;
struct buffer_page *head_page;
u64 read_stamp;
+ u64 read_walltimestamp;
+ u32 read_walltime[3];
};

/* buffer may be either ring_buffer or ring_buffer_per_cpu */
@@ -1499,6 +1504,14 @@ rb_set_commit_to_write(struct ring_buffer_per_cpu *cpu_buffer)
static void rb_reset_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
{
cpu_buffer->read_stamp = cpu_buffer->reader_page->page->time_stamp;
+ cpu_buffer->read_walltimestamp =
+ cpu_buffer->reader_page->page->time_stamp;
+ cpu_buffer->read_walltime[0] =
+ cpu_buffer->reader_page->page->walltime[0];
+ cpu_buffer->read_walltime[1] =
+ cpu_buffer->reader_page->page->walltime[1];
+ cpu_buffer->read_walltime[2] =
+ cpu_buffer->reader_page->page->walltime[2];
cpu_buffer->reader_page->read = 0;
}

@@ -1518,6 +1531,10 @@ static void rb_inc_iter(struct ring_buffer_iter *iter)
rb_inc_page(cpu_buffer, &iter->head_page);

iter->read_stamp = iter->head_page->page->time_stamp;
+ iter->read_walltimestamp = iter->head_page->page->time_stamp;
+ iter->read_walltime[0] = iter->head_page->page->walltime[0];
+ iter->read_walltime[1] = iter->head_page->page->walltime[1];
+ iter->read_walltime[2] = iter->head_page->page->walltime[2];
iter->head = 0;
}

@@ -1927,8 +1944,48 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
* If this is the first commit on the page, then update
* its timestamp.
*/
- if (!tail)
+ if (!tail) {
+ struct timespec timespec;
tail_page->page->time_stamp = *ts;
+ /*
+ * Fixme:
+ * 1: Get walltime close to get timestamp
+ * 2: don't need to set timestamp in rb_move_tail(), timestamp
+ * will be rewrited here
+ */
+ /*
+ * Fixme:
+ * We can't call getnstimeofday() here,
+ * because when when a function called with xtime_lock holded,
+ * kernel will run to here by function_trace, then
+ * getnstimeofday() will run into infinite loop, fox ex:
+ * tick_do_update_jiffies64() * write_seqlock(xtime_lock) *
+ * ->do_timer()
+ * ->update_wall_time()
+ * ->update_xtime_cache()
+ * ->ftrace_call() * here into function trace *
+ * ->function_test_events_call()
+ * ->trace_current_buffer_lock_reserve()
+ * -> ...
+ * -> here
+ *
+ * So we use get_seconds() when xtime_lock is holded,
+ * it is only a temporary way just to get walltime function
+ * work, we must fix it.
+ *
+ * Another way is call ftrace_disable_cpu() before
+ * write_seqlock().
+ */
+ if (xtime_lock.sequence & 1) {
+ timespec.tv_sec = get_seconds();
+ timespec.tv_nsec = 0;
+ } else {
+ getnstimeofday(&timespec);
+ }
+ tail_page->page->walltime[0] = timespec.tv_nsec;
+ *(__kernel_time_t *)(tail_page->page->walltime + 1) =
+ timespec.tv_sec;
+ }

return event;
}
@@ -2674,10 +2731,19 @@ static void rb_iter_reset(struct ring_buffer_iter *iter)
iter->head_page = cpu_buffer->reader_page;
iter->head = cpu_buffer->reader_page->read;
}
- if (iter->head)
+ if (iter->head) {
iter->read_stamp = cpu_buffer->read_stamp;
- else
+ iter->read_walltimestamp = cpu_buffer->read_walltimestamp;
+ iter->read_walltime[0] = cpu_buffer->read_walltime[0];
+ iter->read_walltime[1] = cpu_buffer->read_walltime[1];
+ iter->read_walltime[2] = cpu_buffer->read_walltime[2];
+ } else {
iter->read_stamp = iter->head_page->page->time_stamp;
+ iter->read_walltimestamp = iter->head_page->page->time_stamp;
+ iter->read_walltime[0] = iter->head_page->page->walltime[0];
+ iter->read_walltime[1] = iter->head_page->page->walltime[1];
+ iter->read_walltime[2] = iter->head_page->page->walltime[2];
+ }
}

/**
@@ -2780,6 +2846,21 @@ rb_update_iter_read_stamp(struct ring_buffer_iter *iter,
return;
}

+static void rb_cal_walltime(u32 *walltime, u64 ts, u32 *base_walltime,
+ u64 base_ts)
+{
+ u64 sec;
+ u64 nsec;
+ sec = *(__kernel_time_t *)(base_walltime + 1);
+ nsec = base_walltime[0] + ts - base_ts;
+ while (nsec >= NSEC_PER_SEC) {
+ nsec -= NSEC_PER_SEC;
+ ++sec;
+ }
+ walltime[0] = nsec;
+ *(__kernel_time_t *)(walltime + 1) = sec;
+}
+
static struct buffer_page *
rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
{
@@ -2953,12 +3034,14 @@ static void rb_advance_iter(struct ring_buffer_iter *iter)
}

static struct ring_buffer_event *
-rb_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts)
+rb_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts,
+ struct timespec *timespec)
{
struct ring_buffer_per_cpu *cpu_buffer;
struct ring_buffer_event *event;
struct buffer_page *reader;
int nr_loops = 0;
+ u64 timestamp;

cpu_buffer = buffer->buffers[cpu];

@@ -3004,10 +3087,21 @@ rb_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts)
goto again;

case RINGBUF_TYPE_DATA:
- if (ts) {
- *ts = cpu_buffer->read_stamp + event->time_delta;
+ if (ts || timespec) {
+ timestamp = cpu_buffer->read_stamp + event->time_delta;
ring_buffer_normalize_time_stamp(buffer,
- cpu_buffer->cpu, ts);
+ cpu_buffer->cpu,
+ &timestamp);
+ }
+ if (ts)
+ *ts = timestamp;
+ if (timespec) {
+ u32 walltime[3];
+ rb_cal_walltime(walltime, timestamp,
+ cpu_buffer->read_walltime,
+ cpu_buffer->read_walltimestamp);
+ timespec->tv_nsec = walltime[0];
+ timespec->tv_sec = *(__kernel_time_t *)(walltime + 1);
}
return event;

@@ -3020,12 +3114,13 @@ rb_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts)
EXPORT_SYMBOL_GPL(ring_buffer_peek);

static struct ring_buffer_event *
-rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
+rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts, struct timespec *timespec)
{
struct ring_buffer *buffer;
struct ring_buffer_per_cpu *cpu_buffer;
struct ring_buffer_event *event;
int nr_loops = 0;
+ u64 timestamp;

if (ring_buffer_iter_empty(iter))
return NULL;
@@ -3070,10 +3165,20 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
goto again;

case RINGBUF_TYPE_DATA:
- if (ts) {
- *ts = iter->read_stamp + event->time_delta;
+ if (ts || timespec) {
+ timestamp = iter->read_stamp + event->time_delta;
ring_buffer_normalize_time_stamp(buffer,
- cpu_buffer->cpu, ts);
+ cpu_buffer->cpu,
+ &timestamp);
+ }
+ if (ts)
+ *ts = timestamp;
+ if (timespec) {
+ u32 walltime[3];
+ rb_cal_walltime(walltime, timestamp,
+ iter->read_walltime, iter->read_walltimestamp);
+ timespec->tv_nsec = walltime[0];
+ timespec->tv_sec = *(__kernel_time_t *)(walltime + 1);
}
return event;

@@ -3110,7 +3215,8 @@ static inline int rb_ok_to_lock(void)
* not consume the data.
*/
struct ring_buffer_event *
-ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts)
+ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts,
+ struct timespec *timespec)
{
struct ring_buffer_per_cpu *cpu_buffer = buffer->buffers[cpu];
struct ring_buffer_event *event;
@@ -3125,7 +3231,8 @@ ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts)
local_irq_save(flags);
if (dolock)
spin_lock(&cpu_buffer->reader_lock);
- event = rb_buffer_peek(buffer, cpu, ts);
+ event = rb_buffer_peek(buffer, cpu, ts, timespec);
+
if (dolock)
spin_unlock(&cpu_buffer->reader_lock);
local_irq_restore(flags);
@@ -3147,7 +3254,8 @@ ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts)
* not increment the iterator.
*/
struct ring_buffer_event *
-ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
+ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts,
+ struct timespec *timespec)
{
struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
struct ring_buffer_event *event;
@@ -3155,7 +3263,7 @@ ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts)

again:
spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
- event = rb_iter_peek(iter, ts);
+ event = rb_iter_peek(iter, ts, timespec);
spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);

if (event && event->type_len == RINGBUF_TYPE_PADDING) {
@@ -3175,7 +3283,8 @@ ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
* and eventually empty the ring buffer if the producer is slower.
*/
struct ring_buffer_event *
-ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts)
+ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts,
+ struct timespec *timespec)
{
struct ring_buffer_per_cpu *cpu_buffer;
struct ring_buffer_event *event = NULL;
@@ -3196,7 +3305,7 @@ ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts)
if (dolock)
spin_lock(&cpu_buffer->reader_lock);

- event = rb_buffer_peek(buffer, cpu, ts);
+ event = rb_buffer_peek(buffer, cpu, ts, timespec);
if (!event)
goto out_unlock;

@@ -3287,7 +3396,8 @@ EXPORT_SYMBOL_GPL(ring_buffer_read_finish);
* This reads the next event in the ring buffer and increments the iterator.
*/
struct ring_buffer_event *
-ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts)
+ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts,
+ struct timespec *timespec)
{
struct ring_buffer_event *event;
struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
@@ -3295,7 +3405,7 @@ ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts)

again:
spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
- event = rb_iter_peek(iter, ts);
+ event = rb_iter_peek(iter, ts, timespec);
if (!event)
goto out;

@@ -3353,6 +3463,10 @@ rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)

cpu_buffer->write_stamp = 0;
cpu_buffer->read_stamp = 0;
+ cpu_buffer->read_walltimestamp = 0;
+ cpu_buffer->read_walltime[0] = 0;
+ cpu_buffer->read_walltime[1] = 0;
+ cpu_buffer->read_walltime[2] = 0;

rb_head_page_activate(cpu_buffer);
}
@@ -3620,6 +3734,7 @@ int ring_buffer_read_page(struct ring_buffer *buffer,
unsigned int commit;
unsigned int read;
u64 save_timestamp;
+ u32 save_walltime[3];
int ret = -1;

if (!cpumask_test_cpu(cpu, buffer->cpumask))
@@ -3679,6 +3794,9 @@ int ring_buffer_read_page(struct ring_buffer *buffer,

/* save the current timestamp, since the user will need it */
save_timestamp = cpu_buffer->read_stamp;
+ rb_cal_walltime(save_walltime, cpu_buffer->read_stamp,
+ cpu_buffer->read_walltime,
+ cpu_buffer->read_walltimestamp);

/* Need to copy one event at a time */
do {
@@ -3697,6 +3815,9 @@ int ring_buffer_read_page(struct ring_buffer *buffer,
/* update bpage */
local_set(&bpage->commit, pos);
bpage->time_stamp = save_timestamp;
+ bpage->walltime[0] = save_walltime[0];
+ bpage->walltime[1] = save_walltime[1];
+ bpage->walltime[2] = save_walltime[2];

/* we copied everything to the beginning */
read = 0;
diff --git a/kernel/trace/ring_buffer_benchmark.c b/kernel/trace/ring_buffer_benchmark.c
index 573d3cc..059a728 100644
--- a/kernel/trace/ring_buffer_benchmark.c
+++ b/kernel/trace/ring_buffer_benchmark.c
@@ -57,8 +57,9 @@ static enum event_status read_event(int cpu)
struct ring_buffer_event *event;
int *entry;
u64 ts;
+ struct timespec timespec;

- event = ring_buffer_consume(buffer, cpu, &ts);
+ event = ring_buffer_consume(buffer, cpu, &ts, &timespec);
if (!event)
return EVENT_DROPPED;

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index e30e6b1..22787b3 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1393,7 +1393,7 @@ static void trace_iterator_increment(struct trace_iterator *iter)

iter->idx++;
if (iter->buffer_iter[iter->cpu])
- ring_buffer_read(iter->buffer_iter[iter->cpu], NULL);
+ ring_buffer_read(iter->buffer_iter[iter->cpu], NULL, NULL);

ftrace_enable_cpu();
}
@@ -1408,9 +1408,9 @@ peek_next_entry(struct trace_iterator *iter, int cpu, u64 *ts)
ftrace_disable_cpu();

if (buf_iter)
- event = ring_buffer_iter_peek(buf_iter, ts);
+ event = ring_buffer_iter_peek(buf_iter, ts, NULL);
else
- event = ring_buffer_peek(iter->tr->buffer, cpu, ts);
+ event = ring_buffer_peek(iter->tr->buffer, cpu, ts, NULL);

ftrace_enable_cpu();

@@ -1489,7 +1489,7 @@ static void trace_consume(struct trace_iterator *iter)
{
/* Don't allow ftrace to trace into the ring buffers */
ftrace_disable_cpu();
- ring_buffer_consume(iter->tr->buffer, iter->cpu, &iter->ts);
+ ring_buffer_consume(iter->tr->buffer, iter->cpu, &iter->ts, NULL);
ftrace_enable_cpu();
}

diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
index abf7c4a..03520a5 100644
--- a/kernel/trace/trace_functions_graph.c
+++ b/kernel/trace/trace_functions_graph.c
@@ -340,12 +340,12 @@ get_return_for_leaf(struct trace_iterator *iter,

/* First peek to compare current entry and the next one */
if (ring_iter)
- event = ring_buffer_iter_peek(ring_iter, NULL);
+ event = ring_buffer_iter_peek(ring_iter, NULL, NULL);
else {
/* We need to consume the current entry to see the next one */
- ring_buffer_consume(iter->tr->buffer, iter->cpu, NULL);
+ ring_buffer_consume(iter->tr->buffer, iter->cpu, NULL, NULL);
event = ring_buffer_peek(iter->tr->buffer, iter->cpu,
- NULL);
+ NULL, NULL);
}

if (!event)
@@ -362,7 +362,7 @@ get_return_for_leaf(struct trace_iterator *iter,

/* this is a leaf, now advance the iterator */
if (ring_iter)
- ring_buffer_read(ring_iter, NULL);
+ ring_buffer_read(ring_iter, NULL, NULL);

return next;
}
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index 71f2edb..e6fa0d1 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -29,7 +29,7 @@ static int trace_test_buffer_cpu(struct trace_array *tr, int cpu)
struct trace_entry *entry;
unsigned int loops = 0;

- while ((event = ring_buffer_consume(tr->buffer, cpu, NULL))) {
+ while ((event = ring_buffer_consume(tr->buffer, cpu, NULL, NULL))) {
entry = ring_buffer_event_data(event);

/*
--
1.5.5.3

2009-07-24 10:40:59

by Zhao Lei

[permalink] [raw]
Subject: [RFC PATCH 2/3] Apply walltime-supporting functions to trace system

We modified ring-buffer function's arguments to support walltime recording.
There function's caller site needs to be updated to support it.

Signed-off-by: Zhao Lei <[email protected]>
---
include/linux/ftrace_event.h | 1 +
kernel/trace/trace.c | 25 +++++++++++++++----------
kernel/trace/trace.h | 3 ++-
kernel/trace/trace_output.c | 2 +-
4 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 5c093ff..4ffd87a 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -58,6 +58,7 @@ struct trace_iterator {
struct trace_entry *ent;
int cpu;
u64 ts;
+ struct timespec timespec;

loff_t pos;
long idx;
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 22787b3..af25bec 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1399,7 +1399,8 @@ static void trace_iterator_increment(struct trace_iterator *iter)
}

static struct trace_entry *
-peek_next_entry(struct trace_iterator *iter, int cpu, u64 *ts)
+peek_next_entry(struct trace_iterator *iter, int cpu, u64 *ts,
+ struct timespec *timespec)
{
struct ring_buffer_event *event;
struct ring_buffer_iter *buf_iter = iter->buffer_iter[cpu];
@@ -1408,9 +1409,9 @@ peek_next_entry(struct trace_iterator *iter, int cpu, u64 *ts)
ftrace_disable_cpu();

if (buf_iter)
- event = ring_buffer_iter_peek(buf_iter, ts, NULL);
+ event = ring_buffer_iter_peek(buf_iter, ts, timespec);
else
- event = ring_buffer_peek(iter->tr->buffer, cpu, ts, NULL);
+ event = ring_buffer_peek(iter->tr->buffer, cpu, ts, timespec);

ftrace_enable_cpu();

@@ -1418,7 +1419,8 @@ peek_next_entry(struct trace_iterator *iter, int cpu, u64 *ts)
}

static struct trace_entry *
-__find_next_entry(struct trace_iterator *iter, int *ent_cpu, u64 *ent_ts)
+__find_next_entry(struct trace_iterator *iter, int *ent_cpu, u64 *ent_ts,
+ struct timespec *timespec)
{
struct ring_buffer *buffer = iter->tr->buffer;
struct trace_entry *ent, *next = NULL;
@@ -1434,7 +1436,7 @@ __find_next_entry(struct trace_iterator *iter, int *ent_cpu, u64 *ent_ts)
if (cpu_file > TRACE_PIPE_ALL_CPU) {
if (ring_buffer_empty_cpu(buffer, cpu_file))
return NULL;
- ent = peek_next_entry(iter, cpu_file, ent_ts);
+ ent = peek_next_entry(iter, cpu_file, ent_ts, timespec);
if (ent_cpu)
*ent_cpu = cpu_file;

@@ -1446,7 +1448,7 @@ __find_next_entry(struct trace_iterator *iter, int *ent_cpu, u64 *ent_ts)
if (ring_buffer_empty_cpu(buffer, cpu))
continue;

- ent = peek_next_entry(iter, cpu, &ts);
+ ent = peek_next_entry(iter, cpu, &ts, timespec);

/*
* Pick the entry with the smallest timestamp:
@@ -1469,15 +1471,17 @@ __find_next_entry(struct trace_iterator *iter, int *ent_cpu, u64 *ent_ts)

/* Find the next real entry, without updating the iterator itself */
struct trace_entry *trace_find_next_entry(struct trace_iterator *iter,
- int *ent_cpu, u64 *ent_ts)
+ int *ent_cpu, u64 *ent_ts,
+ struct timespec *timespec)
{
- return __find_next_entry(iter, ent_cpu, ent_ts);
+ return __find_next_entry(iter, ent_cpu, ent_ts, timespec);
}

/* Find the next real entry, and increment the iterator to the next entry */
static void *find_next_entry_inc(struct trace_iterator *iter)
{
- iter->ent = __find_next_entry(iter, &iter->cpu, &iter->ts);
+ iter->ent = __find_next_entry(iter, &iter->cpu, &iter->ts,
+ &iter->timespec);

if (iter->ent)
trace_iterator_increment(iter);
@@ -1489,7 +1493,8 @@ static void trace_consume(struct trace_iterator *iter)
{
/* Don't allow ftrace to trace into the ring buffers */
ftrace_disable_cpu();
- ring_buffer_consume(iter->tr->buffer, iter->cpu, &iter->ts, NULL);
+ ring_buffer_consume(iter->tr->buffer, iter->cpu, &iter->ts,
+ &iter->timespec);
ftrace_enable_cpu();
}

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 06886a0..fb3dcc8 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -452,7 +452,8 @@ struct trace_entry *tracing_get_trace_entry(struct trace_array *tr,
struct trace_array_cpu *data);

struct trace_entry *trace_find_next_entry(struct trace_iterator *iter,
- int *ent_cpu, u64 *ent_ts);
+ int *ent_cpu, u64 *ent_ts,
+ struct timespec *timtspec);

void tracing_generic_entry_update(struct trace_entry *entry,
unsigned long flags,
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index e0c2545..f5c7f0c 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -519,7 +519,7 @@ int trace_print_lat_context(struct trace_iterator *iter)
struct trace_seq *s = &iter->seq;
struct trace_entry *entry = iter->ent,
*next_entry = trace_find_next_entry(iter, NULL,
- &next_ts);
+ &next_ts, NULL);
unsigned long verbose = (trace_flags & TRACE_ITER_VERBOSE);
unsigned long abs_usecs = ns2usecs(iter->ts - iter->tr->time_start);
unsigned long rel_usecs;
--
1.5.5.3

2009-07-24 10:41:35

by Zhao Lei

[permalink] [raw]
Subject: [RFC PATCH 3/3] Make ftrace display walltime in output

Now ring-buffer system have ability to support events' walltime record.
We can make ftrace display walltime in output as:
# cd /mnt/debugfs/tracing/
# echo sched_switch > current_tracer
# cat trace
# tracer: sched_switch
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
bash-2457 [000] 214.895446: 2457:120:S + [000] 2457:120:S bash
bash-2457 [000] 214.898298: 2457:120:R + [000] 5:115:S events/0
bash-2457 [000] 214.898684: 2457:120:R ==> [000] 5:115:R events/0
events/0-5 [000] 214.899458: 5:115:R + [000] 2455:120:S sshd
events/0-5 [000] 214.899495: 5:115:S ==> [000] 2455:120:R sshd
...
# echo 1 > options/walltime
# cat trace
# tracer: sched_switch
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
<idle>-0 [000] 2009-07-25 18:01:00.848468: 0:140:R ==> [000] 5:115:R events/0
events/0-5 [000] 2009-07-25 18:01:00.848523: 5:115:S ==> [000] 0:140:R <idle>
<idle>-0 [000] 2009-07-25 18:01:01.613479: 0:140:R + [000] 2455:120:S sshd
<idle>-0 [000] 2009-07-25 18:01:01.613678: 0:140:R ==> [000] 2455:120:R sshd
sshd-2455 [000] 2009-07-25 18:01:01.614015: 2455:120:S + [000] 2455:120:S sshd
...

Signed-off-by: Zhao Lei <[email protected]>
---
kernel/trace/trace.c | 14 ++++++++++++--
kernel/trace/trace.h | 1 +
kernel/trace/trace_output.c | 21 +++++++++++++++++----
3 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index af25bec..4e360cf 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -326,6 +326,7 @@ static const char *trace_options[] = {
"global-clock",
"sleep-time",
"graph-time",
+ "walltime",
NULL
};

@@ -1601,8 +1602,17 @@ static void print_lat_help_header(struct seq_file *m)

static void print_func_help_header(struct seq_file *m)
{
- seq_puts(m, "# TASK-PID CPU# TIMESTAMP FUNCTION\n");
- seq_puts(m, "# | | | | |\n");
+ if (trace_flags & TRACE_ITER_WALLTIME) {
+ seq_puts(m, "# TASK-PID CPU# "
+ " TIMESTAMP FUNCTION\n");
+ seq_puts(m, "# | | | "
+ " | |\n");
+ } else {
+ seq_puts(m, "# TASK-PID CPU# "
+ " TIMESTAMP FUNCTION\n");
+ seq_puts(m, "# | | | "
+ " | |\n");
+ }
}


diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index fb3dcc8..0860b3b 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -664,6 +664,7 @@ enum trace_iterator_flags {
TRACE_ITER_GLOBAL_CLK = 0x80000,
TRACE_ITER_SLEEP_TIME = 0x100000,
TRACE_ITER_GRAPH_TIME = 0x200000,
+ TRACE_ITER_WALLTIME = 0x400000, /* Print walltime */
};

/*
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index f5c7f0c..3a52bb5 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -501,15 +501,28 @@ int trace_print_context(struct trace_iterator *iter)
{
struct trace_seq *s = &iter->seq;
struct trace_entry *entry = iter->ent;
- unsigned long long t = ns2usecs(iter->ts);
- unsigned long usec_rem = do_div(t, USEC_PER_SEC);
- unsigned long secs = (unsigned long)t;
char comm[TASK_COMM_LEN];

trace_find_cmdline(entry->pid, comm);

- return trace_seq_printf(s, "%16s-%-5d [%03d] %5lu.%06lu: ",
+ if (trace_flags & TRACE_ITER_WALLTIME) {
+ struct tm tm;
+ localtime_r(iter->timespec.tv_sec, &tm);
+ return trace_seq_printf(s, "%16s-%-5d [%03d] "
+ "%04u-%02u-%02u "
+ "%02u:%02u:%02u.%06llu: ",
+ comm, entry->pid, iter->cpu,
+ tm.tm_year + 1900, tm.tm_mon + 1,
+ tm.tm_mday, tm.tm_hour, tm.tm_min,
+ tm.tm_sec,
+ ns2usecs(iter->timespec.tv_nsec));
+ } else {
+ unsigned long long t = ns2usecs(iter->ts);
+ unsigned long usec_rem = do_div(t, USEC_PER_SEC);
+ unsigned long secs = (unsigned long)t;
+ return trace_seq_printf(s, "%16s-%-5d [%03d] %5lu.%06lu: ",
comm, entry->pid, iter->cpu, secs, usec_rem);
+ }
}

int trace_print_lat_context(struct trace_iterator *iter)
--
1.5.5.3

2009-07-24 10:44:41

by Zhao Lei

[permalink] [raw]
Subject: Re: [PATCH] Add walltime support for ring-buffer

Zhaolei wrote:
> Note:
> It is only a prototype patch to for demostrate what is result
> looks like.
> There still have many works to do(bug to fix) before apply.
>
Hello,

I forgot add [RFC 1/3] in title, and resend it with correct title.
Please ignore this mail, sorry.

Thanks
Zhaolei

2009-07-24 13:05:21

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 0/3] Add walltime support for ring-buffer


On Fri, 24 Jul 2009, Zhaolei wrote:

> Hello, Steven
>
> It is a prototype code to make ftrace display walltime of events.
>
> It need to applys on my patch of:
> [PATCH v3 1/2] Add function to convert between calendar time and broken-down time for universal use
>
> This code can run but need to be fixed, it only for demostrate what is result
> looks like.
>
> I think display walltime is valuable for ftrace, it can provide more
> recognizable information to system manager that he can known "when"
> event happened.
> And it is also helpful for flight-recorder we are doing now, we can get
> walltime of events after panic for analyse.
>
> We can get following result on this patch:
> # cd /mnt/debugfs/tracing/
> # echo sched_switch > current_tracer
> # cat trace
> # tracer: sched_switch
> #
> # TASK-PID CPU# TIMESTAMP FUNCTION
> # | | | | |
> bash-2457 [000] 214.895446: 2457:120:S + [000] 2457:120:S bash
> bash-2457 [000] 214.898298: 2457:120:R + [000] 5:115:S events/0
> bash-2457 [000] 214.898684: 2457:120:R ==> [000] 5:115:R events/0
> events/0-5 [000] 214.899458: 5:115:R + [000] 2455:120:S sshd
> events/0-5 [000] 214.899495: 5:115:S ==> [000] 2455:120:R sshd
> ...
> # echo 1 > options/walltime
> # cat trace
> # tracer: sched_switch
> #
> # TASK-PID CPU# TIMESTAMP FUNCTION
> # | | | | |
> <idle>-0 [000] 2009-07-25 18:01:00.848468: 0:140:R ==> [000] 5:115:R events/0
> events/0-5 [000] 2009-07-25 18:01:00.848523: 5:115:S ==> [000] 0:140:R <idle>
> <idle>-0 [000] 2009-07-25 18:01:01.613479: 0:140:R + [000] 2455:120:S sshd
> <idle>-0 [000] 2009-07-25 18:01:01.613678: 0:140:R ==> [000] 2455:120:R sshd
> sshd-2455 [000] 2009-07-25 18:01:01.614015: 2455:120:S + [000] 2455:120:S sshd
> ...
>
> This patch is prototype because we have following things to do:
> 1: function to get walltime is not near rb_time_stamp()
> So walltime is not absolutely synchronize with trace time.
> But if we place code to get walltime near rb_time_stamp(), it will be called
> on every reserve-event and will make system slow.
> 2: We can't call getnstimeofday() freely in ring-buffer's code, because
> it is possibility already hold xtime_lock. Prototype code used a ugly way
> to get gid of this problem, maybe we need to ftrace_disable_cpu() before
> every write_seqlock() instead.
> 3: People maybe change walltime when tracing, but current ring-buffer page's
> walltime can not reflect this change. We need to add tracepoints to watch
> this change, and add a walltime-change event(already reserved in
> ring-buffer.c) to record this change.
>
> I'll continue to finish remain steps, what is your opinion on it?

Hi Zhaolei,

Could you not just save the wall time in some 64bit format within the
current timestamp? Right now there's three clocks that can be used by
ftrace. Two that you can really get two (I need to add access to the
third). The default clock is sched_clock. But there's an option to get to
global_clock:

debug/tracing/options/global-clock

# echo 1 > /debug/tracing/options/global-clock

Will enable the global clock. Perhaps we can add a "wall clock" version
that will map the timestamps to the wall clock?

I really hate adding any more fields to the ring buffer headers. That
takes away from the amount that you can record per page.

-- Steve

2009-07-28 01:43:15

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 0/3] Add walltime support for ring-buffer

Hi Zhaolei,

> Hi Zhaolei,
>
> Could you not just save the wall time in some 64bit format within the
> current timestamp? Right now there's three clocks that can be used by
> ftrace. Two that you can really get two (I need to add access to the
> third). The default clock is sched_clock. But there's an option to get to
> global_clock:
>
> debug/tracing/options/global-clock
>
> # echo 1 > /debug/tracing/options/global-clock
>
> Will enable the global clock. Perhaps we can add a "wall clock" version
> that will map the timestamps to the wall clock?
>
> I really hate adding any more fields to the ring buffer headers. That
> takes away from the amount that you can record per page.

I think wall-time recording on every event is too slow. slower tracer
isn't useful.
Instead, Can we make pseudo wall-time recording event periodically?

In nowadays, many workload run on cluster machine environment. Then
We often need compare different machines tracing log. it mean TSC isn't
sufficient. but nobody want performance reduce.
Fortunatelly, ntp time adjustment is not happend so frequently. we
don't need wall-time on _every_ event.

Thanks.

2009-07-28 01:53:56

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 0/3] Add walltime support for ring-buffer

On Tue, Jul 28, 2009 at 10:43:11AM +0900, KOSAKI Motohiro wrote:
> Hi Zhaolei,
>
> > Hi Zhaolei,
> >
> > Could you not just save the wall time in some 64bit format within the
> > current timestamp? Right now there's three clocks that can be used by
> > ftrace. Two that you can really get two (I need to add access to the
> > third). The default clock is sched_clock. But there's an option to get to
> > global_clock:
> >
> > debug/tracing/options/global-clock
> >
> > # echo 1 > /debug/tracing/options/global-clock
> >
> > Will enable the global clock. Perhaps we can add a "wall clock" version
> > that will map the timestamps to the wall clock?
> >
> > I really hate adding any more fields to the ring buffer headers. That
> > takes away from the amount that you can record per page.
>
> I think wall-time recording on every event is too slow. slower tracer
> isn't useful.
> Instead, Can we make pseudo wall-time recording event periodically?
>
> In nowadays, many workload run on cluster machine environment. Then
> We often need compare different machines tracing log. it mean TSC isn't
> sufficient. but nobody want performance reduce.
> Fortunatelly, ntp time adjustment is not happend so frequently. we
> don't need wall-time on _every_ event.
>
> Thanks.


Yeah, that would really slow the tracing.

Can't we get a snapshot of the couple (walltime, cpu clock) once
at startup.
Then we can retrieve the walltime when cpu_clock was 0 and compute
the walltime for each traces at output time using the trace timestamp as
a delta?

Thanks,
Frederic.

2009-07-28 02:19:49

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 0/3] Add walltime support for ring-buffer


On Tue, 28 Jul 2009, Frederic Weisbecker wrote:
> >
> > In nowadays, many workload run on cluster machine environment. Then
> > We often need compare different machines tracing log. it mean TSC isn't
> > sufficient. but nobody want performance reduce.
> > Fortunatelly, ntp time adjustment is not happend so frequently. we
> > don't need wall-time on _every_ event.
> >
> > Thanks.
>
>
> Yeah, that would really slow the tracing.
>
> Can't we get a snapshot of the couple (walltime, cpu clock) once
> at startup.
> Then we can retrieve the walltime when cpu_clock was 0 and compute
> the walltime for each traces at output time using the trace timestamp as
> a delta?

This should be something in one of the trace clock sources. It can perhaps
have a helper function called by gettimeofday or something, but it should
be something that gets plugged into the ring buffer and not something that
needs to modify any of its code.

-- Steve

2009-07-28 02:23:19

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 0/3] Add walltime support for ring-buffer

> On Tue, Jul 28, 2009 at 10:43:11AM +0900, KOSAKI Motohiro wrote:
> > Hi Zhaolei,
> >
> > > Hi Zhaolei,
> > >
> > > Could you not just save the wall time in some 64bit format within the
> > > current timestamp? Right now there's three clocks that can be used by
> > > ftrace. Two that you can really get two (I need to add access to the
> > > third). The default clock is sched_clock. But there's an option to get to
> > > global_clock:
> > >
> > > debug/tracing/options/global-clock
> > >
> > > # echo 1 > /debug/tracing/options/global-clock
> > >
> > > Will enable the global clock. Perhaps we can add a "wall clock" version
> > > that will map the timestamps to the wall clock?
> > >
> > > I really hate adding any more fields to the ring buffer headers. That
> > > takes away from the amount that you can record per page.
> >
> > I think wall-time recording on every event is too slow. slower tracer
> > isn't useful.
> > Instead, Can we make pseudo wall-time recording event periodically?
> >
> > In nowadays, many workload run on cluster machine environment. Then
> > We often need compare different machines tracing log. it mean TSC isn't
> > sufficient. but nobody want performance reduce.
> > Fortunatelly, ntp time adjustment is not happend so frequently. we
> > don't need wall-time on _every_ event.
> >
> > Thanks.
>
>
> Yeah, that would really slow the tracing.
>
> Can't we get a snapshot of the couple (walltime, cpu clock) once
> at startup.
> Then we can retrieve the walltime when cpu_clock was 0 and compute
> the walltime for each traces at output time using the trace timestamp as
> a delta?


It can. but I don't think it is sufficient.
On flight recorder use-case, The tracing run more than one year. And,
Administrator only see last N sec log. (previous logs were overwritten
by ring-buffer go-around)

Thanks.