2010-12-04 00:13:36

by David Sharp

[permalink] [raw]
Subject: [Patch 00/15] Reduce tracing payload size.

This is a series of patches for tracing aiming to reduce the size of events in
the trace buffer. For some tracing tasks, the machine is so heavily loaded such
that events are produced much faster than they can be written to disk or
network. In this situation, the sizes of the trace events and the ring buffer
are the primary factors limiting the length of time that can be traced. Even
with ktrace (which has significantly smaller event sizes), a fairly minimal set
of enabled events, and a 16MB buffer per cpu, we are finding we can only get
about 20 seconds of tracing. This is not enough when the problem being
diagnosed only occurs at greater timescales and cannot be easily detected at
runtime.

These patches are just a start to shrinking the size of trace events. I am
planning to also make small versions of trace events that are used when
CONFIG_SMALL_TRACES is enabled. I'm also open to discussion of other ways to
shrink event sizes.


The patches can be divided into three major groups.

The first group is self explanitory patches, not actually related to shrinking
events:

tracing: Add a 'buffer_overwrite' debugfs file
ring_buffer.c: Remove unused #include <linux/trace_irq.h>
ring_buffer: Align buffer_page struct allocations only to fit the flags.

The second group is patches to adjust packing and alignment of tracing event
structures:

ftrace: pack event structures.
ftrace: fix event alignment: ftrace:context_switch and ftrace:wakeup
ftrace: fix event alignment: module:module_request
ftrace: fix event alignment: kvm:kvm_hv_hypercall
ftrace: fix event alignment: mce:mce_record
ftrace: fix event alignment: skb:kfree_skb
ftrace: fix event alignment: jbd2:*
ftrace: fix event alignment: ext4:*

The third group are patches to remove 8 bytes from struct trace_entry:

trace_output.c: adjust conditional expression formatting.
small_traces: Add config option to shrink trace events.
small_traces: Remove trace output of large fields.
small_traces: Remove 8 bytes from trace_entry.


Thanks,
d#


2010-12-04 00:13:46

by David Sharp

[permalink] [raw]
Subject: [PATCH 01/15] tracing: Add a 'buffer_overwrite' debugfs file

From: Jiaying Zhang <[email protected]>

Add a "buffer_overwrite" debugfs file for ftrace to control whether the buffer
should be overwritten on overflow or not. The default remains to overwrite old
events when the buffer is full. This patch adds the option to instead discard
newest events when the buffer is full. This is useful to get a snapshot of
traces just after enabling traces. Dropping the current event is also a simpler
code path.

Signed-off-by: David Sharp <[email protected]>
---
Documentation/trace/ftrace.txt | 6 ++++
include/linux/ring_buffer.h | 2 +
kernel/trace/ring_buffer.c | 11 +++++++
kernel/trace/trace.c | 59 +++++++++++++++++++++++++++++++++++++---
4 files changed, 74 insertions(+), 4 deletions(-)

diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt
index 557c1ed..9237da3 100644
--- a/Documentation/trace/ftrace.txt
+++ b/Documentation/trace/ftrace.txt
@@ -138,6 +138,12 @@ of ftrace. Here is a list of some of the key files:
This can only be updated when the current_tracer
is set to "nop".

+ buffer_overwrite:
+
+ This controls what happens when the trace buffer is full.
+ If "1" (default), the oldest events are discarded and
+ overwritten. If "0", then the newest events are discarded.
+
tracing_cpumask:

This is a mask that lets the user only trace
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 8d3a248..ab38ac8 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -100,6 +100,8 @@ void ring_buffer_free(struct ring_buffer *buffer);

int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size);

+void ring_buffer_change_overwrite(struct ring_buffer *buffer, int val);
+
struct ring_buffer_event *ring_buffer_lock_reserve(struct ring_buffer *buffer,
unsigned long length);
int ring_buffer_unlock_commit(struct ring_buffer *buffer,
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 9ed509a..3207147 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1429,6 +1429,17 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size)
}
EXPORT_SYMBOL_GPL(ring_buffer_resize);

+void ring_buffer_change_overwrite(struct ring_buffer *buffer, int val)
+{
+ mutex_lock(&buffer->mutex);
+ if (val)
+ buffer->flags |= RB_FL_OVERWRITE;
+ else
+ buffer->flags &= ~RB_FL_OVERWRITE;
+ mutex_unlock(&buffer->mutex);
+}
+EXPORT_SYMBOL_GPL(ring_buffer_change_overwrite);
+
static inline void *
__rb_data_page_index(struct buffer_data_page *bpage, unsigned index)
{
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index c380612..ed5c14f 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -41,8 +41,6 @@
#include "trace.h"
#include "trace_output.h"

-#define TRACE_BUFFER_FLAGS (RB_FL_OVERWRITE)
-
/*
* On boot up, the ring buffer is set to the minimum size, so that
* we do not waste memory on systems that are not using tracing.
@@ -241,6 +239,9 @@ int tracing_is_enabled(void)

static unsigned long trace_buf_size = TRACE_BUF_SIZE_DEFAULT;

+/* whether the trace buffer should be overwritten on overflow or not. */
+static enum ring_buffer_flags trace_buffer_flags = RB_FL_OVERWRITE;
+
/* trace_types holds a link list of available tracers. */
static struct tracer *trace_types __read_mostly;

@@ -3466,6 +3467,47 @@ tracing_entries_write(struct file *filp, const char __user *ubuf,
return cnt;
}

+static ssize_t
+tracing_overwrite_read(struct file *filp, char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ char buf[64];
+ int r;
+ r = snprintf(buf, 64, "%u\n", trace_buffer_flags);
+ return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
+}
+
+static ssize_t
+tracing_overwrite_write(struct file *filp, const char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ unsigned long val;
+ char buf[64];
+ int ret;
+
+ if (cnt >= sizeof(buf))
+ return -EINVAL;
+
+ if (copy_from_user(&buf, ubuf, cnt))
+ return -EFAULT;
+
+ buf[cnt] = 0;
+
+ ret = strict_strtoul(buf, 10, &val);
+ if (ret < 0)
+ return ret;
+ *ppos += cnt;
+
+ if (val != 0 && val != 1)
+ return -EINVAL;
+
+ if (trace_buffer_flags != val) {
+ trace_buffer_flags = val;
+ ring_buffer_change_overwrite(global_trace.buffer, val);
+ }
+ return cnt;
+}
+
static int mark_printk(const char *fmt, ...)
{
int ret;
@@ -3611,6 +3653,12 @@ static const struct file_operations tracing_entries_fops = {
.llseek = generic_file_llseek,
};

+static const struct file_operations tracing_overwrite_fops = {
+ .open = tracing_open_generic,
+ .read = tracing_overwrite_read,
+ .write = tracing_overwrite_write,
+};
+
static const struct file_operations tracing_mark_fops = {
.open = tracing_open_generic,
.write = tracing_mark_write,
@@ -4336,6 +4384,9 @@ static __init int tracer_init_debugfs(void)
trace_create_file("buffer_size_kb", 0644, d_tracer,
&global_trace, &tracing_entries_fops);

+ trace_create_file("buffer_overwrite", 0644, d_tracer,
+ &global_trace, &tracing_overwrite_fops);
+
trace_create_file("trace_marker", 0220, d_tracer,
NULL, &tracing_mark_fops);

@@ -4565,7 +4616,7 @@ __init static int tracer_alloc_buffers(void)

/* TODO: make the number of buffers hot pluggable with CPUS */
global_trace.buffer = ring_buffer_alloc(ring_buf_size,
- TRACE_BUFFER_FLAGS);
+ trace_buffer_flags);
if (!global_trace.buffer) {
printk(KERN_ERR "tracer: failed to allocate ring buffer!\n");
WARN_ON(1);
@@ -4575,7 +4626,7 @@ __init static int tracer_alloc_buffers(void)


#ifdef CONFIG_TRACER_MAX_TRACE
- max_tr.buffer = ring_buffer_alloc(1, TRACE_BUFFER_FLAGS);
+ max_tr.buffer = ring_buffer_alloc(1, trace_buffer_flags);
if (!max_tr.buffer) {
printk(KERN_ERR "tracer: failed to allocate max ring buffer!\n");
WARN_ON(1);
--
1.7.3.1

2010-12-04 00:13:56

by David Sharp

[permalink] [raw]
Subject: [PATCH 13/15] small_traces: Add config option to shrink trace events.

This option will remove several fields from trace events to reduce the
payload size. Under heavy load, events can be produced faster than they
can possibly be written to disk or sent on the network. Shrinking the
payload size will enable fitting more traces in the ring buffer, which
therefore enables tracing for longer periods, at the cost of losing some
relatively unimportant details.

Google-Bug-Id: 3224547

Signed-off-by: David Sharp <[email protected]>
---
kernel/trace/Kconfig | 12 ++++++++++++
1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index ea37e2f..9c24fe5 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -403,6 +403,18 @@ config FUNCTION_PROFILER

If in doubt, say N.

+config SMALL_TRACES
+ bool "Shrink trace events"
+ default n
+ help
+ Minimize the size of trace events. Omits fields from trace event
+ header like preempt_count, and lock_depth. Chooses smaller
+ alternative event structures. Shrinking the payload size enables
+ fitting more traces in the ring buffer, which enables tracing for
+ longer periods when the buffer can not be read as fast as it is
+ filled, at the cost of losing some details.
+
+
config FTRACE_MCOUNT_RECORD
def_bool y
depends on DYNAMIC_FTRACE
--
1.7.3.1

2010-12-04 00:14:02

by David Sharp

[permalink] [raw]
Subject: [PATCH 02/15] ring_buffer.c: Remove unused #include <linux/trace_irq.h>


Signed-off-by: David Sharp <[email protected]>
---
kernel/trace/ring_buffer.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 3207147..8ef7cc4 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5,7 +5,6 @@
*/
#include <linux/ring_buffer.h>
#include <linux/trace_clock.h>
-#include <linux/ftrace_irq.h>
#include <linux/spinlock.h>
#include <linux/debugfs.h>
#include <linux/uaccess.h>
--
1.7.3.1

2010-12-04 00:14:07

by David Sharp

[permalink] [raw]
Subject: [PATCH 10/15] ftrace: fix event alignment: jbd2:*


Signed-off-by: David Sharp <[email protected]>
---
include/trace/events/jbd2.h | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/trace/events/jbd2.h b/include/trace/events/jbd2.h
index 7447ea9..2f9f362 100644
--- a/include/trace/events/jbd2.h
+++ b/include/trace/events/jbd2.h
@@ -41,15 +41,15 @@ DECLARE_EVENT_CLASS(jbd2_commit,
TP_STRUCT__entry(
__field( int, dev_major )
__field( int, dev_minor )
- __field( char, sync_commit )
__field( int, transaction )
+ __field( char, sync_commit )
),

TP_fast_assign(
__entry->dev_major = MAJOR(journal->j_fs_dev->bd_dev);
__entry->dev_minor = MINOR(journal->j_fs_dev->bd_dev);
- __entry->sync_commit = commit_transaction->t_synchronous_commit;
__entry->transaction = commit_transaction->t_tid;
+ __entry->sync_commit = commit_transaction->t_synchronous_commit;
),

TP_printk("dev %d,%d transaction %d sync %d",
@@ -93,17 +93,17 @@ TRACE_EVENT(jbd2_end_commit,
TP_STRUCT__entry(
__field( int, dev_major )
__field( int, dev_minor )
- __field( char, sync_commit )
__field( int, transaction )
__field( int, head )
+ __field( char, sync_commit )
),

TP_fast_assign(
__entry->dev_major = MAJOR(journal->j_fs_dev->bd_dev);
__entry->dev_minor = MINOR(journal->j_fs_dev->bd_dev);
- __entry->sync_commit = commit_transaction->t_synchronous_commit;
__entry->transaction = commit_transaction->t_tid;
__entry->head = journal->j_tail_sequence;
+ __entry->sync_commit = commit_transaction->t_synchronous_commit;
),

TP_printk("dev %d,%d transaction %d sync %d head %d",
--
1.7.3.1

2010-12-04 00:13:54

by David Sharp

[permalink] [raw]
Subject: [PATCH 07/15] ftrace: fix event alignment: kvm:kvm_hv_hypercall


Signed-off-by: David Sharp <[email protected]>
---
arch/x86/kvm/trace.h | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index a6544b8..ab41fb0 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -62,21 +62,21 @@ TRACE_EVENT(kvm_hv_hypercall,
TP_ARGS(code, fast, rep_cnt, rep_idx, ingpa, outgpa),

TP_STRUCT__entry(
- __field( __u16, code )
- __field( bool, fast )
__field( __u16, rep_cnt )
__field( __u16, rep_idx )
__field( __u64, ingpa )
__field( __u64, outgpa )
+ __field( __u16, code )
+ __field( bool, fast )
),

TP_fast_assign(
- __entry->code = code;
- __entry->fast = fast;
__entry->rep_cnt = rep_cnt;
__entry->rep_idx = rep_idx;
__entry->ingpa = ingpa;
__entry->outgpa = outgpa;
+ __entry->code = code;
+ __entry->fast = fast;
),

TP_printk("code 0x%x %s cnt 0x%x idx 0x%x in 0x%llx out 0x%llx",
--
1.7.3.1

2010-12-04 00:13:55

by David Sharp

[permalink] [raw]
Subject: [PATCH 04/15] ftrace: pack event structures.

Ftrace event structures have a 12-byte struct trace_entry at the beginning.
If the structure is aligned, this means that if the first field is 64-bits,
there will be 4 bytes of padding. Ironically, due to the 4-byte ringbuffer
header, this will make 64-bit writes unaligned, if the ring buffer position
is currently 64-bit aligned:
4(rb)+12(ftrace)+4(pad) = 20; 20%8 = 4

Adding __attribute__((packed)) to the event structures removes the extra
space from the trace events, and actually improves alignment of trace
events with a first field that is 64-bits.

About 65 tracepoints have a 4-byte pad at offset 12:
# find events -name format | xargs -n1 awk '
$1=="name:" {name=$2}
$1=="format:"{FS="\t"}
$3=="offset:12;" && $4=="size:4;"{okay=1}
$3=="offset:16;" && !okay {print name}' | wc -l
65

With all 'syscalls' and 'timer' events enabled, this results in a 5%
improvement in a simple 512MB read benchmark with warm caches.

Tested:

setup:
# echo 1 >events/syscalls/enable
# echo 1 >events/timer/enable
# echo 0 > tracing_enabled
off:
# for n in $(seq 10) ; do \
time dd if=/dev/hda3 of=/dev/null bs=1K count=512K ; \
done
on:
# for n in $(seq 10) ; do \
echo 1 >tracing_enabled; \
time dd if=/dev/hda3 of=/dev/null bs=1K count=512K ; \
echo 0 >tracing_enabled; \
echo > trace; \
done

real time mean/median/stdev
w/o patch off: 1.1679/1.164/0.0169
w/o patch on : 1.9432/1.936/0.0274
w/ patch off: 1.1715/1.159/0.0431
w/ patch on : 1.8425/1.836/0.0138
"on" delta: -0.1007 --> -5.2%

Google-Bug-Id: 2895627

Signed-off-by: David Sharp <[email protected]>
---
include/trace/ftrace.h | 5 +++--
kernel/trace/trace.h | 2 +-
2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index a9377c0..51d1f52 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -48,7 +48,8 @@
#define __array(type, item, len) type item[len];

#undef __dynamic_array
-#define __dynamic_array(type, item, len) u32 __data_loc_##item;
+#define __dynamic_array(type, item, len) \
+ u32 __data_loc_##item __attribute__((aligned(4)));

#undef __string
#define __string(item, src) __dynamic_array(char, item, -1)
@@ -62,7 +63,7 @@
struct trace_entry ent; \
tstruct \
char __data[0]; \
- }; \
+ } __attribute__((packed)); \
\
static struct ftrace_event_class event_class_##name;

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 9021f8c..2e80433 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -60,7 +60,7 @@ enum trace_type {
struct struct_name { \
struct trace_entry ent; \
tstruct \
- }
+ } __attribute__((packed))

#undef TP_ARGS
#define TP_ARGS(args...) args
--
1.7.3.1

2010-12-04 00:14:43

by David Sharp

[permalink] [raw]
Subject: [PATCH 11/15] ftrace: fix event alignment: ext4:*


Signed-off-by: David Sharp <[email protected]>
---
include/trace/events/ext4.h | 70 +++++++++++++++++++++---------------------
1 files changed, 35 insertions(+), 35 deletions(-)

diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index e5e345f..9887cdb 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -23,21 +23,21 @@ TRACE_EVENT(ext4_free_inode,
TP_STRUCT__entry(
__field( int, dev_major )
__field( int, dev_minor )
+ __field( blkcnt_t, blocks )
__field( ino_t, ino )
- __field( umode_t, mode )
__field( uid_t, uid )
__field( gid_t, gid )
- __field( blkcnt_t, blocks )
+ __field( umode_t, mode )
),

TP_fast_assign(
__entry->dev_major = MAJOR(inode->i_sb->s_dev);
__entry->dev_minor = MINOR(inode->i_sb->s_dev);
+ __entry->blocks = inode->i_blocks;
__entry->ino = inode->i_ino;
- __entry->mode = inode->i_mode;
__entry->uid = inode->i_uid;
__entry->gid = inode->i_gid;
- __entry->blocks = inode->i_blocks;
+ __entry->mode = inode->i_mode;
),

TP_printk("dev %d,%d ino %lu mode 0%o uid %u gid %u blocks %llu",
@@ -342,10 +342,10 @@ TRACE_EVENT(ext4_da_writepages,
__field( long, pages_skipped )
__field( loff_t, range_start )
__field( loff_t, range_end )
+ __field( pgoff_t, writeback_index )
__field( char, for_kupdate )
__field( char, for_reclaim )
__field( char, range_cyclic )
- __field( pgoff_t, writeback_index )
),

TP_fast_assign(
@@ -356,10 +356,10 @@ TRACE_EVENT(ext4_da_writepages,
__entry->pages_skipped = wbc->pages_skipped;
__entry->range_start = wbc->range_start;
__entry->range_end = wbc->range_end;
+ __entry->writeback_index = inode->i_mapping->writeback_index;
__entry->for_kupdate = wbc->for_kupdate;
__entry->for_reclaim = wbc->for_reclaim;
__entry->range_cyclic = wbc->range_cyclic;
- __entry->writeback_index = inode->i_mapping->writeback_index;
),

TP_printk("dev %d,%d ino %lu nr_to_write %ld pages_skipped %ld "
@@ -425,8 +425,8 @@ TRACE_EVENT(ext4_da_writepages_result,
__field( int, ret )
__field( int, pages_written )
__field( long, pages_skipped )
- __field( char, more_io )
__field( pgoff_t, writeback_index )
+ __field( char, more_io )
),

TP_fast_assign(
@@ -436,8 +436,8 @@ TRACE_EVENT(ext4_da_writepages_result,
__entry->ret = ret;
__entry->pages_written = pages_written;
__entry->pages_skipped = wbc->pages_skipped;
- __entry->more_io = wbc->more_io;
__entry->writeback_index = inode->i_mapping->writeback_index;
+ __entry->more_io = wbc->more_io;
),

TP_printk("dev %d,%d ino %lu ret %d pages_written %d pages_skipped %ld more_io %d writeback_index %lu",
@@ -485,8 +485,8 @@ DECLARE_EVENT_CLASS(ext4__mb_new_pa,
__field( int, dev_minor )
__field( ino_t, ino )
__field( __u64, pa_pstart )
- __field( __u32, pa_len )
__field( __u64, pa_lstart )
+ __field( __u32, pa_len )

),

@@ -495,8 +495,8 @@ DECLARE_EVENT_CLASS(ext4__mb_new_pa,
__entry->dev_minor = MINOR(ac->ac_sb->s_dev);
__entry->ino = ac->ac_inode->i_ino;
__entry->pa_pstart = pa->pa_pstart;
- __entry->pa_len = pa->pa_len;
__entry->pa_lstart = pa->pa_lstart;
+ __entry->pa_len = pa->pa_len;
),

TP_printk("dev %d,%d ino %lu pstart %llu len %u lstart %llu",
@@ -724,20 +724,20 @@ TRACE_EVENT(ext4_free_blocks,
__field( int, dev_major )
__field( int, dev_minor )
__field( ino_t, ino )
- __field( umode_t, mode )
__field( __u64, block )
__field( unsigned long, count )
__field( int, flags )
+ __field( umode_t, mode )
),

TP_fast_assign(
__entry->dev_major = MAJOR(inode->i_sb->s_dev);
__entry->dev_minor = MINOR(inode->i_sb->s_dev);
__entry->ino = inode->i_ino;
- __entry->mode = inode->i_mode;
__entry->block = block;
__entry->count = count;
__entry->flags = flags;
+ __entry->mode = inode->i_mode;
),

TP_printk("dev %d,%d ino %lu mode 0%o block %llu count %lu flags %d",
@@ -766,8 +766,8 @@ TRACE_EVENT(ext4_sync_file,
__entry->dev_major = MAJOR(dentry->d_inode->i_sb->s_dev);
__entry->dev_minor = MINOR(dentry->d_inode->i_sb->s_dev);
__entry->ino = dentry->d_inode->i_ino;
- __entry->datasync = datasync;
__entry->parent = dentry->d_parent->d_inode->i_ino;
+ __entry->datasync = datasync;
),

TP_printk("dev %d,%d ino %ld parent %ld datasync %d ",
@@ -834,12 +834,6 @@ TRACE_EVENT(ext4_mballoc_alloc,
__field( int, dev_major )
__field( int, dev_minor )
__field( ino_t, ino )
- __field( __u16, found )
- __field( __u16, groups )
- __field( __u16, buddy )
- __field( __u16, flags )
- __field( __u16, tail )
- __field( __u8, cr )
__field( __u32, orig_logical )
__field( int, orig_start )
__field( __u32, orig_group )
@@ -852,18 +846,18 @@ TRACE_EVENT(ext4_mballoc_alloc,
__field( int, result_start )
__field( __u32, result_group )
__field( int, result_len )
+ __field( __u16, found )
+ __field( __u16, groups )
+ __field( __u16, buddy )
+ __field( __u16, flags )
+ __field( __u16, tail )
+ __field( __u8, cr )
),

TP_fast_assign(
__entry->dev_major = MAJOR(ac->ac_inode->i_sb->s_dev);
__entry->dev_minor = MINOR(ac->ac_inode->i_sb->s_dev);
__entry->ino = ac->ac_inode->i_ino;
- __entry->found = ac->ac_found;
- __entry->flags = ac->ac_flags;
- __entry->groups = ac->ac_groups_scanned;
- __entry->buddy = ac->ac_buddy;
- __entry->tail = ac->ac_tail;
- __entry->cr = ac->ac_criteria;
__entry->orig_logical = ac->ac_o_ex.fe_logical;
__entry->orig_start = ac->ac_o_ex.fe_start;
__entry->orig_group = ac->ac_o_ex.fe_group;
@@ -876,6 +870,12 @@ TRACE_EVENT(ext4_mballoc_alloc,
__entry->result_start = ac->ac_f_ex.fe_start;
__entry->result_group = ac->ac_f_ex.fe_group;
__entry->result_len = ac->ac_f_ex.fe_len;
+ __entry->found = ac->ac_found;
+ __entry->groups = ac->ac_groups_scanned;
+ __entry->buddy = ac->ac_buddy;
+ __entry->flags = ac->ac_flags;
+ __entry->tail = ac->ac_tail;
+ __entry->cr = ac->ac_criteria;
),

TP_printk("dev %d,%d inode %lu orig %u/%d/%u@%u goal %u/%d/%u@%u "
@@ -1001,18 +1001,18 @@ TRACE_EVENT(ext4_forget,
__field( int, dev_major )
__field( int, dev_minor )
__field( ino_t, ino )
- __field( umode_t, mode )
- __field( int, is_metadata )
__field( __u64, block )
+ __field( int, is_metadata )
+ __field( umode_t, mode )
),

TP_fast_assign(
__entry->dev_major = MAJOR(inode->i_sb->s_dev);
__entry->dev_minor = MINOR(inode->i_sb->s_dev);
__entry->ino = inode->i_ino;
- __entry->mode = inode->i_mode;
- __entry->is_metadata = is_metadata;
__entry->block = block;
+ __entry->is_metadata = is_metadata;
+ __entry->mode = inode->i_mode;
),

TP_printk("dev %d,%d ino %lu mode 0%o is_metadata %d block %llu",
@@ -1030,24 +1030,24 @@ TRACE_EVENT(ext4_da_update_reserve_space,
__field( int, dev_major )
__field( int, dev_minor )
__field( ino_t, ino )
- __field( umode_t, mode )
__field( __u64, i_blocks )
__field( int, used_blocks )
__field( int, reserved_data_blocks )
__field( int, reserved_meta_blocks )
__field( int, allocated_meta_blocks )
+ __field( umode_t, mode )
),

TP_fast_assign(
__entry->dev_major = MAJOR(inode->i_sb->s_dev);
__entry->dev_minor = MINOR(inode->i_sb->s_dev);
__entry->ino = inode->i_ino;
- __entry->mode = inode->i_mode;
__entry->i_blocks = inode->i_blocks;
__entry->used_blocks = used_blocks;
__entry->reserved_data_blocks = EXT4_I(inode)->i_reserved_data_blocks;
__entry->reserved_meta_blocks = EXT4_I(inode)->i_reserved_meta_blocks;
__entry->allocated_meta_blocks = EXT4_I(inode)->i_allocated_meta_blocks;
+ __entry->mode = inode->i_mode;
),

TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu used_blocks %d reserved_data_blocks %d reserved_meta_blocks %d allocated_meta_blocks %d",
@@ -1067,22 +1067,22 @@ TRACE_EVENT(ext4_da_reserve_space,
__field( int, dev_major )
__field( int, dev_minor )
__field( ino_t, ino )
- __field( umode_t, mode )
__field( __u64, i_blocks )
__field( int, md_needed )
__field( int, reserved_data_blocks )
__field( int, reserved_meta_blocks )
+ __field( umode_t, mode )
),

TP_fast_assign(
__entry->dev_major = MAJOR(inode->i_sb->s_dev);
__entry->dev_minor = MINOR(inode->i_sb->s_dev);
__entry->ino = inode->i_ino;
- __entry->mode = inode->i_mode;
__entry->i_blocks = inode->i_blocks;
__entry->md_needed = md_needed;
__entry->reserved_data_blocks = EXT4_I(inode)->i_reserved_data_blocks;
__entry->reserved_meta_blocks = EXT4_I(inode)->i_reserved_meta_blocks;
+ __entry->mode = inode->i_mode;
),

TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu md_needed %d reserved_data_blocks %d reserved_meta_blocks %d",
@@ -1102,24 +1102,24 @@ TRACE_EVENT(ext4_da_release_space,
__field( int, dev_major )
__field( int, dev_minor )
__field( ino_t, ino )
- __field( umode_t, mode )
__field( __u64, i_blocks )
__field( int, freed_blocks )
__field( int, reserved_data_blocks )
__field( int, reserved_meta_blocks )
__field( int, allocated_meta_blocks )
+ __field( umode_t, mode )
),

TP_fast_assign(
__entry->dev_major = MAJOR(inode->i_sb->s_dev);
__entry->dev_minor = MINOR(inode->i_sb->s_dev);
__entry->ino = inode->i_ino;
- __entry->mode = inode->i_mode;
__entry->i_blocks = inode->i_blocks;
__entry->freed_blocks = freed_blocks;
__entry->reserved_data_blocks = EXT4_I(inode)->i_reserved_data_blocks;
__entry->reserved_meta_blocks = EXT4_I(inode)->i_reserved_meta_blocks;
__entry->allocated_meta_blocks = EXT4_I(inode)->i_allocated_meta_blocks;
+ __entry->mode = inode->i_mode;
),

TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu freed_blocks %d reserved_data_blocks %d reserved_meta_blocks %d allocated_meta_blocks %d",
--
1.7.3.1

2010-12-04 00:14:56

by David Sharp

[permalink] [raw]
Subject: [PATCH 08/15] ftrace: fix event alignment: mce:mce_record


Signed-off-by: David Sharp <[email protected]>
---
include/trace/events/mce.h | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h
index 7eee778..4cbbcef 100644
--- a/include/trace/events/mce.h
+++ b/include/trace/events/mce.h
@@ -17,36 +17,36 @@ TRACE_EVENT(mce_record,
TP_STRUCT__entry(
__field( u64, mcgcap )
__field( u64, mcgstatus )
- __field( u8, bank )
__field( u64, status )
__field( u64, addr )
__field( u64, misc )
__field( u64, ip )
- __field( u8, cs )
__field( u64, tsc )
__field( u64, walltime )
__field( u32, cpu )
__field( u32, cpuid )
__field( u32, apicid )
__field( u32, socketid )
+ __field( u8, cs )
+ __field( u8, bank )
__field( u8, cpuvendor )
),

TP_fast_assign(
__entry->mcgcap = m->mcgcap;
__entry->mcgstatus = m->mcgstatus;
- __entry->bank = m->bank;
__entry->status = m->status;
__entry->addr = m->addr;
__entry->misc = m->misc;
__entry->ip = m->ip;
- __entry->cs = m->cs;
__entry->tsc = m->tsc;
__entry->walltime = m->time;
__entry->cpu = m->extcpu;
__entry->cpuid = m->cpuid;
__entry->apicid = m->apicid;
__entry->socketid = m->socketid;
+ __entry->cs = m->cs;
+ __entry->bank = m->bank;
__entry->cpuvendor = m->cpuvendor;
),

--
1.7.3.1

2010-12-04 00:14:55

by David Sharp

[permalink] [raw]
Subject: [PATCH 15/15] small_traces: Remove 8 bytes from trace_entry.

Remove flags, preempt_count, and lock_depth from trace_entry. These fields
were used only by the latency tracer output and are relatively unimportant,
while incurring high payload overhead.

Shrink pid to 16 bits. pid_max is usually 32768. The potential for loss of
fidelity here is minimal, and can be guaranteed by setting pid_max.

Google-Bug-Id: 3224547

Signed-off-by: David Sharp <[email protected]>
---
include/linux/ftrace_event.h | 4 ++++
kernel/trace/trace.c | 7 ++++++-
kernel/trace/trace_events.c | 4 ++++
3 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 8beabb9..3a0bdd1 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -34,10 +34,14 @@ const char *ftrace_print_hex_seq(struct trace_seq *p,
*/
struct trace_entry {
unsigned short type;
+#ifdef CONFIG_SMALL_TRACES
+ short pid;
+#else
unsigned char flags;
unsigned char preempt_count;
int pid;
int lock_depth;
+#endif
};

#define FTRACE_MAX_EVENT \
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 177d4e5..73b6482 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1101,8 +1101,11 @@ tracing_generic_entry_update(struct trace_entry *entry, unsigned long flags,
{
struct task_struct *tsk = current;

- entry->preempt_count = pc & 0xff;
+#ifdef CONFIG_SMALL_TRACES
+ entry->pid = (tsk) ? (typeof(entry->pid))tsk->pid : 0;
+#else
entry->pid = (tsk) ? tsk->pid : 0;
+ entry->preempt_count = pc & 0xff;
entry->lock_depth = (tsk) ? tsk->lock_depth : 0;
entry->flags =
#ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT
@@ -1113,9 +1116,11 @@ tracing_generic_entry_update(struct trace_entry *entry, unsigned long flags,
((pc & HARDIRQ_MASK) ? TRACE_FLAG_HARDIRQ : 0) |
((pc & SOFTIRQ_MASK) ? TRACE_FLAG_SOFTIRQ : 0) |
(need_resched() ? TRACE_FLAG_NEED_RESCHED : 0);
+#endif
}
EXPORT_SYMBOL_GPL(tracing_generic_entry_update);

+
struct ring_buffer_event *
trace_buffer_lock_reserve(struct ring_buffer *buffer,
int type,
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 0725eea..3684a3e 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -107,10 +107,14 @@ static int trace_define_common_fields(void)
struct trace_entry ent;

__common_field(unsigned short, type);
+#ifdef CONFIG_SMALL_TRACES
+ __common_field(short, pid);
+#else
__common_field(unsigned char, flags);
__common_field(unsigned char, preempt_count);
__common_field(int, pid);
__common_field(int, lock_depth);
+#endif

return ret;
}
--
1.7.3.1

2010-12-04 00:15:38

by David Sharp

[permalink] [raw]
Subject: [PATCH 14/15] small_traces: Remove trace output of large fields.

Google-Bug-Id: 3224547

Signed-off-by: David Sharp <[email protected]>
---
kernel/trace/trace.c | 6 ++++++
kernel/trace/trace_output.c | 18 ++++++++++++++++++
2 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ed5c14f..177d4e5 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1747,6 +1747,7 @@ static void s_stop(struct seq_file *m, void *p)

static void print_lat_help_header(struct seq_file *m)
{
+#ifndef CONFIG_SMALL_TRACES
seq_puts(m, "# _------=> CPU# \n");
seq_puts(m, "# / _-----=> irqs-off \n");
seq_puts(m, "# | / _----=> need-resched \n");
@@ -1756,6 +1757,11 @@ static void print_lat_help_header(struct seq_file *m)
seq_puts(m, "# |||||/ delay \n");
seq_puts(m, "# cmd pid |||||| time | caller \n");
seq_puts(m, "# \\ / |||||| \\ | / \n");
+#else
+ seq_puts(m, "# CPU# delay \n");
+ seq_puts(m, "# cmd pid | time | caller \n");
+ seq_puts(m, "# \\ / | \\ | / \n");
+#endif
}

static void print_func_help_header(struct seq_file *m)
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 9e33b1f..d5d7575 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -531,6 +531,7 @@ seq_print_ip_sym(struct trace_seq *s, unsigned long ip, unsigned long sym_flags)
* Prints the generic fields of irqs off, in hard or softirq, preempt
* count and lock depth.
*/
+#ifndef CONFIG_SMALL_TRACES
int trace_print_lat_fmt(struct trace_seq *s, struct trace_entry *entry)
{
int ret;
@@ -566,6 +567,12 @@ int trace_print_lat_fmt(struct trace_seq *s, struct trace_entry *entry)

return trace_seq_printf(s, "%d", entry->lock_depth);
}
+#else
+int trace_print_lat_fmt(struct trace_seq *s, struct trace_entry *entry)
+{
+ return 1;
+}
+#endif

static int
lat_print_generic(struct trace_seq *s, struct trace_entry *entry, int cpu)
@@ -628,6 +635,7 @@ int trace_print_lat_context(struct trace_iterator *iter)

trace_find_cmdline(entry->pid, comm);

+#ifndef CONFIG_SMALL_TRACES
ret = trace_seq_printf(s, "%16s %5d %3d %d %08x %08lx [%08llx]"
" %ld.%03ldms (+%ld.%03ldms): ", comm,
entry->pid, iter->cpu, entry->flags,
@@ -637,6 +645,16 @@ int trace_print_lat_context(struct trace_iterator *iter)
abs_usecs % USEC_PER_MSEC,
rel_usecs / USEC_PER_MSEC,
rel_usecs % USEC_PER_MSEC);
+#else
+ ret = trace_seq_printf(s, "%16s %5d %3d %08lx [%08llx]"
+ " %ld.%03ldms (+%ld.%03ldms): ", comm,
+ entry->pid, iter->cpu, iter->idx,
+ ns2usecs(iter->ts),
+ abs_usecs / USEC_PER_MSEC,
+ abs_usecs % USEC_PER_MSEC,
+ rel_usecs / USEC_PER_MSEC,
+ rel_usecs % USEC_PER_MSEC);
+#endif
} else {
ret = lat_print_generic(s, entry, iter->cpu);
if (ret)
--
1.7.3.1

2010-12-04 00:15:39

by David Sharp

[permalink] [raw]
Subject: [PATCH 12/15] trace_output.c: adjust conditional expression formatting.

Formatting change only to improve readability. No code changes except to
introduce intermediate variables.

Signed-off-by: David Sharp <[email protected]>
---
kernel/trace/trace_output.c | 26 +++++++++++++++-----------
1 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 02272ba..9e33b1f 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -533,20 +533,24 @@ seq_print_ip_sym(struct trace_seq *s, unsigned long ip, unsigned long sym_flags)
*/
int trace_print_lat_fmt(struct trace_seq *s, struct trace_entry *entry)
{
- int hardirq, softirq;
int ret;
-
- hardirq = entry->flags & TRACE_FLAG_HARDIRQ;
- softirq = entry->flags & TRACE_FLAG_SOFTIRQ;
+ int hardirq = entry->flags & TRACE_FLAG_HARDIRQ;
+ int softirq = entry->flags & TRACE_FLAG_SOFTIRQ;
+
+ const char irqs_off =
+ (entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
+ (entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' :
+ '.';
+ const char need_resched =
+ (entry->flags & TRACE_FLAG_NEED_RESCHED) ? 'N' : '.';
+ const char hardsoft_irq =
+ (hardirq && softirq) ? 'H' :
+ hardirq ? 'h' :
+ softirq ? 's' :
+ '.';

if (!trace_seq_printf(s, "%c%c%c",
- (entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
- (entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ?
- 'X' : '.',
- (entry->flags & TRACE_FLAG_NEED_RESCHED) ?
- 'N' : '.',
- (hardirq && softirq) ? 'H' :
- hardirq ? 'h' : softirq ? 's' : '.'))
+ irqs_off, need_resched, hardsoft_irq))
return 0;

if (entry->preempt_count)
--
1.7.3.1

2010-12-04 00:13:51

by David Sharp

[permalink] [raw]
Subject: [PATCH 03/15] ring_buffer: Align buffer_page struct allocations only to fit the flags.

buffer_page structs need to be aligned to 4 byte boundaries because the page
flags are stored in the two least-significant bits of the pointers in the page
list. Aligning to cache lines is sufficient, but doesn't seem to be necessary.
Reducing the alignement to only 4 bytes may improve cache efficiency.

Testing with Autotest's tracing_microbenchmark, there was no significant change
in overhead with this change.

Signed-off-by: David Sharp <[email protected]>
---
kernel/trace/ring_buffer.c | 20 ++++++++++++--------
1 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 8ef7cc4..957a8b8 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -644,8 +644,14 @@ EXPORT_SYMBOL_GPL(ring_buffer_normalize_time_stamp);
#define RB_PAGE_HEAD 1UL
#define RB_PAGE_UPDATE 2UL

-
#define RB_FLAG_MASK 3UL
+#define RB_PAGE_ALIGNMENT (RB_FLAG_MASK+1)
+
+/* Ensure alignment of struct buffer_page */
+static __attribute__((unused)) void check_buffer_page_alignment(void)
+{
+ BUILD_BUG_ON(__alignof__(struct buffer_page) % RB_PAGE_ALIGNMENT != 0);
+}

/* PAGE_MOVED is not part of the mask */
#define RB_PAGE_MOVED 4UL
@@ -1004,8 +1010,8 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
WARN_ON(!nr_pages);

for (i = 0; i < nr_pages; i++) {
- bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
- GFP_KERNEL, cpu_to_node(cpu_buffer->cpu));
+ bpage = kzalloc_node(sizeof(*bpage), GFP_KERNEL,
+ cpu_to_node(cpu_buffer->cpu));
if (!bpage)
goto free_pages;

@@ -1059,8 +1065,7 @@ rb_allocate_cpu_buffer(struct ring_buffer *buffer, int cpu)
lockdep_set_class(&cpu_buffer->reader_lock, buffer->reader_lock_key);
cpu_buffer->lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;

- bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
- GFP_KERNEL, cpu_to_node(cpu));
+ bpage = kzalloc_node(sizeof(*bpage), GFP_KERNEL, cpu_to_node(cpu));
if (!bpage)
goto fail_free_buffer;

@@ -1375,9 +1380,8 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size)

for_each_buffer_cpu(buffer, cpu) {
for (i = 0; i < new_pages; i++) {
- bpage = kzalloc_node(ALIGN(sizeof(*bpage),
- cache_line_size()),
- GFP_KERNEL, cpu_to_node(cpu));
+ bpage = kzalloc_node(sizeof(*bpage), GFP_KERNEL,
+ cpu_to_node(cpu));
if (!bpage)
goto free_pages;
list_add(&bpage->list, &pages);
--
1.7.3.1

2010-12-04 00:16:17

by David Sharp

[permalink] [raw]
Subject: [PATCH 09/15] ftrace: fix event alignment: skb:kfree_skb


Signed-off-by: David Sharp <[email protected]>
---
include/trace/events/skb.h | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
index 75ce9d5..aa4a56a 100644
--- a/include/trace/events/skb.h
+++ b/include/trace/events/skb.h
@@ -19,16 +19,16 @@ TRACE_EVENT(kfree_skb,

TP_STRUCT__entry(
__field( void *, skbaddr )
- __field( unsigned short, protocol )
__field( void *, location )
+ __field( unsigned short, protocol )
),

TP_fast_assign(
__entry->skbaddr = skb;
+ __entry->location = location;
if (skb) {
__entry->protocol = ntohs(skb->protocol);
}
- __entry->location = location;
),

TP_printk("skbaddr=%p protocol=%u location=%p",
--
1.7.3.1

2010-12-04 00:15:59

by David Sharp

[permalink] [raw]
Subject: [PATCH 06/15] ftrace: fix event alignment: module:module_request


Signed-off-by: David Sharp <[email protected]>
---
include/trace/events/module.h | 5 ++---
1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/module.h b/include/trace/events/module.h
index c7bb2f0..8154f5f 100644
--- a/include/trace/events/module.h
+++ b/include/trace/events/module.h
@@ -98,14 +98,14 @@ TRACE_EVENT(module_request,
TP_ARGS(name, wait, ip),

TP_STRUCT__entry(
- __field( bool, wait )
__field( unsigned long, ip )
+ __field( bool, wait )
__string( name, name )
),

TP_fast_assign(
- __entry->wait = wait;
__entry->ip = ip;
+ __entry->wait = wait;
__assign_str(name, name);
),

@@ -119,4 +119,3 @@ TRACE_EVENT(module_request,

/* This part must be outside protection */
#include <trace/define_trace.h>
-
--
1.7.3.1

2010-12-04 00:16:34

by David Sharp

[permalink] [raw]
Subject: [PATCH 05/15] ftrace: fix event alignment: ftrace:context_switch and ftrace:wakeup


Signed-off-by: David Sharp <[email protected]>
---
kernel/trace/trace_entries.h | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index e3dfeca..81ab9c3 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -109,12 +109,12 @@ FTRACE_ENTRY(funcgraph_exit, ftrace_graph_ret_entry,
*/
#define FTRACE_CTX_FIELDS \
__field( unsigned int, prev_pid ) \
+ __field( unsigned int, next_pid ) \
+ __field( unsigned int, next_cpu ) \
__field( unsigned char, prev_prio ) \
__field( unsigned char, prev_state ) \
- __field( unsigned int, next_pid ) \
__field( unsigned char, next_prio ) \
- __field( unsigned char, next_state ) \
- __field( unsigned int, next_cpu )
+ __field( unsigned char, next_state )

FTRACE_ENTRY(context_switch, ctx_switch_entry,

--
1.7.3.1

2010-12-04 01:43:32

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 03/15] ring_buffer: Align buffer_page struct allocations only to fit the flags.

On Fri, 2010-12-03 at 16:13 -0800, David Sharp wrote:
> buffer_page structs need to be aligned to 4 byte boundaries because the page
> flags are stored in the two least-significant bits of the pointers in the page
> list. Aligning to cache lines is sufficient, but doesn't seem to be necessary.
> Reducing the alignement to only 4 bytes may improve cache efficiency.

The reason for the cache line boundaries because the bpages are per cpu,
and if they are allocated next to some variable that gets change
globally or even another bpage that is used for another cpu buffer, then
I would expect we would get some cache line bouncing.

Perhaps we should make bpage have its own slab allocation
(kmem_cache_alloc). And perhaps even try to keep all bpages that share
the same cache line on the same cpu buffer.

-- Steve

2010-12-04 01:47:48

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 06/15] ftrace: fix event alignment: module:module_request

[ Ccing Li for an Acked-by. He wrote this event ]

-- Steve


On Fri, 2010-12-03 at 16:13 -0800, David Sharp wrote:
> Signed-off-by: David Sharp <[email protected]>
> ---
> include/trace/events/module.h | 5 ++---
> 1 files changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/include/trace/events/module.h b/include/trace/events/module.h
> index c7bb2f0..8154f5f 100644
> --- a/include/trace/events/module.h
> +++ b/include/trace/events/module.h
> @@ -98,14 +98,14 @@ TRACE_EVENT(module_request,
> TP_ARGS(name, wait, ip),
>
> TP_STRUCT__entry(
> - __field( bool, wait )
> __field( unsigned long, ip )
> + __field( bool, wait )
> __string( name, name )
> ),
>
> TP_fast_assign(
> - __entry->wait = wait;
> __entry->ip = ip;
> + __entry->wait = wait;
> __assign_str(name, name);
> ),
>
> @@ -119,4 +119,3 @@ TRACE_EVENT(module_request,
>
> /* This part must be outside protection */
> #include <trace/define_trace.h>
> -

2010-12-04 01:49:13

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 07/15] ftrace: fix event alignment: kvm:kvm_hv_hypercall

[ Ccing Marcelo and Avi for Acked-by's ]

-- Steve


On Fri, 2010-12-03 at 16:13 -0800, David Sharp wrote:
> Signed-off-by: David Sharp <[email protected]>
> ---
> arch/x86/kvm/trace.h | 8 ++++----
> 1 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> index a6544b8..ab41fb0 100644
> --- a/arch/x86/kvm/trace.h
> +++ b/arch/x86/kvm/trace.h
> @@ -62,21 +62,21 @@ TRACE_EVENT(kvm_hv_hypercall,
> TP_ARGS(code, fast, rep_cnt, rep_idx, ingpa, outgpa),
>
> TP_STRUCT__entry(
> - __field( __u16, code )
> - __field( bool, fast )
> __field( __u16, rep_cnt )
> __field( __u16, rep_idx )
> __field( __u64, ingpa )
> __field( __u64, outgpa )
> + __field( __u16, code )
> + __field( bool, fast )
> ),
>
> TP_fast_assign(
> - __entry->code = code;
> - __entry->fast = fast;
> __entry->rep_cnt = rep_cnt;
> __entry->rep_idx = rep_idx;
> __entry->ingpa = ingpa;
> __entry->outgpa = outgpa;
> + __entry->code = code;
> + __entry->fast = fast;
> ),
>
> TP_printk("code 0x%x %s cnt 0x%x idx 0x%x in 0x%llx out 0x%llx",

2010-12-04 01:50:24

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 08/15] ftrace: fix event alignment: mce:mce_record

[ Ccing Hidetoshi and Frederic for Acked-by's ]

-- Steve


On Fri, 2010-12-03 at 16:13 -0800, David Sharp wrote:
> Signed-off-by: David Sharp <[email protected]>
> ---
> include/trace/events/mce.h | 8 ++++----
> 1 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h
> index 7eee778..4cbbcef 100644
> --- a/include/trace/events/mce.h
> +++ b/include/trace/events/mce.h
> @@ -17,36 +17,36 @@ TRACE_EVENT(mce_record,
> TP_STRUCT__entry(
> __field( u64, mcgcap )
> __field( u64, mcgstatus )
> - __field( u8, bank )
> __field( u64, status )
> __field( u64, addr )
> __field( u64, misc )
> __field( u64, ip )
> - __field( u8, cs )
> __field( u64, tsc )
> __field( u64, walltime )
> __field( u32, cpu )
> __field( u32, cpuid )
> __field( u32, apicid )
> __field( u32, socketid )
> + __field( u8, cs )
> + __field( u8, bank )
> __field( u8, cpuvendor )
> ),
>
> TP_fast_assign(
> __entry->mcgcap = m->mcgcap;
> __entry->mcgstatus = m->mcgstatus;
> - __entry->bank = m->bank;
> __entry->status = m->status;
> __entry->addr = m->addr;
> __entry->misc = m->misc;
> __entry->ip = m->ip;
> - __entry->cs = m->cs;
> __entry->tsc = m->tsc;
> __entry->walltime = m->time;
> __entry->cpu = m->extcpu;
> __entry->cpuid = m->cpuid;
> __entry->apicid = m->apicid;
> __entry->socketid = m->socketid;
> + __entry->cs = m->cs;
> + __entry->bank = m->bank;
> __entry->cpuvendor = m->cpuvendor;
> ),
>

2010-12-04 01:52:08

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 09/15] ftrace: fix event alignment: skb:kfree_skb

[ Ccing Neil for Acked-by ]

-- Steve


On Fri, 2010-12-03 at 16:13 -0800, David Sharp wrote:
> Signed-off-by: David Sharp <[email protected]>
> ---
> include/trace/events/skb.h | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
> index 75ce9d5..aa4a56a 100644
> --- a/include/trace/events/skb.h
> +++ b/include/trace/events/skb.h
> @@ -19,16 +19,16 @@ TRACE_EVENT(kfree_skb,
>
> TP_STRUCT__entry(
> __field( void *, skbaddr )
> - __field( unsigned short, protocol )
> __field( void *, location )
> + __field( unsigned short, protocol )
> ),
>
> TP_fast_assign(
> __entry->skbaddr = skb;
> + __entry->location = location;
> if (skb) {
> __entry->protocol = ntohs(skb->protocol);
> }
> - __entry->location = location;
> ),
>
> TP_printk("skbaddr=%p protocol=%u location=%p",

2010-12-04 01:53:01

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 10/15] ftrace: fix event alignment: jbd2:*

[ Ccing Ted for Acked-by ]

-- Steve


On Fri, 2010-12-03 at 16:13 -0800, David Sharp wrote:
> Signed-off-by: David Sharp <[email protected]>
> ---
> include/trace/events/jbd2.h | 8 ++++----
> 1 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/include/trace/events/jbd2.h b/include/trace/events/jbd2.h
> index 7447ea9..2f9f362 100644
> --- a/include/trace/events/jbd2.h
> +++ b/include/trace/events/jbd2.h
> @@ -41,15 +41,15 @@ DECLARE_EVENT_CLASS(jbd2_commit,
> TP_STRUCT__entry(
> __field( int, dev_major )
> __field( int, dev_minor )
> - __field( char, sync_commit )
> __field( int, transaction )
> + __field( char, sync_commit )
> ),
>
> TP_fast_assign(
> __entry->dev_major = MAJOR(journal->j_fs_dev->bd_dev);
> __entry->dev_minor = MINOR(journal->j_fs_dev->bd_dev);
> - __entry->sync_commit = commit_transaction->t_synchronous_commit;
> __entry->transaction = commit_transaction->t_tid;
> + __entry->sync_commit = commit_transaction->t_synchronous_commit;
> ),
>
> TP_printk("dev %d,%d transaction %d sync %d",
> @@ -93,17 +93,17 @@ TRACE_EVENT(jbd2_end_commit,
> TP_STRUCT__entry(
> __field( int, dev_major )
> __field( int, dev_minor )
> - __field( char, sync_commit )
> __field( int, transaction )
> __field( int, head )
> + __field( char, sync_commit )
> ),
>
> TP_fast_assign(
> __entry->dev_major = MAJOR(journal->j_fs_dev->bd_dev);
> __entry->dev_minor = MINOR(journal->j_fs_dev->bd_dev);
> - __entry->sync_commit = commit_transaction->t_synchronous_commit;
> __entry->transaction = commit_transaction->t_tid;
> __entry->head = journal->j_tail_sequence;
> + __entry->sync_commit = commit_transaction->t_synchronous_commit;
> ),
>
> TP_printk("dev %d,%d transaction %d sync %d head %d",

2010-12-04 01:53:30

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 11/15] ftrace: fix event alignment: ext4:*

[ Ccing Ted for Acked-by ]

-- Steve


On Fri, 2010-12-03 at 16:13 -0800, David Sharp wrote:
> Signed-off-by: David Sharp <[email protected]>
> ---
> include/trace/events/ext4.h | 70 +++++++++++++++++++++---------------------
> 1 files changed, 35 insertions(+), 35 deletions(-)
>
> diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> index e5e345f..9887cdb 100644
> --- a/include/trace/events/ext4.h
> +++ b/include/trace/events/ext4.h
> @@ -23,21 +23,21 @@ TRACE_EVENT(ext4_free_inode,
> TP_STRUCT__entry(
> __field( int, dev_major )
> __field( int, dev_minor )
> + __field( blkcnt_t, blocks )
> __field( ino_t, ino )
> - __field( umode_t, mode )
> __field( uid_t, uid )
> __field( gid_t, gid )
> - __field( blkcnt_t, blocks )
> + __field( umode_t, mode )
> ),
>
> TP_fast_assign(
> __entry->dev_major = MAJOR(inode->i_sb->s_dev);
> __entry->dev_minor = MINOR(inode->i_sb->s_dev);
> + __entry->blocks = inode->i_blocks;
> __entry->ino = inode->i_ino;
> - __entry->mode = inode->i_mode;
> __entry->uid = inode->i_uid;
> __entry->gid = inode->i_gid;
> - __entry->blocks = inode->i_blocks;
> + __entry->mode = inode->i_mode;
> ),
>
> TP_printk("dev %d,%d ino %lu mode 0%o uid %u gid %u blocks %llu",
> @@ -342,10 +342,10 @@ TRACE_EVENT(ext4_da_writepages,
> __field( long, pages_skipped )
> __field( loff_t, range_start )
> __field( loff_t, range_end )
> + __field( pgoff_t, writeback_index )
> __field( char, for_kupdate )
> __field( char, for_reclaim )
> __field( char, range_cyclic )
> - __field( pgoff_t, writeback_index )
> ),
>
> TP_fast_assign(
> @@ -356,10 +356,10 @@ TRACE_EVENT(ext4_da_writepages,
> __entry->pages_skipped = wbc->pages_skipped;
> __entry->range_start = wbc->range_start;
> __entry->range_end = wbc->range_end;
> + __entry->writeback_index = inode->i_mapping->writeback_index;
> __entry->for_kupdate = wbc->for_kupdate;
> __entry->for_reclaim = wbc->for_reclaim;
> __entry->range_cyclic = wbc->range_cyclic;
> - __entry->writeback_index = inode->i_mapping->writeback_index;
> ),
>
> TP_printk("dev %d,%d ino %lu nr_to_write %ld pages_skipped %ld "
> @@ -425,8 +425,8 @@ TRACE_EVENT(ext4_da_writepages_result,
> __field( int, ret )
> __field( int, pages_written )
> __field( long, pages_skipped )
> - __field( char, more_io )
> __field( pgoff_t, writeback_index )
> + __field( char, more_io )
> ),
>
> TP_fast_assign(
> @@ -436,8 +436,8 @@ TRACE_EVENT(ext4_da_writepages_result,
> __entry->ret = ret;
> __entry->pages_written = pages_written;
> __entry->pages_skipped = wbc->pages_skipped;
> - __entry->more_io = wbc->more_io;
> __entry->writeback_index = inode->i_mapping->writeback_index;
> + __entry->more_io = wbc->more_io;
> ),
>
> TP_printk("dev %d,%d ino %lu ret %d pages_written %d pages_skipped %ld more_io %d writeback_index %lu",
> @@ -485,8 +485,8 @@ DECLARE_EVENT_CLASS(ext4__mb_new_pa,
> __field( int, dev_minor )
> __field( ino_t, ino )
> __field( __u64, pa_pstart )
> - __field( __u32, pa_len )
> __field( __u64, pa_lstart )
> + __field( __u32, pa_len )
>
> ),
>
> @@ -495,8 +495,8 @@ DECLARE_EVENT_CLASS(ext4__mb_new_pa,
> __entry->dev_minor = MINOR(ac->ac_sb->s_dev);
> __entry->ino = ac->ac_inode->i_ino;
> __entry->pa_pstart = pa->pa_pstart;
> - __entry->pa_len = pa->pa_len;
> __entry->pa_lstart = pa->pa_lstart;
> + __entry->pa_len = pa->pa_len;
> ),
>
> TP_printk("dev %d,%d ino %lu pstart %llu len %u lstart %llu",
> @@ -724,20 +724,20 @@ TRACE_EVENT(ext4_free_blocks,
> __field( int, dev_major )
> __field( int, dev_minor )
> __field( ino_t, ino )
> - __field( umode_t, mode )
> __field( __u64, block )
> __field( unsigned long, count )
> __field( int, flags )
> + __field( umode_t, mode )
> ),
>
> TP_fast_assign(
> __entry->dev_major = MAJOR(inode->i_sb->s_dev);
> __entry->dev_minor = MINOR(inode->i_sb->s_dev);
> __entry->ino = inode->i_ino;
> - __entry->mode = inode->i_mode;
> __entry->block = block;
> __entry->count = count;
> __entry->flags = flags;
> + __entry->mode = inode->i_mode;
> ),
>
> TP_printk("dev %d,%d ino %lu mode 0%o block %llu count %lu flags %d",
> @@ -766,8 +766,8 @@ TRACE_EVENT(ext4_sync_file,
> __entry->dev_major = MAJOR(dentry->d_inode->i_sb->s_dev);
> __entry->dev_minor = MINOR(dentry->d_inode->i_sb->s_dev);
> __entry->ino = dentry->d_inode->i_ino;
> - __entry->datasync = datasync;
> __entry->parent = dentry->d_parent->d_inode->i_ino;
> + __entry->datasync = datasync;
> ),
>
> TP_printk("dev %d,%d ino %ld parent %ld datasync %d ",
> @@ -834,12 +834,6 @@ TRACE_EVENT(ext4_mballoc_alloc,
> __field( int, dev_major )
> __field( int, dev_minor )
> __field( ino_t, ino )
> - __field( __u16, found )
> - __field( __u16, groups )
> - __field( __u16, buddy )
> - __field( __u16, flags )
> - __field( __u16, tail )
> - __field( __u8, cr )
> __field( __u32, orig_logical )
> __field( int, orig_start )
> __field( __u32, orig_group )
> @@ -852,18 +846,18 @@ TRACE_EVENT(ext4_mballoc_alloc,
> __field( int, result_start )
> __field( __u32, result_group )
> __field( int, result_len )
> + __field( __u16, found )
> + __field( __u16, groups )
> + __field( __u16, buddy )
> + __field( __u16, flags )
> + __field( __u16, tail )
> + __field( __u8, cr )
> ),
>
> TP_fast_assign(
> __entry->dev_major = MAJOR(ac->ac_inode->i_sb->s_dev);
> __entry->dev_minor = MINOR(ac->ac_inode->i_sb->s_dev);
> __entry->ino = ac->ac_inode->i_ino;
> - __entry->found = ac->ac_found;
> - __entry->flags = ac->ac_flags;
> - __entry->groups = ac->ac_groups_scanned;
> - __entry->buddy = ac->ac_buddy;
> - __entry->tail = ac->ac_tail;
> - __entry->cr = ac->ac_criteria;
> __entry->orig_logical = ac->ac_o_ex.fe_logical;
> __entry->orig_start = ac->ac_o_ex.fe_start;
> __entry->orig_group = ac->ac_o_ex.fe_group;
> @@ -876,6 +870,12 @@ TRACE_EVENT(ext4_mballoc_alloc,
> __entry->result_start = ac->ac_f_ex.fe_start;
> __entry->result_group = ac->ac_f_ex.fe_group;
> __entry->result_len = ac->ac_f_ex.fe_len;
> + __entry->found = ac->ac_found;
> + __entry->groups = ac->ac_groups_scanned;
> + __entry->buddy = ac->ac_buddy;
> + __entry->flags = ac->ac_flags;
> + __entry->tail = ac->ac_tail;
> + __entry->cr = ac->ac_criteria;
> ),
>
> TP_printk("dev %d,%d inode %lu orig %u/%d/%u@%u goal %u/%d/%u@%u "
> @@ -1001,18 +1001,18 @@ TRACE_EVENT(ext4_forget,
> __field( int, dev_major )
> __field( int, dev_minor )
> __field( ino_t, ino )
> - __field( umode_t, mode )
> - __field( int, is_metadata )
> __field( __u64, block )
> + __field( int, is_metadata )
> + __field( umode_t, mode )
> ),
>
> TP_fast_assign(
> __entry->dev_major = MAJOR(inode->i_sb->s_dev);
> __entry->dev_minor = MINOR(inode->i_sb->s_dev);
> __entry->ino = inode->i_ino;
> - __entry->mode = inode->i_mode;
> - __entry->is_metadata = is_metadata;
> __entry->block = block;
> + __entry->is_metadata = is_metadata;
> + __entry->mode = inode->i_mode;
> ),
>
> TP_printk("dev %d,%d ino %lu mode 0%o is_metadata %d block %llu",
> @@ -1030,24 +1030,24 @@ TRACE_EVENT(ext4_da_update_reserve_space,
> __field( int, dev_major )
> __field( int, dev_minor )
> __field( ino_t, ino )
> - __field( umode_t, mode )
> __field( __u64, i_blocks )
> __field( int, used_blocks )
> __field( int, reserved_data_blocks )
> __field( int, reserved_meta_blocks )
> __field( int, allocated_meta_blocks )
> + __field( umode_t, mode )
> ),
>
> TP_fast_assign(
> __entry->dev_major = MAJOR(inode->i_sb->s_dev);
> __entry->dev_minor = MINOR(inode->i_sb->s_dev);
> __entry->ino = inode->i_ino;
> - __entry->mode = inode->i_mode;
> __entry->i_blocks = inode->i_blocks;
> __entry->used_blocks = used_blocks;
> __entry->reserved_data_blocks = EXT4_I(inode)->i_reserved_data_blocks;
> __entry->reserved_meta_blocks = EXT4_I(inode)->i_reserved_meta_blocks;
> __entry->allocated_meta_blocks = EXT4_I(inode)->i_allocated_meta_blocks;
> + __entry->mode = inode->i_mode;
> ),
>
> TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu used_blocks %d reserved_data_blocks %d reserved_meta_blocks %d allocated_meta_blocks %d",
> @@ -1067,22 +1067,22 @@ TRACE_EVENT(ext4_da_reserve_space,
> __field( int, dev_major )
> __field( int, dev_minor )
> __field( ino_t, ino )
> - __field( umode_t, mode )
> __field( __u64, i_blocks )
> __field( int, md_needed )
> __field( int, reserved_data_blocks )
> __field( int, reserved_meta_blocks )
> + __field( umode_t, mode )
> ),
>
> TP_fast_assign(
> __entry->dev_major = MAJOR(inode->i_sb->s_dev);
> __entry->dev_minor = MINOR(inode->i_sb->s_dev);
> __entry->ino = inode->i_ino;
> - __entry->mode = inode->i_mode;
> __entry->i_blocks = inode->i_blocks;
> __entry->md_needed = md_needed;
> __entry->reserved_data_blocks = EXT4_I(inode)->i_reserved_data_blocks;
> __entry->reserved_meta_blocks = EXT4_I(inode)->i_reserved_meta_blocks;
> + __entry->mode = inode->i_mode;
> ),
>
> TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu md_needed %d reserved_data_blocks %d reserved_meta_blocks %d",
> @@ -1102,24 +1102,24 @@ TRACE_EVENT(ext4_da_release_space,
> __field( int, dev_major )
> __field( int, dev_minor )
> __field( ino_t, ino )
> - __field( umode_t, mode )
> __field( __u64, i_blocks )
> __field( int, freed_blocks )
> __field( int, reserved_data_blocks )
> __field( int, reserved_meta_blocks )
> __field( int, allocated_meta_blocks )
> + __field( umode_t, mode )
> ),
>
> TP_fast_assign(
> __entry->dev_major = MAJOR(inode->i_sb->s_dev);
> __entry->dev_minor = MINOR(inode->i_sb->s_dev);
> __entry->ino = inode->i_ino;
> - __entry->mode = inode->i_mode;
> __entry->i_blocks = inode->i_blocks;
> __entry->freed_blocks = freed_blocks;
> __entry->reserved_data_blocks = EXT4_I(inode)->i_reserved_data_blocks;
> __entry->reserved_meta_blocks = EXT4_I(inode)->i_reserved_meta_blocks;
> __entry->allocated_meta_blocks = EXT4_I(inode)->i_allocated_meta_blocks;
> + __entry->mode = inode->i_mode;
> ),
>
> TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu freed_blocks %d reserved_data_blocks %d reserved_meta_blocks %d allocated_meta_blocks %d",

2010-12-04 01:56:13

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 13/15] small_traces: Add config option to shrink trace events.

[ Added Frederic on Cc ]

I rather have this as a run time option than a compile time.

You could use two different trace event structures. The normal one and a
"compact" one. Add to the "trace_options" and when it is set the traces
are compact.

-- Steve


On Fri, 2010-12-03 at 16:13 -0800, David Sharp wrote:
> This option will remove several fields from trace events to reduce the
> payload size. Under heavy load, events can be produced faster than they
> can possibly be written to disk or sent on the network. Shrinking the
> payload size will enable fitting more traces in the ring buffer, which
> therefore enables tracing for longer periods, at the cost of losing some
> relatively unimportant details.
>
> Google-Bug-Id: 3224547
>
> Signed-off-by: David Sharp <[email protected]>
> ---
> kernel/trace/Kconfig | 12 ++++++++++++
> 1 files changed, 12 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index ea37e2f..9c24fe5 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -403,6 +403,18 @@ config FUNCTION_PROFILER
>
> If in doubt, say N.
>
> +config SMALL_TRACES
> + bool "Shrink trace events"
> + default n
> + help
> + Minimize the size of trace events. Omits fields from trace event
> + header like preempt_count, and lock_depth. Chooses smaller
> + alternative event structures. Shrinking the payload size enables
> + fitting more traces in the ring buffer, which enables tracing for
> + longer periods when the buffer can not be read as fast as it is
> + filled, at the cost of losing some details.
> +
> +
> config FTRACE_MCOUNT_RECORD
> def_bool y
> depends on DYNAMIC_FTRACE

2010-12-04 01:57:12

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 01/15] tracing: Add a 'buffer_overwrite' debugfs file

On Fri, 2010-12-03 at 16:13 -0800, David Sharp wrote:
> From: Jiaying Zhang <[email protected]>
>
> Add a "buffer_overwrite" debugfs file for ftrace to control whether the buffer

Instead of adding a new file, make this another "trace_option".

Thanks,

-- Steve


> should be overwritten on overflow or not. The default remains to overwrite old
> events when the buffer is full. This patch adds the option to instead discard
> newest events when the buffer is full. This is useful to get a snapshot of
> traces just after enabling traces. Dropping the current event is also a simpler
> code path.
>
> Signed-off-by: David Sharp <[email protected]>
> ---
> Documentation/trace/ftrace.txt | 6 ++++
> include/linux/ring_buffer.h | 2 +
> kernel/trace/ring_buffer.c | 11 +++++++
> kernel/trace/trace.c | 59 +++++++++++++++++++++++++++++++++++++---
> 4 files changed, 74 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt
> index 557c1ed..9237da3 100644
> --- a/Documentation/trace/ftrace.txt
> +++ b/Documentation/trace/ftrace.txt
> @@ -138,6 +138,12 @@ of ftrace. Here is a list of some of the key files:
> This can only be updated when the current_tracer
> is set to "nop".
>
> + buffer_overwrite:
> +
> + This controls what happens when the trace buffer is full.
> + If "1" (default), the oldest events are discarded and
> + overwritten. If "0", then the newest events are discarded.
> +
> tracing_cpumask:
>
> This is a mask that lets the user only trace
> diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
> index 8d3a248..ab38ac8 100644
> --- a/include/linux/ring_buffer.h
> +++ b/include/linux/ring_buffer.h
> @@ -100,6 +100,8 @@ void ring_buffer_free(struct ring_buffer *buffer);
>
> int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size);
>
> +void ring_buffer_change_overwrite(struct ring_buffer *buffer, int val);
> +
> struct ring_buffer_event *ring_buffer_lock_reserve(struct ring_buffer *buffer,
> unsigned long length);
> int ring_buffer_unlock_commit(struct ring_buffer *buffer,
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 9ed509a..3207147 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -1429,6 +1429,17 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size)
> }
> EXPORT_SYMBOL_GPL(ring_buffer_resize);
>
> +void ring_buffer_change_overwrite(struct ring_buffer *buffer, int val)
> +{
> + mutex_lock(&buffer->mutex);
> + if (val)
> + buffer->flags |= RB_FL_OVERWRITE;
> + else
> + buffer->flags &= ~RB_FL_OVERWRITE;
> + mutex_unlock(&buffer->mutex);
> +}
> +EXPORT_SYMBOL_GPL(ring_buffer_change_overwrite);
> +
> static inline void *
> __rb_data_page_index(struct buffer_data_page *bpage, unsigned index)
> {
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index c380612..ed5c14f 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -41,8 +41,6 @@
> #include "trace.h"
> #include "trace_output.h"
>
> -#define TRACE_BUFFER_FLAGS (RB_FL_OVERWRITE)
> -
> /*
> * On boot up, the ring buffer is set to the minimum size, so that
> * we do not waste memory on systems that are not using tracing.
> @@ -241,6 +239,9 @@ int tracing_is_enabled(void)
>
> static unsigned long trace_buf_size = TRACE_BUF_SIZE_DEFAULT;
>
> +/* whether the trace buffer should be overwritten on overflow or not. */
> +static enum ring_buffer_flags trace_buffer_flags = RB_FL_OVERWRITE;
> +
> /* trace_types holds a link list of available tracers. */
> static struct tracer *trace_types __read_mostly;
>
> @@ -3466,6 +3467,47 @@ tracing_entries_write(struct file *filp, const char __user *ubuf,
> return cnt;
> }
>
> +static ssize_t
> +tracing_overwrite_read(struct file *filp, char __user *ubuf,
> + size_t cnt, loff_t *ppos)
> +{
> + char buf[64];
> + int r;
> + r = snprintf(buf, 64, "%u\n", trace_buffer_flags);
> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
> +}
> +
> +static ssize_t
> +tracing_overwrite_write(struct file *filp, const char __user *ubuf,
> + size_t cnt, loff_t *ppos)
> +{
> + unsigned long val;
> + char buf[64];
> + int ret;
> +
> + if (cnt >= sizeof(buf))
> + return -EINVAL;
> +
> + if (copy_from_user(&buf, ubuf, cnt))
> + return -EFAULT;
> +
> + buf[cnt] = 0;
> +
> + ret = strict_strtoul(buf, 10, &val);
> + if (ret < 0)
> + return ret;
> + *ppos += cnt;
> +
> + if (val != 0 && val != 1)
> + return -EINVAL;
> +
> + if (trace_buffer_flags != val) {
> + trace_buffer_flags = val;
> + ring_buffer_change_overwrite(global_trace.buffer, val);
> + }
> + return cnt;
> +}
> +
> static int mark_printk(const char *fmt, ...)
> {
> int ret;
> @@ -3611,6 +3653,12 @@ static const struct file_operations tracing_entries_fops = {
> .llseek = generic_file_llseek,
> };
>
> +static const struct file_operations tracing_overwrite_fops = {
> + .open = tracing_open_generic,
> + .read = tracing_overwrite_read,
> + .write = tracing_overwrite_write,
> +};
> +
> static const struct file_operations tracing_mark_fops = {
> .open = tracing_open_generic,
> .write = tracing_mark_write,
> @@ -4336,6 +4384,9 @@ static __init int tracer_init_debugfs(void)
> trace_create_file("buffer_size_kb", 0644, d_tracer,
> &global_trace, &tracing_entries_fops);
>
> + trace_create_file("buffer_overwrite", 0644, d_tracer,
> + &global_trace, &tracing_overwrite_fops);
> +
> trace_create_file("trace_marker", 0220, d_tracer,
> NULL, &tracing_mark_fops);
>
> @@ -4565,7 +4616,7 @@ __init static int tracer_alloc_buffers(void)
>
> /* TODO: make the number of buffers hot pluggable with CPUS */
> global_trace.buffer = ring_buffer_alloc(ring_buf_size,
> - TRACE_BUFFER_FLAGS);
> + trace_buffer_flags);
> if (!global_trace.buffer) {
> printk(KERN_ERR "tracer: failed to allocate ring buffer!\n");
> WARN_ON(1);
> @@ -4575,7 +4626,7 @@ __init static int tracer_alloc_buffers(void)
>
>
> #ifdef CONFIG_TRACER_MAX_TRACE
> - max_tr.buffer = ring_buffer_alloc(1, TRACE_BUFFER_FLAGS);
> + max_tr.buffer = ring_buffer_alloc(1, trace_buffer_flags);
> if (!max_tr.buffer) {
> printk(KERN_ERR "tracer: failed to allocate max ring buffer!\n");
> WARN_ON(1);

2010-12-04 02:34:23

by David Sharp

[permalink] [raw]
Subject: Re: [PATCH 13/15] small_traces: Add config option to shrink trace events.

I considered that, and I generally thing it's a good idea. However, I
also want to use this switch to shrink individual tracepoint event
structures.

eg: sched switch is a high frequency event and it is 68 bytes (60
after these patches)

Can you suggest a syntax for TRACE_EVENT, DECLARE_EVENT_CLASS, etc,
that could express the two versions and produce the right code?

I'm worried about adding even further complexity to the TRACE_EVENT
macros. I could add TRACE_EVENT_SMALL that takes two versions of
TP_STRUCT__entry, TP_fast_assign, and TP_printk each, but then this
will need to be permuted with your TP_CONDITIONAL patches as well.

Thanks,
d#

On Fri, Dec 3, 2010 at 5:56 PM, Steven Rostedt <[email protected]> wrote:
> [ Added Frederic on Cc ]
>
> I rather have this as a run time option than a compile time.
>
> You could use two different trace event structures. The normal one and a
> "compact" one. Add to the "trace_options" and when it is set the traces
> are compact.
>
> -- Steve
>
>
> On Fri, 2010-12-03 at 16:13 -0800, David Sharp wrote:
>> This option will remove several fields from trace events to reduce the
>> payload size. Under heavy load, events can be produced faster than they
>> can possibly be written to disk or sent on the network. Shrinking the
>> payload size will enable fitting more traces in the ring buffer, which
>> therefore enables tracing for longer periods, at the cost of losing some
>> relatively unimportant details.
>>
>> Google-Bug-Id: 3224547
>>
>> Signed-off-by: David Sharp <[email protected]>
>> ---
>>  kernel/trace/Kconfig |   12 ++++++++++++
>>  1 files changed, 12 insertions(+), 0 deletions(-)
>>
>> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
>> index ea37e2f..9c24fe5 100644
>> --- a/kernel/trace/Kconfig
>> +++ b/kernel/trace/Kconfig
>> @@ -403,6 +403,18 @@ config FUNCTION_PROFILER
>>
>>         If in doubt, say N.
>>
>> +config SMALL_TRACES
>> +     bool "Shrink trace events"
>> +     default n
>> +     help
>> +       Minimize the size of trace events. Omits fields from trace event
>> +       header like preempt_count, and lock_depth. Chooses smaller
>> +       alternative event structures. Shrinking the payload size enables
>> +       fitting more traces in the ring buffer, which enables tracing for
>> +       longer periods when the buffer can not be read as fast as it is
>> +       filled, at the cost of losing some details.
>> +
>> +
>>  config FTRACE_MCOUNT_RECORD
>>       def_bool y
>>       depends on DYNAMIC_FTRACE
>
>
>

2010-12-04 02:54:16

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 13/15] small_traces: Add config option to shrink trace events.

On Fri, 2010-12-03 at 18:33 -0800, David Sharp wrote:
> I considered that, and I generally thing it's a good idea. However, I
> also want to use this switch to shrink individual tracepoint event
> structures.
>
> eg: sched switch is a high frequency event and it is 68 bytes (60
> after these patches)
>
> Can you suggest a syntax for TRACE_EVENT, DECLARE_EVENT_CLASS, etc,
> that could express the two versions and produce the right code?
>
> I'm worried about adding even further complexity to the TRACE_EVENT
> macros. I could add TRACE_EVENT_SMALL that takes two versions of
> TP_STRUCT__entry, TP_fast_assign, and TP_printk each, but then this
> will need to be permuted with your TP_CONDITIONAL patches as well.

I would not touch the TRACE_EVENT() structures. They are there as is and
I would not think about changing them. Something like that would never
make it into mainline.

Now what you can do, is to make your own events based off of the same
tracepoints. For example, the TRACE_EVENT(sched_switch...) has in
sched.c:

trace_sched_switch(prev, next);


You could even write a module that does something like this:

register_trace_sched_switch(probe_sched_switch, mydata);



void probe_sched_switch(void *mydata,
struct task_struct *prev,
struct task_struct *next)
{
struct ring_buffer *buffer;
struct ring_buffer_event *event;
struct myentry *entry;

event = trace_current_buffer_lock_reserve(buffer,
mytype, sizeof(*entry),
0, 0);

if (!event)
return;

entry = ring_buffer_event_data(event);

entry->myfield = prev->x;
...

trace_nowake_buffer_unlock_commit(buffer, event,
0, 0);
}

You will need to do a register_ftrace_event() to register that 'mytype'
and how to output it. Otherwise it would just be ignored in the "trace"
file.

All of the above would work fine as a loadable module that you could
easily maintain out of tree, and still uses the internals of the system.

-- Steve

2010-12-04 08:11:33

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 07/15] ftrace: fix event alignment: kvm:kvm_hv_hypercall

On 12/04/2010 02:13 AM, David Sharp wrote:
> Signed-off-by: David Sharp<[email protected]>
> ---
> arch/x86/kvm/trace.h | 8 ++++----
> 1 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> index a6544b8..ab41fb0 100644
> --- a/arch/x86/kvm/trace.h
> +++ b/arch/x86/kvm/trace.h
> @@ -62,21 +62,21 @@ TRACE_EVENT(kvm_hv_hypercall,
> TP_ARGS(code, fast, rep_cnt, rep_idx, ingpa, outgpa),
>
> TP_STRUCT__entry(
> - __field( __u16, code )
> - __field( bool, fast )
> __field( __u16, rep_cnt )
> __field( __u16, rep_idx )
> __field( __u64, ingpa )
> __field( __u64, outgpa )
> + __field( __u16, code )
> + __field( bool, fast )
> ),
>

Looks like a pessimisation.

Before: 24 bytes
After: 32 bytes

(on a 64-bit machine, assuming no packing)

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-12-04 13:39:03

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCH 09/15] ftrace: fix event alignment: skb:kfree_skb

On Fri, Dec 03, 2010 at 08:52:05PM -0500, Steven Rostedt wrote:
> [ Ccing Neil for Acked-by ]
>
> -- Steve
>
>
Looks good, thanks!
Acked-by: Neil Horman <[email protected]>


> On Fri, 2010-12-03 at 16:13 -0800, David Sharp wrote:
> > Signed-off-by: David Sharp <[email protected]>
> > ---
> > include/trace/events/skb.h | 4 ++--
> > 1 files changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
> > index 75ce9d5..aa4a56a 100644
> > --- a/include/trace/events/skb.h
> > +++ b/include/trace/events/skb.h
> > @@ -19,16 +19,16 @@ TRACE_EVENT(kfree_skb,
> >
> > TP_STRUCT__entry(
> > __field( void *, skbaddr )
> > - __field( unsigned short, protocol )
> > __field( void *, location )
> > + __field( unsigned short, protocol )
> > ),
> >
> > TP_fast_assign(
> > __entry->skbaddr = skb;
> > + __entry->location = location;
> > if (skb) {
> > __entry->protocol = ntohs(skb->protocol);
> > }
> > - __entry->location = location;
> > ),
> >
> > TP_printk("skbaddr=%p protocol=%u location=%p",
>
>
>

2010-12-06 01:26:27

by Li Zefan

[permalink] [raw]
Subject: Re: [PATCH 06/15] ftrace: fix event alignment: module:module_request

Steven Rostedt wrote:
> [ Ccing Li for an Acked-by. He wrote this event ]
>
> -- Steve
>
>
> On Fri, 2010-12-03 at 16:13 -0800, David Sharp wrote:
>> Signed-off-by: David Sharp <[email protected]>

Acked-by: Li Zefan <[email protected]>

>> ---
>> include/trace/events/module.h | 5 ++---
>> 1 files changed, 2 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/trace/events/module.h b/include/trace/events/module.h
>> index c7bb2f0..8154f5f 100644
>> --- a/include/trace/events/module.h
>> +++ b/include/trace/events/module.h
>> @@ -98,14 +98,14 @@ TRACE_EVENT(module_request,
>> TP_ARGS(name, wait, ip),
>>
>> TP_STRUCT__entry(
>> - __field( bool, wait )
>> __field( unsigned long, ip )
>> + __field( bool, wait )
>> __string( name, name )
>> ),
>>
>> TP_fast_assign(
>> - __entry->wait = wait;
>> __entry->ip = ip;
>> + __entry->wait = wait;
>> __assign_str(name, name);
>> ),
>>
>> @@ -119,4 +119,3 @@ TRACE_EVENT(module_request,
>>
>> /* This part must be outside protection */
>> #include <trace/define_trace.h>
>> -
>
>
>

2010-12-06 13:22:05

by Andi Kleen

[permalink] [raw]
Subject: Re: [Patch 00/15] Reduce tracing payload size.

David Sharp <[email protected]> writes:
>
> These patches are just a start to shrinking the size of trace events. I am
> planning to also make small versions of trace events that are used when
> CONFIG_SMALL_TRACES is enabled. I'm also open to discussion of other ways to
> shrink event sizes.

Maybe the simplest would be to lzo them as they get logged?
I assume you already considered that?

-Andi

2010-12-06 13:56:44

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Patch 00/15] Reduce tracing payload size.

On Mon, Dec 06, 2010 at 02:22:00PM +0100, Andi Kleen wrote:
> David Sharp <[email protected]> writes:
> >
> > These patches are just a start to shrinking the size of trace events. I am
> > planning to also make small versions of trace events that are used when
> > CONFIG_SMALL_TRACES is enabled. I'm also open to discussion of other ways to
> > shrink event sizes.
>
> Maybe the simplest would be to lzo them as they get logged?
> I assume you already considered that?

The tracing subsystem is supposed to be extremely low-overhead.
Compressiong the event log would add considerable CPU overhead. If we
can shrink the trace events by being more careful about unnecessary
padding, and not logging fields that aren't needed, it's all win....

The problem of course with the latter is that different people will
have disagreements about what is needed or not. And even adding
conditionals will slow down the tracing hotpath.

- Ted

2010-12-06 14:58:58

by Andi Kleen

[permalink] [raw]
Subject: Re: [Patch 00/15] Reduce tracing payload size.

On Mon, Dec 06, 2010 at 08:56:37AM -0500, Ted Ts'o wrote:
> On Mon, Dec 06, 2010 at 02:22:00PM +0100, Andi Kleen wrote:
> > David Sharp <[email protected]> writes:
> > >
> > > These patches are just a start to shrinking the size of trace events. I am
> > > planning to also make small versions of trace events that are used when
> > > CONFIG_SMALL_TRACES is enabled. I'm also open to discussion of other ways to
> > > shrink event sizes.
> >
> > Maybe the simplest would be to lzo them as they get logged?
> > I assume you already considered that?
>
> The tracing subsystem is supposed to be extremely low-overhead.
> Compressiong the event log would add considerable CPU overhead. If we

lzo as a rule of thumb is about as expensive as 3x memcpy()
That's not really expensive.

-Andi

--
[email protected] -- Speaking for myself only.

2010-12-06 16:17:40

by Steven Rostedt

[permalink] [raw]
Subject: Re: [Patch 00/15] Reduce tracing payload size.

On Mon, 2010-12-06 at 15:58 +0100, Andi Kleen wrote:
> On Mon, Dec 06, 2010 at 08:56:37AM -0500, Ted Ts'o wrote:
> > On Mon, Dec 06, 2010 at 02:22:00PM +0100, Andi Kleen wrote:
> > > David Sharp <[email protected]> writes:
> > > >
> > > > These patches are just a start to shrinking the size of trace events. I am
> > > > planning to also make small versions of trace events that are used when
> > > > CONFIG_SMALL_TRACES is enabled. I'm also open to discussion of other ways to
> > > > shrink event sizes.
> > >
> > > Maybe the simplest would be to lzo them as they get logged?
> > > I assume you already considered that?
> >
> > The tracing subsystem is supposed to be extremely low-overhead.
> > Compressiong the event log would add considerable CPU overhead. If we
>
> lzo as a rule of thumb is about as expensive as 3x memcpy()
> That's not really expensive.

Usually the most expensive thing in the trace is the memcpy(). Which is
why we try to do it only once.

-- Steve

2010-12-06 16:31:07

by Miguel Ojeda

[permalink] [raw]
Subject: Re: [Patch 00/15] Reduce tracing payload size.

On Mon, Dec 6, 2010 at 3:58 PM, Andi Kleen <[email protected]> wrote:
> On Mon, Dec 06, 2010 at 08:56:37AM -0500, Ted Ts'o wrote:
>> On Mon, Dec 06, 2010 at 02:22:00PM +0100, Andi Kleen wrote:
>> > David Sharp <[email protected]> writes:
>> > >
>> > > These patches are just a start to shrinking the size of trace events. I am
>> > > planning to also make small versions of trace events that are used when
>> > > CONFIG_SMALL_TRACES is enabled. I'm also open to discussion of other ways to
>> > > shrink event sizes.
>> >
>> > Maybe the simplest would be to lzo them as they get logged?
>> > I assume you already considered that?
>>
>> The tracing subsystem is supposed to be extremely low-overhead.
>> Compressiong the event log would add considerable CPU overhead. ?If we
>
> lzo as a rule of thumb is about as expensive as 3x memcpy()
> That's not really expensive.

That is true for the decompression step but not for the compression
one, which takes more than 10 memcpys().

>
> -Andi
>
> --
> [email protected] -- Speaking for myself only.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/
>

2010-12-06 16:41:56

by Andi Kleen

[permalink] [raw]
Subject: Re: [Patch 00/15] Reduce tracing payload size.

> That is true for the decompression step but not for the compression
> one, which takes more than 10 memcpys().

That's a good point. At 10x it's definitely too slow. Too bad, would have been too easy.

I guess one could still consider some very cheap data specific delta compressions
(I assume there's a lot of redundancy in a tracing stream with similar events
occurring in a row or nearby) I used this technique successfully with large log files
in the past and overall moving less data around ends up being cheaper.

-Andi

--
[email protected] -- Speaking for myself only.

2010-12-06 20:38:51

by David Sharp

[permalink] [raw]
Subject: Re: [PATCH 07/15] ftrace: fix event alignment: kvm:kvm_hv_hypercall

On Sat, Dec 4, 2010 at 12:11 AM, Avi Kivity <[email protected]> wrote:
> On 12/04/2010 02:13 AM, David Sharp wrote:
>>
>> Signed-off-by: David Sharp<[email protected]>
>> ---
>>  arch/x86/kvm/trace.h |    8 ++++----
>>  1 files changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
>> index a6544b8..ab41fb0 100644
>> --- a/arch/x86/kvm/trace.h
>> +++ b/arch/x86/kvm/trace.h
>> @@ -62,21 +62,21 @@ TRACE_EVENT(kvm_hv_hypercall,
>>        TP_ARGS(code, fast, rep_cnt, rep_idx, ingpa, outgpa),
>>
>>        TP_STRUCT__entry(
>> -               __field(        __u16,          code            )
>> -               __field(        bool,           fast            )
>>                __field(        __u16,          rep_cnt         )
>>                __field(        __u16,          rep_idx         )
>>                __field(        __u64,          ingpa           )
>>                __field(        __u64,          outgpa          )
>> +               __field(        __u16,          code            )
>> +               __field(        bool,           fast            )
>>        ),
>>
>
> Looks like a pessimisation.
>
> Before: 24 bytes
> After: 32 bytes
>
> (on a 64-bit machine, assuming no packing)

This patch is predicated on packing the event structures. And since
the ring buffer is 32-bit addressable, I don't attempt to improve
alignment beyond 32-bit boundaries.

>
> --
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.
>
>
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2010-12-07 09:23:06

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 07/15] ftrace: fix event alignment: kvm:kvm_hv_hypercall

On 12/06/2010 10:38 PM, David Sharp wrote:
> On Sat, Dec 4, 2010 at 12:11 AM, Avi Kivity<[email protected]> wrote:
> > On 12/04/2010 02:13 AM, David Sharp wrote:
> >>
> >> Signed-off-by: David Sharp<[email protected]>
> >> ---
> >> arch/x86/kvm/trace.h | 8 ++++----
> >> 1 files changed, 4 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> >> index a6544b8..ab41fb0 100644
> >> --- a/arch/x86/kvm/trace.h
> >> +++ b/arch/x86/kvm/trace.h
> >> @@ -62,21 +62,21 @@ TRACE_EVENT(kvm_hv_hypercall,
> >> TP_ARGS(code, fast, rep_cnt, rep_idx, ingpa, outgpa),
> >>
> >> TP_STRUCT__entry(
> >> - __field( __u16, code )
> >> - __field( bool, fast )
> >> __field( __u16, rep_cnt )
> >> __field( __u16, rep_idx )
> >> __field( __u64, ingpa )
> >> __field( __u64, outgpa )
> >> + __field( __u16, code )
> >> + __field( bool, fast )
> >> ),
> >>
> >
> > Looks like a pessimisation.
> >
> > Before: 24 bytes
> > After: 32 bytes
> >
> > (on a 64-bit machine, assuming no packing)
>
> This patch is predicated on packing the event structures. And since
> the ring buffer is 32-bit addressable, I don't attempt to improve
> alignment beyond 32-bit boundaries.

I don't understand this. Can you elaborate? What does "32-bit
addressable" mean? And "predicated on packing the event structures"?
Is the structure __attribute__((packed)), or is it not?

--
error compiling committee.c: too many arguments to function

2010-12-07 21:17:13

by David Sharp

[permalink] [raw]
Subject: Re: [PATCH 07/15] ftrace: fix event alignment: kvm:kvm_hv_hypercall

On Tue, Dec 7, 2010 at 1:22 AM, Avi Kivity <[email protected]> wrote:
> On 12/06/2010 10:38 PM, David Sharp wrote:
>> On Sat, Dec 4, 2010 at 12:11 AM, Avi Kivity<[email protected]>  wrote:
>> >  On 12/04/2010 02:13 AM, David Sharp wrote:
>> >>
>> >>  Signed-off-by: David Sharp<[email protected]>
>> >>  ---
>> >>    arch/x86/kvm/trace.h |    8 ++++----
>> >>    1 files changed, 4 insertions(+), 4 deletions(-)
>> >>
>> >>  diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
>> >>  index a6544b8..ab41fb0 100644
>> >>  --- a/arch/x86/kvm/trace.h
>> >>  +++ b/arch/x86/kvm/trace.h
>> >>  @@ -62,21 +62,21 @@ TRACE_EVENT(kvm_hv_hypercall,
>> >>          TP_ARGS(code, fast, rep_cnt, rep_idx, ingpa, outgpa),
>> >>
>> >>          TP_STRUCT__entry(
>> >>  -               __field(        __u16,          code            )
>> >>  -               __field(        bool,           fast            )
>> >>                  __field(        __u16,          rep_cnt         )
>> >>                  __field(        __u16,          rep_idx         )
>> >>                  __field(        __u64,          ingpa           )
>> >>                  __field(        __u64,          outgpa          )
>> >>  +               __field(        __u16,          code            )
>> >>  +               __field(        bool,           fast            )
>> >>          ),
>> >>
>> >
>> >  Looks like a pessimisation.
>> >
>> >  Before: 24 bytes
>> >  After: 32 bytes
>> >
>> >  (on a 64-bit machine, assuming no packing)
>>
>> This patch is predicated on packing the event structures. And since
>> the ring buffer is 32-bit addressable, I don't attempt to improve
>> alignment beyond 32-bit boundaries.
>
> I don't understand this.  Can you elaborate?  What does "32-bit addressable"
> mean?

The ring buffer gives you space that is a multiple of 4 bytes in
length, and 32-bit aligned. Therefore it is useless to attempt to
align the structure beyond 32-bit boundaries, eg, a 64-bit boundary,
because it is unpredictable if the memory the structure will be
written to is at a 64-bit boundary (addr % 8 could be 0 or 4).

> And "predicated on packing the event structures"?  Is the structure
> __attribute__((packed)), or is it not?

It is not packed in Linus' tree, but one of the patches before this
patch in this patch series adds __attribute__((packed)). This patch
assumes that the event packing patch has been applied. This patch
should not be applied if the packing patch is not (hence,
"predicated").

>
> --
> error compiling committee.c: too many arguments to function
>
>

2010-12-07 22:44:27

by David Sharp

[permalink] [raw]
Subject: Re: [PATCH 03/15] ring_buffer: Align buffer_page struct allocations only to fit the flags.

On Fri, Dec 3, 2010 at 5:43 PM, Steven Rostedt <[email protected]> wrote:
> On Fri, 2010-12-03 at 16:13 -0800, David Sharp wrote:
>> buffer_page structs need to be aligned to 4 byte boundaries because the page
>> flags are stored in the two least-significant bits of the pointers in the page
>> list. Aligning to cache lines is sufficient, but doesn't seem to be necessary.
>> Reducing the alignement to only 4 bytes may improve cache efficiency.
>
> The reason for the cache line boundaries because the bpages are per cpu,
> and if they are allocated next to some variable that gets change
> globally or even another bpage that is used for another cpu buffer, then
> I would expect we would get some cache line bouncing.

I see. That makes sense. I'm not attached to this patch, we can drop it.

> Perhaps we should make bpage have its own slab allocation
> (kmem_cache_alloc). And perhaps even try to keep all bpages that share
> the same cache line on the same cpu buffer.

Maybe. Although, I was thinking of some patches to keep pages in a
free-pool, so that when a cpu buffer needs a new page, it takes from
the pool, and when a reader is done with a page, it returns to the
free-pool. (This helps with the situation where more events occur on
one CPU than others: pages are not locked to a CPU that isn't
producing events.) In that case, it would be impossible to predict
what cpu a bpage belongs to. But since this is theoretical, maybe it's
not relevant.

2010-12-08 09:18:14

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 07/15] ftrace: fix event alignment: kvm:kvm_hv_hypercall

On 12/07/2010 11:16 PM, David Sharp wrote:
> >
> > I don't understand this. Can you elaborate? What does "32-bit addressable"
> > mean?
>
> The ring buffer gives you space that is a multiple of 4 bytes in
> length, and 32-bit aligned. Therefore it is useless to attempt to
> align the structure beyond 32-bit boundaries, eg, a 64-bit boundary,
> because it is unpredictable if the memory the structure will be
> written to is at a 64-bit boundary (addr % 8 could be 0 or 4).
>
> > And "predicated on packing the event structures"? Is the structure
> > __attribute__((packed)), or is it not?
>
> It is not packed in Linus' tree, but one of the patches before this
> patch in this patch series adds __attribute__((packed)). This patch
> assumes that the event packing patch has been applied. This patch
> should not be applied if the packing patch is not (hence,
> "predicated").

Thanks for the explanations, it makes sense now.

--
error compiling committee.c: too many arguments to function

2010-12-08 20:16:40

by David Sharp

[permalink] [raw]
Subject: Re: [PATCH 01/15] tracing: Add a 'buffer_overwrite' debugfs file

On Fri, Dec 3, 2010 at 5:57 PM, Steven Rostedt <[email protected]> wrote:
> On Fri, 2010-12-03 at 16:13 -0800, David Sharp wrote:
>> From: Jiaying Zhang <[email protected]>
>>
>> Add a "buffer_overwrite" debugfs file for ftrace to control whether the buffer
>
> Instead of adding a new file, make this another "trace_option".

To do this, I had to add a special case to set_tracer_flags() to call
ring_buffer_change_overwrite. It also means adding something not
iterator-related to trace_iterator_flags. Although, maybe it's just
the name of the enum and value prefix that should change.

I'll send the new patch to you separately today.

>
> Thanks,
>
> -- Steve
>
>
>> should be overwritten on overflow or not. The default remains to overwrite old
>> events when the buffer is full. This patch adds the option to instead discard
>> newest events when the buffer is full. This is useful to get a snapshot of
>> traces just after enabling traces. Dropping the current event is also a simpler
>> code path.
>>
>> Signed-off-by: David Sharp <[email protected]>
>> ---
>>  Documentation/trace/ftrace.txt |    6 ++++
>>  include/linux/ring_buffer.h    |    2 +
>>  kernel/trace/ring_buffer.c     |   11 +++++++
>>  kernel/trace/trace.c           |   59 +++++++++++++++++++++++++++++++++++++---
>>  4 files changed, 74 insertions(+), 4 deletions(-)
>>
>> diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt
>> index 557c1ed..9237da3 100644
>> --- a/Documentation/trace/ftrace.txt
>> +++ b/Documentation/trace/ftrace.txt
>> @@ -138,6 +138,12 @@ of ftrace. Here is a list of some of the key files:
>>       This can only be updated when the current_tracer
>>       is set to "nop".
>>
>> +  buffer_overwrite:
>> +
>> +     This controls what happens when the trace buffer is full.
>> +     If "1" (default), the oldest events are discarded and
>> +     overwritten. If "0", then the newest events are discarded.
>> +
>>    tracing_cpumask:
>>
>>       This is a mask that lets the user only trace
>> diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
>> index 8d3a248..ab38ac8 100644
>> --- a/include/linux/ring_buffer.h
>> +++ b/include/linux/ring_buffer.h
>> @@ -100,6 +100,8 @@ void ring_buffer_free(struct ring_buffer *buffer);
>>
>>  int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size);
>>
>> +void ring_buffer_change_overwrite(struct ring_buffer *buffer, int val);
>> +
>>  struct ring_buffer_event *ring_buffer_lock_reserve(struct ring_buffer *buffer,
>>                                                  unsigned long length);
>>  int ring_buffer_unlock_commit(struct ring_buffer *buffer,
>> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
>> index 9ed509a..3207147 100644
>> --- a/kernel/trace/ring_buffer.c
>> +++ b/kernel/trace/ring_buffer.c
>> @@ -1429,6 +1429,17 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size)
>>  }
>>  EXPORT_SYMBOL_GPL(ring_buffer_resize);
>>
>> +void ring_buffer_change_overwrite(struct ring_buffer *buffer, int val)
>> +{
>> +     mutex_lock(&buffer->mutex);
>> +     if (val)
>> +             buffer->flags |= RB_FL_OVERWRITE;
>> +     else
>> +             buffer->flags &= ~RB_FL_OVERWRITE;
>> +     mutex_unlock(&buffer->mutex);
>> +}
>> +EXPORT_SYMBOL_GPL(ring_buffer_change_overwrite);
>> +
>>  static inline void *
>>  __rb_data_page_index(struct buffer_data_page *bpage, unsigned index)
>>  {
>> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
>> index c380612..ed5c14f 100644
>> --- a/kernel/trace/trace.c
>> +++ b/kernel/trace/trace.c
>> @@ -41,8 +41,6 @@
>>  #include "trace.h"
>>  #include "trace_output.h"
>>
>> -#define TRACE_BUFFER_FLAGS   (RB_FL_OVERWRITE)
>> -
>>  /*
>>   * On boot up, the ring buffer is set to the minimum size, so that
>>   * we do not waste memory on systems that are not using tracing.
>> @@ -241,6 +239,9 @@ int tracing_is_enabled(void)
>>
>>  static unsigned long         trace_buf_size = TRACE_BUF_SIZE_DEFAULT;
>>
>> +/* whether the trace buffer should be overwritten on overflow or not. */
>> +static enum ring_buffer_flags  trace_buffer_flags = RB_FL_OVERWRITE;
>> +
>>  /* trace_types holds a link list of available tracers. */
>>  static struct tracer         *trace_types __read_mostly;
>>
>> @@ -3466,6 +3467,47 @@ tracing_entries_write(struct file *filp, const char __user *ubuf,
>>       return cnt;
>>  }
>>
>> +static ssize_t
>> +tracing_overwrite_read(struct file *filp, char __user *ubuf,
>> +                  size_t cnt, loff_t *ppos)
>> +{
>> +     char buf[64];
>> +     int r;
>> +     r = snprintf(buf, 64, "%u\n", trace_buffer_flags);
>> +     return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
>> +}
>> +
>> +static ssize_t
>> +tracing_overwrite_write(struct file *filp, const char __user *ubuf,
>> +                   size_t cnt, loff_t *ppos)
>> +{
>> +     unsigned long val;
>> +     char buf[64];
>> +     int ret;
>> +
>> +     if (cnt >= sizeof(buf))
>> +             return -EINVAL;
>> +
>> +     if (copy_from_user(&buf, ubuf, cnt))
>> +             return -EFAULT;
>> +
>> +     buf[cnt] = 0;
>> +
>> +     ret = strict_strtoul(buf, 10, &val);
>> +     if (ret < 0)
>> +             return ret;
>> +     *ppos += cnt;
>> +
>> +     if (val != 0 && val != 1)
>> +             return -EINVAL;
>> +
>> +     if (trace_buffer_flags != val) {
>> +             trace_buffer_flags = val;
>> +             ring_buffer_change_overwrite(global_trace.buffer, val);
>> +     }
>> +     return cnt;
>> +}
>> +
>>  static int mark_printk(const char *fmt, ...)
>>  {
>>       int ret;
>> @@ -3611,6 +3653,12 @@ static const struct file_operations tracing_entries_fops = {
>>       .llseek         = generic_file_llseek,
>>  };
>>
>> +static const struct file_operations tracing_overwrite_fops = {
>> +     .open           = tracing_open_generic,
>> +     .read           = tracing_overwrite_read,
>> +     .write          = tracing_overwrite_write,
>> +};
>> +
>>  static const struct file_operations tracing_mark_fops = {
>>       .open           = tracing_open_generic,
>>       .write          = tracing_mark_write,
>> @@ -4336,6 +4384,9 @@ static __init int tracer_init_debugfs(void)
>>       trace_create_file("buffer_size_kb", 0644, d_tracer,
>>                       &global_trace, &tracing_entries_fops);
>>
>> +     trace_create_file("buffer_overwrite", 0644, d_tracer,
>> +                     &global_trace, &tracing_overwrite_fops);
>> +
>>       trace_create_file("trace_marker", 0220, d_tracer,
>>                       NULL, &tracing_mark_fops);
>>
>> @@ -4565,7 +4616,7 @@ __init static int tracer_alloc_buffers(void)
>>
>>       /* TODO: make the number of buffers hot pluggable with CPUS */
>>       global_trace.buffer = ring_buffer_alloc(ring_buf_size,
>> -                                                TRACE_BUFFER_FLAGS);
>> +                                                trace_buffer_flags);
>>       if (!global_trace.buffer) {
>>               printk(KERN_ERR "tracer: failed to allocate ring buffer!\n");
>>               WARN_ON(1);
>> @@ -4575,7 +4626,7 @@ __init static int tracer_alloc_buffers(void)
>>
>>
>>  #ifdef CONFIG_TRACER_MAX_TRACE
>> -     max_tr.buffer = ring_buffer_alloc(1, TRACE_BUFFER_FLAGS);
>> +     max_tr.buffer = ring_buffer_alloc(1, trace_buffer_flags);
>>       if (!max_tr.buffer) {
>>               printk(KERN_ERR "tracer: failed to allocate max ring buffer!\n");
>>               WARN_ON(1);
>
>
>

2010-12-08 21:47:27

by David Sharp

[permalink] [raw]
Subject: [PATCH] tracing: Add an 'overwrite' trace_option.

Add an "overwrite" trace_option for ftrace to control whether the buffer should
be overwritten on overflow or not. The default remains to overwrite old events
when the buffer is full. This patch adds the option to instead discard newest
events when the buffer is full. This is useful to get a snapshot of traces just
after enabling traces. Dropping the current event is also a simpler code path.
---
Documentation/trace/ftrace.txt | 5 +++++
include/linux/ring_buffer.h | 2 ++
kernel/trace/ring_buffer.c | 11 +++++++++++
kernel/trace/trace.c | 17 +++++++++++------
kernel/trace/trace.h | 1 +
5 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt
index 557c1ed..2842ea7 100644
--- a/Documentation/trace/ftrace.txt
+++ b/Documentation/trace/ftrace.txt
@@ -491,6 +491,11 @@ x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
latencies, as described in "Latency
trace format".

+ overwrite - This controls what happens when the trace buffer is
+ full. If "1" (default), the oldest events are
+ discarded and overwritten. If "0", then the newest
+ events are discarded.
+
sched_switch
------------

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 8d3a248..ab38ac8 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -100,6 +100,8 @@ void ring_buffer_free(struct ring_buffer *buffer);

int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size);

+void ring_buffer_change_overwrite(struct ring_buffer *buffer, int val);
+
struct ring_buffer_event *ring_buffer_lock_reserve(struct ring_buffer *buffer,
unsigned long length);
int ring_buffer_unlock_commit(struct ring_buffer *buffer,
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 9ed509a..3207147 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1429,6 +1429,17 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size)
}
EXPORT_SYMBOL_GPL(ring_buffer_resize);

+void ring_buffer_change_overwrite(struct ring_buffer *buffer, int val)
+{
+ mutex_lock(&buffer->mutex);
+ if (val)
+ buffer->flags |= RB_FL_OVERWRITE;
+ else
+ buffer->flags &= ~RB_FL_OVERWRITE;
+ mutex_unlock(&buffer->mutex);
+}
+EXPORT_SYMBOL_GPL(ring_buffer_change_overwrite);
+
static inline void *
__rb_data_page_index(struct buffer_data_page *bpage, unsigned index)
{
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index c380612..6f7494c 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -41,8 +41,6 @@
#include "trace.h"
#include "trace_output.h"

-#define TRACE_BUFFER_FLAGS (RB_FL_OVERWRITE)
-
/*
* On boot up, the ring buffer is set to the minimum size, so that
* we do not waste memory on systems that are not using tracing.
@@ -340,7 +338,7 @@ static DECLARE_WAIT_QUEUE_HEAD(trace_wait);
/* trace_flags holds trace_options default values */
unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO | TRACE_ITER_SLEEP_TIME |
- TRACE_ITER_GRAPH_TIME | TRACE_ITER_RECORD_CMD;
+ TRACE_ITER_GRAPH_TIME | TRACE_ITER_RECORD_CMD | TRACE_ITER_OVERWRITE;

static int trace_stop_count;
static DEFINE_SPINLOCK(tracing_start_lock);
@@ -425,6 +423,7 @@ static const char *trace_options[] = {
"sleep-time",
"graph-time",
"record-cmd",
+ "overwrite",
NULL
};

@@ -2523,6 +2522,9 @@ static void set_tracer_flags(unsigned int mask, int enabled)

if (mask == TRACE_ITER_RECORD_CMD)
trace_event_enable_cmd_record(enabled);
+
+ if (mask == TRACE_ITER_OVERWRITE)
+ ring_buffer_change_overwrite(global_trace.buffer, enabled);
}

static ssize_t
@@ -4545,9 +4547,11 @@ void ftrace_dump(enum ftrace_dump_mode oops_dump_mode)
__init static int tracer_alloc_buffers(void)
{
int ring_buf_size;
+ enum ring_buffer_flags rb_flags;
int i;
int ret = -ENOMEM;

+
if (!alloc_cpumask_var(&tracing_buffer_mask, GFP_KERNEL))
goto out;

@@ -4560,12 +4564,13 @@ __init static int tracer_alloc_buffers(void)
else
ring_buf_size = 1;

+ rb_flags = trace_flags & TRACE_ITER_OVERWRITE ? RB_FL_OVERWRITE : 0;
+
cpumask_copy(tracing_buffer_mask, cpu_possible_mask);
cpumask_copy(tracing_cpumask, cpu_all_mask);

/* TODO: make the number of buffers hot pluggable with CPUS */
- global_trace.buffer = ring_buffer_alloc(ring_buf_size,
- TRACE_BUFFER_FLAGS);
+ global_trace.buffer = ring_buffer_alloc(ring_buf_size, rb_flags);
if (!global_trace.buffer) {
printk(KERN_ERR "tracer: failed to allocate ring buffer!\n");
WARN_ON(1);
@@ -4575,7 +4580,7 @@ __init static int tracer_alloc_buffers(void)


#ifdef CONFIG_TRACER_MAX_TRACE
- max_tr.buffer = ring_buffer_alloc(1, TRACE_BUFFER_FLAGS);
+ max_tr.buffer = ring_buffer_alloc(1, rb_flags);
if (!max_tr.buffer) {
printk(KERN_ERR "tracer: failed to allocate max ring buffer!\n");
WARN_ON(1);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 9021f8c..9cd025a 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -606,6 +606,7 @@ enum trace_iterator_flags {
TRACE_ITER_SLEEP_TIME = 0x40000,
TRACE_ITER_GRAPH_TIME = 0x80000,
TRACE_ITER_RECORD_CMD = 0x100000,
+ TRACE_ITER_OVERWRITE = 0x200000,
};

/*
--
1.7.3.1

2010-12-09 13:33:15

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 08/15] ftrace: fix event alignment: mce:mce_record

On Fri, Dec 03, 2010 at 08:50:22PM -0500, Steven Rostedt wrote:
> [ Ccing Hidetoshi and Frederic for Acked-by's ]

No problem for me.

Thanks.

> -- Steve
>
>
> On Fri, 2010-12-03 at 16:13 -0800, David Sharp wrote:
> > Signed-off-by: David Sharp <[email protected]>
> > ---
> > include/trace/events/mce.h | 8 ++++----
> > 1 files changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h
> > index 7eee778..4cbbcef 100644
> > --- a/include/trace/events/mce.h
> > +++ b/include/trace/events/mce.h
> > @@ -17,36 +17,36 @@ TRACE_EVENT(mce_record,
> > TP_STRUCT__entry(
> > __field( u64, mcgcap )
> > __field( u64, mcgstatus )
> > - __field( u8, bank )
> > __field( u64, status )
> > __field( u64, addr )
> > __field( u64, misc )
> > __field( u64, ip )
> > - __field( u8, cs )
> > __field( u64, tsc )
> > __field( u64, walltime )
> > __field( u32, cpu )
> > __field( u32, cpuid )
> > __field( u32, apicid )
> > __field( u32, socketid )
> > + __field( u8, cs )
> > + __field( u8, bank )
> > __field( u8, cpuvendor )
> > ),
> >
> > TP_fast_assign(
> > __entry->mcgcap = m->mcgcap;
> > __entry->mcgstatus = m->mcgstatus;
> > - __entry->bank = m->bank;
> > __entry->status = m->status;
> > __entry->addr = m->addr;
> > __entry->misc = m->misc;
> > __entry->ip = m->ip;
> > - __entry->cs = m->cs;
> > __entry->tsc = m->tsc;
> > __entry->walltime = m->time;
> > __entry->cpu = m->extcpu;
> > __entry->cpuid = m->cpuid;
> > __entry->apicid = m->apicid;
> > __entry->socketid = m->socketid;
> > + __entry->cs = m->cs;
> > + __entry->bank = m->bank;
> > __entry->cpuvendor = m->cpuvendor;
> > ),
> >
>
>

2010-12-09 14:55:51

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 13/15] small_traces: Add config option to shrink trace events.

On Fri, Dec 03, 2010 at 09:54:12PM -0500, Steven Rostedt wrote:
> On Fri, 2010-12-03 at 18:33 -0800, David Sharp wrote:
> > I considered that, and I generally thing it's a good idea. However, I
> > also want to use this switch to shrink individual tracepoint event
> > structures.
> >
> > eg: sched switch is a high frequency event and it is 68 bytes (60
> > after these patches)
> >
> > Can you suggest a syntax for TRACE_EVENT, DECLARE_EVENT_CLASS, etc,
> > that could express the two versions and produce the right code?
> >
> > I'm worried about adding even further complexity to the TRACE_EVENT
> > macros. I could add TRACE_EVENT_SMALL that takes two versions of
> > TP_STRUCT__entry, TP_fast_assign, and TP_printk each, but then this
> > will need to be permuted with your TP_CONDITIONAL patches as well.
>
> I would not touch the TRACE_EVENT() structures. They are there as is and
> I would not think about changing them. Something like that would never
> make it into mainline.
>
> Now what you can do, is to make your own events based off of the same
> tracepoints. For example, the TRACE_EVENT(sched_switch...) has in
> sched.c:
>
> trace_sched_switch(prev, next);
>
>
> You could even write a module that does something like this:
>
> register_trace_sched_switch(probe_sched_switch, mydata);
>
>
>
> void probe_sched_switch(void *mydata,
> struct task_struct *prev,
> struct task_struct *next)
> {
> struct ring_buffer *buffer;
> struct ring_buffer_event *event;
> struct myentry *entry;
>
> event = trace_current_buffer_lock_reserve(buffer,
> mytype, sizeof(*entry),
> 0, 0);
>
> if (!event)
> return;
>
> entry = ring_buffer_event_data(event);
>
> entry->myfield = prev->x;
> ...
>
> trace_nowake_buffer_unlock_commit(buffer, event,
> 0, 0);
> }
>
> You will need to do a register_ftrace_event() to register that 'mytype'
> and how to output it. Otherwise it would just be ignored in the "trace"
> file.
>
> All of the above would work fine as a loadable module that you could
> easily maintain out of tree, and still uses the internals of the system.
>
> -- Steve
>
>


But this would improve only google's tracing while this is a general
mainline tracing problem.

The first thing is that we need to get rid of the lock_depth field, the bkl
is dying.

For the rest what about having a bitmap of the fields we want to ignore,
which can be setup from a trace file for ftrace and as an ioctl for perf.

So this bitmap is easy to implement on the common fields.

For the rest, one could choose between using TP_fast_assign()
and TP_cond_assign().

TP_fast_assign() stays as is and doesn't implement bitmap field
ignoring. Those who want conditional record will need
TP_cond_assign().
Well, unfortunately this probably requires us to play
the same trickery than SYSCALL_DEFINE() in that we'll probably
need TP_cond_assign1(), TP_cond_assign2(), TP_cond_assign3(), etc...

#define TP_cond_assignx(nr, assign) \
if (call->bitmask & nr) { \
assign
}

#define TP_cond_assign2(nr, assign, ...) \
TP_cond_assignx(nr, assign) \
TP_cond_assign1(nr + 1, __VA_ARGS__)

#define TP_cond_assign3(nr, assign, ...) \
TP_cond_assignx(nr, assign) \
TP_cond_assign2(nr + 1, __VA_ARGS__)

That will also require a bit more trickery to dynamically
pre-compute the size of the trace entry.

2010-12-09 15:08:22

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 13/15] small_traces: Add config option to shrink trace events.

On Thu, 2010-12-09 at 15:55 +0100, Frederic Weisbecker wrote:
> On Fri, Dec 03, 2010 at 09:54:12PM -0500, Steven Rostedt wrote:
> > On Fri, 2010-12-03 at 18:33 -0800, David Sharp wrote:
> > > I considered that, and I generally thing it's a good idea. However, I
> > > also want to use this switch to shrink individual tracepoint event
> > > structures.
> > >
> > > eg: sched switch is a high frequency event and it is 68 bytes (60
> > > after these patches)
> > >
> > > Can you suggest a syntax for TRACE_EVENT, DECLARE_EVENT_CLASS, etc,
> > > that could express the two versions and produce the right code?
> > >
> > > I'm worried about adding even further complexity to the TRACE_EVENT
> > > macros. I could add TRACE_EVENT_SMALL that takes two versions of
> > > TP_STRUCT__entry, TP_fast_assign, and TP_printk each, but then this
> > > will need to be permuted with your TP_CONDITIONAL patches as well.
> >
> > I would not touch the TRACE_EVENT() structures. They are there as is and
> > I would not think about changing them. Something like that would never
> > make it into mainline.
> >
> > Now what you can do, is to make your own events based off of the same
> > tracepoints. For example, the TRACE_EVENT(sched_switch...) has in
> > sched.c:
> >
> > trace_sched_switch(prev, next);
> >
> >
> > You could even write a module that does something like this:
> >
> > register_trace_sched_switch(probe_sched_switch, mydata);
> >
> >
> >
> > void probe_sched_switch(void *mydata,
> > struct task_struct *prev,
> > struct task_struct *next)
> > {
> > struct ring_buffer *buffer;
> > struct ring_buffer_event *event;
> > struct myentry *entry;
> >
> > event = trace_current_buffer_lock_reserve(buffer,
> > mytype, sizeof(*entry),
> > 0, 0);
> >
> > if (!event)
> > return;
> >
> > entry = ring_buffer_event_data(event);
> >
> > entry->myfield = prev->x;
> > ...
> >
> > trace_nowake_buffer_unlock_commit(buffer, event,
> > 0, 0);
> > }
> >
> > You will need to do a register_ftrace_event() to register that 'mytype'
> > and how to output it. Otherwise it would just be ignored in the "trace"
> > file.
> >
> > All of the above would work fine as a loadable module that you could
> > easily maintain out of tree, and still uses the internals of the system.
> >
> > -- Steve
> >
> >
>
>
> But this would improve only google's tracing while this is a general
> mainline tracing problem.
>
> The first thing is that we need to get rid of the lock_depth field, the bkl
> is dying.

Yeah that needs to go :-)

>
> For the rest what about having a bitmap of the fields we want to ignore,
> which can be setup from a trace file for ftrace and as an ioctl for perf.
>
> So this bitmap is easy to implement on the common fields.
>
> For the rest, one could choose between using TP_fast_assign()
> and TP_cond_assign().
>
> TP_fast_assign() stays as is and doesn't implement bitmap field
> ignoring. Those who want conditional record will need
> TP_cond_assign().
> Well, unfortunately this probably requires us to play
> the same trickery than SYSCALL_DEFINE() in that we'll probably
> need TP_cond_assign1(), TP_cond_assign2(), TP_cond_assign3(), etc...
>
> #define TP_cond_assignx(nr, assign) \
> if (call->bitmask & nr) { \
> assign
> }
>
> #define TP_cond_assign2(nr, assign, ...) \
> TP_cond_assignx(nr, assign) \
> TP_cond_assign1(nr + 1, __VA_ARGS__)
>
> #define TP_cond_assign3(nr, assign, ...) \
> TP_cond_assignx(nr, assign) \
> TP_cond_assign2(nr + 1, __VA_ARGS__)
>
> That will also require a bit more trickery to dynamically
> pre-compute the size of the trace entry.

Mathieu is working on encapsulating the assignments in their own macros.

Instead of doing:

__entry->foo = bar;

We will have:

tp_assign(foo, bar);

This way we could probably use this to dynamically figure out what to
assign.

-- Steve

2010-12-09 15:28:16

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 13/15] small_traces: Add config option to shrink trace events.

On Thu, Dec 09, 2010 at 10:08:17AM -0500, Steven Rostedt wrote:
> On Thu, 2010-12-09 at 15:55 +0100, Frederic Weisbecker wrote:
> > On Fri, Dec 03, 2010 at 09:54:12PM -0500, Steven Rostedt wrote:
> > > On Fri, 2010-12-03 at 18:33 -0800, David Sharp wrote:
> > > > I considered that, and I generally thing it's a good idea. However, I
> > > > also want to use this switch to shrink individual tracepoint event
> > > > structures.
> > > >
> > > > eg: sched switch is a high frequency event and it is 68 bytes (60
> > > > after these patches)
> > > >
> > > > Can you suggest a syntax for TRACE_EVENT, DECLARE_EVENT_CLASS, etc,
> > > > that could express the two versions and produce the right code?
> > > >
> > > > I'm worried about adding even further complexity to the TRACE_EVENT
> > > > macros. I could add TRACE_EVENT_SMALL that takes two versions of
> > > > TP_STRUCT__entry, TP_fast_assign, and TP_printk each, but then this
> > > > will need to be permuted with your TP_CONDITIONAL patches as well.
> > >
> > > I would not touch the TRACE_EVENT() structures. They are there as is and
> > > I would not think about changing them. Something like that would never
> > > make it into mainline.
> > >
> > > Now what you can do, is to make your own events based off of the same
> > > tracepoints. For example, the TRACE_EVENT(sched_switch...) has in
> > > sched.c:
> > >
> > > trace_sched_switch(prev, next);
> > >
> > >
> > > You could even write a module that does something like this:
> > >
> > > register_trace_sched_switch(probe_sched_switch, mydata);
> > >
> > >
> > >
> > > void probe_sched_switch(void *mydata,
> > > struct task_struct *prev,
> > > struct task_struct *next)
> > > {
> > > struct ring_buffer *buffer;
> > > struct ring_buffer_event *event;
> > > struct myentry *entry;
> > >
> > > event = trace_current_buffer_lock_reserve(buffer,
> > > mytype, sizeof(*entry),
> > > 0, 0);
> > >
> > > if (!event)
> > > return;
> > >
> > > entry = ring_buffer_event_data(event);
> > >
> > > entry->myfield = prev->x;
> > > ...
> > >
> > > trace_nowake_buffer_unlock_commit(buffer, event,
> > > 0, 0);
> > > }
> > >
> > > You will need to do a register_ftrace_event() to register that 'mytype'
> > > and how to output it. Otherwise it would just be ignored in the "trace"
> > > file.
> > >
> > > All of the above would work fine as a loadable module that you could
> > > easily maintain out of tree, and still uses the internals of the system.
> > >
> > > -- Steve
> > >
> > >
> >
> >
> > But this would improve only google's tracing while this is a general
> > mainline tracing problem.
> >
> > The first thing is that we need to get rid of the lock_depth field, the bkl
> > is dying.
>
> Yeah that needs to go :-)
>
> >
> > For the rest what about having a bitmap of the fields we want to ignore,
> > which can be setup from a trace file for ftrace and as an ioctl for perf.
> >
> > So this bitmap is easy to implement on the common fields.
> >
> > For the rest, one could choose between using TP_fast_assign()
> > and TP_cond_assign().
> >
> > TP_fast_assign() stays as is and doesn't implement bitmap field
> > ignoring. Those who want conditional record will need
> > TP_cond_assign().
> > Well, unfortunately this probably requires us to play
> > the same trickery than SYSCALL_DEFINE() in that we'll probably
> > need TP_cond_assign1(), TP_cond_assign2(), TP_cond_assign3(), etc...
> >
> > #define TP_cond_assignx(nr, assign) \
> > if (call->bitmask & nr) { \
> > assign
> > }
> >
> > #define TP_cond_assign2(nr, assign, ...) \
> > TP_cond_assignx(nr, assign) \
> > TP_cond_assign1(nr + 1, __VA_ARGS__)
> >
> > #define TP_cond_assign3(nr, assign, ...) \
> > TP_cond_assignx(nr, assign) \
> > TP_cond_assign2(nr + 1, __VA_ARGS__)
> >
> > That will also require a bit more trickery to dynamically
> > pre-compute the size of the trace entry.
>
> Mathieu is working on encapsulating the assignments in their own macros.
>
> Instead of doing:
>
> __entry->foo = bar;
>
> We will have:
>
> tp_assign(foo, bar);
>
> This way we could probably use this to dynamically figure out what to
> assign.
>
> -- Steve
>
>


Yep, it should also work that way.

2010-12-09 16:16:44

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH 13/15] small_traces: Add config option to shrink trace events.

* Frederic Weisbecker ([email protected]) wrote:
> On Thu, Dec 09, 2010 at 10:08:17AM -0500, Steven Rostedt wrote:
> > On Thu, 2010-12-09 at 15:55 +0100, Frederic Weisbecker wrote:
> > > On Fri, Dec 03, 2010 at 09:54:12PM -0500, Steven Rostedt wrote:
> > > > On Fri, 2010-12-03 at 18:33 -0800, David Sharp wrote:
> > > > > I considered that, and I generally thing it's a good idea. However, I
> > > > > also want to use this switch to shrink individual tracepoint event
> > > > > structures.
> > > > >
> > > > > eg: sched switch is a high frequency event and it is 68 bytes (60
> > > > > after these patches)
> > > > >
> > > > > Can you suggest a syntax for TRACE_EVENT, DECLARE_EVENT_CLASS, etc,
> > > > > that could express the two versions and produce the right code?
> > > > >
> > > > > I'm worried about adding even further complexity to the TRACE_EVENT
> > > > > macros. I could add TRACE_EVENT_SMALL that takes two versions of
> > > > > TP_STRUCT__entry, TP_fast_assign, and TP_printk each, but then this
> > > > > will need to be permuted with your TP_CONDITIONAL patches as well.
> > > >
> > > > I would not touch the TRACE_EVENT() structures. They are there as is and
> > > > I would not think about changing them. Something like that would never
> > > > make it into mainline.
> > > >
> > > > Now what you can do, is to make your own events based off of the same
> > > > tracepoints. For example, the TRACE_EVENT(sched_switch...) has in
> > > > sched.c:
> > > >
> > > > trace_sched_switch(prev, next);
> > > >
> > > >
> > > > You could even write a module that does something like this:
> > > >
> > > > register_trace_sched_switch(probe_sched_switch, mydata);
> > > >
> > > >
> > > >
> > > > void probe_sched_switch(void *mydata,
> > > > struct task_struct *prev,
> > > > struct task_struct *next)
> > > > {
> > > > struct ring_buffer *buffer;
> > > > struct ring_buffer_event *event;
> > > > struct myentry *entry;
> > > >
> > > > event = trace_current_buffer_lock_reserve(buffer,
> > > > mytype, sizeof(*entry),
> > > > 0, 0);
> > > >
> > > > if (!event)
> > > > return;
> > > >
> > > > entry = ring_buffer_event_data(event);
> > > >
> > > > entry->myfield = prev->x;
> > > > ...
> > > >
> > > > trace_nowake_buffer_unlock_commit(buffer, event,
> > > > 0, 0);
> > > > }
> > > >
> > > > You will need to do a register_ftrace_event() to register that 'mytype'
> > > > and how to output it. Otherwise it would just be ignored in the "trace"
> > > > file.
> > > >
> > > > All of the above would work fine as a loadable module that you could
> > > > easily maintain out of tree, and still uses the internals of the system.
> > > >
> > > > -- Steve
> > > >
> > > >
> > >
> > >
> > > But this would improve only google's tracing while this is a general
> > > mainline tracing problem.
> > >
> > > The first thing is that we need to get rid of the lock_depth field, the bkl
> > > is dying.
> >
> > Yeah that needs to go :-)
> >
> > >
> > > For the rest what about having a bitmap of the fields we want to ignore,
> > > which can be setup from a trace file for ftrace and as an ioctl for perf.
> > >
> > > So this bitmap is easy to implement on the common fields.
> > >
> > > For the rest, one could choose between using TP_fast_assign()
> > > and TP_cond_assign().
> > >
> > > TP_fast_assign() stays as is and doesn't implement bitmap field
> > > ignoring. Those who want conditional record will need
> > > TP_cond_assign().
> > > Well, unfortunately this probably requires us to play
> > > the same trickery than SYSCALL_DEFINE() in that we'll probably
> > > need TP_cond_assign1(), TP_cond_assign2(), TP_cond_assign3(), etc...
> > >
> > > #define TP_cond_assignx(nr, assign) \
> > > if (call->bitmask & nr) { \
> > > assign
> > > }
> > >
> > > #define TP_cond_assign2(nr, assign, ...) \
> > > TP_cond_assignx(nr, assign) \
> > > TP_cond_assign1(nr + 1, __VA_ARGS__)
> > >
> > > #define TP_cond_assign3(nr, assign, ...) \
> > > TP_cond_assignx(nr, assign) \
> > > TP_cond_assign2(nr + 1, __VA_ARGS__)
> > >
> > > That will also require a bit more trickery to dynamically
> > > pre-compute the size of the trace entry.
> >
> > Mathieu is working on encapsulating the assignments in their own macros.
> >
> > Instead of doing:
> >
> > __entry->foo = bar;
> >
> > We will have:
> >
> > tp_assign(foo, bar);
> >
> > This way we could probably use this to dynamically figure out what to
> > assign.
> >
> > -- Steve
> >
> >
>
> Yep, it should also work that way.

In an incredible exercise of good timing, I'm planning to post them just now :)

Comments will be very welcome,

Thanks!

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-12-14 00:40:14

by David Sharp

[permalink] [raw]
Subject: Re: [PATCH] tracing: Add an 'overwrite' trace_option.

Steve, any feedback on this patch?

On Wed, Dec 8, 2010 at 1:46 PM, David Sharp <[email protected]> wrote:
> Add an "overwrite" trace_option for ftrace to control whether the buffer should
> be overwritten on overflow or not. The default remains to overwrite old events
> when the buffer is full. This patch adds the option to instead discard newest
> events when the buffer is full. This is useful to get a snapshot of traces just
> after enabling traces. Dropping the current event is also a simpler code path.
> ---
>  Documentation/trace/ftrace.txt |    5 +++++
>  include/linux/ring_buffer.h    |    2 ++
>  kernel/trace/ring_buffer.c     |   11 +++++++++++
>  kernel/trace/trace.c           |   17 +++++++++++------
>  kernel/trace/trace.h           |    1 +
>  5 files changed, 30 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt
> index 557c1ed..2842ea7 100644
> --- a/Documentation/trace/ftrace.txt
> +++ b/Documentation/trace/ftrace.txt
> @@ -491,6 +491,11 @@ x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
>                    latencies, as described in "Latency
>                    trace format".
>
> +  overwrite - This controls what happens when the trace buffer is
> +              full. If "1" (default), the oldest events are
> +              discarded and overwritten. If "0", then the newest
> +              events are discarded.
> +
>  sched_switch
>  ------------
>
> diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
> index 8d3a248..ab38ac8 100644
> --- a/include/linux/ring_buffer.h
> +++ b/include/linux/ring_buffer.h
> @@ -100,6 +100,8 @@ void ring_buffer_free(struct ring_buffer *buffer);
>
>  int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size);
>
> +void ring_buffer_change_overwrite(struct ring_buffer *buffer, int val);
> +
>  struct ring_buffer_event *ring_buffer_lock_reserve(struct ring_buffer *buffer,
>                                                   unsigned long length);
>  int ring_buffer_unlock_commit(struct ring_buffer *buffer,
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 9ed509a..3207147 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -1429,6 +1429,17 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size)
>  }
>  EXPORT_SYMBOL_GPL(ring_buffer_resize);
>
> +void ring_buffer_change_overwrite(struct ring_buffer *buffer, int val)
> +{
> +       mutex_lock(&buffer->mutex);
> +       if (val)
> +               buffer->flags |= RB_FL_OVERWRITE;
> +       else
> +               buffer->flags &= ~RB_FL_OVERWRITE;
> +       mutex_unlock(&buffer->mutex);
> +}
> +EXPORT_SYMBOL_GPL(ring_buffer_change_overwrite);
> +
>  static inline void *
>  __rb_data_page_index(struct buffer_data_page *bpage, unsigned index)
>  {
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index c380612..6f7494c 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -41,8 +41,6 @@
>  #include "trace.h"
>  #include "trace_output.h"
>
> -#define TRACE_BUFFER_FLAGS     (RB_FL_OVERWRITE)
> -
>  /*
>  * On boot up, the ring buffer is set to the minimum size, so that
>  * we do not waste memory on systems that are not using tracing.
> @@ -340,7 +338,7 @@ static DECLARE_WAIT_QUEUE_HEAD(trace_wait);
>  /* trace_flags holds trace_options default values */
>  unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
>        TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO | TRACE_ITER_SLEEP_TIME |
> -       TRACE_ITER_GRAPH_TIME | TRACE_ITER_RECORD_CMD;
> +       TRACE_ITER_GRAPH_TIME | TRACE_ITER_RECORD_CMD | TRACE_ITER_OVERWRITE;
>
>  static int trace_stop_count;
>  static DEFINE_SPINLOCK(tracing_start_lock);
> @@ -425,6 +423,7 @@ static const char *trace_options[] = {
>        "sleep-time",
>        "graph-time",
>        "record-cmd",
> +       "overwrite",
>        NULL
>  };
>
> @@ -2523,6 +2522,9 @@ static void set_tracer_flags(unsigned int mask, int enabled)
>
>        if (mask == TRACE_ITER_RECORD_CMD)
>                trace_event_enable_cmd_record(enabled);
> +
> +       if (mask == TRACE_ITER_OVERWRITE)
> +               ring_buffer_change_overwrite(global_trace.buffer, enabled);
>  }
>
>  static ssize_t
> @@ -4545,9 +4547,11 @@ void ftrace_dump(enum ftrace_dump_mode oops_dump_mode)
>  __init static int tracer_alloc_buffers(void)
>  {
>        int ring_buf_size;
> +       enum ring_buffer_flags rb_flags;
>        int i;
>        int ret = -ENOMEM;
>
> +
>        if (!alloc_cpumask_var(&tracing_buffer_mask, GFP_KERNEL))
>                goto out;
>
> @@ -4560,12 +4564,13 @@ __init static int tracer_alloc_buffers(void)
>        else
>                ring_buf_size = 1;
>
> +       rb_flags = trace_flags & TRACE_ITER_OVERWRITE ? RB_FL_OVERWRITE : 0;
> +
>        cpumask_copy(tracing_buffer_mask, cpu_possible_mask);
>        cpumask_copy(tracing_cpumask, cpu_all_mask);
>
>        /* TODO: make the number of buffers hot pluggable with CPUS */
> -       global_trace.buffer = ring_buffer_alloc(ring_buf_size,
> -                                                  TRACE_BUFFER_FLAGS);
> +       global_trace.buffer = ring_buffer_alloc(ring_buf_size, rb_flags);
>        if (!global_trace.buffer) {
>                printk(KERN_ERR "tracer: failed to allocate ring buffer!\n");
>                WARN_ON(1);
> @@ -4575,7 +4580,7 @@ __init static int tracer_alloc_buffers(void)
>
>
>  #ifdef CONFIG_TRACER_MAX_TRACE
> -       max_tr.buffer = ring_buffer_alloc(1, TRACE_BUFFER_FLAGS);
> +       max_tr.buffer = ring_buffer_alloc(1, rb_flags);
>        if (!max_tr.buffer) {
>                printk(KERN_ERR "tracer: failed to allocate max ring buffer!\n");
>                WARN_ON(1);
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 9021f8c..9cd025a 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -606,6 +606,7 @@ enum trace_iterator_flags {
>        TRACE_ITER_SLEEP_TIME           = 0x40000,
>        TRACE_ITER_GRAPH_TIME           = 0x80000,
>        TRACE_ITER_RECORD_CMD           = 0x100000,
> +       TRACE_ITER_OVERWRITE            = 0x200000,
>  };
>
>  /*
> --
> 1.7.3.1
>
>

2011-03-08 23:30:08

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 04/15] ftrace: pack event structures.

David (Miller),

I'm going back through this patch set and looking to incorporate it into
2.6.39. But I know how touchy sparc is with "packed" data structures. Do
you see this patch as hurting sparc?

I wonder if we should change this to something like:

#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
# define do_event_packed __attirbute__((packed))
#else
# define do_event_packed
#endif

and use "do_event_packed" instead?


On Fri, 2010-12-03 at 16:13 -0800, David Sharp wrote:
> Ftrace event structures have a 12-byte struct trace_entry at the beginning.
> If the structure is aligned, this means that if the first field is 64-bits,
> there will be 4 bytes of padding. Ironically, due to the 4-byte ringbuffer
> header, this will make 64-bit writes unaligned, if the ring buffer position
> is currently 64-bit aligned:
> 4(rb)+12(ftrace)+4(pad) = 20; 20%8 = 4

Note, on 64bit archs without CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS, the
ring buffer forces the 8 byte header, to keep the data portion of the
event load 8byte aligned.

-- Steve

>
> Adding __attribute__((packed)) to the event structures removes the extra
> space from the trace events, and actually improves alignment of trace
> events with a first field that is 64-bits.
>
> About 65 tracepoints have a 4-byte pad at offset 12:
> # find events -name format | xargs -n1 awk '
> $1=="name:" {name=$2}
> $1=="format:"{FS="\t"}
> $3=="offset:12;" && $4=="size:4;"{okay=1}
> $3=="offset:16;" && !okay {print name}' | wc -l
> 65
>
> With all 'syscalls' and 'timer' events enabled, this results in a 5%
> improvement in a simple 512MB read benchmark with warm caches.
>
> Tested:
>
> setup:
> # echo 1 >events/syscalls/enable
> # echo 1 >events/timer/enable
> # echo 0 > tracing_enabled
> off:
> # for n in $(seq 10) ; do \
> time dd if=/dev/hda3 of=/dev/null bs=1K count=512K ; \
> done
> on:
> # for n in $(seq 10) ; do \
> echo 1 >tracing_enabled; \
> time dd if=/dev/hda3 of=/dev/null bs=1K count=512K ; \
> echo 0 >tracing_enabled; \
> echo > trace; \
> done
>
> real time mean/median/stdev
> w/o patch off: 1.1679/1.164/0.0169
> w/o patch on : 1.9432/1.936/0.0274
> w/ patch off: 1.1715/1.159/0.0431
> w/ patch on : 1.8425/1.836/0.0138
> "on" delta: -0.1007 --> -5.2%
>
> Google-Bug-Id: 2895627
>
> Signed-off-by: David Sharp <[email protected]>
> ---
> include/trace/ftrace.h | 5 +++--
> kernel/trace/trace.h | 2 +-
> 2 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
> index a9377c0..51d1f52 100644
> --- a/include/trace/ftrace.h
> +++ b/include/trace/ftrace.h
> @@ -48,7 +48,8 @@
> #define __array(type, item, len) type item[len];
>
> #undef __dynamic_array
> -#define __dynamic_array(type, item, len) u32 __data_loc_##item;
> +#define __dynamic_array(type, item, len) \
> + u32 __data_loc_##item __attribute__((aligned(4)));
>
> #undef __string
> #define __string(item, src) __dynamic_array(char, item, -1)
> @@ -62,7 +63,7 @@
> struct trace_entry ent; \
> tstruct \
> char __data[0]; \
> - }; \
> + } __attribute__((packed)); \
> \
> static struct ftrace_event_class event_class_##name;
>
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 9021f8c..2e80433 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -60,7 +60,7 @@ enum trace_type {
> struct struct_name { \
> struct trace_entry ent; \
> tstruct \
> - }
> + } __attribute__((packed))
>
> #undef TP_ARGS
> #define TP_ARGS(args...) args

2011-03-08 23:55:45

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 07/15] ftrace: fix event alignment: kvm:kvm_hv_hypercall

Hi Avi,

This patch set got lost in the shuffle, and I'm not looking to include
it.

On Wed, 2010-12-08 at 11:18 +0200, Avi Kivity wrote:
> On 12/07/2010 11:16 PM, David Sharp wrote:
> > >
> > > I don't understand this. Can you elaborate? What does "32-bit addressable"
> > > mean?
> >
> > The ring buffer gives you space that is a multiple of 4 bytes in
> > length, and 32-bit aligned. Therefore it is useless to attempt to
> > align the structure beyond 32-bit boundaries, eg, a 64-bit boundary,
> > because it is unpredictable if the memory the structure will be
> > written to is at a 64-bit boundary (addr % 8 could be 0 or 4).
> >
> > > And "predicated on packing the event structures"? Is the structure
> > > __attribute__((packed)), or is it not?
> >
> > It is not packed in Linus' tree, but one of the patches before this
> > patch in this patch series adds __attribute__((packed)). This patch
> > assumes that the event packing patch has been applied. This patch
> > should not be applied if the packing patch is not (hence,
> > "predicated").
>
> Thanks for the explanations, it makes sense now.
>

Does this mean I can add your "Acked-by"?

-- Steve

2011-03-09 00:03:28

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 10/15] ftrace: fix event alignment: jbd2:*

Now I know why I was so late at adding this. I was waiting for
Acked-bys ;)

On Fri, 2010-12-03 at 20:52 -0500, Steven Rostedt wrote:
> [ Ccing Ted for Acked-by ]

Ted, Ping?

-- Steve

>
> -- Steve
>
>
> On Fri, 2010-12-03 at 16:13 -0800, David Sharp wrote:
> > Signed-off-by: David Sharp <[email protected]>
> > ---
> > include/trace/events/jbd2.h | 8 ++++----
> > 1 files changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/trace/events/jbd2.h b/include/trace/events/jbd2.h
> > index 7447ea9..2f9f362 100644
> > --- a/include/trace/events/jbd2.h
> > +++ b/include/trace/events/jbd2.h
> > @@ -41,15 +41,15 @@ DECLARE_EVENT_CLASS(jbd2_commit,
> > TP_STRUCT__entry(
> > __field( int, dev_major )
> > __field( int, dev_minor )
> > - __field( char, sync_commit )
> > __field( int, transaction )
> > + __field( char, sync_commit )
> > ),
> >
> > TP_fast_assign(
> > __entry->dev_major = MAJOR(journal->j_fs_dev->bd_dev);
> > __entry->dev_minor = MINOR(journal->j_fs_dev->bd_dev);
> > - __entry->sync_commit = commit_transaction->t_synchronous_commit;
> > __entry->transaction = commit_transaction->t_tid;
> > + __entry->sync_commit = commit_transaction->t_synchronous_commit;
> > ),
> >
> > TP_printk("dev %d,%d transaction %d sync %d",
> > @@ -93,17 +93,17 @@ TRACE_EVENT(jbd2_end_commit,
> > TP_STRUCT__entry(
> > __field( int, dev_major )
> > __field( int, dev_minor )
> > - __field( char, sync_commit )
> > __field( int, transaction )
> > __field( int, head )
> > + __field( char, sync_commit )
> > ),
> >
> > TP_fast_assign(
> > __entry->dev_major = MAJOR(journal->j_fs_dev->bd_dev);
> > __entry->dev_minor = MINOR(journal->j_fs_dev->bd_dev);
> > - __entry->sync_commit = commit_transaction->t_synchronous_commit;
> > __entry->transaction = commit_transaction->t_tid;
> > __entry->head = journal->j_tail_sequence;
> > + __entry->sync_commit = commit_transaction->t_synchronous_commit;
> > ),
> >
> > TP_printk("dev %d,%d transaction %d sync %d head %d",
>

2011-03-09 00:04:51

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 11/15] ftrace: fix event alignment: ext4:*

On Fri, 2010-12-03 at 20:53 -0500, Steven Rostedt wrote:
> [ Ccing Ted for Acked-by ]

Ted, Ping on this too.

-- Steve

>
> -- Steve
>
>
> On Fri, 2010-12-03 at 16:13 -0800, David Sharp wrote:
> > Signed-off-by: David Sharp <[email protected]>
> > ---
> > include/trace/events/ext4.h | 70 +++++++++++++++++++++---------------------
> > 1 files changed, 35 insertions(+), 35 deletions(-)
> >
> > diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> > index e5e345f..9887cdb 100644
> > --- a/include/trace/events/ext4.h
> > +++ b/include/trace/events/ext4.h
> > @@ -23,21 +23,21 @@ TRACE_EVENT(ext4_free_inode,
> > TP_STRUCT__entry(
> > __field( int, dev_major )
> > __field( int, dev_minor )
> > + __field( blkcnt_t, blocks )
> > __field( ino_t, ino )
> > - __field( umode_t, mode )
> > __field( uid_t, uid )
> > __field( gid_t, gid )
> > - __field( blkcnt_t, blocks )
> > + __field( umode_t, mode )
> > ),
> >
> > TP_fast_assign(
> > __entry->dev_major = MAJOR(inode->i_sb->s_dev);
> > __entry->dev_minor = MINOR(inode->i_sb->s_dev);
> > + __entry->blocks = inode->i_blocks;
> > __entry->ino = inode->i_ino;
> > - __entry->mode = inode->i_mode;
> > __entry->uid = inode->i_uid;
> > __entry->gid = inode->i_gid;
> > - __entry->blocks = inode->i_blocks;
> > + __entry->mode = inode->i_mode;
> > ),
> >
> > TP_printk("dev %d,%d ino %lu mode 0%o uid %u gid %u blocks %llu",
> > @@ -342,10 +342,10 @@ TRACE_EVENT(ext4_da_writepages,
> > __field( long, pages_skipped )
> > __field( loff_t, range_start )
> > __field( loff_t, range_end )
> > + __field( pgoff_t, writeback_index )
> > __field( char, for_kupdate )
> > __field( char, for_reclaim )
> > __field( char, range_cyclic )
> > - __field( pgoff_t, writeback_index )
> > ),
> >
> > TP_fast_assign(
> > @@ -356,10 +356,10 @@ TRACE_EVENT(ext4_da_writepages,
> > __entry->pages_skipped = wbc->pages_skipped;
> > __entry->range_start = wbc->range_start;
> > __entry->range_end = wbc->range_end;
> > + __entry->writeback_index = inode->i_mapping->writeback_index;
> > __entry->for_kupdate = wbc->for_kupdate;
> > __entry->for_reclaim = wbc->for_reclaim;
> > __entry->range_cyclic = wbc->range_cyclic;
> > - __entry->writeback_index = inode->i_mapping->writeback_index;
> > ),
> >
> > TP_printk("dev %d,%d ino %lu nr_to_write %ld pages_skipped %ld "
> > @@ -425,8 +425,8 @@ TRACE_EVENT(ext4_da_writepages_result,
> > __field( int, ret )
> > __field( int, pages_written )
> > __field( long, pages_skipped )
> > - __field( char, more_io )
> > __field( pgoff_t, writeback_index )
> > + __field( char, more_io )
> > ),
> >
> > TP_fast_assign(
> > @@ -436,8 +436,8 @@ TRACE_EVENT(ext4_da_writepages_result,
> > __entry->ret = ret;
> > __entry->pages_written = pages_written;
> > __entry->pages_skipped = wbc->pages_skipped;
> > - __entry->more_io = wbc->more_io;
> > __entry->writeback_index = inode->i_mapping->writeback_index;
> > + __entry->more_io = wbc->more_io;
> > ),
> >
> > TP_printk("dev %d,%d ino %lu ret %d pages_written %d pages_skipped %ld more_io %d writeback_index %lu",
> > @@ -485,8 +485,8 @@ DECLARE_EVENT_CLASS(ext4__mb_new_pa,
> > __field( int, dev_minor )
> > __field( ino_t, ino )
> > __field( __u64, pa_pstart )
> > - __field( __u32, pa_len )
> > __field( __u64, pa_lstart )
> > + __field( __u32, pa_len )
> >
> > ),
> >
> > @@ -495,8 +495,8 @@ DECLARE_EVENT_CLASS(ext4__mb_new_pa,
> > __entry->dev_minor = MINOR(ac->ac_sb->s_dev);
> > __entry->ino = ac->ac_inode->i_ino;
> > __entry->pa_pstart = pa->pa_pstart;
> > - __entry->pa_len = pa->pa_len;
> > __entry->pa_lstart = pa->pa_lstart;
> > + __entry->pa_len = pa->pa_len;
> > ),
> >
> > TP_printk("dev %d,%d ino %lu pstart %llu len %u lstart %llu",
> > @@ -724,20 +724,20 @@ TRACE_EVENT(ext4_free_blocks,
> > __field( int, dev_major )
> > __field( int, dev_minor )
> > __field( ino_t, ino )
> > - __field( umode_t, mode )
> > __field( __u64, block )
> > __field( unsigned long, count )
> > __field( int, flags )
> > + __field( umode_t, mode )
> > ),
> >
> > TP_fast_assign(
> > __entry->dev_major = MAJOR(inode->i_sb->s_dev);
> > __entry->dev_minor = MINOR(inode->i_sb->s_dev);
> > __entry->ino = inode->i_ino;
> > - __entry->mode = inode->i_mode;
> > __entry->block = block;
> > __entry->count = count;
> > __entry->flags = flags;
> > + __entry->mode = inode->i_mode;
> > ),
> >
> > TP_printk("dev %d,%d ino %lu mode 0%o block %llu count %lu flags %d",
> > @@ -766,8 +766,8 @@ TRACE_EVENT(ext4_sync_file,
> > __entry->dev_major = MAJOR(dentry->d_inode->i_sb->s_dev);
> > __entry->dev_minor = MINOR(dentry->d_inode->i_sb->s_dev);
> > __entry->ino = dentry->d_inode->i_ino;
> > - __entry->datasync = datasync;
> > __entry->parent = dentry->d_parent->d_inode->i_ino;
> > + __entry->datasync = datasync;
> > ),
> >
> > TP_printk("dev %d,%d ino %ld parent %ld datasync %d ",
> > @@ -834,12 +834,6 @@ TRACE_EVENT(ext4_mballoc_alloc,
> > __field( int, dev_major )
> > __field( int, dev_minor )
> > __field( ino_t, ino )
> > - __field( __u16, found )
> > - __field( __u16, groups )
> > - __field( __u16, buddy )
> > - __field( __u16, flags )
> > - __field( __u16, tail )
> > - __field( __u8, cr )
> > __field( __u32, orig_logical )
> > __field( int, orig_start )
> > __field( __u32, orig_group )
> > @@ -852,18 +846,18 @@ TRACE_EVENT(ext4_mballoc_alloc,
> > __field( int, result_start )
> > __field( __u32, result_group )
> > __field( int, result_len )
> > + __field( __u16, found )
> > + __field( __u16, groups )
> > + __field( __u16, buddy )
> > + __field( __u16, flags )
> > + __field( __u16, tail )
> > + __field( __u8, cr )
> > ),
> >
> > TP_fast_assign(
> > __entry->dev_major = MAJOR(ac->ac_inode->i_sb->s_dev);
> > __entry->dev_minor = MINOR(ac->ac_inode->i_sb->s_dev);
> > __entry->ino = ac->ac_inode->i_ino;
> > - __entry->found = ac->ac_found;
> > - __entry->flags = ac->ac_flags;
> > - __entry->groups = ac->ac_groups_scanned;
> > - __entry->buddy = ac->ac_buddy;
> > - __entry->tail = ac->ac_tail;
> > - __entry->cr = ac->ac_criteria;
> > __entry->orig_logical = ac->ac_o_ex.fe_logical;
> > __entry->orig_start = ac->ac_o_ex.fe_start;
> > __entry->orig_group = ac->ac_o_ex.fe_group;
> > @@ -876,6 +870,12 @@ TRACE_EVENT(ext4_mballoc_alloc,
> > __entry->result_start = ac->ac_f_ex.fe_start;
> > __entry->result_group = ac->ac_f_ex.fe_group;
> > __entry->result_len = ac->ac_f_ex.fe_len;
> > + __entry->found = ac->ac_found;
> > + __entry->groups = ac->ac_groups_scanned;
> > + __entry->buddy = ac->ac_buddy;
> > + __entry->flags = ac->ac_flags;
> > + __entry->tail = ac->ac_tail;
> > + __entry->cr = ac->ac_criteria;
> > ),
> >
> > TP_printk("dev %d,%d inode %lu orig %u/%d/%u@%u goal %u/%d/%u@%u "
> > @@ -1001,18 +1001,18 @@ TRACE_EVENT(ext4_forget,
> > __field( int, dev_major )
> > __field( int, dev_minor )
> > __field( ino_t, ino )
> > - __field( umode_t, mode )
> > - __field( int, is_metadata )
> > __field( __u64, block )
> > + __field( int, is_metadata )
> > + __field( umode_t, mode )
> > ),
> >
> > TP_fast_assign(
> > __entry->dev_major = MAJOR(inode->i_sb->s_dev);
> > __entry->dev_minor = MINOR(inode->i_sb->s_dev);
> > __entry->ino = inode->i_ino;
> > - __entry->mode = inode->i_mode;
> > - __entry->is_metadata = is_metadata;
> > __entry->block = block;
> > + __entry->is_metadata = is_metadata;
> > + __entry->mode = inode->i_mode;
> > ),
> >
> > TP_printk("dev %d,%d ino %lu mode 0%o is_metadata %d block %llu",
> > @@ -1030,24 +1030,24 @@ TRACE_EVENT(ext4_da_update_reserve_space,
> > __field( int, dev_major )
> > __field( int, dev_minor )
> > __field( ino_t, ino )
> > - __field( umode_t, mode )
> > __field( __u64, i_blocks )
> > __field( int, used_blocks )
> > __field( int, reserved_data_blocks )
> > __field( int, reserved_meta_blocks )
> > __field( int, allocated_meta_blocks )
> > + __field( umode_t, mode )
> > ),
> >
> > TP_fast_assign(
> > __entry->dev_major = MAJOR(inode->i_sb->s_dev);
> > __entry->dev_minor = MINOR(inode->i_sb->s_dev);
> > __entry->ino = inode->i_ino;
> > - __entry->mode = inode->i_mode;
> > __entry->i_blocks = inode->i_blocks;
> > __entry->used_blocks = used_blocks;
> > __entry->reserved_data_blocks = EXT4_I(inode)->i_reserved_data_blocks;
> > __entry->reserved_meta_blocks = EXT4_I(inode)->i_reserved_meta_blocks;
> > __entry->allocated_meta_blocks = EXT4_I(inode)->i_allocated_meta_blocks;
> > + __entry->mode = inode->i_mode;
> > ),
> >
> > TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu used_blocks %d reserved_data_blocks %d reserved_meta_blocks %d allocated_meta_blocks %d",
> > @@ -1067,22 +1067,22 @@ TRACE_EVENT(ext4_da_reserve_space,
> > __field( int, dev_major )
> > __field( int, dev_minor )
> > __field( ino_t, ino )
> > - __field( umode_t, mode )
> > __field( __u64, i_blocks )
> > __field( int, md_needed )
> > __field( int, reserved_data_blocks )
> > __field( int, reserved_meta_blocks )
> > + __field( umode_t, mode )
> > ),
> >
> > TP_fast_assign(
> > __entry->dev_major = MAJOR(inode->i_sb->s_dev);
> > __entry->dev_minor = MINOR(inode->i_sb->s_dev);
> > __entry->ino = inode->i_ino;
> > - __entry->mode = inode->i_mode;
> > __entry->i_blocks = inode->i_blocks;
> > __entry->md_needed = md_needed;
> > __entry->reserved_data_blocks = EXT4_I(inode)->i_reserved_data_blocks;
> > __entry->reserved_meta_blocks = EXT4_I(inode)->i_reserved_meta_blocks;
> > + __entry->mode = inode->i_mode;
> > ),
> >
> > TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu md_needed %d reserved_data_blocks %d reserved_meta_blocks %d",
> > @@ -1102,24 +1102,24 @@ TRACE_EVENT(ext4_da_release_space,
> > __field( int, dev_major )
> > __field( int, dev_minor )
> > __field( ino_t, ino )
> > - __field( umode_t, mode )
> > __field( __u64, i_blocks )
> > __field( int, freed_blocks )
> > __field( int, reserved_data_blocks )
> > __field( int, reserved_meta_blocks )
> > __field( int, allocated_meta_blocks )
> > + __field( umode_t, mode )
> > ),
> >
> > TP_fast_assign(
> > __entry->dev_major = MAJOR(inode->i_sb->s_dev);
> > __entry->dev_minor = MINOR(inode->i_sb->s_dev);
> > __entry->ino = inode->i_ino;
> > - __entry->mode = inode->i_mode;
> > __entry->i_blocks = inode->i_blocks;
> > __entry->freed_blocks = freed_blocks;
> > __entry->reserved_data_blocks = EXT4_I(inode)->i_reserved_data_blocks;
> > __entry->reserved_meta_blocks = EXT4_I(inode)->i_reserved_meta_blocks;
> > __entry->allocated_meta_blocks = EXT4_I(inode)->i_allocated_meta_blocks;
> > + __entry->mode = inode->i_mode;
> > ),
> >
> > TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu freed_blocks %d reserved_data_blocks %d reserved_meta_blocks %d allocated_meta_blocks %d",
>

2011-03-09 00:23:15

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 13/15] small_traces: Add config option to shrink trace events.

On Fri, 2010-12-03 at 16:13 -0800, David Sharp wrote:
> This option will remove several fields from trace events to reduce the
> payload size. Under heavy load, events can be produced faster than they
> can possibly be written to disk or sent on the network. Shrinking the
> payload size will enable fitting more traces in the ring buffer, which
> therefore enables tracing for longer periods, at the cost of losing some
> relatively unimportant details.
>

OK, I'm trying to get all your patches in for 2.6.39 except the last 3.
As this may break user interfaces. I would like a way to dynamically
shrink the payload, but we need to figure out how to do that.

For example, currently trace-cmd looks at the event files know what data
to record. Sure we could force it with a config option, but that is not
a good solution for mainline. What we would want is something more
dymanic. But then tools like trace-cmd, and I suspect perf will need to
have a way to know how to parse the data. Especially perf which is user
specific (where as trace-cmd is global). The change in events needs to
be atomic per user. So one user changing what gets recorded in an event
does not affect another user.

Perhaps in the future we can figure something out. But for now, patches
1-12 look good (with the exception of patch 3).

-- Steve


> Google-Bug-Id: 3224547
>
> Signed-off-by: David Sharp <[email protected]>
> ---
> kernel/trace/Kconfig | 12 ++++++++++++
> 1 files changed, 12 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index ea37e2f..9c24fe5 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -403,6 +403,18 @@ config FUNCTION_PROFILER
>
> If in doubt, say N.
>
> +config SMALL_TRACES
> + bool "Shrink trace events"
> + default n
> + help
> + Minimize the size of trace events. Omits fields from trace event
> + header like preempt_count, and lock_depth. Chooses smaller
> + alternative event structures. Shrinking the payload size enables
> + fitting more traces in the ring buffer, which enables tracing for
> + longer periods when the buffer can not be read as fast as it is
> + filled, at the cost of losing some details.
> +
> +
> config FTRACE_MCOUNT_RECORD
> def_bool y
> depends on DYNAMIC_FTRACE

2011-03-09 00:31:50

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 10/15] ftrace: fix event alignment: jbd2:*

On Tue, Mar 08, 2011 at 07:03:25PM -0500, Steven Rostedt wrote:
> Now I know why I was so late at adding this. I was waiting for
> Acked-bys ;)
>
> On Fri, 2010-12-03 at 20:52 -0500, Steven Rostedt wrote:
> > [ Ccing Ted for Acked-by ]
>
> Ted, Ping?

A month or two ago I did an analysis using the pahole binary (part of
the 'dwarves' package in Debian/Ubuntu) and noticed many more
structures that were flagged as having 'holes' where there were
optimization opportunities.

In addition, (in an interesting example of synchronicity) just this
afternoon, I had talked to Jiaying about a further optimization of the
ext4 trace header files, since she pointed out to me that the 'perf'
binary can actually handle the MAJOR() and MINOR() macros. So we can
replace 8 bytes of "int major, minor", with a 4 byte dev_t, and save
another 4 bytes (assuming we can move the structure elements around so
we can save a full 8 bytes to avoid structure padding).

Jiaying was also going to send some patches to add some additional
tracepoints to the upstream kernel.

So probably the best thing to do at this point is to drop the ext4 and
jbd2 patches, since Jiaying is about to send me patches that to handle
these optimizations plus some further ones --- and add some additional
tracepoints. Getting one patch from her that I'll feed in via the
ext4 tree probably will be much simpler than coordinating patches to
the same file from two different trees.

- Ted

2011-03-09 00:46:14

by David Sharp

[permalink] [raw]
Subject: Re: [PATCH] tracing: Add an 'overwrite' trace_option.

On Wed, Dec 8, 2010 at 1:46 PM, David Sharp <[email protected]> wrote:
> Add an "overwrite" trace_option for ftrace to control whether the buffer should
> be overwritten on overflow or not. The default remains to overwrite old events
> when the buffer is full. This patch adds the option to instead discard newest
> events when the buffer is full. This is useful to get a snapshot of traces just
> after enabling traces. Dropping the current event is also a simpler code path.

Signed-off-by: David Sharp <[email protected]>

> ---
>  Documentation/trace/ftrace.txt |    5 +++++
>  include/linux/ring_buffer.h    |    2 ++
>  kernel/trace/ring_buffer.c     |   11 +++++++++++
>  kernel/trace/trace.c           |   17 +++++++++++------
>  kernel/trace/trace.h           |    1 +
>  5 files changed, 30 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt
> index 557c1ed..2842ea7 100644
> --- a/Documentation/trace/ftrace.txt
> +++ b/Documentation/trace/ftrace.txt
> @@ -491,6 +491,11 @@ x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
>                    latencies, as described in "Latency
>                    trace format".
>
> +  overwrite - This controls what happens when the trace buffer is
> +              full. If "1" (default), the oldest events are
> +              discarded and overwritten. If "0", then the newest
> +              events are discarded.
> +
>  sched_switch
>  ------------
>
> diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
> index 8d3a248..ab38ac8 100644
> --- a/include/linux/ring_buffer.h
> +++ b/include/linux/ring_buffer.h
> @@ -100,6 +100,8 @@ void ring_buffer_free(struct ring_buffer *buffer);
>
>  int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size);
>
> +void ring_buffer_change_overwrite(struct ring_buffer *buffer, int val);
> +
>  struct ring_buffer_event *ring_buffer_lock_reserve(struct ring_buffer *buffer,
>                                                   unsigned long length);
>  int ring_buffer_unlock_commit(struct ring_buffer *buffer,
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 9ed509a..3207147 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -1429,6 +1429,17 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size)
>  }
>  EXPORT_SYMBOL_GPL(ring_buffer_resize);
>
> +void ring_buffer_change_overwrite(struct ring_buffer *buffer, int val)
> +{
> +       mutex_lock(&buffer->mutex);
> +       if (val)
> +               buffer->flags |= RB_FL_OVERWRITE;
> +       else
> +               buffer->flags &= ~RB_FL_OVERWRITE;
> +       mutex_unlock(&buffer->mutex);
> +}
> +EXPORT_SYMBOL_GPL(ring_buffer_change_overwrite);
> +
>  static inline void *
>  __rb_data_page_index(struct buffer_data_page *bpage, unsigned index)
>  {
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index c380612..6f7494c 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -41,8 +41,6 @@
>  #include "trace.h"
>  #include "trace_output.h"
>
> -#define TRACE_BUFFER_FLAGS     (RB_FL_OVERWRITE)
> -
>  /*
>  * On boot up, the ring buffer is set to the minimum size, so that
>  * we do not waste memory on systems that are not using tracing.
> @@ -340,7 +338,7 @@ static DECLARE_WAIT_QUEUE_HEAD(trace_wait);
>  /* trace_flags holds trace_options default values */
>  unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
>        TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO | TRACE_ITER_SLEEP_TIME |
> -       TRACE_ITER_GRAPH_TIME | TRACE_ITER_RECORD_CMD;
> +       TRACE_ITER_GRAPH_TIME | TRACE_ITER_RECORD_CMD | TRACE_ITER_OVERWRITE;
>
>  static int trace_stop_count;
>  static DEFINE_SPINLOCK(tracing_start_lock);
> @@ -425,6 +423,7 @@ static const char *trace_options[] = {
>        "sleep-time",
>        "graph-time",
>        "record-cmd",
> +       "overwrite",
>        NULL
>  };
>
> @@ -2523,6 +2522,9 @@ static void set_tracer_flags(unsigned int mask, int enabled)
>
>        if (mask == TRACE_ITER_RECORD_CMD)
>                trace_event_enable_cmd_record(enabled);
> +
> +       if (mask == TRACE_ITER_OVERWRITE)
> +               ring_buffer_change_overwrite(global_trace.buffer, enabled);
>  }
>
>  static ssize_t
> @@ -4545,9 +4547,11 @@ void ftrace_dump(enum ftrace_dump_mode oops_dump_mode)
>  __init static int tracer_alloc_buffers(void)
>  {
>        int ring_buf_size;
> +       enum ring_buffer_flags rb_flags;
>        int i;
>        int ret = -ENOMEM;
>
> +
>        if (!alloc_cpumask_var(&tracing_buffer_mask, GFP_KERNEL))
>                goto out;
>
> @@ -4560,12 +4564,13 @@ __init static int tracer_alloc_buffers(void)
>        else
>                ring_buf_size = 1;
>
> +       rb_flags = trace_flags & TRACE_ITER_OVERWRITE ? RB_FL_OVERWRITE : 0;
> +
>        cpumask_copy(tracing_buffer_mask, cpu_possible_mask);
>        cpumask_copy(tracing_cpumask, cpu_all_mask);
>
>        /* TODO: make the number of buffers hot pluggable with CPUS */
> -       global_trace.buffer = ring_buffer_alloc(ring_buf_size,
> -                                                  TRACE_BUFFER_FLAGS);
> +       global_trace.buffer = ring_buffer_alloc(ring_buf_size, rb_flags);
>        if (!global_trace.buffer) {
>                printk(KERN_ERR "tracer: failed to allocate ring buffer!\n");
>                WARN_ON(1);
> @@ -4575,7 +4580,7 @@ __init static int tracer_alloc_buffers(void)
>
>
>  #ifdef CONFIG_TRACER_MAX_TRACE
> -       max_tr.buffer = ring_buffer_alloc(1, TRACE_BUFFER_FLAGS);
> +       max_tr.buffer = ring_buffer_alloc(1, rb_flags);
>        if (!max_tr.buffer) {
>                printk(KERN_ERR "tracer: failed to allocate max ring buffer!\n");
>                WARN_ON(1);
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 9021f8c..9cd025a 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -606,6 +606,7 @@ enum trace_iterator_flags {
>        TRACE_ITER_SLEEP_TIME           = 0x40000,
>        TRACE_ITER_GRAPH_TIME           = 0x80000,
>        TRACE_ITER_RECORD_CMD           = 0x100000,
> +       TRACE_ITER_OVERWRITE            = 0x200000,
>  };
>
>  /*
> --
> 1.7.3.1
>
>

2011-03-09 00:52:09

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 10/15] ftrace: fix event alignment: jbd2:*

On Tue, 2011-03-08 at 19:31 -0500, Ted Ts'o wrote:
> On Tue, Mar 08, 2011 at 07:03:25PM -0500, Steven Rostedt wrote:
> > Now I know why I was so late at adding this. I was waiting for
> > Acked-bys ;)
> >
> > On Fri, 2010-12-03 at 20:52 -0500, Steven Rostedt wrote:
> > > [ Ccing Ted for Acked-by ]
> >
> > Ted, Ping?
>
> A month or two ago I did an analysis using the pahole binary (part of
> the 'dwarves' package in Debian/Ubuntu) and noticed many more
> structures that were flagged as having 'holes' where there were
> optimization opportunities.
>
> In addition, (in an interesting example of synchronicity) just this
> afternoon, I had talked to Jiaying about a further optimization of the
> ext4 trace header files, since she pointed out to me that the 'perf'
> binary can actually handle the MAJOR() and MINOR() macros. So we can
> replace 8 bytes of "int major, minor", with a 4 byte dev_t, and save
> another 4 bytes (assuming we can move the structure elements around so
> we can save a full 8 bytes to avoid structure padding).
>
> Jiaying was also going to send some patches to add some additional
> tracepoints to the upstream kernel.
>
> So probably the best thing to do at this point is to drop the ext4 and
> jbd2 patches, since Jiaying is about to send me patches that to handle
> these optimizations plus some further ones --- and add some additional
> tracepoints. Getting one patch from her that I'll feed in via the
> ext4 tree probably will be much simpler than coordinating patches to
> the same file from two different trees.

OK, thanks for the explanation. I'll go ahead and drop those two patches
from my queue.

-- Steve

2011-03-09 02:01:22

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 04/15] ftrace: pack event structures.

On Tue, 2011-03-08 at 18:30 -0500, Steven Rostedt wrote:
> David (Miller),
>
> I'm going back through this patch set and looking to incorporate it into
> 2.6.39. But I know how touchy sparc is with "packed" data structures. Do
> you see this patch as hurting sparc?
>
> I wonder if we should change this to something like:
>
> #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> # define do_event_packed __attirbute__((packed))
> #else
> # define do_event_packed
> #endif
>
> and use "do_event_packed" instead?
>

Here's the modified patch, if you think the current one is not OK:

-- Steve

commit c49fd97767b112a9a7a50c184caf5059f798d5a0
Author: David Sharp <[email protected]>
Date: Fri Dec 3 16:13:18 2010 -0800

tracing: Pack event structures.

Ftrace event structures have a 12-byte struct trace_entry at the beginning.
If the structure is aligned, this means that if the first field is 64-bits,
there will be 4 bytes of padding. Ironically, due to the 4-byte ringbuffer
header, this will make 64-bit writes unaligned, if the ring buffer position
is currently 64-bit aligned:
4(rb)+12(ftrace)+4(pad) = 20; 20%8 = 4

Adding __attribute__((packed)) to the event structures removes the extra
space from the trace events, and actually improves alignment of trace
events with a first field that is 64-bits.

About 65 tracepoints have a 4-byte pad at offset 12:
# find events -name format | xargs -n1 awk '
$1=="name:" {name=$2}
$1=="format:"{FS="\t"}
$3=="offset:12;" && $4=="size:4;"{okay=1}
$3=="offset:16;" && !okay {print name}' | wc -l
65

With all 'syscalls' and 'timer' events enabled, this results in a 5%
improvement in a simple 512MB read benchmark with warm caches.

Tested:

setup:
# echo 1 >events/syscalls/enable
# echo 1 >events/timer/enable
# echo 0 > tracing_enabled
off:
# for n in $(seq 10) ; do \
time dd if=/dev/hda3 of=/dev/null bs=1K count=512K ; \
done
on:
# for n in $(seq 10) ; do \
echo 1 >tracing_enabled; \
time dd if=/dev/hda3 of=/dev/null bs=1K count=512K ; \
echo 0 >tracing_enabled; \
echo > trace; \
done

real time mean/median/stdev
w/o patch off: 1.1679/1.164/0.0169
w/o patch on : 1.9432/1.936/0.0274
w/ patch off: 1.1715/1.159/0.0431
w/ patch on : 1.8425/1.836/0.0138
"on" delta: -0.1007 --> -5.2%

Google-Bug-Id: 2895627

Cc: David S. Miller <[email protected]>
Signed-off-by: David Sharp <[email protected]>
LKML-Reference: <[email protected]>

[
Replaced "__attibute__((packed))" with "ftrace_do_packed" which
is a macro that is defined as the packed attribute only if the
arch has efficient unaligned access.
]

Signed-off-by: Steven Rostedt <[email protected]>

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 47e3997..c920ab5 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -282,4 +282,16 @@ perf_trace_buf_submit(void *raw_data, int size, int rctx, u64 addr,
}
#endif

+/*
+ * On systems that can access unaligned data efficiently we
+ * pack the data structures. This actually gives a boost
+ * in tracing speed. But leave the structs alone for archs
+ * that do not do as well with packed data structures.
+ */
+#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+# define ftrace_do_packed __attribute__((packed))
+#else
+# define ftrace_do_packed
+#endif
+
#endif /* _LINUX_FTRACE_EVENT_H */
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 3e68366..6b8f366 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -48,7 +48,8 @@
#define __array(type, item, len) type item[len];

#undef __dynamic_array
-#define __dynamic_array(type, item, len) u32 __data_loc_##item;
+#define __dynamic_array(type, item, len) \
+ u32 __data_loc_##item __attribute__((aligned(4)));

#undef __string
#define __string(item, src) __dynamic_array(char, item, -1)
@@ -62,7 +63,7 @@
struct trace_entry ent; \
tstruct \
char __data[0]; \
- }; \
+ } ftrace_do_packed; \
\
static struct ftrace_event_class event_class_##name;

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 9cd025a..ce185fe 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -60,7 +60,7 @@ enum trace_type {
struct struct_name { \
struct trace_entry ent; \
tstruct \
- }
+ } ftrace_do_packed

#undef TP_ARGS
#define TP_ARGS(args...) args

2011-03-09 06:39:15

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 04/15] ftrace: pack event structures.

From: Steven Rostedt <[email protected]>
Date: Tue, 08 Mar 2011 18:30:03 -0500

> I'm going back through this patch set and looking to incorporate it into
> 2.6.39. But I know how touchy sparc is with "packed" data structures. Do
> you see this patch as hurting sparc?
>
> I wonder if we should change this to something like:
>
> #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> # define do_event_packed __attirbute__((packed))
> #else
> # define do_event_packed
> #endif
>
> and use "do_event_packed" instead?

I think you should elide packed at all costs, and instead tell the compiler
what your intentions are by exposing the real types using unions or
similar instead of hiding it in an opaque way behind final char[] arrays.

2011-03-09 08:50:15

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 07/15] ftrace: fix event alignment: kvm:kvm_hv_hypercall

On 03/09/2011 01:55 AM, Steven Rostedt wrote:
> Hi Avi,
>
> This patch set got lost in the shuffle, and I'm not looking to include
> it.

Anything I need to do?

--
error compiling committee.c: too many arguments to function

2011-03-09 12:55:00

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 07/15] ftrace: fix event alignment: kvm:kvm_hv_hypercall

On Wed, 2011-03-09 at 10:50 +0200, Avi Kivity wrote:
> On 03/09/2011 01:55 AM, Steven Rostedt wrote:
> > Hi Avi,
> >
> > This patch set got lost in the shuffle, and I'm not looking to include

Nasty typo: s/not/now/

> > it.
>
> Anything I need to do?
>

Yeah, give me your Acked-by ;)

-- Steve


2011-03-09 13:01:37

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 07/15] ftrace: fix event alignment: kvm:kvm_hv_hypercall

On 03/09/2011 02:54 PM, Steven Rostedt wrote:
> On Wed, 2011-03-09 at 10:50 +0200, Avi Kivity wrote:
> > On 03/09/2011 01:55 AM, Steven Rostedt wrote:
> > > Hi Avi,
> > >
> > > This patch set got lost in the shuffle, and I'm not looking to include
>
> Nasty typo: s/not/now/

Makes sense now.

> > > it.
> >
> > Anything I need to do?
> >
>
> Yeah, give me your Acked-by ;)

Acked-by: $me.

Sorry, your previous mail got lost.

--
error compiling committee.c: too many arguments to function

2011-03-09 13:01:59

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 04/15] ftrace: pack event structures.

On Fri, 2010-12-03 at 16:13 -0800, David Sharp wrote:
> Ftrace event structures have a 12-byte struct trace_entry at the beginning.
> If the structure is aligned, this means that if the first field is 64-bits,
> there will be 4 bytes of padding. Ironically, due to the 4-byte ringbuffer
> header, this will make 64-bit writes unaligned, if the ring buffer position
> is currently 64-bit aligned:
> 4(rb)+12(ftrace)+4(pad) = 20; 20%8 = 4

Actually the better answer is: It's time to nuke the lock_depth field.
That was added temporarily in helping to remove the BKL, and now that
it's pretty much gone, I say we need to nuke it. That will remove 4
bytes from the 12 byte ftrace header giving us a 8 byte header. Much
nicer :)

I'll do that today. Thanks!

-- Steve

>
> Adding __attribute__((packed)) to the event structures removes the extra
> space from the trace events, and actually improves alignment of trace
> events with a first field that is 64-bits.
>
> About 65 tracepoints have a 4-byte pad at offset 12:
> # find events -name format | xargs -n1 awk '
> $1=="name:" {name=$2}
> $1=="format:"{FS="\t"}
> $3=="offset:12;" && $4=="size:4;"{okay=1}
> $3=="offset:16;" && !okay {print name}' | wc -l
> 65
>
> With all 'syscalls' and 'timer' events enabled, this results in a 5%
> improvement in a simple 512MB read benchmark with warm caches.

2011-03-09 15:18:11

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 04/15] ftrace: pack event structures.

On Tue, 2011-03-08 at 22:39 -0800, David Miller wrote:
> From: Steven Rostedt <[email protected]>
> Date: Tue, 08 Mar 2011 18:30:03 -0500
>
> > I'm going back through this patch set and looking to incorporate it into
> > 2.6.39. But I know how touchy sparc is with "packed" data structures. Do
> > you see this patch as hurting sparc?
> >
> > I wonder if we should change this to something like:
> >
> > #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> > # define do_event_packed __attirbute__((packed))
> > #else
> > # define do_event_packed
> > #endif
> >
> > and use "do_event_packed" instead?
>
> I think you should elide packed at all costs, and instead tell the compiler
> what your intentions are by exposing the real types using unions or
> similar instead of hiding it in an opaque way behind final char[] arrays.

The problem here is not the issue of char[] but because of the way
ftrace's header was 12 bytes and caused everything that had a 8byte word
use another 4 bytes to align it. As I stated in another email, I think
removing the "int lock_depth" from the ftrace header should solve this.

David (Sharp), the reason for the "packed" was because of that extra 4
bytes in the ftrace header, right? So removing the lock_depth should fix
the issues that you've seen?

-- Steve


2011-03-10 23:22:19

by David Sharp

[permalink] [raw]
Subject: Re: [PATCH 04/15] ftrace: pack event structures.

On Wed, Mar 9, 2011 at 7:18 AM, Steven Rostedt <[email protected]> wrote:
> On Tue, 2011-03-08 at 22:39 -0800, David Miller wrote:
>> From: Steven Rostedt <[email protected]>
>> Date: Tue, 08 Mar 2011 18:30:03 -0500
>>
>> > I'm going back through this patch set and looking to incorporate it into
>> > 2.6.39. But I know how touchy sparc is with "packed" data structures. Do
>> > you see this patch as hurting sparc?
>> >
>> > I wonder if we should change this to something like:
>> >
>> > #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>> > # define do_event_packed __attirbute__((packed))
>> > #else
>> > # define do_event_packed
>> > #endif
>> >
>> > and use "do_event_packed" instead?
>>
>> I think you should elide packed at all costs, and instead tell the compiler
>> what your intentions are by exposing the real types using unions or
>> similar instead of hiding it in an opaque way behind final char[] arrays.
>
> The problem here is not the issue of char[] but because of the way
> ftrace's header was 12 bytes and caused everything that had a 8byte word
> use another 4 bytes to align it.

Not only that, but those extra "4 bytes to align it" have a good
chance of in fact causing the 8 byte word to become *un*aligned, so
it's really just wasted space.

> As I stated in another email, I think
> removing the "int lock_depth" from the ftrace header should solve this.
>
> David (Sharp), the reason for the "packed" was because of that extra 4
> bytes in the ftrace header, right? So removing the lock_depth should fix
> the issues that you've seen?

Yes, I think that should have the same effect. Of course the next time
we come along and remove other useless fields like 'flags' and
'preempt_count', a similar condition could arise.

(Note I don't necessarily think that 'flags' and 'preempt_count' are
useless; although I don't know what they're used for, or why they need
to be in every entry. fwiw, we've taken trace_entry down to 4 bytes of
{unsigned short type; unsigned short pid;}, and we even have plans to
completely remove pid.)

>
> -- Steve
>
>
>
>

2011-03-11 03:37:25

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 04/15] ftrace: pack event structures.

On Thu, 2011-03-10 at 15:21 -0800, David Sharp wrote:

> Yes, I think that should have the same effect. Of course the next time
> we come along and remove other useless fields like 'flags' and
> 'preempt_count', a similar condition could arise.
>
> (Note I don't necessarily think that 'flags' and 'preempt_count' are
> useless; although I don't know what they're used for, or why they need
> to be in every entry. fwiw, we've taken trace_entry down to 4 bytes of
> {unsigned short type; unsigned short pid;}, and we even have plans to
> completely remove pid.)

The flags and preempt count are legacy from the latency-tracer. But they
do come in quite handy when debugging -rt code. I've gotten trace data
from customers that knowing if preemption is enabled or disable and
whether or not something happened in interrupt context was detrimental.

That said, I would love to figure out a nice way to make these optional.

-- Steve

2011-03-11 09:46:21

by David Sharp

[permalink] [raw]
Subject: [tip:perf/core] tracing: Add an 'overwrite' trace_option.

Commit-ID: 750912fa366312e9c5bc83eab352898a26750401
Gitweb: http://git.kernel.org/tip/750912fa366312e9c5bc83eab352898a26750401
Author: David Sharp <[email protected]>
AuthorDate: Wed, 8 Dec 2010 13:46:47 -0800
Committer: Steven Rostedt <[email protected]>
CommitDate: Wed, 9 Mar 2011 13:52:27 -0500

tracing: Add an 'overwrite' trace_option.

Add an "overwrite" trace_option for ftrace to control whether the buffer should
be overwritten on overflow or not. The default remains to overwrite old events
when the buffer is full. This patch adds the option to instead discard newest
events when the buffer is full. This is useful to get a snapshot of traces just
after enabling traces. Dropping the current event is also a simpler code path.

Signed-off-by: David Sharp <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
Documentation/trace/ftrace.txt | 5 +++++
include/linux/ring_buffer.h | 2 ++
kernel/trace/ring_buffer.c | 11 +++++++++++
kernel/trace/trace.c | 17 +++++++++++------
kernel/trace/trace.h | 1 +
5 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt
index 67f1cc4..1ebc24c 100644
--- a/Documentation/trace/ftrace.txt
+++ b/Documentation/trace/ftrace.txt
@@ -454,6 +454,11 @@ x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
latencies, as described in "Latency
trace format".

+ overwrite - This controls what happens when the trace buffer is
+ full. If "1" (default), the oldest events are
+ discarded and overwritten. If "0", then the newest
+ events are discarded.
+
ftrace_enabled
--------------

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 8d3a248..ab38ac8 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -100,6 +100,8 @@ void ring_buffer_free(struct ring_buffer *buffer);

int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size);

+void ring_buffer_change_overwrite(struct ring_buffer *buffer, int val);
+
struct ring_buffer_event *ring_buffer_lock_reserve(struct ring_buffer *buffer,
unsigned long length);
int ring_buffer_unlock_commit(struct ring_buffer *buffer,
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index bd1c35a..269db80 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1429,6 +1429,17 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size)
}
EXPORT_SYMBOL_GPL(ring_buffer_resize);

+void ring_buffer_change_overwrite(struct ring_buffer *buffer, int val)
+{
+ mutex_lock(&buffer->mutex);
+ if (val)
+ buffer->flags |= RB_FL_OVERWRITE;
+ else
+ buffer->flags &= ~RB_FL_OVERWRITE;
+ mutex_unlock(&buffer->mutex);
+}
+EXPORT_SYMBOL_GPL(ring_buffer_change_overwrite);
+
static inline void *
__rb_data_page_index(struct buffer_data_page *bpage, unsigned index)
{
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 8dc8da6..85e3ee1 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -41,8 +41,6 @@
#include "trace.h"
#include "trace_output.h"

-#define TRACE_BUFFER_FLAGS (RB_FL_OVERWRITE)
-
/*
* On boot up, the ring buffer is set to the minimum size, so that
* we do not waste memory on systems that are not using tracing.
@@ -340,7 +338,7 @@ static DECLARE_WAIT_QUEUE_HEAD(trace_wait);
/* trace_flags holds trace_options default values */
unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO | TRACE_ITER_SLEEP_TIME |
- TRACE_ITER_GRAPH_TIME | TRACE_ITER_RECORD_CMD;
+ TRACE_ITER_GRAPH_TIME | TRACE_ITER_RECORD_CMD | TRACE_ITER_OVERWRITE;

static int trace_stop_count;
static DEFINE_SPINLOCK(tracing_start_lock);
@@ -425,6 +423,7 @@ static const char *trace_options[] = {
"sleep-time",
"graph-time",
"record-cmd",
+ "overwrite",
NULL
};

@@ -2529,6 +2528,9 @@ static void set_tracer_flags(unsigned int mask, int enabled)

if (mask == TRACE_ITER_RECORD_CMD)
trace_event_enable_cmd_record(enabled);
+
+ if (mask == TRACE_ITER_OVERWRITE)
+ ring_buffer_change_overwrite(global_trace.buffer, enabled);
}

static ssize_t
@@ -4555,9 +4557,11 @@ void ftrace_dump(enum ftrace_dump_mode oops_dump_mode)
__init static int tracer_alloc_buffers(void)
{
int ring_buf_size;
+ enum ring_buffer_flags rb_flags;
int i;
int ret = -ENOMEM;

+
if (!alloc_cpumask_var(&tracing_buffer_mask, GFP_KERNEL))
goto out;

@@ -4570,12 +4574,13 @@ __init static int tracer_alloc_buffers(void)
else
ring_buf_size = 1;

+ rb_flags = trace_flags & TRACE_ITER_OVERWRITE ? RB_FL_OVERWRITE : 0;
+
cpumask_copy(tracing_buffer_mask, cpu_possible_mask);
cpumask_copy(tracing_cpumask, cpu_all_mask);

/* TODO: make the number of buffers hot pluggable with CPUS */
- global_trace.buffer = ring_buffer_alloc(ring_buf_size,
- TRACE_BUFFER_FLAGS);
+ global_trace.buffer = ring_buffer_alloc(ring_buf_size, rb_flags);
if (!global_trace.buffer) {
printk(KERN_ERR "tracer: failed to allocate ring buffer!\n");
WARN_ON(1);
@@ -4585,7 +4590,7 @@ __init static int tracer_alloc_buffers(void)


#ifdef CONFIG_TRACER_MAX_TRACE
- max_tr.buffer = ring_buffer_alloc(1, TRACE_BUFFER_FLAGS);
+ max_tr.buffer = ring_buffer_alloc(1, rb_flags);
if (!max_tr.buffer) {
printk(KERN_ERR "tracer: failed to allocate max ring buffer!\n");
WARN_ON(1);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 856e73c..951d0b7 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -606,6 +606,7 @@ enum trace_iterator_flags {
TRACE_ITER_SLEEP_TIME = 0x40000,
TRACE_ITER_GRAPH_TIME = 0x80000,
TRACE_ITER_RECORD_CMD = 0x100000,
+ TRACE_ITER_OVERWRITE = 0x200000,
};

/*

2011-03-11 09:46:37

by David Sharp

[permalink] [raw]
Subject: [tip:perf/core] ring-buffer: Remove unused #include <linux/trace_irq.h>

Commit-ID: de29be5e712dc8b7eef2bef9417af3bb6a88e47a
Gitweb: http://git.kernel.org/tip/de29be5e712dc8b7eef2bef9417af3bb6a88e47a
Author: David Sharp <[email protected]>
AuthorDate: Fri, 3 Dec 2010 16:13:16 -0800
Committer: Steven Rostedt <[email protected]>
CommitDate: Wed, 9 Mar 2011 13:52:28 -0500

ring-buffer: Remove unused #include <linux/trace_irq.h>

Signed-off-by: David Sharp <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
kernel/trace/ring_buffer.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 269db80..3237d96 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5,7 +5,6 @@
*/
#include <linux/ring_buffer.h>
#include <linux/trace_clock.h>
-#include <linux/ftrace_irq.h>
#include <linux/spinlock.h>
#include <linux/debugfs.h>
#include <linux/uaccess.h>

2011-03-11 09:48:59

by David Sharp

[permalink] [raw]
Subject: [tip:perf/core] tracing: Fix event alignment: ftrace:context_switch and ftrace:wakeup

Commit-ID: 140e4f2d1cd816aed196705c036763313c0e4bd3
Gitweb: http://git.kernel.org/tip/140e4f2d1cd816aed196705c036763313c0e4bd3
Author: David Sharp <[email protected]>
AuthorDate: Fri, 3 Dec 2010 16:13:19 -0800
Committer: Steven Rostedt <[email protected]>
CommitDate: Thu, 10 Mar 2011 10:34:16 -0500

tracing: Fix event alignment: ftrace:context_switch and ftrace:wakeup

Signed-off-by: David Sharp <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
kernel/trace/trace_entries.h | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 6cf2237..1516cb3 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -109,12 +109,12 @@ FTRACE_ENTRY(funcgraph_exit, ftrace_graph_ret_entry,
*/
#define FTRACE_CTX_FIELDS \
__field( unsigned int, prev_pid ) \
+ __field( unsigned int, next_pid ) \
+ __field( unsigned int, next_cpu ) \
__field( unsigned char, prev_prio ) \
__field( unsigned char, prev_state ) \
- __field( unsigned int, next_pid ) \
__field( unsigned char, next_prio ) \
- __field( unsigned char, next_state ) \
- __field( unsigned int, next_cpu )
+ __field( unsigned char, next_state )

FTRACE_ENTRY(context_switch, ctx_switch_entry,

2011-03-11 09:49:19

by David Sharp

[permalink] [raw]
Subject: [tip:perf/core] tracing: Fix event alignment: module:module_request

Commit-ID: b5e3008e489f5a00c6d5db914a4c4338c9ef5e8b
Gitweb: http://git.kernel.org/tip/b5e3008e489f5a00c6d5db914a4c4338c9ef5e8b
Author: David Sharp <[email protected]>
AuthorDate: Fri, 3 Dec 2010 16:13:20 -0800
Committer: Steven Rostedt <[email protected]>
CommitDate: Thu, 10 Mar 2011 10:34:20 -0500

tracing: Fix event alignment: module:module_request

Acked-by: Li Zefan <[email protected]>
Signed-off-by: David Sharp <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
include/trace/events/module.h | 5 ++---
1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/module.h b/include/trace/events/module.h
index c6bae36..21a546d 100644
--- a/include/trace/events/module.h
+++ b/include/trace/events/module.h
@@ -108,14 +108,14 @@ TRACE_EVENT(module_request,
TP_ARGS(name, wait, ip),

TP_STRUCT__entry(
- __field( bool, wait )
__field( unsigned long, ip )
+ __field( bool, wait )
__string( name, name )
),

TP_fast_assign(
- __entry->wait = wait;
__entry->ip = ip;
+ __entry->wait = wait;
__assign_str(name, name);
),

@@ -129,4 +129,3 @@ TRACE_EVENT(module_request,

/* This part must be outside protection */
#include <trace/define_trace.h>
-

2011-03-11 09:49:42

by David Sharp

[permalink] [raw]
Subject: [tip:perf/core] tracing: Fix event alignment: kvm:kvm_hv_hypercall

Commit-ID: d5bf2ff07230a4a1b73ecb22363f77c02e1d85ab
Gitweb: http://git.kernel.org/tip/d5bf2ff07230a4a1b73ecb22363f77c02e1d85ab
Author: David Sharp <[email protected]>
AuthorDate: Fri, 3 Dec 2010 16:13:21 -0800
Committer: Steven Rostedt <[email protected]>
CommitDate: Thu, 10 Mar 2011 10:34:24 -0500

tracing: Fix event alignment: kvm:kvm_hv_hypercall

Acked-by: Avi Kivity <[email protected]>
Signed-off-by: David Sharp <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
arch/x86/kvm/trace.h | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 1357d7c..db93276 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -62,21 +62,21 @@ TRACE_EVENT(kvm_hv_hypercall,
TP_ARGS(code, fast, rep_cnt, rep_idx, ingpa, outgpa),

TP_STRUCT__entry(
- __field( __u16, code )
- __field( bool, fast )
__field( __u16, rep_cnt )
__field( __u16, rep_idx )
__field( __u64, ingpa )
__field( __u64, outgpa )
+ __field( __u16, code )
+ __field( bool, fast )
),

TP_fast_assign(
- __entry->code = code;
- __entry->fast = fast;
__entry->rep_cnt = rep_cnt;
__entry->rep_idx = rep_idx;
__entry->ingpa = ingpa;
__entry->outgpa = outgpa;
+ __entry->code = code;
+ __entry->fast = fast;
),

TP_printk("code 0x%x %s cnt 0x%x idx 0x%x in 0x%llx out 0x%llx",

2011-03-11 09:50:05

by David Sharp

[permalink] [raw]
Subject: [tip:perf/core] tracing: Fix event alignment: mce:mce_record

Commit-ID: ad440ad66f1617194738bf674dfe2d38978ac54d
Gitweb: http://git.kernel.org/tip/ad440ad66f1617194738bf674dfe2d38978ac54d
Author: David Sharp <[email protected]>
AuthorDate: Fri, 3 Dec 2010 16:13:22 -0800
Committer: Steven Rostedt <[email protected]>
CommitDate: Thu, 10 Mar 2011 10:34:28 -0500

tracing: Fix event alignment: mce:mce_record

Signed-off-by: David Sharp <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
include/trace/events/mce.h | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h
index 7eee778..4cbbcef 100644
--- a/include/trace/events/mce.h
+++ b/include/trace/events/mce.h
@@ -17,36 +17,36 @@ TRACE_EVENT(mce_record,
TP_STRUCT__entry(
__field( u64, mcgcap )
__field( u64, mcgstatus )
- __field( u8, bank )
__field( u64, status )
__field( u64, addr )
__field( u64, misc )
__field( u64, ip )
- __field( u8, cs )
__field( u64, tsc )
__field( u64, walltime )
__field( u32, cpu )
__field( u32, cpuid )
__field( u32, apicid )
__field( u32, socketid )
+ __field( u8, cs )
+ __field( u8, bank )
__field( u8, cpuvendor )
),

TP_fast_assign(
__entry->mcgcap = m->mcgcap;
__entry->mcgstatus = m->mcgstatus;
- __entry->bank = m->bank;
__entry->status = m->status;
__entry->addr = m->addr;
__entry->misc = m->misc;
__entry->ip = m->ip;
- __entry->cs = m->cs;
__entry->tsc = m->tsc;
__entry->walltime = m->time;
__entry->cpu = m->extcpu;
__entry->cpuid = m->cpuid;
__entry->apicid = m->apicid;
__entry->socketid = m->socketid;
+ __entry->cs = m->cs;
+ __entry->bank = m->bank;
__entry->cpuvendor = m->cpuvendor;
),

2011-03-11 09:50:37

by David Sharp

[permalink] [raw]
Subject: [tip:perf/core] tracing: Fix event alignment: skb:kfree_skb

Commit-ID: ca9da2dd63b0b32de1b693953dff66cadeb6400b
Gitweb: http://git.kernel.org/tip/ca9da2dd63b0b32de1b693953dff66cadeb6400b
Author: David Sharp <[email protected]>
AuthorDate: Fri, 3 Dec 2010 16:13:23 -0800
Committer: Steven Rostedt <[email protected]>
CommitDate: Thu, 10 Mar 2011 10:34:31 -0500

tracing: Fix event alignment: skb:kfree_skb

Acked-by: Neil Horman <[email protected]>
Signed-off-by: David Sharp <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
include/trace/events/skb.h | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
index f10293c..0c68ae22 100644
--- a/include/trace/events/skb.h
+++ b/include/trace/events/skb.h
@@ -19,14 +19,14 @@ TRACE_EVENT(kfree_skb,

TP_STRUCT__entry(
__field( void *, skbaddr )
- __field( unsigned short, protocol )
__field( void *, location )
+ __field( unsigned short, protocol )
),

TP_fast_assign(
__entry->skbaddr = skb;
- __entry->protocol = ntohs(skb->protocol);
__entry->location = location;
+ __entry->protocol = ntohs(skb->protocol);
),

TP_printk("skbaddr=%p protocol=%u location=%p",

2011-03-11 09:50:52

by David Sharp

[permalink] [raw]
Subject: [tip:perf/core] tracing: Adjust conditional expression latency formatting.

Commit-ID: 10da37a645b5e915d8572cc2b1f5eb11ada3ea4f
Gitweb: http://git.kernel.org/tip/10da37a645b5e915d8572cc2b1f5eb11ada3ea4f
Author: David Sharp <[email protected]>
AuthorDate: Fri, 3 Dec 2010 16:13:26 -0800
Committer: Steven Rostedt <[email protected]>
CommitDate: Thu, 10 Mar 2011 10:34:35 -0500

tracing: Adjust conditional expression latency formatting.

Formatting change only to improve code readability. No code changes except to
introduce intermediate variables.

Signed-off-by: David Sharp <[email protected]>
LKML-Reference: <[email protected]>

[ Keep variable declarations and assignment separate ]

Signed-off-by: Steven Rostedt <[email protected]>
---
kernel/trace/trace_output.c | 26 ++++++++++++++++++--------
1 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 151f32e..456be90 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -533,20 +533,30 @@ seq_print_ip_sym(struct trace_seq *s, unsigned long ip, unsigned long sym_flags)
*/
int trace_print_lat_fmt(struct trace_seq *s, struct trace_entry *entry)
{
- int hardirq, softirq;
+ char hardsoft_irq;
+ char need_resched;
+ char irqs_off;
+ int hardirq;
+ int softirq;
int ret;

hardirq = entry->flags & TRACE_FLAG_HARDIRQ;
softirq = entry->flags & TRACE_FLAG_SOFTIRQ;

+ irqs_off =
+ (entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
+ (entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' :
+ '.';
+ need_resched =
+ (entry->flags & TRACE_FLAG_NEED_RESCHED) ? 'N' : '.';
+ hardsoft_irq =
+ (hardirq && softirq) ? 'H' :
+ hardirq ? 'h' :
+ softirq ? 's' :
+ '.';
+
if (!trace_seq_printf(s, "%c%c%c",
- (entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
- (entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ?
- 'X' : '.',
- (entry->flags & TRACE_FLAG_NEED_RESCHED) ?
- 'N' : '.',
- (hardirq && softirq) ? 'H' :
- hardirq ? 'h' : softirq ? 's' : '.'))
+ irqs_off, need_resched, hardsoft_irq))
return 0;

if (entry->preempt_count)