2009-01-09 15:05:05

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [ltt-dev] Fw: [PATCH] ext4: Add markers for better debuggability

* KOSAKI Motohiro ([email protected]) wrote:
> FYI
>
> following patch merged linus tree today.
>
>
>
> Forwarded by KOSAKI Motohiro <[email protected]>
> ----------------------- Original Message -----------------------
> From: Theodore Ts'o <[email protected]>
> To: Ext4 Developers List <[email protected]>
> Cc: Theodore Ts'o <[email protected]>
> Date: Sun, 4 Jan 2009 15:33:58 -0500
> Subject: [PATCH] ext4: Add markers for better debuggability
> ----
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
> fs/ext4/ialloc.c | 9 ++++++++
> fs/ext4/inode.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> fs/ext4/mballoc.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++--
> 3 files changed, 116 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 88ab940..3bcf197 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -210,6 +210,11 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
>
> ino = inode->i_ino;
> ext4_debug("freeing inode %lu\n", ino);
> + trace_mark(ext4_free_inode,
> + "dev %s ino %lu mode %d uid %lu gid %lu bocks %llu",
> + sb->s_id, inode->i_ino, inode->i_mode,
> + (unsigned long) inode->i_uid, (unsigned long) inode->i_gid,
> + (unsigned long long) inode->i_blocks);
>

Hi Ted,

Given this is meant to be in the mainline kernel, using tracepoints
rather than markers would be more appropriate. Please see
Documentation/tracepoints.txt and samples/tracepoints/ for details.

Best regards,

Mathieu


> /*
> * Note: we must free any quota before locking the superblock,
> @@ -698,6 +703,8 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode *dir, int mode)
> return ERR_PTR(-EPERM);
>
> sb = dir->i_sb;
> + trace_mark(ext4_request_inode, "dev %s dir %lu mode %d", sb->s_id,
> + dir->i_ino, mode);
> inode = new_inode(sb);
> if (!inode)
> return ERR_PTR(-ENOMEM);
> @@ -922,6 +929,8 @@ got:
> }
>
> ext4_debug("allocating inode %lu\n", inode->i_ino);
> + trace_mark(ext4_allocate_inode, "dev %s ino %lu dir %lu mode %d",
> + sb->s_id, inode->i_ino, dir->i_ino, mode);
> goto really_out;
> fail:
> ext4_std_error(sb, err);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index aa402db..8f7884d 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1350,6 +1350,10 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
> pgoff_t index;
> unsigned from, to;
>
> + trace_mark(ext4_write_begin,
> + "dev %s ino %lu pos %llu len %u flags %u",
> + inode->i_sb->s_id, inode->i_ino,
> + (unsigned long long) pos, len, flags);
> index = pos >> PAGE_CACHE_SHIFT;
> from = pos & (PAGE_CACHE_SIZE - 1);
> to = from + len;
> @@ -1421,6 +1425,10 @@ static int ext4_ordered_write_end(struct file *file,
> struct inode *inode = mapping->host;
> int ret = 0, ret2;
>
> + trace_mark(ext4_ordered_write_end,
> + "dev %s ino %lu pos %llu len %u copied %u",
> + inode->i_sb->s_id, inode->i_ino,
> + (unsigned long long) pos, len, copied);
> ret = ext4_jbd2_file_inode(handle, inode);
>
> if (ret == 0) {
> @@ -1459,6 +1467,10 @@ static int ext4_writeback_write_end(struct file *file,
> int ret = 0, ret2;
> loff_t new_i_size;
>
> + trace_mark(ext4_writeback_write_end,
> + "dev %s ino %lu pos %llu len %u copied %u",
> + inode->i_sb->s_id, inode->i_ino,
> + (unsigned long long) pos, len, copied);
> new_i_size = pos + copied;
> if (new_i_size > EXT4_I(inode)->i_disksize) {
> ext4_update_i_disksize(inode, new_i_size);
> @@ -1494,6 +1506,10 @@ static int ext4_journalled_write_end(struct file *file,
> unsigned from, to;
> loff_t new_i_size;
>
> + trace_mark(ext4_journalled_write_end,
> + "dev %s ino %lu pos %llu len %u copied %u",
> + inode->i_sb->s_id, inode->i_ino,
> + (unsigned long long) pos, len, copied);
> from = pos & (PAGE_CACHE_SIZE - 1);
> to = from + len;
>
> @@ -2310,6 +2326,9 @@ static int ext4_da_writepage(struct page *page,
> struct buffer_head *page_bufs;
> struct inode *inode = page->mapping->host;
>
> + trace_mark(ext4_da_writepage,
> + "dev %s ino %lu page_index %lu",
> + inode->i_sb->s_id, inode->i_ino, page->index);
> size = i_size_read(inode);
> if (page->index == size >> PAGE_CACHE_SHIFT)
> len = size & ~PAGE_CACHE_MASK;
> @@ -2420,6 +2439,20 @@ static int ext4_da_writepages(struct address_space *mapping,
> int needed_blocks, ret = 0, nr_to_writebump = 0;
> struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
>
> + trace_mark(ext4_da_writepages,
> + "dev %s ino %lu nr_t_write %ld "
> + "pages_skipped %ld range_start %llu "
> + "range_end %llu nonblocking %d "
> + "for_kupdate %d for_reclaim %d "
> + "for_writepages %d range_cyclic %d",
> + inode->i_sb->s_id, inode->i_ino,
> + wbc->nr_to_write, wbc->pages_skipped,
> + (unsigned long long) wbc->range_start,
> + (unsigned long long) wbc->range_end,
> + wbc->nonblocking, wbc->for_kupdate,
> + wbc->for_reclaim, wbc->for_writepages,
> + wbc->range_cyclic);
> +
> /*
> * No pages to write? This is mainly a kludge to avoid starting
> * a transaction for special inodes like journal inode on last iput()
> @@ -2538,6 +2571,14 @@ out_writepages:
> if (!no_nrwrite_index_update)
> wbc->no_nrwrite_index_update = 0;
> wbc->nr_to_write -= nr_to_writebump;
> + trace_mark(ext4_da_writepage_result,
> + "dev %s ino %lu ret %d pages_written %d "
> + "pages_skipped %ld congestion %d "
> + "more_io %d no_nrwrite_index_update %d",
> + inode->i_sb->s_id, inode->i_ino, ret,
> + pages_written, wbc->pages_skipped,
> + wbc->encountered_congestion, wbc->more_io,
> + wbc->no_nrwrite_index_update);
> return ret;
> }
>
> @@ -2589,6 +2630,11 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
> len, flags, pagep, fsdata);
> }
> *fsdata = (void *)0;
> +
> + trace_mark(ext4_da_write_begin,
> + "dev %s ino %lu pos %llu len %u flags %u",
> + inode->i_sb->s_id, inode->i_ino,
> + (unsigned long long) pos, len, flags);
> retry:
> /*
> * With delayed allocation, we don't log the i_disksize update
> @@ -2678,6 +2724,10 @@ static int ext4_da_write_end(struct file *file,
> }
> }
>
> + trace_mark(ext4_da_write_end,
> + "dev %s ino %lu pos %llu len %u copied %u",
> + inode->i_sb->s_id, inode->i_ino,
> + (unsigned long long) pos, len, copied);
> start = pos & (PAGE_CACHE_SIZE - 1);
> end = start + copied - 1;
>
> @@ -2891,6 +2941,9 @@ static int ext4_normal_writepage(struct page *page,
> loff_t size = i_size_read(inode);
> loff_t len;
>
> + trace_mark(ext4_normal_writepage,
> + "dev %s ino %lu page_index %lu",
> + inode->i_sb->s_id, inode->i_ino, page->index);
> J_ASSERT(PageLocked(page));
> if (page->index == size >> PAGE_CACHE_SHIFT)
> len = size & ~PAGE_CACHE_MASK;
> @@ -2976,6 +3029,9 @@ static int ext4_journalled_writepage(struct page *page,
> loff_t size = i_size_read(inode);
> loff_t len;
>
> + trace_mark(ext4_journalled_writepage,
> + "dev %s ino %lu page_index %lu",
> + inode->i_sb->s_id, inode->i_ino, page->index);
> J_ASSERT(PageLocked(page));
> if (page->index == size >> PAGE_CACHE_SHIFT)
> len = size & ~PAGE_CACHE_MASK;
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 0372626..78e826d 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2878,8 +2878,9 @@ static void release_blocks_on_commit(journal_t *journal, transaction_t *txn)
> discard_block = (ext4_fsblk_t) entry->group * EXT4_BLOCKS_PER_GROUP(sb)
> + entry->start_blk
> + le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block);
> - trace_mark(ext4_discard_blocks, "dev %s blk %llu count %u", sb->s_id,
> - (unsigned long long) discard_block, entry->count);
> + trace_mark(ext4_discard_blocks, "dev %s blk %llu count %u",
> + sb->s_id, (unsigned long long) discard_block,
> + entry->count);
> sb_issue_discard(sb, discard_block, entry->count);
>
> kmem_cache_free(ext4_free_ext_cachep, entry);
> @@ -3697,6 +3698,10 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>
> mb_debug("new inode pa %p: %llu/%u for %u\n", pa,
> pa->pa_pstart, pa->pa_len, pa->pa_lstart);
> + trace_mark(ext4_mb_new_inode_pa,
> + "dev %s ino %lu pstart %llu len %u lstart %u",
> + sb->s_id, ac->ac_inode->i_ino,
> + pa->pa_pstart, pa->pa_len, pa->pa_lstart);
>
> ext4_mb_use_inode_pa(ac, pa);
> atomic_add(pa->pa_free, &EXT4_SB(sb)->s_mb_preallocated);
> @@ -3754,7 +3759,9 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
> pa->pa_linear = 1;
>
> mb_debug("new group pa %p: %llu/%u for %u\n", pa,
> - pa->pa_pstart, pa->pa_len, pa->pa_lstart);
> + pa->pa_pstart, pa->pa_len, pa->pa_lstart);
> + trace_mark(ext4_mb_new_group_pa, "dev %s pstart %llu len %u lstart %u",
> + sb->s_id, pa->pa_pstart, pa->pa_len, pa->pa_lstart);
>
> ext4_mb_use_group_pa(ac, pa);
> atomic_add(pa->pa_free, &EXT4_SB(sb)->s_mb_preallocated);
> @@ -3807,12 +3814,14 @@ ext4_mb_release_inode_pa(struct ext4_buddy *e4b, struct buffer_head *bitmap_bh,
> unsigned int next;
> ext4_group_t group;
> ext4_grpblk_t bit;
> + unsigned long long grp_blk_start;
> sector_t start;
> int err = 0;
> int free = 0;
>
> BUG_ON(pa->pa_deleted == 0);
> ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, &bit);
> + grp_blk_start = pa->pa_pstart - bit;
> BUG_ON(group != e4b->bd_group && pa->pa_len != 0);
> end = bit + pa->pa_len;
>
> @@ -3842,6 +3851,10 @@ ext4_mb_release_inode_pa(struct ext4_buddy *e4b, struct buffer_head *bitmap_bh,
> ext4_mb_store_history(ac);
> }
>
> + trace_mark(ext4_mb_release_inode_pa,
> + "dev %s ino %lu block %llu count %u",
> + sb->s_id, pa->pa_inode->i_ino, grp_blk_start + bit,
> + next - bit);
> mb_free_blocks(pa->pa_inode, e4b, bit, next - bit);
> bit = next + 1;
> }
> @@ -3875,6 +3888,8 @@ ext4_mb_release_group_pa(struct ext4_buddy *e4b,
> if (ac)
> ac->ac_op = EXT4_MB_HISTORY_DISCARD;
>
> + trace_mark(ext4_mb_release_group_pa, "dev %s pstart %llu len %d",
> + sb->s_id, pa->pa_pstart, pa->pa_len);
> BUG_ON(pa->pa_deleted == 0);
> ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, &bit);
> BUG_ON(group != e4b->bd_group && pa->pa_len != 0);
> @@ -4040,6 +4055,8 @@ void ext4_discard_preallocations(struct inode *inode)
> }
>
> mb_debug("discard preallocation for inode %lu\n", inode->i_ino);
> + trace_mark(ext4_discard_preallocations, "dev %s ino %lu", sb->s_id,
> + inode->i_ino);
>
> INIT_LIST_HEAD(&list);
>
> @@ -4492,6 +4509,8 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
> int ret;
> int freed = 0;
>
> + trace_mark(ext4_mb_discard_preallocations, "dev %s needed %d",
> + sb->s_id, needed);
> for (i = 0; i < EXT4_SB(sb)->s_groups_count && needed > 0; i++) {
> ret = ext4_mb_discard_group_preallocations(sb, i, needed);
> freed += ret;
> @@ -4520,6 +4539,18 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
> sb = ar->inode->i_sb;
> sbi = EXT4_SB(sb);
>
> + trace_mark(ext4_request_blocks, "dev %s flags %u len %u ino %lu "
> + "lblk %llu goal %llu lleft %llu lright %llu "
> + "pleft %llu pright %llu ",
> + sb->s_id, ar->flags, ar->len,
> + ar->inode ? ar->inode->i_ino : 0,
> + (unsigned long long) ar->logical,
> + (unsigned long long) ar->goal,
> + (unsigned long long) ar->lleft,
> + (unsigned long long) ar->lright,
> + (unsigned long long) ar->pleft,
> + (unsigned long long) ar->pright);
> +
> if (!EXT4_I(ar->inode)->i_delalloc_reserved_flag) {
> /*
> * With delalloc we already reserved the blocks
> @@ -4622,6 +4653,19 @@ out3:
> reserv_blks);
> }
>
> + trace_mark(ext4_allocate_blocks,
> + "dev %s block %llu flags %u len %u ino %lu "
> + "logical %llu goal %llu lleft %llu lright %llu "
> + "pleft %llu pright %llu ",
> + sb->s_id, (unsigned long long) block,
> + ar->flags, ar->len, ar->inode ? ar->inode->i_ino : 0,
> + (unsigned long long) ar->logical,
> + (unsigned long long) ar->goal,
> + (unsigned long long) ar->lleft,
> + (unsigned long long) ar->lright,
> + (unsigned long long) ar->pleft,
> + (unsigned long long) ar->pright);
> +
> return block;
> }
>
> @@ -4755,6 +4799,10 @@ void ext4_mb_free_blocks(handle_t *handle, struct inode *inode,
> }
>
> ext4_debug("freeing block %lu\n", block);
> + trace_mark(ext4_free_blocks,
> + "dev %s block %llu count %lu metadata %d ino %lu",
> + sb->s_id, (unsigned long long) block, count, metadata,
> + inode ? inode->i_ino : 0);
>
> ac = kmem_cache_alloc(ext4_ac_cachep, GFP_NOFS);
> if (ac) {
> --
> 1.6.0.4.8.g36f27.dirty
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --------------------- Original Message Ends --------------------
>
>
>
>
> _______________________________________________
> ltt-dev mailing list
> [email protected]
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68


2009-01-09 18:58:45

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [ltt-dev] Fw: [PATCH] ext4: Add markers for better debuggability

On Fri, Jan 09, 2009 at 09:49:01AM -0500, Mathieu Desnoyers wrote:
>
> Given this is meant to be in the mainline kernel, using tracepoints
> rather than markers would be more appropriate. Please see
> Documentation/tracepoints.txt and samples/tracepoints/ for details.

I saw that, but I'm still not sure how I would actually *use*
tracepoints. I can use markers today with Systemtap, and I need an
easy way to be able to selectively do things like "trap and print all
of the logged data when the ext4_request_inode tracepoint/marker is
tripped where dev=='dm-8'".

Unfortunately, Systemtap seems doomed given its current course and
direction, so I'm not opposed to moving to another solution --- I just
need a HOWTO that demonstrates how I can do this painlessly and
easily; preferably without having me have to manually hack up a kernel
module each time I want to implement the tracing

Is there an easy way to do that using ltt today? Can you point me at
a sample of how this could be done?

- Ted

2009-01-10 16:19:08

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [ltt-dev] Fw: [PATCH] ext4: Add markers for better debuggability

* Theodore Tso ([email protected]) wrote:
> On Fri, Jan 09, 2009 at 09:49:01AM -0500, Mathieu Desnoyers wrote:
> >
> > Given this is meant to be in the mainline kernel, using tracepoints
> > rather than markers would be more appropriate. Please see
> > Documentation/tracepoints.txt and samples/tracepoints/ for details.
>
> I saw that, but I'm still not sure how I would actually *use*
> tracepoints. I can use markers today with Systemtap, and I need an
> easy way to be able to selectively do things like "trap and print all
> of the logged data when the ext4_request_inode tracepoint/marker is
> tripped where dev=='dm-8'".
>
> Unfortunately, Systemtap seems doomed given its current course and
> direction, so I'm not opposed to moving to another solution --- I just
> need a HOWTO that demonstrates how I can do this painlessly and
> easily; preferably without having me have to manually hack up a kernel
> module each time I want to implement the tracing
>
> Is there an easy way to do that using ltt today? Can you point me at
> a sample of how this could be done?
>

Hi Ted,

I just ported LTTng to 2.6.28 yesterday and started doing the port of
ext4/jbd2 to tracepoints. As you can see in my 0.74 announcement, the
tracepoint work for both jbd2 and ext4 is done. I also did the lttng
probe module for jbd2. Now I just have to create the probe module for
ext4. I also want to create debugfs files to control per-probe module
filtering, e.g. :

/mnt/debugfs/ltt/filter/jbd2/dev
/mnt/debugfs/ltt/filter/ext4/dev

Where writing to it would add device names to the filter list. I would
like a scheme where we can easily add/remove devices, list all
devices... I think ftrace already has something similar for
instrumentation activation.

The main question I am facing is : What interface semantic do we want
for such filter control file ?

Comments/ideas are welcome.

Mathieu

> - Ted
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-01-10 18:20:03

by Mathieu Desnoyers

[permalink] [raw]
Subject: LTTng for ext4 tracing HOWTO

* Theodore Tso ([email protected]) wrote:
> On Fri, Jan 09, 2009 at 09:49:01AM -0500, Mathieu Desnoyers wrote:
> >
> > Given this is meant to be in the mainline kernel, using tracepoints
> > rather than markers would be more appropriate. Please see
> > Documentation/tracepoints.txt and samples/tracepoints/ for details.
>
> I saw that, but I'm still not sure how I would actually *use*
> tracepoints. I can use markers today with Systemtap, and I need an
> easy way to be able to selectively do things like "trap and print all
> of the logged data when the ext4_request_inode tracepoint/marker is
> tripped where dev=='dm-8'".
>
> Unfortunately, Systemtap seems doomed given its current course and
> direction, so I'm not opposed to moving to another solution --- I just
> need a HOWTO that demonstrates how I can do this painlessly and
> easily; preferably without having me have to manually hack up a kernel
> module each time I want to implement the tracing
>
> Is there an easy way to do that using ltt today? Can you point me at
> a sample of how this could be done?
>
> - Ted
>

Hi Ted,

I just finished adding the ext4 tracing feature to LTTng. It's in LTTng
0.75 as I write this. The only requirement currently not fulfilled is
the "filtering by dev == 'dm-8'", which I plan to implement later. We
currently have the whole stream of ext4 operations recorded to trace
buffers (all on) when tracing is enabled.

The starting point for the LTTng project is http://ltt.polymtl.ca

You will need LTTng 0.75 and the matching ltt-control and lttv packages
to get this running. Matching versions are listed here :
http://ltt.polymtl.ca/svn/trunk/lttv/doc/developer/lttng-lttv-compatibility.html

To save you time, here are the direct links to the sources :
LTTng :
http://ltt.polymtl.ca/lttng/patch-2.6.28-lttng-0.75.tar.bz2
or
git://git.kernel.org/pub/scm/linux/kernel/git/compudj/linux-2.6-lttng.git

ltt-control :
http://ltt.polymtl.ca/lttng/ltt-control-0.63-03012009.tar.gz

lttv :
http://ltt.polymtl.ca/packages/lttv-0.12.5-09012009.tar.gz

Follow the QUICKSTART GUIDE to perform the installation :
http://ltt.polymtl.ca/svn/trunk/lttv/QUICKSTART

Once you are setted up, reboot your LTTng-enabled kernel, and try the
following (as root) :

ltt-armall
(you should see all the markers within lttng probes being enabled on
the screen. This should include lines beginning with
"Connecting ext4 ....". If not, make sure you have the lttng ext4
probe module loaded.)
lttctl -C -w /tmp/trace trace

(perform some ext4 operations)

lttctl -D trace

lttv -m textDump -e channel.name=ext4 -t /tmp/trace
This will show events from the ext4 channel only, e.g. :

ext4.da_write_begin: 197.030265213 (/tmp/trace/ext4_3), 4190, 4190, \
/bin/cp, , 4021, 0x0, SYSCALL \
{ dev = "loop0", ino = 15, pos = 5320704, len = 4096, flags = 0 }
ext4.da_write_end: 197.030268714 (/tmp/trace/ext4_3), 4190, 4190, \
/bin/cp, , 4021, 0x0, SYSCALL \
{ dev = "loop0", ino = 15, pos = 5320704, len = 4096, copied = 4096 }
ext4.da_write_begin: 197.030279441 (/tmp/trace/ext4_3), 4190, 4190, \
/bin/cp, , 4021, 0x0, SYSCALL \
{ dev = "loop0", ino = 15, pos = 5324800, len = 4096, flags = 0 }
....

You can also try the GUI interface if you like, with :
lttv-gui -t /tmp/trace

You will find it useful if you need to dig in the ext4 trace information
along with the instrumentation collected from the rest of the kernel.

This is a very basic HOWTO. Don't hesitate to ask if you need help to
tune the amount of events logged (e.g. by disabling some markers, which
can be done by doing a custom ltt-armall script).

Also, lttv filtering currently does not support filtering by event
fields. A quick hack to do it is :

lttv -m textDump -e channel.name=ext4 -t /tmp/trace |grep "dev = \"loop0\","

Enjoy :)

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-01-10 18:42:45

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [ltt-dev] Fw: [PATCH] ext4: Add markers for better debuggability

On Sat, Jan 10, 2009 at 11:19:06AM -0500, Mathieu Desnoyers wrote:
>
> I just ported LTTng to 2.6.28 yesterday and started doing the port of
> ext4/jbd2 to tracepoints. As you can see in my 0.74 announcement, the
> tracepoint work for both jbd2 and ext4 is done. I also did the lttng
> probe module for jbd2. Now I just have to create the probe module for
> ext4. I also want to create debugfs files to control per-probe module
> filtering, e.g. :
>
> /mnt/debugfs/ltt/filter/jbd2/dev
> /mnt/debugfs/ltt/filter/ext4/dev
>
> Where writing to it would add device names to the filter list. I would
> like a scheme where we can easily add/remove devices, list all
> devices... I think ftrace already has something similar for
> instrumentation activation.
>
> The main question I am facing is : What interface semantic do we want
> for such filter control file ?

Hmm, we'll let's see. The most common filtering restriction will be
by device, but I'll occasionally want to filter based on a single
inode; the next most common thing I could forsee wanting to do is to
filter on a based of inode numbers or on one or more block groups.
(This would be when trying to figure out what is going on with a
particular filesystem benchmark.)

Past a certain point, I recognize that I'll probably have to write a
custom probe module --- although I have to admit that's one of the
things that has spoiled me about SystemTap; it automates the job of
creating the custom probe modules, and allows me to create
turing-equivalent filtering and data collection.

So the question is where do we draw the line between the most common
filters that is worth putting into a probe module that goes into
mainline, versus what should be done via custome probe modules,
probably via modifying the probe module as an example.

If it's not too much trouble, being able to filter on a single device
(or report the data from all trace points) and filtering on a single
inode (or reporting all inodes) seems to make the most mount of sense.
Does tht seem reasonable to you?

The other question is how much data gets reported back; normally with
Systemtap I would report back to userspace only the bits that I needed
to debug whatever issue I was looking at. However, with the LTTNG
approach, can we send back all of the bits of data tht was in the
markers? It won't be needed for all problems, and I know that sending
too much data back will cause us to potentilly overflow the kernel
buffering, since there is a limited bandwidth of stuff we can send
back through the kernel<->user interface. I'll have to admit to being
that well informed about LTTNG; what are the practical bandwidth
limitations that I should expect to see when it's in use?

Thanks, regards,

- Ted

P.S. At the kernel summit, there was some talk about a very
simplified interface that would allow us to extract text dumps from
tracepoints without having to download a huge userspace utility like
SystemTap --- or LTTng. Has that been written yet, or is the only way
for me to use Tracepoints right now is to figure out how to use the
whole LTTng infrstructure?

2009-01-10 20:40:21

by Steven Rostedt

[permalink] [raw]
Subject: Re: [ltt-dev] Fw: [PATCH] ext4: Add markers for better debuggability


On Sat, 10 Jan 2009, Theodore Tso wrote:
>
> P.S. At the kernel summit, there was some talk about a very
> simplified interface that would allow us to extract text dumps from
> tracepoints without having to download a huge userspace utility like
> SystemTap --- or LTTng. Has that been written yet, or is the only way
> for me to use Tracepoints right now is to figure out how to use the
> whole LTTng infrstructure?

Actually, that's one of the things ftrace is suppose to do, and is on my
todo list. To track all trace points, and be able to hook into any one and
print them out to the trace.

-- Steve


2009-01-10 21:50:08

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [ltt-dev] Fw: [PATCH] ext4: Add markers for better debuggability

On Sat, Jan 10, 2009 at 03:40:16PM -0500, Steven Rostedt wrote:
>
> Actually, that's one of the things ftrace is suppose to do, and is on my
> todo list. To track all trace points, and be able to hook into any one and
> print them out to the trace.

As a feature request, can this include actually printing the data
which gets captured at a tracepoint, and not just the fact that we hit
a particular tracepoint?

One nice thing about the markers infrastructure was it included a
printk format strong, so this would have been very easy. The current
tracepoints insertion hook, while convenient in that you can just pass
it a data structure like this:

+ trace_ext4_sync_file(file, dentry, datasync);

....instead of this...

- trace_mark(ext4_sync_file, "dev %s datasync %d ino %ld parent %ld",
- inode->i_sb->s_id, datasync, inode->i_ino,
- dentry->d_parent->d_inode->i_ino);


Has as its downside the fact that it's going to be difficult for
ftrace to be able to pretty-print the contents of data structure.
Where as with the trace_mark() format string, it would be trivially
easy for ftrace to print the information captured at the trace point.

For the bulk of the ext4 tracepoints, merely logging the fact that we
hit the "ext4_sync_file" tracepoint isn't going to be particularly
interesting; it's the data associated with hitting that particular
tracepoint which I really want to be able to get access to.

(And why I decided to use markers instead of tracepoints. With
systemtap, I can access the markers *today* looks like being able to
do something useful with tracepoints is still a work in progress....
But as I said, I've been warning the Systemtap folks for a while now
that they don't get their act together, there will be replacements
written by kernel developers that will be designed to be useful for
kernel developers --- but they chose not to believe James Bottomley or
myself, and blew us off --- so when there is a functional equivalent
ready, I'm happy to switch; it just doesn't seem quite there yet.)


- Ted

2009-01-12 01:34:19

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [ltt-dev] Fw: [PATCH] ext4: Add markers for better debuggability

* Theodore Tso ([email protected]) wrote:
> On Sat, Jan 10, 2009 at 11:19:06AM -0500, Mathieu Desnoyers wrote:
> >
> > I just ported LTTng to 2.6.28 yesterday and started doing the port of
> > ext4/jbd2 to tracepoints. As you can see in my 0.74 announcement, the
> > tracepoint work for both jbd2 and ext4 is done. I also did the lttng
> > probe module for jbd2. Now I just have to create the probe module for
> > ext4. I also want to create debugfs files to control per-probe module
> > filtering, e.g. :
> >
> > /mnt/debugfs/ltt/filter/jbd2/dev
> > /mnt/debugfs/ltt/filter/ext4/dev
> >
> > Where writing to it would add device names to the filter list. I would
> > like a scheme where we can easily add/remove devices, list all
> > devices... I think ftrace already has something similar for
> > instrumentation activation.
> >
> > The main question I am facing is : What interface semantic do we want
> > for such filter control file ?
>
> Hmm, we'll let's see. The most common filtering restriction will be
> by device, but I'll occasionally want to filter based on a single
> inode; the next most common thing I could forsee wanting to do is to
> filter on a based of inode numbers or on one or more block groups.
> (This would be when trying to figure out what is going on with a
> particular filesystem benchmark.)
>
> Past a certain point, I recognize that I'll probably have to write a
> custom probe module --- although I have to admit that's one of the
> things that has spoiled me about SystemTap; it automates the job of
> creating the custom probe modules, and allows me to create
> turing-equivalent filtering and data collection.
>
> So the question is where do we draw the line between the most common
> filters that is worth putting into a probe module that goes into
> mainline, versus what should be done via custome probe modules,
> probably via modifying the probe module as an example.
>
> If it's not too much trouble, being able to filter on a single device
> (or report the data from all trace points) and filtering on a single
> inode (or reporting all inodes) seems to make the most mount of sense.
> Does tht seem reasonable to you?
>

Yes, very reasonable. It's now implemented in LTTng 0.77. Usage looks
like :

echo "*" > /mnt/debugfs/ltt/filter/ext4/dev
(or just echo "" > /mnt/debugfs/ltt/filter/ext4/dev)
(enables all devices, this is the default)
cat /mnt/debugfs/ltt/filter/ext4/dev
(shows current filter)

echo sda1 > /mnt/debugfs/ltt/filter/ext4/dev
(ext4 probes will only record information for device sda1)
echo 15 > /mnt/debugfs/ltt/filter/ext4/inode
(in addition to device filtering, only inode 15 will be traced)

I think it makes sense to let people edit
ltt/probes/ext4-trace.c to implement their own more complex filtering
suiting their need. It can be compiled as a kernel module.

Here is the ext4 filtering feature diff :
http://ltt.polymtl.ca/cgi-bin/gitweb.cgi?p=linux-2.6-lttng.git;a=blob;f=ltt/probes/ext4-trace.c;h=4824a82fe9c2056d8dea83c57473ecf6c7f1c25d;hb=ad03cc9ead62d85d7641dcd19ea883f07bb7900a

> The other question is how much data gets reported back; normally with
> Systemtap I would report back to userspace only the bits that I needed
> to debug whatever issue I was looking at. However, with the LTTNG
> approach, can we send back all of the bits of data tht was in the
> markers?

Yes, LTTng consists of a "probe layer" which converts the data taken
from tracepoints into markers. For ext4 :
http://ltt.polymtl.ca/cgi-bin/gitweb.cgi?p=linux-2.6-lttng.git;a=blob;f=ltt/probes/ext4-trace.c;h=167357d87e730e33c3ca1fb335358ddc469473ad;hb=1861c6b862df1e6aad2d658984fbff137b15496e

By default, LTTng sends all the data, including the payload, identified
by the markers. Note that the marker format string is only recorded once
per trace in the metadata channel in a table, which makes everything
very compact and efficient.

> It won't be needed for all problems, and I know that sending
> too much data back will cause us to potentilly overflow the kernel
> buffering, since there is a limited bandwidth of stuff we can send
> back through the kernel<->user interface. I'll have to admit to being
> that well informed about LTTNG; what are the practical bandwidth
> limitations that I should expect to see when it's in use?
>

LTTng has been heavily optimized to provide good performances. It uses
lightweight locking (RCU and per-cpu local atomic operations) and
splices the data directly from the memory buffers to disk without
intermediate copy.

A small test to give you an idea of the amount of data generated :

Copying 125MB worth of data (the drivers/ subdirectory) from an ext3 to
an ext4 partition, both on the same disk (raid1) + issuing sync, while
tracing with the trace output on the same disk (on the ext3 partition).

Copy duration : 12.99s
sync duration : 3.59s

Trace generated : 39M

ext4-related data:
516K /tmp/trace/ext4_0
516K /tmp/trace/ext4_1
516K /tmp/trace/ext4_2
3.1M /tmp/trace/ext4_3
516K /tmp/trace/ext4_4
516K /tmp/trace/ext4_5
516K /tmp/trace/ext4_6
516K /tmp/trace/ext4_7

Therefore, most of the trace size is padding due to unused buffer space.
I should say that this amount of ext4 data output (187.0kB/s) is really
not a problem neither for LTTng nor the disk.

A small speed test to figure out the tracer impact :
echo 3 > /proc/sys/vm/drop_caches; time (cp -ax drivers /mnt/testext4/; sync)

Without tracing (1):
real 0m18.182s to 0m19.026s

While tracing the overall kernel (including ext4) to disk :
real 0m19.309s to 0m19.986s (~4.8% execution time overhead over (1))

If you cannot afford to dirturb disk activity, you can use the flight
recorder mode, which only keep the data in memory buffers and overwrites
the oldest data until you stop tracing. At that point, you can grab the
buffers and dump them to disk to analyse the last events. It's very
useful if you know what particular condition you are looking for and if
you are only interested in the latest prior to the condition.

Flight recorder trace example :

lttctl -C -w /tmp/trace3 -o channel.all.overwrite=1 trace3
...
lttctl -D -w /tmp/trace3 trace3

While tracing the overall kernel (including ext4) to memory buffers :
real 0m18.748s to 0m19.388s (~2.2% execution time overhead over (1))

Note that the current ext4 probe implementation uses the format-string
serializer (printk-like code which iterates on the format string). For
very high throughput events, we can trivially change to
ltt-type-serializer.[ch] and diminish the execution time impact a lot.
It's just a bit more probe code to write. See kernel-trace.c to see
examples using such mechanism. The only reason why I use this dynamic
serializer is because I am lazy. ;)

> Thanks, regards,
>
> - Ted
>
> P.S. At the kernel summit, there was some talk about a very
> simplified interface that would allow us to extract text dumps from
> tracepoints without having to download a huge userspace utility like
> SystemTap --- or LTTng. Has that been written yet, or is the only way
> for me to use Tracepoints right now is to figure out how to use the
> whole LTTng infrstructure?
>

Yes, we're working on having a binary-to-ascii converter in the kernel.
People at Fujitsu are helping on this, and I'm waiting for news in this
regard. That's the main element I'm waiting for before I start
submitting LTTng for mainline inclusion.

The current LTTng work flow does not require "huge" userspace utilities.
lttctl (from the ltt-control package) is just a helper over the
/mnt/debugfs/ltt/ interface; this can be done using simple "echo"
commands too. It's therefore optional.

lttd is the daemon which splices the data to disk. It is required.

lttv is built as a single executable with multiple libraries. The
package itself depends on GTK/GDK, but the textDump module really only
needs the glib library.

However, I agree that also providing a text output directly from a
debugfs file would be useful for low-overhead tracing. Note, however,
that this would increase the tracer footprint.

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68