2010-01-23 00:07:00

by Keiichi KII

[permalink] [raw]
Subject: [RFC PATCH -tip 0/2 v2] pagecache tracepoints proposal

Hello,

This is v2 of a patchset to add some tracepoints for pagecache.

I would propose several tracepoints for tracing pagecache behaviors and
a script for these.
By using both the tracepoints and the script, we can monitor pagecache usage
with high resolution. Example output of the script looks like:

[file list]
cached
device inode pages
--------------------------------
253:0 1051413 130
253:0 1051399 2
253:0 1051414 44
253:0 1051417 154

o postmaster-2330
cached added removed indirect
device inode pages pages pages removed pages
----------------------------------------------------------------
253:0 1051399 0 2 0 0
253:0 1051417 154 0 0 0
253:0 1051413 130 0 0 0
253:0 1051414 44 0 0 0
----------------------------------------------------------------
total: 337 2 0 0

We can now know system-wide pagecache usage by /proc/meminfo.
But we have no method to get higher resolution information like per file or
per process usage than system-wide one.
A process may share some pagecache or add a pagecache to the memory or
remove a pagecache from the memory.
If a pagecache miss hit ratio rises, maybe it leads to extra I/O and
affects system performance.

So, by using the tracepoints we can get the following information.
1. how many pagecaches each process has per each file
2. how many pages are cached per each file
3. how many pagecaches each process shares
4. how often each process adds/removes pagecache
5. how long a pagecache stays in the memory
6. pagecache hit rate per file

Especially, the monitoring pagecache usage per each file would help us tune
some applications like database.

Changelog since v1
o Add a script based on "perf trace stream scripting support".

Any comments are welcome.
--
Keiichi Kii <[email protected]>


2010-01-23 00:11:15

by Keiichi KII

[permalink] [raw]
Subject: [RFC PATCH -tip 1/2 v2] add tracepoints for pagecache

This patch adds several tracepoints to track pagecach behavior.
These trecepoints would help us monitor pagecache usage with high resolution.

Signed-off-by: Keiichi Kii <[email protected]>
Cc: Atsushi Tsuji <[email protected]>
---
include/trace/events/filemap.h | 83 +++++++++++++++++++++++++++++++++++++++++
mm/filemap.c | 5 ++
mm/truncate.c | 2
mm/vmscan.c | 3 +
4 files changed, 93 insertions(+)

Index: linux-2.6-tip/include/trace/events/filemap.h
===================================================================
--- /dev/null
+++ linux-2.6-tip/include/trace/events/filemap.h
@@ -0,0 +1,83 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM filemap
+
+#if !defined(_TRACE_FILEMAP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_FILEMAP_H
+
+#include <linux/fs.h>
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(find_get_page,
+
+ TP_PROTO(struct address_space *mapping, pgoff_t offset,
+ struct page *page),
+
+ TP_ARGS(mapping, offset, page),
+
+ TP_STRUCT__entry(
+ __field(dev_t, s_dev)
+ __field(ino_t, i_ino)
+ __field(pgoff_t, offset)
+ __field(struct page *, page)
+ ),
+
+ TP_fast_assign(
+ __entry->s_dev = mapping->host ? mapping->host->i_sb->s_dev : 0;
+ __entry->i_ino = mapping->host ? mapping->host->i_ino : 0;
+ __entry->offset = offset;
+ __entry->page = page;
+ ),
+
+ TP_printk("s_dev=%u:%u i_ino=%lu offset=%lu %s", MAJOR(__entry->s_dev),
+ MINOR(__entry->s_dev), __entry->i_ino, __entry->offset,
+ __entry->page == NULL ? "page_not_found" : "page_found")
+);
+
+TRACE_EVENT(add_to_page_cache,
+
+ TP_PROTO(struct address_space *mapping, pgoff_t offset),
+
+ TP_ARGS(mapping, offset),
+
+ TP_STRUCT__entry(
+ __field(dev_t, s_dev)
+ __field(ino_t, i_ino)
+ __field(pgoff_t, offset)
+ ),
+
+ TP_fast_assign(
+ __entry->s_dev = mapping->host->i_sb->s_dev;
+ __entry->i_ino = mapping->host->i_ino;
+ __entry->offset = offset;
+ ),
+
+ TP_printk("s_dev=%u:%u i_ino=%lu offset=%lu", MAJOR(__entry->s_dev),
+ MINOR(__entry->s_dev), __entry->i_ino, __entry->offset)
+);
+
+TRACE_EVENT(remove_from_page_cache,
+
+ TP_PROTO(struct address_space *mapping, pgoff_t offset),
+
+ TP_ARGS(mapping, offset),
+
+ TP_STRUCT__entry(
+ __field(dev_t, s_dev)
+ __field(ino_t, i_ino)
+ __field(pgoff_t, offset)
+ ),
+
+ TP_fast_assign(
+ __entry->s_dev = mapping->host->i_sb->s_dev;
+ __entry->i_ino = mapping->host->i_ino;
+ __entry->offset = offset;
+ ),
+
+ TP_printk("s_dev=%u:%u i_ino=%lu offset=%lu", MAJOR(__entry->s_dev),
+ MINOR(__entry->s_dev), __entry->i_ino, __entry->offset)
+);
+
+#endif /* _TRACE_FILEMAP_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
Index: linux-2.6-tip/mm/filemap.c
===================================================================
--- linux-2.6-tip.orig/mm/filemap.c
+++ linux-2.6-tip/mm/filemap.c
@@ -34,6 +34,8 @@
#include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
#include <linux/memcontrol.h>
#include <linux/mm_inline.h> /* for page_is_file_cache() */
+#define CREATE_TRACE_POINTS
+#include <trace/events/filemap.h>
#include "internal.h"

/*
@@ -149,6 +151,7 @@ void remove_from_page_cache(struct page
spin_lock_irq(&mapping->tree_lock);
__remove_from_page_cache(page);
spin_unlock_irq(&mapping->tree_lock);
+ trace_remove_from_page_cache(mapping, page->index);
mem_cgroup_uncharge_cache_page(page);
}

@@ -419,6 +422,7 @@ int add_to_page_cache_locked(struct page
if (PageSwapBacked(page))
__inc_zone_page_state(page, NR_SHMEM);
spin_unlock_irq(&mapping->tree_lock);
+ trace_add_to_page_cache(mapping, offset);
} else {
page->mapping = NULL;
spin_unlock_irq(&mapping->tree_lock);
@@ -642,6 +646,7 @@ repeat:
}
rcu_read_unlock();

+ trace_find_get_page(mapping, offset, page);
return page;
}
EXPORT_SYMBOL(find_get_page);
Index: linux-2.6-tip/mm/truncate.c
===================================================================
--- linux-2.6-tip.orig/mm/truncate.c
+++ linux-2.6-tip/mm/truncate.c
@@ -20,6 +20,7 @@
do_invalidatepage */
#include "internal.h"

+#include <trace/events/filemap.h>

/**
* do_invalidatepage - invalidate part or all of a page
@@ -388,6 +389,7 @@ invalidate_complete_page2(struct address
BUG_ON(page_has_private(page));
__remove_from_page_cache(page);
spin_unlock_irq(&mapping->tree_lock);
+ trace_remove_from_page_cache(mapping, page->index);
mem_cgroup_uncharge_cache_page(page);
page_cache_release(page); /* pagecache ref */
return 1;
Index: linux-2.6-tip/mm/vmscan.c
===================================================================
--- linux-2.6-tip.orig/mm/vmscan.c
+++ linux-2.6-tip/mm/vmscan.c
@@ -48,6 +48,8 @@

#include "internal.h"

+#include <trace/events/filemap.h>
+
struct scan_control {
/* Incremented by the number of inactive pages that were scanned */
unsigned long nr_scanned;
@@ -477,6 +479,7 @@ static int __remove_mapping(struct addre
} else {
__remove_from_page_cache(page);
spin_unlock_irq(&mapping->tree_lock);
+ trace_remove_from_page_cache(mapping, page->index);
mem_cgroup_uncharge_cache_page(page);
}


2010-01-23 00:11:30

by Keiichi KII

[permalink] [raw]
Subject: [RFC PATCH -tip 2/2 v2] add a scripts for pagecache usage per process

The scripts are implemented based on the trace stream scripting support.
And the scripts implement the following.
- how many pagecaches each process has per each file
- how many pages are cached per each file
- how many pagecaches each process shares

To monitor pagecache usage per a process, run "pagecache-usage-record" to
record perf data for "pagecache-usage.pl" and run "pagecache-usage-report"
to display.

The below outputs show execution sample.

[file list]
device inode caches
--------------------------------
253:0 1051413 130
253:0 1051399 2
253:0 1051414 44
253:0 1051417 154

[process list]
o postmaster-2330
cached added removed indirect
device inode pages pages pages removed pages
----------------------------------------------------------------
253:0 1051399 0 2 0 0
253:0 1051417 154 0 0 0
253:0 1051413 130 0 0 0
253:0 1051414 44 0 0 0
----------------------------------------------------------------
total: 337 2 0 0

>From the output, we can know some information like:

- if "added pages" > "cached pages" on process list then
It means repeating add/remove pagecache many times.
=> Bad case for pagecache usage

- if "added pages" <= "cached pages" on process list then
It means no unnecessary I/O operations.
=> Good case for pagecache usage.

- if "caches" on file list >
sum "cached pages" per each file on process list then
It means there are unneccessary pagecaches in the memory.
=> Bad case for pagecache usage

Signed-off-by: Keiichi Kii <[email protected]>
Cc: Atsushi Tsuji <[email protected]>
---
tools/perf/scripts/perl/bin/pagecache-usage-record | 7
tools/perf/scripts/perl/bin/pagecache-usage-report | 6
tools/perf/scripts/perl/pagecache-usage.pl | 160 +++++++++++++++++++++
3 files changed, 173 insertions(+)

Index: linux-2.6-tip/tools/perf/scripts/perl/bin/pagecache-usage-record
===================================================================
--- /dev/null
+++ linux-2.6-tip/tools/perf/scripts/perl/bin/pagecache-usage-record
@@ -0,0 +1,7 @@
+#!/bin/bash
+perf record -c 1 -f -a -M -R -e filemap:add_to_page_cache -e filemap:find_get_page -e filemap:remove_from_page_cache
+
+
+
+
+
Index: linux-2.6-tip/tools/perf/scripts/perl/bin/pagecache-usage-report
===================================================================
--- /dev/null
+++ linux-2.6-tip/tools/perf/scripts/perl/bin/pagecache-usage-report
@@ -0,0 +1,6 @@
+#!/bin/bash
+# description: pagecache usage per process
+perf trace -s ~/libexec/perf-core/scripts/perl/pagecache-usage.pl
+
+
+
Index: linux-2.6-tip/tools/perf/scripts/perl/pagecache-usage.pl
===================================================================
--- /dev/null
+++ linux-2.6-tip/tools/perf/scripts/perl/pagecache-usage.pl
@@ -0,0 +1,160 @@
+# perf trace event handlers, generated by perf trace -g perl
+# Licensed under the terms of the GNU GPL License version 2
+
+# The common_* event handler fields are the most useful fields common to
+# all events. They don't necessarily correspond to the 'common_*' fields
+# in the format files. Those fields not available as handler params can
+# be retrieved using Perl functions of the form common_*($context).
+# See Context.pm for the list of available functions.
+
+use lib "$ENV{'PERF_EXEC_PATH'}/scripts/perl/Perf-Trace-Util/lib";
+use lib "./Perf-Trace-Util/lib";
+use Perf::Trace::Core;
+use Perf::Trace::Context;
+use Perf::Trace::Util;
+use List::Util qw/sum/;
+my %files;
+my %processes;
+my %records;
+
+sub trace_end
+{
+ print_pagecache_usage_per_file();
+ print "\n";
+ print_pagecache_usage_per_process();
+ print_unhandled();
+}
+
+sub filemap::remove_from_page_cache
+{
+ my ($event_name, $context, $common_cpu, $common_secs, $common_nsecs,
+ $common_pid, $common_comm,
+ $s_dev, $i_ino, $offset) = @_;
+ my $f = \%{$files{$s_dev}{$i_ino}};
+ my $r = \%{$records{$common_comm."-".$common_pid}{$f}};
+
+ delete $$f{$offset};
+ $$r{inode} = $i_ino;
+ $$r{dev} = $s_dev;
+ if (exists $$r{added}{$offset}) {
+ $$r{removed}++;
+ } else {
+ $$r{indirect_removed}++;
+ }
+}
+
+sub filemap::add_to_page_cache
+{
+ my ($event_name, $context, $common_cpu, $common_secs, $common_nsecs,
+ $common_pid, $common_comm,
+ $s_dev, $i_ino, $offset) = @_;
+ my $f = \%{$files{$s_dev}{$i_ino}};
+ my $r = \%{$records{$common_comm."-".$common_pid}{$f}};
+
+ $$f{$offset}++;
+ $$r{added}{$offset}++;
+ $$r{inode} = $i_ino;
+ $$r{dev} = $s_dev;
+}
+
+sub filemap::find_get_page
+{
+ my ($event_name, $context, $common_cpu, $common_secs, $common_nsecs,
+ $common_pid, $common_comm,
+ $s_dev, $i_ino, $offset, $page) = @_;
+ my $f = \%{$files{$s_dev}{$i_ino}};
+ my $r = \%{$records{$common_comm."-".$common_pid}{$f}};
+
+ if ($page != 0) {
+ $$f{$offset}++;
+ $$r{cached}++;
+ $$r{inode} = $i_ino;
+ $$r{dev} = $s_dev;
+ }
+}
+
+my %unhandled;
+
+sub trace_unhandled
+{
+ my ($event_name, $context, $common_cpu, $common_secs, $common_nsecs,
+ $common_pid, $common_comm) = @_;
+
+ $unhandled{$event_name}++;
+}
+
+sub print_unhandled
+{
+ if ((scalar keys %unhandled) == 0) {
+ print "unhandled events nothing\n";
+ return;
+ }
+
+ print "\nunhandled events:\n\n";
+
+ printf("%-40s %10s\n", "event", "count");
+ printf("%-40s %10s\n", "----------------------------------------",
+ "-----------");
+
+ foreach my $event_name (keys %unhandled) {
+ printf("%-40s %10d\n", $event_name, $unhandled{$event_name});
+ }
+}
+
+sub minor
+{
+ my $dev = shift;
+ return $dev & ((1 << 20) - 1);
+}
+
+sub major
+{
+ my $dev = shift;
+ return $dev >> 20;
+}
+
+sub print_pagecache_usage_per_file
+{
+ print "[file list]\n";
+ printf(" %12s %10s %8s\n", "", "", "cached");
+ printf(" %12s %10s %8s\n", "device", "inode", "pages");
+ printf(" %s\n", '-' x 32);
+ while(my($dev, $file) = each(%files)) {
+ while(my($inode, $r) = each(%$file)) {
+ my $count = values %$r;
+ next if $count == 0;
+ printf(" %12s %10d %8d\n",
+ major($dev).":".minor($dev), $inode, $count);
+ }
+ }
+}
+
+sub print_pagecache_usage_per_process
+{
+ print "[process list]\n";
+ while(my ($pid, $v) = each(%records)) {
+ my ($sum_cached, $sum_added, $sum_removed, $sum_indirect_removed);
+
+ print "o $pid\n";
+ printf(" %12s %10s %8s %8s %8s %13s\n", "", "",
+ "cached", "added", "removed", "indirect");
+ printf(" %12s %10s %8s %8s %8s %13s\n", "device", "inode",
+ "pages", "pages", "pages", "removed pages");
+ printf(" %s\n", '-' x 64);
+ while(my ($file, $r) = each(%$v)) {
+ my $added_num = List::Util::sum(values %{$$r{added}});
+ $sum_cached += $$r{cached};
+ $sum_added += $added_num;
+ $sum_removed += $$r{removed};
+ $sum_indirect_removed += $$r{indirect_removed};
+ printf(" %12s %10d %8d %8d %8d %13d\n",
+ major($$r{dev}).":".minor($$r{dev}), $$r{inode},
+ $$r{cached}, $added_num, $$r{removed},
+ $$r{indirect_removed});
+ }
+ printf(" %s\n", '-' x 64);
+ printf(" total: %5s %10s %8d %8d %8d %13d\n", "", "", $sum_cached,
+ $sum_added, $sum_removed, $sum_indirect_removed);
+ print "\n";
+ }
+}


2010-01-23 02:28:35

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH -tip 1/2 v2] add tracepoints for pagecache

On Fri, 2010-01-22 at 19:07 -0500, Keiichi KII wrote:
> This patch adds several tracepoints to track pagecach behavior.
> These trecepoints would help us monitor pagecache usage with high resolution.
>
> Signed-off-by: Keiichi Kii <[email protected]>
> Cc: Atsushi Tsuji <[email protected]>
> ---
> include/trace/events/filemap.h | 83 +++++++++++++++++++++++++++++++++++++++++
> mm/filemap.c | 5 ++
> mm/truncate.c | 2
> mm/vmscan.c | 3 +
> 4 files changed, 93 insertions(+)
>
> Index: linux-2.6-tip/include/trace/events/filemap.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6-tip/include/trace/events/filemap.h
> @@ -0,0 +1,83 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM filemap
> +
> +#if !defined(_TRACE_FILEMAP_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_FILEMAP_H
> +
> +#include <linux/fs.h>
> +#include <linux/tracepoint.h>
> +
> +TRACE_EVENT(find_get_page,
> +
> + TP_PROTO(struct address_space *mapping, pgoff_t offset,
> + struct page *page),
> +
> + TP_ARGS(mapping, offset, page),
> +
> + TP_STRUCT__entry(
> + __field(dev_t, s_dev)
> + __field(ino_t, i_ino)
> + __field(pgoff_t, offset)
> + __field(struct page *, page)
> + ),
> +
> + TP_fast_assign(
> + __entry->s_dev = mapping->host ? mapping->host->i_sb->s_dev : 0;
> + __entry->i_ino = mapping->host ? mapping->host->i_ino : 0;
> + __entry->offset = offset;
> + __entry->page = page;
> + ),
> +
> + TP_printk("s_dev=%u:%u i_ino=%lu offset=%lu %s", MAJOR(__entry->s_dev),
> + MINOR(__entry->s_dev), __entry->i_ino, __entry->offset,
> + __entry->page == NULL ? "page_not_found" : "page_found")
> +);
> +
> +TRACE_EVENT(add_to_page_cache,
> +
> + TP_PROTO(struct address_space *mapping, pgoff_t offset),
> +
> + TP_ARGS(mapping, offset),
> +
> + TP_STRUCT__entry(
> + __field(dev_t, s_dev)
> + __field(ino_t, i_ino)
> + __field(pgoff_t, offset)
> + ),
> +
> + TP_fast_assign(
> + __entry->s_dev = mapping->host->i_sb->s_dev;
> + __entry->i_ino = mapping->host->i_ino;
> + __entry->offset = offset;
> + ),
> +
> + TP_printk("s_dev=%u:%u i_ino=%lu offset=%lu", MAJOR(__entry->s_dev),
> + MINOR(__entry->s_dev), __entry->i_ino, __entry->offset)
> +);
> +
> +TRACE_EVENT(remove_from_page_cache,
> +
> + TP_PROTO(struct address_space *mapping, pgoff_t offset),
> +
> + TP_ARGS(mapping, offset),
> +
> + TP_STRUCT__entry(
> + __field(dev_t, s_dev)
> + __field(ino_t, i_ino)
> + __field(pgoff_t, offset)
> + ),
> +
> + TP_fast_assign(
> + __entry->s_dev = mapping->host->i_sb->s_dev;
> + __entry->i_ino = mapping->host->i_ino;
> + __entry->offset = offset;
> + ),
> +
> + TP_printk("s_dev=%u:%u i_ino=%lu offset=%lu", MAJOR(__entry->s_dev),
> + MINOR(__entry->s_dev), __entry->i_ino, __entry->offset)
> +);
> +

The above qualify in converting to templates or DECLACE_TRACE_CLASS, and
DEFINE_EVENT, That is, rename the above TRACE_EVENT into
DECLARE_TRACE_CLASS, and then have the other one be a DEFINE_EVENT().
See the trace/event/sched.h for examples.

The TRACE_EVENT can add a bit of code, so use DECLARE_TRACE_CLASS when
possible and it will save on the size overhead.

-- Steve

> +#endif /* _TRACE_FILEMAP_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> Index: linux-2.6-tip/mm/filemap.c
> ===================================================================
> --- linux-2.6-tip.orig/mm/filemap.c
> +++ linux-2.6-tip/mm/filemap.c
> @@ -34,6 +34,8 @@
> #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
> #include <linux/memcontrol.h>
> #include <linux/mm_inline.h> /* for page_is_file_cache() */
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/filemap.h>
> #include "internal.h"
>
> /*
> @@ -149,6 +151,7 @@ void remove_from_page_cache(struct page
> spin_lock_irq(&mapping->tree_lock);
> __remove_from_page_cache(page);
> spin_unlock_irq(&mapping->tree_lock);
> + trace_remove_from_page_cache(mapping, page->index);
> mem_cgroup_uncharge_cache_page(page);
> }
>
> @@ -419,6 +422,7 @@ int add_to_page_cache_locked(struct page
> if (PageSwapBacked(page))
> __inc_zone_page_state(page, NR_SHMEM);
> spin_unlock_irq(&mapping->tree_lock);
> + trace_add_to_page_cache(mapping, offset);
> } else {
> page->mapping = NULL;
> spin_unlock_irq(&mapping->tree_lock);
> @@ -642,6 +646,7 @@ repeat:
> }
> rcu_read_unlock();
>
> + trace_find_get_page(mapping, offset, page);
> return page;
> }
> EXPORT_SYMBOL(find_get_page);
> Index: linux-2.6-tip/mm/truncate.c
> ===================================================================
> --- linux-2.6-tip.orig/mm/truncate.c
> +++ linux-2.6-tip/mm/truncate.c
> @@ -20,6 +20,7 @@
> do_invalidatepage */
> #include "internal.h"
>
> +#include <trace/events/filemap.h>
>
> /**
> * do_invalidatepage - invalidate part or all of a page
> @@ -388,6 +389,7 @@ invalidate_complete_page2(struct address
> BUG_ON(page_has_private(page));
> __remove_from_page_cache(page);
> spin_unlock_irq(&mapping->tree_lock);
> + trace_remove_from_page_cache(mapping, page->index);
> mem_cgroup_uncharge_cache_page(page);
> page_cache_release(page); /* pagecache ref */
> return 1;
> Index: linux-2.6-tip/mm/vmscan.c
> ===================================================================
> --- linux-2.6-tip.orig/mm/vmscan.c
> +++ linux-2.6-tip/mm/vmscan.c
> @@ -48,6 +48,8 @@
>
> #include "internal.h"
>
> +#include <trace/events/filemap.h>
> +
> struct scan_control {
> /* Incremented by the number of inactive pages that were scanned */
> unsigned long nr_scanned;
> @@ -477,6 +479,7 @@ static int __remove_mapping(struct addre
> } else {
> __remove_from_page_cache(page);
> spin_unlock_irq(&mapping->tree_lock);
> + trace_remove_from_page_cache(mapping, page->index);
> mem_cgroup_uncharge_cache_page(page);
> }
>
>
>

2010-01-23 08:21:15

by Tom Zanussi

[permalink] [raw]
Subject: Re: [RFC PATCH -tip 2/2 v2] add a scripts for pagecache usage per process

Hi,

On Fri, 2010-01-22 at 19:08 -0500, Keiichi KII wrote:
> The scripts are implemented based on the trace stream scripting support.
> And the scripts implement the following.
> - how many pagecaches each process has per each file
> - how many pages are cached per each file
> - how many pagecaches each process shares
>

Nice, it looks like a very useful script - I gave it a quick try and it
seems to work well...

The only problem I see, nothing to do with your script and nothing you
can do anything about at the moment, is that the record step generates a
huge amount of data, which of course makes the event processing take
awhile. A lot of it appears to be due to perf itself - being able to
filter out the perf-generated events in the kernel would make a big
difference, I think; you normally don't want to see those anyway...

BTW, I see that you did your first version in Python - not that you'd
want to redo it again, but just FYI I now have working Python support
that I'll be posting soon - I still have some small details to hammer
out, but if you have any other scripts in the pipeline, in a couple days
you'll be able to use Python instead if you want.

A few more small comments below...

> To monitor pagecache usage per a process, run "pagecache-usage-record" to
> record perf data for "pagecache-usage.pl" and run "pagecache-usage-report"
> to display.

Another way of course would be to use 'perf trace record/report' and the
script name as shown by perf trace -l:

$ perf trace record pagecache-usage
$ perf trace report pagecache-usage

>
> The below outputs show execution sample.
>
> [file list]
> device inode caches
> --------------------------------
> 253:0 1051413 130
> 253:0 1051399 2
> 253:0 1051414 44
> 253:0 1051417 154
>
> [process list]
> o postmaster-2330
> cached added removed indirect
> device inode pages pages pages removed pages
> ----------------------------------------------------------------
> 253:0 1051399 0 2 0 0
> 253:0 1051417 154 0 0 0
> 253:0 1051413 130 0 0 0
> 253:0 1051414 44 0 0 0
> ----------------------------------------------------------------
> total: 337 2 0 0
>
> >From the output, we can know some information like:
>
> - if "added pages" > "cached pages" on process list then
> It means repeating add/remove pagecache many times.
> => Bad case for pagecache usage
>
> - if "added pages" <= "cached pages" on process list then
> It means no unnecessary I/O operations.
> => Good case for pagecache usage.
>
> - if "caches" on file list >
> sum "cached pages" per each file on process list then
> It means there are unneccessary pagecaches in the memory.
> => Bad case for pagecache usage
>
> Signed-off-by: Keiichi Kii <[email protected]>
> Cc: Atsushi Tsuji <[email protected]>
> ---
> tools/perf/scripts/perl/bin/pagecache-usage-record | 7
> tools/perf/scripts/perl/bin/pagecache-usage-report | 6
> tools/perf/scripts/perl/pagecache-usage.pl | 160 +++++++++++++++++++++
> 3 files changed, 173 insertions(+)
>
> Index: linux-2.6-tip/tools/perf/scripts/perl/bin/pagecache-usage-record
> ===================================================================
> --- /dev/null
> +++ linux-2.6-tip/tools/perf/scripts/perl/bin/pagecache-usage-record
> @@ -0,0 +1,7 @@
> +#!/bin/bash
> +perf record -c 1 -f -a -M -R -e filemap:add_to_page_cache -e filemap:find_get_page -e filemap:remove_from_page_cache
> +
> +
> +
> +
> +
> Index: linux-2.6-tip/tools/perf/scripts/perl/bin/pagecache-usage-report
> ===================================================================
> --- /dev/null
> +++ linux-2.6-tip/tools/perf/scripts/perl/bin/pagecache-usage-report
> @@ -0,0 +1,6 @@
> +#!/bin/bash
> +# description: pagecache usage per process
> +perf trace -s ~/libexec/perf-core/scripts/perl/pagecache-usage.pl
> +
> +
> +
> Index: linux-2.6-tip/tools/perf/scripts/perl/pagecache-usage.pl
> ===================================================================
> --- /dev/null
> +++ linux-2.6-tip/tools/perf/scripts/perl/pagecache-usage.pl
> @@ -0,0 +1,160 @@
> +# perf trace event handlers, generated by perf trace -g perl

You might want to get rid of this and add a short description and your
name, if you want to take credit for it. ;-)

> +# Licensed under the terms of the GNU GPL License version 2
> +

> +# The common_* event handler fields are the most useful fields common to
> +# all events. They don't necessarily correspond to the 'common_*' fields
> +# in the format files. Those fields not available as handler params can
> +# be retrieved using Perl functions of the form common_*($context).
> +# See Context.pm for the list of available functions.
> +

You can get rid of this part too - it's just meant to be helpful
information generated when starting a script.

> +use lib "$ENV{'PERF_EXEC_PATH'}/scripts/perl/Perf-Trace-Util/lib";
> +use lib "./Perf-Trace-Util/lib";
> +use Perf::Trace::Core;
> +use Perf::Trace::Context;
> +use Perf::Trace::Util;
> +use List::Util qw/sum/;
> +my %files;
> +my %processes;
> +my %records;
> +
> +sub trace_end
> +{
> + print_pagecache_usage_per_file();
> + print "\n";
> + print_pagecache_usage_per_process();
> + print_unhandled();
> +}
> +
> +sub filemap::remove_from_page_cache
> +{
> + my ($event_name, $context, $common_cpu, $common_secs, $common_nsecs,
> + $common_pid, $common_comm,
> + $s_dev, $i_ino, $offset) = @_;
> + my $f = \%{$files{$s_dev}{$i_ino}};
> + my $r = \%{$records{$common_comm."-".$common_pid}{$f}};
> +
> + delete $$f{$offset};
> + $$r{inode} = $i_ino;
> + $$r{dev} = $s_dev;
> + if (exists $$r{added}{$offset}) {
> + $$r{removed}++;
> + } else {
> + $$r{indirect_removed}++;
> + }
> +}
> +
> +sub filemap::add_to_page_cache
> +{
> + my ($event_name, $context, $common_cpu, $common_secs, $common_nsecs,
> + $common_pid, $common_comm,
> + $s_dev, $i_ino, $offset) = @_;
> + my $f = \%{$files{$s_dev}{$i_ino}};
> + my $r = \%{$records{$common_comm."-".$common_pid}{$f}};
> +
> + $$f{$offset}++;
> + $$r{added}{$offset}++;
> + $$r{inode} = $i_ino;
> + $$r{dev} = $s_dev;
> +}
> +
> +sub filemap::find_get_page
> +{
> + my ($event_name, $context, $common_cpu, $common_secs, $common_nsecs,
> + $common_pid, $common_comm,
> + $s_dev, $i_ino, $offset, $page) = @_;
> + my $f = \%{$files{$s_dev}{$i_ino}};
> + my $r = \%{$records{$common_comm."-".$common_pid}{$f}};
> +
> + if ($page != 0) {
> + $$f{$offset}++;
> + $$r{cached}++;
> + $$r{inode} = $i_ino;
> + $$r{dev} = $s_dev;
> + }
> +}
> +
> +my %unhandled;
> +
> +sub trace_unhandled
> +{
> + my ($event_name, $context, $common_cpu, $common_secs, $common_nsecs,
> + $common_pid, $common_comm) = @_;
> +
> + $unhandled{$event_name}++;
> +}
> +
> +sub print_unhandled
> +{
> + if ((scalar keys %unhandled) == 0) {
> + print "unhandled events nothing\n";

This is kind of distracting - it's not too useful to know that you don't
have unhandled events, but if you do have some, it is useful to print
those as you do below - it points out that some event type are being
unnecessarily recorded or the script is being run on the wrong trace
data.

Thanks,

Tom

> + return;
> + }
> +
> + print "\nunhandled events:\n\n";
> +
> + printf("%-40s %10s\n", "event", "count");
> + printf("%-40s %10s\n", "----------------------------------------",
> + "-----------");
> +
> + foreach my $event_name (keys %unhandled) {
> + printf("%-40s %10d\n", $event_name, $unhandled{$event_name});
> + }
> +}
> +
> +sub minor
> +{
> + my $dev = shift;
> + return $dev & ((1 << 20) - 1);
> +}
> +
> +sub major
> +{
> + my $dev = shift;
> + return $dev >> 20;
> +}
> +
> +sub print_pagecache_usage_per_file
> +{
> + print "[file list]\n";
> + printf(" %12s %10s %8s\n", "", "", "cached");
> + printf(" %12s %10s %8s\n", "device", "inode", "pages");
> + printf(" %s\n", '-' x 32);
> + while(my($dev, $file) = each(%files)) {
> + while(my($inode, $r) = each(%$file)) {
> + my $count = values %$r;
> + next if $count == 0;
> + printf(" %12s %10d %8d\n",
> + major($dev).":".minor($dev), $inode, $count);
> + }
> + }
> +}
> +
> +sub print_pagecache_usage_per_process
> +{
> + print "[process list]\n";
> + while(my ($pid, $v) = each(%records)) {
> + my ($sum_cached, $sum_added, $sum_removed, $sum_indirect_removed);
> +
> + print "o $pid\n";
> + printf(" %12s %10s %8s %8s %8s %13s\n", "", "",
> + "cached", "added", "removed", "indirect");
> + printf(" %12s %10s %8s %8s %8s %13s\n", "device", "inode",
> + "pages", "pages", "pages", "removed pages");
> + printf(" %s\n", '-' x 64);
> + while(my ($file, $r) = each(%$v)) {
> + my $added_num = List::Util::sum(values %{$$r{added}});
> + $sum_cached += $$r{cached};
> + $sum_added += $added_num;
> + $sum_removed += $$r{removed};
> + $sum_indirect_removed += $$r{indirect_removed};
> + printf(" %12s %10d %8d %8d %8d %13d\n",
> + major($$r{dev}).":".minor($$r{dev}), $$r{inode},
> + $$r{cached}, $added_num, $$r{removed},
> + $$r{indirect_removed});
> + }
> + printf(" %s\n", '-' x 64);
> + printf(" total: %5s %10s %8d %8d %8d %13d\n", "", "", $sum_cached,
> + $sum_added, $sum_removed, $sum_indirect_removed);
> + print "\n";
> + }
> +}
>
>
>

2010-01-25 22:19:32

by Keiichi KII

[permalink] [raw]
Subject: Re: [RFC PATCH -tip 2/2 v2] add a scripts for pagecache usage per process

(2010年01月23日 03:21), Tom Zanussi wrote:
> Hi,
>
> On Fri, 2010-01-22 at 19:08 -0500, Keiichi KII wrote:
>> The scripts are implemented based on the trace stream scripting support.
>> And the scripts implement the following.
>> - how many pagecaches each process has per each file
>> - how many pages are cached per each file
>> - how many pagecaches each process shares
>>
>
> Nice, it looks like a very useful script - I gave it a quick try and it
> seems to work well...
>
> The only problem I see, nothing to do with your script and nothing you
> can do anything about at the moment, is that the record step generates a
> huge amount of data, which of course makes the event processing take
> awhile. A lot of it appears to be due to perf itself - being able to
> filter out the perf-generated events in the kernel would make a big
> difference, I think; you normally don't want to see those anyway...

Yes, right. I don't want to process the data of perf itself.
I will try to find any way to solve this problem.

> BTW, I see that you did your first version in Python - not that you'd
> want to redo it again, but just FYI I now have working Python support
> that I'll be posting soon - I still have some small details to hammer
> out, but if you have any other scripts in the pipeline, in a couple days
> you'll be able to use Python instead if you want.

It will help me create some scripts. I will use Python support if it is posted.

>> To monitor pagecache usage per a process, run "pagecache-usage-record" to
>> record perf data for "pagecache-usage.pl" and run "pagecache-usage-report"
>> to display.
>
> Another way of course would be to use 'perf trace record/report' and the
> script name as shown by perf trace -l:
>
> $ perf trace record pagecache-usage
> $ perf trace report pagecache-usage

Thank you for your information.
I will use this way.

>> Index: linux-2.6-tip/tools/perf/scripts/perl/pagecache-usage.pl
>> ===================================================================
>> --- /dev/null
>> +++ linux-2.6-tip/tools/perf/scripts/perl/pagecache-usage.pl
>> @@ -0,0 +1,160 @@
>> +# perf trace event handlers, generated by perf trace -g perl
>
> You might want to get rid of this and add a short description and your
> name, if you want to take credit for it. ;-)
>
>> +# Licensed under the terms of the GNU GPL License version 2
>> +
>
>> +# The common_* event handler fields are the most useful fields common to
>> +# all events. They don't necessarily correspond to the 'common_*' fields
>> +# in the format files. Those fields not available as handler params can
>> +# be retrieved using Perl functions of the form common_*($context).
>> +# See Context.pm for the list of available functions.
>> +
>
> You can get rid of this part too - it's just meant to be helpful
> information generated when starting a script.
>

I'll remove unnecessary comments and add a short description next time.

>> +my %unhandled;
>> +
>> +sub trace_unhandled
>> +{
>> + my ($event_name, $context, $common_cpu, $common_secs, $common_nsecs,
>> + $common_pid, $common_comm) = @_;
>> +
>> + $unhandled{$event_name}++;
>> +}
>> +
>> +sub print_unhandled
>> +{
>> + if ((scalar keys %unhandled) == 0) {
>> + print "unhandled events nothing\n";
>
> This is kind of distracting - it's not too useful to know that you don't
> have unhandled events, but if you do have some, it is useful to print
> those as you do below - it points out that some event type are being
> unnecessarily recorded or the script is being run on the wrong trace
> data.

I don't have unhandled events to monitor pagecache usage as you say.
So I will remove these code.

Thanks,
Keiichi

2010-01-25 22:19:48

by Keiichi KII

[permalink] [raw]
Subject: Re: [RFC PATCH -tip 1/2 v2] add tracepoints for pagecache

(2010年01月22日 21:28), Steven Rostedt wrote:
>> +TRACE_EVENT(remove_from_page_cache,
>> +
>> + TP_PROTO(struct address_space *mapping, pgoff_t offset),
>> +
>> + TP_ARGS(mapping, offset),
>> +
>> + TP_STRUCT__entry(
>> + __field(dev_t, s_dev)
>> + __field(ino_t, i_ino)
>> + __field(pgoff_t, offset)
>> + ),
>> +
>> + TP_fast_assign(
>> + __entry->s_dev = mapping->host->i_sb->s_dev;
>> + __entry->i_ino = mapping->host->i_ino;
>> + __entry->offset = offset;
>> + ),
>> +
>> + TP_printk("s_dev=%u:%u i_ino=%lu offset=%lu", MAJOR(__entry->s_dev),
>> + MINOR(__entry->s_dev), __entry->i_ino, __entry->offset)
>> +);
>> +
>
> The above qualify in converting to templates or DECLACE_TRACE_CLASS, and
> DEFINE_EVENT, That is, rename the above TRACE_EVENT into
> DECLARE_TRACE_CLASS, and then have the other one be a DEFINE_EVENT().
> See the trace/event/sched.h for examples.
>
> The TRACE_EVENT can add a bit of code, so use DECLARE_TRACE_CLASS when
> possible and it will save on the size overhead.

Thank you for your information. I'll fix it next time.

Thanks,
Keiichi

2010-02-01 08:40:44

by Tom Zanussi

[permalink] [raw]
Subject: Re: [RFC PATCH -tip 2/2 v2] add a scripts for pagecache usage per process

Hi,

On Mon, 2010-01-25 at 17:16 -0500, Keiichi KII wrote:
> (2010年01月23日 03:21), Tom Zanussi wrote:
> > Hi,
> >
> > On Fri, 2010-01-22 at 19:08 -0500, Keiichi KII wrote:
> >> The scripts are implemented based on the trace stream scripting support.
> >> And the scripts implement the following.
> >> - how many pagecaches each process has per each file
> >> - how many pages are cached per each file
> >> - how many pagecaches each process shares
> >>
> >
> > Nice, it looks like a very useful script - I gave it a quick try and it
> > seems to work well...
> >
> > The only problem I see, nothing to do with your script and nothing you
> > can do anything about at the moment, is that the record step generates a
> > huge amount of data, which of course makes the event processing take
> > awhile. A lot of it appears to be due to perf itself - being able to
> > filter out the perf-generated events in the kernel would make a big
> > difference, I think; you normally don't want to see those anyway...
>
> Yes, right. I don't want to process the data of perf itself.
> I will try to find any way to solve this problem.
>

Here's one way, using the tracepoint filters - it does make a big
difference in this case.

Before (using the new -P option, which includes perf in the trace
data):

root@tropicana:~# perf record -c 1 -f -a -M -R -e filemap:add_to_page_cache -e filemap:find_get_page -e filemap:remove_from_page_cache -P sleep 5
[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 71.201 MB perf.data (~3110815 samples) ]

After (filters out events generated by perf):

root@tropicana:~# perf record -c 1 -f -a -M -R -e filemap:add_to_page_cache -e filemap:find_get_page -e filemap:remove_from_page_cache sleep 5
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.309 MB perf.data (~13479 samples) ]

Tom

[PATCH] perf record: filter out perf process tracepoint events

The perf process itself can generate a lot of trace data, which most
of the time isn't of any interest. This patch adds a predicate to the
kernel tracepoint filter of each recorded event type which effectively
screens out any event generated by perf.

Assuming the common case would be to ignore perf, this makes it the
default; the old behavior can be selected by using 'perf record -P'.

Signed-off-by: Tom Zanussi <[email protected]>
---
tools/perf/builtin-record.c | 44 ++++++++++++++++++++++++++++++++++++++++
tools/perf/util/parse-events.h | 3 +-
2 files changed, 46 insertions(+), 1 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index eea5691..5fa113a 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -49,6 +49,8 @@ static int no_samples = 0;
static int sample_address = 0;
static int multiplex = 0;
static int multiplex_fd = -1;
+static int include_perf = 0;
+static char exclude_perf_pred[MAX_FILTER_STR_VAL];

static long samples = 0;
static struct timeval last_read;
@@ -236,6 +238,37 @@ static struct perf_header_attr *get_header_attr(struct perf_event_attr *a, int n
return h_attr;
}

+static char *add_exclude_perf_pred(char *old_filter)
+{
+ int len = strlen(exclude_perf_pred);
+ char *filter;
+
+ if (old_filter != NULL)
+ len = len + strlen(" && ()") + strlen(old_filter);
+
+ if (len >= MAX_FILTER_STR_VAL) {
+ fprintf(stderr, "excluding perf exceeds max filter length,"
+ " use -P instead\n");
+ exit(-1);
+ }
+
+ filter = malloc(len + 1);
+ if (!filter) {
+ fprintf(stderr, "not enough memory to hold filter string\n");
+ exit(-1);
+ }
+
+ strcpy(filter, exclude_perf_pred);
+
+ if (old_filter) {
+ strcat(filter, " && (");
+ strcat(filter, old_filter);
+ strcat(filter, ")");
+ }
+
+ return filter;
+}
+
static void create_counter(int counter, int cpu, pid_t pid)
{
char *filter = filters[counter];
@@ -375,6 +408,12 @@ try_again:
}
}

+ if (include_perf == 0 && cpu == 0) {
+ filters[counter] = add_exclude_perf_pred(filter);
+ free(filter);
+ filter = filters[counter];
+ }
+
if (filter != NULL) {
ret = ioctl(fd[nr_cpu][counter],
PERF_EVENT_IOC_SET_FILTER, filter);
@@ -677,6 +716,8 @@ static const struct option options[] = {
"don't sample"),
OPT_BOOLEAN('M', "multiplex", &multiplex,
"multiplex counter output in a single channel"),
+ OPT_BOOLEAN('P', "include-perf", &include_perf,
+ "include perf in trace data (normally excluded)"),
OPT_END()
};

@@ -716,5 +757,8 @@ int cmd_record(int argc, const char **argv, const char *prefix __used)
attrs[counter].sample_period = default_interval;
}

+ if (include_perf == 0)
+ sprintf(exclude_perf_pred, "common_pid != %d", getpid());
+
return __cmd_record(argc, argv);
}
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index b8c1f64..d43d9b0 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -25,7 +25,8 @@ extern const char *__event_name(int type, u64 config);
extern int parse_events(const struct option *opt, const char *str, int unset);
extern int parse_filter(const struct option *opt, const char *str, int unset);

-#define EVENTS_HELP_MAX (128*1024)
+#define EVENTS_HELP_MAX (128*1024)
+#define MAX_FILTER_STR_VAL 256

extern void print_events(void);

--
1.6.4.GIT


2010-02-01 21:23:35

by Keiichi KII

[permalink] [raw]
Subject: Re: [RFC PATCH -tip 2/2 v2] add a scripts for pagecache usage per process

Hello,

(02/01/10 03:17), Tom Zanussi wrote:
> Here's one way, using the tracepoint filters - it does make a big
> difference in this case.
>
> Before (using the new -P option, which includes perf in the trace
> data):
>
> root@tropicana:~# perf record -c 1 -f -a -M -R -e filemap:add_to_page_cache -e filemap:find_get_page -e filemap:remove_from_page_cache -P sleep 5
> [ perf record: Woken up 0 times to write data ]
> [ perf record: Captured and wrote 71.201 MB perf.data (~3110815 samples) ]
>
> After (filters out events generated by perf):
>
> root@tropicana:~# perf record -c 1 -f -a -M -R -e filemap:add_to_page_cache -e filemap:find_get_page -e filemap:remove_from_page_cache sleep 5
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.309 MB perf.data (~13479 samples) ]

I tried your patch.
It works well for me. Thank you.

Keiichi

2010-02-23 17:54:13

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH -tip 2/2 v2] add a scripts for pagecache usage per process

On Mon, Feb 01, 2010 at 02:17:35AM -0600, Tom Zanussi wrote:
> Here's one way, using the tracepoint filters - it does make a big
> difference in this case.
>
> Before (using the new -P option, which includes perf in the trace
> data):
>
> root@tropicana:~# perf record -c 1 -f -a -M -R -e filemap:add_to_page_cache -e filemap:find_get_page -e filemap:remove_from_page_cache -P sleep 5
> [ perf record: Woken up 0 times to write data ]
> [ perf record: Captured and wrote 71.201 MB perf.data (~3110815 samples) ]
>
> After (filters out events generated by perf):
>
> root@tropicana:~# perf record -c 1 -f -a -M -R -e filemap:add_to_page_cache -e filemap:find_get_page -e filemap:remove_from_page_cache sleep 5
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.309 MB perf.data (~13479 samples) ]
>
> Tom
>
> [PATCH] perf record: filter out perf process tracepoint events
>
> The perf process itself can generate a lot of trace data, which most
> of the time isn't of any interest. This patch adds a predicate to the
> kernel tracepoint filter of each recorded event type which effectively
> screens out any event generated by perf.
>
> Assuming the common case would be to ignore perf, this makes it the
> default; the old behavior can be selected by using 'perf record -P'.


I think filtering out perf from the instrumentation is a very
desirable features.

But I see two drawbacks with this patch.
First of all, we want to keep perf as a part of the instrumentation
as a default behaviour I think, as it is a true part of the system
wide load. So I would rather suggest to keep it as a default and
have an exclude_perf option instead of include_perf.

The other downside is that this filtering only applies to ftrace events
and not to other perf events. I would expect an exclude_perf option
to apply to every events, not just a family of them.

This is not that easy though. It's trivial for a process bound
instrumentation as we only need to use enable_on_exec for that
(assuming we create the targeted process from perf).

Otherwise we need the cpu events to filter out a given context, which
needs to be done from the kernel, on events scheduling time.
It's just an idea, I'm adding more interested parties in Cc.

Thanks.

2010-02-23 18:13:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH -tip 2/2 v2] add a scripts for pagecache usage per process

On Tue, 2010-02-23 at 18:54 +0100, Frederic Weisbecker wrote:

> > [PATCH] perf record: filter out perf process tracepoint events
> >
> > The perf process itself can generate a lot of trace data, which most
> > of the time isn't of any interest. This patch adds a predicate to the
> > kernel tracepoint filter of each recorded event type which effectively
> > screens out any event generated by perf.
> >
> > Assuming the common case would be to ignore perf, this makes it the
> > default; the old behavior can be selected by using 'perf record -P'.
>
>
> I think filtering out perf from the instrumentation is a very
> desirable features.
>
> But I see two drawbacks with this patch.
> First of all, we want to keep perf as a part of the instrumentation
> as a default behaviour I think, as it is a true part of the system
> wide load. So I would rather suggest to keep it as a default and
> have an exclude_perf option instead of include_perf.

Agreed, excluding by default doesn't seem to be a sane idea.