2010-02-05 02:29:20

by Keiichi KII

[permalink] [raw]
Subject: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal

Hello,

This is v3 of a patchset to add some tracepoints for pagecache.

I would propose several tracepoints for tracing pagecache behavior and
a script for these.
By using both the tracepoints and the script, we can analysis pagecache behavior
like usage or hit ratio with high resolution like per process or per file.
Example output of the script looks like:

[process list]
o yum-3215
cache find cache hit cache hit
device inode count count ratio
--------------------------------------------------------
253:0 16 34434 34130 99.12%
253:0 198 9692 9463 97.64%
253:0 639 647 628 97.06%
253:0 778 32 29 90.62%
253:0 7305 50225 49005 97.57%
253:0 144217 12 10 83.33%
253:0 262775 16 13 81.25%
*snip*

-------------------------------------------------------------------------------

[file list]
device cached
(maj:min) inode pages
--------------------------------
253:0 16 5752
253:0 198 2233
253:0 639 51
253:0 778 86
253:0 7305 12307
253:0 144217 11
253:0 262775 39
*snip*

[process list]
o yum-3215
device cached added removed indirect
(maj:min) inode pages pages pages removed pages
----------------------------------------------------------------
253:0 16 34130 5752 0 0
253:0 198 9463 2233 0 0
253:0 639 628 51 0 0
253:0 778 29 78 0 0
253:0 7305 49005 12307 0 0
253:0 144217 10 11 0 0
253:0 262775 13 39 0 0
*snip*
----------------------------------------------------------------
total: 102346 26165 1 0

We can now know system-wide pagecache usage by /proc/meminfo.
But we have no method to get higher resolution information like per file or
per process usage than system-wide one.
A process may share some pagecache or add a pagecache to the memory or
remove a pagecache from the memory.
If a pagecache miss hit ratio rises, maybe it leads to extra I/O and
affects system performance.

So, by using the tracepoints we can get the following information.
1. how many pagecaches each process has per each file
2. how many pages are cached per each file
3. how many pagecaches each process shares
4. how often each process adds/removes pagecache
5. how long a pagecache stays in the memory
6. pagecache hit rate per file

Especially, the monitoring pagecache usage per each file and pagecache hit
ratio would help us tune some applications like database.
And it will also help us tune the kernel parameters like "vm.dirty_*".

Changelog since v2
o add new script to monitor pagecache hit ratio per process.
o use DECLARE_EVENT_CLASS

Changelog since v1
o Add a script based on "perf trace stream scripting support".

Any comments are welcome.
--
Keiichi Kii <[email protected]>


2010-02-05 02:29:31

by Keiichi KII

[permalink] [raw]
Subject: [RFC PATCH -tip 1/2 v3] tracepoints: add tracepoints for pagecache

This patch adds several tracepoints to track pagecach behavior.
These trecepoints would help us monitor pagecache usage with high resolution.

Signed-off-by: Keiichi Kii <[email protected]>
Cc: Atsushi Tsuji <[email protected]>
---
include/trace/events/filemap.h | 83 +++++++++++++++++++++++++++++++++++++++++
mm/filemap.c | 5 ++
mm/truncate.c | 2
mm/vmscan.c | 3 +
4 files changed, 93 insertions(+)

Index: linux-2.6-tip/include/trace/events/filemap.h
===================================================================
--- /dev/null
+++ linux-2.6-tip/include/trace/events/filemap.h
@@ -0,0 +1,75 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM filemap
+
+#if !defined(_TRACE_FILEMAP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_FILEMAP_H
+
+#include <linux/fs.h>
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(find_get_page,
+
+ TP_PROTO(struct address_space *mapping, pgoff_t offset,
+ struct page *page),
+
+ TP_ARGS(mapping, offset, page),
+
+ TP_STRUCT__entry(
+ __field(dev_t, s_dev)
+ __field(ino_t, i_ino)
+ __field(pgoff_t, offset)
+ __field(struct page *, page)
+ ),
+
+ TP_fast_assign(
+ __entry->s_dev = mapping->host ? mapping->host->i_sb->s_dev : 0;
+ __entry->i_ino = mapping->host ? mapping->host->i_ino : 0;
+ __entry->offset = offset;
+ __entry->page = page;
+ ),
+
+ TP_printk("s_dev=%u:%u i_ino=%lu offset=%lu %s", MAJOR(__entry->s_dev),
+ MINOR(__entry->s_dev), __entry->i_ino, __entry->offset,
+ __entry->page == NULL ? "page_not_found" : "page_found")
+);
+
+DECLARE_EVENT_CLASS(page_cache_template,
+
+ TP_PROTO(struct address_space *mapping, pgoff_t offset),
+
+ TP_ARGS(mapping, offset),
+
+ TP_STRUCT__entry(
+ __field(dev_t, s_dev)
+ __field(ino_t, i_ino)
+ __field(pgoff_t, offset)
+ ),
+
+ TP_fast_assign(
+ __entry->s_dev = mapping->host->i_sb->s_dev;
+ __entry->i_ino = mapping->host->i_ino;
+ __entry->offset = offset;
+ ),
+
+ TP_printk("s_dev=%u:%u i_ino=%lu offset=%lu", MAJOR(__entry->s_dev),
+ MINOR(__entry->s_dev), __entry->i_ino, __entry->offset)
+);
+
+DEFINE_EVENT(page_cache_template, add_to_page_cache,
+
+ TP_PROTO(struct address_space *mapping, pgoff_t offset),
+
+ TP_ARGS(mapping, offset)
+);
+
+DEFINE_EVENT(page_cache_template, remove_from_page_cache,
+
+ TP_PROTO(struct address_space *mapping, pgoff_t offset),
+
+ TP_ARGS(mapping, offset)
+);
+
+#endif /* _TRACE_FILEMAP_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
Index: linux-2.6-tip/mm/filemap.c
===================================================================
--- linux-2.6-tip.orig/mm/filemap.c
+++ linux-2.6-tip/mm/filemap.c
@@ -34,6 +34,8 @@
#include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
#include <linux/memcontrol.h>
#include <linux/mm_inline.h> /* for page_is_file_cache() */
+#define CREATE_TRACE_POINTS
+#include <trace/events/filemap.h>
#include "internal.h"

/*
@@ -149,6 +151,7 @@ void remove_from_page_cache(struct page
spin_lock_irq(&mapping->tree_lock);
__remove_from_page_cache(page);
spin_unlock_irq(&mapping->tree_lock);
+ trace_remove_from_page_cache(mapping, page->index);
mem_cgroup_uncharge_cache_page(page);
}

@@ -419,6 +422,7 @@ int add_to_page_cache_locked(struct page
if (PageSwapBacked(page))
__inc_zone_page_state(page, NR_SHMEM);
spin_unlock_irq(&mapping->tree_lock);
+ trace_add_to_page_cache(mapping, offset);
} else {
page->mapping = NULL;
spin_unlock_irq(&mapping->tree_lock);
@@ -642,6 +646,7 @@ repeat:
}
rcu_read_unlock();

+ trace_find_get_page(mapping, offset, page);
return page;
}
EXPORT_SYMBOL(find_get_page);
Index: linux-2.6-tip/mm/truncate.c
===================================================================
--- linux-2.6-tip.orig/mm/truncate.c
+++ linux-2.6-tip/mm/truncate.c
@@ -20,6 +20,7 @@
do_invalidatepage */
#include "internal.h"

+#include <trace/events/filemap.h>

/**
* do_invalidatepage - invalidate part or all of a page
@@ -388,6 +389,7 @@ invalidate_complete_page2(struct address
BUG_ON(page_has_private(page));
__remove_from_page_cache(page);
spin_unlock_irq(&mapping->tree_lock);
+ trace_remove_from_page_cache(mapping, page->index);
mem_cgroup_uncharge_cache_page(page);
page_cache_release(page); /* pagecache ref */
return 1;
Index: linux-2.6-tip/mm/vmscan.c
===================================================================
--- linux-2.6-tip.orig/mm/vmscan.c
+++ linux-2.6-tip/mm/vmscan.c
@@ -48,6 +48,8 @@

#include "internal.h"

+#include <trace/events/filemap.h>
+
struct scan_control {
/* Incremented by the number of inactive pages that were scanned */
unsigned long nr_scanned;
@@ -477,6 +479,7 @@ static int __remove_mapping(struct addre
} else {
__remove_from_page_cache(page);
spin_unlock_irq(&mapping->tree_lock);
+ trace_remove_from_page_cache(mapping, page->index);
mem_cgroup_uncharge_cache_page(page);
}



2010-02-05 02:29:34

by Keiichi KII

[permalink] [raw]
Subject: [RFC PATCH -tip 2/2 v3] add scripts for pagecache analysis per process

The scripts are implemented based on the trace stream scripting support.
And the scripts implement the following and depend on the page cache
tracepoints.

- pagecache hit ratio per process
- how many pagecaches each process has per each file
- how many pages are cached per each file
- how many pagecaches each process shares

To monitor pagecache hit ratio per process, run "pagecache-hit-ratio-record"
or "pref trace record pagecache-hit-ratio" to record perf data for
"pagecache-hit-ration.pl" and run "pagecache-hit-ratio-report" or
"perf trace report pagecache-usage" to display.

The below outputs show execution sample.

[process list]
o yum-3215
cache find cache hit cache hit
device inode count count ratio
--------------------------------------------------------
253:0 16 34434 34130 99.12%
253:0 198 9692 9463 97.64%
253:0 639 647 628 97.06%
253:0 778 32 29 90.62%
253:0 7305 50225 49005 97.57%
253:0 144217 12 10 83.33%
253:0 262775 16 13 81.25%
*snip*

To monitor pagecache usage per a process, run "pagecache-usage-record" or
"perf trace record pagecache-usage" to record perf data for
"pagecache-usage.pl" and run "pagecache-usage-report" or "perf trace report
pagecache-usage" to display.

The below outputs show execution sample.

[file list]
device cached
(maj:min) inode pages
--------------------------------
253:0 16 5752
253:0 198 2233
253:0 639 51
253:0 778 86
253:0 7305 12307
253:0 144217 11
253:0 262775 39
*snip*

[process list]
o yum-3215
device cached added removed indirect
(maj:min) inode pages pages pages removed pages
----------------------------------------------------------------
253:0 16 34130 5752 0 0
253:0 198 9463 2233 0 0
253:0 639 628 51 0 0
253:0 778 29 78 0 0
253:0 7305 49005 12307 0 0
253:0 144217 10 11 0 0
253:0 262775 13 39 0 0
*snip*
----------------------------------------------------------------
total: 102346 26165 1 0

>From the output, we can know some information like:

- if "added pages" > "cached pages" on process list then
It means repeating add/remove pagecache many times.
=> Bad case for pagecache usage

- if "added pages" <= "cached pages" on process list then
It means no unnecessary I/O operations.
=> Good case for pagecache usage.

- if "caches" on file list >
sum "cached pages" per each file on process list then
It means there are unneccessary pagecaches in the memory.
=> Bad case for pagecache usage

Signed-off-by: Keiichi Kii <[email protected]>
Cc: Atsushi Tsuji <[email protected]>
---
tools/perf/scripts/perl/bin/pagecache-hit-ratio-record | 7
tools/perf/scripts/perl/bin/pagecache-hit-ratio-report | 6
tools/perf/scripts/perl/bin/pagecache-usage-record | 7
tools/perf/scripts/perl/bin/pagecache-usage-report | 6
tools/perf/scripts/perl/pagecache-hit-ratio.pl | 75 +++++++++
tools/perf/scripts/perl/pagecache-usage.pl | 136 +++++++++++++++++
6 files changed, 237 insertions(+)

Index: linux-2.6-tip/tools/perf/scripts/perl/bin/pagecache-usage-record
===================================================================
--- /dev/null
+++ linux-2.6-tip/tools/perf/scripts/perl/bin/pagecache-usage-record
@@ -0,0 +1,7 @@
+#!/bin/bash
+perf record -c 1 -f -a -M -R -e filemap:add_to_page_cache -e filemap:find_get_page -e filemap:remove_from_page_cache
+
+
+
+
+
Index: linux-2.6-tip/tools/perf/scripts/perl/bin/pagecache-usage-report
===================================================================
--- /dev/null
+++ linux-2.6-tip/tools/perf/scripts/perl/bin/pagecache-usage-report
@@ -0,0 +1,6 @@
+#!/bin/bash
+# description: pagecache usage per process
+perf trace -s ~/libexec/perf-core/scripts/perl/pagecache-usage.pl
+
+
+
Index: linux-2.6-tip/tools/perf/scripts/perl/pagecache-usage.pl
===================================================================
--- /dev/null
+++ linux-2.6-tip/tools/perf/scripts/perl/pagecache-usage.pl
@@ -0,0 +1,136 @@
+#!/usr/bin/perl -w
+# (C) 2010, Keiichi Kii <[email protected]>
+# Licensed under the terms of the GNU GPL License version 2
+
+# Display pagecache usage per a process
+
+use lib "$ENV{'PERF_EXEC_PATH'}/scripts/perl/Perf-Trace-Util/lib";
+use lib "./Perf-Trace-Util/lib";
+use Perf::Trace::Core;
+use Perf::Trace::Context;
+use Perf::Trace::Util;
+use List::Util qw/sum/;
+
+my %files;
+my %processes;
+my %records;
+
+sub trace_end
+{
+ print_pagecache_usage_per_file();
+ print "\n";
+ print_pagecache_usage_per_process();
+}
+
+sub filemap::remove_from_page_cache
+{
+ my ($event_name, $context, $common_cpu, $common_secs, $common_nsecs,
+ $common_pid, $common_comm,
+ $s_dev, $i_ino, $offset) = @_;
+ my ($f, $r) = get_record($common_comm."-".$common_pid, $s_dev, $i_ino);
+
+ delete $$f{$offset};
+ if (defined $$r{added}{$offset}) {
+ $$r{removed}++;
+ } else {
+ $$r{indirect_removed}++;
+ }
+}
+
+sub filemap::add_to_page_cache
+{
+ my ($event_name, $context, $common_cpu, $common_secs, $common_nsecs,
+ $common_pid, $common_comm,
+ $s_dev, $i_ino, $offset) = @_;
+ my ($f, $r) = get_record($common_comm."-".$common_pid, $s_dev, $i_ino);
+
+ $$f{$offset}++;
+ $$r{added}{$offset}++;
+}
+
+sub filemap::find_get_page
+{
+ my ($event_name, $context, $common_cpu, $common_secs, $common_nsecs,
+ $common_pid, $common_comm,
+ $s_dev, $i_ino, $offset, $page) = @_;
+ my ($f, $r) = get_record($common_comm."-".$common_pid, $s_dev, $i_ino);
+
+ if ($page != 0) {
+ $$f{$offset}++;
+ $$r{cached}++;
+ }
+}
+
+sub get_record
+{
+ my ($p, $dev, $inode) = @_;
+
+ unless (defined($files{$dev}{$inode})) {
+ $files{$dev}{$inode} = {};
+ }
+ $f = $files{$dev}{$inode};
+ unless (defined($records{$p}{$f})) {
+ $records{$p}{$f} =
+ {inode => $inode, dev => $dev, added => {},
+ cached => 0, removed => 0, indirect_removed => 0};
+ }
+ return $f, $records{$p}{$f};
+}
+
+sub minor
+{
+ my $dev = shift;
+ return $dev & ((1 << 20) - 1);
+}
+
+sub major
+{
+ my $dev = shift;
+ return $dev >> 20;
+}
+
+sub print_pagecache_usage_per_file
+{
+ print "[file list]\n";
+ printf(" %12s %10s %8s\n", "device", "", "cached");
+ printf(" %12s %10s %8s\n", "(maj:min)", "inode", "pages");
+ printf(" %s\n", '-' x 32);
+ while(my($dev, $file) = each(%files)) {
+ foreach my $inode (sort { $a <=> $b } keys %$file) {
+ my $count = values %{$$file{$inode}};
+ next if $count == 0;
+ printf(" %12s %10d %8d\n",
+ major($dev).":".minor($dev), $inode, $count);
+ }
+ }
+}
+
+sub print_pagecache_usage_per_process
+{
+ print "[process list]\n";
+ while(my ($pid, $v) = each(%records)) {
+ my ($sum_cached, $sum_added, $sum_removed, $sum_indirect_removed);
+ print "o $pid\n";
+ printf(" %12s %10s %8s %8s %8s %13s\n", "device", "",
+ "cached", "added", "removed", "indirect");
+ printf(" %12s %10s %8s %8s %8s %13s\n", "(maj:min)", "inode",
+ "pages", "pages", "pages", "removed pages");
+ printf(" %s\n", '-' x 64);
+ foreach my $r (sort { $$a{inode} <=> $$b{inode} } values %$v) {
+ my $added_num = scalar(keys %{$$r{added}}) == 0 ?
+ 0 : List::Util::sum(values %{$$r{added}});
+ $sum_cached += $$r{cached};
+ $sum_added += $added_num;
+ $sum_removed += $$r{removed};
+ $sum_indirect_removed += $$r{indirect_removed};
+ printf(" %12s %10d %8d %8d %8d %13d\n",
+ major($$r{dev}).":".minor($$r{dev}), $$r{inode},
+ $$r{cached}, $added_num, $$r{removed},
+ $$r{indirect_removed});
+ }
+ printf(" %s\n", '-' x 64);
+ printf(" total: %5s %10s %8d %8d %8d %13d\n", "", "", $sum_cached,
+ $sum_added, $sum_removed, $sum_indirect_removed);
+ print "\n";
+ }
+}
Index: linux-2.6-tip/tools/perf/scripts/perl/bin/pagecache-hit-ratio-record
===================================================================
--- /dev/null
+++ linux-2.6-tip/tools/perf/scripts/perl/bin/pagecache-hit-ratio-record
@@ -0,0 +1,7 @@
+#!/bin/bash
+perf record -c 1 -f -a -M -R -e filemap:find_get_page
+
+
+
+
+
Index: linux-2.6-tip/tools/perf/scripts/perl/bin/pagecache-hit-ratio-report
===================================================================
--- /dev/null
+++ linux-2.6-tip/tools/perf/scripts/perl/bin/pagecache-hit-ratio-report
@@ -0,0 +1,6 @@
+#!/bin/bash
+# description: monitor pagecache hit ratio per process
+perf trace -s ~/libexec/perf-core/scripts/perl/pagecache-hit-ratio.pl
+
+
+
Index: linux-2.6-tip/tools/perf/scripts/perl/pagecache-hit-ratio.pl
===================================================================
--- /dev/null
+++ linux-2.6-tip/tools/perf/scripts/perl/pagecache-hit-ratio.pl
@@ -0,0 +1,75 @@
+#!/usr/bin/perl -w
+# (C) 2010, Keiichi Kii <[email protected]>
+# Licensed under the terms of the GNU GPL License version 2
+
+# Display pagecache hit ratio per process
+
+use lib "$ENV{'PERF_EXEC_PATH'}/scripts/perl/Perf-Trace-Util/lib";
+use lib "./Perf-Trace-Util/lib";
+use Perf::Trace::Core;
+use Perf::Trace::Context;
+use Perf::Trace::Util;
+
+my %records;
+
+sub trace_end
+{
+ print_pagecache_hit_ratio();
+}
+
+sub filemap::find_get_page
+{
+ my ($event_name, $context, $common_cpu, $common_secs, $common_nsecs,
+ $common_pid, $common_comm,
+ $s_dev, $i_ino, $offset, $page) = @_;
+ my $r = get_record($common_comm."-".$common_pid, $s_dev, $i_ino);
+
+ if ($page != 0) {
+ $$r{hit}++;
+ } else {
+ $$r{miss}++;
+ }
+}
+
+sub get_record
+{
+ my ($p, $dev, $inode) = @_;
+
+ unless (defined($records{$p}{$dev.":".$inode})) {
+ $records{$p}{$dev.":".$inode} = {inode => $inode, dev => $dev,
+ hit => 0, miss => 0};
+ }
+ return $records{$p}{$dev.":".$inode};
+}
+
+sub minor
+{
+ my $dev = shift;
+ return $dev & ((1 << 20) - 1);
+}
+
+sub major
+{
+ my $dev = shift;
+ return $dev >> 20;
+}
+
+sub print_pagecache_hit_ratio
+{
+ print "[process list]\n";
+ while(my ($pid, $v) = each(%records)) {
+ print "o $pid\n";
+ printf(" %12s %10s %10s %10s %10s\n", "", "",
+ "cache find", "cache hit", "cache hit");
+ printf(" %12s %10s %10s %10s %10s\n", "device", "inode",
+ "count", "count", "ratio");
+ printf(" %s\n", '-' x 56);
+ foreach my $r (sort { $$a{inode} <=> $$b{inode} } values %$v) {
+ printf(" %12s %10d %10d %10d %9.2f%%\n",
+ major($$r{dev}).":".minor($$r{dev}), $$r{inode},
+ $$r{miss} + $$r{hit}, $$r{hit},
+ $$r{hit} / ($$r{miss} + $$r{hit}) * 100);
+ }
+ print "\n";
+ }
+}



2010-02-05 07:29:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal


* Keiichi KII <[email protected]> wrote:

> Hello,
>
> This is v3 of a patchset to add some tracepoints for pagecache.
>
> I would propose several tracepoints for tracing pagecache behavior and
> a script for these.
> By using both the tracepoints and the script, we can analysis pagecache behavior
> like usage or hit ratio with high resolution like per process or per file.
> Example output of the script looks like:
>
> [process list]
> o yum-3215
> cache find cache hit cache hit
> device inode count count ratio
> --------------------------------------------------------
> 253:0 16 34434 34130 99.12%
> 253:0 198 9692 9463 97.64%
> 253:0 639 647 628 97.06%
> 253:0 778 32 29 90.62%
> 253:0 7305 50225 49005 97.57%
> 253:0 144217 12 10 83.33%
> 253:0 262775 16 13 81.25%
> *snip*
>
> -------------------------------------------------------------------------------
>
> [file list]
> device cached
> (maj:min) inode pages
> --------------------------------
> 253:0 16 5752
> 253:0 198 2233
> 253:0 639 51
> 253:0 778 86
> 253:0 7305 12307
> 253:0 144217 11
> 253:0 262775 39
> *snip*
>
> [process list]
> o yum-3215
> device cached added removed indirect
> (maj:min) inode pages pages pages removed pages
> ----------------------------------------------------------------
> 253:0 16 34130 5752 0 0
> 253:0 198 9463 2233 0 0
> 253:0 639 628 51 0 0
> 253:0 778 29 78 0 0
> 253:0 7305 49005 12307 0 0
> 253:0 144217 10 11 0 0
> 253:0 262775 13 39 0 0
> *snip*
> ----------------------------------------------------------------
> total: 102346 26165 1 0
>
> We can now know system-wide pagecache usage by /proc/meminfo.
> But we have no method to get higher resolution information like per file or
> per process usage than system-wide one.
> A process may share some pagecache or add a pagecache to the memory or
> remove a pagecache from the memory.
> If a pagecache miss hit ratio rises, maybe it leads to extra I/O and
> affects system performance.
>
> So, by using the tracepoints we can get the following information.
> 1. how many pagecaches each process has per each file
> 2. how many pages are cached per each file
> 3. how many pagecaches each process shares
> 4. how often each process adds/removes pagecache
> 5. how long a pagecache stays in the memory
> 6. pagecache hit rate per file
>
> Especially, the monitoring pagecache usage per each file and pagecache hit
> ratio would help us tune some applications like database.
> And it will also help us tune the kernel parameters like "vm.dirty_*".
>
> Changelog since v2
> o add new script to monitor pagecache hit ratio per process.
> o use DECLARE_EVENT_CLASS
>
> Changelog since v1
> o Add a script based on "perf trace stream scripting support".
>
> Any comments are welcome.

Looks really nice IMO! It also demonstrates nicely the extensibility via
Tom's perf trace scripting engine. (which will soon get a Python script
engine as well, so Perl and C wont be the only possibility to extend perf
with.)

I've Cc:-ed a few parties who might be interested in this. Wu Fengguang has
done MM instrumentation in this area before - there might be some common
ground instead of scattered functionality in /proc, debugfs, perf and
elsewhere?

Note that there's also these older experimental commits in tip:tracing/mm
that introduce the notion of 'object collections' and adds the ability to
trace them:

3383e37: tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
c33b359: tracing, page-allocator: Add trace event for page traffic related to the buddy lists
0d524fb: tracing, mm: Add trace events for anti-fragmentation falling back to other migratetypes
b9a2817: tracing, page-allocator: Add trace events for page allocation and page freeing
08b6cb8: perf_counter tools: Provide default bfd_demangle() function in case it's not around
eb46710: tracing/mm: rename 'trigger' file to 'dump_range'
1487a7a: tracing/mm: fix mapcount trace record field
dcac8cd: tracing/mm: add page frame snapshot trace

this concept, if refreshed a bit and extended to the page cache, would allow
the recording/snapshotting of the MM state of all currently present pages in
the page-cache - a possibly nice addition to the dynamic technique you apply
in your patches.

there's similar "object collections" work underway for 'perf lock' btw., by
Hitoshi Mitake and Frederic.

So there's lots of common ground and lots of interest.

Btw., instead of "perf trace record pagecache-usage", you might want to think
about introducing a higher level tool as well: 'perf mm' or 'perf pagecache'
- just like we have 'perf kmem' for SLAB instrumentation, 'perf sched' for
scheduler instrumentation and 'perf lock' for locking instrumentation. [with
'perf timer' having been posted too.]

'perf mm' could then still map to Perl scripts, it's just a convenience. It
could then harbor other MM related instrumentation bits as well. Just an idea
- this is a possibility, if you are trying to achieve higher organization.

Thanks,

Ingo

2010-02-05 21:20:55

by Keiichi KII

[permalink] [raw]
Subject: Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal

Hello,

(02/05/10 02:28), Ingo Molnar wrote:
> Looks really nice IMO! It also demonstrates nicely the extensibility via
> Tom's perf trace scripting engine. (which will soon get a Python script
> engine as well, so Perl and C wont be the only possibility to extend perf
> with.)
>
> I've Cc:-ed a few parties who might be interested in this. Wu Fengguang has
> done MM instrumentation in this area before - there might be some common
> ground instead of scattered functionality in /proc, debugfs, perf and
> elsewhere?
>
> Note that there's also these older experimental commits in tip:tracing/mm
> that introduce the notion of 'object collections' and adds the ability to
> trace them:
>
> 3383e37: tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
> c33b359: tracing, page-allocator: Add trace event for page traffic related to the buddy lists
> 0d524fb: tracing, mm: Add trace events for anti-fragmentation falling back to other migratetypes
> b9a2817: tracing, page-allocator: Add trace events for page allocation and page freeing
> 08b6cb8: perf_counter tools: Provide default bfd_demangle() function in case it's not around
> eb46710: tracing/mm: rename 'trigger' file to 'dump_range'
> 1487a7a: tracing/mm: fix mapcount trace record field
> dcac8cd: tracing/mm: add page frame snapshot trace
>
> this concept, if refreshed a bit and extended to the page cache, would allow
> the recording/snapshotting of the MM state of all currently present pages in
> the page-cache - a possibly nice addition to the dynamic technique you apply
> in your patches.
> there's similar "object collections" work underway for 'perf lock' btw., by
> Hitoshi Mitake and Frederic.
>
> So there's lots of common ground and lots of interest.
>
> Btw., instead of "perf trace record pagecache-usage", you might want to think
> about introducing a higher level tool as well: 'perf mm' or 'perf pagecache'
> - just like we have 'perf kmem' for SLAB instrumentation, 'perf sched' for
> scheduler instrumentation and 'perf lock' for locking instrumentation. [with
> 'perf timer' having been posted too.]
>
> 'perf mm' could then still map to Perl scripts, it's just a convenience. It
> could then harbor other MM related instrumentation bits as well. Just an idea
> - this is a possibility, if you are trying to achieve higher organization.

Thank you for your information about "perf lock" and "tip:tracing/mm" things.
I think it's very useful to merge 'object collections' about tracing/mm into
"perf mm". So, I will introduce a higer level tool like "perf mm" for the
mm related things as next step.
These will help me implement "perf mm".

And tom's perf trace scripting engine is very flexible.
I will try to implement "perf mm" based on his scripting engine and
harbor other MM related instrumentation like the above if I can.

Thanks,
Keiichi

2010-02-08 13:04:20

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal

* Keiichi KII <[email protected]> [2010-02-04 21:17:35]:

> Hello,
>
> This is v3 of a patchset to add some tracepoints for pagecache.
>
> I would propose several tracepoints for tracing pagecache behavior and
> a script for these.
> By using both the tracepoints and the script, we can analysis pagecache behavior
> like usage or hit ratio with high resolution like per process or per file.
> Example output of the script looks like:
>
> [process list]
> o yum-3215
> cache find cache hit cache hit
> device inode count count ratio
> --------------------------------------------------------
> 253:0 16 34434 34130 99.12%
> 253:0 198 9692 9463 97.64%
> 253:0 639 647 628 97.06%
> 253:0 778 32 29 90.62%
> 253:0 7305 50225 49005 97.57%
> 253:0 144217 12 10 83.33%
> 253:0 262775 16 13 81.25%
> *snip*

Very nice, we should be able to sum these to get a system wide view

>
> -------------------------------------------------------------------------------
>
> [file list]
> device cached
> (maj:min) inode pages
> --------------------------------
> 253:0 16 5752
> 253:0 198 2233
> 253:0 639 51
> 253:0 778 86
> 253:0 7305 12307
> 253:0 144217 11
> 253:0 262775 39
> *snip*
>
> [process list]
> o yum-3215
> device cached added removed indirect
> (maj:min) inode pages pages pages removed pages
> ----------------------------------------------------------------
> 253:0 16 34130 5752 0 0
> 253:0 198 9463 2233 0 0
> 253:0 639 628 51 0 0
> 253:0 778 29 78 0 0
> 253:0 7305 49005 12307 0 0
> 253:0 144217 10 11 0 0
> 253:0 262775 13 39 0 0
> *snip*
> ----------------------------------------------------------------
> total: 102346 26165 1 0
^^^
Is this 1 stray?
>
> We can now know system-wide pagecache usage by /proc/meminfo.
> But we have no method to get higher resolution information like per file or
> per process usage than system-wide one.

It would be really nice to see if we can detect the mapped from the
unmapped page cache

> A process may share some pagecache or add a pagecache to the memory or
> remove a pagecache from the memory.
> If a pagecache miss hit ratio rises, maybe it leads to extra I/O and
> affects system performance.
>
> So, by using the tracepoints we can get the following information.
> 1. how many pagecaches each process has per each file
> 2. how many pages are cached per each file
> 3. how many pagecaches each process shares
> 4. how often each process adds/removes pagecache
> 5. how long a pagecache stays in the memory
> 6. pagecache hit rate per file
>
> Especially, the monitoring pagecache usage per each file and pagecache hit
> ratio would help us tune some applications like database.
> And it will also help us tune the kernel parameters like "vm.dirty_*".
>
> Changelog since v2
> o add new script to monitor pagecache hit ratio per process.
> o use DECLARE_EVENT_CLASS
>
> Changelog since v1
> o Add a script based on "perf trace stream scripting support".
>
> Any comments are welcome.

--
Three Cheers,
Balbir

2010-02-08 16:11:32

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal

Hi Ingo,

> Note that there's also these older experimental commits in tip:tracing/mm
> that introduce the notion of 'object collections' and adds the ability to
> trace them:
>
> 3383e37: tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
> c33b359: tracing, page-allocator: Add trace event for page traffic related to the buddy lists
> 0d524fb: tracing, mm: Add trace events for anti-fragmentation falling back to other migratetypes
> b9a2817: tracing, page-allocator: Add trace events for page allocation and page freeing
> 08b6cb8: perf_counter tools: Provide default bfd_demangle() function in case it's not around
> eb46710: tracing/mm: rename 'trigger' file to 'dump_range'
> 1487a7a: tracing/mm: fix mapcount trace record field
> dcac8cd: tracing/mm: add page frame snapshot trace
>
> this concept, if refreshed a bit and extended to the page cache, would allow
> the recording/snapshotting of the MM state of all currently present pages in
> the page-cache - a possibly nice addition to the dynamic technique you apply
> in your patches.
>
> there's similar "object collections" work underway for 'perf lock' btw., by
> Hitoshi Mitake and Frederic.
>
> So there's lots of common ground and lots of interest.

Here is a scratch patch to exercise the "object collections" idea :)

Interestingly, the pagecache walk is pretty fast, while copying out the trace
data takes more time:

# time (echo / > walk-fs)
(; echo / > walk-fs; ) 0.01s user 0.11s system 82% cpu 0.145 total

# time wc /debug/tracing/trace
4570 45893 551282 /debug/tracing/trace
wc /debug/tracing/trace 0.75s user 0.55s system 88% cpu 1.470 total

# time (cat /debug/tracing/trace > /dev/shm/t)
(; cat /debug/tracing/trace > /dev/shm/t; ) 0.04s user 0.49s system 95% cpu 0.548 total

# time (dd if=/debug/tracing/trace of=/dev/shm/t bs=1M)
0+138 records in
0+138 records out
551282 bytes (551 kB) copied, 0.380454 s, 1.4 MB/s
(; dd if=/debug/tracing/trace of=/dev/shm/t bs=1M; ) 0.09s user 0.48s system 96% cpu 0.600 total

The patch is based on tip/tracing/mm.

Thanks,
Fengguang
---
tracing: pagecache object collections

This dumps
- all cached files of a mounted fs (the inode-cache)
- all cached pages of a cached file (the page-cache)

Usage and Sample output:

# echo / > /debug/tracing/objects/mm/pages/walk-fs
# head /debug/tracing/trace

# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
zsh-3078 [000] 526.272587: dump_inode: ino=102223 size=169291 cached=172032 age=9 dirty=6 dev=0:15 file=<TODO>
zsh-3078 [000] 526.274260: dump_pagecache_range: index=0 len=41 flags=10000000000002c count=1 mapcount=0
zsh-3078 [000] 526.274340: dump_pagecache_range: index=41 len=1 flags=10000000000006c count=1 mapcount=0
zsh-3078 [000] 526.274401: dump_inode: ino=8966 size=442 cached=4096 age=49 dirty=0 dev=0:15 file=<TODO>
zsh-3078 [000] 526.274425: dump_pagecache_range: index=0 len=1 flags=10000000000002c count=1 mapcount=0
zsh-3078 [000] 526.274440: dump_inode: ino=8964 size=4096 cached=0 age=49 dirty=0 dev=0:15 file=<TODO>

Here "age" is either age from inode create time, or from last dirty time.

TODO:

correctness
- show file path name
XXX: can trace_seq_path() be called directly inside TRACE_EVENT()?
- reliably prevent ring buffer overflow,
by replacing cond_resched() with some wait function
(eg. wait until 2+ pages are free in ring buffer)
- use stable_page_flags() in recent kernel

output style
- use plain tracing output format (no fancy TASK-PID/.../FUNCTION fields)
- clear ring buffer before dumping the objects?
- output format: key=value pairs ==> header + tabbed values?
- add filtering options if necessary

CC: Ingo Molnar <[email protected]>
CC: Chris Frost <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Frederic Weisbecker <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
fs/inode.c | 2
include/trace/events/mm.h | 67 ++++++++++++++
kernel/trace/trace_mm.c | 165 ++++++++++++++++++++++++++++++++++++
3 files changed, 233 insertions(+), 1 deletion(-)

--- linux-mm.orig/include/trace/events/mm.h 2010-02-08 23:19:09.000000000 +0800
+++ linux-mm/include/trace/events/mm.h 2010-02-08 23:19:16.000000000 +0800
@@ -2,6 +2,7 @@
#define _TRACE_MM_H

#include <linux/tracepoint.h>
+#include <linux/pagemap.h>
#include <linux/mm.h>

#undef TRACE_SYSTEM
@@ -42,6 +43,72 @@ TRACE_EVENT(dump_pages,
__entry->mapcount, __entry->index)
);

+TRACE_EVENT(dump_pagecache_range,
+
+ TP_PROTO(struct page *page, unsigned long len),
+
+ TP_ARGS(page, len),
+
+ TP_STRUCT__entry(
+ __field( unsigned long, index )
+ __field( unsigned long, len )
+ __field( unsigned long, flags )
+ __field( unsigned int, count )
+ __field( unsigned int, mapcount )
+ ),
+
+ TP_fast_assign(
+ __entry->index = page->index;
+ __entry->len = len;
+ __entry->flags = page->flags;
+ __entry->count = atomic_read(&page->_count);
+ __entry->mapcount = page_mapcount(page);
+ ),
+
+ TP_printk("index=%lu len=%lu flags=%lx count=%u mapcount=%u",
+ __entry->index,
+ __entry->len,
+ __entry->flags,
+ __entry->count,
+ __entry->mapcount)
+);
+
+TRACE_EVENT(dump_inode,
+
+ TP_PROTO(struct inode *inode),
+
+ TP_ARGS(inode),
+
+ TP_STRUCT__entry(
+ __field( unsigned long, ino )
+ __field( loff_t, size )
+ __field( unsigned long, nrpages )
+ __field( unsigned long, age )
+ __field( unsigned long, state )
+ __field( dev_t, dev )
+ ),
+
+ TP_fast_assign(
+ __entry->ino = inode->i_ino;
+ __entry->size = i_size_read(inode);
+ __entry->nrpages = inode->i_mapping->nrpages;
+ __entry->age = jiffies - inode->dirtied_when;
+ __entry->state = inode->i_state;
+ __entry->dev = inode->i_sb->s_dev;
+ ),
+
+ TP_printk("ino=%lu size=%llu cached=%lu age=%lu dirty=%lu "
+ "dev=%u:%u file=<TODO>",
+ __entry->ino,
+ __entry->size,
+ __entry->nrpages << PAGE_CACHE_SHIFT,
+ __entry->age / HZ,
+ __entry->state & I_DIRTY,
+ MAJOR(__entry->dev),
+ MINOR(__entry->dev))
+);
+
+
#endif /* _TRACE_MM_H */

/* This part must be outside protection */
--- linux-mm.orig/kernel/trace/trace_mm.c 2010-02-08 23:19:09.000000000 +0800
+++ linux-mm/kernel/trace/trace_mm.c 2010-02-08 23:19:16.000000000 +0800
@@ -9,6 +9,9 @@
#include <linux/bootmem.h>
#include <linux/debugfs.h>
#include <linux/uaccess.h>
+#include <linux/pagevec.h>
+#include <linux/writeback.h>
+#include <linux/file.h>

#include "trace_output.h"

@@ -95,6 +98,162 @@ static const struct file_operations trac
.write = trace_mm_dump_range_write,
};

+static unsigned long page_flags(struct page* page)
+{
+ return page->flags & ((1 << NR_PAGEFLAGS) - 1);
+}
+
+static int pages_similiar(struct page* page0, struct page* page)
+{
+ if (page_count(page0) != page_count(page))
+ return 0;
+
+ if (page_mapcount(page0) != page_mapcount(page))
+ return 0;
+
+ if (page_flags(page0) != page_flags(page))
+ return 0;
+
+ return 1;
+}
+
+#define BATCH_LINES 100
+static void dump_pagecache(struct address_space *mapping)
+{
+ int i;
+ int lines = 0;
+ pgoff_t len = 0;
+ struct pagevec pvec;
+ struct page *page;
+ struct page *page0 = NULL;
+ unsigned long start = 0;
+
+ for (;;) {
+ pagevec_init(&pvec, 0);
+ pvec.nr = radix_tree_gang_lookup(&mapping->page_tree,
+ (void **)pvec.pages, start + len, PAGEVEC_SIZE);
+
+ if (pvec.nr == 0) {
+ if (len)
+ trace_dump_pagecache_range(page0, len);
+ break;
+ }
+
+ if (!page0)
+ page0 = pvec.pages[0];
+
+ for (i = 0; i < pvec.nr; i++) {
+ page = pvec.pages[i];
+
+ if (page->index == start + len &&
+ pages_similiar(page0, page))
+ len++;
+ else {
+ trace_dump_pagecache_range(page0, len);
+ page0 = page;
+ start = page->index;
+ len = 1;
+ if (++lines > BATCH_LINES) {
+ lines = 0;
+ cond_resched();
+ }
+ }
+ }
+ }
+}
+
+static void dump_fs_pagecache(struct super_block *sb)
+{
+ struct inode *inode;
+ struct inode *prev_inode = NULL;
+
+ down_read(&sb->s_umount);
+ if (!sb->s_root)
+ goto out;
+ spin_lock(&inode_lock);
+ list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
+ continue;
+ __iget(inode);
+ spin_unlock(&inode_lock);
+ trace_dump_inode(inode);
+ if (inode->i_mapping->nrpages)
+ dump_pagecache(inode->i_mapping);
+ iput(prev_inode);
+ prev_inode = inode;
+ cond_resched();
+ spin_lock(&inode_lock);
+ }
+ spin_unlock(&inode_lock);
+ iput(prev_inode);
+out:
+ up_read(&sb->s_umount);
+}
+
+static ssize_t
+trace_pagecache_write(struct file *filp, const char __user *ubuf, size_t count,
+ loff_t *ppos)
+{
+ struct file *file = NULL;
+ char *name;
+ int err = 0;
+
+ if (count > PATH_MAX + 1)
+ return -ENAMETOOLONG;
+
+ name = kmalloc(count+1, GFP_KERNEL);
+ if (!name)
+ return -ENOMEM;
+
+ if (copy_from_user(name, ubuf, count)) {
+ err = -EFAULT;
+ goto out;
+ }
+
+ /* strip the newline added by `echo` */
+ if (count)
+ name[count-1] = '\0';
+
+ file = filp_open(name, O_RDONLY|O_LARGEFILE, 0);
+ if (IS_ERR(file)) {
+ err = PTR_ERR(file);
+ file = NULL;
+ goto out;
+ }
+
+ if (tracing_update_buffers() < 0) {
+ err = -ENOMEM;
+ goto out;
+ }
+ if (trace_set_clr_event("mm", "dump_pagecache_range", 1)) {
+ err = -EINVAL;
+ goto out;
+ }
+ if (trace_set_clr_event("mm", "dump_inode", 1)) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (filp->f_path.dentry->d_inode->i_private) {
+ dump_fs_pagecache(file->f_path.dentry->d_sb);
+ } else {
+ dump_pagecache(file->f_mapping);
+ }
+
+out:
+ if (file)
+ fput(file);
+ kfree(name);
+
+ return err ? err : count;
+}
+
+static const struct file_operations trace_pagecache_fops = {
+ .open = tracing_open_generic,
+ .read = trace_mm_dump_range_read,
+ .write = trace_pagecache_write,
+};
+
/* move this into trace_objects.c when that file is created */
static struct dentry *trace_objects_dir(void)
{
@@ -167,6 +326,12 @@ static __init int trace_objects_mm_init(
trace_create_file("dump_range", 0600, d_pages, NULL,
&trace_mm_fops);

+ trace_create_file("walk-file", 0600, d_pages, NULL,
+ &trace_pagecache_fops);
+
+ trace_create_file("walk-fs", 0600, d_pages, (void *)1,
+ &trace_pagecache_fops);
+
return 0;
}
fs_initcall(trace_objects_mm_init);
--- linux-mm.orig/fs/inode.c 2010-02-08 23:19:12.000000000 +0800
+++ linux-mm/fs/inode.c 2010-02-08 23:19:22.000000000 +0800
@@ -149,7 +149,7 @@ struct inode *inode_init_always(struct s
inode->i_bdev = NULL;
inode->i_cdev = NULL;
inode->i_rdev = 0;
- inode->dirtied_when = 0;
+ inode->dirtied_when = jiffies;

if (security_inode_alloc(inode))
goto out_free_inode;

2010-02-09 16:21:13

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal

> Here is a scratch patch to exercise the "object collections" idea :)
>
> Interestingly, the pagecache walk is pretty fast, while copying out the trace
> data takes more time:
>
> # time (echo / > walk-fs)
> (; echo / > walk-fs; ) 0.01s user 0.11s system 82% cpu 0.145 total
>
> # time wc /debug/tracing/trace
> 4570 45893 551282 /debug/tracing/trace
> wc /debug/tracing/trace 0.75s user 0.55s system 88% cpu 1.470 total

Ah got it: it takes much time to "print" the raw trace data.

> TODO:
>
> correctness
> - show file path name
> XXX: can trace_seq_path() be called directly inside TRACE_EVENT()?

OK, finished with the file name with d_path(). I choose not to mangle
the possible '\n' in file names, and simply show "?" for such files,
for the sake of speed.

Thanks,
Fengguang
---
tracing: pagecache object collections

This dumps
- all cached files of a mounted fs (the inode-cache)
- all cached pages of a cached file (the page-cache)

Usage and Sample output:

# echo /dev > /debug/tracing/objects/mm/pages/walk-fs
# tail /debug/tracing/trace
zsh-2528 [000] 10429.172470: dump_inode: ino=889 size=0 cached=0 age=442 dirty=0 dev=0:18 file=/dev/console
zsh-2528 [000] 10429.172472: dump_inode: ino=888 size=0 cached=0 age=442 dirty=7 dev=0:18 file=/dev/null
zsh-2528 [000] 10429.172474: dump_inode: ino=887 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/shm
zsh-2528 [000] 10429.172477: dump_inode: ino=886 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/pts
zsh-2528 [000] 10429.172479: dump_inode: ino=885 size=11 cached=0 age=442 dirty=0 dev=0:18 file=/dev/core
zsh-2528 [000] 10429.172481: dump_inode: ino=884 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stderr
zsh-2528 [000] 10429.172483: dump_inode: ino=883 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdout
zsh-2528 [000] 10429.172486: dump_inode: ino=882 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdin
zsh-2528 [000] 10429.172488: dump_inode: ino=881 size=13 cached=0 age=442 dirty=0 dev=0:18 file=/dev/fd
zsh-2528 [000] 10429.172491: dump_inode: ino=872 size=13360 cached=0 age=442 dirty=0 dev=0:18 file=/dev

Here "age" is either age from inode create time, or from last dirty time.

TODO:

correctness
- reliably prevent ring buffer overflow,
by replacing cond_resched() with some wait function
(eg. wait until 2+ pages are free in ring buffer)
- use stable_page_flags() in recent kernel

output style
- use plain tracing output format (no fancy TASK-PID/.../FUNCTION fields)
- clear ring buffer before dumping the objects?
- output format: key=value pairs ==> header + tabbed values?
- add filtering options if necessary

CC: Ingo Molnar <[email protected]>
CC: Chris Frost <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Frederic Weisbecker <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
fs/inode.c | 2
include/trace/events/mm.h | 70 ++++++++++++
kernel/trace/trace_mm.c | 204 ++++++++++++++++++++++++++++++++++++
3 files changed, 275 insertions(+), 1 deletion(-)

--- linux-mm.orig/include/trace/events/mm.h 2010-02-08 23:19:09.000000000 +0800
+++ linux-mm/include/trace/events/mm.h 2010-02-09 23:39:03.000000000 +0800
@@ -2,6 +2,7 @@
#define _TRACE_MM_H

#include <linux/tracepoint.h>
+#include <linux/pagemap.h>
#include <linux/mm.h>

#undef TRACE_SYSTEM
@@ -42,6 +43,75 @@ TRACE_EVENT(dump_pages,
__entry->mapcount, __entry->index)
);

+TRACE_EVENT(dump_pagecache_range,
+
+ TP_PROTO(struct page *page, unsigned long len),
+
+ TP_ARGS(page, len),
+
+ TP_STRUCT__entry(
+ __field( unsigned long, index )
+ __field( unsigned long, len )
+ __field( unsigned long, flags )
+ __field( unsigned int, count )
+ __field( unsigned int, mapcount )
+ ),
+
+ TP_fast_assign(
+ __entry->index = page->index;
+ __entry->len = len;
+ __entry->flags = page->flags;
+ __entry->count = atomic_read(&page->_count);
+ __entry->mapcount = page_mapcount(page);
+ ),
+
+ TP_printk("index=%lu len=%lu flags=%lx count=%u mapcount=%u",
+ __entry->index,
+ __entry->len,
+ __entry->flags,
+ __entry->count,
+ __entry->mapcount)
+);
+
+TRACE_EVENT(dump_inode,
+
+ TP_PROTO(struct inode *inode, char *name, int len),
+
+ TP_ARGS(inode, name, len),
+
+ TP_STRUCT__entry(
+ __field( unsigned long, ino )
+ __field( loff_t, size )
+ __field( unsigned long, nrpages )
+ __field( unsigned long, age )
+ __field( unsigned long, state )
+ __field( dev_t, dev )
+ __dynamic_array(char, file, len )
+ ),
+
+ TP_fast_assign(
+ __entry->ino = inode->i_ino;
+ __entry->size = i_size_read(inode);
+ __entry->nrpages = inode->i_mapping->nrpages;
+ __entry->age = jiffies - inode->dirtied_when;
+ __entry->state = inode->i_state;
+ __entry->dev = inode->i_sb->s_dev;
+ memcpy(__get_str(file), name, len);
+ ),
+
+ TP_printk("ino=%lu size=%llu cached=%lu age=%lu dirty=%lu "
+ "dev=%u:%u file=%s",
+ __entry->ino,
+ __entry->size,
+ __entry->nrpages << PAGE_CACHE_SHIFT,
+ __entry->age / HZ,
+ __entry->state & I_DIRTY,
+ MAJOR(__entry->dev),
+ MINOR(__entry->dev),
+ strchr(__get_str(file), '\n') ? "?" : __get_str(file))
+);
+
+
#endif /* _TRACE_MM_H */

/* This part must be outside protection */
--- linux-mm.orig/kernel/trace/trace_mm.c 2010-02-08 23:19:09.000000000 +0800
+++ linux-mm/kernel/trace/trace_mm.c 2010-02-10 00:04:47.000000000 +0800
@@ -9,6 +9,9 @@
#include <linux/bootmem.h>
#include <linux/debugfs.h>
#include <linux/uaccess.h>
+#include <linux/pagevec.h>
+#include <linux/writeback.h>
+#include <linux/file.h>

#include "trace_output.h"

@@ -95,6 +98,201 @@ static const struct file_operations trac
.write = trace_mm_dump_range_write,
};

+static unsigned long page_flags(struct page* page)
+{
+ return page->flags & ((1 << NR_PAGEFLAGS) - 1);
+}
+
+static int pages_similiar(struct page* page0, struct page* page)
+{
+ if (page_count(page0) != page_count(page))
+ return 0;
+
+ if (page_mapcount(page0) != page_mapcount(page))
+ return 0;
+
+ if (page_flags(page0) != page_flags(page))
+ return 0;
+
+ return 1;
+}
+
+#define BATCH_LINES 100
+static void dump_pagecache(struct address_space *mapping)
+{
+ int i;
+ int lines = 0;
+ pgoff_t len = 0;
+ struct pagevec pvec;
+ struct page *page;
+ struct page *page0 = NULL;
+ unsigned long start = 0;
+
+ for (;;) {
+ pagevec_init(&pvec, 0);
+ pvec.nr = radix_tree_gang_lookup(&mapping->page_tree,
+ (void **)pvec.pages, start + len, PAGEVEC_SIZE);
+
+ if (pvec.nr == 0) {
+ if (len)
+ trace_dump_pagecache_range(page0, len);
+ break;
+ }
+
+ if (!page0)
+ page0 = pvec.pages[0];
+
+ for (i = 0; i < pvec.nr; i++) {
+ page = pvec.pages[i];
+
+ if (page->index == start + len &&
+ pages_similiar(page0, page))
+ len++;
+ else {
+ trace_dump_pagecache_range(page0, len);
+ page0 = page;
+ start = page->index;
+ len = 1;
+ if (++lines > BATCH_LINES) {
+ lines = 0;
+ cond_resched();
+ }
+ }
+ }
+ }
+}
+
+static void dump_inode(struct inode *inode,
+ char *name_buf,
+ struct vfsmount *mnt)
+{
+ struct path path = {
+ .mnt = mnt,
+ .dentry = d_find_alias(inode)
+ };
+ char *name;
+ int len;
+
+ if (!path.dentry) {
+ trace_dump_inode(inode, "?", 2);
+ return;
+ }
+
+ name = d_path(&path, name_buf, PAGE_SIZE);
+ if (IS_ERR(name)) {
+ name = "?";
+ len = 2;
+ } else
+ len = PAGE_SIZE + name_buf - name;
+
+ trace_dump_inode(inode, name, len);
+
+ if (path.dentry)
+ dput(path.dentry);
+}
+
+static void dump_fs_pagecache(struct super_block *sb, struct vfsmount *mnt)
+{
+ struct inode *inode;
+ struct inode *prev_inode = NULL;
+ char *name_buf;
+
+ name_buf = (char *)__get_free_page(GFP_TEMPORARY);
+ if (!name_buf)
+ return;
+
+ down_read(&sb->s_umount);
+ if (!sb->s_root)
+ goto out;
+
+ spin_lock(&inode_lock);
+ list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
+ continue;
+ __iget(inode);
+ spin_unlock(&inode_lock);
+ dump_inode(inode, name_buf, mnt);
+ if (inode->i_mapping->nrpages)
+ dump_pagecache(inode->i_mapping);
+ iput(prev_inode);
+ prev_inode = inode;
+ cond_resched();
+ spin_lock(&inode_lock);
+ }
+ spin_unlock(&inode_lock);
+ iput(prev_inode);
+out:
+ up_read(&sb->s_umount);
+ free_page((unsigned long)name_buf);
+}
+
+static ssize_t
+trace_pagecache_write(struct file *filp, const char __user *ubuf, size_t count,
+ loff_t *ppos)
+{
+ struct file *file = NULL;
+ char *name;
+ int err = 0;
+
+ if (count <= 1)
+ return -EINVAL;
+ if (count > PATH_MAX + 1)
+ return -ENAMETOOLONG;
+
+ name = kmalloc(count+1, GFP_KERNEL);
+ if (!name)
+ return -ENOMEM;
+
+ if (copy_from_user(name, ubuf, count)) {
+ err = -EFAULT;
+ goto out;
+ }
+
+ /* strip the newline added by `echo` */
+ if (name[count-1] != '\n')
+ return -EINVAL;
+ name[count-1] = '\0';
+
+ file = filp_open(name, O_RDONLY|O_LARGEFILE, 0);
+ if (IS_ERR(file)) {
+ err = PTR_ERR(file);
+ file = NULL;
+ goto out;
+ }
+
+ if (tracing_update_buffers() < 0) {
+ err = -ENOMEM;
+ goto out;
+ }
+ if (trace_set_clr_event("mm", "dump_pagecache_range", 1)) {
+ err = -EINVAL;
+ goto out;
+ }
+ if (trace_set_clr_event("mm", "dump_inode", 1)) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (filp->f_path.dentry->d_inode->i_private) {
+ dump_fs_pagecache(file->f_path.dentry->d_sb, file->f_path.mnt);
+ } else {
+ dump_pagecache(file->f_mapping);
+ }
+
+out:
+ if (file)
+ fput(file);
+ kfree(name);
+
+ return err ? err : count;
+}
+
+static const struct file_operations trace_pagecache_fops = {
+ .open = tracing_open_generic,
+ .read = trace_mm_dump_range_read,
+ .write = trace_pagecache_write,
+};
+
/* move this into trace_objects.c when that file is created */
static struct dentry *trace_objects_dir(void)
{
@@ -167,6 +365,12 @@ static __init int trace_objects_mm_init(
trace_create_file("dump_range", 0600, d_pages, NULL,
&trace_mm_fops);

+ trace_create_file("walk-file", 0600, d_pages, NULL,
+ &trace_pagecache_fops);
+
+ trace_create_file("walk-fs", 0600, d_pages, (void *)1,
+ &trace_pagecache_fops);
+
return 0;
}
fs_initcall(trace_objects_mm_init);
--- linux-mm.orig/fs/inode.c 2010-02-08 23:19:12.000000000 +0800
+++ linux-mm/fs/inode.c 2010-02-08 23:19:22.000000000 +0800
@@ -149,7 +149,7 @@ struct inode *inode_init_always(struct s
inode->i_bdev = NULL;
inode->i_cdev = NULL;
inode->i_rdev = 0;
- inode->dirtied_when = 0;
+ inode->dirtied_when = jiffies;

if (security_inode_alloc(inode))
goto out_free_inode;

2010-02-13 13:30:08

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal

* Wu Fengguang <[email protected]> [2010-02-10 00:21:01]:

> > Here is a scratch patch to exercise the "object collections" idea :)
> >
> > Interestingly, the pagecache walk is pretty fast, while copying out the trace
> > data takes more time:
> >
> > # time (echo / > walk-fs)
> > (; echo / > walk-fs; ) 0.01s user 0.11s system 82% cpu 0.145 total
> >
> > # time wc /debug/tracing/trace
> > 4570 45893 551282 /debug/tracing/trace
> > wc /debug/tracing/trace 0.75s user 0.55s system 88% cpu 1.470 total
>
> Ah got it: it takes much time to "print" the raw trace data.
>
> > TODO:
> >
> > correctness
> > - show file path name
> > XXX: can trace_seq_path() be called directly inside TRACE_EVENT()?
>
> OK, finished with the file name with d_path(). I choose not to mangle
> the possible '\n' in file names, and simply show "?" for such files,
> for the sake of speed.
>
> Thanks,
> Fengguang
> ---
> tracing: pagecache object collections
>
> This dumps
> - all cached files of a mounted fs (the inode-cache)
> - all cached pages of a cached file (the page-cache)
>
> Usage and Sample output:
>
> # echo /dev > /debug/tracing/objects/mm/pages/walk-fs
> # tail /debug/tracing/trace
> zsh-2528 [000] 10429.172470: dump_inode: ino=889 size=0 cached=0 age=442 dirty=0 dev=0:18 file=/dev/console
> zsh-2528 [000] 10429.172472: dump_inode: ino=888 size=0 cached=0 age=442 dirty=7 dev=0:18 file=/dev/null
> zsh-2528 [000] 10429.172474: dump_inode: ino=887 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/shm
> zsh-2528 [000] 10429.172477: dump_inode: ino=886 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/pts
> zsh-2528 [000] 10429.172479: dump_inode: ino=885 size=11 cached=0 age=442 dirty=0 dev=0:18 file=/dev/core
> zsh-2528 [000] 10429.172481: dump_inode: ino=884 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stderr
> zsh-2528 [000] 10429.172483: dump_inode: ino=883 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdout
> zsh-2528 [000] 10429.172486: dump_inode: ino=882 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdin
> zsh-2528 [000] 10429.172488: dump_inode: ino=881 size=13 cached=0 age=442 dirty=0 dev=0:18 file=/dev/fd
> zsh-2528 [000] 10429.172491: dump_inode: ino=872 size=13360 cached=0 age=442 dirty=0 dev=0:18 file=/dev
>
> Here "age" is either age from inode create time, or from last dirty time.
>

It would be nice to see mapped/unmapped information as well.

> TODO:
>
> correctness
> - reliably prevent ring buffer overflow,
> by replacing cond_resched() with some wait function
> (eg. wait until 2+ pages are free in ring buffer)
> - use stable_page_flags() in recent kernel
>
> output style
> - use plain tracing output format (no fancy TASK-PID/.../FUNCTION fields)
> - clear ring buffer before dumping the objects?
> - output format: key=value pairs ==> header + tabbed values?
> - add filtering options if necessary
>
> CC: Ingo Molnar <[email protected]>
> CC: Chris Frost <[email protected]>
> CC: Steven Rostedt <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Frederic Weisbecker <[email protected]>
> Signed-off-by: Wu Fengguang <[email protected]>
> ---
> fs/inode.c | 2
> include/trace/events/mm.h | 70 ++++++++++++
> kernel/trace/trace_mm.c | 204 ++++++++++++++++++++++++++++++++++++
> 3 files changed, 275 insertions(+), 1 deletion(-)
>
> --- linux-mm.orig/include/trace/events/mm.h 2010-02-08 23:19:09.000000000 +0800
> +++ linux-mm/include/trace/events/mm.h 2010-02-09 23:39:03.000000000 +0800
> @@ -2,6 +2,7 @@
> #define _TRACE_MM_H
>
> #include <linux/tracepoint.h>
> +#include <linux/pagemap.h>
> #include <linux/mm.h>
>
> #undef TRACE_SYSTEM
> @@ -42,6 +43,75 @@ TRACE_EVENT(dump_pages,
> __entry->mapcount, __entry->index)
> );
>
> +TRACE_EVENT(dump_pagecache_range,
> +
> + TP_PROTO(struct page *page, unsigned long len),
> +
> + TP_ARGS(page, len),
> +
> + TP_STRUCT__entry(
> + __field( unsigned long, index )
> + __field( unsigned long, len )
> + __field( unsigned long, flags )
> + __field( unsigned int, count )
> + __field( unsigned int, mapcount )
> + ),
> +
> + TP_fast_assign(
> + __entry->index = page->index;
> + __entry->len = len;
> + __entry->flags = page->flags;
> + __entry->count = atomic_read(&page->_count);
> + __entry->mapcount = page_mapcount(page);
> + ),
> +
> + TP_printk("index=%lu len=%lu flags=%lx count=%u mapcount=%u",
> + __entry->index,
> + __entry->len,
> + __entry->flags,
> + __entry->count,
> + __entry->mapcount)
> +);
> +
> +TRACE_EVENT(dump_inode,
> +
> + TP_PROTO(struct inode *inode, char *name, int len),
> +
> + TP_ARGS(inode, name, len),
> +
> + TP_STRUCT__entry(
> + __field( unsigned long, ino )
> + __field( loff_t, size )
> + __field( unsigned long, nrpages )
> + __field( unsigned long, age )
> + __field( unsigned long, state )
> + __field( dev_t, dev )
> + __dynamic_array(char, file, len )
> + ),
> +
> + TP_fast_assign(
> + __entry->ino = inode->i_ino;
> + __entry->size = i_size_read(inode);
> + __entry->nrpages = inode->i_mapping->nrpages;
> + __entry->age = jiffies - inode->dirtied_when;
> + __entry->state = inode->i_state;
> + __entry->dev = inode->i_sb->s_dev;
> + memcpy(__get_str(file), name, len);
> + ),
> +
> + TP_printk("ino=%lu size=%llu cached=%lu age=%lu dirty=%lu "
> + "dev=%u:%u file=%s",
> + __entry->ino,
> + __entry->size,
> + __entry->nrpages << PAGE_CACHE_SHIFT,
> + __entry->age / HZ,
> + __entry->state & I_DIRTY,
> + MAJOR(__entry->dev),
> + MINOR(__entry->dev),
> + strchr(__get_str(file), '\n') ? "?" : __get_str(file))
> +);
> +
> +
> #endif /* _TRACE_MM_H */
>
> /* This part must be outside protection */
> --- linux-mm.orig/kernel/trace/trace_mm.c 2010-02-08 23:19:09.000000000 +0800
> +++ linux-mm/kernel/trace/trace_mm.c 2010-02-10 00:04:47.000000000 +0800
> @@ -9,6 +9,9 @@
> #include <linux/bootmem.h>
> #include <linux/debugfs.h>
> #include <linux/uaccess.h>
> +#include <linux/pagevec.h>
> +#include <linux/writeback.h>
> +#include <linux/file.h>
>
> #include "trace_output.h"
>
> @@ -95,6 +98,201 @@ static const struct file_operations trac
> .write = trace_mm_dump_range_write,
> };
>
> +static unsigned long page_flags(struct page* page)
> +{
> + return page->flags & ((1 << NR_PAGEFLAGS) - 1);
> +}
> +
> +static int pages_similiar(struct page* page0, struct page* page)
> +{
> + if (page_count(page0) != page_count(page))
> + return 0;
> +
> + if (page_mapcount(page0) != page_mapcount(page))
> + return 0;
> +
> + if (page_flags(page0) != page_flags(page))
> + return 0;
> +
> + return 1;
> +}
> +

OK, so pages_similar() is used to identify a range of pages in the
cache?

> +#define BATCH_LINES 100
> +static void dump_pagecache(struct address_space *mapping)
> +{
> + int i;
> + int lines = 0;
> + pgoff_t len = 0;
> + struct pagevec pvec;
> + struct page *page;
> + struct page *page0 = NULL;
> + unsigned long start = 0;
> +
> + for (;;) {
> + pagevec_init(&pvec, 0);
> + pvec.nr = radix_tree_gang_lookup(&mapping->page_tree,
> + (void **)pvec.pages, start + len, PAGEVEC_SIZE);

Is radix_tree_gang_lookup synchronized somewhere? Don't we need to
call it under RCU or a lock (mapping) ?

> +
> + if (pvec.nr == 0) {
> + if (len)
> + trace_dump_pagecache_range(page0, len);
> + break;
> + }
> +
> + if (!page0)
> + page0 = pvec.pages[0];
> +
> + for (i = 0; i < pvec.nr; i++) {
> + page = pvec.pages[i];
> +
> + if (page->index == start + len &&
> + pages_similiar(page0, page))
> + len++;
> + else {
> + trace_dump_pagecache_range(page0, len);
> + page0 = page;
> + start = page->index;
> + len = 1;
> + if (++lines > BATCH_LINES) {
> + lines = 0;
> + cond_resched();
> + }
> + }
> + }
> + }
> +}
> +
> +static void dump_inode(struct inode *inode,
> + char *name_buf,
> + struct vfsmount *mnt)
> +{
> + struct path path = {
> + .mnt = mnt,
> + .dentry = d_find_alias(inode)
> + };
> + char *name;
> + int len;
> +
> + if (!path.dentry) {
> + trace_dump_inode(inode, "?", 2);
> + return;
> + }
> +
> + name = d_path(&path, name_buf, PAGE_SIZE);
> + if (IS_ERR(name)) {
> + name = "?";
> + len = 2;
> + } else
> + len = PAGE_SIZE + name_buf - name;
> +
> + trace_dump_inode(inode, name, len);
> +
> + if (path.dentry)
> + dput(path.dentry);
> +}
> +
> +static void dump_fs_pagecache(struct super_block *sb, struct vfsmount *mnt)
> +{
> + struct inode *inode;
> + struct inode *prev_inode = NULL;
> + char *name_buf;
> +
> + name_buf = (char *)__get_free_page(GFP_TEMPORARY);
> + if (!name_buf)
> + return;
> +
> + down_read(&sb->s_umount);
> + if (!sb->s_root)
> + goto out;
> +
> + spin_lock(&inode_lock);
> + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> + if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
> + continue;
> + __iget(inode);
> + spin_unlock(&inode_lock);
> + dump_inode(inode, name_buf, mnt);
> + if (inode->i_mapping->nrpages)
> + dump_pagecache(inode->i_mapping);
> + iput(prev_inode);
> + prev_inode = inode;
> + cond_resched();
> + spin_lock(&inode_lock);
> + }
> + spin_unlock(&inode_lock);
> + iput(prev_inode);
> +out:
> + up_read(&sb->s_umount);
> + free_page((unsigned long)name_buf);
> +}
> +
> +static ssize_t
> +trace_pagecache_write(struct file *filp, const char __user *ubuf, size_t count,
> + loff_t *ppos)
> +{
> + struct file *file = NULL;
> + char *name;
> + int err = 0;
> +

Can't we use the trace_parser here?

> + if (count <= 1)
> + return -EINVAL;
> + if (count > PATH_MAX + 1)
> + return -ENAMETOOLONG;
> +
> + name = kmalloc(count+1, GFP_KERNEL);
> + if (!name)
> + return -ENOMEM;
> +
> + if (copy_from_user(name, ubuf, count)) {
> + err = -EFAULT;
> + goto out;
> + }
> +
> + /* strip the newline added by `echo` */
> + if (name[count-1] != '\n')
> + return -EINVAL;

Doesn't sound correct, what happens if we use echo -n?

> + name[count-1] = '\0';
> +
> + file = filp_open(name, O_RDONLY|O_LARGEFILE, 0);
> + if (IS_ERR(file)) {
> + err = PTR_ERR(file);
> + file = NULL;
> + goto out;
> + }
> +
> + if (tracing_update_buffers() < 0) {
> + err = -ENOMEM;
> + goto out;
> + }
> + if (trace_set_clr_event("mm", "dump_pagecache_range", 1)) {
> + err = -EINVAL;
> + goto out;
> + }
> + if (trace_set_clr_event("mm", "dump_inode", 1)) {
> + err = -EINVAL;
> + goto out;
> + }
> +
> + if (filp->f_path.dentry->d_inode->i_private) {
> + dump_fs_pagecache(file->f_path.dentry->d_sb, file->f_path.mnt);
> + } else {
> + dump_pagecache(file->f_mapping);
> + }
> +
> +out:
> + if (file)
> + fput(file);
> + kfree(name);
> +
> + return err ? err : count;
> +}
> +
> +static const struct file_operations trace_pagecache_fops = {
> + .open = tracing_open_generic,
> + .read = trace_mm_dump_range_read,
> + .write = trace_pagecache_write,
> +};
> +
> /* move this into trace_objects.c when that file is created */
> static struct dentry *trace_objects_dir(void)
> {
> @@ -167,6 +365,12 @@ static __init int trace_objects_mm_init(
> trace_create_file("dump_range", 0600, d_pages, NULL,
> &trace_mm_fops);
>
> + trace_create_file("walk-file", 0600, d_pages, NULL,
> + &trace_pagecache_fops);
> +
> + trace_create_file("walk-fs", 0600, d_pages, (void *)1,
> + &trace_pagecache_fops);
> +
> return 0;
> }
> fs_initcall(trace_objects_mm_init);
> --- linux-mm.orig/fs/inode.c 2010-02-08 23:19:12.000000000 +0800
> +++ linux-mm/fs/inode.c 2010-02-08 23:19:22.000000000 +0800
> @@ -149,7 +149,7 @@ struct inode *inode_init_always(struct s
> inode->i_bdev = NULL;
> inode->i_cdev = NULL;
> inode->i_rdev = 0;
> - inode->dirtied_when = 0;
> + inode->dirtied_when = jiffies;
>

Hmmm... Is the inode really dirtied when initialized? I know the
change is for tracing, but the code when read is confusing.


--
Three Cheers,
Balbir

2010-02-14 10:52:56

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal

* Balbir Singh <[email protected]> [2010-02-13 18:59:52]:

> * Wu Fengguang <[email protected]> [2010-02-10 00:21:01]:
>
> > > Here is a scratch patch to exercise the "object collections" idea :)
> > >
> > > Interestingly, the pagecache walk is pretty fast, while copying out the trace
> > > data takes more time:
> > >
> > > # time (echo / > walk-fs)
> > > (; echo / > walk-fs; ) 0.01s user 0.11s system 82% cpu 0.145 total
> > >
> > > # time wc /debug/tracing/trace
> > > 4570 45893 551282 /debug/tracing/trace
> > > wc /debug/tracing/trace 0.75s user 0.55s system 88% cpu 1.470 total
> >
> > Ah got it: it takes much time to "print" the raw trace data.
> >
> > > TODO:
> > >
> > > correctness
> > > - show file path name
> > > XXX: can trace_seq_path() be called directly inside TRACE_EVENT()?
> >
> > OK, finished with the file name with d_path(). I choose not to mangle
> > the possible '\n' in file names, and simply show "?" for such files,
> > for the sake of speed.
> >
> > Thanks,
> > Fengguang
> > ---
> > tracing: pagecache object collections
> >
> > This dumps
> > - all cached files of a mounted fs (the inode-cache)
> > - all cached pages of a cached file (the page-cache)
> >
> > Usage and Sample output:
> >
> > # echo /dev > /debug/tracing/objects/mm/pages/walk-fs
> > # tail /debug/tracing/trace
> > zsh-2528 [000] 10429.172470: dump_inode: ino=889 size=0 cached=0 age=442 dirty=0 dev=0:18 file=/dev/console
> > zsh-2528 [000] 10429.172472: dump_inode: ino=888 size=0 cached=0 age=442 dirty=7 dev=0:18 file=/dev/null
> > zsh-2528 [000] 10429.172474: dump_inode: ino=887 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/shm
> > zsh-2528 [000] 10429.172477: dump_inode: ino=886 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/pts
> > zsh-2528 [000] 10429.172479: dump_inode: ino=885 size=11 cached=0 age=442 dirty=0 dev=0:18 file=/dev/core
> > zsh-2528 [000] 10429.172481: dump_inode: ino=884 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stderr
> > zsh-2528 [000] 10429.172483: dump_inode: ino=883 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdout
> > zsh-2528 [000] 10429.172486: dump_inode: ino=882 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdin
> > zsh-2528 [000] 10429.172488: dump_inode: ino=881 size=13 cached=0 age=442 dirty=0 dev=0:18 file=/dev/fd
> > zsh-2528 [000] 10429.172491: dump_inode: ino=872 size=13360 cached=0 age=442 dirty=0 dev=0:18 file=/dev
> >
> > Here "age" is either age from inode create time, or from last dirty time.
> >
>
> It would be nice to see mapped/unmapped information as well.
>

OK, I see you got mapcount, thanks!

--
Three Cheers,
Balbir

2010-02-16 03:22:56

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal

> > Here is a scratch patch to exercise the "object collections" idea :)
> >
> > Interestingly, the pagecache walk is pretty fast, while copying out the trace
> > data takes more time:
> >
> > # time (echo / > walk-fs)
> > (; echo / > walk-fs; ) 0.01s user 0.11s system 82% cpu 0.145 total
> >
> > # time wc /debug/tracing/trace
> > 4570 45893 551282 /debug/tracing/trace
> > wc /debug/tracing/trace 0.75s user 0.55s system 88% cpu 1.470 total
>
> Ah got it: it takes much time to "print" the raw trace data.
>
> > TODO:
> >
> > correctness
> > - show file path name
> > XXX: can trace_seq_path() be called directly inside TRACE_EVENT()?
>
> OK, finished with the file name with d_path(). I choose not to mangle
> the possible '\n' in file names, and simply show "?" for such files,
> for the sake of speed.


This patch is nicer than KII-san's one. I plan to test it on
my local test environment awhile.

thanks.


>
> Thanks,
> Fengguang
> ---
> tracing: pagecache object collections
>
> This dumps
> - all cached files of a mounted fs (the inode-cache)
> - all cached pages of a cached file (the page-cache)
>
> Usage and Sample output:
>
> # echo /dev > /debug/tracing/objects/mm/pages/walk-fs
> # tail /debug/tracing/trace
> zsh-2528 [000] 10429.172470: dump_inode: ino=889 size=0 cached=0 age=442 dirty=0 dev=0:18 file=/dev/console
> zsh-2528 [000] 10429.172472: dump_inode: ino=888 size=0 cached=0 age=442 dirty=7 dev=0:18 file=/dev/null
> zsh-2528 [000] 10429.172474: dump_inode: ino=887 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/shm
> zsh-2528 [000] 10429.172477: dump_inode: ino=886 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/pts
> zsh-2528 [000] 10429.172479: dump_inode: ino=885 size=11 cached=0 age=442 dirty=0 dev=0:18 file=/dev/core
> zsh-2528 [000] 10429.172481: dump_inode: ino=884 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stderr
> zsh-2528 [000] 10429.172483: dump_inode: ino=883 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdout
> zsh-2528 [000] 10429.172486: dump_inode: ino=882 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdin
> zsh-2528 [000] 10429.172488: dump_inode: ino=881 size=13 cached=0 age=442 dirty=0 dev=0:18 file=/dev/fd
> zsh-2528 [000] 10429.172491: dump_inode: ino=872 size=13360 cached=0 age=442 dirty=0 dev=0:18 file=/dev
>
> Here "age" is either age from inode create time, or from last dirty time.
>
> TODO:
>
> correctness
> - reliably prevent ring buffer overflow,
> by replacing cond_resched() with some wait function
> (eg. wait until 2+ pages are free in ring buffer)
> - use stable_page_flags() in recent kernel
>
> output style
> - use plain tracing output format (no fancy TASK-PID/.../FUNCTION fields)
> - clear ring buffer before dumping the objects?
> - output format: key=value pairs ==> header + tabbed values?
> - add filtering options if necessary
>
> CC: Ingo Molnar <[email protected]>
> CC: Chris Frost <[email protected]>
> CC: Steven Rostedt <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Frederic Weisbecker <[email protected]>
> Signed-off-by: Wu Fengguang <[email protected]>
> ---
> fs/inode.c | 2
> include/trace/events/mm.h | 70 ++++++++++++
> kernel/trace/trace_mm.c | 204 ++++++++++++++++++++++++++++++++++++
> 3 files changed, 275 insertions(+), 1 deletion(-)
>
> --- linux-mm.orig/include/trace/events/mm.h 2010-02-08 23:19:09.000000000 +0800
> +++ linux-mm/include/trace/events/mm.h 2010-02-09 23:39:03.000000000 +0800
> @@ -2,6 +2,7 @@
> #define _TRACE_MM_H
>
> #include <linux/tracepoint.h>
> +#include <linux/pagemap.h>
> #include <linux/mm.h>
>
> #undef TRACE_SYSTEM
> @@ -42,6 +43,75 @@ TRACE_EVENT(dump_pages,
> __entry->mapcount, __entry->index)
> );
>
> +TRACE_EVENT(dump_pagecache_range,
> +
> + TP_PROTO(struct page *page, unsigned long len),
> +
> + TP_ARGS(page, len),
> +
> + TP_STRUCT__entry(
> + __field( unsigned long, index )
> + __field( unsigned long, len )
> + __field( unsigned long, flags )
> + __field( unsigned int, count )
> + __field( unsigned int, mapcount )
> + ),
> +
> + TP_fast_assign(
> + __entry->index = page->index;
> + __entry->len = len;
> + __entry->flags = page->flags;
> + __entry->count = atomic_read(&page->_count);
> + __entry->mapcount = page_mapcount(page);
> + ),
> +
> + TP_printk("index=%lu len=%lu flags=%lx count=%u mapcount=%u",
> + __entry->index,
> + __entry->len,
> + __entry->flags,
> + __entry->count,
> + __entry->mapcount)
> +);
> +
> +TRACE_EVENT(dump_inode,
> +
> + TP_PROTO(struct inode *inode, char *name, int len),
> +
> + TP_ARGS(inode, name, len),
> +
> + TP_STRUCT__entry(
> + __field( unsigned long, ino )
> + __field( loff_t, size )
> + __field( unsigned long, nrpages )
> + __field( unsigned long, age )
> + __field( unsigned long, state )
> + __field( dev_t, dev )
> + __dynamic_array(char, file, len )
> + ),
> +
> + TP_fast_assign(
> + __entry->ino = inode->i_ino;
> + __entry->size = i_size_read(inode);
> + __entry->nrpages = inode->i_mapping->nrpages;
> + __entry->age = jiffies - inode->dirtied_when;
> + __entry->state = inode->i_state;
> + __entry->dev = inode->i_sb->s_dev;
> + memcpy(__get_str(file), name, len);
> + ),
> +
> + TP_printk("ino=%lu size=%llu cached=%lu age=%lu dirty=%lu "
> + "dev=%u:%u file=%s",
> + __entry->ino,
> + __entry->size,
> + __entry->nrpages << PAGE_CACHE_SHIFT,
> + __entry->age / HZ,
> + __entry->state & I_DIRTY,
> + MAJOR(__entry->dev),
> + MINOR(__entry->dev),
> + strchr(__get_str(file), '\n') ? "?" : __get_str(file))
> +);
> +
> +
> #endif /* _TRACE_MM_H */
>
> /* This part must be outside protection */
> --- linux-mm.orig/kernel/trace/trace_mm.c 2010-02-08 23:19:09.000000000 +0800
> +++ linux-mm/kernel/trace/trace_mm.c 2010-02-10 00:04:47.000000000 +0800
> @@ -9,6 +9,9 @@
> #include <linux/bootmem.h>
> #include <linux/debugfs.h>
> #include <linux/uaccess.h>
> +#include <linux/pagevec.h>
> +#include <linux/writeback.h>
> +#include <linux/file.h>
>
> #include "trace_output.h"
>
> @@ -95,6 +98,201 @@ static const struct file_operations trac
> .write = trace_mm_dump_range_write,
> };
>
> +static unsigned long page_flags(struct page* page)
> +{
> + return page->flags & ((1 << NR_PAGEFLAGS) - 1);
> +}
> +
> +static int pages_similiar(struct page* page0, struct page* page)
> +{
> + if (page_count(page0) != page_count(page))
> + return 0;
> +
> + if (page_mapcount(page0) != page_mapcount(page))
> + return 0;
> +
> + if (page_flags(page0) != page_flags(page))
> + return 0;
> +
> + return 1;
> +}
> +
> +#define BATCH_LINES 100
> +static void dump_pagecache(struct address_space *mapping)
> +{
> + int i;
> + int lines = 0;
> + pgoff_t len = 0;
> + struct pagevec pvec;
> + struct page *page;
> + struct page *page0 = NULL;
> + unsigned long start = 0;
> +
> + for (;;) {
> + pagevec_init(&pvec, 0);
> + pvec.nr = radix_tree_gang_lookup(&mapping->page_tree,
> + (void **)pvec.pages, start + len, PAGEVEC_SIZE);
> +
> + if (pvec.nr == 0) {
> + if (len)
> + trace_dump_pagecache_range(page0, len);
> + break;
> + }
> +
> + if (!page0)
> + page0 = pvec.pages[0];
> +
> + for (i = 0; i < pvec.nr; i++) {
> + page = pvec.pages[i];
> +
> + if (page->index == start + len &&
> + pages_similiar(page0, page))
> + len++;
> + else {
> + trace_dump_pagecache_range(page0, len);
> + page0 = page;
> + start = page->index;
> + len = 1;
> + if (++lines > BATCH_LINES) {
> + lines = 0;
> + cond_resched();
> + }
> + }
> + }
> + }
> +}
> +
> +static void dump_inode(struct inode *inode,
> + char *name_buf,
> + struct vfsmount *mnt)
> +{
> + struct path path = {
> + .mnt = mnt,
> + .dentry = d_find_alias(inode)
> + };
> + char *name;
> + int len;
> +
> + if (!path.dentry) {
> + trace_dump_inode(inode, "?", 2);
> + return;
> + }
> +
> + name = d_path(&path, name_buf, PAGE_SIZE);
> + if (IS_ERR(name)) {
> + name = "?";
> + len = 2;
> + } else
> + len = PAGE_SIZE + name_buf - name;
> +
> + trace_dump_inode(inode, name, len);
> +
> + if (path.dentry)
> + dput(path.dentry);
> +}
> +
> +static void dump_fs_pagecache(struct super_block *sb, struct vfsmount *mnt)
> +{
> + struct inode *inode;
> + struct inode *prev_inode = NULL;
> + char *name_buf;
> +
> + name_buf = (char *)__get_free_page(GFP_TEMPORARY);
> + if (!name_buf)
> + return;
> +
> + down_read(&sb->s_umount);
> + if (!sb->s_root)
> + goto out;
> +
> + spin_lock(&inode_lock);
> + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> + if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
> + continue;
> + __iget(inode);
> + spin_unlock(&inode_lock);
> + dump_inode(inode, name_buf, mnt);
> + if (inode->i_mapping->nrpages)
> + dump_pagecache(inode->i_mapping);
> + iput(prev_inode);
> + prev_inode = inode;
> + cond_resched();
> + spin_lock(&inode_lock);
> + }
> + spin_unlock(&inode_lock);
> + iput(prev_inode);
> +out:
> + up_read(&sb->s_umount);
> + free_page((unsigned long)name_buf);
> +}
> +
> +static ssize_t
> +trace_pagecache_write(struct file *filp, const char __user *ubuf, size_t count,
> + loff_t *ppos)
> +{
> + struct file *file = NULL;
> + char *name;
> + int err = 0;
> +
> + if (count <= 1)
> + return -EINVAL;
> + if (count > PATH_MAX + 1)
> + return -ENAMETOOLONG;
> +
> + name = kmalloc(count+1, GFP_KERNEL);
> + if (!name)
> + return -ENOMEM;
> +
> + if (copy_from_user(name, ubuf, count)) {
> + err = -EFAULT;
> + goto out;
> + }
> +
> + /* strip the newline added by `echo` */
> + if (name[count-1] != '\n')
> + return -EINVAL;
> + name[count-1] = '\0';
> +
> + file = filp_open(name, O_RDONLY|O_LARGEFILE, 0);
> + if (IS_ERR(file)) {
> + err = PTR_ERR(file);
> + file = NULL;
> + goto out;
> + }
> +
> + if (tracing_update_buffers() < 0) {
> + err = -ENOMEM;
> + goto out;
> + }
> + if (trace_set_clr_event("mm", "dump_pagecache_range", 1)) {
> + err = -EINVAL;
> + goto out;
> + }
> + if (trace_set_clr_event("mm", "dump_inode", 1)) {
> + err = -EINVAL;
> + goto out;
> + }
> +
> + if (filp->f_path.dentry->d_inode->i_private) {
> + dump_fs_pagecache(file->f_path.dentry->d_sb, file->f_path.mnt);
> + } else {
> + dump_pagecache(file->f_mapping);
> + }
> +
> +out:
> + if (file)
> + fput(file);
> + kfree(name);
> +
> + return err ? err : count;
> +}
> +
> +static const struct file_operations trace_pagecache_fops = {
> + .open = tracing_open_generic,
> + .read = trace_mm_dump_range_read,
> + .write = trace_pagecache_write,
> +};
> +
> /* move this into trace_objects.c when that file is created */
> static struct dentry *trace_objects_dir(void)
> {
> @@ -167,6 +365,12 @@ static __init int trace_objects_mm_init(
> trace_create_file("dump_range", 0600, d_pages, NULL,
> &trace_mm_fops);
>
> + trace_create_file("walk-file", 0600, d_pages, NULL,
> + &trace_pagecache_fops);
> +
> + trace_create_file("walk-fs", 0600, d_pages, (void *)1,
> + &trace_pagecache_fops);
> +
> return 0;
> }
> fs_initcall(trace_objects_mm_init);
> --- linux-mm.orig/fs/inode.c 2010-02-08 23:19:12.000000000 +0800
> +++ linux-mm/fs/inode.c 2010-02-08 23:19:22.000000000 +0800
> @@ -149,7 +149,7 @@ struct inode *inode_init_always(struct s
> inode->i_bdev = NULL;
> inode->i_cdev = NULL;
> inode->i_rdev = 0;
> - inode->dirtied_when = 0;
> + inode->dirtied_when = jiffies;
>
> if (security_inode_alloc(inode))
> goto out_free_inode;
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>


2010-02-17 22:41:37

by Keiichi KII

[permalink] [raw]
Subject: Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal

Hello,

(02/15/10 22:22), KOSAKI Motohiro wrote:
>>> Here is a scratch patch to exercise the "object collections" idea :)
>>>
>>> Interestingly, the pagecache walk is pretty fast, while copying out the trace
>>> data takes more time:
>>>
>>> # time (echo / > walk-fs)
>>> (; echo / > walk-fs; ) 0.01s user 0.11s system 82% cpu 0.145 total
>>>
>>> # time wc /debug/tracing/trace
>>> 4570 45893 551282 /debug/tracing/trace
>>> wc /debug/tracing/trace 0.75s user 0.55s system 88% cpu 1.470 total
>>
>> Ah got it: it takes much time to "print" the raw trace data.
>>
>>> TODO:
>>>
>>> correctness
>>> - show file path name
>>> XXX: can trace_seq_path() be called directly inside TRACE_EVENT()?
>>
>> OK, finished with the file name with d_path(). I choose not to mangle
>> the possible '\n' in file names, and simply show "?" for such files,
>> for the sake of speed.
>
>
> This patch is nicer than KII-san's one. I plan to test it on
> my local test environment awhile.

I don't think my patch is completely replaced by Wu's patch.
Both patches focus on pagecache and will work together for achieving
perf enhancement for mm like "perf mm".

His patch can efficiently dump a pagecache usage snapshot for a file system
or a file as he said.
And we will be able to just monitor pagecache increase and decrease
by taking some snapshots for pagecache using his patch.
My patch can monitor some pagecache behavior like pagecache hit ratio and
using frequency(e.g. the following outputs).

For example, the outputs shows yum's pagecache behavior analysis using my patch.
Please focus on inode 16 and 778 on the device(253:0).
The system has 5752 pagecaches for the inode 16 and 86 pagecaches for
the inode 778.
We will be able to know same information using his patch as well.
But we can get further detailed information about pagecache in the system
using my patch.
There is a big difference of using frequency between inode 16 and inode 778.
The inode 16 is used by the yum more same pagecaches than inode 778's.

And maybe it is useful to improve/tune pagecache management like pdflush.

[process list]
o yum-3215
cache find cache hit cache hit
device inode count count ratio
--------------------------------------------------------
253:0 16 34434 34130 99.12%
253:0 198 9692 9463 97.64%
253:0 639 647 628 97.06%
253:0 778 32 29 90.62%
253:0 7305 50225 49005 97.57%
253:0 144217 12 10 83.33%
253:0 262775 16 13 81.25%
*snip*

[file list]
device cached
(maj:min) inode pages
--------------------------------
253:0 16 5752
253:0 198 2233
253:0 639 51
253:0 778 86
253:0 7305 12307
253:0 144217 11
253:0 262775 39
*snip*

[process list]
o yum-3215
device cached added removed indirect
(maj:min) inode pages pages pages removed pages
----------------------------------------------------------------
253:0 16 34130 5752 0 0
253:0 198 9463 2233 0 0
253:0 639 628 51 0 0
253:0 778 29 78 0 0
253:0 7305 49005 12307 0 0
253:0 144217 10 11 0 0
253:0 262775 13 39 0 0
*snip*
----------------------------------------------------------------
total: 102346 26165 1 0

Any comments are welcome.

Thanks,
Keiichi

2010-02-18 05:38:27

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal

On Mon, 8 Feb 2010 23:54:50 +0800
Wu Fengguang <[email protected]> wrote:

> Hi Ingo,
>
> > Note that there's also these older experimental commits in tip:tracing/mm
> > that introduce the notion of 'object collections' and adds the ability to
> > trace them:
> >
> > 3383e37: tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
> > c33b359: tracing, page-allocator: Add trace event for page traffic related to the buddy lists
> > 0d524fb: tracing, mm: Add trace events for anti-fragmentation falling back to other migratetypes
> > b9a2817: tracing, page-allocator: Add trace events for page allocation and page freeing
> > 08b6cb8: perf_counter tools: Provide default bfd_demangle() function in case it's not around
> > eb46710: tracing/mm: rename 'trigger' file to 'dump_range'
> > 1487a7a: tracing/mm: fix mapcount trace record field
> > dcac8cd: tracing/mm: add page frame snapshot trace
> >
> > this concept, if refreshed a bit and extended to the page cache, would allow
> > the recording/snapshotting of the MM state of all currently present pages in
> > the page-cache - a possibly nice addition to the dynamic technique you apply
> > in your patches.
> >
> > there's similar "object collections" work underway for 'perf lock' btw., by
> > Hitoshi Mitake and Frederic.
> >
> > So there's lots of common ground and lots of interest.
>
> Here is a scratch patch to exercise the "object collections" idea :)
>
> Interestingly, the pagecache walk is pretty fast, while copying out the trace
> data takes more time:
>
> # time (echo / > walk-fs)
> (; echo / > walk-fs; ) 0.01s user 0.11s system 82% cpu 0.145 total
>
> # time wc /debug/tracing/trace
> 4570 45893 551282 /debug/tracing/trace
> wc /debug/tracing/trace 0.75s user 0.55s system 88% cpu 1.470 total
>
> # time (cat /debug/tracing/trace > /dev/shm/t)
> (; cat /debug/tracing/trace > /dev/shm/t; ) 0.04s user 0.49s system 95% cpu 0.548 total
>
> # time (dd if=/debug/tracing/trace of=/dev/shm/t bs=1M)
> 0+138 records in
> 0+138 records out
> 551282 bytes (551 kB) copied, 0.380454 s, 1.4 MB/s
> (; dd if=/debug/tracing/trace of=/dev/shm/t bs=1M; ) 0.09s user 0.48s system 96% cpu 0.600 total
>
> The patch is based on tip/tracing/mm.
>
> Thanks,
> Fengguang
> ---
> tracing: pagecache object collections
>
> This dumps
> - all cached files of a mounted fs (the inode-cache)
> - all cached pages of a cached file (the page-cache)
>
> Usage and Sample output:
>
> # echo / > /debug/tracing/objects/mm/pages/walk-fs
> # head /debug/tracing/trace
>
> # tracer: nop
> #
> # TASK-PID CPU# TIMESTAMP FUNCTION
> # | | | | |
> zsh-3078 [000] 526.272587: dump_inode: ino=102223 size=169291 cached=172032 age=9 dirty=6 dev=0:15 file=<TODO>
> zsh-3078 [000] 526.274260: dump_pagecache_range: index=0 len=41 flags=10000000000002c count=1 mapcount=0
> zsh-3078 [000] 526.274340: dump_pagecache_range: index=41 len=1 flags=10000000000006c count=1 mapcount=0
> zsh-3078 [000] 526.274401: dump_inode: ino=8966 size=442 cached=4096 age=49 dirty=0 dev=0:15 file=<TODO>
> zsh-3078 [000] 526.274425: dump_pagecache_range: index=0 len=1 flags=10000000000002c count=1 mapcount=0
> zsh-3078 [000] 526.274440: dump_inode: ino=8964 size=4096 cached=0 age=49 dirty=0 dev=0:15 file=<TODO>
>
> Here "age" is either age from inode create time, or from last dirty time.
>
> TODO:
>
> correctness
> - show file path name
> XXX: can trace_seq_path() be called directly inside TRACE_EVENT()?
> - reliably prevent ring buffer overflow,
> by replacing cond_resched() with some wait function
> (eg. wait until 2+ pages are free in ring buffer)
> - use stable_page_flags() in recent kernel
>
> output style
> - use plain tracing output format (no fancy TASK-PID/.../FUNCTION fields)
> - clear ring buffer before dumping the objects?
> - output format: key=value pairs ==> header + tabbed values?
> - add filtering options if necessary
>

Can we dump page's cgroup ? If so, I'm happy.
Maybe
==
struct page_cgroup *pc = lookup_page_cgroup(page);
struct mem_cgroup *mem = pc->mem_cgroup;
shodt mem_cgroup_id = mem->css.css_id;

And statistics can be counted per css_id.

And then, some output like

dump_pagecache_range: index=0 len=1 flags=10000000000002c count=1 mapcount=0 file=XXX memcg=group_A:x,group_B:y

Is it okay to add a new field after your work finish ?

If so, I'll think about some infrastructure to get above based on your patch.

THanks,
-Kame





> CC: Ingo Molnar <[email protected]>
> CC: Chris Frost <[email protected]>
> CC: Steven Rostedt <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Frederic Weisbecker <[email protected]>
> Signed-off-by: Wu Fengguang <[email protected]>
> ---
> fs/inode.c | 2
> include/trace/events/mm.h | 67 ++++++++++++++
> kernel/trace/trace_mm.c | 165 ++++++++++++++++++++++++++++++++++++
> 3 files changed, 233 insertions(+), 1 deletion(-)
>
> --- linux-mm.orig/include/trace/events/mm.h 2010-02-08 23:19:09.000000000 +0800
> +++ linux-mm/include/trace/events/mm.h 2010-02-08 23:19:16.000000000 +0800
> @@ -2,6 +2,7 @@
> #define _TRACE_MM_H
>
> #include <linux/tracepoint.h>
> +#include <linux/pagemap.h>
> #include <linux/mm.h>
>
> #undef TRACE_SYSTEM
> @@ -42,6 +43,72 @@ TRACE_EVENT(dump_pages,
> __entry->mapcount, __entry->index)
> );
>
> +TRACE_EVENT(dump_pagecache_range,
> +
> + TP_PROTO(struct page *page, unsigned long len),
> +
> + TP_ARGS(page, len),
> +
> + TP_STRUCT__entry(
> + __field( unsigned long, index )
> + __field( unsigned long, len )
> + __field( unsigned long, flags )
> + __field( unsigned int, count )
> + __field( unsigned int, mapcount )
> + ),
> +
> + TP_fast_assign(
> + __entry->index = page->index;
> + __entry->len = len;
> + __entry->flags = page->flags;
> + __entry->count = atomic_read(&page->_count);
> + __entry->mapcount = page_mapcount(page);
> + ),
> +
> + TP_printk("index=%lu len=%lu flags=%lx count=%u mapcount=%u",
> + __entry->index,
> + __entry->len,
> + __entry->flags,
> + __entry->count,
> + __entry->mapcount)
> +);
> +
> +TRACE_EVENT(dump_inode,
> +
> + TP_PROTO(struct inode *inode),
> +
> + TP_ARGS(inode),
> +
> + TP_STRUCT__entry(
> + __field( unsigned long, ino )
> + __field( loff_t, size )
> + __field( unsigned long, nrpages )
> + __field( unsigned long, age )
> + __field( unsigned long, state )
> + __field( dev_t, dev )
> + ),
> +
> + TP_fast_assign(
> + __entry->ino = inode->i_ino;
> + __entry->size = i_size_read(inode);
> + __entry->nrpages = inode->i_mapping->nrpages;
> + __entry->age = jiffies - inode->dirtied_when;
> + __entry->state = inode->i_state;
> + __entry->dev = inode->i_sb->s_dev;
> + ),
> +
> + TP_printk("ino=%lu size=%llu cached=%lu age=%lu dirty=%lu "
> + "dev=%u:%u file=<TODO>",
> + __entry->ino,
> + __entry->size,
> + __entry->nrpages << PAGE_CACHE_SHIFT,
> + __entry->age / HZ,
> + __entry->state & I_DIRTY,
> + MAJOR(__entry->dev),
> + MINOR(__entry->dev))
> +);
> +
> +
> #endif /* _TRACE_MM_H */
>
> /* This part must be outside protection */
> --- linux-mm.orig/kernel/trace/trace_mm.c 2010-02-08 23:19:09.000000000 +0800
> +++ linux-mm/kernel/trace/trace_mm.c 2010-02-08 23:19:16.000000000 +0800
> @@ -9,6 +9,9 @@
> #include <linux/bootmem.h>
> #include <linux/debugfs.h>
> #include <linux/uaccess.h>
> +#include <linux/pagevec.h>
> +#include <linux/writeback.h>
> +#include <linux/file.h>
>
> #include "trace_output.h"
>
> @@ -95,6 +98,162 @@ static const struct file_operations trac
> .write = trace_mm_dump_range_write,
> };
>
> +static unsigned long page_flags(struct page* page)
> +{
> + return page->flags & ((1 << NR_PAGEFLAGS) - 1);
> +}
> +
> +static int pages_similiar(struct page* page0, struct page* page)
> +{
> + if (page_count(page0) != page_count(page))
> + return 0;
> +
> + if (page_mapcount(page0) != page_mapcount(page))
> + return 0;
> +
> + if (page_flags(page0) != page_flags(page))
> + return 0;
> +
> + return 1;
> +}
> +
> +#define BATCH_LINES 100
> +static void dump_pagecache(struct address_space *mapping)
> +{
> + int i;
> + int lines = 0;
> + pgoff_t len = 0;
> + struct pagevec pvec;
> + struct page *page;
> + struct page *page0 = NULL;
> + unsigned long start = 0;
> +
> + for (;;) {
> + pagevec_init(&pvec, 0);
> + pvec.nr = radix_tree_gang_lookup(&mapping->page_tree,
> + (void **)pvec.pages, start + len, PAGEVEC_SIZE);
> +
> + if (pvec.nr == 0) {
> + if (len)
> + trace_dump_pagecache_range(page0, len);
> + break;
> + }
> +
> + if (!page0)
> + page0 = pvec.pages[0];
> +
> + for (i = 0; i < pvec.nr; i++) {
> + page = pvec.pages[i];
> +
> + if (page->index == start + len &&
> + pages_similiar(page0, page))
> + len++;
> + else {
> + trace_dump_pagecache_range(page0, len);
> + page0 = page;
> + start = page->index;
> + len = 1;
> + if (++lines > BATCH_LINES) {
> + lines = 0;
> + cond_resched();
> + }
> + }
> + }
> + }
> +}
> +
> +static void dump_fs_pagecache(struct super_block *sb)
> +{
> + struct inode *inode;
> + struct inode *prev_inode = NULL;
> +
> + down_read(&sb->s_umount);
> + if (!sb->s_root)
> + goto out;
> + spin_lock(&inode_lock);
> + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> + if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
> + continue;
> + __iget(inode);
> + spin_unlock(&inode_lock);
> + trace_dump_inode(inode);
> + if (inode->i_mapping->nrpages)
> + dump_pagecache(inode->i_mapping);
> + iput(prev_inode);
> + prev_inode = inode;
> + cond_resched();
> + spin_lock(&inode_lock);
> + }
> + spin_unlock(&inode_lock);
> + iput(prev_inode);
> +out:
> + up_read(&sb->s_umount);
> +}
> +
> +static ssize_t
> +trace_pagecache_write(struct file *filp, const char __user *ubuf, size_t count,
> + loff_t *ppos)
> +{
> + struct file *file = NULL;
> + char *name;
> + int err = 0;
> +
> + if (count > PATH_MAX + 1)
> + return -ENAMETOOLONG;
> +
> + name = kmalloc(count+1, GFP_KERNEL);
> + if (!name)
> + return -ENOMEM;
> +
> + if (copy_from_user(name, ubuf, count)) {
> + err = -EFAULT;
> + goto out;
> + }
> +
> + /* strip the newline added by `echo` */
> + if (count)
> + name[count-1] = '\0';
> +
> + file = filp_open(name, O_RDONLY|O_LARGEFILE, 0);
> + if (IS_ERR(file)) {
> + err = PTR_ERR(file);
> + file = NULL;
> + goto out;
> + }
> +
> + if (tracing_update_buffers() < 0) {
> + err = -ENOMEM;
> + goto out;
> + }
> + if (trace_set_clr_event("mm", "dump_pagecache_range", 1)) {
> + err = -EINVAL;
> + goto out;
> + }
> + if (trace_set_clr_event("mm", "dump_inode", 1)) {
> + err = -EINVAL;
> + goto out;
> + }
> +
> + if (filp->f_path.dentry->d_inode->i_private) {
> + dump_fs_pagecache(file->f_path.dentry->d_sb);
> + } else {
> + dump_pagecache(file->f_mapping);
> + }
> +
> +out:
> + if (file)
> + fput(file);
> + kfree(name);
> +
> + return err ? err : count;
> +}
> +
> +static const struct file_operations trace_pagecache_fops = {
> + .open = tracing_open_generic,
> + .read = trace_mm_dump_range_read,
> + .write = trace_pagecache_write,
> +};
> +
> /* move this into trace_objects.c when that file is created */
> static struct dentry *trace_objects_dir(void)
> {
> @@ -167,6 +326,12 @@ static __init int trace_objects_mm_init(
> trace_create_file("dump_range", 0600, d_pages, NULL,
> &trace_mm_fops);
>
> + trace_create_file("walk-file", 0600, d_pages, NULL,
> + &trace_pagecache_fops);
> +
> + trace_create_file("walk-fs", 0600, d_pages, (void *)1,
> + &trace_pagecache_fops);
> +
> return 0;
> }
> fs_initcall(trace_objects_mm_init);
> --- linux-mm.orig/fs/inode.c 2010-02-08 23:19:12.000000000 +0800
> +++ linux-mm/fs/inode.c 2010-02-08 23:19:22.000000000 +0800
> @@ -149,7 +149,7 @@ struct inode *inode_init_always(struct s
> inode->i_bdev = NULL;
> inode->i_cdev = NULL;
> inode->i_rdev = 0;
> - inode->dirtied_when = 0;
> + inode->dirtied_when = jiffies;
>
> if (security_inode_alloc(inode))
> goto out_free_inode;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2010-02-18 09:59:06

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal

* KAMEZAWA Hiroyuki <[email protected]> [2010-02-18 14:34:29]:

> On Mon, 8 Feb 2010 23:54:50 +0800
> Wu Fengguang <[email protected]> wrote:
>
> > Hi Ingo,
> >
> > > Note that there's also these older experimental commits in tip:tracing/mm
> > > that introduce the notion of 'object collections' and adds the ability to
> > > trace them:
> > >
> > > 3383e37: tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
> > > c33b359: tracing, page-allocator: Add trace event for page traffic related to the buddy lists
> > > 0d524fb: tracing, mm: Add trace events for anti-fragmentation falling back to other migratetypes
> > > b9a2817: tracing, page-allocator: Add trace events for page allocation and page freeing
> > > 08b6cb8: perf_counter tools: Provide default bfd_demangle() function in case it's not around
> > > eb46710: tracing/mm: rename 'trigger' file to 'dump_range'
> > > 1487a7a: tracing/mm: fix mapcount trace record field
> > > dcac8cd: tracing/mm: add page frame snapshot trace
> > >
> > > this concept, if refreshed a bit and extended to the page cache, would allow
> > > the recording/snapshotting of the MM state of all currently present pages in
> > > the page-cache - a possibly nice addition to the dynamic technique you apply
> > > in your patches.
> > >
> > > there's similar "object collections" work underway for 'perf lock' btw., by
> > > Hitoshi Mitake and Frederic.
> > >
> > > So there's lots of common ground and lots of interest.
> >
> > Here is a scratch patch to exercise the "object collections" idea :)
> >
> > Interestingly, the pagecache walk is pretty fast, while copying out the trace
> > data takes more time:
> >
> > # time (echo / > walk-fs)
> > (; echo / > walk-fs; ) 0.01s user 0.11s system 82% cpu 0.145 total
> >
> > # time wc /debug/tracing/trace
> > 4570 45893 551282 /debug/tracing/trace
> > wc /debug/tracing/trace 0.75s user 0.55s system 88% cpu 1.470 total
> >
> > # time (cat /debug/tracing/trace > /dev/shm/t)
> > (; cat /debug/tracing/trace > /dev/shm/t; ) 0.04s user 0.49s system 95% cpu 0.548 total
> >
> > # time (dd if=/debug/tracing/trace of=/dev/shm/t bs=1M)
> > 0+138 records in
> > 0+138 records out
> > 551282 bytes (551 kB) copied, 0.380454 s, 1.4 MB/s
> > (; dd if=/debug/tracing/trace of=/dev/shm/t bs=1M; ) 0.09s user 0.48s system 96% cpu 0.600 total
> >
> > The patch is based on tip/tracing/mm.
> >
> > Thanks,
> > Fengguang
> > ---
> > tracing: pagecache object collections
> >
> > This dumps
> > - all cached files of a mounted fs (the inode-cache)
> > - all cached pages of a cached file (the page-cache)
> >
> > Usage and Sample output:
> >
> > # echo / > /debug/tracing/objects/mm/pages/walk-fs
> > # head /debug/tracing/trace
> >
> > # tracer: nop
> > #
> > # TASK-PID CPU# TIMESTAMP FUNCTION
> > # | | | | |
> > zsh-3078 [000] 526.272587: dump_inode: ino=102223 size=169291 cached=172032 age=9 dirty=6 dev=0:15 file=<TODO>
> > zsh-3078 [000] 526.274260: dump_pagecache_range: index=0 len=41 flags=10000000000002c count=1 mapcount=0
> > zsh-3078 [000] 526.274340: dump_pagecache_range: index=41 len=1 flags=10000000000006c count=1 mapcount=0
> > zsh-3078 [000] 526.274401: dump_inode: ino=8966 size=442 cached=4096 age=49 dirty=0 dev=0:15 file=<TODO>
> > zsh-3078 [000] 526.274425: dump_pagecache_range: index=0 len=1 flags=10000000000002c count=1 mapcount=0
> > zsh-3078 [000] 526.274440: dump_inode: ino=8964 size=4096 cached=0 age=49 dirty=0 dev=0:15 file=<TODO>
> >
> > Here "age" is either age from inode create time, or from last dirty time.
> >
> > TODO:
> >
> > correctness
> > - show file path name
> > XXX: can trace_seq_path() be called directly inside TRACE_EVENT()?
> > - reliably prevent ring buffer overflow,
> > by replacing cond_resched() with some wait function
> > (eg. wait until 2+ pages are free in ring buffer)
> > - use stable_page_flags() in recent kernel
> >
> > output style
> > - use plain tracing output format (no fancy TASK-PID/.../FUNCTION fields)
> > - clear ring buffer before dumping the objects?
> > - output format: key=value pairs ==> header + tabbed values?
> > - add filtering options if necessary
> >
>
> Can we dump page's cgroup ? If so, I'm happy.
> Maybe
> ==
> struct page_cgroup *pc = lookup_page_cgroup(page);
> struct mem_cgroup *mem = pc->mem_cgroup;
> shodt mem_cgroup_id = mem->css.css_id;
>
> And statistics can be counted per css_id.
>

Good idea, all of this needs to happen with a check to see if memcg is
enabled/disabled at boot as well. pc can be NULL if
CONFIG_CGROUP_MEM_RES_CTLR is not enabled.

> And then, some output like
>
> dump_pagecache_range: index=0 len=1 flags=10000000000002c count=1 mapcount=0 file=XXX memcg=group_A:x,group_B:y
>
> Is it okay to add a new field after your work finish ?
>
--
Three Cheers,
Balbir

2010-02-21 02:28:37

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal

Hi Balbir,

> > tracing: pagecache object collections
> >
> > This dumps
> > - all cached files of a mounted fs (the inode-cache)
> > - all cached pages of a cached file (the page-cache)
> >
> > Usage and Sample output:
> >
> > # echo /dev > /debug/tracing/objects/mm/pages/walk-fs
> > # tail /debug/tracing/trace
> > zsh-2528 [000] 10429.172470: dump_inode: ino=889 size=0 cached=0 age=442 dirty=0 dev=0:18 file=/dev/console
> > zsh-2528 [000] 10429.172472: dump_inode: ino=888 size=0 cached=0 age=442 dirty=7 dev=0:18 file=/dev/null
> > zsh-2528 [000] 10429.172474: dump_inode: ino=887 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/shm
> > zsh-2528 [000] 10429.172477: dump_inode: ino=886 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/pts
> > zsh-2528 [000] 10429.172479: dump_inode: ino=885 size=11 cached=0 age=442 dirty=0 dev=0:18 file=/dev/core
> > zsh-2528 [000] 10429.172481: dump_inode: ino=884 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stderr
> > zsh-2528 [000] 10429.172483: dump_inode: ino=883 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdout
> > zsh-2528 [000] 10429.172486: dump_inode: ino=882 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdin
> > zsh-2528 [000] 10429.172488: dump_inode: ino=881 size=13 cached=0 age=442 dirty=0 dev=0:18 file=/dev/fd
> > zsh-2528 [000] 10429.172491: dump_inode: ino=872 size=13360 cached=0 age=442 dirty=0 dev=0:18 file=/dev
> >
> > Here "age" is either age from inode create time, or from last dirty time.
> >
>
> It would be nice to see mapped/unmapped information as well.

As you noticed, we have mapcount for individual pages :)

> > +static int pages_similiar(struct page* page0, struct page* page)
> > +{
> > + if (page_count(page0) != page_count(page))
> > + return 0;
> > +
> > + if (page_mapcount(page0) != page_mapcount(page))
> > + return 0;
> > +
> > + if (page_flags(page0) != page_flags(page))
> > + return 0;
> > +
> > + return 1;
> > +}
> > +
>
> OK, so pages_similar() is used to identify a range of pages in the
> cache?

Right. Many files are accessed sequentially or clustered, so
pages_similar() can save lots of output lines :)

> > +#define BATCH_LINES 100
> > +static void dump_pagecache(struct address_space *mapping)
> > +{
> > + int i;
> > + int lines = 0;
> > + pgoff_t len = 0;
> > + struct pagevec pvec;
> > + struct page *page;
> > + struct page *page0 = NULL;
> > + unsigned long start = 0;
> > +
> > + for (;;) {
> > + pagevec_init(&pvec, 0);
> > + pvec.nr = radix_tree_gang_lookup(&mapping->page_tree,
> > + (void **)pvec.pages, start + len, PAGEVEC_SIZE);
>
> Is radix_tree_gang_lookup synchronized somewhere? Don't we need to
> call it under RCU or a lock (mapping) ?

No. This function is inherently non-atomic, and it seems that most in-kernel
users do not bother to take rcu_read_lock(). So lets leave it as is?

> > +static ssize_t
> > +trace_pagecache_write(struct file *filp, const char __user *ubuf, size_t count,
> > + loff_t *ppos)
> > +{
> > + struct file *file = NULL;
> > + char *name;
> > + int err = 0;
> > +
>
> Can't we use the trace_parser here?

Seems not necessary? It's merely one file name, which could contain spaces.

> > + if (count <= 1)
> > + return -EINVAL;
> > + if (count > PATH_MAX + 1)
> > + return -ENAMETOOLONG;
> > +
> > + name = kmalloc(count+1, GFP_KERNEL);
> > + if (!name)
> > + return -ENOMEM;
> > +
> > + if (copy_from_user(name, ubuf, count)) {
> > + err = -EFAULT;
> > + goto out;
> > + }
> > +
> > + /* strip the newline added by `echo` */
> > + if (name[count-1] != '\n')
> > + return -EINVAL;
>
> Doesn't sound correct, what happens if we use echo -n?

It's a bit sad. If we accept both "echo" and "echo -n" with some
smart logic to test for trailing '\n', then it will go wrong for a
'\n'-terminated file name.

Or shall we support only "echo -n"? I can do with either one.

> > --- linux-mm.orig/fs/inode.c 2010-02-08 23:19:12.000000000 +0800
> > +++ linux-mm/fs/inode.c 2010-02-08 23:19:22.000000000 +0800
> > @@ -149,7 +149,7 @@ struct inode *inode_init_always(struct s
> > inode->i_bdev = NULL;
> > inode->i_cdev = NULL;
> > inode->i_rdev = 0;
> > - inode->dirtied_when = 0;
> > + inode->dirtied_when = jiffies;
> >
>
> Hmmm... Is the inode really dirtied when initialized? I know the
> change is for tracing, but the code when read is confusing.

Huh. Not really dirtied (for that you need to check I_DIRTY), but
dirtied_when is only used in writeback code when I_DIRTY is set.

So I overload dirtied_when in the clean case to indicate the inode
load time. This is a useful trick for fastboot to collect cache
footprint shortly after boot, when most inodes are clean.

It does ask for a comment:

/*
* This records inode load time. It will be invalidated once inode is
* dirtied, or jiffies wraps around. Despite the pitfalls it still
* provides useful information for some use cases like fastboot.
*/
inode->dirtied_when = jiffies;


Thanks,
Fengguang

2010-02-21 03:10:31

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal

Kame,

On Thu, Feb 18, 2010 at 01:34:29PM +0800, KAMEZAWA Hiroyuki wrote:

> Can we dump page's cgroup ? If so, I'm happy.

Good idea. page_cgroup is extended mem_map anyway.

> Maybe
> ==
> struct page_cgroup *pc = lookup_page_cgroup(page);
> struct mem_cgroup *mem = pc->mem_cgroup;
> shodt mem_cgroup_id = mem->css.css_id;
>
> And statistics can be counted per css_id.
>
> And then, some output like
>
> dump_pagecache_range: index=0 len=1 flags=10000000000002c count=1 mapcount=0 file=XXX memcg=group_A:x,group_B:y

Is it possible for a page to be owned by two cgroups?
For hierarchical cgroups, it would be easier to report only the bottom level cgroup.

> Is it okay to add a new field after your work finish ?

Sure.

> If so, I'll think about some infrastructure to get above based on your patch.

Then you may want to include this patch (with modification),
if recording the css id as raw tracing data.

Thanks,
Fengguang
---
memcg: show memory.id in cgroupfs

The hwpoison test suite need to selectively inject hwpoison to some
targeted task pages, and must not kill important system processes
such as init.

The memory cgroup serves this purpose well. We can put the target
processes under the control of a memory cgroup, tell the hwpoison
injection code the id of that memory cgroup so that it will only
poison pages associated with it.

Signed-off-by: Wu Fengguang <[email protected]>
---
mm/memcontrol.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

--- linux-mm.orig/mm/memcontrol.c 2009-09-07 16:01:02.000000000 +0800
+++ linux-mm/mm/memcontrol.c 2009-09-11 18:20:55.000000000 +0800
@@ -2510,6 +2510,13 @@ mem_cgroup_get_recursive_idx_stat(struct
*val = d.val;
}

+#ifdef CONFIG_HWPOISON_INJECT
+static u64 mem_cgroup_id_read(struct cgroup *cont, struct cftype *cft)
+{
+ return css_id(cgroup_subsys_state(cont, mem_cgroup_subsys_id));
+}
+#endif
+
static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
{
struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
@@ -2841,6 +2848,12 @@ static int mem_cgroup_swappiness_write(s


static struct cftype mem_cgroup_files[] = {
+#ifdef CONFIG_HWPOISON_INJECT /* for now, only user is hwpoison testing */
+ {
+ .name = "id",
+ .read_u64 = mem_cgroup_id_read,
+ },
+#endif
{
.name = "usage_in_bytes",
.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),

2010-02-23 14:05:54

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal

On Thu, Feb 18, 2010 at 05:58:50PM +0800, Balbir Singh wrote:
> * KAMEZAWA Hiroyuki <[email protected]> [2010-02-18 14:34:29]:
> > Can we dump page's cgroup ? If so, I'm happy.
> > Maybe
> > ==
> > struct page_cgroup *pc = lookup_page_cgroup(page);
> > struct mem_cgroup *mem = pc->mem_cgroup;
> > shodt mem_cgroup_id = mem->css.css_id;
> >
> > And statistics can be counted per css_id.
> >
>
> Good idea, all of this needs to happen with a check to see if memcg is
> enabled/disabled at boot as well. pc can be NULL if
> CONFIG_CGROUP_MEM_RES_CTLR is not enabled.

Not sure if this is the one in your mind, but I defined a function in
memcontrol.c for the trace code. Compile tested.

It'll be used like this:

TP_fast_assign(
__entry->memcg = page_memcg_id(page);
)

TP_printk("index=%lu len=%lu flags=%lx count=%u mapcount=%u memcg=%d",

Thanks,
Fengguang

---
memcg: introduce page_memcg_id()

This will be used to dump the memcg id associated with a pagecache page.

CC: Balbir Singh <[email protected]>
CC: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
include/linux/memcontrol.h | 6 ++++++
mm/memcontrol.c | 16 ++++++++++++++++
2 files changed, 22 insertions(+)

--- linux-mm.orig/include/linux/memcontrol.h 2010-02-23 21:49:39.000000000 +0800
+++ linux-mm/include/linux/memcontrol.h 2010-02-23 21:50:14.000000000 +0800
@@ -69,6 +69,7 @@ extern void mem_cgroup_out_of_memory(str
int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);

extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
+extern unsigned short page_memcg_id(struct page *page);
extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);

static inline
@@ -142,6 +143,11 @@ static inline int mem_cgroup_try_charge_
return 0;
}

+static inline unsigned short page_memcg_id(struct page *page)
+{
+ return 0;
+}
+
static inline void mem_cgroup_commit_charge_swapin(struct page *page,
struct mem_cgroup *ptr)
{
--- linux-mm.orig/mm/memcontrol.c 2010-02-23 21:48:23.000000000 +0800
+++ linux-mm/mm/memcontrol.c 2010-02-23 21:49:33.000000000 +0800
@@ -324,6 +324,22 @@ static struct mem_cgroup *try_get_mem_cg
return mem;
}

+unsigned short page_memcg_id(struct page *page)
+{
+ struct mem_cgroup *mem;
+ struct cgroup_subsys_state *css;
+ unsigned short id = 0;
+
+ mem = try_get_mem_cgroup_from_page(page);
+ if (mem) {
+ css = mem_cgroup_css(mem);
+ id = css_id(css);
+ css_put(css);
+ }
+
+ return id;
+}
+
/*
* Call callback function against all cgroup under hierarchy tree.
*/