Hello,
I would propose several tracepoints for tracing pagecache behaviors.
By using the tracepoints, we can monitor pagecache usage with high resolution.
-----------------------------------------------------------------------------
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
postmaster-7293 [002] 104039.093744: find_get_page: s_dev=8:2 i_ino=19672
42 offset=22302 page_found
postmaster-7293 [000] 104047.138110: add_to_page_cache: s_dev=8:2 i_ino=1
967242 offset=5672
postmaster-7293 [000] 104072.590885: remove_from_page_cache: s_dev=8:2 i_
ino=5016146 offset=1
-----------------------------------------------------------------------------
We can now know system-wide pagecache usage by /proc/meminfo.
But we have no method to get higher resolution information like per file or
per process usage than system-wide one.
A process may share some pagecache or add a pagecache to the memory or
remove a pagecache from the memory.
If a pagecache miss hit ratio rises, maybe it leads to extra I/O and
affects system performance.
So, by using the tracepoints we can get the following information.
1. how many pagecaches each process has per each file
2. how many pages are cached per each file
3. how many pagecaches each process shares
4. how often each process adds/removes pagecache
5. how long a pagecache stays in the memory
6. pagecache hit rate per file
Especially, the monitoring pagecache usage per each file would help us tune
some application like database.
I attach a sample script for counting file-by-file pagecache usage per process.
The scripts processes raw data from <debugfs>/tracing/trace to get
human-readable output.
You can run it as:
# echo 1 > <debugfs>/tracing/events/filemap
# cat <debugfs>/tracing/trace | python trace-pagecache-postprocess.py
The script implements counting 1, 2 and 3 information in the above.
o script output format
[file list]
< pagecache usage on a file basis >
...
[process list]
process: < pagecache usage of this process >
dev: < pagecache usage of above process on this file >
...
...
For example:
The below output is pagecache usage when pgbench(benchmark tests on PostgreSQL)
runs.
An inode 1967121 is a part of file(75M) for PostgreSQL database.
An inode 5019039 is a part of exec file(2.9M) for PostgreSQL,
"/usr/bin/postgres".
- if "added"(L8) > "cached"(L2) then
It means repeating add/remove pagecache many times.
=> Bad case for pagecache usage
- if "cached"(L3) >= "added"(L9)) && "cached"(L6) > 0 then
It means no unnecessary I/O operations.
=> Good case for pagecache usage.
(the "L2" means that second line in the output, "2: dev:8:2, ...".)
-----------------------------------------------------------------------------
1: [file list]
2: dev:8:2, inode:1967121, cached: 13M
3: dev:8:2, inode:5019039, cached: 1M
4: [process list]
5: process: kswapd0-369 (cached:0K, added:0K, removed:0K, indirect removed:10M)
6: dev:8:2, inode:1967121, cached:0K, added:0K, removed:0K, indirect removed:10M
7: process: postmaster-5025 (cached:23M, added:26M, removed:616K, indirect removed:176K)
8: dev:8:2, inode:1967121, cached:22M, added:26M, removed:616K, indirect removed:0K
9: dev:8:2, inode:5019039, cached:1M, added:64K, removed:0K, indirect removed:176K
10: process: dd-5028 (cached:0K, added:0K, removed:0K, indirect removed:1M)
11: dev:8:2, inode:1967121, cached:0K, added:0K, removed:0K, indirect removed:848K
12: dev:8:2, inode:5019039, cached:0K, added:0K, removed:0K, indirect removed:396K
-----------------------------------------------------------------------------
Any comments are welcome.
--
Keiichi Kii <[email protected]>
Hello,
Sorry for misspell linux-mm ML.
So, I send the patch again.
I would propose several tracepoints for tracing pagecache behaviors.
By using the tracepoints, we can monitor pagecache usage with high resolution.
-----------------------------------------------------------------------------
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
postmaster-7293 [002] 104039.093744: find_get_page: s_dev=8:2 i_ino=19672
42 offset=22302 page_found
postmaster-7293 [000] 104047.138110: add_to_page_cache: s_dev=8:2 i_ino=1
967242 offset=5672
postmaster-7293 [000] 104072.590885: remove_from_page_cache: s_dev=8:2 i_
ino=5016146 offset=1
-----------------------------------------------------------------------------
We can now know system-wide pagecache usage by /proc/meminfo.
But we have no method to get higher resolution information like per file or
per process usage than system-wide one.
A process may share some pagecache or add a pagecache to the memory or
remove a pagecache from the memory.
If a pagecache miss hit ratio rises, maybe it leads to extra I/O and
affects system performance.
So, by using the tracepoints we can get the following information.
1. how many pagecaches each process has per each file
2. how many pages are cached per each file
3. how many pagecaches each process shares
4. how often each process adds/removes pagecache
5. how long a pagecache stays in the memory
6. pagecache hit rate per file
Especially, the monitoring pagecache usage per each file would help us tune
some application like database.
I attach a sample script for counting file-by-file pagecache usage per process.
The scripts processes raw data from <debugfs>/tracing/trace to get
human-readable output.
You can run it as:
# echo 1 > <debugfs>/tracing/events/filemap
# cat <debugfs>/tracing/trace | python trace-pagecache-postprocess.py
The script implements counting 1, 2 and 3 information in the above.
o script output format
[file list]
< pagecache usage on a file basis >
...
[process list]
process: < pagecache usage of this process >
dev: < pagecache usage of above process on this file >
...
...
For example:
The below output is pagecache usage when pgbench(benchmark tests on PostgreSQL)
runs.
An inode 1967121 is a part of file(75M) for PostgreSQL database.
An inode 5019039 is a part of exec file(2.9M) for PostgreSQL,
"/usr/bin/postgres".
- if "added"(L8) > "cached"(L2) then
It means repeating add/remove pagecache many times.
=> Bad case for pagecache usage
- if "cached"(L3) >= "added"(L9)) && "cached"(L6) > 0 then
It means no unnecessary I/O operations.
=> Good case for pagecache usage.
(the "L2" means that second line in the output, "2: dev:8:2, ...".)
-----------------------------------------------------------------------------
1: [file list]
2: dev:8:2, inode:1967121, cached: 13M
3: dev:8:2, inode:5019039, cached: 1M
4: [process list]
5: process: kswapd0-369 (cached:0K, added:0K, removed:0K, indirect removed:10M)
6: dev:8:2, inode:1967121, cached:0K, added:0K, removed:0K, indirect removed:10M
7: process: postmaster-5025 (cached:23M, added:26M, removed:616K, indirect removed:176K)
8: dev:8:2, inode:1967121, cached:22M, added:26M, removed:616K, indirect removed:0K
9: dev:8:2, inode:5019039, cached:1M, added:64K, removed:0K, indirect removed:176K
10: process: dd-5028 (cached:0K, added:0K, removed:0K, indirect removed:1M)
11: dev:8:2, inode:1967121, cached:0K, added:0K, removed:0K, indirect removed:848K
12: dev:8:2, inode:5019039, cached:0K, added:0K, removed:0K, indirect removed:396K
-----------------------------------------------------------------------------
Any comments are welcome.
--
Keiichi Kii <[email protected]>
This patch adds several tracepoints to track pagecach behavior.
These trecepoints would help us monitor pagecache usage with high resolution.
Signed-off-by: Keiichi Kii <[email protected]>
Cc: Atsushi Tsuji <[email protected]>
---
include/trace/events/filemap.h | 83 +++++++++++++++++++++++++++++++++++++++++
mm/filemap.c | 5 ++
mm/truncate.c | 2
mm/vmscan.c | 3 +
4 files changed, 93 insertions(+)
Index: linux-2.6-tip/include/trace/events/filemap.h
===================================================================
--- /dev/null
+++ linux-2.6-tip/include/trace/events/filemap.h
@@ -0,0 +1,83 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM filemap
+
+#if !defined(_TRACE_FILEMAP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_FILEMAP_H
+
+#include <linux/fs.h>
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(find_get_page,
+
+ TP_PROTO(struct address_space *mapping, pgoff_t offset,
+ struct page * page),
+
+ TP_ARGS(mapping, offset, page),
+
+ TP_STRUCT__entry(
+ __field(dev_t, s_dev)
+ __field(ino_t, i_ino)
+ __field(pgoff_t, offset)
+ __field(struct page *, page)
+ ),
+
+ TP_fast_assign(
+ __entry->s_dev = mapping->host ? mapping->host->i_sb->s_dev : 0;
+ __entry->i_ino = mapping->host ? mapping->host->i_ino : 0;
+ __entry->offset = offset;
+ __entry->page = page;
+ ),
+
+ TP_printk("s_dev=%u:%u i_ino=%lu offset=%lu %s", MAJOR(__entry->s_dev),
+ MINOR(__entry->s_dev), __entry->i_ino, __entry->offset,
+ __entry->page == NULL ? "page_not_found" : "page_found")
+);
+
+TRACE_EVENT(add_to_page_cache,
+
+ TP_PROTO(struct address_space *mapping, pgoff_t offset),
+
+ TP_ARGS(mapping, offset),
+
+ TP_STRUCT__entry(
+ __field(dev_t, s_dev)
+ __field(ino_t, i_ino)
+ __field(pgoff_t, offset)
+ ),
+
+ TP_fast_assign(
+ __entry->s_dev = mapping->host->i_sb->s_dev;
+ __entry->i_ino = mapping->host->i_ino;
+ __entry->offset = offset;
+ ),
+
+ TP_printk("s_dev=%u:%u i_ino=%lu offset=%lu", MAJOR(__entry->s_dev),
+ MINOR(__entry->s_dev), __entry->i_ino, __entry->offset)
+);
+
+TRACE_EVENT(remove_from_page_cache,
+
+ TP_PROTO(struct address_space *mapping, pgoff_t offset),
+
+ TP_ARGS(mapping, offset),
+
+ TP_STRUCT__entry(
+ __field(dev_t, s_dev)
+ __field(ino_t, i_ino)
+ __field(pgoff_t, offset)
+ ),
+
+ TP_fast_assign(
+ __entry->s_dev = mapping->host->i_sb->s_dev;
+ __entry->i_ino = mapping->host->i_ino;
+ __entry->offset = offset;
+ ),
+
+ TP_printk("s_dev=%u:%u i_ino=%lu offset=%lu", MAJOR(__entry->s_dev),
+ MINOR(__entry->s_dev), __entry->i_ino, __entry->offset)
+);
+
+#endif /* _TRACE_FILEMAP_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
Index: linux-2.6-tip/mm/filemap.c
===================================================================
--- linux-2.6-tip.orig/mm/filemap.c
+++ linux-2.6-tip/mm/filemap.c
@@ -34,6 +34,8 @@
#include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
#include <linux/memcontrol.h>
#include <linux/mm_inline.h> /* for page_is_file_cache() */
+#define CREATE_TRACE_POINTS
+#include <trace/events/filemap.h>
#include "internal.h"
/*
@@ -149,6 +151,7 @@ void remove_from_page_cache(struct page
spin_lock_irq(&mapping->tree_lock);
__remove_from_page_cache(page);
spin_unlock_irq(&mapping->tree_lock);
+ trace_remove_from_page_cache(mapping, page->index);
mem_cgroup_uncharge_cache_page(page);
}
@@ -440,6 +443,7 @@ int add_to_page_cache_locked(struct page
if (PageSwapBacked(page))
__inc_zone_page_state(page, NR_SHMEM);
spin_unlock_irq(&mapping->tree_lock);
+ trace_add_to_page_cache(mapping, offset);
} else {
page->mapping = NULL;
spin_unlock_irq(&mapping->tree_lock);
@@ -663,6 +667,7 @@ repeat:
}
rcu_read_unlock();
+ trace_find_get_page(mapping, offset, page);
return page;
}
EXPORT_SYMBOL(find_get_page);
Index: linux-2.6-tip/mm/truncate.c
===================================================================
--- linux-2.6-tip.orig/mm/truncate.c
+++ linux-2.6-tip/mm/truncate.c
@@ -20,6 +20,7 @@
do_invalidatepage */
#include "internal.h"
+#include <trace/events/filemap.h>
/**
* do_invalidatepage - invalidate part or all of a page
@@ -384,6 +385,7 @@ invalidate_complete_page2(struct address
BUG_ON(page_has_private(page));
__remove_from_page_cache(page);
spin_unlock_irq(&mapping->tree_lock);
+ trace_remove_from_page_cache(mapping, page->index);
mem_cgroup_uncharge_cache_page(page);
page_cache_release(page); /* pagecache ref */
return 1;
Index: linux-2.6-tip/mm/vmscan.c
===================================================================
--- linux-2.6-tip.orig/mm/vmscan.c
+++ linux-2.6-tip/mm/vmscan.c
@@ -48,6 +48,8 @@
#include "internal.h"
+#include <trace/events/filemap.h>
+
struct scan_control {
/* Incremented by the number of inactive pages that were scanned */
unsigned long nr_scanned;
@@ -478,6 +480,7 @@ static int __remove_mapping(struct addre
} else {
__remove_from_page_cache(page);
spin_unlock_irq(&mapping->tree_lock);
+ trace_remove_from_page_cache(mapping, page->index);
mem_cgroup_uncharge_cache_page(page);
}