2010-08-19 20:58:49

by Michael Rubin

[permalink] [raw]
Subject: [PATCH 0/3] writeback: kernel visibility

Patch #1 sets up some helper functions for accounting.

Patch #2 adds writeback visibility in /proc/sys/vm.

To help developers and applications gain visibility into writeback
behaviour adding two read-only sysctl files into /proc/sys/vm.
These files allow user apps to understand writeback behaviour over time
and learn how it is impacting their performance.

# cat /proc/sys/vm/pages_dirtied
3747
# cat /proc/sys/vm/pages_entered_writeback
3618

These two new files are necessary to give visibility into writeback
behaviour. We have /proc/diskstats which lets us understand the io in
the block layer. We have blktrace for more in depth understanding. We have
e2fsprogs and debugsfs to give insight into the file systems behaviour,
but we don't offer our users the ability understand what writeback is
doing. There is no way to know how active it is over the whole system,
if it's falling behind or to quantify it's efforts. With these values
exported users can easily see how much data applications are sending
through writeback and also at what rates writeback is processing this
data. Comparing the rates of change between the two allow developers
to see when writeback is not able to keep up with incoming traffic and
the rate of dirty memory being sent to the IO back end. This allows
folks to understand their io workloads and track kernel issues. Non
kernel engineers at Google often use these counters to solve puzzling
performance problems.

There is a strong case for putting these files in /proc/sys/vm.
It was Christoph's suggestion that made me realize that /proc/sys/vm is
their proper home. Most if not all the tunables for writeback are
there. When one is trying to find the state of the system's writeback
activity that's the directory. In addition while the internals of
writeback will probably change the use of the dirty and writeback states
and pages will be around for a long time. Exposing them in /proc/sys/vm
should be acceptable.

Having these pages_dirtied and pages_entered_writeback in /proc/vmstat
to me feels like a way to make sure that users who would need them won't
find them unless they are reading source. And these are folks who aren't
reading source.

Patch #3 adds dirty thresholds to /proc/vmstat.

# grep threshold /proc/vmstat
nr_pages_dirty_threshold 409111
nr_pages_dirty_background_threshold 818223

The files that report the dirty thresholds belong in /proc/vmstat. They
are meant for application writers so should not be in debugfs. But since
they are more related to internals of writeback, albeit internals that
are fundamental to how it works, /proc/sys/vm is not appropriate.

Michael Rubin (3):
mm: helper functions for dirty and writeback accounting
writeback: Adding pages_dirtied and pages_entered_writeback
writeback: Reporting dirty thresholds in /proc/vmstat

Documentation/sysctl/vm.txt | 20 +++++++++++++++---
drivers/base/node.c | 14 +++++++++++++
fs/ceph/addr.c | 8 +-----
fs/nilfs2/segment.c | 2 +-
include/linux/mm.h | 1 +
include/linux/mmzone.h | 4 +++
include/linux/writeback.h | 9 ++++++++
kernel/sysctl.c | 14 +++++++++++++
mm/page-writeback.c | 45 ++++++++++++++++++++++++++++++++++++++++--
mm/vmstat.c | 10 +++++++++
10 files changed, 113 insertions(+), 14 deletions(-)


2010-08-19 20:58:19

by Michael Rubin

[permalink] [raw]
Subject: [PATCH 1/3] mm: helper functions for dirty and writeback accounting

Exporting account_pages_dirty and adding a symmetric routine
account_pages_writeback.

This allows code outside of the mm core to safely manipulate page state
and not worry about the other accounting. Not using these routines means
that some code will lose track of the accounting and we get bugs. This
has happened once already.

Signed-off-by: Michael Rubin <[email protected]>
---
fs/ceph/addr.c | 8 ++------
fs/nilfs2/segment.c | 2 +-
include/linux/mm.h | 1 +
mm/page-writeback.c | 15 +++++++++++++++
4 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index d9c60b8..359aa3a 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -106,12 +106,8 @@ static int ceph_set_page_dirty(struct page *page)
if (page->mapping) { /* Race with truncate? */
WARN_ON_ONCE(!PageUptodate(page));

- if (mapping_cap_account_dirty(mapping)) {
- __inc_zone_page_state(page, NR_FILE_DIRTY);
- __inc_bdi_stat(mapping->backing_dev_info,
- BDI_RECLAIMABLE);
- task_io_account_write(PAGE_CACHE_SIZE);
- }
+ if (mapping_cap_account_dirty(mapping))
+ account_page_dirtied(page, page->mapping);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);

diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index c920164..967ed7d 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1599,7 +1599,7 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
kunmap_atomic(kaddr, KM_USER0);

if (!TestSetPageWriteback(clone_page))
- inc_zone_page_state(clone_page, NR_WRITEBACK);
+ account_page_writeback(clone_page, page_mapping(clone_page));
unlock_page(clone_page);

return 0;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a2b4804..b138392 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -855,6 +855,7 @@ int __set_page_dirty_no_writeback(struct page *page);
int redirty_page_for_writepage(struct writeback_control *wbc,
struct page *page);
void account_page_dirtied(struct page *page, struct address_space *mapping);
+void account_page_writeback(struct page *page, struct address_space *mapping);
int set_page_dirty(struct page *page);
int set_page_dirty_lock(struct page *page);
int clear_page_dirty_for_io(struct page *page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 37498ef..b8e7b3b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1096,6 +1096,21 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
task_io_account_write(PAGE_CACHE_SIZE);
}
}
+EXPORT_SYMBOL(account_page_dirtied);
+
+/*
+ * Helper function for set_page_writeback family.
+ * NOTE: Unlike account_page_dirtied this does not rely on being atomic
+ * wrt interrupts.
+ */
+
+void account_page_writeback(struct page *page, struct address_space *mapping)
+{
+ if (mapping_cap_account_dirty(mapping))
+ inc_zone_page_state(page, NR_WRITEBACK);
+}
+EXPORT_SYMBOL(account_page_writeback);
+

/*
* For address_spaces which do not use buffers. Just tag the page as dirty in
--
1.7.1

2010-08-19 20:58:16

by Michael Rubin

[permalink] [raw]
Subject: [PATCH 2/3] writeback: Adding pages_dirtied and pages_entered_writeback

To help developers and applications gain visibility into writeback
behaviour adding four read only sysctl files into /proc/sys/vm.
These files allow user apps to understand writeback behaviour over time
and learn how it is impacting their performance.

# cat /proc/sys/vm/pages_dirtied
3747
# cat /proc/sys/vm/pages_entered_writeback
3618

Documentation/vm.txt has been updated.

In order to track the "cleaned" and "dirtied" counts we added two
vm_stat_items. Per memory node stats have been added also. So we can
see per node granularity:

# cat /sys/devices/system/node/node20/writebackstat
Node 20 pages_writeback: 0 times
Node 20 pages_dirtied: 0 times

Signed-off-by: Michael Rubin <[email protected]>
---
Documentation/sysctl/vm.txt | 20 ++++++++++++++++----
drivers/base/node.c | 14 ++++++++++++++
include/linux/mmzone.h | 2 ++
include/linux/writeback.h | 9 +++++++++
kernel/sysctl.c | 14 ++++++++++++++
mm/page-writeback.c | 36 ++++++++++++++++++++++++++++++------
mm/vmstat.c | 2 ++
7 files changed, 87 insertions(+), 10 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 5fdbb61..de9ec6a 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -50,6 +50,8 @@ Currently, these files are in /proc/sys/vm:
- overcommit_memory
- overcommit_ratio
- page-cluster
+- pages_dirtied
+- pages_entered_writeback
- panic_on_oom
- percpu_pagelist_fraction
- stat_interval
@@ -425,10 +427,7 @@ See Documentation/vm/hugetlbpage.txt
nr_pdflush_threads

The current number of pdflush threads. This value is read-only.
-The value changes according to the number of dirty pages in the system.
-
-When necessary, additional pdflush threads are created, one per second, up to
-nr_pdflush_threads_max.
+This value is obsolete.

==============================================================

@@ -582,6 +581,19 @@ swap-intensive.

=============================================================

+pages_dirtied
+
+Number of pages that have ever been dirtied since boot.
+This value is read-only.
+
+=============================================================
+
+pages_entered_writeback
+
+Number of pages that have been moved from dirty to writeback since boot.
+This is only a count of file pages. This value is read-only.
+
+=============================================================
panic_on_oom

This enables or disables panic on out-of-memory feature.
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 2bdd8a9..b321d32 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -160,6 +160,18 @@ static ssize_t node_read_numastat(struct sys_device * dev,
}
static SYSDEV_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);

+static ssize_t node_read_writebackstat(struct sys_device *dev,
+ struct sysdev_attribute *attr, char *buf)
+{
+ int nid = dev->id;
+ return sprintf(buf,
+ "Node %d pages_writeback: %lu times\n"
+ "Node %d pages_dirtied: %lu times\n",
+ nid, node_page_state(nid, NR_PAGES_ENTERED_WRITEBACK),
+ nid, node_page_state(nid, NR_FILE_PAGES_DIRTIED));
+}
+static SYSDEV_ATTR(writebackstat, S_IRUGO, node_read_writebackstat, NULL);
+
static ssize_t node_read_distance(struct sys_device * dev,
struct sysdev_attribute *attr, char * buf)
{
@@ -243,6 +255,7 @@ int register_node(struct node *node, int num, struct node *parent)
sysdev_create_file(&node->sysdev, &attr_meminfo);
sysdev_create_file(&node->sysdev, &attr_numastat);
sysdev_create_file(&node->sysdev, &attr_distance);
+ sysdev_create_file(&node->sysdev, &attr_writebackstat);

scan_unevictable_register_node(node);

@@ -267,6 +280,7 @@ void unregister_node(struct node *node)
sysdev_remove_file(&node->sysdev, &attr_meminfo);
sysdev_remove_file(&node->sysdev, &attr_numastat);
sysdev_remove_file(&node->sysdev, &attr_distance);
+ sysdev_remove_file(&node->sysdev, &attr_writebackstat);

scan_unevictable_unregister_node(node);
hugetlb_unregister_node(node); /* no-op, if memoryless node */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b4d109e..f160481 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -112,6 +112,8 @@ enum zone_stat_item {
NUMA_LOCAL, /* allocation from local node */
NUMA_OTHER, /* allocation from other node */
#endif
+ NR_PAGES_ENTERED_WRITEBACK, /* number of times pages enter writeback */
+ NR_FILE_PAGES_DIRTIED, /* number of times pages get dirtied */
NR_VM_ZONE_STAT_ITEMS };

/*
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index c24eca7..2d47afb 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -99,6 +99,8 @@ extern int dirty_background_ratio;
extern unsigned long dirty_background_bytes;
extern int vm_dirty_ratio;
extern unsigned long vm_dirty_bytes;
+extern unsigned long vm_pages_dirtied;
+extern unsigned long vm_pages_entered_writeback;
extern unsigned int dirty_writeback_interval;
extern unsigned int dirty_expire_interval;
extern int vm_highmem_is_dirtyable;
@@ -120,6 +122,13 @@ extern int dirty_bytes_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);

+extern int pages_dirtied_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos);
+extern int pages_entered_writeback_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos);
+
struct ctl_table;
int dirty_writeback_centisecs_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d24f761..33c3589 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1053,6 +1053,20 @@ static struct ctl_table vm_table[] = {
.proc_handler = proc_dointvec,
},
{
+ .procname = "pages_dirtied",
+ .data = &vm_pages_dirtied,
+ .maxlen = sizeof(vm_pages_dirtied),
+ .mode = 0444 /* read-only */,
+ .proc_handler = pages_dirtied_handler,
+ },
+ {
+ .procname = "pages_entered_writeback",
+ .data = &vm_pages_entered_writeback,
+ .maxlen = sizeof(vm_pages_entered_writeback),
+ .mode = 0444 /* read-only */,
+ .proc_handler = pages_entered_writeback_handler,
+ },
+ {
.procname = "nr_pdflush_threads",
.data = &nr_pdflush_threads,
.maxlen = sizeof nr_pdflush_threads,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b8e7b3b..4ed5dec 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -95,6 +95,14 @@ unsigned int dirty_writeback_interval = 5 * 100; /* centiseconds */
*/
unsigned int dirty_expire_interval = 30 * 100; /* centiseconds */

+
+/*
+ * Number of pages dirtied and entered writeback state
+ */
+
+unsigned long vm_pages_dirtied;
+unsigned long vm_pages_entered_writeback;
+
/*
* Flag that makes the machine dump writes/reads and block dirtyings.
*/
@@ -196,7 +204,6 @@ int dirty_ratio_handler(struct ctl_table *table, int write,
return ret;
}

-
int dirty_bytes_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos)
@@ -212,6 +219,23 @@ int dirty_bytes_handler(struct ctl_table *table, int write,
return ret;
}

+int pages_dirtied_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos)
+{
+ vm_pages_dirtied = global_page_state(NR_FILE_PAGES_DIRTIED);
+ return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
+}
+
+int pages_entered_writeback_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos)
+{
+ vm_pages_entered_writeback =
+ global_page_state(NR_PAGES_ENTERED_WRITEBACK);
+ return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
+}
+
/*
* Increment the BDI's writeout completion count and the global writeout
* completion count. Called from test_clear_page_writeback().
@@ -1091,6 +1115,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
{
if (mapping_cap_account_dirty(mapping)) {
__inc_zone_page_state(page, NR_FILE_DIRTY);
+ __inc_zone_page_state(page, NR_FILE_PAGES_DIRTIED);
__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
task_dirty_inc(current);
task_io_account_write(PAGE_CACHE_SIZE);
@@ -1103,15 +1128,15 @@ EXPORT_SYMBOL(account_page_dirtied);
* NOTE: Unlike account_page_dirtied this does not rely on being atomic
* wrt interrupts.
*/
-
void account_page_writeback(struct page *page, struct address_space *mapping)
{
- if (mapping_cap_account_dirty(mapping))
+ if (mapping_cap_account_dirty(mapping)) {
inc_zone_page_state(page, NR_WRITEBACK);
+ inc_zone_page_state(page, NR_PAGES_ENTERED_WRITEBACK);
+ }
}
EXPORT_SYMBOL(account_page_writeback);

-
/*
* For address_spaces which do not use buffers. Just tag the page as dirty in
* its radix tree.
@@ -1347,9 +1372,8 @@ int test_set_page_writeback(struct page *page)
ret = TestSetPageWriteback(page);
}
if (!ret)
- inc_zone_page_state(page, NR_WRITEBACK);
+ account_page_writeback(page, mapping);
return ret;
-
}
EXPORT_SYMBOL(test_set_page_writeback);

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7759941..e177a40 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -740,6 +740,8 @@ static const char * const vmstat_text[] = {
"numa_local",
"numa_other",
#endif
+ "nr_pages_entered_writeback",
+ "nr_file_pages_dirtied",

#ifdef CONFIG_VM_EVENT_COUNTERS
"pgpgin",
--
1.7.1

2010-08-19 20:58:33

by Michael Rubin

[permalink] [raw]
Subject: [PATCH 3/3] writeback: Reporting dirty thresholds in /proc/vmstat

The kernel already exposes the desired thresholds in /proc/sys/vm with
dirty_background_ratio and background_ratio. Instead the kernel may
alter the number requested without giving the user any indication that
is the case.

Knowing the actual ratios the kernel is honoring can help app developers
understand how their buffered IO will be sent to the disk.

$ grep threshold /proc/vmstat
nr_pages_dirty_threshold 409111
nr_pages_dirty_background_threshold 818223

Signed-off-by: Michael Rubin <[email protected]>
---
include/linux/mmzone.h | 2 ++
mm/vmstat.c | 8 ++++++++
2 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f160481..7c4a3bf 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -114,6 +114,8 @@ enum zone_stat_item {
#endif
NR_PAGES_ENTERED_WRITEBACK, /* number of times pages enter writeback */
NR_FILE_PAGES_DIRTIED, /* number of times pages get dirtied */
+ NR_PAGES_DIRTY_THRESHOLD, /* writeback threshold */
+ NR_PAGES_DIRTY_BG_THRESHOLD,/* bg writeback threshold */
NR_VM_ZONE_STAT_ITEMS };

/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index e177a40..8b5bc78 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -17,6 +17,7 @@
#include <linux/vmstat.h>
#include <linux/sched.h>
#include <linux/math64.h>
+#include <linux/writeback.h>

#ifdef CONFIG_VM_EVENT_COUNTERS
DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
@@ -742,6 +743,8 @@ static const char * const vmstat_text[] = {
#endif
"nr_pages_entered_writeback",
"nr_file_pages_dirtied",
+ "nr_pages_dirty_threshold",
+ "nr_pages_dirty_background_threshold",

#ifdef CONFIG_VM_EVENT_COUNTERS
"pgpgin",
@@ -901,6 +904,7 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
#ifdef CONFIG_VM_EVENT_COUNTERS
unsigned long *e;
#endif
+ unsigned long dirty_thresh, dirty_bg_thresh;
int i;

if (*pos >= ARRAY_SIZE(vmstat_text))
@@ -918,6 +922,10 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
return ERR_PTR(-ENOMEM);
for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
v[i] = global_page_state(i);
+
+ get_dirty_limits(&dirty_thresh, &dirty_bg_thresh, NULL, NULL);
+ v[NR_PAGES_DIRTY_THRESHOLD] = dirty_thresh;
+ v[NR_PAGES_DIRTY_BG_THRESHOLD] = dirty_bg_thresh;
#ifdef CONFIG_VM_EVENT_COUNTERS
e = v + NR_VM_ZONE_STAT_ITEMS;
all_vm_events(e);
--
1.7.1

2010-08-19 22:12:16

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 0/3] writeback: kernel visibility

On 08/19/2010 04:57 PM, Michael Rubin wrote:
> Patch #1 sets up some helper functions for accounting.
>
> Patch #2 adds writeback visibility in /proc/sys/vm.
>
> To help developers and applications gain visibility into writeback
> behaviour adding two read-only sysctl files into /proc/sys/vm.
> These files allow user apps to understand writeback behaviour over time
> and learn how it is impacting their performance.
>
> # cat /proc/sys/vm/pages_dirtied
> 3747
> # cat /proc/sys/vm/pages_entered_writeback
> 3618

Would it be better to have these values in /proc/vmstat
and /proc/zoneinfo ?

I don't really see why they need to be in /proc/sys
at all...

--
All rights reversed

2010-08-20 02:34:26

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: helper functions for dirty and writeback accounting

On Thu, Aug 19, 2010 at 01:57:25PM -0700, Michael Rubin wrote:
> Exporting account_pages_dirty and adding a symmetric routine
> account_pages_writeback.

s/account_pages_writeback/account_page_writeback/

I'd recommend to separate the changes into two patches.
It's actually a bug fix to export account_pages_dirty() for ceph,
which should be a good candidate for 2.6.36.

> This allows code outside of the mm core to safely manipulate page state
> and not worry about the other accounting. Not using these routines means
> that some code will lose track of the accounting and we get bugs. This
> has happened once already.
>
> Signed-off-by: Michael Rubin <[email protected]>
> ---
> fs/ceph/addr.c | 8 ++------
> fs/nilfs2/segment.c | 2 +-
> include/linux/mm.h | 1 +
> mm/page-writeback.c | 15 +++++++++++++++
> 4 files changed, 19 insertions(+), 7 deletions(-)
>
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index d9c60b8..359aa3a 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -106,12 +106,8 @@ static int ceph_set_page_dirty(struct page *page)
> if (page->mapping) { /* Race with truncate? */
> WARN_ON_ONCE(!PageUptodate(page));
>
> - if (mapping_cap_account_dirty(mapping)) {
> - __inc_zone_page_state(page, NR_FILE_DIRTY);
> - __inc_bdi_stat(mapping->backing_dev_info,
> - BDI_RECLAIMABLE);
> - task_io_account_write(PAGE_CACHE_SIZE);
> - }
> + if (mapping_cap_account_dirty(mapping))

That 'if' is not necessary. account_page_dirtied() already has one.
The extra 'if' is not an optimization either, because the ceph fs is
not likely to have un-accountable mappings.

> + account_page_dirtied(page, page->mapping);
> radix_tree_tag_set(&mapping->page_tree,
> page_index(page), PAGECACHE_TAG_DIRTY);
>
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index c920164..967ed7d 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -1599,7 +1599,7 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
> kunmap_atomic(kaddr, KM_USER0);
>
> if (!TestSetPageWriteback(clone_page))
> - inc_zone_page_state(clone_page, NR_WRITEBACK);
> + account_page_writeback(clone_page, page_mapping(clone_page));
> unlock_page(clone_page);
>
> return 0;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a2b4804..b138392 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -855,6 +855,7 @@ int __set_page_dirty_no_writeback(struct page *page);
> int redirty_page_for_writepage(struct writeback_control *wbc,
> struct page *page);
> void account_page_dirtied(struct page *page, struct address_space *mapping);
> +void account_page_writeback(struct page *page, struct address_space *mapping);
> int set_page_dirty(struct page *page);
> int set_page_dirty_lock(struct page *page);
> int clear_page_dirty_for_io(struct page *page);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 37498ef..b8e7b3b 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1096,6 +1096,21 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
> task_io_account_write(PAGE_CACHE_SIZE);
> }
> }
> +EXPORT_SYMBOL(account_page_dirtied);
> +
> +/*
> + * Helper function for set_page_writeback family.
> + * NOTE: Unlike account_page_dirtied this does not rely on being atomic
> + * wrt interrupts.
> + */
> +
> +void account_page_writeback(struct page *page, struct address_space *mapping)
> +{
> + if (mapping_cap_account_dirty(mapping))

The 'if' test and *mapping parameter looks unnecessary at least for
now. The only place a mapping has BDI_CAP_NO_ACCT_WB but not
BDI_CAP_NO_WRITEBACK is fuse, which does its own accounting.

Thanks,
Fengguang

> + inc_zone_page_state(page, NR_WRITEBACK);
> +}
> +EXPORT_SYMBOL(account_page_writeback);
> +
>
> /*
> * For address_spaces which do not use buffers. Just tag the page as dirty in
> --
> 1.7.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2010-08-20 02:51:18

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 2/3] writeback: Adding pages_dirtied and pages_entered_writeback

On Thu, Aug 19, 2010 at 01:57:26PM -0700, Michael Rubin wrote:
> To help developers and applications gain visibility into writeback
> behaviour adding four read only sysctl files into /proc/sys/vm.
> These files allow user apps to understand writeback behaviour over time
> and learn how it is impacting their performance.
>
> # cat /proc/sys/vm/pages_dirtied
> 3747
> # cat /proc/sys/vm/pages_entered_writeback
> 3618

As Rik said, /proc/sys is not a suitable place.

Frankly speaking I've worked on writeback for years and never felt
the need to add these counters. What I often do is:

$ vmmon -d 1 nr_writeback nr_dirty nr_unstable

nr_writeback nr_dirty nr_unstable
68738 0 39568
66051 0 42255
63406 0 44900
60643 0 47663
57954 0 50352
55264 0 53042
52592 0 55715
49922 0 58385
That is what I get when copying /dev/zero to NFS.

You can find vmmon.c in Andrew Morton's ext3-tools package.
Also attached for your convenience.

I'm very interested in Google's use case for this patch, and why
the simple /proc/vmstat based vmmon tool is not enough.

Thanks,
Fengguang

2010-08-20 03:16:54

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 3/3] writeback: Reporting dirty thresholds in /proc/vmstat

On Thu, Aug 19, 2010 at 01:57:27PM -0700, Michael Rubin wrote:
> The kernel already exposes the desired thresholds in /proc/sys/vm with
> dirty_background_ratio and background_ratio. Instead the kernel may
> alter the number requested without giving the user any indication that
> is the case.

You mean the 5% lower bound in global_dirty_limits()? Let's rip it :)

> Knowing the actual ratios the kernel is honoring can help app developers
> understand how their buffered IO will be sent to the disk.
>
> $ grep threshold /proc/vmstat
> nr_pages_dirty_threshold 409111
> nr_pages_dirty_background_threshold 818223

It's redundant to have _pages in the names. /proc/vmstat has the
tradition to use nr_dirty instead of nr_pages_dirty.

They do look like useful counters to export, especially when we do
dynamic dirty limits in future.

> Signed-off-by: Michael Rubin <[email protected]>
> ---
> include/linux/mmzone.h | 2 ++
> mm/vmstat.c | 8 ++++++++
> 2 files changed, 10 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index f160481..7c4a3bf 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -114,6 +114,8 @@ enum zone_stat_item {
> #endif
> NR_PAGES_ENTERED_WRITEBACK, /* number of times pages enter writeback */
> NR_FILE_PAGES_DIRTIED, /* number of times pages get dirtied */
> + NR_PAGES_DIRTY_THRESHOLD, /* writeback threshold */
> + NR_PAGES_DIRTY_BG_THRESHOLD,/* bg writeback threshold */

s/_PAGES//

> NR_VM_ZONE_STAT_ITEMS };
>
> /*
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index e177a40..8b5bc78 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -17,6 +17,7 @@
> #include <linux/vmstat.h>
> #include <linux/sched.h>
> #include <linux/math64.h>
> +#include <linux/writeback.h>
>
> #ifdef CONFIG_VM_EVENT_COUNTERS
> DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
> @@ -742,6 +743,8 @@ static const char * const vmstat_text[] = {
> #endif
> "nr_pages_entered_writeback",
> "nr_file_pages_dirtied",
> + "nr_pages_dirty_threshold",
> + "nr_pages_dirty_background_threshold",

s/_pages//

> #ifdef CONFIG_VM_EVENT_COUNTERS
> "pgpgin",
> @@ -901,6 +904,7 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
> #ifdef CONFIG_VM_EVENT_COUNTERS
> unsigned long *e;
> #endif
> + unsigned long dirty_thresh, dirty_bg_thresh;
> int i;
>
> if (*pos >= ARRAY_SIZE(vmstat_text))
> @@ -918,6 +922,10 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
> return ERR_PTR(-ENOMEM);
> for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
> v[i] = global_page_state(i);
> +
> + get_dirty_limits(&dirty_thresh, &dirty_bg_thresh, NULL, NULL);

2.6.36-rc1 will need this:

global_dirty_limits(v + NR_DIRTY_THRESHOLD, v + NR_DIRTY_BG_THRESHOLD);

Thanks,
Fengguang

> + v[NR_PAGES_DIRTY_THRESHOLD] = dirty_thresh;
> + v[NR_PAGES_DIRTY_BG_THRESHOLD] = dirty_bg_thresh;
> #ifdef CONFIG_VM_EVENT_COUNTERS
> e = v + NR_VM_ZONE_STAT_ITEMS;
> all_vm_events(e);
> --
> 1.7.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2010-08-20 06:54:37

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 2/3] writeback: Adding pages_dirtied and pages_entered_writeback

Sorry, here is the mentioned vmmon source file.

The full ext3-tools package is here:

http://userweb.kernel.org/~akpm/stuff/ext3-tools.tar.gz

Thanks,
Fengguang

On Fri, Aug 20, 2010 at 10:51:11AM +0800, Wu Fengguang wrote:
> On Thu, Aug 19, 2010 at 01:57:26PM -0700, Michael Rubin wrote:
> > To help developers and applications gain visibility into writeback
> > behaviour adding four read only sysctl files into /proc/sys/vm.
> > These files allow user apps to understand writeback behaviour over time
> > and learn how it is impacting their performance.
> >
> > # cat /proc/sys/vm/pages_dirtied
> > 3747
> > # cat /proc/sys/vm/pages_entered_writeback
> > 3618
>
> As Rik said, /proc/sys is not a suitable place.
>
> Frankly speaking I've worked on writeback for years and never felt
> the need to add these counters. What I often do is:
>
> $ vmmon -d 1 nr_writeback nr_dirty nr_unstable
>
> nr_writeback nr_dirty nr_unstable
> 68738 0 39568
> 66051 0 42255
> 63406 0 44900
> 60643 0 47663
> 57954 0 50352
> 55264 0 53042
> 52592 0 55715
> 49922 0 58385
> That is what I get when copying /dev/zero to NFS.
>
> You can find vmmon.c in Andrew Morton's ext3-tools package.
> Also attached for your convenience.
>
> I'm very interested in Google's use case for this patch, and why
> the simple /proc/vmstat based vmmon tool is not enough.
>
> Thanks,
> Fengguang

2010-08-20 08:16:39

by Michael Rubin

[permalink] [raw]
Subject: Re: [PATCH 2/3] writeback: Adding pages_dirtied and pages_entered_writeback

On Thu, Aug 19, 2010 at 7:51 PM, Wu Fengguang <[email protected]> wrote:
> As Rik said, /proc/sys is not a suitable place.

OK I'm convinced.

> Frankly speaking I've worked on writeback for years and never felt
> the need to add these counters. What I often do is:
>
> $ vmmon -d 1 nr_writeback nr_dirty nr_unstable
>
> ? ? nr_writeback ? ? ? ? nr_dirty ? ? ?nr_unstable
> ? ? ? ? ? ?68738 ? ? ? ? ? ? ? ?0 ? ? ? ? ? ?39568
> ? ? ? ? ? ?66051 ? ? ? ? ? ? ? ?0 ? ? ? ? ? ?42255
> ? ? ? ? ? ?63406 ? ? ? ? ? ? ? ?0 ? ? ? ? ? ?44900
> ? ? ? ? ? ?60643 ? ? ? ? ? ? ? ?0 ? ? ? ? ? ?47663
> ? ? ? ? ? ?57954 ? ? ? ? ? ? ? ?0 ? ? ? ? ? ?50352
> ? ? ? ? ? ?55264 ? ? ? ? ? ? ? ?0 ? ? ? ? ? ?53042
> ? ? ? ? ? ?52592 ? ? ? ? ? ? ? ?0 ? ? ? ? ? ?55715
> ? ? ? ? ? ?49922 ? ? ? ? ? ? ? ?0 ? ? ? ? ? ?58385
> That is what I get when copying /dev/zero to NFS.
>
> I'm very interested in Google's use case for this patch, and why
> the simple /proc/vmstat based vmmon tool is not enough.

So as I understand it from looking at the code vmmon is sampling
nr_writeback, nr_dirty which are exported versions of
global_page_state for NR_FILE_DIRTY and NR_WRITEBACK. These states are
a snapshot of the state of the kernel's pages. Namely how many dpages
ar ein writeback or dirty at the moment vmmon's acquire routine is
called.

vmmon is sampling /proc/vstat and then displaying the difference from
the last time they sampled. If I am misunderstanding let me know.

This is good for the state of the system but as we compare
application, mm and io performance over long periods of time we are
interested in the surges and fluctuations of the rates of the
producing and consuming of dirty pages also. It can help isolate where
the problem is and also to compare performance between kernels and/or
applications.

mrubin

2010-08-20 08:18:42

by Michael Rubin

[permalink] [raw]
Subject: Re: [PATCH 3/3] writeback: Reporting dirty thresholds in /proc/vmstat

Thank you for your quick reply and comments.

On Thu, Aug 19, 2010 at 8:16 PM, Wu Fengguang <[email protected]> wrote:
> On Thu, Aug 19, 2010 at 01:57:27PM -0700, Michael Rubin wrote:
>> The kernel already exposes the desired thresholds in /proc/sys/vm with
>> dirty_background_ratio and background_ratio. Instead the kernel may
>> alter the number requested without giving the user any indication that
>> is the case.
>
> You mean the 5% lower bound in global_dirty_limits()? Let's rip it :)
>
>> Knowing the actual ratios the kernel is honoring can help app developers
>> understand how their buffered IO will be sent to the disk.
>>
>> ? ? ? $ grep threshold /proc/vmstat
>> ? ? ? nr_pages_dirty_threshold 409111
>> ? ? ? nr_pages_dirty_background_threshold 818223
>
> It's redundant to have _pages in the names. /proc/vmstat has the
> tradition to use nr_dirty instead of nr_pages_dirty.
>
> They do look like useful counters to export, especially when we do
> dynamic dirty limits in future.
>
>> Signed-off-by: Michael Rubin <[email protected]>
>> ---
>> ?include/linux/mmzone.h | ? ?2 ++
>> ?mm/vmstat.c ? ? ? ? ? ?| ? ?8 ++++++++
>> ?2 files changed, 10 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index f160481..7c4a3bf 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -114,6 +114,8 @@ enum zone_stat_item {
>> ?#endif
>> ? ? ? NR_PAGES_ENTERED_WRITEBACK, /* number of times pages enter writeback */
>> ? ? ? NR_FILE_PAGES_DIRTIED, ? ? ?/* number of times pages get dirtied */
>> + ? ? NR_PAGES_DIRTY_THRESHOLD, ? /* writeback threshold */
>> + ? ? NR_PAGES_DIRTY_BG_THRESHOLD,/* bg writeback threshold */
>
> s/_PAGES//

Cool. Thanks.

>
>> ? ? ? NR_VM_ZONE_STAT_ITEMS };
>>
>> ?/*
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index e177a40..8b5bc78 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -17,6 +17,7 @@
>> ?#include <linux/vmstat.h>
>> ?#include <linux/sched.h>
>> ?#include <linux/math64.h>
>> +#include <linux/writeback.h>
>>
>> ?#ifdef CONFIG_VM_EVENT_COUNTERS
>> ?DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
>> @@ -742,6 +743,8 @@ static const char * const vmstat_text[] = {
>> ?#endif
>> ? ? ? "nr_pages_entered_writeback",
>> ? ? ? "nr_file_pages_dirtied",
>> + ? ? "nr_pages_dirty_threshold",
>> + ? ? "nr_pages_dirty_background_threshold",
>
> s/_pages//

Got it.

>
>> ?#ifdef CONFIG_VM_EVENT_COUNTERS
>> ? ? ? "pgpgin",
>> @@ -901,6 +904,7 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
>> ?#ifdef CONFIG_VM_EVENT_COUNTERS
>> ? ? ? unsigned long *e;
>> ?#endif
>> + ? ? unsigned long dirty_thresh, dirty_bg_thresh;
>> ? ? ? int i;
>>
>> ? ? ? if (*pos >= ARRAY_SIZE(vmstat_text))
>> @@ -918,6 +922,10 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
>> ? ? ? ? ? ? ? return ERR_PTR(-ENOMEM);
>> ? ? ? for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
>> ? ? ? ? ? ? ? v[i] = global_page_state(i);
>> +
>> + ? ? get_dirty_limits(&dirty_thresh, &dirty_bg_thresh, NULL, NULL);
>
> 2.6.36-rc1 will need this:
>
> ? ? ? ?global_dirty_limits(v + NR_DIRTY_THRESHOLD, v + NR_DIRTY_BG_THRESHOLD);

Yeah I noticed when I rebased. Thanks.

mrubin

2010-08-20 08:19:59

by Michael Rubin

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: helper functions for dirty and writeback accounting

On Thu, Aug 19, 2010 at 7:34 PM, Wu Fengguang <[email protected]> wrote:
> On Thu, Aug 19, 2010 at 01:57:25PM -0700, Michael Rubin wrote:
>> Exporting account_pages_dirty and adding a symmetric routine
>> account_pages_writeback.
>
> s/account_pages_writeback/account_page_writeback/

Got it. thanks.

> I'd recommend to separate the changes into two patches.
> It's actually a bug fix to export account_pages_dirty() for ceph,
> which should be a good candidate for 2.6.36.

Good idea.

>> This allows code outside of the mm core to safely manipulate page state
>> and not worry about the other accounting. Not using these routines means
>> that some code will lose track of the accounting and we get bugs. This
>> has happened once already.
>>
>> Signed-off-by: Michael Rubin <[email protected]>
>> ---
>> ?fs/ceph/addr.c ? ? ?| ? ?8 ++------
>> ?fs/nilfs2/segment.c | ? ?2 +-
>> ?include/linux/mm.h ?| ? ?1 +
>> ?mm/page-writeback.c | ? 15 +++++++++++++++
>> ?4 files changed, 19 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>> index d9c60b8..359aa3a 100644
>> --- a/fs/ceph/addr.c
>> +++ b/fs/ceph/addr.c
>> @@ -106,12 +106,8 @@ static int ceph_set_page_dirty(struct page *page)
>> ? ? ? if (page->mapping) { ? ?/* Race with truncate? */
>> ? ? ? ? ? ? ? WARN_ON_ONCE(!PageUptodate(page));
>>
>> - ? ? ? ? ? ? if (mapping_cap_account_dirty(mapping)) {
>> - ? ? ? ? ? ? ? ? ? ? __inc_zone_page_state(page, NR_FILE_DIRTY);
>> - ? ? ? ? ? ? ? ? ? ? __inc_bdi_stat(mapping->backing_dev_info,
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? BDI_RECLAIMABLE);
>> - ? ? ? ? ? ? ? ? ? ? task_io_account_write(PAGE_CACHE_SIZE);
>> - ? ? ? ? ? ? }
>> + ? ? ? ? ? ? if (mapping_cap_account_dirty(mapping))
>
> That 'if' is not necessary. account_page_dirtied() already has one.
> The extra 'if' is not an optimization either, because the ceph fs is
> not likely to have un-accountable mappings.

Sweet. Thanks.
>
>> + ? ? ? ? ? ? ? ? ? ? account_page_dirtied(page, page->mapping);
>> ? ? ? ? ? ? ? radix_tree_tag_set(&mapping->page_tree,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? page_index(page), PAGECACHE_TAG_DIRTY);
>>
>> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>> index c920164..967ed7d 100644
>> --- a/fs/nilfs2/segment.c
>> +++ b/fs/nilfs2/segment.c
>> @@ -1599,7 +1599,7 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
>> ? ? ? kunmap_atomic(kaddr, KM_USER0);
>>
>> ? ? ? if (!TestSetPageWriteback(clone_page))
>> - ? ? ? ? ? ? inc_zone_page_state(clone_page, NR_WRITEBACK);
>> + ? ? ? ? ? ? account_page_writeback(clone_page, page_mapping(clone_page));
>> ? ? ? unlock_page(clone_page);
>>
>> ? ? ? return 0;
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index a2b4804..b138392 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -855,6 +855,7 @@ int __set_page_dirty_no_writeback(struct page *page);
>> ?int redirty_page_for_writepage(struct writeback_control *wbc,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct page *page);
>> ?void account_page_dirtied(struct page *page, struct address_space *mapping);
>> +void account_page_writeback(struct page *page, struct address_space *mapping);
>> ?int set_page_dirty(struct page *page);
>> ?int set_page_dirty_lock(struct page *page);
>> ?int clear_page_dirty_for_io(struct page *page);
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 37498ef..b8e7b3b 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -1096,6 +1096,21 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
>> ? ? ? ? ? ? ? task_io_account_write(PAGE_CACHE_SIZE);
>> ? ? ? }
>> ?}
>> +EXPORT_SYMBOL(account_page_dirtied);
>> +
>> +/*
>> + * Helper function for set_page_writeback family.
>> + * NOTE: Unlike account_page_dirtied this does not rely on being atomic
>> + * wrt interrupts.
>> + */
>> +
>> +void account_page_writeback(struct page *page, struct address_space *mapping)
>> +{
>> + ? ? if (mapping_cap_account_dirty(mapping))
>
> The 'if' test and *mapping parameter looks unnecessary at least for
> now. The only place a mapping has BDI_CAP_NO_ACCT_WB but not
> BDI_CAP_NO_WRITEBACK is fuse, which does its own accounting.

Cool.
Thanks.

2010-08-20 08:43:12

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 2/3] writeback: Adding pages_dirtied and pages_entered_writeback

On Fri, Aug 20, 2010 at 04:16:09PM +0800, Michael Rubin wrote:
> On Thu, Aug 19, 2010 at 7:51 PM, Wu Fengguang <[email protected]> wrote:
> > As Rik said, /proc/sys is not a suitable place.
>
> OK I'm convinced.
>
> > Frankly speaking I've worked on writeback for years and never felt
> > the need to add these counters. What I often do is:
> >
> > $ vmmon -d 1 nr_writeback nr_dirty nr_unstable
> >
> >     nr_writeback         nr_dirty      nr_unstable
> >            68738                0            39568
> >            66051                0            42255
> >            63406                0            44900
> >            60643                0            47663
> >            57954                0            50352
> >            55264                0            53042
> >            52592                0            55715
> >            49922                0            58385
> > That is what I get when copying /dev/zero to NFS.
> >
> > I'm very interested in Google's use case for this patch, and why
> > the simple /proc/vmstat based vmmon tool is not enough.
>
> So as I understand it from looking at the code vmmon is sampling
> nr_writeback, nr_dirty which are exported versions of
> global_page_state for NR_FILE_DIRTY and NR_WRITEBACK. These states are
> a snapshot of the state of the kernel's pages. Namely how many dpages
> ar ein writeback or dirty at the moment vmmon's acquire routine is
> called.
>
> vmmon is sampling /proc/vstat and then displaying the difference from
> the last time they sampled. If I am misunderstanding let me know.

Maybe Andrew's vmmon does that. My vmmon always display the raw values
:) It could be improved to do raw values for nr_dirty and differences
for pgpgin by default.

> This is good for the state of the system but as we compare
> application, mm and io performance over long periods of time we are
> interested in the surges and fluctuations of the rates of the
> producing and consuming of dirty pages also. It can help isolate where
> the problem is and also to compare performance between kernels and/or
> applications.

Yeah the accumulated dirty and writeback page counts could be useful.
For example, for inspecting the dirty and writeback speed over time.
That's not possible for nr_dirty/nr_writeback.

Thanks,
Fengguang