2010-08-28 02:41:16

by Michael Rubin

[permalink] [raw]
Subject: [PATCH 0/4] writeback: kernel visibility

Patch #1 sets up some helper functions for account_page_dirty

Patch #2 sets up some helper functions for account_page_writeback

Patch #3 adds writeback visibility in /proc/vmstat

To help developers and applications gain visibility into writeback
behaviour this patch adds two counters to /proc/vmstat.

# grep nr_dirtied /proc/vmstat
nr_dirtied 3747
# grep nr_cleaned /proc/vmstat
nr_cleaned 3618

These entries allow user apps to understand writeback behaviour over
time and learn how it is impacting their performance. Currently there
is no way to inspect dirty and writeback speed over time. It's not
possible for nr_dirty/nr_writeback.

These entries are necessary to give visibility into writeback
behaviour. We have /proc/diskstats which lets us understand the io in
the block layer. We have blktrace for more in depth understanding. We have
e2fsprogs and debugsfs to give insight into the file systems behaviour,
but we don't offer our users the ability understand what writeback is
doing. There is no way to know how active it is over the whole system,
if it's falling behind or to quantify it's efforts. With these values
exported users can easily see how much data applications are sending
through writeback and also at what rates writeback is processing this
data. Comparing the rates of change between the two allow developers
to see when writeback is not able to keep up with incoming traffic and
the rate of dirty memory being sent to the IO back end. This allows
folks to understand their io workloads and track kernel issues. Non
kernel engineers at Google often use these counters to solve puzzling
performance problems.

Patch #4 add writeback thresholds to /proc/vmstat

# grep threshold /proc/vmstat
nr_pages_dirty_threshold 409111
nr_pages_dirty_background_threshold 818223

The files that report the dirty thresholds belong in /proc/vmstat. They
are meant for application writers so should not be in debugfs. But since
they are more related to internals of writeback, albeit internals that
are fundamental to how it works, /proc/sys/vm is not appropriate.

These values are reported in debugfs already in
/debug/bdi/default/stats. Since debugfs is intended for kernel developers
and /proc for applications there is an argument to put it in /proc. Not
sure if that's enough but thought it worth attaching.

Michael Rubin (4):
mm: exporting account_page_dirty
mm: account_page_writeback added
writeback: nr_dirtied and nr_cleaned in /proc/vmstat
writeback: Reporting dirty thresholds in /proc/vmstat

drivers/base/node.c | 14 ++++++++++++++
fs/ceph/addr.c | 8 +-------
fs/nilfs2/segment.c | 2 +-
include/linux/mm.h | 1 +
include/linux/mmzone.h | 4 ++++
mm/page-writeback.c | 16 +++++++++++++++-
mm/vmstat.c | 8 ++++++++
7 files changed, 44 insertions(+), 9 deletions(-)


2010-08-28 02:41:18

by Michael Rubin

[permalink] [raw]
Subject: [PATCH 1/4] mm: exporting account_page_dirty

This allows code outside of the mm core to safely manipulate page state
and not worry about the other accounting. Not using these routines means
that some code will lose track of the accounting and we get bugs. This
has happened once already.

Modified cephs to use the interface.

Signed-off-by: Michael Rubin <[email protected]>
Reviewed-by: Wu Fengguang <[email protected]>

---
fs/ceph/addr.c | 8 +-------
mm/page-writeback.c | 1 +
2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 5598a0d..420d469 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -105,13 +105,7 @@ static int ceph_set_page_dirty(struct page *page)
spin_lock_irq(&mapping->tree_lock);
if (page->mapping) { /* Race with truncate? */
WARN_ON_ONCE(!PageUptodate(page));
-
- if (mapping_cap_account_dirty(mapping)) {
- __inc_zone_page_state(page, NR_FILE_DIRTY);
- __inc_bdi_stat(mapping->backing_dev_info,
- BDI_RECLAIMABLE);
- task_io_account_write(PAGE_CACHE_SIZE);
- }
+ account_page_dirtied(page, page->mapping);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7262aac..9d07a8d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1131,6 +1131,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
task_io_account_write(PAGE_CACHE_SIZE);
}
}
+EXPORT_SYMBOL(account_page_dirtied);

/*
* For address_spaces which do not use buffers. Just tag the page as dirty in
--
1.7.1

2010-08-28 02:41:14

by Michael Rubin

[permalink] [raw]
Subject: [PATCH 2/4] mm: account_page_writeback added

This allows code outside of the mm core to safely manipulate page
writeback state and not worry about the other accounting. Not using
these routines means that some code will lose track of the accounting
and we get bugs.

Modified nilfs2 to use interface.

Signed-off-by: Michael Rubin <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
---
fs/nilfs2/segment.c | 2 +-
include/linux/mm.h | 1 +
mm/page-writeback.c | 13 ++++++++++++-
3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index 9fd051a..5617f16 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1599,7 +1599,7 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
kunmap_atomic(kaddr, KM_USER0);

if (!TestSetPageWriteback(clone_page))
- inc_zone_page_state(clone_page, NR_WRITEBACK);
+ account_page_writeback(clone_page);
unlock_page(clone_page);

return 0;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 709f672..4b2f38b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -856,6 +856,7 @@ int __set_page_dirty_no_writeback(struct page *page);
int redirty_page_for_writepage(struct writeback_control *wbc,
struct page *page);
void account_page_dirtied(struct page *page, struct address_space *mapping);
+void account_page_writeback(struct page *page);
int set_page_dirty(struct page *page);
int set_page_dirty_lock(struct page *page);
int clear_page_dirty_for_io(struct page *page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 9d07a8d..ae5f5d5 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1134,6 +1134,17 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
EXPORT_SYMBOL(account_page_dirtied);

/*
+ * Helper function for set_page_writeback family.
+ * NOTE: Unlike account_page_dirtied this does not rely on being atomic
+ * wrt interrupts.
+ */
+void account_page_writeback(struct page *page)
+{
+ inc_zone_page_state(page, NR_WRITEBACK);
+}
+EXPORT_SYMBOL(account_page_writeback);
+
+/*
* For address_spaces which do not use buffers. Just tag the page as dirty in
* its radix tree.
*
@@ -1371,7 +1382,7 @@ int test_set_page_writeback(struct page *page)
ret = TestSetPageWriteback(page);
}
if (!ret)
- inc_zone_page_state(page, NR_WRITEBACK);
+ account_page_writeback(page);
return ret;

}
--
1.7.1

2010-08-28 02:41:43

by Michael Rubin

[permalink] [raw]
Subject: [PATCH 4/4] writeback: Reporting dirty thresholds in /proc/vmstat

The kernel already exposes the user desired thresholds in /proc/sys/vm
with dirty_background_ratio and background_ratio. But the kernel may
alter the number requested without giving the user any indication that
is the case.

Knowing the actual ratios the kernel is honoring can help app developers
understand how their buffered IO will be sent to the disk.

$ grep threshold /proc/vmstat
nr_dirty_threshold 409111
nr_dirty_background_threshold 818223

Signed-off-by: Michael Rubin <[email protected]>
---
include/linux/mmzone.h | 2 ++
mm/vmstat.c | 5 +++++
2 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d42f179..ad48963 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -106,6 +106,8 @@ enum zone_stat_item {
NR_SHMEM, /* shmem pages (included tmpfs/GEM pages) */
NR_FILE_PAGES_DIRTIED, /* number of times pages get dirtied */
NR_PAGES_CLEANED, /* number of times pages enter writeback */
+ NR_DIRTY_THRESHOLD, /* writeback threshold */
+ NR_DIRTY_BG_THRESHOLD, /* bg writeback threshold */
#ifdef CONFIG_NUMA
NUMA_HIT, /* allocated in intended node */
NUMA_MISS, /* allocated in non intended node */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8521475..2342010 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -17,6 +17,7 @@
#include <linux/vmstat.h>
#include <linux/sched.h>
#include <linux/math64.h>
+#include <linux/writeback.h>

#ifdef CONFIG_VM_EVENT_COUNTERS
DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
@@ -734,6 +735,8 @@ static const char * const vmstat_text[] = {
"nr_shmem",
"nr_dirtied",
"nr_cleaned",
+ "nr_dirty_threshold",
+ "nr_dirty_background_threshold",

#ifdef CONFIG_NUMA
"numa_hit",
@@ -917,6 +920,8 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
return ERR_PTR(-ENOMEM);
for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
v[i] = global_page_state(i);
+
+ global_dirty_limits(v + NR_DIRTY_BG_THRESHOLD, v + NR_DIRTY_THRESHOLD);
#ifdef CONFIG_VM_EVENT_COUNTERS
e = v + NR_VM_ZONE_STAT_ITEMS;
all_vm_events(e);
--
1.7.1

2010-08-28 02:42:01

by Michael Rubin

[permalink] [raw]
Subject: [PATCH 3/4] writeback: nr_dirtied and nr_cleaned in /proc/vmstat

To help developers and applications gain visibility into writeback
behaviour adding two entries to /proc/vmstat.

# grep nr_dirtied /proc/vmstat
nr_dirtied 3747
# grep nr_cleaned /proc/vmstat
nr_cleaned 3618

In order to track the "cleaned" and "dirtied" counts we added two
vm_stat_items. Per memory node stats have been added also. So we can
see per node granularity:

# cat /sys/devices/system/node/node20/vmstat
Node 20 pages_cleaned: 0 times
Node 20 pages_dirtied: 0 times

Signed-off-by: Michael Rubin <[email protected]>
---
drivers/base/node.c | 14 ++++++++++++++
include/linux/mmzone.h | 2 ++
mm/page-writeback.c | 2 ++
mm/vmstat.c | 3 +++
4 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 2872e86..facd920 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -160,6 +160,18 @@ static ssize_t node_read_numastat(struct sys_device * dev,
}
static SYSDEV_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);

+static ssize_t node_read_vmstat(struct sys_device *dev,
+ struct sysdev_attribute *attr, char *buf)
+{
+ int nid = dev->id;
+ return sprintf(buf,
+ "Node %d pages_cleaned: %lu times\n"
+ "Node %d pages_dirtied: %lu times\n",
+ nid, node_page_state(nid, NR_PAGES_CLEANED),
+ nid, node_page_state(nid, NR_FILE_PAGES_DIRTIED));
+}
+static SYSDEV_ATTR(vmstat, S_IRUGO, node_read_vmstat, NULL);
+
static ssize_t node_read_distance(struct sys_device * dev,
struct sysdev_attribute *attr, char * buf)
{
@@ -243,6 +255,7 @@ int register_node(struct node *node, int num, struct node *parent)
sysdev_create_file(&node->sysdev, &attr_meminfo);
sysdev_create_file(&node->sysdev, &attr_numastat);
sysdev_create_file(&node->sysdev, &attr_distance);
+ sysdev_create_file(&node->sysdev, &attr_vmstat);

scan_unevictable_register_node(node);

@@ -267,6 +280,7 @@ void unregister_node(struct node *node)
sysdev_remove_file(&node->sysdev, &attr_meminfo);
sysdev_remove_file(&node->sysdev, &attr_numastat);
sysdev_remove_file(&node->sysdev, &attr_distance);
+ sysdev_remove_file(&node->sysdev, &attr_vmstat);

scan_unevictable_unregister_node(node);
hugetlb_unregister_node(node); /* no-op, if memoryless node */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6e6e626..d42f179 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -104,6 +104,8 @@ enum zone_stat_item {
NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
NR_SHMEM, /* shmem pages (included tmpfs/GEM pages) */
+ NR_FILE_PAGES_DIRTIED, /* number of times pages get dirtied */
+ NR_PAGES_CLEANED, /* number of times pages enter writeback */
#ifdef CONFIG_NUMA
NUMA_HIT, /* allocated in intended node */
NUMA_MISS, /* allocated in non intended node */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index ae5f5d5..19bb8c0 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1126,6 +1126,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
{
if (mapping_cap_account_dirty(mapping)) {
__inc_zone_page_state(page, NR_FILE_DIRTY);
+ __inc_zone_page_state(page, NR_FILE_PAGES_DIRTIED);
__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
task_dirty_inc(current);
task_io_account_write(PAGE_CACHE_SIZE);
@@ -1141,6 +1142,7 @@ EXPORT_SYMBOL(account_page_dirtied);
void account_page_writeback(struct page *page)
{
inc_zone_page_state(page, NR_WRITEBACK);
+ inc_zone_page_state(page, NR_PAGES_CLEANED);
}
EXPORT_SYMBOL(account_page_writeback);

diff --git a/mm/vmstat.c b/mm/vmstat.c
index f389168..8521475 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -732,6 +732,9 @@ static const char * const vmstat_text[] = {
"nr_isolated_anon",
"nr_isolated_file",
"nr_shmem",
+ "nr_dirtied",
+ "nr_cleaned",
+
#ifdef CONFIG_NUMA
"numa_hit",
"numa_miss",
--
1.7.1

2010-08-28 22:11:03

by Sage Weil

[permalink] [raw]
Subject: Re: [PATCH 1/4] mm: exporting account_page_dirty

This one was just merged with the other Ceph stuff.

Thanks!
sage


On Fri, 27 Aug 2010, Michael Rubin wrote:

> This allows code outside of the mm core to safely manipulate page state
> and not worry about the other accounting. Not using these routines means
> that some code will lose track of the accounting and we get bugs. This
> has happened once already.
>
> Modified cephs to use the interface.
>
> Signed-off-by: Michael Rubin <[email protected]>
> Reviewed-by: Wu Fengguang <[email protected]>
>
> ---
> fs/ceph/addr.c | 8 +-------
> mm/page-writeback.c | 1 +
> 2 files changed, 2 insertions(+), 7 deletions(-)
>
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 5598a0d..420d469 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -105,13 +105,7 @@ static int ceph_set_page_dirty(struct page *page)
> spin_lock_irq(&mapping->tree_lock);
> if (page->mapping) { /* Race with truncate? */
> WARN_ON_ONCE(!PageUptodate(page));
> -
> - if (mapping_cap_account_dirty(mapping)) {
> - __inc_zone_page_state(page, NR_FILE_DIRTY);
> - __inc_bdi_stat(mapping->backing_dev_info,
> - BDI_RECLAIMABLE);
> - task_io_account_write(PAGE_CACHE_SIZE);
> - }
> + account_page_dirtied(page, page->mapping);
> radix_tree_tag_set(&mapping->page_tree,
> page_index(page), PAGECACHE_TAG_DIRTY);
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 7262aac..9d07a8d 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1131,6 +1131,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
> task_io_account_write(PAGE_CACHE_SIZE);
> }
> }
> +EXPORT_SYMBOL(account_page_dirtied);
>
> /*
> * For address_spaces which do not use buffers. Just tag the page as dirty in
> --
> 1.7.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>

2010-08-28 23:50:41

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 3/4] writeback: nr_dirtied and nr_cleaned in /proc/vmstat

On Sat, Aug 28, 2010 at 10:40:26AM +0800, Michael Rubin wrote:
> To help developers and applications gain visibility into writeback
> behaviour adding two entries to /proc/vmstat.
>
> # grep nr_dirtied /proc/vmstat
> nr_dirtied 3747
> # grep nr_cleaned /proc/vmstat
> nr_cleaned 3618
>
> In order to track the "cleaned" and "dirtied" counts we added two
> vm_stat_items. Per memory node stats have been added also. So we can
> see per node granularity:
>
> # cat /sys/devices/system/node/node20/vmstat
> Node 20 pages_cleaned: 0 times
> Node 20 pages_dirtied: 0 times

It's silly to have the different names nr_dirtied and pages_cleaned
for the same item.

> Signed-off-by: Michael Rubin <[email protected]>
> ---
> drivers/base/node.c | 14 ++++++++++++++
> include/linux/mmzone.h | 2 ++
> mm/page-writeback.c | 2 ++
> mm/vmstat.c | 3 +++
> 4 files changed, 21 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 2872e86..facd920 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -160,6 +160,18 @@ static ssize_t node_read_numastat(struct sys_device * dev,
> }
> static SYSDEV_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
>
> +static ssize_t node_read_vmstat(struct sys_device *dev,
> + struct sysdev_attribute *attr, char *buf)
> +{
> + int nid = dev->id;
> + return sprintf(buf,
> + "Node %d pages_cleaned: %lu times\n"
> + "Node %d pages_dirtied: %lu times\n",
> + nid, node_page_state(nid, NR_PAGES_CLEANED),
> + nid, node_page_state(nid, NR_FILE_PAGES_DIRTIED));
> +}

The output format is quite different from /proc/vmstat.
Do we really need to "Node X", ":" and "times" decorations?

And the "_PAGES" in NR_FILE_PAGES_DIRTIED looks redundant to
the "_page" in node_page_state(). It's a bit long to be a pleasant
name. NR_FILE_DIRTIED/NR_CLEANED looks nicer.

> +static SYSDEV_ATTR(vmstat, S_IRUGO, node_read_vmstat, NULL);
> +
> static ssize_t node_read_distance(struct sys_device * dev,
> struct sysdev_attribute *attr, char * buf)
> {
> @@ -243,6 +255,7 @@ int register_node(struct node *node, int num, struct node *parent)
> sysdev_create_file(&node->sysdev, &attr_meminfo);
> sysdev_create_file(&node->sysdev, &attr_numastat);
> sysdev_create_file(&node->sysdev, &attr_distance);
> + sysdev_create_file(&node->sysdev, &attr_vmstat);
>
> scan_unevictable_register_node(node);
>
> @@ -267,6 +280,7 @@ void unregister_node(struct node *node)
> sysdev_remove_file(&node->sysdev, &attr_meminfo);
> sysdev_remove_file(&node->sysdev, &attr_numastat);
> sysdev_remove_file(&node->sysdev, &attr_distance);
> + sysdev_remove_file(&node->sysdev, &attr_vmstat);
>
> scan_unevictable_unregister_node(node);
> hugetlb_unregister_node(node); /* no-op, if memoryless node */
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 6e6e626..d42f179 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -104,6 +104,8 @@ enum zone_stat_item {
> NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
> NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
> NR_SHMEM, /* shmem pages (included tmpfs/GEM pages) */
> + NR_FILE_PAGES_DIRTIED, /* number of times pages get dirtied */
> + NR_PAGES_CLEANED, /* number of times pages enter writeback */

How about the comments /* accumulated number of pages ... */?

Note that NR_CLEANED won't match NR_FILE_DIRTIED in long term because
it also accounts for anon pages, and does not account for dirty pages
that are truncated before they go writeback.

Thanks,
Fengguang

2010-08-30 00:28:39

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 4/4] writeback: Reporting dirty thresholds in /proc/vmstat

> The kernel already exposes the user desired thresholds in /proc/sys/vm
> with dirty_background_ratio and background_ratio. But the kernel may
> alter the number requested without giving the user any indication that
> is the case.
>
> Knowing the actual ratios the kernel is honoring can help app developers
> understand how their buffered IO will be sent to the disk.
>
> $ grep threshold /proc/vmstat
> nr_dirty_threshold 409111
> nr_dirty_background_threshold 818223

?
afaict, you and wu agreed /debug/bdi/default/stats is enough good.
why do you change your mention?


>
> Signed-off-by: Michael Rubin <[email protected]>
> ---
> include/linux/mmzone.h | 2 ++
> mm/vmstat.c | 5 +++++
> 2 files changed, 7 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index d42f179..ad48963 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -106,6 +106,8 @@ enum zone_stat_item {
> NR_SHMEM, /* shmem pages (included tmpfs/GEM pages) */
> NR_FILE_PAGES_DIRTIED, /* number of times pages get dirtied */
> NR_PAGES_CLEANED, /* number of times pages enter writeback */
> + NR_DIRTY_THRESHOLD, /* writeback threshold */
> + NR_DIRTY_BG_THRESHOLD, /* bg writeback threshold */

Don't need this even though we add this two fields into /proc/sys/vm.
It can be calculated at displaing time.



> #ifdef CONFIG_NUMA
> NUMA_HIT, /* allocated in intended node */
> NUMA_MISS, /* allocated in non intended node */
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 8521475..2342010 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -17,6 +17,7 @@
> #include <linux/vmstat.h>
> #include <linux/sched.h>
> #include <linux/math64.h>
> +#include <linux/writeback.h>
>
> #ifdef CONFIG_VM_EVENT_COUNTERS
> DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
> @@ -734,6 +735,8 @@ static const char * const vmstat_text[] = {
> "nr_shmem",
> "nr_dirtied",
> "nr_cleaned",
> + "nr_dirty_threshold",
> + "nr_dirty_background_threshold",
>
> #ifdef CONFIG_NUMA
> "numa_hit",
> @@ -917,6 +920,8 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
> return ERR_PTR(-ENOMEM);
> for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
> v[i] = global_page_state(i);
> +
> + global_dirty_limits(v + NR_DIRTY_BG_THRESHOLD, v + NR_DIRTY_THRESHOLD);
> #ifdef CONFIG_VM_EVENT_COUNTERS
> e = v + NR_VM_ZONE_STAT_ITEMS;
> all_vm_events(e);
> --
> 1.7.1
>


2010-08-30 16:26:18

by Michael Rubin

[permalink] [raw]
Subject: Re: [PATCH 4/4] writeback: Reporting dirty thresholds in /proc/vmstat

On Sun, Aug 29, 2010 at 5:28 PM, KOSAKI Motohiro
<[email protected]> wrote:
> afaict, you and wu agreed /debug/bdi/default/stats is enough good.
> why do you change your mention?

I commented on this in the 0/4 email of the bug. I think these belong
in /proc/vmstat but I saw they exist in /debug/bdi/default/stats. I
figure they will probably not be accepted but I thought it was worth
attaching for consideration of upgrading from debugfs to /proc.

mrubin

2010-08-31 01:07:38

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 4/4] writeback: Reporting dirty thresholds in /proc/vmstat

> On Sun, Aug 29, 2010 at 5:28 PM, KOSAKI Motohiro
> <[email protected]> wrote:
> > afaict, you and wu agreed /debug/bdi/default/stats is enough good.
> > why do you change your mention?
>
> I commented on this in the 0/4 email of the bug. I think these belong
> in /proc/vmstat but I saw they exist in /debug/bdi/default/stats. I
> figure they will probably not be accepted but I thought it was worth
> attaching for consideration of upgrading from debugfs to /proc.

For reviewers view, we are reviewing your patch to merge immediately if all issue are fixed.
Then, I'm unhappy if you don't drop merge blocker item even though you merely want asking.
At least, you can make separate thread, no?

Of cource, wen other user also want to expose via /proc interface, we are resume
this discusstion gradly.


2010-08-31 01:33:09

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 4/4] writeback: Reporting dirty thresholds in /proc/vmstat

On Tue, Aug 31, 2010 at 09:07:32AM +0800, KOSAKI Motohiro wrote:
> > On Sun, Aug 29, 2010 at 5:28 PM, KOSAKI Motohiro
> > <[email protected]> wrote:
> > > afaict, you and wu agreed /debug/bdi/default/stats is enough good.
> > > why do you change your mention?
> >
> > I commented on this in the 0/4 email of the bug. I think these belong
> > in /proc/vmstat but I saw they exist in /debug/bdi/default/stats. I
> > figure they will probably not be accepted but I thought it was worth
> > attaching for consideration of upgrading from debugfs to /proc.
>
> For reviewers view, we are reviewing your patch to merge immediately if all issue are fixed.
> Then, I'm unhappy if you don't drop merge blocker item even though you merely want asking.
> At least, you can make separate thread, no?
>
> Of cource, wen other user also want to expose via /proc interface, we are resume
> this discusstion gradly.

Michael asked promoting the dirty thresholds from debugfs to /proc.
As a developer I'd interpret the question as: will there be enough
applications/admins using it? If not, we'd better keep it as debugfs.
Otherwise it benefits to do the interface promotion now, because it
will hurt to accumulate many end user dependencies on debugfs over
time..

Thanks,
Fengguang

2010-08-31 06:09:40

by Michael Rubin

[permalink] [raw]
Subject: Re: [PATCH 3/4] writeback: nr_dirtied and nr_cleaned in /proc/vmstat

On Sat, Aug 28, 2010 at 4:50 PM, Wu Fengguang <[email protected]> wrote:
> It's silly to have the different names nr_dirtied and pages_cleaned
> for the same item.

I agree. Will fix.

> The output format is quite different from /proc/vmstat.
> Do we really need to "Node X", ":" and "times" decorations?

Node X is based on the meminfo file but I agree it's redundant information.

> And the "_PAGES" in NR_FILE_PAGES_DIRTIED looks redundant to
> the "_page" in node_page_state(). It's a bit long to be a pleasant
> name. NR_FILE_DIRTIED/NR_CLEANED looks nicer.

Yeah. Will fix.


>> +static SYSDEV_ATTR(vmstat, S_IRUGO, node_read_vmstat, NULL);
>> +
>> ?static ssize_t node_read_distance(struct sys_device * dev,
>> ? ? ? ? ? ? ? ? ? ? ? struct sysdev_attribute *attr, char * buf)
>> ?{
>> @@ -243,6 +255,7 @@ int register_node(struct node *node, int num, struct node *parent)
>> ? ? ? ? ? ? ? sysdev_create_file(&node->sysdev, &attr_meminfo);
>> ? ? ? ? ? ? ? sysdev_create_file(&node->sysdev, &attr_numastat);
>> ? ? ? ? ? ? ? sysdev_create_file(&node->sysdev, &attr_distance);
>> + ? ? ? ? ? ? sysdev_create_file(&node->sysdev, &attr_vmstat);
>>
>> ? ? ? ? ? ? ? scan_unevictable_register_node(node);
>>
>> @@ -267,6 +280,7 @@ void unregister_node(struct node *node)
>> ? ? ? sysdev_remove_file(&node->sysdev, &attr_meminfo);
>> ? ? ? sysdev_remove_file(&node->sysdev, &attr_numastat);
>> ? ? ? sysdev_remove_file(&node->sysdev, &attr_distance);
>> + ? ? sysdev_remove_file(&node->sysdev, &attr_vmstat);
>>
>> ? ? ? scan_unevictable_unregister_node(node);
>> ? ? ? hugetlb_unregister_node(node); ? ? ? ? ?/* no-op, if memoryless node */
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 6e6e626..d42f179 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -104,6 +104,8 @@ enum zone_stat_item {
>> ? ? ? NR_ISOLATED_ANON, ? ? ? /* Temporary isolated pages from anon lru */
>> ? ? ? NR_ISOLATED_FILE, ? ? ? /* Temporary isolated pages from file lru */
>> ? ? ? NR_SHMEM, ? ? ? ? ? ? ? /* shmem pages (included tmpfs/GEM pages) */
>> + ? ? NR_FILE_PAGES_DIRTIED, ?/* number of times pages get dirtied */
>> + ? ? NR_PAGES_CLEANED, ? ? ? /* number of times pages enter writeback */
>
> How about the comments /* accumulated number of pages ... */?

OK.
May not get patch out today but will incorporate these fixes. Thank you again.

mrubin

2010-08-31 07:48:45

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 3/4] writeback: nr_dirtied and nr_cleaned in /proc/vmstat

> > The output format is quite different from /proc/vmstat.
> > Do we really need to "Node X", ":" and "times" decorations?
>
> Node X is based on the meminfo file but I agree it's redundant information.

Thanks. In the same directory you can find a different style example
/sys/devices/system/node/node0/numastat :) If ever the file was named
vmstat! In the other hand, shall we put the numbers there? I'm confused..

> > And the "_PAGES" in NR_FILE_PAGES_DIRTIED looks redundant to
> > the "_page" in node_page_state(). It's a bit long to be a pleasant
> > name. NR_FILE_DIRTIED/NR_CLEANED looks nicer.
>
> Yeah. Will fix.

Thanks. This is kind of nitpick, however here is another name by
Jan Kara: BDI_WRITTEN. BDI_WRITTEN may not be a lot better than
BDI_CLEANED, but here is a patch based on Jan's code. I'm cooking
more patches that make use of this per-bdi counter to estimate the
bdi's write bandwidth, and to further decide the optimal (large)
writeback chunk size as well as to do IO-less balance_dirty_pages().

Basically BDI_WRITTEN and NR_CLEANED are accounting for the same
thing in different dimensions. So it would be good if we can use
the same naming scheme to avoid confusing users: either to use
BDI_WRITTEN and NR_WRITTEN, or use BDI_CLEANED and NR_CLEANED.
What's your opinion?

Thanks,
Fengguang

---
writeback: account per-bdi accumulated written pages

This steals code from Jan's balance_dirty_pages() patch.

Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
include/linux/backing-dev.h | 1 +
mm/backing-dev.c | 6 ++++--
mm/page-writeback.c | 1 +
3 files changed, 6 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h 2010-08-29 23:32:46.000000000 +0800
+++ linux-next/include/linux/backing-dev.h 2010-08-29 23:52:50.000000000 +0800
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
enum bdi_stat_item {
BDI_RECLAIMABLE,
BDI_WRITEBACK,
+ BDI_WRITTEN,
NR_BDI_STAT_ITEMS
};

--- linux-next.orig/mm/backing-dev.c 2010-08-29 23:32:46.000000000 +0800
+++ linux-next/mm/backing-dev.c 2010-08-29 23:52:50.000000000 +0800
@@ -91,6 +91,7 @@ static int bdi_debug_stats_show(struct s
"BdiDirtyThresh: %8lu kB\n"
"DirtyThresh: %8lu kB\n"
"BackgroundThresh: %8lu kB\n"
+ "BdiWritten: %8lu kB\n"
"b_dirty: %8lu\n"
"b_io: %8lu\n"
"b_more_io: %8lu\n"
@@ -98,8 +99,9 @@ static int bdi_debug_stats_show(struct s
"state: %8lx\n",
(unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
(unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
- K(bdi_thresh), K(dirty_thresh),
- K(background_thresh), nr_dirty, nr_io, nr_more_io,
+ K(bdi_thresh), K(dirty_thresh), K(background_thresh),
+ (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
+ nr_dirty, nr_io, nr_more_io,
!list_empty(&bdi->bdi_list), bdi->state);
#undef K

--- linux-next.orig/mm/page-writeback.c 2010-08-29 23:52:47.000000000 +0800
+++ linux-next/mm/page-writeback.c 2010-08-29 23:52:50.000000000 +0800
@@ -1305,6 +1305,7 @@ int test_clear_page_writeback(struct pag
PAGECACHE_TAG_WRITEBACK);
if (bdi_cap_account_writeback(bdi)) {
__dec_bdi_stat(bdi, BDI_WRITEBACK);
+ __inc_bdi_stat(bdi, BDI_WRITTEN);
__bdi_writeout_inc(bdi);
}
}