2010-06-19 00:31:06

by Michael Rubin

[permalink] [raw]
Subject: [PATCH 0/3] writeback visibility

Debugging writeback issues and tuning an application's writeback activity is
easier when the activity is visible. With large clusters, classifying
and root causing writeback problems has been a big headache. This patch
series contains a series of patches that our team has been using to start
getting a handle on writeback behaviour. These changes should be helpful
for single system maintainers also. It's still a big headache.

Once these changes are reviewed I will make sure the Documentation files
are updated, but I expect some back and forth first.

Michael Rubin (3):
writeback: Creating /sys/kernel/mm/writeback/writeback
writeback: per bdi monitoring
writeback: tracking subsystems causing writeback

drivers/base/node.c | 14 +++++
fs/buffer.c | 2 +-
fs/fs-writeback.c | 28 +++++++--
fs/nilfs2/segment.c | 4 +-
fs/sync.c | 2 +-
include/linux/backing-dev.h | 9 +++
include/linux/mmzone.h | 2 +
include/linux/writeback.h | 50 +++++++++++++++-
mm/backing-dev.c | 137 ++++++++++++++++++++++---------------------
mm/mm_init.c | 122 ++++++++++++++++++++++++++++++++++++++
mm/page-writeback.c | 18 ++++--
mm/vmscan.c | 3 +-
mm/vmstat.c | 2 +
13 files changed, 311 insertions(+), 82 deletions(-)


2010-06-19 00:31:11

by Michael Rubin

[permalink] [raw]
Subject: [PATCH 1/3] writeback: Creating /sys/kernel/mm/writeback/writeback

Adding the /sys/kernel/mm/writeback/writeback file. It contains data
to help developers and applications gain visibility into writeback
behaviour.

# cat /sys/kernel/mm/writeback/writeback
pages_dirtied: 3747
pages_cleaned: 3618
dirty_threshold: 816673
bg_threshold: 408336

The motivation of a sys file as opposed to debugfs is that applications
that do not have permissions to mount file systems often need this data.

In order to track the "cleaned" and "dirtied" counts we added two
vm_stat_items. Per memory node stats have been added also. So we can
see per node granularity:

# cat /sys/devices/system/node/node20/writebackstat
Node 20 pages_writeback: 0 times
Node 20 pages_dirtied: 0 times

Signed-off-by: Michael Rubin <[email protected]>
---
drivers/base/node.c | 14 ++++++++++++++
fs/nilfs2/segment.c | 4 +++-
include/linux/mmzone.h | 2 ++
mm/mm_init.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
mm/page-writeback.c | 6 ++++--
mm/vmstat.c | 2 ++
6 files changed, 73 insertions(+), 3 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 2bdd8a9..b321d32 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -160,6 +160,18 @@ static ssize_t node_read_numastat(struct sys_device * dev,
}
static SYSDEV_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);

+static ssize_t node_read_writebackstat(struct sys_device *dev,
+ struct sysdev_attribute *attr, char *buf)
+{
+ int nid = dev->id;
+ return sprintf(buf,
+ "Node %d pages_writeback: %lu times\n"
+ "Node %d pages_dirtied: %lu times\n",
+ nid, node_page_state(nid, NR_PAGES_ENTERED_WRITEBACK),
+ nid, node_page_state(nid, NR_FILE_PAGES_DIRTIED));
+}
+static SYSDEV_ATTR(writebackstat, S_IRUGO, node_read_writebackstat, NULL);
+
static ssize_t node_read_distance(struct sys_device * dev,
struct sysdev_attribute *attr, char * buf)
{
@@ -243,6 +255,7 @@ int register_node(struct node *node, int num, struct node *parent)
sysdev_create_file(&node->sysdev, &attr_meminfo);
sysdev_create_file(&node->sysdev, &attr_numastat);
sysdev_create_file(&node->sysdev, &attr_distance);
+ sysdev_create_file(&node->sysdev, &attr_writebackstat);

scan_unevictable_register_node(node);

@@ -267,6 +280,7 @@ void unregister_node(struct node *node)
sysdev_remove_file(&node->sysdev, &attr_meminfo);
sysdev_remove_file(&node->sysdev, &attr_numastat);
sysdev_remove_file(&node->sysdev, &attr_distance);
+ sysdev_remove_file(&node->sysdev, &attr_writebackstat);

scan_unevictable_unregister_node(node);
hugetlb_unregister_node(node); /* no-op, if memoryless node */
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index c920164..84b0181 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1598,8 +1598,10 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
kunmap_atomic(kaddr, KM_USER0);

- if (!TestSetPageWriteback(clone_page))
+ if (!TestSetPageWriteback(clone_page)) {
inc_zone_page_state(clone_page, NR_WRITEBACK);
+ inc_zone_page_state(clone_page, NR_PAGES_ENTERED_WRITEBACK);
+ }
unlock_page(clone_page);

return 0;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b4d109e..c0cd2bd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -112,6 +112,8 @@ enum zone_stat_item {
NUMA_LOCAL, /* allocation from local node */
NUMA_OTHER, /* allocation from other node */
#endif
+ NR_PAGES_ENTERED_WRITEBACK, /* number of times pages enter writeback */
+ NR_FILE_PAGES_DIRTIED, /* number of times pages get dirtied */
NR_VM_ZONE_STAT_ITEMS };

/*
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 4e0e265..8f2ebdb 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -9,6 +9,7 @@
#include <linux/init.h>
#include <linux/kobject.h>
#include <linux/module.h>
+#include <linux/writeback.h>
#include "internal.h"

#ifdef CONFIG_DEBUG_MEMORY_INIT
@@ -137,6 +138,52 @@ static __init int set_mminit_loglevel(char *str)
early_param("mminit_loglevel", set_mminit_loglevel);
#endif /* CONFIG_DEBUG_MEMORY_INIT */

+#define KERNEL_ATTR_RO(_name) \
+static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
+
+static ssize_t writeback_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ unsigned long dirty, background;
+ get_dirty_limits(&background, &dirty, NULL, NULL);
+ return sprintf(buf,
+ "pages_dirtied: %lu\n"
+ "pages_cleaned: %lu\n"
+ "dirty_threshold: %lu\n"
+ "bg_threshold: %lu\n",
+ global_page_state(NR_FILE_PAGES_DIRTIED),
+ global_page_state(NR_PAGES_ENTERED_WRITEBACK),
+ dirty, background);
+}
+
+KERNEL_ATTR_RO(writeback);
+
+static struct attribute *writeback_attrs[] = {
+ &writeback_attr.attr,
+ NULL,
+};
+
+static struct attribute_group writeback_attr_group = {
+ .attrs = writeback_attrs,
+};
+
+static int mm_sysfs_writeback_init(void)
+{
+ int error;
+
+ struct kobject *writeback_kobj =
+ kobject_create_and_add("writeback", mm_kobj);
+ if (writeback_kobj == NULL)
+ return -ENOMEM;
+
+ error = sysfs_create_group(writeback_kobj, &writeback_attr_group);
+ if (error) {
+ kobject_put(mm_kobj);
+ return -ENOMEM;
+ }
+ return 0;
+}
+
struct kobject *mm_kobj;
EXPORT_SYMBOL_GPL(mm_kobj);

@@ -146,6 +193,7 @@ static int __init mm_sysfs_init(void)
if (!mm_kobj)
return -ENOMEM;

+ mm_sysfs_writeback_init();
return 0;
}

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index bbd396a..4cea5c0 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1093,6 +1093,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
{
if (mapping_cap_account_dirty(mapping)) {
__inc_zone_page_state(page, NR_FILE_DIRTY);
+ __inc_zone_page_state(page, NR_FILE_PAGES_DIRTIED);
__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
task_dirty_inc(current);
task_io_account_write(PAGE_CACHE_SIZE);
@@ -1333,10 +1334,11 @@ int test_set_page_writeback(struct page *page)
} else {
ret = TestSetPageWriteback(page);
}
- if (!ret)
+ if (!ret) {
inc_zone_page_state(page, NR_WRITEBACK);
+ inc_zone_page_state(page, NR_PAGES_ENTERED_WRITEBACK);
+ }
return ret;
-
}
EXPORT_SYMBOL(test_set_page_writeback);

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7759941..e177a40 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -740,6 +740,8 @@ static const char * const vmstat_text[] = {
"numa_local",
"numa_other",
#endif
+ "nr_pages_entered_writeback",
+ "nr_file_pages_dirtied",

#ifdef CONFIG_VM_EVENT_COUNTERS
"pgpgin",
--
1.7.0.1

2010-06-19 00:31:09

by Michael Rubin

[permalink] [raw]
Subject: [PATCH 2/3] writeback: per bdi monitoring

To allow users and applications to gain visibility into the writeback
behaviour debugfs files are moved into /sys. bdi granularity
of visibility is important to root cause both rogue user apps and/or
kernel issues.

This patch converts bdi debug files of bdi into
/sys/block/<dev>/bdi/writeback.

# cat /sys/class/block/sda/bdi/writeback
BdiWriteback: 0 kB
BdiReclaimable: 96 kB
BdiDirtyThresh: 2117144 kB
DirtyThresh: 12192420 kB
BackgroundThresh: 1625656 kB
WritebackThreads: 1
WorkWaiting: 0
InodesOnDirty: 37
InodesOnIo: 0
InodesOnMoreIo: 0
wb_mask: 1
wb_cnt: 1

Also it adds a sys file for the bdi state in
/sys/block/<block>/bdi/state. The file will expose the state of
the bdi in a comma separated list.

# cat /sys/class/block/sda/bdi/state
registered,sync_congested

TESTED:
Injected false states in the code stream to ensure the state file worked
correctly. Then removed false states.

Performed IO on different disks to see the correct writeback values
fluctuate and report what was expected.

Signed-off-by: Michael Rubin <[email protected]>
---
fs/fs-writeback.c | 13 +++++
include/linux/backing-dev.h | 1 +
mm/backing-dev.c | 126 +++++++++++++++++++------------------------
3 files changed, 70 insertions(+), 70 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1d1088f..62a899e 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -95,6 +95,19 @@ int writeback_in_progress(struct backing_dev_info *bdi)
return !list_empty(&bdi->work_list);
}

+int writeback_work_waiting(struct backing_dev_info *bdi)
+{
+ struct bdi_work *work;
+ int nr_wb_work = 0;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(work, &bdi->work_list, list) {
+ nr_wb_work++;
+ }
+ rcu_read_unlock();
+ return nr_wb_work;
+}
+
static void bdi_work_clear(struct bdi_work *work)
{
clear_bit(WS_USED_B, &work->state);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index aee5f6c..e1fb11c 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -255,6 +255,7 @@ extern struct backing_dev_info noop_backing_dev_info;
void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page);

int writeback_in_progress(struct backing_dev_info *bdi);
+int writeback_work_waiting(struct backing_dev_info *bdi);

static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
{
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 660a87a..9a89296 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -51,25 +51,52 @@ static void sync_supers_timer_fn(unsigned long);

static void bdi_add_default_flusher_task(struct backing_dev_info *bdi);

-#ifdef CONFIG_DEBUG_FS
-#include <linux/debugfs.h>
-#include <linux/seq_file.h>
-
-static struct dentry *bdi_debug_root;
+static const char *bdi_state_labels[] = {
+ "pending",
+ "wb_alloc",
+ "async_congested",
+ "sync_congested",
+ "registered",
+};

-static void bdi_debug_init(void)
+static ssize_t state_show(struct device *dev, struct device_attribute *attr,
+ char *page)
{
- bdi_debug_root = debugfs_create_dir("bdi", NULL);
+ struct backing_dev_info *bdi = dev_get_drvdata(dev);
+ int statebit, printed = 0;
+ int bufsize = PAGE_SIZE - 1;
+
+ /* Caching for consistency */
+ unsigned long state = ((1 << BDI_unused) - 1) & bdi->state;
+
+ if (state & 0)
+ return snprintf(page, bufsize, "No state\n");
+
+ for (statebit = 0; statebit < BDI_unused; statebit++) {
+ if (!test_bit(statebit, &state))
+ continue;
+ printed += snprintf(page + printed, bufsize - printed,
+ "%s,", bdi_state_labels[statebit]);
+ if (printed >= bufsize) {
+ page[PAGE_SIZE - 1] = '\n';
+ return PAGE_SIZE;
+ }
+ }
+
+ page[printed - 1] = '\n';
+ return printed;
}

-static int bdi_debug_stats_show(struct seq_file *m, void *v)
+
+static ssize_t writeback_show(struct device *dev, struct device_attribute *attr,
+ char *page)
{
- struct backing_dev_info *bdi = m->private;
+ struct backing_dev_info *bdi = dev_get_drvdata(dev);
struct bdi_writeback *wb;
unsigned long background_thresh;
unsigned long dirty_thresh;
unsigned long bdi_thresh;
- unsigned long nr_dirty, nr_io, nr_more_io, nr_wb;
+ unsigned long nr_dirty, nr_io, nr_more_io, nr_wb, nr_wb_work;
struct inode *inode;

/*
@@ -90,70 +117,30 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
spin_unlock(&inode_lock);

get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi);
+ nr_wb_work = writeback_work_waiting(bdi);

#define K(x) ((x) << (PAGE_SHIFT - 10))
- seq_printf(m,
- "BdiWriteback: %8lu kB\n"
- "BdiReclaimable: %8lu kB\n"
- "BdiDirtyThresh: %8lu kB\n"
- "DirtyThresh: %8lu kB\n"
- "BackgroundThresh: %8lu kB\n"
- "WritebackThreads: %8lu\n"
- "b_dirty: %8lu\n"
- "b_io: %8lu\n"
- "b_more_io: %8lu\n"
- "bdi_list: %8u\n"
- "state: %8lx\n"
- "wb_mask: %8lx\n"
- "wb_list: %8u\n"
- "wb_cnt: %8u\n",
+ return snprintf(page, PAGE_SIZE-1,
+ "BdiWriteback: %8lu kB\n"
+ "BdiReclaimable: %8lu kB\n"
+ "BdiDirtyThresh: %8lu kB\n"
+ "DirtyThresh: %8lu kB\n"
+ "BackgroundThresh: %8lu kB\n"
+ "WritebackThreads: %8lu\n"
+ "WorkWaiting: %8lu\n"
+ "InodesOnDirty: %8lu\n"
+ "InodesOnIo: %8lu\n"
+ "InodesOnMoreIo: %8lu\n"
+ "wb_mask: %8lx\n"
+ "wb_cnt: %8u\n",
(unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
(unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
K(bdi_thresh), K(dirty_thresh),
- K(background_thresh), nr_wb, nr_dirty, nr_io, nr_more_io,
- !list_empty(&bdi->bdi_list), bdi->state, bdi->wb_mask,
- !list_empty(&bdi->wb_list), bdi->wb_cnt);
+ K(background_thresh), nr_wb, nr_wb_work,
+ nr_dirty, nr_io, nr_more_io,
+ bdi->wb_mask, bdi->wb_cnt);
#undef K
-
- return 0;
-}
-
-static int bdi_debug_stats_open(struct inode *inode, struct file *file)
-{
- return single_open(file, bdi_debug_stats_show, inode->i_private);
-}
-
-static const struct file_operations bdi_debug_stats_fops = {
- .open = bdi_debug_stats_open,
- .read = seq_read,
- .llseek = seq_lseek,
- .release = single_release,
-};
-
-static void bdi_debug_register(struct backing_dev_info *bdi, const char *name)
-{
- bdi->debug_dir = debugfs_create_dir(name, bdi_debug_root);
- bdi->debug_stats = debugfs_create_file("stats", 0444, bdi->debug_dir,
- bdi, &bdi_debug_stats_fops);
-}
-
-static void bdi_debug_unregister(struct backing_dev_info *bdi)
-{
- debugfs_remove(bdi->debug_stats);
- debugfs_remove(bdi->debug_dir);
-}
-#else
-static inline void bdi_debug_init(void)
-{
-}
-static inline void bdi_debug_register(struct backing_dev_info *bdi,
- const char *name)
-{
-}
-static inline void bdi_debug_unregister(struct backing_dev_info *bdi)
-{
}
-#endif

static ssize_t read_ahead_kb_store(struct device *dev,
struct device_attribute *attr,
@@ -224,6 +211,8 @@ BDI_SHOW(max_ratio, bdi->max_ratio)
#define __ATTR_RW(attr) __ATTR(attr, 0644, attr##_show, attr##_store)

static struct device_attribute bdi_dev_attrs[] = {
+ __ATTR_RO(state),
+ __ATTR_RO(writeback),
__ATTR_RW(read_ahead_kb),
__ATTR_RW(min_ratio),
__ATTR_RW(max_ratio),
@@ -237,7 +226,6 @@ static __init int bdi_class_init(void)
return PTR_ERR(bdi_class);

bdi_class->dev_attrs = bdi_dev_attrs;
- bdi_debug_init();
return 0;
}
postcore_initcall(bdi_class_init);
@@ -583,7 +571,6 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
}
}

- bdi_debug_register(bdi, dev_name(dev));
set_bit(BDI_registered, &bdi->state);
exit:
return ret;
@@ -651,7 +638,6 @@ void bdi_unregister(struct backing_dev_info *bdi)

if (!bdi_cap_flush_forker(bdi))
bdi_wb_shutdown(bdi);
- bdi_debug_unregister(bdi);
device_unregister(bdi->dev);
bdi->dev = NULL;
}
--
1.7.0.1

2010-06-19 00:31:45

by Michael Rubin

[permalink] [raw]
Subject: [PATCH 3/3] writeback: tracking subsystems causing writeback

Adding stats to indicate which subsystem is causing writeback.
Implemented as per cpu counters rather than tracepoints so we can gather
statistics continuously and cheaply.

Knowing what is the source of writeback acitivity can help tune and
debug applications.

The data is exported in bdi granularity at /sys/block/<dev>/bdi/writeback_stats.

# cat /sys/block/sda/bdi/writeback_stats
balance dirty pages 0
balance dirty pages waiting 0
periodic writeback 92024
periodic writeback exited 0
laptop periodic 0
laptop or bg threshold 0
free more memory 0
try to free pages 271
syc_sync 6
sync filesystem 0

For the entire system you can find them at /sys/kernel/mm/writeback/stats.

# cat /sys/kernel/mm/writeback/stats
balance dirty pages 0
balance dirty pages waiting 0
periodic writeback 92892
periodic writeback exited 13
laptop periodic 0
laptop or bg threshold 0
free more memory 0
try to free pages 271
syc_sync 6
sync filesystem 0

Signed-off-by: Michael Rubin <[email protected]>
---
fs/buffer.c | 2 +-
fs/fs-writeback.c | 15 +++++---
fs/sync.c | 2 +-
include/linux/backing-dev.h | 8 ++++
include/linux/writeback.h | 50 +++++++++++++++++++++++++-
mm/backing-dev.c | 17 +++++++++
mm/mm_init.c | 82 ++++++++++++++++++++++++++++++++++++++++--
mm/page-writeback.c | 12 +++++--
mm/vmscan.c | 3 +-
9 files changed, 175 insertions(+), 16 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index d54812b..2a0ebe0 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -288,7 +288,7 @@ static void free_more_memory(void)
struct zone *zone;
int nid;

- wakeup_flusher_threads(1024);
+ wakeup_flusher_threads(1024, WB_STAT_FREE_MORE_MEM);
yield();

for_each_online_node(nid) {
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 62a899e..56331b0 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -956,6 +956,7 @@ int bdi_writeback_task(struct bdi_writeback *wb)
long pages_written;

while (!kthread_should_stop()) {
+ bdi_writeback_stat_inc(wb->bdi, WB_STAT_PERIODIC);
pages_written = wb_do_writeback(wb, 0);

if (pages_written)
@@ -969,8 +970,11 @@ int bdi_writeback_task(struct bdi_writeback *wb)
* recreated automatically.
*/
max_idle = max(5UL * 60 * HZ, wait_jiffies);
- if (time_after(jiffies, max_idle + last_active))
+ if (time_after(jiffies, max_idle + last_active)) {
+ bdi_writeback_stat_inc(wb->bdi,
+ WB_STAT_PERIODIC_EXIT);
break;
+ }
}

if (dirty_writeback_interval) {
@@ -994,7 +998,8 @@ int bdi_writeback_task(struct bdi_writeback *wb)
* Schedule writeback for all backing devices. This does WB_SYNC_NONE
* writeback, for integrity writeback see bdi_sync_writeback().
*/
-static void bdi_writeback_all(struct super_block *sb, long nr_pages)
+static void bdi_writeback_all(struct super_block *sb, long nr_pages,
+ enum wb_stats stat)
{
struct wb_writeback_args args = {
.sb = sb,
@@ -1008,7 +1013,7 @@ static void bdi_writeback_all(struct super_block *sb, long nr_pages)
list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
if (!bdi_has_dirty_io(bdi))
continue;
-
+ bdi_writeback_stat_inc(bdi, stat);
bdi_alloc_queue_work(bdi, &args);
}

@@ -1019,12 +1024,12 @@ static void bdi_writeback_all(struct super_block *sb, long nr_pages)
* Start writeback of `nr_pages' pages. If `nr_pages' is zero, write back
* the whole world.
*/
-void wakeup_flusher_threads(long nr_pages)
+void wakeup_flusher_threads(long nr_pages, enum wb_stats stat)
{
if (nr_pages == 0)
nr_pages = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS);
- bdi_writeback_all(NULL, nr_pages);
+ bdi_writeback_all(NULL, nr_pages, stat);
}

static noinline void block_dump___mark_inode_dirty(struct inode *inode)
diff --git a/fs/sync.c b/fs/sync.c
index 15aa6f0..6059e83 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -97,7 +97,7 @@ static void sync_filesystems(int wait)
*/
SYSCALL_DEFINE0(sync)
{
- wakeup_flusher_threads(0);
+ wakeup_flusher_threads(0, WB_STAT_SYNC);
sync_filesystems(0);
sync_filesystems(1);
if (unlikely(laptop_mode))
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index e1fb11c..dfa7c78 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -72,6 +72,7 @@ struct backing_dev_info {
char *name;

struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
+ struct writeback_stats *wb_stat;

struct prop_local_percpu completions;
int dirty_exceeded;
@@ -184,6 +185,13 @@ static inline s64 bdi_stat_sum(struct backing_dev_info *bdi,
return sum;
}

+static inline void bdi_writeback_stat_inc(struct backing_dev_info *bdi,
+ enum wb_stats stat)
+{
+ if (bdi)
+ writeback_stats_inc(bdi->wb_stat, stat);
+}
+
extern void bdi_writeout_inc(struct backing_dev_info *bdi);

/*
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index d63ef8f..f874410 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -22,6 +22,54 @@ enum writeback_sync_modes {
};

/*
+ * why this writeback was initiated
+ */
+enum wb_stats {
+ WB_STAT_BALANCE_DIRTY,
+ WB_STAT_BALANCE_DIRTY_WAIT,
+ WB_STAT_PERIODIC,
+ WB_STAT_PERIODIC_EXIT,
+ WB_STAT_LAPTOP_TIMER,
+ WB_STAT_LAPTOP_BG,
+ WB_STAT_FREE_MORE_MEM,
+ WB_STAT_TRY_TO_FREE_PAGES,
+ WB_STAT_SYNC,
+ WB_STAT_SYNC_FS,
+ WB_STAT_MAX,
+};
+
+struct writeback_stats {
+ u64 stats[WB_STAT_MAX];
+};
+
+extern struct writeback_stats *writeback_sys_stats;
+
+static inline struct writeback_stats *writeback_stats_alloc(void)
+{
+ return alloc_percpu(struct writeback_stats);
+}
+
+static inline void writeback_stats_free(struct writeback_stats *stats)
+{
+ free_percpu(stats);
+}
+
+static inline void writeback_stats_inc(struct writeback_stats *stats, int stat)
+{
+ BUG_ON(stat >= WB_STAT_MAX);
+ preempt_disable();
+ stats = per_cpu_ptr(stats, smp_processor_id());
+ stats->stats[stat]++;
+ if (likely(writeback_sys_stats)) {
+ stats = per_cpu_ptr(writeback_sys_stats, smp_processor_id());
+ stats->stats[stat]++;
+ }
+ preempt_enable();
+}
+
+size_t writeback_stats_print(struct writeback_stats *, char *buf, size_t);
+
+/*
* A control structure which tells the writeback code what to do. These are
* always on the stack, and hence need no locking. They are always initialised
* in a manner such that unspecified fields are set to zero.
@@ -68,7 +116,7 @@ int writeback_inodes_sb_if_idle(struct super_block *);
void sync_inodes_sb(struct super_block *);
void writeback_inodes_wbc(struct writeback_control *wbc);
long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
-void wakeup_flusher_threads(long nr_pages);
+void wakeup_flusher_threads(long nr_pages, enum wb_stats);

/* writeback.h requires fs.h; it, too, is not included from here. */
static inline void wait_on_inode(struct inode *inode)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 9a89296..83efc39 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -142,6 +142,13 @@ static ssize_t writeback_show(struct device *dev, struct device_attribute *attr,
#undef K
}

+static ssize_t writeback_stats_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct backing_dev_info *bdi = dev_get_drvdata(dev);
+ return writeback_stats_print(bdi->wb_stat, buf, PAGE_SIZE);
+}
+
static ssize_t read_ahead_kb_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
@@ -213,6 +220,7 @@ BDI_SHOW(max_ratio, bdi->max_ratio)
static struct device_attribute bdi_dev_attrs[] = {
__ATTR_RO(state),
__ATTR_RO(writeback),
+ __ATTR_RO(writeback_stats),
__ATTR_RW(read_ahead_kb),
__ATTR_RW(min_ratio),
__ATTR_RW(max_ratio),
@@ -673,11 +681,18 @@ int bdi_init(struct backing_dev_info *bdi)
goto err;
}

+ bdi->wb_stat = writeback_stats_alloc();
+ if (bdi->wb_stat == NULL) {
+ err = -ENOMEM;
+ goto err;
+ }
+
bdi->dirty_exceeded = 0;
err = prop_local_init_percpu(&bdi->completions);

if (err) {
err:
+ writeback_stats_free(bdi->wb_stat);
while (i--)
percpu_counter_destroy(&bdi->bdi_stat[i]);
}
@@ -709,6 +724,8 @@ void bdi_destroy(struct backing_dev_info *bdi)
for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
percpu_counter_destroy(&bdi->bdi_stat[i]);

+ writeback_stats_free(bdi->wb_stat);
+
prop_local_destroy_percpu(&bdi->completions);
}
EXPORT_SYMBOL(bdi_destroy);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 8f2ebdb..9ca5377 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -158,8 +158,74 @@ static ssize_t writeback_show(struct kobject *kobj,

KERNEL_ATTR_RO(writeback);

+struct writeback_stats *writeback_sys_stats;
+
+static const char *wb_stats_labels[WB_STAT_MAX] = {
+ [WB_STAT_BALANCE_DIRTY] = "balance dirty pages",
+ [WB_STAT_BALANCE_DIRTY_WAIT] = "balance dirty pages waiting",
+ [WB_STAT_PERIODIC] = "periodic writeback",
+ [WB_STAT_PERIODIC_EXIT] = "periodic writeback exited",
+ [WB_STAT_LAPTOP_TIMER] = "laptop periodic",
+ [WB_STAT_LAPTOP_BG] = "laptop or bg threshold",
+ [WB_STAT_FREE_MORE_MEM] = "free more memory",
+ [WB_STAT_TRY_TO_FREE_PAGES] = "try to free pages",
+ [WB_STAT_SYNC] = "syc_sync",
+ [WB_STAT_SYNC_FS] = "sync filesystem",
+};
+
+static void writeback_stats_collect(struct writeback_stats *src,
+ struct writeback_stats *target)
+{
+ int cpu;
+ for_each_online_cpu(cpu) {
+ int stat;
+ struct writeback_stats *stats = per_cpu_ptr(src, cpu);
+ for (stat = 0; stat < WB_STAT_MAX; stat++)
+ target->stats[stat] += stats->stats[stat];
+ }
+}
+
+static size_t writeback_stats_to_str(struct writeback_stats *stats,
+ char *buf, size_t len)
+{
+ int bufsize = len - 1;
+ int i, printed = 0;
+ for (i = 0; i < WB_STAT_MAX; i++) {
+ const char *label = wb_stats_labels[i];
+ if (label == NULL)
+ continue;
+ printed += snprintf(buf + printed, bufsize - printed,
+ "%-32s %10llu\n", label, stats->stats[i]);
+ if (printed >= bufsize) {
+ buf[len - 1] = '\n';
+ return len;
+ }
+ }
+
+ buf[printed - 1] = '\n';
+ return printed;
+}
+
+size_t writeback_stats_print(struct writeback_stats *stats,
+ char *buf, size_t len)
+{
+ struct writeback_stats total;
+ memset(&total, 0, sizeof(total));
+ writeback_stats_collect(stats, &total);
+ return writeback_stats_to_str(&total, buf, len);
+}
+
+static ssize_t stats_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return writeback_stats_print(writeback_sys_stats, buf, PAGE_SIZE);
+}
+
+KERNEL_ATTR_RO(stats);
+
static struct attribute *writeback_attrs[] = {
&writeback_attr.attr,
+ &stats_attr.attr,
NULL,
};

@@ -177,11 +243,19 @@ static int mm_sysfs_writeback_init(void)
return -ENOMEM;

error = sysfs_create_group(writeback_kobj, &writeback_attr_group);
- if (error) {
- kobject_put(mm_kobj);
- return -ENOMEM;
- }
+ if (error)
+ goto err;
+
+ writeback_sys_stats = alloc_percpu(struct writeback_stats);
+ if (writeback_sys_stats == NULL)
+ goto stats_err;
return 0;
+
+stats_err:
+ sysfs_remove_group(writeback_kobj, &writeback_attr_group);
+err:
+ kobject_put(mm_kobj);
+ return -ENOMEM;
}

struct kobject *mm_kobj;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 4cea5c0..39d7ff2 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -537,6 +537,7 @@ static void balance_dirty_pages(struct address_space *mapping,
* up.
*/
if (bdi_nr_reclaimable > bdi_thresh) {
+ bdi_writeback_stat_inc(bdi, WB_STAT_BALANCE_DIRTY);
writeback_inodes_wbc(&wbc);
pages_written += write_chunk - wbc.nr_to_write;
get_dirty_limits(&background_thresh, &dirty_thresh,
@@ -567,6 +568,7 @@ static void balance_dirty_pages(struct address_space *mapping,
break; /* We've done our duty */

__set_current_state(TASK_INTERRUPTIBLE);
+ bdi_writeback_stat_inc(bdi, WB_STAT_BALANCE_DIRTY_WAIT);
io_schedule_timeout(pause);

/*
@@ -596,8 +598,10 @@ static void balance_dirty_pages(struct address_space *mapping,
if ((laptop_mode && pages_written) ||
(!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
+ global_page_state(NR_UNSTABLE_NFS))
- > background_thresh)))
+ > background_thresh))) {
+ bdi_writeback_stat_inc(bdi, WB_STAT_LAPTOP_BG);
bdi_start_writeback(bdi, NULL, 0);
+ }
}

void set_page_dirty_balance(struct page *page, int page_mkwrite)
@@ -700,14 +704,16 @@ void laptop_mode_timer_fn(unsigned long data)
struct request_queue *q = (struct request_queue *)data;
int nr_pages = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS);
-
/*
* We want to write everything out, not just down to the dirty
* threshold
*/

- if (bdi_has_dirty_io(&q->backing_dev_info))
+ if (bdi_has_dirty_io(&q->backing_dev_info)) {
+ writeback_stats_inc(q->backing_dev_info.wb_stat,
+ WB_STAT_LAPTOP_TIMER);
bdi_start_writeback(&q->backing_dev_info, NULL, nr_pages);
+ }
}

/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c7e57c..ba6966e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1838,7 +1838,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
*/
writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
if (total_scanned > writeback_threshold) {
- wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
+ wakeup_flusher_threads(laptop_mode ? 0 : total_scanned,
+ WB_STAT_TRY_TO_FREE_PAGES);
sc->may_writepage = 1;
}

--
1.7.0.1

2010-06-19 08:17:06

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/3] writeback: tracking subsystems causing writeback

Michael Rubin <[email protected]> writes:
>
> # cat /sys/block/sda/bdi/writeback_stats
> balance dirty pages 0
> balance dirty pages waiting 0
> periodic writeback 92024
> periodic writeback exited 0
> laptop periodic 0
> laptop or bg threshold 0
> free more memory 0
> try to free pages 271
> syc_sync 6
> sync filesystem 0

That exports a lot of kernel internals in /sys, presumably read by some
applications. What happens with the applications if the kernel internals
ever change? Will the application break?

It would be bad to not be able to change the kernel because of
such an interface.

-Andi

--
[email protected] -- Speaking for myself only.

2010-06-19 10:45:10

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/3] writeback: Creating /sys/kernel/mm/writeback/writeback

On Fri, Jun 18, 2010 at 05:30:13PM -0700, Michael Rubin wrote:
> Adding the /sys/kernel/mm/writeback/writeback file. It contains data
> to help developers and applications gain visibility into writeback
> behaviour.
>
> # cat /sys/kernel/mm/writeback/writeback
> pages_dirtied: 3747
> pages_cleaned: 3618
> dirty_threshold: 816673
> bg_threshold: 408336

I'm fine with exposting this. but the interface is rather awkward.
These kinds of multiple value per file interface require addition
parsing and are a pain to extend. Please do something like

/proc/sys/vm/writeback/

pages_dirtied
pages_cleaned
dirty_threshold
background_threshold

where you can just read the value from the file.

> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index c920164..84b0181 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -1598,8 +1598,10 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
> } while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
> kunmap_atomic(kaddr, KM_USER0);
>
> - if (!TestSetPageWriteback(clone_page))
> + if (!TestSetPageWriteback(clone_page)) {
> inc_zone_page_state(clone_page, NR_WRITEBACK);
> + inc_zone_page_state(clone_page, NR_PAGES_ENTERED_WRITEBACK);
> + }
> unlock_page(clone_page);

I'm not very happy about having this opencoded in a filesystem.

2010-06-19 17:45:16

by Michael Rubin

[permalink] [raw]
Subject: Re: [PATCH 1/3] writeback: Creating /sys/kernel/mm/writeback/writeback

Thanks for looking at this.

On Sat, Jun 19, 2010 at 3:44 AM, Christoph Hellwig <[email protected]> wrote:
> I'm fine with exposting this. but the interface is rather awkward.
> These kinds of multiple value per file interface require addition
> parsing and are a pain to extend. ?Please do something like
>
> /proc/sys/vm/writeback/
>
> ? ? ? ? ? ? ? ? ? ? ? ?pages_dirtied
> ? ? ? ? ? ? ? ? ? ? ? ?pages_cleaned
> ? ? ? ? ? ? ? ? ? ? ? ?dirty_threshold
> ? ? ? ? ? ? ? ? ? ? ? ?background_threshold
>
> where you can just read the value from the file.

Cool. This is kind of funny. In the google tree I implemented this in
the same multi-file-one-value-in-file manner. The debate on one file
for all vs that style was heated. And I changed it before sending
upstream. I really don't care either way. So I will just change the
patch and move the values to that location

Do you mind explaining why something would go in /proc/ vs /sys? I
thought the idea was to not put things in /proc anymore.

>> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>> index c920164..84b0181 100644
>> --- a/fs/nilfs2/segment.c
>> +++ b/fs/nilfs2/segment.c
>> @@ -1598,8 +1598,10 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
>> ? ? ? } while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
>> ? ? ? kunmap_atomic(kaddr, KM_USER0);
>>
>> - ? ? if (!TestSetPageWriteback(clone_page))
>> + ? ? if (!TestSetPageWriteback(clone_page)) {
>> ? ? ? ? ? ? ? inc_zone_page_state(clone_page, NR_WRITEBACK);
>> + ? ? ? ? ? ? inc_zone_page_state(clone_page, NR_PAGES_ENTERED_WRITEBACK);
>> + ? ? }
>> ? ? ? unlock_page(clone_page);
>
> I'm not very happy about having this opencoded in a filesystem.

I wasn't excited about this section either. What does opencoded mean?
Do you mean it should not be exposed to specific fs code?

mrubin

2010-06-19 17:50:01

by Michael Rubin

[permalink] [raw]
Subject: Re: [PATCH 3/3] writeback: tracking subsystems causing writeback

Thanks for looking at this.

On Sat, Jun 19, 2010 at 1:17 AM, Andi Kleen <[email protected]> wrote:
> Michael Rubin <[email protected]> writes:
>> ? ? # cat /sys/block/sda/bdi/writeback_stats
>> ? ? balance dirty pages ? ? ? ? ? ? ? ? ? ? ? 0
>> ? ? balance dirty pages waiting ? ? ? ? ? ? ? 0
>> ? ? periodic writeback ? ? ? ? ? ? ? ? ? ?92024
>> ? ? periodic writeback exited ? ? ? ? ? ? ? ? 0
>> ? ? laptop periodic ? ? ? ? ? ? ? ? ? ? ? ? ? 0
>> ? ? laptop or bg threshold ? ? ? ? ? ? ? ? ? ?0
>> ? ? free more memory ? ? ? ? ? ? ? ? ? ? ? ? ?0
>> ? ? try to free pages ? ? ? ? ? ? ? ? ? ? ? 271
>> ? ? syc_sync ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?6
>> ? ? sync filesystem ? ? ? ? ? ? ? ? ? ? ? ? ? 0
>
> That exports a lot of kernel internals in /sys, presumably read by some
> applications. What happens with the applications if the kernel internals
> ever change? ?Will the application break?
>
> It would be bad to not be able to change the kernel because of
> such an interface.

I agree. This would put the kernel in a box a bit. Some of them
(sys_sync, periodic writeback, free_more_memory) I feel are generic
enough concepts that with some rewording of the labels they could be
exposed with no issue. "Balance_dirty_pages" is an example where that
won't work.

Are there alternatives to this? Maybe tracepoints that are compiled to be on?
A CONFIG_WRITEBACK_DEBUG that would expose this file?

Having this set of info readily available and collected makes
debugging a lot easier. But I admit I am not sure the best way to
expose them.

mrubin

2010-06-19 20:23:51

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/3] writeback: tracking subsystems causing writeback

Michael Rubin <[email protected]> writes:
>
> I agree. This would put the kernel in a box a bit. Some of them
> (sys_sync, periodic writeback, free_more_memory) I feel are generic
> enough concepts that with some rewording of the labels they could be
> exposed with no issue. "Balance_dirty_pages" is an example where that
> won't work.

Yes some rewording would be good.

> Are there alternatives to this? Maybe tracepoints that are compiled to be on?
> A CONFIG_WRITEBACK_DEBUG that would expose this file?

The classic way is to put it into debugfs which has a appropiate
disclaimer.

(although I fear we're weaning apps that depend on debugfs too
The growing ftrace user space code seems to all depend on debugfs)

> Having this set of info readily available and collected makes
> debugging a lot easier. But I admit I am not sure the best way to
> expose them.

Maybe we just need a simpler writeback path that is not as complicated
to debug.

-Andi

--
[email protected] -- Speaking for myself only.

2010-06-20 23:11:01

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/3] writeback visibility

On Fri, Jun 18, 2010 at 05:30:12PM -0700, Michael Rubin wrote:
> Debugging writeback issues and tuning an application's writeback activity is
> easier when the activity is visible. With large clusters, classifying
> and root causing writeback problems has been a big headache. This patch
> series contains a series of patches that our team has been using to start
> getting a handle on writeback behaviour. These changes should be helpful
> for single system maintainers also. It's still a big headache.
>
> Once these changes are reviewed I will make sure the Documentation files
> are updated, but I expect some back and forth first.
>
> Michael Rubin (3):
> writeback: Creating /sys/kernel/mm/writeback/writeback
> writeback: per bdi monitoring
> writeback: tracking subsystems causing writeback

I'm not sure we want to export statistics that represent internal
implementation details into a fixed userspace API. Who, other than
developers, are going to understand and be able to make use of this
information?

FWIW, I've got to resend the writeback tracing patches to Jens that I
have that give better visibility into the writeback behaviour.
Perhaps those tracing events are a better basis for tracking down
writeback problems - the bugs I found with the tracing could not
have been found with these statistics...

That's really why I'm asking - if the stats are just there to help
development and debugging, then I think that improving the writeback
tracing is a better approach to improving visibility of writeback
behaviour...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-06-21 17:09:51

by Michael Rubin

[permalink] [raw]
Subject: Re: [PATCH 0/3] writeback visibility

Thanks for looking at this.

On Sun, Jun 20, 2010 at 4:10 PM, Dave Chinner <[email protected]> wrote:
>> Michael Rubin (3):
>> ? writeback: Creating /sys/kernel/mm/writeback/writeback
>> ? writeback: per bdi monitoring
>> ? writeback: tracking subsystems causing writeback
>
> I'm not sure we want to export statistics that represent internal
> implementation details into a fixed userspace API. Who, other than
> developers, are going to understand and be able to make use of this
> information?

I think there are varying degrees of internal exposure on the patches.

>> ? writeback: Creating /sys/kernel/mm/writeback/writeback
This one seems to not expose any new internals. We already expose the
concept of "dirty", "writeback" and thresholds in /proc/meminfo.

>>? writeback: per bdi monitoring

Looking at it again. I think this one is somewhat of a mixed bag.
BDIReclaimable, BdiWriteback, and the dirty thresholds seems safe to
export.While I agree the rest should stay in debugfs. Would that be
amenable?

>> writeback: tracking subsystems causing writeback

I definitely agree that this one is too revealing and needs to be
redone. But I think we might want to add the details for concepts
which we already expose.
The idea of a "periodic writeback" is already exposed in /proc/sys/vm/
and I don't see that changing in the kernel as a method to deal with
buffered IO. Neither will sync. The laptop stuff and the names of
"balance_dirty_pages" are bad, but maybe we can come up with something
more high level. Like "writeback due to low memory"

> FWIW, I've got to resend the writeback tracing patches to Jens that I
> have that give better visibility into the writeback behaviour.
> Perhaps those tracing events are a better basis for tracking down
> writeback problems - the bugs I found with the tracing could not
> have been found with these statistics...

Yeah I have been watching the tracing stuff you have posted and I
think it will help. There were some other trace points I wanted to add
to this patch but was waiting to learn from your submission on the
best way to integrate them.

> That's really why I'm asking - if the stats are just there to help
> development and debugging, then I think that improving the writeback
> tracing is a better approach to improving visibility of writeback
> behaviour...

Maybe I should not have put all these patches in one series. The first
one with the /sys/kernel/vm file is very useful for user space
developers. System Administrators who are trying to classify IO
problems often need to know if the disk is bad or if the buffered data
is not even being written to disk over time.. Also at Google we tend
to run our jobs with very little unused RAM. Pushing things close to
their limits results in many surprises and writeback is often one of
them. Knowing the thresholds and rate of dirty and cleaning of pages
can help systems do the right thing.

mrubin

2010-06-24 00:03:35

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/3] writeback visibility

On Mon, Jun 21, 2010 at 10:09:24AM -0700, Michael Rubin wrote:
> Thanks for looking at this.
>
> On Sun, Jun 20, 2010 at 4:10 PM, Dave Chinner <[email protected]> wrote:
> >> Michael Rubin (3):
> >>   writeback: Creating /sys/kernel/mm/writeback/writeback
> >>   writeback: per bdi monitoring
> >>   writeback: tracking subsystems causing writeback
> >
> > I'm not sure we want to export statistics that represent internal
> > implementation details into a fixed userspace API. Who, other than
> > developers, are going to understand and be able to make use of this
> > information?
>
> I think there are varying degrees of internal exposure on the patches.
>
> >>   writeback: Creating /sys/kernel/mm/writeback/writeback
> This one seems to not expose any new internals. We already expose the
> concept of "dirty", "writeback" and thresholds in /proc/meminfo.

I don't see any probems with these stats - no matter the
implementation, they'll still be relevant.

> >>  writeback: per bdi monitoring
>
> Looking at it again. I think this one is somewhat of a mixed bag.
> BDIReclaimable, BdiWriteback, and the dirty thresholds seems safe to
> export.While I agree the rest should stay in debugfs. Would that be
> amenable?

I'd much prefer all the bdi stats in the one spot. It's hard enough
to find what you're looking for without splitting them into multiple
locations.

The other thing to consider is that tracing requires debugfѕ to be
mounted. Hence most kernels are going to have the debug stats
available, anyway....

> >> writeback: tracking subsystems causing writeback
>
> I definitely agree that this one is too revealing and needs to be
> redone. But I think we might want to add the details for concepts
> which we already expose.
> The idea of a "periodic writeback" is already exposed in /proc/sys/vm/
> and I don't see that changing in the kernel as a method to deal with
> buffered IO. Neither will sync.

I don't see much value in exposing this information outside of
development environments. I think it's much better to add trace
points for events like this so that we do fine-grained analysis of
when the events occur during problematic workloads....

> The laptop stuff and the names of
> "balance_dirty_pages" are bad, but maybe we can come up with something
> more high level. Like "writeback due to low memory"
>
> > FWIW, I've got to resend the writeback tracing patches to Jens that I
> > have that give better visibility into the writeback behaviour.
> > Perhaps those tracing events are a better basis for tracking down
> > writeback problems - the bugs I found with the tracing could not
> > have been found with these statistics...
>
> Yeah I have been watching the tracing stuff you have posted and I
> think it will help. There were some other trace points I wanted to add
> to this patch but was waiting to learn from your submission on the
> best way to integrate them.

I've got more work to do on them first.... :/

> > That's really why I'm asking - if the stats are just there to help
> > development and debugging, then I think that improving the writeback
> > tracing is a better approach to improving visibility of writeback
> > behaviour...
>
> Maybe I should not have put all these patches in one series. The first
> one with the /sys/kernel/vm file is very useful for user space
> developers. System Administrators who are trying to classify IO
> problems often need to know if the disk is bad

These stats aren't the place for observing that a disk is bad ;)

> or if the buffered data
> is not even being written to disk over time..

I think this can be obtained from the existing info in /proc/meminfo.
I'm not saying the new stats aren't necessary, just that we already
have the high level information available for this....

> Also at Google we tend
> to run our jobs with very little unused RAM. Pushing things close to
> their limits results in many surprises and writeback is often one of
> them. Knowing the thresholds and rate of dirty and cleaning of pages
> can help systems do the right thing.

Yes, I hear this all the time from appliance developers that cache
everything they need in userspace - they just want the kernel to
stay out of the way and not use the unused RAM for caching stuff that
doesn't matter to the application. Normally the issue is unbounded
growth of the inode and dentry caches, but I can see how exceeding
writeback limits can be just as much of a problem.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-06-25 07:16:12

by Michael Rubin

[permalink] [raw]
Subject: Re: [PATCH 0/3] writeback visibility

On Wed, Jun 23, 2010 at 5:02 PM, Dave Chinner <[email protected]> wrote:
> I don't see any probems with these stats - no matter the
> implementation, they'll still be relevant.

Cool. I have a new patch I will send out tomorrow for these. They have
been moved to /proc/sys/vm too as Christoph recommended. Makes more
sense too.

> I'd much prefer all the bdi stats in the one spot. It's hard enough
> to find what you're looking for without splitting them into multiple
> locations.

Yeah I hear ya.

> The other thing to consider is that tracing requires debugfѕ to be
> mounted. Hence most kernels are going to have the debug stats
> available, anyway....

This thread has made me reconsider pursuing if there is a way that we
can access debugfs safely in our environment. It would make things a
lot easier.

>> >> writeback: tracking subsystems causing writeback

> I don't see much value in exposing this information outside of
> development environments. I think it's much better to add trace
> points for events like this so that we do fine-grained analysis of
> when the events occur during problematic workloads....

> These stats aren't the place for observing that a disk is bad ;)

They do help grant visibility in the whole stack of behaviour.
Writeback has created a whole lot of confusion and time waste. I do
agree with you that these should be folded into tracing
infrastructure.

> Yes, I hear this all the time from appliance developers that cache
> everything they need in userspace - they just want the kernel to
> stay out of the way and not use the unused RAM for caching stuff that
> doesn't matter to the application. Normally the issue is unbounded
> growth of the inode and dentry caches, but I can see how exceeding
> writeback limits can be just as much of a problem.

You hit the nail on the head. There's nothing like writing back logs
to create latency spikes for direct IO traffic that make folks scratch
their heads. In low memory environments this can get more confusing
for a appliance developers trying to find out what happened after the
fact.

Thanks again. I think (or hope) the next set of patches will be more applicable.

mrubin