2010-04-20 02:42:49

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 0/4] writeback: tracing and wbc->nr_to_write fixes

This series contains the initial writeback tracing patches from
Jens, as well as the extensions I added to provide visibility into
writeback control structures as the are used by the writeback code.
The visibility given is sufficient to understand what is happening
in the writeback path - what path is writing data, what path is
blocking on congestion, etc, and to determine the differences in
behaviour for different sync modes and calling contexts. This
tracing really needs to be integrated into mainline so that anyone
can improve the tracing as they use it to track down problems
in our convoluted writeback paths.

The remaining patches are fixes to problems that the new tracing
highlighted.


2010-04-20 02:42:39

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 1/4] writeback: initial tracing support

From: From: Jens Axboe <[email protected]>

Trace queue/sched/exec parts of the writeback loop.

Signed-off-by: Jens Axboe <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>
---
fs/fs-writeback.c | 51 +++++-------
include/linux/writeback.h | 27 ++++++
include/trace/events/writeback.h | 171 ++++++++++++++++++++++++++++++++++++++
mm/backing-dev.c | 5 +
4 files changed, 224 insertions(+), 30 deletions(-)
create mode 100644 include/trace/events/writeback.h

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 76fc4d5..3f5f0a5 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -25,7 +25,9 @@
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/buffer_head.h>
+#include <linux/ftrace.h>
#include "internal.h"
+#include <trace/events/writeback.h>

#define inode_to_bdi(inode) ((inode)->i_mapping->backing_dev_info)

@@ -34,33 +36,6 @@
*/
int nr_pdflush_threads;

-/*
- * Passed into wb_writeback(), essentially a subset of writeback_control
- */
-struct wb_writeback_args {
- long nr_pages;
- struct super_block *sb;
- enum writeback_sync_modes sync_mode;
- int for_kupdate:1;
- int range_cyclic:1;
- int for_background:1;
-};
-
-/*
- * Work items for the bdi_writeback threads
- */
-struct bdi_work {
- struct list_head list; /* pending work list */
- struct rcu_head rcu_head; /* for RCU free/clear of work */
-
- unsigned long seen; /* threads that have seen this work */
- atomic_t pending; /* number of threads still to do work */
-
- struct wb_writeback_args args; /* writeback arguments */
-
- unsigned long state; /* flag bits, see WS_* */
-};
-
enum {
WS_USED_B = 0,
WS_ONSTACK_B,
@@ -135,6 +110,8 @@ static void wb_work_complete(struct bdi_work *work)

static void wb_clear_pending(struct bdi_writeback *wb, struct bdi_work *work)
{
+ trace_writeback_clear(work);
+
/*
* The caller has retrieved the work arguments from this work,
* drop our reference. If this is the last ref, delete and free it
@@ -170,12 +147,16 @@ static void bdi_queue_work(struct backing_dev_info *bdi, struct bdi_work *work)
* If the default thread isn't there, make sure we add it. When
* it gets created and wakes up, we'll run this work.
*/
- if (unlikely(list_empty_careful(&bdi->wb_list)))
+ if (unlikely(list_empty_careful(&bdi->wb_list))) {
+ trace_writeback_sched(bdi, work, "default");
wake_up_process(default_backing_dev_info.wb.task);
- else {
+ } else {
struct bdi_writeback *wb = &bdi->wb;
+ struct task_struct *task = wb->task;

- if (wb->task)
+ trace_writeback_sched(bdi, work, task ? "task" : "notask");
+
+ if (task)
wake_up_process(wb->task);
}
}
@@ -202,6 +183,7 @@ static void bdi_alloc_queue_work(struct backing_dev_info *bdi,
work = kmalloc(sizeof(*work), GFP_ATOMIC);
if (work) {
bdi_work_init(work, args);
+ trace_writeback_queue(bdi, args);
bdi_queue_work(bdi, work);
} else {
struct bdi_writeback *wb = &bdi->wb;
@@ -235,6 +217,7 @@ static void bdi_sync_writeback(struct backing_dev_info *bdi,
bdi_work_init(&work, &args);
work.state |= WS_ONSTACK;

+ trace_writeback_queue(bdi, &args);
bdi_queue_work(bdi, &work);
bdi_wait_on_work_clear(&work);
}
@@ -880,6 +863,8 @@ long wb_do_writeback(struct bdi_writeback *wb, int force_wait)
if (force_wait)
work->args.sync_mode = args.sync_mode = WB_SYNC_ALL;

+ trace_writeback_exec(work);
+
/*
* If this isn't a data integrity operation, just notify
* that we have seen this work and we are now starting it.
@@ -915,9 +900,13 @@ int bdi_writeback_task(struct bdi_writeback *wb)
unsigned long wait_jiffies = -1UL;
long pages_written;

+ trace_writeback_thread_start(1);
+
while (!kthread_should_stop()) {
pages_written = wb_do_writeback(wb, 0);

+ trace_writeback_pages_written(pages_written);
+
if (pages_written)
last_active = jiffies;
else if (wait_jiffies != -1UL) {
@@ -938,6 +927,8 @@ int bdi_writeback_task(struct bdi_writeback *wb)
try_to_freeze();
}

+ trace_writeback_thread_start(0);
+
return 0;
}

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 76e8903..b2d615f 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -22,6 +22,33 @@ enum writeback_sync_modes {
};

/*
+ * Passed into wb_writeback(), essentially a subset of writeback_control
+ */
+struct wb_writeback_args {
+ long nr_pages;
+ struct super_block *sb;
+ enum writeback_sync_modes sync_mode;
+ int for_kupdate:1;
+ int range_cyclic:1;
+ int for_background:1;
+};
+
+/*
+ * Work items for the bdi_writeback threads
+ */
+struct bdi_work {
+ struct list_head list; /* pending work list */
+ struct rcu_head rcu_head; /* for RCU free/clear of work */
+
+ unsigned long seen; /* threads that have seen this work */
+ atomic_t pending; /* number of threads still to do work */
+
+ struct wb_writeback_args args; /* writeback arguments */
+
+ unsigned long state; /* flag bits, see WS_* */
+};
+
+/*
* A control structure which tells the writeback code what to do. These are
* always on the stack, and hence need no locking. They are always initialised
* in a manner such that unspecified fields are set to zero.
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
new file mode 100644
index 0000000..df76457
--- /dev/null
+++ b/include/trace/events/writeback.h
@@ -0,0 +1,171 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM writeback
+
+#if !defined(_TRACE_WRITEBACK_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_WRITEBACK_H
+
+#include <linux/backing-dev.h>
+#include <linux/writeback.h>
+
+TRACE_EVENT(writeback_queue,
+
+ TP_PROTO(struct backing_dev_info *bdi, struct wb_writeback_args *args),
+
+ TP_ARGS(bdi, args),
+
+ TP_STRUCT__entry(
+ __array(char, name, 16)
+ __field(long, nr_pages)
+ __field(int, sb)
+ __field(int, sync_mode)
+ __field(int, for_kupdate)
+ __field(int, range_cyclic)
+ __field(int, for_background)
+ ),
+
+ TP_fast_assign(
+ strncpy(__entry->name, dev_name(bdi->dev), 16);
+ __entry->nr_pages = args->nr_pages;
+ __entry->sb = !!args->sb;
+ __entry->for_kupdate = args->for_kupdate;
+ __entry->range_cyclic = args->range_cyclic;
+ __entry->for_background = args->for_background;
+ ),
+
+ TP_printk("%s: pages=%ld, sb=%d, kupdate=%d, range_cyclic=%d "
+ "for_background=%d", __entry->name, __entry->nr_pages,
+ __entry->sb, __entry->for_kupdate,
+ __entry->range_cyclic, __entry->for_background)
+);
+
+TRACE_EVENT(writeback_sched,
+
+ TP_PROTO(struct backing_dev_info *bdi, struct bdi_work *work,
+ const char *msg),
+
+ TP_ARGS(bdi, work, msg),
+
+ TP_STRUCT__entry(
+ __array(char, name, 16)
+ __field(unsigned int, work)
+ __array(char, task, 8)
+ ),
+
+ TP_fast_assign(
+ strncpy(__entry->name, dev_name(bdi->dev), 16);
+ __entry->work = (unsigned long) work & 0xffff;
+ snprintf(__entry->task, 8, "%s", msg);
+ ),
+
+ TP_printk("work=%x, task=%s", __entry->work, __entry->task)
+);
+
+TRACE_EVENT(writeback_exec,
+
+ TP_PROTO(struct bdi_work *work),
+
+ TP_ARGS(work),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, work)
+ __field(long, nr_pages)
+ __field(int, sb)
+ __field(int, sync_mode)
+ __field(int, for_kupdate)
+ __field(int, range_cyclic)
+ __field(int, for_background)
+ ),
+
+ TP_fast_assign(
+ __entry->work = (unsigned long) work & 0xffff;
+ __entry->nr_pages = work->args.nr_pages;
+ __entry->sb = !!work->args.sb;
+ __entry->for_kupdate = work->args.for_kupdate;
+ __entry->range_cyclic = work->args.range_cyclic;
+ __entry->for_background = work->args.for_background;
+
+ ),
+
+ TP_printk("work=%x pages=%ld, sb=%d, kupdate=%d, range_cyclic=%d"
+ " for_background=%d", __entry->work,
+ __entry->nr_pages, __entry->sb, __entry->for_kupdate,
+ __entry->range_cyclic, __entry->for_background)
+);
+
+TRACE_EVENT(writeback_clear,
+
+ TP_PROTO(struct bdi_work *work),
+
+ TP_ARGS(work),
+
+ TP_STRUCT__entry(
+ __field(struct bdi_work *, work)
+ __field(int, refs)
+ ),
+
+ TP_fast_assign(
+ __entry->work = work;
+ __entry->refs = atomic_read(&work->pending);
+ ),
+
+ TP_printk("work=%p, refs=%d", __entry->work, __entry->refs)
+);
+
+TRACE_EVENT(writeback_pages_written,
+
+ TP_PROTO(long pages_written),
+
+ TP_ARGS(pages_written),
+
+ TP_STRUCT__entry(
+ __field(long, pages)
+ ),
+
+ TP_fast_assign(
+ __entry->pages = pages_written;
+ ),
+
+ TP_printk("%ld", __entry->pages)
+);
+
+
+TRACE_EVENT(writeback_thread_start,
+
+ TP_PROTO(int start),
+
+ TP_ARGS(start),
+
+ TP_STRUCT__entry(
+ __field(int, start)
+ ),
+
+ TP_fast_assign(
+ __entry->start = start;
+ ),
+
+ TP_printk("%s", __entry->start ? "started" : "exited")
+);
+
+TRACE_EVENT(writeback_bdi_register,
+
+ TP_PROTO(const char *name, int start),
+
+ TP_ARGS(name, start),
+
+ TP_STRUCT__entry(
+ __array(char, name, 16)
+ __field(int, start)
+ ),
+
+ TP_fast_assign(
+ strncpy(__entry->name, name, 16);
+ __entry->start = start;
+ ),
+
+ TP_printk("%s: %s", __entry->name,
+ __entry->start ? "registered" : "unregistered")
+);
+#endif /* _TRACE_WRITEBACK_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 0e8ca03..2323e92 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -11,6 +11,9 @@
#include <linux/writeback.h>
#include <linux/device.h>

+#define CREATE_TRACE_POINTS
+#include <trace/events/writeback.h>
+
void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
{
}
@@ -570,6 +573,7 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,

bdi_debug_register(bdi, dev_name(dev));
set_bit(BDI_registered, &bdi->state);
+ trace_writeback_bdi_register(dev_name(dev), 1);
exit:
return ret;
}
@@ -632,6 +636,7 @@ static void bdi_prune_sb(struct backing_dev_info *bdi)
void bdi_unregister(struct backing_dev_info *bdi)
{
if (bdi->dev) {
+ trace_writeback_bdi_register(dev_name(bdi->dev), 0);
bdi_prune_sb(bdi);

if (!bdi_cap_flush_forker(bdi))
--
1.6.5

2010-04-20 02:42:45

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 4/4] xfs: remove nr_to_write writeback windup.

From: Dave Chinner <[email protected]>

Now that the background flush code has been fixed, we shouldn't need to
silently multiply the wbc->nr_to_write to get good writeback. Remove
that code.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/xfs/linux-2.6/xfs_aops.c | 8 --------
1 files changed, 0 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index 9962850..2b2225d 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -1336,14 +1336,6 @@ xfs_vm_writepage(
if (!page_has_buffers(page))
create_empty_buffers(page, 1 << inode->i_blkbits, 0);

-
- /*
- * VM calculation for nr_to_write seems off. Bump it way
- * up, this gets simple streaming writes zippy again.
- * To be reviewed again after Jens' writeback changes.
- */
- wbc->nr_to_write *= 4;
-
/*
* Convert delayed allocate, unwritten or unmapped space
* to real space and flush out to disk.
--
1.6.5

2010-04-20 02:42:58

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 2/4] writeback: Add tracing to balance_dirty_pages

From: Dave Chinner <[email protected]>

Tracing high level background writeback events is good, but it doesn't
give the entire picture. Add IO dispatched by foreground throttling to the
writeback events.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/fs-writeback.c | 5 ++
include/trace/events/writeback.h | 77 ++++++++++++++++++++++++++++++++++++++
mm/page-writeback.c | 4 ++
3 files changed, 86 insertions(+), 0 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3f5f0a5..5214b61 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -752,7 +752,11 @@ static long wb_writeback(struct bdi_writeback *wb,
wbc.more_io = 0;
wbc.nr_to_write = MAX_WRITEBACK_PAGES;
wbc.pages_skipped = 0;
+
+ trace_wbc_writeback_start(&wbc);
writeback_inodes_wb(wb, &wbc);
+ trace_wbc_writeback_written(&wbc);
+
args->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;

@@ -780,6 +784,7 @@ static long wb_writeback(struct bdi_writeback *wb,
if (!list_empty(&wb->b_more_io)) {
inode = list_entry(wb->b_more_io.prev,
struct inode, i_list);
+ trace_wbc_writeback_wait(&wbc);
inode_wait_for_writeback(inode);
}
spin_unlock(&inode_lock);
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index df76457..02f34a5 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -165,6 +165,83 @@ TRACE_EVENT(writeback_bdi_register,
TP_printk("%s: %s", __entry->name,
__entry->start ? "registered" : "unregistered")
);
+
+/* pass flags explicitly */
+DECLARE_EVENT_CLASS(wbc_class,
+ TP_PROTO(struct writeback_control *wbc),
+ TP_ARGS(wbc),
+ TP_STRUCT__entry(
+ __field(unsigned int, wbc)
+ __array(char, name, 16)
+ __field(long, nr_to_write)
+ __field(long, pages_skipped)
+ __field(int, sb)
+ __field(int, sync_mode)
+ __field(int, nonblocking)
+ __field(int, encountered_congestion)
+ __field(int, for_kupdate)
+ __field(int, for_background)
+ __field(int, for_reclaim)
+ __field(int, range_cyclic)
+ __field(int, more_io)
+ __field(unsigned long, older_than_this)
+ __field(long, range_start)
+ __field(long, range_end)
+ ),
+
+ TP_fast_assign(
+ char *__name = "(none)";
+
+ __entry->wbc = (unsigned long)wbc & 0xffff;
+ if (wbc->bdi)
+ strncpy(__entry->name, dev_name(wbc->bdi->dev), 16);
+ else
+ strncpy(__entry->name, __name, 16);
+ __entry->nr_to_write = wbc->nr_to_write;
+ __entry->pages_skipped = wbc->pages_skipped;
+ __entry->sb = !!wbc->sb;
+ __entry->sync_mode = wbc->sync_mode;
+ __entry->for_kupdate = wbc->for_kupdate;
+ __entry->for_background = wbc->for_background;
+ __entry->for_reclaim = wbc->for_reclaim;
+ __entry->range_cyclic = wbc->range_cyclic;
+ __entry->more_io = wbc->more_io;
+ __entry->older_than_this = wbc->older_than_this ?
+ *wbc->older_than_this : 0;
+ __entry->range_start = (long)wbc->range_start;
+ __entry->range_end = (long)wbc->range_end;
+ ),
+
+ TP_printk("dev %s wbc=%x towrt=%ld skip=%ld sb=%d mode=%d kupd=%d "
+ "bgrd=%d reclm=%d cyclic=%d more=%d older=0x%lx "
+ "start=0x%lx end=0x%lx",
+ __entry->name,
+ __entry->wbc,
+ __entry->nr_to_write,
+ __entry->pages_skipped,
+ __entry->sb,
+ __entry->sync_mode,
+ __entry->for_kupdate,
+ __entry->for_background,
+ __entry->for_reclaim,
+ __entry->range_cyclic,
+ __entry->more_io,
+ __entry->older_than_this,
+ __entry->range_start,
+ __entry->range_end)
+)
+
+#define DEFINE_WBC_EVENT(name) \
+DEFINE_EVENT(wbc_class, name, \
+ TP_PROTO(struct writeback_control *wbc), \
+ TP_ARGS(wbc))
+DEFINE_WBC_EVENT(wbc_writeback_start);
+DEFINE_WBC_EVENT(wbc_writeback_written);
+DEFINE_WBC_EVENT(wbc_writeback_wait);
+DEFINE_WBC_EVENT(wbc_balance_dirty_start);
+DEFINE_WBC_EVENT(wbc_balance_dirty_written);
+DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
+
#endif /* _TRACE_WRITEBACK_H */

/* This part must be outside protection */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0b19943..d45f59e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -34,6 +34,7 @@
#include <linux/syscalls.h>
#include <linux/buffer_head.h>
#include <linux/pagevec.h>
+#include <trace/events/writeback.h>

/*
* After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
@@ -536,11 +537,13 @@ static void balance_dirty_pages(struct address_space *mapping,
* threshold otherwise wait until the disk writes catch
* up.
*/
+ trace_wbc_balance_dirty_start(&wbc);
if (bdi_nr_reclaimable > bdi_thresh) {
writeback_inodes_wbc(&wbc);
pages_written += write_chunk - wbc.nr_to_write;
get_dirty_limits(&background_thresh, &dirty_thresh,
&bdi_thresh, bdi);
+ trace_wbc_balance_dirty_written(&wbc);
}

/*
@@ -566,6 +569,7 @@ static void balance_dirty_pages(struct address_space *mapping,
if (pages_written >= write_chunk)
break; /* We've done our duty */

+ trace_wbc_balance_dirty_wait(&wbc);
__set_current_state(TASK_INTERRUPTIBLE);
io_schedule_timeout(pause);

--
1.6.5

2010-04-20 02:43:32

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 3/4] writeback: pay attention to wbc->nr_to_write in write_cache_pages

From: Dave Chinner <[email protected]>

If a filesystem writes more than one page in ->writepage, write_cache_pages
fails to notice this and continues to attempt writeback when wbc->nr_to_write
has gone negative - this trace was captured from XFS:


wbc_writeback_start: towrt=1024
wbc_writepage: towrt=1024
wbc_writepage: towrt=0
wbc_writepage: towrt=-1
wbc_writepage: towrt=-5
wbc_writepage: towrt=-21
wbc_writepage: towrt=-85

This has adverse effects on filesystem writeback behaviour. write_cache_pages()
needs to terminate after a certain number of pages are written, not after a
certain number of calls to ->writepage are made. Make it observe the current
value of wbc->nr_to_write and treat a value of <= 0 as though it is a either a
termination condition or a trigger to reset to MAX_WRITEḆACK_PAGES for data
integrity syncs.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/fs-writeback.c | 9 ---------
include/linux/writeback.h | 9 +++++++++
include/trace/events/writeback.h | 1 +
mm/page-writeback.c | 20 +++++++++++++++++++-
4 files changed, 29 insertions(+), 10 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5214b61..d8271d5 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -675,15 +675,6 @@ void writeback_inodes_wbc(struct writeback_control *wbc)
writeback_inodes_wb(&bdi->wb, wbc);
}

-/*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation. We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode. Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES 1024
-
static inline bool over_bground_thresh(void)
{
unsigned long background_thresh, dirty_thresh;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index b2d615f..8533a0f 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -14,6 +14,15 @@ extern struct list_head inode_in_use;
extern struct list_head inode_unused;

/*
+ * The maximum number of pages to writeout in a single bdi flush/kupdate
+ * operation. We do this so we don't hold I_SYNC against an inode for
+ * enormous amounts of time, which would block a userspace task which has
+ * been forced to throttle against that inode. Also, the code reevaluates
+ * the dirty each time it has written this many pages.
+ */
+#define MAX_WRITEBACK_PAGES 1024
+
+/*
* fs/fs-writeback.c
*/
enum writeback_sync_modes {
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 02f34a5..3bcbd83 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -241,6 +241,7 @@ DEFINE_WBC_EVENT(wbc_writeback_wait);
DEFINE_WBC_EVENT(wbc_balance_dirty_start);
DEFINE_WBC_EVENT(wbc_balance_dirty_written);
DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
+DEFINE_WBC_EVENT(wbc_writepage);

#endif /* _TRACE_WRITEBACK_H */

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d45f59e..e22af84 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -917,6 +917,7 @@ continue_unlock:
if (!clear_page_dirty_for_io(page))
goto continue_unlock;

+ trace_wbc_writepage(wbc);
ret = (*writepage)(page, wbc, data);
if (unlikely(ret)) {
if (ret == AOP_WRITEPAGE_ACTIVATE) {
@@ -935,7 +936,7 @@ continue_unlock:
done = 1;
break;
}
- }
+ }

if (nr_to_write > 0) {
nr_to_write--;
@@ -955,6 +956,23 @@ continue_unlock:
break;
}
}
+
+ /*
+ * Some filesystems will write multiple pages in
+ * ->writepage, so wbc->nr_to_write can change much,
+ * much faster than nr_to_write. Check this as an exit
+ * condition, or if we are doing a data integrity sync,
+ * reset the wbc to MAX_WRITEBACK_PAGES so that such
+ * filesystems can do optimal writeout here.
+ */
+ if (wbc->nr_to_write <= 0) {
+ if (wbc->sync_mode == WB_SYNC_NONE) {
+ done = 1;
+ nr_to_write = 0;
+ break;
+ }
+ wbc->nr_to_write = MAX_WRITEBACK_PAGES;
+ }
}
pagevec_release(&pvec);
cond_resched();
--
1.6.5

2010-04-20 03:40:16

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 5/4] writeback: limit write_cache_pages integrity scanning to current EOF


sync can currently take a really long time if a concurrent writer is
extending a file. The problem is that the dirty pages on the address
space grow in the same direction as write_cache_pages scans, so if
the writer keeps ahead of writeback, the writeback will not
terminate until the writer stops adding dirty pages.

For a data integrity sync, we only need to write the pages dirty at
the time we start the writeback, so we can stop scanning once we get
to the page that was at the end of the file at the time the scan
started.

This will prevent operations like copying a large file preventing
sync from completing as it will not write back pages that were
dirtied after the sync was started. This does not impact the
existing integrity guarantees, as any dirty page (old or new)
within the EOF range at the start of the scan will still be
captured.

Signed-off-by: Dave Chinner <[email protected]>
---
mm/page-writeback.c | 15 +++++++++++++++
1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e22af84..4ba2728 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -852,7 +852,22 @@ int write_cache_pages(struct address_space *mapping,
if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
range_whole = 1;
cycled = 1; /* ignore range_cyclic tests */
+
+ /*
+ * If this is a data integrity sync, cap the writeback to the
+ * current end of file. Any extension to the file that occurs
+ * after this is a new write and we don't need to write those
+ * pages out to fulfil our data integrity requirements. If we
+ * try to write them out, we can get stuck in this scan until
+ * the concurrent writer stops adding dirty pages and extending
+ * EOF.
+ */
+ if (wbc->sync_mode == WB_SYNC_ALL &&
+ wbc->range_end == LLONG_MAX) {
+ end = i_size_read(mapping->host) >> PAGE_CACHE_SHIFT;
+ }
}
+
retry:
done_index = index;
while (!done && (index <= end)) {

2010-04-20 13:20:26

by Richard Kennedy

[permalink] [raw]
Subject: Re: [PATCH 0/4] writeback: tracing and wbc->nr_to_write fixes

On 20/04/10 03:41, Dave Chinner wrote:
> This series contains the initial writeback tracing patches from
> Jens, as well as the extensions I added to provide visibility into
> writeback control structures as the are used by the writeback code.
> The visibility given is sufficient to understand what is happening
> in the writeback path - what path is writing data, what path is
> blocking on congestion, etc, and to determine the differences in
> behaviour for different sync modes and calling contexts. This
> tracing really needs to be integrated into mainline so that anyone
> can improve the tracing as they use it to track down problems
> in our convoluted writeback paths.
>
> The remaining patches are fixes to problems that the new tracing
> highlighted.
>

Hi Dave,

Thanks for adding tracing to this, it will be really useful.

The fix to write_cache_pages looks really interesting, I'm going to test
it on my machine. Maybe it should be a separate patch to get more
visibility?

Ext4 also multiplies nr_to_write, so will that need fixing too?

regards
Richard

2010-04-20 23:28:23

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH 5/4] writeback: limit write_cache_pages integrity scanning to current EOF

Dave Chinner wrote:
> sync can currently take a really long time if a concurrent writer is
> extending a file. The problem is that the dirty pages on the address
> space grow in the same direction as write_cache_pages scans, so if
> the writer keeps ahead of writeback, the writeback will not
> terminate until the writer stops adding dirty pages.
>
> For a data integrity sync, we only need to write the pages dirty at
> the time we start the writeback, so we can stop scanning once we get
> to the page that was at the end of the file at the time the scan
> started.
>
> This will prevent operations like copying a large file preventing
> sync from completing as it will not write back pages that were
> dirtied after the sync was started. This does not impact the
> existing integrity guarantees, as any dirty page (old or new)
> within the EOF range at the start of the scan will still be
> captured.

I guess it can still get stuck if someone does ftruncate() first, then
writes to the hole?

-- Jamie

2010-04-20 23:29:15

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/4] writeback: tracing and wbc->nr_to_write fixes

On Tue, Apr 20, 2010 at 01:02:16PM +0100, Richard Kennedy wrote:
> On 20/04/10 03:41, Dave Chinner wrote:
> > This series contains the initial writeback tracing patches from
> > Jens, as well as the extensions I added to provide visibility into
> > writeback control structures as the are used by the writeback code.
> > The visibility given is sufficient to understand what is happening
> > in the writeback path - what path is writing data, what path is
> > blocking on congestion, etc, and to determine the differences in
> > behaviour for different sync modes and calling contexts. This
> > tracing really needs to be integrated into mainline so that anyone
> > can improve the tracing as they use it to track down problems
> > in our convoluted writeback paths.
> >
> > The remaining patches are fixes to problems that the new tracing
> > highlighted.
>
> Hi Dave,
>
> Thanks for adding tracing to this, it will be really useful.
>
> The fix to write_cache_pages looks really interesting, I'm going to test
> it on my machine. Maybe it should be a separate patch to get more
> visibility?

I don't see a big need to separate the series at this point. Once
there's been a review and testing we can decide how to push them
into mainline. IMO, the tracing is just as important as the bug
fixes....

> Ext4 also multiplies nr_to_write, so will that need fixing too?

No idea. I don't claim to understand ext4's convoluted delayed
allocation path and all it's constraints, so I guess you'd need to
ask the ext4 developers about that one. After all, with the tracing
they'd be able to see if there is a problem. ;)

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-04-20 23:32:09

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 5/4] writeback: limit write_cache_pages integrity scanning to current EOF

On Wed, Apr 21, 2010 at 12:28:19AM +0100, Jamie Lokier wrote:
> Dave Chinner wrote:
> > sync can currently take a really long time if a concurrent writer is
> > extending a file. The problem is that the dirty pages on the address
> > space grow in the same direction as write_cache_pages scans, so if
> > the writer keeps ahead of writeback, the writeback will not
> > terminate until the writer stops adding dirty pages.
> >
> > For a data integrity sync, we only need to write the pages dirty at
> > the time we start the writeback, so we can stop scanning once we get
> > to the page that was at the end of the file at the time the scan
> > started.
> >
> > This will prevent operations like copying a large file preventing
> > sync from completing as it will not write back pages that were
> > dirtied after the sync was started. This does not impact the
> > existing integrity guarantees, as any dirty page (old or new)
> > within the EOF range at the start of the scan will still be
> > captured.
>
> I guess it can still get stuck if someone does ftruncate() first, then
> writes to the hole?

Yes, it would. It only deals with extending files because fixing the
problem w.r.t. writes into holes requires something much more
invasive like Jan's radix tree mark-and-sweep algorithm....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-04-22 19:07:23

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 3/4] writeback: pay attention to wbc->nr_to_write in write_cache_pages

On Tue 20-04-10 12:41:53, Dave Chinner wrote:
> From: Dave Chinner <[email protected]>
>
> If a filesystem writes more than one page in ->writepage, write_cache_pages
> fails to notice this and continues to attempt writeback when wbc->nr_to_write
> has gone negative - this trace was captured from XFS:
>
>
> wbc_writeback_start: towrt=1024
> wbc_writepage: towrt=1024
> wbc_writepage: towrt=0
> wbc_writepage: towrt=-1
> wbc_writepage: towrt=-5
> wbc_writepage: towrt=-21
> wbc_writepage: towrt=-85
>
> This has adverse effects on filesystem writeback behaviour. write_cache_pages()
> needs to terminate after a certain number of pages are written, not after a
> certain number of calls to ->writepage are made. Make it observe the current
> value of wbc->nr_to_write and treat a value of <= 0 as though it is a either a
> termination condition or a trigger to reset to MAX_WRITEḆACK_PAGES for data
> integrity syncs.
>
> Signed-off-by: Dave Chinner <[email protected]>
> ---
> fs/fs-writeback.c | 9 ---------
> include/linux/writeback.h | 9 +++++++++
> include/trace/events/writeback.h | 1 +
> mm/page-writeback.c | 20 +++++++++++++++++++-
> 4 files changed, 29 insertions(+), 10 deletions(-)
>
<snip>

> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index d45f59e..e22af84 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -917,6 +917,7 @@ continue_unlock:
> if (!clear_page_dirty_for_io(page))
> goto continue_unlock;
>
> + trace_wbc_writepage(wbc);
> ret = (*writepage)(page, wbc, data);
> if (unlikely(ret)) {
> if (ret == AOP_WRITEPAGE_ACTIVATE) {
> @@ -935,7 +936,7 @@ continue_unlock:
> done = 1;
> break;
> }
> - }
> + }
>
> if (nr_to_write > 0) {
> nr_to_write--;
> @@ -955,6 +956,23 @@ continue_unlock:
> break;
> }
> }
> +
> + /*
> + * Some filesystems will write multiple pages in
> + * ->writepage, so wbc->nr_to_write can change much,
> + * much faster than nr_to_write. Check this as an exit
> + * condition, or if we are doing a data integrity sync,
> + * reset the wbc to MAX_WRITEBACK_PAGES so that such
> + * filesystems can do optimal writeout here.
> + */
> + if (wbc->nr_to_write <= 0) {
> + if (wbc->sync_mode == WB_SYNC_NONE) {
> + done = 1;
> + nr_to_write = 0;
> + break;
> + }
> + wbc->nr_to_write = MAX_WRITEBACK_PAGES;
> + }
Honestly, this is an ugly hack. I'd rather work towards ignoring
nr_to_write completely in WB_SYNC_ALL mode since it doesn't really make
any sence to say "write me *safely* 5 pages".

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-04-22 19:09:36

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/4] xfs: remove nr_to_write writeback windup.

On Tue 20-04-10 12:41:54, Dave Chinner wrote:
> From: Dave Chinner <[email protected]>
>
> Now that the background flush code has been fixed, we shouldn't need to
> silently multiply the wbc->nr_to_write to get good writeback. Remove
> that code.
>
> Signed-off-by: Dave Chinner <[email protected]>
> ---
> fs/xfs/linux-2.6/xfs_aops.c | 8 --------
> 1 files changed, 0 insertions(+), 8 deletions(-)
>
> diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
> index 9962850..2b2225d 100644
> --- a/fs/xfs/linux-2.6/xfs_aops.c
> +++ b/fs/xfs/linux-2.6/xfs_aops.c
> @@ -1336,14 +1336,6 @@ xfs_vm_writepage(
> if (!page_has_buffers(page))
> create_empty_buffers(page, 1 << inode->i_blkbits, 0);
>
> -
> - /*
> - * VM calculation for nr_to_write seems off. Bump it way
> - * up, this gets simple streaming writes zippy again.
> - * To be reviewed again after Jens' writeback changes.
> - */
> - wbc->nr_to_write *= 4;
> -
Hum, are you sure about this? I thought it's there because VM passes at
most 1024 pages to write from background writeback and you wanted to write
more in one go (at least ext4 wants to do this).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-04-22 19:13:10

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 5/4] writeback: limit write_cache_pages integrity scanning to current EOF

On Tue 20-04-10 13:40:05, Dave Chinner wrote:
>
> sync can currently take a really long time if a concurrent writer is
> extending a file. The problem is that the dirty pages on the address
> space grow in the same direction as write_cache_pages scans, so if
> the writer keeps ahead of writeback, the writeback will not
> terminate until the writer stops adding dirty pages.
>
> For a data integrity sync, we only need to write the pages dirty at
> the time we start the writeback, so we can stop scanning once we get
> to the page that was at the end of the file at the time the scan
> started.
>
> This will prevent operations like copying a large file preventing
> sync from completing as it will not write back pages that were
> dirtied after the sync was started. This does not impact the
> existing integrity guarantees, as any dirty page (old or new)
> within the EOF range at the start of the scan will still be
> captured.
Looks good.

Acked-by: Jan Kara <[email protected]>

> Signed-off-by: Dave Chinner <[email protected]>
> ---
> mm/page-writeback.c | 15 +++++++++++++++
> 1 files changed, 15 insertions(+), 0 deletions(-)
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e22af84..4ba2728 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -852,7 +852,22 @@ int write_cache_pages(struct address_space *mapping,
> if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
> range_whole = 1;
> cycled = 1; /* ignore range_cyclic tests */
> +
> + /*
> + * If this is a data integrity sync, cap the writeback to the
> + * current end of file. Any extension to the file that occurs
> + * after this is a new write and we don't need to write those
> + * pages out to fulfil our data integrity requirements. If we
> + * try to write them out, we can get stuck in this scan until
> + * the concurrent writer stops adding dirty pages and extending
> + * EOF.
> + */
> + if (wbc->sync_mode == WB_SYNC_ALL &&
> + wbc->range_end == LLONG_MAX) {
> + end = i_size_read(mapping->host) >> PAGE_CACHE_SHIFT;
> + }
> }
> +
> retry:
> done_index = index;
> while (!done && (index <= end)) {
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-04-25 03:33:26

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 3/4] writeback: pay attention to wbc->nr_to_write in write_cache_pages

On Tue, Apr 20, 2010 at 12:41:53PM +1000, Dave Chinner wrote:
> From: Dave Chinner <[email protected]>
>
> If a filesystem writes more than one page in ->writepage, write_cache_pages
> fails to notice this and continues to attempt writeback when wbc->nr_to_write
> has gone negative - this trace was captured from XFS:
>
>
> wbc_writeback_start: towrt=1024
> wbc_writepage: towrt=1024
> wbc_writepage: towrt=0
> wbc_writepage: towrt=-1
> wbc_writepage: towrt=-5
> wbc_writepage: towrt=-21
> wbc_writepage: towrt=-85
>
> This has adverse effects on filesystem writeback behaviour. write_cache_pages()
> needs to terminate after a certain number of pages are written, not after a
> certain number of calls to ->writepage are made. Make it observe the current
> value of wbc->nr_to_write and treat a value of <= 0 as though it is a either a
> termination condition or a trigger to reset to MAX_WRITEḆACK_PAGES for data
> integrity syncs.

Be careful here. If you are going to write more pages than what the
writeback code has requested (the stupid no more than 1024 pages
restriction in the writeback code before it jumps to start writing
some other inode), you actually need to let the returned
wbc->nr_to_write go negative, so that wb_writeback() knows how many
pages it has written.

In other words, the writeback code assumes that

<orignal value of nr_to_write> - <returned wbc->nr_to_write>

is

<number of pages actually written>

If you don't let wbc->nr_to_write go negative, the writeback code will
be confused about how many pages were _actually_ written, and the
writeback code ends up writing too much. See commit 2faf2e1.

All of this is a crock of course. The file system shouldn't be
second-guessing the writeback code. Instead the writeback code should
be adaptively measuring how long it takes to were written out N pages
to a particular block device, and then decide what's the appropriate
setting for nr_to_write. What makes sense for a USB stick, or a 4200
RPM laptop drive, may not make sense for a massive RAID array....

But since we don't have that, both XFS and ext4 have workarounds for
brain-damaged writeback behaviour. (I did some testing, and even for
standard laptop drives the cap of 1024 pages is just Way Too Small;
that limit was set something like a decade ago, and everyone has been
afraid to change it, even though disks have gotten a wee bit faster
since those days.)

- Ted

2010-04-26 00:46:55

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 4/4] xfs: remove nr_to_write writeback windup.

On Thu, Apr 22, 2010 at 09:09:37PM +0200, Jan Kara wrote:
> On Tue 20-04-10 12:41:54, Dave Chinner wrote:
> > From: Dave Chinner <[email protected]>
> >
> > Now that the background flush code has been fixed, we shouldn't need to
> > silently multiply the wbc->nr_to_write to get good writeback. Remove
> > that code.
> >
> > Signed-off-by: Dave Chinner <[email protected]>
> > ---
> > fs/xfs/linux-2.6/xfs_aops.c | 8 --------
> > 1 files changed, 0 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
> > index 9962850..2b2225d 100644
> > --- a/fs/xfs/linux-2.6/xfs_aops.c
> > +++ b/fs/xfs/linux-2.6/xfs_aops.c
> > @@ -1336,14 +1336,6 @@ xfs_vm_writepage(
> > if (!page_has_buffers(page))
> > create_empty_buffers(page, 1 << inode->i_blkbits, 0);
> >
> > -
> > - /*
> > - * VM calculation for nr_to_write seems off. Bump it way
> > - * up, this gets simple streaming writes zippy again.
> > - * To be reviewed again after Jens' writeback changes.
> > - */
> > - wbc->nr_to_write *= 4;
> > -
> Hum, are you sure about this? I thought it's there because VM passes at
> most 1024 pages to write from background writeback and you wanted to write
> more in one go (at least ext4 wants to do this).

About 500MB/s sure. ;)

Seriously though, the problem that lead to us adding this
multiplication was that writeback was not feeding XFS 1024 pages at
a time - we were getting much less than this (somewhere in the order
of 32-64 pages at a time. With the fixes I posted, in every
circumstance I can see we are the correct number of pages (1024
pages or what is left over from the last inode) being passed into
->writepages, and writeback is back to full speed without needing
this crutch. Indeed, this multiplication now causes nr_to_write to
go ballistic in some cirumstances, and that causes latency and
fairness problems that will significantly reduce write rates for
applications like NFS servers.

Realistically, XFS doesn't need to write more than 1024 pages in one
go - the reason ext4 needs to do this is it's amazingly convoluted
delayed allocation path and the fact that it's allocator is nowhere
near as good at contiguous allocation across multiple invocations as
the XFS allocator is. IOWs, XFS really just needs enough contiguous
pages to be able to form large IOs, and given that most hardware
limits the IO size to 1MB on x86_64, then 1024 pages is more than
enough to provide this.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-04-26 01:49:18

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 3/4] writeback: pay attention to wbc->nr_to_write in write_cache_pages

On Sat, Apr 24, 2010 at 11:33:15PM -0400, [email protected] wrote:
> On Tue, Apr 20, 2010 at 12:41:53PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <[email protected]>
> >
> > If a filesystem writes more than one page in ->writepage, write_cache_pages
> > fails to notice this and continues to attempt writeback when wbc->nr_to_write
> > has gone negative - this trace was captured from XFS:
> >
> >
> > wbc_writeback_start: towrt=1024
> > wbc_writepage: towrt=1024
> > wbc_writepage: towrt=0
> > wbc_writepage: towrt=-1
> > wbc_writepage: towrt=-5
> > wbc_writepage: towrt=-21
> > wbc_writepage: towrt=-85
> >
> > This has adverse effects on filesystem writeback behaviour. write_cache_pages()
> > needs to terminate after a certain number of pages are written, not after a
> > certain number of calls to ->writepage are made. Make it observe the current
> > value of wbc->nr_to_write and treat a value of <= 0 as though it is a either a
> > termination condition or a trigger to reset to MAX_WRITEḆACK_PAGES for data
> > integrity syncs.
>
> Be careful here. If you are going to write more pages than what the
> writeback code has requested (the stupid no more than 1024 pages
> restriction in the writeback code before it jumps to start writing
> some other inode), you actually need to let the returned
> wbc->nr_to_write go negative, so that wb_writeback() knows how many
> pages it has written.
>
> In other words, the writeback code assumes that
>
> <orignal value of nr_to_write> - <returned wbc->nr_to_write>
>
> is
>
> <number of pages actually written>

Yes, but that does not require a negative value to get right. None
of the code relies on negative nr_to_write values to do anything
correctly, and all the termination checks are for wbc->nr_to-write
<= 0. And the tracing shows it behaves correctly when
wbc->nr_to_write = 0 on return. Requiring a negative number is not
documented in any of the comments, write_cache_pages() does not
return a negative number, etc, so I can't see why you think this is
necessary....

> If you don't let wbc->nr_to_write go negative, the writeback code will
> be confused about how many pages were _actually_ written, and the
> writeback code ends up writing too much. See commit 2faf2e1.

ext4 added a "bump" to wbc->nr_to_write, then in some cases forgot
to remove it so it never returned to <= 0. Well, of course this
causes writeback to write too much! But that's an ext4 bug not
allowing nr_to_write to reach zero (not negative, but zero), not a
general writeback bug....

> All of this is a crock of course. The file system shouldn't be
> second-guessing the writeback code. Instead the writeback code should
> be adaptively measuring how long it takes to were written out N pages
> to a particular block device, and then decide what's the appropriate
> setting for nr_to_write. What makes sense for a USB stick, or a 4200
> RPM laptop drive, may not make sense for a massive RAID array....

Why? Writeback should just keep pushing pages down until it congests
the block device. Then it throttles itself in get_request() and so
writeback already adapts to the load on the device. Multiple passes
of 1024 pages per dirty inode is fine for this - a larger
nr_to_write doesn't get the block device to congestion any faster or
slower, nor does it change the behaviour once at congestion....

> But since we don't have that, both XFS and ext4 have workarounds for
> brain-damaged writeback behaviour. (I did some testing, and even for
> standard laptop drives the cap of 1024 pages is just Way Too Small;
> that limit was set something like a decade ago, and everyone has been
> afraid to change it, even though disks have gotten a wee bit faster
> since those days.)

XFS put a workaround in for a different reason to ext4. ext4 put it
in to improve delayed allocation by working with larger chunks of
pages. XFS put it in to get large IOs to be issued through
submit_bio(), not to help the allocator...

And to be the nasty person to shoot down your modern hardware
theory: nr_to_write = 1024 pages works just fine on my laptop (XFS
on indilix SSD) as well as my big test server (XFS on 12 disk RAID0)
The server gets 1.5GB/s with pretty much perfect IO patterns with
the fixes I posted, unlike the mess of single page IOs that occurs
without them....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-04-26 02:43:10

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 3/4] writeback: pay attention to wbc->nr_to_write in write_cache_pages

On Mon, Apr 26, 2010 at 11:49:08AM +1000, Dave Chinner wrote:
>
> Yes, but that does not require a negative value to get right. None
> of the code relies on negative nr_to_write values to do anything
> correctly, and all the termination checks are for wbc->nr_to-write
> <= 0. And the tracing shows it behaves correctly when
> wbc->nr_to_write = 0 on return. Requiring a negative number is not
> documented in any of the comments, write_cache_pages() does not
> return a negative number, etc, so I can't see why you think this is
> necessary....

In fs/fs-writeback.c, wb_writeback(), around line 774:

wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;

If we want "wrote" to be reflect accurately the number of pages that
the filesystem actually wrote, then if you write more pages than what
was requested by wbc.nr_to_write, then it needs to be negative.

> XFS put a workaround in for a different reason to ext4. ext4 put it
> in to improve delayed allocation by working with larger chunks of
> pages. XFS put it in to get large IOs to be issued through
> submit_bio(), not to help the allocator...

That's why I put in ext4 at least initially, yes. I'm working on
rewriting the ext4_writepages() code to make this unnecessary....

However...

> And to be the nasty person to shoot down your modern hardware
> theory: nr_to_write = 1024 pages works just fine on my laptop (XFS
> on indilix SSD) as well as my big test server (XFS on 12 disk RAID0)
> The server gets 1.5GB/s with pretty much perfect IO patterns with
> the fixes I posted, unlike the mess of single page IOs that occurs
> without them....

Have you tested with multiple files that are subject to writeout at
the same time? After all, if your I/O allocator does a great job of
keeping the files contiguous in chunks larger tham 4MB, then if you
have two or more files that need to be written out, the page allocator
will round robin between the two files in 4MB chunks, and that might
not be considered an ideal I/O pattern.

Regards,

- Ted

2010-04-26 02:45:30

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 3/4] writeback: pay attention to wbc->nr_to_write in write_cache_pages

On Sun, Apr 25, 2010 at 10:43:02PM -0400, [email protected] wrote:
> Have you tested with multiple files that are subject to writeout at
> the same time? After all, if your I/O allocator does a great job of
> keeping the files contiguous in chunks larger tham 4MB, then if you
> have two or more files that need to be written out, the page allocator
> will round robin between the two files in 4MB chunks, and that might
> not be considered an ideal I/O pattern.

Argh. Sorry for not proof reading better before hitting the send
key....

s/tham/than/
s/page allocator/writeback code/

- Ted

2010-04-27 03:30:37

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 3/4] writeback: pay attention to wbc->nr_to_write in write_cache_pages

On Sun, Apr 25, 2010 at 10:43:02PM -0400, [email protected] wrote:
> On Mon, Apr 26, 2010 at 11:49:08AM +1000, Dave Chinner wrote:
> >
> > Yes, but that does not require a negative value to get right. None
> > of the code relies on negative nr_to_write values to do anything
> > correctly, and all the termination checks are for wbc->nr_to-write
> > <= 0. And the tracing shows it behaves correctly when
> > wbc->nr_to_write = 0 on return. Requiring a negative number is not
> > documented in any of the comments, write_cache_pages() does not
> > return a negative number, etc, so I can't see why you think this is
> > necessary....
>
> In fs/fs-writeback.c, wb_writeback(), around line 774:
>
> wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
>
> If we want "wrote" to be reflect accurately the number of pages that
> the filesystem actually wrote, then if you write more pages than what
> was requested by wbc.nr_to_write, then it needs to be negative.

Yes, but the change I made:

a) prevented it from writing more than requested in the
async writeback case, and
b) prevented it from going massively negative so that the
higher levels wouldn't have over-accounted for pages
written.

And if we consider that for the sync case we actaully return the
number of pages written - it's gets capped at zero even when we
write a lot more than that.

Hence exactly accounting for pages written is really not important.
Indeed, exact number of written pages is not actually used for
anything specific - only to determine if there was activity or not:

919 pages_written = wb_do_writeback(wb, 0);
920
921 if (pages_written)
922 last_active = jiffies;

> > XFS put a workaround in for a different reason to ext4. ext4 put it
> > in to improve delayed allocation by working with larger chunks of
> > pages. XFS put it in to get large IOs to be issued through
> > submit_bio(), not to help the allocator...
>
> That's why I put in ext4 at least initially, yes. I'm working on
> rewriting the ext4_writepages() code to make this unnecessary....
>
> However...
>
> > And to be the nasty person to shoot down your modern hardware
> > theory: nr_to_write = 1024 pages works just fine on my laptop (XFS
> > on indilix SSD) as well as my big test server (XFS on 12 disk RAID0)
> > The server gets 1.5GB/s with pretty much perfect IO patterns with
> > the fixes I posted, unlike the mess of single page IOs that occurs
> > without them....
>
> Have you tested with multiple files that are subject to writeout at
> the same time?

Of course.

> After all, if your I/O allocator does a great job of
> keeping the files contiguous in chunks larger tham 4MB, then if you
> have two or more files that need to be written out, the page allocator
> will round robin between the two files in 4MB chunks, and that might
> not be considered an ideal I/O pattern.

4MB chunks translate into 4-8 IOs at the block layer with typical
setups that set the maximum IO size to 512k or 1MB. So that is
_plenty_ to keep a single disk or several disks in a RAID stripe
busy before seeking to another location to do the next set of 4-8
writes. And if the drive has any amount of cache (we're seeing
64-128MB in SATA drives now), then it will be aggregating these writes in
the cache into even larger sequential chunks. Hence seeks in _modern
hardware_ are going to be almost entirely mitigated for most large
sequential write workloads as long as the contiguous chunks are more
than a few MB in size.

Some numbers for you:

One 4GB file (baseline):

$ dd if=/dev/zero of=/mnt/scratch/$i/test bs=1024k count=4000
.....
$ sudo xfs_bmap -vp /mnt/scratch/*/test
/mnt/scratch/0/test:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..4710271]: 96..4710367 0 (96..4710367) 4710272 00000
1: [4710272..8191999]: 5242976..8724703 1 (96..3481823) 3481728 00000

Ideal layout - the AG size is about 2.4GB, so it'll be two extents
as we see (average gives 2GB per extent). This completed at about 440MB/s.

Two 4GB files in parallel into the same directory:

$ for i in `seq 0 1 1`; do dd if=/dev/zero of=/mnt/scratch/test$i bs=1024k count=4000 & done
$ sudo xfs_bmap -vp /mnt/scratch/test* | awk '/ [0-9]*:/ { tot += $6; cnt++ } END { print tot / cnt }'
712348
$

So the average extent size is ~355MB, and throughput was roughly
520MB/s.

Two 4GB files in parallel into different directories (to trigger a
different allocator placement heuristic):

$ for i in `seq 0 1 1`; do dd if=/dev/zero of=/mnt/scratch/$i/test bs=1024k count=4000 & done
$ sudo xfs_bmap -vp /mnt/scratch/*/test | awk '/ [0-9]*:/ { tot += $6; cnt++ } END { printf "%d\n", tot / cnt }'
1170285
$

~600MB average extent size and throughput was roughly 530MB/s.

Let's make it harder - eight 1GB files in parallel into the same directory:

$ for i in `seq 0 1 7`; do dd if=/dev/zero of=/mnt/scratch/test$i bs=1024k count=1000 & done
...
$ sudo xfs_bmap -vp /mnt/scratch/test* | awk '/[0-9]:/ { tot += $6; cnt++ } END { print tot / cnt }'
157538
$

An average of 78MB per extent with throughput at roughly 520MB/s.
IOWs, the extent size is still large enough to provide full
bandwidth to pretty much any application that does sequential IO.
i.e. it is not ideal, but it's not badly fragmented enough to be a
problem for most people.

FWIW, with the current code I am seeing average extent sizes of
roughly 55MB for this same test, so there is significant _reduction_
in fragmentation by making sure we interleave chunks of pages
_consistently_ in writeback. Mind you, throughput didn't change
because extents of 55MB are still large enough to maintain full disk
throughput for this workload....

FYI, if this level of fragmentation were a problem for this
workload (e.g. a mythTV box) I could use something like the
allocsize mount option to specify the EOF preallocation size:

$ sudo umount /mnt/scratch
$ sudo mount -o logbsize=262144,nobarrier,allocsize=512m /dev/vdb /mnt/scratch
$ for i in `seq 0 1 7`; do dd if=/dev/zero of=/mnt/scratch/test$i bs=1024k count=1000 & done
....
$ sudo xfs_bmap -vp /mnt/scratch/test* | awk '/ [0-9]*:/ { tot += $6; cnt++ } END { print tot / cnt }'
1024000
$

512MB extent size average, exactly, with throughput at 510MB/s (so
not real reduction in throughput). IOWs, fragmentation for this
workload can be directly controlled without any performance penalty
if necessary.

I hope this answers your question, Ted. ;)

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-04-30 17:09:54

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 3/4] writeback: pay attention to wbc->nr_to_write in write_cache_pages

On Tue, 20 Apr 2010 12:41:53 +1000
Dave Chinner <[email protected]> wrote:

> If a filesystem writes more than one page in ->writepage, write_cache_pages
> fails to notice this and continues to attempt writeback when wbc->nr_to_write
> has gone negative - this trace was captured from XFS:
>
>
> wbc_writeback_start: towrt=1024
> wbc_writepage: towrt=1024
> wbc_writepage: towrt=0
> wbc_writepage: towrt=-1
> wbc_writepage: towrt=-5
> wbc_writepage: towrt=-21
> wbc_writepage: towrt=-85
>

Bug.

AFAIT it's a regression introduced by

: commit 17bc6c30cf6bfffd816bdc53682dd46fc34a2cf4
: Author: Aneesh Kumar K.V <[email protected]>
: AuthorDate: Thu Oct 16 10:09:17 2008 -0400
: Commit: Theodore Ts'o <[email protected]>
: CommitDate: Thu Oct 16 10:09:17 2008 -0400
:
: vfs: Add no_nrwrite_index_update writeback control flag

I suggest that what you do here is remove the local `nr_to_write' from
write_cache_pages() and go back to directly using wbc->nr_to_write
within the loop.

And thus we restore the convention that if the fs writes back more than
a single page, it subtracts (nr_written - 1) from wbc->nr_to_write.

2010-04-30 19:08:33

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 3/4] writeback: pay attention to wbc->nr_to_write in write_cache_pages

On Thu, 29 Apr 2010 14:39:31 -0700, Andrew Morton <[email protected]> wrote:
> On Tue, 20 Apr 2010 12:41:53 +1000
> Dave Chinner <[email protected]> wrote:
>
> > If a filesystem writes more than one page in ->writepage, write_cache_pages
> > fails to notice this and continues to attempt writeback when wbc->nr_to_write
> > has gone negative - this trace was captured from XFS:
> >
> >
> > wbc_writeback_start: towrt=1024
> > wbc_writepage: towrt=1024
> > wbc_writepage: towrt=0
> > wbc_writepage: towrt=-1
> > wbc_writepage: towrt=-5
> > wbc_writepage: towrt=-21
> > wbc_writepage: towrt=-85
> >
>
> Bug.
>
> AFAIT it's a regression introduced by
>
> : commit 17bc6c30cf6bfffd816bdc53682dd46fc34a2cf4
> : Author: Aneesh Kumar K.V <[email protected]>
> : AuthorDate: Thu Oct 16 10:09:17 2008 -0400
> : Commit: Theodore Ts'o <[email protected]>
> : CommitDate: Thu Oct 16 10:09:17 2008 -0400
> :
> : vfs: Add no_nrwrite_index_update writeback control flag
>
> I suggest that what you do here is remove the local `nr_to_write' from
> write_cache_pages() and go back to directly using wbc->nr_to_write
> within the loop.
>
> And thus we restore the convention that if the fs writes back more than
> a single page, it subtracts (nr_written - 1) from wbc->nr_to_write.
>

My mistake i never expected writepage to write more than one page. The
interface said 'writepage' so it was natural to expect that it writes only
one page. BTW the reason for the change is to give file system which
accumulate dirty pages using write_cache_pages and attempt to write
them out later a chance to properly manage nr_to_write. Something like

ext4_da_writepages
-- write_cache_pages
---- collect dirty page
---- return
--return
--now try to writeout all the collected dirty pages ( say 100)
----Only able to allocate blocks for 50 pages
so update nr_to_write -= 50 and mark rest of 50 pages as dirty
again

So we want wbc->nr_to_write updated only by ext4_da_writepages.


-aneesh

2010-04-30 19:44:13

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 3/4] writeback: pay attention to wbc->nr_to_write in write_cache_pages

On Fri, 30 Apr 2010 11:31:53 +0530
"Aneesh Kumar K. V" <[email protected]> wrote:

> On Thu, 29 Apr 2010 14:39:31 -0700, Andrew Morton <[email protected]> wrote:
> > On Tue, 20 Apr 2010 12:41:53 +1000
> > Dave Chinner <[email protected]> wrote:
> >
> > > If a filesystem writes more than one page in ->writepage, write_cache_pages
> > > fails to notice this and continues to attempt writeback when wbc->nr_to_write
> > > has gone negative - this trace was captured from XFS:
> > >
> > >
> > > wbc_writeback_start: towrt=1024
> > > wbc_writepage: towrt=1024
> > > wbc_writepage: towrt=0
> > > wbc_writepage: towrt=-1
> > > wbc_writepage: towrt=-5
> > > wbc_writepage: towrt=-21
> > > wbc_writepage: towrt=-85
> > >
> >
> > Bug.
> >
> > AFAIT it's a regression introduced by
> >
> > : commit 17bc6c30cf6bfffd816bdc53682dd46fc34a2cf4
> > : Author: Aneesh Kumar K.V <[email protected]>
> > : AuthorDate: Thu Oct 16 10:09:17 2008 -0400
> > : Commit: Theodore Ts'o <[email protected]>
> > : CommitDate: Thu Oct 16 10:09:17 2008 -0400
> > :
> > : vfs: Add no_nrwrite_index_update writeback control flag
> >
> > I suggest that what you do here is remove the local `nr_to_write' from
> > write_cache_pages() and go back to directly using wbc->nr_to_write
> > within the loop.
> >
> > And thus we restore the convention that if the fs writes back more than
> > a single page, it subtracts (nr_written - 1) from wbc->nr_to_write.
> >
>
> My mistake i never expected writepage to write more than one page.

The writeback code is tricky and easy to break in subtle ways.

> The
> interface said 'writepage' so it was natural to expect that it writes only
> one page. BTW the reason for the change is to give file system which
> accumulate dirty pages using write_cache_pages and attempt to write
> them out later a chance to properly manage nr_to_write. Something like
>
> ext4_da_writepages
> -- write_cache_pages
> ---- collect dirty page
> ---- return
> --return
> --now try to writeout all the collected dirty pages ( say 100)
> ----Only able to allocate blocks for 50 pages
> so update nr_to_write -= 50 and mark rest of 50 pages as dirty
> again
>
> So we want wbc->nr_to_write updated only by ext4_da_writepages.

So you want a ->writepage() implementation which doesn't actually write
a page at all - it just remembers that page for later.

Maybe that fs shouldn't be calling write_cache_pages() at all. After
all, write_cache_pages() is a wrapper which emits a sequence of calls
to ->writepage(), and ->writepage() writes a page.

Rather than hacking around, subverting things and breaking core kernel
code, let's step back and more clearly think about what to do?

One option would be to implement a new address_space_operation which
provides the new semantics in a well-understood fashion. Let's call it
writepage_prepare(?). Then reimplement write_cache_pages() so that if
->writepage_prepare() is available, it handles it in a sensible fashion
and doesn't break traditional filesystems.

Or simply implement a new, different version of write_cache_pages() for
filesystems which wish to buffer in this fashion. The new
write_cache_pages_prepare()(?) would call ->writepage_prepare().
Internally it might share implementation with write_cache_pages().

There are lots of options. But the way in which write_cache_pages()
was extended to handle this ext4 requirement was rather unclean,
non-obvious and, umm, broken!