2010-05-25 10:54:48

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 0/6] writeback: tracing and fixes


This series contains the initial writeback tracing patches from
Jens, as well as the extensions I added to provide visibility into
writeback control structures as the are used by the writeback code.
The visibility given is sufficient to understand what is happening
in the writeback path - what path is writing data, what path is
blocking on congestion, etc, and to determine the differences in
behaviour for different sync modes and calling contexts. This
tracing really needs to be integrated into mainline so that anyone
can improve the tracing as they use it to track down problems in our
convoluted writeback paths.

The remaining patches are fixes to problems that the new tracing
highlighted.

Version 2:
- included ext4 write_cache_pages separation patch from Ted Ts'o.
- moved CREATE_TRACE_POINTS into fs-writeback.c as suggested by
Christoph Hellwig.
- moved include of trace/events/writeback.h until after structure
definitions in fs-writeback.c
- manually revert changes made to write_cache_pages() in
17bc6c30cf6bfffd816bdc53682dd46fc34a2cf4 that caused the
regression. This restores the convention that if the fs writes
back more than a single page, it subtracts (nr_written - 1) from
wbc->nr_to_write, as suggested by Andrew Morton.
- added patch to prevent sync from looping in write_cache_pages
chasing a moving tail when an appending write workload is running
concurrently with sync.



2010-05-25 10:54:34

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 2/6] writeback: Add tracing to balance_dirty_pages

From: Dave Chinner <[email protected]>

Tracing high level background writeback events is good, but it doesn't
give the entire picture. Add IO dispatched by foreground throttling to the
writeback events.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/fs-writeback.c | 5 ++
include/trace/events/writeback.h | 80 ++++++++++++++++++++++++++++++++++++++
mm/page-writeback.c | 4 ++
3 files changed, 89 insertions(+), 0 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index b44aba6..efa3d18 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -814,7 +814,11 @@ static long wb_writeback(struct bdi_writeback *wb,
wbc.more_io = 0;
wbc.nr_to_write = MAX_WRITEBACK_PAGES;
wbc.pages_skipped = 0;
+
+ trace_wbc_writeback_start(&wbc);
writeback_inodes_wb(wb, &wbc);
+ trace_wbc_writeback_written(&wbc);
+
args->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;

@@ -842,6 +846,7 @@ static long wb_writeback(struct bdi_writeback *wb,
if (!list_empty(&wb->b_more_io)) {
inode = list_entry(wb->b_more_io.prev,
struct inode, i_list);
+ trace_wbc_writeback_wait(&wbc);
inode_wait_for_writeback(inode);
}
spin_unlock(&inode_lock);
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index df76457..72c1a12 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -7,6 +7,9 @@
#include <linux/backing-dev.h>
#include <linux/writeback.h>

+struct wb_writeback_args;
+struct bdi_work;
+
TRACE_EVENT(writeback_queue,

TP_PROTO(struct backing_dev_info *bdi, struct wb_writeback_args *args),
@@ -165,6 +168,83 @@ TRACE_EVENT(writeback_bdi_register,
TP_printk("%s: %s", __entry->name,
__entry->start ? "registered" : "unregistered")
);
+
+/* pass flags explicitly */
+DECLARE_EVENT_CLASS(wbc_class,
+ TP_PROTO(struct writeback_control *wbc),
+ TP_ARGS(wbc),
+ TP_STRUCT__entry(
+ __field(unsigned int, wbc)
+ __array(char, name, 16)
+ __field(long, nr_to_write)
+ __field(long, pages_skipped)
+ __field(int, sb)
+ __field(int, sync_mode)
+ __field(int, nonblocking)
+ __field(int, encountered_congestion)
+ __field(int, for_kupdate)
+ __field(int, for_background)
+ __field(int, for_reclaim)
+ __field(int, range_cyclic)
+ __field(int, more_io)
+ __field(unsigned long, older_than_this)
+ __field(long, range_start)
+ __field(long, range_end)
+ ),
+
+ TP_fast_assign(
+ char *__name = "(none)";
+
+ __entry->wbc = (unsigned long)wbc & 0xffff;
+ if (wbc->bdi)
+ strncpy(__entry->name, dev_name(wbc->bdi->dev), 16);
+ else
+ strncpy(__entry->name, __name, 16);
+ __entry->nr_to_write = wbc->nr_to_write;
+ __entry->pages_skipped = wbc->pages_skipped;
+ __entry->sb = !!wbc->sb;
+ __entry->sync_mode = wbc->sync_mode;
+ __entry->for_kupdate = wbc->for_kupdate;
+ __entry->for_background = wbc->for_background;
+ __entry->for_reclaim = wbc->for_reclaim;
+ __entry->range_cyclic = wbc->range_cyclic;
+ __entry->more_io = wbc->more_io;
+ __entry->older_than_this = wbc->older_than_this ?
+ *wbc->older_than_this : 0;
+ __entry->range_start = (long)wbc->range_start;
+ __entry->range_end = (long)wbc->range_end;
+ ),
+
+ TP_printk("dev %s wbc=%x towrt=%ld skip=%ld sb=%d mode=%d kupd=%d "
+ "bgrd=%d reclm=%d cyclic=%d more=%d older=0x%lx "
+ "start=0x%lx end=0x%lx",
+ __entry->name,
+ __entry->wbc,
+ __entry->nr_to_write,
+ __entry->pages_skipped,
+ __entry->sb,
+ __entry->sync_mode,
+ __entry->for_kupdate,
+ __entry->for_background,
+ __entry->for_reclaim,
+ __entry->range_cyclic,
+ __entry->more_io,
+ __entry->older_than_this,
+ __entry->range_start,
+ __entry->range_end)
+)
+
+#define DEFINE_WBC_EVENT(name) \
+DEFINE_EVENT(wbc_class, name, \
+ TP_PROTO(struct writeback_control *wbc), \
+ TP_ARGS(wbc))
+DEFINE_WBC_EVENT(wbc_writeback_start);
+DEFINE_WBC_EVENT(wbc_writeback_written);
+DEFINE_WBC_EVENT(wbc_writeback_wait);
+DEFINE_WBC_EVENT(wbc_balance_dirty_start);
+DEFINE_WBC_EVENT(wbc_balance_dirty_written);
+DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
+
#endif /* _TRACE_WRITEBACK_H */

/* This part must be outside protection */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b289310..68eb727 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -34,6 +34,7 @@
#include <linux/syscalls.h>
#include <linux/buffer_head.h>
#include <linux/pagevec.h>
+#include <trace/events/writeback.h>

/*
* After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
@@ -536,11 +537,13 @@ static void balance_dirty_pages(struct address_space *mapping,
* threshold otherwise wait until the disk writes catch
* up.
*/
+ trace_wbc_balance_dirty_start(&wbc);
if (bdi_nr_reclaimable > bdi_thresh) {
writeback_inodes_wbc(&wbc);
pages_written += write_chunk - wbc.nr_to_write;
get_dirty_limits(&background_thresh, &dirty_thresh,
&bdi_thresh, bdi);
+ trace_wbc_balance_dirty_written(&wbc);
}

/*
@@ -566,6 +569,7 @@ static void balance_dirty_pages(struct address_space *mapping,
if (pages_written >= write_chunk)
break; /* We've done our duty */

+ trace_wbc_balance_dirty_wait(&wbc);
__set_current_state(TASK_INTERRUPTIBLE);
io_schedule_timeout(pause);

--
1.5.6.5


2010-05-25 10:54:57

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 4/6] writeback: pay attention to wbc->nr_to_write in write_cache_pages

From: Dave Chinner <[email protected]>

If a filesystem writes more than one page in ->writepage, write_cache_pages
fails to notice this and continues to attempt writeback when wbc->nr_to_write
has gone negative - this trace was captured from XFS:


wbc_writeback_start: towrt=1024
wbc_writepage: towrt=1024
wbc_writepage: towrt=0
wbc_writepage: towrt=-1
wbc_writepage: towrt=-5
wbc_writepage: towrt=-21
wbc_writepage: towrt=-85

This has adverse effects on filesystem writeback behaviour. write_cache_pages()
needs to terminate after a certain number of pages are written, not after a
certain number of calls to ->writepage are made. This is a regression
introduced by 17bc6c30cf6bfffd816bdc53682dd46fc34a2cf4, but cannot be reverted
directly due to subsequent bug fixes that have gone in on top of it.

This commit adds a ->writepage tracepoint inside write_cache_pages() (how the
above trace was generated) and does the revert manually leaving the subsequent
bug fixes in tact. ext4 is not affected by this as a previous commit in the
series stops ext4 from using the generic function.

Signed-off-by: Dave Chinner <[email protected]>
---
include/linux/writeback.h | 9 ---------
include/trace/events/writeback.h | 1 +
mm/page-writeback.c | 16 ++++++----------
3 files changed, 7 insertions(+), 19 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a74837e..488ac1c 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -55,15 +55,6 @@ struct writeback_control {
unsigned for_reclaim:1; /* Invoked from the page allocator */
unsigned range_cyclic:1; /* range_start is cyclic */
unsigned more_io:1; /* more io to be dispatched */
- /*
- * write_cache_pages() won't update wbc->nr_to_write and
- * mapping->writeback_index if no_nrwrite_index_update
- * is set. write_cache_pages() may write more than we
- * requested and we want to make sure nr_to_write and
- * writeback_index are updated in a consistent manner
- * so we use a single control to update them
- */
- unsigned no_nrwrite_index_update:1;

/*
* For WB_SYNC_ALL, the sb must always be pinned. For WB_SYNC_NONE,
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 72c1a12..5dda40e 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -244,6 +244,7 @@ DEFINE_WBC_EVENT(wbc_writeback_wait);
DEFINE_WBC_EVENT(wbc_balance_dirty_start);
DEFINE_WBC_EVENT(wbc_balance_dirty_written);
DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
+DEFINE_WBC_EVENT(wbc_writepage);

#endif /* _TRACE_WRITEBACK_H */

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 68eb727..0fe713d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -839,7 +839,6 @@ int write_cache_pages(struct address_space *mapping,
pgoff_t done_index;
int cycled;
int range_whole = 0;
- long nr_to_write = wbc->nr_to_write;

pagevec_init(&pvec, 0);
if (wbc->range_cyclic) {
@@ -921,6 +920,7 @@ continue_unlock:
if (!clear_page_dirty_for_io(page))
goto continue_unlock;

+ trace_wbc_writepage(wbc);
ret = (*writepage)(page, wbc, data);
if (unlikely(ret)) {
if (ret == AOP_WRITEPAGE_ACTIVATE) {
@@ -939,11 +939,10 @@ continue_unlock:
done = 1;
break;
}
- }
+ }

- if (nr_to_write > 0) {
- nr_to_write--;
- if (nr_to_write == 0 &&
+ if (wbc->nr_to_write > 0) {
+ if (--wbc->nr_to_write == 0 &&
wbc->sync_mode == WB_SYNC_NONE) {
/*
* We stop writing back only if we are
@@ -974,11 +973,8 @@ continue_unlock:
end = writeback_index - 1;
goto retry;
}
- if (!wbc->no_nrwrite_index_update) {
- if (wbc->range_cyclic || (range_whole && nr_to_write > 0))
- mapping->writeback_index = done_index;
- wbc->nr_to_write = nr_to_write;
- }
+ if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
+ mapping->writeback_index = done_index;

return ret;
}
--
1.5.6.5


2010-05-25 10:54:54

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 5/6] xfs: remove nr_to_write writeback windup.

From: Dave Chinner <[email protected]>

Now that the background flush code has been fixed, we shouldn't need to
silently multiply the wbc->nr_to_write to get good writeback. Remove
that code.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/xfs/linux-2.6/xfs_aops.c | 8 --------
1 files changed, 0 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index 089eaca..4c89db3 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -1366,14 +1366,6 @@ xfs_vm_writepage(
if (!page_has_buffers(page))
create_empty_buffers(page, 1 << inode->i_blkbits, 0);

-
- /*
- * VM calculation for nr_to_write seems off. Bump it way
- * up, this gets simple streaming writes zippy again.
- * To be reviewed again after Jens' writeback changes.
- */
- wbc->nr_to_write *= 4;

2010-05-25 10:54:12

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 6/6] writeback: limit write_cache_pages integrity scanning to current EOF

From: Dave Chinner <[email protected]>

sync can currently take a really long time if a concurrent writer is
extending a file. The problem is that the dirty pages on the address
space grow in the same direction as write_cache_pages scans, so if
the writer keeps ahead of writeback, the writeback will not
terminate until the writer stops adding dirty pages.

For a data integrity sync, we only need to write the pages dirty at
the time we start the writeback, so we can stop scanning once we get
to the page that was at the end of the file at the time the scan
started.

This will prevent operations like copying a large file preventing
sync from completing as it will not write back pages that were
dirtied after the sync was started. This does not impact the
existing integrity guarantees, as any dirty page (old or new)
within the EOF range at the start of the scan will still be
captured.

Signed-off-by: Dave Chinner <[email protected]>
---
mm/page-writeback.c | 15 +++++++++++++++
1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0fe713d..c97e973 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -855,7 +855,22 @@ int write_cache_pages(struct address_space *mapping,
if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
range_whole = 1;
cycled = 1; /* ignore range_cyclic tests */
+
+ /*
+ * If this is a data integrity sync, cap the writeback to the
+ * current end of file. Any extension to the file that occurs
+ * after this is a new write and we don't need to write those
+ * pages out to fulfil our data integrity requirements. If we
+ * try to write them out, we can get stuck in this scan until
+ * the concurrent writer stops adding dirty pages and extending
+ * EOF.
+ */
+ if (wbc->sync_mode == WB_SYNC_ALL &&
+ wbc->range_end == LLONG_MAX) {
+ end = i_size_read(mapping->host) >> PAGE_CACHE_SHIFT;
+ }
}
+
retry:
done_index = index;
while (!done && (index <= end)) {
--
1.5.6.5

2010-05-25 10:54:07

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 1/6] writeback: initial tracing support

From: From: Jens Axboe <[email protected]>

Trace queue/sched/exec parts of the writeback loop.

Signed-off-by: Jens Axboe <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>
---
fs/fs-writeback.c | 40 +++++++--
include/trace/events/writeback.h | 171 ++++++++++++++++++++++++++++++++++++++
mm/backing-dev.c | 3 +
3 files changed, 204 insertions(+), 10 deletions(-)
create mode 100644 include/trace/events/writeback.h

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index b1e76ef..b44aba6 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -26,15 +26,9 @@
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/buffer_head.h>
+#include <linux/ftrace.h>
#include "internal.h"

-#define inode_to_bdi(inode) ((inode)->i_mapping->backing_dev_info)
-
-/*
- * We don't actually have pdflush, but this one is exported though /proc...
- */
-int nr_pdflush_threads;
-
/*
* Passed into wb_writeback(), essentially a subset of writeback_control
*/
@@ -63,6 +57,16 @@ struct bdi_work {
unsigned long state; /* flag bits, see WS_* */
};

+#define CREATE_TRACE_POINTS
+#include <trace/events/writeback.h>
+
+#define inode_to_bdi(inode) ((inode)->i_mapping->backing_dev_info)
+
+/*
+ * We don't actually have pdflush, but this one is exported though /proc...
+ */
+int nr_pdflush_threads;
+
enum {
WS_USED_B = 0,
WS_ONSTACK_B,
@@ -137,6 +141,8 @@ static void wb_work_complete(struct bdi_work *work)

static void wb_clear_pending(struct bdi_writeback *wb, struct bdi_work *work)
{
+ trace_writeback_clear(work);
+
/*
* The caller has retrieved the work arguments from this work,
* drop our reference. If this is the last ref, delete and free it
@@ -172,12 +178,16 @@ static void bdi_queue_work(struct backing_dev_info *bdi, struct bdi_work *work)
* If the default thread isn't there, make sure we add it. When
* it gets created and wakes up, we'll run this work.
*/
- if (unlikely(list_empty_careful(&bdi->wb_list)))
+ if (unlikely(list_empty_careful(&bdi->wb_list))) {
+ trace_writeback_sched(bdi, work, "default");
wake_up_process(default_backing_dev_info.wb.task);
- else {
+ } else {
struct bdi_writeback *wb = &bdi->wb;
+ struct task_struct *task = wb->task;

- if (wb->task)
+ trace_writeback_sched(bdi, work, task ? "task" : "notask");
+
+ if (task)
wake_up_process(wb->task);
}
}
@@ -205,6 +215,7 @@ static void bdi_alloc_queue_work(struct backing_dev_info *bdi,
work = kmalloc(sizeof(*work), GFP_ATOMIC);
if (work) {
bdi_work_init(work, args);
+ trace_writeback_queue(bdi, args);
bdi_queue_work(bdi, work);
if (wait)
bdi_wait_on_work_clear(work);
@@ -245,6 +256,7 @@ static void bdi_sync_writeback(struct backing_dev_info *bdi,
bdi_work_init(&work, &args);
work.state |= WS_ONSTACK;

+ trace_writeback_queue(bdi, &args);
bdi_queue_work(bdi, &work);
bdi_wait_on_work_clear(&work);
}
@@ -914,6 +926,8 @@ long wb_do_writeback(struct bdi_writeback *wb, int force_wait)
struct wb_writeback_args args = work->args;
int post_clear;

+ trace_writeback_exec(work);
+
/*
* Override sync mode, in case we must wait for completion
*/
@@ -957,9 +971,13 @@ int bdi_writeback_task(struct bdi_writeback *wb)
unsigned long wait_jiffies = -1UL;
long pages_written;

+ trace_writeback_thread_start(1);
+
while (!kthread_should_stop()) {
pages_written = wb_do_writeback(wb, 0);

+ trace_writeback_pages_written(pages_written);
+
if (pages_written)
last_active = jiffies;
else if (wait_jiffies != -1UL) {
@@ -989,6 +1007,8 @@ int bdi_writeback_task(struct bdi_writeback *wb)
try_to_freeze();
}

+ trace_writeback_thread_start(0);
+
return 0;
}

diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
new file mode 100644
index 0000000..df76457
--- /dev/null
+++ b/include/trace/events/writeback.h
@@ -0,0 +1,171 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM writeback
+
+#if !defined(_TRACE_WRITEBACK_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_WRITEBACK_H
+
+#include <linux/backing-dev.h>
+#include <linux/writeback.h>
+
+TRACE_EVENT(writeback_queue,
+
+ TP_PROTO(struct backing_dev_info *bdi, struct wb_writeback_args *args),
+
+ TP_ARGS(bdi, args),
+
+ TP_STRUCT__entry(
+ __array(char, name, 16)
+ __field(long, nr_pages)
+ __field(int, sb)
+ __field(int, sync_mode)
+ __field(int, for_kupdate)
+ __field(int, range_cyclic)
+ __field(int, for_background)
+ ),
+
+ TP_fast_assign(
+ strncpy(__entry->name, dev_name(bdi->dev), 16);
+ __entry->nr_pages = args->nr_pages;
+ __entry->sb = !!args->sb;
+ __entry->for_kupdate = args->for_kupdate;
+ __entry->range_cyclic = args->range_cyclic;
+ __entry->for_background = args->for_background;
+ ),
+
+ TP_printk("%s: pages=%ld, sb=%d, kupdate=%d, range_cyclic=%d "
+ "for_background=%d", __entry->name, __entry->nr_pages,
+ __entry->sb, __entry->for_kupdate,
+ __entry->range_cyclic, __entry->for_background)
+);
+
+TRACE_EVENT(writeback_sched,
+
+ TP_PROTO(struct backing_dev_info *bdi, struct bdi_work *work,
+ const char *msg),
+
+ TP_ARGS(bdi, work, msg),
+
+ TP_STRUCT__entry(
+ __array(char, name, 16)
+ __field(unsigned int, work)
+ __array(char, task, 8)
+ ),
+
+ TP_fast_assign(
+ strncpy(__entry->name, dev_name(bdi->dev), 16);
+ __entry->work = (unsigned long) work & 0xffff;
+ snprintf(__entry->task, 8, "%s", msg);
+ ),
+
+ TP_printk("work=%x, task=%s", __entry->work, __entry->task)
+);
+
+TRACE_EVENT(writeback_exec,
+
+ TP_PROTO(struct bdi_work *work),
+
+ TP_ARGS(work),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, work)
+ __field(long, nr_pages)
+ __field(int, sb)
+ __field(int, sync_mode)
+ __field(int, for_kupdate)
+ __field(int, range_cyclic)
+ __field(int, for_background)
+ ),
+
+ TP_fast_assign(
+ __entry->work = (unsigned long) work & 0xffff;
+ __entry->nr_pages = work->args.nr_pages;
+ __entry->sb = !!work->args.sb;
+ __entry->for_kupdate = work->args.for_kupdate;
+ __entry->range_cyclic = work->args.range_cyclic;
+ __entry->for_background = work->args.for_background;
+
+ ),
+
+ TP_printk("work=%x pages=%ld, sb=%d, kupdate=%d, range_cyclic=%d"
+ " for_background=%d", __entry->work,
+ __entry->nr_pages, __entry->sb, __entry->for_kupdate,
+ __entry->range_cyclic, __entry->for_background)
+);
+
+TRACE_EVENT(writeback_clear,
+
+ TP_PROTO(struct bdi_work *work),
+
+ TP_ARGS(work),
+
+ TP_STRUCT__entry(
+ __field(struct bdi_work *, work)
+ __field(int, refs)
+ ),
+
+ TP_fast_assign(
+ __entry->work = work;
+ __entry->refs = atomic_read(&work->pending);
+ ),
+
+ TP_printk("work=%p, refs=%d", __entry->work, __entry->refs)
+);
+
+TRACE_EVENT(writeback_pages_written,
+
+ TP_PROTO(long pages_written),
+
+ TP_ARGS(pages_written),
+
+ TP_STRUCT__entry(
+ __field(long, pages)
+ ),
+
+ TP_fast_assign(
+ __entry->pages = pages_written;
+ ),
+
+ TP_printk("%ld", __entry->pages)
+);
+
+
+TRACE_EVENT(writeback_thread_start,
+
+ TP_PROTO(int start),
+
+ TP_ARGS(start),
+
+ TP_STRUCT__entry(
+ __field(int, start)
+ ),
+
+ TP_fast_assign(
+ __entry->start = start;
+ ),
+
+ TP_printk("%s", __entry->start ? "started" : "exited")
+);
+
+TRACE_EVENT(writeback_bdi_register,
+
+ TP_PROTO(const char *name, int start),
+
+ TP_ARGS(name, start),
+
+ TP_STRUCT__entry(
+ __array(char, name, 16)
+ __field(int, start)
+ ),
+
+ TP_fast_assign(
+ strncpy(__entry->name, name, 16);
+ __entry->start = start;
+ ),
+
+ TP_printk("%s: %s", __entry->name,
+ __entry->start ? "registered" : "unregistered")
+);
+#endif /* _TRACE_WRITEBACK_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 660a87a..1f7723b 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -10,6 +10,7 @@
#include <linux/module.h>
#include <linux/writeback.h>
#include <linux/device.h>
+#include <trace/events/writeback.h>

static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);

@@ -585,6 +586,7 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,

bdi_debug_register(bdi, dev_name(dev));
set_bit(BDI_registered, &bdi->state);
+ trace_writeback_bdi_register(dev_name(dev), 1);
exit:
return ret;
}
@@ -647,6 +649,7 @@ static void bdi_prune_sb(struct backing_dev_info *bdi)
void bdi_unregister(struct backing_dev_info *bdi)
{
if (bdi->dev) {
+ trace_writeback_bdi_register(dev_name(bdi->dev), 0);
bdi_prune_sb(bdi);

if (!bdi_cap_flush_forker(bdi))
--
1.5.6.5


2010-05-25 10:54:09

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 3/6] ext4: Use our own write_cache_pages()

From: Theodore Ts'o <[email protected]>

Make a copy of write_cache_pages() for the benefit of
ext4_da_writepages(). This allows us to simplify the code some, and
will allow us to further customize the code in future patches.

There are some nasty hacks in write_cache_pages(), which Linus has
(correctly) characterized as vile. I've just copied it into
write_cache_pages_da(), without trying to clean those bits up lest I
break something in the ext4's delalloc implementation, which is a bit
fragile right now. This will allow Dave Chinner to clean up
write_cache_pages() in mm/page-writeback.c, without worrying about
breaking ext4. Eventually write_cache_pages_da() will go away when I
rewrite ext4's delayed allocation and create a general
ext4_writepages() which is used for all of ext4's writeback. Until
now this is the lowest risk way to clean up the core
write_cache_pages() function.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/inode.c | 141 ++++++++++++++++++++++++++++++++++++-------
include/trace/events/ext4.h | 5 +-
2 files changed, 120 insertions(+), 26 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3e0f6af..cdd4abe 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2423,17 +2423,6 @@ static int __mpage_da_writepage(struct page *page,
struct buffer_head *bh, *head;
sector_t logical;

- if (mpd->io_done) {
- /*
- * Rest of the page in the page_vec
- * redirty then and skip then. We will
- * try to write them again after
- * starting a new transaction
- */
- redirty_page_for_writepage(wbc, page);
- unlock_page(page);
- return MPAGE_DA_EXTENT_TAIL;
- }
/*
* Can we merge this page to current extent?
*/
@@ -2828,6 +2817,124 @@ static int ext4_da_writepages_trans_blocks(struct inode *inode)
return ext4_chunk_trans_blocks(inode, max_blocks);
}

+/*
+ * write_cache_pages_da - walk the list of dirty pages of the given
+ * address space and call the callback function (which usually writes
+ * the pages).
+ *
+ * This is a forked version of write_cache_pages(). Differences:
+ * Range cyclic is ignored.
+ * no_nrwrite_index_update is always presumed true
+ */
+static int write_cache_pages_da(struct address_space *mapping,
+ struct writeback_control *wbc,
+ struct mpage_da_data *mpd)
+{
+ int ret = 0;
+ int done = 0;
+ struct pagevec pvec;
+ int nr_pages;
+ pgoff_t index;
+ pgoff_t end; /* Inclusive */
+ long nr_to_write = wbc->nr_to_write;
+
+ pagevec_init(&pvec, 0);
+ index = wbc->range_start >> PAGE_CACHE_SHIFT;
+ end = wbc->range_end >> PAGE_CACHE_SHIFT;
+
+ while (!done && (index <= end)) {
+ int i;
+
+ nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
+ PAGECACHE_TAG_DIRTY,
+ min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
+ if (nr_pages == 0)
+ break;
+
+ for (i = 0; i < nr_pages; i++) {
+ struct page *page = pvec.pages[i];
+
+ /*
+ * At this point, the page may be truncated or
+ * invalidated (changing page->mapping to NULL), or
+ * even swizzled back from swapper_space to tmpfs file
+ * mapping. However, page->index will not change
+ * because we have a reference on the page.
+ */
+ if (page->index > end) {
+ done = 1;
+ break;
+ }
+
+ lock_page(page);
+
+ /*
+ * Page truncated or invalidated. We can freely skip it
+ * then, even for data integrity operations: the page
+ * has disappeared concurrently, so there could be no
+ * real expectation of this data interity operation
+ * even if there is now a new, dirty page at the same
+ * pagecache address.
+ */
+ if (unlikely(page->mapping != mapping)) {
+continue_unlock:
+ unlock_page(page);
+ continue;
+ }
+
+ if (!PageDirty(page)) {
+ /* someone wrote it for us */
+ goto continue_unlock;
+ }
+
+ if (PageWriteback(page)) {
+ if (wbc->sync_mode != WB_SYNC_NONE)
+ wait_on_page_writeback(page);
+ else
+ goto continue_unlock;
+ }
+
+ BUG_ON(PageWriteback(page));
+ if (!clear_page_dirty_for_io(page))
+ goto continue_unlock;
+
+ ret = __mpage_da_writepage(page, wbc, mpd);
+ if (unlikely(ret)) {
+ if (ret == AOP_WRITEPAGE_ACTIVATE) {
+ unlock_page(page);
+ ret = 0;
+ } else {
+ done = 1;
+ break;
+ }
+ }
+
+ if (nr_to_write > 0) {
+ nr_to_write--;
+ if (nr_to_write == 0 &&
+ wbc->sync_mode == WB_SYNC_NONE) {
+ /*
+ * We stop writing back only if we are
+ * not doing integrity sync. In case of
+ * integrity sync we have to keep going
+ * because someone may be concurrently
+ * dirtying pages, and we might have
+ * synced a lot of newly appeared dirty
+ * pages, but have not synced all of the
+ * old dirty pages.
+ */
+ done = 1;
+ break;
+ }
+ }
+ }
+ pagevec_release(&pvec);
+ cond_resched();
+ }
+ return ret;
+}
+
+
static int ext4_da_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
@@ -2836,7 +2943,6 @@ static int ext4_da_writepages(struct address_space *mapping,
handle_t *handle = NULL;
struct mpage_da_data mpd;
struct inode *inode = mapping->host;
- int no_nrwrite_index_update;
int pages_written = 0;
long pages_skipped;
unsigned int max_pages;
@@ -2916,12 +3022,6 @@ static int ext4_da_writepages(struct address_space *mapping,
mpd.wbc = wbc;
mpd.inode = mapping->host;

- /*
- * we don't want write_cache_pages to update
- * nr_to_write and writeback_index
- */
- no_nrwrite_index_update = wbc->no_nrwrite_index_update;
- wbc->no_nrwrite_index_update = 1;
pages_skipped = wbc->pages_skipped;

retry:
@@ -2963,8 +3063,7 @@ retry:
mpd.io_done = 0;
mpd.pages_written = 0;
mpd.retval = 0;
- ret = write_cache_pages(mapping, wbc, __mpage_da_writepage,
- &mpd);
+ ret = write_cache_pages_da(mapping, wbc, &mpd);
/*
* If we have a contiguous extent of pages and we
* haven't done the I/O yet, map the blocks and submit
@@ -3030,8 +3129,6 @@ retry:
mapping->writeback_index = index;

out_writepages:
- if (!no_nrwrite_index_update)
- wbc->no_nrwrite_index_update = 0;
wbc->nr_to_write -= nr_to_writebump;
wbc->range_start = range_start;
trace_ext4_da_writepages_result(inode, wbc, ret, pages_written);
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 2aa6aa3..fe76c15 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -306,7 +306,6 @@ TRACE_EVENT(ext4_da_writepages_result,
__field( int, pages_written )
__field( long, pages_skipped )
__field( char, more_io )
- __field( char, no_nrwrite_index_update )
__field( pgoff_t, writeback_index )
),

@@ -317,16 +316,14 @@ TRACE_EVENT(ext4_da_writepages_result,
__entry->pages_written = pages_written;
__entry->pages_skipped = wbc->pages_skipped;
__entry->more_io = wbc->more_io;
- __entry->no_nrwrite_index_update = wbc->no_nrwrite_index_update;
__entry->writeback_index = inode->i_mapping->writeback_index;
),

- TP_printk("dev %s ino %lu ret %d pages_written %d pages_skipped %ld more_io %d no_nrwrite_index_update %d writeback_index %lu",
+ TP_printk("dev %s ino %lu ret %d pages_written %d pages_skipped %ld more_io %d writeback_index %lu",
jbd2_dev_to_name(__entry->dev),
(unsigned long) __entry->ino, __entry->ret,
__entry->pages_written, __entry->pages_skipped,
__entry->more_io,
- __entry->no_nrwrite_index_update,
(unsigned long) __entry->writeback_index)
);

--
1.5.6.5

2010-05-25 11:11:07

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 4/6] writeback: pay attention to wbc->nr_to_write in write_cache_pages

On Tue, May 25, 2010 at 08:54:10PM +1000, Dave Chinner wrote:
> This commit adds a ->writepage tracepoint inside write_cache_pages() (how the
> above trace was generated) and does the revert manually leaving the subsequent

The tracepoint really should be a separate commit, probably ordered
before the ext4 revert.


2010-05-25 11:13:27

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/6] writeback: initial tracing support

> +#define CREATE_TRACE_POINTS
> +#include <trace/events/writeback.h>

It might be worth to add a comment why this is in an unusal place.

Otherwise looks good,


Reviewed-by: Christoph Hellwig <[email protected]>

2010-05-25 11:13:45

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/6] writeback: Add tracing to balance_dirty_pages

On Tue, May 25, 2010 at 08:54:08PM +1000, Dave Chinner wrote:
> From: Dave Chinner <[email protected]>
>
> Tracing high level background writeback events is good, but it doesn't
> give the entire picture. Add IO dispatched by foreground throttling to the
> writeback events.
>
> Signed-off-by: Dave Chinner <[email protected]>

Looks good,


Reviewed-by: Christoph Hellwig <[email protected]>


2010-05-25 11:14:14

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 5/6] xfs: remove nr_to_write writeback windup.

On Tue, May 25, 2010 at 08:54:11PM +1000, Dave Chinner wrote:
> From: Dave Chinner <[email protected]>
>
> Now that the background flush code has been fixed, we shouldn't need to
> silently multiply the wbc->nr_to_write to get good writeback. Remove
> that code.
>
> Signed-off-by: Dave Chinner <[email protected]>

Ok,


Reviewed-by: Christoph Hellwig <[email protected]>


2010-05-25 13:06:31

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 3/6] ext4: Use our own write_cache_pages()

On Tue, May 25, 2010 at 08:54:09PM +1000, Dave Chinner wrote:
> From: Theodore Ts'o <[email protected]>
>
> Make a copy of write_cache_pages() for the benefit of
> ext4_da_writepages(). This allows us to simplify the code some, and
> will allow us to further customize the code in future patches.

I'm going to be pushing this to Linus shortly, so it's likely you'll
be able to drop this from your patch series. I just wanted to give
you a heads up about what I was planning on doing, which hopefully
simplified your life a little.

- Ted

2010-05-25 22:43:10

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 3/6] ext4: Use our own write_cache_pages()

On Tue, May 25, 2010 at 09:06:26AM -0400, [email protected] wrote:
> On Tue, May 25, 2010 at 08:54:09PM +1000, Dave Chinner wrote:
> > From: Theodore Ts'o <[email protected]>
> >
> > Make a copy of write_cache_pages() for the benefit of
> > ext4_da_writepages(). This allows us to simplify the code some, and
> > will allow us to further customize the code in future patches.
>
> I'm going to be pushing this to Linus shortly, so it's likely you'll
> be able to drop this from your patch series.

Great!

> I just wanted to give
> you a heads up about what I was planning on doing, which hopefully
> simplified your life a little.

And that's appreciated. I included it to make sure everyone could
see how this all fitted together and others could test it if they so
desired. i.e. make sure I wasn't breaking ext4 by reverting the
changes to write_cache_pages. I'll leave it in the series until the
ext4 change is upstream, then I'll drop it.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-05-27 21:33:14

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/6] writeback: initial tracing support

On Tue, 25 May 2010 20:54:07 +1000
Dave Chinner <[email protected]> wrote:

> From: From: Jens Axboe <[email protected]>
>
> Trace queue/sched/exec parts of the writeback loop.

It would be most useful if this patchset's description provided sample
tracing output, so we can see what the patch is actually providing us.

> -#define inode_to_bdi(inode) ((inode)->i_mapping->backing_dev_info)
> -
> -/*
> - * We don't actually have pdflush, but this one is exported though /proc...
> - */
> -int nr_pdflush_threads;
> -
> /*
> * Passed into wb_writeback(), essentially a subset of writeback_control
> */
> @@ -63,6 +57,16 @@ struct bdi_work {
> unsigned long state; /* flag bits, see WS_* */
> };
>
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/writeback.h>
> +
> +#define inode_to_bdi(inode) ((inode)->i_mapping->backing_dev_info)

Could/should be implemented in C.

> +/*
> + * We don't actually have pdflush, but this one is exported though /proc...
> + */
> +int nr_pdflush_threads;

So this is always zero now?

We don't want to keep it forever. Add a
printk_once("nr_pdflush_threads is deprecated") when someone reads it,
remove it in 2014.

>
> ...
>
> --- /dev/null
> +++ b/include/trace/events/writeback.h
> @@ -0,0 +1,171 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM writeback
> +
> +#if !defined(_TRACE_WRITEBACK_H) || defined(TRACE_HEADER_MULTI_READ)

Seems wrong. If you define TRACE_HEADER_MULTI_READ then include this
header twice, things explode. Which negates the purpose of
_TRACE_WRITEBACK_H.

> +#define _TRACE_WRITEBACK_H
> +
> +#include <linux/backing-dev.h>
> +#include <linux/writeback.h>
> +
> +TRACE_EVENT(writeback_queue,
> +
> + TP_PROTO(struct backing_dev_info *bdi, struct wb_writeback_args *args),
> +
> + TP_ARGS(bdi, args),
> +
> + TP_STRUCT__entry(
> + __array(char, name, 16)
> + __field(long, nr_pages)
> + __field(int, sb)
> + __field(int, sync_mode)
> + __field(int, for_kupdate)
> + __field(int, range_cyclic)
> + __field(int, for_background)
> + ),
> +
> + TP_fast_assign(
> + strncpy(__entry->name, dev_name(bdi->dev), 16);
> + __entry->nr_pages = args->nr_pages;
> + __entry->sb = !!args->sb;
> + __entry->for_kupdate = args->for_kupdate;
> + __entry->range_cyclic = args->range_cyclic;
> + __entry->for_background = args->for_background;
> + ),
> +
> + TP_printk("%s: pages=%ld, sb=%d, kupdate=%d, range_cyclic=%d "
> + "for_background=%d", __entry->name, __entry->nr_pages,
> + __entry->sb, __entry->for_kupdate,
> + __entry->range_cyclic, __entry->for_background)
> +);
> +


2010-05-27 21:33:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 4/6] writeback: pay attention to wbc->nr_to_write in write_cache_pages

On Tue, 25 May 2010 20:54:10 +1000
Dave Chinner <[email protected]> wrote:

> From: Dave Chinner <[email protected]>
>
> If a filesystem writes more than one page in ->writepage, write_cache_pages
> fails to notice this and continues to attempt writeback when wbc->nr_to_write
> has gone negative - this trace was captured from XFS:
>
>
> wbc_writeback_start: towrt=1024
> wbc_writepage: towrt=1024
> wbc_writepage: towrt=0
> wbc_writepage: towrt=-1
> wbc_writepage: towrt=-5
> wbc_writepage: towrt=-21
> wbc_writepage: towrt=-85
>
> This has adverse effects on filesystem writeback behaviour. write_cache_pages()
> needs to terminate after a certain number of pages are written, not after a
> certain number of calls to ->writepage are made. This is a regression
> introduced by 17bc6c30cf6bfffd816bdc53682dd46fc34a2cf4, but cannot be reverted

It's conventional to identify commits by their title as well as their
hash. So 17bc6c30cf6bfffd816bdc53682dd46fc34a2cf4 ("vfs: Add
no_nrwrite_index_update writeback control flag"). Because that commit
might have different hashes in different trees, I think. A Linus idea.

I do this ten times a day - It's a PITA.

> directly due to subsequent bug fixes that have gone in on top of it.
>
> This commit adds a ->writepage tracepoint inside write_cache_pages() (how the
> above trace was generated) and does the revert manually leaving the subsequent
> bug fixes in tact. ext4 is not affected by this as a previous commit in the

"intact".

> series stops ext4 from using the generic function.
>
> - if (nr_to_write > 0) {
> - nr_to_write--;
> - if (nr_to_write == 0 &&
> + if (wbc->nr_to_write > 0) {
> + if (--wbc->nr_to_write == 0 &&
> wbc->sync_mode == WB_SYNC_NONE) {
> /*
> * We stop writing back only if we are
> @@ -974,11 +973,8 @@ continue_unlock:
> end = writeback_index - 1;
> goto retry;
> }
> - if (!wbc->no_nrwrite_index_update) {
> - if (wbc->range_cyclic || (range_whole && nr_to_write > 0))
> - mapping->writeback_index = done_index;
> - wbc->nr_to_write = nr_to_write;
> - }
> + if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
> + mapping->writeback_index = done_index;
>
> return ret;

'bout time we fixed that. I wonder why it took so long to find.

2010-05-27 21:34:17

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 6/6] writeback: limit write_cache_pages integrity scanning to current EOF

On Tue, 25 May 2010 20:54:12 +1000
Dave Chinner <[email protected]> wrote:

> From: Dave Chinner <[email protected]>
>
> sync can currently take a really long time if a concurrent writer is
> extending a file. The problem is that the dirty pages on the address
> space grow in the same direction as write_cache_pages scans, so if
> the writer keeps ahead of writeback, the writeback will not
> terminate until the writer stops adding dirty pages.

<looks at Jens>

The really was a pretty basic bug. It's writeback 101 to test that case :(

> For a data integrity sync, we only need to write the pages dirty at
> the time we start the writeback, so we can stop scanning once we get
> to the page that was at the end of the file at the time the scan
> started.
>
> This will prevent operations like copying a large file preventing
> sync from completing as it will not write back pages that were
> dirtied after the sync was started. This does not impact the
> existing integrity guarantees, as any dirty page (old or new)
> within the EOF range at the start of the scan will still be
> captured.
>
> Signed-off-by: Dave Chinner <[email protected]>
> ---
> mm/page-writeback.c | 15 +++++++++++++++
> 1 files changed, 15 insertions(+), 0 deletions(-)
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 0fe713d..c97e973 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -855,7 +855,22 @@ int write_cache_pages(struct address_space *mapping,
> if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
> range_whole = 1;
> cycled = 1; /* ignore range_cyclic tests */
> +
> + /*
> + * If this is a data integrity sync, cap the writeback to the
> + * current end of file. Any extension to the file that occurs
> + * after this is a new write and we don't need to write those
> + * pages out to fulfil our data integrity requirements. If we
> + * try to write them out, we can get stuck in this scan until
> + * the concurrent writer stops adding dirty pages and extending
> + * EOF.
> + */
> + if (wbc->sync_mode == WB_SYNC_ALL &&
> + wbc->range_end == LLONG_MAX) {
> + end = i_size_read(mapping->host) >> PAGE_CACHE_SHIFT;
> + }
> }
> +

This is somewhat inefficient. It's really trivial and fast to find the
highest-index dirty page by walking straight down the
PAGECACHE_TAG_DIRTY-tagged nodes.

However pagevec_lookup_tag(..., PAGECACHE_TAG_DIRTY) should do a pretty
good job of skipping over the (millions of) pages between the (last
dirty page before `end') and (`end'). So it _should_ be OK. Some thought
and runtime testing would be good.



That being said, I think the patch is insufficient. If I create an
enormous (possibly sparse) file with a 16TB hole (or a run of clean
pages) in the middle and then start busily writing into that hole (run
of clean pages), the problem will still occur.

One obvious fix for that (a) would be to add another radix-tree tag and
do two passes across the radix-tree.

Another fix (b) would be to track the number of dirty pages per
adddress_space, and only write that number of pages.

Another fix would be to work out how the code handled this situation
before we broke it, and restore that in some fashion. I guess fix (b)
above kinda does that.



2010-05-28 00:44:35

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/6] writeback: initial tracing support

On Thu, May 27, 2010 at 02:32:33PM -0700, Andrew Morton wrote:
> On Tue, 25 May 2010 20:54:07 +1000
> Dave Chinner <[email protected]> wrote:
>
> > From: From: Jens Axboe <[email protected]>
> >
> > Trace queue/sched/exec parts of the writeback loop.
>
> It would be most useful if this patchset's description provided sample
> tracing output, so we can see what the patch is actually providing us.

This is just a forward port of Jen's patch. I guess I'll have to
clean it up some more...

>
> > -#define inode_to_bdi(inode) ((inode)->i_mapping->backing_dev_info)
> > -
> > -/*
> > - * We don't actually have pdflush, but this one is exported though /proc...
> > - */
> > -int nr_pdflush_threads;
> > -
> > /*
> > * Passed into wb_writeback(), essentially a subset of writeback_control
> > */
> > @@ -63,6 +57,16 @@ struct bdi_work {
> > unsigned long state; /* flag bits, see WS_* */
> > };
> >
> > +#define CREATE_TRACE_POINTS
> > +#include <trace/events/writeback.h>
> > +
> > +#define inode_to_bdi(inode) ((inode)->i_mapping->backing_dev_info)
>
> Could/should be implemented in C.

OK.

>
> > +/*
> > + * We don't actually have pdflush, but this one is exported though /proc...
> > + */
> > +int nr_pdflush_threads;
>
> So this is always zero now?

I guess so. I'd forgotten that this was in the original patch....

> We don't want to keep it forever. Add a
> printk_once("nr_pdflush_threads is deprecated") when someone reads it,
> remove it in 2014.

OK.

> > --- /dev/null
> > +++ b/include/trace/events/writeback.h
> > @@ -0,0 +1,171 @@
> > +#undef TRACE_SYSTEM
> > +#define TRACE_SYSTEM writeback
> > +
> > +#if !defined(_TRACE_WRITEBACK_H) || defined(TRACE_HEADER_MULTI_READ)
>
> Seems wrong. If you define TRACE_HEADER_MULTI_READ then include this
> header twice, things explode. Which negates the purpose of
> _TRACE_WRITEBACK_H.

Every other trace event header does this, so if it's wrong then the
same mistake is in at least 30 files now. I don't know enough about
the tracing code to know why this is done, and I'm not keen to
address such a mistake here...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-05-28 00:56:02

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 4/6] writeback: pay attention to wbc->nr_to_write in write_cache_pages

On Thu, May 27, 2010 at 02:32:51PM -0700, Andrew Morton wrote:
> On Tue, 25 May 2010 20:54:10 +1000
> Dave Chinner <[email protected]> wrote:
>
> > From: Dave Chinner <[email protected]>
> >
> > If a filesystem writes more than one page in ->writepage, write_cache_pages
> > fails to notice this and continues to attempt writeback when wbc->nr_to_write
> > has gone negative - this trace was captured from XFS:
> >
> >
> > wbc_writeback_start: towrt=1024
> > wbc_writepage: towrt=1024
> > wbc_writepage: towrt=0
> > wbc_writepage: towrt=-1
> > wbc_writepage: towrt=-5
> > wbc_writepage: towrt=-21
> > wbc_writepage: towrt=-85
> >
> > This has adverse effects on filesystem writeback behaviour. write_cache_pages()
> > needs to terminate after a certain number of pages are written, not after a
> > certain number of calls to ->writepage are made. This is a regression
> > introduced by 17bc6c30cf6bfffd816bdc53682dd46fc34a2cf4, but cannot be reverted
>
> It's conventional to identify commits by their title as well as their
> hash. So 17bc6c30cf6bfffd816bdc53682dd46fc34a2cf4 ("vfs: Add
> no_nrwrite_index_update writeback control flag"). Because that commit
> might have different hashes in different trees, I think. A Linus idea.
>
> I do this ten times a day - It's a PITA.

Will fix.

>
> > directly due to subsequent bug fixes that have gone in on top of it.
> >
> > This commit adds a ->writepage tracepoint inside write_cache_pages() (how the
> > above trace was generated) and does the revert manually leaving the subsequent
> > bug fixes in tact. ext4 is not affected by this as a previous commit in the
>
> "intact".

*nod*

> > series stops ext4 from using the generic function.
> >
> > - if (nr_to_write > 0) {
> > - nr_to_write--;
> > - if (nr_to_write == 0 &&
> > + if (wbc->nr_to_write > 0) {
> > + if (--wbc->nr_to_write == 0 &&
> > wbc->sync_mode == WB_SYNC_NONE) {
> > /*
> > * We stop writing back only if we are
> > @@ -974,11 +973,8 @@ continue_unlock:
> > end = writeback_index - 1;
> > goto retry;
> > }
> > - if (!wbc->no_nrwrite_index_update) {
> > - if (wbc->range_cyclic || (range_whole && nr_to_write > 0))
> > - mapping->writeback_index = done_index;
> > - wbc->nr_to_write = nr_to_write;
> > - }
> > + if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
> > + mapping->writeback_index = done_index;
> >
> > return ret;
>
> 'bout time we fixed that. I wonder why it took so long to find.

IMO, because without tracing it's almost impossible to see what is
happening inside this code easily. I wasn't looking for this bug
when I added the tracing - I was trying to discover why sync was
locking up for minutes on end.

Also a significant problem is that writeback changes often come in
through a tree that no filesystem developer is actually testing
(e.g. the block git tree) or a single FS tree (e.g. the ext4 tree),
so problems in the generic code that manifest in only one or two
filesystems slip under the radar all too easily. Hence they are
often only discovered when some other, unrelated, obvious problem is
investigated...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-05-28 01:18:24

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 1/6] writeback: initial tracing support

On Thu, 2010-05-27 at 14:32 -0700, Andrew Morton wrote:

> > --- /dev/null
> > +++ b/include/trace/events/writeback.h
> > @@ -0,0 +1,171 @@
> > +#undef TRACE_SYSTEM
> > +#define TRACE_SYSTEM writeback
> > +
> > +#if !defined(_TRACE_WRITEBACK_H) || defined(TRACE_HEADER_MULTI_READ)
>
> Seems wrong. If you define TRACE_HEADER_MULTI_READ then include this
> header twice, things explode. Which negates the purpose of
> _TRACE_WRITEBACK_H.

That's intended. It is documented in
samples/trace_events/trace-events-samples.h

The purpose of the TRACE_HEADER_MULTI_READ is to read the trace header
multi times. ;-)


You can also read about it here:

http://lwn.net/Articles/379903/

here:

http://lwn.net/Articles/381064/

and here:

http://lwn.net/Articles/383362/

-- Steve

>
> > +#define _TRACE_WRITEBACK_H
> > +
> > +#include <linux/backing-dev.h>
> > +#include <linux/writeback.h>
> > +


2010-05-28 01:21:00

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 1/6] writeback: initial tracing support

On Fri, 2010-05-28 at 10:44 +1000, Dave Chinner wrote:
> On Thu, May 27, 2010 at 02:32:33PM -0700, Andrew Morton wrote:
> >
> > > --- /dev/null
> > > +++ b/include/trace/events/writeback.h
> > > @@ -0,0 +1,171 @@
> > > +#undef TRACE_SYSTEM
> > > +#define TRACE_SYSTEM writeback
> > > +
> > > +#if !defined(_TRACE_WRITEBACK_H) || defined(TRACE_HEADER_MULTI_READ)
> >
> > Seems wrong. If you define TRACE_HEADER_MULTI_READ then include this
> > header twice, things explode. Which negates the purpose of
> > _TRACE_WRITEBACK_H.
>
> Every other trace event header does this, so if it's wrong then the
> same mistake is in at least 30 files now. I don't know enough about
> the tracing code to know why this is done, and I'm not keen to
> address such a mistake here...

It's all part of the CPP voodoo ritual. You don't want to know why, just
do it, otherwise you may find dancing CPP skulls running around your
bathtub.

If you want to be a CPP Voodoo Witch doctor too, read up on LWN, I
describe some of the magic there.

-- Steve



2010-05-28 01:23:38

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 6/6] writeback: limit write_cache_pages integrity scanning to current EOF

On Thu, May 27, 2010 at 02:33:41PM -0700, Andrew Morton wrote:
> On Tue, 25 May 2010 20:54:12 +1000
> Dave Chinner <[email protected]> wrote:
>
> > From: Dave Chinner <[email protected]>
> >
> > sync can currently take a really long time if a concurrent writer is
> > extending a file. The problem is that the dirty pages on the address
> > space grow in the same direction as write_cache_pages scans, so if
> > the writer keeps ahead of writeback, the writeback will not
> > terminate until the writer stops adding dirty pages.
>
> <looks at Jens>
>
> The really was a pretty basic bug. It's writeback 101 to test that case :(
>
> > For a data integrity sync, we only need to write the pages dirty at
> > the time we start the writeback, so we can stop scanning once we get
> > to the page that was at the end of the file at the time the scan
> > started.
> >
> > This will prevent operations like copying a large file preventing
> > sync from completing as it will not write back pages that were
> > dirtied after the sync was started. This does not impact the
> > existing integrity guarantees, as any dirty page (old or new)
> > within the EOF range at the start of the scan will still be
> > captured.
> >
> > Signed-off-by: Dave Chinner <[email protected]>
> > ---
> > mm/page-writeback.c | 15 +++++++++++++++
> > 1 files changed, 15 insertions(+), 0 deletions(-)
> >
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 0fe713d..c97e973 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -855,7 +855,22 @@ int write_cache_pages(struct address_space *mapping,
> > if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
> > range_whole = 1;
> > cycled = 1; /* ignore range_cyclic tests */
> > +
> > + /*
> > + * If this is a data integrity sync, cap the writeback to the
> > + * current end of file. Any extension to the file that occurs
> > + * after this is a new write and we don't need to write those
> > + * pages out to fulfil our data integrity requirements. If we
> > + * try to write them out, we can get stuck in this scan until
> > + * the concurrent writer stops adding dirty pages and extending
> > + * EOF.
> > + */
> > + if (wbc->sync_mode == WB_SYNC_ALL &&
> > + wbc->range_end == LLONG_MAX) {
> > + end = i_size_read(mapping->host) >> PAGE_CACHE_SHIFT;
> > + }
> > }
> > +
>
> This is somewhat inefficient. It's really trivial and fast to find the
> highest-index dirty page by walking straight down the
> PAGECACHE_TAG_DIRTY-tagged nodes.
>
> However pagevec_lookup_tag(..., PAGECACHE_TAG_DIRTY) should do a pretty
> good job of skipping over the (millions of) pages between the (last
> dirty page before `end') and (`end'). So it _should_ be OK. Some thought
> and runtime testing would be good.

I've done plenty of testing here - the patch is part of my usual QA
stack so it's probably run through XFSQA a few hundred times now.

The behaviour w.r.t. writes into holes appears to be identical to
the current behaviour (i.e. sync hangs until the write stops). I
can't _see_ any difference in behaviour of increase in overhead, but
we all know that this doesn't mean there isn't one.

The test I ran to determine it's effectiveness against extending
writes was running sync during an 8GB sequential write. It would
hang every time until the write completes - typically around 50-60s
- without this patch. With this patch sync returns within 3-4s every
time.

> That being said, I think the patch is insufficient. If I create an
> enormous (possibly sparse) file with a 16TB hole (or a run of clean
> pages) in the middle and then start busily writing into that hole (run
> of clean pages), the problem will still occur.

Yes, that's already been pointed out. This doesn't attempt to
address that problem because....

> One obvious fix for that (a) would be to add another radix-tree tag and
> do two passes across the radix-tree.

.... we're still waiting for Jan's mark and sweep patch that does
this to make progress. This patch is just a simple fix for the most
common cause of the problem and might get into the tree sooner.

Personally I see no problems with using a mark+sweep algorithm for
this - IMO ensuring that data integrity requirements are met is more
important than ultimate performance.

> Another fix (b) would be to track the number of dirty pages per
> adddress_space, and only write that number of pages.
>
> Another fix would be to work out how the code handled this situation
> before we broke it, and restore that in some fashion. I guess fix (b)
> above kinda does that.

Yes, b) is similar to the old code, but like the old code it doesn't
work for the data integrity case because it doesn't guarantee the
correct pages are written out. This patch does for the append case,
and a) guarantees it in all cases....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-05-28 05:07:15

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 6/6] writeback: limit write_cache_pages integrity scanning to current EOF

On Thu, May 27, 2010 at 02:33:41PM -0700, Andrew Morton wrote:
> On Tue, 25 May 2010 20:54:12 +1000
> Dave Chinner <[email protected]> wrote:
>
> > From: Dave Chinner <[email protected]>
> >
> > sync can currently take a really long time if a concurrent writer is
> > extending a file. The problem is that the dirty pages on the address
> > space grow in the same direction as write_cache_pages scans, so if
> > the writer keeps ahead of writeback, the writeback will not
> > terminate until the writer stops adding dirty pages.

...

> That being said, I think the patch is insufficient. If I create an
> enormous (possibly sparse) file with a 16TB hole (or a run of clean
> pages) in the middle and then start busily writing into that hole (run
> of clean pages), the problem will still occur.

Yep.


> One obvious fix for that (a) would be to add another radix-tree tag and
> do two passes across the radix-tree.

Yes this is the method I tried. Jan has taken it further and should
have the latest patches around. A good test case for the starvation
would be helpful.


> Another fix (b) would be to track the number of dirty pages per
> adddress_space, and only write that number of pages.
>
> Another fix would be to work out how the code handled this situation
> before we broke it, and restore that in some fashion. I guess fix (b)
> above kinda does that.

I took that out (and offered fix a in replacement but it was turned
down at the time). Because b stands for broken.

IIRC we were writing out no more than 2x the dirty pages of the file
during sync. The problem with that is more pages can be dirtied after
we calculate the number, and then we might write out those newly dirty
pages and miss old dirty pages.



2010-05-28 07:45:00

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/6] writeback: initial tracing support

On Thu, May 27, 2010 at 02:32:33PM -0700, Andrew Morton wrote:
> > +#define inode_to_bdi(inode) ((inode)->i_mapping->backing_dev_info)
>
> Could/should be implemented in C.
>
> > +/*
> > + * We don't actually have pdflush, but this one is exported though /proc...
> > + */
> > +int nr_pdflush_threads;
>
> So this is always zero now?
>
> We don't want to keep it forever. Add a
> printk_once("nr_pdflush_threads is deprecated") when someone reads it,
> remove it in 2014.

These two lines are just moved down a bit by the patch, I think any
cleanups should be left to separate patches.

2010-06-01 15:54:33

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 6/6] writeback: limit write_cache_pages integrity scanning to current EOF

On Thu 27-05-10 14:33:41, Andrew Morton wrote:
> On Tue, 25 May 2010 20:54:12 +1000
> Dave Chinner <[email protected]> wrote:
>
> > From: Dave Chinner <[email protected]>
> >
> > sync can currently take a really long time if a concurrent writer is
> > extending a file. The problem is that the dirty pages on the address
> > space grow in the same direction as write_cache_pages scans, so if
> > the writer keeps ahead of writeback, the writeback will not
> > terminate until the writer stops adding dirty pages.
>
> <looks at Jens>
>
> The really was a pretty basic bug. It's writeback 101 to test that case :(
The code has this live-lock since Nick fixed data integrity issues in
write_cache_pages which was (digging) commit 05fe478d ("mm:
write_cache_pages integrity fix") in January 2009. Jens just kept the code
as it was...

...
> That being said, I think the patch is insufficient. If I create an
> enormous (possibly sparse) file with a 16TB hole (or a run of clean
> pages) in the middle and then start busily writing into that hole (run
> of clean pages), the problem will still occur.
>
> One obvious fix for that (a) would be to add another radix-tree tag and
> do two passes across the radix-tree.
>
> Another fix (b) would be to track the number of dirty pages per
> adddress_space, and only write that number of pages.
>
> Another fix would be to work out how the code handled this situation
> before we broke it, and restore that in some fashion. I guess fix (b)
> above kinda does that.
(b) does not work for data integrity sync (see changelog of the above
mentioned commit). I was sending a patch doing (a) in February but in
particular you raised concerns whether it's not too expensive... Since
it indeed has some cost (although I was not able to measure any performance
impact) and I didn't know a better solution, I just postponed the patches.
But I guess it's time to revive the series and maybe we'll get further with
it.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR