2009-09-14 09:36:31

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 0/7] Post merge per-bdi writeback patches v2

Hi,

Since the writeback patches are now (at last!) in mainline, I renamed
this branch from writeback-postmerge to just plain writeback. The first
four patches are identical to the ones from friday, and then I added
three more patches. Two of them are cleanups to better separate the
WB_SYNC_NONE and WB_SYNC_ALL writeback paths, the latter is a fix for
when a bdi is destroyed whilst inodes are still attached. I don't see
a much better way to fix this at the moment, if inodes get requeued
at sync time, then it would complicate exit quite a lot. So just
move those entries to the default_backing_dev_info, which can handle
them at any time.

Patches are also available at:

git://git.kernel.dk/linux-2.6-block.git writeback

fs/btrfs/disk-io.c | 1
fs/fs-writeback.c | 181 ++++++++++++++----------------------
fs/fuse/inode.c | 2
fs/super.c | 6 +
fs/sync.c | 9 +
fs/ubifs/super.c | 1
include/linux/backing-dev.h | 1
include/linux/fs.h | 1
mm/backing-dev.c | 88 +++++++++++++----
mm/page-writeback.c | 9 -
10 files changed, 161 insertions(+), 138 deletions(-)

--
Jens Axboe


2009-09-14 09:36:43

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 1/7] writeback: merely wakeup flusher thread if work allocation fails for WB_SYNC_NONE

From: Christoph Hellwig <[email protected]>

Since it's an opportunistic writeback and not a data integrity action,
don't punt to blocking writeback. Just wakeup the thread and it will
flush old data.

Signed-off-by: Jens Axboe <[email protected]>
---
fs/fs-writeback.c | 46 ++++++++++++++--------------------------------
1 files changed, 14 insertions(+), 32 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index da86ef5..1873fd0 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -75,13 +75,6 @@ static inline void bdi_work_init(struct bdi_work *work,
work->state = WS_USED;
}

-static inline void bdi_work_init_on_stack(struct bdi_work *work,
- struct writeback_control *wbc)
-{
- bdi_work_init(work, wbc);
- work->state |= WS_ONSTACK;
-}
-
/**
* writeback_in_progress - determine whether there is writeback in progress
* @bdi: the device's backing_dev_info structure.
@@ -207,34 +200,23 @@ static struct bdi_work *bdi_alloc_work(struct writeback_control *wbc)

void bdi_start_writeback(struct writeback_control *wbc)
{
- const bool must_wait = wbc->sync_mode == WB_SYNC_ALL;
- struct bdi_work work_stack, *work = NULL;
-
- if (!must_wait)
- work = bdi_alloc_work(wbc);
+ /*
+ * WB_SYNC_NONE is opportunistic writeback. If this allocation fails,
+ * bdi_queue_work() will wake up the thread and flush old data. This
+ * should ensure some amount of progress in freeing memory.
+ */
+ if (wbc->sync_mode != WB_SYNC_ALL) {
+ struct bdi_work *w = bdi_alloc_work(wbc);

- if (!work) {
- work = &work_stack;
- bdi_work_init_on_stack(work, wbc);
- }
+ bdi_queue_work(wbc->bdi, w);
+ } else {
+ struct bdi_work work;

- bdi_queue_work(wbc->bdi, work);
+ bdi_work_init(&work, wbc);
+ work.state |= WS_ONSTACK;

- /*
- * If the sync mode is WB_SYNC_ALL, block waiting for the work to
- * complete. If not, we only need to wait for the work to be started,
- * if we allocated it on-stack. We use the same mechanism, if the
- * wait bit is set in the bdi_work struct, then threads will not
- * clear pending until after they are done.
- *
- * Note that work == &work_stack if must_wait is true, so we don't
- * need to do call_rcu() here ever, since the completion path will
- * have done that for us.
- */
- if (must_wait || work == &work_stack) {
- bdi_wait_on_work_clear(work);
- if (work != &work_stack)
- call_rcu(&work->rcu_head, bdi_work_free);
+ bdi_queue_work(wbc->bdi, &work);
+ bdi_wait_on_work_clear(&work);
}
}

--
1.6.4.1.207.g68ea

2009-09-14 09:37:39

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 2/7] Assign bdi in super_block

We do this automatically in get_sb_bdev() from the set_bdev_super()
callback. Filesystems that have their own private backing_dev_info
must assign that in ->fill_super().

Note that ->s_bdi assignment is required for proper writeback!

Acked-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
---
fs/btrfs/disk-io.c | 1 +
fs/fuse/inode.c | 2 ++
fs/super.c | 6 ++++++
fs/sync.c | 9 ++++++++-
fs/ubifs/super.c | 1 +
include/linux/fs.h | 1 +
6 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 15831d5..8b81927 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1600,6 +1600,7 @@ struct btrfs_root *open_ctree(struct super_block *sb,

sb->s_blocksize = 4096;
sb->s_blocksize_bits = blksize_bits(4096);
+ sb->s_bdi = &fs_info->bdi;

/*
* we set the i_size on the btree inode to the max possible int.
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 4567db6..e5dbecd 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -894,6 +894,8 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
if (err)
goto err_put_conn;

+ sb->s_bdi = &fc->bdi;
+
/* Handle umasking inside the fuse code */
if (sb->s_flags & MS_POSIXACL)
fc->dont_mask = 1;
diff --git a/fs/super.c b/fs/super.c
index 9cda337..b03fea8 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -707,6 +707,12 @@ static int set_bdev_super(struct super_block *s, void *data)
{
s->s_bdev = data;
s->s_dev = s->s_bdev->bd_dev;
+
+ /*
+ * We set the bdi here to the queue backing, file systems can
+ * overwrite this in ->fill_super()
+ */
+ s->s_bdi = &bdev_get_queue(s->s_bdev)->backing_dev_info;
return 0;
}

diff --git a/fs/sync.c b/fs/sync.c
index 103cc7f..8582c34 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -27,6 +27,13 @@
*/
static int __sync_filesystem(struct super_block *sb, int wait)
{
+ /*
+ * This should be safe, as we require bdi backing to actually
+ * write out data in the first place
+ */
+ if (!sb->s_bdi)
+ return 0;
+
/* Avoid doing twice syncing and cache pruning for quota sync */
if (!wait) {
writeout_quota_sb(sb, -1);
@@ -101,7 +108,7 @@ restart:
spin_unlock(&sb_lock);

down_read(&sb->s_umount);
- if (!(sb->s_flags & MS_RDONLY) && sb->s_root)
+ if (!(sb->s_flags & MS_RDONLY) && sb->s_root && sb->s_bdi)
__sync_filesystem(sb, wait);
up_read(&sb->s_umount);

diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index 51763aa..c4af069 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -1980,6 +1980,7 @@ static int ubifs_fill_super(struct super_block *sb, void *data, int silent)
if (err)
goto out_bdi;

+ sb->s_bdi = &c->bdi;
sb->s_fs_info = c;
sb->s_magic = UBIFS_SUPER_MAGIC;
sb->s_blocksize = UBIFS_BLOCK_SIZE;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a79f483..f998adc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1343,6 +1343,7 @@ struct super_block {
int s_nr_dentry_unused; /* # of dentry on lru */

struct block_device *s_bdev;
+ struct backing_dev_info *s_bdi;
struct mtd_info *s_mtd;
struct list_head s_instances;
struct quota_info s_dquot; /* Diskquota specific options */
--
1.6.4.1.207.g68ea

2009-09-14 09:37:58

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 3/7] writeback: only use bdi_writeback_all() for WB_SYNC_NONE writeout

Data integrity writeback must use bdi_start_writeback() and ensure
that wbc->sb and wbc->bdi are set.

Signed-off-by: Jens Axboe <[email protected]>
---
fs/fs-writeback.c | 69 ++++++++++------------------------------------------
1 files changed, 14 insertions(+), 55 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1873fd0..5d4bd1c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -187,7 +187,8 @@ static void bdi_wait_on_work_clear(struct bdi_work *work)
TASK_UNINTERRUPTIBLE);
}

-static struct bdi_work *bdi_alloc_work(struct writeback_control *wbc)
+static void bdi_alloc_queue_work(struct backing_dev_info *bdi,
+ struct writeback_control *wbc)
{
struct bdi_work *work;

@@ -195,7 +196,7 @@ static struct bdi_work *bdi_alloc_work(struct writeback_control *wbc)
if (work)
bdi_work_init(work, wbc);

- return work;
+ bdi_queue_work(bdi, work);
}

void bdi_start_writeback(struct writeback_control *wbc)
@@ -205,11 +206,9 @@ void bdi_start_writeback(struct writeback_control *wbc)
* bdi_queue_work() will wake up the thread and flush old data. This
* should ensure some amount of progress in freeing memory.
*/
- if (wbc->sync_mode != WB_SYNC_ALL) {
- struct bdi_work *w = bdi_alloc_work(wbc);
-
- bdi_queue_work(wbc->bdi, w);
- } else {
+ if (wbc->sync_mode != WB_SYNC_ALL)
+ bdi_alloc_queue_work(wbc->bdi, wbc);
+ else {
struct bdi_work work;

bdi_work_init(&work, wbc);
@@ -840,67 +839,26 @@ int bdi_writeback_task(struct bdi_writeback *wb)
}

/*
- * Schedule writeback for all backing devices. Expensive! If this is a data
- * integrity operation, writeback will be complete when this returns. If
- * we are simply called for WB_SYNC_NONE, then writeback will merely be
- * scheduled to run.
+ * Schedule writeback for all backing devices. Can only be used for
+ * WB_SYNC_NONE writeback, WB_SYNC_ALL should use bdi_start_writeback()
+ * and pass in the superblock.
*/
static void bdi_writeback_all(struct writeback_control *wbc)
{
- const bool must_wait = wbc->sync_mode == WB_SYNC_ALL;
struct backing_dev_info *bdi;
- struct bdi_work *work;
- LIST_HEAD(list);

-restart:
+ WARN_ON(wbc->sync_mode == WB_SYNC_ALL);
+
spin_lock(&bdi_lock);

list_for_each_entry(bdi, &bdi_list, bdi_list) {
- struct bdi_work *work;
-
if (!bdi_has_dirty_io(bdi))
continue;

- /*
- * If work allocation fails, do the writes inline. We drop
- * the lock and restart the list writeout. This should be OK,
- * since this happens rarely and because the writeout should
- * eventually make more free memory available.
- */
- work = bdi_alloc_work(wbc);
- if (!work) {
- struct writeback_control __wbc;
-
- /*
- * Not a data integrity writeout, just continue
- */
- if (!must_wait)
- continue;
-
- spin_unlock(&bdi_lock);
- __wbc = *wbc;
- __wbc.bdi = bdi;
- writeback_inodes_wbc(&__wbc);
- goto restart;
- }
- if (must_wait)
- list_add_tail(&work->wait_list, &list);
-
- bdi_queue_work(bdi, work);
+ bdi_alloc_queue_work(bdi, wbc);
}

spin_unlock(&bdi_lock);
-
- /*
- * If this is for WB_SYNC_ALL, wait for pending work to complete
- * before returning.
- */
- while (!list_empty(&list)) {
- work = list_entry(list.next, struct bdi_work, wait_list);
- list_del(&work->wait_list);
- bdi_wait_on_work_clear(work);
- call_rcu(&work->rcu_head, bdi_work_free);
- }
}

/*
@@ -1157,6 +1115,7 @@ long sync_inodes_sb(struct super_block *sb)
{
struct writeback_control wbc = {
.sb = sb,
+ .bdi = sb->s_bdi,
.sync_mode = WB_SYNC_ALL,
.range_start = 0,
.range_end = LLONG_MAX,
@@ -1164,7 +1123,7 @@ long sync_inodes_sb(struct super_block *sb)
long nr_to_write = LONG_MAX; /* doesn't actually matter */

wbc.nr_to_write = nr_to_write;
- bdi_writeback_all(&wbc);
+ bdi_start_writeback(&wbc);
wait_sb_inodes(&wbc);
return nr_to_write - wbc.nr_to_write;
}
--
1.6.4.1.207.g68ea

2009-09-14 09:37:33

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 4/7] writeback: use RCU to protect bdi_list

Now that bdi_writeback_all() no longer handles integrity writeback,
it doesn't have to block anymore. This means that we can switch
bdi_list reader side protection to RCU.

Signed-off-by: Jens Axboe <[email protected]>
---
fs/fs-writeback.c | 4 +-
include/linux/backing-dev.h | 1 +
mm/backing-dev.c | 76 +++++++++++++++++++++++++++++++------------
mm/page-writeback.c | 8 ++--
4 files changed, 62 insertions(+), 27 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5d4bd1c..d7592c7 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -849,7 +849,7 @@ static void bdi_writeback_all(struct writeback_control *wbc)

WARN_ON(wbc->sync_mode == WB_SYNC_ALL);

- spin_lock(&bdi_lock);
+ rcu_read_lock();

list_for_each_entry(bdi, &bdi_list, bdi_list) {
if (!bdi_has_dirty_io(bdi))
@@ -858,7 +858,7 @@ static void bdi_writeback_all(struct writeback_control *wbc)
bdi_alloc_queue_work(bdi, wbc);
}

- spin_unlock(&bdi_lock);
+ rcu_read_unlock();
}

/*
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index f169bcb..859e797 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -59,6 +59,7 @@ struct bdi_writeback {

struct backing_dev_info {
struct list_head bdi_list;
+ struct rcu_head rcu_head;
unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */
unsigned long state; /* Always use atomic bitops on this */
unsigned int capabilities; /* Device capabilities */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index d3ca0da..fd93566 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -26,6 +26,12 @@ struct backing_dev_info default_backing_dev_info = {
EXPORT_SYMBOL_GPL(default_backing_dev_info);

static struct class *bdi_class;
+
+/*
+ * bdi_lock protects updates to bdi_list and bdi_pending_list, as well as
+ * reader side protection for bdi_pending_list. bdi_list has RCU reader side
+ * locking.
+ */
DEFINE_SPINLOCK(bdi_lock);
LIST_HEAD(bdi_list);
LIST_HEAD(bdi_pending_list);
@@ -284,9 +290,9 @@ static int bdi_start_fn(void *ptr)
/*
* Add us to the active bdi_list
*/
- spin_lock(&bdi_lock);
- list_add(&bdi->bdi_list, &bdi_list);
- spin_unlock(&bdi_lock);
+ spin_lock_bh(&bdi_lock);
+ list_add_rcu(&bdi->bdi_list, &bdi_list);
+ spin_unlock_bh(&bdi_lock);

bdi_task_init(bdi, wb);

@@ -389,7 +395,7 @@ static int bdi_forker_task(void *ptr)
if (wb_has_dirty_io(me) || !list_empty(&me->bdi->work_list))
wb_do_writeback(me, 0);

- spin_lock(&bdi_lock);
+ spin_lock_bh(&bdi_lock);

/*
* Check if any existing bdi's have dirty data without
@@ -410,7 +416,7 @@ static int bdi_forker_task(void *ptr)
if (list_empty(&bdi_pending_list)) {
unsigned long wait;

- spin_unlock(&bdi_lock);
+ spin_unlock_bh(&bdi_lock);
wait = msecs_to_jiffies(dirty_writeback_interval * 10);
schedule_timeout(wait);
try_to_freeze();
@@ -426,7 +432,7 @@ static int bdi_forker_task(void *ptr)
bdi = list_entry(bdi_pending_list.next, struct backing_dev_info,
bdi_list);
list_del_init(&bdi->bdi_list);
- spin_unlock(&bdi_lock);
+ spin_unlock_bh(&bdi_lock);

wb = &bdi->wb;
wb->task = kthread_run(bdi_start_fn, wb, "flush-%s",
@@ -445,9 +451,9 @@ static int bdi_forker_task(void *ptr)
* a chance to flush other bdi's to free
* memory.
*/
- spin_lock(&bdi_lock);
+ spin_lock_bh(&bdi_lock);
list_add_tail(&bdi->bdi_list, &bdi_pending_list);
- spin_unlock(&bdi_lock);
+ spin_unlock_bh(&bdi_lock);

bdi_flush_io(bdi);
}
@@ -456,6 +462,24 @@ static int bdi_forker_task(void *ptr)
return 0;
}

+static void bdi_add_to_pending(struct rcu_head *head)
+{
+ struct backing_dev_info *bdi;
+
+ bdi = container_of(head, struct backing_dev_info, rcu_head);
+ INIT_LIST_HEAD(&bdi->bdi_list);
+
+ spin_lock(&bdi_lock);
+ list_add_tail(&bdi->bdi_list, &bdi_pending_list);
+ spin_unlock(&bdi_lock);
+
+ /*
+ * We are now on the pending list, wake up bdi_forker_task()
+ * to finish the job and add us back to the active bdi_list
+ */
+ wake_up_process(default_backing_dev_info.wb.task);
+}
+
/*
* Add the default flusher task that gets created for any bdi
* that has dirty data pending writeout
@@ -478,16 +502,29 @@ void static bdi_add_default_flusher_task(struct backing_dev_info *bdi)
* waiting for previous additions to finish.
*/
if (!test_and_set_bit(BDI_pending, &bdi->state)) {
- list_move_tail(&bdi->bdi_list, &bdi_pending_list);
+ list_del_rcu(&bdi->bdi_list);

/*
- * We are now on the pending list, wake up bdi_forker_task()
- * to finish the job and add us back to the active bdi_list
+ * We must wait for the current RCU period to end before
+ * moving to the pending list. So schedule that operation
+ * from an RCU callback.
*/
- wake_up_process(default_backing_dev_info.wb.task);
+ call_rcu(&bdi->rcu_head, bdi_add_to_pending);
}
}

+/*
+ * Remove bdi from bdi_list, and ensure that it is no longer visible
+ */
+static void bdi_remove_from_list(struct backing_dev_info *bdi)
+{
+ spin_lock_bh(&bdi_lock);
+ list_del_rcu(&bdi->bdi_list);
+ spin_unlock_bh(&bdi_lock);
+
+ synchronize_rcu();
+}
+
int bdi_register(struct backing_dev_info *bdi, struct device *parent,
const char *fmt, ...)
{
@@ -506,9 +543,9 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
goto exit;
}

- spin_lock(&bdi_lock);
- list_add_tail(&bdi->bdi_list, &bdi_list);
- spin_unlock(&bdi_lock);
+ spin_lock_bh(&bdi_lock);
+ list_add_tail_rcu(&bdi->bdi_list, &bdi_list);
+ spin_unlock_bh(&bdi_lock);

bdi->dev = dev;

@@ -526,9 +563,7 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
wb->task = NULL;
ret = -ENOMEM;

- spin_lock(&bdi_lock);
- list_del(&bdi->bdi_list);
- spin_unlock(&bdi_lock);
+ bdi_remove_from_list(bdi);
goto exit;
}
}
@@ -565,9 +600,7 @@ static void bdi_wb_shutdown(struct backing_dev_info *bdi)
/*
* Make sure nobody finds us on the bdi_list anymore
*/
- spin_lock(&bdi_lock);
- list_del(&bdi->bdi_list);
- spin_unlock(&bdi_lock);
+ bdi_remove_from_list(bdi);

/*
* Finally, kill the kernel threads. We don't need to be RCU
@@ -599,6 +632,7 @@ int bdi_init(struct backing_dev_info *bdi)
bdi->max_ratio = 100;
bdi->max_prop_frac = PROP_FRAC_BASE;
spin_lock_init(&bdi->wb_lock);
+ INIT_RCU_HEAD(&bdi->rcu_head);
INIT_LIST_HEAD(&bdi->bdi_list);
INIT_LIST_HEAD(&bdi->wb_list);
INIT_LIST_HEAD(&bdi->work_list);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 25e7770..a5f0f76 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -315,7 +315,7 @@ int bdi_set_min_ratio(struct backing_dev_info *bdi, unsigned int min_ratio)
{
int ret = 0;

- spin_lock(&bdi_lock);
+ spin_lock_bh(&bdi_lock);
if (min_ratio > bdi->max_ratio) {
ret = -EINVAL;
} else {
@@ -327,7 +327,7 @@ int bdi_set_min_ratio(struct backing_dev_info *bdi, unsigned int min_ratio)
ret = -EINVAL;
}
}
- spin_unlock(&bdi_lock);
+ spin_unlock_bh(&bdi_lock);

return ret;
}
@@ -339,14 +339,14 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
if (max_ratio > 100)
return -EINVAL;

- spin_lock(&bdi_lock);
+ spin_lock_bh(&bdi_lock);
if (bdi->min_ratio > max_ratio) {
ret = -EINVAL;
} else {
bdi->max_ratio = max_ratio;
bdi->max_prop_frac = (PROP_FRAC_BASE * max_ratio) / 100;
}
- spin_unlock(&bdi_lock);
+ spin_unlock_bh(&bdi_lock);

return ret;
}
--
1.6.4.1.207.g68ea

2009-09-14 09:37:06

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 5/7] writeback: inline allocation failure handling in bdi_alloc_queue_work()

This gets rid of work == NULL in bdi_queue_work() and puts the
OOM handling where it belongs.

Signed-off-by: Jens Axboe <[email protected]>
---
fs/fs-writeback.c | 55 +++++++++++++++++++++++++++-------------------------
1 files changed, 29 insertions(+), 26 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d7592c7..5cd8b3b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -139,21 +139,19 @@ static void wb_clear_pending(struct bdi_writeback *wb, struct bdi_work *work)

static void bdi_queue_work(struct backing_dev_info *bdi, struct bdi_work *work)
{
- if (work) {
- work->seen = bdi->wb_mask;
- BUG_ON(!work->seen);
- atomic_set(&work->pending, bdi->wb_cnt);
- BUG_ON(!bdi->wb_cnt);
+ work->seen = bdi->wb_mask;
+ BUG_ON(!work->seen);
+ atomic_set(&work->pending, bdi->wb_cnt);
+ BUG_ON(!bdi->wb_cnt);

- /*
- * Make sure stores are seen before it appears on the list
- */
- smp_mb();
+ /*
+ * Make sure stores are seen before it appears on the list
+ */
+ smp_mb();

- spin_lock(&bdi->wb_lock);
- list_add_tail_rcu(&work->list, &bdi->work_list);
- spin_unlock(&bdi->wb_lock);
- }
+ spin_lock(&bdi->wb_lock);
+ list_add_tail_rcu(&work->list, &bdi->work_list);
+ spin_unlock(&bdi->wb_lock);

/*
* If the default thread isn't there, make sure we add it. When
@@ -165,14 +163,12 @@ static void bdi_queue_work(struct backing_dev_info *bdi, struct bdi_work *work)
struct bdi_writeback *wb = &bdi->wb;

/*
- * If we failed allocating the bdi work item, wake up the wb
- * thread always. As a safety precaution, it'll flush out
- * everything
+ * End work now if this wb has no dirty IO pending. Otherwise
+ * wakeup the handling thread
*/
- if (!wb_has_dirty_io(wb)) {
- if (work)
- wb_clear_pending(wb, work);
- } else if (wb->task)
+ if (!wb_has_dirty_io(wb))
+ wb_clear_pending(wb, work);
+ else if (wb->task)
wake_up_process(wb->task);
}
}
@@ -192,11 +188,20 @@ static void bdi_alloc_queue_work(struct backing_dev_info *bdi,
{
struct bdi_work *work;

+ /*
+ * This is WB_SYNC_NONE writeback, so if allocation fails just
+ * wakeup the thread for old dirty data writeback
+ */
work = kmalloc(sizeof(*work), GFP_ATOMIC);
- if (work)
+ if (work) {
bdi_work_init(work, wbc);
+ bdi_queue_work(bdi, work);
+ } else {
+ struct bdi_writeback *wb = &bdi->wb;

- bdi_queue_work(bdi, work);
+ if (wb->task)
+ wake_up_process(wb->task);
+ }
}

void bdi_start_writeback(struct writeback_control *wbc)
@@ -852,10 +857,8 @@ static void bdi_writeback_all(struct writeback_control *wbc)
rcu_read_lock();

list_for_each_entry(bdi, &bdi_list, bdi_list) {
- if (!bdi_has_dirty_io(bdi))
- continue;
-
- bdi_alloc_queue_work(bdi, wbc);
+ if (bdi_has_dirty_io(bdi))
+ bdi_alloc_queue_work(bdi, wbc);
}

rcu_read_unlock();
--
1.6.4.1.207.g68ea

2009-09-14 09:36:50

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback

bdi_start_writeback() is currently split into two paths, one for
WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
only WB_SYNC_NONE.

Signed-off-by: Jens Axboe <[email protected]>
---
fs/fs-writeback.c | 51 ++++++++++++++++++++++++++++++++++-----------------
mm/page-writeback.c | 1 -
2 files changed, 34 insertions(+), 18 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5cd8b3b..64ca471 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -204,24 +204,42 @@ static void bdi_alloc_queue_work(struct backing_dev_info *bdi,
}
}

-void bdi_start_writeback(struct writeback_control *wbc)
+/**
+ * bdi_sync_writeback - start and wait for writeback
+ * @wbc: writeback parameters
+ *
+ * Description:
+ * This does WB_SYNC_ALL data integrity writeback and waits for the
+ * IO to complete. Callers must hold the sb s_umount semaphore for
+ * reading, to avoid having the super disappear before we are done.
+ */
+static void bdi_sync_writeback(struct writeback_control *wbc)
{
- /*
- * WB_SYNC_NONE is opportunistic writeback. If this allocation fails,
- * bdi_queue_work() will wake up the thread and flush old data. This
- * should ensure some amount of progress in freeing memory.
- */
- if (wbc->sync_mode != WB_SYNC_ALL)
- bdi_alloc_queue_work(wbc->bdi, wbc);
- else {
- struct bdi_work work;
+ struct bdi_work work;

- bdi_work_init(&work, wbc);
- work.state |= WS_ONSTACK;
+ wbc->sync_mode = WB_SYNC_ALL;

- bdi_queue_work(wbc->bdi, &work);
- bdi_wait_on_work_clear(&work);
- }
+ bdi_work_init(&work, wbc);
+ work.state |= WS_ONSTACK;
+
+ bdi_queue_work(wbc->bdi, &work);
+ bdi_wait_on_work_clear(&work);
+}
+
+/**
+ * bdi_start_writeback - start writeback
+ * @wbc: writeback parameters
+ *
+ * Description:
+ * This does WB_SYNC_NONE opportunistic writeback. The IO is only
+ * started when this function returns, we make no guarentees on
+ * completion. Caller need not hold sb s_umount semaphore.
+ *
+ */
+void bdi_start_writeback(struct writeback_control *wbc)
+{
+ wbc->sync_mode = WB_SYNC_NONE;
+ bdi_alloc_queue_work(wbc->bdi, wbc);
}

/*
@@ -1119,14 +1137,13 @@ long sync_inodes_sb(struct super_block *sb)
struct writeback_control wbc = {
.sb = sb,
.bdi = sb->s_bdi,
- .sync_mode = WB_SYNC_ALL,
.range_start = 0,
.range_end = LLONG_MAX,
};
long nr_to_write = LONG_MAX; /* doesn't actually matter */

wbc.nr_to_write = nr_to_write;
- bdi_start_writeback(&wbc);
+ bdi_sync_writeback(&wbc);
wait_sb_inodes(&wbc);
return nr_to_write - wbc.nr_to_write;
}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a5f0f76..f61f0cc 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -585,7 +585,6 @@ static void balance_dirty_pages(struct address_space *mapping)
> background_thresh))) {
struct writeback_control wbc = {
.bdi = bdi,
- .sync_mode = WB_SYNC_NONE,
.nr_to_write = nr_writeback,
};

--
1.6.4.1.207.g68ea

2009-09-14 09:36:40

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 7/7] writeback: splice dirty inode entries to default bdi on bdi_destroy()

We cannot safely ensure that the inodes are all gone at this point
in time, and we must not destroy this bdi with inodes having off it.
So just splice our entries to the default bdi since that one will
always persist.

Signed-off-by: Jens Axboe <[email protected]>
---
mm/backing-dev.c | 12 +++++++++++-
1 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index fd93566..27f82dc 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -668,7 +668,17 @@ void bdi_destroy(struct backing_dev_info *bdi)
{
int i;

- WARN_ON(bdi_has_dirty_io(bdi));
+ /*
+ * Splice our entries to the default_backing_dev_info, if this
+ * bdi disappears
+ */
+ if (bdi_has_dirty_io(bdi)) {
+ struct bdi_writeback *dst = &default_backing_dev_info.wb;
+
+ list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
+ list_splice(&bdi->wb.b_io, &dst->b_io);
+ list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
+ }

bdi_unregister(bdi);

--
1.6.4.1.207.g68ea

2009-09-14 10:56:00

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 7/7] writeback: splice dirty inode entries to default bdi on bdi_destroy()

On Mon, Sep 14 2009, Jens Axboe wrote:
> We cannot safely ensure that the inodes are all gone at this point
> in time, and we must not destroy this bdi with inodes having off it.
> So just splice our entries to the default bdi since that one will
> always persist.
>
> Signed-off-by: Jens Axboe <[email protected]>
> ---
> mm/backing-dev.c | 12 +++++++++++-
> 1 files changed, 11 insertions(+), 1 deletions(-)
>
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index fd93566..27f82dc 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -668,7 +668,17 @@ void bdi_destroy(struct backing_dev_info *bdi)
> {
> int i;
>
> - WARN_ON(bdi_has_dirty_io(bdi));
> + /*
> + * Splice our entries to the default_backing_dev_info, if this
> + * bdi disappears
> + */
> + if (bdi_has_dirty_io(bdi)) {
> + struct bdi_writeback *dst = &default_backing_dev_info.wb;
> +
> + list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
> + list_splice(&bdi->wb.b_io, &dst->b_io);
> + list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
> + }

This needs locking, of course...

--
Jens Axboe

2009-09-14 11:10:20

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 4/7] writeback: use RCU to protect bdi_list

On Mon, 14 Sep 2009 11:36:31 +0200
Jens Axboe <[email protected]> wrote:

> Now that bdi_writeback_all() no longer handles integrity writeback,
> it doesn't have to block anymore. This means that we can switch
> bdi_list reader side protection to RCU.
>
> Signed-off-by: Jens Axboe <[email protected]>
> ---
> fs/fs-writeback.c | 4 +-
> include/linux/backing-dev.h | 1 +
> mm/backing-dev.c | 76 +++++++++++++++++++++++++++++++------------
> mm/page-writeback.c | 8 ++--
> 4 files changed, 62 insertions(+), 27 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 5d4bd1c..d7592c7 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -849,7 +849,7 @@ static void bdi_writeback_all(struct writeback_control *wbc)
>
> WARN_ON(wbc->sync_mode == WB_SYNC_ALL);
>
> - spin_lock(&bdi_lock);
> + rcu_read_lock();
>
> list_for_each_entry(bdi, &bdi_list, bdi_list) {

list_for_each_entry_rcu?

--
Kind regards,
Minchan Kim

2009-09-14 11:11:20

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 4/7] writeback: use RCU to protect bdi_list

On Mon, Sep 14 2009, Minchan Kim wrote:
> On Mon, 14 Sep 2009 11:36:31 +0200
> Jens Axboe <[email protected]> wrote:
>
> > Now that bdi_writeback_all() no longer handles integrity writeback,
> > it doesn't have to block anymore. This means that we can switch
> > bdi_list reader side protection to RCU.
> >
> > Signed-off-by: Jens Axboe <[email protected]>
> > ---
> > fs/fs-writeback.c | 4 +-
> > include/linux/backing-dev.h | 1 +
> > mm/backing-dev.c | 76 +++++++++++++++++++++++++++++++------------
> > mm/page-writeback.c | 8 ++--
> > 4 files changed, 62 insertions(+), 27 deletions(-)
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 5d4bd1c..d7592c7 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -849,7 +849,7 @@ static void bdi_writeback_all(struct writeback_control *wbc)
> >
> > WARN_ON(wbc->sync_mode == WB_SYNC_ALL);
> >
> > - spin_lock(&bdi_lock);
> > + rcu_read_lock();
> >
> > list_for_each_entry(bdi, &bdi_list, bdi_list) {
>
> list_for_each_entry_rcu?

Indeed, pretty silly. Thanks!

--
Jens Axboe

2009-09-14 12:34:26

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 1/7] writeback: merely wakeup flusher thread if work allocation fails for WB_SYNC_NONE

On Mon 14-09-09 11:36:28, Jens Axboe wrote:
> From: Christoph Hellwig <[email protected]>
>
> Since it's an opportunistic writeback and not a data integrity action,
> don't punt to blocking writeback. Just wakeup the thread and it will
> flush old data.
>
> Signed-off-by: Jens Axboe <[email protected]>
Looks good. Acked-by: Jan Kara <[email protected]>
BTW, don't we miss Christoph's Signed-off-by?

Honza

> ---
> fs/fs-writeback.c | 46 ++++++++++++++--------------------------------
> 1 files changed, 14 insertions(+), 32 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index da86ef5..1873fd0 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -75,13 +75,6 @@ static inline void bdi_work_init(struct bdi_work *work,
> work->state = WS_USED;
> }
>
> -static inline void bdi_work_init_on_stack(struct bdi_work *work,
> - struct writeback_control *wbc)
> -{
> - bdi_work_init(work, wbc);
> - work->state |= WS_ONSTACK;
> -}
> -
> /**
> * writeback_in_progress - determine whether there is writeback in progress
> * @bdi: the device's backing_dev_info structure.
> @@ -207,34 +200,23 @@ static struct bdi_work *bdi_alloc_work(struct writeback_control *wbc)
>
> void bdi_start_writeback(struct writeback_control *wbc)
> {
> - const bool must_wait = wbc->sync_mode == WB_SYNC_ALL;
> - struct bdi_work work_stack, *work = NULL;
> -
> - if (!must_wait)
> - work = bdi_alloc_work(wbc);
> + /*
> + * WB_SYNC_NONE is opportunistic writeback. If this allocation fails,
> + * bdi_queue_work() will wake up the thread and flush old data. This
> + * should ensure some amount of progress in freeing memory.
> + */
> + if (wbc->sync_mode != WB_SYNC_ALL) {
> + struct bdi_work *w = bdi_alloc_work(wbc);
>
> - if (!work) {
> - work = &work_stack;
> - bdi_work_init_on_stack(work, wbc);
> - }
> + bdi_queue_work(wbc->bdi, w);
> + } else {
> + struct bdi_work work;
>
> - bdi_queue_work(wbc->bdi, work);
> + bdi_work_init(&work, wbc);
> + work.state |= WS_ONSTACK;
>
> - /*
> - * If the sync mode is WB_SYNC_ALL, block waiting for the work to
> - * complete. If not, we only need to wait for the work to be started,
> - * if we allocated it on-stack. We use the same mechanism, if the
> - * wait bit is set in the bdi_work struct, then threads will not
> - * clear pending until after they are done.
> - *
> - * Note that work == &work_stack if must_wait is true, so we don't
> - * need to do call_rcu() here ever, since the completion path will
> - * have done that for us.
> - */
> - if (must_wait || work == &work_stack) {
> - bdi_wait_on_work_clear(work);
> - if (work != &work_stack)
> - call_rcu(&work->rcu_head, bdi_work_free);
> + bdi_queue_work(wbc->bdi, &work);
> + bdi_wait_on_work_clear(&work);
> }
> }
>
> --
> 1.6.4.1.207.g68ea
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-09-14 12:39:27

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 1/7] writeback: merely wakeup flusher thread if work allocation fails for WB_SYNC_NONE

On Mon, Sep 14 2009, Jan Kara wrote:
> On Mon 14-09-09 11:36:28, Jens Axboe wrote:
> > From: Christoph Hellwig <[email protected]>
> >
> > Since it's an opportunistic writeback and not a data integrity action,
> > don't punt to blocking writeback. Just wakeup the thread and it will
> > flush old data.
> >
> > Signed-off-by: Jens Axboe <[email protected]>
> Looks good. Acked-by: Jan Kara <[email protected]>

Thanks!

> BTW, don't we miss Christoph's Signed-off-by?

We do, he didn't include one.

--
Jens Axboe

2009-09-14 13:02:30

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/7] Assign bdi in super_block

On Mon 14-09-09 11:36:29, Jens Axboe wrote:
> We do this automatically in get_sb_bdev() from the set_bdev_super()
> callback. Filesystems that have their own private backing_dev_info
> must assign that in ->fill_super().
>
> Note that ->s_bdi assignment is required for proper writeback!
>
> Acked-by: Christoph Hellwig <[email protected]>
> Signed-off-by: Jens Axboe <[email protected]>
Hmm, looking at this again, I'm not sure this will work for NFS. It seems
to set mapping->backing_dev_info to its private backing dev info for
regular files while it leaves it intact for other inodes (e.g.
directories). I'm not sure why it does so but it seems its inodes end up on
two different BDI lists and thus they wouldn't be synced properly. Trond,
do I read the code properly?
Also we definitely need to set *some* bdi in nfs_get_sb as otherwise sync
won't work for it.

Honza
> ---
> fs/btrfs/disk-io.c | 1 +
> fs/fuse/inode.c | 2 ++
> fs/super.c | 6 ++++++
> fs/sync.c | 9 ++++++++-
> fs/ubifs/super.c | 1 +
> include/linux/fs.h | 1 +
> 6 files changed, 19 insertions(+), 1 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 15831d5..8b81927 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1600,6 +1600,7 @@ struct btrfs_root *open_ctree(struct super_block *sb,
>
> sb->s_blocksize = 4096;
> sb->s_blocksize_bits = blksize_bits(4096);
> + sb->s_bdi = &fs_info->bdi;
>
> /*
> * we set the i_size on the btree inode to the max possible int.
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 4567db6..e5dbecd 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -894,6 +894,8 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
> if (err)
> goto err_put_conn;
>
> + sb->s_bdi = &fc->bdi;
> +
> /* Handle umasking inside the fuse code */
> if (sb->s_flags & MS_POSIXACL)
> fc->dont_mask = 1;
> diff --git a/fs/super.c b/fs/super.c
> index 9cda337..b03fea8 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -707,6 +707,12 @@ static int set_bdev_super(struct super_block *s, void *data)
> {
> s->s_bdev = data;
> s->s_dev = s->s_bdev->bd_dev;
> +
> + /*
> + * We set the bdi here to the queue backing, file systems can
> + * overwrite this in ->fill_super()
> + */
> + s->s_bdi = &bdev_get_queue(s->s_bdev)->backing_dev_info;
> return 0;
> }
>
> diff --git a/fs/sync.c b/fs/sync.c
> index 103cc7f..8582c34 100644
> --- a/fs/sync.c
> +++ b/fs/sync.c
> @@ -27,6 +27,13 @@
> */
> static int __sync_filesystem(struct super_block *sb, int wait)
> {
> + /*
> + * This should be safe, as we require bdi backing to actually
> + * write out data in the first place
> + */
> + if (!sb->s_bdi)
> + return 0;
> +
> /* Avoid doing twice syncing and cache pruning for quota sync */
> if (!wait) {
> writeout_quota_sb(sb, -1);
> @@ -101,7 +108,7 @@ restart:
> spin_unlock(&sb_lock);
>
> down_read(&sb->s_umount);
> - if (!(sb->s_flags & MS_RDONLY) && sb->s_root)
> + if (!(sb->s_flags & MS_RDONLY) && sb->s_root && sb->s_bdi)
> __sync_filesystem(sb, wait);
> up_read(&sb->s_umount);
>
> diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
> index 51763aa..c4af069 100644
> --- a/fs/ubifs/super.c
> +++ b/fs/ubifs/super.c
> @@ -1980,6 +1980,7 @@ static int ubifs_fill_super(struct super_block *sb, void *data, int silent)
> if (err)
> goto out_bdi;
>
> + sb->s_bdi = &c->bdi;
> sb->s_fs_info = c;
> sb->s_magic = UBIFS_SUPER_MAGIC;
> sb->s_blocksize = UBIFS_BLOCK_SIZE;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index a79f483..f998adc 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1343,6 +1343,7 @@ struct super_block {
> int s_nr_dentry_unused; /* # of dentry on lru */
>
> struct block_device *s_bdev;
> + struct backing_dev_info *s_bdi;
> struct mtd_info *s_mtd;
> struct list_head s_instances;
> struct quota_info s_dquot; /* Diskquota specific options */
> --
> 1.6.4.1.207.g68ea
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-09-14 13:12:48

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 3/7] writeback: only use bdi_writeback_all() for WB_SYNC_NONE writeout

On Mon 14-09-09 11:36:30, Jens Axboe wrote:
> Data integrity writeback must use bdi_start_writeback() and ensure
> that wbc->sb and wbc->bdi are set.
This patch looks good.
Acked-by: Jan Kara <[email protected]>

Honza

>
> Signed-off-by: Jens Axboe <[email protected]>
> ---
> fs/fs-writeback.c | 69 ++++++++++------------------------------------------
> 1 files changed, 14 insertions(+), 55 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 1873fd0..5d4bd1c 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -187,7 +187,8 @@ static void bdi_wait_on_work_clear(struct bdi_work *work)
> TASK_UNINTERRUPTIBLE);
> }
>
> -static struct bdi_work *bdi_alloc_work(struct writeback_control *wbc)
> +static void bdi_alloc_queue_work(struct backing_dev_info *bdi,
> + struct writeback_control *wbc)
> {
> struct bdi_work *work;
>
> @@ -195,7 +196,7 @@ static struct bdi_work *bdi_alloc_work(struct writeback_control *wbc)
> if (work)
> bdi_work_init(work, wbc);
>
> - return work;
> + bdi_queue_work(bdi, work);
> }
>
> void bdi_start_writeback(struct writeback_control *wbc)
> @@ -205,11 +206,9 @@ void bdi_start_writeback(struct writeback_control *wbc)
> * bdi_queue_work() will wake up the thread and flush old data. This
> * should ensure some amount of progress in freeing memory.
> */
> - if (wbc->sync_mode != WB_SYNC_ALL) {
> - struct bdi_work *w = bdi_alloc_work(wbc);
> -
> - bdi_queue_work(wbc->bdi, w);
> - } else {
> + if (wbc->sync_mode != WB_SYNC_ALL)
> + bdi_alloc_queue_work(wbc->bdi, wbc);
> + else {
> struct bdi_work work;
>
> bdi_work_init(&work, wbc);
> @@ -840,67 +839,26 @@ int bdi_writeback_task(struct bdi_writeback *wb)
> }
>
> /*
> - * Schedule writeback for all backing devices. Expensive! If this is a data
> - * integrity operation, writeback will be complete when this returns. If
> - * we are simply called for WB_SYNC_NONE, then writeback will merely be
> - * scheduled to run.
> + * Schedule writeback for all backing devices. Can only be used for
> + * WB_SYNC_NONE writeback, WB_SYNC_ALL should use bdi_start_writeback()
> + * and pass in the superblock.
> */
> static void bdi_writeback_all(struct writeback_control *wbc)
> {
> - const bool must_wait = wbc->sync_mode == WB_SYNC_ALL;
> struct backing_dev_info *bdi;
> - struct bdi_work *work;
> - LIST_HEAD(list);
>
> -restart:
> + WARN_ON(wbc->sync_mode == WB_SYNC_ALL);
> +
> spin_lock(&bdi_lock);
>
> list_for_each_entry(bdi, &bdi_list, bdi_list) {
> - struct bdi_work *work;
> -
> if (!bdi_has_dirty_io(bdi))
> continue;
>
> - /*
> - * If work allocation fails, do the writes inline. We drop
> - * the lock and restart the list writeout. This should be OK,
> - * since this happens rarely and because the writeout should
> - * eventually make more free memory available.
> - */
> - work = bdi_alloc_work(wbc);
> - if (!work) {
> - struct writeback_control __wbc;
> -
> - /*
> - * Not a data integrity writeout, just continue
> - */
> - if (!must_wait)
> - continue;
> -
> - spin_unlock(&bdi_lock);
> - __wbc = *wbc;
> - __wbc.bdi = bdi;
> - writeback_inodes_wbc(&__wbc);
> - goto restart;
> - }
> - if (must_wait)
> - list_add_tail(&work->wait_list, &list);
> -
> - bdi_queue_work(bdi, work);
> + bdi_alloc_queue_work(bdi, wbc);
> }
>
> spin_unlock(&bdi_lock);
> -
> - /*
> - * If this is for WB_SYNC_ALL, wait for pending work to complete
> - * before returning.
> - */
> - while (!list_empty(&list)) {
> - work = list_entry(list.next, struct bdi_work, wait_list);
> - list_del(&work->wait_list);
> - bdi_wait_on_work_clear(work);
> - call_rcu(&work->rcu_head, bdi_work_free);
> - }
> }
>
> /*
> @@ -1157,6 +1115,7 @@ long sync_inodes_sb(struct super_block *sb)
> {
> struct writeback_control wbc = {
> .sb = sb,
> + .bdi = sb->s_bdi,
> .sync_mode = WB_SYNC_ALL,
> .range_start = 0,
> .range_end = LLONG_MAX,
> @@ -1164,7 +1123,7 @@ long sync_inodes_sb(struct super_block *sb)
> long nr_to_write = LONG_MAX; /* doesn't actually matter */
>
> wbc.nr_to_write = nr_to_write;
> - bdi_writeback_all(&wbc);
> + bdi_start_writeback(&wbc);
> wait_sb_inodes(&wbc);
> return nr_to_write - wbc.nr_to_write;
> }
> --
> 1.6.4.1.207.g68ea
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-09-14 13:13:42

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 5/7] writeback: inline allocation failure handling in bdi_alloc_queue_work()

On Mon 14-09-09 11:36:32, Jens Axboe wrote:
> This gets rid of work == NULL in bdi_queue_work() and puts the
> OOM handling where it belongs.
Looks good.

Acked-by: Jan Kara <[email protected]>

Honza
>
> Signed-off-by: Jens Axboe <[email protected]>
> ---
> fs/fs-writeback.c | 55 +++++++++++++++++++++++++++-------------------------
> 1 files changed, 29 insertions(+), 26 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index d7592c7..5cd8b3b 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -139,21 +139,19 @@ static void wb_clear_pending(struct bdi_writeback *wb, struct bdi_work *work)
>
> static void bdi_queue_work(struct backing_dev_info *bdi, struct bdi_work *work)
> {
> - if (work) {
> - work->seen = bdi->wb_mask;
> - BUG_ON(!work->seen);
> - atomic_set(&work->pending, bdi->wb_cnt);
> - BUG_ON(!bdi->wb_cnt);
> + work->seen = bdi->wb_mask;
> + BUG_ON(!work->seen);
> + atomic_set(&work->pending, bdi->wb_cnt);
> + BUG_ON(!bdi->wb_cnt);
>
> - /*
> - * Make sure stores are seen before it appears on the list
> - */
> - smp_mb();
> + /*
> + * Make sure stores are seen before it appears on the list
> + */
> + smp_mb();
>
> - spin_lock(&bdi->wb_lock);
> - list_add_tail_rcu(&work->list, &bdi->work_list);
> - spin_unlock(&bdi->wb_lock);
> - }
> + spin_lock(&bdi->wb_lock);
> + list_add_tail_rcu(&work->list, &bdi->work_list);
> + spin_unlock(&bdi->wb_lock);
>
> /*
> * If the default thread isn't there, make sure we add it. When
> @@ -165,14 +163,12 @@ static void bdi_queue_work(struct backing_dev_info *bdi, struct bdi_work *work)
> struct bdi_writeback *wb = &bdi->wb;
>
> /*
> - * If we failed allocating the bdi work item, wake up the wb
> - * thread always. As a safety precaution, it'll flush out
> - * everything
> + * End work now if this wb has no dirty IO pending. Otherwise
> + * wakeup the handling thread
> */
> - if (!wb_has_dirty_io(wb)) {
> - if (work)
> - wb_clear_pending(wb, work);
> - } else if (wb->task)
> + if (!wb_has_dirty_io(wb))
> + wb_clear_pending(wb, work);
> + else if (wb->task)
> wake_up_process(wb->task);
> }
> }
> @@ -192,11 +188,20 @@ static void bdi_alloc_queue_work(struct backing_dev_info *bdi,
> {
> struct bdi_work *work;
>
> + /*
> + * This is WB_SYNC_NONE writeback, so if allocation fails just
> + * wakeup the thread for old dirty data writeback
> + */
> work = kmalloc(sizeof(*work), GFP_ATOMIC);
> - if (work)
> + if (work) {
> bdi_work_init(work, wbc);
> + bdi_queue_work(bdi, work);
> + } else {
> + struct bdi_writeback *wb = &bdi->wb;
>
> - bdi_queue_work(bdi, work);
> + if (wb->task)
> + wake_up_process(wb->task);
> + }
> }
>
> void bdi_start_writeback(struct writeback_control *wbc)
> @@ -852,10 +857,8 @@ static void bdi_writeback_all(struct writeback_control *wbc)
> rcu_read_lock();
>
> list_for_each_entry(bdi, &bdi_list, bdi_list) {
> - if (!bdi_has_dirty_io(bdi))
> - continue;
> -
> - bdi_alloc_queue_work(bdi, wbc);
> + if (bdi_has_dirty_io(bdi))
> + bdi_alloc_queue_work(bdi, wbc);
> }
>
> rcu_read_unlock();
> --
> 1.6.4.1.207.g68ea
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-09-14 13:19:51

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/7] writeback: merely wakeup flusher thread if work allocation fails for WB_SYNC_NONE

On Mon, Sep 14, 2009 at 02:39:29PM +0200, Jens Axboe wrote:
>
> > BTW, don't we miss Christoph's Signed-off-by?
>
> We do, he didn't include one.


Signed-off-by: Christoph Hellwig <[email protected]>

also for Jens' slightly improved version.

2009-09-14 13:33:08

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback

On Mon 14-09-09 11:36:33, Jens Axboe wrote:
> bdi_start_writeback() is currently split into two paths, one for
> WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
> for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
> only WB_SYNC_NONE.
What I don't like about this patch is that if somebody sets up
writeback_control with WB_SYNC_ALL mode set and then submits it to disk via
bdi_start_writeback() it will just silently convert his writeback to an
asynchronous one.
So I'd maybe leave setting of sync_mode to the caller and just WARN_ON if
it does not match the purpose of the function...

Honza

>
> Signed-off-by: Jens Axboe <[email protected]>
> ---
> fs/fs-writeback.c | 51 ++++++++++++++++++++++++++++++++++-----------------
> mm/page-writeback.c | 1 -
> 2 files changed, 34 insertions(+), 18 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 5cd8b3b..64ca471 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -204,24 +204,42 @@ static void bdi_alloc_queue_work(struct backing_dev_info *bdi,
> }
> }
>
> -void bdi_start_writeback(struct writeback_control *wbc)
> +/**
> + * bdi_sync_writeback - start and wait for writeback
> + * @wbc: writeback parameters
> + *
> + * Description:
> + * This does WB_SYNC_ALL data integrity writeback and waits for the
> + * IO to complete. Callers must hold the sb s_umount semaphore for
> + * reading, to avoid having the super disappear before we are done.
> + */
> +static void bdi_sync_writeback(struct writeback_control *wbc)
> {
> - /*
> - * WB_SYNC_NONE is opportunistic writeback. If this allocation fails,
> - * bdi_queue_work() will wake up the thread and flush old data. This
> - * should ensure some amount of progress in freeing memory.
> - */
> - if (wbc->sync_mode != WB_SYNC_ALL)
> - bdi_alloc_queue_work(wbc->bdi, wbc);
> - else {
> - struct bdi_work work;
> + struct bdi_work work;
>
> - bdi_work_init(&work, wbc);
> - work.state |= WS_ONSTACK;
> + wbc->sync_mode = WB_SYNC_ALL;
>
> - bdi_queue_work(wbc->bdi, &work);
> - bdi_wait_on_work_clear(&work);
> - }
> + bdi_work_init(&work, wbc);
> + work.state |= WS_ONSTACK;
> +
> + bdi_queue_work(wbc->bdi, &work);
> + bdi_wait_on_work_clear(&work);
> +}
> +
> +/**
> + * bdi_start_writeback - start writeback
> + * @wbc: writeback parameters
> + *
> + * Description:
> + * This does WB_SYNC_NONE opportunistic writeback. The IO is only
> + * started when this function returns, we make no guarentees on
> + * completion. Caller need not hold sb s_umount semaphore.
> + *
> + */
> +void bdi_start_writeback(struct writeback_control *wbc)
> +{
> + wbc->sync_mode = WB_SYNC_NONE;
> + bdi_alloc_queue_work(wbc->bdi, wbc);
> }
>
> /*
> @@ -1119,14 +1137,13 @@ long sync_inodes_sb(struct super_block *sb)
> struct writeback_control wbc = {
> .sb = sb,
> .bdi = sb->s_bdi,
> - .sync_mode = WB_SYNC_ALL,
> .range_start = 0,
> .range_end = LLONG_MAX,
> };
> long nr_to_write = LONG_MAX; /* doesn't actually matter */
>
> wbc.nr_to_write = nr_to_write;
> - bdi_start_writeback(&wbc);
> + bdi_sync_writeback(&wbc);
> wait_sb_inodes(&wbc);
> return nr_to_write - wbc.nr_to_write;
> }
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index a5f0f76..f61f0cc 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -585,7 +585,6 @@ static void balance_dirty_pages(struct address_space *mapping)
> > background_thresh))) {
> struct writeback_control wbc = {
> .bdi = bdi,
> - .sync_mode = WB_SYNC_NONE,
> .nr_to_write = nr_writeback,
> };
>
> --
> 1.6.4.1.207.g68ea
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-09-14 13:42:09

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback

On Mon, Sep 14, 2009 at 03:33:07PM +0200, Jan Kara wrote:
> On Mon 14-09-09 11:36:33, Jens Axboe wrote:
> > bdi_start_writeback() is currently split into two paths, one for
> > WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
> > for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
> > only WB_SYNC_NONE.
> What I don't like about this patch is that if somebody sets up
> writeback_control with WB_SYNC_ALL mode set and then submits it to disk via
> bdi_start_writeback() it will just silently convert his writeback to an
> asynchronous one.
> So I'd maybe leave setting of sync_mode to the caller and just WARN_ON if
> it does not match the purpose of the function...

Or initialize the wb entirely inside these functions. For the sync case
we really only need a superblock as argument, and for writeback it's
bdi + nr_pages. And also make sure they consistenly return void as
no one cares about the return value.

2009-09-14 18:25:40

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 2/7] Assign bdi in super_block

On Mon, 2009-09-14 at 15:02 +0200, Jan Kara wrote:
> On Mon 14-09-09 11:36:29, Jens Axboe wrote:
> > We do this automatically in get_sb_bdev() from the set_bdev_super()
> > callback. Filesystems that have their own private backing_dev_info
> > must assign that in ->fill_super().
> >
> > Note that ->s_bdi assignment is required for proper writeback!
> >
> > Acked-by: Christoph Hellwig <[email protected]>
> > Signed-off-by: Jens Axboe <[email protected]>
> Hmm, looking at this again, I'm not sure this will work for NFS. It seems
> to set mapping->backing_dev_info to its private backing dev info for
> regular files while it leaves it intact for other inodes (e.g.
> directories). I'm not sure why it does so but it seems its inodes end up on
> two different BDI lists and thus they wouldn't be synced properly. Trond,
> do I read the code properly?
> Also we definitely need to set *some* bdi in nfs_get_sb as otherwise sync
> won't work for it.

There hasn't really been a need for a bdi in NFS other than for the
regular file read and writeback code. The main reason for making it
private was to ensure that we could set a per-superblock readahead limit
that was a decent multiple of the server's preferred read block size.

Is there any reason why we couldn't set sb->s_bdi to point to that
private bdi?

Cheers
Trond

2009-09-14 18:36:54

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 2/7] Assign bdi in super_block

On Mon, Sep 14 2009, Trond Myklebust wrote:
> On Mon, 2009-09-14 at 15:02 +0200, Jan Kara wrote:
> > On Mon 14-09-09 11:36:29, Jens Axboe wrote:
> > > We do this automatically in get_sb_bdev() from the set_bdev_super()
> > > callback. Filesystems that have their own private backing_dev_info
> > > must assign that in ->fill_super().
> > >
> > > Note that ->s_bdi assignment is required for proper writeback!
> > >
> > > Acked-by: Christoph Hellwig <[email protected]>
> > > Signed-off-by: Jens Axboe <[email protected]>
> > Hmm, looking at this again, I'm not sure this will work for NFS. It seems
> > to set mapping->backing_dev_info to its private backing dev info for
> > regular files while it leaves it intact for other inodes (e.g.
> > directories). I'm not sure why it does so but it seems its inodes end up on
> > two different BDI lists and thus they wouldn't be synced properly. Trond,
> > do I read the code properly?
> > Also we definitely need to set *some* bdi in nfs_get_sb as otherwise sync
> > won't work for it.
>
> There hasn't really been a need for a bdi in NFS other than for the
> regular file read and writeback code. The main reason for making it
> private was to ensure that we could set a per-superblock readahead limit
> that was a decent multiple of the server's preferred read block size.
>
> Is there any reason why we couldn't set sb->s_bdi to point to that
> private bdi?

No, that should work fine. NFS already works fine with the bdi flusher
threads, so you should just point it at that bdi.

If you could take a look at the parent patch and give some input (or an
addition for NFS, even better), then I would much appreciate it.

--
Jens Axboe

2009-09-14 19:28:03

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback

On Mon, Sep 14 2009, Christoph Hellwig wrote:
> On Mon, Sep 14, 2009 at 03:33:07PM +0200, Jan Kara wrote:
> > On Mon 14-09-09 11:36:33, Jens Axboe wrote:
> > > bdi_start_writeback() is currently split into two paths, one for
> > > WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
> > > for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
> > > only WB_SYNC_NONE.
> > What I don't like about this patch is that if somebody sets up
> > writeback_control with WB_SYNC_ALL mode set and then submits it to disk via
> > bdi_start_writeback() it will just silently convert his writeback to an
> > asynchronous one.
> > So I'd maybe leave setting of sync_mode to the caller and just WARN_ON if
> > it does not match the purpose of the function...
>
> Or initialize the wb entirely inside these functions. For the sync case
> we really only need a superblock as argument, and for writeback it's
> bdi + nr_pages. And also make sure they consistenly return void as
> no one cares about the return value.

Yes, I thought about doing that and like that better than the warning.
Just pass in the needed args and allocate+fill the wbc on stack. I'll
make that change.

--
Jens Axboe

2009-09-14 19:42:41

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback

On Mon, Sep 14 2009, Jens Axboe wrote:
> On Mon, Sep 14 2009, Christoph Hellwig wrote:
> > On Mon, Sep 14, 2009 at 03:33:07PM +0200, Jan Kara wrote:
> > > On Mon 14-09-09 11:36:33, Jens Axboe wrote:
> > > > bdi_start_writeback() is currently split into two paths, one for
> > > > WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
> > > > for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
> > > > only WB_SYNC_NONE.
> > > What I don't like about this patch is that if somebody sets up
> > > writeback_control with WB_SYNC_ALL mode set and then submits it to disk via
> > > bdi_start_writeback() it will just silently convert his writeback to an
> > > asynchronous one.
> > > So I'd maybe leave setting of sync_mode to the caller and just WARN_ON if
> > > it does not match the purpose of the function...
> >
> > Or initialize the wb entirely inside these functions. For the sync case
> > we really only need a superblock as argument, and for writeback it's
> > bdi + nr_pages. And also make sure they consistenly return void as
> > no one cares about the return value.
>
> Yes, I thought about doing that and like that better than the warning.
> Just pass in the needed args and allocate+fill the wbc on stack. I'll
> make that change.

That works out much better, imho:

http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=270c12655d7d11e234d335a8ab0540c02c034b66

--
Jens Axboe

2009-09-15 09:08:49

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback

On Mon 14-09-09 21:42:43, Jens Axboe wrote:
> On Mon, Sep 14 2009, Jens Axboe wrote:
> > On Mon, Sep 14 2009, Christoph Hellwig wrote:
> > > On Mon, Sep 14, 2009 at 03:33:07PM +0200, Jan Kara wrote:
> > > > On Mon 14-09-09 11:36:33, Jens Axboe wrote:
> > > > > bdi_start_writeback() is currently split into two paths, one for
> > > > > WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
> > > > > for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
> > > > > only WB_SYNC_NONE.
> > > > What I don't like about this patch is that if somebody sets up
> > > > writeback_control with WB_SYNC_ALL mode set and then submits it to disk via
> > > > bdi_start_writeback() it will just silently convert his writeback to an
> > > > asynchronous one.
> > > > So I'd maybe leave setting of sync_mode to the caller and just WARN_ON if
> > > > it does not match the purpose of the function...
> > >
> > > Or initialize the wb entirely inside these functions. For the sync case
> > > we really only need a superblock as argument, and for writeback it's
> > > bdi + nr_pages. And also make sure they consistenly return void as
> > > no one cares about the return value.
> >
> > Yes, I thought about doing that and like that better than the warning.
> > Just pass in the needed args and allocate+fill the wbc on stack. I'll
> > make that change.
>
> That works out much better, imho:
>
> http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=270c12655d7d11e234d335a8ab0540c02c034b66
Yeah, the code looks better. BTW, how about converting also
bdi_writeback_all() to get superblock and nr_pages as an argument?
Currently it seems to be the only place "above" flusher thread which uses
wbc and it's just constructed in the callers of bdi_writeback_all() and
then disassembled inside the function...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-09-15 09:14:05

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback

On Tue, Sep 15 2009, Jan Kara wrote:
> On Mon 14-09-09 21:42:43, Jens Axboe wrote:
> > On Mon, Sep 14 2009, Jens Axboe wrote:
> > > On Mon, Sep 14 2009, Christoph Hellwig wrote:
> > > > On Mon, Sep 14, 2009 at 03:33:07PM +0200, Jan Kara wrote:
> > > > > On Mon 14-09-09 11:36:33, Jens Axboe wrote:
> > > > > > bdi_start_writeback() is currently split into two paths, one for
> > > > > > WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
> > > > > > for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
> > > > > > only WB_SYNC_NONE.
> > > > > What I don't like about this patch is that if somebody sets up
> > > > > writeback_control with WB_SYNC_ALL mode set and then submits it to disk via
> > > > > bdi_start_writeback() it will just silently convert his writeback to an
> > > > > asynchronous one.
> > > > > So I'd maybe leave setting of sync_mode to the caller and just WARN_ON if
> > > > > it does not match the purpose of the function...
> > > >
> > > > Or initialize the wb entirely inside these functions. For the sync case
> > > > we really only need a superblock as argument, and for writeback it's
> > > > bdi + nr_pages. And also make sure they consistenly return void as
> > > > no one cares about the return value.
> > >
> > > Yes, I thought about doing that and like that better than the warning.
> > > Just pass in the needed args and allocate+fill the wbc on stack. I'll
> > > make that change.
> >
> > That works out much better, imho:
> >
> > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=270c12655d7d11e234d335a8ab0540c02c034b66
> Yeah, the code looks better. BTW, how about converting also
> bdi_writeback_all() to get superblock and nr_pages as an argument?
> Currently it seems to be the only place "above" flusher thread which uses
> wbc and it's just constructed in the callers of bdi_writeback_all() and
> then disassembled inside the function...

Yes good point, I'll include that too. Thanks!

--
Jens Axboe

2009-09-15 10:15:00

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/7] Assign bdi in super_block

On Mon 14-09-09 20:36:54, Jens Axboe wrote:
> On Mon, Sep 14 2009, Trond Myklebust wrote:
> > On Mon, 2009-09-14 at 15:02 +0200, Jan Kara wrote:
> > > On Mon 14-09-09 11:36:29, Jens Axboe wrote:
> > > > We do this automatically in get_sb_bdev() from the set_bdev_super()
> > > > callback. Filesystems that have their own private backing_dev_info
> > > > must assign that in ->fill_super().
> > > >
> > > > Note that ->s_bdi assignment is required for proper writeback!
> > > >
> > > > Acked-by: Christoph Hellwig <[email protected]>
> > > > Signed-off-by: Jens Axboe <[email protected]>
> > > Hmm, looking at this again, I'm not sure this will work for NFS. It seems
> > > to set mapping->backing_dev_info to its private backing dev info for
> > > regular files while it leaves it intact for other inodes (e.g.
> > > directories). I'm not sure why it does so but it seems its inodes end up on
> > > two different BDI lists and thus they wouldn't be synced properly. Trond,
> > > do I read the code properly?
> > > Also we definitely need to set *some* bdi in nfs_get_sb as otherwise sync
> > > won't work for it.
> >
> > There hasn't really been a need for a bdi in NFS other than for the
> > regular file read and writeback code. The main reason for making it
> > private was to ensure that we could set a per-superblock readahead limit
> > that was a decent multiple of the server's preferred read block size.
> >
> > Is there any reason why we couldn't set sb->s_bdi to point to that
> > private bdi?
>
> No, that should work fine. NFS already works fine with the bdi flusher
> threads, so you should just point it at that bdi.
But will it really work well? I mean if we sync the superblock on the
client, it will sync only the private BDI. So it won't sync any directory
inodes because they are on the default_backing_dev_info (NFS leaves
sb->s_bdev at NULL).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-09-15 10:22:34

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 2/7] Assign bdi in super_block

On Tue, Sep 15 2009, Jan Kara wrote:
> On Mon 14-09-09 20:36:54, Jens Axboe wrote:
> > On Mon, Sep 14 2009, Trond Myklebust wrote:
> > > On Mon, 2009-09-14 at 15:02 +0200, Jan Kara wrote:
> > > > On Mon 14-09-09 11:36:29, Jens Axboe wrote:
> > > > > We do this automatically in get_sb_bdev() from the set_bdev_super()
> > > > > callback. Filesystems that have their own private backing_dev_info
> > > > > must assign that in ->fill_super().
> > > > >
> > > > > Note that ->s_bdi assignment is required for proper writeback!
> > > > >
> > > > > Acked-by: Christoph Hellwig <[email protected]>
> > > > > Signed-off-by: Jens Axboe <[email protected]>
> > > > Hmm, looking at this again, I'm not sure this will work for NFS. It seems
> > > > to set mapping->backing_dev_info to its private backing dev info for
> > > > regular files while it leaves it intact for other inodes (e.g.
> > > > directories). I'm not sure why it does so but it seems its inodes end up on
> > > > two different BDI lists and thus they wouldn't be synced properly. Trond,
> > > > do I read the code properly?
> > > > Also we definitely need to set *some* bdi in nfs_get_sb as otherwise sync
> > > > won't work for it.
> > >
> > > There hasn't really been a need for a bdi in NFS other than for the
> > > regular file read and writeback code. The main reason for making it
> > > private was to ensure that we could set a per-superblock readahead limit
> > > that was a decent multiple of the server's preferred read block size.
> > >
> > > Is there any reason why we couldn't set sb->s_bdi to point to that
> > > private bdi?
> >
> > No, that should work fine. NFS already works fine with the bdi flusher
> > threads, so you should just point it at that bdi.
> But will it really work well? I mean if we sync the superblock on the
> client, it will sync only the private BDI. So it won't sync any directory
> inodes because they are on the default_backing_dev_info (NFS leaves
> sb->s_bdev at NULL).

No you are right, it wont be complete. I talked to Chris about this
yesterday, and we agree that the best option is to give NFS the 'btrfs
treatment' bdi wise.

--
Jens Axboe

2009-09-15 11:44:30

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback

On Tue, Sep 15 2009, Jens Axboe wrote:
> On Tue, Sep 15 2009, Jan Kara wrote:
> > On Mon 14-09-09 21:42:43, Jens Axboe wrote:
> > > On Mon, Sep 14 2009, Jens Axboe wrote:
> > > > On Mon, Sep 14 2009, Christoph Hellwig wrote:
> > > > > On Mon, Sep 14, 2009 at 03:33:07PM +0200, Jan Kara wrote:
> > > > > > On Mon 14-09-09 11:36:33, Jens Axboe wrote:
> > > > > > > bdi_start_writeback() is currently split into two paths, one for
> > > > > > > WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
> > > > > > > for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
> > > > > > > only WB_SYNC_NONE.
> > > > > > What I don't like about this patch is that if somebody sets up
> > > > > > writeback_control with WB_SYNC_ALL mode set and then submits it to disk via
> > > > > > bdi_start_writeback() it will just silently convert his writeback to an
> > > > > > asynchronous one.
> > > > > > So I'd maybe leave setting of sync_mode to the caller and just WARN_ON if
> > > > > > it does not match the purpose of the function...
> > > > >
> > > > > Or initialize the wb entirely inside these functions. For the sync case
> > > > > we really only need a superblock as argument, and for writeback it's
> > > > > bdi + nr_pages. And also make sure they consistenly return void as
> > > > > no one cares about the return value.
> > > >
> > > > Yes, I thought about doing that and like that better than the warning.
> > > > Just pass in the needed args and allocate+fill the wbc on stack. I'll
> > > > make that change.
> > >
> > > That works out much better, imho:
> > >
> > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=270c12655d7d11e234d335a8ab0540c02c034b66
> > Yeah, the code looks better. BTW, how about converting also
> > bdi_writeback_all() to get superblock and nr_pages as an argument?
> > Currently it seems to be the only place "above" flusher thread which uses
> > wbc and it's just constructed in the callers of bdi_writeback_all() and
> > then disassembled inside the function...
>
> Yes good point, I'll include that too. Thanks!

One small problem there, though... Currently all queued writeback is
range cyclic, however with this change then we drop that bit from
sync_inodes_sb() which isn't range_cyclic and instead just specifies the
whole range.

--
Jens Axboe

2009-09-15 12:58:57

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback

On Tue 15-09-09 13:44:26, Jens Axboe wrote:
> On Tue, Sep 15 2009, Jens Axboe wrote:
> > On Tue, Sep 15 2009, Jan Kara wrote:
> > > On Mon 14-09-09 21:42:43, Jens Axboe wrote:
> > > > On Mon, Sep 14 2009, Jens Axboe wrote:
> > > > > On Mon, Sep 14 2009, Christoph Hellwig wrote:
> > > > > > On Mon, Sep 14, 2009 at 03:33:07PM +0200, Jan Kara wrote:
> > > > > > > On Mon 14-09-09 11:36:33, Jens Axboe wrote:
> > > > > > > > bdi_start_writeback() is currently split into two paths, one for
> > > > > > > > WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
> > > > > > > > for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
> > > > > > > > only WB_SYNC_NONE.
> > > > > > > What I don't like about this patch is that if somebody sets up
> > > > > > > writeback_control with WB_SYNC_ALL mode set and then submits it to disk via
> > > > > > > bdi_start_writeback() it will just silently convert his writeback to an
> > > > > > > asynchronous one.
> > > > > > > So I'd maybe leave setting of sync_mode to the caller and just WARN_ON if
> > > > > > > it does not match the purpose of the function...
> > > > > >
> > > > > > Or initialize the wb entirely inside these functions. For the sync case
> > > > > > we really only need a superblock as argument, and for writeback it's
> > > > > > bdi + nr_pages. And also make sure they consistenly return void as
> > > > > > no one cares about the return value.
> > > > >
> > > > > Yes, I thought about doing that and like that better than the warning.
> > > > > Just pass in the needed args and allocate+fill the wbc on stack. I'll
> > > > > make that change.
> > > >
> > > > That works out much better, imho:
> > > >
> > > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=270c12655d7d11e234d335a8ab0540c02c034b66
> > > Yeah, the code looks better. BTW, how about converting also
> > > bdi_writeback_all() to get superblock and nr_pages as an argument?
> > > Currently it seems to be the only place "above" flusher thread which uses
> > > wbc and it's just constructed in the callers of bdi_writeback_all() and
> > > then disassembled inside the function...
> >
> > Yes good point, I'll include that too. Thanks!
>
> One small problem there, though... Currently all queued writeback is
> range cyclic, however with this change then we drop that bit from
> sync_inodes_sb() which isn't range_cyclic and instead just specifies the
> whole range.
I'm not sure I understand your comment but I see a problem that even
writeback queued from sync_inodes_sb() is processed by wb_writeback() which
sets range_cyclic. That's a bug even in your old patchset.
Let's have a look at the flags in wbc:
nonblocking - Currently only set by direct callers of ->writepage() BUT
originally wb_kupdate() and background_writeout() also
set this flag. Since filesystems and write_cache_pages()
use the flag we should set it for equivalent writeouts as
well. This should be fixed...
encountered_congestion - Checked only by AFS, probably we can get rid of
it now.
for_kupdate - Used, we set it properly in wb_writeback() so that is fine.
for_reclaim - Used by direct callers of ->writepage(). OK.
for_writepages - Only set. Get rid of it.
range_cyclic - Used. Set also when a caller didn't want it - should be
fixed.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-09-15 13:04:24

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback

On Tue, Sep 15 2009, Jan Kara wrote:
> On Tue 15-09-09 13:44:26, Jens Axboe wrote:
> > On Tue, Sep 15 2009, Jens Axboe wrote:
> > > On Tue, Sep 15 2009, Jan Kara wrote:
> > > > On Mon 14-09-09 21:42:43, Jens Axboe wrote:
> > > > > On Mon, Sep 14 2009, Jens Axboe wrote:
> > > > > > On Mon, Sep 14 2009, Christoph Hellwig wrote:
> > > > > > > On Mon, Sep 14, 2009 at 03:33:07PM +0200, Jan Kara wrote:
> > > > > > > > On Mon 14-09-09 11:36:33, Jens Axboe wrote:
> > > > > > > > > bdi_start_writeback() is currently split into two paths, one for
> > > > > > > > > WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
> > > > > > > > > for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
> > > > > > > > > only WB_SYNC_NONE.
> > > > > > > > What I don't like about this patch is that if somebody sets up
> > > > > > > > writeback_control with WB_SYNC_ALL mode set and then submits it to disk via
> > > > > > > > bdi_start_writeback() it will just silently convert his writeback to an
> > > > > > > > asynchronous one.
> > > > > > > > So I'd maybe leave setting of sync_mode to the caller and just WARN_ON if
> > > > > > > > it does not match the purpose of the function...
> > > > > > >
> > > > > > > Or initialize the wb entirely inside these functions. For the sync case
> > > > > > > we really only need a superblock as argument, and for writeback it's
> > > > > > > bdi + nr_pages. And also make sure they consistenly return void as
> > > > > > > no one cares about the return value.
> > > > > >
> > > > > > Yes, I thought about doing that and like that better than the warning.
> > > > > > Just pass in the needed args and allocate+fill the wbc on stack. I'll
> > > > > > make that change.
> > > > >
> > > > > That works out much better, imho:
> > > > >
> > > > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=270c12655d7d11e234d335a8ab0540c02c034b66
> > > > Yeah, the code looks better. BTW, how about converting also
> > > > bdi_writeback_all() to get superblock and nr_pages as an argument?
> > > > Currently it seems to be the only place "above" flusher thread which uses
> > > > wbc and it's just constructed in the callers of bdi_writeback_all() and
> > > > then disassembled inside the function...
> > >
> > > Yes good point, I'll include that too. Thanks!
> >
> > One small problem there, though... Currently all queued writeback is
> > range cyclic, however with this change then we drop that bit from
> > sync_inodes_sb() which isn't range_cyclic and instead just specifies the
> > whole range.
> I'm not sure I understand your comment but I see a problem that even
> writeback queued from sync_inodes_sb() is processed by wb_writeback() which
> sets range_cyclic. That's a bug even in your old patchset.

Indeed, looks like I have to double check all that again.

> Let's have a look at the flags in wbc:
> nonblocking - Currently only set by direct callers of ->writepage() BUT
> originally wb_kupdate() and background_writeout() also
> set this flag. Since filesystems and write_cache_pages()
> use the flag we should set it for equivalent writeouts as
> well. This should be fixed...

Since this is all handled by the dedicated thread now, dropping the
nonblocking bit was on purpose. What would the point be, except for
stopping pdflush being blocked on request allocation?

> encountered_congestion - Checked only by AFS, probably we can get rid of
> it now.
> for_kupdate - Used, we set it properly in wb_writeback() so that is fine.
> for_reclaim - Used by direct callers of ->writepage(). OK.
> for_writepages - Only set. Get rid of it.

Hmm indeed, that predates the patchset though. But I can queue up a
removal.

> range_cyclic - Used. Set also when a caller didn't want it - should be
> fixed.

Yes, that one wants fixing. Will probably have to pass that in through
the 'work' structure.

--
Jens Axboe

2009-09-15 13:08:35

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback

On Tue, Sep 15, 2009 at 03:04:19PM +0200, Jens Axboe wrote:
>
> > Let's have a look at the flags in wbc:
> > nonblocking - Currently only set by direct callers of ->writepage() BUT
> > originally wb_kupdate() and background_writeout() also
> > set this flag. Since filesystems and write_cache_pages()
> > use the flag we should set it for equivalent writeouts as
> > well. This should be fixed...
>
> Since this is all handled by the dedicated thread now, dropping the
> nonblocking bit was on purpose. What would the point be, except for
> stopping pdflush being blocked on request allocation?

Note that this flag just caused utter mess traditionally. btrfs decided
to ignore it completely and ext4 partially. Removing this check in
XFS increases large bufferd write loads massively.

Just half-removing it is a bad idea, though - if you don't set it
anymore please kill it entirely.

2009-09-15 13:16:51

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 2/7] Assign bdi in super_block

On Tue, 2009-09-15 at 12:14 +0200, Jan Kara wrote:
> On Mon 14-09-09 20:36:54, Jens Axboe wrote:
> > No, that should work fine. NFS already works fine with the bdi flusher
> > threads, so you should just point it at that bdi.
> But will it really work well? I mean if we sync the superblock on the
> client, it will sync only the private BDI. So it won't sync any directory
> inodes because they are on the default_backing_dev_info (NFS leaves
> sb->s_bdev at NULL).

All directory related operations (link, rename, create, ...) are fully
synchronous in NFS. There should be no need to set up anything to
synchronise directory inodes.

Cheers
Trond

2009-09-15 13:17:11

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback

On Tue, Sep 15 2009, Christoph Hellwig wrote:
> On Tue, Sep 15, 2009 at 03:04:19PM +0200, Jens Axboe wrote:
> >
> > > Let's have a look at the flags in wbc:
> > > nonblocking - Currently only set by direct callers of ->writepage() BUT
> > > originally wb_kupdate() and background_writeout() also
> > > set this flag. Since filesystems and write_cache_pages()
> > > use the flag we should set it for equivalent writeouts as
> > > well. This should be fixed...
> >
> > Since this is all handled by the dedicated thread now, dropping the
> > nonblocking bit was on purpose. What would the point be, except for
> > stopping pdflush being blocked on request allocation?
>
> Note that this flag just caused utter mess traditionally. btrfs decided
> to ignore it completely and ext4 partially. Removing this check in
> XFS increases large bufferd write loads massively.
>
> Just half-removing it is a bad idea, though - if you don't set it
> anymore please kill it entirely.

I haven't touched it, except removing it from the caller where it
doesn't make sense anymore. If you think we should kill it completely,
then lets look at that in a few days. I've got more than enough stuff
queued up for inclusion now that I need to test and verify before doing
even more cleanups/changes :-)

--
Jens Axboe

2009-09-15 14:01:47

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback

On Tue 15-09-09 09:08:29, Christoph Hellwig wrote:
> On Tue, Sep 15, 2009 at 03:04:19PM +0200, Jens Axboe wrote:
> >
> > > Let's have a look at the flags in wbc:
> > > nonblocking - Currently only set by direct callers of ->writepage() BUT
> > > originally wb_kupdate() and background_writeout() also
> > > set this flag. Since filesystems and write_cache_pages()
> > > use the flag we should set it for equivalent writeouts as
> > > well. This should be fixed...
> >
> > Since this is all handled by the dedicated thread now, dropping the
> > nonblocking bit was on purpose. What would the point be, except for
> > stopping pdflush being blocked on request allocation?
>
> Note that this flag just caused utter mess traditionally. btrfs decided
> to ignore it completely and ext4 partially. Removing this check in
> XFS increases large bufferd write loads massively.
>
> Just half-removing it is a bad idea, though - if you don't set it
> anymore please kill it entirely.
The nonblocking flag is still set for writeback done for memory reclaim.
OTOH the only real consumer of this flag now seems to be
__block_write_full_page() which does trylock_buffer() in case of
nonblocking writeback. I'm undecided whether it makes sence or not.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-09-15 14:10:00

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback

On Tue, Sep 15, 2009 at 04:01:45PM +0200, Jan Kara wrote:
> On Tue 15-09-09 09:08:29, Christoph Hellwig wrote:
> > On Tue, Sep 15, 2009 at 03:04:19PM +0200, Jens Axboe wrote:
> > >
> > > > Let's have a look at the flags in wbc:
> > > > nonblocking - Currently only set by direct callers of ->writepage() BUT
> > > > originally wb_kupdate() and background_writeout() also
> > > > set this flag. Since filesystems and write_cache_pages()
> > > > use the flag we should set it for equivalent writeouts as
> > > > well. This should be fixed...
> > >
> > > Since this is all handled by the dedicated thread now, dropping the
> > > nonblocking bit was on purpose. What would the point be, except for
> > > stopping pdflush being blocked on request allocation?
> >
> > Note that this flag just caused utter mess traditionally. btrfs decided
> > to ignore it completely and ext4 partially. Removing this check in
> > XFS increases large bufferd write loads massively.
> >
> > Just half-removing it is a bad idea, though - if you don't set it
> > anymore please kill it entirely.
> The nonblocking flag is still set for writeback done for memory reclaim.
> OTOH the only real consumer of this flag now seems to be
> __block_write_full_page() which does trylock_buffer() in case of
> nonblocking writeback. I'm undecided whether it makes sence or not.

Ugh, making sense is tricky to say. If __block_write_full_page
does a lock_buffer() instead of a trylock_buffer(), and ext3 is mounted in
data=ordered mode then it is very possible that we'll end up with a
dirty page with locked buffers.

The buffers will have been locked by ext3 data=ordered writeback and
they won't unlock until the IO is done.

We probably don't want kswapd waiting on that writeback.

-chris