2008-01-15 08:09:56

by Michael Rubin

[permalink] [raw]
Subject: [patch] Converting writeback linked lists to a tree based data structure

From: Michael Rubin <[email protected]>

For those of you who have waited so long. This is the third submission
of the first attempt at this patch. It is a trilogy.

Two changes are in this patch. They are dependant on each other.

In addition we get an unintended performance improvement. Syncing
1,000,000 inodes each with 4KB of dirty data takes the original kernel
83 seconds and with the change it take 77 seconds.

1) Adding a datastructure to guarantee fairness when writing
inodes to disk and simplify the code base.

When inodes are parked for writeback they are parked in the
flush_tree. The flush tree is a data structure based on an rb tree.

Duplicate keys are handled by making a list in the tree for each key
value. The order of how we choose the next inode to flush is decided
by two fields. First the earliest dirtied_when value. If there are
duplicate dirtied_when values then the earliest i_flush_gen value
determines who gets flushed next.

The flush tree organizes the dirtied_when keys with the rb_tree. Any
inodes with a duplicate dirtied_when value are link listed together. This
link list is sorted by the inode's i_flush_gen. When both the
dirtied_when and the i_flush_gen are identical the order in the
linked list determines the order we flush the inodes.

2) Added an inode flag to allow inodes to be marked so that they
are never written back to disk.

The motivation behind this change is several fold. The first is
to insure fairness in the writeback algorithm. The second is to
deal with a bug where the writing to large files concurrently
to smaller ones creates a situation where writeback cannot
keep up with traffic and memory baloons until the we hit the
threshold watermark. This can result in surprising long latency
with respect to disk traffic. This latency can take minutes. The
flush tree fixes this issue and fixes several other minor issues
with fairness also.

Signed-off-by: Michael Rubin <[email protected]>
---

Index: 2624rc7_wb/fs/anon_inodes.c
===================================================================
--- 2624rc7_wb.orig/fs/anon_inodes.c 2008-01-06 13:45:38.000000000 -0800
+++ 2624rc7_wb/fs/anon_inodes.c 2008-01-08 13:46:59.000000000 -0800
@@ -154,13 +154,7 @@ static struct inode *anon_inode_mkinode(

inode->i_fop = &anon_inode_fops;

- /*
- * Mark the inode dirty from the very beginning,
- * that way it will never be moved to the dirty
- * list because mark_inode_dirty() will think
- * that it already _is_ on the dirty list.
- */
- inode->i_state = I_DIRTY;
+ inode->i_state = I_WRITEBACK_NEVER;
inode->i_mode = S_IRUSR | S_IWUSR;
inode->i_uid = current->fsuid;
inode->i_gid = current->fsgid;
Index: 2624rc7_wb/fs/fs-writeback.c
===================================================================
--- 2624rc7_wb.orig/fs/fs-writeback.c 2008-01-06 13:45:38.000000000 -0800
+++ 2624rc7_wb/fs/fs-writeback.c 2008-01-14 18:53:54.000000000 -0800
@@ -23,8 +23,185 @@
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/buffer_head.h>
+#include <linux/rbtree.h>
#include "internal.h"

+#define rb_to_inode(node) rb_entry((node), struct inode, i_flush_node)
+
+/*
+ * When inodes are parked for writeback they are parked in the
+ * flush_tree. The flush tree is a data structure based on an rb tree.
+ *
+ * Duplicate keys are handled by making a list in the tree for each key
+ * value. The order of how we choose the next inode to flush is decided
+ * by two fields. First the earliest dirtied_when value. If there are
+ * duplicate dirtied_when values then the earliest i_flush_gen value
+ * determines who gets flushed next.
+ *
+ * The flush tree organizes the dirtied_when keys with the rb_tree. Any
+ * inodes with a duplicate dirtied_when value are link listed together. This
+ * link list is sorted by the inode's i_flush_gen. When both the
+ * dirtied_when and the i_flush_gen are identical the order in the
+ * linked list determines the order we flush the inodes.
+ */
+
+/*
+ * Find a rb_node matching the key in the flush tree. There are no duplicate
+ * rb_nodes in the tree. Instead they are chained off the first node.
+ */
+static struct inode *flush_tree_search(struct super_block *sb,
+ unsigned long ts)
+{
+ struct rb_node *n = sb->s_flush_root.rb_node;
+ assert_spin_locked(&inode_lock);
+ while (n) {
+ struct inode *inode = rb_to_inode(n);
+ if (time_before(ts, inode->dirtied_when)) {
+ n = n->rb_left;
+ } else if (time_after(ts, inode->dirtied_when)) {
+ n = n->rb_right;
+ } else {
+ return inode;
+ }
+ }
+ return NULL;
+}
+
+/*
+ * Inserting an inode into the flush tree. The tree is keyed by the
+ * dirtied_when member.
+ *
+ * If there is a duplicate key in the tree already the new inode is put
+ * on the tail of a list of the rb_node.
+ * All inserted inodes must have one of the I_DIRTY flags set.
+ */
+static void flush_tree_insert(struct super_block *sb, struct inode *inode)
+{
+ struct rb_node **new = &(sb->s_flush_root.rb_node);
+ struct rb_node *parent = NULL;
+
+ assert_spin_locked(&inode_lock);
+ BUG_ON((inode->i_state & I_DIRTY) == 0);
+ BUG_ON(inode->i_state & (I_FREEING|I_CLEAR));
+ BUG_ON(RB_LINKED_NODE(&inode->i_flush_node));
+
+ sb->s_flush_count++;
+ inode->i_flush_gen = sb->s_flush_gen;
+
+ list_del_init(&inode->i_list);
+ while (*new) {
+ struct inode *this = rb_to_inode(*new);
+ parent = *new;
+ if (time_before(inode->dirtied_when, this->dirtied_when))
+ new = &((*new)->rb_left);
+ else if (time_after(inode->dirtied_when, this->dirtied_when))
+ new = &((*new)->rb_right);
+ else {
+ list_add_tail(&inode->i_list, &this->i_list);
+ return;
+ }
+ }
+
+ /* Add in the new node and re-balance the tree */
+ rb_link_node(&inode->i_flush_node, parent, new);
+ rb_insert_color(&inode->i_flush_node, &sb->s_flush_root);
+}
+
+/*
+ * Here we return the inode that has the smallest key in the flush tree
+ * that is greater than the parameter "prev_time".
+ */
+static struct inode *flush_tree_min_greater(struct super_block *sb,
+ unsigned long prev_time)
+{
+ struct rb_node *node = sb->s_flush_root.rb_node;
+ struct inode *bsf = NULL;
+ /* best so far */
+ assert_spin_locked(&inode_lock);
+ while (node) {
+ struct inode *data = rb_to_inode(node);
+ /* Just trying to get lucky */
+ if ((prev_time + 1) == data->dirtied_when)
+ return data;
+ /*
+ * If this value is greater than our prev_time and is
+ * less than the best so far, this is our new best so far.
+ */
+ if (time_after(data->dirtied_when, prev_time) &&
+ (bsf ? time_after(bsf->dirtied_when, data->dirtied_when):1))
+ bsf = data;
+
+ /* Search all the way down to the bottom of the tree */
+ if (time_before(prev_time, data->dirtied_when)) {
+ node = node->rb_left;
+ } else if (time_after_eq(prev_time, data->dirtied_when)) {
+ node = node->rb_right;
+ }
+ }
+ return bsf;
+}
+
+/*
+ * Here is where we iterate to find the next inode to process. The
+ * strategy is to first look for any other inodes with the same dirtied_when
+ * value. If we have already processed that node then we need to find
+ * the next highest dirtied_when value in the tree.
+ */
+static struct inode *flush_tree_next(struct super_block *sb,
+ struct inode *prev_inode,
+ unsigned long prev_dirtied,
+ unsigned long long flush_gen)
+{
+ struct inode *inode;
+ assert_spin_locked(&inode_lock);
+
+ /*
+ * First we look to see if there is an inode with the same
+ * dirtied_time as our previous processed inode.
+ */
+ inode = flush_tree_search(sb, prev_dirtied);
+
+ /*
+ * If there is and if it's not the same one as we just processed
+ * and the i_flush_gen is later than our start.
+ */
+ if (inode && (inode->i_flush_gen < flush_gen))
+ return inode;
+
+ /* If not we find the next inode that has been dirtied after this one */
+ return flush_tree_min_greater(sb, prev_dirtied);
+}
+
+/* Removing a node from the flushtree. */
+void flush_tree_remove(struct super_block *sb, struct inode *inode)
+{
+ struct rb_node *rb_node = &inode->i_flush_node;
+ struct rb_root *rb_root = &sb->s_flush_root;
+
+ assert_spin_locked(&inode_lock);
+ BUG_ON((inode->i_state & I_DIRTY) == 0);
+
+ sb->s_flush_count--;
+
+ /* There is no chain on this inode. Just remove it from the tree */
+ if (list_empty(&inode->i_list)) {
+ BUG_ON(!RB_LINKED_NODE(rb_node));
+ rb_erase(rb_node, rb_root);
+ memset(rb_node, 0, sizeof(*rb_node));
+ return;
+ }
+
+ /* This node is on a chain AND is in the rb_tree */
+ if (RB_LINKED_NODE(rb_node)) {
+ struct inode *new = list_entry(inode->i_list.next,
+ struct inode, i_list);
+ rb_replace_node(rb_node, &new->i_flush_node, rb_root);
+ memset(rb_node, 0, sizeof(*rb_node));
+ }
+ /* Take it off the list */
+ list_del_init(&inode->i_list);
+}
+
/**
* __mark_inode_dirty - internal function
* @inode: inode to mark
@@ -32,11 +209,11 @@
* Mark an inode as dirty. Callers should use mark_inode_dirty or
* mark_inode_dirty_sync.
*
- * Put the inode on the super block's dirty list.
+ * Put the inode on the super block's flush_tree.
*
* CAREFUL! We mark it dirty unconditionally, but move it onto the
* dirty list only if it is hashed or if it refers to a blockdev.
- * If it was not hashed, it will never be added to the dirty list
+ * If it was not hashed, it will never be added to the flush_tree
* even if it is later hashed, as it will have been marked dirty already.
*
* In short, make sure you hash any inodes _before_ you start marking
@@ -75,6 +252,10 @@ void __mark_inode_dirty(struct inode *in
if ((inode->i_state & flags) == flags)
return;

+ if (inode->i_state & I_WRITEBACK_NEVER)
+ return;
+
+
if (unlikely(block_dump)) {
struct dentry *dentry = NULL;
const char *name = "?";
@@ -97,34 +278,36 @@ void __mark_inode_dirty(struct inode *in
if ((inode->i_state & flags) != flags) {
const int was_dirty = inode->i_state & I_DIRTY;

- inode->i_state |= flags;
-
- /*
- * If the inode is being synced, just update its dirty state.
- * The unlocker will place the inode on the appropriate
- * superblock list, based upon its state.
- */
- if (inode->i_state & I_SYNC)
+ /* If we are freeing this inode we cannot mark it dirty. */
+ if (inode->i_state & (I_FREEING|I_CLEAR))
goto out;

/*
- * Only add valid (hashed) inodes to the superblock's
- * dirty list. Add blockdev inodes as well.
+ * Only add valid (hashed) inodes to the flush_tree
+ * Add blockdev inodes as well.
*/
if (!S_ISBLK(inode->i_mode)) {
if (hlist_unhashed(&inode->i_hash))
goto out;
}
- if (inode->i_state & (I_FREEING|I_CLEAR))
+
+ inode->i_state |= flags;
+
+ /*
+ * If the inode is locked, just update its dirty state.
+ * The unlocker will place the inode into the flush_tree
+ * based upon its state.
+ */
+ if (inode->i_state & I_SYNC)
goto out;

/*
- * If the inode was already on s_dirty/s_io/s_more_io, don't
- * reposition it (that would break s_dirty time-ordering).
+ * If the inode was already in the flush_tree, don't
+ * reposition it (that would break flush_tree time-ordering).
*/
if (!was_dirty) {
inode->dirtied_when = jiffies;
- list_move(&inode->i_list, &sb->s_dirty);
+ flush_tree_insert(sb, inode);
}
}
out:
@@ -140,38 +323,6 @@ static int write_inode(struct inode *ino
return 0;
}

-/*
- * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
- * furthest end of its superblock's dirty-inode list.
- *
- * Before stamping the inode's ->dirtied_when, we check to see whether it is
- * already the most-recently-dirtied inode on the s_dirty list. If that is
- * the case then the inode must have been redirtied while it was being written
- * out and we don't reset its dirtied_when.
- */
-static void redirty_tail(struct inode *inode)
-{
- struct super_block *sb = inode->i_sb;
-
- if (!list_empty(&sb->s_dirty)) {
- struct inode *tail_inode;
-
- tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
- if (!time_after_eq(inode->dirtied_when,
- tail_inode->dirtied_when))
- inode->dirtied_when = jiffies;
- }
- list_move(&inode->i_list, &sb->s_dirty);
-}
-
-/*
- * requeue inode for re-scanning after sb->s_io list is exhausted.
- */
-static void requeue_io(struct inode *inode)
-{
- list_move(&inode->i_list, &inode->i_sb->s_more_io);
-}
-
static void inode_sync_complete(struct inode *inode)
{
/*
@@ -181,38 +332,9 @@ static void inode_sync_complete(struct i
wake_up_bit(&inode->i_state, __I_SYNC);
}

-/*
- * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
- */
-static void move_expired_inodes(struct list_head *delaying_queue,
- struct list_head *dispatch_queue,
- unsigned long *older_than_this)
-{
- while (!list_empty(delaying_queue)) {
- struct inode *inode = list_entry(delaying_queue->prev,
- struct inode, i_list);
- if (older_than_this &&
- time_after(inode->dirtied_when, *older_than_this))
- break;
- list_move(&inode->i_list, dispatch_queue);
- }
-}
-
-/*
- * Queue all expired dirty inodes for io, eldest first.
- */
-static void queue_io(struct super_block *sb,
- unsigned long *older_than_this)
-{
- list_splice_init(&sb->s_more_io, sb->s_io.prev);
- move_expired_inodes(&sb->s_dirty, &sb->s_io, older_than_this);
-}
-
int sb_has_dirty_inodes(struct super_block *sb)
{
- return !list_empty(&sb->s_dirty) ||
- !list_empty(&sb->s_io) ||
- !list_empty(&sb->s_more_io);
+ return !RB_EMPTY_ROOT(&sb->s_flush_root);
}
EXPORT_SYMBOL(sb_has_dirty_inodes);

@@ -221,7 +343,7 @@ EXPORT_SYMBOL(sb_has_dirty_inodes);
* If `wait' is set, wait on the writeout.
*
* The whole writeout design is quite complex and fragile. We want to avoid
- * starvation of particular inodes when others are being redirtied, prevent
+ * starvation of particular inodes when others are being re-dirtied, prevent
* livelocks, etc.
*
* Called under inode_lock.
@@ -231,10 +353,13 @@ __sync_single_inode(struct inode *inode,
{
unsigned dirty;
struct address_space *mapping = inode->i_mapping;
+ struct super_block *sb = inode->i_sb;
int wait = wbc->sync_mode == WB_SYNC_ALL;
int ret;
+ long pages_skipped = wbc->pages_skipped;

BUG_ON(inode->i_state & I_SYNC);
+ BUG_ON(RB_LINKED_NODE(&inode->i_flush_node));

/* Set I_SYNC, reset I_DIRTY */
dirty = inode->i_state & I_DIRTY;
@@ -260,60 +385,63 @@ __sync_single_inode(struct inode *inode,

spin_lock(&inode_lock);
inode->i_state &= ~I_SYNC;
- if (!(inode->i_state & I_FREEING)) {
- if (!(inode->i_state & I_DIRTY) &&
+ if (inode->i_state & I_FREEING)
+ goto out;
+
+ if (wbc->pages_skipped != pages_skipped) {
+ /*
+ * writeback is not making progress due to locked
+ * buffers. Skip this inode for now.
+ */
+ inode->i_state |= I_DIRTY_PAGES;
+ flush_tree_insert(sb, inode);
+ } else if (!(inode->i_state & I_DIRTY) &&
mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+ /*
+ * We didn't write back all the pages. nfs_writepages()
+ * sometimes bales out without doing anything. Redirty
+ * the inode; then put it into the flush_tree.
+ */
+ if (wbc->for_kupdate) {
/*
- * We didn't write back all the pages. nfs_writepages()
- * sometimes bales out without doing anything. Redirty
- * the inode; Move it from s_io onto s_more_io/s_dirty.
- */
- /*
- * akpm: if the caller was the kupdate function we put
- * this inode at the head of s_dirty so it gets first
- * consideration. Otherwise, move it to the tail, for
- * the reasons described there. I'm not really sure
- * how much sense this makes. Presumably I had a good
- * reasons for doing it this way, and I'd rather not
- * muck with it at present.
- */
- if (wbc->for_kupdate) {
- /*
- * For the kupdate function we move the inode
- * to s_more_io so it will get more writeout as
- * soon as the queue becomes uncongested.
- */
- inode->i_state |= I_DIRTY_PAGES;
- requeue_io(inode);
- } else {
- /*
- * Otherwise fully redirty the inode so that
- * other inodes on this superblock will get some
- * writeout. Otherwise heavy writing to one
- * file would indefinitely suspend writeout of
- * all the other files.
- */
- inode->i_state |= I_DIRTY_PAGES;
- redirty_tail(inode);
- }
- } else if (inode->i_state & I_DIRTY) {
- /*
- * Someone redirtied the inode while were writing back
- * the pages.
- */
- redirty_tail(inode);
- } else if (atomic_read(&inode->i_count)) {
- /*
- * The inode is clean, inuse
+ * For the kupdate function we leave
+ * dirtied_when field untouched and return
+ * it to the flush_tree. The next iteration
+ * of kupdate will flush more pages when
+ * the queue is no longer congested.
*/
- list_move(&inode->i_list, &inode_in_use);
+ inode->i_state |= I_DIRTY_PAGES;
+ flush_tree_insert(sb, inode);
} else {
/*
- * The inode is clean, unused
+ * Otherwise fully redirty the inode so that
+ * other inodes on this superblock will get some
+ * writeout. Otherwise heavy writing to one
+ * file would indefinitely suspend writeout of
+ * all the other files.
*/
- list_move(&inode->i_list, &inode_unused);
+ inode->i_state |= I_DIRTY_PAGES;
+ inode->dirtied_when = jiffies;
+ flush_tree_insert(sb, inode);
}
+ } else if (inode->i_state & I_DIRTY) {
+ /*
+ * Someone redirtied the inode while were writing back
+ * the pages.
+ */
+ flush_tree_insert(inode->i_sb, inode);
+ } else if (atomic_read(&inode->i_count)) {
+ /*
+ * The inode is clean, inuse
+ */
+ list_move(&inode->i_list, &inode_in_use);
+ } else {
+ /*
+ * The inode is clean, unused
+ */
+ list_move(&inode->i_list, &inode_unused);
}
+out:
inode_sync_complete(inode);
return ret;
}
@@ -333,27 +461,14 @@ __writeback_single_inode(struct inode *i
else
WARN_ON(inode->i_state & I_WILL_FREE);

+ /*
+ * If the inode is locked and we are not going to wait for it
+ * to be unlocked then we can just exit the routine. Since the
+ * inode is marked I_DIRTY it will be inserted into the flush
+ * tree by sync_single_inode when the I_SYNC is released.
+ */
if ((wbc->sync_mode != WB_SYNC_ALL) && (inode->i_state & I_SYNC)) {
- struct address_space *mapping = inode->i_mapping;
- int ret;
-
- /*
- * We're skipping this inode because it's locked, and we're not
- * doing writeback-for-data-integrity. Move it to s_more_io so
- * that writeback can proceed with the other inodes on s_io.
- * We'll have another go at writing back this inode when we
- * completed a full scan of s_io.
- */
- requeue_io(inode);
-
- /*
- * Even if we don't actually write the inode itself here,
- * we can at least start some of the data writeout..
- */
- spin_unlock(&inode_lock);
- ret = do_writepages(mapping, wbc);
- spin_lock(&inode_lock);
- return ret;
+ return 0;
}

/*
@@ -380,12 +495,9 @@ __writeback_single_inode(struct inode *i
* If older_than_this is non-NULL, then only write out inodes which
* had their first dirtying at a time earlier than *older_than_this.
*
- * If we're a pdlfush thread, then implement pdflush collision avoidance
+ * If we're a pdflush thread, then implement pdflush collision avoidance
* against the entire list.
*
- * WB_SYNC_HOLD is a hack for sys_sync(): reattach the inode to sb->s_dirty so
- * that it can be located for waiting on in __writeback_single_inode().
- *
* Called under inode_lock.
*
* If `bdi' is non-zero then we're being asked to writeback a specific queue.
@@ -398,33 +510,35 @@ __writeback_single_inode(struct inode *i
* a queue with that address_space. (Easy: have a global "dirty superblocks"
* list).
*
- * The inodes to be written are parked on sb->s_io. They are moved back onto
- * sb->s_dirty as they are selected for writing. This way, none can be missed
- * on the writer throttling path, and we get decent balancing between many
- * throttled threads: we don't want them all piling up on inode_sync_wait.
+ * The inodes to be written are inserted into the flush_tree.
*/
static void
sync_sb_inodes(struct super_block *sb, struct writeback_control *wbc)
{
const unsigned long start = jiffies; /* livelock avoidance */
+ struct inode *inode = NULL;
+ unsigned long prev_dirtied = 0;
+ unsigned long long flush_gen;
+
+ spin_lock(&inode_lock);

- if (!wbc->for_kupdate || list_empty(&sb->s_io))
- queue_io(sb, wbc->older_than_this);
+ flush_gen = ++sb->s_flush_gen;

- while (!list_empty(&sb->s_io)) {
- struct inode *inode = list_entry(sb->s_io.prev,
- struct inode, i_list);
+ while ((inode = flush_tree_next(sb, inode,
+ prev_dirtied, flush_gen)) != NULL) {
struct address_space *mapping = inode->i_mapping;
struct backing_dev_info *bdi = mapping->backing_dev_info;
- long pages_skipped;
+
+ flush_tree_remove(sb , inode);
+ prev_dirtied = inode->dirtied_when;

if (!bdi_cap_writeback_dirty(bdi)) {
- redirty_tail(inode);
if (sb_is_blkdev_sb(sb)) {
/*
* Dirty memory-backed blockdev: the ramdisk
* driver does this. Skip just this inode
*/
+ flush_tree_insert(sb, inode);
continue;
}
/*
@@ -439,14 +553,14 @@ sync_sb_inodes(struct super_block *sb, s
wbc->encountered_congestion = 1;
if (!sb_is_blkdev_sb(sb))
break; /* Skip a congested fs */
- requeue_io(inode);
+ flush_tree_insert(sb, inode);
continue; /* Skip a congested blockdev */
}

if (wbc->bdi && bdi != wbc->bdi) {
if (!sb_is_blkdev_sb(sb))
break; /* fs has the wrong queue */
- requeue_io(inode);
+ flush_tree_insert(sb, inode);
continue; /* blockdev has wrong queue */
}

@@ -454,36 +568,33 @@ sync_sb_inodes(struct super_block *sb, s
if (time_after(inode->dirtied_when, start))
break;

+ /* Was this inode dirtied too recently? */
+ if (wbc->older_than_this && time_after(inode->dirtied_when,
+ *wbc->older_than_this))
+ break;
+
+
/* Is another pdflush already flushing this queue? */
if (current_is_pdflush() && !writeback_acquire(bdi))
break;

BUG_ON(inode->i_state & I_FREEING);
__iget(inode);
- pages_skipped = wbc->pages_skipped;
__writeback_single_inode(inode, wbc);
- if (wbc->sync_mode == WB_SYNC_HOLD) {
- inode->dirtied_when = jiffies;
- list_move(&inode->i_list, &sb->s_dirty);
- }
if (current_is_pdflush())
writeback_release(bdi);
- if (wbc->pages_skipped != pages_skipped) {
- /*
- * writeback is not making progress due to locked
- * buffers. Skip this inode for now.
- */
- redirty_tail(inode);
- }
spin_unlock(&inode_lock);
iput(inode);
cond_resched();
spin_lock(&inode_lock);
- if (wbc->nr_to_write <= 0)
+ if (wbc->nr_to_write <= 0) {
+ inode = NULL;
break;
+ }
}
- if (!list_empty(&sb->s_more_io))
- wbc->more_io = 1;
+ if (inode)
+ flush_tree_insert(sb, inode);
+ spin_unlock(&inode_lock);
return; /* Leave any unwritten inodes on s_io */
}

@@ -492,9 +603,9 @@ sync_sb_inodes(struct super_block *sb, s
*
* Note:
* We don't need to grab a reference to superblock here. If it has non-empty
- * ->s_dirty it's hadn't been killed yet and kill_super() won't proceed
- * past sync_inodes_sb() until the ->s_dirty/s_io/s_more_io lists are all
- * empty. Since __sync_single_inode() regains inode_lock before it finally moves
+ * flush_tree it's hadn't been killed yet and kill_super() won't proceed
+ * past sync_inodes_sb() until the flush_tree is empty. Since
+ * __sync_single_inode() regains inode_lock before it finally moves
* inode from superblock lists we are OK.
*
* If `older_than_this' is non-zero then only flush inodes which have a
@@ -527,9 +638,7 @@ restart:
*/
if (down_read_trylock(&sb->s_umount)) {
if (sb->s_root) {
- spin_lock(&inode_lock);
sync_sb_inodes(sb, wbc);
- spin_unlock(&inode_lock);
}
up_read(&sb->s_umount);
}
@@ -545,8 +654,7 @@ restart:

/*
* writeback and wait upon the filesystem's dirty inodes. The caller will
- * do this in two passes - one to write, and one to wait. WB_SYNC_HOLD is
- * used to park the written inodes on sb->s_dirty for the wait pass.
+ * do this in two passes - one to write, and one to wait.
*
* A finite limit is set on the number of pages which will be written.
* To prevent infinite livelock of sys_sync().
@@ -557,7 +665,7 @@ restart:
void sync_inodes_sb(struct super_block *sb, int wait)
{
struct writeback_control wbc = {
- .sync_mode = wait ? WB_SYNC_ALL : WB_SYNC_HOLD,
+ .sync_mode = wait ? WB_SYNC_ALL : WB_SYNC_NONE,
.range_start = 0,
.range_end = LLONG_MAX,
};
@@ -568,9 +676,7 @@ void sync_inodes_sb(struct super_block *
(inodes_stat.nr_inodes - inodes_stat.nr_unused) +
nr_dirty + nr_unstable;
wbc.nr_to_write += wbc.nr_to_write / 2; /* Bit more for luck */
- spin_lock(&inode_lock);
sync_sb_inodes(sb, &wbc);
- spin_unlock(&inode_lock);
}

/*
@@ -654,7 +760,7 @@ void sync_inodes(int wait)
*/
int write_inode_now(struct inode *inode, int sync)
{
- int ret;
+ int ret = 0;
struct writeback_control wbc = {
.nr_to_write = LONG_MAX,
.sync_mode = WB_SYNC_ALL,
@@ -666,9 +772,7 @@ int write_inode_now(struct inode *inode,
wbc.nr_to_write = 0;

might_sleep();
- spin_lock(&inode_lock);
- ret = __writeback_single_inode(inode, &wbc);
- spin_unlock(&inode_lock);
+ sync_inode(inode, &wbc);
if (sync)
inode_sync_wait(inode);
return ret;
@@ -688,10 +792,13 @@ EXPORT_SYMBOL(write_inode_now);
*/
int sync_inode(struct inode *inode, struct writeback_control *wbc)
{
- int ret;
+ int ret = 0;

spin_lock(&inode_lock);
- ret = __writeback_single_inode(inode, wbc);
+ if (inode->i_state & I_DIRTY) {
+ flush_tree_remove(inode->i_sb, inode);
+ ret = __writeback_single_inode(inode, wbc);
+ }
spin_unlock(&inode_lock);
return ret;
}
Index: 2624rc7_wb/fs/hugetlbfs/inode.c
===================================================================
--- 2624rc7_wb.orig/fs/hugetlbfs/inode.c 2008-01-06 13:45:38.000000000 -0800
+++ 2624rc7_wb/fs/hugetlbfs/inode.c 2008-01-08 13:47:38.000000000 -0800
@@ -383,7 +383,7 @@ static void hugetlbfs_forget_inode(struc
struct super_block *sb = inode->i_sb;

if (!hlist_unhashed(&inode->i_hash)) {
- if (!(inode->i_state & (I_DIRTY|I_SYNC)))
+ if (!(inode->i_state & (I_DIRTY|I_SYNC|I_WRITEBACK_NEVER)))
list_move(&inode->i_list, &inode_unused);
inodes_stat.nr_unused++;
if (!sb || (sb->s_flags & MS_ACTIVE)) {
Index: 2624rc7_wb/fs/inode.c
===================================================================
--- 2624rc7_wb.orig/fs/inode.c 2008-01-06 13:45:38.000000000 -0800
+++ 2624rc7_wb/fs/inode.c 2008-01-08 13:54:07.000000000 -0800
@@ -143,6 +143,7 @@ static struct inode *alloc_inode(struct
inode->i_cdev = NULL;
inode->i_rdev = 0;
inode->dirtied_when = 0;
+ memset(&inode->i_flush_node, 0, sizeof(inode->i_flush_node));
if (security_inode_alloc(inode)) {
if (inode->i_sb->s_op->destroy_inode)
inode->i_sb->s_op->destroy_inode(inode);
@@ -337,6 +338,10 @@ static int invalidate_list(struct list_h
inode = list_entry(tmp, struct inode, i_sb_list);
invalidate_inode_buffers(inode);
if (!atomic_read(&inode->i_count)) {
+ if (inode->i_state & I_DIRTY) {
+ flush_tree_remove(inode->i_sb, inode);
+ inode->i_state &= ~I_DIRTY;
+ }
list_move(&inode->i_list, dispose);
inode->i_state |= I_FREEING;
count++;
@@ -1043,7 +1048,10 @@ EXPORT_SYMBOL(remove_inode_hash);
void generic_delete_inode(struct inode *inode)
{
const struct super_operations *op = inode->i_sb->s_op;
-
+ if ((inode->i_state & I_DIRTY)) {
+ flush_tree_remove(inode->i_sb, inode);
+ inode->i_state &= ~I_DIRTY;
+ }
list_del_init(&inode->i_list);
list_del_init(&inode->i_sb_list);
inode->i_state |= I_FREEING;
@@ -1080,7 +1088,7 @@ static void generic_forget_inode(struct
struct super_block *sb = inode->i_sb;

if (!hlist_unhashed(&inode->i_hash)) {
- if (!(inode->i_state & (I_DIRTY|I_SYNC)))
+ if (!(inode->i_state & (I_DIRTY|I_SYNC|I_WRITEBACK_NEVER)))
list_move(&inode->i_list, &inode_unused);
inodes_stat.nr_unused++;
if (sb->s_flags & MS_ACTIVE) {
Index: 2624rc7_wb/fs/pipe.c
===================================================================
--- 2624rc7_wb.orig/fs/pipe.c 2008-01-06 13:45:38.000000000 -0800
+++ 2624rc7_wb/fs/pipe.c 2008-01-08 13:55:15.000000000 -0800
@@ -930,13 +930,7 @@ static struct inode * get_pipe_inode(voi
pipe->readers = pipe->writers = 1;
inode->i_fop = &rdwr_pipe_fops;

- /*
- * Mark the inode dirty from the very beginning,
- * that way it will never be moved to the dirty
- * list because "mark_inode_dirty()" will think
- * that it already _is_ on the dirty list.
- */
- inode->i_state = I_DIRTY;
+ inode->i_state = I_WRITEBACK_NEVER;
inode->i_mode = S_IFIFO | S_IRUSR | S_IWUSR;
inode->i_uid = current->fsuid;
inode->i_gid = current->fsgid;
Index: 2624rc7_wb/fs/super.c
===================================================================
--- 2624rc7_wb.orig/fs/super.c 2008-01-06 13:45:38.000000000 -0800
+++ 2624rc7_wb/fs/super.c 2008-01-08 15:04:11.000000000 -0800
@@ -61,9 +61,7 @@ static struct super_block *alloc_super(s
s = NULL;
goto out;
}
- INIT_LIST_HEAD(&s->s_dirty);
- INIT_LIST_HEAD(&s->s_io);
- INIT_LIST_HEAD(&s->s_more_io);
+ s->s_flush_root = RB_ROOT;
INIT_LIST_HEAD(&s->s_files);
INIT_LIST_HEAD(&s->s_instances);
INIT_HLIST_HEAD(&s->s_anon);
@@ -103,6 +101,7 @@ out:
*/
static inline void destroy_super(struct super_block *s)
{
+ mutex_destroy(&s->s_flush_lock);
security_sb_free(s);
kfree(s->s_subtype);
kfree(s);
Index: 2624rc7_wb/include/linux/fs.h
===================================================================
--- 2624rc7_wb.orig/include/linux/fs.h 2008-01-06 13:45:38.000000000 -0800
+++ 2624rc7_wb/include/linux/fs.h 2008-01-08 14:57:42.000000000 -0800
@@ -280,6 +280,7 @@ extern int dir_notify_enable;
#include <linux/kobject.h>
#include <linux/list.h>
#include <linux/radix-tree.h>
+#include <linux/rbtree.h>
#include <linux/prio_tree.h>
#include <linux/init.h>
#include <linux/pid.h>
@@ -592,6 +593,8 @@ struct inode {
struct hlist_node i_hash;
struct list_head i_list;
struct list_head i_sb_list;
+ struct rb_node i_flush_node;
+ unsigned long long i_flush_gen;
struct list_head i_dentry;
unsigned long i_ino;
atomic_t i_count;
@@ -1003,9 +1006,10 @@ struct super_block {
struct xattr_handler **s_xattr;

struct list_head s_inodes; /* all inodes */
- struct list_head s_dirty; /* dirty inodes */
- struct list_head s_io; /* parked for writeback */
- struct list_head s_more_io; /* parked for more writeback */
+ struct rb_root s_flush_root;
+ unsigned long s_flush_count;
+ unsigned long long s_flush_gen;
+
struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */
struct list_head s_files;

@@ -1308,6 +1312,8 @@ struct super_operations {
* of inode dirty data. Having a seperate lock for this
* purpose reduces latency and prevents some filesystem-
* specific deadlocks.
+ *I_WRITEBACK_NEVER For inodes that we may dirty data to but never
+ * want written back.
*
* Q: Why does I_DIRTY_DATASYNC exist? It appears as if it could be replaced
* by (I_DIRTY_SYNC|I_DIRTY_PAGES).
@@ -1326,6 +1332,7 @@ struct super_operations {
#define I_LOCK (1 << __I_LOCK)
#define __I_SYNC 8
#define I_SYNC (1 << __I_SYNC)
+#define I_WRITEBACK_NEVER 512

#define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)

Index: 2624rc7_wb/include/linux/rbtree.h
===================================================================
--- 2624rc7_wb.orig/include/linux/rbtree.h 2008-01-06 13:45:38.000000000 -0800
+++ 2624rc7_wb/include/linux/rbtree.h 2008-01-09 01:41:30.000000000 -0800
@@ -135,6 +135,9 @@ static inline void rb_set_color(struct r
#define RB_EMPTY_ROOT(root) ((root)->rb_node == NULL)
#define RB_EMPTY_NODE(node) (rb_parent(node) == node)
#define RB_CLEAR_NODE(node) (rb_set_parent(node, node))
+#define RB_LINKED_NODE(node) ((node)->rb_parent_color || \
+ (node)->rb_left || (node)->rb_right)
+

extern void rb_insert_color(struct rb_node *, struct rb_root *);
extern void rb_erase(struct rb_node *, struct rb_root *);
Index: 2624rc7_wb/include/linux/writeback.h
===================================================================
--- 2624rc7_wb.orig/include/linux/writeback.h 2008-01-06 13:45:38.000000000 -0800
+++ 2624rc7_wb/include/linux/writeback.h 2008-01-08 14:04:38.000000000 -0800
@@ -30,7 +30,6 @@ static inline int task_is_pdflush(struct
enum writeback_sync_modes {
WB_SYNC_NONE, /* Don't wait on anything */
WB_SYNC_ALL, /* Wait on every mapping */
- WB_SYNC_HOLD, /* Hold the inode on sb_dirty for sys_sync() */
};

/*
@@ -72,6 +71,8 @@ void writeback_inodes(struct writeback_c
int inode_wait(void *);
void sync_inodes_sb(struct super_block *, int wait);
void sync_inodes(int wait);
+void flush_tree_remove(struct super_block *sb, struct inode *inode);
+

/* writeback.h requires fs.h; it, too, is not included from here. */
static inline void wait_on_inode(struct inode *inode)


2008-01-15 08:46:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure


On Tue, 2008-01-15 at 00:09 -0800, Michael Rubin wrote:
> From: Michael Rubin <[email protected]>
>
> For those of you who have waited so long. This is the third submission
> of the first attempt at this patch. It is a trilogy.

Just a quick question, how does this interact/depend-uppon etc.. with
Fengguangs patches I still have in my mailbox? (Those from Dec 28th)




2008-01-15 17:54:03

by Michael Rubin

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Jan 15, 2008 12:46 AM, Peter Zijlstra <[email protected]> wrote:
> Just a quick question, how does this interact/depend-uppon etc.. with
> Fengguangs patches I still have in my mailbox? (Those from Dec 28th)

They don't. They apply to a 2.6.24rc7 tree. This is a candidte for 2.6.25.

This work was done before Fengguang's patches. I am trying to test
Fengguang's for comparison but am having problems with getting mm1 to
boot on my systems.

mrubin

2008-01-16 03:01:30

by Wu Fengguang

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Tue, Jan 15, 2008 at 09:53:42AM -0800, Michael Rubin wrote:
> On Jan 15, 2008 12:46 AM, Peter Zijlstra <[email protected]> wrote:
> > Just a quick question, how does this interact/depend-uppon etc.. with
> > Fengguangs patches I still have in my mailbox? (Those from Dec 28th)
>
> They don't. They apply to a 2.6.24rc7 tree. This is a candidte for 2.6.25.
>
> This work was done before Fengguang's patches. I am trying to test
> Fengguang's for comparison but am having problems with getting mm1 to
> boot on my systems.

Yeah, they are independent ones. The initial motivation is to fix the
bug "sluggish writeback on small+large files". Michael introduced
a new rbtree, and me introduced a new list(s_more_io_wait).

Basically I think rbtree is an overkill to do time based ordering.
Sorry, Michael. But s_dirty would be enough for that. Plus, s_more_io
provides fair queuing between small/large files, and s_more_io_wait
provides waiting mechanism for blocked inodes.

The time ordered rbtree may delay io for a blocked inode simply by
modifying its dirtied_when and reinsert it. But it would no longer be
that easy if it is to be ordered by location.

If we are going to do location based ordering in the future, the lists
will continue to be useful. It would simply be a matter of switching
from the s_dirty(order by time) to some rbtree or radix tree(order by
location).

We can even provide both ordering at the same time to different
fs/inodes which is configurable by the user. Because the s_dirty
and/or rbtree would provide _only_ ordering(not faireness or waiting)
and hence is interchangeable.

This patchset could be a good reference. It does location based
ordering with radix tree:

[RFC][PATCH] clustered writeback <http://lkml.org/lkml/2007/8/27/45>

Thank you,
Fengguang

2008-01-16 03:44:27

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Wed, 16 Jan 2008 11:01:08 +0800 Fengguang Wu <[email protected]> wrote:

> On Tue, Jan 15, 2008 at 09:53:42AM -0800, Michael Rubin wrote:
> > On Jan 15, 2008 12:46 AM, Peter Zijlstra <[email protected]> wrote:
> > > Just a quick question, how does this interact/depend-uppon etc.. with
> > > Fengguangs patches I still have in my mailbox? (Those from Dec 28th)
> >
> > They don't. They apply to a 2.6.24rc7 tree. This is a candidte for 2.6.25.
> >
> > This work was done before Fengguang's patches. I am trying to test
> > Fengguang's for comparison but am having problems with getting mm1 to
> > boot on my systems.
>
> Yeah, they are independent ones. The initial motivation is to fix the
> bug "sluggish writeback on small+large files". Michael introduced
> a new rbtree, and me introduced a new list(s_more_io_wait).
>
> Basically I think rbtree is an overkill to do time based ordering.
> Sorry, Michael. But s_dirty would be enough for that. Plus, s_more_io
> provides fair queuing between small/large files, and s_more_io_wait
> provides waiting mechanism for blocked inodes.
>
> The time ordered rbtree may delay io for a blocked inode simply by
> modifying its dirtied_when and reinsert it. But it would no longer be
> that easy if it is to be ordered by location.

What does the term "ordered by location" mean? Attemting to sort inodes by
physical disk address? By using their i_ino as a key?

That sounds optimistic.

> If we are going to do location based ordering in the future, the lists
> will continue to be useful. It would simply be a matter of switching
> from the s_dirty(order by time) to some rbtree or radix tree(order by
> location).
>
> We can even provide both ordering at the same time to different
> fs/inodes which is configurable by the user. Because the s_dirty
> and/or rbtree would provide _only_ ordering(not faireness or waiting)
> and hence is interchangeable.
>
> This patchset could be a good reference. It does location based
> ordering with radix tree:
>
> [RFC][PATCH] clustered writeback <http://lkml.org/lkml/2007/8/27/45>

list_heads are just the wrong data structure for this function. Especially
list_heads which are protected by a non-sleeping lock.

2008-01-16 04:26:20

by Wu Fengguang

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Tue, Jan 15, 2008 at 07:44:15PM -0800, Andrew Morton wrote:
> On Wed, 16 Jan 2008 11:01:08 +0800 Fengguang Wu <[email protected]> wrote:
>
> > On Tue, Jan 15, 2008 at 09:53:42AM -0800, Michael Rubin wrote:
> > > On Jan 15, 2008 12:46 AM, Peter Zijlstra <[email protected]> wrote:
> > > > Just a quick question, how does this interact/depend-uppon etc.. with
> > > > Fengguangs patches I still have in my mailbox? (Those from Dec 28th)
> > >
> > > They don't. They apply to a 2.6.24rc7 tree. This is a candidte for 2.6.25.
> > >
> > > This work was done before Fengguang's patches. I am trying to test
> > > Fengguang's for comparison but am having problems with getting mm1 to
> > > boot on my systems.
> >
> > Yeah, they are independent ones. The initial motivation is to fix the
> > bug "sluggish writeback on small+large files". Michael introduced
> > a new rbtree, and me introduced a new list(s_more_io_wait).
> >
> > Basically I think rbtree is an overkill to do time based ordering.
> > Sorry, Michael. But s_dirty would be enough for that. Plus, s_more_io
> > provides fair queuing between small/large files, and s_more_io_wait
> > provides waiting mechanism for blocked inodes.
> >
> > The time ordered rbtree may delay io for a blocked inode simply by
> > modifying its dirtied_when and reinsert it. But it would no longer be
> > that easy if it is to be ordered by location.
>
> What does the term "ordered by location" mean? Attemting to sort inodes by
> physical disk address? By using their i_ino as a key?
>
> That sounds optimistic.

Yes, exactly. Think about email servers with lots of dirty files.

> > If we are going to do location based ordering in the future, the lists
> > will continue to be useful. It would simply be a matter of switching
> > from the s_dirty(order by time) to some rbtree or radix tree(order by
> > location).
> >
> > We can even provide both ordering at the same time to different
> > fs/inodes which is configurable by the user. Because the s_dirty
> > and/or rbtree would provide _only_ ordering(not faireness or waiting)
> > and hence is interchangeable.
> >
> > This patchset could be a good reference. It does location based
> > ordering with radix tree:
> >
> > [RFC][PATCH] clustered writeback <http://lkml.org/lkml/2007/8/27/45>
>
> list_heads are just the wrong data structure for this function. Especially
> list_heads which are protected by a non-sleeping lock.

list_heads are OK if we use them for one and only function. We have
been trying to jam too much into s_dirty in the past. Grabbing a
refcount could be better than locking - anyway if we split the
functions today, it would be easy to replace the list_heads one by
one in the future.

2008-01-16 04:42:46

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Wed, 16 Jan 2008 12:25:53 +0800 Fengguang Wu <[email protected]> wrote:

> list_heads are OK if we use them for one and only function.

Not really. They're inappropriate when you wish to remember your
position in the list while you dropped the lock (as we must do in
writeback).

A data structure which permits us to interate across the search key rather
than across the actual storage locations is more appropriate.

2008-01-16 04:56:21

by Wu Fengguang

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Tue, Jan 15, 2008 at 08:42:36PM -0800, Andrew Morton wrote:
> On Wed, 16 Jan 2008 12:25:53 +0800 Fengguang Wu <[email protected]> wrote:
>
> > list_heads are OK if we use them for one and only function.
>
> Not really. They're inappropriate when you wish to remember your
> position in the list while you dropped the lock (as we must do in
> writeback).
>
> A data structure which permits us to interate across the search key rather
> than across the actual storage locations is more appropriate.

I totally agree with you. What I mean is to first do the split of
functions - into three: ordering, starvation prevention, and blockade
waiting. Then to do better ordering by adopting radix tree(or rbtree
if radix tree is not enough), and lastly get rid of the list_heads to
avoid locking. Does it sound like a good path?

2008-01-16 05:52:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Wed, 16 Jan 2008 12:55:07 +0800 Fengguang Wu <[email protected]> wrote:

> On Tue, Jan 15, 2008 at 08:42:36PM -0800, Andrew Morton wrote:
> > On Wed, 16 Jan 2008 12:25:53 +0800 Fengguang Wu <[email protected]> wrote:
> >
> > > list_heads are OK if we use them for one and only function.
> >
> > Not really. They're inappropriate when you wish to remember your
> > position in the list while you dropped the lock (as we must do in
> > writeback).
> >
> > A data structure which permits us to interate across the search key rather
> > than across the actual storage locations is more appropriate.
>
> I totally agree with you. What I mean is to first do the split of
> functions - into three: ordering, starvation prevention, and blockade
> waiting.

Does "ordering" here refer to ordering bt time-of-first-dirty?

What is "blockade waiting"?

> Then to do better ordering by adopting radix tree(or rbtree
> if radix tree is not enough),

ordering of what?

> and lastly get rid of the list_heads to
> avoid locking. Does it sound like a good path?

I'd have thaought that replacing list_heads with another data structure
would be a simgle commit.

2008-01-16 07:56:18

by David Chinner

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Tue, Jan 15, 2008 at 07:44:15PM -0800, Andrew Morton wrote:
> On Wed, 16 Jan 2008 11:01:08 +0800 Fengguang Wu <[email protected]> wrote:
>
> > On Tue, Jan 15, 2008 at 09:53:42AM -0800, Michael Rubin wrote:
> > > On Jan 15, 2008 12:46 AM, Peter Zijlstra <[email protected]> wrote:
> > > > Just a quick question, how does this interact/depend-uppon etc.. with
> > > > Fengguangs patches I still have in my mailbox? (Those from Dec 28th)
> > >
> > > They don't. They apply to a 2.6.24rc7 tree. This is a candidte for 2.6.25.
> > >
> > > This work was done before Fengguang's patches. I am trying to test
> > > Fengguang's for comparison but am having problems with getting mm1 to
> > > boot on my systems.
> >
> > Yeah, they are independent ones. The initial motivation is to fix the
> > bug "sluggish writeback on small+large files". Michael introduced
> > a new rbtree, and me introduced a new list(s_more_io_wait).
> >
> > Basically I think rbtree is an overkill to do time based ordering.
> > Sorry, Michael. But s_dirty would be enough for that. Plus, s_more_io
> > provides fair queuing between small/large files, and s_more_io_wait
> > provides waiting mechanism for blocked inodes.
> >
> > The time ordered rbtree may delay io for a blocked inode simply by
> > modifying its dirtied_when and reinsert it. But it would no longer be
> > that easy if it is to be ordered by location.
>
> What does the term "ordered by location" mean? Attemting to sort inodes by
> physical disk address? By using their i_ino as a key?
>
> That sounds optimistic.

In XFS, inode number is an encoding of it's location on disk, so
ordering inode writeback by inode number *does* make sense.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2008-01-16 08:14:23

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Wed, 16 Jan 2008 18:55:38 +1100 David Chinner <[email protected]> wrote:

> On Tue, Jan 15, 2008 at 07:44:15PM -0800, Andrew Morton wrote:
> > On Wed, 16 Jan 2008 11:01:08 +0800 Fengguang Wu <[email protected]> wrote:
> >
> > > On Tue, Jan 15, 2008 at 09:53:42AM -0800, Michael Rubin wrote:
> > > > On Jan 15, 2008 12:46 AM, Peter Zijlstra <[email protected]> wrote:
> > > > > Just a quick question, how does this interact/depend-uppon etc.. with
> > > > > Fengguangs patches I still have in my mailbox? (Those from Dec 28th)
> > > >
> > > > They don't. They apply to a 2.6.24rc7 tree. This is a candidte for 2.6.25.
> > > >
> > > > This work was done before Fengguang's patches. I am trying to test
> > > > Fengguang's for comparison but am having problems with getting mm1 to
> > > > boot on my systems.
> > >
> > > Yeah, they are independent ones. The initial motivation is to fix the
> > > bug "sluggish writeback on small+large files". Michael introduced
> > > a new rbtree, and me introduced a new list(s_more_io_wait).
> > >
> > > Basically I think rbtree is an overkill to do time based ordering.
> > > Sorry, Michael. But s_dirty would be enough for that. Plus, s_more_io
> > > provides fair queuing between small/large files, and s_more_io_wait
> > > provides waiting mechanism for blocked inodes.
> > >
> > > The time ordered rbtree may delay io for a blocked inode simply by
> > > modifying its dirtied_when and reinsert it. But it would no longer be
> > > that easy if it is to be ordered by location.
> >
> > What does the term "ordered by location" mean? Attemting to sort inodes by
> > physical disk address? By using their i_ino as a key?
> >
> > That sounds optimistic.
>
> In XFS, inode number is an encoding of it's location on disk, so
> ordering inode writeback by inode number *does* make sense.

This code is mainly concerned with writing pagecache data, not inodes.

2008-01-16 09:07:36

by Wu Fengguang

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Tue, Jan 15, 2008 at 09:51:49PM -0800, Andrew Morton wrote:
> On Wed, 16 Jan 2008 12:55:07 +0800 Fengguang Wu <[email protected]> wrote:
>
> > On Tue, Jan 15, 2008 at 08:42:36PM -0800, Andrew Morton wrote:
> > > On Wed, 16 Jan 2008 12:25:53 +0800 Fengguang Wu <[email protected]> wrote:
> > >
> > > > list_heads are OK if we use them for one and only function.
> > >
> > > Not really. They're inappropriate when you wish to remember your
> > > position in the list while you dropped the lock (as we must do in
> > > writeback).
> > >
> > > A data structure which permits us to interate across the search key rather
> > > than across the actual storage locations is more appropriate.
> >
> > I totally agree with you. What I mean is to first do the split of
> > functions - into three: ordering, starvation prevention, and blockade
> > waiting.
>
> Does "ordering" here refer to ordering bt time-of-first-dirty?

Ordering by dirtied_when or i_ino, either is OK.

> What is "blockade waiting"?

Some inodes/pages cannot be synced now for some reason and should be
retried after a while.

> > Then to do better ordering by adopting radix tree(or rbtree
> > if radix tree is not enough),
>
> ordering of what?

Switch from time to location.

> > and lastly get rid of the list_heads to
> > avoid locking. Does it sound like a good path?
>
> I'd have thaought that replacing list_heads with another data structure
> would be a simgle commit.

That would be easy. s_more_io and s_more_io_wait can all be converted
to radix trees.

2008-01-16 13:07:06

by Wu Fengguang

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Wed, Jan 16, 2008 at 12:13:43AM -0800, Andrew Morton wrote:
> On Wed, 16 Jan 2008 18:55:38 +1100 David Chinner <[email protected]> wrote:
>
> > On Tue, Jan 15, 2008 at 07:44:15PM -0800, Andrew Morton wrote:
> > > On Wed, 16 Jan 2008 11:01:08 +0800 Fengguang Wu <[email protected]> wrote:
> > >
> > > > On Tue, Jan 15, 2008 at 09:53:42AM -0800, Michael Rubin wrote:
> > > > > On Jan 15, 2008 12:46 AM, Peter Zijlstra <[email protected]> wrote:
> > > > > > Just a quick question, how does this interact/depend-uppon etc.. with
> > > > > > Fengguangs patches I still have in my mailbox? (Those from Dec 28th)
> > > > >
> > > > > They don't. They apply to a 2.6.24rc7 tree. This is a candidte for 2.6.25.
> > > > >
> > > > > This work was done before Fengguang's patches. I am trying to test
> > > > > Fengguang's for comparison but am having problems with getting mm1 to
> > > > > boot on my systems.
> > > >
> > > > Yeah, they are independent ones. The initial motivation is to fix the
> > > > bug "sluggish writeback on small+large files". Michael introduced
> > > > a new rbtree, and me introduced a new list(s_more_io_wait).
> > > >
> > > > Basically I think rbtree is an overkill to do time based ordering.
> > > > Sorry, Michael. But s_dirty would be enough for that. Plus, s_more_io
> > > > provides fair queuing between small/large files, and s_more_io_wait
> > > > provides waiting mechanism for blocked inodes.
> > > >
> > > > The time ordered rbtree may delay io for a blocked inode simply by
> > > > modifying its dirtied_when and reinsert it. But it would no longer be
> > > > that easy if it is to be ordered by location.
> > >
> > > What does the term "ordered by location" mean? Attemting to sort inodes by
> > > physical disk address? By using their i_ino as a key?
> > >
> > > That sounds optimistic.
> >
> > In XFS, inode number is an encoding of it's location on disk, so
> > ordering inode writeback by inode number *does* make sense.
>
> This code is mainly concerned with writing pagecache data, not inodes.

Now I tend to agree with you. It is not easy to sort writeback pages
by disk location. There are three main obstacles:
- there can be multiple filesystems on the same disk
- ext3/reiserfs/... make use of blockdev inode, which spans the whole
partition;
- xfs may store inode metadata/data separately;

2008-01-16 18:55:47

by Michael Rubin

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Jan 15, 2008 7:01 PM, Fengguang Wu <[email protected]> wrote:
> Basically I think rbtree is an overkill to do time based ordering.
> Sorry, Michael. But s_dirty would be enough for that. Plus, s_more_io
> provides fair queuing between small/large files, and s_more_io_wait
> provides waiting mechanism for blocked inodes.

I think the flush_tree (which is a little more than just an rbtree)
provides the same queuing mechanisms that the three or four lists
heads do and manages to do it in one structure. The i_flushed_when
provides the ability to have blocked inodes wait their turn so to
speak.

Another motivation behind the rbtree patch is to unify the data
structure that handles the priority and mechanism of how we write out
the pages of the inodes. There are some ideas about introducing
priority schemes for QOS and such in the future. I am not saying this
patch is about making that happen, but the idea is to if possible
unify the four stages of lists into a single structure to facilitate
efforts like that.

mrubin

2008-01-16 22:36:21

by David Chinner

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Wed, Jan 16, 2008 at 05:07:20PM +0800, Fengguang Wu wrote:
> On Tue, Jan 15, 2008 at 09:51:49PM -0800, Andrew Morton wrote:
> > > Then to do better ordering by adopting radix tree(or rbtree
> > > if radix tree is not enough),
> >
> > ordering of what?
>
> Switch from time to location.

Note that data writeback may be adversely affected by location
based writeback rather than time based writeback - think of
the effect of location based data writeback on an app that
creates lots of short term (<30s) temp files and then removes
them before they are written back.

Also, data writeback locatio cannot be easily derived from
the inode number in pretty much all cases. "near" in terms
of XFS means the same AG which means the data could be up to
a TB away from the inode, and if you have >1TB filesystems
usingthe default inode32 allocator, file data is *never*
placed near the inode - the inodes are in the first TB of
the filesystem, the data is rotored around the rest of the
filesystem.

And with delayed allocation, you don't know where the data is even
going to be written ahead of the filesystem ->writepage call, so you
can't do optimal location ordering for data in this case.

> > > and lastly get rid of the list_heads to
> > > avoid locking. Does it sound like a good path?
> >
> > I'd have thaought that replacing list_heads with another data structure
> > would be a simgle commit.
>
> That would be easy. s_more_io and s_more_io_wait can all be converted
> to radix trees.

Makes sense for location based writeback of the inodes themselves,
but not for data.

Hmmmm - I'm wondering if we'd do better to split data writeback from
inode writeback. i.e. we do two passes. The first pass writes all
the data back in time order, the second pass writes all the inodes
back in location order.

Right now we interleave data and inode writeback, (i.e. we do data,
inode, data, inode, data, inode, ....). I'd much prefer to see all
data written out first, then the inodes. ->writepage often dirties
the inode and hence if we need to do multiple do_writepages() calls
on an inode to flush all the data (e.g. congestion, large amounts of
data to be written, etc), we really shouldn't be calling
write_inode() after every do_writepages() call. The inode
should not be written until all the data is written....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2008-01-17 03:16:18

by Wu Fengguang

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Thu, Jan 17, 2008 at 09:35:10AM +1100, David Chinner wrote:
> On Wed, Jan 16, 2008 at 05:07:20PM +0800, Fengguang Wu wrote:
> > On Tue, Jan 15, 2008 at 09:51:49PM -0800, Andrew Morton wrote:
> > > > Then to do better ordering by adopting radix tree(or rbtree
> > > > if radix tree is not enough),
> > >
> > > ordering of what?
> >
> > Switch from time to location.
>
> Note that data writeback may be adversely affected by location
> based writeback rather than time based writeback - think of
> the effect of location based data writeback on an app that
> creates lots of short term (<30s) temp files and then removes
> them before they are written back.

A small(e.g. 5s) time window can still be enforced, but...

> Also, data writeback locatio cannot be easily derived from
> the inode number in pretty much all cases. "near" in terms
> of XFS means the same AG which means the data could be up to
> a TB away from the inode, and if you have >1TB filesystems
> usingthe default inode32 allocator, file data is *never*
> placed near the inode - the inodes are in the first TB of
> the filesystem, the data is rotored around the rest of the
> filesystem.
>
> And with delayed allocation, you don't know where the data is even
> going to be written ahead of the filesystem ->writepage call, so you
> can't do optimal location ordering for data in this case.

Agreed.

> Hmmmm - I'm wondering if we'd do better to split data writeback from
> inode writeback. i.e. we do two passes. The first pass writes all
> the data back in time order, the second pass writes all the inodes
> back in location order.
>
> Right now we interleave data and inode writeback, (i.e. we do data,
> inode, data, inode, data, inode, ....). I'd much prefer to see all
> data written out first, then the inodes. ->writepage often dirties
> the inode and hence if we need to do multiple do_writepages() calls
> on an inode to flush all the data (e.g. congestion, large amounts of
> data to be written, etc), we really shouldn't be calling
> write_inode() after every do_writepages() call. The inode
> should not be written until all the data is written....

That may do good to XFS. Another case is documented as follows:
"the write_inode() function of a typical fs will perform no I/O, but
will mark buffers in the blockdev mapping as dirty."

2008-01-17 03:31:33

by Wu Fengguang

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Wed, Jan 16, 2008 at 10:55:28AM -0800, Michael Rubin wrote:
> On Jan 15, 2008 7:01 PM, Fengguang Wu <[email protected]> wrote:
> > Basically I think rbtree is an overkill to do time based ordering.
> > Sorry, Michael. But s_dirty would be enough for that. Plus, s_more_io
> > provides fair queuing between small/large files, and s_more_io_wait
> > provides waiting mechanism for blocked inodes.
>
> I think the flush_tree (which is a little more than just an rbtree)
> provides the same queuing mechanisms that the three or four lists
> heads do and manages to do it in one structure. The i_flushed_when
> provides the ability to have blocked inodes wait their turn so to
> speak.
>
> Another motivation behind the rbtree patch is to unify the data
> structure that handles the priority and mechanism of how we write out
> the pages of the inodes. There are some ideas about introducing
> priority schemes for QOS and such in the future. I am not saying this
> patch is about making that happen, but the idea is to if possible
> unify the four stages of lists into a single structure to facilitate
> efforts like that.

Yeah, rbtree is better than list_heads after all. Let's make it happen.

2008-01-17 05:22:16

by David Chinner

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Thu, Jan 17, 2008 at 11:16:00AM +0800, Fengguang Wu wrote:
> On Thu, Jan 17, 2008 at 09:35:10AM +1100, David Chinner wrote:
> > On Wed, Jan 16, 2008 at 05:07:20PM +0800, Fengguang Wu wrote:
> > > On Tue, Jan 15, 2008 at 09:51:49PM -0800, Andrew Morton wrote:
> > > > > Then to do better ordering by adopting radix tree(or rbtree
> > > > > if radix tree is not enough),
> > > >
> > > > ordering of what?
> > >
> > > Switch from time to location.
> >
> > Note that data writeback may be adversely affected by location
> > based writeback rather than time based writeback - think of
> > the effect of location based data writeback on an app that
> > creates lots of short term (<30s) temp files and then removes
> > them before they are written back.
>
> A small(e.g. 5s) time window can still be enforced, but...

Yes, you could, but that will then result in non-deterministic
performance for repeated workloads because the order of file
writeback will not be consistent.

e.g. the first run is fast because the output file is at lower
offset than the temp file meaning the temp file gets deleted
without being written.

The second run is slow because the location of the files is
reversed and the temp file is written to disk before the
final output file and hence the run is much slower because
it writes much more.

The third run is also slow, but the files are like the first
fast run. However, pdflush tries to write the temp file back
within 5s of it being dirtied so it skips it and writes
the output file first.

The difference between the first+second case can be found by
knowing that inode number determines writeback order, but
there is no obvious clue as to why the first+third runs are
different.

This is exactly the sort of non-deterministic behaviour we
want to avoid in a writeback algorithm.

> > Hmmmm - I'm wondering if we'd do better to split data writeback from
> > inode writeback. i.e. we do two passes. The first pass writes all
> > the data back in time order, the second pass writes all the inodes
> > back in location order.
> >
> > Right now we interleave data and inode writeback, (i.e. we do data,
> > inode, data, inode, data, inode, ....). I'd much prefer to see all
> > data written out first, then the inodes. ->writepage often dirties
> > the inode and hence if we need to do multiple do_writepages() calls
> > on an inode to flush all the data (e.g. congestion, large amounts of
> > data to be written, etc), we really shouldn't be calling
> > write_inode() after every do_writepages() call. The inode
> > should not be written until all the data is written....
>
> That may do good to XFS. Another case is documented as follows:
> "the write_inode() function of a typical fs will perform no I/O, but
> will mark buffers in the blockdev mapping as dirty."

Yup, but in that situation ->write_inode() does not do any I/O, so
it will work with any high level inode writeback ordering or timing
scheme equally well. As a result, that's not the case we need to
optimise at all.

FWIW, the NFS client is likely to work better with split data/
inode writeback as it also has to mark the inode dirty on async
write completion (to get ->write_inode called to issue a commit
RPC). Hence delaying the inode write until after all the data
is written makes sense there as well....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2008-01-17 09:42:23

by Wu Fengguang

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Tue, Jan 15, 2008 at 12:09:21AM -0800, Michael Rubin wrote:
> 1) Adding a datastructure to guarantee fairness when writing
> inodes to disk and simplify the code base.
>
> When inodes are parked for writeback they are parked in the
> flush_tree. The flush tree is a data structure based on an rb tree.

The main benefit of rbtree is possibly better support of future policies.
Can you demonstrate an example?

(grumble:
Apart from the benefit of flexibility, I don't think it makes things
simpler, nor does the introduction of rbtree automatically fixes bugs.
Bugs can only be avoided by good understanding of all possible cases.)

The most tricky writeback issues could be starvation prevention
between
- small/large files
- new/old files
- superblocks
Some kind of limit should be applied for each. They used to be:
- requeue to s_more_io whenever MAX_WRITEBACK_PAGES is reached
this preempts big files
- refill s_io iif it is drained
this prevents promotion of big/old files
- return from sync_sb_inodes() after one go of s_io
(todo: don't restart from the first superblock in writeback_inodes())
this prevents busy superblock from starving others
and ensures fairness between superblocks

Michael, could you sort out and document the new starvation prevention schemes?

> Duplicate keys are handled by making a list in the tree for each key
> value. The order of how we choose the next inode to flush is decided
> by two fields. First the earliest dirtied_when value. If there are
> duplicate dirtied_when values then the earliest i_flush_gen value
> determines who gets flushed next.
>
> The flush tree organizes the dirtied_when keys with the rb_tree. Any
> inodes with a duplicate dirtied_when value are link listed together. This
> link list is sorted by the inode's i_flush_gen. When both the
> dirtied_when and the i_flush_gen are identical the order in the
> linked list determines the order we flush the inodes.

Introduce i_flush_gen to help restarting from the last inode?
Well, it's not as simple as list_heads.

> 2) Added an inode flag to allow inodes to be marked so that they
> are never written back to disk.
>
> The motivation behind this change is several fold. The first is
> to insure fairness in the writeback algorithm. The second is to

What do you mean by fairness? Why cannot I_WRITEBACK_NEVER be in a
decoupled standalone patch?

> deal with a bug where the writing to large files concurrently
> to smaller ones creates a situation where writeback cannot
> keep up with traffic and memory baloons until the we hit the
> threshold watermark. This can result in surprising long latency
> with respect to disk traffic. This latency can take minutes. The

> flush tree fixes this issue and fixes several other minor issues
> with fairness also.

More details about the fixings, please?

2008-01-17 21:07:28

by Michael Rubin

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Jan 17, 2008 1:41 AM, Fengguang Wu <[email protected]> wrote:
> On Tue, Jan 15, 2008 at 12:09:21AM -0800, Michael Rubin wrote:
> The main benefit of rbtree is possibly better support of future policies.
> Can you demonstrate an example?

These are ill-formed thoughts as of now on my end but the idea was
that keeping one tree sorted via a scheme might be simpler than
multiple list_heads.

> Bugs can only be avoided by good understanding of all possible cases.)

I take the above statement as a tautology. And am trying my best to do so. :-)

> The most tricky writeback issues could be starvation prevention
> between


> - small/large files
> - new/old files
> - superblocks

So I have written tests and believe I have covered these issues. If
you are concerned in specific on any and have a test case please let
me know.

> Some kind of limit should be applied for each. They used to be:
> - requeue to s_more_io whenever MAX_WRITEBACK_PAGES is reached
> this preempts big files

The patch uses th same limit.

> - refill s_io iif it is drained
> this prevents promotion of big/old files

Once a big file gets its first do_writepages it is moved behind the
other smaller files via i_flushed_when. And the same in reverse for
big vs old.

> - return from sync_sb_inodes() after one go of s_io

I am not sure how this limit helps things out. Is this for superblock
starvation? Can you elaborate?

> Michael, could you sort out and document the new starvation prevention schemes?

The basic idea behind the writeback algorithm to handle starvation.
The over arching idea is that we want to preserve order of writeback
based on when an inode was dirtied and also preserve the dirtied_when
contents until the inode has been written back (partially or fully)

Every sync_sb_inodes we find the least recent inodes dirtied. To deal
with large or small starvation we have a s_flush_gen for each
iteration of sync_sb_inodes every time we issue a writeback we mark
that the inode cannot be processed until the next s_flush_gen. This
way we don't process one get to the rest since we keep pushing them
into subsequent s_fush_gen's.

Let me know if you want more detail or structured responses.

> Introduce i_flush_gen to help restarting from the last inode?
> Well, it's not as simple as list_heads.
>
> > 2) Added an inode flag to allow inodes to be marked so that they
> > are never written back to disk.
> >
> > The motivation behind this change is several fold. The first is
> > to insure fairness in the writeback algorithm. The second is to
>
> What do you mean by fairness?

So originally this comment was written when I was trying to fix a bug
in 2.6.23. The one where we were starving large files from being
flushed. There was a fairness issue where small files were being
flushed but the large ones were just ballooning in memory.

> Why cannot I_WRITEBACK_NEVER be in a decoupled standalone patch?

The WRITEBACK_NEVER could be in a previous patch to the rbtree. But
not a subsequent patch to the rbtree. The rbtree depends on the
WRITEBACK_NEVER patch otherwise we run in to problems in
generic_delete_inode. Now that you point it out I think I can split
this patch into two patches and make the WRITEBACK_NEVER in the first
one.

> More details about the fixings, please?

So again this comment was written against 2.6.23. The biggest fix is
the starving of big files. I remember there were other smaller issues,
but there have been so many changes in the patch sets that I need to
go back to quantify.

2008-01-18 04:56:34

by Wu Fengguang

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Thu, Jan 17, 2008 at 01:07:05PM -0800, Michael Rubin wrote:
> On Jan 17, 2008 1:41 AM, Fengguang Wu <[email protected]> wrote:
> > On Tue, Jan 15, 2008 at 12:09:21AM -0800, Michael Rubin wrote:
> > The main benefit of rbtree is possibly better support of future policies.
> > Can you demonstrate an example?
>
> These are ill-formed thoughts as of now on my end but the idea was
> that keeping one tree sorted via a scheme might be simpler than
> multiple list_heads.

Suppose we want to grant longer expiration window for temp files,
adding a new list named s_dirty_tmpfile would be a handy solution.

So the question is: should we need more than 3 QoS classes?

> > The most tricky writeback issues could be starvation prevention
> > between
>
>
> > - small/large files
> > - new/old files
> > - superblocks
>
> So I have written tests and believe I have covered these issues. If
> you are concerned in specific on any and have a test case please let
> me know.

OK.

> > Some kind of limit should be applied for each. They used to be:
> > - requeue to s_more_io whenever MAX_WRITEBACK_PAGES is reached
> > this preempts big files
>
> The patch uses th same limit.
>
> > - refill s_io iif it is drained
> > this prevents promotion of big/old files
>
> Once a big file gets its first do_writepages it is moved behind the
> other smaller files via i_flushed_when. And the same in reverse for
> big vs old.

You mean i_flush_gen? No, sync_sb_inodes() will abort on every
MAX_WRITEBACK_PAGES, and s_flush_gen will be updated accordingly.
Hence the sync will restart from big/old files.

>
> > - return from sync_sb_inodes() after one go of s_io
>
> I am not sure how this limit helps things out. Is this for superblock
> starvation? Can you elaborate?

We should have a way to go to next superblock even if new dirty inodes
or pages are emerging fast in this superblock. Fill and drain s_io
only once and then abort helps.

s_io is a stable and bounded working set in one go of superblock.

> > Michael, could you sort out and document the new starvation prevention schemes?
>
> The basic idea behind the writeback algorithm to handle starvation.
> The over arching idea is that we want to preserve order of writeback
> based on when an inode was dirtied and also preserve the dirtied_when
> contents until the inode has been written back (partially or fully)
>
> Every sync_sb_inodes we find the least recent inodes dirtied. To deal
> with large or small starvation we have a s_flush_gen for each
> iteration of sync_sb_inodes every time we issue a writeback we mark
> that the inode cannot be processed until the next s_flush_gen. This
> way we don't process one get to the rest since we keep pushing them
> into subsequent s_fush_gen's.
>
> Let me know if you want more detail or structured responses.
>
> > Introduce i_flush_gen to help restarting from the last inode?
> > Well, it's not as simple as list_heads.

Basically you make one list_head in each rbtree node.
That list_head is recycled cyclic, and is an analog to the old
fashioned s_dirty. We need to know 'where we are' and 'where it ends'.
So an extra indicator must be introduced - i_flush_gen. It's awkward.

We are simply repeating the aged list_heads' problem.

> > > 2) Added an inode flag to allow inodes to be marked so that they
> > > are never written back to disk.
> > >
> > > The motivation behind this change is several fold. The first is
> > > to insure fairness in the writeback algorithm. The second is to
> >
> > What do you mean by fairness?
>
> So originally this comment was written when I was trying to fix a bug
> in 2.6.23. The one where we were starving large files from being
> flushed. There was a fairness issue where small files were being
> flushed but the large ones were just ballooning in memory.

In fact the bug is turned-around rather than fixed - now the small
files could be starved.

> > Why cannot I_WRITEBACK_NEVER be in a decoupled standalone patch?
>
> The WRITEBACK_NEVER could be in a previous patch to the rbtree. But
> not a subsequent patch to the rbtree. The rbtree depends on the
> WRITEBACK_NEVER patch otherwise we run in to problems in
> generic_delete_inode. Now that you point it out I think I can split
> this patch into two patches and make the WRITEBACK_NEVER in the first
> one.

OK.

2008-01-18 05:01:38

by David Chinner

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Thu, Jan 17, 2008 at 01:07:05PM -0800, Michael Rubin wrote:
> > Michael, could you sort out and document the new starvation prevention schemes?
>
> The basic idea behind the writeback algorithm to handle starvation.
> The over arching idea is that we want to preserve order of writeback
> based on when an inode was dirtied and also preserve the dirtied_when
> contents until the inode has been written back (partially or fully)
>
> Every sync_sb_inodes we find the least recent inodes dirtied. To deal
> with large or small starvation we have a s_flush_gen for each
> iteration of sync_sb_inodes every time we issue a writeback we mark
> that the inode cannot be processed until the next s_flush_gen. This
> way we don't process one get to the rest since we keep pushing them
> into subsequent s_fush_gen's.

This seems suboptimal for large files. If you keep feeding in
new least recently dirtied files, the large files will never
get an unimpeded go at the disk and hence we'll struggle to
get decent bandwidth under anything but pure large file
write loads.

Fairness is a tradeoff between seeks and bandwidth. Ideally, we
want to spend 50% of the *disk* time servicing sequential writes and
50% of the time servicing seeky writes - that way neither get
penalised unfairly by the other type of workload.

Switching inodes during writeback implies a seek to the new write
location, while continuing to write the same inode has no seek
penalty because the writeback is sequential. It follows from this
that allowing larges file a disproportionate amount of data
writeback is desirable.

Also, cycling rapidly through all the large files to write 4MB to each is
going to cause us to spend time seeking rather than writing compared
to cycling slower and writing 40MB from each large file at a time.

i.e. servicing one large file for 100ms is going to result in higher
writeback throughput than servicing 10 large files for 10ms each
because there's going to be less seeking and more writing done by
the disks.

That is, think of large file writes like process scheduler batch
jobs - bulk throughput is what matters, so the larger the time slice
you give them the higher the throughput.

IMO, the sort of result we should be looking at is a
writeback design that results in cycling somewhat like:

slice 1: iterate over small files
slice 2: flush large file 1
slice 3: iterate over small files
slice 4: flush large file 2
......
slice n-1: flush large file N
slice n: iterate over small files
slice n+1: flush large file N+1

So that we keep the disk busy with a relatively fair mix of
small and large I/Os while both are necessary.

Furthermore, as disk bandwidth goes up, the relationship
between large file and small file writes changes if we want
to maintain writeback at a significant percentage of the
maximum bandwidth of the drive (which is extremely desirable).
So if we take a 4k page machine and a 1024page writeback slice,
for different disks, we get a bandwidth slice in terms of disk
seeks like:

slow disk: 20MB/s, 10ms seek (say a laptop drive)
- 4MB write takes 200ms, or equivalent of 10 seeks

normal disk: 60MB/s, 8ms seek (desktop)
- 4MB write takes 66ms, or equivalent of 8 seeks

fast disk: 120MB/s, 5ms seek (15krpm SAS)
- 4MB write takes 33ms, or equivalent of 6 seeks

small RAID5 lun: 200MB/s, 4ms seek
- 4MB write takes 20ms, or equivalent of 5 seeks

Large RAID5 lun: 1GB/s, 2ms seek
- 4MB write takes 4ms, or equivalent of 2 seeks

Put simply:

The higher the bandwidth of the device, the more frequently
we need to be servicing the inodes with large amounts of
dirty data to be written to maintain write throughput at a
significant percentage of the device capability.

The writeback algorithm needs to take this into account for it
to be able to scale effectively for high throughput devices.

BTW, it needs to be recognised that if we are under memory pressure
we can clean much more memory in a short period of time by writing
out all the large files first. This would clearly benefit the system
as a whole as we'd get the most pages available for reclaim as
possible in a short a time as possible. The writeback algorithm
should really have a mode that allows this sort of flush ordering to
occur....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2008-01-18 05:38:41

by Michael Rubin

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Jan 17, 2008 9:01 PM, David Chinner <[email protected]> wrote:

First off thank you for the very detailed reply. This rocks and gives
me much to think about.

> On Thu, Jan 17, 2008 at 01:07:05PM -0800, Michael Rubin wrote:
> This seems suboptimal for large files. If you keep feeding in
> new least recently dirtied files, the large files will never
> get an unimpeded go at the disk and hence we'll struggle to
> get decent bandwidth under anything but pure large file
> write loads.

You're right. I understand now. I just changed a dial on my tests,
ran it and found pdflush not keeping up like it should. I need to
address this.

> Switching inodes during writeback implies a seek to the new write
> location, while continuing to write the same inode has no seek
> penalty because the writeback is sequential. It follows from this
> that allowing larges file a disproportionate amount of data
> writeback is desirable.
>
> Also, cycling rapidly through all the large files to write 4MB to each is
> going to cause us to spend time seeking rather than writing compared
> to cycling slower and writing 40MB from each large file at a time.
>
> i.e. servicing one large file for 100ms is going to result in higher
> writeback throughput than servicing 10 large files for 10ms each
> because there's going to be less seeking and more writing done by
> the disks.
>
> That is, think of large file writes like process scheduler batch
> jobs - bulk throughput is what matters, so the larger the time slice
> you give them the higher the throughput.
>
> IMO, the sort of result we should be looking at is a
> writeback design that results in cycling somewhat like:
>
> slice 1: iterate over small files
> slice 2: flush large file 1
> slice 3: iterate over small files
> slice 4: flush large file 2
> ......
> slice n-1: flush large file N
> slice n: iterate over small files
> slice n+1: flush large file N+1
>
> So that we keep the disk busy with a relatively fair mix of
> small and large I/Os while both are necessary.

I am getting where you are coming from. But if we are going to make
changes to optimize for seeks maybe we need to be more aggressive in
write back in how we organize both time and location. Right now AFAIK
there is no attention to location in the writeback path.

> The higher the bandwidth of the device, the more frequently
> we need to be servicing the inodes with large amounts of
> dirty data to be written to maintain write throughput at a
> significant percentage of the device capability.
>

Could you expand that to say it's not the inodes of large files but
the ones with data that we can exploit locality? Often large files are
fragmented. Would it make more sense to pursue cracking the inodes and
grouping their blocks's locations? Or is this all overkill and should
be handled at a lower level like the elevator?

> BTW, it needs to be recognised that if we are under memory pressure
> we can clean much more memory in a short period of time by writing
> out all the large files first. This would clearly benefit the system
> as a whole as we'd get the most pages available for reclaim as
> possible in a short a time as possible. The writeback algorithm
> should really have a mode that allows this sort of flush ordering to
> occur....

I completely agree.

mrubin

2008-01-18 05:41:20

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

Fengguang Wu <[email protected]> writes:
>
> Suppose we want to grant longer expiration window for temp files,
> adding a new list named s_dirty_tmpfile would be a handy solution.

How would the kernel know that a file is a tmp file?

> So the question is: should we need more than 3 QoS classes?

[just a random idea; i have not worked out all the implications]

Would it be possible to derive a writeback apriority from the ionice
level of the process originating the IO? e.g. we have long standing
problems that background jobs even when niced and can cause
significant slow downs to foreground processes by starving IO
and pushing out pages. ionice was supposed to help with that
but in practice it does not seem to have helped too much and I suspect
it needs more prioritization higher up the VM food chain. Adding
such priorities to writeback would seem like a step in the right
direction, although it would of course not solve the problem
completely.

-Andi

2008-01-18 05:42:08

by Wu Fengguang

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

> Fairness is a tradeoff between seeks and bandwidth. Ideally, we
> want to spend 50% of the *disk* time servicing sequential writes and
> 50% of the time servicing seeky writes - that way neither get
> penalised unfairly by the other type of workload.
>
> Switching inodes during writeback implies a seek to the new write
> location, while continuing to write the same inode has no seek
> penalty because the writeback is sequential. It follows from this
> that allowing larges file a disproportionate amount of data
> writeback is desirable.
>
> Also, cycling rapidly through all the large files to write 4MB to each is
> going to cause us to spend time seeking rather than writing compared
> to cycling slower and writing 40MB from each large file at a time.
>
> i.e. servicing one large file for 100ms is going to result in higher
> writeback throughput than servicing 10 large files for 10ms each
> because there's going to be less seeking and more writing done by
> the disks.
>
> That is, think of large file writes like process scheduler batch
> jobs - bulk throughput is what matters, so the larger the time slice
> you give them the higher the throughput.
>
> IMO, the sort of result we should be looking at is a
> writeback design that results in cycling somewhat like:
>
> slice 1: iterate over small files
> slice 2: flush large file 1
> slice 3: iterate over small files
> slice 4: flush large file 2
> ......
> slice n-1: flush large file N
> slice n: iterate over small files
> slice n+1: flush large file N+1
>
> So that we keep the disk busy with a relatively fair mix of
> small and large I/Os while both are necessary.

If we can sync fast enough, the lower layer would be able to merge
those 4MB requests. Whatever the slice size is, we end up sending the
same set of pages to disk: the writable pages of expired inodes. We
only have to worry about choosing the right set of pages being written
in one single batch(in every pdflush wakeup time). That set is
currently defined by the 5s/30s dirty expiration rules.

> Furthermore, as disk bandwidth goes up, the relationship
> between large file and small file writes changes if we want
> to maintain writeback at a significant percentage of the
> maximum bandwidth of the drive (which is extremely desirable).
> So if we take a 4k page machine and a 1024page writeback slice,
> for different disks, we get a bandwidth slice in terms of disk
> seeks like:
>
> slow disk: 20MB/s, 10ms seek (say a laptop drive)
> - 4MB write takes 200ms, or equivalent of 10 seeks
>
> normal disk: 60MB/s, 8ms seek (desktop)
> - 4MB write takes 66ms, or equivalent of 8 seeks
>
> fast disk: 120MB/s, 5ms seek (15krpm SAS)
> - 4MB write takes 33ms, or equivalent of 6 seeks
>
> small RAID5 lun: 200MB/s, 4ms seek
> - 4MB write takes 20ms, or equivalent of 5 seeks
>
> Large RAID5 lun: 1GB/s, 2ms seek
> - 4MB write takes 4ms, or equivalent of 2 seeks
>
> Put simply:
>
> The higher the bandwidth of the device, the more frequently
> we need to be servicing the inodes with large amounts of
> dirty data to be written to maintain write throughput at a
> significant percentage of the device capability.
>
> The writeback algorithm needs to take this into account for it
> to be able to scale effectively for high throughput devices.

Slow queues go full first. Currently the writeback code will skip
_and_ congestion_wait() for congested filesystems. The better policy
is to congestion_wait() _after_ all other writable pages have been
synced. The same have been done for blocked inodes/pages in my
patchset.

> BTW, it needs to be recognised that if we are under memory pressure
> we can clean much more memory in a short period of time by writing
> out all the large files first. This would clearly benefit the system
> as a whole as we'd get the most pages available for reclaim as
> possible in a short a time as possible. The writeback algorithm
> should really have a mode that allows this sort of flush ordering to
> occur....

We might do write-behind for large files :-)

2008-01-18 06:43:31

by Michael Rubin

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Jan 17, 2008 8:56 PM, Fengguang Wu <[email protected]> wrote:

Once again thanks for the speedy replies. :-)

> On Thu, Jan 17, 2008 at 01:07:05PM -0800, Michael Rubin wrote:
> Suppose we want to grant longer expiration window for temp files,
> adding a new list named s_dirty_tmpfile would be a handy solution.

When you mean tmp do you mean files that eventually get written to
disk? If not I would just use the WRITEBACK_NEVER. If so I am not sure
if that feature is worth making a special case. It seems like the
location based ideas may be more useful.

> So the question is: should we need more than 3 QoS classes?
>
> > > The most tricky writeback issues could be starvation prevention
> > > between
> >
> >
> > > - small/large files
> > > - new/old files
> > > - superblocks
> >
> > So I have written tests and believe I have covered these issues. If
> > you are concerned in specific on any and have a test case please let
> > me know.
>
> OK.
>
> > > Some kind of limit should be applied for each. They used to be:
> > > - requeue to s_more_io whenever MAX_WRITEBACK_PAGES is reached
> > > this preempts big files
> >
> > The patch uses th same limit.
> >
> > > - refill s_io iif it is drained
> > > this prevents promotion of big/old files
> >
> > Once a big file gets its first do_writepages it is moved behind the
> > other smaller files via i_flushed_when. And the same in reverse for
> > big vs old.
>
> You mean i_flush_gen?

Yeah sorry. It was once called i_flush_when. (sheepish)

> No, sync_sb_inodes() will abort on every
> MAX_WRITEBACK_PAGES, and s_flush_gen will be updated accordingly.
> Hence the sync will restart from big/old files.

If I understand you correctly I am not sure I agree. Here is what I
think happens in the patch:

1) pull big inode off of flush tree
2) sync big inode
3) Hit MAX_WRITEBACK_PAGES
4) Re-insert big inode (without modifying the dirtied_when)
5) update the i_flush_gen on big inode and re-insert behind small
inodes we have not synced yet.

In a subsequent sync_sb_inode we end up retrieving the small inode we
had not serviced yet.

> > > - return from sync_sb_inodes() after one go of s_io
> >
> > I am not sure how this limit helps things out. Is this for superblock
> > starvation? Can you elaborate?
>
> We should have a way to go to next superblock even if new dirty inodes
> or pages are emerging fast in this superblock. Fill and drain s_io
> only once and then abort helps.

Got it.

> s_io is a stable and bounded working set in one go of superblock.

Is this necessary with MAX_WRITEBACK_PAGES? It feels like a double limit.

> Basically you make one list_head in each rbtree node.
> That list_head is recycled cyclic, and is an analog to the old
> fashioned s_dirty. We need to know 'where we are' and 'where it ends'.
> So an extra indicator must be introduced - i_flush_gen. It's awkward.
> We are simply repeating the aged list_heads' problem.

To me they both feel a little awkward. I feel like the original
problem in 2.6.23 led to a lot of examination which is bringing new
possibilities to light.

BTW the issue that started me on this whole path (starving large
files) was still present in 2.6.23-rc8 but now looks fixed in
2.6.24-rc3.
Still no idea about your changes in 2.6.24-rc6-mm1. I have given up
trying to get that thing to boot.

mrubin

2008-01-18 07:37:36

by Mike Waychison

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

Fengguang Wu wrote:
> On Tue, Jan 15, 2008 at 09:51:49PM -0800, Andrew Morton wrote:
>> On Wed, 16 Jan 2008 12:55:07 +0800 Fengguang Wu <[email protected]> wrote:
>>
>>> On Tue, Jan 15, 2008 at 08:42:36PM -0800, Andrew Morton wrote:
>>>> On Wed, 16 Jan 2008 12:25:53 +0800 Fengguang Wu <[email protected]> wrote:
>>>>
>>>>> list_heads are OK if we use them for one and only function.
>>>> Not really. They're inappropriate when you wish to remember your
>>>> position in the list while you dropped the lock (as we must do in
>>>> writeback).
>>>>
>>>> A data structure which permits us to interate across the search key rather
>>>> than across the actual storage locations is more appropriate.
>>> I totally agree with you. What I mean is to first do the split of
>>> functions - into three: ordering, starvation prevention, and blockade
>>> waiting.
>> Does "ordering" here refer to ordering bt time-of-first-dirty?
>
> Ordering by dirtied_when or i_ino, either is OK.
>
>> What is "blockade waiting"?
>
> Some inodes/pages cannot be synced now for some reason and should be
> retried after a while.
>
>>> Then to do better ordering by adopting radix tree(or rbtree
>>> if radix tree is not enough),
>> ordering of what?
>
> Switch from time to location.
>

Given the way LBAs are located on disk and the fact that rotational
latency is a large factor in changing locations of a drive head, any
attempts to do a C-SCAN pass are pretty much useless. Further
complicating this is any volume management that sits between the fs and
the actual storage.

A nice feature to have longer term is to have the write_inodes paths for
background flushing understand storage congestion _through_ any volume
management. This would allow us to back off background flushing on a per
spindle basis (when using drives of course) and avoid write congestion
in both the io scheduler and in the drive's writecaches, which I
believe, but don't have hard evidence, get congested today, knocking the
drive into a fifo fashion in firmware.

A data structure that allows us to keep a dirtied_when values consistent
across back-offs and blocking allows us to further develop the
background writeout paths to get to this point (though exposing this
congestion information will require more work deeper in the stack).

>>> and lastly get rid of the list_heads to
>>> avoid locking. Does it sound like a good path?
>> I'd have thaought that replacing list_heads with another data structure
>> would be a simgle commit.
>
> That would be easy. s_more_io and s_more_io_wait can all be converted
> to radix trees.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2008-01-18 07:49:41

by Mike Waychison

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

We have some code internally that does just this, though it slightly
abuses struct page by tagging pages with the highest priority that
dirties them.

I'm not sure what a better solution is, though there has been talk about
rewritting it to use a per mapping radix tree to keep mem_map small.

The idea is to mark current->ioprio into each page as they are dirtied,
higher priority overrides lower priority. Buffer_heads are done in the
same way.

Come IO submission, bio's get the highest IO priority associated with
the submitted pages/buffer_heads and these are passed into the the
struct request's into the scheduler.

Similar changes are underway for making sure all AIO get the right ioprios.

It will take some cleaning to get it ready for lkml submission, though
I'm still a bit unsure of what we should do to avoid struct page growth.

Mike Waychison

Andi Kleen wrote:
> Fengguang Wu <[email protected]> writes:
>> Suppose we want to grant longer expiration window for temp files,
>> adding a new list named s_dirty_tmpfile would be a handy solution.
>
> How would the kernel know that a file is a tmp file?
>
>> So the question is: should we need more than 3 QoS classes?
>
> [just a random idea; i have not worked out all the implications]
>
> Would it be possible to derive a writeback apriority from the ionice
> level of the process originating the IO? e.g. we have long standing
> problems that background jobs even when niced and can cause
> significant slow downs to foreground processes by starving IO
> and pushing out pages. ionice was supposed to help with that
> but in practice it does not seem to have helped too much and I suspect
> it needs more prioritization higher up the VM food chain. Adding
> such priorities to writeback would seem like a step in the right
> direction, although it would of course not solve the problem
> completely.
>
> -Andi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2008-01-18 08:18:31

by Wu Fengguang

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Fri, Jan 18, 2008 at 06:41:09AM +0100, Andi Kleen wrote:
> Fengguang Wu <[email protected]> writes:
> >
> > Suppose we want to grant longer expiration window for temp files,
> > adding a new list named s_dirty_tmpfile would be a handy solution.
>
> How would the kernel know that a file is a tmp file?

No idea - but it makes a good example ;-)

But for those making different filesystems for /tmp, /var, /data etc,
per-superblock expiration parameters may help.

> > So the question is: should we need more than 3 QoS classes?
>
> [just a random idea; i have not worked out all the implications]
>
> Would it be possible to derive a writeback apriority from the ionice
> level of the process originating the IO? e.g. we have long standing
> problems that background jobs even when niced and can cause
> significant slow downs to foreground processes by starving IO
> and pushing out pages. ionice was supposed to help with that
> but in practice it does not seem to have helped too much and I suspect
> it needs more prioritization higher up the VM food chain. Adding
> such priorities to writeback would seem like a step in the right
> direction, although it would of course not solve the problem
> completely.

Good idea. Michael may well be considering similar interfaces :-)

2008-01-18 08:54:40

by David Chinner

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Thu, Jan 17, 2008 at 09:38:24PM -0800, Michael Rubin wrote:
> On Jan 17, 2008 9:01 PM, David Chinner <[email protected]> wrote:
>
> First off thank you for the very detailed reply. This rocks and gives
> me much to think about.
>
> > On Thu, Jan 17, 2008 at 01:07:05PM -0800, Michael Rubin wrote:
> > This seems suboptimal for large files. If you keep feeding in
> > new least recently dirtied files, the large files will never
> > get an unimpeded go at the disk and hence we'll struggle to
> > get decent bandwidth under anything but pure large file
> > write loads.
>
> You're right. I understand now. I just changed a dial on my tests,
> ran it and found pdflush not keeping up like it should. I need to
> address this.
>
> > Switching inodes during writeback implies a seek to the new write
> > location, while continuing to write the same inode has no seek
> > penalty because the writeback is sequential. It follows from this
> > that allowing larges file a disproportionate amount of data
> > writeback is desirable.
> >
> > Also, cycling rapidly through all the large files to write 4MB to each is
> > going to cause us to spend time seeking rather than writing compared
> > to cycling slower and writing 40MB from each large file at a time.
> >
> > i.e. servicing one large file for 100ms is going to result in higher
> > writeback throughput than servicing 10 large files for 10ms each
> > because there's going to be less seeking and more writing done by
> > the disks.
> >
> > That is, think of large file writes like process scheduler batch
> > jobs - bulk throughput is what matters, so the larger the time slice
> > you give them the higher the throughput.
> >
> > IMO, the sort of result we should be looking at is a
> > writeback design that results in cycling somewhat like:
> >
> > slice 1: iterate over small files
> > slice 2: flush large file 1
> > slice 3: iterate over small files
> > slice 4: flush large file 2
> > ......
> > slice n-1: flush large file N
> > slice n: iterate over small files
> > slice n+1: flush large file N+1
> >
> > So that we keep the disk busy with a relatively fair mix of
> > small and large I/Os while both are necessary.
>
> I am getting where you are coming from. But if we are going to make
> changes to optimize for seeks maybe we need to be more aggressive in
> write back in how we organize both time and location. Right now AFAIK
> there is no attention to location in the writeback path.

True. But IMO, locality ordering really only impacts the small file
data writes and the inodes themselves because there is typically
lots of seeks in doing that.

For large sequential writes to a file, writing a significant
chunk of data gives that bit of writeback it's own locality
because it does not cause seeks. Hence simply writing large
enough chunks avoids any need to order the writeback by locality.

Hence I writeback ordering by locality more a function of
optimising the "iterate over small files" aspect of the writeback.

> > The higher the bandwidth of the device, the more frequently
> > we need to be servicing the inodes with large amounts of
> > dirty data to be written to maintain write throughput at a
> > significant percentage of the device capability.
> >
>
> Could you expand that to say it's not the inodes of large files but
> the ones with data that we can exploit locality?

Not sure I understand what you mean. Can you rephrase that?

> Often large files are fragmented.

Then the filesystem is not doing it's job. Fragmentation does
not happen very frequently in XFS for large files - that is one
of the reasons it is extremely good for large files and high
throughput applications...

> Would it make more sense to pursue cracking the inodes and
> grouping their blocks's locations? Or is this all overkill and should
> be handled at a lower level like the elevator?

For large files it is overkill. For filesystems that do delayed
allocation, it is often impossible (no block mapping until
the writeback is executed unless it's an overwrite).

At this point, I'd say it is best to leave it to the filesystem and
the elevator to do their jobs properly.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2008-01-18 09:26:38

by Michael Rubin

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Jan 18, 2008 12:54 AM, David Chinner <[email protected]> wrote:
> At this point, I'd say it is best to leave it to the filesystem and
> the elevator to do their jobs properly.

Amen.

mrubin

2008-01-18 10:19:19

by Wu Fengguang

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Thu, Jan 17, 2008 at 10:43:15PM -0800, Michael Rubin wrote:
> On Jan 17, 2008 8:56 PM, Fengguang Wu <[email protected]> wrote:
> > On Thu, Jan 17, 2008 at 01:07:05PM -0800, Michael Rubin wrote:
> > Suppose we want to grant longer expiration window for temp files,
> > adding a new list named s_dirty_tmpfile would be a handy solution.
>
> When you mean tmp do you mean files that eventually get written to

Yes, they are disk based and can be synced on.

> disk? If not I would just use the WRITEBACK_NEVER. If so I am not sure
> if that feature is worth making a special case. It seems like the
> location based ideas may be more useful.

I'm not interested in WRITEBACK_NEVER or location based writeback
for now :-)

> > > > - refill s_io iif it is drained
> > > > this prevents promotion of big/old files
> > >
> > > Once a big file gets its first do_writepages it is moved behind the
> > > other smaller files via i_flushed_when. And the same in reverse for
> > > big vs old.
> >
> > You mean i_flush_gen?
>
> Yeah sorry. It was once called i_flush_when. (sheepish)
>
> > No, sync_sb_inodes() will abort on every
> > MAX_WRITEBACK_PAGES, and s_flush_gen will be updated accordingly.
> > Hence the sync will restart from big/old files.
>
> If I understand you correctly I am not sure I agree. Here is what I
> think happens in the patch:
>
> 1) pull big inode off of flush tree
> 2) sync big inode
> 3) Hit MAX_WRITEBACK_PAGES
> 4) Re-insert big inode (without modifying the dirtied_when)
> 5) update the i_flush_gen on big inode and re-insert behind small
> inodes we have not synced yet.
>
> In a subsequent sync_sb_inode we end up retrieving the small inode we
> had not serviced yet.

Yes, exactly. And then it will continue to sync the big one again.
It will never be able to move forward to the next dirtied_when before
exhausting the inodes in the current list(with the oldest dirtied_when).

> > > > - return from sync_sb_inodes() after one go of s_io
> > >
> > > I am not sure how this limit helps things out. Is this for superblock
> > > starvation? Can you elaborate?
> >
> > We should have a way to go to next superblock even if new dirty inodes
> > or pages are emerging fast in this superblock. Fill and drain s_io
> > only once and then abort helps.
>
> Got it.
>
> > s_io is a stable and bounded working set in one go of superblock.
>
> Is this necessary with MAX_WRITEBACK_PAGES? It feels like a double limit.

We need a limit and continuing scheme at each level. It was so hard to
sort them out, that I'm really reluctant to restart all the fuss again.

> > Basically you make one list_head in each rbtree node.
> > That list_head is recycled cyclic, and is an analog to the old
> > fashioned s_dirty. We need to know 'where we are' and 'where it ends'.
> > So an extra indicator must be introduced - i_flush_gen. It's awkward.
> > We are simply repeating the aged list_heads' problem.
>
> To me they both feel a little awkward. I feel like the original
> problem in 2.6.23 led to a lot of examination which is bringing new
> possibilities to light.
>
> BTW the issue that started me on this whole path (starving large
> files) was still present in 2.6.23-rc8 but now looks fixed in
> 2.6.24-rc3.
> Still no idea about your changes in 2.6.24-rc6-mm1. I have given up
> trying to get that thing to boot.

Hehe, I guess the bug is still there in 2.6.24-rc3. But should be gone
in the latest patchset.

2008-01-19 02:51:15

by David Chinner

[permalink] [raw]
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Fri, Jan 18, 2008 at 01:41:33PM +0800, Fengguang Wu wrote:
> > That is, think of large file writes like process scheduler batch
> > jobs - bulk throughput is what matters, so the larger the time slice
> > you give them the higher the throughput.
> >
> > IMO, the sort of result we should be looking at is a
> > writeback design that results in cycling somewhat like:
> >
> > slice 1: iterate over small files
> > slice 2: flush large file 1
> > slice 3: iterate over small files
> > slice 4: flush large file 2
> > ......
> > slice n-1: flush large file N
> > slice n: iterate over small files
> > slice n+1: flush large file N+1
> >
> > So that we keep the disk busy with a relatively fair mix of
> > small and large I/Os while both are necessary.
>
> If we can sync fast enough, the lower layer would be able to merge
> those 4MB requests.

No, not necessarily - think of a stripe with a chunk size of 512k.
That 4MB will be split into 8x512k chunks and sent to different
devices (and hence elevator queues). The only way you get elevator
merging in this sort of config is that if you send multiple stripe
*width* sized amounts to the device in a very short time period.
I see quite a few filesystems with stripe widths in the tens of MB
range.....

> > Put simply:
> >
> > The higher the bandwidth of the device, the more frequently
> > we need to be servicing the inodes with large amounts of
> > dirty data to be written to maintain write throughput at a
> > significant percentage of the device capability.
> >
> > The writeback algorithm needs to take this into account for it
> > to be able to scale effectively for high throughput devices.
>
> Slow queues go full first. Currently the writeback code will skip
> _and_ congestion_wait() for congested filesystems. The better policy
> is to congestion_wait() _after_ all other writable pages have been
> synced.

Agreed.

The comments I've made are mainly concerned with getting efficient
flushing of a single device occuring. Interactions between multiple
devices are a separable issue....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group