2009-04-28 18:05:08

by Chris Mason

[permalink] [raw]
Subject: [PATCH RFC] ext3 data=guarded v5

Hello everyone,

I've rediffed the ext3 data=guarded code against Linus' current git
tree, and worked in most of Jan's suggestions:

The rbtree is gone in favor of a simple list, and O_DIRECT is fixed up.

I didn't consolidate the data=guarded address space operations with
either ordered or writeback, I think that should be a separate patch
later on when everything is working.

I also didn't drop the filemap_write_and_wait before truncate. There is
room for optimization there, and I think we can get rid of it
completely. But this does work for starters.

This is only lightly tested, please treat gently for a bit:

ext3 data=ordered mode makes sure that data blocks are on disk before
the metadata that references them, which avoids files full of garbage
or previously deleted data after a crash. It does this by adding every dirty
buffer onto a list of things that must be written before a commit.

This makes every fsync write out all the dirty data on the entire FS, which
has high latencies and is generally much more expensive than it needs to be.

Another way to avoid exposing stale data after a crash is to wait until
after the data buffers are written before updating the on-disk record
of the file's size. If we crash before the data IO is done, i_size
doesn't yet include the new blocks and no stale data is exposed.

This patch adds the delayed i_size update to ext3, along with a new
mount option (data=guarded) to enable it. The basic mechanism works like
this:

* Change block_write_full_page to take an end_io handler as a parameter.
This allows us to make an end_io handler that queues buffer heads for
a workqueue where the real work of updating the on disk i_size is done.

* Add a list to the in-memory ext3 inode for tracking data=guarded
buffer heads that are waiting to be sent to disk.

* Add an ext3 guarded write_end call to add buffer heads for newly
allocated blocks into the list. If we have a newly allocated block that is
filling a hole inside i_size, this is done as an old style data=ordered write
instead.

* Add an ext3 guarded writepage call that uses a special buffer head
end_io handler for buffers that are marked as guarded. Again, if we find
newly allocated blocks filling holes, they are sent through data=ordered
instead of data=guarded.

* When a guarded IO finishes, kick a per-FS workqueue to do the
on disk i_size updates. The workqueue function must be very careful. We only
update the on disk i_size if all of the IO between the old on disk i_size and
the new on disk i_size is complete. The on disk i_size is incrementally
updated to the largest safe value every time an IO completes.

* When we start tracking guarded buffers on a given inode, we put the
inode into ext3's orphan list. This way if we do crash, the file will
be truncated back down to the on disk i_size and we'll free any blocks that
were not completely written. The inode is removed from the orphan list
only after all the guarded buffers are done.

Signed-off-by: Chris Mason <[email protected]>

---
fs/ext3/Makefile | 3 +-
fs/ext3/fsync.c | 12 +
fs/ext3/inode.c | 582 +++++++++++++++++++++++++++++++++++++++++++-
fs/ext3/namei.c | 3 +-
fs/ext3/ordered-data.c | 235 ++++++++++++++++++
fs/ext3/super.c | 48 ++++-
fs/jbd/transaction.c | 1 +
include/linux/ext3_fs.h | 32 +++-
include/linux/ext3_fs_i.h | 45 ++++
include/linux/ext3_fs_sb.h | 6 +
include/linux/ext3_jbd.h | 11 +
include/linux/jbd.h | 10 +
12 files changed, 968 insertions(+), 20 deletions(-)

diff --git a/fs/ext3/Makefile b/fs/ext3/Makefile
index e77766a..f3a9dc1 100644
--- a/fs/ext3/Makefile
+++ b/fs/ext3/Makefile
@@ -5,7 +5,8 @@
obj-$(CONFIG_EXT3_FS) += ext3.o

ext3-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \
- ioctl.o namei.o super.o symlink.o hash.o resize.o ext3_jbd.o
+ ioctl.o namei.o super.o symlink.o hash.o resize.o ext3_jbd.o \
+ ordered-data.o

ext3-$(CONFIG_EXT3_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
ext3-$(CONFIG_EXT3_FS_POSIX_ACL) += acl.o
diff --git a/fs/ext3/fsync.c b/fs/ext3/fsync.c
index d336341..a50abb4 100644
--- a/fs/ext3/fsync.c
+++ b/fs/ext3/fsync.c
@@ -59,6 +59,11 @@ int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
* sync_inode() will write the inode if it is dirty. Then the caller's
* filemap_fdatawait() will wait on the pages.
*
+ * data=guarded:
+ * The caller's filemap_fdatawrite will start the IO, and we
+ * use filemap_fdatawait here to make sure all the disk i_size updates
+ * are done before we commit the inode.
+ *
* data=journal:
* filemap_fdatawrite won't do anything (the buffers are clean).
* ext3_force_commit will write the file data into the journal and
@@ -84,6 +89,13 @@ int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
.sync_mode = WB_SYNC_ALL,
.nr_to_write = 0, /* sys_fsync did this */
};
+ /*
+ * the new disk i_size must be logged before we commit,
+ * so we wait here for pending writeback
+ */
+ if (ext3_should_guard_data(inode))
+ filemap_write_and_wait(inode->i_mapping);
+
ret = sync_inode(inode, &wbc);
}
out:
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index fcfa243..1e90107 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -38,6 +38,7 @@
#include <linux/bio.h>
#include <linux/fiemap.h>
#include <linux/namei.h>
+#include <linux/workqueue.h>
#include "xattr.h"
#include "acl.h"

@@ -179,6 +180,105 @@ static int ext3_journal_test_restart(handle_t *handle, struct inode *inode)
}

/*
+ * after a data=guarded IO is done, we need to update the
+ * disk i_size to reflect the data we've written. If there are
+ * no more ordered data extents left in the tree, we need to
+ * get rid of the orphan entry making sure the file's
+ * block pointers match the i_size after a crash
+ *
+ * When we aren't in data=guarded mode, this just does an ext3_orphan_del.
+ *
+ * It returns the result of ext3_orphan_del.
+ *
+ * handle may be null if we are just cleaning up the orphan list in
+ * memory.
+ *
+ * pass must_log == 1 when the inode must be logged in order to get
+ * an i_size update on disk
+ */
+static int ordered_orphan_del(handle_t *handle, struct inode *inode,
+ int must_log)
+{
+ int ret = 0;
+
+ /* fast out when data=guarded isn't on */
+ if (!ext3_should_guard_data(inode))
+ return ext3_orphan_del(handle, inode);
+
+ ext3_ordered_lock(inode);
+ if (inode->i_nlink &&
+ list_empty(&EXT3_I(inode)->ordered_buffers.ordered_list)) {
+ ext3_ordered_unlock(inode);
+
+ /*
+ * if we aren't actually on the orphan list, the orphan
+ * del won't log our inode. Log it now to make sure
+ */
+ ext3_mark_inode_dirty(handle, inode);
+
+ ret = ext3_orphan_del(handle, inode);
+ if (ret || !handle)
+ goto err;
+
+ /*
+ * now we check again to see if we might have dropped
+ * the orphan just after someone added a new ordered extent
+ */
+ ext3_ordered_lock(inode);
+ if (!list_empty(&EXT3_I(inode)->ordered_buffers.ordered_list) &&
+ list_empty(&EXT3_I(inode)->i_orphan)) {
+ ext3_ordered_unlock(inode);
+ ret = ext3_orphan_add(handle, inode);
+ if (ret)
+ goto err;
+ } else {
+ ext3_ordered_unlock(inode);
+ }
+ } else if (handle && must_log) {
+ ext3_ordered_unlock(inode);
+
+ /*
+ * we need to make sure any updates done by the data=guarded
+ * code end up in the inode on disk. Log the inode
+ * here
+ */
+ ext3_mark_inode_dirty(handle, inode);
+ } else {
+ ext3_ordered_unlock(inode);
+ }
+
+err:
+ return ret;
+}
+
+/*
+ * Wrapper around ordered_orphan_del that starts a transaction
+ */
+static void ordered_orphan_del_trans(struct inode *inode, int must_log)
+{
+ handle_t *handle;
+
+ handle = ext3_journal_start(inode, 3);
+
+ /*
+ * uhoh, should we flag the FS as readonly here? ext3_dirty_inode
+ * doesn't, which is what we're modeling ourselves after.
+ *
+ * We do need to make sure to get this inode off the ordered list
+ * when the transaction start fails though. ordered_orphan_del
+ * does the right thing.
+ */
+ if (IS_ERR(handle)) {
+ ordered_orphan_del(NULL, inode, 0);
+ return;
+ }
+
+ ordered_orphan_del(handle, inode, must_log);
+ ext3_journal_stop(handle);
+}
+
+
+/*
* Called at the last iput() if i_nlink is zero.
*/
void ext3_delete_inode (struct inode * inode)
@@ -204,6 +304,13 @@ void ext3_delete_inode (struct inode * inode)
if (IS_SYNC(inode))
handle->h_sync = 1;
inode->i_size = 0;
+
+ /*
+ * make sure we clean up any ordered extents that didn't get
+ * IO started on them because i_size shrunk down to zero.
+ */
+ ext3_truncate_ordered_extents(inode, 0);
+
if (inode->i_blocks)
ext3_truncate(inode);
/*
@@ -767,6 +874,24 @@ err_out:
}

/*
+ * This protects the disk i_size with the spinlock for the ordered
+ * extent tree. It returns 1 when the inode needs to be logged
+ * because the i_disksize has been updated.
+ */
+static int maybe_update_disk_isize(struct inode *inode, loff_t new_size)
+{
+ int ret = 0;
+
+ ext3_ordered_lock(inode);
+ if (EXT3_I(inode)->i_disksize < new_size) {
+ EXT3_I(inode)->i_disksize = new_size;
+ ret = 1;
+ }
+ ext3_ordered_unlock(inode);
+ return ret;
+}
+
+/*
* Allocation strategy is simple: if we have to allocate something, we will
* have to go the whole way to leaf. So let's do it before attaching anything
* to tree, set linkage between the newborn blocks, write them if sync is
@@ -815,6 +940,7 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
if (!partial) {
first_block = le32_to_cpu(chain[depth - 1].key);
clear_buffer_new(bh_result);
+ clear_buffer_datanew(bh_result);
count++;
/*map more blocks*/
while (count < maxblocks && count <= blocks_to_boundary) {
@@ -873,6 +999,7 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
if (err)
goto cleanup;
clear_buffer_new(bh_result);
+ clear_buffer_datanew(bh_result);
goto got_it;
}
}
@@ -915,14 +1042,18 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
* i_disksize growing is protected by truncate_mutex. Don't forget to
* protect it if you're about to implement concurrent
* ext3_get_block() -bzzz
+ *
+ * extend_disksize is only called for directories, and so
+ * the are not using guarded buffer protection.
*/
- if (!err && extend_disksize && inode->i_size > ei->i_disksize)
- ei->i_disksize = inode->i_size;
+ if (!err && extend_disksize)
+ maybe_update_disk_isize(inode, inode->i_size);
mutex_unlock(&ei->truncate_mutex);
if (err)
goto cleanup;

set_buffer_new(bh_result);
+ set_buffer_datanew(bh_result);
got_it:
map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key));
if (count > blocks_to_boundary)
@@ -1079,6 +1210,77 @@ struct buffer_head *ext3_bread(handle_t *handle, struct inode *inode,
return NULL;
}

+/*
+ * data=guarded updates are handled in a workqueue after the IO
+ * is done. This runs through the list of buffer heads that are pending
+ * processing.
+ */
+void ext3_run_guarded_work(struct work_struct *work)
+{
+ struct ext3_sb_info *sbi =
+ container_of(work, struct ext3_sb_info, guarded_work);
+ struct buffer_head *bh;
+ struct ext3_ordered_extent *ordered;
+ struct inode *inode;
+ struct page *page;
+ int must_log;
+
+ spin_lock_irq(&sbi->guarded_lock);
+ while (!list_empty(&sbi->guarded_buffers)) {
+ ordered = list_entry(sbi->guarded_buffers.next,
+ struct ext3_ordered_extent, work_list);
+
+ list_del(&ordered->work_list);
+
+ bh = ordered->end_io_bh;
+ ordered->end_io_bh = NULL;
+ must_log = 0;
+
+ /* we don't need a reference on the buffer head because
+ * it is locked until the end_io handler is called.
+ *
+ * This means the page can't go away, which means the
+ * inode can't go away
+ */
+ spin_unlock_irq(&sbi->guarded_lock);
+
+ page = bh->b_page;
+ inode = page->mapping->host;
+
+ ext3_ordered_lock(inode);
+ if (ordered->bh) {
+ /*
+ * someone might have decided this buffer didn't
+ * really need to be ordered and removed us from
+ * the list. They set ordered->bh to null
+ * when that happens.
+ */
+ ext3_remove_ordered_extent(inode, ordered);
+ must_log = ext3_ordered_update_i_size(inode);
+ }
+ ext3_ordered_unlock(inode);
+
+ /*
+ * drop the reference taken when this ordered extent was
+ * put onto the guarded_buffers list
+ */
+ ext3_put_ordered_extent(ordered);
+
+ /*
+ * maybe log the inode and/or cleanup the orphan entry
+ */
+ ordered_orphan_del_trans(inode, must_log > 0);
+
+ /*
+ * finally, call the real bh end_io function to do
+ * all the hard work of maintaining page writeback.
+ */
+ end_buffer_async_write(bh, buffer_uptodate(bh));
+ spin_lock_irq(&sbi->guarded_lock);
+ }
+ spin_unlock_irq(&sbi->guarded_lock);
+}
+
static int walk_page_buffers( handle_t *handle,
struct buffer_head *head,
unsigned from,
@@ -1185,6 +1387,7 @@ retry:
ret = walk_page_buffers(handle, page_buffers(page),
from, to, NULL, do_journal_get_write_access);
}
+
write_begin_failed:
if (ret) {
/*
@@ -1212,7 +1415,13 @@ out:

int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
{
- int err = journal_dirty_data(handle, bh);
+ int err;
+
+ /* don't take buffers from the data=guarded list */
+ if (buffer_dataguarded(bh))
+ return 0;
+
+ err = journal_dirty_data(handle, bh);
if (err)
ext3_journal_abort_handle(__func__, __func__,
bh, handle, err);
@@ -1231,6 +1440,89 @@ static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
return 0;
}

+/*
+ * Walk the buffers in a page for data=guarded mode. Buffers that
+ * are not marked as datanew are ignored.
+ *
+ * New buffers outside i_size are sent to the data guarded code
+ *
+ * We must do the old data=ordered mode when filling holes in the
+ * file, since i_size doesn't protect these at all.
+ */
+static int journal_dirty_data_guarded_fn(handle_t *handle,
+ struct buffer_head *bh)
+{
+ u64 offset = page_offset(bh->b_page) + bh_offset(bh);
+ struct inode *inode = bh->b_page->mapping->host;
+ int ret = 0;
+
+ /*
+ * Write could have mapped the buffer but it didn't copy the data in
+ * yet. So avoid filing such buffer into a transaction.
+ */
+ if (!buffer_mapped(bh) || !buffer_uptodate(bh))
+ return 0;
+
+ if (test_clear_buffer_datanew(bh)) {
+ /*
+ * if we're filling a hole inside i_size, we need to
+ * fall back to the old style data=ordered
+ */
+ if (offset < inode->i_size) {
+ ret = ext3_journal_dirty_data(handle, bh);
+ goto out;
+ }
+ ret = ext3_add_ordered_extent(inode, offset, bh);
+
+ /* if we crash before the IO is done, i_size will be small
+ * but these blocks will still be allocated to the file.
+ *
+ * So, add an orphan entry for the file, which will truncate it
+ * down to the i_size it finds after the crash.
+ *
+ * The orphan is cleaned up when the IO is done. We
+ * don't add orphans while mount is running the orphan list,
+ * that seems to corrupt the list.
+ */
+ if (ret == 0 && buffer_dataguarded(bh) &&
+ list_empty(&EXT3_I(inode)->i_orphan) &&
+ !(EXT3_SB(inode->i_sb)->s_mount_state & EXT3_ORPHAN_FS)) {
+ ret = ext3_orphan_add(handle, inode);
+ }
+ }
+out:
+ return ret;
+}
+
+/*
+ * Walk the buffers in a page for data=guarded mode for writepage.
+ *
+ * We must do the old data=ordered mode when filling holes in the
+ * file, since i_size doesn't protect these at all.
+ *
+ * This is actually called after writepage is run and so we can't
+ * trust anything other than the buffer head (which we have pinned).
+ *
+ * Any datanew buffer at writepage time is filling a hole, so we don't need
+ * extra tests against the inode size.
+ */
+static int journal_dirty_data_guarded_writepage_fn(handle_t *handle,
+ struct buffer_head *bh)
+{
+ int ret = 0;
+
+ /*
+ * Write could have mapped the buffer but it didn't copy the data in
+ * yet. So avoid filing such buffer into a transaction.
+ */
+ if (!buffer_mapped(bh) || !buffer_uptodate(bh))
+ return 0;
+
+ if (test_clear_buffer_datanew(bh))
+ ret = ext3_journal_dirty_data(handle, bh);
+ return ret;
+}
+
/* For write_end() in data=journal mode */
static int write_end_fn(handle_t *handle, struct buffer_head *bh)
{
@@ -1251,10 +1543,8 @@ static void update_file_sizes(struct inode *inode, loff_t pos, unsigned copied)
/* What matters to us is i_disksize. We don't write i_size anywhere */
if (pos + copied > inode->i_size)
i_size_write(inode, pos + copied);
- if (pos + copied > EXT3_I(inode)->i_disksize) {
- EXT3_I(inode)->i_disksize = pos + copied;
+ if (maybe_update_disk_isize(inode, pos + copied))
mark_inode_dirty(inode);
- }
}

/*
@@ -1300,6 +1590,68 @@ static int ext3_ordered_write_end(struct file *file,
return ret ? ret : copied;
}

+static int ext3_guarded_write_end(struct file *file,
+ struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned copied,
+ struct page *page, void *fsdata)
+{
+ handle_t *handle = ext3_journal_current_handle();
+ struct inode *inode = file->f_mapping->host;
+ unsigned from, to;
+ int ret = 0, ret2;
+
+ copied = block_write_end(file, mapping, pos, len, copied,
+ page, fsdata);
+
+ from = pos & (PAGE_CACHE_SIZE - 1);
+ to = from + copied;
+ ret = walk_page_buffers(handle, page_buffers(page),
+ from, to, NULL, journal_dirty_data_guarded_fn);
+
+ /*
+ * we only update the in-memory i_size. The disk i_size is done
+ * by the end io handlers
+ */
+ if (ret == 0 && pos + copied > inode->i_size) {
+ int must_log;
+
+ /* updated i_size, but we may have raced with a
+ * data=guarded end_io handler.
+ *
+ * All the guarded IO could have ended while i_size was still
+ * small, and if we're just adding bytes into an existing block
+ * in the file, we may not be adding a new guarded IO with this
+ * write. So, do a check on the disk i_size and make sure it
+ * is updated to the highest safe value.
+ *
+ * ext3_ordered_update_i_size tests inode->i_size, so we
+ * make sure to update it with the ordered lock held.
+ */
+ ext3_ordered_lock(inode);
+ i_size_write(inode, pos + copied);
+
+ must_log = ext3_ordered_update_i_size(inode);
+ ext3_ordered_unlock(inode);
+ ordered_orphan_del_trans(inode, must_log > 0);
+ }
+
+ /*
+ * There may be allocated blocks outside of i_size because
+ * we failed to copy some data. Prepare for truncate.
+ */
+ if (pos + len > inode->i_size)
+ ext3_orphan_add(handle, inode);
+ ret2 = ext3_journal_stop(handle);
+ if (!ret)
+ ret = ret2;
+ unlock_page(page);
+ page_cache_release(page);
+
+ if (pos + len > inode->i_size)
+ vmtruncate(inode, inode->i_size);
+ return ret ? ret : copied;
+}
+
static int ext3_writeback_write_end(struct file *file,
struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
@@ -1311,6 +1663,7 @@ static int ext3_writeback_write_end(struct file *file,

copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
update_file_sizes(inode, pos, copied);
+
/*
* There may be allocated blocks outside of i_size because
* we failed to copy some data. Prepare for truncate.
@@ -1574,6 +1927,144 @@ out_fail:
return ret;
}

+/*
+ * Completion handler for block_write_full_page(). This will
+ * kick off the data=guarded workqueue as the IO finishes.
+ */
+static void end_buffer_async_write_guarded(struct buffer_head *bh,
+ int uptodate)
+{
+ struct ext3_sb_info *sbi;
+ struct address_space *mapping;
+ struct ext3_ordered_extent *ordered;
+ unsigned long flags;
+
+ mapping = bh->b_page->mapping;
+ if (!mapping || !bh->b_private || !buffer_dataguarded(bh)) {
+noguard:
+ end_buffer_async_write(bh, uptodate);
+ return;
+ }
+
+ /*
+ * the guarded workqueue function checks the uptodate bit on the
+ * bh and uses that to tell the real end_io handler if things worked
+ * out or not.
+ */
+ if (uptodate)
+ set_buffer_uptodate(bh);
+ else
+ clear_buffer_uptodate(bh);
+
+ sbi = EXT3_SB(mapping->host->i_sb);
+
+ spin_lock_irqsave(&sbi->guarded_lock, flags);
+
+ /*
+ * remove any chance that a truncate raced in and cleared
+ * our dataguard flag, which also freed the ordered extent in
+ * our b_private.
+ */
+ if (!buffer_dataguarded(bh)) {
+ spin_unlock_irqrestore(&sbi->guarded_lock, flags);
+ goto noguard;
+ }
+ ordered = bh->b_private;
+ WARN_ON(ordered->end_io_bh);
+
+ /*
+ * use the special end_io_bh pointer to make sure that
+ * some form of end_io handler is run on this bh, even
+ * if the ordered_extent is removed from the rb tree before
+ * our workqueue ends up processing it.
+ */
+ ordered->end_io_bh = bh;
+ list_add_tail(&ordered->work_list, &sbi->guarded_buffers);
+ ext3_get_ordered_extent(ordered);
+ spin_unlock_irqrestore(&sbi->guarded_lock, flags);
+
+ queue_work(sbi->guarded_wq, &sbi->guarded_work);
+}
+
+static int ext3_guarded_writepage(struct page *page,
+ struct writeback_control *wbc)
+{
+ struct inode *inode = page->mapping->host;
+ struct buffer_head *page_bufs;
+ handle_t *handle = NULL;
+ int ret = 0;
+ int err;
+
+ J_ASSERT(PageLocked(page));
+
+ /*
+ * We give up here if we're reentered, because it might be for a
+ * different filesystem.
+ */
+ if (ext3_journal_current_handle())
+ goto out_fail;
+
+ if (!page_has_buffers(page)) {
+ create_empty_buffers(page, inode->i_sb->s_blocksize,
+ (1 << BH_Dirty)|(1 << BH_Uptodate));
+ page_bufs = page_buffers(page);
+ } else {
+ page_bufs = page_buffers(page);
+ if (!walk_page_buffers(NULL, page_bufs, 0, PAGE_CACHE_SIZE,
+ NULL, buffer_unmapped)) {
+ /* Provide NULL get_block() to catch bugs if buffers
+ * weren't really mapped */
+ return block_write_full_page_endio(page, NULL, wbc,
+ end_buffer_async_write_guarded);
+ }
+ }
+ handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
+
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out_fail;
+ }
+
+ walk_page_buffers(handle, page_bufs, 0,
+ PAGE_CACHE_SIZE, NULL, bget_one);
+
+ ret = block_write_full_page_endio(page, ext3_get_block, wbc,
+ end_buffer_async_write_guarded);
+
+ /*
+ * The page can become unlocked at any point now, and
+ * truncate can then come in and change things. So we
+ * can't touch *page from now on. But *page_bufs is
+ * safe due to elevated refcount.
+ */
+
+ /*
+ * And attach them to the current transaction. But only if
+ * block_write_full_page() succeeded. Otherwise they are unmapped,
+ * and generally junk.
+ */
+ if (ret == 0) {
+ err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
+ NULL, journal_dirty_data_guarded_writepage_fn);
+ if (!ret)
+ ret = err;
+ }
+ walk_page_buffers(handle, page_bufs, 0,
+ PAGE_CACHE_SIZE, NULL, bput_one);
+ err = ext3_journal_stop(handle);
+ if (!ret)
+ ret = err;
+
+ return ret;
+
+out_fail:
+ redirty_page_for_writepage(wbc, page);
+ unlock_page(page);
+ return ret;
+}
+
+
+
static int ext3_writeback_writepage(struct page *page,
struct writeback_control *wbc)
{
@@ -1747,7 +2238,14 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
goto out;
}
orphan = 1;
- ei->i_disksize = inode->i_size;
+ /* in guarded mode, other code is responsible
+ * for updating i_disksize. Actually in
+ * every mode, ei->i_disksize should be correct,
+ * so I don't understand why it is getting updated
+ * to i_size here.
+ */
+ if (!ext3_should_guard_data(inode))
+ ei->i_disksize = inode->i_size;
ext3_journal_stop(handle);
}
}
@@ -1768,11 +2266,20 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
ret = PTR_ERR(handle);
goto out;
}
+
if (inode->i_nlink)
- ext3_orphan_del(handle, inode);
+ ordered_orphan_del(handle, inode, 0);
+
if (ret > 0) {
loff_t end = offset + ret;
if (end > inode->i_size) {
+ /* i_mutex keeps other file writes from
+ * hopping in at this time, and we
+ * know the O_DIRECT write just put all
+ * those blocks on disk. So, we can
+ * safely update i_disksize here even
+ * in guarded mode
+ */
ei->i_disksize = end;
i_size_write(inode, end);
/*
@@ -1842,6 +2349,21 @@ static const struct address_space_operations ext3_writeback_aops = {
.is_partially_uptodate = block_is_partially_uptodate,
};

+static const struct address_space_operations ext3_guarded_aops = {
+ .readpage = ext3_readpage,
+ .readpages = ext3_readpages,
+ .writepage = ext3_guarded_writepage,
+ .sync_page = block_sync_page,
+ .write_begin = ext3_write_begin,
+ .write_end = ext3_guarded_write_end,
+ .bmap = ext3_bmap,
+ .invalidatepage = ext3_invalidatepage,
+ .releasepage = ext3_releasepage,
+ .direct_IO = ext3_direct_IO,
+ .migratepage = buffer_migrate_page,
+ .is_partially_uptodate = block_is_partially_uptodate,
+};
+
static const struct address_space_operations ext3_journalled_aops = {
.readpage = ext3_readpage,
.readpages = ext3_readpages,
@@ -1860,6 +2382,8 @@ void ext3_set_aops(struct inode *inode)
{
if (ext3_should_order_data(inode))
inode->i_mapping->a_ops = &ext3_ordered_aops;
+ else if (ext3_should_guard_data(inode))
+ inode->i_mapping->a_ops = &ext3_guarded_aops;
else if (ext3_should_writeback_data(inode))
inode->i_mapping->a_ops = &ext3_writeback_aops;
else
@@ -2376,7 +2900,8 @@ void ext3_truncate(struct inode *inode)
if (!ext3_can_truncate(inode))
return;

- if (inode->i_size == 0 && ext3_should_writeback_data(inode))
+ if (inode->i_size == 0 && (ext3_should_writeback_data(inode) ||
+ ext3_should_guard_data(inode)))
ei->i_state |= EXT3_STATE_FLUSH_ON_CLOSE;

/*
@@ -3103,10 +3628,39 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
ext3_journal_stop(handle);
}

+ if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
+ /*
+ * we need to make sure any data=guarded pages
+ * are on disk before we force a new disk i_size
+ * down into the inode. The crucial range is
+ * anything between the disksize on disk now
+ * and the new size we're going to set.
+ *
+ * We're holding i_mutex here, so we know new
+ * ordered extents are not going to appear in the inode
+ *
+ * This must be done both for truncates that make the
+ * file bigger and smaller because truncate messes around
+ * with the orphan inode list in both cases.
+ */
+ if (ext3_should_guard_data(inode)) {
+ filemap_write_and_wait_range(inode->i_mapping,
+ EXT3_I(inode)->i_disksize,
+ (loff_t)-1);
+ /*
+ * we've written everything, make sure all
+ * the ordered extents are really gone.
+ *
+ * This prevents leaking of ordered extents
+ * and it also makes sure the ordered extent code
+ * doesn't mess with the orphan link
+ */
+ ext3_truncate_ordered_extents(inode, 0);
+ }
+ }
if (S_ISREG(inode->i_mode) &&
attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) {
handle_t *handle;
-
handle = ext3_journal_start(inode, 3);
if (IS_ERR(handle)) {
error = PTR_ERR(handle);
@@ -3114,6 +3668,7 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
}

error = ext3_orphan_add(handle, inode);
+
EXT3_I(inode)->i_disksize = attr->ia_size;
rc = ext3_mark_inode_dirty(handle, inode);
if (!error)
@@ -3125,8 +3680,11 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)

/* If inode_setattr's call to ext3_truncate failed to get a
* transaction handle at all, we need to clean up the in-core
- * orphan list manually. */
- if (inode->i_nlink)
+ * orphan list manually. Because we've finished off all the
+ * guarded IO above, this doesn't hurt anything for the guarded
+ * code
+ */
+ if (inode->i_nlink && (attr->ia_valid & ATTR_SIZE))
ext3_orphan_del(NULL, inode);

if (!rc && (ia_valid & ATTR_MODE))
diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index 6ff7b97..ac3991a 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -2410,7 +2410,8 @@ static int ext3_rename (struct inode * old_dir, struct dentry *old_dentry,
ext3_mark_inode_dirty(handle, new_inode);
if (!new_inode->i_nlink)
ext3_orphan_add(handle, new_inode);
- if (ext3_should_writeback_data(new_inode))
+ if (ext3_should_writeback_data(new_inode) ||
+ ext3_should_guard_data(new_inode))
flush_file = 1;
}
retval = 0;
diff --git a/fs/ext3/ordered-data.c b/fs/ext3/ordered-data.c
new file mode 100644
index 0000000..a6dab2d
--- /dev/null
+++ b/fs/ext3/ordered-data.c
@@ -0,0 +1,235 @@
+/*
+ * Copyright (C) 2009 Oracle. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/gfp.h>
+#include <linux/slab.h>
+#include <linux/blkdev.h>
+#include <linux/writeback.h>
+#include <linux/pagevec.h>
+#include <linux/buffer_head.h>
+#include <linux/ext3_jbd.h>
+
+/*
+ * simple helper to make sure a new entry we're adding is
+ * at a larger offset in the file than the last entry in the list
+ */
+static void check_ordering(struct ext3_ordered_buffers *buffers,
+ struct ext3_ordered_extent *entry)
+{
+ struct ext3_ordered_extent *last;
+
+ if (list_empty(&buffers->ordered_list))
+ return;
+
+ last = list_entry(buffers->ordered_list.prev,
+ struct ext3_ordered_extent, ordered_list);
+ BUG_ON(last->start >= entry->start);
+}
+
+/* allocate and add a new ordered_extent into the per-inode list.
+ * start is the logical offset in the file
+ *
+ * The list is given a single reference on the ordered extent that was
+ * inserted, and it also takes a reference on the buffer head.
+ */
+int ext3_add_ordered_extent(struct inode *inode, u64 start,
+ struct buffer_head *bh)
+{
+ struct ext3_ordered_buffers *buffers;
+ struct ext3_ordered_extent *entry;
+ int ret = 0;
+
+ lock_buffer(bh);
+
+ /* ordered extent already there, or in old style data=ordered */
+ if (bh->b_private) {
+ ret = 0;
+ goto out;
+ }
+
+ buffers = &EXT3_I(inode)->ordered_buffers;
+ entry = kzalloc(sizeof(*entry), GFP_NOFS);
+ if (!entry) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ spin_lock(&buffers->lock);
+ entry->start = start;
+
+ get_bh(bh);
+ entry->bh = bh;
+ bh->b_private = entry;
+ set_buffer_dataguarded(bh);
+
+ /* one ref for the list */
+ atomic_set(&entry->refs, 1);
+ INIT_LIST_HEAD(&entry->work_list);
+
+ check_ordering(buffers, entry);
+
+ list_add_tail(&entry->ordered_list, &buffers->ordered_list);
+
+ spin_unlock(&buffers->lock);
+out:
+ unlock_buffer(bh);
+ return ret;
+}
+
+/*
+ * used to drop a reference on an ordered extent. This will free
+ * the extent if the last reference is dropped
+ */
+int ext3_put_ordered_extent(struct ext3_ordered_extent *entry)
+{
+ if (atomic_dec_and_test(&entry->refs)) {
+ WARN_ON(entry->bh);
+ WARN_ON(entry->end_io_bh);
+ kfree(entry);
+ }
+ return 0;
+}
+
+/*
+ * remove an ordered extent from the list. This removes the
+ * reference held by the list on 'entry' and the
+ * reference on the buffer head held by the entry.
+ */
+int ext3_remove_ordered_extent(struct inode *inode,
+ struct ext3_ordered_extent *entry)
+{
+ struct ext3_ordered_buffers *buffers;
+
+ buffers = &EXT3_I(inode)->ordered_buffers;
+
+ /*
+ * the data=guarded end_io handler takes this guarded_lock
+ * before it puts a given buffer head and its ordered extent
+ * into the guarded_buffers list. We need to make sure
+ * we don't race with them, so we take the guarded_lock too.
+ */
+ spin_lock_irq(&EXT3_SB(inode->i_sb)->guarded_lock);
+ clear_buffer_dataguarded(entry->bh);
+ entry->bh->b_private = NULL;
+ brelse(entry->bh);
+ entry->bh = NULL;
+ spin_unlock_irq(&EXT3_SB(inode->i_sb)->guarded_lock);
+
+ /*
+ * we must not clear entry->end_io_bh, that is set by
+ * the end_io handlers and will be cleared by the end_io
+ * workqueue
+ */
+
+ list_del_init(&entry->ordered_list);
+ ext3_put_ordered_extent(entry);
+ return 0;
+}
+
+/*
+ * After an extent is done, call this to conditionally update the on disk
+ * i_size. i_size is updated to cover any fully written part of the file.
+ *
+ * This returns < 0 on error, zero if no action needs to be taken and
+ * 1 if the inode must be logged.
+ */
+int ext3_ordered_update_i_size(struct inode *inode)
+{
+ u64 new_size;
+ u64 disk_size;
+ struct ext3_ordered_extent *test;
+ struct ext3_ordered_buffers *buffers = &EXT3_I(inode)->ordered_buffers;
+ int ret = 0;
+
+ disk_size = EXT3_I(inode)->i_disksize;
+
+ /*
+ * if the disk i_size is already at the inode->i_size, we're done
+ */
+ if (disk_size >= inode->i_size)
+ goto out;
+
+ /*
+ * if the ordered list is empty, push the disk i_size all the way
+ * up to the inode size, otherwise, use the start of the first
+ * ordered extent in the list as the new disk i_size
+ */
+ if (list_empty(&buffers->ordered_list)) {
+ new_size = inode->i_size;
+ } else {
+ test = list_entry(buffers->ordered_list.next,
+ struct ext3_ordered_extent, ordered_list);
+
+ new_size = test->start;
+ }
+
+ new_size = min_t(u64, new_size, i_size_read(inode));
+
+ /* the caller needs to log this inode */
+ ret = 1;
+
+ EXT3_I(inode)->i_disksize = new_size;
+out:
+ return ret;
+}
+
+/*
+ * during a truncate or delete, we need to get rid of pending
+ * ordered extents so there isn't a war over who updates disk i_size first.
+ * This does that, without waiting for any of the IO to actually finish.
+ *
+ * When the IO does finish, it will find the ordered extent removed from the
+ * list and all will work properly.
+ */
+void ext3_truncate_ordered_extents(struct inode *inode, u64 offset)
+{
+ struct ext3_ordered_buffers *buffers = &EXT3_I(inode)->ordered_buffers;
+ struct ext3_ordered_extent *test;
+
+ spin_lock(&buffers->lock);
+ while (!list_empty(&buffers->ordered_list)) {
+
+ test = list_entry(buffers->ordered_list.prev,
+ struct ext3_ordered_extent, ordered_list);
+
+ if (test->start < offset)
+ break;
+ /*
+ * once this is called, the end_io handler won't run,
+ * and we won't update disk_i_size to include this buffer.
+ *
+ * That's ok for truncates because the truncate code is
+ * writing a new i_size.
+ *
+ * This ignores any IO in flight, which is ok
+ * because the guarded_buffers list has a reference
+ * on the ordered extent
+ */
+ ext3_remove_ordered_extent(inode, test);
+ }
+ spin_unlock(&buffers->lock);
+ return;
+
+}
+
+void ext3_ordered_inode_init(struct ext3_inode_info *ei)
+{
+ INIT_LIST_HEAD(&ei->ordered_buffers.ordered_list);
+ spin_lock_init(&ei->ordered_buffers.lock);
+}
+
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 599dbfe..b5a7b42 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -37,6 +37,7 @@
#include <linux/quotaops.h>
#include <linux/seq_file.h>
#include <linux/log2.h>
+#include <linux/workqueue.h>

#include <asm/uaccess.h>

@@ -399,6 +400,9 @@ static void ext3_put_super (struct super_block * sb)
struct ext3_super_block *es = sbi->s_es;
int i, err;

+ flush_workqueue(sbi->guarded_wq);
+ destroy_workqueue(sbi->guarded_wq);
+
ext3_xattr_put_super(sb);
err = journal_destroy(sbi->s_journal);
sbi->s_journal = NULL;
@@ -468,6 +472,8 @@ static struct inode *ext3_alloc_inode(struct super_block *sb)
#endif
ei->i_block_alloc_info = NULL;
ei->vfs_inode.i_version = 1;
+ ext3_ordered_inode_init(ei);
+
return &ei->vfs_inode;
}

@@ -481,6 +487,8 @@ static void ext3_destroy_inode(struct inode *inode)
false);
dump_stack();
}
+ if (!list_empty(&EXT3_I(inode)->ordered_buffers.ordered_list))
+ printk(KERN_INFO "EXT3 ordered tree not empty\n");
kmem_cache_free(ext3_inode_cachep, EXT3_I(inode));
}

@@ -528,6 +536,13 @@ static void ext3_clear_inode(struct inode *inode)
EXT3_I(inode)->i_default_acl = EXT3_ACL_NOT_CACHED;
}
#endif
+ /*
+ * If pages got cleaned by truncate, truncate should have
+ * gotten rid of the ordered extents. Just in case, drop them
+ * here.
+ */
+ ext3_truncate_ordered_extents(inode, 0);
+
ext3_discard_reservation(inode);
EXT3_I(inode)->i_block_alloc_info = NULL;
if (unlikely(rsv))
@@ -634,6 +649,8 @@ static int ext3_show_options(struct seq_file *seq, struct vfsmount *vfs)
seq_puts(seq, ",data=journal");
else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA)
seq_puts(seq, ",data=ordered");
+ else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_GUARDED_DATA)
+ seq_puts(seq, ",data=guarded");
else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)
seq_puts(seq, ",data=writeback");

@@ -790,7 +807,7 @@ enum {
Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh,
Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
- Opt_data_err_abort, Opt_data_err_ignore,
+ Opt_data_guarded, Opt_data_err_abort, Opt_data_err_ignore,
Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
@@ -832,6 +849,7 @@ static const match_table_t tokens = {
{Opt_abort, "abort"},
{Opt_data_journal, "data=journal"},
{Opt_data_ordered, "data=ordered"},
+ {Opt_data_guarded, "data=guarded"},
{Opt_data_writeback, "data=writeback"},
{Opt_data_err_abort, "data_err=abort"},
{Opt_data_err_ignore, "data_err=ignore"},
@@ -1034,6 +1052,9 @@ static int parse_options (char *options, struct super_block *sb,
case Opt_data_ordered:
data_opt = EXT3_MOUNT_ORDERED_DATA;
goto datacheck;
+ case Opt_data_guarded:
+ data_opt = EXT3_MOUNT_GUARDED_DATA;
+ goto datacheck;
case Opt_data_writeback:
data_opt = EXT3_MOUNT_WRITEBACK_DATA;
datacheck:
@@ -1949,11 +1970,23 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
clear_opt(sbi->s_mount_opt, NOBH);
}
}
+
+ /*
+ * setup the guarded work list
+ */
+ INIT_LIST_HEAD(&EXT3_SB(sb)->guarded_buffers);
+ INIT_WORK(&EXT3_SB(sb)->guarded_work, ext3_run_guarded_work);
+ spin_lock_init(&EXT3_SB(sb)->guarded_lock);
+ EXT3_SB(sb)->guarded_wq = create_workqueue("ext3-guard");
+ if (!EXT3_SB(sb)->guarded_wq) {
+ printk(KERN_ERR "EXT3-fs: failed to create workqueue\n");
+ goto failed_mount_guard;
+ }
+
/*
* The journal_load will have done any necessary log recovery,
* so we can safely mount the rest of the filesystem now.
*/
-
root = ext3_iget(sb, EXT3_ROOT_INO);
if (IS_ERR(root)) {
printk(KERN_ERR "EXT3-fs: get root inode failed\n");
@@ -1965,6 +1998,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
printk(KERN_ERR "EXT3-fs: corrupt root inode, run e2fsck\n");
goto failed_mount4;
}
+
sb->s_root = d_alloc_root(root);
if (!sb->s_root) {
printk(KERN_ERR "EXT3-fs: get root dentry failed\n");
@@ -1974,6 +2008,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
}

ext3_setup_super (sb, es, sb->s_flags & MS_RDONLY);
+
/*
* akpm: core read_super() calls in here with the superblock locked.
* That deadlocks, because orphan cleanup needs to lock the superblock
@@ -1989,9 +2024,10 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
printk (KERN_INFO "EXT3-fs: recovery complete.\n");
ext3_mark_recovery_complete(sb, es);
printk (KERN_INFO "EXT3-fs: mounted filesystem with %s data mode.\n",
- test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
- test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
- "writeback");
+ test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal" :
+ test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_GUARDED_DATA ? "guarded" :
+ test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered" :
+ "writeback");

lock_kernel();
return 0;
@@ -2003,6 +2039,8 @@ cantfind_ext3:
goto failed_mount;

failed_mount4:
+ destroy_workqueue(EXT3_SB(sb)->guarded_wq);
+failed_mount_guard:
journal_destroy(sbi->s_journal);
failed_mount3:
percpu_counter_destroy(&sbi->s_freeblocks_counter);
diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index ed886e6..1354a55 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -2018,6 +2018,7 @@ zap_buffer_unlocked:
clear_buffer_mapped(bh);
clear_buffer_req(bh);
clear_buffer_new(bh);
+ clear_buffer_datanew(bh);
bh->b_bdev = NULL;
return may_free;
}
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 634a5e5..cf097b7 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -18,6 +18,7 @@

#include <linux/types.h>
#include <linux/magic.h>
+#include <linux/workqueue.h>

/*
* The second extended filesystem constants/structures
@@ -398,7 +399,6 @@ struct ext3_inode {
#define EXT3_MOUNT_MINIX_DF 0x00080 /* Mimics the Minix statfs */
#define EXT3_MOUNT_NOLOAD 0x00100 /* Don't use existing journal*/
#define EXT3_MOUNT_ABORT 0x00200 /* Fatal error detected */
-#define EXT3_MOUNT_DATA_FLAGS 0x00C00 /* Mode for data writes: */
#define EXT3_MOUNT_JOURNAL_DATA 0x00400 /* Write data to journal */
#define EXT3_MOUNT_ORDERED_DATA 0x00800 /* Flush data before commit */
#define EXT3_MOUNT_WRITEBACK_DATA 0x00C00 /* No data ordering */
@@ -414,6 +414,12 @@ struct ext3_inode {
#define EXT3_MOUNT_GRPQUOTA 0x200000 /* "old" group quota */
#define EXT3_MOUNT_DATA_ERR_ABORT 0x400000 /* Abort on file data write
* error in ordered mode */
+#define EXT3_MOUNT_GUARDED_DATA 0x800000 /* guard new writes with
+ i_size */
+#define EXT3_MOUNT_DATA_FLAGS (EXT3_MOUNT_JOURNAL_DATA | \
+ EXT3_MOUNT_ORDERED_DATA | \
+ EXT3_MOUNT_WRITEBACK_DATA | \
+ EXT3_MOUNT_GUARDED_DATA)

/* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
#ifndef _LINUX_EXT2_FS_H
@@ -892,6 +898,7 @@ extern void ext3_get_inode_flags(struct ext3_inode_info *);
extern void ext3_set_aops(struct inode *inode);
extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
u64 start, u64 len);
+void ext3_run_guarded_work(struct work_struct *work);

/* ioctl.c */
extern long ext3_ioctl(struct file *, unsigned int, unsigned long);
@@ -945,7 +952,30 @@ extern const struct inode_operations ext3_special_inode_operations;
extern const struct inode_operations ext3_symlink_inode_operations;
extern const struct inode_operations ext3_fast_symlink_inode_operations;

+/* ordered-data.c */
+int ext3_add_ordered_extent(struct inode *inode, u64 file_offset,
+ struct buffer_head *bh);
+int ext3_put_ordered_extent(struct ext3_ordered_extent *entry);
+int ext3_remove_ordered_extent(struct inode *inode,
+ struct ext3_ordered_extent *entry);
+int ext3_ordered_update_i_size(struct inode *inode);
+void ext3_ordered_inode_init(struct ext3_inode_info *ei);
+void ext3_truncate_ordered_extents(struct inode *inode, u64 offset);
+
+static inline void ext3_ordered_lock(struct inode *inode)
+{
+ spin_lock(&EXT3_I(inode)->ordered_buffers.lock);
+}

+static inline void ext3_ordered_unlock(struct inode *inode)
+{
+ spin_unlock(&EXT3_I(inode)->ordered_buffers.lock);
+}
+
+static inline void ext3_get_ordered_extent(struct ext3_ordered_extent *entry)
+{
+ atomic_inc(&entry->refs);
+}
#endif /* __KERNEL__ */

#endif /* _LINUX_EXT3_FS_H */
diff --git a/include/linux/ext3_fs_i.h b/include/linux/ext3_fs_i.h
index 7894dd0..11dd4d4 100644
--- a/include/linux/ext3_fs_i.h
+++ b/include/linux/ext3_fs_i.h
@@ -65,6 +65,49 @@ struct ext3_block_alloc_info {
#define rsv_end rsv_window._rsv_end

/*
+ * used to prevent garbage in files after a crash by
+ * making sure i_size isn't updated until after the IO
+ * is done.
+ *
+ * See fs/ext3/ordered-data.c for the code that uses these.
+ */
+struct buffer_head;
+struct ext3_ordered_buffers {
+ /* protects the list and disk i_size */
+ spinlock_t lock;
+
+ struct list_head ordered_list;
+};
+
+struct ext3_ordered_extent {
+ /* logical offset of the block in the file
+ * strictly speaking we don't need this
+ * but keep it in the struct for
+ * debugging
+ */
+ u64 start;
+
+ /* buffer head being written */
+ struct buffer_head *bh;
+
+ /*
+ * set at end_io time so we properly
+ * do IO accounting even when this ordered
+ * extent struct has been removed from the
+ * list
+ */
+ struct buffer_head *end_io_bh;
+
+ /* number of refs on this ordered extent */
+ atomic_t refs;
+
+ struct list_head ordered_list;
+
+ /* list of things being processed by the workqueue */
+ struct list_head work_list;
+};
+
+/*
* third extended file system inode data in memory
*/
struct ext3_inode_info {
@@ -141,6 +184,8 @@ struct ext3_inode_info {
* by other means, so we have truncate_mutex.
*/
struct mutex truncate_mutex;
+
+ struct ext3_ordered_buffers ordered_buffers;
struct inode vfs_inode;
};

diff --git a/include/linux/ext3_fs_sb.h b/include/linux/ext3_fs_sb.h
index f07f34d..5dbdbeb 100644
--- a/include/linux/ext3_fs_sb.h
+++ b/include/linux/ext3_fs_sb.h
@@ -21,6 +21,7 @@
#include <linux/wait.h>
#include <linux/blockgroup_lock.h>
#include <linux/percpu_counter.h>
+#include <linux/workqueue.h>
#endif
#include <linux/rbtree.h>

@@ -82,6 +83,11 @@ struct ext3_sb_info {
char *s_qf_names[MAXQUOTAS]; /* Names of quota files with journalled quota */
int s_jquota_fmt; /* Format of quota to use */
#endif
+
+ struct workqueue_struct *guarded_wq;
+ struct work_struct guarded_work;
+ struct list_head guarded_buffers;
+ spinlock_t guarded_lock;
};

static inline spinlock_t *
diff --git a/include/linux/ext3_jbd.h b/include/linux/ext3_jbd.h
index cf82d51..45cb4aa 100644
--- a/include/linux/ext3_jbd.h
+++ b/include/linux/ext3_jbd.h
@@ -212,6 +212,17 @@ static inline int ext3_should_order_data(struct inode *inode)
return 0;
}

+static inline int ext3_should_guard_data(struct inode *inode)
+{
+ if (!S_ISREG(inode->i_mode))
+ return 0;
+ if (EXT3_I(inode)->i_flags & EXT3_JOURNAL_DATA_FL)
+ return 0;
+ if (test_opt(inode->i_sb, GUARDED_DATA) == EXT3_MOUNT_GUARDED_DATA)
+ return 1;
+ return 0;
+}
+
static inline int ext3_should_writeback_data(struct inode *inode)
{
if (!S_ISREG(inode->i_mode))
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index c2049a0..bbb7990 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -291,6 +291,13 @@ enum jbd_state_bits {
BH_State, /* Pins most journal_head state */
BH_JournalHead, /* Pins bh->b_private and jh->b_bh */
BH_Unshadow, /* Dummy bit, for BJ_Shadow wakeup filtering */
+ BH_DataGuarded, /* ext3 data=guarded mode buffer
+ * these have something other than a
+ * journal_head at b_private */
+ BH_DataNew, /* BH_new gets cleared too early for
+ * data=guarded to use it. So,
+ * this gets set instead.
+ */
};

BUFFER_FNS(JBD, jbd)
@@ -302,6 +309,9 @@ TAS_BUFFER_FNS(Revoked, revoked)
BUFFER_FNS(RevokeValid, revokevalid)
TAS_BUFFER_FNS(RevokeValid, revokevalid)
BUFFER_FNS(Freed, freed)
+BUFFER_FNS(DataGuarded, dataguarded)
+BUFFER_FNS(DataNew, datanew)
+TAS_BUFFER_FNS(DataNew, datanew)

static inline struct buffer_head *jh2bh(struct journal_head *jh)
{









2009-04-28 18:33:27

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH RFC] ext3 data=guarded v5

On Tue, 2009-04-28 at 14:04 -0400, Chris Mason wrote:

> This patch adds the delayed i_size update to ext3, along with a new
> mount option (data=guarded) to enable it.

Is there a good reason to require a new mount option for this vs just
calling it a fix (or enhancement) to the existing writeback option?

I presume yes (you are fs guy..), but wonder what that reason is.

-Mike


2009-04-28 18:55:29

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH RFC] ext3 data=guarded v5

On Tue, 2009-04-28 at 20:33 +0200, Mike Galbraith wrote:
> On Tue, 2009-04-28 at 14:04 -0400, Chris Mason wrote:
>
> > This patch adds the delayed i_size update to ext3, along with a new
> > mount option (data=guarded) to enable it.
>
> Is there a good reason to require a new mount option for this vs just
> calling it a fix (or enhancement) to the existing writeback option?
>
> I presume yes (you are fs guy..), but wonder what that reason is.

It is mostly cutting down on the risk of the patch. The patch tries to
isolate its changes to just data=guarded mode, so you can go back to the
old and crusty modes if things go badly.

-chris



2009-04-29 02:12:23

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH RFC] ext3 data=guarded v5

Hi

> Hello everyone,
>
> I've rediffed the ext3 data=guarded code against Linus' current git
> tree, and worked in most of Jan's suggestions:

I'm not fs expert and I can't review it technically. but please don't
use the name of "guarded". this word implies strong protection.

I think end-user can misunderstand this mode make slowness and strong protect
much than "mode=journal".




2009-04-29 08:56:37

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC] ext3 data=guarded v5

Hi,

On Tue 28-04-09 14:04:00, Chris Mason wrote:
> ext3 data=ordered mode makes sure that data blocks are on disk before
> the metadata that references them, which avoids files full of garbage
> or previously deleted data after a crash. It does this by adding every dirty
> buffer onto a list of things that must be written before a commit.
>
> This makes every fsync write out all the dirty data on the entire FS, which
> has high latencies and is generally much more expensive than it needs to be.
>
> Another way to avoid exposing stale data after a crash is to wait until
> after the data buffers are written before updating the on-disk record
> of the file's size. If we crash before the data IO is done, i_size
> doesn't yet include the new blocks and no stale data is exposed.
>
> This patch adds the delayed i_size update to ext3, along with a new
> mount option (data=guarded) to enable it. The basic mechanism works like
> this:
>
> * Change block_write_full_page to take an end_io handler as a parameter.
> This allows us to make an end_io handler that queues buffer heads for
> a workqueue where the real work of updating the on disk i_size is done.
>
> * Add a list to the in-memory ext3 inode for tracking data=guarded
> buffer heads that are waiting to be sent to disk.
>
> * Add an ext3 guarded write_end call to add buffer heads for newly
> allocated blocks into the list. If we have a newly allocated block that is
> filling a hole inside i_size, this is done as an old style data=ordered write
> instead.
>
> * Add an ext3 guarded writepage call that uses a special buffer head
> end_io handler for buffers that are marked as guarded. Again, if we find
> newly allocated blocks filling holes, they are sent through data=ordered
> instead of data=guarded.
>
> * When a guarded IO finishes, kick a per-FS workqueue to do the
> on disk i_size updates. The workqueue function must be very careful. We only
> update the on disk i_size if all of the IO between the old on disk i_size and
> the new on disk i_size is complete. The on disk i_size is incrementally
> updated to the largest safe value every time an IO completes.
>
> * When we start tracking guarded buffers on a given inode, we put the
> inode into ext3's orphan list. This way if we do crash, the file will
> be truncated back down to the on disk i_size and we'll free any blocks that
> were not completely written. The inode is removed from the orphan list
> only after all the guarded buffers are done.
>
> Signed-off-by: Chris Mason <[email protected]>
Thanks for redoing the patch. Some comments below.

>
> ---
> fs/ext3/Makefile | 3 +-
> fs/ext3/fsync.c | 12 +
> fs/ext3/inode.c | 582 +++++++++++++++++++++++++++++++++++++++++++-
> fs/ext3/namei.c | 3 +-
> fs/ext3/ordered-data.c | 235 ++++++++++++++++++
> fs/ext3/super.c | 48 ++++-
> fs/jbd/transaction.c | 1 +
> include/linux/ext3_fs.h | 32 +++-
> include/linux/ext3_fs_i.h | 45 ++++
> include/linux/ext3_fs_sb.h | 6 +
> include/linux/ext3_jbd.h | 11 +
> include/linux/jbd.h | 10 +
> 12 files changed, 968 insertions(+), 20 deletions(-)
>
> diff --git a/fs/ext3/Makefile b/fs/ext3/Makefile
> index e77766a..f3a9dc1 100644
> --- a/fs/ext3/Makefile
> +++ b/fs/ext3/Makefile
> @@ -5,7 +5,8 @@
> obj-$(CONFIG_EXT3_FS) += ext3.o
>
> ext3-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \
> - ioctl.o namei.o super.o symlink.o hash.o resize.o ext3_jbd.o
> + ioctl.o namei.o super.o symlink.o hash.o resize.o ext3_jbd.o \
> + ordered-data.o
>
> ext3-$(CONFIG_EXT3_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
> ext3-$(CONFIG_EXT3_FS_POSIX_ACL) += acl.o
> diff --git a/fs/ext3/fsync.c b/fs/ext3/fsync.c
> index d336341..a50abb4 100644
> --- a/fs/ext3/fsync.c
> +++ b/fs/ext3/fsync.c
> @@ -59,6 +59,11 @@ int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
> * sync_inode() will write the inode if it is dirty. Then the caller's
> * filemap_fdatawait() will wait on the pages.
> *
> + * data=guarded:
> + * The caller's filemap_fdatawrite will start the IO, and we
> + * use filemap_fdatawait here to make sure all the disk i_size updates
> + * are done before we commit the inode.
> + *
> * data=journal:
> * filemap_fdatawrite won't do anything (the buffers are clean).
> * ext3_force_commit will write the file data into the journal and
> @@ -84,6 +89,13 @@ int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
> .sync_mode = WB_SYNC_ALL,
> .nr_to_write = 0, /* sys_fsync did this */
> };
> + /*
> + * the new disk i_size must be logged before we commit,
> + * so we wait here for pending writeback
> + */
> + if (ext3_should_guard_data(inode))
> + filemap_write_and_wait(inode->i_mapping);
> +
> ret = sync_inode(inode, &wbc);
> }
> out:
> diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> index fcfa243..1e90107 100644
> --- a/fs/ext3/inode.c
> +++ b/fs/ext3/inode.c
> @@ -38,6 +38,7 @@
> #include <linux/bio.h>
> #include <linux/fiemap.h>
> #include <linux/namei.h>
> +#include <linux/workqueue.h>
> #include "xattr.h"
> #include "acl.h"
>
> @@ -179,6 +180,105 @@ static int ext3_journal_test_restart(handle_t *handle, struct inode *inode)
> }
>
> /*
> + * after a data=guarded IO is done, we need to update the
> + * disk i_size to reflect the data we've written. If there are
> + * no more ordered data extents left in the tree, we need to
^^^^^^^^ the list
> + * get rid of the orphan entry making sure the file's
> + * block pointers match the i_size after a crash
> + *
> + * When we aren't in data=guarded mode, this just does an ext3_orphan_del.
> + *
> + * It returns the result of ext3_orphan_del.
> + *
> + * handle may be null if we are just cleaning up the orphan list in
> + * memory.
> + *
> + * pass must_log == 1 when the inode must be logged in order to get
> + * an i_size update on disk
> + */
> +static int ordered_orphan_del(handle_t *handle, struct inode *inode,
> + int must_log)
> +{
I'm afraid this function is racy.
1) We probably need i_mutex to protect against unlink happening in parallel
(after we check i_nlink but before we all ext3_orphan_del).
2) We need superblock lock for the check list_empty(&EXT3_I(inode)->i_orphan).
3) The function should rather have name ext3_guarded_orphan_del()... At
least "ordered" is really confusing (that's the case for a few other
structs / variables as well).

Hmm, maybe we could make this a bit more readable by:
Introducing __ext3_orphan_del() which would already expect superblock
locked and return 0 if it didn't do anything and 1 if it dirtied inode. So
the result would be:
static int ext3_guarded_orphan_del(handle_t *handle, struct inode *inode)
{
int remove, ret = 0;

/* fast out when data=guarded isn't on or this is not a file */
if (!ext3_should_guard_data(inode))
return ext3_orphan_del(handle, inode);
ext3_ordered_lock(inode);
/* Quick check to avoid heavy locking in common case */
remove = inode->i_nlink &&
list_empty(&EXT3_I(inode)->ordered_buffers.ordered_list)
ext3_ordered_unlock(inode);
if (remove) {
/* i_mutex also avoids any races with new guarded writes */
mutex_lock(&inode->i_mutex);
lock_super(inode->i_sb);
ext3_ordered_lock(inode);
/* Now the reliable check */
remove = inode->i_nlink &&
list_empty(&EXT3_I(inode)->ordered_buffers.ordered_list);
ext3_ordered_unlock(inode);
if (remove)
ret = __ext3_orphan_del(handle, inode);
ext3_ordered_unlock(inode);
unlock_super(inode->i_sb);
mutex_unlock(&inode->i_mutex);
}
return ret;
}

and use it like:
err = ext3_guarded_orphan_del(handle, inode);
if (must_log && !err)
ext3_mark_inode_dirty(handle, inode);

> + int ret = 0;
> +
> + /* fast out when data=guarded isn't on */
> + if (!ext3_should_guard_data(inode))
> + return ext3_orphan_del(handle, inode);
> +
> + ext3_ordered_lock(inode);
> + if (inode->i_nlink &&
> + list_empty(&EXT3_I(inode)->ordered_buffers.ordered_list)) {
> + ext3_ordered_unlock(inode);
> +
> + /*
> + * if we aren't actually on the orphan list, the orphan
> + * del won't log our inode. Log it now to make sure
> + */
> + ext3_mark_inode_dirty(handle, inode);
> +
> + ret = ext3_orphan_del(handle, inode);
> + if (ret || !handle)
> + goto err;
> +
> + /*
> + * now we check again to see if we might have dropped
> + * the orphan just after someone added a new ordered extent
> + */
> + ext3_ordered_lock(inode);
> + if (!list_empty(&EXT3_I(inode)->ordered_buffers.ordered_list) &&
> + list_empty(&EXT3_I(inode)->i_orphan)) {
> + ext3_ordered_unlock(inode);
> + ret = ext3_orphan_add(handle, inode);
> + if (ret)
> + goto err;
> + } else {
> + ext3_ordered_unlock(inode);
> + }
> + } else if (handle && must_log) {
> + ext3_ordered_unlock(inode);
> +
> + /*
> + * we need to make sure any updates done by the data=guarded
> + * code end up in the inode on disk. Log the inode
> + * here
> + */
> + ext3_mark_inode_dirty(handle, inode);
> + } else {
> + ext3_ordered_unlock(inode);
> + }
> +
> +err:
> + return ret;
> +}
> +
> +/*
> + * Wrapper around ordered_orphan_del that starts a transaction
> + */
> +static void ordered_orphan_del_trans(struct inode *inode, int must_log)
> +{
This function is going to be used only from one place, so consider
opencoding it. I don't have a strong opinion...

> + handle_t *handle;
> +
> + handle = ext3_journal_start(inode, 3);
> +
> + /*
> + * uhoh, should we flag the FS as readonly here? ext3_dirty_inode
> + * doesn't, which is what we're modeling ourselves after.
> + *
> + * We do need to make sure to get this inode off the ordered list
> + * when the transaction start fails though. ordered_orphan_del
> + * does the right thing.
> + */
> + if (IS_ERR(handle)) {
> + ordered_orphan_del(NULL, inode, 0);
> + return;
> + }
> +
> + ordered_orphan_del(handle, inode, must_log);
> + ext3_journal_stop(handle);
> +}
> +
> +
> +/*
> * Called at the last iput() if i_nlink is zero.
> */
> void ext3_delete_inode (struct inode * inode)
> @@ -204,6 +304,13 @@ void ext3_delete_inode (struct inode * inode)
> if (IS_SYNC(inode))
> handle->h_sync = 1;
> inode->i_size = 0;
> +
> + /*
> + * make sure we clean up any ordered extents that didn't get
> + * IO started on them because i_size shrunk down to zero.
> + */
> + ext3_truncate_ordered_extents(inode, 0);
> +
> if (inode->i_blocks)
> ext3_truncate(inode);
> /*
> @@ -767,6 +874,24 @@ err_out:
> }
>
> /*
> + * This protects the disk i_size with the spinlock for the ordered
> + * extent tree. It returns 1 when the inode needs to be logged
> + * because the i_disksize has been updated.
> + */
> +static int maybe_update_disk_isize(struct inode *inode, loff_t new_size)
> +{
> + int ret = 0;
> +
> + ext3_ordered_lock(inode);
> + if (EXT3_I(inode)->i_disksize < new_size) {
> + EXT3_I(inode)->i_disksize = new_size;
> + ret = 1;
> + }
> + ext3_ordered_unlock(inode);
> + return ret;
> +}
> +
> +/*
> * Allocation strategy is simple: if we have to allocate something, we will
> * have to go the whole way to leaf. So let's do it before attaching anything
> * to tree, set linkage between the newborn blocks, write them if sync is
> @@ -815,6 +940,7 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
> if (!partial) {
> first_block = le32_to_cpu(chain[depth - 1].key);
> clear_buffer_new(bh_result);
> + clear_buffer_datanew(bh_result);
> count++;
> /*map more blocks*/
> while (count < maxblocks && count <= blocks_to_boundary) {
> @@ -873,6 +999,7 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
> if (err)
> goto cleanup;
> clear_buffer_new(bh_result);
> + clear_buffer_datanew(bh_result);
> goto got_it;
> }
> }
> @@ -915,14 +1042,18 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
> * i_disksize growing is protected by truncate_mutex. Don't forget to
> * protect it if you're about to implement concurrent
> * ext3_get_block() -bzzz
> + *
> + * extend_disksize is only called for directories, and so
> + * the are not using guarded buffer protection.
^^^ The sentence is strange...
> */
> - if (!err && extend_disksize && inode->i_size > ei->i_disksize)
> - ei->i_disksize = inode->i_size;
> + if (!err && extend_disksize)
> + maybe_update_disk_isize(inode, inode->i_size);
So do we really need to take the ordered lock for directories? We could
just leave above two lines as they were.

> mutex_unlock(&ei->truncate_mutex);
> if (err)
> goto cleanup;
>
> set_buffer_new(bh_result);
> + set_buffer_datanew(bh_result);
> got_it:
> map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key));
> if (count > blocks_to_boundary)
> @@ -1079,6 +1210,77 @@ struct buffer_head *ext3_bread(handle_t *handle, struct inode *inode,
> return NULL;
> }
>
> +/*
> + * data=guarded updates are handled in a workqueue after the IO
> + * is done. This runs through the list of buffer heads that are pending
> + * processing.
> + */
> +void ext3_run_guarded_work(struct work_struct *work)
> +{
> + struct ext3_sb_info *sbi =
> + container_of(work, struct ext3_sb_info, guarded_work);
> + struct buffer_head *bh;
> + struct ext3_ordered_extent *ordered;
> + struct inode *inode;
> + struct page *page;
> + int must_log;
> +
> + spin_lock_irq(&sbi->guarded_lock);
> + while (!list_empty(&sbi->guarded_buffers)) {
> + ordered = list_entry(sbi->guarded_buffers.next,
> + struct ext3_ordered_extent, work_list);
> +
> + list_del(&ordered->work_list);
> +
> + bh = ordered->end_io_bh;
> + ordered->end_io_bh = NULL;
> + must_log = 0;
> +
> + /* we don't need a reference on the buffer head because
> + * it is locked until the end_io handler is called.
> + *
> + * This means the page can't go away, which means the
> + * inode can't go away
> + */
> + spin_unlock_irq(&sbi->guarded_lock);
> +
> + page = bh->b_page;
> + inode = page->mapping->host;
> +
> + ext3_ordered_lock(inode);
> + if (ordered->bh) {
> + /*
> + * someone might have decided this buffer didn't
> + * really need to be ordered and removed us from
> + * the list. They set ordered->bh to null
> + * when that happens.
> + */
> + ext3_remove_ordered_extent(inode, ordered);
> + must_log = ext3_ordered_update_i_size(inode);
> + }
> + ext3_ordered_unlock(inode);
> +
> + /*
> + * drop the reference taken when this ordered extent was
> + * put onto the guarded_buffers list
> + */
> + ext3_put_ordered_extent(ordered);
> +
> + /*
> + * maybe log the inode and/or cleanup the orphan entry
> + */
> + ordered_orphan_del_trans(inode, must_log > 0);
> +
> + /*
> + * finally, call the real bh end_io function to do
> + * all the hard work of maintaining page writeback.
> + */
> + end_buffer_async_write(bh, buffer_uptodate(bh));
> + spin_lock_irq(&sbi->guarded_lock);
> + }
> + spin_unlock_irq(&sbi->guarded_lock);
> +}
> +
> static int walk_page_buffers( handle_t *handle,
> struct buffer_head *head,
> unsigned from,
> @@ -1185,6 +1387,7 @@ retry:
> ret = walk_page_buffers(handle, page_buffers(page),
> from, to, NULL, do_journal_get_write_access);
> }
> +
> write_begin_failed:
> if (ret) {
> /*
> @@ -1212,7 +1415,13 @@ out:
>
> int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
> {
> - int err = journal_dirty_data(handle, bh);
> + int err;
> +
> + /* don't take buffers from the data=guarded list */
> + if (buffer_dataguarded(bh))
> + return 0;
> +
> + err = journal_dirty_data(handle, bh);
But this has a problem that if we do extending write (like from pos 1024
to pos 2048) and then do write from 0 to 1024 and we hit the window while
the buffer is on the work queue list, we won't order this write. Probably
we don't care but I wanted to note this...

> if (err)
> ext3_journal_abort_handle(__func__, __func__,
> bh, handle, err);
> @@ -1231,6 +1440,89 @@ static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
> return 0;
> }
>
> +/*
> + * Walk the buffers in a page for data=guarded mode. Buffers that
> + * are not marked as datanew are ignored.
> + *
> + * New buffers outside i_size are sent to the data guarded code
> + *
> + * We must do the old data=ordered mode when filling holes in the
> + * file, since i_size doesn't protect these at all.
> + */
> +static int journal_dirty_data_guarded_fn(handle_t *handle,
> + struct buffer_head *bh)
> +{
> + u64 offset = page_offset(bh->b_page) + bh_offset(bh);
> + struct inode *inode = bh->b_page->mapping->host;
> + int ret = 0;
> +
> + /*
> + * Write could have mapped the buffer but it didn't copy the data in
> + * yet. So avoid filing such buffer into a transaction.
> + */
> + if (!buffer_mapped(bh) || !buffer_uptodate(bh))
> + return 0;
> +
> + if (test_clear_buffer_datanew(bh)) {
Hmm, if we just extend the file inside the block (e.g. from 100 bytes to
500 bytes), then we won't do the write guarded. But then if we crash before
the block really gets written, user will see zeros at the end of file
instead of data... I don't think we should let this happen so I'd think we
have to guard all the extending writes regardless whether they allocate new
block or not. Which probably makes the buffer_datanew() flag unnecessary
because we just guard all the buffers from max(start of write, i_size) to
end of write.

> + /*
> + * if we're filling a hole inside i_size, we need to
> + * fall back to the old style data=ordered
> + */
> + if (offset < inode->i_size) {
> + ret = ext3_journal_dirty_data(handle, bh);
> + goto out;
> + }
> + ret = ext3_add_ordered_extent(inode, offset, bh);
> +
> + /* if we crash before the IO is done, i_size will be small
> + * but these blocks will still be allocated to the file.
> + *
> + * So, add an orphan entry for the file, which will truncate it
> + * down to the i_size it finds after the crash.
> + *
> + * The orphan is cleaned up when the IO is done. We
> + * don't add orphans while mount is running the orphan list,
> + * that seems to corrupt the list.
> + */
> + if (ret == 0 && buffer_dataguarded(bh) &&
> + list_empty(&EXT3_I(inode)->i_orphan) &&
> + !(EXT3_SB(inode->i_sb)->s_mount_state & EXT3_ORPHAN_FS)) {
> + ret = ext3_orphan_add(handle, inode);
> + }
> + }
> +out:
> + return ret;
> +}
> +
> +/*
> + * Walk the buffers in a page for data=guarded mode for writepage.
> + *
> + * We must do the old data=ordered mode when filling holes in the
> + * file, since i_size doesn't protect these at all.
> + *
> + * This is actually called after writepage is run and so we can't
> + * trust anything other than the buffer head (which we have pinned).
> + *
> + * Any datanew buffer at writepage time is filling a hole, so we don't need
> + * extra tests against the inode size.
> + */
> +static int journal_dirty_data_guarded_writepage_fn(handle_t *handle,
> + struct buffer_head *bh)
> +{
> + int ret = 0;
> +
> + /*
> + * Write could have mapped the buffer but it didn't copy the data in
> + * yet. So avoid filing such buffer into a transaction.
> + */
> + if (!buffer_mapped(bh) || !buffer_uptodate(bh))
> + return 0;
> +
> + if (test_clear_buffer_datanew(bh))
> + ret = ext3_journal_dirty_data(handle, bh);
> + return ret;
> +}
> +
Hmm, here we use the datanew flag as well. But it's probably not worth
keeping it just for this case. Ordering data in all cases when we get here
should be fine since if the block is already allocated we should not get
here (unless somebody managed to strip buffers from the page but kept the
page but that should be rare enough).

> /* For write_end() in data=journal mode */
> static int write_end_fn(handle_t *handle, struct buffer_head *bh)
> {
> @@ -1251,10 +1543,8 @@ static void update_file_sizes(struct inode *inode, loff_t pos, unsigned copied)
> /* What matters to us is i_disksize. We don't write i_size anywhere */
> if (pos + copied > inode->i_size)
> i_size_write(inode, pos + copied);
> - if (pos + copied > EXT3_I(inode)->i_disksize) {
> - EXT3_I(inode)->i_disksize = pos + copied;
> + if (maybe_update_disk_isize(inode, pos + copied))
> mark_inode_dirty(inode);
> - }
> }
>
> /*
> @@ -1300,6 +1590,68 @@ static int ext3_ordered_write_end(struct file *file,
> return ret ? ret : copied;
> }
>
> +static int ext3_guarded_write_end(struct file *file,
> + struct address_space *mapping,
> + loff_t pos, unsigned len, unsigned copied,
> + struct page *page, void *fsdata)
> +{
> + handle_t *handle = ext3_journal_current_handle();
> + struct inode *inode = file->f_mapping->host;
> + unsigned from, to;
> + int ret = 0, ret2;
> +
> + copied = block_write_end(file, mapping, pos, len, copied,
> + page, fsdata);
> +
> + from = pos & (PAGE_CACHE_SIZE - 1);
> + to = from + copied;
> + ret = walk_page_buffers(handle, page_buffers(page),
> + from, to, NULL, journal_dirty_data_guarded_fn);
> +
> + /*
> + * we only update the in-memory i_size. The disk i_size is done
> + * by the end io handlers
> + */
> + if (ret == 0 && pos + copied > inode->i_size) {
> + int must_log;
> +
> + /* updated i_size, but we may have raced with a
> + * data=guarded end_io handler.
> + *
> + * All the guarded IO could have ended while i_size was still
> + * small, and if we're just adding bytes into an existing block
> + * in the file, we may not be adding a new guarded IO with this
> + * write. So, do a check on the disk i_size and make sure it
> + * is updated to the highest safe value.
> + *
> + * ext3_ordered_update_i_size tests inode->i_size, so we
> + * make sure to update it with the ordered lock held.
> + */
This can go away if we guard all the extending writes...

> + ext3_ordered_lock(inode);
> + i_size_write(inode, pos + copied);
> +
> + must_log = ext3_ordered_update_i_size(inode);
> + ext3_ordered_unlock(inode);
> + ordered_orphan_del_trans(inode, must_log > 0);
In case this needs to stay, here we have a transaction started so why not
just directly call ordered_orphan_del()?

> + }
> +
> + /*
> + * There may be allocated blocks outside of i_size because
> + * we failed to copy some data. Prepare for truncate.
> + */
> + if (pos + len > inode->i_size)
> + ext3_orphan_add(handle, inode);
> + ret2 = ext3_journal_stop(handle);
> + if (!ret)
> + ret = ret2;
> + unlock_page(page);
> + page_cache_release(page);
> +
> + if (pos + len > inode->i_size)
> + vmtruncate(inode, inode->i_size);
> + return ret ? ret : copied;
> +}
> +
> static int ext3_writeback_write_end(struct file *file,
> struct address_space *mapping,
> loff_t pos, unsigned len, unsigned copied,
> @@ -1311,6 +1663,7 @@ static int ext3_writeback_write_end(struct file *file,
>
> copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> update_file_sizes(inode, pos, copied);
> +
> /*
> * There may be allocated blocks outside of i_size because
> * we failed to copy some data. Prepare for truncate.
> @@ -1574,6 +1927,144 @@ out_fail:
> return ret;
> }
>
> +/*
> + * Completion handler for block_write_full_page(). This will
> + * kick off the data=guarded workqueue as the IO finishes.
> + */
> +static void end_buffer_async_write_guarded(struct buffer_head *bh,
> + int uptodate)
> +{
> + struct ext3_sb_info *sbi;
> + struct address_space *mapping;
> + struct ext3_ordered_extent *ordered;
> + unsigned long flags;
> +
> + mapping = bh->b_page->mapping;
> + if (!mapping || !bh->b_private || !buffer_dataguarded(bh)) {
> +noguard:
> + end_buffer_async_write(bh, uptodate);
> + return;
> + }
> +
> + /*
> + * the guarded workqueue function checks the uptodate bit on the
> + * bh and uses that to tell the real end_io handler if things worked
> + * out or not.
> + */
> + if (uptodate)
> + set_buffer_uptodate(bh);
> + else
> + clear_buffer_uptodate(bh);
> +
> + sbi = EXT3_SB(mapping->host->i_sb);
> +
> + spin_lock_irqsave(&sbi->guarded_lock, flags);
> +
> + /*
> + * remove any chance that a truncate raced in and cleared
> + * our dataguard flag, which also freed the ordered extent in
> + * our b_private.
> + */
> + if (!buffer_dataguarded(bh)) {
> + spin_unlock_irqrestore(&sbi->guarded_lock, flags);
> + goto noguard;
> + }
> + ordered = bh->b_private;
> + WARN_ON(ordered->end_io_bh);
> +
> + /*
> + * use the special end_io_bh pointer to make sure that
> + * some form of end_io handler is run on this bh, even
> + * if the ordered_extent is removed from the rb tree before
> + * our workqueue ends up processing it.
> + */
> + ordered->end_io_bh = bh;
> + list_add_tail(&ordered->work_list, &sbi->guarded_buffers);
> + ext3_get_ordered_extent(ordered);
> + spin_unlock_irqrestore(&sbi->guarded_lock, flags);
> +
> + queue_work(sbi->guarded_wq, &sbi->guarded_work);
> +}
> +
> +static int ext3_guarded_writepage(struct page *page,
> + struct writeback_control *wbc)
> +{
> + struct inode *inode = page->mapping->host;
> + struct buffer_head *page_bufs;
> + handle_t *handle = NULL;
> + int ret = 0;
> + int err;
> +
> + J_ASSERT(PageLocked(page));
> +
> + /*
> + * We give up here if we're reentered, because it might be for a
> + * different filesystem.
> + */
> + if (ext3_journal_current_handle())
> + goto out_fail;
> +
> + if (!page_has_buffers(page)) {
> + create_empty_buffers(page, inode->i_sb->s_blocksize,
> + (1 << BH_Dirty)|(1 << BH_Uptodate));
> + page_bufs = page_buffers(page);
> + } else {
> + page_bufs = page_buffers(page);
> + if (!walk_page_buffers(NULL, page_bufs, 0, PAGE_CACHE_SIZE,
> + NULL, buffer_unmapped)) {
> + /* Provide NULL get_block() to catch bugs if buffers
> + * weren't really mapped */
> + return block_write_full_page_endio(page, NULL, wbc,
> + end_buffer_async_write_guarded);
> + }
> + }
> + handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
> +
> + if (IS_ERR(handle)) {
> + ret = PTR_ERR(handle);
> + goto out_fail;
> + }
> +
> + walk_page_buffers(handle, page_bufs, 0,
> + PAGE_CACHE_SIZE, NULL, bget_one);
> +
> + ret = block_write_full_page_endio(page, ext3_get_block, wbc,
> + end_buffer_async_write_guarded);
> +
> + /*
> + * The page can become unlocked at any point now, and
> + * truncate can then come in and change things. So we
> + * can't touch *page from now on. But *page_bufs is
> + * safe due to elevated refcount.
> + */
> +
> + /*
> + * And attach them to the current transaction. But only if
> + * block_write_full_page() succeeded. Otherwise they are unmapped,
> + * and generally junk.
> + */
> + if (ret == 0) {
> + err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
> + NULL, journal_dirty_data_guarded_writepage_fn);
> + if (!ret)
> + ret = err;
> + }
> + walk_page_buffers(handle, page_bufs, 0,
> + PAGE_CACHE_SIZE, NULL, bput_one);
> + err = ext3_journal_stop(handle);
> + if (!ret)
> + ret = err;
> +
> + return ret;
> +
> +out_fail:
> + redirty_page_for_writepage(wbc, page);
> + unlock_page(page);
> + return ret;
> +}
> +
> +
> +
> static int ext3_writeback_writepage(struct page *page,
> struct writeback_control *wbc)
> {
> @@ -1747,7 +2238,14 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
> goto out;
> }
> orphan = 1;
> - ei->i_disksize = inode->i_size;
> + /* in guarded mode, other code is responsible
> + * for updating i_disksize. Actually in
> + * every mode, ei->i_disksize should be correct,
> + * so I don't understand why it is getting updated
> + * to i_size here.
> + */
> + if (!ext3_should_guard_data(inode))
> + ei->i_disksize = inode->i_size;
Hmm, true. When we acquire i_mutex, i_size should be equal to i_disksize
so this seems rather pointless. Probably worth a separate patch to remove
it...

> ext3_journal_stop(handle);
> }
> }
> @@ -1768,11 +2266,20 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
> ret = PTR_ERR(handle);
> goto out;
> }
> +
> if (inode->i_nlink)
> - ext3_orphan_del(handle, inode);
> + ordered_orphan_del(handle, inode, 0);
> +
> if (ret > 0) {
> loff_t end = offset + ret;
> if (end > inode->i_size) {
> + /* i_mutex keeps other file writes from
> + * hopping in at this time, and we
> + * know the O_DIRECT write just put all
> + * those blocks on disk. So, we can
> + * safely update i_disksize here even
> + * in guarded mode
> + */
Not quite - there could be guarded blocks before the place where we did
O_DIRECT write and we need to wait for them...

> ei->i_disksize = end;
> i_size_write(inode, end);
> /*
> @@ -1842,6 +2349,21 @@ static const struct address_space_operations ext3_writeback_aops = {
> .is_partially_uptodate = block_is_partially_uptodate,
> };
>
> +static const struct address_space_operations ext3_guarded_aops = {
> + .readpage = ext3_readpage,
> + .readpages = ext3_readpages,
> + .writepage = ext3_guarded_writepage,
> + .sync_page = block_sync_page,
> + .write_begin = ext3_write_begin,
> + .write_end = ext3_guarded_write_end,
> + .bmap = ext3_bmap,
> + .invalidatepage = ext3_invalidatepage,
> + .releasepage = ext3_releasepage,
> + .direct_IO = ext3_direct_IO,
> + .migratepage = buffer_migrate_page,
> + .is_partially_uptodate = block_is_partially_uptodate,
> +};
> +
> static const struct address_space_operations ext3_journalled_aops = {
> .readpage = ext3_readpage,
> .readpages = ext3_readpages,
> @@ -1860,6 +2382,8 @@ void ext3_set_aops(struct inode *inode)
> {
> if (ext3_should_order_data(inode))
> inode->i_mapping->a_ops = &ext3_ordered_aops;
> + else if (ext3_should_guard_data(inode))
> + inode->i_mapping->a_ops = &ext3_guarded_aops;
> else if (ext3_should_writeback_data(inode))
> inode->i_mapping->a_ops = &ext3_writeback_aops;
> else
> @@ -2376,7 +2900,8 @@ void ext3_truncate(struct inode *inode)
> if (!ext3_can_truncate(inode))
> return;
>
> - if (inode->i_size == 0 && ext3_should_writeback_data(inode))
> + if (inode->i_size == 0 && (ext3_should_writeback_data(inode) ||
> + ext3_should_guard_data(inode)))
> ei->i_state |= EXT3_STATE_FLUSH_ON_CLOSE;
>
> /*
> @@ -3103,10 +3628,39 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
> ext3_journal_stop(handle);
> }
>
> + if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
> + /*
> + * we need to make sure any data=guarded pages
> + * are on disk before we force a new disk i_size
> + * down into the inode. The crucial range is
> + * anything between the disksize on disk now
> + * and the new size we're going to set.
> + *
> + * We're holding i_mutex here, so we know new
> + * ordered extents are not going to appear in the inode
> + *
> + * This must be done both for truncates that make the
> + * file bigger and smaller because truncate messes around
> + * with the orphan inode list in both cases.
> + */
> + if (ext3_should_guard_data(inode)) {
> + filemap_write_and_wait_range(inode->i_mapping,
> + EXT3_I(inode)->i_disksize,
> + (loff_t)-1);
> + /*
> + * we've written everything, make sure all
> + * the ordered extents are really gone.
> + *
> + * This prevents leaking of ordered extents
> + * and it also makes sure the ordered extent code
> + * doesn't mess with the orphan link
> + */
> + ext3_truncate_ordered_extents(inode, 0);
> + }
> + }
> if (S_ISREG(inode->i_mode) &&
> attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) {
> handle_t *handle;
> -
> handle = ext3_journal_start(inode, 3);
> if (IS_ERR(handle)) {
> error = PTR_ERR(handle);
> @@ -3114,6 +3668,7 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
> }
>
> error = ext3_orphan_add(handle, inode);
> +
> EXT3_I(inode)->i_disksize = attr->ia_size;
> rc = ext3_mark_inode_dirty(handle, inode);
> if (!error)
> @@ -3125,8 +3680,11 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
>
> /* If inode_setattr's call to ext3_truncate failed to get a
> * transaction handle at all, we need to clean up the in-core
> - * orphan list manually. */
> - if (inode->i_nlink)
> + * orphan list manually. Because we've finished off all the
> + * guarded IO above, this doesn't hurt anything for the guarded
> + * code
> + */
> + if (inode->i_nlink && (attr->ia_valid & ATTR_SIZE))
> ext3_orphan_del(NULL, inode);
>
> if (!rc && (ia_valid & ATTR_MODE))
> diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
> index 6ff7b97..ac3991a 100644
> --- a/fs/ext3/namei.c
> +++ b/fs/ext3/namei.c
> @@ -2410,7 +2410,8 @@ static int ext3_rename (struct inode * old_dir, struct dentry *old_dentry,
> ext3_mark_inode_dirty(handle, new_inode);
> if (!new_inode->i_nlink)
> ext3_orphan_add(handle, new_inode);
> - if (ext3_should_writeback_data(new_inode))
> + if (ext3_should_writeback_data(new_inode) ||
> + ext3_should_guard_data(new_inode))
> flush_file = 1;
> }
> retval = 0;
> diff --git a/fs/ext3/ordered-data.c b/fs/ext3/ordered-data.c
> new file mode 100644
> index 0000000..a6dab2d
> --- /dev/null
> +++ b/fs/ext3/ordered-data.c
> @@ -0,0 +1,235 @@
> +/*
> + * Copyright (C) 2009 Oracle. All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public
> + * License v2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; if not, write to the
> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> + * Boston, MA 021110-1307, USA.
> + */
> +
> +#include <linux/gfp.h>
> +#include <linux/slab.h>
> +#include <linux/blkdev.h>
> +#include <linux/writeback.h>
> +#include <linux/pagevec.h>
> +#include <linux/buffer_head.h>
> +#include <linux/ext3_jbd.h>
> +
> +/*
> + * simple helper to make sure a new entry we're adding is
> + * at a larger offset in the file than the last entry in the list
> + */
> +static void check_ordering(struct ext3_ordered_buffers *buffers,
> + struct ext3_ordered_extent *entry)
> +{
> + struct ext3_ordered_extent *last;
> +
> + if (list_empty(&buffers->ordered_list))
> + return;
> +
> + last = list_entry(buffers->ordered_list.prev,
> + struct ext3_ordered_extent, ordered_list);
> + BUG_ON(last->start >= entry->start);
> +}
> +
> +/* allocate and add a new ordered_extent into the per-inode list.
> + * start is the logical offset in the file
> + *
> + * The list is given a single reference on the ordered extent that was
> + * inserted, and it also takes a reference on the buffer head.
> + */
> +int ext3_add_ordered_extent(struct inode *inode, u64 start,
> + struct buffer_head *bh)
> +{
> + struct ext3_ordered_buffers *buffers;
> + struct ext3_ordered_extent *entry;
> + int ret = 0;
> +
> + lock_buffer(bh);
> +
> + /* ordered extent already there, or in old style data=ordered */
> + if (bh->b_private) {
> + ret = 0;
> + goto out;
> + }
> +
> + buffers = &EXT3_I(inode)->ordered_buffers;
> + entry = kzalloc(sizeof(*entry), GFP_NOFS);
> + if (!entry) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + spin_lock(&buffers->lock);
> + entry->start = start;
> +
> + get_bh(bh);
> + entry->bh = bh;
> + bh->b_private = entry;
> + set_buffer_dataguarded(bh);
> +
> + /* one ref for the list */
> + atomic_set(&entry->refs, 1);
> + INIT_LIST_HEAD(&entry->work_list);
> +
> + check_ordering(buffers, entry);
> +
> + list_add_tail(&entry->ordered_list, &buffers->ordered_list);
> +
> + spin_unlock(&buffers->lock);
> +out:
> + unlock_buffer(bh);
> + return ret;
> +}
> +
> +/*
> + * used to drop a reference on an ordered extent. This will free
> + * the extent if the last reference is dropped
> + */
> +int ext3_put_ordered_extent(struct ext3_ordered_extent *entry)
> +{
> + if (atomic_dec_and_test(&entry->refs)) {
> + WARN_ON(entry->bh);
> + WARN_ON(entry->end_io_bh);
> + kfree(entry);
> + }
> + return 0;
> +}
> +
> +/*
> + * remove an ordered extent from the list. This removes the
> + * reference held by the list on 'entry' and the
> + * reference on the buffer head held by the entry.
> + */
> +int ext3_remove_ordered_extent(struct inode *inode,
> + struct ext3_ordered_extent *entry)
> +{
> + struct ext3_ordered_buffers *buffers;
> +
> + buffers = &EXT3_I(inode)->ordered_buffers;
> +
> + /*
> + * the data=guarded end_io handler takes this guarded_lock
> + * before it puts a given buffer head and its ordered extent
> + * into the guarded_buffers list. We need to make sure
> + * we don't race with them, so we take the guarded_lock too.
> + */
> + spin_lock_irq(&EXT3_SB(inode->i_sb)->guarded_lock);
> + clear_buffer_dataguarded(entry->bh);
> + entry->bh->b_private = NULL;
> + brelse(entry->bh);
> + entry->bh = NULL;
> + spin_unlock_irq(&EXT3_SB(inode->i_sb)->guarded_lock);
> +
> + /*
> + * we must not clear entry->end_io_bh, that is set by
> + * the end_io handlers and will be cleared by the end_io
> + * workqueue
> + */
> +
> + list_del_init(&entry->ordered_list);
> + ext3_put_ordered_extent(entry);
> + return 0;
> +}
> +
> +/*
> + * After an extent is done, call this to conditionally update the on disk
> + * i_size. i_size is updated to cover any fully written part of the file.
> + *
> + * This returns < 0 on error, zero if no action needs to be taken and
> + * 1 if the inode must be logged.
> + */
> +int ext3_ordered_update_i_size(struct inode *inode)
> +{
> + u64 new_size;
> + u64 disk_size;
> + struct ext3_ordered_extent *test;
> + struct ext3_ordered_buffers *buffers = &EXT3_I(inode)->ordered_buffers;
> + int ret = 0;
> +
> + disk_size = EXT3_I(inode)->i_disksize;
> +
> + /*
> + * if the disk i_size is already at the inode->i_size, we're done
> + */
> + if (disk_size >= inode->i_size)
^^^ i_size_read() here?
> + goto out;
> +
> + /*
> + * if the ordered list is empty, push the disk i_size all the way
> + * up to the inode size, otherwise, use the start of the first
> + * ordered extent in the list as the new disk i_size
> + */
> + if (list_empty(&buffers->ordered_list)) {
> + new_size = inode->i_size;
> + } else {
> + test = list_entry(buffers->ordered_list.next,
> + struct ext3_ordered_extent, ordered_list);
> +
> + new_size = test->start;
> + }
> +
> + new_size = min_t(u64, new_size, i_size_read(inode));
> +
> + /* the caller needs to log this inode */
> + ret = 1;
> +
> + EXT3_I(inode)->i_disksize = new_size;
> +out:
> + return ret;
> +}
> +
> +/*
> + * during a truncate or delete, we need to get rid of pending
> + * ordered extents so there isn't a war over who updates disk i_size first.
> + * This does that, without waiting for any of the IO to actually finish.
> + *
> + * When the IO does finish, it will find the ordered extent removed from the
> + * list and all will work properly.
> + */
> +void ext3_truncate_ordered_extents(struct inode *inode, u64 offset)
> +{
> + struct ext3_ordered_buffers *buffers = &EXT3_I(inode)->ordered_buffers;
> + struct ext3_ordered_extent *test;
> +
> + spin_lock(&buffers->lock);
> + while (!list_empty(&buffers->ordered_list)) {
> +
> + test = list_entry(buffers->ordered_list.prev,
> + struct ext3_ordered_extent, ordered_list);
> +
> + if (test->start < offset)
> + break;
> + /*
> + * once this is called, the end_io handler won't run,
> + * and we won't update disk_i_size to include this buffer.
> + *
> + * That's ok for truncates because the truncate code is
> + * writing a new i_size.
> + *
> + * This ignores any IO in flight, which is ok
> + * because the guarded_buffers list has a reference
> + * on the ordered extent
> + */
> + ext3_remove_ordered_extent(inode, test);
> + }
> + spin_unlock(&buffers->lock);
> + return;
> +
> +}
> +
> +void ext3_ordered_inode_init(struct ext3_inode_info *ei)
> +{
> + INIT_LIST_HEAD(&ei->ordered_buffers.ordered_list);
> + spin_lock_init(&ei->ordered_buffers.lock);
> +}
> +
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index 599dbfe..b5a7b42 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -37,6 +37,7 @@
> #include <linux/quotaops.h>
> #include <linux/seq_file.h>
> #include <linux/log2.h>
> +#include <linux/workqueue.h>
>
> #include <asm/uaccess.h>
>
> @@ -399,6 +400,9 @@ static void ext3_put_super (struct super_block * sb)
> struct ext3_super_block *es = sbi->s_es;
> int i, err;
>
> + flush_workqueue(sbi->guarded_wq);
> + destroy_workqueue(sbi->guarded_wq);
> +
> ext3_xattr_put_super(sb);
> err = journal_destroy(sbi->s_journal);
> sbi->s_journal = NULL;
> @@ -468,6 +472,8 @@ static struct inode *ext3_alloc_inode(struct super_block *sb)
> #endif
> ei->i_block_alloc_info = NULL;
> ei->vfs_inode.i_version = 1;
> + ext3_ordered_inode_init(ei);
> +
> return &ei->vfs_inode;
> }
>
> @@ -481,6 +487,8 @@ static void ext3_destroy_inode(struct inode *inode)
> false);
> dump_stack();
> }
> + if (!list_empty(&EXT3_I(inode)->ordered_buffers.ordered_list))
> + printk(KERN_INFO "EXT3 ordered tree not empty\n");
> kmem_cache_free(ext3_inode_cachep, EXT3_I(inode));
> }
>
> @@ -528,6 +536,13 @@ static void ext3_clear_inode(struct inode *inode)
> EXT3_I(inode)->i_default_acl = EXT3_ACL_NOT_CACHED;
> }
> #endif
> + /*
> + * If pages got cleaned by truncate, truncate should have
> + * gotten rid of the ordered extents. Just in case, drop them
> + * here.
> + */
> + ext3_truncate_ordered_extents(inode, 0);
> +
> ext3_discard_reservation(inode);
> EXT3_I(inode)->i_block_alloc_info = NULL;
> if (unlikely(rsv))
> @@ -634,6 +649,8 @@ static int ext3_show_options(struct seq_file *seq, struct vfsmount *vfs)
> seq_puts(seq, ",data=journal");
> else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA)
> seq_puts(seq, ",data=ordered");
> + else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_GUARDED_DATA)
> + seq_puts(seq, ",data=guarded");
> else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)
> seq_puts(seq, ",data=writeback");
>
> @@ -790,7 +807,7 @@ enum {
> Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh,
> Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
> Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
> - Opt_data_err_abort, Opt_data_err_ignore,
> + Opt_data_guarded, Opt_data_err_abort, Opt_data_err_ignore,
> Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
> Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
> Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
> @@ -832,6 +849,7 @@ static const match_table_t tokens = {
> {Opt_abort, "abort"},
> {Opt_data_journal, "data=journal"},
> {Opt_data_ordered, "data=ordered"},
> + {Opt_data_guarded, "data=guarded"},
> {Opt_data_writeback, "data=writeback"},
> {Opt_data_err_abort, "data_err=abort"},
> {Opt_data_err_ignore, "data_err=ignore"},
> @@ -1034,6 +1052,9 @@ static int parse_options (char *options, struct super_block *sb,
> case Opt_data_ordered:
> data_opt = EXT3_MOUNT_ORDERED_DATA;
> goto datacheck;
> + case Opt_data_guarded:
> + data_opt = EXT3_MOUNT_GUARDED_DATA;
> + goto datacheck;
> case Opt_data_writeback:
> data_opt = EXT3_MOUNT_WRITEBACK_DATA;
> datacheck:
> @@ -1949,11 +1970,23 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
> clear_opt(sbi->s_mount_opt, NOBH);
> }
> }
> +
> + /*
> + * setup the guarded work list
> + */
> + INIT_LIST_HEAD(&EXT3_SB(sb)->guarded_buffers);
> + INIT_WORK(&EXT3_SB(sb)->guarded_work, ext3_run_guarded_work);
> + spin_lock_init(&EXT3_SB(sb)->guarded_lock);
> + EXT3_SB(sb)->guarded_wq = create_workqueue("ext3-guard");
> + if (!EXT3_SB(sb)->guarded_wq) {
> + printk(KERN_ERR "EXT3-fs: failed to create workqueue\n");
> + goto failed_mount_guard;
> + }
> +
> /*
> * The journal_load will have done any necessary log recovery,
> * so we can safely mount the rest of the filesystem now.
> */
> -
> root = ext3_iget(sb, EXT3_ROOT_INO);
> if (IS_ERR(root)) {
> printk(KERN_ERR "EXT3-fs: get root inode failed\n");
> @@ -1965,6 +1998,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
> printk(KERN_ERR "EXT3-fs: corrupt root inode, run e2fsck\n");
> goto failed_mount4;
> }
> +
> sb->s_root = d_alloc_root(root);
> if (!sb->s_root) {
> printk(KERN_ERR "EXT3-fs: get root dentry failed\n");
> @@ -1974,6 +2008,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
> }
>
> ext3_setup_super (sb, es, sb->s_flags & MS_RDONLY);
> +
> /*
> * akpm: core read_super() calls in here with the superblock locked.
> * That deadlocks, because orphan cleanup needs to lock the superblock
> @@ -1989,9 +2024,10 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
> printk (KERN_INFO "EXT3-fs: recovery complete.\n");
> ext3_mark_recovery_complete(sb, es);
> printk (KERN_INFO "EXT3-fs: mounted filesystem with %s data mode.\n",
> - test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
> - test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
> - "writeback");
> + test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal" :
> + test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_GUARDED_DATA ? "guarded" :
> + test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered" :
> + "writeback");
>
> lock_kernel();
> return 0;
> @@ -2003,6 +2039,8 @@ cantfind_ext3:
> goto failed_mount;
>
> failed_mount4:
> + destroy_workqueue(EXT3_SB(sb)->guarded_wq);
> +failed_mount_guard:
> journal_destroy(sbi->s_journal);
> failed_mount3:
> percpu_counter_destroy(&sbi->s_freeblocks_counter);
> diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
> index ed886e6..1354a55 100644
> --- a/fs/jbd/transaction.c
> +++ b/fs/jbd/transaction.c
> @@ -2018,6 +2018,7 @@ zap_buffer_unlocked:
> clear_buffer_mapped(bh);
> clear_buffer_req(bh);
> clear_buffer_new(bh);
> + clear_buffer_datanew(bh);
> bh->b_bdev = NULL;
> return may_free;
> }
> diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
> index 634a5e5..cf097b7 100644
> --- a/include/linux/ext3_fs.h
> +++ b/include/linux/ext3_fs.h
> @@ -18,6 +18,7 @@
>
> #include <linux/types.h>
> #include <linux/magic.h>
> +#include <linux/workqueue.h>
>
> /*
> * The second extended filesystem constants/structures
> @@ -398,7 +399,6 @@ struct ext3_inode {
> #define EXT3_MOUNT_MINIX_DF 0x00080 /* Mimics the Minix statfs */
> #define EXT3_MOUNT_NOLOAD 0x00100 /* Don't use existing journal*/
> #define EXT3_MOUNT_ABORT 0x00200 /* Fatal error detected */
> -#define EXT3_MOUNT_DATA_FLAGS 0x00C00 /* Mode for data writes: */
> #define EXT3_MOUNT_JOURNAL_DATA 0x00400 /* Write data to journal */
> #define EXT3_MOUNT_ORDERED_DATA 0x00800 /* Flush data before commit */
> #define EXT3_MOUNT_WRITEBACK_DATA 0x00C00 /* No data ordering */
> @@ -414,6 +414,12 @@ struct ext3_inode {
> #define EXT3_MOUNT_GRPQUOTA 0x200000 /* "old" group quota */
> #define EXT3_MOUNT_DATA_ERR_ABORT 0x400000 /* Abort on file data write
> * error in ordered mode */
> +#define EXT3_MOUNT_GUARDED_DATA 0x800000 /* guard new writes with
> + i_size */
> +#define EXT3_MOUNT_DATA_FLAGS (EXT3_MOUNT_JOURNAL_DATA | \
> + EXT3_MOUNT_ORDERED_DATA | \
> + EXT3_MOUNT_WRITEBACK_DATA | \
> + EXT3_MOUNT_GUARDED_DATA)
>
> /* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
> #ifndef _LINUX_EXT2_FS_H
> @@ -892,6 +898,7 @@ extern void ext3_get_inode_flags(struct ext3_inode_info *);
> extern void ext3_set_aops(struct inode *inode);
> extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> u64 start, u64 len);
> +void ext3_run_guarded_work(struct work_struct *work);
>
> /* ioctl.c */
> extern long ext3_ioctl(struct file *, unsigned int, unsigned long);
> @@ -945,7 +952,30 @@ extern const struct inode_operations ext3_special_inode_operations;
> extern const struct inode_operations ext3_symlink_inode_operations;
> extern const struct inode_operations ext3_fast_symlink_inode_operations;
>
> +/* ordered-data.c */
> +int ext3_add_ordered_extent(struct inode *inode, u64 file_offset,
> + struct buffer_head *bh);
> +int ext3_put_ordered_extent(struct ext3_ordered_extent *entry);
> +int ext3_remove_ordered_extent(struct inode *inode,
> + struct ext3_ordered_extent *entry);
> +int ext3_ordered_update_i_size(struct inode *inode);
> +void ext3_ordered_inode_init(struct ext3_inode_info *ei);
> +void ext3_truncate_ordered_extents(struct inode *inode, u64 offset);
> +
> +static inline void ext3_ordered_lock(struct inode *inode)
> +{
> + spin_lock(&EXT3_I(inode)->ordered_buffers.lock);
> +}
>
> +static inline void ext3_ordered_unlock(struct inode *inode)
> +{
> + spin_unlock(&EXT3_I(inode)->ordered_buffers.lock);
> +}
> +
> +static inline void ext3_get_ordered_extent(struct ext3_ordered_extent *entry)
> +{
> + atomic_inc(&entry->refs);
> +}
> #endif /* __KERNEL__ */
>
> #endif /* _LINUX_EXT3_FS_H */
> diff --git a/include/linux/ext3_fs_i.h b/include/linux/ext3_fs_i.h
> index 7894dd0..11dd4d4 100644
> --- a/include/linux/ext3_fs_i.h
> +++ b/include/linux/ext3_fs_i.h
> @@ -65,6 +65,49 @@ struct ext3_block_alloc_info {
> #define rsv_end rsv_window._rsv_end
>
> /*
> + * used to prevent garbage in files after a crash by
> + * making sure i_size isn't updated until after the IO
> + * is done.
> + *
> + * See fs/ext3/ordered-data.c for the code that uses these.
> + */
> +struct buffer_head;
> +struct ext3_ordered_buffers {
> + /* protects the list and disk i_size */
> + spinlock_t lock;
> +
> + struct list_head ordered_list;
> +};
> +
> +struct ext3_ordered_extent {
> + /* logical offset of the block in the file
> + * strictly speaking we don't need this
> + * but keep it in the struct for
> + * debugging
> + */
> + u64 start;
> +
> + /* buffer head being written */
> + struct buffer_head *bh;
> +
> + /*
> + * set at end_io time so we properly
> + * do IO accounting even when this ordered
> + * extent struct has been removed from the
> + * list
> + */
> + struct buffer_head *end_io_bh;
> +
> + /* number of refs on this ordered extent */
> + atomic_t refs;
> +
> + struct list_head ordered_list;
> +
> + /* list of things being processed by the workqueue */
> + struct list_head work_list;
> +};
> +
> +/*
> * third extended file system inode data in memory
> */
> struct ext3_inode_info {
> @@ -141,6 +184,8 @@ struct ext3_inode_info {
> * by other means, so we have truncate_mutex.
> */
> struct mutex truncate_mutex;
> +
> + struct ext3_ordered_buffers ordered_buffers;
> struct inode vfs_inode;
> };
>
> diff --git a/include/linux/ext3_fs_sb.h b/include/linux/ext3_fs_sb.h
> index f07f34d..5dbdbeb 100644
> --- a/include/linux/ext3_fs_sb.h
> +++ b/include/linux/ext3_fs_sb.h
> @@ -21,6 +21,7 @@
> #include <linux/wait.h>
> #include <linux/blockgroup_lock.h>
> #include <linux/percpu_counter.h>
> +#include <linux/workqueue.h>
> #endif
> #include <linux/rbtree.h>
>
> @@ -82,6 +83,11 @@ struct ext3_sb_info {
> char *s_qf_names[MAXQUOTAS]; /* Names of quota files with journalled quota */
> int s_jquota_fmt; /* Format of quota to use */
> #endif
> +
> + struct workqueue_struct *guarded_wq;
> + struct work_struct guarded_work;
> + struct list_head guarded_buffers;
> + spinlock_t guarded_lock;
> };
>
> static inline spinlock_t *
> diff --git a/include/linux/ext3_jbd.h b/include/linux/ext3_jbd.h
> index cf82d51..45cb4aa 100644
> --- a/include/linux/ext3_jbd.h
> +++ b/include/linux/ext3_jbd.h
> @@ -212,6 +212,17 @@ static inline int ext3_should_order_data(struct inode *inode)
> return 0;
> }
>
> +static inline int ext3_should_guard_data(struct inode *inode)
> +{
> + if (!S_ISREG(inode->i_mode))
> + return 0;
> + if (EXT3_I(inode)->i_flags & EXT3_JOURNAL_DATA_FL)
> + return 0;
> + if (test_opt(inode->i_sb, GUARDED_DATA) == EXT3_MOUNT_GUARDED_DATA)
> + return 1;
> + return 0;
> +}
> +
> static inline int ext3_should_writeback_data(struct inode *inode)
> {
> if (!S_ISREG(inode->i_mode))
> diff --git a/include/linux/jbd.h b/include/linux/jbd.h
> index c2049a0..bbb7990 100644
> --- a/include/linux/jbd.h
> +++ b/include/linux/jbd.h
> @@ -291,6 +291,13 @@ enum jbd_state_bits {
> BH_State, /* Pins most journal_head state */
> BH_JournalHead, /* Pins bh->b_private and jh->b_bh */
> BH_Unshadow, /* Dummy bit, for BJ_Shadow wakeup filtering */
> + BH_DataGuarded, /* ext3 data=guarded mode buffer
> + * these have something other than a
> + * journal_head at b_private */
> + BH_DataNew, /* BH_new gets cleared too early for
> + * data=guarded to use it. So,
> + * this gets set instead.
> + */
> };
>
> BUFFER_FNS(JBD, jbd)
> @@ -302,6 +309,9 @@ TAS_BUFFER_FNS(Revoked, revoked)
> BUFFER_FNS(RevokeValid, revokevalid)
> TAS_BUFFER_FNS(RevokeValid, revokevalid)
> BUFFER_FNS(Freed, freed)
> +BUFFER_FNS(DataGuarded, dataguarded)
> +BUFFER_FNS(DataNew, datanew)
> +TAS_BUFFER_FNS(DataNew, datanew)
>
> static inline struct buffer_head *jh2bh(struct journal_head *jh)
> {

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-04-29 14:09:18

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH RFC] ext3 data=guarded v5

On Wed, 2009-04-29 at 10:56 +0200, Jan Kara wrote:

> > diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> > index fcfa243..1e90107 100644
> > --- a/fs/ext3/inode.c
> > +++ b/fs/ext3/inode.c
> > @@ -38,6 +38,7 @@
> > #include <linux/bio.h>
> > #include <linux/fiemap.h>
> > #include <linux/namei.h>
> > +#include <linux/workqueue.h>
> > #include "xattr.h"
> > #include "acl.h"
> >
> > @@ -179,6 +180,105 @@ static int ext3_journal_test_restart(handle_t *handle, struct inode *inode)
> > }
> >
> > /*
> > + * after a data=guarded IO is done, we need to update the
> > + * disk i_size to reflect the data we've written. If there are
> > + * no more ordered data extents left in the tree, we need to
> ^^^^^^^^ the list
> > + * get rid of the orphan entry making sure the file's
> > + * block pointers match the i_size after a crash
> > + *
> > + * When we aren't in data=guarded mode, this just does an ext3_orphan_del.
> > + *
> > + * It returns the result of ext3_orphan_del.
> > + *
> > + * handle may be null if we are just cleaning up the orphan list in
> > + * memory.
> > + *
> > + * pass must_log == 1 when the inode must be logged in order to get
> > + * an i_size update on disk
> > + */
> > +static int ordered_orphan_del(handle_t *handle, struct inode *inode,
> > + int must_log)
> > +{
> I'm afraid this function is racy.
> 1) We probably need i_mutex to protect against unlink happening in parallel
> (after we check i_nlink but before we all ext3_orphan_del).

This would mean IO completion (clearing PG_writeback) would have to wait
on the inode mutex, which we can't quite do in O_SYNC and O_DIRECT.
But, what I can do is check i_nlink after the ext3_orphan_del call and
put the inode back on the orphan list if it has gone to zero.

> 2) We need superblock lock for the check list_empty(&EXT3_I(inode)->i_orphan).

How about I take the guarded spinlock when doing the list_add instead?
I'm trying to avoid the superblock lock as much as I can.

> 3) The function should rather have name ext3_guarded_orphan_del()... At
> least "ordered" is really confusing (that's the case for a few other
> structs / variables as well).

My long term plan was to replaced ordered with guarded, but I can rename
this one to guarded if you think it'll make it more clear.

> > +/*
> > + * Wrapper around ordered_orphan_del that starts a transaction
> > + */
> > +static void ordered_orphan_del_trans(struct inode *inode, int must_log)
> > +{
> This function is going to be used only from one place, so consider
> opencoding it. I don't have a strong opinions


Yeah, I think it keeps the code a little more readable to have it
separate....gcc will inline the thing for us anyway.

> > + *
> > + * extend_disksize is only called for directories, and so
> > + * the are not using guarded buffer protection.
> ^^^ The sentence is strange...

Thanks

> > */
> > - if (!err && extend_disksize && inode->i_size > ei->i_disksize)
> > - ei->i_disksize = inode->i_size;
> > + if (!err && extend_disksize)
> > + maybe_update_disk_isize(inode, inode->i_size);
> So do we really need to take the ordered lock for directories? We could
> just leave above two lines as they were.

Good point

>
> > mutex_unlock(&ei->truncate_mutex);
> > if (err)
> > goto cleanup;
> >
> > set_buffer_new(bh_result);
> > + set_buffer_datanew(bh_result);
> > got_it:
> > map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key));
> > if (count > blocks_to_boundary)
> > @@ -1079,6 +1210,77 @@ struct buffer_head *ext3_bread(handle_t *handle, struct inode *inode,
> > return NULL;
> > }
> >
> > +/*
> > + * data=guarded updates are handled in a workqueue after the IO
> > + * is done. This runs through the list of buffer heads that are pending
> > + * processing.
> > + */
> > +void ext3_run_guarded_work(struct work_struct *work)
> > +{
> > + struct ext3_sb_info *sbi =
> > + container_of(work, struct ext3_sb_info, guarded_work);
> > + struct buffer_head *bh;
> > + struct ext3_ordered_extent *ordered;
> > + struct inode *inode;
> > + struct page *page;
> > + int must_log;
> > +
> > + spin_lock_irq(&sbi->guarded_lock);
> > + while (!list_empty(&sbi->guarded_buffers)) {
> > + ordered = list_entry(sbi->guarded_buffers.next,
> > + struct ext3_ordered_extent, work_list);
> > +
> > + list_del(&ordered->work_list);
> > +
> > + bh = ordered->end_io_bh;
> > + ordered->end_io_bh = NULL;
> > + must_log = 0;
> > +
> > + /* we don't need a reference on the buffer head because
> > + * it is locked until the end_io handler is called.
> > + *
> > + * This means the page can't go away, which means the
> > + * inode can't go away
> > + */
> > + spin_unlock_irq(&sbi->guarded_lock);
> > +
> > + page = bh->b_page;
> > + inode = page->mapping->host;
> > +
> > + ext3_ordered_lock(inode);
> > + if (ordered->bh) {
> > + /*
> > + * someone might have decided this buffer didn't
> > + * really need to be ordered and removed us from
> > + * the list. They set ordered->bh to null
> > + * when that happens.
> > + */
> > + ext3_remove_ordered_extent(inode, ordered);
> > + must_log = ext3_ordered_update_i_size(inode);
> > + }
> > + ext3_ordered_unlock(inode);
> > +
> > + /*
> > + * drop the reference taken when this ordered extent was
> > + * put onto the guarded_buffers list
> > + */
> > + ext3_put_ordered_extent(ordered);
> > +
> > + /*
> > + * maybe log the inode and/or cleanup the orphan entry
> > + */
> > + ordered_orphan_del_trans(inode, must_log > 0);
> > +
> > + /*
> > + * finally, call the real bh end_io function to do
> > + * all the hard work of maintaining page writeback.
> > + */
> > + end_buffer_async_write(bh, buffer_uptodate(bh));
> > + spin_lock_irq(&sbi->guarded_lock);
> > + }
> > + spin_unlock_irq(&sbi->guarded_lock);
> > +}
> > +
> > static int walk_page_buffers( handle_t *handle,
> > struct buffer_head *head,
> > unsigned from,
> > @@ -1185,6 +1387,7 @@ retry:
> > ret = walk_page_buffers(handle, page_buffers(page),
> > from, to, NULL, do_journal_get_write_access);
> > }
> > +
> > write_begin_failed:
> > if (ret) {
> > /*
> > @@ -1212,7 +1415,13 @@ out:
> >
> > int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
> > {
> > - int err = journal_dirty_data(handle, bh);
> > + int err;
> > +
> > + /* don't take buffers from the data=guarded list */
> > + if (buffer_dataguarded(bh))
> > + return 0;
> > +
> > + err = journal_dirty_data(handle, bh);
> But this has a problem that if we do extending write (like from pos 1024
> to pos 2048) and then do write from 0 to 1024 and we hit the window while
> the buffer is on the work queue list, we won't order this write. Probably
> we don't care but I wanted to note this...

Yeah, in this case the guarded IO should protect i_size, and this write
won't really be ordered. The block could have zeros from 0-1024 if we
crash.

>
> > if (err)
> > ext3_journal_abort_handle(__func__, __func__,
> > bh, handle, err);
> > @@ -1231,6 +1440,89 @@ static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
> > return 0;
> > }
> >
> > +/*
> > + * Walk the buffers in a page for data=guarded mode. Buffers that
> > + * are not marked as datanew are ignored.
> > + *
> > + * New buffers outside i_size are sent to the data guarded code
> > + *
> > + * We must do the old data=ordered mode when filling holes in the
> > + * file, since i_size doesn't protect these at all.
> > + */
> > +static int journal_dirty_data_guarded_fn(handle_t *handle,
> > + struct buffer_head *bh)
> > +{
> > + u64 offset = page_offset(bh->b_page) + bh_offset(bh);
> > + struct inode *inode = bh->b_page->mapping->host;
> > + int ret = 0;
> > +
> > + /*
> > + * Write could have mapped the buffer but it didn't copy the data in
> > + * yet. So avoid filing such buffer into a transaction.
> > + */
> > + if (!buffer_mapped(bh) || !buffer_uptodate(bh))
> > + return 0;
> > +
> > + if (test_clear_buffer_datanew(bh)) {
> Hmm, if we just extend the file inside the block (e.g. from 100 bytes to
> 500 bytes), then we won't do the write guarded. But then if we crash before
> the block really gets written, user will see zeros at the end of file
> instead of data...

You see something like this:

create(file)
write(file, 100 bytes) # create guarded IO
fsync(file)
write(file, 400 more bytes) # buffer isn't guarded, i_size goes to 500


> I don't think we should let this happen so I'd think we
> have to guard all the extending writes regardless whether they allocate new
> block or not.

My main concern was avoiding stale data from the disk after a crash,
zeros from partially written blocks are not as big a problem. But,
you're right that we can easily avoid this, so I'll update the patch to
do all extending writes as guarded.

> Which probably makes the buffer_datanew() flag unnecessary
> because we just guard all the buffers from max(start of write, i_size) to
> end of write.

But, we still want buffer_datanew to decide when writes that fill holes
should go through data=ordered.

> > +/*
> > + * Walk the buffers in a page for data=guarded mode for writepage.
> > + *
> > + * We must do the old data=ordered mode when filling holes in the
> > + * file, since i_size doesn't protect these at all.
> > + *
> > + * This is actually called after writepage is run and so we can't
> > + * trust anything other than the buffer head (which we have pinned).
> > + *
> > + * Any datanew buffer at writepage time is filling a hole, so we don't need
> > + * extra tests against the inode size.
> > + */
> > +static int journal_dirty_data_guarded_writepage_fn(handle_t *handle,
> > + struct buffer_head *bh)
> > +{
> > + int ret = 0;
> > +
> > + /*
> > + * Write could have mapped the buffer but it didn't copy the data in
> > + * yet. So avoid filing such buffer into a transaction.
> > + */
> > + if (!buffer_mapped(bh) || !buffer_uptodate(bh))
> > + return 0;
> > +
> > + if (test_clear_buffer_datanew(bh))
> > + ret = ext3_journal_dirty_data(handle, bh);
> > + return ret;
> > +}
> > +
> Hmm, here we use the datanew flag as well. But it's probably not worth
> keeping it just for this case. Ordering data in all cases when we get here
> should be fine since if the block is already allocated we should not get
> here (unless somebody managed to strip buffers from the page but kept the
> page but that should be rare enough).
>

I'd keep it for the common case of filling holes with write(), so then
the code in writepage is gravy.

> > @@ -1300,6 +1590,68 @@ static int ext3_ordered_write_end(struct file *file,
> > return ret ? ret : copied;
> > }
> >
> > +static int ext3_guarded_write_end(struct file *file,
> > + struct address_space *mapping,
> > + loff_t pos, unsigned len, unsigned copied,
> > + struct page *page, void *fsdata)
> > +{
> > + handle_t *handle = ext3_journal_current_handle();
> > + struct inode *inode = file->f_mapping->host;
> > + unsigned from, to;
> > + int ret = 0, ret2;
> > +
> > + copied = block_write_end(file, mapping, pos, len, copied,
> > + page, fsdata);
> > +
> > + from = pos & (PAGE_CACHE_SIZE - 1);
> > + to = from + copied;
> > + ret = walk_page_buffers(handle, page_buffers(page),
> > + from, to, NULL, journal_dirty_data_guarded_fn);
> > +
> > + /*
> > + * we only update the in-memory i_size. The disk i_size is done
> > + * by the end io handlers
> > + */
> > + if (ret == 0 && pos + copied > inode->i_size) {
> > + int must_log;
> > +
> > + /* updated i_size, but we may have raced with a
> > + * data=guarded end_io handler.
> > + *
> > + * All the guarded IO could have ended while i_size was still
> > + * small, and if we're just adding bytes into an existing block
> > + * in the file, we may not be adding a new guarded IO with this
> > + * write. So, do a check on the disk i_size and make sure it
> > + * is updated to the highest safe value.
> > + *
> > + * ext3_ordered_update_i_size tests inode->i_size, so we
> > + * make sure to update it with the ordered lock held.
> > + */
> This can go away if we guard all the extending writes...

Yes, good point.

>
> > + ext3_ordered_lock(inode);
> > + i_size_write(inode, pos + copied);
> > +
> > + must_log = ext3_ordered_update_i_size(inode);
> > + ext3_ordered_unlock(inode);
> > + ordered_orphan_del_trans(inode, must_log > 0);
> In case this needs to stay, here we have a transaction started so why not
> just directly call ordered_orphan_del()?
>

Thanks

> > @@ -1747,7 +2238,14 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
> > goto out;
> > }
> > orphan = 1;
> > - ei->i_disksize = inode->i_size;
> > + /* in guarded mode, other code is responsible
> > + * for updating i_disksize. Actually in
> > + * every mode, ei->i_disksize should be correct,
> > + * so I don't understand why it is getting updated
> > + * to i_size here.
> > + */
> > + if (!ext3_should_guard_data(inode))
> > + ei->i_disksize = inode->i_size;
> Hmm, true. When we acquire i_mutex, i_size should be equal to i_disksize
> so this seems rather pointless. Probably worth a separate patch to remove
> it...

Yeah, I didn't want to go around messing with O_DIRECT in this
patchset ;)

>
> > ext3_journal_stop(handle);
> > }
> > }
> > @@ -1768,11 +2266,20 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
> > ret = PTR_ERR(handle);
> > goto out;
> > }
> > +
> > if (inode->i_nlink)
> > - ext3_orphan_del(handle, inode);
> > + ordered_orphan_del(handle, inode, 0);
> > +
> > if (ret > 0) {
> > loff_t end = offset + ret;
> > if (end > inode->i_size) {
> > + /* i_mutex keeps other file writes from
> > + * hopping in at this time, and we
> > + * know the O_DIRECT write just put all
> > + * those blocks on disk. So, we can
> > + * safely update i_disksize here even
> > + * in guarded mode
> > + */
> Not quite - there could be guarded blocks before the place where we did
> O_DIRECT write and we need to wait for them...

Hmmm, O_DIRECT is only waiting on the blocks it actually wrote isn't it.
Good point, will fix.

-chris



2009-04-29 14:44:18

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH RFC] ext3 data=guarded v5

On Wed, 2009-04-29 at 10:08 -0400, Chris Mason wrote:
> On Wed, 2009-04-29 at 10:56 +0200, Jan Kara wrote:
>
> > > diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> > > index fcfa243..1e90107 100644
> > > --- a/fs/ext3/inode.c
> > > +++ b/fs/ext3/inode.c
> > > @@ -38,6 +38,7 @@
> > > #include <linux/bio.h>
> > > #include <linux/fiemap.h>
> > > #include <linux/namei.h>
> > > +#include <linux/workqueue.h>
> > > #include "xattr.h"
> > > #include "acl.h"
> > >
> > > @@ -179,6 +180,105 @@ static int ext3_journal_test_restart(handle_t *handle, struct inode *inode)
> > > }
> > >
> > > /*
> > > + * after a data=guarded IO is done, we need to update the
> > > + * disk i_size to reflect the data we've written. If there are
> > > + * no more ordered data extents left in the tree, we need to
> > ^^^^^^^^ the list
> > > + * get rid of the orphan entry making sure the file's
> > > + * block pointers match the i_size after a crash
> > > + *
> > > + * When we aren't in data=guarded mode, this just does an ext3_orphan_del.
> > > + *
> > > + * It returns the result of ext3_orphan_del.
> > > + *
> > > + * handle may be null if we are just cleaning up the orphan list in
> > > + * memory.
> > > + *
> > > + * pass must_log == 1 when the inode must be logged in order to get
> > > + * an i_size update on disk
> > > + */
> > > +static int ordered_orphan_del(handle_t *handle, struct inode *inode,
> > > + int must_log)
> > > +{
> > I'm afraid this function is racy.
> > 1) We probably need i_mutex to protect against unlink happening in parallel
> > (after we check i_nlink but before we all ext3_orphan_del).
>
> This would mean IO completion (clearing PG_writeback) would have to wait
> on the inode mutex, which we can't quite do in O_SYNC and O_DIRECT.
> But, what I can do is check i_nlink after the ext3_orphan_del call and
> put the inode back on the orphan list if it has gone to zero.

Ugh, that won't work, we'll just race with link and risk an orphan that
never gets removed. I'll make a version of ext3_orphan_del that expects
the super lock held and use that instead.

-chris



2009-04-29 19:15:37

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC] ext3 data=guarded v5

On Wed 29-04-09 10:08:13, Chris Mason wrote:
> On Wed, 2009-04-29 at 10:56 +0200, Jan Kara wrote:
>
> > > diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> > > index fcfa243..1e90107 100644
> > > --- a/fs/ext3/inode.c
> > > +++ b/fs/ext3/inode.c
> > > @@ -38,6 +38,7 @@
> > > #include <linux/bio.h>
> > > #include <linux/fiemap.h>
> > > #include <linux/namei.h>
> > > +#include <linux/workqueue.h>
> > > #include "xattr.h"
> > > #include "acl.h"
> > >
> > > @@ -179,6 +180,105 @@ static int ext3_journal_test_restart(handle_t *handle, struct inode *inode)
> > > }
> > >
> > > /*
> > > + * after a data=guarded IO is done, we need to update the
> > > + * disk i_size to reflect the data we've written. If there are
> > > + * no more ordered data extents left in the tree, we need to
> > ^^^^^^^^ the list
> > > + * get rid of the orphan entry making sure the file's
> > > + * block pointers match the i_size after a crash
> > > + *
> > > + * When we aren't in data=guarded mode, this just does an ext3_orphan_del.
> > > + *
> > > + * It returns the result of ext3_orphan_del.
> > > + *
> > > + * handle may be null if we are just cleaning up the orphan list in
> > > + * memory.
> > > + *
> > > + * pass must_log == 1 when the inode must be logged in order to get
> > > + * an i_size update on disk
> > > + */
> > > +static int ordered_orphan_del(handle_t *handle, struct inode *inode,
> > > + int must_log)
> > > +{
> > I'm afraid this function is racy.
> > 1) We probably need i_mutex to protect against unlink happening in parallel
> > (after we check i_nlink but before we all ext3_orphan_del).
>
> This would mean IO completion (clearing PG_writeback) would have to wait
> on the inode mutex, which we can't quite do in O_SYNC and O_DIRECT.
> But, what I can do is check i_nlink after the ext3_orphan_del call and
> put the inode back on the orphan list if it has gone to zero.
Ah, good point. But doing it without i_mutex is icky. Strictly speaking
you should have memory barriers in the code to make sure that you fetch
a recent value of i_nlink (although other locking around those places
probably does the work for you but *proving* the correctness is complex).
Hmm, I can't help but the idea of updating i_disksize and calling
end_page_writeback() directly from end_io handler and doing just
mark_inode_dirty() and orphan deletion from the workqueue still returns to
me ;-) It would solve this problem as well. We just have to somehow pin the
inode so that VFS cannot remove it before we manage to file i_disksize
update into the transaction, which, I agree, has some issues as well as you
wrote in some email... The main advantage of the current scheme probably is
that PG_writeback bit naturally tells VM that we have still some pinned
resources and so it throttles writers if needed... But still, I find my
better ;).


> > 2) We need superblock lock for the check
> > list_empty(&EXT3_I(inode)->i_orphan).
>
> How about I take the guarded spinlock when doing the list_add instead?
> I'm trying to avoid the superblock lock as much as I can.
Well, ext3_orphan_del() takes it anyway so it's not like you introduce a
new lock dependency. Actually in the code I wrote you have exactly as many
superblock locking as there is now. So IMO this is not an issue if we
somehow solve problem 1).

> > 3) The function should rather have name ext3_guarded_orphan_del()... At
> > least "ordered" is really confusing (that's the case for a few other
> > structs / variables as well).
>
> My long term plan was to replaced ordered with guarded, but I can rename
> this one to guarded if you think it'll make it more clear.
Ah, OK. My feeling is that there's going to be at least some time of
coexstence of these two modes (also because the perfomance numbers for
ordered mode were considerably better in some loads) so I'd vote for
clearly separating their names.

> > > if (err)
> > > ext3_journal_abort_handle(__func__, __func__,
> > > bh, handle, err);
> > > @@ -1231,6 +1440,89 @@ static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
> > > return 0;
> > > }
> > >
> > > +/*
> > > + * Walk the buffers in a page for data=guarded mode. Buffers that
> > > + * are not marked as datanew are ignored.
> > > + *
> > > + * New buffers outside i_size are sent to the data guarded code
> > > + *
> > > + * We must do the old data=ordered mode when filling holes in the
> > > + * file, since i_size doesn't protect these at all.
> > > + */
> > > +static int journal_dirty_data_guarded_fn(handle_t *handle,
> > > + struct buffer_head *bh)
> > > +{
> > > + u64 offset = page_offset(bh->b_page) + bh_offset(bh);
> > > + struct inode *inode = bh->b_page->mapping->host;
> > > + int ret = 0;
> > > +
> > > + /*
> > > + * Write could have mapped the buffer but it didn't copy the data in
> > > + * yet. So avoid filing such buffer into a transaction.
> > > + */
> > > + if (!buffer_mapped(bh) || !buffer_uptodate(bh))
> > > + return 0;
> > > +
> > > + if (test_clear_buffer_datanew(bh)) {
> > Hmm, if we just extend the file inside the block (e.g. from 100 bytes to
> > 500 bytes), then we won't do the write guarded. But then if we crash before
> > the block really gets written, user will see zeros at the end of file
> > instead of data...
>
> You see something like this:
>
> create(file)
> write(file, 100 bytes) # create guarded IO
> fsync(file)
> write(file, 400 more bytes) # buffer isn't guarded, i_size goes to 500
Yes.

> > I don't think we should let this happen so I'd think we
> > have to guard all the extending writes regardless whether they allocate new
> > block or not.
>
> My main concern was avoiding stale data from the disk after a crash,
> zeros from partially written blocks are not as big a problem. But,
> you're right that we can easily avoid this, so I'll update the patch to
> do all extending writes as guarded.
Yes, ext3 actually does some work to avoid exposing zeros at the
end of file when we fail to copy in new data (source page was swapped out
in the mean time) and we crash before we manage to swap the page in again.
So it would be stupid to introduce the same problem here again...

> > Which probably makes the buffer_datanew() flag unnecessary
> > because we just guard all the buffers from max(start of write, i_size) to
> > end of write.
>
> But, we still want buffer_datanew to decide when writes that fill holes
> should go through data=ordered.
See below.

> > > +/*
> > > + * Walk the buffers in a page for data=guarded mode for writepage.
> > > + *
> > > + * We must do the old data=ordered mode when filling holes in the
> > > + * file, since i_size doesn't protect these at all.
> > > + *
> > > + * This is actually called after writepage is run and so we can't
> > > + * trust anything other than the buffer head (which we have pinned).
> > > + *
> > > + * Any datanew buffer at writepage time is filling a hole, so we don't need
> > > + * extra tests against the inode size.
> > > + */
> > > +static int journal_dirty_data_guarded_writepage_fn(handle_t *handle,
> > > + struct buffer_head *bh)
> > > +{
> > > + int ret = 0;
> > > +
> > > + /*
> > > + * Write could have mapped the buffer but it didn't copy the data in
> > > + * yet. So avoid filing such buffer into a transaction.
> > > + */
> > > + if (!buffer_mapped(bh) || !buffer_uptodate(bh))
> > > + return 0;
> > > +
> > > + if (test_clear_buffer_datanew(bh))
> > > + ret = ext3_journal_dirty_data(handle, bh);
> > > + return ret;
> > > +}
> > > +
> > Hmm, here we use the datanew flag as well. But it's probably not worth
> > keeping it just for this case. Ordering data in all cases when we get here
> > should be fine since if the block is already allocated we should not get
> > here (unless somebody managed to strip buffers from the page but kept the
> > page but that should be rare enough).
> >
>
> I'd keep it for the common case of filling holes with write(), so then
> the code in writepage is gravy.
I'm not sure I understand. I wanted to suggest to change
if (test_clear_buffer_datanew(bh))
ret = ext3_journal_dirty_data(handle, bh);
to
ret = ext3_journal_dirty_data(handle, bh);

So to always order the data. Actually the only case when we would get to
journal_dirty_data_guarded_writepage_fn() and buffer_datanew() is not set
would be when
a) blocksize < pagesize, page already has some blocks allocated and we
add more blocks to we fill the hole. Then with your code we would not order
all blocks in the page, with my code we would. But I don't think it makes a
big difference.
b) someone removed buffers from the page.
In all other cases we should take the fast path in writepage and submit the
IO without starting the transaction.
So IMO datanew can be removed...

> > > ext3_journal_stop(handle);
> > > }
> > > }
> > > @@ -1768,11 +2266,20 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
> > > ret = PTR_ERR(handle);
> > > goto out;
> > > }
> > > +
> > > if (inode->i_nlink)
> > > - ext3_orphan_del(handle, inode);
> > > + ordered_orphan_del(handle, inode, 0);
> > > +
> > > if (ret > 0) {
> > > loff_t end = offset + ret;
> > > if (end > inode->i_size) {
> > > + /* i_mutex keeps other file writes from
> > > + * hopping in at this time, and we
> > > + * know the O_DIRECT write just put all
> > > + * those blocks on disk. So, we can
> > > + * safely update i_disksize here even
> > > + * in guarded mode
> > > + */
> > Not quite - there could be guarded blocks before the place where we did
> > O_DIRECT write and we need to wait for them...
>
> Hmmm, O_DIRECT is only waiting on the blocks it actually wrote isn't it.
> Good point, will fix.
Yes.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-04-29 19:42:24

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH RFC] ext3 data=guarded v5

On Wed, 2009-04-29 at 21:15 +0200, Jan Kara wrote:
> On Wed 29-04-09 10:08:13, Chris Mason wrote:
> > On Wed, 2009-04-29 at 10:56 +0200, Jan Kara wrote:
> >
> > > > diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> > > > index fcfa243..1e90107 100644
> > > > --- a/fs/ext3/inode.c
> > > > +++ b/fs/ext3/inode.c
> > > > @@ -38,6 +38,7 @@
> > > > #include <linux/bio.h>
> > > > #include <linux/fiemap.h>
> > > > #include <linux/namei.h>
> > > > +#include <linux/workqueue.h>
> > > > #include "xattr.h"
> > > > #include "acl.h"
> > > >
> > > > @@ -179,6 +180,105 @@ static int ext3_journal_test_restart(handle_t *handle, struct inode *inode)
> > > > }
> > > >
> > > > /*
> > > > + * after a data=guarded IO is done, we need to update the
> > > > + * disk i_size to reflect the data we've written. If there are
> > > > + * no more ordered data extents left in the tree, we need to
> > > ^^^^^^^^ the list
> > > > + * get rid of the orphan entry making sure the file's
> > > > + * block pointers match the i_size after a crash
> > > > + *
> > > > + * When we aren't in data=guarded mode, this just does an ext3_orphan_del.
> > > > + *
> > > > + * It returns the result of ext3_orphan_del.
> > > > + *
> > > > + * handle may be null if we are just cleaning up the orphan list in
> > > > + * memory.
> > > > + *
> > > > + * pass must_log == 1 when the inode must be logged in order to get
> > > > + * an i_size update on disk
> > > > + */
> > > > +static int ordered_orphan_del(handle_t *handle, struct inode *inode,
> > > > + int must_log)
> > > > +{
> > > I'm afraid this function is racy.
> > > 1) We probably need i_mutex to protect against unlink happening in parallel
> > > (after we check i_nlink but before we all ext3_orphan_del).
> >
> > This would mean IO completion (clearing PG_writeback) would have to wait
> > on the inode mutex, which we can't quite do in O_SYNC and O_DIRECT.
> > But, what I can do is check i_nlink after the ext3_orphan_del call and
> > put the inode back on the orphan list if it has gone to zero.
> Ah, good point. But doing it without i_mutex is icky. Strictly speaking
> you should have memory barriers in the code to make sure that you fetch
> a recent value of i_nlink (although other locking around those places
> probably does the work for you but *proving* the correctness is complex).
> Hmm, I can't help but the idea of updating i_disksize and calling
> end_page_writeback() directly from end_io handler and doing just
> mark_inode_dirty() and orphan deletion from the workqueue still returns to
> me ;-) It would solve this problem as well. We just have to somehow pin the
> inode so that VFS cannot remove it before we manage to file i_disksize
> update into the transaction, which, I agree, has some issues as well as you
> wrote in some email... The main advantage of the current scheme probably is
> that PG_writeback bit naturally tells VM that we have still some pinned
> resources and so it throttles writers if needed... But still, I find my
> better ;).
>

I think my latest patch has this nailed down without the mutex.
Basically it checks i_nlink with super lock held and then calls
ext3_orphan_del. If we race with unlink, we'll either find the new
nlink count and skip the orphan del or unlink will come in after us and
add the orphan back.

(new patch on the way)

>
> > > 2) We need superblock lock for the check
> > > list_empty(&EXT3_I(inode)->i_orphan).
> >
> > How about I take the guarded spinlock when doing the list_add instead?
> > I'm trying to avoid the superblock lock as much as I can.
> Well, ext3_orphan_del() takes it anyway so it's not like you introduce a
> new lock dependency. Actually in the code I wrote you have exactly as many
> superblock locking as there is now. So IMO this is not an issue if we
> somehow solve problem 1).
>

Yeah, I agree.

> > > 3) The function should rather have name ext3_guarded_orphan_del()... At
> > > least "ordered" is really confusing (that's the case for a few other
> > > structs / variables as well).
> >
> > My long term plan was to replaced ordered with guarded, but I can rename
> > this one to guarded if you think it'll make it more clear.
> Ah, OK. My feeling is that there's going to be at least some time of
> coexstence of these two modes (also because the perfomance numbers for
> ordered mode were considerably better in some loads) so I'd vote for
> clearly separating their names.
>
> > > > if (err)
> > > > ext3_journal_abort_handle(__func__, __func__,
> > > > bh, handle, err);
> > > > @@ -1231,6 +1440,89 @@ static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
> > > > return 0;
> > > > }
> > > >
> > > > +/*
> > > > + * Walk the buffers in a page for data=guarded mode. Buffers that
> > > > + * are not marked as datanew are ignored.
> > > > + *
> > > > + * New buffers outside i_size are sent to the data guarded code
> > > > + *
> > > > + * We must do the old data=ordered mode when filling holes in the
> > > > + * file, since i_size doesn't protect these at all.
> > > > + */
> > > > +static int journal_dirty_data_guarded_fn(handle_t *handle,
> > > > + struct buffer_head *bh)
> > > > +{
> > > > + u64 offset = page_offset(bh->b_page) + bh_offset(bh);
> > > > + struct inode *inode = bh->b_page->mapping->host;
> > > > + int ret = 0;
> > > > +
> > > > + /*
> > > > + * Write could have mapped the buffer but it didn't copy the data in
> > > > + * yet. So avoid filing such buffer into a transaction.
> > > > + */
> > > > + if (!buffer_mapped(bh) || !buffer_uptodate(bh))
> > > > + return 0;
> > > > +
> > > > + if (test_clear_buffer_datanew(bh)) {
> > > Hmm, if we just extend the file inside the block (e.g. from 100 bytes to
> > > 500 bytes), then we won't do the write guarded. But then if we crash before
> > > the block really gets written, user will see zeros at the end of file
> > > instead of data...
> >
> > You see something like this:
> >
> > create(file)
> > write(file, 100 bytes) # create guarded IO
> > fsync(file)
> > write(file, 400 more bytes) # buffer isn't guarded, i_size goes to 500
> Yes.
>
> > > I don't think we should let this happen so I'd think we
> > > have to guard all the extending writes regardless whether they allocate new
> > > block or not.
> >
> > My main concern was avoiding stale data from the disk after a crash,
> > zeros from partially written blocks are not as big a problem. But,
> > you're right that we can easily avoid this, so I'll update the patch to
> > do all extending writes as guarded.
> Yes, ext3 actually does some work to avoid exposing zeros at the
> end of file when we fail to copy in new data (source page was swapped out
> in the mean time) and we crash before we manage to swap the page in again.
> So it would be stupid to introduce the same problem here again...

This should be fixed in my current version as well. The buffer is
likely to be added as an ordered buffer in that case, but either way
there won't be zeros.

>
> > > Which probably makes the buffer_datanew() flag unnecessary
> > > because we just guard all the buffers from max(start of write, i_size) to
> > > end of write.
> >
> > But, we still want buffer_datanew to decide when writes that fill holes
> > should go through data=ordered.
> See below.
>
> > > > +/*
> > > > + * Walk the buffers in a page for data=guarded mode for writepage.
> > > > + *
> > > > + * We must do the old data=ordered mode when filling holes in the
> > > > + * file, since i_size doesn't protect these at all.
> > > > + *
> > > > + * This is actually called after writepage is run and so we can't
> > > > + * trust anything other than the buffer head (which we have pinned).
> > > > + *
> > > > + * Any datanew buffer at writepage time is filling a hole, so we don't need
> > > > + * extra tests against the inode size.
> > > > + */
> > > > +static int journal_dirty_data_guarded_writepage_fn(handle_t *handle,
> > > > + struct buffer_head *bh)
> > > > +{
> > > > + int ret = 0;
> > > > +
> > > > + /*
> > > > + * Write could have mapped the buffer but it didn't copy the data in
> > > > + * yet. So avoid filing such buffer into a transaction.
> > > > + */
> > > > + if (!buffer_mapped(bh) || !buffer_uptodate(bh))
> > > > + return 0;
> > > > +
> > > > + if (test_clear_buffer_datanew(bh))
> > > > + ret = ext3_journal_dirty_data(handle, bh);
> > > > + return ret;
> > > > +}
> > > > +
> > > Hmm, here we use the datanew flag as well. But it's probably not worth
> > > keeping it just for this case. Ordering data in all cases when we get here
> > > should be fine since if the block is already allocated we should not get
> > > here (unless somebody managed to strip buffers from the page but kept the
> > > page but that should be rare enough).
> > >
> >
> > I'd keep it for the common case of filling holes with write(), so then
> > the code in writepage is gravy.
> I'm not sure I understand. I wanted to suggest to change
> if (test_clear_buffer_datanew(bh))
> ret = ext3_journal_dirty_data(handle, bh);
> to
> ret = ext3_journal_dirty_data(handle, bh);
>
> So to always order the data. Actually the only case when we would get to
> journal_dirty_data_guarded_writepage_fn() and buffer_datanew() is not set
> would be when
> a) blocksize < pagesize, page already has some blocks allocated and we
> add more blocks to we fill the hole. Then with your code we would not order
> all blocks in the page, with my code we would. But I don't think it makes a
> big difference.
> b) someone removed buffers from the page.
> In all other cases we should take the fast path in writepage and submit the
> IO without starting the transaction.
> So IMO datanew can be removed...

What we don't want to do is have a call to write() over existing blocks
in the file add new things to the data=ordered list. I don't see how we
can avoid that without datanew.

-chris



2009-04-29 19:47:48

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH RFC] ext3 data=guarded v5

On Apr 29, 2009 10:43 -0400, Chris Mason wrote:
> On Wed, 2009-04-29 at 10:08 -0400, Chris Mason wrote:
> > This would mean IO completion (clearing PG_writeback) would have to wait
> > on the inode mutex, which we can't quite do in O_SYNC and O_DIRECT.
> > But, what I can do is check i_nlink after the ext3_orphan_del call and
> > put the inode back on the orphan list if it has gone to zero.
>
> Ugh, that won't work, we'll just race with link and risk an orphan that
> never gets removed. I'll make a version of ext3_orphan_del that expects
> the super lock held and use that instead.

I looks like ext3_link() checks for i_nlink == 0 and returns -ENOENT to
avoid this race.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-04-29 19:52:44

by Chris Mason

[permalink] [raw]
Subject: [PATCH RFC] ext3 data=guarded v6

Hello everyone,

Here is v6 based on Jan's review:

* Fixup locking while deleting an orphan entry. The idea here is to
take the super lock and then check our link count and ordered list. If
we race with unlink or another process adding another guarded IO, both
will wait on the super lock while they do the orphan add.

* Fixup O_DIRCECT disk i_size updates

* Do either a guarded or ordered IO for any write past i_size.

ext3 data=ordered mode makes sure that data blocks are on disk before
the metadata that references them, which avoids files full of garbage
or previously deleted data after a crash. It does this by adding every dirty
buffer onto a list of things that must be written before a commit.

This makes every fsync write out all the dirty data on the entire FS, which
has high latencies and is generally much more expensive than it needs to be.

Another way to avoid exposing stale data after a crash is to wait until
after the data buffers are written before updating the on-disk record
of the file's size. If we crash before the data IO is done, i_size
doesn't yet include the new blocks and no stale data is exposed.

This patch adds the delayed i_size update to ext3, along with a new
mount option (data=guarded) to enable it. The basic mechanism works like
this:

* Change block_write_full_page to take an end_io handler as a parameter.
This allows us to make an end_io handler that queues buffer heads for
a workqueue where the real work of updating the on disk i_size is done.

* Add an list to the in-memory ext3 inode for tracking data=guarded
buffer heads that are waiting to be sent to disk.

* Add an ext3 guarded write_end call to add buffer heads for newly
allocated blocks into the rbtree. If we have a newly allocated block that is
filling a hole inside i_size, this is done as an old style data=ordered write
instead.

* Add an ext3 guarded writepage call that uses a special buffer head
end_io handler for buffers that are marked as guarded. Again, if we find
newly allocated blocks filling holes, they are sent through data=ordered
instead of data=guarded.

* When a guarded IO finishes, kick a per-FS workqueue to do the
on disk i_size updates. The workqueue function must be very careful. We only
update the on disk i_size if all of the IO between the old on disk i_size and
the new on disk i_size is complete. The on disk i_size is incrementally
updated to the largest safe value every time an IO completes.

* When we start tracking guarded buffers on a given inode, we put the
inode into ext3's orphan list. This way if we do crash, the file will
be truncated back down to the on disk i_size and we'll free any blocks that
were not completely written. The inode is removed from the orphan list
only after all the guarded buffers are done.

Signed-off-by: Chris Mason <[email protected]>

---
fs/ext3/Makefile | 3 +-
fs/ext3/fsync.c | 12 +
fs/ext3/inode.c | 604 +++++++++++++++++++++++++++++++++++++++++++-
fs/ext3/namei.c | 21 +-
fs/ext3/ordered-data.c | 235 +++++++++++++++++
fs/ext3/super.c | 48 +++-
fs/jbd/transaction.c | 1 +
include/linux/ext3_fs.h | 33 +++-
include/linux/ext3_fs_i.h | 45 ++++
include/linux/ext3_fs_sb.h | 6 +
include/linux/ext3_jbd.h | 11 +
include/linux/jbd.h | 10 +
12 files changed, 1002 insertions(+), 27 deletions(-)

diff --git a/fs/ext3/Makefile b/fs/ext3/Makefile
index e77766a..f3a9dc1 100644
--- a/fs/ext3/Makefile
+++ b/fs/ext3/Makefile
@@ -5,7 +5,8 @@
obj-$(CONFIG_EXT3_FS) += ext3.o

ext3-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \
- ioctl.o namei.o super.o symlink.o hash.o resize.o ext3_jbd.o
+ ioctl.o namei.o super.o symlink.o hash.o resize.o ext3_jbd.o \
+ ordered-data.o

ext3-$(CONFIG_EXT3_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
ext3-$(CONFIG_EXT3_FS_POSIX_ACL) += acl.o
diff --git a/fs/ext3/fsync.c b/fs/ext3/fsync.c
index d336341..a50abb4 100644
--- a/fs/ext3/fsync.c
+++ b/fs/ext3/fsync.c
@@ -59,6 +59,11 @@ int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
* sync_inode() will write the inode if it is dirty. Then the caller's
* filemap_fdatawait() will wait on the pages.
*
+ * data=guarded:
+ * The caller's filemap_fdatawrite will start the IO, and we
+ * use filemap_fdatawait here to make sure all the disk i_size updates
+ * are done before we commit the inode.
+ *
* data=journal:
* filemap_fdatawrite won't do anything (the buffers are clean).
* ext3_force_commit will write the file data into the journal and
@@ -84,6 +89,13 @@ int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
.sync_mode = WB_SYNC_ALL,
.nr_to_write = 0, /* sys_fsync did this */
};
+ /*
+ * the new disk i_size must be logged before we commit,
+ * so we wait here for pending writeback
+ */
+ if (ext3_should_guard_data(inode))
+ filemap_write_and_wait(inode->i_mapping);
+
ret = sync_inode(inode, &wbc);
}
out:
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index fcfa243..1a43178 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -38,6 +38,7 @@
#include <linux/bio.h>
#include <linux/fiemap.h>
#include <linux/namei.h>
+#include <linux/workqueue.h>
#include "xattr.h"
#include "acl.h"

@@ -179,6 +180,106 @@ static int ext3_journal_test_restart(handle_t *handle, struct inode *inode)
}

/*
+ * after a data=guarded IO is done, we need to update the
+ * disk i_size to reflect the data we've written. If there are
+ * no more ordered data extents left in the list, we need to
+ * get rid of the orphan entry making sure the file's
+ * block pointers match the i_size after a crash
+ *
+ * When we aren't in data=guarded mode, this just does an ext3_orphan_del.
+ *
+ * It returns the result of ext3_orphan_del.
+ *
+ * handle may be null if we are just cleaning up the orphan list in
+ * memory.
+ *
+ * pass must_log == 1 when the inode must be logged in order to get
+ * an i_size update on disk
+ */
+static int orphan_del(handle_t *handle, struct inode *inode, int must_log)
+{
+ int ret = 0;
+ struct list_head *ordered_list;
+
+ ordered_list = &EXT3_I(inode)->ordered_buffers.ordered_list;
+
+ /* fast out when data=guarded isn't on */
+ if (!ext3_should_guard_data(inode))
+ return ext3_orphan_del(handle, inode);
+
+ ext3_ordered_lock(inode);
+ if (inode->i_nlink && list_empty(ordered_list)) {
+ ext3_ordered_unlock(inode);
+
+ lock_super(inode->i_sb);
+
+ /*
+ * now that we have the lock make sure we are allowed to
+ * get rid of the orphan. This way we make sure our
+ * test isn't happening concurrently with someone else
+ * adding an orphan. Memory barrier for the ordered list.
+ */
+ smp_mb();
+ if (inode->i_nlink == 0 || !list_empty(ordered_list)) {
+ ext3_ordered_unlock(inode);
+ unlock_super(inode->i_sb);
+ goto out;
+ }
+
+ /*
+ * if we aren't actually on the orphan list, the orphan
+ * del won't log our inode. Log it now to make sure
+ */
+ ext3_mark_inode_dirty(handle, inode);
+
+ ret = ext3_orphan_del_locked(handle, inode);
+
+ unlock_super(inode->i_sb);
+ } else if (handle && must_log) {
+ ext3_ordered_unlock(inode);
+
+ /*
+ * we need to make sure any updates done by the data=guarded
+ * code end up in the inode on disk. Log the inode
+ * here
+ */
+ ext3_mark_inode_dirty(handle, inode);
+ } else {
+ ext3_ordered_unlock(inode);
+ }
+
+out:
+ return ret;
+}
+
+/*
+ * Wrapper around orphan_del that starts a transaction
+ */
+static void orphan_del_trans(struct inode *inode, int must_log)
+{
+ handle_t *handle;
+
+ handle = ext3_journal_start(inode, 3);
+
+ /*
+ * uhoh, should we flag the FS as readonly here? ext3_dirty_inode
+ * doesn't, which is what we're modeling ourselves after.
+ *
+ * We do need to make sure to get this inode off the ordered list
+ * when the transaction start fails though. orphan_del
+ * does the right thing.
+ */
+ if (IS_ERR(handle)) {
+ orphan_del(NULL, inode, 0);
+ return;
+ }
+
+ orphan_del(handle, inode, must_log);
+ ext3_journal_stop(handle);
+}
+
+
+/*
* Called at the last iput() if i_nlink is zero.
*/
void ext3_delete_inode (struct inode * inode)
@@ -204,6 +305,13 @@ void ext3_delete_inode (struct inode * inode)
if (IS_SYNC(inode))
handle->h_sync = 1;
inode->i_size = 0;
+
+ /*
+ * make sure we clean up any ordered extents that didn't get
+ * IO started on them because i_size shrunk down to zero.
+ */
+ ext3_truncate_ordered_extents(inode, 0);
+
if (inode->i_blocks)
ext3_truncate(inode);
/*
@@ -767,6 +875,24 @@ err_out:
}

/*
+ * This protects the disk i_size with the spinlock for the ordered
+ * extent tree. It returns 1 when the inode needs to be logged
+ * because the i_disksize has been updated.
+ */
+static int maybe_update_disk_isize(struct inode *inode, loff_t new_size)
+{
+ int ret = 0;
+
+ ext3_ordered_lock(inode);
+ if (EXT3_I(inode)->i_disksize < new_size) {
+ EXT3_I(inode)->i_disksize = new_size;
+ ret = 1;
+ }
+ ext3_ordered_unlock(inode);
+ return ret;
+}
+
+/*
* Allocation strategy is simple: if we have to allocate something, we will
* have to go the whole way to leaf. So let's do it before attaching anything
* to tree, set linkage between the newborn blocks, write them if sync is
@@ -815,6 +941,7 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
if (!partial) {
first_block = le32_to_cpu(chain[depth - 1].key);
clear_buffer_new(bh_result);
+ clear_buffer_datanew(bh_result);
count++;
/*map more blocks*/
while (count < maxblocks && count <= blocks_to_boundary) {
@@ -873,6 +1000,7 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
if (err)
goto cleanup;
clear_buffer_new(bh_result);
+ clear_buffer_datanew(bh_result);
goto got_it;
}
}
@@ -915,14 +1043,18 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
* i_disksize growing is protected by truncate_mutex. Don't forget to
* protect it if you're about to implement concurrent
* ext3_get_block() -bzzz
+ *
+ * extend_disksize is only called for directories, and so
+ * it is not using guarded buffer protection.
*/
- if (!err && extend_disksize && inode->i_size > ei->i_disksize)
+ if (!err && extend_disksize)
ei->i_disksize = inode->i_size;
mutex_unlock(&ei->truncate_mutex);
if (err)
goto cleanup;

set_buffer_new(bh_result);
+ set_buffer_datanew(bh_result);
got_it:
map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key));
if (count > blocks_to_boundary)
@@ -1079,6 +1211,77 @@ struct buffer_head *ext3_bread(handle_t *handle, struct inode *inode,
return NULL;
}

+/*
+ * data=guarded updates are handled in a workqueue after the IO
+ * is done. This runs through the list of buffer heads that are pending
+ * processing.
+ */
+void ext3_run_guarded_work(struct work_struct *work)
+{
+ struct ext3_sb_info *sbi =
+ container_of(work, struct ext3_sb_info, guarded_work);
+ struct buffer_head *bh;
+ struct ext3_ordered_extent *ordered;
+ struct inode *inode;
+ struct page *page;
+ int must_log;
+
+ spin_lock_irq(&sbi->guarded_lock);
+ while (!list_empty(&sbi->guarded_buffers)) {
+ ordered = list_entry(sbi->guarded_buffers.next,
+ struct ext3_ordered_extent, work_list);
+
+ list_del(&ordered->work_list);
+
+ bh = ordered->end_io_bh;
+ ordered->end_io_bh = NULL;
+ must_log = 0;
+
+ /* we don't need a reference on the buffer head because
+ * it is locked until the end_io handler is called.
+ *
+ * This means the page can't go away, which means the
+ * inode can't go away
+ */
+ spin_unlock_irq(&sbi->guarded_lock);
+
+ page = bh->b_page;
+ inode = page->mapping->host;
+
+ ext3_ordered_lock(inode);
+ if (ordered->bh) {
+ /*
+ * someone might have decided this buffer didn't
+ * really need to be ordered and removed us from
+ * the list. They set ordered->bh to null
+ * when that happens.
+ */
+ ext3_remove_ordered_extent(inode, ordered);
+ must_log = ext3_ordered_update_i_size(inode);
+ }
+ ext3_ordered_unlock(inode);
+
+ /*
+ * drop the reference taken when this ordered extent was
+ * put onto the guarded_buffers list
+ */
+ ext3_put_ordered_extent(ordered);
+
+ /*
+ * maybe log the inode and/or cleanup the orphan entry
+ */
+ orphan_del_trans(inode, must_log > 0);
+
+ /*
+ * finally, call the real bh end_io function to do
+ * all the hard work of maintaining page writeback.
+ */
+ end_buffer_async_write(bh, buffer_uptodate(bh));
+ spin_lock_irq(&sbi->guarded_lock);
+ }
+ spin_unlock_irq(&sbi->guarded_lock);
+}
+
static int walk_page_buffers( handle_t *handle,
struct buffer_head *head,
unsigned from,
@@ -1185,6 +1388,7 @@ retry:
ret = walk_page_buffers(handle, page_buffers(page),
from, to, NULL, do_journal_get_write_access);
}
+
write_begin_failed:
if (ret) {
/*
@@ -1212,7 +1416,13 @@ out:

int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
{
- int err = journal_dirty_data(handle, bh);
+ int err;
+
+ /* don't take buffers from the data=guarded list */
+ if (buffer_dataguarded(bh))
+ return 0;
+
+ err = journal_dirty_data(handle, bh);
if (err)
ext3_journal_abort_handle(__func__, __func__,
bh, handle, err);
@@ -1231,6 +1441,98 @@ static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
return 0;
}

+/*
+ * Walk the buffers in a page for data=guarded mode. Buffers that
+ * are not marked as datanew are ignored.
+ *
+ * New buffers outside i_size are sent to the data guarded code
+ *
+ * We must do the old data=ordered mode when filling holes in the
+ * file, since i_size doesn't protect these at all.
+ */
+static int journal_dirty_data_guarded_fn(handle_t *handle,
+ struct buffer_head *bh)
+{
+ u64 offset = page_offset(bh->b_page) + bh_offset(bh);
+ struct inode *inode = bh->b_page->mapping->host;
+ int ret = 0;
+ int was_new;
+
+ /*
+ * Write could have mapped the buffer but it didn't copy the data in
+ * yet. So avoid filing such buffer into a transaction.
+ */
+ if (!buffer_mapped(bh) || !buffer_uptodate(bh))
+ return 0;
+
+ was_new = test_clear_buffer_datanew(bh);
+
+ if (offset < inode->i_size) {
+ /*
+ * if we're filling a hole inside i_size, we need to
+ * fall back to the old style data=ordered
+ */
+ if (was_new)
+ ret = ext3_journal_dirty_data(handle, bh);
+ goto out;
+ }
+ ret = ext3_add_ordered_extent(inode, offset, bh);
+
+ /* if we crash before the IO is done, i_size will be small
+ * but these blocks will still be allocated to the file.
+ *
+ * So, add an orphan entry for the file, which will truncate it
+ * down to the i_size it finds after the crash.
+ *
+ * The orphan is cleaned up when the IO is done. We
+ * don't add orphans while mount is running the orphan list,
+ * that seems to corrupt the list.
+ *
+ * We're testing list_empty on the i_orphan list, but
+ * right here we have i_mutex held. So the only place that
+ * is going to race around and remove us from the orphan
+ * list is the work queue to process completed guarded
+ * buffers. That will find the ordered_extent we added
+ * above and leave us on the orphan list.
+ */
+ if (ret == 0 && buffer_dataguarded(bh) &&
+ list_empty(&EXT3_I(inode)->i_orphan) &&
+ !(EXT3_SB(inode->i_sb)->s_mount_state & EXT3_ORPHAN_FS)) {
+ ret = ext3_orphan_add(handle, inode);
+ }
+out:
+ return ret;
+}
+
+/*
+ * Walk the buffers in a page for data=guarded mode for writepage.
+ *
+ * We must do the old data=ordered mode when filling holes in the
+ * file, since i_size doesn't protect these at all.
+ *
+ * This is actually called after writepage is run and so we can't
+ * trust anything other than the buffer head (which we have pinned).
+ *
+ * Any datanew buffer at writepage time is filling a hole, so we don't need
+ * extra tests against the inode size.
+ */
+static int journal_dirty_data_guarded_writepage_fn(handle_t *handle,
+ struct buffer_head *bh)
+{
+ int ret = 0;
+
+ /*
+ * Write could have mapped the buffer but it didn't copy the data in
+ * yet. So avoid filing such buffer into a transaction.
+ */
+ if (!buffer_mapped(bh) || !buffer_uptodate(bh))
+ return 0;
+
+ if (test_clear_buffer_datanew(bh))
+ ret = ext3_journal_dirty_data(handle, bh);
+ return ret;
+}
+
/* For write_end() in data=journal mode */
static int write_end_fn(handle_t *handle, struct buffer_head *bh)
{
@@ -1251,10 +1553,8 @@ static void update_file_sizes(struct inode *inode, loff_t pos, unsigned copied)
/* What matters to us is i_disksize. We don't write i_size anywhere */
if (pos + copied > inode->i_size)
i_size_write(inode, pos + copied);
- if (pos + copied > EXT3_I(inode)->i_disksize) {
- EXT3_I(inode)->i_disksize = pos + copied;
+ if (maybe_update_disk_isize(inode, pos + copied))
mark_inode_dirty(inode);
- }
}

/*
@@ -1300,6 +1600,73 @@ static int ext3_ordered_write_end(struct file *file,
return ret ? ret : copied;
}

+static int ext3_guarded_write_end(struct file *file,
+ struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned copied,
+ struct page *page, void *fsdata)
+{
+ handle_t *handle = ext3_journal_current_handle();
+ struct inode *inode = file->f_mapping->host;
+ unsigned from, to;
+ int ret = 0, ret2;
+
+ copied = block_write_end(file, mapping, pos, len, copied,
+ page, fsdata);
+
+ from = pos & (PAGE_CACHE_SIZE - 1);
+ to = from + copied;
+ ret = walk_page_buffers(handle, page_buffers(page),
+ from, to, NULL, journal_dirty_data_guarded_fn);
+
+ /*
+ * we only update the in-memory i_size. The disk i_size is done
+ * by the end io handlers
+ */
+ if (ret == 0 && pos + copied > inode->i_size) {
+ int must_log;
+
+ /* updated i_size, but we may have raced with a
+ * data=guarded end_io handler.
+ *
+ * All the guarded IO could have ended while i_size was still
+ * small, and if we're just adding bytes into an existing block
+ * in the file, we may not be adding a new guarded IO with this
+ * write. So, do a check on the disk i_size and make sure it
+ * is updated to the highest safe value.
+ *
+ * This may also be required if the
+ * journal_dirty_data_guarded_fn chose to do an fully
+ * ordered write of this buffer instead of a guarded
+ * write.
+ *
+ * ext3_ordered_update_i_size tests inode->i_size, so we
+ * make sure to update it with the ordered lock held.
+ */
+ ext3_ordered_lock(inode);
+ i_size_write(inode, pos + copied);
+ must_log = ext3_ordered_update_i_size(inode);
+ ext3_ordered_unlock(inode);
+
+ orphan_del_trans(inode, must_log > 0);
+ }
+
+ /*
+ * There may be allocated blocks outside of i_size because
+ * we failed to copy some data. Prepare for truncate.
+ */
+ if (pos + len > inode->i_size)
+ ext3_orphan_add(handle, inode);
+ ret2 = ext3_journal_stop(handle);
+ if (!ret)
+ ret = ret2;
+ unlock_page(page);
+ page_cache_release(page);
+
+ if (pos + len > inode->i_size)
+ vmtruncate(inode, inode->i_size);
+ return ret ? ret : copied;
+}
+
static int ext3_writeback_write_end(struct file *file,
struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
@@ -1311,6 +1678,7 @@ static int ext3_writeback_write_end(struct file *file,

copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
update_file_sizes(inode, pos, copied);
+
/*
* There may be allocated blocks outside of i_size because
* we failed to copy some data. Prepare for truncate.
@@ -1574,6 +1942,144 @@ out_fail:
return ret;
}

+/*
+ * Completion handler for block_write_full_page(). This will
+ * kick off the data=guarded workqueue as the IO finishes.
+ */
+static void end_buffer_async_write_guarded(struct buffer_head *bh,
+ int uptodate)
+{
+ struct ext3_sb_info *sbi;
+ struct address_space *mapping;
+ struct ext3_ordered_extent *ordered;
+ unsigned long flags;
+
+ mapping = bh->b_page->mapping;
+ if (!mapping || !bh->b_private || !buffer_dataguarded(bh)) {
+noguard:
+ end_buffer_async_write(bh, uptodate);
+ return;
+ }
+
+ /*
+ * the guarded workqueue function checks the uptodate bit on the
+ * bh and uses that to tell the real end_io handler if things worked
+ * out or not.
+ */
+ if (uptodate)
+ set_buffer_uptodate(bh);
+ else
+ clear_buffer_uptodate(bh);
+
+ sbi = EXT3_SB(mapping->host->i_sb);
+
+ spin_lock_irqsave(&sbi->guarded_lock, flags);
+
+ /*
+ * remove any chance that a truncate raced in and cleared
+ * our dataguard flag, which also freed the ordered extent in
+ * our b_private.
+ */
+ if (!buffer_dataguarded(bh)) {
+ spin_unlock_irqrestore(&sbi->guarded_lock, flags);
+ goto noguard;
+ }
+ ordered = bh->b_private;
+ WARN_ON(ordered->end_io_bh);
+
+ /*
+ * use the special end_io_bh pointer to make sure that
+ * some form of end_io handler is run on this bh, even
+ * if the ordered_extent is removed from the rb tree before
+ * our workqueue ends up processing it.
+ */
+ ordered->end_io_bh = bh;
+ list_add_tail(&ordered->work_list, &sbi->guarded_buffers);
+ ext3_get_ordered_extent(ordered);
+ spin_unlock_irqrestore(&sbi->guarded_lock, flags);
+
+ queue_work(sbi->guarded_wq, &sbi->guarded_work);
+}
+
+static int ext3_guarded_writepage(struct page *page,
+ struct writeback_control *wbc)
+{
+ struct inode *inode = page->mapping->host;
+ struct buffer_head *page_bufs;
+ handle_t *handle = NULL;
+ int ret = 0;
+ int err;
+
+ J_ASSERT(PageLocked(page));
+
+ /*
+ * We give up here if we're reentered, because it might be for a
+ * different filesystem.
+ */
+ if (ext3_journal_current_handle())
+ goto out_fail;
+
+ if (!page_has_buffers(page)) {
+ create_empty_buffers(page, inode->i_sb->s_blocksize,
+ (1 << BH_Dirty)|(1 << BH_Uptodate));
+ page_bufs = page_buffers(page);
+ } else {
+ page_bufs = page_buffers(page);
+ if (!walk_page_buffers(NULL, page_bufs, 0, PAGE_CACHE_SIZE,
+ NULL, buffer_unmapped)) {
+ /* Provide NULL get_block() to catch bugs if buffers
+ * weren't really mapped */
+ return block_write_full_page_endio(page, NULL, wbc,
+ end_buffer_async_write_guarded);
+ }
+ }
+ handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
+
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out_fail;
+ }
+
+ walk_page_buffers(handle, page_bufs, 0,
+ PAGE_CACHE_SIZE, NULL, bget_one);
+
+ ret = block_write_full_page_endio(page, ext3_get_block, wbc,
+ end_buffer_async_write_guarded);
+
+ /*
+ * The page can become unlocked at any point now, and
+ * truncate can then come in and change things. So we
+ * can't touch *page from now on. But *page_bufs is
+ * safe due to elevated refcount.
+ */
+
+ /*
+ * And attach them to the current transaction. But only if
+ * block_write_full_page() succeeded. Otherwise they are unmapped,
+ * and generally junk.
+ */
+ if (ret == 0) {
+ err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
+ NULL, journal_dirty_data_guarded_writepage_fn);
+ if (!ret)
+ ret = err;
+ }
+ walk_page_buffers(handle, page_bufs, 0,
+ PAGE_CACHE_SIZE, NULL, bput_one);
+ err = ext3_journal_stop(handle);
+ if (!ret)
+ ret = err;
+
+ return ret;
+
+out_fail:
+ redirty_page_for_writepage(wbc, page);
+ unlock_page(page);
+ return ret;
+}
+
+
+
static int ext3_writeback_writepage(struct page *page,
struct writeback_control *wbc)
{
@@ -1747,7 +2253,14 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
goto out;
}
orphan = 1;
- ei->i_disksize = inode->i_size;
+ /* in guarded mode, other code is responsible
+ * for updating i_disksize. Actually in
+ * every mode, ei->i_disksize should be correct,
+ * so I don't understand why it is getting updated
+ * to i_size here.
+ */
+ if (!ext3_should_guard_data(inode))
+ ei->i_disksize = inode->i_size;
ext3_journal_stop(handle);
}
}
@@ -1768,13 +2281,27 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
ret = PTR_ERR(handle);
goto out;
}
+
if (inode->i_nlink)
- ext3_orphan_del(handle, inode);
+ orphan_del(handle, inode, 0);
+
if (ret > 0) {
loff_t end = offset + ret;
if (end > inode->i_size) {
- ei->i_disksize = end;
- i_size_write(inode, end);
+ /* i_mutex keeps other file writes from
+ * hopping in at this time, and we
+ * know the O_DIRECT write just put all
+ * those blocks on disk. But, there
+ * may be guarded writes at lower offsets
+ * in the file that were not forced down.
+ */
+ if (ext3_should_guard_data(inode)) {
+ i_size_write(inode, end);
+ ext3_ordered_update_i_size(inode);
+ } else {
+ ei->i_disksize = end;
+ i_size_write(inode, end);
+ }
/*
* We're going to return a positive `ret'
* here due to non-zero-length I/O, so there's
@@ -1842,6 +2369,21 @@ static const struct address_space_operations ext3_writeback_aops = {
.is_partially_uptodate = block_is_partially_uptodate,
};

+static const struct address_space_operations ext3_guarded_aops = {
+ .readpage = ext3_readpage,
+ .readpages = ext3_readpages,
+ .writepage = ext3_guarded_writepage,
+ .sync_page = block_sync_page,
+ .write_begin = ext3_write_begin,
+ .write_end = ext3_guarded_write_end,
+ .bmap = ext3_bmap,
+ .invalidatepage = ext3_invalidatepage,
+ .releasepage = ext3_releasepage,
+ .direct_IO = ext3_direct_IO,
+ .migratepage = buffer_migrate_page,
+ .is_partially_uptodate = block_is_partially_uptodate,
+};
+
static const struct address_space_operations ext3_journalled_aops = {
.readpage = ext3_readpage,
.readpages = ext3_readpages,
@@ -1860,6 +2402,8 @@ void ext3_set_aops(struct inode *inode)
{
if (ext3_should_order_data(inode))
inode->i_mapping->a_ops = &ext3_ordered_aops;
+ else if (ext3_should_guard_data(inode))
+ inode->i_mapping->a_ops = &ext3_guarded_aops;
else if (ext3_should_writeback_data(inode))
inode->i_mapping->a_ops = &ext3_writeback_aops;
else
@@ -2376,7 +2920,8 @@ void ext3_truncate(struct inode *inode)
if (!ext3_can_truncate(inode))
return;

- if (inode->i_size == 0 && ext3_should_writeback_data(inode))
+ if (inode->i_size == 0 && (ext3_should_writeback_data(inode) ||
+ ext3_should_guard_data(inode)))
ei->i_state |= EXT3_STATE_FLUSH_ON_CLOSE;

/*
@@ -3103,10 +3648,39 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
ext3_journal_stop(handle);
}

+ if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
+ /*
+ * we need to make sure any data=guarded pages
+ * are on disk before we force a new disk i_size
+ * down into the inode. The crucial range is
+ * anything between the disksize on disk now
+ * and the new size we're going to set.
+ *
+ * We're holding i_mutex here, so we know new
+ * ordered extents are not going to appear in the inode
+ *
+ * This must be done both for truncates that make the
+ * file bigger and smaller because truncate messes around
+ * with the orphan inode list in both cases.
+ */
+ if (ext3_should_guard_data(inode)) {
+ filemap_write_and_wait_range(inode->i_mapping,
+ EXT3_I(inode)->i_disksize,
+ (loff_t)-1);
+ /*
+ * we've written everything, make sure all
+ * the ordered extents are really gone.
+ *
+ * This prevents leaking of ordered extents
+ * and it also makes sure the ordered extent code
+ * doesn't mess with the orphan link
+ */
+ ext3_truncate_ordered_extents(inode, 0);
+ }
+ }
if (S_ISREG(inode->i_mode) &&
attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) {
handle_t *handle;
-
handle = ext3_journal_start(inode, 3);
if (IS_ERR(handle)) {
error = PTR_ERR(handle);
@@ -3114,6 +3688,7 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
}

error = ext3_orphan_add(handle, inode);
+
EXT3_I(inode)->i_disksize = attr->ia_size;
rc = ext3_mark_inode_dirty(handle, inode);
if (!error)
@@ -3125,8 +3700,11 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)

/* If inode_setattr's call to ext3_truncate failed to get a
* transaction handle at all, we need to clean up the in-core
- * orphan list manually. */
- if (inode->i_nlink)
+ * orphan list manually. Because we've finished off all the
+ * guarded IO above, this doesn't hurt anything for the guarded
+ * code
+ */
+ if (inode->i_nlink && (attr->ia_valid & ATTR_SIZE))
ext3_orphan_del(NULL, inode);

if (!rc && (ia_valid & ATTR_MODE))
diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index 6ff7b97..711549a 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -1973,11 +1973,21 @@ out_unlock:
return err;
}

+int ext3_orphan_del(handle_t *handle, struct inode *inode)
+{
+ int ret;
+
+ lock_super(inode->i_sb);
+ ret = ext3_orphan_del_locked(handle, inode);
+ unlock_super(inode->i_sb);
+ return ret;
+}
+
/*
* ext3_orphan_del() removes an unlinked or truncated inode from the list
* of such inodes stored on disk, because it is finally being cleaned up.
*/
-int ext3_orphan_del(handle_t *handle, struct inode *inode)
+int ext3_orphan_del_locked(handle_t *handle, struct inode *inode)
{
struct list_head *prev;
struct ext3_inode_info *ei = EXT3_I(inode);
@@ -1986,11 +1996,8 @@ int ext3_orphan_del(handle_t *handle, struct inode *inode)
struct ext3_iloc iloc;
int err = 0;

- lock_super(inode->i_sb);
- if (list_empty(&ei->i_orphan)) {
- unlock_super(inode->i_sb);
+ if (list_empty(&ei->i_orphan))
return 0;
- }

ino_next = NEXT_ORPHAN(inode);
prev = ei->i_orphan.prev;
@@ -2040,7 +2047,6 @@ int ext3_orphan_del(handle_t *handle, struct inode *inode)
out_err:
ext3_std_error(inode->i_sb, err);
out:
- unlock_super(inode->i_sb);
return err;

out_brelse:
@@ -2410,7 +2416,8 @@ static int ext3_rename (struct inode * old_dir, struct dentry *old_dentry,
ext3_mark_inode_dirty(handle, new_inode);
if (!new_inode->i_nlink)
ext3_orphan_add(handle, new_inode);
- if (ext3_should_writeback_data(new_inode))
+ if (ext3_should_writeback_data(new_inode) ||
+ ext3_should_guard_data(new_inode))
flush_file = 1;
}
retval = 0;
diff --git a/fs/ext3/ordered-data.c b/fs/ext3/ordered-data.c
new file mode 100644
index 0000000..a6dab2d
--- /dev/null
+++ b/fs/ext3/ordered-data.c
@@ -0,0 +1,235 @@
+/*
+ * Copyright (C) 2009 Oracle. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/gfp.h>
+#include <linux/slab.h>
+#include <linux/blkdev.h>
+#include <linux/writeback.h>
+#include <linux/pagevec.h>
+#include <linux/buffer_head.h>
+#include <linux/ext3_jbd.h>
+
+/*
+ * simple helper to make sure a new entry we're adding is
+ * at a larger offset in the file than the last entry in the list
+ */
+static void check_ordering(struct ext3_ordered_buffers *buffers,
+ struct ext3_ordered_extent *entry)
+{
+ struct ext3_ordered_extent *last;
+
+ if (list_empty(&buffers->ordered_list))
+ return;
+
+ last = list_entry(buffers->ordered_list.prev,
+ struct ext3_ordered_extent, ordered_list);
+ BUG_ON(last->start >= entry->start);
+}
+
+/* allocate and add a new ordered_extent into the per-inode list.
+ * start is the logical offset in the file
+ *
+ * The list is given a single reference on the ordered extent that was
+ * inserted, and it also takes a reference on the buffer head.
+ */
+int ext3_add_ordered_extent(struct inode *inode, u64 start,
+ struct buffer_head *bh)
+{
+ struct ext3_ordered_buffers *buffers;
+ struct ext3_ordered_extent *entry;
+ int ret = 0;
+
+ lock_buffer(bh);
+
+ /* ordered extent already there, or in old style data=ordered */
+ if (bh->b_private) {
+ ret = 0;
+ goto out;
+ }
+
+ buffers = &EXT3_I(inode)->ordered_buffers;
+ entry = kzalloc(sizeof(*entry), GFP_NOFS);
+ if (!entry) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ spin_lock(&buffers->lock);
+ entry->start = start;
+
+ get_bh(bh);
+ entry->bh = bh;
+ bh->b_private = entry;
+ set_buffer_dataguarded(bh);
+
+ /* one ref for the list */
+ atomic_set(&entry->refs, 1);
+ INIT_LIST_HEAD(&entry->work_list);
+
+ check_ordering(buffers, entry);
+
+ list_add_tail(&entry->ordered_list, &buffers->ordered_list);
+
+ spin_unlock(&buffers->lock);
+out:
+ unlock_buffer(bh);
+ return ret;
+}
+
+/*
+ * used to drop a reference on an ordered extent. This will free
+ * the extent if the last reference is dropped
+ */
+int ext3_put_ordered_extent(struct ext3_ordered_extent *entry)
+{
+ if (atomic_dec_and_test(&entry->refs)) {
+ WARN_ON(entry->bh);
+ WARN_ON(entry->end_io_bh);
+ kfree(entry);
+ }
+ return 0;
+}
+
+/*
+ * remove an ordered extent from the list. This removes the
+ * reference held by the list on 'entry' and the
+ * reference on the buffer head held by the entry.
+ */
+int ext3_remove_ordered_extent(struct inode *inode,
+ struct ext3_ordered_extent *entry)
+{
+ struct ext3_ordered_buffers *buffers;
+
+ buffers = &EXT3_I(inode)->ordered_buffers;
+
+ /*
+ * the data=guarded end_io handler takes this guarded_lock
+ * before it puts a given buffer head and its ordered extent
+ * into the guarded_buffers list. We need to make sure
+ * we don't race with them, so we take the guarded_lock too.
+ */
+ spin_lock_irq(&EXT3_SB(inode->i_sb)->guarded_lock);
+ clear_buffer_dataguarded(entry->bh);
+ entry->bh->b_private = NULL;
+ brelse(entry->bh);
+ entry->bh = NULL;
+ spin_unlock_irq(&EXT3_SB(inode->i_sb)->guarded_lock);
+
+ /*
+ * we must not clear entry->end_io_bh, that is set by
+ * the end_io handlers and will be cleared by the end_io
+ * workqueue
+ */
+
+ list_del_init(&entry->ordered_list);
+ ext3_put_ordered_extent(entry);
+ return 0;
+}
+
+/*
+ * After an extent is done, call this to conditionally update the on disk
+ * i_size. i_size is updated to cover any fully written part of the file.
+ *
+ * This returns < 0 on error, zero if no action needs to be taken and
+ * 1 if the inode must be logged.
+ */
+int ext3_ordered_update_i_size(struct inode *inode)
+{
+ u64 new_size;
+ u64 disk_size;
+ struct ext3_ordered_extent *test;
+ struct ext3_ordered_buffers *buffers = &EXT3_I(inode)->ordered_buffers;
+ int ret = 0;
+
+ disk_size = EXT3_I(inode)->i_disksize;
+
+ /*
+ * if the disk i_size is already at the inode->i_size, we're done
+ */
+ if (disk_size >= inode->i_size)
+ goto out;
+
+ /*
+ * if the ordered list is empty, push the disk i_size all the way
+ * up to the inode size, otherwise, use the start of the first
+ * ordered extent in the list as the new disk i_size
+ */
+ if (list_empty(&buffers->ordered_list)) {
+ new_size = inode->i_size;
+ } else {
+ test = list_entry(buffers->ordered_list.next,
+ struct ext3_ordered_extent, ordered_list);
+
+ new_size = test->start;
+ }
+
+ new_size = min_t(u64, new_size, i_size_read(inode));
+
+ /* the caller needs to log this inode */
+ ret = 1;
+
+ EXT3_I(inode)->i_disksize = new_size;
+out:
+ return ret;
+}
+
+/*
+ * during a truncate or delete, we need to get rid of pending
+ * ordered extents so there isn't a war over who updates disk i_size first.
+ * This does that, without waiting for any of the IO to actually finish.
+ *
+ * When the IO does finish, it will find the ordered extent removed from the
+ * list and all will work properly.
+ */
+void ext3_truncate_ordered_extents(struct inode *inode, u64 offset)
+{
+ struct ext3_ordered_buffers *buffers = &EXT3_I(inode)->ordered_buffers;
+ struct ext3_ordered_extent *test;
+
+ spin_lock(&buffers->lock);
+ while (!list_empty(&buffers->ordered_list)) {
+
+ test = list_entry(buffers->ordered_list.prev,
+ struct ext3_ordered_extent, ordered_list);
+
+ if (test->start < offset)
+ break;
+ /*
+ * once this is called, the end_io handler won't run,
+ * and we won't update disk_i_size to include this buffer.
+ *
+ * That's ok for truncates because the truncate code is
+ * writing a new i_size.
+ *
+ * This ignores any IO in flight, which is ok
+ * because the guarded_buffers list has a reference
+ * on the ordered extent
+ */
+ ext3_remove_ordered_extent(inode, test);
+ }
+ spin_unlock(&buffers->lock);
+ return;
+
+}
+
+void ext3_ordered_inode_init(struct ext3_inode_info *ei)
+{
+ INIT_LIST_HEAD(&ei->ordered_buffers.ordered_list);
+ spin_lock_init(&ei->ordered_buffers.lock);
+}
+
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 599dbfe..1e0eff8 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -37,6 +37,7 @@
#include <linux/quotaops.h>
#include <linux/seq_file.h>
#include <linux/log2.h>
+#include <linux/workqueue.h>

#include <asm/uaccess.h>

@@ -399,6 +400,9 @@ static void ext3_put_super (struct super_block * sb)
struct ext3_super_block *es = sbi->s_es;
int i, err;

+ flush_workqueue(sbi->guarded_wq);
+ destroy_workqueue(sbi->guarded_wq);
+
ext3_xattr_put_super(sb);
err = journal_destroy(sbi->s_journal);
sbi->s_journal = NULL;
@@ -468,6 +472,8 @@ static struct inode *ext3_alloc_inode(struct super_block *sb)
#endif
ei->i_block_alloc_info = NULL;
ei->vfs_inode.i_version = 1;
+ ext3_ordered_inode_init(ei);
+
return &ei->vfs_inode;
}

@@ -481,6 +487,8 @@ static void ext3_destroy_inode(struct inode *inode)
false);
dump_stack();
}
+ if (!list_empty(&EXT3_I(inode)->ordered_buffers.ordered_list))
+ printk(KERN_INFO "EXT3 ordered list not empty\n");
kmem_cache_free(ext3_inode_cachep, EXT3_I(inode));
}

@@ -528,6 +536,13 @@ static void ext3_clear_inode(struct inode *inode)
EXT3_I(inode)->i_default_acl = EXT3_ACL_NOT_CACHED;
}
#endif
+ /*
+ * If pages got cleaned by truncate, truncate should have
+ * gotten rid of the ordered extents. Just in case, drop them
+ * here.
+ */
+ ext3_truncate_ordered_extents(inode, 0);
+
ext3_discard_reservation(inode);
EXT3_I(inode)->i_block_alloc_info = NULL;
if (unlikely(rsv))
@@ -634,6 +649,8 @@ static int ext3_show_options(struct seq_file *seq, struct vfsmount *vfs)
seq_puts(seq, ",data=journal");
else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA)
seq_puts(seq, ",data=ordered");
+ else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_GUARDED_DATA)
+ seq_puts(seq, ",data=guarded");
else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)
seq_puts(seq, ",data=writeback");

@@ -790,7 +807,7 @@ enum {
Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh,
Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
- Opt_data_err_abort, Opt_data_err_ignore,
+ Opt_data_guarded, Opt_data_err_abort, Opt_data_err_ignore,
Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
@@ -832,6 +849,7 @@ static const match_table_t tokens = {
{Opt_abort, "abort"},
{Opt_data_journal, "data=journal"},
{Opt_data_ordered, "data=ordered"},
+ {Opt_data_guarded, "data=guarded"},
{Opt_data_writeback, "data=writeback"},
{Opt_data_err_abort, "data_err=abort"},
{Opt_data_err_ignore, "data_err=ignore"},
@@ -1034,6 +1052,9 @@ static int parse_options (char *options, struct super_block *sb,
case Opt_data_ordered:
data_opt = EXT3_MOUNT_ORDERED_DATA;
goto datacheck;
+ case Opt_data_guarded:
+ data_opt = EXT3_MOUNT_GUARDED_DATA;
+ goto datacheck;
case Opt_data_writeback:
data_opt = EXT3_MOUNT_WRITEBACK_DATA;
datacheck:
@@ -1949,11 +1970,23 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
clear_opt(sbi->s_mount_opt, NOBH);
}
}
+
+ /*
+ * setup the guarded work list
+ */
+ INIT_LIST_HEAD(&EXT3_SB(sb)->guarded_buffers);
+ INIT_WORK(&EXT3_SB(sb)->guarded_work, ext3_run_guarded_work);
+ spin_lock_init(&EXT3_SB(sb)->guarded_lock);
+ EXT3_SB(sb)->guarded_wq = create_workqueue("ext3-guard");
+ if (!EXT3_SB(sb)->guarded_wq) {
+ printk(KERN_ERR "EXT3-fs: failed to create workqueue\n");
+ goto failed_mount_guard;
+ }
+
/*
* The journal_load will have done any necessary log recovery,
* so we can safely mount the rest of the filesystem now.
*/
-
root = ext3_iget(sb, EXT3_ROOT_INO);
if (IS_ERR(root)) {
printk(KERN_ERR "EXT3-fs: get root inode failed\n");
@@ -1965,6 +1998,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
printk(KERN_ERR "EXT3-fs: corrupt root inode, run e2fsck\n");
goto failed_mount4;
}
+
sb->s_root = d_alloc_root(root);
if (!sb->s_root) {
printk(KERN_ERR "EXT3-fs: get root dentry failed\n");
@@ -1974,6 +2008,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
}

ext3_setup_super (sb, es, sb->s_flags & MS_RDONLY);
+
/*
* akpm: core read_super() calls in here with the superblock locked.
* That deadlocks, because orphan cleanup needs to lock the superblock
@@ -1989,9 +2024,10 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
printk (KERN_INFO "EXT3-fs: recovery complete.\n");
ext3_mark_recovery_complete(sb, es);
printk (KERN_INFO "EXT3-fs: mounted filesystem with %s data mode.\n",
- test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
- test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
- "writeback");
+ test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal" :
+ test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_GUARDED_DATA ? "guarded" :
+ test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered" :
+ "writeback");

lock_kernel();
return 0;
@@ -2003,6 +2039,8 @@ cantfind_ext3:
goto failed_mount;

failed_mount4:
+ destroy_workqueue(EXT3_SB(sb)->guarded_wq);
+failed_mount_guard:
journal_destroy(sbi->s_journal);
failed_mount3:
percpu_counter_destroy(&sbi->s_freeblocks_counter);
diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index ed886e6..1354a55 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -2018,6 +2018,7 @@ zap_buffer_unlocked:
clear_buffer_mapped(bh);
clear_buffer_req(bh);
clear_buffer_new(bh);
+ clear_buffer_datanew(bh);
bh->b_bdev = NULL;
return may_free;
}
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 634a5e5..a20bd4f 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -18,6 +18,7 @@

#include <linux/types.h>
#include <linux/magic.h>
+#include <linux/workqueue.h>

/*
* The second extended filesystem constants/structures
@@ -398,7 +399,6 @@ struct ext3_inode {
#define EXT3_MOUNT_MINIX_DF 0x00080 /* Mimics the Minix statfs */
#define EXT3_MOUNT_NOLOAD 0x00100 /* Don't use existing journal*/
#define EXT3_MOUNT_ABORT 0x00200 /* Fatal error detected */
-#define EXT3_MOUNT_DATA_FLAGS 0x00C00 /* Mode for data writes: */
#define EXT3_MOUNT_JOURNAL_DATA 0x00400 /* Write data to journal */
#define EXT3_MOUNT_ORDERED_DATA 0x00800 /* Flush data before commit */
#define EXT3_MOUNT_WRITEBACK_DATA 0x00C00 /* No data ordering */
@@ -414,6 +414,12 @@ struct ext3_inode {
#define EXT3_MOUNT_GRPQUOTA 0x200000 /* "old" group quota */
#define EXT3_MOUNT_DATA_ERR_ABORT 0x400000 /* Abort on file data write
* error in ordered mode */
+#define EXT3_MOUNT_GUARDED_DATA 0x800000 /* guard new writes with
+ i_size */
+#define EXT3_MOUNT_DATA_FLAGS (EXT3_MOUNT_JOURNAL_DATA | \
+ EXT3_MOUNT_ORDERED_DATA | \
+ EXT3_MOUNT_WRITEBACK_DATA | \
+ EXT3_MOUNT_GUARDED_DATA)

/* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
#ifndef _LINUX_EXT2_FS_H
@@ -892,6 +898,7 @@ extern void ext3_get_inode_flags(struct ext3_inode_info *);
extern void ext3_set_aops(struct inode *inode);
extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
u64 start, u64 len);
+void ext3_run_guarded_work(struct work_struct *work);

/* ioctl.c */
extern long ext3_ioctl(struct file *, unsigned int, unsigned long);
@@ -900,6 +907,7 @@ extern long ext3_compat_ioctl(struct file *, unsigned int, unsigned long);
/* namei.c */
extern int ext3_orphan_add(handle_t *, struct inode *);
extern int ext3_orphan_del(handle_t *, struct inode *);
+extern int ext3_orphan_del_locked(handle_t *, struct inode *);
extern int ext3_htree_fill_tree(struct file *dir_file, __u32 start_hash,
__u32 start_minor_hash, __u32 *next_hash);

@@ -945,7 +953,30 @@ extern const struct inode_operations ext3_special_inode_operations;
extern const struct inode_operations ext3_symlink_inode_operations;
extern const struct inode_operations ext3_fast_symlink_inode_operations;

+/* ordered-data.c */
+int ext3_add_ordered_extent(struct inode *inode, u64 file_offset,
+ struct buffer_head *bh);
+int ext3_put_ordered_extent(struct ext3_ordered_extent *entry);
+int ext3_remove_ordered_extent(struct inode *inode,
+ struct ext3_ordered_extent *entry);
+int ext3_ordered_update_i_size(struct inode *inode);
+void ext3_ordered_inode_init(struct ext3_inode_info *ei);
+void ext3_truncate_ordered_extents(struct inode *inode, u64 offset);
+
+static inline void ext3_ordered_lock(struct inode *inode)
+{
+ spin_lock(&EXT3_I(inode)->ordered_buffers.lock);
+}

+static inline void ext3_ordered_unlock(struct inode *inode)
+{
+ spin_unlock(&EXT3_I(inode)->ordered_buffers.lock);
+}
+
+static inline void ext3_get_ordered_extent(struct ext3_ordered_extent *entry)
+{
+ atomic_inc(&entry->refs);
+}
#endif /* __KERNEL__ */

#endif /* _LINUX_EXT3_FS_H */
diff --git a/include/linux/ext3_fs_i.h b/include/linux/ext3_fs_i.h
index 7894dd0..11dd4d4 100644
--- a/include/linux/ext3_fs_i.h
+++ b/include/linux/ext3_fs_i.h
@@ -65,6 +65,49 @@ struct ext3_block_alloc_info {
#define rsv_end rsv_window._rsv_end

/*
+ * used to prevent garbage in files after a crash by
+ * making sure i_size isn't updated until after the IO
+ * is done.
+ *
+ * See fs/ext3/ordered-data.c for the code that uses these.
+ */
+struct buffer_head;
+struct ext3_ordered_buffers {
+ /* protects the list and disk i_size */
+ spinlock_t lock;
+
+ struct list_head ordered_list;
+};
+
+struct ext3_ordered_extent {
+ /* logical offset of the block in the file
+ * strictly speaking we don't need this
+ * but keep it in the struct for
+ * debugging
+ */
+ u64 start;
+
+ /* buffer head being written */
+ struct buffer_head *bh;
+
+ /*
+ * set at end_io time so we properly
+ * do IO accounting even when this ordered
+ * extent struct has been removed from the
+ * list
+ */
+ struct buffer_head *end_io_bh;
+
+ /* number of refs on this ordered extent */
+ atomic_t refs;
+
+ struct list_head ordered_list;
+
+ /* list of things being processed by the workqueue */
+ struct list_head work_list;
+};
+
+/*
* third extended file system inode data in memory
*/
struct ext3_inode_info {
@@ -141,6 +184,8 @@ struct ext3_inode_info {
* by other means, so we have truncate_mutex.
*/
struct mutex truncate_mutex;
+
+ struct ext3_ordered_buffers ordered_buffers;
struct inode vfs_inode;
};

diff --git a/include/linux/ext3_fs_sb.h b/include/linux/ext3_fs_sb.h
index f07f34d..5dbdbeb 100644
--- a/include/linux/ext3_fs_sb.h
+++ b/include/linux/ext3_fs_sb.h
@@ -21,6 +21,7 @@
#include <linux/wait.h>
#include <linux/blockgroup_lock.h>
#include <linux/percpu_counter.h>
+#include <linux/workqueue.h>
#endif
#include <linux/rbtree.h>

@@ -82,6 +83,11 @@ struct ext3_sb_info {
char *s_qf_names[MAXQUOTAS]; /* Names of quota files with journalled quota */
int s_jquota_fmt; /* Format of quota to use */
#endif
+
+ struct workqueue_struct *guarded_wq;
+ struct work_struct guarded_work;
+ struct list_head guarded_buffers;
+ spinlock_t guarded_lock;
};

static inline spinlock_t *
diff --git a/include/linux/ext3_jbd.h b/include/linux/ext3_jbd.h
index cf82d51..45cb4aa 100644
--- a/include/linux/ext3_jbd.h
+++ b/include/linux/ext3_jbd.h
@@ -212,6 +212,17 @@ static inline int ext3_should_order_data(struct inode *inode)
return 0;
}

+static inline int ext3_should_guard_data(struct inode *inode)
+{
+ if (!S_ISREG(inode->i_mode))
+ return 0;
+ if (EXT3_I(inode)->i_flags & EXT3_JOURNAL_DATA_FL)
+ return 0;
+ if (test_opt(inode->i_sb, GUARDED_DATA) == EXT3_MOUNT_GUARDED_DATA)
+ return 1;
+ return 0;
+}
+
static inline int ext3_should_writeback_data(struct inode *inode)
{
if (!S_ISREG(inode->i_mode))
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index c2049a0..bbb7990 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -291,6 +291,13 @@ enum jbd_state_bits {
BH_State, /* Pins most journal_head state */
BH_JournalHead, /* Pins bh->b_private and jh->b_bh */
BH_Unshadow, /* Dummy bit, for BJ_Shadow wakeup filtering */
+ BH_DataGuarded, /* ext3 data=guarded mode buffer
+ * these have something other than a
+ * journal_head at b_private */
+ BH_DataNew, /* BH_new gets cleared too early for
+ * data=guarded to use it. So,
+ * this gets set instead.
+ */
};

BUFFER_FNS(JBD, jbd)
@@ -302,6 +309,9 @@ TAS_BUFFER_FNS(Revoked, revoked)
BUFFER_FNS(RevokeValid, revokevalid)
TAS_BUFFER_FNS(RevokeValid, revokevalid)
BUFFER_FNS(Freed, freed)
+BUFFER_FNS(DataGuarded, dataguarded)
+BUFFER_FNS(DataNew, datanew)
+TAS_BUFFER_FNS(DataNew, datanew)

static inline struct buffer_head *jh2bh(struct journal_head *jh)
{
--





2009-04-29 20:04:14

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC] ext3 data=guarded v5

On Wed 29-04-09 15:41:29, Chris Mason wrote:
> On Wed, 2009-04-29 at 21:15 +0200, Jan Kara wrote:
> > On Wed 29-04-09 10:08:13, Chris Mason wrote:
> > > On Wed, 2009-04-29 at 10:56 +0200, Jan Kara wrote:
> > >
> > > > > diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> > > > > index fcfa243..1e90107 100644
> > > > > --- a/fs/ext3/inode.c
> > > > > +++ b/fs/ext3/inode.c
> > > > > @@ -38,6 +38,7 @@
> > > > > #include <linux/bio.h>
> > > > > #include <linux/fiemap.h>
> > > > > #include <linux/namei.h>
> > > > > +#include <linux/workqueue.h>
> > > > > #include "xattr.h"
> > > > > #include "acl.h"
> > > > >
> > > > > @@ -179,6 +180,105 @@ static int ext3_journal_test_restart(handle_t *handle, struct inode *inode)
> > > > > }
> > > > >
> > > > > /*
> > > > > + * after a data=guarded IO is done, we need to update the
> > > > > + * disk i_size to reflect the data we've written. If there are
> > > > > + * no more ordered data extents left in the tree, we need to
> > > > ^^^^^^^^ the list
> > > > > + * get rid of the orphan entry making sure the file's
> > > > > + * block pointers match the i_size after a crash
> > > > > + *
> > > > > + * When we aren't in data=guarded mode, this just does an ext3_orphan_del.
> > > > > + *
> > > > > + * It returns the result of ext3_orphan_del.
> > > > > + *
> > > > > + * handle may be null if we are just cleaning up the orphan list in
> > > > > + * memory.
> > > > > + *
> > > > > + * pass must_log == 1 when the inode must be logged in order to get
> > > > > + * an i_size update on disk
> > > > > + */
> > > > > +static int ordered_orphan_del(handle_t *handle, struct inode *inode,
> > > > > + int must_log)
> > > > > +{
> > > > I'm afraid this function is racy.
> > > > 1) We probably need i_mutex to protect against unlink happening in parallel
> > > > (after we check i_nlink but before we all ext3_orphan_del).
> > >
> > > This would mean IO completion (clearing PG_writeback) would have to wait
> > > on the inode mutex, which we can't quite do in O_SYNC and O_DIRECT.
> > > But, what I can do is check i_nlink after the ext3_orphan_del call and
> > > put the inode back on the orphan list if it has gone to zero.
> > Ah, good point. But doing it without i_mutex is icky. Strictly speaking
> > you should have memory barriers in the code to make sure that you fetch
> > a recent value of i_nlink (although other locking around those places
> > probably does the work for you but *proving* the correctness is complex).
> > Hmm, I can't help but the idea of updating i_disksize and calling
> > end_page_writeback() directly from end_io handler and doing just
> > mark_inode_dirty() and orphan deletion from the workqueue still returns to
> > me ;-) It would solve this problem as well. We just have to somehow pin the
> > inode so that VFS cannot remove it before we manage to file i_disksize
> > update into the transaction, which, I agree, has some issues as well as you
> > wrote in some email... The main advantage of the current scheme probably is
> > that PG_writeback bit naturally tells VM that we have still some pinned
> > resources and so it throttles writers if needed... But still, I find my
> > better ;).
> >
> I think my latest patch has this nailed down without the mutex.
> Basically it checks i_nlink with super lock held and then calls
> ext3_orphan_del. If we race with unlink, we'll either find the new
> nlink count and skip the orphan del or unlink will come in after us and
> add the orphan back.
Ah, OK. That should reasonably work.

> > > > 2) We need superblock lock for the check
> > > > list_empty(&EXT3_I(inode)->i_orphan).
> > >
> > > How about I take the guarded spinlock when doing the list_add instead?
> > > I'm trying to avoid the superblock lock as much as I can.
> > Well, ext3_orphan_del() takes it anyway so it's not like you introduce a
> > new lock dependency. Actually in the code I wrote you have exactly as many
> > superblock locking as there is now. So IMO this is not an issue if we
> > somehow solve problem 1).
> >
>
> Yeah, I agree.
>
> > > > 3) The function should rather have name ext3_guarded_orphan_del()... At
> > > > least "ordered" is really confusing (that's the case for a few other
> > > > structs / variables as well).
> > >
> > > My long term plan was to replaced ordered with guarded, but I can rename
> > > this one to guarded if you think it'll make it more clear.
> > Ah, OK. My feeling is that there's going to be at least some time of
> > coexstence of these two modes (also because the perfomance numbers for
> > ordered mode were considerably better in some loads) so I'd vote for
> > clearly separating their names.
> >
> > > > > if (err)
> > > > > ext3_journal_abort_handle(__func__, __func__,
> > > > > bh, handle, err);
> > > > > @@ -1231,6 +1440,89 @@ static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
> > > > > return 0;
> > > > > }
> > > > >
> > > > > +/*
> > > > > + * Walk the buffers in a page for data=guarded mode. Buffers that
> > > > > + * are not marked as datanew are ignored.
> > > > > + *
> > > > > + * New buffers outside i_size are sent to the data guarded code
> > > > > + *
> > > > > + * We must do the old data=ordered mode when filling holes in the
> > > > > + * file, since i_size doesn't protect these at all.
> > > > > + */
> > > > > +static int journal_dirty_data_guarded_fn(handle_t *handle,
> > > > > + struct buffer_head *bh)
> > > > > +{
> > > > > + u64 offset = page_offset(bh->b_page) + bh_offset(bh);
> > > > > + struct inode *inode = bh->b_page->mapping->host;
> > > > > + int ret = 0;
> > > > > +
> > > > > + /*
> > > > > + * Write could have mapped the buffer but it didn't copy the data in
> > > > > + * yet. So avoid filing such buffer into a transaction.
> > > > > + */
> > > > > + if (!buffer_mapped(bh) || !buffer_uptodate(bh))
> > > > > + return 0;
> > > > > +
> > > > > + if (test_clear_buffer_datanew(bh)) {
> > > > Hmm, if we just extend the file inside the block (e.g. from 100 bytes to
> > > > 500 bytes), then we won't do the write guarded. But then if we crash before
> > > > the block really gets written, user will see zeros at the end of file
> > > > instead of data...
> > >
> > > You see something like this:
> > >
> > > create(file)
> > > write(file, 100 bytes) # create guarded IO
> > > fsync(file)
> > > write(file, 400 more bytes) # buffer isn't guarded, i_size goes to 500
> > Yes.
> >
> > > > I don't think we should let this happen so I'd think we
> > > > have to guard all the extending writes regardless whether they allocate new
> > > > block or not.
> > >
> > > My main concern was avoiding stale data from the disk after a crash,
> > > zeros from partially written blocks are not as big a problem. But,
> > > you're right that we can easily avoid this, so I'll update the patch to
> > > do all extending writes as guarded.
> > Yes, ext3 actually does some work to avoid exposing zeros at the
> > end of file when we fail to copy in new data (source page was swapped out
> > in the mean time) and we crash before we manage to swap the page in again.
> > So it would be stupid to introduce the same problem here again...
>
> This should be fixed in my current version as well. The buffer is
> likely to be added as an ordered buffer in that case, but either way
> there won't be zeros.
>
> >
> > > > Which probably makes the buffer_datanew() flag unnecessary
> > > > because we just guard all the buffers from max(start of write, i_size) to
> > > > end of write.
> > >
> > > But, we still want buffer_datanew to decide when writes that fill holes
> > > should go through data=ordered.
> > See below.
> >
> > > > > +/*
> > > > > + * Walk the buffers in a page for data=guarded mode for writepage.
> > > > > + *
> > > > > + * We must do the old data=ordered mode when filling holes in the
> > > > > + * file, since i_size doesn't protect these at all.
> > > > > + *
> > > > > + * This is actually called after writepage is run and so we can't
> > > > > + * trust anything other than the buffer head (which we have pinned).
> > > > > + *
> > > > > + * Any datanew buffer at writepage time is filling a hole, so we don't need
> > > > > + * extra tests against the inode size.
> > > > > + */
> > > > > +static int journal_dirty_data_guarded_writepage_fn(handle_t *handle,
> > > > > + struct buffer_head *bh)
> > > > > +{
> > > > > + int ret = 0;
> > > > > +
> > > > > + /*
> > > > > + * Write could have mapped the buffer but it didn't copy the data in
> > > > > + * yet. So avoid filing such buffer into a transaction.
> > > > > + */
> > > > > + if (!buffer_mapped(bh) || !buffer_uptodate(bh))
> > > > > + return 0;
> > > > > +
> > > > > + if (test_clear_buffer_datanew(bh))
> > > > > + ret = ext3_journal_dirty_data(handle, bh);
> > > > > + return ret;
> > > > > +}
> > > > > +
> > > > Hmm, here we use the datanew flag as well. But it's probably not worth
> > > > keeping it just for this case. Ordering data in all cases when we get here
> > > > should be fine since if the block is already allocated we should not get
> > > > here (unless somebody managed to strip buffers from the page but kept the
> > > > page but that should be rare enough).
> > > >
> > >
> > > I'd keep it for the common case of filling holes with write(), so then
> > > the code in writepage is gravy.
> > I'm not sure I understand. I wanted to suggest to change
> > if (test_clear_buffer_datanew(bh))
> > ret = ext3_journal_dirty_data(handle, bh);
> > to
> > ret = ext3_journal_dirty_data(handle, bh);
> >
> > So to always order the data. Actually the only case when we would get to
> > journal_dirty_data_guarded_writepage_fn() and buffer_datanew() is not set
> > would be when
> > a) blocksize < pagesize, page already has some blocks allocated and we
> > add more blocks to we fill the hole. Then with your code we would not order
> > all blocks in the page, with my code we would. But I don't think it makes a
> > big difference.
> > b) someone removed buffers from the page.
> > In all other cases we should take the fast path in writepage and submit the
> > IO without starting the transaction.
> > So IMO datanew can be removed...
>
> What we don't want to do is have a call to write() over existing blocks
> in the file add new things to the data=ordered list. I don't see how we
> can avoid that without datanew.
Yes, what I suggest would do exactly that:
In ordered_writepage() in the beginning we do:
page_bufs = page_buffers(page);
if (!walk_page_buffers(NULL, page_bufs, 0, PAGE_CACHE_SIZE,
NULL, buffer_unmapped)) {
return block_write_full_page(page, NULL, wbc);
}
So we only get to starting a transaction and file some buffers if some buffer
in the page is unmapped. Write() maps / allocates all buffers in write_begin()
so they are never added to ordered lists in writepage(). We rely on write_end
to do it. So the only case where not all buffers in the page are mapped is
when we have to allocate in writepage() (mmaped write) or the two cases I
describe above.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-04-29 20:21:07

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC] ext3 data=guarded v6

On Wed 29-04-09 15:51:46, Chris Mason wrote:
> Hello everyone,
>
> Here is v6 based on Jan's review:
>
> * Fixup locking while deleting an orphan entry. The idea here is to
> take the super lock and then check our link count and ordered list. If
> we race with unlink or another process adding another guarded IO, both
> will wait on the super lock while they do the orphan add.
>
> * Fixup O_DIRCECT disk i_size updates
>
> * Do either a guarded or ordered IO for any write past i_size.
>
> ext3 data=ordered mode makes sure that data blocks are on disk before
> the metadata that references them, which avoids files full of garbage
> or previously deleted data after a crash. It does this by adding every dirty
> buffer onto a list of things that must be written before a commit.
>
> This makes every fsync write out all the dirty data on the entire FS, which
> has high latencies and is generally much more expensive than it needs to be.
>
> Another way to avoid exposing stale data after a crash is to wait until
> after the data buffers are written before updating the on-disk record
> of the file's size. If we crash before the data IO is done, i_size
> doesn't yet include the new blocks and no stale data is exposed.
>
> This patch adds the delayed i_size update to ext3, along with a new
> mount option (data=guarded) to enable it. The basic mechanism works like
> this:
>
> * Change block_write_full_page to take an end_io handler as a parameter.
> This allows us to make an end_io handler that queues buffer heads for
> a workqueue where the real work of updating the on disk i_size is done.
>
> * Add an list to the in-memory ext3 inode for tracking data=guarded
> buffer heads that are waiting to be sent to disk.
>
> * Add an ext3 guarded write_end call to add buffer heads for newly
> allocated blocks into the rbtree. If we have a newly allocated block that is
^^^^^^ ;)

> filling a hole inside i_size, this is done as an old style data=ordered write
> instead.
>
> * Add an ext3 guarded writepage call that uses a special buffer head
> end_io handler for buffers that are marked as guarded. Again, if we find
> newly allocated blocks filling holes, they are sent through data=ordered
> instead of data=guarded.
>
> * When a guarded IO finishes, kick a per-FS workqueue to do the
> on disk i_size updates. The workqueue function must be very careful. We only
> update the on disk i_size if all of the IO between the old on disk i_size and
> the new on disk i_size is complete. The on disk i_size is incrementally
> updated to the largest safe value every time an IO completes.
>
> * When we start tracking guarded buffers on a given inode, we put the
> inode into ext3's orphan list. This way if we do crash, the file will
> be truncated back down to the on disk i_size and we'll free any blocks that
> were not completely written. The inode is removed from the orphan list
> only after all the guarded buffers are done.
>
> Signed-off-by: Chris Mason <[email protected]>
>
> ---
> fs/ext3/Makefile | 3 +-
> fs/ext3/fsync.c | 12 +
> fs/ext3/inode.c | 604 +++++++++++++++++++++++++++++++++++++++++++-
> fs/ext3/namei.c | 21 +-
> fs/ext3/ordered-data.c | 235 +++++++++++++++++
> fs/ext3/super.c | 48 +++-
> fs/jbd/transaction.c | 1 +
> include/linux/ext3_fs.h | 33 +++-
> include/linux/ext3_fs_i.h | 45 ++++
> include/linux/ext3_fs_sb.h | 6 +
> include/linux/ext3_jbd.h | 11 +
> include/linux/jbd.h | 10 +
> 12 files changed, 1002 insertions(+), 27 deletions(-)
>
> diff --git a/fs/ext3/Makefile b/fs/ext3/Makefile
> index e77766a..f3a9dc1 100644
> --- a/fs/ext3/Makefile
> +++ b/fs/ext3/Makefile
> @@ -5,7 +5,8 @@
> obj-$(CONFIG_EXT3_FS) += ext3.o
>
> ext3-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \
> - ioctl.o namei.o super.o symlink.o hash.o resize.o ext3_jbd.o
> + ioctl.o namei.o super.o symlink.o hash.o resize.o ext3_jbd.o \
> + ordered-data.o
>
> ext3-$(CONFIG_EXT3_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
> ext3-$(CONFIG_EXT3_FS_POSIX_ACL) += acl.o
> diff --git a/fs/ext3/fsync.c b/fs/ext3/fsync.c
> index d336341..a50abb4 100644
> --- a/fs/ext3/fsync.c
> +++ b/fs/ext3/fsync.c
> @@ -59,6 +59,11 @@ int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
> * sync_inode() will write the inode if it is dirty. Then the caller's
> * filemap_fdatawait() will wait on the pages.
> *
> + * data=guarded:
> + * The caller's filemap_fdatawrite will start the IO, and we
> + * use filemap_fdatawait here to make sure all the disk i_size updates
> + * are done before we commit the inode.
> + *
> * data=journal:
> * filemap_fdatawrite won't do anything (the buffers are clean).
> * ext3_force_commit will write the file data into the journal and
> @@ -84,6 +89,13 @@ int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
> .sync_mode = WB_SYNC_ALL,
> .nr_to_write = 0, /* sys_fsync did this */
> };
> + /*
> + * the new disk i_size must be logged before we commit,
> + * so we wait here for pending writeback
> + */
> + if (ext3_should_guard_data(inode))
> + filemap_write_and_wait(inode->i_mapping);
> +
> ret = sync_inode(inode, &wbc);
> }
> out:
> diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> index fcfa243..1a43178 100644
> --- a/fs/ext3/inode.c
> +++ b/fs/ext3/inode.c
> @@ -38,6 +38,7 @@
> #include <linux/bio.h>
> #include <linux/fiemap.h>
> #include <linux/namei.h>
> +#include <linux/workqueue.h>
> #include "xattr.h"
> #include "acl.h"
>
> @@ -179,6 +180,106 @@ static int ext3_journal_test_restart(handle_t *handle, struct inode *inode)
> }
>
> /*
> + * after a data=guarded IO is done, we need to update the
> + * disk i_size to reflect the data we've written. If there are
> + * no more ordered data extents left in the list, we need to
> + * get rid of the orphan entry making sure the file's
> + * block pointers match the i_size after a crash
> + *
> + * When we aren't in data=guarded mode, this just does an ext3_orphan_del.
> + *
> + * It returns the result of ext3_orphan_del.
> + *
> + * handle may be null if we are just cleaning up the orphan list in
> + * memory.
> + *
> + * pass must_log == 1 when the inode must be logged in order to get
> + * an i_size update on disk
> + */
> +static int orphan_del(handle_t *handle, struct inode *inode, int must_log)
> +{
> + int ret = 0;
> + struct list_head *ordered_list;
> +
> + ordered_list = &EXT3_I(inode)->ordered_buffers.ordered_list;
> +
> + /* fast out when data=guarded isn't on */
> + if (!ext3_should_guard_data(inode))
> + return ext3_orphan_del(handle, inode);
> +
> + ext3_ordered_lock(inode);
> + if (inode->i_nlink && list_empty(ordered_list)) {
> + ext3_ordered_unlock(inode);
> +
> + lock_super(inode->i_sb);
> +
> + /*
> + * now that we have the lock make sure we are allowed to
> + * get rid of the orphan. This way we make sure our
> + * test isn't happening concurrently with someone else
> + * adding an orphan. Memory barrier for the ordered list.
> + */
> + smp_mb();
> + if (inode->i_nlink == 0 || !list_empty(ordered_list)) {
> + ext3_ordered_unlock(inode);
Unlock here is superfluous... Otherwise it looks correct.

> + unlock_super(inode->i_sb);
> + goto out;
> + }
> +
> + /*
> + * if we aren't actually on the orphan list, the orphan
> + * del won't log our inode. Log it now to make sure
> + */
> + ext3_mark_inode_dirty(handle, inode);
> +
> + ret = ext3_orphan_del_locked(handle, inode);
> +
> + unlock_super(inode->i_sb);
> + } else if (handle && must_log) {
> + ext3_ordered_unlock(inode);
> +
> + /*
> + * we need to make sure any updates done by the data=guarded
> + * code end up in the inode on disk. Log the inode
> + * here
> + */
> + ext3_mark_inode_dirty(handle, inode);
> + } else {
> + ext3_ordered_unlock(inode);
> + }
> +
> +out:
> + return ret;
> +}
> +
> +/*
> + * Wrapper around orphan_del that starts a transaction
> + */
> +static void orphan_del_trans(struct inode *inode, int must_log)
> +{
> + handle_t *handle;
> +
> + handle = ext3_journal_start(inode, 3);
> +
> + /*
> + * uhoh, should we flag the FS as readonly here? ext3_dirty_inode
> + * doesn't, which is what we're modeling ourselves after.
> + *
> + * We do need to make sure to get this inode off the ordered list
> + * when the transaction start fails though. orphan_del
> + * does the right thing.
> + */
> + if (IS_ERR(handle)) {
> + orphan_del(NULL, inode, 0);
> + return;
> + }
> +
> + orphan_del(handle, inode, must_log);
> + ext3_journal_stop(handle);
> +}
> +
> +
> +/*
> * Called at the last iput() if i_nlink is zero.
> */
> void ext3_delete_inode (struct inode * inode)
> @@ -204,6 +305,13 @@ void ext3_delete_inode (struct inode * inode)
> if (IS_SYNC(inode))
> handle->h_sync = 1;
> inode->i_size = 0;
> +
> + /*
> + * make sure we clean up any ordered extents that didn't get
> + * IO started on them because i_size shrunk down to zero.
> + */
> + ext3_truncate_ordered_extents(inode, 0);
> +
> if (inode->i_blocks)
> ext3_truncate(inode);
> /*
> @@ -767,6 +875,24 @@ err_out:
> }
>
> /*
> + * This protects the disk i_size with the spinlock for the ordered
> + * extent tree. It returns 1 when the inode needs to be logged
> + * because the i_disksize has been updated.
> + */
> +static int maybe_update_disk_isize(struct inode *inode, loff_t new_size)
> +{
> + int ret = 0;
> +
> + ext3_ordered_lock(inode);
> + if (EXT3_I(inode)->i_disksize < new_size) {
> + EXT3_I(inode)->i_disksize = new_size;
> + ret = 1;
> + }
> + ext3_ordered_unlock(inode);
> + return ret;
> +}
> +
> +/*
> * Allocation strategy is simple: if we have to allocate something, we will
> * have to go the whole way to leaf. So let's do it before attaching anything
> * to tree, set linkage between the newborn blocks, write them if sync is
> @@ -815,6 +941,7 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
> if (!partial) {
> first_block = le32_to_cpu(chain[depth - 1].key);
> clear_buffer_new(bh_result);
> + clear_buffer_datanew(bh_result);
> count++;
> /*map more blocks*/
> while (count < maxblocks && count <= blocks_to_boundary) {
> @@ -873,6 +1000,7 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
> if (err)
> goto cleanup;
> clear_buffer_new(bh_result);
> + clear_buffer_datanew(bh_result);
> goto got_it;
> }
> }
> @@ -915,14 +1043,18 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
> * i_disksize growing is protected by truncate_mutex. Don't forget to
> * protect it if you're about to implement concurrent
> * ext3_get_block() -bzzz
> + *
> + * extend_disksize is only called for directories, and so
> + * it is not using guarded buffer protection.
> */
> - if (!err && extend_disksize && inode->i_size > ei->i_disksize)
> + if (!err && extend_disksize)
> ei->i_disksize = inode->i_size;
> mutex_unlock(&ei->truncate_mutex);
> if (err)
> goto cleanup;
>
> set_buffer_new(bh_result);
> + set_buffer_datanew(bh_result);
> got_it:
> map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key));
> if (count > blocks_to_boundary)
> @@ -1079,6 +1211,77 @@ struct buffer_head *ext3_bread(handle_t *handle, struct inode *inode,
> return NULL;
> }
>
> +/*
> + * data=guarded updates are handled in a workqueue after the IO
> + * is done. This runs through the list of buffer heads that are pending
> + * processing.
> + */
> +void ext3_run_guarded_work(struct work_struct *work)
> +{
> + struct ext3_sb_info *sbi =
> + container_of(work, struct ext3_sb_info, guarded_work);
> + struct buffer_head *bh;
> + struct ext3_ordered_extent *ordered;
> + struct inode *inode;
> + struct page *page;
> + int must_log;
> +
> + spin_lock_irq(&sbi->guarded_lock);
> + while (!list_empty(&sbi->guarded_buffers)) {
> + ordered = list_entry(sbi->guarded_buffers.next,
> + struct ext3_ordered_extent, work_list);
> +
> + list_del(&ordered->work_list);
> +
> + bh = ordered->end_io_bh;
> + ordered->end_io_bh = NULL;
> + must_log = 0;
> +
> + /* we don't need a reference on the buffer head because
> + * it is locked until the end_io handler is called.
> + *
> + * This means the page can't go away, which means the
> + * inode can't go away
> + */
> + spin_unlock_irq(&sbi->guarded_lock);
> +
> + page = bh->b_page;
> + inode = page->mapping->host;
> +
> + ext3_ordered_lock(inode);
> + if (ordered->bh) {
> + /*
> + * someone might have decided this buffer didn't
> + * really need to be ordered and removed us from
> + * the list. They set ordered->bh to null
> + * when that happens.
> + */
> + ext3_remove_ordered_extent(inode, ordered);
> + must_log = ext3_ordered_update_i_size(inode);
> + }
> + ext3_ordered_unlock(inode);
> +
> + /*
> + * drop the reference taken when this ordered extent was
> + * put onto the guarded_buffers list
> + */
> + ext3_put_ordered_extent(ordered);
> +
> + /*
> + * maybe log the inode and/or cleanup the orphan entry
> + */
> + orphan_del_trans(inode, must_log > 0);
> +
> + /*
> + * finally, call the real bh end_io function to do
> + * all the hard work of maintaining page writeback.
> + */
> + end_buffer_async_write(bh, buffer_uptodate(bh));
> + spin_lock_irq(&sbi->guarded_lock);
> + }
> + spin_unlock_irq(&sbi->guarded_lock);
> +}
> +
> static int walk_page_buffers( handle_t *handle,
> struct buffer_head *head,
> unsigned from,
> @@ -1185,6 +1388,7 @@ retry:
> ret = walk_page_buffers(handle, page_buffers(page),
> from, to, NULL, do_journal_get_write_access);
> }
> +
> write_begin_failed:
> if (ret) {
> /*
> @@ -1212,7 +1416,13 @@ out:
>
> int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
> {
> - int err = journal_dirty_data(handle, bh);
> + int err;
> +
> + /* don't take buffers from the data=guarded list */
> + if (buffer_dataguarded(bh))
> + return 0;
> +
> + err = journal_dirty_data(handle, bh);
> if (err)
> ext3_journal_abort_handle(__func__, __func__,
> bh, handle, err);
> @@ -1231,6 +1441,98 @@ static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
> return 0;
> }
>
> +/*
> + * Walk the buffers in a page for data=guarded mode. Buffers that
> + * are not marked as datanew are ignored.
> + *
> + * New buffers outside i_size are sent to the data guarded code
> + *
> + * We must do the old data=ordered mode when filling holes in the
> + * file, since i_size doesn't protect these at all.
> + */
> +static int journal_dirty_data_guarded_fn(handle_t *handle,
> + struct buffer_head *bh)
> +{
> + u64 offset = page_offset(bh->b_page) + bh_offset(bh);
> + struct inode *inode = bh->b_page->mapping->host;
> + int ret = 0;
> + int was_new;
> +
> + /*
> + * Write could have mapped the buffer but it didn't copy the data in
> + * yet. So avoid filing such buffer into a transaction.
> + */
> + if (!buffer_mapped(bh) || !buffer_uptodate(bh))
> + return 0;
> +
> + was_new = test_clear_buffer_datanew(bh);
> +
> + if (offset < inode->i_size) {
> + /*
> + * if we're filling a hole inside i_size, we need to
> + * fall back to the old style data=ordered
> + */
> + if (was_new)
> + ret = ext3_journal_dirty_data(handle, bh);
> + goto out;
> + }
> + ret = ext3_add_ordered_extent(inode, offset, bh);
> +
> + /* if we crash before the IO is done, i_size will be small
> + * but these blocks will still be allocated to the file.
> + *
> + * So, add an orphan entry for the file, which will truncate it
> + * down to the i_size it finds after the crash.
> + *
> + * The orphan is cleaned up when the IO is done. We
> + * don't add orphans while mount is running the orphan list,
> + * that seems to corrupt the list.
> + *
> + * We're testing list_empty on the i_orphan list, but
> + * right here we have i_mutex held. So the only place that
> + * is going to race around and remove us from the orphan
> + * list is the work queue to process completed guarded
> + * buffers. That will find the ordered_extent we added
> + * above and leave us on the orphan list.
> + */
> + if (ret == 0 && buffer_dataguarded(bh) &&
> + list_empty(&EXT3_I(inode)->i_orphan) &&
> + !(EXT3_SB(inode->i_sb)->s_mount_state & EXT3_ORPHAN_FS)) {
> + ret = ext3_orphan_add(handle, inode);
> + }
OK, looks fine but it's subtle...

> +out:
> + return ret;
> +}
> +
> +/*
> + * Walk the buffers in a page for data=guarded mode for writepage.
> + *
> + * We must do the old data=ordered mode when filling holes in the
> + * file, since i_size doesn't protect these at all.
> + *
> + * This is actually called after writepage is run and so we can't
> + * trust anything other than the buffer head (which we have pinned).
> + *
> + * Any datanew buffer at writepage time is filling a hole, so we don't need
> + * extra tests against the inode size.
> + */
> +static int journal_dirty_data_guarded_writepage_fn(handle_t *handle,
> + struct buffer_head *bh)
> +{
> + int ret = 0;
> +
> + /*
> + * Write could have mapped the buffer but it didn't copy the data in
> + * yet. So avoid filing such buffer into a transaction.
> + */
> + if (!buffer_mapped(bh) || !buffer_uptodate(bh))
> + return 0;
> +
> + if (test_clear_buffer_datanew(bh))
> + ret = ext3_journal_dirty_data(handle, bh);
> + return ret;
> +}
> +
> /* For write_end() in data=journal mode */
> static int write_end_fn(handle_t *handle, struct buffer_head *bh)
> {
> @@ -1251,10 +1553,8 @@ static void update_file_sizes(struct inode *inode, loff_t pos, unsigned copied)
> /* What matters to us is i_disksize. We don't write i_size anywhere */
> if (pos + copied > inode->i_size)
> i_size_write(inode, pos + copied);
> - if (pos + copied > EXT3_I(inode)->i_disksize) {
> - EXT3_I(inode)->i_disksize = pos + copied;
> + if (maybe_update_disk_isize(inode, pos + copied))
> mark_inode_dirty(inode);
> - }
> }
>
> /*
> @@ -1300,6 +1600,73 @@ static int ext3_ordered_write_end(struct file *file,
> return ret ? ret : copied;
> }
>
> +static int ext3_guarded_write_end(struct file *file,
> + struct address_space *mapping,
> + loff_t pos, unsigned len, unsigned copied,
> + struct page *page, void *fsdata)
> +{
> + handle_t *handle = ext3_journal_current_handle();
> + struct inode *inode = file->f_mapping->host;
> + unsigned from, to;
> + int ret = 0, ret2;
> +
> + copied = block_write_end(file, mapping, pos, len, copied,
> + page, fsdata);
> +
> + from = pos & (PAGE_CACHE_SIZE - 1);
> + to = from + copied;
> + ret = walk_page_buffers(handle, page_buffers(page),
> + from, to, NULL, journal_dirty_data_guarded_fn);
> +
> + /*
> + * we only update the in-memory i_size. The disk i_size is done
> + * by the end io handlers
> + */
> + if (ret == 0 && pos + copied > inode->i_size) {
> + int must_log;
> +
> + /* updated i_size, but we may have raced with a
> + * data=guarded end_io handler.
> + *
> + * All the guarded IO could have ended while i_size was still
> + * small, and if we're just adding bytes into an existing block
> + * in the file, we may not be adding a new guarded IO with this
> + * write. So, do a check on the disk i_size and make sure it
> + * is updated to the highest safe value.
> + *
> + * This may also be required if the
> + * journal_dirty_data_guarded_fn chose to do an fully
> + * ordered write of this buffer instead of a guarded
> + * write.
> + *
> + * ext3_ordered_update_i_size tests inode->i_size, so we
> + * make sure to update it with the ordered lock held.
> + */
> + ext3_ordered_lock(inode);
> + i_size_write(inode, pos + copied);
> + must_log = ext3_ordered_update_i_size(inode);
> + ext3_ordered_unlock(inode);
> +
> + orphan_del_trans(inode, must_log > 0);
> + }
> +
> + /*
> + * There may be allocated blocks outside of i_size because
> + * we failed to copy some data. Prepare for truncate.
> + */
> + if (pos + len > inode->i_size)
> + ext3_orphan_add(handle, inode);
> + ret2 = ext3_journal_stop(handle);
> + if (!ret)
> + ret = ret2;
> + unlock_page(page);
> + page_cache_release(page);
> +
> + if (pos + len > inode->i_size)
> + vmtruncate(inode, inode->i_size);
> + return ret ? ret : copied;
> +}
> +
> static int ext3_writeback_write_end(struct file *file,
> struct address_space *mapping,
> loff_t pos, unsigned len, unsigned copied,
> @@ -1311,6 +1678,7 @@ static int ext3_writeback_write_end(struct file *file,
>
> copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> update_file_sizes(inode, pos, copied);
> +
> /*
> * There may be allocated blocks outside of i_size because
> * we failed to copy some data. Prepare for truncate.
> @@ -1574,6 +1942,144 @@ out_fail:
> return ret;
> }
>
> +/*
> + * Completion handler for block_write_full_page(). This will
> + * kick off the data=guarded workqueue as the IO finishes.
> + */
> +static void end_buffer_async_write_guarded(struct buffer_head *bh,
> + int uptodate)
> +{
> + struct ext3_sb_info *sbi;
> + struct address_space *mapping;
> + struct ext3_ordered_extent *ordered;
> + unsigned long flags;
> +
> + mapping = bh->b_page->mapping;
> + if (!mapping || !bh->b_private || !buffer_dataguarded(bh)) {
> +noguard:
> + end_buffer_async_write(bh, uptodate);
> + return;
> + }
> +
> + /*
> + * the guarded workqueue function checks the uptodate bit on the
> + * bh and uses that to tell the real end_io handler if things worked
> + * out or not.
> + */
> + if (uptodate)
> + set_buffer_uptodate(bh);
> + else
> + clear_buffer_uptodate(bh);
> +
> + sbi = EXT3_SB(mapping->host->i_sb);
> +
> + spin_lock_irqsave(&sbi->guarded_lock, flags);
> +
> + /*
> + * remove any chance that a truncate raced in and cleared
> + * our dataguard flag, which also freed the ordered extent in
> + * our b_private.
> + */
> + if (!buffer_dataguarded(bh)) {
> + spin_unlock_irqrestore(&sbi->guarded_lock, flags);
> + goto noguard;
> + }
> + ordered = bh->b_private;
> + WARN_ON(ordered->end_io_bh);
> +
> + /*
> + * use the special end_io_bh pointer to make sure that
> + * some form of end_io handler is run on this bh, even
> + * if the ordered_extent is removed from the rb tree before
> + * our workqueue ends up processing it.
> + */
> + ordered->end_io_bh = bh;
> + list_add_tail(&ordered->work_list, &sbi->guarded_buffers);
> + ext3_get_ordered_extent(ordered);
> + spin_unlock_irqrestore(&sbi->guarded_lock, flags);
> +
> + queue_work(sbi->guarded_wq, &sbi->guarded_work);
> +}
> +
> +static int ext3_guarded_writepage(struct page *page,
> + struct writeback_control *wbc)
> +{
> + struct inode *inode = page->mapping->host;
> + struct buffer_head *page_bufs;
> + handle_t *handle = NULL;
> + int ret = 0;
> + int err;
> +
> + J_ASSERT(PageLocked(page));
> +
> + /*
> + * We give up here if we're reentered, because it might be for a
> + * different filesystem.
> + */
> + if (ext3_journal_current_handle())
> + goto out_fail;
> +
> + if (!page_has_buffers(page)) {
> + create_empty_buffers(page, inode->i_sb->s_blocksize,
> + (1 << BH_Dirty)|(1 << BH_Uptodate));
> + page_bufs = page_buffers(page);
> + } else {
> + page_bufs = page_buffers(page);
> + if (!walk_page_buffers(NULL, page_bufs, 0, PAGE_CACHE_SIZE,
> + NULL, buffer_unmapped)) {
> + /* Provide NULL get_block() to catch bugs if buffers
> + * weren't really mapped */
> + return block_write_full_page_endio(page, NULL, wbc,
> + end_buffer_async_write_guarded);
> + }
> + }
> + handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
> +
> + if (IS_ERR(handle)) {
> + ret = PTR_ERR(handle);
> + goto out_fail;
> + }
> +
> + walk_page_buffers(handle, page_bufs, 0,
> + PAGE_CACHE_SIZE, NULL, bget_one);
> +
> + ret = block_write_full_page_endio(page, ext3_get_block, wbc,
> + end_buffer_async_write_guarded);
> +
> + /*
> + * The page can become unlocked at any point now, and
> + * truncate can then come in and change things. So we
> + * can't touch *page from now on. But *page_bufs is
> + * safe due to elevated refcount.
> + */
> +
> + /*
> + * And attach them to the current transaction. But only if
> + * block_write_full_page() succeeded. Otherwise they are unmapped,
> + * and generally junk.
> + */
> + if (ret == 0) {
> + err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
> + NULL, journal_dirty_data_guarded_writepage_fn);
> + if (!ret)
> + ret = err;
> + }
> + walk_page_buffers(handle, page_bufs, 0,
> + PAGE_CACHE_SIZE, NULL, bput_one);
> + err = ext3_journal_stop(handle);
> + if (!ret)
> + ret = err;
> +
> + return ret;
> +
> +out_fail:
> + redirty_page_for_writepage(wbc, page);
> + unlock_page(page);
> + return ret;
> +}
> +
> +
> +
> static int ext3_writeback_writepage(struct page *page,
> struct writeback_control *wbc)
> {
> @@ -1747,7 +2253,14 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
> goto out;
> }
> orphan = 1;
> - ei->i_disksize = inode->i_size;
> + /* in guarded mode, other code is responsible
> + * for updating i_disksize. Actually in
> + * every mode, ei->i_disksize should be correct,
> + * so I don't understand why it is getting updated
> + * to i_size here.
> + */
> + if (!ext3_should_guard_data(inode))
> + ei->i_disksize = inode->i_size;
> ext3_journal_stop(handle);
> }
> }
> @@ -1768,13 +2281,27 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
> ret = PTR_ERR(handle);
> goto out;
> }
> +
> if (inode->i_nlink)
> - ext3_orphan_del(handle, inode);
> + orphan_del(handle, inode, 0);
> +
> if (ret > 0) {
> loff_t end = offset + ret;
> if (end > inode->i_size) {
> - ei->i_disksize = end;
> - i_size_write(inode, end);
> + /* i_mutex keeps other file writes from
> + * hopping in at this time, and we
> + * know the O_DIRECT write just put all
> + * those blocks on disk. But, there
> + * may be guarded writes at lower offsets
> + * in the file that were not forced down.
> + */
> + if (ext3_should_guard_data(inode)) {
> + i_size_write(inode, end);
> + ext3_ordered_update_i_size(inode);
> + } else {
> + ei->i_disksize = end;
> + i_size_write(inode, end);
> + }
Move i_size_write() before the if?

> /*
> * We're going to return a positive `ret'
> * here due to non-zero-length I/O, so there's
> @@ -1842,6 +2369,21 @@ static const struct address_space_operations ext3_writeback_aops = {
> .is_partially_uptodate = block_is_partially_uptodate,
> };
>
> +static const struct address_space_operations ext3_guarded_aops = {
> + .readpage = ext3_readpage,
> + .readpages = ext3_readpages,
> + .writepage = ext3_guarded_writepage,
> + .sync_page = block_sync_page,
> + .write_begin = ext3_write_begin,
> + .write_end = ext3_guarded_write_end,
> + .bmap = ext3_bmap,
> + .invalidatepage = ext3_invalidatepage,
> + .releasepage = ext3_releasepage,
> + .direct_IO = ext3_direct_IO,
> + .migratepage = buffer_migrate_page,
> + .is_partially_uptodate = block_is_partially_uptodate,
> +};
> +
> static const struct address_space_operations ext3_journalled_aops = {
> .readpage = ext3_readpage,
> .readpages = ext3_readpages,
> @@ -1860,6 +2402,8 @@ void ext3_set_aops(struct inode *inode)
> {
> if (ext3_should_order_data(inode))
> inode->i_mapping->a_ops = &ext3_ordered_aops;
> + else if (ext3_should_guard_data(inode))
> + inode->i_mapping->a_ops = &ext3_guarded_aops;
> else if (ext3_should_writeback_data(inode))
> inode->i_mapping->a_ops = &ext3_writeback_aops;
> else
> @@ -2376,7 +2920,8 @@ void ext3_truncate(struct inode *inode)
> if (!ext3_can_truncate(inode))
> return;
>
> - if (inode->i_size == 0 && ext3_should_writeback_data(inode))
> + if (inode->i_size == 0 && (ext3_should_writeback_data(inode) ||
> + ext3_should_guard_data(inode)))
> ei->i_state |= EXT3_STATE_FLUSH_ON_CLOSE;
>
> /*
> @@ -3103,10 +3648,39 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
> ext3_journal_stop(handle);
> }
>
> + if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
> + /*
> + * we need to make sure any data=guarded pages
> + * are on disk before we force a new disk i_size
> + * down into the inode. The crucial range is
> + * anything between the disksize on disk now
> + * and the new size we're going to set.
> + *
> + * We're holding i_mutex here, so we know new
> + * ordered extents are not going to appear in the inode
> + *
> + * This must be done both for truncates that make the
> + * file bigger and smaller because truncate messes around
> + * with the orphan inode list in both cases.
> + */
> + if (ext3_should_guard_data(inode)) {
> + filemap_write_and_wait_range(inode->i_mapping,
> + EXT3_I(inode)->i_disksize,
> + (loff_t)-1);
> + /*
> + * we've written everything, make sure all
> + * the ordered extents are really gone.
> + *
> + * This prevents leaking of ordered extents
> + * and it also makes sure the ordered extent code
> + * doesn't mess with the orphan link
> + */
> + ext3_truncate_ordered_extents(inode, 0);
> + }
> + }
> if (S_ISREG(inode->i_mode) &&
> attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) {
> handle_t *handle;
> -
> handle = ext3_journal_start(inode, 3);
> if (IS_ERR(handle)) {
> error = PTR_ERR(handle);
> @@ -3114,6 +3688,7 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
> }
>
> error = ext3_orphan_add(handle, inode);
> +
> EXT3_I(inode)->i_disksize = attr->ia_size;
> rc = ext3_mark_inode_dirty(handle, inode);
> if (!error)
> @@ -3125,8 +3700,11 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
>
> /* If inode_setattr's call to ext3_truncate failed to get a
> * transaction handle at all, we need to clean up the in-core
> - * orphan list manually. */
> - if (inode->i_nlink)
> + * orphan list manually. Because we've finished off all the
> + * guarded IO above, this doesn't hurt anything for the guarded
> + * code
> + */
> + if (inode->i_nlink && (attr->ia_valid & ATTR_SIZE))
> ext3_orphan_del(NULL, inode);
>
> if (!rc && (ia_valid & ATTR_MODE))
> diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
> index 6ff7b97..711549a 100644
> --- a/fs/ext3/namei.c
> +++ b/fs/ext3/namei.c
> @@ -1973,11 +1973,21 @@ out_unlock:
> return err;
> }
>
> +int ext3_orphan_del(handle_t *handle, struct inode *inode)
> +{
> + int ret;
> +
> + lock_super(inode->i_sb);
> + ret = ext3_orphan_del_locked(handle, inode);
> + unlock_super(inode->i_sb);
> + return ret;
> +}
> +
> /*
> * ext3_orphan_del() removes an unlinked or truncated inode from the list
> * of such inodes stored on disk, because it is finally being cleaned up.
> */
> -int ext3_orphan_del(handle_t *handle, struct inode *inode)
> +int ext3_orphan_del_locked(handle_t *handle, struct inode *inode)
> {
> struct list_head *prev;
> struct ext3_inode_info *ei = EXT3_I(inode);
> @@ -1986,11 +1996,8 @@ int ext3_orphan_del(handle_t *handle, struct inode *inode)
> struct ext3_iloc iloc;
> int err = 0;
>
> - lock_super(inode->i_sb);
> - if (list_empty(&ei->i_orphan)) {
> - unlock_super(inode->i_sb);
> + if (list_empty(&ei->i_orphan))
> return 0;
> - }
>
> ino_next = NEXT_ORPHAN(inode);
> prev = ei->i_orphan.prev;
> @@ -2040,7 +2047,6 @@ int ext3_orphan_del(handle_t *handle, struct inode *inode)
> out_err:
> ext3_std_error(inode->i_sb, err);
> out:
> - unlock_super(inode->i_sb);
> return err;
>
> out_brelse:
> @@ -2410,7 +2416,8 @@ static int ext3_rename (struct inode * old_dir, struct dentry *old_dentry,
> ext3_mark_inode_dirty(handle, new_inode);
> if (!new_inode->i_nlink)
> ext3_orphan_add(handle, new_inode);
> - if (ext3_should_writeback_data(new_inode))
> + if (ext3_should_writeback_data(new_inode) ||
> + ext3_should_guard_data(new_inode))
> flush_file = 1;
> }
> retval = 0;
> diff --git a/fs/ext3/ordered-data.c b/fs/ext3/ordered-data.c
> new file mode 100644
> index 0000000..a6dab2d
> --- /dev/null
> +++ b/fs/ext3/ordered-data.c
> @@ -0,0 +1,235 @@
> +/*
> + * Copyright (C) 2009 Oracle. All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public
> + * License v2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; if not, write to the
> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> + * Boston, MA 021110-1307, USA.
> + */
> +
> +#include <linux/gfp.h>
> +#include <linux/slab.h>
> +#include <linux/blkdev.h>
> +#include <linux/writeback.h>
> +#include <linux/pagevec.h>
> +#include <linux/buffer_head.h>
> +#include <linux/ext3_jbd.h>
> +
> +/*
> + * simple helper to make sure a new entry we're adding is
> + * at a larger offset in the file than the last entry in the list
> + */
> +static void check_ordering(struct ext3_ordered_buffers *buffers,
> + struct ext3_ordered_extent *entry)
> +{
> + struct ext3_ordered_extent *last;
> +
> + if (list_empty(&buffers->ordered_list))
> + return;
> +
> + last = list_entry(buffers->ordered_list.prev,
> + struct ext3_ordered_extent, ordered_list);
> + BUG_ON(last->start >= entry->start);
> +}
> +
> +/* allocate and add a new ordered_extent into the per-inode list.
> + * start is the logical offset in the file
> + *
> + * The list is given a single reference on the ordered extent that was
> + * inserted, and it also takes a reference on the buffer head.
> + */
> +int ext3_add_ordered_extent(struct inode *inode, u64 start,
> + struct buffer_head *bh)
> +{
> + struct ext3_ordered_buffers *buffers;
> + struct ext3_ordered_extent *entry;
> + int ret = 0;
> +
> + lock_buffer(bh);
> +
> + /* ordered extent already there, or in old style data=ordered */
> + if (bh->b_private) {
> + ret = 0;
> + goto out;
> + }
> +
> + buffers = &EXT3_I(inode)->ordered_buffers;
> + entry = kzalloc(sizeof(*entry), GFP_NOFS);
> + if (!entry) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + spin_lock(&buffers->lock);
> + entry->start = start;
> +
> + get_bh(bh);
> + entry->bh = bh;
> + bh->b_private = entry;
> + set_buffer_dataguarded(bh);
> +
> + /* one ref for the list */
> + atomic_set(&entry->refs, 1);
> + INIT_LIST_HEAD(&entry->work_list);
> +
> + check_ordering(buffers, entry);
> +
> + list_add_tail(&entry->ordered_list, &buffers->ordered_list);
> +
> + spin_unlock(&buffers->lock);
> +out:
> + unlock_buffer(bh);
> + return ret;
> +}
> +
> +/*
> + * used to drop a reference on an ordered extent. This will free
> + * the extent if the last reference is dropped
> + */
> +int ext3_put_ordered_extent(struct ext3_ordered_extent *entry)
> +{
> + if (atomic_dec_and_test(&entry->refs)) {
> + WARN_ON(entry->bh);
> + WARN_ON(entry->end_io_bh);
> + kfree(entry);
> + }
> + return 0;
> +}
> +
> +/*
> + * remove an ordered extent from the list. This removes the
> + * reference held by the list on 'entry' and the
> + * reference on the buffer head held by the entry.
> + */
> +int ext3_remove_ordered_extent(struct inode *inode,
> + struct ext3_ordered_extent *entry)
> +{
> + struct ext3_ordered_buffers *buffers;
> +
> + buffers = &EXT3_I(inode)->ordered_buffers;
> +
> + /*
> + * the data=guarded end_io handler takes this guarded_lock
> + * before it puts a given buffer head and its ordered extent
> + * into the guarded_buffers list. We need to make sure
> + * we don't race with them, so we take the guarded_lock too.
> + */
> + spin_lock_irq(&EXT3_SB(inode->i_sb)->guarded_lock);
> + clear_buffer_dataguarded(entry->bh);
> + entry->bh->b_private = NULL;
> + brelse(entry->bh);
> + entry->bh = NULL;
> + spin_unlock_irq(&EXT3_SB(inode->i_sb)->guarded_lock);
> +
> + /*
> + * we must not clear entry->end_io_bh, that is set by
> + * the end_io handlers and will be cleared by the end_io
> + * workqueue
> + */
> +
> + list_del_init(&entry->ordered_list);
> + ext3_put_ordered_extent(entry);
> + return 0;
> +}
> +
> +/*
> + * After an extent is done, call this to conditionally update the on disk
> + * i_size. i_size is updated to cover any fully written part of the file.
> + *
> + * This returns < 0 on error, zero if no action needs to be taken and
> + * 1 if the inode must be logged.
> + */
> +int ext3_ordered_update_i_size(struct inode *inode)
> +{
> + u64 new_size;
> + u64 disk_size;
> + struct ext3_ordered_extent *test;
> + struct ext3_ordered_buffers *buffers = &EXT3_I(inode)->ordered_buffers;
> + int ret = 0;
> +
> + disk_size = EXT3_I(inode)->i_disksize;
> +
> + /*
> + * if the disk i_size is already at the inode->i_size, we're done
> + */
> + if (disk_size >= inode->i_size)
> + goto out;
> +
> + /*
> + * if the ordered list is empty, push the disk i_size all the way
> + * up to the inode size, otherwise, use the start of the first
> + * ordered extent in the list as the new disk i_size
> + */
> + if (list_empty(&buffers->ordered_list)) {
> + new_size = inode->i_size;
> + } else {
> + test = list_entry(buffers->ordered_list.next,
> + struct ext3_ordered_extent, ordered_list);
> +
> + new_size = test->start;
> + }
> +
> + new_size = min_t(u64, new_size, i_size_read(inode));
> +
> + /* the caller needs to log this inode */
> + ret = 1;
> +
> + EXT3_I(inode)->i_disksize = new_size;
> +out:
> + return ret;
> +}
> +
> +/*
> + * during a truncate or delete, we need to get rid of pending
> + * ordered extents so there isn't a war over who updates disk i_size first.
> + * This does that, without waiting for any of the IO to actually finish.
> + *
> + * When the IO does finish, it will find the ordered extent removed from the
> + * list and all will work properly.
> + */
> +void ext3_truncate_ordered_extents(struct inode *inode, u64 offset)
> +{
> + struct ext3_ordered_buffers *buffers = &EXT3_I(inode)->ordered_buffers;
> + struct ext3_ordered_extent *test;
> +
> + spin_lock(&buffers->lock);
> + while (!list_empty(&buffers->ordered_list)) {
> +
> + test = list_entry(buffers->ordered_list.prev,
> + struct ext3_ordered_extent, ordered_list);
> +
> + if (test->start < offset)
> + break;
> + /*
> + * once this is called, the end_io handler won't run,
> + * and we won't update disk_i_size to include this buffer.
> + *
> + * That's ok for truncates because the truncate code is
> + * writing a new i_size.
> + *
> + * This ignores any IO in flight, which is ok
> + * because the guarded_buffers list has a reference
> + * on the ordered extent
> + */
> + ext3_remove_ordered_extent(inode, test);
> + }
> + spin_unlock(&buffers->lock);
> + return;
> +
> +}
> +
> +void ext3_ordered_inode_init(struct ext3_inode_info *ei)
> +{
> + INIT_LIST_HEAD(&ei->ordered_buffers.ordered_list);
> + spin_lock_init(&ei->ordered_buffers.lock);
> +}
> +
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index 599dbfe..1e0eff8 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -37,6 +37,7 @@
> #include <linux/quotaops.h>
> #include <linux/seq_file.h>
> #include <linux/log2.h>
> +#include <linux/workqueue.h>
>
> #include <asm/uaccess.h>
>
> @@ -399,6 +400,9 @@ static void ext3_put_super (struct super_block * sb)
> struct ext3_super_block *es = sbi->s_es;
> int i, err;
>
> + flush_workqueue(sbi->guarded_wq);
> + destroy_workqueue(sbi->guarded_wq);
> +
> ext3_xattr_put_super(sb);
> err = journal_destroy(sbi->s_journal);
> sbi->s_journal = NULL;
> @@ -468,6 +472,8 @@ static struct inode *ext3_alloc_inode(struct super_block *sb)
> #endif
> ei->i_block_alloc_info = NULL;
> ei->vfs_inode.i_version = 1;
> + ext3_ordered_inode_init(ei);
> +
> return &ei->vfs_inode;
> }
>
> @@ -481,6 +487,8 @@ static void ext3_destroy_inode(struct inode *inode)
> false);
> dump_stack();
> }
> + if (!list_empty(&EXT3_I(inode)->ordered_buffers.ordered_list))
> + printk(KERN_INFO "EXT3 ordered list not empty\n");
> kmem_cache_free(ext3_inode_cachep, EXT3_I(inode));
> }
>
> @@ -528,6 +536,13 @@ static void ext3_clear_inode(struct inode *inode)
> EXT3_I(inode)->i_default_acl = EXT3_ACL_NOT_CACHED;
> }
> #endif
> + /*
> + * If pages got cleaned by truncate, truncate should have
> + * gotten rid of the ordered extents. Just in case, drop them
> + * here.
> + */
> + ext3_truncate_ordered_extents(inode, 0);
> +
> ext3_discard_reservation(inode);
> EXT3_I(inode)->i_block_alloc_info = NULL;
> if (unlikely(rsv))
> @@ -634,6 +649,8 @@ static int ext3_show_options(struct seq_file *seq, struct vfsmount *vfs)
> seq_puts(seq, ",data=journal");
> else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA)
> seq_puts(seq, ",data=ordered");
> + else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_GUARDED_DATA)
> + seq_puts(seq, ",data=guarded");
> else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)
> seq_puts(seq, ",data=writeback");
>
> @@ -790,7 +807,7 @@ enum {
> Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh,
> Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
> Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
> - Opt_data_err_abort, Opt_data_err_ignore,
> + Opt_data_guarded, Opt_data_err_abort, Opt_data_err_ignore,
> Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
> Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
> Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
> @@ -832,6 +849,7 @@ static const match_table_t tokens = {
> {Opt_abort, "abort"},
> {Opt_data_journal, "data=journal"},
> {Opt_data_ordered, "data=ordered"},
> + {Opt_data_guarded, "data=guarded"},
> {Opt_data_writeback, "data=writeback"},
> {Opt_data_err_abort, "data_err=abort"},
> {Opt_data_err_ignore, "data_err=ignore"},
> @@ -1034,6 +1052,9 @@ static int parse_options (char *options, struct super_block *sb,
> case Opt_data_ordered:
> data_opt = EXT3_MOUNT_ORDERED_DATA;
> goto datacheck;
> + case Opt_data_guarded:
> + data_opt = EXT3_MOUNT_GUARDED_DATA;
> + goto datacheck;
> case Opt_data_writeback:
> data_opt = EXT3_MOUNT_WRITEBACK_DATA;
> datacheck:
> @@ -1949,11 +1970,23 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
> clear_opt(sbi->s_mount_opt, NOBH);
> }
> }
> +
> + /*
> + * setup the guarded work list
> + */
> + INIT_LIST_HEAD(&EXT3_SB(sb)->guarded_buffers);
> + INIT_WORK(&EXT3_SB(sb)->guarded_work, ext3_run_guarded_work);
> + spin_lock_init(&EXT3_SB(sb)->guarded_lock);
> + EXT3_SB(sb)->guarded_wq = create_workqueue("ext3-guard");
> + if (!EXT3_SB(sb)->guarded_wq) {
> + printk(KERN_ERR "EXT3-fs: failed to create workqueue\n");
> + goto failed_mount_guard;
> + }
> +
> /*
> * The journal_load will have done any necessary log recovery,
> * so we can safely mount the rest of the filesystem now.
> */
> -
> root = ext3_iget(sb, EXT3_ROOT_INO);
> if (IS_ERR(root)) {
> printk(KERN_ERR "EXT3-fs: get root inode failed\n");
> @@ -1965,6 +1998,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
> printk(KERN_ERR "EXT3-fs: corrupt root inode, run e2fsck\n");
> goto failed_mount4;
> }
> +
> sb->s_root = d_alloc_root(root);
> if (!sb->s_root) {
> printk(KERN_ERR "EXT3-fs: get root dentry failed\n");
> @@ -1974,6 +2008,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
> }
>
> ext3_setup_super (sb, es, sb->s_flags & MS_RDONLY);
> +
> /*
> * akpm: core read_super() calls in here with the superblock locked.
> * That deadlocks, because orphan cleanup needs to lock the superblock
> @@ -1989,9 +2024,10 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
> printk (KERN_INFO "EXT3-fs: recovery complete.\n");
> ext3_mark_recovery_complete(sb, es);
> printk (KERN_INFO "EXT3-fs: mounted filesystem with %s data mode.\n",
> - test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
> - test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
> - "writeback");
> + test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal" :
> + test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_GUARDED_DATA ? "guarded" :
> + test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered" :
> + "writeback");
>
> lock_kernel();
> return 0;
> @@ -2003,6 +2039,8 @@ cantfind_ext3:
> goto failed_mount;
>
> failed_mount4:
> + destroy_workqueue(EXT3_SB(sb)->guarded_wq);
> +failed_mount_guard:
> journal_destroy(sbi->s_journal);
> failed_mount3:
> percpu_counter_destroy(&sbi->s_freeblocks_counter);
> diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
> index ed886e6..1354a55 100644
> --- a/fs/jbd/transaction.c
> +++ b/fs/jbd/transaction.c
> @@ -2018,6 +2018,7 @@ zap_buffer_unlocked:
> clear_buffer_mapped(bh);
> clear_buffer_req(bh);
> clear_buffer_new(bh);
> + clear_buffer_datanew(bh);
> bh->b_bdev = NULL;
> return may_free;
> }
> diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
> index 634a5e5..a20bd4f 100644
> --- a/include/linux/ext3_fs.h
> +++ b/include/linux/ext3_fs.h
> @@ -18,6 +18,7 @@
>
> #include <linux/types.h>
> #include <linux/magic.h>
> +#include <linux/workqueue.h>
>
> /*
> * The second extended filesystem constants/structures
> @@ -398,7 +399,6 @@ struct ext3_inode {
> #define EXT3_MOUNT_MINIX_DF 0x00080 /* Mimics the Minix statfs */
> #define EXT3_MOUNT_NOLOAD 0x00100 /* Don't use existing journal*/
> #define EXT3_MOUNT_ABORT 0x00200 /* Fatal error detected */
> -#define EXT3_MOUNT_DATA_FLAGS 0x00C00 /* Mode for data writes: */
> #define EXT3_MOUNT_JOURNAL_DATA 0x00400 /* Write data to journal */
> #define EXT3_MOUNT_ORDERED_DATA 0x00800 /* Flush data before commit */
> #define EXT3_MOUNT_WRITEBACK_DATA 0x00C00 /* No data ordering */
> @@ -414,6 +414,12 @@ struct ext3_inode {
> #define EXT3_MOUNT_GRPQUOTA 0x200000 /* "old" group quota */
> #define EXT3_MOUNT_DATA_ERR_ABORT 0x400000 /* Abort on file data write
> * error in ordered mode */
> +#define EXT3_MOUNT_GUARDED_DATA 0x800000 /* guard new writes with
> + i_size */
> +#define EXT3_MOUNT_DATA_FLAGS (EXT3_MOUNT_JOURNAL_DATA | \
> + EXT3_MOUNT_ORDERED_DATA | \
> + EXT3_MOUNT_WRITEBACK_DATA | \
> + EXT3_MOUNT_GUARDED_DATA)
>
> /* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
> #ifndef _LINUX_EXT2_FS_H
> @@ -892,6 +898,7 @@ extern void ext3_get_inode_flags(struct ext3_inode_info *);
> extern void ext3_set_aops(struct inode *inode);
> extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> u64 start, u64 len);
> +void ext3_run_guarded_work(struct work_struct *work);
>
> /* ioctl.c */
> extern long ext3_ioctl(struct file *, unsigned int, unsigned long);
> @@ -900,6 +907,7 @@ extern long ext3_compat_ioctl(struct file *, unsigned int, unsigned long);
> /* namei.c */
> extern int ext3_orphan_add(handle_t *, struct inode *);
> extern int ext3_orphan_del(handle_t *, struct inode *);
> +extern int ext3_orphan_del_locked(handle_t *, struct inode *);
> extern int ext3_htree_fill_tree(struct file *dir_file, __u32 start_hash,
> __u32 start_minor_hash, __u32 *next_hash);
>
> @@ -945,7 +953,30 @@ extern const struct inode_operations ext3_special_inode_operations;
> extern const struct inode_operations ext3_symlink_inode_operations;
> extern const struct inode_operations ext3_fast_symlink_inode_operations;
>
> +/* ordered-data.c */
> +int ext3_add_ordered_extent(struct inode *inode, u64 file_offset,
> + struct buffer_head *bh);
> +int ext3_put_ordered_extent(struct ext3_ordered_extent *entry);
> +int ext3_remove_ordered_extent(struct inode *inode,
> + struct ext3_ordered_extent *entry);
> +int ext3_ordered_update_i_size(struct inode *inode);
> +void ext3_ordered_inode_init(struct ext3_inode_info *ei);
> +void ext3_truncate_ordered_extents(struct inode *inode, u64 offset);
> +
> +static inline void ext3_ordered_lock(struct inode *inode)
> +{
> + spin_lock(&EXT3_I(inode)->ordered_buffers.lock);
> +}
>
> +static inline void ext3_ordered_unlock(struct inode *inode)
> +{
> + spin_unlock(&EXT3_I(inode)->ordered_buffers.lock);
> +}
> +
> +static inline void ext3_get_ordered_extent(struct ext3_ordered_extent *entry)
> +{
> + atomic_inc(&entry->refs);
> +}
> #endif /* __KERNEL__ */
>
> #endif /* _LINUX_EXT3_FS_H */
> diff --git a/include/linux/ext3_fs_i.h b/include/linux/ext3_fs_i.h
> index 7894dd0..11dd4d4 100644
> --- a/include/linux/ext3_fs_i.h
> +++ b/include/linux/ext3_fs_i.h
> @@ -65,6 +65,49 @@ struct ext3_block_alloc_info {
> #define rsv_end rsv_window._rsv_end
>
> /*
> + * used to prevent garbage in files after a crash by
> + * making sure i_size isn't updated until after the IO
> + * is done.
> + *
> + * See fs/ext3/ordered-data.c for the code that uses these.
> + */
> +struct buffer_head;
> +struct ext3_ordered_buffers {
> + /* protects the list and disk i_size */
> + spinlock_t lock;
> +
> + struct list_head ordered_list;
> +};
> +
> +struct ext3_ordered_extent {
> + /* logical offset of the block in the file
> + * strictly speaking we don't need this
> + * but keep it in the struct for
> + * debugging
> + */
> + u64 start;
> +
> + /* buffer head being written */
> + struct buffer_head *bh;
> +
> + /*
> + * set at end_io time so we properly
> + * do IO accounting even when this ordered
> + * extent struct has been removed from the
> + * list
> + */
> + struct buffer_head *end_io_bh;
> +
> + /* number of refs on this ordered extent */
> + atomic_t refs;
> +
> + struct list_head ordered_list;
> +
> + /* list of things being processed by the workqueue */
> + struct list_head work_list;
> +};
> +
> +/*
> * third extended file system inode data in memory
> */
> struct ext3_inode_info {
> @@ -141,6 +184,8 @@ struct ext3_inode_info {
> * by other means, so we have truncate_mutex.
> */
> struct mutex truncate_mutex;
> +
> + struct ext3_ordered_buffers ordered_buffers;
> struct inode vfs_inode;
> };
>
> diff --git a/include/linux/ext3_fs_sb.h b/include/linux/ext3_fs_sb.h
> index f07f34d..5dbdbeb 100644
> --- a/include/linux/ext3_fs_sb.h
> +++ b/include/linux/ext3_fs_sb.h
> @@ -21,6 +21,7 @@
> #include <linux/wait.h>
> #include <linux/blockgroup_lock.h>
> #include <linux/percpu_counter.h>
> +#include <linux/workqueue.h>
> #endif
> #include <linux/rbtree.h>
>
> @@ -82,6 +83,11 @@ struct ext3_sb_info {
> char *s_qf_names[MAXQUOTAS]; /* Names of quota files with journalled quota */
> int s_jquota_fmt; /* Format of quota to use */
> #endif
> +
> + struct workqueue_struct *guarded_wq;
> + struct work_struct guarded_work;
> + struct list_head guarded_buffers;
> + spinlock_t guarded_lock;
> };
>
> static inline spinlock_t *
> diff --git a/include/linux/ext3_jbd.h b/include/linux/ext3_jbd.h
> index cf82d51..45cb4aa 100644
> --- a/include/linux/ext3_jbd.h
> +++ b/include/linux/ext3_jbd.h
> @@ -212,6 +212,17 @@ static inline int ext3_should_order_data(struct inode *inode)
> return 0;
> }
>
> +static inline int ext3_should_guard_data(struct inode *inode)
> +{
> + if (!S_ISREG(inode->i_mode))
> + return 0;
> + if (EXT3_I(inode)->i_flags & EXT3_JOURNAL_DATA_FL)
> + return 0;
> + if (test_opt(inode->i_sb, GUARDED_DATA) == EXT3_MOUNT_GUARDED_DATA)
> + return 1;
> + return 0;
> +}
> +
> static inline int ext3_should_writeback_data(struct inode *inode)
> {
> if (!S_ISREG(inode->i_mode))
> diff --git a/include/linux/jbd.h b/include/linux/jbd.h
> index c2049a0..bbb7990 100644
> --- a/include/linux/jbd.h
> +++ b/include/linux/jbd.h
> @@ -291,6 +291,13 @@ enum jbd_state_bits {
> BH_State, /* Pins most journal_head state */
> BH_JournalHead, /* Pins bh->b_private and jh->b_bh */
> BH_Unshadow, /* Dummy bit, for BJ_Shadow wakeup filtering */
> + BH_DataGuarded, /* ext3 data=guarded mode buffer
> + * these have something other than a
> + * journal_head at b_private */
> + BH_DataNew, /* BH_new gets cleared too early for
> + * data=guarded to use it. So,
> + * this gets set instead.
> + */
> };
>
> BUFFER_FNS(JBD, jbd)
> @@ -302,6 +309,9 @@ TAS_BUFFER_FNS(Revoked, revoked)
> BUFFER_FNS(RevokeValid, revokevalid)
> TAS_BUFFER_FNS(RevokeValid, revokevalid)
> BUFFER_FNS(Freed, freed)
> +BUFFER_FNS(DataGuarded, dataguarded)
> +BUFFER_FNS(DataNew, datanew)
> +TAS_BUFFER_FNS(DataNew, datanew)
>
> static inline struct buffer_head *jh2bh(struct journal_head *jh)
> {
> --

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-04-29 20:38:05

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH RFC] ext3 data=guarded v5

On Wed, 2009-04-29 at 22:04 +0200, Jan Kara wrote:

> > What we don't want to do is have a call to write() over existing blocks
> > in the file add new things to the data=ordered list. I don't see how we
> > can avoid that without datanew.
> Yes, what I suggest would do exactly that:
> In ordered_writepage() in the beginning we do:
> page_bufs = page_buffers(page);
> if (!walk_page_buffers(NULL, page_bufs, 0, PAGE_CACHE_SIZE,
> NULL, buffer_unmapped)) {
> return block_write_full_page(page, NULL, wbc);
> }
> So we only get to starting a transaction and file some buffers if some buffer
> in the page is unmapped. Write() maps / allocates all buffers in write_begin()
> so they are never added to ordered lists in writepage().

Right, writepage doesn't really need datanew.

> We rely on write_end
> to do it. So the only case where not all buffers in the page are mapped is
> when we have to allocate in writepage() (mmaped write) or the two cases I
> describe above.

But I still think write_end does need datanew. That's where 99% of the
ordered buffers are going to come from when we overwrite the contents of
an existing file.

-chris



2009-04-29 20:54:32

by Chris Mason

[permalink] [raw]
Subject: [PATCH RFC] ext3 data=guarded v7

And here's v7, which is similar to v6 but fixes the extra spin_unlock
that Jan Kara noticed. This also has small reordering of the i_size
updates in O_DIRECT. I'll hammer on it overnight.

Thanks to Jan for all the review.

-chris

ext3 data=ordered mode makes sure that data blocks are on disk before
the metadata that references them, which avoids files full of garbage
or previously deleted data after a crash. It does this by adding every dirty
buffer onto a list of things that must be written before a commit.

This makes every fsync write out all the dirty data on the entire FS, which
has high latencies and is generally much more expensive than it needs to be.

Another way to avoid exposing stale data after a crash is to wait until
after the data buffers are written before updating the on-disk record
of the file's size. If we crash before the data IO is done, i_size
doesn't yet include the new blocks and no stale data is exposed.

This patch adds the delayed i_size update to ext3, along with a new
mount option (data=guarded) to enable it. The basic mechanism works like
this:

* Change block_write_full_page to take an end_io handler as a parameter.
This allows us to make an end_io handler that queues buffer heads for
a workqueue where the real work of updating the on disk i_size is done.

* Add an list to the in-memory ext3 inode for tracking data=guarded
buffer heads that are waiting to be sent to disk.

* Add an ext3 guarded write_end call to add buffer heads for newly
allocated blocks into the list. If we have a newly allocated block that is
filling a hole inside i_size, this is done as an old style data=ordered write
instead.

* Add an ext3 guarded writepage call that uses a special buffer head
end_io handler for buffers that are marked as guarded. Again, if we find
newly allocated blocks filling holes, they are sent through data=ordered
instead of data=guarded.

* When a guarded IO finishes, kick a per-FS workqueue to do the
on disk i_size updates. The workqueue function must be very careful. We only
update the on disk i_size if all of the IO between the old on disk i_size and
the new on disk i_size is complete. The on disk i_size is incrementally
updated to the largest safe value every time an IO completes.

* When we start tracking guarded buffers on a given inode, we put the
inode into ext3's orphan list. This way if we do crash, the file will
be truncated back down to the on disk i_size and we'll free any blocks that
were not completely written. The inode is removed from the orphan list
only after all the guarded buffers are done.

Signed-off-by: Chris Mason <[email protected]>

---
fs/ext3/Makefile | 3 +-
fs/ext3/fsync.c | 12 +
fs/ext3/inode.c | 599 +++++++++++++++++++++++++++++++++++++++++++-
fs/ext3/namei.c | 21 +-
fs/ext3/ordered-data.c | 235 +++++++++++++++++
fs/ext3/super.c | 48 ++++-
fs/jbd/transaction.c | 1 +
include/linux/ext3_fs.h | 33 +++-
include/linux/ext3_fs_i.h | 45 ++++
include/linux/ext3_fs_sb.h | 6 +
include/linux/ext3_jbd.h | 11 +
include/linux/jbd.h | 10 +
12 files changed, 998 insertions(+), 26 deletions(-)

diff --git a/fs/ext3/Makefile b/fs/ext3/Makefile
index e77766a..f3a9dc1 100644
--- a/fs/ext3/Makefile
+++ b/fs/ext3/Makefile
@@ -5,7 +5,8 @@
obj-$(CONFIG_EXT3_FS) += ext3.o

ext3-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \
- ioctl.o namei.o super.o symlink.o hash.o resize.o ext3_jbd.o
+ ioctl.o namei.o super.o symlink.o hash.o resize.o ext3_jbd.o \
+ ordered-data.o

ext3-$(CONFIG_EXT3_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
ext3-$(CONFIG_EXT3_FS_POSIX_ACL) += acl.o
diff --git a/fs/ext3/fsync.c b/fs/ext3/fsync.c
index d336341..a50abb4 100644
--- a/fs/ext3/fsync.c
+++ b/fs/ext3/fsync.c
@@ -59,6 +59,11 @@ int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
* sync_inode() will write the inode if it is dirty. Then the caller's
* filemap_fdatawait() will wait on the pages.
*
+ * data=guarded:
+ * The caller's filemap_fdatawrite will start the IO, and we
+ * use filemap_fdatawait here to make sure all the disk i_size updates
+ * are done before we commit the inode.
+ *
* data=journal:
* filemap_fdatawrite won't do anything (the buffers are clean).
* ext3_force_commit will write the file data into the journal and
@@ -84,6 +89,13 @@ int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
.sync_mode = WB_SYNC_ALL,
.nr_to_write = 0, /* sys_fsync did this */
};
+ /*
+ * the new disk i_size must be logged before we commit,
+ * so we wait here for pending writeback
+ */
+ if (ext3_should_guard_data(inode))
+ filemap_write_and_wait(inode->i_mapping);
+
ret = sync_inode(inode, &wbc);
}
out:
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index fcfa243..edeb71d 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -38,6 +38,7 @@
#include <linux/bio.h>
#include <linux/fiemap.h>
#include <linux/namei.h>
+#include <linux/workqueue.h>
#include "xattr.h"
#include "acl.h"

@@ -179,6 +180,105 @@ static int ext3_journal_test_restart(handle_t *handle, struct inode *inode)
}

/*
+ * after a data=guarded IO is done, we need to update the
+ * disk i_size to reflect the data we've written. If there are
+ * no more ordered data extents left in the list, we need to
+ * get rid of the orphan entry making sure the file's
+ * block pointers match the i_size after a crash
+ *
+ * When we aren't in data=guarded mode, this just does an ext3_orphan_del.
+ *
+ * It returns the result of ext3_orphan_del.
+ *
+ * handle may be null if we are just cleaning up the orphan list in
+ * memory.
+ *
+ * pass must_log == 1 when the inode must be logged in order to get
+ * an i_size update on disk
+ */
+static int orphan_del(handle_t *handle, struct inode *inode, int must_log)
+{
+ int ret = 0;
+ struct list_head *ordered_list;
+
+ ordered_list = &EXT3_I(inode)->ordered_buffers.ordered_list;
+
+ /* fast out when data=guarded isn't on */
+ if (!ext3_should_guard_data(inode))
+ return ext3_orphan_del(handle, inode);
+
+ ext3_ordered_lock(inode);
+ if (inode->i_nlink && list_empty(ordered_list)) {
+ ext3_ordered_unlock(inode);
+
+ lock_super(inode->i_sb);
+
+ /*
+ * now that we have the lock make sure we are allowed to
+ * get rid of the orphan. This way we make sure our
+ * test isn't happening concurrently with someone else
+ * adding an orphan. Memory barrier for the ordered list.
+ */
+ smp_mb();
+ if (inode->i_nlink == 0 || !list_empty(ordered_list)) {
+ unlock_super(inode->i_sb);
+ goto out;
+ }
+
+ /*
+ * if we aren't actually on the orphan list, the orphan
+ * del won't log our inode. Log it now to make sure
+ */
+ ext3_mark_inode_dirty(handle, inode);
+
+ ret = ext3_orphan_del_locked(handle, inode);
+
+ unlock_super(inode->i_sb);
+ } else if (handle && must_log) {
+ ext3_ordered_unlock(inode);
+
+ /*
+ * we need to make sure any updates done by the data=guarded
+ * code end up in the inode on disk. Log the inode
+ * here
+ */
+ ext3_mark_inode_dirty(handle, inode);
+ } else {
+ ext3_ordered_unlock(inode);
+ }
+
+out:
+ return ret;
+}
+
+/*
+ * Wrapper around orphan_del that starts a transaction
+ */
+static void orphan_del_trans(struct inode *inode, int must_log)
+{
+ handle_t *handle;
+
+ handle = ext3_journal_start(inode, 3);
+
+ /*
+ * uhoh, should we flag the FS as readonly here? ext3_dirty_inode
+ * doesn't, which is what we're modeling ourselves after.
+ *
+ * We do need to make sure to get this inode off the ordered list
+ * when the transaction start fails though. orphan_del
+ * does the right thing.
+ */
+ if (IS_ERR(handle)) {
+ orphan_del(NULL, inode, 0);
+ return;
+ }
+
+ orphan_del(handle, inode, must_log);
+ ext3_journal_stop(handle);
+}
+
+
+/*
* Called at the last iput() if i_nlink is zero.
*/
void ext3_delete_inode (struct inode * inode)
@@ -204,6 +304,13 @@ void ext3_delete_inode (struct inode * inode)
if (IS_SYNC(inode))
handle->h_sync = 1;
inode->i_size = 0;
+
+ /*
+ * make sure we clean up any ordered extents that didn't get
+ * IO started on them because i_size shrunk down to zero.
+ */
+ ext3_truncate_ordered_extents(inode, 0);
+
if (inode->i_blocks)
ext3_truncate(inode);
/*
@@ -767,6 +874,24 @@ err_out:
}

/*
+ * This protects the disk i_size with the spinlock for the ordered
+ * extent tree. It returns 1 when the inode needs to be logged
+ * because the i_disksize has been updated.
+ */
+static int maybe_update_disk_isize(struct inode *inode, loff_t new_size)
+{
+ int ret = 0;
+
+ ext3_ordered_lock(inode);
+ if (EXT3_I(inode)->i_disksize < new_size) {
+ EXT3_I(inode)->i_disksize = new_size;
+ ret = 1;
+ }
+ ext3_ordered_unlock(inode);
+ return ret;
+}
+
+/*
* Allocation strategy is simple: if we have to allocate something, we will
* have to go the whole way to leaf. So let's do it before attaching anything
* to tree, set linkage between the newborn blocks, write them if sync is
@@ -815,6 +940,7 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
if (!partial) {
first_block = le32_to_cpu(chain[depth - 1].key);
clear_buffer_new(bh_result);
+ clear_buffer_datanew(bh_result);
count++;
/*map more blocks*/
while (count < maxblocks && count <= blocks_to_boundary) {
@@ -873,6 +999,7 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
if (err)
goto cleanup;
clear_buffer_new(bh_result);
+ clear_buffer_datanew(bh_result);
goto got_it;
}
}
@@ -915,14 +1042,18 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
* i_disksize growing is protected by truncate_mutex. Don't forget to
* protect it if you're about to implement concurrent
* ext3_get_block() -bzzz
+ *
+ * extend_disksize is only called for directories, and so
+ * it is not using guarded buffer protection.
*/
- if (!err && extend_disksize && inode->i_size > ei->i_disksize)
+ if (!err && extend_disksize)
ei->i_disksize = inode->i_size;
mutex_unlock(&ei->truncate_mutex);
if (err)
goto cleanup;

set_buffer_new(bh_result);
+ set_buffer_datanew(bh_result);
got_it:
map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key));
if (count > blocks_to_boundary)
@@ -1079,6 +1210,77 @@ struct buffer_head *ext3_bread(handle_t *handle, struct inode *inode,
return NULL;
}

+/*
+ * data=guarded updates are handled in a workqueue after the IO
+ * is done. This runs through the list of buffer heads that are pending
+ * processing.
+ */
+void ext3_run_guarded_work(struct work_struct *work)
+{
+ struct ext3_sb_info *sbi =
+ container_of(work, struct ext3_sb_info, guarded_work);
+ struct buffer_head *bh;
+ struct ext3_ordered_extent *ordered;
+ struct inode *inode;
+ struct page *page;
+ int must_log;
+
+ spin_lock_irq(&sbi->guarded_lock);
+ while (!list_empty(&sbi->guarded_buffers)) {
+ ordered = list_entry(sbi->guarded_buffers.next,
+ struct ext3_ordered_extent, work_list);
+
+ list_del(&ordered->work_list);
+
+ bh = ordered->end_io_bh;
+ ordered->end_io_bh = NULL;
+ must_log = 0;
+
+ /* we don't need a reference on the buffer head because
+ * it is locked until the end_io handler is called.
+ *
+ * This means the page can't go away, which means the
+ * inode can't go away
+ */
+ spin_unlock_irq(&sbi->guarded_lock);
+
+ page = bh->b_page;
+ inode = page->mapping->host;
+
+ ext3_ordered_lock(inode);
+ if (ordered->bh) {
+ /*
+ * someone might have decided this buffer didn't
+ * really need to be ordered and removed us from
+ * the list. They set ordered->bh to null
+ * when that happens.
+ */
+ ext3_remove_ordered_extent(inode, ordered);
+ must_log = ext3_ordered_update_i_size(inode);
+ }
+ ext3_ordered_unlock(inode);
+
+ /*
+ * drop the reference taken when this ordered extent was
+ * put onto the guarded_buffers list
+ */
+ ext3_put_ordered_extent(ordered);
+
+ /*
+ * maybe log the inode and/or cleanup the orphan entry
+ */
+ orphan_del_trans(inode, must_log > 0);
+
+ /*
+ * finally, call the real bh end_io function to do
+ * all the hard work of maintaining page writeback.
+ */
+ end_buffer_async_write(bh, buffer_uptodate(bh));
+ spin_lock_irq(&sbi->guarded_lock);
+ }
+ spin_unlock_irq(&sbi->guarded_lock);
+}
+
static int walk_page_buffers( handle_t *handle,
struct buffer_head *head,
unsigned from,
@@ -1185,6 +1387,7 @@ retry:
ret = walk_page_buffers(handle, page_buffers(page),
from, to, NULL, do_journal_get_write_access);
}
+
write_begin_failed:
if (ret) {
/*
@@ -1212,7 +1415,13 @@ out:

int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
{
- int err = journal_dirty_data(handle, bh);
+ int err;
+
+ /* don't take buffers from the data=guarded list */
+ if (buffer_dataguarded(bh))
+ return 0;
+
+ err = journal_dirty_data(handle, bh);
if (err)
ext3_journal_abort_handle(__func__, __func__,
bh, handle, err);
@@ -1231,6 +1440,98 @@ static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
return 0;
}

+/*
+ * Walk the buffers in a page for data=guarded mode. Buffers that
+ * are not marked as datanew are ignored.
+ *
+ * New buffers outside i_size are sent to the data guarded code
+ *
+ * We must do the old data=ordered mode when filling holes in the
+ * file, since i_size doesn't protect these at all.
+ */
+static int journal_dirty_data_guarded_fn(handle_t *handle,
+ struct buffer_head *bh)
+{
+ u64 offset = page_offset(bh->b_page) + bh_offset(bh);
+ struct inode *inode = bh->b_page->mapping->host;
+ int ret = 0;
+ int was_new;
+
+ /*
+ * Write could have mapped the buffer but it didn't copy the data in
+ * yet. So avoid filing such buffer into a transaction.
+ */
+ if (!buffer_mapped(bh) || !buffer_uptodate(bh))
+ return 0;
+
+ was_new = test_clear_buffer_datanew(bh);
+
+ if (offset < inode->i_size) {
+ /*
+ * if we're filling a hole inside i_size, we need to
+ * fall back to the old style data=ordered
+ */
+ if (was_new)
+ ret = ext3_journal_dirty_data(handle, bh);
+ goto out;
+ }
+ ret = ext3_add_ordered_extent(inode, offset, bh);
+
+ /* if we crash before the IO is done, i_size will be small
+ * but these blocks will still be allocated to the file.
+ *
+ * So, add an orphan entry for the file, which will truncate it
+ * down to the i_size it finds after the crash.
+ *
+ * The orphan is cleaned up when the IO is done. We
+ * don't add orphans while mount is running the orphan list,
+ * that seems to corrupt the list.
+ *
+ * We're testing list_empty on the i_orphan list, but
+ * right here we have i_mutex held. So the only place that
+ * is going to race around and remove us from the orphan
+ * list is the work queue to process completed guarded
+ * buffers. That will find the ordered_extent we added
+ * above and leave us on the orphan list.
+ */
+ if (ret == 0 && buffer_dataguarded(bh) &&
+ list_empty(&EXT3_I(inode)->i_orphan) &&
+ !(EXT3_SB(inode->i_sb)->s_mount_state & EXT3_ORPHAN_FS)) {
+ ret = ext3_orphan_add(handle, inode);
+ }
+out:
+ return ret;
+}
+
+/*
+ * Walk the buffers in a page for data=guarded mode for writepage.
+ *
+ * We must do the old data=ordered mode when filling holes in the
+ * file, since i_size doesn't protect these at all.
+ *
+ * This is actually called after writepage is run and so we can't
+ * trust anything other than the buffer head (which we have pinned).
+ *
+ * Any datanew buffer at writepage time is filling a hole, so we don't need
+ * extra tests against the inode size.
+ */
+static int journal_dirty_data_guarded_writepage_fn(handle_t *handle,
+ struct buffer_head *bh)
+{
+ int ret = 0;
+
+ /*
+ * Write could have mapped the buffer but it didn't copy the data in
+ * yet. So avoid filing such buffer into a transaction.
+ */
+ if (!buffer_mapped(bh) || !buffer_uptodate(bh))
+ return 0;
+
+ if (test_clear_buffer_datanew(bh))
+ ret = ext3_journal_dirty_data(handle, bh);
+ return ret;
+}
+
/* For write_end() in data=journal mode */
static int write_end_fn(handle_t *handle, struct buffer_head *bh)
{
@@ -1251,10 +1552,8 @@ static void update_file_sizes(struct inode *inode, loff_t pos, unsigned copied)
/* What matters to us is i_disksize. We don't write i_size anywhere */
if (pos + copied > inode->i_size)
i_size_write(inode, pos + copied);
- if (pos + copied > EXT3_I(inode)->i_disksize) {
- EXT3_I(inode)->i_disksize = pos + copied;
+ if (maybe_update_disk_isize(inode, pos + copied))
mark_inode_dirty(inode);
- }
}

/*
@@ -1300,6 +1599,73 @@ static int ext3_ordered_write_end(struct file *file,
return ret ? ret : copied;
}

+static int ext3_guarded_write_end(struct file *file,
+ struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned copied,
+ struct page *page, void *fsdata)
+{
+ handle_t *handle = ext3_journal_current_handle();
+ struct inode *inode = file->f_mapping->host;
+ unsigned from, to;
+ int ret = 0, ret2;
+
+ copied = block_write_end(file, mapping, pos, len, copied,
+ page, fsdata);
+
+ from = pos & (PAGE_CACHE_SIZE - 1);
+ to = from + copied;
+ ret = walk_page_buffers(handle, page_buffers(page),
+ from, to, NULL, journal_dirty_data_guarded_fn);
+
+ /*
+ * we only update the in-memory i_size. The disk i_size is done
+ * by the end io handlers
+ */
+ if (ret == 0 && pos + copied > inode->i_size) {
+ int must_log;
+
+ /* updated i_size, but we may have raced with a
+ * data=guarded end_io handler.
+ *
+ * All the guarded IO could have ended while i_size was still
+ * small, and if we're just adding bytes into an existing block
+ * in the file, we may not be adding a new guarded IO with this
+ * write. So, do a check on the disk i_size and make sure it
+ * is updated to the highest safe value.
+ *
+ * This may also be required if the
+ * journal_dirty_data_guarded_fn chose to do an fully
+ * ordered write of this buffer instead of a guarded
+ * write.
+ *
+ * ext3_ordered_update_i_size tests inode->i_size, so we
+ * make sure to update it with the ordered lock held.
+ */
+ ext3_ordered_lock(inode);
+ i_size_write(inode, pos + copied);
+ must_log = ext3_ordered_update_i_size(inode);
+ ext3_ordered_unlock(inode);
+
+ orphan_del_trans(inode, must_log > 0);
+ }
+
+ /*
+ * There may be allocated blocks outside of i_size because
+ * we failed to copy some data. Prepare for truncate.
+ */
+ if (pos + len > inode->i_size)
+ ext3_orphan_add(handle, inode);
+ ret2 = ext3_journal_stop(handle);
+ if (!ret)
+ ret = ret2;
+ unlock_page(page);
+ page_cache_release(page);
+
+ if (pos + len > inode->i_size)
+ vmtruncate(inode, inode->i_size);
+ return ret ? ret : copied;
+}
+
static int ext3_writeback_write_end(struct file *file,
struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
@@ -1311,6 +1677,7 @@ static int ext3_writeback_write_end(struct file *file,

copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
update_file_sizes(inode, pos, copied);
+
/*
* There may be allocated blocks outside of i_size because
* we failed to copy some data. Prepare for truncate.
@@ -1574,6 +1941,144 @@ out_fail:
return ret;
}

+/*
+ * Completion handler for block_write_full_page(). This will
+ * kick off the data=guarded workqueue as the IO finishes.
+ */
+static void end_buffer_async_write_guarded(struct buffer_head *bh,
+ int uptodate)
+{
+ struct ext3_sb_info *sbi;
+ struct address_space *mapping;
+ struct ext3_ordered_extent *ordered;
+ unsigned long flags;
+
+ mapping = bh->b_page->mapping;
+ if (!mapping || !bh->b_private || !buffer_dataguarded(bh)) {
+noguard:
+ end_buffer_async_write(bh, uptodate);
+ return;
+ }
+
+ /*
+ * the guarded workqueue function checks the uptodate bit on the
+ * bh and uses that to tell the real end_io handler if things worked
+ * out or not.
+ */
+ if (uptodate)
+ set_buffer_uptodate(bh);
+ else
+ clear_buffer_uptodate(bh);
+
+ sbi = EXT3_SB(mapping->host->i_sb);
+
+ spin_lock_irqsave(&sbi->guarded_lock, flags);
+
+ /*
+ * remove any chance that a truncate raced in and cleared
+ * our dataguard flag, which also freed the ordered extent in
+ * our b_private.
+ */
+ if (!buffer_dataguarded(bh)) {
+ spin_unlock_irqrestore(&sbi->guarded_lock, flags);
+ goto noguard;
+ }
+ ordered = bh->b_private;
+ WARN_ON(ordered->end_io_bh);
+
+ /*
+ * use the special end_io_bh pointer to make sure that
+ * some form of end_io handler is run on this bh, even
+ * if the ordered_extent is removed from the rb tree before
+ * our workqueue ends up processing it.
+ */
+ ordered->end_io_bh = bh;
+ list_add_tail(&ordered->work_list, &sbi->guarded_buffers);
+ ext3_get_ordered_extent(ordered);
+ spin_unlock_irqrestore(&sbi->guarded_lock, flags);
+
+ queue_work(sbi->guarded_wq, &sbi->guarded_work);
+}
+
+static int ext3_guarded_writepage(struct page *page,
+ struct writeback_control *wbc)
+{
+ struct inode *inode = page->mapping->host;
+ struct buffer_head *page_bufs;
+ handle_t *handle = NULL;
+ int ret = 0;
+ int err;
+
+ J_ASSERT(PageLocked(page));
+
+ /*
+ * We give up here if we're reentered, because it might be for a
+ * different filesystem.
+ */
+ if (ext3_journal_current_handle())
+ goto out_fail;
+
+ if (!page_has_buffers(page)) {
+ create_empty_buffers(page, inode->i_sb->s_blocksize,
+ (1 << BH_Dirty)|(1 << BH_Uptodate));
+ page_bufs = page_buffers(page);
+ } else {
+ page_bufs = page_buffers(page);
+ if (!walk_page_buffers(NULL, page_bufs, 0, PAGE_CACHE_SIZE,
+ NULL, buffer_unmapped)) {
+ /* Provide NULL get_block() to catch bugs if buffers
+ * weren't really mapped */
+ return block_write_full_page_endio(page, NULL, wbc,
+ end_buffer_async_write_guarded);
+ }
+ }
+ handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
+
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out_fail;
+ }
+
+ walk_page_buffers(handle, page_bufs, 0,
+ PAGE_CACHE_SIZE, NULL, bget_one);
+
+ ret = block_write_full_page_endio(page, ext3_get_block, wbc,
+ end_buffer_async_write_guarded);
+
+ /*
+ * The page can become unlocked at any point now, and
+ * truncate can then come in and change things. So we
+ * can't touch *page from now on. But *page_bufs is
+ * safe due to elevated refcount.
+ */
+
+ /*
+ * And attach them to the current transaction. But only if
+ * block_write_full_page() succeeded. Otherwise they are unmapped,
+ * and generally junk.
+ */
+ if (ret == 0) {
+ err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
+ NULL, journal_dirty_data_guarded_writepage_fn);
+ if (!ret)
+ ret = err;
+ }
+ walk_page_buffers(handle, page_bufs, 0,
+ PAGE_CACHE_SIZE, NULL, bput_one);
+ err = ext3_journal_stop(handle);
+ if (!ret)
+ ret = err;
+
+ return ret;
+
+out_fail:
+ redirty_page_for_writepage(wbc, page);
+ unlock_page(page);
+ return ret;
+}
+
+
+
static int ext3_writeback_writepage(struct page *page,
struct writeback_control *wbc)
{
@@ -1747,7 +2252,14 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
goto out;
}
orphan = 1;
- ei->i_disksize = inode->i_size;
+ /* in guarded mode, other code is responsible
+ * for updating i_disksize. Actually in
+ * every mode, ei->i_disksize should be correct,
+ * so I don't understand why it is getting updated
+ * to i_size here.
+ */
+ if (!ext3_should_guard_data(inode))
+ ei->i_disksize = inode->i_size;
ext3_journal_stop(handle);
}
}
@@ -1768,13 +2280,25 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
ret = PTR_ERR(handle);
goto out;
}
+
if (inode->i_nlink)
- ext3_orphan_del(handle, inode);
+ orphan_del(handle, inode, 0);
+
if (ret > 0) {
loff_t end = offset + ret;
if (end > inode->i_size) {
- ei->i_disksize = end;
+ /* i_mutex keeps other file writes from
+ * hopping in at this time, and we
+ * know the O_DIRECT write just put all
+ * those blocks on disk. But, there
+ * may be guarded writes at lower offsets
+ * in the file that were not forced down.
+ */
i_size_write(inode, end);
+ if (ext3_should_guard_data(inode))
+ ext3_ordered_update_i_size(inode);
+ else
+ ei->i_disksize = end;
/*
* We're going to return a positive `ret'
* here due to non-zero-length I/O, so there's
@@ -1842,6 +2366,21 @@ static const struct address_space_operations ext3_writeback_aops = {
.is_partially_uptodate = block_is_partially_uptodate,
};

+static const struct address_space_operations ext3_guarded_aops = {
+ .readpage = ext3_readpage,
+ .readpages = ext3_readpages,
+ .writepage = ext3_guarded_writepage,
+ .sync_page = block_sync_page,
+ .write_begin = ext3_write_begin,
+ .write_end = ext3_guarded_write_end,
+ .bmap = ext3_bmap,
+ .invalidatepage = ext3_invalidatepage,
+ .releasepage = ext3_releasepage,
+ .direct_IO = ext3_direct_IO,
+ .migratepage = buffer_migrate_page,
+ .is_partially_uptodate = block_is_partially_uptodate,
+};
+
static const struct address_space_operations ext3_journalled_aops = {
.readpage = ext3_readpage,
.readpages = ext3_readpages,
@@ -1860,6 +2399,8 @@ void ext3_set_aops(struct inode *inode)
{
if (ext3_should_order_data(inode))
inode->i_mapping->a_ops = &ext3_ordered_aops;
+ else if (ext3_should_guard_data(inode))
+ inode->i_mapping->a_ops = &ext3_guarded_aops;
else if (ext3_should_writeback_data(inode))
inode->i_mapping->a_ops = &ext3_writeback_aops;
else
@@ -2376,7 +2917,8 @@ void ext3_truncate(struct inode *inode)
if (!ext3_can_truncate(inode))
return;

- if (inode->i_size == 0 && ext3_should_writeback_data(inode))
+ if (inode->i_size == 0 && (ext3_should_writeback_data(inode) ||
+ ext3_should_guard_data(inode)))
ei->i_state |= EXT3_STATE_FLUSH_ON_CLOSE;

/*
@@ -3103,10 +3645,39 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
ext3_journal_stop(handle);
}

+ if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
+ /*
+ * we need to make sure any data=guarded pages
+ * are on disk before we force a new disk i_size
+ * down into the inode. The crucial range is
+ * anything between the disksize on disk now
+ * and the new size we're going to set.
+ *
+ * We're holding i_mutex here, so we know new
+ * ordered extents are not going to appear in the inode
+ *
+ * This must be done both for truncates that make the
+ * file bigger and smaller because truncate messes around
+ * with the orphan inode list in both cases.
+ */
+ if (ext3_should_guard_data(inode)) {
+ filemap_write_and_wait_range(inode->i_mapping,
+ EXT3_I(inode)->i_disksize,
+ (loff_t)-1);
+ /*
+ * we've written everything, make sure all
+ * the ordered extents are really gone.
+ *
+ * This prevents leaking of ordered extents
+ * and it also makes sure the ordered extent code
+ * doesn't mess with the orphan link
+ */
+ ext3_truncate_ordered_extents(inode, 0);
+ }
+ }
if (S_ISREG(inode->i_mode) &&
attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) {
handle_t *handle;
-
handle = ext3_journal_start(inode, 3);
if (IS_ERR(handle)) {
error = PTR_ERR(handle);
@@ -3114,6 +3685,7 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
}

error = ext3_orphan_add(handle, inode);
+
EXT3_I(inode)->i_disksize = attr->ia_size;
rc = ext3_mark_inode_dirty(handle, inode);
if (!error)
@@ -3125,8 +3697,11 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)

/* If inode_setattr's call to ext3_truncate failed to get a
* transaction handle at all, we need to clean up the in-core
- * orphan list manually. */
- if (inode->i_nlink)
+ * orphan list manually. Because we've finished off all the
+ * guarded IO above, this doesn't hurt anything for the guarded
+ * code
+ */
+ if (inode->i_nlink && (attr->ia_valid & ATTR_SIZE))
ext3_orphan_del(NULL, inode);

if (!rc && (ia_valid & ATTR_MODE))
diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index 6ff7b97..711549a 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -1973,11 +1973,21 @@ out_unlock:
return err;
}

+int ext3_orphan_del(handle_t *handle, struct inode *inode)
+{
+ int ret;
+
+ lock_super(inode->i_sb);
+ ret = ext3_orphan_del_locked(handle, inode);
+ unlock_super(inode->i_sb);
+ return ret;
+}
+
/*
* ext3_orphan_del() removes an unlinked or truncated inode from the list
* of such inodes stored on disk, because it is finally being cleaned up.
*/
-int ext3_orphan_del(handle_t *handle, struct inode *inode)
+int ext3_orphan_del_locked(handle_t *handle, struct inode *inode)
{
struct list_head *prev;
struct ext3_inode_info *ei = EXT3_I(inode);
@@ -1986,11 +1996,8 @@ int ext3_orphan_del(handle_t *handle, struct inode *inode)
struct ext3_iloc iloc;
int err = 0;

- lock_super(inode->i_sb);
- if (list_empty(&ei->i_orphan)) {
- unlock_super(inode->i_sb);
+ if (list_empty(&ei->i_orphan))
return 0;
- }

ino_next = NEXT_ORPHAN(inode);
prev = ei->i_orphan.prev;
@@ -2040,7 +2047,6 @@ int ext3_orphan_del(handle_t *handle, struct inode *inode)
out_err:
ext3_std_error(inode->i_sb, err);
out:
- unlock_super(inode->i_sb);
return err;

out_brelse:
@@ -2410,7 +2416,8 @@ static int ext3_rename (struct inode * old_dir, struct dentry *old_dentry,
ext3_mark_inode_dirty(handle, new_inode);
if (!new_inode->i_nlink)
ext3_orphan_add(handle, new_inode);
- if (ext3_should_writeback_data(new_inode))
+ if (ext3_should_writeback_data(new_inode) ||
+ ext3_should_guard_data(new_inode))
flush_file = 1;
}
retval = 0;
diff --git a/fs/ext3/ordered-data.c b/fs/ext3/ordered-data.c
new file mode 100644
index 0000000..a6dab2d
--- /dev/null
+++ b/fs/ext3/ordered-data.c
@@ -0,0 +1,235 @@
+/*
+ * Copyright (C) 2009 Oracle. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/gfp.h>
+#include <linux/slab.h>
+#include <linux/blkdev.h>
+#include <linux/writeback.h>
+#include <linux/pagevec.h>
+#include <linux/buffer_head.h>
+#include <linux/ext3_jbd.h>
+
+/*
+ * simple helper to make sure a new entry we're adding is
+ * at a larger offset in the file than the last entry in the list
+ */
+static void check_ordering(struct ext3_ordered_buffers *buffers,
+ struct ext3_ordered_extent *entry)
+{
+ struct ext3_ordered_extent *last;
+
+ if (list_empty(&buffers->ordered_list))
+ return;
+
+ last = list_entry(buffers->ordered_list.prev,
+ struct ext3_ordered_extent, ordered_list);
+ BUG_ON(last->start >= entry->start);
+}
+
+/* allocate and add a new ordered_extent into the per-inode list.
+ * start is the logical offset in the file
+ *
+ * The list is given a single reference on the ordered extent that was
+ * inserted, and it also takes a reference on the buffer head.
+ */
+int ext3_add_ordered_extent(struct inode *inode, u64 start,
+ struct buffer_head *bh)
+{
+ struct ext3_ordered_buffers *buffers;
+ struct ext3_ordered_extent *entry;
+ int ret = 0;
+
+ lock_buffer(bh);
+
+ /* ordered extent already there, or in old style data=ordered */
+ if (bh->b_private) {
+ ret = 0;
+ goto out;
+ }
+
+ buffers = &EXT3_I(inode)->ordered_buffers;
+ entry = kzalloc(sizeof(*entry), GFP_NOFS);
+ if (!entry) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ spin_lock(&buffers->lock);
+ entry->start = start;
+
+ get_bh(bh);
+ entry->bh = bh;
+ bh->b_private = entry;
+ set_buffer_dataguarded(bh);
+
+ /* one ref for the list */
+ atomic_set(&entry->refs, 1);
+ INIT_LIST_HEAD(&entry->work_list);
+
+ check_ordering(buffers, entry);
+
+ list_add_tail(&entry->ordered_list, &buffers->ordered_list);
+
+ spin_unlock(&buffers->lock);
+out:
+ unlock_buffer(bh);
+ return ret;
+}
+
+/*
+ * used to drop a reference on an ordered extent. This will free
+ * the extent if the last reference is dropped
+ */
+int ext3_put_ordered_extent(struct ext3_ordered_extent *entry)
+{
+ if (atomic_dec_and_test(&entry->refs)) {
+ WARN_ON(entry->bh);
+ WARN_ON(entry->end_io_bh);
+ kfree(entry);
+ }
+ return 0;
+}
+
+/*
+ * remove an ordered extent from the list. This removes the
+ * reference held by the list on 'entry' and the
+ * reference on the buffer head held by the entry.
+ */
+int ext3_remove_ordered_extent(struct inode *inode,
+ struct ext3_ordered_extent *entry)
+{
+ struct ext3_ordered_buffers *buffers;
+
+ buffers = &EXT3_I(inode)->ordered_buffers;
+
+ /*
+ * the data=guarded end_io handler takes this guarded_lock
+ * before it puts a given buffer head and its ordered extent
+ * into the guarded_buffers list. We need to make sure
+ * we don't race with them, so we take the guarded_lock too.
+ */
+ spin_lock_irq(&EXT3_SB(inode->i_sb)->guarded_lock);
+ clear_buffer_dataguarded(entry->bh);
+ entry->bh->b_private = NULL;
+ brelse(entry->bh);
+ entry->bh = NULL;
+ spin_unlock_irq(&EXT3_SB(inode->i_sb)->guarded_lock);
+
+ /*
+ * we must not clear entry->end_io_bh, that is set by
+ * the end_io handlers and will be cleared by the end_io
+ * workqueue
+ */
+
+ list_del_init(&entry->ordered_list);
+ ext3_put_ordered_extent(entry);
+ return 0;
+}
+
+/*
+ * After an extent is done, call this to conditionally update the on disk
+ * i_size. i_size is updated to cover any fully written part of the file.
+ *
+ * This returns < 0 on error, zero if no action needs to be taken and
+ * 1 if the inode must be logged.
+ */
+int ext3_ordered_update_i_size(struct inode *inode)
+{
+ u64 new_size;
+ u64 disk_size;
+ struct ext3_ordered_extent *test;
+ struct ext3_ordered_buffers *buffers = &EXT3_I(inode)->ordered_buffers;
+ int ret = 0;
+
+ disk_size = EXT3_I(inode)->i_disksize;
+
+ /*
+ * if the disk i_size is already at the inode->i_size, we're done
+ */
+ if (disk_size >= inode->i_size)
+ goto out;
+
+ /*
+ * if the ordered list is empty, push the disk i_size all the way
+ * up to the inode size, otherwise, use the start of the first
+ * ordered extent in the list as the new disk i_size
+ */
+ if (list_empty(&buffers->ordered_list)) {
+ new_size = inode->i_size;
+ } else {
+ test = list_entry(buffers->ordered_list.next,
+ struct ext3_ordered_extent, ordered_list);
+
+ new_size = test->start;
+ }
+
+ new_size = min_t(u64, new_size, i_size_read(inode));
+
+ /* the caller needs to log this inode */
+ ret = 1;
+
+ EXT3_I(inode)->i_disksize = new_size;
+out:
+ return ret;
+}
+
+/*
+ * during a truncate or delete, we need to get rid of pending
+ * ordered extents so there isn't a war over who updates disk i_size first.
+ * This does that, without waiting for any of the IO to actually finish.
+ *
+ * When the IO does finish, it will find the ordered extent removed from the
+ * list and all will work properly.
+ */
+void ext3_truncate_ordered_extents(struct inode *inode, u64 offset)
+{
+ struct ext3_ordered_buffers *buffers = &EXT3_I(inode)->ordered_buffers;
+ struct ext3_ordered_extent *test;
+
+ spin_lock(&buffers->lock);
+ while (!list_empty(&buffers->ordered_list)) {
+
+ test = list_entry(buffers->ordered_list.prev,
+ struct ext3_ordered_extent, ordered_list);
+
+ if (test->start < offset)
+ break;
+ /*
+ * once this is called, the end_io handler won't run,
+ * and we won't update disk_i_size to include this buffer.
+ *
+ * That's ok for truncates because the truncate code is
+ * writing a new i_size.
+ *
+ * This ignores any IO in flight, which is ok
+ * because the guarded_buffers list has a reference
+ * on the ordered extent
+ */
+ ext3_remove_ordered_extent(inode, test);
+ }
+ spin_unlock(&buffers->lock);
+ return;
+
+}
+
+void ext3_ordered_inode_init(struct ext3_inode_info *ei)
+{
+ INIT_LIST_HEAD(&ei->ordered_buffers.ordered_list);
+ spin_lock_init(&ei->ordered_buffers.lock);
+}
+
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 599dbfe..1e0eff8 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -37,6 +37,7 @@
#include <linux/quotaops.h>
#include <linux/seq_file.h>
#include <linux/log2.h>
+#include <linux/workqueue.h>

#include <asm/uaccess.h>

@@ -399,6 +400,9 @@ static void ext3_put_super (struct super_block * sb)
struct ext3_super_block *es = sbi->s_es;
int i, err;

+ flush_workqueue(sbi->guarded_wq);
+ destroy_workqueue(sbi->guarded_wq);
+
ext3_xattr_put_super(sb);
err = journal_destroy(sbi->s_journal);
sbi->s_journal = NULL;
@@ -468,6 +472,8 @@ static struct inode *ext3_alloc_inode(struct super_block *sb)
#endif
ei->i_block_alloc_info = NULL;
ei->vfs_inode.i_version = 1;
+ ext3_ordered_inode_init(ei);
+
return &ei->vfs_inode;
}

@@ -481,6 +487,8 @@ static void ext3_destroy_inode(struct inode *inode)
false);
dump_stack();
}
+ if (!list_empty(&EXT3_I(inode)->ordered_buffers.ordered_list))
+ printk(KERN_INFO "EXT3 ordered list not empty\n");
kmem_cache_free(ext3_inode_cachep, EXT3_I(inode));
}

@@ -528,6 +536,13 @@ static void ext3_clear_inode(struct inode *inode)
EXT3_I(inode)->i_default_acl = EXT3_ACL_NOT_CACHED;
}
#endif
+ /*
+ * If pages got cleaned by truncate, truncate should have
+ * gotten rid of the ordered extents. Just in case, drop them
+ * here.
+ */
+ ext3_truncate_ordered_extents(inode, 0);
+
ext3_discard_reservation(inode);
EXT3_I(inode)->i_block_alloc_info = NULL;
if (unlikely(rsv))
@@ -634,6 +649,8 @@ static int ext3_show_options(struct seq_file *seq, struct vfsmount *vfs)
seq_puts(seq, ",data=journal");
else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA)
seq_puts(seq, ",data=ordered");
+ else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_GUARDED_DATA)
+ seq_puts(seq, ",data=guarded");
else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)
seq_puts(seq, ",data=writeback");

@@ -790,7 +807,7 @@ enum {
Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh,
Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
- Opt_data_err_abort, Opt_data_err_ignore,
+ Opt_data_guarded, Opt_data_err_abort, Opt_data_err_ignore,
Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
@@ -832,6 +849,7 @@ static const match_table_t tokens = {
{Opt_abort, "abort"},
{Opt_data_journal, "data=journal"},
{Opt_data_ordered, "data=ordered"},
+ {Opt_data_guarded, "data=guarded"},
{Opt_data_writeback, "data=writeback"},
{Opt_data_err_abort, "data_err=abort"},
{Opt_data_err_ignore, "data_err=ignore"},
@@ -1034,6 +1052,9 @@ static int parse_options (char *options, struct super_block *sb,
case Opt_data_ordered:
data_opt = EXT3_MOUNT_ORDERED_DATA;
goto datacheck;
+ case Opt_data_guarded:
+ data_opt = EXT3_MOUNT_GUARDED_DATA;
+ goto datacheck;
case Opt_data_writeback:
data_opt = EXT3_MOUNT_WRITEBACK_DATA;
datacheck:
@@ -1949,11 +1970,23 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
clear_opt(sbi->s_mount_opt, NOBH);
}
}
+
+ /*
+ * setup the guarded work list
+ */
+ INIT_LIST_HEAD(&EXT3_SB(sb)->guarded_buffers);
+ INIT_WORK(&EXT3_SB(sb)->guarded_work, ext3_run_guarded_work);
+ spin_lock_init(&EXT3_SB(sb)->guarded_lock);
+ EXT3_SB(sb)->guarded_wq = create_workqueue("ext3-guard");
+ if (!EXT3_SB(sb)->guarded_wq) {
+ printk(KERN_ERR "EXT3-fs: failed to create workqueue\n");
+ goto failed_mount_guard;
+ }
+
/*
* The journal_load will have done any necessary log recovery,
* so we can safely mount the rest of the filesystem now.
*/
-
root = ext3_iget(sb, EXT3_ROOT_INO);
if (IS_ERR(root)) {
printk(KERN_ERR "EXT3-fs: get root inode failed\n");
@@ -1965,6 +1998,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
printk(KERN_ERR "EXT3-fs: corrupt root inode, run e2fsck\n");
goto failed_mount4;
}
+
sb->s_root = d_alloc_root(root);
if (!sb->s_root) {
printk(KERN_ERR "EXT3-fs: get root dentry failed\n");
@@ -1974,6 +2008,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
}

ext3_setup_super (sb, es, sb->s_flags & MS_RDONLY);
+
/*
* akpm: core read_super() calls in here with the superblock locked.
* That deadlocks, because orphan cleanup needs to lock the superblock
@@ -1989,9 +2024,10 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
printk (KERN_INFO "EXT3-fs: recovery complete.\n");
ext3_mark_recovery_complete(sb, es);
printk (KERN_INFO "EXT3-fs: mounted filesystem with %s data mode.\n",
- test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
- test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
- "writeback");
+ test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal" :
+ test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_GUARDED_DATA ? "guarded" :
+ test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered" :
+ "writeback");

lock_kernel();
return 0;
@@ -2003,6 +2039,8 @@ cantfind_ext3:
goto failed_mount;

failed_mount4:
+ destroy_workqueue(EXT3_SB(sb)->guarded_wq);
+failed_mount_guard:
journal_destroy(sbi->s_journal);
failed_mount3:
percpu_counter_destroy(&sbi->s_freeblocks_counter);
diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index ed886e6..1354a55 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -2018,6 +2018,7 @@ zap_buffer_unlocked:
clear_buffer_mapped(bh);
clear_buffer_req(bh);
clear_buffer_new(bh);
+ clear_buffer_datanew(bh);
bh->b_bdev = NULL;
return may_free;
}
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 634a5e5..a20bd4f 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -18,6 +18,7 @@

#include <linux/types.h>
#include <linux/magic.h>
+#include <linux/workqueue.h>

/*
* The second extended filesystem constants/structures
@@ -398,7 +399,6 @@ struct ext3_inode {
#define EXT3_MOUNT_MINIX_DF 0x00080 /* Mimics the Minix statfs */
#define EXT3_MOUNT_NOLOAD 0x00100 /* Don't use existing journal*/
#define EXT3_MOUNT_ABORT 0x00200 /* Fatal error detected */
-#define EXT3_MOUNT_DATA_FLAGS 0x00C00 /* Mode for data writes: */
#define EXT3_MOUNT_JOURNAL_DATA 0x00400 /* Write data to journal */
#define EXT3_MOUNT_ORDERED_DATA 0x00800 /* Flush data before commit */
#define EXT3_MOUNT_WRITEBACK_DATA 0x00C00 /* No data ordering */
@@ -414,6 +414,12 @@ struct ext3_inode {
#define EXT3_MOUNT_GRPQUOTA 0x200000 /* "old" group quota */
#define EXT3_MOUNT_DATA_ERR_ABORT 0x400000 /* Abort on file data write
* error in ordered mode */
+#define EXT3_MOUNT_GUARDED_DATA 0x800000 /* guard new writes with
+ i_size */
+#define EXT3_MOUNT_DATA_FLAGS (EXT3_MOUNT_JOURNAL_DATA | \
+ EXT3_MOUNT_ORDERED_DATA | \
+ EXT3_MOUNT_WRITEBACK_DATA | \
+ EXT3_MOUNT_GUARDED_DATA)

/* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
#ifndef _LINUX_EXT2_FS_H
@@ -892,6 +898,7 @@ extern void ext3_get_inode_flags(struct ext3_inode_info *);
extern void ext3_set_aops(struct inode *inode);
extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
u64 start, u64 len);
+void ext3_run_guarded_work(struct work_struct *work);

/* ioctl.c */
extern long ext3_ioctl(struct file *, unsigned int, unsigned long);
@@ -900,6 +907,7 @@ extern long ext3_compat_ioctl(struct file *, unsigned int, unsigned long);
/* namei.c */
extern int ext3_orphan_add(handle_t *, struct inode *);
extern int ext3_orphan_del(handle_t *, struct inode *);
+extern int ext3_orphan_del_locked(handle_t *, struct inode *);
extern int ext3_htree_fill_tree(struct file *dir_file, __u32 start_hash,
__u32 start_minor_hash, __u32 *next_hash);

@@ -945,7 +953,30 @@ extern const struct inode_operations ext3_special_inode_operations;
extern const struct inode_operations ext3_symlink_inode_operations;
extern const struct inode_operations ext3_fast_symlink_inode_operations;

+/* ordered-data.c */
+int ext3_add_ordered_extent(struct inode *inode, u64 file_offset,
+ struct buffer_head *bh);
+int ext3_put_ordered_extent(struct ext3_ordered_extent *entry);
+int ext3_remove_ordered_extent(struct inode *inode,
+ struct ext3_ordered_extent *entry);
+int ext3_ordered_update_i_size(struct inode *inode);
+void ext3_ordered_inode_init(struct ext3_inode_info *ei);
+void ext3_truncate_ordered_extents(struct inode *inode, u64 offset);
+
+static inline void ext3_ordered_lock(struct inode *inode)
+{
+ spin_lock(&EXT3_I(inode)->ordered_buffers.lock);
+}

+static inline void ext3_ordered_unlock(struct inode *inode)
+{
+ spin_unlock(&EXT3_I(inode)->ordered_buffers.lock);
+}
+
+static inline void ext3_get_ordered_extent(struct ext3_ordered_extent *entry)
+{
+ atomic_inc(&entry->refs);
+}
#endif /* __KERNEL__ */

#endif /* _LINUX_EXT3_FS_H */
diff --git a/include/linux/ext3_fs_i.h b/include/linux/ext3_fs_i.h
index 7894dd0..11dd4d4 100644
--- a/include/linux/ext3_fs_i.h
+++ b/include/linux/ext3_fs_i.h
@@ -65,6 +65,49 @@ struct ext3_block_alloc_info {
#define rsv_end rsv_window._rsv_end

/*
+ * used to prevent garbage in files after a crash by
+ * making sure i_size isn't updated until after the IO
+ * is done.
+ *
+ * See fs/ext3/ordered-data.c for the code that uses these.
+ */
+struct buffer_head;
+struct ext3_ordered_buffers {
+ /* protects the list and disk i_size */
+ spinlock_t lock;
+
+ struct list_head ordered_list;
+};
+
+struct ext3_ordered_extent {
+ /* logical offset of the block in the file
+ * strictly speaking we don't need this
+ * but keep it in the struct for
+ * debugging
+ */
+ u64 start;
+
+ /* buffer head being written */
+ struct buffer_head *bh;
+
+ /*
+ * set at end_io time so we properly
+ * do IO accounting even when this ordered
+ * extent struct has been removed from the
+ * list
+ */
+ struct buffer_head *end_io_bh;
+
+ /* number of refs on this ordered extent */
+ atomic_t refs;
+
+ struct list_head ordered_list;
+
+ /* list of things being processed by the workqueue */
+ struct list_head work_list;
+};
+
+/*
* third extended file system inode data in memory
*/
struct ext3_inode_info {
@@ -141,6 +184,8 @@ struct ext3_inode_info {
* by other means, so we have truncate_mutex.
*/
struct mutex truncate_mutex;
+
+ struct ext3_ordered_buffers ordered_buffers;
struct inode vfs_inode;
};

diff --git a/include/linux/ext3_fs_sb.h b/include/linux/ext3_fs_sb.h
index f07f34d..5dbdbeb 100644
--- a/include/linux/ext3_fs_sb.h
+++ b/include/linux/ext3_fs_sb.h
@@ -21,6 +21,7 @@
#include <linux/wait.h>
#include <linux/blockgroup_lock.h>
#include <linux/percpu_counter.h>
+#include <linux/workqueue.h>
#endif
#include <linux/rbtree.h>

@@ -82,6 +83,11 @@ struct ext3_sb_info {
char *s_qf_names[MAXQUOTAS]; /* Names of quota files with journalled quota */
int s_jquota_fmt; /* Format of quota to use */
#endif
+
+ struct workqueue_struct *guarded_wq;
+ struct work_struct guarded_work;
+ struct list_head guarded_buffers;
+ spinlock_t guarded_lock;
};

static inline spinlock_t *
diff --git a/include/linux/ext3_jbd.h b/include/linux/ext3_jbd.h
index cf82d51..45cb4aa 100644
--- a/include/linux/ext3_jbd.h
+++ b/include/linux/ext3_jbd.h
@@ -212,6 +212,17 @@ static inline int ext3_should_order_data(struct inode *inode)
return 0;
}

+static inline int ext3_should_guard_data(struct inode *inode)
+{
+ if (!S_ISREG(inode->i_mode))
+ return 0;
+ if (EXT3_I(inode)->i_flags & EXT3_JOURNAL_DATA_FL)
+ return 0;
+ if (test_opt(inode->i_sb, GUARDED_DATA) == EXT3_MOUNT_GUARDED_DATA)
+ return 1;
+ return 0;
+}
+
static inline int ext3_should_writeback_data(struct inode *inode)
{
if (!S_ISREG(inode->i_mode))
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index c2049a0..bbb7990 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -291,6 +291,13 @@ enum jbd_state_bits {
BH_State, /* Pins most journal_head state */
BH_JournalHead, /* Pins bh->b_private and jh->b_bh */
BH_Unshadow, /* Dummy bit, for BJ_Shadow wakeup filtering */
+ BH_DataGuarded, /* ext3 data=guarded mode buffer
+ * these have something other than a
+ * journal_head at b_private */
+ BH_DataNew, /* BH_new gets cleared too early for
+ * data=guarded to use it. So,
+ * this gets set instead.
+ */
};

BUFFER_FNS(JBD, jbd)
@@ -302,6 +309,9 @@ TAS_BUFFER_FNS(Revoked, revoked)
BUFFER_FNS(RevokeValid, revokevalid)
TAS_BUFFER_FNS(RevokeValid, revokevalid)
BUFFER_FNS(Freed, freed)
+BUFFER_FNS(DataGuarded, dataguarded)
+BUFFER_FNS(DataNew, datanew)
+TAS_BUFFER_FNS(DataNew, datanew)

static inline struct buffer_head *jh2bh(struct journal_head *jh)
{
--




2009-04-29 21:53:57

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH RFC] ext3 data=guarded v5

On Wed, Apr 29, 2009 at 03:41:29PM -0400, Chris Mason wrote:
>
> I think my latest patch has this nailed down without the mutex.
> Basically it checks i_nlink with super lock held and then calls
> ext3_orphan_del. If we race with unlink, we'll either find the new
> nlink count and skip the orphan del or unlink will come in after us and
> add the orphan back.

Can you make sure you mark any lock_super()'s with a comment saying
what it's protecting? Eventually I suspect we'll want to forward port
this to ext4, and I have a patch in the ext4 patch queue that I mean
to backport to ext3 which introduces an explicit i_orphan_lock mutex
and eliminates most of the calls to lock/unlock_super() in support of
a cleanup which Christoph is planning. So it'll make life easier if
you annotate any use of lock_super(), since it's going to be going
away in both ext3 and ext4 in the near future.

Thanks!!

- Ted

2009-04-30 11:38:27

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC] ext3 data=guarded v5

On Wed 29-04-09 16:37:01, Chris Mason wrote:
> On Wed, 2009-04-29 at 22:04 +0200, Jan Kara wrote:
>
> > > What we don't want to do is have a call to write() over existing blocks
> > > in the file add new things to the data=ordered list. I don't see how we
> > > can avoid that without datanew.
> > Yes, what I suggest would do exactly that:
> > In ordered_writepage() in the beginning we do:
> > page_bufs = page_buffers(page);
> > if (!walk_page_buffers(NULL, page_bufs, 0, PAGE_CACHE_SIZE,
> > NULL, buffer_unmapped)) {
> > return block_write_full_page(page, NULL, wbc);
> > }
> > So we only get to starting a transaction and file some buffers if some buffer
> > in the page is unmapped. Write() maps / allocates all buffers in write_begin()
> > so they are never added to ordered lists in writepage().
>
> Right, writepage doesn't really need datanew.
>
> > We rely on write_end
> > to do it. So the only case where not all buffers in the page are mapped is
> > when we have to allocate in writepage() (mmaped write) or the two cases I
> > describe above.
>
> But I still think write_end does need datanew. That's where 99% of the
> ordered buffers are going to come from when we overwrite the contents of
> an existing file.
Ah, true, buffer_new() can be cleared in __block_prepare_write() in some
cases. Frankly, I don't see a reason why that happens but that's another
story.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-04-30 11:52:09

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC] ext3 data=guarded v7

On Wed 29-04-09 16:53:44, Chris Mason wrote:
...
> +static int ext3_guarded_write_end(struct file *file,
> + struct address_space *mapping,
> + loff_t pos, unsigned len, unsigned copied,
> + struct page *page, void *fsdata)
> +{
> + handle_t *handle = ext3_journal_current_handle();
> + struct inode *inode = file->f_mapping->host;
> + unsigned from, to;
> + int ret = 0, ret2;
> +
> + copied = block_write_end(file, mapping, pos, len, copied,
> + page, fsdata);
> +
> + from = pos & (PAGE_CACHE_SIZE - 1);
> + to = from + copied;
> + ret = walk_page_buffers(handle, page_buffers(page),
> + from, to, NULL, journal_dirty_data_guarded_fn);
> +
> + /*
> + * we only update the in-memory i_size. The disk i_size is done
> + * by the end io handlers
> + */
> + if (ret == 0 && pos + copied > inode->i_size) {
> + int must_log;
> +
> + /* updated i_size, but we may have raced with a
> + * data=guarded end_io handler.
> + *
> + * All the guarded IO could have ended while i_size was still
> + * small, and if we're just adding bytes into an existing block
> + * in the file, we may not be adding a new guarded IO with this
> + * write. So, do a check on the disk i_size and make sure it
> + * is updated to the highest safe value.
> + *
> + * This may also be required if the
> + * journal_dirty_data_guarded_fn chose to do an fully
> + * ordered write of this buffer instead of a guarded
> + * write.
> + *
> + * ext3_ordered_update_i_size tests inode->i_size, so we
> + * make sure to update it with the ordered lock held.
> + */
> + ext3_ordered_lock(inode);
> + i_size_write(inode, pos + copied);
> + must_log = ext3_ordered_update_i_size(inode);
> + ext3_ordered_unlock(inode);
> +
> + orphan_del_trans(inode, must_log > 0);
> + }
Didn't we agree that only "i_size_write" should remain from the above
"if" after you changed journal_dirty_data_guarded_fn() function?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-04-30 13:18:35

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH RFC] ext3 data=guarded v7

On Thu, 2009-04-30 at 13:52 +0200, Jan Kara wrote:
> On Wed 29-04-09 16:53:44, Chris Mason wrote:
> ...
> > +static int ext3_guarded_write_end(struct file *file,
> > + struct address_space *mapping,
> > + loff_t pos, unsigned len, unsigned copied,
> > + struct page *page, void *fsdata)
> > +{
> > + handle_t *handle = ext3_journal_current_handle();
> > + struct inode *inode = file->f_mapping->host;
> > + unsigned from, to;
> > + int ret = 0, ret2;
> > +
> > + copied = block_write_end(file, mapping, pos, len, copied,
> > + page, fsdata);
> > +
> > + from = pos & (PAGE_CACHE_SIZE - 1);
> > + to = from + copied;
> > + ret = walk_page_buffers(handle, page_buffers(page),
> > + from, to, NULL, journal_dirty_data_guarded_fn);
> > +
> > + /*
> > + * we only update the in-memory i_size. The disk i_size is done
> > + * by the end io handlers
> > + */
> > + if (ret == 0 && pos + copied > inode->i_size) {
> > + int must_log;
> > +
> > + /* updated i_size, but we may have raced with a
> > + * data=guarded end_io handler.
> > + *
> > + * All the guarded IO could have ended while i_size was still
> > + * small, and if we're just adding bytes into an existing block
> > + * in the file, we may not be adding a new guarded IO with this
> > + * write. So, do a check on the disk i_size and make sure it
> > + * is updated to the highest safe value.
> > + *
> > + * This may also be required if the
> > + * journal_dirty_data_guarded_fn chose to do an fully
> > + * ordered write of this buffer instead of a guarded
> > + * write.
> > + *
> > + * ext3_ordered_update_i_size tests inode->i_size, so we
> > + * make sure to update it with the ordered lock held.
> > + */
> > + ext3_ordered_lock(inode);
> > + i_size_write(inode, pos + copied);
> > + must_log = ext3_ordered_update_i_size(inode);
> > + ext3_ordered_unlock(inode);
> > +
> > + orphan_del_trans(inode, must_log > 0);
> > + }
> Didn't we agree that only "i_size_write" should remain from the above
> "if" after you changed journal_dirty_data_guarded_fn() function?

It sounded like a really good idea at the time ;) But it doesn't cover
all the cases. If journal_dirty_data_guarded_fn decided to do a full
ordering of a buffer because the start of the buffer was inside of
i_size, we might not have updated disk_i_size.

Basically something like this:

dd if=/dev/zero of=foo bs=2k count=1 # makes an ordered buffer
sync # disk_i_size is now 2k
dd if=/dev/zero of=foo bs=3k count=1 conv=notrunc

This will become an ordered write. There isn't a really good way to do
it as a guarded IO because that buffer head may already have a guarded
IO attached. I could add some complexity and cover all the cases where
an ordered IO is in flight etc etc, but it doesn't seem worth it for the
small append case.

Since it is an ordered write, there is no guarded IO to update the disk
i_size later. The update needs to happen during write_end(). I did try
to update the comments to reflect that, but it might not be as clear as
it should be.

-chris