2009-10-27 12:48:46

by Jan Kara

[permalink] [raw]
Subject: [PATCH 0/4] Fix fsync on ext3 and ext4 (v2)


Hi,

this is a second try for a patchset which makes ext3 and ext4 properly force
a transaction commit when needed. We now do not rely on buffer dirty bits
(which does not work as pdflush can just clear them without forcing a
transaction commit) but rather keep transaction ids that need to be committed
for each inode.
Since last version, I've implemented Aneesh's and Curt's comments and also
fixed a missing initialization of the fields. I've tested that now the patch
works correctly for uninitialized extents as well as for standard writes. If
noone objects, would you merge the ext4 part Ted? I'll take care of the ext3
patch.

Honza


2009-10-27 12:48:46

by Jan Kara

[permalink] [raw]
Subject: [PATCH 3/4] ext4: Fix error handling ext4_ind_get_blocks

When an error happened in ext4_splice_branch we failed to notice that
in ext4_ind_get_blocks and mapped the buffer anyway. Fix the problem
by checking for error properly.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5c5bc5d..9105f40 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1020,7 +1020,7 @@ static int ext4_ind_get_blocks(handle_t *handle, struct inode *inode,
if (!err)
err = ext4_splice_branch(handle, inode, iblock,
partial, indirect_blks, count);
- else
+ if (err)
goto cleanup;

set_buffer_new(bh_result);
--
1.6.0.2


2009-10-27 12:48:46

by Jan Kara

[permalink] [raw]
Subject: [PATCH 1/4] ext3: Wait for proper transaction commit on fsync

We cannot rely on buffer dirty bits during fsync because pdflush can come
before fsync is called and clear dirty bits without forcing a transaction
commit. What we do is that we track which transaction has last changed
the inode and which transaction last changed allocation and force it to
disk on fsync.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext3/fsync.c | 36 ++++++++++++++++--------------------
fs/ext3/inode.c | 32 +++++++++++++++++++++++++++++++-
fs/ext3/super.c | 2 ++
include/linux/ext3_fs_i.h | 8 ++++++++
4 files changed, 57 insertions(+), 21 deletions(-)

diff --git a/fs/ext3/fsync.c b/fs/ext3/fsync.c
index 451d166..8209f26 100644
--- a/fs/ext3/fsync.c
+++ b/fs/ext3/fsync.c
@@ -46,19 +46,21 @@
int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
{
struct inode *inode = dentry->d_inode;
+ struct ext3_inode_info *ei = EXT3_I(inode);
+ journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
int ret = 0;
+ tid_t commit_tid;
+
+ if (inode->i_sb->s_flags & MS_RDONLY)
+ return 0;

J_ASSERT(ext3_journal_current_handle() == NULL);

/*
- * data=writeback:
+ * data=writeback,ordered:
* The caller's filemap_fdatawrite()/wait will sync the data.
- * sync_inode() will sync the metadata
- *
- * data=ordered:
- * The caller's filemap_fdatawrite() will write the data and
- * sync_inode() will write the inode if it is dirty. Then the caller's
- * filemap_fdatawait() will wait on the pages.
+ * Metadata is in the journal, we wait for a proper transaction
+ * to commit here.
*
* data=journal:
* filemap_fdatawrite won't do anything (the buffers are clean).
@@ -73,22 +75,16 @@ int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
goto out;
}

- if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
- goto flush;
+ if (datasync)
+ commit_tid = atomic_read(&ei->i_datasync_tid);
+ else
+ commit_tid = atomic_read(&ei->i_sync_tid);

- /*
- * The VFS has written the file data. If the inode is unaltered
- * then we need not start a commit.
- */
- if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
- struct writeback_control wbc = {
- .sync_mode = WB_SYNC_ALL,
- .nr_to_write = 0, /* sys_fsync did this */
- };
- ret = sync_inode(inode, &wbc);
+ if (log_start_commit(journal, commit_tid)) {
+ log_wait_commit(journal, commit_tid);
goto out;
}
-flush:
+
/*
* In case we didn't commit a transaction, we have to flush
* disk caches manually so that data really is on persistent
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index acf1b14..74b6146 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -699,8 +699,9 @@ static int ext3_splice_branch(handle_t *handle, struct inode *inode,
int err = 0;
struct ext3_block_alloc_info *block_i;
ext3_fsblk_t current_block;
+ struct ext3_inode_info *ei = EXT3_I(inode);

- block_i = EXT3_I(inode)->i_block_alloc_info;
+ block_i = ei->i_block_alloc_info;
/*
* If we're splicing into a [td]indirect block (as opposed to the
* inode) then we need to get write access to the [td]indirect block
@@ -741,6 +742,8 @@ static int ext3_splice_branch(handle_t *handle, struct inode *inode,

inode->i_ctime = CURRENT_TIME_SEC;
ext3_mark_inode_dirty(handle, inode);
+ /* ext3_mark_inode_dirty already updated i_sync_tid */
+ atomic_set(&ei->i_datasync_tid, handle->h_transaction->t_tid);

/* had we spliced it onto indirect block? */
if (where->bh) {
@@ -2750,6 +2753,8 @@ struct inode *ext3_iget(struct super_block *sb, unsigned long ino)
struct ext3_inode_info *ei;
struct buffer_head *bh;
struct inode *inode;
+ journal_t *journal = EXT3_SB(sb)->s_journal;
+ transaction_t *transaction;
long ret;
int block;

@@ -2827,6 +2832,30 @@ struct inode *ext3_iget(struct super_block *sb, unsigned long ino)
ei->i_data[block] = raw_inode->i_block[block];
INIT_LIST_HEAD(&ei->i_orphan);

+ /*
+ * Set transaction id's of transactions that have to be committed
+ * to finish f[data]sync. We set them to currently running transaction
+ * as we cannot be sure that the inode or some of its metadata isn't
+ * part of the transaction - the inode could have been reclaimed and
+ * now it is reread from disk.
+ */
+ if (journal) {
+ tid_t tid;
+
+ spin_lock(&journal->j_state_lock);
+ if (journal->j_running_transaction)
+ transaction = journal->j_running_transaction;
+ else
+ transaction = journal->j_committing_transaction;
+ if (transaction)
+ tid = transaction->t_tid;
+ else
+ tid = journal->j_commit_sequence;
+ spin_unlock(&journal->j_state_lock);
+ atomic_set(&ei->i_sync_tid, tid);
+ atomic_set(&ei->i_datasync_tid, tid);
+ }
+
if (inode->i_ino >= EXT3_FIRST_INO(inode->i_sb) + 1 &&
EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) {
/*
@@ -3011,6 +3040,7 @@ again:
err = rc;
ei->i_state &= ~EXT3_STATE_NEW;

+ atomic_set(&ei->i_sync_tid, handle->h_transaction->t_tid);
out_brelse:
brelse (bh);
ext3_std_error(inode->i_sb, err);
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 7a520a8..427496c 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -466,6 +466,8 @@ static struct inode *ext3_alloc_inode(struct super_block *sb)
return NULL;
ei->i_block_alloc_info = NULL;
ei->vfs_inode.i_version = 1;
+ atomic_set(&ei->i_datasync_tid, 0);
+ atomic_set(&ei->i_sync_tid, 0);
return &ei->vfs_inode;
}

diff --git a/include/linux/ext3_fs_i.h b/include/linux/ext3_fs_i.h
index ca1bfe9..93e7428 100644
--- a/include/linux/ext3_fs_i.h
+++ b/include/linux/ext3_fs_i.h
@@ -137,6 +137,14 @@ struct ext3_inode_info {
* by other means, so we have truncate_mutex.
*/
struct mutex truncate_mutex;
+
+ /*
+ * Transactions that contain inode's metadata needed to complete
+ * fsync and fdatasync, respectively.
+ */
+ atomic_t i_sync_tid;
+ atomic_t i_datasync_tid;
+
struct inode vfs_inode;
};

--
1.6.0.2


2009-10-27 12:48:46

by Jan Kara

[permalink] [raw]
Subject: [PATCH 4/4] ext4: Wait for proper transaction commit on fsync

We cannot rely on buffer dirty bits during fsync because pdflush can come
before fsync is called and clear dirty bits without forcing a transaction
commit. What we do is that we track which transaction has last changed
the inode and which transaction last changed allocation and force it to
disk on fsync.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 7 +++++++
fs/ext4/ext4_jbd2.h | 14 ++++++++++++++
fs/ext4/extents.c | 14 ++++++++++++--
fs/ext4/fsync.c | 40 +++++++++++++++++-----------------------
fs/ext4/inode.c | 29 +++++++++++++++++++++++++++++
fs/ext4/super.c | 2 ++
6 files changed, 81 insertions(+), 25 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 984ca0c..5639f30 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -702,6 +702,13 @@ struct ext4_inode_info {
struct list_head i_aio_dio_complete_list;
/* current io_end structure for async DIO write*/
ext4_io_end_t *cur_aio_dio;
+
+ /*
+ * Transactions that contain inode's metadata needed to complete
+ * fsync and fdatasync, respectively.
+ */
+ atomic_t i_sync_tid;
+ atomic_t i_datasync_tid;
};

/*
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index a286598..4389941 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -254,6 +254,20 @@ static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode)
return 0;
}

+static inline void ext4_update_inode_fsync_trans(handle_t *handle,
+ struct inode *inode,
+ int datasync)
+{
+ if (ext4_handle_valid(handle)) {
+ atomic_set(&EXT4_I(inode)->i_sync_tid,
+ handle->h_transaction->t_tid);
+ if (datasync) {
+ atomic_set(&EXT4_I(inode)->i_datasync_tid,
+ handle->h_transaction->t_tid);
+ }
+ }
+}
+
/* super.c */
int ext4_force_commit(struct super_block *sb);

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 10539e3..d3ba4a9 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3057,6 +3057,8 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
if (flags == EXT4_GET_BLOCKS_DIO_CONVERT_EXT) {
ret = ext4_convert_unwritten_extents_dio(handle, inode,
path);
+ if (ret >= 0)
+ ext4_update_inode_fsync_trans(handle, inode, 1);
goto out2;
}
/* buffered IO case */
@@ -3084,6 +3086,8 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
ret = ext4_ext_convert_to_initialized(handle, inode,
path, iblock,
max_blocks);
+ if (ret >= 0)
+ ext4_update_inode_fsync_trans(handle, inode, 1);
out:
if (ret <= 0) {
err = ret;
@@ -3316,10 +3320,16 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
allocated = ext4_ext_get_actual_len(&newex);
set_buffer_new(bh_result);

- /* Cache only when it is _not_ an uninitialized extent */
- if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0)
+ /*
+ * Cache the extent and update transaction to commit on fdatasync only
+ * when it is _not_ an uninitialized extent.
+ */
+ if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0) {
ext4_ext_put_in_cache(inode, iblock, allocated, newblock,
EXT4_EXT_CACHE_EXTENT);
+ ext4_update_inode_fsync_trans(handle, inode, 1);
+ } else
+ ext4_update_inode_fsync_trans(handle, inode, 0);
out:
if (allocated > max_blocks)
allocated = max_blocks;
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 2bf9413..03e0fe9 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -51,25 +51,26 @@
int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
{
struct inode *inode = dentry->d_inode;
+ struct ext4_inode_info *ei = EXT4_I(inode);
journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
- int err, ret = 0;
+ int ret = 0;
+ tid_t commit_tid;

J_ASSERT(ext4_journal_current_handle() == NULL);

trace_ext4_sync_file(file, dentry, datasync);

+ if (inode->i_sb->s_flags & MS_RDONLY)
+ goto out;
+
ret = flush_aio_dio_completed_IO(inode);
if (ret < 0)
goto out;
/*
- * data=writeback:
+ * data=writeback,ordered:
* The caller's filemap_fdatawrite()/wait will sync the data.
- * sync_inode() will sync the metadata
- *
- * data=ordered:
- * The caller's filemap_fdatawrite() will write the data and
- * sync_inode() will write the inode if it is dirty. Then the caller's
- * filemap_fdatawait() will wait on the pages.
+ * Metadata is in the journal, we wait for proper transaction to
+ * commit here.
*
* data=journal:
* filemap_fdatawrite won't do anything (the buffers are clean).
@@ -87,23 +88,16 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
if (!journal)
ret = sync_mapping_buffers(inode->i_mapping);

- if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
- goto flush;
+ if (datasync)
+ commit_tid = atomic_read(&ei->i_datasync_tid);
+ else
+ commit_tid = atomic_read(&ei->i_sync_tid);

- /*
- * The VFS has written the file data. If the inode is unaltered
- * then we need not start a commit.
- */
- if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
- struct writeback_control wbc = {
- .sync_mode = WB_SYNC_ALL,
- .nr_to_write = 0, /* sys_fsync did this */
- };
- err = sync_inode(inode, &wbc);
- if (ret == 0)
- ret = err;
+ if (jbd2_log_start_commit(journal, commit_tid)) {
+ jbd2_log_wait_commit(journal, commit_tid);
+ goto out;
}
-flush:
+
if (journal && (journal->j_flags & JBD2_BARRIER))
blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
out:
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9105f40..546f175 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1024,6 +1024,8 @@ static int ext4_ind_get_blocks(handle_t *handle, struct inode *inode,
goto cleanup;

set_buffer_new(bh_result);
+
+ ext4_update_inode_fsync_trans(handle, inode, 1);
got_it:
map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key));
if (count > blocks_to_boundary)
@@ -4777,6 +4779,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
struct ext4_inode_info *ei;
struct buffer_head *bh;
struct inode *inode;
+ journal_t *journal = EXT4_SB(sb)->s_journal;
long ret;
int block;

@@ -4842,6 +4845,31 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
ei->i_data[block] = raw_inode->i_block[block];
INIT_LIST_HEAD(&ei->i_orphan);

+ /*
+ * Set transaction id's of transactions that have to be committed
+ * to finish f[data]sync. We set them to currently running transaction
+ * as we cannot be sure that the inode or some of its metadata isn't
+ * part of the transaction - the inode could have been reclaimed and
+ * now it is reread from disk.
+ */
+ if (journal) {
+ transaction_t *transaction;
+ tid_t tid;
+
+ spin_lock(&journal->j_state_lock);
+ if (journal->j_running_transaction)
+ transaction = journal->j_running_transaction;
+ else
+ transaction = journal->j_committing_transaction;
+ if (transaction)
+ tid = transaction->t_tid;
+ else
+ tid = journal->j_commit_sequence;
+ spin_unlock(&journal->j_state_lock);
+ atomic_set(&ei->i_sync_tid, tid);
+ atomic_set(&ei->i_datasync_tid, tid);
+ }
+
if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) {
ei->i_extra_isize = le16_to_cpu(raw_inode->i_extra_isize);
if (EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize >
@@ -5102,6 +5130,7 @@ static int ext4_do_update_inode(handle_t *handle,
err = rc;
ei->i_state &= ~EXT4_STATE_NEW;

+ ext4_update_inode_fsync_trans(handle, inode, 0);
out_brelse:
brelse(bh);
ext4_std_error(inode->i_sb, err);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 312211e..9a6716f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -704,6 +704,8 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
spin_lock_init(&(ei->i_block_reservation_lock));
INIT_LIST_HEAD(&ei->i_aio_dio_complete_list);
ei->cur_aio_dio = NULL;
+ atomic_set(&ei->i_datasync_tid, 0);
+ atomic_set(&ei->i_sync_tid, 0);

return &ei->vfs_inode;
}
--
1.6.0.2


2009-10-27 12:48:46

by Jan Kara

[permalink] [raw]
Subject: [PATCH 2/4] ext4: Avoid issuing barriers on error recovery path

There is no reason why we should issue IO barriers when we
just hit an error in ext4_sync_file. So move out: label to
just return and introduce flush: label to those places which
really need it.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/fsync.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 2b15312..2bf9413 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -88,7 +88,7 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
ret = sync_mapping_buffers(inode->i_mapping);

if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
- goto out;
+ goto flush;

/*
* The VFS has written the file data. If the inode is unaltered
@@ -103,8 +103,9 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
if (ret == 0)
ret = err;
}
-out:
+flush:
if (journal && (journal->j_flags & JBD2_BARRIER))
blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
+out:
return ret;
}
--
1.6.0.2


2009-11-04 14:57:19

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 4/4] ext4: Wait for proper transaction commit on fsync

Jan Kara <[email protected]> writes:
> ext4_io_end_t *cur_aio_dio;
> +
> + /*
> + * Transactions that contain inode's metadata needed to complete
> + * fsync and fdatasync, respectively.
> + */
> + atomic_t i_sync_tid;
> + atomic_t i_datasync_tid;

This might be a stupid question, but the atomic implies you don't hold
any kind of reference to the transaction. So what prevents these IDs
from wrapping while in there? Given it would be probably take a long
time today, but might be not fully future proof to very fast future IO
devices.

-Andi


--
[email protected] -- Speaking for myself only.

2009-11-04 15:53:40

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/4] ext4: Wait for proper transaction commit on fsync

On Wed 04-11-09 15:57:22, Andi Kleen wrote:
> Jan Kara <[email protected]> writes:
> > ext4_io_end_t *cur_aio_dio;
> > +
> > + /*
> > + * Transactions that contain inode's metadata needed to complete
> > + * fsync and fdatasync, respectively.
> > + */
> > + atomic_t i_sync_tid;
> > + atomic_t i_datasync_tid;
>
> This might be a stupid question, but the atomic implies you don't hold
> any kind of reference to the transaction. So what prevents these IDs
> from wrapping while in there? Given it would be probably take a long
> time today, but might be not fully future proof to very fast future IO
> devices.
Yes, IDs can wrap. But journalling layer actually compares them so that
wrapping does not matter unless you manage to do 2^31 transactions
inbetween. So we are still a few orders of magnitude safe with current hw.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-11-05 13:03:16

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/4] Fix fsync on ext3 and ext4 (v2)

Hi,

> this is a second try for a patchset which makes ext3 and ext4 properly force
> a transaction commit when needed. We now do not rely on buffer dirty bits
> (which does not work as pdflush can just clear them without forcing a
> transaction commit) but rather keep transaction ids that need to be committed
> for each inode.
> Since last version, I've implemented Aneesh's and Curt's comments and also
> fixed a missing initialization of the fields. I've tested that now the patch
> works correctly for uninitialized extents as well as for standard writes. If
> noone objects, would you merge the ext4 part Ted? I'll take care of the ext3
> patch.
Aneesh, Ted, is the second version of the patchset fine with you?

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2009-11-05 16:35:11

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 0/4] Fix fsync on ext3 and ext4 (v2)

On Thu, Nov 05, 2009 at 02:02:50PM +0100, Jan Kara wrote:
> Hi,
>
> > this is a second try for a patchset which makes ext3 and ext4 properly force
> > a transaction commit when needed. We now do not rely on buffer dirty bits
> > (which does not work as pdflush can just clear them without forcing a
> > transaction commit) but rather keep transaction ids that need to be committed
> > for each inode.
> > Since last version, I've implemented Aneesh's and Curt's comments and also
> > fixed a missing initialization of the fields. I've tested that now the patch
> > works correctly for uninitialized extents as well as for standard writes. If
> > noone objects, would you merge the ext4 part Ted? I'll take care of the ext3
> > patch.
> Aneesh, Ted, is the second version of the patchset fine with you?

The patches looks good.

Reviewed-by: Aneesh Kumar K.V <[email protected]>

-aneesh

2009-11-11 14:17:09

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/4] Fix fsync on ext3 and ext4 (v2)

On Thu 05-11-09 22:05:10, Aneesh Kumar K.V wrote:
> On Thu, Nov 05, 2009 at 02:02:50PM +0100, Jan Kara wrote:
> > Hi,
> >
> > > this is a second try for a patchset which makes ext3 and ext4 properly force
> > > a transaction commit when needed. We now do not rely on buffer dirty bits
> > > (which does not work as pdflush can just clear them without forcing a
> > > transaction commit) but rather keep transaction ids that need to be committed
> > > for each inode.
> > > Since last version, I've implemented Aneesh's and Curt's comments and also
> > > fixed a missing initialization of the fields. I've tested that now the patch
> > > works correctly for uninitialized extents as well as for standard writes. If
> > > noone objects, would you merge the ext4 part Ted? I'll take care of the ext3
> > > patch.
> > Aneesh, Ted, is the second version of the patchset fine with you?
>
> The patches looks good.
>
> Reviewed-by: Aneesh Kumar K.V <[email protected]>
Thanks for the review! I've merged the ext3 patch and will push it to
Linus soon. Ted, will you take care of ext4 changes? Thanks.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-11-16 00:48:38

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 4/4] ext4: Wait for proper transaction commit on fsync

On Tue, Oct 27, 2009 at 01:48:49PM +0100, Jan Kara wrote:
> + /*
> + * Transactions that contain inode's metadata needed to complete
> + * fsync and fdatasync, respectively.
> + */
> + atomic_t i_sync_tid;
> + atomic_t i_datasync_tid;

Why do we need an atomic_t here at all? It's not clear it's
necessary. What specific race are you worried about?

- Ted

2009-11-16 00:48:38

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 2/4] ext4: Avoid issuing barriers on error recovery path

On Tue, Oct 27, 2009 at 01:48:47PM +0100, Jan Kara wrote:
> There is no reason why we should issue IO barriers when we
> just hit an error in ext4_sync_file. So move out: label to
> just return and introduce flush: label to those places which
> really need it.
>
> Signed-off-by: Jan Kara <[email protected]>

Once the out: label is moved to just before the "return ret;", it's
actually cleaner to simply replace all of the "goto out" with "return
ret;" statements.....

- Ted

2009-11-16 02:40:46

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH] ext4: Avoid issuing unnecessary barriers

We don't to issue an I/O barrier on an error or if we force commit
because we are doing data journaling.

Signed-off-by: "Theodore Ts'o" <[email protected]>
Cc: Jan Kara <[email protected]>
---
This patch should be equivalent to Jan's "ext4: Avoid issuing barriers
on error recovery path", but it removes more lines than it adds. :-)

fs/ext4/fsync.c | 8 +++-----
1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 2b15312..a3c2507 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -60,7 +60,7 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)

ret = flush_aio_dio_completed_IO(inode);
if (ret < 0)
- goto out;
+ return ret;
/*
* data=writeback:
* The caller's filemap_fdatawrite()/wait will sync the data.
@@ -79,10 +79,8 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
* (they were dirtied by commit). But that's OK - the blocks are
* safe in-journal, which is all fsync() needs to ensure.
*/
- if (ext4_should_journal_data(inode)) {
- ret = ext4_force_commit(inode->i_sb);
- goto out;
- }
+ if (ext4_should_journal_data(inode))
+ return ext4_force_commit(inode->i_sb);

if (!journal)
ret = sync_mapping_buffers(inode->i_mapping);
--
1.6.5.216.g5288a.dirty


2009-11-16 02:57:04

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 3/4] ext4: Fix error handling ext4_ind_get_blocks

On Tue, Oct 27, 2009 at 01:48:48PM +0100, Jan Kara wrote:
> When an error happened in ext4_splice_branch we failed to notice that
> in ext4_ind_get_blocks and mapped the buffer anyway. Fix the problem
> by checking for error properly.
>
> Signed-off-by: Jan Kara <[email protected]>

Added to the ext4 patch queue, thanks.

- Ted

2009-11-16 10:29:07

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: Avoid issuing unnecessary barriers

On Sun 15-11-09 21:40:42, Theodore Ts'o wrote:
> We don't to issue an I/O barrier on an error or if we force commit
> because we are doing data journaling.
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> Cc: Jan Kara <[email protected]>
> ---
> This patch should be equivalent to Jan's "ext4: Avoid issuing barriers
> on error recovery path", but it removes more lines than it adds. :-)
Yeah, it looks fine. Acked-by: Jan Kara <[email protected]>

Honza
>
> fs/ext4/fsync.c | 8 +++-----
> 1 files changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index 2b15312..a3c2507 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -60,7 +60,7 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
>
> ret = flush_aio_dio_completed_IO(inode);
> if (ret < 0)
> - goto out;
> + return ret;
> /*
> * data=writeback:
> * The caller's filemap_fdatawrite()/wait will sync the data.
> @@ -79,10 +79,8 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
> * (they were dirtied by commit). But that's OK - the blocks are
> * safe in-journal, which is all fsync() needs to ensure.
> */
> - if (ext4_should_journal_data(inode)) {
> - ret = ext4_force_commit(inode->i_sb);
> - goto out;
> - }
> + if (ext4_should_journal_data(inode))
> + return ext4_force_commit(inode->i_sb);
>
> if (!journal)
> ret = sync_mapping_buffers(inode->i_mapping);
> --
> 1.6.5.216.g5288a.dirty
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-11-16 10:43:26

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/4] ext4: Wait for proper transaction commit on fsync

On Sun 15-11-09 19:46:41, Theodore Tso wrote:
> On Tue, Oct 27, 2009 at 01:48:49PM +0100, Jan Kara wrote:
> > + /*
> > + * Transactions that contain inode's metadata needed to complete
> > + * fsync and fdatasync, respectively.
> > + */
> > + atomic_t i_sync_tid;
> > + atomic_t i_datasync_tid;
>
> Why do we need an atomic_t here at all? It's not clear it's
> necessary. What specific race are you worried about?
Well, i_[data]sync_tid are set and read from several places which aren't
currently synchronized against each other and I din't want to introduce any
synchronization. So atomic_t seemed like a clean way of making clear that
loads / stores from those fields are atomic, regardless of architecture,
memory alignment or whatever insanity that might make us see just half
overwritten field. On all archs where this means just plain stores / loads
(such as x86), there's no performance hit since the operations are
implemented as such.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-12-09 03:51:17

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 4/4] ext4: Wait for proper transaction commit on fsync

On Mon, Nov 16, 2009 at 11:43:31AM +0100, Jan Kara wrote:
> > Why do we need an atomic_t here at all? It's not clear it's
> > necessary. What specific race are you worried about?
> Well, i_[data]sync_tid are set and read from several places which aren't
> currently synchronized against each other and I din't want to introduce any
> synchronization. So atomic_t seemed like a clean way of making clear that
> loads / stores from those fields are atomic, regardless of architecture,
> memory alignment or whatever insanity that might make us see just half
> overwritten field. On all archs where this means just plain stores / loads
> (such as x86), there's no performance hit since the operations are
> implemented as such.

Sorry for not responding to this one sooner, but see this URL:

http://digitalvampire.org/blog/index.php/2007/05/13/atomic-cargo-cults/

- Ted



2009-12-09 04:54:26

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 4/4] ext4: Wait for proper transaction commit on fsync

Here's a revised version of your patch that I've included in the ext4
patch queue. I've removed the use of the atomic_t data type. I've
also cleaned up the fsync() function some more to make it easier to
follow, and I've fixed the handling in no-journal mode (we can't
depend on the forcing a commit if there is no journal, so we can't
drop the use of sync_inode() --- we can use simple_fsync(), though,
which does everything we need if there is no journal).

- Ted


commit b436b9bef84de6893e86346d8fbf7104bc520645
Author: Jan Kara <[email protected]>
Date: Tue Dec 8 23:51:10 2009 -0500

ext4: Wait for proper transaction commit on fsync

We cannot rely on buffer dirty bits during fsync because pdflush can come
before fsync is called and clear dirty bits without forcing a transaction
commit. What we do is that we track which transaction has last changed
the inode and which transaction last changed allocation and force it to
disk on fsync.

Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 4cfc2f0..ab31e65 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -709,6 +709,13 @@ struct ext4_inode_info {
struct list_head i_aio_dio_complete_list;
/* current io_end structure for async DIO write*/
ext4_io_end_t *cur_aio_dio;
+
+ /*
+ * Transactions that contain inode's metadata needed to complete
+ * fsync and fdatasync, respectively.
+ */
+ tid_t i_sync_tid;
+ tid_t i_datasync_tid;
};

/*
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 2c2b262..05eca81 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -249,6 +249,19 @@ static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode)
return 0;
}

+static inline void ext4_update_inode_fsync_trans(handle_t *handle,
+ struct inode *inode,
+ int datasync)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+
+ if (ext4_handle_valid(handle)) {
+ ei->i_sync_tid = handle->h_transaction->t_tid;
+ if (datasync)
+ ei->i_datasync_tid = handle->h_transaction->t_tid;
+ }
+}
+
/* super.c */
int ext4_force_commit(struct super_block *sb);

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 5967f18..700206e 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3058,6 +3058,8 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
if (flags == EXT4_GET_BLOCKS_DIO_CONVERT_EXT) {
ret = ext4_convert_unwritten_extents_dio(handle, inode,
path);
+ if (ret >= 0)
+ ext4_update_inode_fsync_trans(handle, inode, 1);
goto out2;
}
/* buffered IO case */
@@ -3085,6 +3087,8 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
ret = ext4_ext_convert_to_initialized(handle, inode,
path, iblock,
max_blocks);
+ if (ret >= 0)
+ ext4_update_inode_fsync_trans(handle, inode, 1);
out:
if (ret <= 0) {
err = ret;
@@ -3323,10 +3327,16 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
allocated = ext4_ext_get_actual_len(&newex);
set_buffer_new(bh_result);

- /* Cache only when it is _not_ an uninitialized extent */
- if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0)
+ /*
+ * Cache the extent and update transaction to commit on fdatasync only
+ * when it is _not_ an uninitialized extent.
+ */
+ if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0) {
ext4_ext_put_in_cache(inode, iblock, allocated, newblock,
EXT4_EXT_CACHE_EXTENT);
+ ext4_update_inode_fsync_trans(handle, inode, 1);
+ } else
+ ext4_update_inode_fsync_trans(handle, inode, 0);
out:
if (allocated > max_blocks)
allocated = max_blocks;
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index a3c2507..0b22497 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -51,25 +51,30 @@
int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
{
struct inode *inode = dentry->d_inode;
+ struct ext4_inode_info *ei = EXT4_I(inode);
journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
- int err, ret = 0;
+ int ret;
+ tid_t commit_tid;

J_ASSERT(ext4_journal_current_handle() == NULL);

trace_ext4_sync_file(file, dentry, datasync);

+ if (inode->i_sb->s_flags & MS_RDONLY)
+ return 0;
+
ret = flush_aio_dio_completed_IO(inode);
if (ret < 0)
return ret;
+
+ if (!journal)
+ return simple_fsync(file, dentry, datasync);
+
/*
- * data=writeback:
+ * data=writeback,ordered:
* The caller's filemap_fdatawrite()/wait will sync the data.
- * sync_inode() will sync the metadata
- *
- * data=ordered:
- * The caller's filemap_fdatawrite() will write the data and
- * sync_inode() will write the inode if it is dirty. Then the caller's
- * filemap_fdatawait() will wait on the pages.
+ * Metadata is in the journal, we wait for proper transaction to
+ * commit here.
*
* data=journal:
* filemap_fdatawrite won't do anything (the buffers are clean).
@@ -82,27 +87,10 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
if (ext4_should_journal_data(inode))
return ext4_force_commit(inode->i_sb);

- if (!journal)
- ret = sync_mapping_buffers(inode->i_mapping);
-
- if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
- goto out;
-
- /*
- * The VFS has written the file data. If the inode is unaltered
- * then we need not start a commit.
- */
- if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
- struct writeback_control wbc = {
- .sync_mode = WB_SYNC_ALL,
- .nr_to_write = 0, /* sys_fsync did this */
- };
- err = sync_inode(inode, &wbc);
- if (ret == 0)
- ret = err;
- }
-out:
- if (journal && (journal->j_flags & JBD2_BARRIER))
+ commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
+ if (jbd2_log_start_commit(journal, commit_tid))
+ jbd2_log_wait_commit(journal, commit_tid);
+ else if (journal->j_flags & JBD2_BARRIER)
blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
return ret;
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 958c3ff..f1bc1e3 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -983,6 +983,8 @@ static int ext4_ind_get_blocks(handle_t *handle, struct inode *inode,
goto cleanup;

set_buffer_new(bh_result);
+
+ ext4_update_inode_fsync_trans(handle, inode, 1);
got_it:
map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key));
if (count > blocks_to_boundary)
@@ -4738,6 +4740,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
struct ext4_inode *raw_inode;
struct ext4_inode_info *ei;
struct inode *inode;
+ journal_t *journal = EXT4_SB(sb)->s_journal;
long ret;
int block;

@@ -4802,6 +4805,31 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
ei->i_data[block] = raw_inode->i_block[block];
INIT_LIST_HEAD(&ei->i_orphan);

+ /*
+ * Set transaction id's of transactions that have to be committed
+ * to finish f[data]sync. We set them to currently running transaction
+ * as we cannot be sure that the inode or some of its metadata isn't
+ * part of the transaction - the inode could have been reclaimed and
+ * now it is reread from disk.
+ */
+ if (journal) {
+ transaction_t *transaction;
+ tid_t tid;
+
+ spin_lock(&journal->j_state_lock);
+ if (journal->j_running_transaction)
+ transaction = journal->j_running_transaction;
+ else
+ transaction = journal->j_committing_transaction;
+ if (transaction)
+ tid = transaction->t_tid;
+ else
+ tid = journal->j_commit_sequence;
+ spin_unlock(&journal->j_state_lock);
+ ei->i_sync_tid = tid;
+ ei->i_datasync_tid = tid;
+ }
+
if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) {
ei->i_extra_isize = le16_to_cpu(raw_inode->i_extra_isize);
if (EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize >
@@ -5056,6 +5084,7 @@ static int ext4_do_update_inode(handle_t *handle,
err = rc;
ei->i_state &= ~EXT4_STATE_NEW;

+ ext4_update_inode_fsync_trans(handle, inode, 0);
out_brelse:
brelse(bh);
ext4_std_error(inode->i_sb, err);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 8ab0c95..2b13dcf 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -706,6 +706,8 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
spin_lock_init(&(ei->i_block_reservation_lock));
INIT_LIST_HEAD(&ei->i_aio_dio_complete_list);
ei->cur_aio_dio = NULL;
+ ei->i_sync_tid = 0;
+ ei->i_datasync_tid = 0;

return &ei->vfs_inode;
}

2009-12-09 11:29:39

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/4] ext4: Wait for proper transaction commit on fsync

On Tue 08-12-09 23:54:24, [email protected] wrote:
> Here's a revised version of your patch that I've included in the ext4
> patch queue. I've removed the use of the atomic_t data type. I've
OK, I'm just unsure: Isn't even 32-bit read non-atomic if it is not
properly aligned e.g. on powerPC? So shouldn't we make sure those inode
fields are aligned to 32-bit boundary (or at least add a comment about
that)?

> also cleaned up the fsync() function some more to make it easier to
> follow, and I've fixed the handling in no-journal mode (we can't
> depend on the forcing a commit if there is no journal, so we can't
> drop the use of sync_inode() --- we can use simple_fsync(), though,
> which does everything we need if there is no journal).
Good catch. Thanks.

Honza

> commit b436b9bef84de6893e86346d8fbf7104bc520645
> Author: Jan Kara <[email protected]>
> Date: Tue Dec 8 23:51:10 2009 -0500
>
> ext4: Wait for proper transaction commit on fsync
>
> We cannot rely on buffer dirty bits during fsync because pdflush can come
> before fsync is called and clear dirty bits without forcing a transaction
> commit. What we do is that we track which transaction has last changed
> the inode and which transaction last changed allocation and force it to
> disk on fsync.
>
> Signed-off-by: Jan Kara <[email protected]>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 4cfc2f0..ab31e65 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -709,6 +709,13 @@ struct ext4_inode_info {
> struct list_head i_aio_dio_complete_list;
> /* current io_end structure for async DIO write*/
> ext4_io_end_t *cur_aio_dio;
> +
> + /*
> + * Transactions that contain inode's metadata needed to complete
> + * fsync and fdatasync, respectively.
> + */
> + tid_t i_sync_tid;
> + tid_t i_datasync_tid;
> };
>
> /*
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index 2c2b262..05eca81 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -249,6 +249,19 @@ static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode)
> return 0;
> }
>
> +static inline void ext4_update_inode_fsync_trans(handle_t *handle,
> + struct inode *inode,
> + int datasync)
> +{
> + struct ext4_inode_info *ei = EXT4_I(inode);
> +
> + if (ext4_handle_valid(handle)) {
> + ei->i_sync_tid = handle->h_transaction->t_tid;
> + if (datasync)
> + ei->i_datasync_tid = handle->h_transaction->t_tid;
> + }
> +}
> +
> /* super.c */
> int ext4_force_commit(struct super_block *sb);
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 5967f18..700206e 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3058,6 +3058,8 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
> if (flags == EXT4_GET_BLOCKS_DIO_CONVERT_EXT) {
> ret = ext4_convert_unwritten_extents_dio(handle, inode,
> path);
> + if (ret >= 0)
> + ext4_update_inode_fsync_trans(handle, inode, 1);
> goto out2;
> }
> /* buffered IO case */
> @@ -3085,6 +3087,8 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
> ret = ext4_ext_convert_to_initialized(handle, inode,
> path, iblock,
> max_blocks);
> + if (ret >= 0)
> + ext4_update_inode_fsync_trans(handle, inode, 1);
> out:
> if (ret <= 0) {
> err = ret;
> @@ -3323,10 +3327,16 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
> allocated = ext4_ext_get_actual_len(&newex);
> set_buffer_new(bh_result);
>
> - /* Cache only when it is _not_ an uninitialized extent */
> - if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0)
> + /*
> + * Cache the extent and update transaction to commit on fdatasync only
> + * when it is _not_ an uninitialized extent.
> + */
> + if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0) {
> ext4_ext_put_in_cache(inode, iblock, allocated, newblock,
> EXT4_EXT_CACHE_EXTENT);
> + ext4_update_inode_fsync_trans(handle, inode, 1);
> + } else
> + ext4_update_inode_fsync_trans(handle, inode, 0);
> out:
> if (allocated > max_blocks)
> allocated = max_blocks;
> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index a3c2507..0b22497 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -51,25 +51,30 @@
> int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
> {
> struct inode *inode = dentry->d_inode;
> + struct ext4_inode_info *ei = EXT4_I(inode);
> journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
> - int err, ret = 0;
> + int ret;
> + tid_t commit_tid;
>
> J_ASSERT(ext4_journal_current_handle() == NULL);
>
> trace_ext4_sync_file(file, dentry, datasync);
>
> + if (inode->i_sb->s_flags & MS_RDONLY)
> + return 0;
> +
> ret = flush_aio_dio_completed_IO(inode);
> if (ret < 0)
> return ret;
> +
> + if (!journal)
> + return simple_fsync(file, dentry, datasync);
> +
> /*
> - * data=writeback:
> + * data=writeback,ordered:
> * The caller's filemap_fdatawrite()/wait will sync the data.
> - * sync_inode() will sync the metadata
> - *
> - * data=ordered:
> - * The caller's filemap_fdatawrite() will write the data and
> - * sync_inode() will write the inode if it is dirty. Then the caller's
> - * filemap_fdatawait() will wait on the pages.
> + * Metadata is in the journal, we wait for proper transaction to
> + * commit here.
> *
> * data=journal:
> * filemap_fdatawrite won't do anything (the buffers are clean).
> @@ -82,27 +87,10 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
> if (ext4_should_journal_data(inode))
> return ext4_force_commit(inode->i_sb);
>
> - if (!journal)
> - ret = sync_mapping_buffers(inode->i_mapping);
> -
> - if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> - goto out;
> -
> - /*
> - * The VFS has written the file data. If the inode is unaltered
> - * then we need not start a commit.
> - */
> - if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> - struct writeback_control wbc = {
> - .sync_mode = WB_SYNC_ALL,
> - .nr_to_write = 0, /* sys_fsync did this */
> - };
> - err = sync_inode(inode, &wbc);
> - if (ret == 0)
> - ret = err;
> - }
> -out:
> - if (journal && (journal->j_flags & JBD2_BARRIER))
> + commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
> + if (jbd2_log_start_commit(journal, commit_tid))
> + jbd2_log_wait_commit(journal, commit_tid);
> + else if (journal->j_flags & JBD2_BARRIER)
> blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
> return ret;
> }
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 958c3ff..f1bc1e3 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -983,6 +983,8 @@ static int ext4_ind_get_blocks(handle_t *handle, struct inode *inode,
> goto cleanup;
>
> set_buffer_new(bh_result);
> +
> + ext4_update_inode_fsync_trans(handle, inode, 1);
> got_it:
> map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key));
> if (count > blocks_to_boundary)
> @@ -4738,6 +4740,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
> struct ext4_inode *raw_inode;
> struct ext4_inode_info *ei;
> struct inode *inode;
> + journal_t *journal = EXT4_SB(sb)->s_journal;
> long ret;
> int block;
>
> @@ -4802,6 +4805,31 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
> ei->i_data[block] = raw_inode->i_block[block];
> INIT_LIST_HEAD(&ei->i_orphan);
>
> + /*
> + * Set transaction id's of transactions that have to be committed
> + * to finish f[data]sync. We set them to currently running transaction
> + * as we cannot be sure that the inode or some of its metadata isn't
> + * part of the transaction - the inode could have been reclaimed and
> + * now it is reread from disk.
> + */
> + if (journal) {
> + transaction_t *transaction;
> + tid_t tid;
> +
> + spin_lock(&journal->j_state_lock);
> + if (journal->j_running_transaction)
> + transaction = journal->j_running_transaction;
> + else
> + transaction = journal->j_committing_transaction;
> + if (transaction)
> + tid = transaction->t_tid;
> + else
> + tid = journal->j_commit_sequence;
> + spin_unlock(&journal->j_state_lock);
> + ei->i_sync_tid = tid;
> + ei->i_datasync_tid = tid;
> + }
> +
> if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) {
> ei->i_extra_isize = le16_to_cpu(raw_inode->i_extra_isize);
> if (EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize >
> @@ -5056,6 +5084,7 @@ static int ext4_do_update_inode(handle_t *handle,
> err = rc;
> ei->i_state &= ~EXT4_STATE_NEW;
>
> + ext4_update_inode_fsync_trans(handle, inode, 0);
> out_brelse:
> brelse(bh);
> ext4_std_error(inode->i_sb, err);
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 8ab0c95..2b13dcf 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -706,6 +706,8 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
> spin_lock_init(&(ei->i_block_reservation_lock));
> INIT_LIST_HEAD(&ei->i_aio_dio_complete_list);
> ei->cur_aio_dio = NULL;
> + ei->i_sync_tid = 0;
> + ei->i_datasync_tid = 0;
>
> return &ei->vfs_inode;
> }
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-12-09 11:30:14

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/4] ext4: Wait for proper transaction commit on fsync

On Tue 08-12-09 22:51:20, [email protected] wrote:
> On Mon, Nov 16, 2009 at 11:43:31AM +0100, Jan Kara wrote:
> > > Why do we need an atomic_t here at all? It's not clear it's
> > > necessary. What specific race are you worried about?
> > Well, i_[data]sync_tid are set and read from several places which aren't
> > currently synchronized against each other and I din't want to introduce any
> > synchronization. So atomic_t seemed like a clean way of making clear that
> > loads / stores from those fields are atomic, regardless of architecture,
> > memory alignment or whatever insanity that might make us see just half
> > overwritten field. On all archs where this means just plain stores / loads
> > (such as x86), there's no performance hit since the operations are
> > implemented as such.
>
> Sorry for not responding to this one sooner, but see this URL:
>
> http://digitalvampire.org/blog/index.php/2007/05/13/atomic-cargo-cults/
Thanks for the pointer :) It's a good reading...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-12-09 14:32:18

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 4/4] ext4: Wait for proper transaction commit on fsync

On Wed, Dec 09, 2009 at 12:29:40PM +0100, Jan Kara wrote:
> On Tue 08-12-09 23:54:24, [email protected] wrote:
> > Here's a revised version of your patch that I've included in the ext4
> > patch queue. I've removed the use of the atomic_t data type. I've
> OK, I'm just unsure: Isn't even 32-bit read non-atomic if it is not
> properly aligned e.g. on powerPC? So shouldn't we make sure those inode
> fields are aligned to 32-bit boundary (or at least add a comment about
> that)?

But gcc guarantees such alignments, unless you do something horrible
from a performance point of view, such as using
__attribute__((packed)). And in fact, if things are not aligned, on
some platforms unaligned access will trigger a software trap, and the
kernel trap handler has to fix up the unaligned access. (I believe
Power is one of these platforms if memory serves correctly.) Even on
platforms that don't, the performance hit where the hardware has to do
the unaligned access so horrible that gcc avoids the misalignment
automatically.

That's why sometimes I'll go around and try to adjust structure member
orders, since something like this:

struct foo {
short int a;
int b;
short int c;
} foo_array[2];

Will consume 24 bytes.... while this:

struct foo {
short int a;
short int c;
int b;
} foo_array[2];

... will consume 16 bytes. For structures that are used in huge
amounts (like inode slab caches), it can mean a measurable memory
savings.

- Ted