2009-10-20 07:25:40

by Jan Kara

[permalink] [raw]
Subject: [PATCH 0/4] Fix fsync bug in ext3 and ext4

Hi,

the patch series below fixes the problem ext3 and ext4 has that we
rely on inode dirty bits for correct fsync. This has two problems:
a) ext3 and ext4 often does not set them (as they call ext?_mark_inode_dirty)
and thus only quota code sets the dirty bit as a sideeffect.
b) pdflush can come in and clear the dirty bit any time (thanks to Chris
for pointing out this).

The second and third patch in the series are just minor bug fixes to ext4.
I did some basic testing to verify that the code does what I'd expect it to
do but an extra pair of eyes checking the new code would be helpful...

Honza



2009-10-20 07:25:50

by Jan Kara

[permalink] [raw]
Subject: [PATCH 1/4] ext3: Wait for proper transaction commit on fsync

We cannot rely on buffer dirty bits during fsync because pdflush can come
before fsync is called and clear dirty bits without forcing a transaction
commit. What we do is that we track which transaction has last changed
the inode and which transaction last changed allocation and force it to
disk on fsync.

CC: Andrew Morton <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext3/fsync.c | 36 ++++++++++++++++--------------------
fs/ext3/inode.c | 35 ++++++++++++++++++++++++++++++++++-
include/linux/ext3_fs_i.h | 8 ++++++++
3 files changed, 58 insertions(+), 21 deletions(-)

diff --git a/fs/ext3/fsync.c b/fs/ext3/fsync.c
index 451d166..8209f26 100644
--- a/fs/ext3/fsync.c
+++ b/fs/ext3/fsync.c
@@ -46,19 +46,21 @@
int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
{
struct inode *inode = dentry->d_inode;
+ struct ext3_inode_info *ei = EXT3_I(inode);
+ journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
int ret = 0;
+ tid_t commit_tid;
+
+ if (inode->i_sb->s_flags & MS_RDONLY)
+ return 0;

J_ASSERT(ext3_journal_current_handle() == NULL);

/*
- * data=writeback:
+ * data=writeback,ordered:
* The caller's filemap_fdatawrite()/wait will sync the data.
- * sync_inode() will sync the metadata
- *
- * data=ordered:
- * The caller's filemap_fdatawrite() will write the data and
- * sync_inode() will write the inode if it is dirty. Then the caller's
- * filemap_fdatawait() will wait on the pages.
+ * Metadata is in the journal, we wait for a proper transaction
+ * to commit here.
*
* data=journal:
* filemap_fdatawrite won't do anything (the buffers are clean).
@@ -73,22 +75,16 @@ int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
goto out;
}

- if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
- goto flush;
+ if (datasync)
+ commit_tid = atomic_read(&ei->i_datasync_tid);
+ else
+ commit_tid = atomic_read(&ei->i_sync_tid);

- /*
- * The VFS has written the file data. If the inode is unaltered
- * then we need not start a commit.
- */
- if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
- struct writeback_control wbc = {
- .sync_mode = WB_SYNC_ALL,
- .nr_to_write = 0, /* sys_fsync did this */
- };
- ret = sync_inode(inode, &wbc);
+ if (log_start_commit(journal, commit_tid)) {
+ log_wait_commit(journal, commit_tid);
goto out;
}
-flush:
+
/*
* In case we didn't commit a transaction, we have to flush
* disk caches manually so that data really is on persistent
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index acf1b14..0a956a8 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -699,8 +699,9 @@ static int ext3_splice_branch(handle_t *handle, struct inode *inode,
int err = 0;
struct ext3_block_alloc_info *block_i;
ext3_fsblk_t current_block;
+ struct ext3_inode_info *ei = EXT3_I(inode);

- block_i = EXT3_I(inode)->i_block_alloc_info;
+ block_i = ei->i_block_alloc_info;
/*
* If we're splicing into a [td]indirect block (as opposed to the
* inode) then we need to get write access to the [td]indirect block
@@ -741,6 +742,8 @@ static int ext3_splice_branch(handle_t *handle, struct inode *inode,

inode->i_ctime = CURRENT_TIME_SEC;
ext3_mark_inode_dirty(handle, inode);
+ /* ext3_mark_inode_dirty already updated i_sync_tid */
+ atomic_set(&ei->i_datasync_tid, handle->h_transaction->t_tid);

/* had we spliced it onto indirect block? */
if (where->bh) {
@@ -2750,6 +2753,8 @@ struct inode *ext3_iget(struct super_block *sb, unsigned long ino)
struct ext3_inode_info *ei;
struct buffer_head *bh;
struct inode *inode;
+ journal_t *journal = EXT3_SB(sb)->s_journal;
+ transaction_t *transaction;
long ret;
int block;

@@ -2827,6 +2832,33 @@ struct inode *ext3_iget(struct super_block *sb, unsigned long ino)
ei->i_data[block] = raw_inode->i_block[block];
INIT_LIST_HEAD(&ei->i_orphan);

+ /*
+ * Set transaction id's of transactions that have to be committed
+ * to finish f[data]sync. We set them to currently running transaction
+ * as we cannot be sure that the inode or some of its metadata isn't
+ * part of the transaction - the inode could have been reclaimed and
+ * now it is reread from disk.
+ */
+ if (journal) {
+ tid_t tid;
+
+ spin_lock(&journal->j_state_lock);
+ if (journal->j_running_transaction)
+ transaction = journal->j_running_transaction;
+ else
+ transaction = journal->j_committing_transaction;
+ if (transaction)
+ tid = transaction->t_tid;
+ else
+ tid = journal->j_commit_sequence;
+ spin_unlock(&journal->j_state_lock);
+ atomic_set(&ei->i_sync_tid, tid);
+ atomic_set(&ei->i_datasync_tid, tid);
+ } else {
+ atomic_set(&ei->i_sync_tid, 0);
+ atomic_set(&ei->i_datasync_tid, 0);
+ }
+
if (inode->i_ino >= EXT3_FIRST_INO(inode->i_sb) + 1 &&
EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) {
/*
@@ -3011,6 +3043,7 @@ again:
err = rc;
ei->i_state &= ~EXT3_STATE_NEW;

+ atomic_set(&ei->i_sync_tid, handle->h_transaction->t_tid);
out_brelse:
brelse (bh);
ext3_std_error(inode->i_sb, err);
diff --git a/include/linux/ext3_fs_i.h b/include/linux/ext3_fs_i.h
index ca1bfe9..93e7428 100644
--- a/include/linux/ext3_fs_i.h
+++ b/include/linux/ext3_fs_i.h
@@ -137,6 +137,14 @@ struct ext3_inode_info {
* by other means, so we have truncate_mutex.
*/
struct mutex truncate_mutex;
+
+ /*
+ * Transactions that contain inode's metadata needed to complete
+ * fsync and fdatasync, respectively.
+ */
+ atomic_t i_sync_tid;
+ atomic_t i_datasync_tid;
+
struct inode vfs_inode;
};

--
1.6.0.2


2009-10-20 07:26:10

by Jan Kara

[permalink] [raw]
Subject: [PATCH 4/4] ext4: Wait for proper transaction commit on fsync

We cannot rely on buffer dirty bits during fsync because pdflush can come
before fsync is called and clear dirty bits without forcing a transaction
commit. What we do is that we track which transaction has last changed
the inode and which transaction last changed allocation and force it to
disk on fsync.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 7 +++++++
fs/ext4/extents.c | 5 +++++
fs/ext4/fsync.c | 40 +++++++++++++++++-----------------------
fs/ext4/inode.c | 34 ++++++++++++++++++++++++++++++++++
4 files changed, 63 insertions(+), 23 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 984ca0c..5639f30 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -702,6 +702,13 @@ struct ext4_inode_info {
struct list_head i_aio_dio_complete_list;
/* current io_end structure for async DIO write*/
ext4_io_end_t *cur_aio_dio;
+
+ /*
+ * Transactions that contain inode's metadata needed to complete
+ * fsync and fdatasync, respectively.
+ */
+ atomic_t i_sync_tid;
+ atomic_t i_datasync_tid;
};

/*
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 10539e3..3e167f6 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3315,6 +3315,11 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
newblock = ext_pblock(&newex);
allocated = ext4_ext_get_actual_len(&newex);
set_buffer_new(bh_result);
+
+ atomic_set(&EXT4_I(inode)->i_sync_tid, handle->h_transaction->t_tid);
+ atomic_set(&EXT4_I(inode)->i_datasync_tid,
+ handle->h_transaction->t_tid);
+ printk("Datasync tid %u\n", handle->h_transaction->t_tid);

/* Cache only when it is _not_ an uninitialized extent */
if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0)
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 2bf9413..03e0fe9 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -51,25 +51,26 @@
int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
{
struct inode *inode = dentry->d_inode;
+ struct ext4_inode_info *ei = EXT4_I(inode);
journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
- int err, ret = 0;
+ int ret = 0;
+ tid_t commit_tid;

J_ASSERT(ext4_journal_current_handle() == NULL);

trace_ext4_sync_file(file, dentry, datasync);

+ if (inode->i_sb->s_flags & MS_RDONLY)
+ goto out;
+
ret = flush_aio_dio_completed_IO(inode);
if (ret < 0)
goto out;
/*
- * data=writeback:
+ * data=writeback,ordered:
* The caller's filemap_fdatawrite()/wait will sync the data.
- * sync_inode() will sync the metadata
- *
- * data=ordered:
- * The caller's filemap_fdatawrite() will write the data and
- * sync_inode() will write the inode if it is dirty. Then the caller's
- * filemap_fdatawait() will wait on the pages.
+ * Metadata is in the journal, we wait for proper transaction to
+ * commit here.
*
* data=journal:
* filemap_fdatawrite won't do anything (the buffers are clean).
@@ -87,23 +88,16 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
if (!journal)
ret = sync_mapping_buffers(inode->i_mapping);

- if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
- goto flush;
+ if (datasync)
+ commit_tid = atomic_read(&ei->i_datasync_tid);
+ else
+ commit_tid = atomic_read(&ei->i_sync_tid);

- /*
- * The VFS has written the file data. If the inode is unaltered
- * then we need not start a commit.
- */
- if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
- struct writeback_control wbc = {
- .sync_mode = WB_SYNC_ALL,
- .nr_to_write = 0, /* sys_fsync did this */
- };
- err = sync_inode(inode, &wbc);
- if (ret == 0)
- ret = err;
+ if (jbd2_log_start_commit(journal, commit_tid)) {
+ jbd2_log_wait_commit(journal, commit_tid);
+ goto out;
}
-flush:
+
if (journal && (journal->j_flags & JBD2_BARRIER))
blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
out:
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9105f40..412de4e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1024,6 +1024,10 @@ static int ext4_ind_get_blocks(handle_t *handle, struct inode *inode,
goto cleanup;

set_buffer_new(bh_result);
+
+ atomic_set(&EXT4_I(inode)->i_sync_tid, handle->h_transaction->t_tid);
+ atomic_set(&EXT4_I(inode)->i_datasync_tid,
+ handle->h_transaction->t_tid);
got_it:
map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key));
if (count > blocks_to_boundary)
@@ -4777,6 +4781,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
struct ext4_inode_info *ei;
struct buffer_head *bh;
struct inode *inode;
+ journal_t *journal = EXT4_SB(sb)->s_journal;
long ret;
int block;

@@ -4842,6 +4847,34 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
ei->i_data[block] = raw_inode->i_block[block];
INIT_LIST_HEAD(&ei->i_orphan);

+ /*
+ * Set transaction id's of transactions that have to be committed
+ * to finish f[data]sync. We set them to currently running transaction
+ * as we cannot be sure that the inode or some of its metadata isn't
+ * part of the transaction - the inode could have been reclaimed and
+ * now it is reread from disk.
+ */
+ if (journal) {
+ transaction_t *transaction;
+ tid_t tid;
+
+ spin_lock(&journal->j_state_lock);
+ if (journal->j_running_transaction)
+ transaction = journal->j_running_transaction;
+ else
+ transaction = journal->j_committing_transaction;
+ if (transaction)
+ tid = transaction->t_tid;
+ else
+ tid = journal->j_commit_sequence;
+ spin_unlock(&journal->j_state_lock);
+ atomic_set(&ei->i_sync_tid, tid);
+ atomic_set(&ei->i_datasync_tid, tid);
+ } else {
+ atomic_set(&ei->i_sync_tid, 0);
+ atomic_set(&ei->i_datasync_tid, 0);
+ }
+
if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) {
ei->i_extra_isize = le16_to_cpu(raw_inode->i_extra_isize);
if (EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize >
@@ -5102,6 +5135,7 @@ static int ext4_do_update_inode(handle_t *handle,
err = rc;
ei->i_state &= ~EXT4_STATE_NEW;

+ atomic_set(&ei->i_sync_tid, handle->h_transaction->t_tid);
out_brelse:
brelse(bh);
ext4_std_error(inode->i_sb, err);
--
1.6.0.2


2009-10-20 07:26:00

by Jan Kara

[permalink] [raw]
Subject: [PATCH 2/4] ext4: Avoid issuing barriers on error recovery path

There is no reason why we should issue IO barriers when we
just hit an error in ext4_sync_file. So move out: label to
just return and introduce flush: label to those places which
really need it.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/fsync.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 2b15312..2bf9413 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -88,7 +88,7 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
ret = sync_mapping_buffers(inode->i_mapping);

if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
- goto out;
+ goto flush;

/*
* The VFS has written the file data. If the inode is unaltered
@@ -103,8 +103,9 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
if (ret == 0)
ret = err;
}
-out:
+flush:
if (journal && (journal->j_flags & JBD2_BARRIER))
blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
+out:
return ret;
}
--
1.6.0.2


2009-10-20 07:26:25

by Jan Kara

[permalink] [raw]
Subject: [PATCH 3/4] ext4: Fix error handling ext4_ind_get_blocks

When an error happened in ext4_splice_branch we failed to notice that
in ext4_ind_get_blocks and mapped the buffer anyway. Fix the problem
by checking for error properly.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5c5bc5d..9105f40 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1020,7 +1020,7 @@ static int ext4_ind_get_blocks(handle_t *handle, struct inode *inode,
if (!err)
err = ext4_splice_branch(handle, inode, iblock,
partial, indirect_blks, count);
- else
+ if (err)
goto cleanup;

set_buffer_new(bh_result);
--
1.6.0.2


2009-10-20 12:31:36

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 4/4] ext4: Wait for proper transaction commit on fsync

On Tue, Oct 20, 2009 at 09:24:38AM +0200, Jan Kara wrote:
> We cannot rely on buffer dirty bits during fsync because pdflush can come
> before fsync is called and clear dirty bits without forcing a transaction
> commit. What we do is that we track which transaction has last changed
> the inode and which transaction last changed allocation and force it to
> disk on fsync.
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/ext4/ext4.h | 7 +++++++
> fs/ext4/extents.c | 5 +++++
> fs/ext4/fsync.c | 40 +++++++++++++++++-----------------------
> fs/ext4/inode.c | 34 ++++++++++++++++++++++++++++++++++
> 4 files changed, 63 insertions(+), 23 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 984ca0c..5639f30 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -702,6 +702,13 @@ struct ext4_inode_info {
> struct list_head i_aio_dio_complete_list;
> /* current io_end structure for async DIO write*/
> ext4_io_end_t *cur_aio_dio;
> +
> + /*
> + * Transactions that contain inode's metadata needed to complete
> + * fsync and fdatasync, respectively.
> + */
> + atomic_t i_sync_tid;
> + atomic_t i_datasync_tid;
> };
>
> /*
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 10539e3..3e167f6 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3315,6 +3315,11 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
> newblock = ext_pblock(&newex);
> allocated = ext4_ext_get_actual_len(&newex);
> set_buffer_new(bh_result);
> +
> + atomic_set(&EXT4_I(inode)->i_sync_tid, handle->h_transaction->t_tid);
> + atomic_set(&EXT4_I(inode)->i_datasync_tid,
> + handle->h_transaction->t_tid);
> + printk("Datasync tid %u\n", handle->h_transaction->t_tid);

The printk need to be removed ?

Also i am wondering wether we need to update i_datasync_tid only if we
allocate new blocks ? How about writing to an fallocate area. I guess
we need to track the transaction in which we are marking an extent
initialized.

-aneesh

2009-10-20 14:36:56

by Curt Wohlgemuth

[permalink] [raw]
Subject: Re: [PATCH 4/4] ext4: Wait for proper transaction commit on fsync

On Tue, Oct 20, 2009 at 12:24 AM, Jan Kara <[email protected]> wrote:
> We cannot rely on buffer dirty bits during fsync because pdflush can come
> before fsync is called and clear dirty bits without forcing a transaction
> commit. What we do is that we track which transaction has last changed
> the inode and which transaction last changed allocation and force it to
> disk on fsync.
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> ?fs/ext4/ext4.h ? ?| ? ?7 +++++++
> ?fs/ext4/extents.c | ? ?5 +++++
> ?fs/ext4/fsync.c ? | ? 40 +++++++++++++++++-----------------------
> ?fs/ext4/inode.c ? | ? 34 ++++++++++++++++++++++++++++++++++
> ?4 files changed, 63 insertions(+), 23 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 984ca0c..5639f30 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -702,6 +702,13 @@ struct ext4_inode_info {
> ? ? ? ?struct list_head i_aio_dio_complete_list;
> ? ? ? ?/* current io_end structure for async DIO write*/
> ? ? ? ?ext4_io_end_t *cur_aio_dio;
> +
> + ? ? ? /*
> + ? ? ? ?* Transactions that contain inode's metadata needed to complete
> + ? ? ? ?* fsync and fdatasync, respectively.
> + ? ? ? ?*/
> + ? ? ? atomic_t i_sync_tid;
> + ? ? ? atomic_t i_datasync_tid;
> ?};
>
> ?/*
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 10539e3..3e167f6 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3315,6 +3315,11 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
> ? ? ? ?newblock = ext_pblock(&newex);
> ? ? ? ?allocated = ext4_ext_get_actual_len(&newex);
> ? ? ? ?set_buffer_new(bh_result);
> +
> + ? ? ? atomic_set(&EXT4_I(inode)->i_sync_tid, handle->h_transaction->t_tid);
> + ? ? ? atomic_set(&EXT4_I(inode)->i_datasync_tid,
> + ? ? ? ? ? ? ? ? ?handle->h_transaction->t_tid);
> + ? ? ? printk("Datasync tid %u\n", handle->h_transaction->t_tid);

Both here and in ext4_ind_get_blocks() below, I think you need to
guard the atomic_set() calls with ext4_handle_valid().

Curt

>
> ? ? ? ?/* Cache only when it is _not_ an uninitialized extent */
> ? ? ? ?if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0)
> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index 2bf9413..03e0fe9 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -51,25 +51,26 @@
> ?int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
> ?{
> ? ? ? ?struct inode *inode = dentry->d_inode;
> + ? ? ? struct ext4_inode_info *ei = EXT4_I(inode);
> ? ? ? ?journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
> - ? ? ? int err, ret = 0;
> + ? ? ? int ret = 0;
> + ? ? ? tid_t commit_tid;
>
> ? ? ? ?J_ASSERT(ext4_journal_current_handle() == NULL);
>
> ? ? ? ?trace_ext4_sync_file(file, dentry, datasync);
>
> + ? ? ? if (inode->i_sb->s_flags & MS_RDONLY)
> + ? ? ? ? ? ? ? goto out;
> +
> ? ? ? ?ret = flush_aio_dio_completed_IO(inode);
> ? ? ? ?if (ret < 0)
> ? ? ? ? ? ? ? ?goto out;
> ? ? ? ?/*
> - ? ? ? ?* data=writeback:
> + ? ? ? ?* data=writeback,ordered:
> ? ? ? ? * ?The caller's filemap_fdatawrite()/wait will sync the data.
> - ? ? ? ?* ?sync_inode() will sync the metadata
> - ? ? ? ?*
> - ? ? ? ?* data=ordered:
> - ? ? ? ?* ?The caller's filemap_fdatawrite() will write the data and
> - ? ? ? ?* ?sync_inode() will write the inode if it is dirty. ?Then the caller's
> - ? ? ? ?* ?filemap_fdatawait() will wait on the pages.
> + ? ? ? ?* ?Metadata is in the journal, we wait for proper transaction to
> + ? ? ? ?* ?commit here.
> ? ? ? ? *
> ? ? ? ? * data=journal:
> ? ? ? ? * ?filemap_fdatawrite won't do anything (the buffers are clean).
> @@ -87,23 +88,16 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
> ? ? ? ?if (!journal)
> ? ? ? ? ? ? ? ?ret = sync_mapping_buffers(inode->i_mapping);
>
> - ? ? ? if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> - ? ? ? ? ? ? ? goto flush;
> + ? ? ? if (datasync)
> + ? ? ? ? ? ? ? commit_tid = atomic_read(&ei->i_datasync_tid);
> + ? ? ? else
> + ? ? ? ? ? ? ? commit_tid = atomic_read(&ei->i_sync_tid);
>
> - ? ? ? /*
> - ? ? ? ?* The VFS has written the file data. ?If the inode is unaltered
> - ? ? ? ?* then we need not start a commit.
> - ? ? ? ?*/
> - ? ? ? if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> - ? ? ? ? ? ? ? struct writeback_control wbc = {
> - ? ? ? ? ? ? ? ? ? ? ? .sync_mode = WB_SYNC_ALL,
> - ? ? ? ? ? ? ? ? ? ? ? .nr_to_write = 0, /* sys_fsync did this */
> - ? ? ? ? ? ? ? };
> - ? ? ? ? ? ? ? err = sync_inode(inode, &wbc);
> - ? ? ? ? ? ? ? if (ret == 0)
> - ? ? ? ? ? ? ? ? ? ? ? ret = err;
> + ? ? ? if (jbd2_log_start_commit(journal, commit_tid)) {
> + ? ? ? ? ? ? ? jbd2_log_wait_commit(journal, commit_tid);
> + ? ? ? ? ? ? ? goto out;
> ? ? ? ?}
> -flush:
> +
> ? ? ? ?if (journal && (journal->j_flags & JBD2_BARRIER))
> ? ? ? ? ? ? ? ?blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
> ?out:
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 9105f40..412de4e 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1024,6 +1024,10 @@ static int ext4_ind_get_blocks(handle_t *handle, struct inode *inode,
> ? ? ? ? ? ? ? ?goto cleanup;
>
> ? ? ? ?set_buffer_new(bh_result);
> +
> + ? ? ? atomic_set(&EXT4_I(inode)->i_sync_tid, handle->h_transaction->t_tid);
> + ? ? ? atomic_set(&EXT4_I(inode)->i_datasync_tid,
> + ? ? ? ? ? ? ? ? ?handle->h_transaction->t_tid);
> ?got_it:
> ? ? ? ?map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key));
> ? ? ? ?if (count > blocks_to_boundary)
> @@ -4777,6 +4781,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
> ? ? ? ?struct ext4_inode_info *ei;
> ? ? ? ?struct buffer_head *bh;
> ? ? ? ?struct inode *inode;
> + ? ? ? journal_t *journal = EXT4_SB(sb)->s_journal;
> ? ? ? ?long ret;
> ? ? ? ?int block;
>
> @@ -4842,6 +4847,34 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
> ? ? ? ? ? ? ? ?ei->i_data[block] = raw_inode->i_block[block];
> ? ? ? ?INIT_LIST_HEAD(&ei->i_orphan);
>
> + ? ? ? /*
> + ? ? ? ?* Set transaction id's of transactions that have to be committed
> + ? ? ? ?* to finish f[data]sync. We set them to currently running transaction
> + ? ? ? ?* as we cannot be sure that the inode or some of its metadata isn't
> + ? ? ? ?* part of the transaction - the inode could have been reclaimed and
> + ? ? ? ?* now it is reread from disk.
> + ? ? ? ?*/
> + ? ? ? if (journal) {
> + ? ? ? ? ? ? ? transaction_t *transaction;
> + ? ? ? ? ? ? ? tid_t tid;
> +
> + ? ? ? ? ? ? ? spin_lock(&journal->j_state_lock);
> + ? ? ? ? ? ? ? if (journal->j_running_transaction)
> + ? ? ? ? ? ? ? ? ? ? ? transaction = journal->j_running_transaction;
> + ? ? ? ? ? ? ? else
> + ? ? ? ? ? ? ? ? ? ? ? transaction = journal->j_committing_transaction;
> + ? ? ? ? ? ? ? if (transaction)
> + ? ? ? ? ? ? ? ? ? ? ? tid = transaction->t_tid;
> + ? ? ? ? ? ? ? else
> + ? ? ? ? ? ? ? ? ? ? ? tid = journal->j_commit_sequence;
> + ? ? ? ? ? ? ? spin_unlock(&journal->j_state_lock);
> + ? ? ? ? ? ? ? atomic_set(&ei->i_sync_tid, tid);
> + ? ? ? ? ? ? ? atomic_set(&ei->i_datasync_tid, tid);
> + ? ? ? } else {
> + ? ? ? ? ? ? ? atomic_set(&ei->i_sync_tid, 0);
> + ? ? ? ? ? ? ? atomic_set(&ei->i_datasync_tid, 0);
> + ? ? ? }
> +
> ? ? ? ?if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) {
> ? ? ? ? ? ? ? ?ei->i_extra_isize = le16_to_cpu(raw_inode->i_extra_isize);
> ? ? ? ? ? ? ? ?if (EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize >
> @@ -5102,6 +5135,7 @@ static int ext4_do_update_inode(handle_t *handle,
> ? ? ? ? ? ? ? ?err = rc;
> ? ? ? ?ei->i_state &= ~EXT4_STATE_NEW;
>
> + ? ? ? atomic_set(&ei->i_sync_tid, handle->h_transaction->t_tid);
> ?out_brelse:
> ? ? ? ?brelse(bh);
> ? ? ? ?ext4_std_error(inode->i_sb, err);
> --
> 1.6.0.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>

2009-10-20 15:53:44

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/4] ext4: Wait for proper transaction commit on fsync

On Tue 20-10-09 07:36:55, Curt Wohlgemuth wrote:
> On Tue, Oct 20, 2009 at 12:24 AM, Jan Kara <[email protected]> wrote:
> > We cannot rely on buffer dirty bits during fsync because pdflush can come
> > before fsync is called and clear dirty bits without forcing a transaction
> > commit. What we do is that we track which transaction has last changed
> > the inode and which transaction last changed allocation and force it to
> > disk on fsync.
<snip>
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index 10539e3..3e167f6 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -3315,6 +3315,11 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
> > ? ? ? ?newblock = ext_pblock(&newex);
> > ? ? ? ?allocated = ext4_ext_get_actual_len(&newex);
> > ? ? ? ?set_buffer_new(bh_result);
> > +
> > + ? ? ? atomic_set(&EXT4_I(inode)->i_sync_tid, handle->h_transaction->t_tid);
> > + ? ? ? atomic_set(&EXT4_I(inode)->i_datasync_tid,
> > + ? ? ? ? ? ? ? ? ?handle->h_transaction->t_tid);
> > + ? ? ? printk("Datasync tid %u\n", handle->h_transaction->t_tid);
>
> Both here and in ext4_ind_get_blocks() below, I think you need to
> guard the atomic_set() calls with ext4_handle_valid().
Ah, good point. Will fix.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-10-20 16:06:15

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/4] ext4: Wait for proper transaction commit on fsync

On Tue 20-10-09 18:01:31, Aneesh Kumar K.V wrote:
> On Tue, Oct 20, 2009 at 09:24:38AM +0200, Jan Kara wrote:
> > We cannot rely on buffer dirty bits during fsync because pdflush can come
> > before fsync is called and clear dirty bits without forcing a transaction
> > commit. What we do is that we track which transaction has last changed
> > the inode and which transaction last changed allocation and force it to
> > disk on fsync.
<snip>
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index 10539e3..3e167f6 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -3315,6 +3315,11 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
> > newblock = ext_pblock(&newex);
> > allocated = ext4_ext_get_actual_len(&newex);
> > set_buffer_new(bh_result);
> > +
> > + atomic_set(&EXT4_I(inode)->i_sync_tid, handle->h_transaction->t_tid);
> > + atomic_set(&EXT4_I(inode)->i_datasync_tid,
> > + handle->h_transaction->t_tid);
> > + printk("Datasync tid %u\n", handle->h_transaction->t_tid);
>
> The printk need to be removed ?
Ah, missed that debugging aid. Will fix.

> Also i am wondering wether we need to update i_datasync_tid only if we
> allocate new blocks ? How about writing to an fallocate area. I guess
Yes, we need to update it only if we allocate new blocks but that should
be what I've done (but maybe I screwed up...).

> we need to track the transaction in which we are marking an extent
> initialized.
You are right that i_datasync_tid needs to be updated also when we
convert uninitialized extents to initialized ones. I'll fix this case.
Thanks for review.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR