LinuxLists.cc - [PATCH] Possible data loss on ext[34], reiserfs with external journal

2009-12-11 08:16:08

Subject: [PATCH] Possible data loss on ext[34], reiserfs with external journal

Hello!

It seems when external journal device is used for ext3, ext4 and reiserfs
(possibly others, but I can only readily confirm these three) and
main filesystem device had writeback cache enabled, a very real
data loss is possible because we never flush main device cache on commit.
As a result if we just wrote some files in ordered mode and then
transaction was committed, the journal data makes it to the disk
(provided that barriers are in use), but the actual file data only made
it to the device cache. As such sudden loss of power at this stage
would lead to files in place, but their content replaced with
whatever happened to be in those blocks before.

This simple patch at the end should remedy the problem.

Also I wonder if there would be a strong opposition to add async version
of blkdev_issue_flush()? Essentially it would be current blkdev_issue_flush
that returns straight after submit_bio() call and returns the bio itself
as a void *cookie for later waiting by the caller. (natirally the completion
would need to be allocated on stack)

This would allow us to have extra optimization to avoid dead waiting and
submit the barrier to data device right after we scheduled all data
blocks and then call completion waiting right before we submit commit
record. We'll have actual transaction preparation and writing in between.

Signed-off-by: Oleg Drokin <[email protected]>

jbd/commit.c | 7 +++++++
jbd2/commit.c | 8 ++++++++
reiserfs/journal.c | 5 +++++
3 files changed, 20 insertions(+)

diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 4bd8825..60c190c 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -21,6 +21,7 @@
#include <linux/mm.h>
#include <linux/pagemap.h>
#include <linux/bio.h>
+#include <linux/blkdev.h>

/*
* Default IO end handler for temporary BJ_IO buffer_heads.
@@ -787,6 +788,12 @@ wait_for_iobuf:

jbd_debug(3, "JBD: commit phase 6\n");

+ /* Flush the external data device first */
+ if ((journal->j_fs_dev != journal->j_dev) &&
+ journal->j_flags & JFS_BARRIER)
+ blkdev_issue_flush(journal->j_fs_dev, NULL);
+
+
if (journal_write_commit_record(journal, commit_transaction))
err = -EIO;

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 8896c1d..e653d72 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -712,6 +712,10 @@ start_journal_io:

if (JBD2_HAS_INCOMPAT_FEATURE(journal,
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
+ /* Flush the external data device first */
+ if ((journal->j_fs_dev != journal->j_dev) &&
+ journal->j_flags & JBD2_BARRIER)
+ blkdev_issue_flush(journal->j_fs_dev, NULL);
err = journal_submit_commit_record(journal, commit_transaction,
&cbh, crc32_sum);
if (err)
@@ -841,6 +845,10 @@ wait_for_iobuf:

if (!JBD2_HAS_INCOMPAT_FEATURE(journal,
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
+ /* Flush the external data device first */
+ if ((journal->j_fs_dev != journal->j_dev) &&
+ journal->j_flags & JBD2_BARRIER)
+ blkdev_issue_flush(journal->j_fs_dev, NULL);
err = journal_submit_commit_record(journal, commit_transaction,
&cbh, crc32_sum);
if (err)
diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
index 2f8a7e7..c49f5a3 100644
--- a/fs/reiserfs/journal.c
+++ b/fs/reiserfs/journal.c
@@ -1104,6 +1104,11 @@ static int flush_commit_list(struct super_block *s,
barrier = reiserfs_barrier_flush(s);
if (barrier) {
int ret;
+
+ /* Wait for data device flush to finish first */
+ if ((s->s_dev != journal->j_dev_bd))
+ blkdev_issue_flush(s->s_dev, NULL);
+
lock_buffer(jl->j_commit_bh);
ret = submit_barrier_buffer(jl->j_commit_bh);
if (ret == -EOPNOTSUPP) {

2009-12-15 05:32:07

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH] Possible data loss on ext[34], reiserfs with external journal

On Fri, Dec 11, 2009 at 11:16:08AM +0300, Oleg Drokin wrote:
> Hello!
>
> It seems when external journal device is used for ext3, ext4 and reiserfs
> (possibly others, but I can only readily confirm these three) and
> main filesystem device had writeback cache enabled, a very real
> data loss is possible because we never flush main device cache on commit.
> As a result if we just wrote some files in ordered mode and then
> transaction was committed, the journal data makes it to the disk
> (provided that barriers are in use), but the actual file data only made
> it to the device cache. As such sudden loss of power at this stage
> would lead to files in place, but their content replaced with
> whatever happened to be in those blocks before.
>
> This simple patch at the end should remedy the problem.

Can you separate this patch into separate ones for each file system?
I think we can actually do better for ext4; for example, in the case
of data=journal or data=writeback, it's not necessary to flush the fs
data device. It's only the case of data=ordered that we need to send
a barrier. With that optimization, we do need to add a barrier in the
case of fsync() and when we are doing a journal checkpoint, but the
extra code complexity is worth not having to force a barrier for the
fs data device for every single commit, especially in the data=journal
mode with a heavy fsync workload.

Do you have a test case that will allow you to easily try out this
patch, in all of ext4's various journalling modes? Thanks!!

- Ted

commit 4ba96b277f26952097d15736f6e960bc84cf4c0c
Author: Theodore Ts'o <[email protected]>
Date: Tue Dec 15 00:31:12 2009 -0500

ext4, jbd2: Add barriers for file systems with exernal journals

This is a bit complicated because we are trying to optimize when we
send barriers to the fs data disk. We could just throw in an extra
barrier to the data disk whenever we send a barrier to the journal
disk, but that's not always strictly necessary.

We only need to send a barrier during a commit when there are data
blocks which are must be written out due to an inode written in
ordered mode, or if fsync() depends on the commit to force data blocks
to disk. Finally, before we drop transactions from the beginning of
the journal during a checkpoint operation, we need to guarantee that
any blocks that were flushed out to the data disk are firmly on the
rust platter before we drop the transaction from the journal.

Thanks to Oleg Drokin for pointing out this flaw in ext3/ext4.

Signed-off-by: "Theodore Ts'o" <[email protected]>

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 0b22497..98bd140 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -88,9 +88,21 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
return ext4_force_commit(inode->i_sb);

commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
- if (jbd2_log_start_commit(journal, commit_tid))
+ if (jbd2_log_start_commit(journal, commit_tid)) {
+ /*
+ * When the journal is on a different device than the
+ * fs data disk, we need to issue the barrier in
+ * writeback mode. (In ordered mode, the jbd2 layer
+ * will take care of issuing the barrier. In
+ * data=journal, all of the data blocks are written to
+ * the journal device.)
+ */
+ if (ext4_should_writeback_data(inode) &&
+ (journal->j_fs_dev != journal->j_dev) &&
+ (journal->j_flags & JBD2_BARRIER))
+ blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
jbd2_log_wait_commit(journal, commit_tid);
- else if (journal->j_flags & JBD2_BARRIER)
+ } else if (journal->j_flags & JBD2_BARRIER)
blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
return ret;
}
diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index ca0f5eb..8868493 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -22,6 +22,7 @@
#include <linux/jbd2.h>
#include <linux/errno.h>
#include <linux/slab.h>
+#include <linux/blkdev.h>
#include <trace/events/jbd2.h>

/*
@@ -515,6 +516,20 @@ int jbd2_cleanup_journal_tail(journal_t *journal)
journal->j_tail_sequence = first_tid;
journal->j_tail = blocknr;
spin_unlock(&journal->j_state_lock);
+
+ /*
+ * If there is an external journal, we need to make sure that
+ * any data blocks that were recently written out --- perhaps
+ * by jbd2_log_do_checkpoint() --- are flushed out before we
+ * drop the transactions from the external journal. It's
+ * unlikely this will be necessary, especially with a
+ * appropriately sized journal, but we need this to guarantee
+ * correctness. Fortunately jbd2_cleanup_journal_tail()
+ * doesn't get called all that often.
+ */
+ if ((journal->j_fs_dev != journal->j_dev) &&
+ (journal->j_flags & JBD2_BARRIER))
+ blkdev_issue_flush(journal->j_fs_dev, NULL);
if (!(journal->j_flags & JBD2_ABORT))
jbd2_journal_update_superblock(journal, 1);
return 0;
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 6a10238..5b78f9a 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -259,6 +259,7 @@ static int journal_submit_data_buffers(journal_t *journal,
ret = err;
spin_lock(&journal->j_list_lock);
J_ASSERT(jinode->i_transaction == commit_transaction);
+ commit_transaction->t_flushed_data_blocks = 1;
jinode->i_flags &= ~JI_COMMIT_RUNNING;
wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
}
@@ -277,6 +278,16 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
struct jbd2_inode *jinode, *next_i;
int err, ret = 0;

+ /*
+ * If the journal is not located on the file system device,
+ * then we must flush the file system device before we issue
+ * the commit record
+ */
+ if (commit_transaction->t_flushed_data_blocks &&
+ (journal->j_fs_dev != journal->j_dev) &&
+ (journal->j_flags & JBD2_BARRIER))
+ blkdev_issue_flush(journal->j_fs_dev, NULL);
+
/* For locking, see the comment in journal_submit_data_buffers() */
spin_lock(&journal->j_list_lock);
list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index f1011f7..638ce45 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -653,6 +653,7 @@ struct transaction_s
* waiting for it to finish.
*/
unsigned int t_synchronous_commit:1;
+ unsigned int t_flushed_data_blocks:1;

/*
* For use by the filesystem to store fs-specific data

2009-12-15 06:33:12

by Oleg Drokin

[permalink] [raw]

Subject: Re: [PATCH] Possible data loss on ext[34], reiserfs with external journal

Hello!

On Dec 15, 2009, at 12:32 AM, [email protected] wrote:
> Can you separate this patch into separate ones for each file system?

Sure.

> I think we can actually do better for ext4; for example, in the case
> of data=journal or data=writeback, it's not necessary to flush the fs
> data device. It's only the case of data=ordered that we need to send
> a barrier. With that optimization, we do need to add a barrier in the
> case of fsync() and when we are doing a journal checkpoint, but the
> extra code complexity is worth not having to force a barrier for the
> fs data device for every single commit, especially in the data=journal
> mode with a heavy fsync workload.

Indeed, this is a good idea too.
I think we still can squeeze a bit more juice out of it if we can
submit the data device flush right after submitting all the data, but
only do the waiting for it before issuing journal device flush in normal
journaling mode (we need to do the waiting before writing commit block
in async journaling mode).

> Do you have a test case that will allow you to easily try out this
> patch, in all of ext4's various journalling modes? Thanks!!

Not for vanilla kernel, but I'll see if I can construct something easily
for it.
> @@ -277,6 +278,16 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
> struct jbd2_inode *jinode, *next_i;
> int err, ret = 0;
>
> + /*
> + * If the journal is not located on the file system device,
> + * then we must flush the file system device before we issue
> + * the commit record
> + */
> + if (commit_transaction->t_flushed_data_blocks &&
> + (journal->j_fs_dev != journal->j_dev) &&
> + (journal->j_flags & JBD2_BARRIER))
> + blkdev_issue_flush(journal->j_fs_dev, NULL);
> +

I am afraid this is not enough. This code is called after journal was flushed for
async commit case, so it leaves a race window where journal transaction is already
on disk and complete, but the data is still in cache somewhere.

Also the callsite has this comment which is misleading, I think:
/*
* This is the right place to wait for data buffers both for ASYNC
* and !ASYNC commit. If commit is ASYNC, we need to wait only after
* the commit block went to disk (which happens above). If commit is
* SYNC, we need to wait for data buffers before we start writing
* commit block, which happens below in such setting.
*/

In fact it is only safe to wait here in ASYNC mode (and internal journal only) because
the device was already flushed (and I hope all device drivers drain the queue too, if not,
even this might be problematic) by the blkdev_issue_flush.

Bye,
Oleg

2009-12-15 16:45:11

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH] Possible data loss on ext[34], reiserfs with external journal

On Tue, Dec 15, 2009 at 01:19:57AM -0500, Oleg Drokin wrote:
> > + /*
> > + * If the journal is not located on the file system device,
> > + * then we must flush the file system device before we issue
> > + * the commit record
> > + */
> > + if (commit_transaction->t_flushed_data_blocks &&
> > + (journal->j_fs_dev != journal->j_dev) &&
> > + (journal->j_flags & JBD2_BARRIER))
> > + blkdev_issue_flush(journal->j_fs_dev, NULL);
> > +
>
> I am afraid this is not enough. This code is called after journal
> was flushed for async commit case, so it leaves a race window where
> journal transaction is already on disk and complete, but the data is
> still in cache somewhere.

No, that's actually fine. In the ASYNC_COMMIT case, the commit won't
be valid until the checksum is correct, and we won't have written any
descriptor blocks yet at this point. So there is no race because
during that window, the commit is written but we won't write any
descriptor blocks until after the barrier returns.

> Also the callsite has this comment which is misleading, I think:
> /*
> * This is the right place to wait for data buffers both for ASYNC
> * and !ASYNC commit. If commit is ASYNC, we need to wait only after
> * the commit block went to disk (which happens above). If commit is
> * SYNC, we need to wait for data buffers before we start writing
> * commit block, which happens below in such setting.
> */

Yeah, that comment is confusing and not entirely accurate. I thought
about cleaning it up, and then decided to do that in a separate patch.

- Ted

2009-12-15 20:01:42

by Oleg Drokin

[permalink] [raw]

Subject: Re: [PATCH] Possible data loss on ext[34], reiserfs with external journal

Hello!

On Dec 15, 2009, at 11:45 AM, [email protected] wrote:
>>> + /*
>>> + * If the journal is not located on the file system device,
>>> + * then we must flush the file system device before we issue
>>> + * the commit record
>>> + */
>>> + if (commit_transaction->t_flushed_data_blocks &&
>>> + (journal->j_fs_dev != journal->j_dev) &&
>>> + (journal->j_flags & JBD2_BARRIER))
>>> + blkdev_issue_flush(journal->j_fs_dev, NULL);
>>> +
>> I am afraid this is not enough. This code is called after journal
>> was flushed for async commit case, so it leaves a race window where
>> journal transaction is already on disk and complete, but the data is
>> still in cache somewhere.
> No, that's actually fine. In the ASYNC_COMMIT case, the commit won't
> be valid until the checksum is correct, and we won't have written any
> descriptor blocks yet at this point. So there is no race because

What do you mean by descriptor blocks? Actual logged blocks?

> during that window, the commit is written but we won't write any
> descriptor blocks until after the barrier returns.

>From my reading of the code, in jbd2_journal_commit_transaction:
After first "JBD: commit phase 2" message we submit actual data buffers.
Then after second "JBD: commit phase 2" message (in the
"while (commit_transaction->t_buffers) {" loop) we actually iterate
through all metadata changes and submit them to disk (after
start_journal_io: label)
Then journal_submit_commit_record writes commit record.
At this point on we have entire transacton + commit
block in the write queue (or partially already on disk) +
all the data blocks in the write queue.
Next blkdev_issue_flush on journal device is done, cementing the transaction
on the permanent storage, but the actual data is still potentially not
there abd it is not until the call to journal_finish_inode_data_buffers()
where that data is flushed too.

If you refer to the code after "JBD: commit phase 3" that waits on actual
transaction blocks buffers to make sure they made it to the disk,
my reading of blkdev_issue_flush is that it adds the flush at the end of the
queue, i.e. after all of the journalled blocks that are sitting in there,
so they all would be already copleted by the time blkdev_issue_flush returns.

Bye,
Oleg

2009-12-23 12:10:31

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH] Possible data loss on ext[34], reiserfs with external journal

After taking a look at things, you're right. We need to flush the
data disk before the first call to journal_submit_commit_record() to
eliminate the race condition in the ASYNC_COMMIT case.

- Ted

commit 82dd90c0e74797d1cea2ebfb9370ba2d6a3e12bf
Author: Theodore Ts'o <[email protected]>
Date: Wed Dec 23 06:52:08 2009 -0500

ext4, jbd2: Add barriers for file systems with exernal journals

This is a bit complicated because we are trying to optimize when we
send barriers to the fs data disk. We could just throw in an extra
barrier to the data disk whenever we send a barrier to the journal
disk, but that's not always strictly necessary.

We only need to send a barrier during a commit when there are data
blocks which are must be written out due to an inode written in
ordered mode, or if fsync() depends on the commit to force data blocks
to disk. Finally, before we drop transactions from the beginning of
the journal during a checkpoint operation, we need to guarantee that
any blocks that were flushed out to the data disk are firmly on the
rust platter before we drop the transaction from the journal.

Thanks to Oleg Drokin for pointing out this flaw in ext3/ext4.

Signed-off-by: "Theodore Ts'o" <[email protected]>

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 0b22497..98bd140 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -88,9 +88,21 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
return ext4_force_commit(inode->i_sb);

commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
- if (jbd2_log_start_commit(journal, commit_tid))
+ if (jbd2_log_start_commit(journal, commit_tid)) {
+ /*
+ * When the journal is on a different device than the
+ * fs data disk, we need to issue the barrier in
+ * writeback mode. (In ordered mode, the jbd2 layer
+ * will take care of issuing the barrier. In
+ * data=journal, all of the data blocks are written to
+ * the journal device.)
+ */
+ if (ext4_should_writeback_data(inode) &&
+ (journal->j_fs_dev != journal->j_dev) &&
+ (journal->j_flags & JBD2_BARRIER))
+ blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
jbd2_log_wait_commit(journal, commit_tid);
- else if (journal->j_flags & JBD2_BARRIER)
+ } else if (journal->j_flags & JBD2_BARRIER)
blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
return ret;
}
diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index ca0f5eb..8868493 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -22,6 +22,7 @@
#include <linux/jbd2.h>
#include <linux/errno.h>
#include <linux/slab.h>
+#include <linux/blkdev.h>
#include <trace/events/jbd2.h>

/*
@@ -515,6 +516,20 @@ int jbd2_cleanup_journal_tail(journal_t *journal)
journal->j_tail_sequence = first_tid;
journal->j_tail = blocknr;
spin_unlock(&journal->j_state_lock);
+
+ /*
+ * If there is an external journal, we need to make sure that
+ * any data blocks that were recently written out --- perhaps
+ * by jbd2_log_do_checkpoint() --- are flushed out before we
+ * drop the transactions from the external journal. It's
+ * unlikely this will be necessary, especially with a
+ * appropriately sized journal, but we need this to guarantee
+ * correctness. Fortunately jbd2_cleanup_journal_tail()
+ * doesn't get called all that often.
+ */
+ if ((journal->j_fs_dev != journal->j_dev) &&
+ (journal->j_flags & JBD2_BARRIER))
+ blkdev_issue_flush(journal->j_fs_dev, NULL);
if (!(journal->j_flags & JBD2_ABORT))
jbd2_journal_update_superblock(journal, 1);
return 0;
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 6a10238..1bc74b6 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -259,6 +259,7 @@ static int journal_submit_data_buffers(journal_t *journal,
ret = err;
spin_lock(&journal->j_list_lock);
J_ASSERT(jinode->i_transaction == commit_transaction);
+ commit_transaction->t_flushed_data_blocks = 1;
jinode->i_flags &= ~JI_COMMIT_RUNNING;
wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
}
@@ -708,8 +709,17 @@ start_journal_io:
}
}

- /* Done it all: now write the commit record asynchronously. */
+ /*
+ * If the journal is not located on the file system device,
+ * then we must flush the file system device before we issue
+ * the commit record
+ */
+ if (commit_transaction->t_flushed_data_blocks &&
+ (journal->j_fs_dev != journal->j_dev) &&
+ (journal->j_flags & JBD2_BARRIER))
+ blkdev_issue_flush(journal->j_fs_dev, NULL);

+ /* Done it all: now write the commit record asynchronously. */
if (JBD2_HAS_INCOMPAT_FEATURE(journal,
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
err = journal_submit_commit_record(journal, commit_transaction,
@@ -720,13 +730,6 @@ start_journal_io:
blkdev_issue_flush(journal->j_dev, NULL);
}

- /*
- * This is the right place to wait for data buffers both for ASYNC
- * and !ASYNC commit. If commit is ASYNC, we need to wait only after
- * the commit block went to disk (which happens above). If commit is
- * SYNC, we need to wait for data buffers before we start writing
- * commit block, which happens below in such setting.
- */
err = journal_finish_inode_data_buffers(journal, commit_transaction);
if (err) {
printk(KERN_WARNING
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index f1011f7..638ce45 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -653,6 +653,7 @@ struct transaction_s
* waiting for it to finish.
*/
unsigned int t_synchronous_commit:1;
+ unsigned int t_flushed_data_blocks:1;

/*
* For use by the filesystem to store fs-specific data